Unit-2 Supervised Machine Learning
Unit-2 Supervised Machine Learning
Learning
Content
• What is supervised ML
• Linear Regression for univariate and Multivariate data
• Cost function
• Gradient Descent
• Logistic regression
• Under fitting and Over fitting
• Support Vector Machine
• Decision Tree, Random Forest, Artificial Neural Network architecture
• Activation functions
• Forward pass in ANN
• Back propogation in ANN
• Model Evaluation techniques.
What is Supervised Machine Learning?
• Supervised Machine Learning is a paradigm in machine learning
where an algorithm learns from a labeled dataset.
• This means that during training, the algorithm is provided with both
the input data (features) and the corresponding correct output (labels
or targets).
• The goal of supervised learning is to build a model that can map new,
unseen input data to its expected output values based on the patterns it
learned from the labeled training data.
Key Characteristics of Supervised Learning
• Labeled Data:
• Requires datasets where the target variable (what you want to predict) is
known for each input example.
• Explicit Feedback:
• The algorithm receives direct feedback (the correct answer) during training,
which it uses to adjust its internal parameters to minimize the difference
between its predictions and the actual labels.
• Goal-Oriented:
• The primary goal is to predict an outcome or classify data based on the learned
patterns.
Types of Supervised Learning Problems:
• Regression:
• The target variable is a continuous numerical value (e.g., predicting house
prices, temperature, stock prices, age, sales).
• Classification:
• The target variable is a categorical label (e.g., predicting whether an email is
spam or not, classifying an image as a cat or dog, determining if a customer
will churn or not).
Process of Supervised Learning
• Data Collection and Preparation: Gather and clean a dataset, ensuring it has both features and corresponding
labels.
• Splitting Data: Divide the labeled dataset into training, validation (optional), and test sets. The model learns from
the training data, is tuned using validation data, and its final performance is evaluated on unseen test data.
• Model Selection: Choose an appropriate machine learning algorithm (e.g., Linear Regression, Logistic
Regression, Decision Trees, Support Vector Machines, Neural Networks).
• Model Training: The algorithm is fed the training data and adjusts its internal parameters iteratively to minimize a
defined "loss function" (which measures the difference between predicted and actual outputs).
• Model Evaluation: Assess the trained model's performance on the test set using various metrics (e.g., accuracy,
precision, recall, F1-score for classification; R-squared, MSE, RMSE for regression).
• Prediction: Once satisfied with the model's performance, it can be used to make predictions on new, unlabeled
data.
Linear Regression
• Linear Regression is a fundamental supervised learning algorithm used for regression tasks,
meaning it predicts a continuous target variable.
• It assumes a linear relationship between the input features and the output variable.
• The core idea is to find the "best-fitting" straight line (or hyperplane in higher dimensions)
that describes the relationship between the features and the target.
• This line minimizes the sum of squared differences between the predicted values and the
actual values (this is known as the Ordinary Least Squares or OLS method).
Linear Regression
• The graph above presents the linear relationship between the output(y) and
predictor(X) variables.
• The blue line is referred to as the best-fit straight line. Based on the given
data points, we attempt to plot a line that fits the points the best.
Univariate Linear Regression (Simple Linear
Regression)
• Univariate linear regression involves one independent variable (feature) to predict a single
continuous dependent variable (target).
• Mathematical Equation: The equation of a straight line is used to model this relationship:
• Where:
• y: The dependent variable (the value we want to predict).
• x: The single independent variable (feature).
• β0 (beta-naught): The y-intercept, representing the predicted value of y when x is 0.
• β1 (beta-one): The coefficient (or slope) of the independent variable, representing the change in y for a one-unit
change in x.
• ϵ: The error term, representing the difference between the actual y and the predicted y.
Example
• Predicting a student's exam score based on the number of hours they studied.
• y = Exam Score
• x = Hours Studied
• This means:
• A student who studies 0 hours is predicted to score 40 (the intercept).
• For every additional hour studied, the exam score is predicted to increase by 5 points.
• Visualization: You can visualize univariate linear regression as fitting a straight line through a scatter plot of data
points (x, y).
Linear Regression
• The goal of the linear regression algorithm is to get the best values for B0 and
B1 to find the best-fit line.
• The best-fit line is a line that has the least error which means the error between
predicted values and actual values should be minimum.
Random Error(Residuals)
• In regression, the difference between the observed value of the
dependent variable(y i ) and the predicted value(predicted) is called
the residuals.
• ε i = y predicted – y i
• where y predicted = B 0 + B 1 X i
Assumptions of Linear Regression
• Linearity of residuals: There needs to be a linear relationship between the dependent
variable and independent variable(s).
• Independence of residuals: The error terms should not be dependent on one another (like
in time-series data wherein the next value is dependent on the previous one). There should
be no correlation between the residual terms. The absence of this phenomenon is known
as Autocorrelation.
Assumptions of Linear Regression
• The equal variance of residuals: The error terms must have constant variance. This phenomenon
is known as Homoscedasticity. The presence of non-constant variance in the error terms is referred
to as Heteroscedasticity. Generally, non-constant variance arises in the presence of outliers or
extreme leverage values.
Multivariate Linear Regression (Multiple Linear
Regression)
• Multivariate linear regression involves two or more independent variables (features) to predict a single
continuous dependent variable (target).
• The relationship is still assumed to be linear, but now it's a linear combination of multiple features.
• Mathematical Equation: The equation extends the univariate form to include multiple independent variables:
• Where:
• y: The dependent variable (the value we want to predict).
• x1 ,x2 ,…,xn : The n independent variables (features).
• β0 : The y-intercept.
• β1 ,β2 ,…,βn : The coefficients (slopes) for each respective independent variable. β i represents the change in y for a one-unit
change in x i , assuming all other independent variables are held constant.
• ϵ: The error term.
Example
• Predicting a house price based on its size, number of bedrooms, and
location.
• y = House Price
• x1 = Size (in sq. ft.)
• x2 = Number of Bedrooms
• x3 = Location Score (e.g., a numerical rating for neighborhood desirability)
Example
• If the model finds coefficients: β0=50000, β1=100, β2=10000, β3=5000:
• Visualization: With two independent variables (x 1 ,x 2 ), the relationship can be visualized as a plane in a 3D
space. With more than two independent variables, it becomes a "hyperplane" which is difficult to visualize
directly, but the mathematical principle remains the same.
Polynomial Regression
• Polynomial regression analysis represents a non-linear relationship
between dependent and independent variables.
• This technique is a variant of the multiple linear regression model, but
the best fit line is curved rather than straight.
Polynomial Regression
• The general form of the equation for a polynomial regression of degree nis:
• Choosing the right polynomial degree n is important: a higher degree may fit the data
more closely but it can lead to overfitting.
• The degree should be selected based on the complexity of the data. Once the model is
trained, it can be used to make predictions on new data, capturing non-linear
relationships and providing a more accurate model for real-world applications.
Coefficient of Determination or R-squared
(R2)
• R-squared is a number that explains the amount of variation that is explained/captured by the
developed model. It always ranges between 0 & 1.
• Overall, the higher the value of R-squared, the better the model fits the data.
• Mathematically it can be represented as,
R2 = 1 – ( RSS/TSS )
• Residual sum of Squares (RSS) is defined as the sum of squares of the residual for each data point in the plot/data. It is the
measure of the difference between the expected and the actual observed output.
• Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the mean of the response variable.
Mathematically TSS is,
Cost function
• The cost function, also known as a loss function or objective function, is a fundamental concept in
machine learning, particularly in supervised learning algorithms like Linear Regression.
• It quantifies the "error" or "discrepancy" between the predicted output of a machine learning model
and the actual (true) output.
• The primary goal of training a machine learning model is to minimize this cost function.
• By minimizing the cost function, the model adjusts its internal parameters (like the coefficients and
intercept in Linear Regression) to make its predictions as close as possible to the actual target values.
Why is Cost Function Important?
• It provides a quantitative measure of model performance.
Example:
• Imagine you're standing on a mountain in dense fog, and you want to reach the lowest point in
the valley. You can't see the whole landscape, so you have to rely on local information. What
would you do? You'd likely look around and take a small step in the direction where the slope is
steepest downwards. You'd repeat this process until you reach a point where you can't go any
lower.
Analogy That describes Gradient Descent
• Cost Function (The Mountain Landscape): This is the function you want to minimize. In machine learning, it
measures how "bad" your model is performing. A higher cost means more errors.
• Parameters (Your Position on the Mountain): These are the adjustable values within your machine learning
model (e.g., weights and biases in a neural network, or the slope and intercept in linear regression). Gradient
Descent iteratively adjusts these parameters.
• Gradient (The Direction of Steepest Descent): The "gradient" of the cost function at a given point tells you the
direction of the steepest ascent (uphill). To minimize the cost, you want to move in the opposite direction of the
gradient, i.e., the direction of steepest descent. Mathematically, the gradient is a vector of partial derivatives of the
cost function with respect to each parameter.
• Learning Rate (The Step Size): This is a crucial hyperparameter that determines how large a step you take in the
direction of the negative gradient.
Learning Rate
• Small learning rate: Takes tiny steps, which can lead to slow convergence but increases the chance of finding a good minimum.
• Large learning rate: Takes big steps, which can speed up convergence but might overshoot the minimum or even diverge (climb up
the other side of the mountain).
Gradient Descent
The goal of the gradient descent algorithm is to minimize the given function (say, cost function). To achieve
this goal, it performs two steps iteratively:
• Compute the gradient (slope), the first-order derivative of the function at that point
• Make a step (move) in the direction opposite to the gradient. The opposite direction of the slope
increases from the current point by alpha times the gradient at that point
How Gradient Descent Works (Step-by-Step)
1. Initialization of Parameters:
1. The model's parameters (e.g., the coefficients/weights, β1,β2,…,βn, and the intercept, β0, in Linear
Regression) are initialized to some arbitrary values, often randomly or to zeros.
2. Calculate the Cost:
• For the current set of parameters, the model makes predictions on the training data.
• The cost function (e.g., Mean Squared Error for Linear Regression) is then calculated to quantify how "bad"
these predictions are.
4. Compute the Gradients:
• This is the crucial step. We calculate the partial derivative of the cost function with respect to each
parameter.
• These partial derivatives tell us the slope of the cost function with respect to each parameter. In other words,
they indicate how much the cost function would change if we slightly adjusted that specific parameter.
• The collection of these partial derivatives forms the gradient vector.
How Gradient Descent Works (Step-by-Step)
Example
Example
Example
Types of Gradient Descent
• The main variants of Gradient Descent differ in how much data they use to compute the gradient at each update step:
• Binomial Logistic Regression: This type is used when the dependent variable has only two possible
categories. Examples include Yes/No, Pass/Fail or 0/1. It is the most common form of logistic
regression and is used for binary classification problems.
• Multinomial Logistic Regression: This is used when the dependent variable has three or more
possible categories that are not ordered. For example, classifying animals into categories like "cat,"
"dog" or "sheep." It extends the binary logistic regression to handle multiple classes.
• Ordinal Logistic Regression: This type applies when the dependent variable has three or more
categories with a natural order or ranking. Examples include ratings like "low," "medium" and "high."
It takes the order of the categories into account when modeling.
Mathematics Behind Logistic Regression
• Linear Combination of Features: Similar to linear regression, logistic regression starts by forming
a linear combination of the input features and their corresponding weights (coefficients).
• This linear combination, often denoted as z, can take any real value:
where:
• w0 is the bias (intercept)
• wi are the weights (coefficients) for each feature xi
Sigmoid (Logistic) Function
• The crucial difference from linear regression is that z is then passed through a special
activation function called the sigmoid function (also known as the logistic function).
• This function squashes any real-valued input into a range between 0 and 1, making it
interpretable as a probability.
Decision Rule
• To classify:
Sigmoid (Logistic) Function
As shown above the sigmoid function converts the continuous variable data into the
probability i.e between 0 and 1.
• σ(z) tends towards 1 as z→∞z→∞
• σ(z) tends towards 0 as z→−∞z→−∞
• σ(z) is always bounded between 0 and 1
Training the Model (Learning the Weights)
• Objective: The goal of training a logistic regression model is to find the optimal weights (wi) that
best map the input features to the observed target probabilities.
• Loss Function (Cost Function): Unlike linear regression which often uses Mean Squared Error,
logistic regression typically uses a Log Loss (also known as Binary Cross-Entropy Loss) function.
• This function measures how "wrong" the predicted probabilities are compared to the actual class
labels.
• The goal is to minimize this loss.
• For binary classification, the log loss for a single training example is:
Optimization Algorithm (Gradient Descent)
• To minimize the loss function, an optimization algorithm like Gradient
Descent is used.
• This iteratively adjusts the weights in the direction that reduces the
loss until a minimum is reached.
Multinomial Logistic Regression
• Multinomial Logistic Regression (MLR) is a type of classification algorithm used when the
dependent variable is categorical with more than two classes.
• It is an extension of binary logistic regression and is used for multi-class classification
problems.
Problem: Predict type of vehicle (Car, Bus, Truck) based on features like engine size, weight, etc.
• Y = {Car, Bus, Truck} ⇒ 3 classes ⇒ use Multinomial Logistic Regression.
Basic Concept of Multinomial Logistic Regression
Model Formulation
Example
Example
Example
Underfitting and Overfitting
• Machine learning models aim to perform well on both training data and new,
unseen data and is considered "good" if:
• It learns patterns effectively from the training data.
• It generalizes well to new, unseen data.
• It avoids memorizing the training data (overfitting) or failing to capture relevant patterns (underfitting).
• Bias and variance are two key sources of error in machine learning models that
directly impact their performance and generalization ability.
• Bias: is the error that happens when a machine learning model is too simple and doesn't learn enough
details from the data. It's like assuming all birds can only be small and fly, so the model fails to recognize
big birds like ostriches or penguins that can't fly and get biased with predictions.
• Variance: Error that happens when a machine learning model learns too much from the data, including
random noise.
Overfitting
• Overfitting happens when a model learns too much from the training data, including details that don’t
matter (like noise or outliers).
• For example, imagine fitting a very complicated curve to a set of points. The curve will go through every point, but
it won’t represent the actual pattern.
• As a result, the model works great on training data but fails when tested on new data.
• Overfitting models are like students who memorize answers instead of understanding the topic. They
do well in practice tests (training) but struggle in real exams (testing).
• Underfitting models are like students who don’t study enough. They don’t do well in practice tests or real exams.
Note: The underfitting model has High bias and low variance.
• Reasons for Underfitting:
• The model is too simple, So it may be not capable to represent the complexities in the data.
• The input features which is used to train the model is not the adequate representations of underlying factors influencing the
target variable.
• The size of the training dataset used is not enough.
• Excessive regularization are used to prevent the overfitting, which constraint the model to capture the data well.
• Features are not scaled.
underfitting, proper fitting, and overfitting.
• Underfitting : Straight line trying to fit a curved dataset but cannot capture the data's patterns,
leading to poor performance on both training and test sets.
• Overfitting: A squiggly curve passing through all training points, failing to generalize performing
well on training data but poorly on test data.
• Appropriate Fitting: Curve that follows the data trend without overcomplicating to capture the
true patterns in the data.
Balance Between Bias and Variance
• The relationship between bias and variance is often referred to as the bias-variance tradeoff, which
highlights the need for balance:
• Increasing model complexity reduces bias but increases variance (risk of overfitting).
• Simplifying the model reduces variance but increases bias (risk of underfitting).
• The goal is to find an optimal balance where both bias and variance are minimized, resulting in good
generalization performance.
• Imagine you're trying to predict the price of houses based on their size, and you decide to draw a line or curve
that best fits the data points on a graph. How well this line captures the trend in the data depends on the
complexity of the model you use.
Bias Variance Tradeoff
Predicted
| Spam | Not Spam
Actual Spam | 80 | 20
Not | 10 | 90
• From this:
• TP = 80 (Spam correctly identified) Calculate Metrics:
•Accuracy = (80+90)/(80+20+10+90) = 170/200 = 85%
• FN = 20 (Spam incorrectly identified as Not Spam) •Precision = 80 / (80 + 10) = 80 / 90 = 88.9%
• FP = 10 (Not Spam incorrectly identified as Spam) •Recall = 80 / (80 + 20) = 80 / 100 = 80%
•F1 Score = 2 * (0.889 * 0.8) / (0.889 + 0.8) ≈ 84.2%
• TN = 90 (Not Spam correctly identified)
• For 3 or more classes, the confusion matrix becomes an n x n table, where each row represents the actual class,
and each column represents the predicted class. Diagonal elements are correct predictions.
Support Vector Machines (SVM)
• Support Vector Machines (SVM) is a supervised machine learning algorithm commonly used for classification tasks.
• SVM constructs a hyperplane or set of hyperplanes in a high-dimensional space that separates the different classes.
• A good separation is achieved by the hyperplane that has the largest margin, meaning the maximum distance between
data points of different classes.
• Support Vector Machine(SVM) is a powerful classifier that works both on linearly and nonlinearly separable data.
• SVM tries to find the “best” margin (distance between the line and the support vectors) that separates the classes.
Types of Support Vector Machine (SVM)
• Linear SVM:
• When the data is perfectly linearly separable only then we
can use Linear SVM.
• Perfectly linearly separable means that the data points can
be classified into 2 classes by using a single straight line(if
2D).
• Non-Linear SVM:
• When the data is not linearly separable, we can use
Non-Linear SVM. This happens when the data points
cannot be separated into two classes using a straight line (if
2D).
• In such cases, we use advanced techniques like kernel
tricks to classify them. In most real-world applications we
do not find linearly separable datapoints hence we use
kernel trick to solve them.
Support Vector Machines (SVM)
• Support Vectors: These are the points closest to the hyperplane from both
classes, highlighted on the boundary lines w ⋅ x — b = 1 and w ⋅ x — b = -1. In
the figure, they are the points on the dashed lines representing the margin.
Support vectors directly influence the positioning of the hyperplane.
• Margin (in yellow): The margin is the distance between the support vectors and
the hyperplane. The goal of SVM is to maximize this margin, ensuring that the
hyperplane separates the classes as clearly as possible. In the figure, the margin is
the region between the dashed lines w ⋅ x — b = 1 and w ⋅ x — b = -1.
• Weight vector w: The arrow labeled w represents the weight vector perpendicular
to the hyperplane. The direction of this vector indicates how the hyperplane is
oriented, and its magnitude determines how steep the slope of the separation
boundary is.
Support Vector Machines (SVM)
• Soft Margin – As most of the real-world data are not fully linearly separable, we will allow some
margin violation to occur which is called soft margin classification. It is better to have a large
margin, even though some constraints are violated. Margin violation means choosing a hyperplane,
which can allow some data points to stay on either the incorrect side of the hyperplane and
between the margin and correct side of the hyperplane.
• Hard Margin – If the training data is linearly separable, we can select two parallel hyperplanes
that separate the two classes of data, so that the distance between them is as large as possible.
Mathematical Computation of SVM
• Consider a binary classification problem with two classes, labeled as +1 and -1.
• We have a training dataset consisting of input feature vectors X and their corresponding class
labels Y.
• The equation for the linear hyperplane can be written as:
• Where:
• w is the normal vector to the hyperplane (the direction perpendicular to it).
• b is the offset or bias term representing the distance of the hyperplane from the origin along the normal
vector w.
Distance from a Data Point to the Hyperplane
• The distance between a data point x_i and the decision boundary can
be calculated as:
• Where:
• yi is the class label (+1 or -1) for each training instance.
• xi is the feature vector for the ii-th training instance.
• m is the total number of training instances.
• The condition ensures that each data point is correctly classified and lies outside the margin.
Soft Margin in Linear SVM Classifier
• In the presence of outliers or non-separable data the SVM allows some misclassification by
introducing slack variables ζi. The optimization problem is modified as:
• Where:
• C is a regularization parameter that controls the trade-off between margin maximization and penalty
for misclassifications.
• ζi are slack variables that represent the degree of violation of the margin by each data point.
Kernels in Support Vector Machine
• The most interesting feature of SVM is that it can even work with a
non-linear dataset and for this, we use “Kernel Trick” which makes it
easier to classifies the points.
• Suppose we have a dataset like this:
• Here we see we cannot draw a single line or say
hyperplane which can classify the points correctly.
• So what we do is try converting this lower dimension
space to a higher dimension space using some
quadratic functions which will allow us to find a
decision boundary that clearly divides the data points.
• These functions which help us do this are called
Kernels and which kernel to use is purely determined
by hyperparameter tuning.
Polynomial Kernel
• The polynomial kernel allows SVMs to model more complex,
non-linear relationships by introducing polynomial terms.
• It maps the original data into a higher-dimensional feature space where
it might become linearly separable. It's often used in image processing.
Where:
• d is the degree of the polynomial.
• c is a constant term that can control the influence of higher-order terms.
Radial Basis Function (RBF) Kernel /
Gaussian Kernel
• The RBF kernel, also known as the Gaussian kernel, is one of the most widely used and versatile
kernels.
• It maps data into an infinite-dimensional space, making it highly effective for complex, non-linear
classification problems where there's no prior knowledge about the data distribution.
• It measures the similarity between two data points based on their Euclidean distance and a gamma
parameter.
Where:
γ (gamma) is a parameter that defines the influence of a
single training example. A small γ means a large influence,
leading to a smoother decision boundary. A large γ means a
small influence, leading to a more complex, potentially
overfitting boundary.
Sigmoid Kernel
• The sigmoid kernel is inspired by neural networks and behaves similarly to the
activation function of a neuron.
• It's often used in scenarios where neural network-like behavior is desired.
Where:
α (alpha) and c are parameters that control the shape
of the tanh function.
Application of SVM Kernel
Decision Tree
• A decision tree is a supervised learning algorithm used for both
classification and regression tasks.
2. Find the Best Split: At each node, the algorithm evaluates all available features to find the "best"
way to split the data. "Best" is determined by a specific criterion that aims to maximize the
homogeneity (purity) of the resulting subsets. Common criteria include:
• Information Gain (based on Entropy): For classification, it measures the reduction in uncertainty or disorder in the
dataset after a split. The higher the information gain, the better the split.
• Gini Impurity: For classification, it measures the probability of incorrectly classifying a randomly chosen element from
the dataset if it were randomly labeled according to the distribution of labels in the subset. Lower Gini impurity is
preferred.
• Variance Reduction: For regression, it measures how much the variance of the target variable is reduced after a split.
3. Split the Data: The data is then split into child nodes based on the chosen feature and its threshold
(for numerical features) or categories (for categorical features).
How Does a Decision Tree Work?
4. Repeat (Recursion): This splitting process is recursively applied to each child node until a
stopping criterion is met. Stopping criteria can include:
• All data points in a node belong to the same class (pure node).
• A maximum tree depth is reached.
• A minimum number of samples is required to make a split.
• No more features are available to split on.
5. Form Leaf Nodes: Once a stopping criterion is met, the node becomes a leaf node, and a final
prediction is made for any data point that reaches that leaf (e.g., the majority class for
classification or the average value for regression).
How Does a Decision Tree Work?
Types of Decision Trees
• Classification Trees
• Used when the target variable is categorical (e.g., predicting "yes" or "no,"
"spam" or "not spam," "dog" or "cat").
• Regression Trees
• Used when the target variable is continuous (e.g., predicting house prices,
temperature, sales figures).
Information Gain (based on Entropy)
• Information Gain, based on Entropy, is a fundamental concept in the
construction of Decision Trees, particularly for classification tasks.
• It's the key metric used by algorithms like ID3 and C4.5 to decide which
feature to split on at each node of the tree.
Entropy
• In the context of decision trees, entropy is a measure of the impurity or uncertainty
within a set of data.
• High Entropy: A dataset with high entropy is very mixed, meaning the classes are evenly
distributed. It's difficult to predict the class of a random sample from this set.
• Low Entropy (or Zero Entropy): A dataset with low entropy (ideally zero) is "pure" or
homogeneous, meaning most or all data points belong to the same class. It's easy to predict the
class of a random sample from this set.
Formula for Entropy
• For a dataset S with C distinct classes, the entropy is calculated as
follows.
Where:
• S is the dataset (or a subset of data at a node).
• C is the number of unique classes in S.
• pi is the proportion (or probability) of instances belonging to class i in dataset S.
• log2 is the base-2 logarithm. The unit of entropy is typically "bits".
Information Gain
• Information Gain (IG) measures the reduction in entropy after a dataset S is split based on a
particular feature (attribute) A. In other words, it quantifies how much "information" a feature
provides about the target variable.
• The goal in building a decision tree is to find the feature that yields the highest Information Gain at
each step, as this feature is considered the "best" for splitting the data and creating more
homogeneous subsets
Example
Let's say you have a dataset for deciding whether to play tennis, with features like Outlook, Temperature, Humidity,
and Windy, and the target variable Play.
• You would repeat this calculation for Temperature, Humidity, and Windy. The feature with
the highest Information Gain would be chosen as the root node of the decision tree. In this
classic example, Outlook usually has the highest Information Gain, making it the first split.
Gini Index
• The Gini Index, also known as Gini impurity, is a widely used metric in decision tree algorithms to measure the
"impurity" or "mixedness" of a dataset.
• In the context of a decision tree, it helps determine the best way to split a node into sub-nodes to achieve more
homogeneous (pure) groups.
• Probability of Misclassification: More intuitively, the Gini Index can be interpreted as the probability of
misclassifying a randomly chosen element from the dataset if it were randomly labeled according to the class
distribution within that node. A higher Gini Index means a greater chance of misclassification.
How is Gini Index calculated
• The formula for the Gini Index for a given node (dataset D) with C classes is:
Where:
• C is the total number of classes.
• pi is the proportion (or probability) of samples belonging to class i in the node.
Steps to calculate Gini Index for a split
• Calculate Gini Impurity for the parent node: Before any split, calculate the Gini Index of the
entire dataset or the current node being considered for splitting.
• For each potential split (feature and its values):
• Divide the data into child nodes based on the chosen feature and its split point.
• Calculate the Gini Impurity for each individual child node using the formula above.
• Calculate the weighted average Gini Impurity of the child nodes. This is done by multiplying the Gini Impity of each
child node by the proportion of samples it contains, and then summing these values.
• Where:
• k is the number of child nodes created by the split.
• Nj is the number of samples in child node j.
• N is the total number of samples in the parent node.
• Gini(Dj) is the Gini Impurity of child node j.
• Choose the best split: The goal is to minimize the Gini impurity. Therefore, the feature and split
point that result in the lowest weighted average Gini Index (or highest Gini Gain) after the split is
chosen as the best split for that node. Gini Gain is calculated as:
• GiniGain=Giniparent−Ginisplit
• A higher Gini Gain indicates a better split.
Example
Gini Index vs. Information Gain (Entropy)
• Both Gini Index and Information Gain are popular metrics for splitting in decision trees. They
generally lead to similar results.
Feature Gini Index Information Gain (based on Entropy)
Measures the probability of misclassifying a Measures the reduction in uncertainty
Concept
random sample. or randomness.
Formula 1−∑(pi)^2 Entropy_parent−∑(Nj/N)Entropy(Dj)
Logarithm Does not involve logarithms. Involves logarithms.
Generally faster to compute as it avoids log Can be slightly more computationally
Computational Efficiency
calculations. intensive.
Tends to isolate the most frequent class in its own Tends to be biased towards attributes
Bias branch. Can be slightly biased towards splits that with a large number of distinct values
produce more equal-sized partitions. (can be mitigated by Gain Ratio).
Used in CART (Classification and Regression
Typical Use Used in ID3 and C4.5 algorithms.
Trees) algorithms.
Range (Binary [0,1] (0 for pure, 1 for maximally
[0,0.5] (0 for pure, 0.5 for maximally impure)
Classification) impure)
In most practical scenarios, the choice between Gini Index and Information Gain doesn't drastically change the final tree structure, but
Gini Index is often preferred due to its computational efficiency.
Decision Tree to Decision Rules
Random Forest
• Random Forest is a popular ensemble learning algorithm used for classification,
regression, and other tasks.
• It builds multiple decision trees and merges their results for more accurate and stable
predictions. It is one of the most powerful and widely used algorithms in machine
learning.
Tree 1 Tree 2
Random Forest Tree N
How Random Forest Works
• The "random" in Random Forest comes from two key mechanisms that ensure diversity among the
individual decision tree.
• Bagging (Bootstrap Aggregation):
• For each decision tree in the forest, a random subset of the training data is sampled with replacement
(meaning some data points might be selected multiple times, while others might not be selected at all).
This creates different "bootstrap samples" for each tree.
• This technique helps reduce variance and overfitting, which are common problems with individual
decision trees.
• Feature Randomness (Random Subspace Method):
• When building each decision tree, at every split point (node), only a random subset of the available
features is considered to find the best split.
• This further decorrelates the trees, making them less prone to making the same errors and improving the
overall predictive power of the forest.
Prediction Process
• Classification
• For a classification problem, each decision tree in the forest "votes" for a class.
The final prediction of the Random Forest is the class that receives the
majority of votes.
• Regression
• For a regression problem, each decision tree predicts a numerical value. The
final prediction of the Random Forest is typically the average of all the
individual tree predictions.
Bagging at training time
N subsets (with
replacement)
Training set
Bagging at inference time
A test sample
75% confidence
Random Subspace Method at training time
Training data
Random Subspace Method at inference time
A test sample
66% confidence
Advantages
Artificial Neural Network architecture
• An Artificial Neural Network (ANN) is a computational model
inspired by the structure and function of the human brain.
• It's a core component of Artificial Intelligence (AI) and a foundational
element of deep learning. ANNs are particularly powerful for tasks
that involve pattern recognition, classification, and making predictions
from complex data.
Biological Neuron
• A nerve cell neuron is a special biological cell that processes information.
According to an estimation, there are huge number of neurons, approximately
10^11 with numerous interconnections, approximately 10^15.
• Error Calculation: The network's prediction is then compared to the actual, desired output. The difference between
the predicted and actual output is the "error.“
• Backpropagation: This is a crucial step in which the error is propagated backward through the network, from the
output layer to the input layer. This process calculates how much each weight contributed to the error.
• Weight Adjustment: Based on the error calculated during backpropagation, the weights of the connections are
adjusted to minimize the error. This is often done using optimization algorithms like gradient descent. The goal is to
make the network's predictions more accurate over time.
• Iteration: This entire process (forward propagation, error calculation, backpropagation, weight adjustment) is
repeated many times over a large dataset. With each iteration, the network continuously refines its weights, learning to
identify patterns and make better predictions.
Artificial neurons vs Biological neurons
Aspect Biological Neurons Artificial Neurons
Structure Dendrites: Receive signals from other neurons. Input Nodes: Receive data and pass it on to the next layer.
Cell Body (Soma): Processes the signals. Hidden Layer Nodes: Process and transform the data.
Axon: Transmits processed signals to other neurons. Output Nodes: Produce the final result after processing.
Synaptic Plasticity: Changes in synaptic strength based on Backpropagation: Adjusts the weights based on errors in
Learning Mechanism activity over time. predictions to improve future performance.
Activation: Neurons fire when signals are strong enough to Activation Function: Maps input to output, deciding if the
Activation reach a threshold. neuron should fire based on the processed data.
Types of neuron connection architecture
• Single-layer feed-forward network
• In this type of network, we have only two
layers input layer and the output layer but
the input layer does not count because no
computation is performed in this layer.
• The output layer is formed when different
weights are applied to input nodes and the
cumulative effect per node is taken. After
this, the neurons collectively give the output
layer to compute the output signals.
Types of neuron connection architecture
• Multilayer feed-forward network
• This layer also has a hidden layer that is internal to the network and has no direct
contact with the external layer.
• The existence of one or more hidden layers enables the network to be computationally
stronger, a feed-forward network because of information flow through the input
function, and the intermediate computations used to determine the output Z.
• There are no feedback connections in which outputs of the model are fed back into
itself.
Types of neuron connection architecture
• Single node with its own feedback
• When outputs can be directed back as inputs to the same layer or
preceding layer nodes, then it results in feedback networks.
• Recurrent networks are feedback networks with closed loops. The
figure shows a single recurrent network having a single neuron with
feedback to itself.
Types of neuron connection architecture
• Single-layer recurrent network
• The network is a single-layer network with a feedback
connection in which the processing element's output can
be directed back to itself or to another processing
element or both.
• A recurrent neural network is a class of artificial neural
networks where connections between nodes form a
directed graph along a sequence.
• This allows it to exhibit dynamic temporal behavior for
a time sequence. Unlike feedforward neural networks,
RNNs can use their internal state (memory) to process
sequences of inputs.
Types of neuron connection architecture
• Multilayer recurrent network
• In this type of network, processing element output
can be directed to the processing element in the
same layer and in the preceding layer forming a
multilayer recurrent network.
• They perform the same task for every element of a
sequence, with the output being dependent on the
previous computations.
• Inputs are not needed at each time step. The main
feature of a Recurrent Neural Network is its hidden
state, which captures some information about a
sequence.
Types of Artificial Neural Networks
• Feedforward Neural Network (FNN)
• Convolutional Neural Network (CNN)
• Radial Basis Function Network (RBFN)
• Recurrent Neural Network (RNN)
Feedforward Neural Network (FNN)
• Feedforward Neural Network (FNN) is a type of artificial neural network in which information flows in a
single direction—from the input layer through hidden layers to the output layer—without loops or
feedback.
• It is mainly used for pattern recognition tasks like image and speech classification.
Feedforward Neural Networks have a structured layered design where data flows
sequentially through each layer.
• Input Layer: The input layer consists of neurons that receive the input data. Each
neuron in the input layer represents a feature of the input data.
• Hidden Layers: One or more hidden layers are placed between the input and
output layers. These layers are responsible for learning the complex patterns in the
data. Each neuron in a hidden layer applies a weighted sum of inputs followed by a
non-linear activation function.
• Output Layer: The output layer provides the final output of the network. The
number of neurons in this layer corresponds to the number of classes in a
classification problem or the number of outputs in a regression problem.
Training a Feedforward Neural Network
Training a Feedforward Neural Network involves adjusting the weights of the neurons to minimize the error
between the predicted output and the actual output. This process is typically performed using backpropagation and
gradient descent.
• Forward Propagation: During forward propagation the input data passes through the network and the output is
calculated.
• Loss Calculation: The loss (or error) is calculated using a loss function such as Mean Squared Error (MSE) for
regression tasks or Cross-Entropy Loss for classification tasks.
• Backpropagation: In backpropagation the error is propagated back through the network to update the weights.
The gradient of the loss function with respect to each weight is calculated and the weights are adjusted using
gradient descent.
Convolutional Neural Network (CNN)
Convolutional Neural Networks (CNNs) are deep learning models
designed to process data with a grid-like topology such as images. They
are the foundation for most modern computer vision applications to
detect features within visual data.
Radial Basis Function Network (RBFN)
• Radial Basis Function (RBF) Neural Networks are used for function
approximation tasks. They are a special category of feed-forward
neural networks comprising of three layers.
• Due to this distinct three-layer architecture and universal
approximation capabilities they offer faster learning speeds and
efficient performance in classification and regression problems.
Recurrent Neural Network (RNN)
• Recurrent Neural Networks (RNNs) differ from regular neural
networks in how they process information. While standard neural
networks pass information in one direction i.e from input to output,
RNNs feed information back into the network at each step.
Activation Functions
• It is a mathematical function applied to the output of a neuron. It
introduces non-linearity into the model, allowing the network to
learn and represent complex patterns in the data.
• Activation function decides whether a neuron should be activated by calculating the weighted sum of inputs
and adding a bias term.
• This helps the model make complex decisions and predictions by introducing non-linearities to the output of
each neuron.
Linear Activation Function
• Linear Activation Function resembles
straight line define by y=x. No matter how
many layers the neural network contains if
they all use linear activation functions the
output is a linear combination of the input.
• The range of the output spans
from (−∞ to +∞)(−∞ to +∞).
• Linear activation function is used at just one
place i.e. output layer.
• Using linear activation across all layers makes
the network's ability to learn complex patterns
limited.
• Linear activation functions are useful for specific tasks but must be combined with non-linear
functions to enhance the neural network’s learning and predictive capabilities.
Sigmoid Function
• Sigmoid Activation Function is characterized by 'S'
shape. It is mathematically defined as . This
formula ensures a smooth and continuous output
that is essential for gradient-based optimization
methods.
• Direction: How much each weight/bias needs to change to reduce the error.
• Magnitude: How sensitive the error is to changes in that specific weight/bias.
• Calculation: The input signals travel forward through the network, layer by layer. At each neuron,
the inputs are multiplied by their respective weights, summed up, and then passed through an
activation function (e.g., sigmoid, ReLU, tanh) to produce an output for that neuron.
• Prediction: This process continues until the final output layer produces the network's prediction
for the given input.
• Error Calculation: This predicted output is then compared to the actual target output (the "ground
truth"). The difference between these two is quantified by a loss function (e.g., Mean Squared
Error for regression, Cross-Entropy for classification). This loss value represents how "wrong" the
network's prediction was.
Backward Pass (Backpropagation of Error)
• Error Propagation: The calculated error from the output layer is propagated backward through the
network, layer by layer, all the way to the input layer.
• Gradient Calculation (Chain Rule): At each layer, the algorithm calculates how much each weight
and bias contributed to the overall error. This is done using the chain rule of calculus. The chain rule
allows us to calculate the derivative of the loss with respect to a weight in an earlier layer by
multiplying the derivatives of intermediate calculations.
• Essentially, it determines the "blame" for the error and assigns it proportionally to the connections (weights) that
contributed to it.
• Weight/Bias Update: Once the gradients for all weights and biases are known, the optimization
algorithm (Gradient Descent) adjusts these parameters. Each weight/bias is updated in the direction
that minimizes the loss, by subtracting a fraction of its gradient (scaled by a learning rate).