0% found this document useful (0 votes)

20 views132 pages

Unit-2 Supervised Machine Learning

Uploaded by

mastervidhan2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views132 pages

Unit-2 Supervised Machine Learning

Uploaded by

mastervidhan2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 132

Supervised Machine

Learning
Content
• What is supervised ML
• Linear Regression for univariate and Multivariate data
• Cost function
• Gradient Descent
• Logistic regression
• Under fitting and Over fitting
• Support Vector Machine
• Decision Tree, Random Forest, Artificial Neural Network architecture
• Activation functions
• Forward pass in ANN
• Back propogation in ANN
• Model Evaluation techniques.
What is Supervised Machine Learning?
• Supervised Machine Learning is a paradigm in machine learning
where an algorithm learns from a labeled dataset.

• This means that during training, the algorithm is provided with both
the input data (features) and the corresponding correct output (labels
or targets).

• The goal of supervised learning is to build a model that can map new,
unseen input data to its expected output values based on the patterns it
learned from the labeled training data.
Key Characteristics of Supervised Learning
• Labeled Data:
• Requires datasets where the target variable (what you want to predict) is
known for each input example.

• Explicit Feedback:
• The algorithm receives direct feedback (the correct answer) during training,
which it uses to adjust its internal parameters to minimize the difference
between its predictions and the actual labels.

• Goal-Oriented:
• The primary goal is to predict an outcome or classify data based on the learned
patterns.
Types of Supervised Learning Problems:
• Regression:
• The target variable is a continuous numerical value (e.g., predicting house
prices, temperature, stock prices, age, sales).

• Classification:
• The target variable is a categorical label (e.g., predicting whether an email is
spam or not, classifying an image as a cat or dog, determining if a customer
will churn or not).
Process of Supervised Learning
• Data Collection and Preparation: Gather and clean a dataset, ensuring it has both features and corresponding
labels.
• Splitting Data: Divide the labeled dataset into training, validation (optional), and test sets. The model learns from
the training data, is tuned using validation data, and its final performance is evaluated on unseen test data.
• Model Selection: Choose an appropriate machine learning algorithm (e.g., Linear Regression, Logistic
Regression, Decision Trees, Support Vector Machines, Neural Networks).
• Model Training: The algorithm is fed the training data and adjusts its internal parameters iteratively to minimize a
defined "loss function" (which measures the difference between predicted and actual outputs).
• Model Evaluation: Assess the trained model's performance on the test set using various metrics (e.g., accuracy,
precision, recall, F1-score for classification; R-squared, MSE, RMSE for regression).
• Prediction: Once satisfied with the model's performance, it can be used to make predictions on new, unlabeled
data.
Linear Regression
• Linear Regression is a fundamental supervised learning algorithm used for regression tasks,
meaning it predicts a continuous target variable.

• It assumes a linear relationship between the input features and the output variable.

• The core idea is to find the "best-fitting" straight line (or hyperplane in higher dimensions)
that describes the relationship between the features and the target.

• This line minimizes the sum of squared differences between the predicted values and the
actual values (this is known as the Ordinary Least Squares or OLS method).
Linear Regression

• The graph above presents the linear relationship between the output(y) and
predictor(X) variables.
• The blue line is referred to as the best-fit straight line. Based on the given
data points, we attempt to plot a line that fits the points the best.
Univariate Linear Regression (Simple Linear
Regression)
• Univariate linear regression involves one independent variable (feature) to predict a single
continuous dependent variable (target).

• Mathematical Equation: The equation of a straight line is used to model this relationship:

• Where:
• y: The dependent variable (the value we want to predict).
• x: The single independent variable (feature).
• β0 (beta-naught): The y-intercept, representing the predicted value of y when x is 0.
• β1 (beta-one): The coefficient (or slope) of the independent variable, representing the change in y for a one-unit
change in x.
• ϵ: The error term, representing the difference between the actual y and the predicted y.
Example
• Predicting a student's exam score based on the number of hours they studied.
• y = Exam Score
• x = Hours Studied

• If the model finds β0=40 and β1=5, the equation becomes:

Exam Score=40+5×Hours Studied

• This means:
• A student who studies 0 hours is predicted to score 40 (the intercept).
• For every additional hour studied, the exam score is predicted to increase by 5 points.

• Visualization: You can visualize univariate linear regression as fitting a straight line through a scatter plot of data
points (x, y).
Linear Regression
• The goal of the linear regression algorithm is to get the best values for B0 and
B1 to find the best-fit line.
• The best-fit line is a line that has the least error which means the error between
predicted values and actual values should be minimum.
Random Error(Residuals)
• In regression, the difference between the observed value of the
dependent variable(y i ) and the predicted value(predicted) is called
the residuals.

• ε i = y predicted – y i
• where y predicted = B 0 + B 1 X i
Assumptions of Linear Regression
• Linearity of residuals: There needs to be a linear relationship between the dependent
variable and independent variable(s).

• Independence of residuals: The error terms should not be dependent on one another (like
in time-series data wherein the next value is dependent on the previous one). There should
be no correlation between the residual terms. The absence of this phenomenon is known
as Autocorrelation.
Assumptions of Linear Regression

• Normal distribution of residuals: The mean of residuals should follow a normal

distribution with a mean equal to zero or close to zero. This is done to check whether the selected
line is the line of best fit or not. If the error terms are non-normally distributed, suggests that there
are a few unusual data points that must be studied closely to make a better model.

• The equal variance of residuals: The error terms must have constant variance. This phenomenon
is known as Homoscedasticity. The presence of non-constant variance in the error terms is referred
to as Heteroscedasticity. Generally, non-constant variance arises in the presence of outliers or
extreme leverage values.
Multivariate Linear Regression (Multiple Linear
Regression)
• Multivariate linear regression involves two or more independent variables (features) to predict a single
continuous dependent variable (target).

• The relationship is still assumed to be linear, but now it's a linear combination of multiple features.

• Mathematical Equation: The equation extends the univariate form to include multiple independent variables:

• Where:
• y: The dependent variable (the value we want to predict).
• x1 ,x2 ,…,xn : The n independent variables (features).
• β0 : The y-intercept.
• β1 ,β2 ,…,βn : The coefficients (slopes) for each respective independent variable. β i represents the change in y for a one-unit
change in x i , assuming all other independent variables are held constant.
• ϵ: The error term.
Example
• Predicting a house price based on its size, number of bedrooms, and
location.
• y = House Price
• x1 = Size (in sq. ft.)
• x2 = Number of Bedrooms
• x3 = Location Score (e.g., a numerical rating for neighborhood desirability)
Example
• If the model finds coefficients: β0=50000, β1=100, β2=10000, β3=5000:

• House Price=50000+(100×Size)+(10000×Bedrooms)+(5000×Location Score)

• This equation indicates:

• A baseline price of 50,000 (intercept).
• An increase of 100 for every additional square foot.
• An increase of 10,000 for every additional bedroom (holding size and location constant).
• An increase of 5,000 for every one-point increase in the location score (holding size and bedrooms constant).

• Visualization: With two independent variables (x 1 ,x 2 ), the relationship can be visualized as a plane in a 3D
space. With more than two independent variables, it becomes a "hyperplane" which is difficult to visualize
directly, but the mathematical principle remains the same.
Polynomial Regression
• Polynomial regression analysis represents a non-linear relationship
between dependent and independent variables.
• This technique is a variant of the multiple linear regression model, but
the best fit line is curved rather than straight.
Polynomial Regression
• The general form of the equation for a polynomial regression of degree nis:

• Choosing the right polynomial degree n is important: a higher degree may fit the data
more closely but it can lead to overfitting.
• The degree should be selected based on the complexity of the data. Once the model is
trained, it can be used to make predictions on new data, capturing non-linear
relationships and providing a more accurate model for real-world applications.
Coefficient of Determination or R-squared
(R2)
• R-squared is a number that explains the amount of variation that is explained/captured by the
developed model. It always ranges between 0 & 1.
• Overall, the higher the value of R-squared, the better the model fits the data.
• Mathematically it can be represented as,

R2 = 1 – ( RSS/TSS )

• Residual sum of Squares (RSS) is defined as the sum of squares of the residual for each data point in the plot/data. It is the
measure of the difference between the expected and the actual observed output.
• Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the mean of the response variable.
Mathematically TSS is,
Cost function
• The cost function, also known as a loss function or objective function, is a fundamental concept in
machine learning, particularly in supervised learning algorithms like Linear Regression.

• It quantifies the "error" or "discrepancy" between the predicted output of a machine learning model
and the actual (true) output.

• The primary goal of training a machine learning model is to minimize this cost function.

• By minimizing the cost function, the model adjusts its internal parameters (like the coefficients and
intercept in Linear Regression) to make its predictions as close as possible to the actual target values.
Why is Cost Function Important?
• It provides a quantitative measure of model performance.

• The optimizer (like Gradient Descent) uses it to adjust model parameters

(weights and biases) in order to reduce the error.

• Without a cost function, a machine learning model wouldn’t know how to

improve.
Mean Squared Error (MSE)
• Mean Squared Error (MSE) is a commonly used cost function for regression problems. It
measures the average of the squares of the errors, that is, the average squared difference between
the actual (true) values and the predicted values by the model.

• What MSE Represents:

• It captures how close the model’s predictions are to the actual outcomes.
• A lower MSE indicates a better fit of the model to the data.
• A higher MSE means larger errors — the predictions are farther from the true values.
Example
Mean Absolute Error (MAE)
• Mean Absolute Error (MAE) is a regression cost function that measures the average
magnitude of errors between predicted values and actual values, without considering their
direction.
Example
Binary Cross-Entropy (Log Loss)
• Binary Cross-Entropy (also called Log Loss) measures the performance of a
classification model where the output is a probability between 0 and 1.
• It compares the predicted probability ybar with the actual class label y∈{0,1}
Example
Categorical Cross-Entropy
• Categorical Cross-Entropy measures the difference between two probability
distributions:
• The true distribution (i.e., the actual class label — one-hot encoded)
• The predicted probability distribution (i.e., model's output probabilities for each class)
• It is widely used for multi-class classification, where each input belongs to
exactly one of k possible classes.
Example
Gradient Descent
• Gradient Descent is a fundamental optimization algorithm widely used in machine
learning to train models. Its primary goal is to minimize the "cost function" (or
"loss function") of a model, which essentially quantifies the error between the
model's predictions and the actual values.
• By minimizing this error, the model learns to make more accurate predictions.

Example:
• Imagine you're standing on a mountain in dense fog, and you want to reach the lowest point in
the valley. You can't see the whole landscape, so you have to rely on local information. What
would you do? You'd likely look around and take a small step in the direction where the slope is
steepest downwards. You'd repeat this process until you reach a point where you can't go any
lower.
Analogy That describes Gradient Descent
• Cost Function (The Mountain Landscape): This is the function you want to minimize. In machine learning, it
measures how "bad" your model is performing. A higher cost means more errors.

• Parameters (Your Position on the Mountain): These are the adjustable values within your machine learning
model (e.g., weights and biases in a neural network, or the slope and intercept in linear regression). Gradient
Descent iteratively adjusts these parameters.

• Gradient (The Direction of Steepest Descent): The "gradient" of the cost function at a given point tells you the
direction of the steepest ascent (uphill). To minimize the cost, you want to move in the opposite direction of the
gradient, i.e., the direction of steepest descent. Mathematically, the gradient is a vector of partial derivatives of the
cost function with respect to each parameter.

• Learning Rate (The Step Size): This is a crucial hyperparameter that determines how large a step you take in the
direction of the negative gradient.
Learning Rate

• Small learning rate: Takes tiny steps, which can lead to slow convergence but increases the chance of finding a good minimum.
• Large learning rate: Takes big steps, which can speed up convergence but might overshoot the minimum or even diverge (climb up
the other side of the mountain).
Gradient Descent
The goal of the gradient descent algorithm is to minimize the given function (say, cost function). To achieve
this goal, it performs two steps iteratively:
• Compute the gradient (slope), the first-order derivative of the function at that point
• Make a step (move) in the direction opposite to the gradient. The opposite direction of the slope
increases from the current point by alpha times the gradient at that point
How Gradient Descent Works (Step-by-Step)
1. Initialization of Parameters:
1. The model's parameters (e.g., the coefficients/weights, β1,β2,…,βn, and the intercept, β0, in Linear
Regression) are initialized to some arbitrary values, often randomly or to zeros.
2. Calculate the Cost:
• For the current set of parameters, the model makes predictions on the training data.
• The cost function (e.g., Mean Squared Error for Linear Regression) is then calculated to quantify how "bad"
these predictions are.
4. Compute the Gradients:
• This is the crucial step. We calculate the partial derivative of the cost function with respect to each
parameter.
• These partial derivatives tell us the slope of the cost function with respect to each parameter. In other words,
they indicate how much the cost function would change if we slightly adjusted that specific parameter.
• The collection of these partial derivatives forms the gradient vector.
How Gradient Descent Works (Step-by-Step)
Example
Example
Example
Types of Gradient Descent
• The main variants of Gradient Descent differ in how much data they use to compute the gradient at each update step:

• Batch Gradient Descent (BGD):

• Calculates the gradient using the entire training dataset for each parameter update.
• Pros: Produces stable convergence and usually reaches the global minimum (for convex cost functions).
• Cons: Can be very slow and computationally expensive for large datasets, as it needs to process all data points before a single update.

• Stochastic Gradient Descent (SGD):

• Calculates the gradient and updates parameters using only one randomly selected training example at a time.
• Pros: Much faster than Batch GD, especially for large datasets. Can escape local minima due to the "noisy" updates.
• Cons: The updates are very noisy, causing the cost function to fluctuate and not always smoothly converge. It might "bounce around" the minimum.

• Mini-Batch Gradient Descent:

• A compromise between Batch GD and SGD. It calculates the gradient and updates parameters using a small, randomly selected subset (mini-batch) of the
training data.
• Pros: Balances the speed of SGD with the stability of Batch GD. It's computationally efficient and often leads to faster and more stable convergence than pure
SGD.
• Cons: Requires tuning the mini-batch size.
Logistic Regression
• Logistic Regression is a supervised learning algorithm used for classification
problems, mainly binary classification, but it can be extended to multiclass as well.
• Despite the name regression, Logistic Regression is actually used for classification.
• It predicts the probability that a given input belongs to a particular class — usually 0
or 1.

Use Case Examples

• Email Spam Detection (Spam or Not Spam)
• Tumor Classification (Malignant or Benign)
• Customer Churn Prediction (Churn or Not Churn)
Types of Logistic Regression
• Logistic regression can be classified into three main types based on the nature of the
dependent variable:

• Binomial Logistic Regression: This type is used when the dependent variable has only two possible
categories. Examples include Yes/No, Pass/Fail or 0/1. It is the most common form of logistic
regression and is used for binary classification problems.

• Multinomial Logistic Regression: This is used when the dependent variable has three or more
possible categories that are not ordered. For example, classifying animals into categories like "cat,"
"dog" or "sheep." It extends the binary logistic regression to handle multiple classes.

• Ordinal Logistic Regression: This type applies when the dependent variable has three or more
categories with a natural order or ranking. Examples include ratings like "low," "medium" and "high."
It takes the order of the categories into account when modeling.
Mathematics Behind Logistic Regression
• Linear Combination of Features: Similar to linear regression, logistic regression starts by forming
a linear combination of the input features and their corresponding weights (coefficients).

• This linear combination, often denoted as z, can take any real value:

where:
• w0 is the bias (intercept)
• wi are the weights (coefficients) for each feature xi
Sigmoid (Logistic) Function
• The crucial difference from linear regression is that z is then passed through a special
activation function called the sigmoid function (also known as the logistic function).
• This function squashes any real-valued input into a range between 0 and 1, making it
interpretable as a probability.

Decision Rule
• To classify:
Sigmoid (Logistic) Function

As shown above the sigmoid function converts the continuous variable data into the
probability i.e between 0 and 1.
• σ(z) tends towards 1 as z→∞z→∞
• σ(z) tends towards 0 as z→−∞z→−∞
• σ(z) is always bounded between 0 and 1
Training the Model (Learning the Weights)
• Objective: The goal of training a logistic regression model is to find the optimal weights (wi) that
best map the input features to the observed target probabilities.
• Loss Function (Cost Function): Unlike linear regression which often uses Mean Squared Error,
logistic regression typically uses a Log Loss (also known as Binary Cross-Entropy Loss) function.
• This function measures how "wrong" the predicted probabilities are compared to the actual class
labels.
• The goal is to minimize this loss.
• For binary classification, the log loss for a single training example is:
Optimization Algorithm (Gradient Descent)
• To minimize the loss function, an optimization algorithm like Gradient
Descent is used.
• This iteratively adjusts the weights in the direction that reduces the
loss until a minimum is reached.
Multinomial Logistic Regression
• Multinomial Logistic Regression (MLR) is a type of classification algorithm used when the
dependent variable is categorical with more than two classes.
• It is an extension of binary logistic regression and is used for multi-class classification
problems.

• When to Use It?

• Output has more than two categories (e.g., predicting fruit: Apple, Banana, Orange).
• Classes are mutually exclusive (a sample belongs to exactly one class).
• Used in Natural Language Processing, Medical Diagnosis, Marketing, etc.

Problem: Predict type of vehicle (Car, Bus, Truck) based on features like engine size, weight, etc.
• Y = {Car, Bus, Truck} ⇒ 3 classes ⇒ use Multinomial Logistic Regression.
Basic Concept of Multinomial Logistic Regression
Model Formulation
Example
Example
Example
Underfitting and Overfitting
• Machine learning models aim to perform well on both training data and new,
unseen data and is considered "good" if:
• It learns patterns effectively from the training data.
• It generalizes well to new, unseen data.
• It avoids memorizing the training data (overfitting) or failing to capture relevant patterns (underfitting).

• Bias and variance are two key sources of error in machine learning models that
directly impact their performance and generalization ability.
• Bias: is the error that happens when a machine learning model is too simple and doesn't learn enough
details from the data. It's like assuming all birds can only be small and fly, so the model fails to recognize
big birds like ostriches or penguins that can't fly and get biased with predictions.
• Variance: Error that happens when a machine learning model learns too much from the data, including
random noise.
Overfitting
• Overfitting happens when a model learns too much from the training data, including details that don’t
matter (like noise or outliers).
• For example, imagine fitting a very complicated curve to a set of points. The curve will go through every point, but
it won’t represent the actual pattern.
• As a result, the model works great on training data but fails when tested on new data.

• Overfitting models are like students who memorize answers instead of understanding the topic. They
do well in practice tests (training) but struggle in real exams (testing).

• Reasons for Overfitting:

• High variance and low bias.
• The model is too complex.
• The size of the training data.
Underfitting
• Underfitting is the opposite of overfitting. It happens when a model is too simple to capture what’s going on in the
data.
• For example, imagine drawing a straight line to fit points that actually follow a curve. The line misses most of the pattern.
• In this case, the model doesn’t work well on either the training or testing data.

• Underfitting models are like students who don’t study enough. They don’t do well in practice tests or real exams.
Note: The underfitting model has High bias and low variance.
• Reasons for Underfitting:
• The model is too simple, So it may be not capable to represent the complexities in the data.
• The input features which is used to train the model is not the adequate representations of underlying factors influencing the
target variable.
• The size of the training dataset used is not enough.
• Excessive regularization are used to prevent the overfitting, which constraint the model to capture the data well.
• Features are not scaled.
underfitting, proper fitting, and overfitting.
• Underfitting : Straight line trying to fit a curved dataset but cannot capture the data's patterns,
leading to poor performance on both training and test sets.
• Overfitting: A squiggly curve passing through all training points, failing to generalize performing
well on training data but poorly on test data.
• Appropriate Fitting: Curve that follows the data trend without overcomplicating to capture the
true patterns in the data.
Balance Between Bias and Variance
• The relationship between bias and variance is often referred to as the bias-variance tradeoff, which
highlights the need for balance:
• Increasing model complexity reduces bias but increases variance (risk of overfitting).
• Simplifying the model reduces variance but increases bias (risk of underfitting).
• The goal is to find an optimal balance where both bias and variance are minimized, resulting in good
generalization performance.
• Imagine you're trying to predict the price of houses based on their size, and you decide to draw a line or curve
that best fits the data points on a graph. How well this line captures the trend in the data depends on the
complexity of the model you use.
Bias Variance Tradeoff

• In the pursuit of optimal performance, a supervised machine learning

algorithm seeks to strike a balance between low bias and low variance for
increased robustness.
Confusion Matrix in Logistic Regression
• confusion matrix is a performance measurement tool for classification problems
in machine learning.
• It shows how well the classification model is performing by comparing the actual
and predicted values.
• For a binary classification problem (e.g., spam vs not spam), the confusion
matrix is a 2x2 table:
Term Meaning
✅ True Model correctly predicts the
Positive (TP) positive class.
Model incorrectly predicts
❌ False
positive when it is actually
Positive (FP)
negative.
Model incorrectly predicts
❌ False
negative when it is actually
Negative (FN)
positive.
✅ True Model correctly predicts the
Negative (TN) negative class.
Metrics Derived from Confusion Matrix
Example
• Let’s say you have a model that classifies emails as spam or not spam:

Predicted
| Spam | Not Spam
Actual Spam | 80 | 20
Not | 10 | 90

• From this:
• TP = 80 (Spam correctly identified) Calculate Metrics:
•Accuracy = (80+90)/(80+20+10+90) = 170/200 = 85%
• FN = 20 (Spam incorrectly identified as Not Spam) •Precision = 80 / (80 + 10) = 80 / 90 = 88.9%
• FP = 10 (Not Spam incorrectly identified as Spam) •Recall = 80 / (80 + 20) = 80 / 100 = 80%
•F1 Score = 2 * (0.889 * 0.8) / (0.889 + 0.8) ≈ 84.2%
• TN = 90 (Not Spam correctly identified)

• For 3 or more classes, the confusion matrix becomes an n x n table, where each row represents the actual class,
and each column represents the predicted class. Diagonal elements are correct predictions.
Support Vector Machines (SVM)
• Support Vector Machines (SVM) is a supervised machine learning algorithm commonly used for classification tasks.

• SVM constructs a hyperplane or set of hyperplanes in a high-dimensional space that separates the different classes.

• A good separation is achieved by the hyperplane that has the largest margin, meaning the maximum distance between
data points of different classes.

• Support Vector Machine(SVM) is a powerful classifier that works both on linearly and nonlinearly separable data.

• SVM tries to find the “best” margin (distance between the line and the support vectors) that separates the classes.
Types of Support Vector Machine (SVM)
• Linear SVM:
• When the data is perfectly linearly separable only then we
can use Linear SVM.
• Perfectly linearly separable means that the data points can
be classified into 2 classes by using a single straight line(if
2D).

• Non-Linear SVM:
• When the data is not linearly separable, we can use
Non-Linear SVM. This happens when the data points
cannot be separated into two classes using a straight line (if
2D).
• In such cases, we use advanced techniques like kernel
tricks to classify them. In most real-world applications we
do not find linearly separable datapoints hence we use
kernel trick to solve them.
Support Vector Machines (SVM)

• The best hyperplane is that plane that has the maximum

• To classify these points, we can distance from both the classes, and this is the main aim of
have many decision boundaries, SVM.
but the question is which is the • This is done by finding different hyperplanes which classify
best and how do we find it? the labels in the best way then it will choose the one which is
farthest from the data points or the one which has a
maximum margin.
Support Vector Machines (SVM)
• Hyperplane (in red): This is the decision boundary represented by the equation w
⋅ x — b = 0. It separates the two classes (green and blue points). The SVM
algorithm searches for the hyperplane that best divides the data while maximizing
the margin.

• Support Vectors: These are the points closest to the hyperplane from both
classes, highlighted on the boundary lines w ⋅ x — b = 1 and w ⋅ x — b = -1. In
the figure, they are the points on the dashed lines representing the margin.
Support vectors directly influence the positioning of the hyperplane.

• Margin (in yellow): The margin is the distance between the support vectors and
the hyperplane. The goal of SVM is to maximize this margin, ensuring that the
hyperplane separates the classes as clearly as possible. In the figure, the margin is
the region between the dashed lines w ⋅ x — b = 1 and w ⋅ x — b = -1.

• Weight vector w: The arrow labeled w represents the weight vector perpendicular
to the hyperplane. The direction of this vector indicates how the hyperplane is
oriented, and its magnitude determines how steep the slope of the separation
boundary is.
Support Vector Machines (SVM)
• Soft Margin – As most of the real-world data are not fully linearly separable, we will allow some
margin violation to occur which is called soft margin classification. It is better to have a large
margin, even though some constraints are violated. Margin violation means choosing a hyperplane,
which can allow some data points to stay on either the incorrect side of the hyperplane and
between the margin and correct side of the hyperplane.
• Hard Margin – If the training data is linearly separable, we can select two parallel hyperplanes
that separate the two classes of data, so that the distance between them is as large as possible.
Mathematical Computation of SVM
• Consider a binary classification problem with two classes, labeled as +1 and -1.
• We have a training dataset consisting of input feature vectors X and their corresponding class
labels Y.
• The equation for the linear hyperplane can be written as:

• Where:
• w is the normal vector to the hyperplane (the direction perpendicular to it).
• b is the offset or bias term representing the distance of the hyperplane from the origin along the normal
vector w.
Distance from a Data Point to the Hyperplane
• The distance between a data point x_i and the decision boundary can
be calculated as:

• where ||w|| represents the Euclidean norm of the weight vector w.

Euclidean norm of the normal vector W
Linear SVM Classifier
Optimization Problem for SVM
• For a linearly separable dataset the goal is to find the hyperplane that maximizes the margin between the two classes
while ensuring that all data points are correctly classified.
• This leads to the following optimization problem:

• Subject to the constraint:

• Where:
• yi is the class label (+1 or -1) for each training instance.
• xi is the feature vector for the ii-th training instance.
• m is the total number of training instances.

• The condition ensures that each data point is correctly classified and lies outside the margin.
Soft Margin in Linear SVM Classifier
• In the presence of outliers or non-separable data the SVM allows some misclassification by
introducing slack variables ζi. The optimization problem is modified as:

• Subject to the constraints:

• Where:
• C is a regularization parameter that controls the trade-off between margin maximization and penalty
for misclassifications.
• ζi are slack variables that represent the degree of violation of the margin by each data point.
Kernels in Support Vector Machine
• The most interesting feature of SVM is that it can even work with a
non-linear dataset and for this, we use “Kernel Trick” which makes it
easier to classifies the points.
• Suppose we have a dataset like this:
• Here we see we cannot draw a single line or say
hyperplane which can classify the points correctly.
• So what we do is try converting this lower dimension
space to a higher dimension space using some
quadratic functions which will allow us to find a
decision boundary that clearly divides the data points.
• These functions which help us do this are called
Kernels and which kernel to use is purely determined
by hyperparameter tuning.
Polynomial Kernel
• The polynomial kernel allows SVMs to model more complex,
non-linear relationships by introducing polynomial terms.
• It maps the original data into a higher-dimensional feature space where
it might become linearly separable. It's often used in image processing.

Where:
• d is the degree of the polynomial.
• c is a constant term that can control the influence of higher-order terms.
Radial Basis Function (RBF) Kernel /
Gaussian Kernel
• The RBF kernel, also known as the Gaussian kernel, is one of the most widely used and versatile
kernels.
• It maps data into an infinite-dimensional space, making it highly effective for complex, non-linear
classification problems where there's no prior knowledge about the data distribution.
• It measures the similarity between two data points based on their Euclidean distance and a gamma
parameter.

Where:
γ (gamma) is a parameter that defines the influence of a
single training example. A small γ means a large influence,
leading to a smoother decision boundary. A large γ means a
small influence, leading to a more complex, potentially
overfitting boundary.
Sigmoid Kernel
• The sigmoid kernel is inspired by neural networks and behaves similarly to the
activation function of a neuron.
• It's often used in scenarios where neural network-like behavior is desired.

Where:
α (alpha) and c are parameters that control the shape
of the tanh function.
Application of SVM Kernel
Decision Tree
• A decision tree is a supervised learning algorithm used for both
classification and regression tasks.

• It has a hierarchical tree structure which consists of a root node, branches,

internal nodes and leaf nodes.

• It works like a flowchart help to make decisions step by step where:

• Internal nodes represent attribute tests
• Branches represent attribute values
• Leaf nodes represent final decisions or predictions.
Decision Tree Terminologies
• Root Node
• Decision Nodes
• Leaf Nodes
• Sub-Tree
• Pruning: The process of
removing or cutting down
specific nodes in a tree to prevent
overfitting and simplify the
model.
• Branch / Sub-Tree
• Parent and Child Node
How Does a Decision Tree Work?
1. Start at the Root: The algorithm begins with the entire dataset at the root node.

2. Find the Best Split: At each node, the algorithm evaluates all available features to find the "best"
way to split the data. "Best" is determined by a specific criterion that aims to maximize the
homogeneity (purity) of the resulting subsets. Common criteria include:
• Information Gain (based on Entropy): For classification, it measures the reduction in uncertainty or disorder in the
dataset after a split. The higher the information gain, the better the split.
• Gini Impurity: For classification, it measures the probability of incorrectly classifying a randomly chosen element from
the dataset if it were randomly labeled according to the distribution of labels in the subset. Lower Gini impurity is
preferred.
• Variance Reduction: For regression, it measures how much the variance of the target variable is reduced after a split.

3. Split the Data: The data is then split into child nodes based on the chosen feature and its threshold
(for numerical features) or categories (for categorical features).
How Does a Decision Tree Work?
4. Repeat (Recursion): This splitting process is recursively applied to each child node until a
stopping criterion is met. Stopping criteria can include:

• All data points in a node belong to the same class (pure node).
• A maximum tree depth is reached.
• A minimum number of samples is required to make a split.
• No more features are available to split on.

5. Form Leaf Nodes: Once a stopping criterion is met, the node becomes a leaf node, and a final
prediction is made for any data point that reaches that leaf (e.g., the majority class for
classification or the average value for regression).
How Does a Decision Tree Work?
Types of Decision Trees
• Classification Trees
• Used when the target variable is categorical (e.g., predicting "yes" or "no,"
"spam" or "not spam," "dog" or "cat").

• Regression Trees
• Used when the target variable is continuous (e.g., predicting house prices,
temperature, sales figures).
Information Gain (based on Entropy)
• Information Gain, based on Entropy, is a fundamental concept in the
construction of Decision Trees, particularly for classification tasks.
• It's the key metric used by algorithms like ID3 and C4.5 to decide which
feature to split on at each node of the tree.

Entropy
• In the context of decision trees, entropy is a measure of the impurity or uncertainty
within a set of data.
• High Entropy: A dataset with high entropy is very mixed, meaning the classes are evenly
distributed. It's difficult to predict the class of a random sample from this set.
• Low Entropy (or Zero Entropy): A dataset with low entropy (ideally zero) is "pure" or
homogeneous, meaning most or all data points belong to the same class. It's easy to predict the
class of a random sample from this set.
Formula for Entropy
• For a dataset S with C distinct classes, the entropy is calculated as
follows.

Where:
• S is the dataset (or a subset of data at a node).
• C is the number of unique classes in S.
• pi is the proportion (or probability) of instances belonging to class i in dataset S.
• log2 is the base-2 logarithm. The unit of entropy is typically "bits".
Information Gain
• Information Gain (IG) measures the reduction in entropy after a dataset S is split based on a
particular feature (attribute) A. In other words, it quantifies how much "information" a feature
provides about the target variable.
• The goal in building a decision tree is to find the feature that yields the highest Information Gain at
each step, as this feature is considered the "best" for splitting the data and creating more
homogeneous subsets
Example
Let's say you have a dataset for deciding whether to play tennis, with features like Outlook, Temperature, Humidity,
and Windy, and the target variable Play.

Outlook Temperature Humidity Windy Play

Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rain Mild High False Yes
Rain Cool Normal False Yes
Rain Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rain Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rain Mild High True No
Example

Click Here for Detailed Example

• You would repeat this calculation for Temperature, Humidity, and Windy. The feature with
the highest Information Gain would be chosen as the root node of the decision tree. In this
classic example, Outlook usually has the highest Information Gain, making it the first split.
Gini Index
• The Gini Index, also known as Gini impurity, is a widely used metric in decision tree algorithms to measure the
"impurity" or "mixedness" of a dataset.
• In the context of a decision tree, it helps determine the best way to split a node into sub-nodes to achieve more
homogeneous (pure) groups.

What does Gini Index measure?

• Impurity/Homogeneity: It quantifies how mixed the classes are within a given node.
• A Gini Index of 0 indicates perfect purity, meaning all samples in the node belong to the same class.
• A Gini Index of 1 (or 0.5 for binary classification) indicates maximum impurity, meaning the classes are evenly distributed.

• Probability of Misclassification: More intuitively, the Gini Index can be interpreted as the probability of
misclassifying a randomly chosen element from the dataset if it were randomly labeled according to the class
distribution within that node. A higher Gini Index means a greater chance of misclassification.
How is Gini Index calculated
• The formula for the Gini Index for a given node (dataset D) with C classes is:

Where:
• C is the total number of classes.
• pi is the proportion (or probability) of samples belonging to class i in the node.
Steps to calculate Gini Index for a split
• Calculate Gini Impurity for the parent node: Before any split, calculate the Gini Index of the
entire dataset or the current node being considered for splitting.
• For each potential split (feature and its values):
• Divide the data into child nodes based on the chosen feature and its split point.
• Calculate the Gini Impurity for each individual child node using the formula above.
• Calculate the weighted average Gini Impurity of the child nodes. This is done by multiplying the Gini Impity of each
child node by the proportion of samples it contains, and then summing these values.

• Where:
• k is the number of child nodes created by the split.
• Nj is the number of samples in child node j.
• N is the total number of samples in the parent node.
• Gini(Dj) is the Gini Impurity of child node j.
• Choose the best split: The goal is to minimize the Gini impurity. Therefore, the feature and split
point that result in the lowest weighted average Gini Index (or highest Gini Gain) after the split is
chosen as the best split for that node. Gini Gain is calculated as:
• GiniGain=Giniparent−Ginisplit
• A higher Gini Gain indicates a better split.
Example
Gini Index vs. Information Gain (Entropy)
• Both Gini Index and Information Gain are popular metrics for splitting in decision trees. They
generally lead to similar results.
Feature Gini Index Information Gain (based on Entropy)
Measures the probability of misclassifying a Measures the reduction in uncertainty
Concept
random sample. or randomness.
Formula 1−∑(pi)^2 Entropy_parent−∑(Nj/N)Entropy(Dj)
Logarithm Does not involve logarithms. Involves logarithms.
Generally faster to compute as it avoids log Can be slightly more computationally
Computational Efficiency
calculations. intensive.
Tends to isolate the most frequent class in its own Tends to be biased towards attributes
Bias branch. Can be slightly biased towards splits that with a large number of distinct values
produce more equal-sized partitions. (can be mitigated by Gain Ratio).
Used in CART (Classification and Regression
Typical Use Used in ID3 and C4.5 algorithms.
Trees) algorithms.
Range (Binary [0,1] (0 for pure, 1 for maximally
[0,0.5] (0 for pure, 0.5 for maximally impure)
Classification) impure)

In most practical scenarios, the choice between Gini Index and Information Gain doesn't drastically change the final tree structure, but
Gini Index is often preferred due to its computational efficiency.
Decision Tree to Decision Rules
Random Forest
• Random Forest is a popular ensemble learning algorithm used for classification,
regression, and other tasks.
• It builds multiple decision trees and merges their results for more accurate and stable
predictions. It is one of the most powerful and widely used algorithms in machine
learning.

• Random Forest is like a "forest of decision trees", where:

• Each tree is trained on a random subset of the data.
• Each node in a tree considers a random subset of features for splitting.
• Final prediction is made by aggregating the predictions of all trees:
• For classification: majority voting
• For regression: average of predictions
Random Forest

Tree 1 Tree 2
Random Forest Tree N
How Random Forest Works
• The "random" in Random Forest comes from two key mechanisms that ensure diversity among the
individual decision tree.
• Bagging (Bootstrap Aggregation):
• For each decision tree in the forest, a random subset of the training data is sampled with replacement
(meaning some data points might be selected multiple times, while others might not be selected at all).
This creates different "bootstrap samples" for each tree.
• This technique helps reduce variance and overfitting, which are common problems with individual
decision trees.
• Feature Randomness (Random Subspace Method):
• When building each decision tree, at every split point (node), only a random subset of the available
features is considered to find the best split.
• This further decorrelates the trees, making them less prone to making the same errors and improving the
overall predictive power of the forest.
Prediction Process
• Classification
• For a classification problem, each decision tree in the forest "votes" for a class.
The final prediction of the Random Forest is the class that receives the
majority of votes.
• Regression
• For a regression problem, each decision tree predicts a numerical value. The
final prediction of the Random Forest is typically the average of all the
individual tree predictions.
Bagging at training time
N subsets (with
replacement)

Training set
Bagging at inference time

A test sample

75% confidence
Random Subspace Method at training time

Training data
Random Subspace Method at inference time

A test sample

66% confidence
Advantages
Artificial Neural Network architecture
• An Artificial Neural Network (ANN) is a computational model
inspired by the structure and function of the human brain.
• It's a core component of Artificial Intelligence (AI) and a foundational
element of deep learning. ANNs are particularly powerful for tasks
that involve pattern recognition, classification, and making predictions
from complex data.
Biological Neuron
• A nerve cell neuron is a special biological cell that processes information.
According to an estimation, there are huge number of neurons, approximately
10^11 with numerous interconnections, approximately 10^15.

• Dendrites − They are tree-like branches, responsible for

receiving the information from other neurons it is connected
to. In other sense, we can say that they are like the ears of
neuron.
• Soma − It is the cell body of the neuron and is responsible
for processing of information, they have received from
dendrites.
• Axon − It is just like a cable through which neurons send the
information.
• Synapses − It is the connection between the axon and other
neuron dendrites.
Key Components of an ANN
• Input Layer: This is where the network receives information. For example, in an image
recognition task, the input could be an image.
• Hidden Layers: These layers process the data received from the input layer. The more hidden
layers there are, the more complex patterns the network can learn and understand. Each hidden
layer transforms the data into more abstract information.
• Output Layer: This is where the final decision or prediction is made. For example, after
processing an image, the output layer might decide whether it’s a cat or a dog.
How it Works (Nodes, Weights, and
Activation Functions)
• Nodes (Artificial Neurons): Each node in an ANN receives inputs from other nodes or from the initial data.
• Weights: The connections between nodes have associated "weights." These weights represent the strength or
importance of the connection. A higher positive weight means that input has a stronger "exciting" influence on
the next node, while a negative weight indicates a "suppressing" influence. During the learning process, the
network adjusts these weights.
• Weighted Sum: Each node calculates a weighted sum of its inputs. This means each input value is multiplied
by its corresponding weight, and then these products are summed up.
• Bias: A "bias" term is often added to the weighted sum. This allows the activation function to be shifted,
providing more flexibility to the model.
• Activation Function: The weighted sum (plus bias) is then passed through an "activation function." This
function introduces non-linearity into the network, allowing it to learn complex patterns and relationships that
linear models cannot. Common activation functions include ReLU (Rectified Linear Unit), Sigmoid, and Tanh.
If the output of the activation function exceeds a certain threshold, the neuron "fires" or activates, sending its
output to the next layer.
ANN
• The image illustrates the analogy between a biological neuron and an
artificial neuron, showing how inputs are received and processed to
produce outputs in both systems.
Learning Process (Training)
• Forward Propagation: During training, input data is fed into the network's input layer and propagates forward
through the hidden layers to the output layer. The network makes a prediction based on its current weights.

• Error Calculation: The network's prediction is then compared to the actual, desired output. The difference between
the predicted and actual output is the "error.“

• Backpropagation: This is a crucial step in which the error is propagated backward through the network, from the
output layer to the input layer. This process calculates how much each weight contributed to the error.

• Weight Adjustment: Based on the error calculated during backpropagation, the weights of the connections are
adjusted to minimize the error. This is often done using optimization algorithms like gradient descent. The goal is to
make the network's predictions more accurate over time.

• Iteration: This entire process (forward propagation, error calculation, backpropagation, weight adjustment) is
repeated many times over a large dataset. With each iteration, the network continuously refines its weights, learning to
identify patterns and make better predictions.
Artificial neurons vs Biological neurons
Aspect Biological Neurons Artificial Neurons

Structure Dendrites: Receive signals from other neurons. Input Nodes: Receive data and pass it on to the next layer.

Cell Body (Soma): Processes the signals. Hidden Layer Nodes: Process and transform the data.

Axon: Transmits processed signals to other neurons. Output Nodes: Produce the final result after processing.

Weights: Connections between neurons that control the

Connections Synapses: Links between neurons that transmit signals.
influence of one neuron on another.

Synaptic Plasticity: Changes in synaptic strength based on Backpropagation: Adjusts the weights based on errors in
Learning Mechanism activity over time. predictions to improve future performance.

Activation: Neurons fire when signals are strong enough to Activation Function: Maps input to output, deciding if the
Activation reach a threshold. neuron should fire based on the processed data.
Types of neuron connection architecture
• Single-layer feed-forward network
• In this type of network, we have only two
layers input layer and the output layer but
the input layer does not count because no
computation is performed in this layer.
• The output layer is formed when different
weights are applied to input nodes and the
cumulative effect per node is taken. After
this, the neurons collectively give the output
layer to compute the output signals.
Types of neuron connection architecture
• Multilayer feed-forward network
• This layer also has a hidden layer that is internal to the network and has no direct
contact with the external layer.
• The existence of one or more hidden layers enables the network to be computationally
stronger, a feed-forward network because of information flow through the input
function, and the intermediate computations used to determine the output Z.
• There are no feedback connections in which outputs of the model are fed back into
itself.
Types of neuron connection architecture
• Single node with its own feedback
• When outputs can be directed back as inputs to the same layer or
preceding layer nodes, then it results in feedback networks.
• Recurrent networks are feedback networks with closed loops. The
figure shows a single recurrent network having a single neuron with
feedback to itself.
Types of neuron connection architecture
• Single-layer recurrent network
• The network is a single-layer network with a feedback
connection in which the processing element's output can
be directed back to itself or to another processing
element or both.
• A recurrent neural network is a class of artificial neural
networks where connections between nodes form a
directed graph along a sequence.
• This allows it to exhibit dynamic temporal behavior for
a time sequence. Unlike feedforward neural networks,
RNNs can use their internal state (memory) to process
sequences of inputs.
Types of neuron connection architecture
• Multilayer recurrent network
• In this type of network, processing element output
can be directed to the processing element in the
same layer and in the preceding layer forming a
multilayer recurrent network.
• They perform the same task for every element of a
sequence, with the output being dependent on the
previous computations.
• Inputs are not needed at each time step. The main
feature of a Recurrent Neural Network is its hidden
state, which captures some information about a
sequence.
Types of Artificial Neural Networks
• Feedforward Neural Network (FNN)
• Convolutional Neural Network (CNN)
• Radial Basis Function Network (RBFN)
• Recurrent Neural Network (RNN)
Feedforward Neural Network (FNN)
• Feedforward Neural Network (FNN) is a type of artificial neural network in which information flows in a
single direction—from the input layer through hidden layers to the output layer—without loops or
feedback.
• It is mainly used for pattern recognition tasks like image and speech classification.

Feedforward Neural Networks have a structured layered design where data flows
sequentially through each layer.

• Input Layer: The input layer consists of neurons that receive the input data. Each
neuron in the input layer represents a feature of the input data.

• Hidden Layers: One or more hidden layers are placed between the input and
output layers. These layers are responsible for learning the complex patterns in the
data. Each neuron in a hidden layer applies a weighted sum of inputs followed by a
non-linear activation function.

• Output Layer: The output layer provides the final output of the network. The
number of neurons in this layer corresponds to the number of classes in a
classification problem or the number of outputs in a regression problem.
Training a Feedforward Neural Network
Training a Feedforward Neural Network involves adjusting the weights of the neurons to minimize the error
between the predicted output and the actual output. This process is typically performed using backpropagation and
gradient descent.

• Forward Propagation: During forward propagation the input data passes through the network and the output is
calculated.

• Loss Calculation: The loss (or error) is calculated using a loss function such as Mean Squared Error (MSE) for
regression tasks or Cross-Entropy Loss for classification tasks.

• Backpropagation: In backpropagation the error is propagated back through the network to update the weights.
The gradient of the loss function with respect to each weight is calculated and the weights are adjusted using
gradient descent.
Convolutional Neural Network (CNN)
Convolutional Neural Networks (CNNs) are deep learning models
designed to process data with a grid-like topology such as images. They
are the foundation for most modern computer vision applications to
detect features within visual data.
Radial Basis Function Network (RBFN)
• Radial Basis Function (RBF) Neural Networks are used for function
approximation tasks. They are a special category of feed-forward
neural networks comprising of three layers.
• Due to this distinct three-layer architecture and universal
approximation capabilities they offer faster learning speeds and
efficient performance in classification and regression problems.
Recurrent Neural Network (RNN)
• Recurrent Neural Networks (RNNs) differ from regular neural
networks in how they process information. While standard neural
networks pass information in one direction i.e from input to output,
RNNs feed information back into the network at each step.
Activation Functions
• It is a mathematical function applied to the output of a neuron. It
introduces non-linearity into the model, allowing the network to
learn and represent complex patterns in the data.

• Without this non-linearity feature a neural network would behave

like a linear regression model no matter how many layers it has.

• Activation function decides whether a neuron should be activated by calculating the weighted sum of inputs
and adding a bias term.

• This helps the model make complex decisions and predictions by introducing non-linearities to the output of
each neuron.
Linear Activation Function
• Linear Activation Function resembles
straight line define by y=x. No matter how
many layers the neural network contains if
they all use linear activation functions the
output is a linear combination of the input.
• The range of the output spans
from (−∞ to +∞)(−∞ to +∞).
• Linear activation function is used at just one
place i.e. output layer.
• Using linear activation across all layers makes
the network's ability to learn complex patterns
limited.
• Linear activation functions are useful for specific tasks but must be combined with non-linear
functions to enhance the neural network’s learning and predictive capabilities.
Sigmoid Function
• Sigmoid Activation Function is characterized by 'S'
shape. It is mathematically defined as . This
formula ensures a smooth and continuous output
that is essential for gradient-based optimization
methods.

• It allows neural networks to handle and model complex

patterns that linear equations cannot.
• The output ranges between 0 and 1, hence useful for
binary classification.
• The function exhibits a steep gradient when x values are
between -2 and 2. This sensitivity means that small
changes in input x can cause significant changes in output
y which is critical during the training process.
Tanh Activation Function
• Tanh function (hyperbolic tangent function) is a
shifted version of the sigmoid, allowing it to stretch
across the y-axis. It is defined as:

• Alternatively, it can be expressed using the sigmoid

function:

• Value Range: Outputs values from -1 to +1.

• Non-linear: Enables modeling of complex data patterns.
• Use in Hidden Layers: Commonly used in hidden layers
due to its zero-centered output, facilitating easier
learning for subsequent layers.
ReLU (Rectified Linear Unit) Function
• ReLU activation is defined by A(x)=max(0,x),
this means that if the input x is positive, ReLU
returns x, if the input is negative, it returns 0.
• Value Range: [0,∞), meaning the function only
outputs non-negative values.
• Nature: It is a non-linear activation function,
allowing neural networks to learn complex patterns
and making backpropagation more efficient.
• Advantage over other Activation: ReLU is less
computationally expensive than tanh and sigmoid
because it involves simpler mathematical operations.
At a time only a few neurons are activated making
the network sparse making it efficient and easy for
computation.
Softmax Function
• Softmax function is designed to handle
multi-class classification problems. It
transforms raw output scores from a
neural network into probabilities.
• It works by squashing the output values
of each class into the range of 0 to 1
while ensuring that the sum of all
probabilities equals 1.
• Softmax is a non-linear activation function.
• The Softmax function ensures that each class
is assigned a probability, helping to identify
which class the input belongs to.
SoftPlus Function
• Softplus function is defined mathematically as:

• This equation ensures that the output is always

positive and differentiable at all points which is
an advantage over the traditional ReLU
function.
• Nature: The Softplus function is non-linear.
• Range: The function outputs values in the range (0,
∞)(0,∞), similar to ReLU, but without the hard zero
threshold that ReLU has.
• Smoothness: Softplus is a smooth, continuous
function, meaning it avoids the sharp discontinuities
of ReLU which can sometimes lead to problems
during optimization.
Back propogation in ANN
• Backpropagation, short for "backward
propagation of errors," is the fundamental
algorithm used to train Artificial Neural
Networks (ANNs).

• It's the engine that allows neural networks to

"learn" from data by iteratively adjusting
their internal parameters (weights and
biases) to minimize the difference between
their predictions and the actual desired
outputs.

• Without backpropagation, training complex

multi-layered neural networks would be
incredibly difficult, if not impossible.
The Purpose of Backpropagation
• The primary goal of backpropagation is to efficiently calculate the gradient of the
loss (or error) function with respect to every weight and bias in the neural
network. This gradient tells us:

• Direction: How much each weight/bias needs to change to reduce the error.
• Magnitude: How sensitive the error is to changes in that specific weight/bias.

• Once these gradients are calculated, an optimization algorithm like Gradient

Descent (or its variants like Adam, RMSprop, etc.) uses them to update the weights
and biases, moving the network towards a state where it makes more accurate
predictions.
Forward Pass (Feedforward)
• Input to Output: Data is fed into the input layer of the neural network.

• Calculation: The input signals travel forward through the network, layer by layer. At each neuron,
the inputs are multiplied by their respective weights, summed up, and then passed through an
activation function (e.g., sigmoid, ReLU, tanh) to produce an output for that neuron.

• Prediction: This process continues until the final output layer produces the network's prediction
for the given input.

• Error Calculation: This predicted output is then compared to the actual target output (the "ground
truth"). The difference between these two is quantified by a loss function (e.g., Mean Squared
Error for regression, Cross-Entropy for classification). This loss value represents how "wrong" the
network's prediction was.
Backward Pass (Backpropagation of Error)
• Error Propagation: The calculated error from the output layer is propagated backward through the
network, layer by layer, all the way to the input layer.

• Gradient Calculation (Chain Rule): At each layer, the algorithm calculates how much each weight
and bias contributed to the overall error. This is done using the chain rule of calculus. The chain rule
allows us to calculate the derivative of the loss with respect to a weight in an earlier layer by
multiplying the derivatives of intermediate calculations.
• Essentially, it determines the "blame" for the error and assigns it proportionally to the connections (weights) that
contributed to it.

• Weight/Bias Update: Once the gradients for all weights and biases are known, the optimization
algorithm (Gradient Descent) adjusts these parameters. Each weight/bias is updated in the direction
that minimizes the loss, by subtracting a fraction of its gradient (scaled by a learning rate).

Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
20 pages
Supervised Learning Essentials
No ratings yet
Supervised Learning Essentials
30 pages
Day.9 SML
No ratings yet
Day.9 SML
23 pages
Understanding Regression in Supervised Learning
No ratings yet
Understanding Regression in Supervised Learning
25 pages
AAI Lecture 10 SP 25
No ratings yet
AAI Lecture 10 SP 25
37 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
Ch-2 Supervised Machine Learning
No ratings yet
Ch-2 Supervised Machine Learning
48 pages
Data Science
100% (1)
Data Science
14 pages
Supervised Machine Learning Overview
No ratings yet
Supervised Machine Learning Overview
14 pages
Week 7. Intro To ML. Regression
No ratings yet
Week 7. Intro To ML. Regression
24 pages
Complete
No ratings yet
Complete
12 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
RRB - Unit 2 Regresion
No ratings yet
RRB - Unit 2 Regresion
53 pages
Regression
No ratings yet
Regression
6 pages
Linear Regression
No ratings yet
Linear Regression
24 pages
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
No ratings yet
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
60 pages
Classification Algorithms Overview
No ratings yet
Classification Algorithms Overview
19 pages
Regression
No ratings yet
Regression
45 pages
Module 2
No ratings yet
Module 2
21 pages
Cp4252 ML Unit-II
No ratings yet
Cp4252 ML Unit-II
44 pages
Week 9 - PROG 8510 Week 9
No ratings yet
Week 9 - PROG 8510 Week 9
27 pages
Types of Supervised Learning2
No ratings yet
Types of Supervised Learning2
66 pages
LinearRegression PDF
No ratings yet
LinearRegression PDF
4 pages
Mod3 Eda
No ratings yet
Mod3 Eda
16 pages
UNIT II Regration
No ratings yet
UNIT II Regration
62 pages
ML 2
No ratings yet
ML 2
155 pages
Machine Learning: Bilal Khan
100% (2)
Machine Learning: Bilal Khan
20 pages
Analytics Compendium
No ratings yet
Analytics Compendium
41 pages
Regression
No ratings yet
Regression
11 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
18 pages
DS Unit-Iv
No ratings yet
DS Unit-Iv
34 pages
ML Unit-2
No ratings yet
ML Unit-2
123 pages
Unit2 ML Notes
No ratings yet
Unit2 ML Notes
19 pages
Linear Regression
No ratings yet
Linear Regression
49 pages
Classification & Regression Models
No ratings yet
Classification & Regression Models
32 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
36 pages
OE-ML Unit - 3
No ratings yet
OE-ML Unit - 3
29 pages
Machine Learning
No ratings yet
Machine Learning
62 pages
Regression: Unit Iii
No ratings yet
Regression: Unit Iii
54 pages
Unit-4 DS Student
No ratings yet
Unit-4 DS Student
43 pages
4 ML
No ratings yet
4 ML
41 pages
Unit - 2, Updated Notes
No ratings yet
Unit - 2, Updated Notes
121 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
6 pages
Bias and Variance Tradeoff:: High Bias Underfitting Low Training & Testing
No ratings yet
Bias and Variance Tradeoff:: High Bias Underfitting Low Training & Testing
12 pages
Regression
No ratings yet
Regression
19 pages
Supervised Learning: Regression Techniques
No ratings yet
Supervised Learning: Regression Techniques
34 pages
(Unit-04) Part-01 - ML Algo
No ratings yet
(Unit-04) Part-01 - ML Algo
49 pages
Regression Analysis for ML Beginners
No ratings yet
Regression Analysis for ML Beginners
12 pages
Unit 2
No ratings yet
Unit 2
136 pages
ML Exp 1
No ratings yet
ML Exp 1
6 pages
Logistic Regression Example Explained
No ratings yet
Logistic Regression Example Explained
45 pages
Ai ML 3
No ratings yet
Ai ML 3
27 pages
Supervised Learning. wk3
No ratings yet
Supervised Learning. wk3
18 pages
ML Points
No ratings yet
ML Points
13 pages
SumitBurnwal ML
No ratings yet
SumitBurnwal ML
13 pages
Ankit 1000019876
No ratings yet
Ankit 1000019876
3 pages
Most Loved Earlys Physical Dysfunction Practice Skills For The Occupational Therapy Assistant - 4th Edition Complete Book Download
No ratings yet
Most Loved Earlys Physical Dysfunction Practice Skills For The Occupational Therapy Assistant - 4th Edition Complete Book Download
15 pages
Topic 4 Jeopardy Review
100% (1)
Topic 4 Jeopardy Review
27 pages
Professional Education LET Reviewer
100% (1)
Professional Education LET Reviewer
34 pages
Phrasal Verbs Questions - 64928
No ratings yet
Phrasal Verbs Questions - 64928
5 pages
Avid 10 Syllabus
No ratings yet
Avid 10 Syllabus
2 pages
SUMMARY 12 Pillars by Jim Rohn and Chris Widener
100% (1)
SUMMARY 12 Pillars by Jim Rohn and Chris Widener
12 pages
Unit Plan Course Plan Ii
No ratings yet
Unit Plan Course Plan Ii
10 pages
CPE Speaking Test
100% (5)
CPE Speaking Test
14 pages
Tray Play Ebook PDF
No ratings yet
Tray Play Ebook PDF
60 pages
Manideepa - Final Paper
No ratings yet
Manideepa - Final Paper
3 pages
Understanding Social Dimensions in Education
0% (1)
Understanding Social Dimensions in Education
2 pages
Requirements - Letter Writing Final
No ratings yet
Requirements - Letter Writing Final
5 pages
Division - Initiated Contextualized Kindergarten Learning Resources Teacher's Guide
No ratings yet
Division - Initiated Contextualized Kindergarten Learning Resources Teacher's Guide
4 pages
Advanced Linguistics
67% (3)
Advanced Linguistics
8 pages
Sample Dissertation Permission Letter
100% (2)
Sample Dissertation Permission Letter
8 pages
Data Handling and Visualization Guide
No ratings yet
Data Handling and Visualization Guide
2 pages
GENZ FILE Final With Spiral
No ratings yet
GENZ FILE Final With Spiral
85 pages
Student Project Seminar Schedule
No ratings yet
Student Project Seminar Schedule
4 pages
Program Coordinator Role in Myanmar
No ratings yet
Program Coordinator Role in Myanmar
3 pages
THM03 Syllabus (Micro Perspective of Tourism and Hospitality)
No ratings yet
THM03 Syllabus (Micro Perspective of Tourism and Hospitality)
24 pages
The Big Brother
No ratings yet
The Big Brother
3 pages
Grade 9 Dressmaking Tasks
No ratings yet
Grade 9 Dressmaking Tasks
6 pages
MATH-tinik Proposal - Docx 2025
No ratings yet
MATH-tinik Proposal - Docx 2025
3 pages
Chapter 3
No ratings yet
Chapter 3
23 pages
AKU-IED Education Program Overview
No ratings yet
AKU-IED Education Program Overview
42 pages
Creative Problem Solving Approach and Process
No ratings yet
Creative Problem Solving Approach and Process
53 pages
Graduate Programs at Marine Institute
No ratings yet
Graduate Programs at Marine Institute
24 pages
Promotion - Teachers 2025
No ratings yet
Promotion - Teachers 2025
2 pages
Writing Manuals
No ratings yet
Writing Manuals
6 pages

Unit-2 Supervised Machine Learning

Uploaded by

Unit-2 Supervised Machine Learning

Uploaded by

Supervised Machine

• If the model finds β0=40 and β1=5, the equation becomes:

Exam Score=40+5×Hours Studied

• Normal distribution of residuals: The mean of residuals should follow a normal

• House Price=50000+(100×Size)+(10000×Bedrooms)+(5000×Location Score)

• This equation indicates:

• The optimizer (like Gradient Descent) uses it to adjust model parameters

• Without a cost function, a machine learning model wouldn’t know how to

• What MSE Represents:

• Batch Gradient Descent (BGD):

• Stochastic Gradient Descent (SGD):

• Mini-Batch Gradient Descent:

Use Case Examples

• When to Use It?

• Reasons for Overfitting:

• In the pursuit of optimal performance, a supervised machine learning

• The best hyperplane is that plane that has the maximum

• where ||w|| represents the Euclidean norm of the weight vector w.

• Subject to the constraint:

• Subject to the constraints:

• It has a hierarchical tree structure which consists of a root node, branches,

• It works like a flowchart help to make decisions step by step where:

Outlook Temperature Humidity Windy Play

Click Here for Detailed Example

What does Gini Index measure?

• Random Forest is like a "forest of decision trees", where:

• Dendrites − They are tree-like branches, responsible for

Weights: Connections between neurons that control the

• Without this non-linearity feature a neural network would behave

• It allows neural networks to handle and model complex

• Alternatively, it can be expressed using the sigmoid

• Value Range: Outputs values from -1 to +1.

• This equation ensures that the output is always

• It's the engine that allows neural networks to

• Without backpropagation, training complex

• Once these gradients are calculated, an optimization algorithm like Gradient

You might also like