Unit No.
2
Unit 2: Supervised learning:
Regression
Prof . Sachin Sambhaji Patil
D. Y. Patil University Ambi, Pune
[Link] Sambhaji Patil , [Link] University Ambi , Pune 1
Supervised Machine Learning
• Supervised learning is the types of machine learning in
which machines are trained using well "labelled" training
data, and on basis of that data, machines predict the
output.
• The labelled data means some input data is already tagged
with the correct output.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 2
Supervised Machine Learning
• In supervised learning, the training data provided to the
machines work as the supervisor that teaches the machines
to predict the output correctly.
• It applies the same concept as a student learns in the
supervision of the teacher.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 3
Supervised Machine Learning
• Supervised learning is a process of providing input data as
well as correct output data to the machine learning model.
The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the
output variable(y).
[Link] Sambhaji Patil , [Link] University Ambi , Pune 4
Supervised Machine Learning
• In the real-world, supervised learning can be used for Risk
Assessment, Image classification, Fraud Detection, spam
filtering, etc.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 5
How Supervised Learning Works?
• In supervised learning, models are trained using labelled
dataset, where the model learns about each type of data.
Once the training process is completed, the model is tested
on the basis of test data (a subset of the training set), and
then it predicts the output.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 6
How Supervised Learning Works?
[Link] Sambhaji Patil , [Link] University Ambi , Pune 7
Supervised Machine Learning
• Suppose we have a dataset of different types of shapes which includes
square, rectangle, triangle, and Polygon. Now the first step is that we
need to train the model for each shape.
• If the given shape has four sides, and all the sides are equal, then it will
be labelled as a Square.
• If the given shape has three sides, then it will be labelled as a triangle.
• If the given shape has six equal sides then it will be labelled as hexagon.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 8
Supervised Machine Learning
• Now, after training, we test our model using the test set, and the
task of the model is to identify the shape.
• The machine is already trained on all types of shapes, and when it
finds a new shape, it classifies the shape on the bases of a number
of sides, and predicts the output.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 9
Supervised Machine Learning
• Steps Involved in Supervised Learning:
1. First Determine the type of training dataset
2. Collect/Gather the labelled training data.
3. Split the training dataset into training dataset, test dataset, and
validation dataset.
4. Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
5. Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
6. Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of
training datasets.
7. Evaluate the accuracy of the model by providing the test set. If the
model predicts the correct output,
[Link] Sambhaji which
Patil , [Link] means
University Ambi , Pune our model is accurate.
10
Types of supervised Machine learning Algorithms:
[Link] Sambhaji Patil , [Link] University Ambi , Pune 11
Types of Supervised Machine learning Algorithms:
• Regression
• Regression algorithms are used if there is a relationship
between the input variable and the output variable.
• It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc.
• Below are some popular Regression algorithms which come
under supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
[Link] Sambhaji Patil , [Link] University Ambi , Pune 12
Types of Supervised Machine learning Algorithms:
• Classification
• Classification algorithms are used when the output
variable is categorical, which means there are two classes
such as Yes-No, Male-Female, True-false, etc.
• Spam Filtering,
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
[Link] Sambhaji Patil , [Link] University Ambi , Pune 13
Types of Supervised Machine learning Algorithms:
• Supervised learning has two types:
• Classification: It predicts the class of the dataset based on the
independent input variable. Class is the categorical or discrete values.
like the image of an animal is a cat or dog?
• Regression: It predicts the continuous output variables based on the
independent input variable. like the prediction of house prices based on
different parameters like house age, distance from the main road,
location, area, etc.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 14
Types of Supervised Machine learning Algorithms:
• Linear Regression
• Linear regression is a type of supervised machine learning
algorithm that computes the linear relationship between a
dependent variable and one or more independent features.
• When the number of the independent feature, is 1 then it is
known as Univariate Linear regression, and in the case of more
than one feature, it is known as multivariate linear regression.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 15
Types of Supervised Machine learning Algorithms:
• Linear Regression
• The goal of the algorithm is to find the best linear equation that
can predict the value of the dependent variable based on the
independent variables.
• The equation provides a straight line that represents the
relationship between the dependent and independent variables.
• The slope of the line indicates how much the dependent variable
changes for a unit change in the independent variables.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 16
Linear regression- Linear models,
• Linear regression performs the task to predict a dependent variable
value (y) based on a given independent variable (x)).
• Hence, the name is Linear Regression.
• In the figure above, X (input) is the work experience and Y (output) is the
salary of a person.
• The regression line is the best-fit line for our model.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 17
Linear regression- Linear models,
our independent feature is the experience i.e X and the
respective salary Y is the dependent variable. Let’s assume
there is a linear relationship between X and Y then the salary
can be predicted using:
[Link] Sambhaji Patil , [Link] University Ambi , Pune 18
Linear regression- Linear models,
• The model gets the best regression fit line by finding the
best θ1 and θ2 values.
• θ1: intercept
• θ2: coefficient of x
• Once we find the best θ1 and θ2 values, we get the best-fit
line. So when we are finally using our model for prediction,
it will predict the value of y for the input value of x.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 19
Linear regression-Cost Function
• The cost function or the loss function is nothing but the error or
difference between the predicted value and the true value Y.
• It is the Mean Squared Error (MSE) between the predicted value and
the true value.
• The cost function (J) can be written as:
[Link] Sambhaji Patil , [Link] University Ambi , Pune 20
How to update θ1 and θ2 values to get the best-fit line?
• To achieve the best-fit regression line, the model aims to predict the
target value such that the error difference between the predicted value
and the true value Y is minimum.
• So, it is very important to update the θ1 and θ2 values, to reach the best
value that minimizes the error between the predicted y value (pred) and
the true y value (y).
[Link] Sambhaji Patil , [Link] University Ambi , Pune 21
Gradient Descent
• A linear regression model can be trained using the optimization algorithm
gradient descent by iteratively modifying the model’s parameters to reduce
the mean squared error (MSE) of the model on a training dataset.
• To update θ1 and θ2 values in order to reduce the Cost function (minimizing
RMSE value) and achieve the best-fit line the model uses Gradient Descent.
The idea is to start with random θ1 and θ2 values and then iteratively
update the values, reaching minimum cost.
• A gradient is nothing but a derivative that defines the effects on outputs of
the function with a little bit of variation in inputs.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 22
Multi Dimensionality Reduction
• Used for dimensionality reduction when the input data is not linearly
arranged or it is not known whether a linear relationship exists or not.
• MDS is a non-linear technique for embedding data in a lower-dimensional
space.
• MDS (multidimensional scaling) is an algorithm that transforms a dataset
into another dataset, usually with lower dimensions, keeping the same
euclidean distances between the points.
• It can be used to detect outliers in some multivariate distribution,
[Link] Sambhaji Patil , [Link] University Ambi , Pune 23
Multi Dimensionality Reduction
• The main objective of MDS is to represent dissimilarities as distances
between points in a low dimensional space such that the distances
correspond as closely as possible to the dissimilarities.
• nonlinear method to project in lower dimensions by saving pairwise
distances
[Link] Sambhaji Patil , [Link] University Ambi , Pune 24
Multi Dimensionality Reduction
• The metric MDS calculates distances between each pair of points in
the original high-dimensional space and then maps it to lower-
dimensional space while preserving those distances between points
as well as possible.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 25
Differentiate the cost function(J)
[Link] Sambhaji Patil , [Link] University Ambi , Pune 26
Linear Regression model
• Finding the coefficients of a linear equation that best fits the training
data is the objective of linear regression.
• By moving in the direction of the Mean Squared Error negative gradient
with respect to the coefficients, the coefficients can be changed.
• And the respective intercept and coefficient of X will be if alpha is the
learning rate.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 27
Gradient Descent
[Link] Sambhaji Patil , [Link] University Ambi , Pune 28
Bias-Variance Trade-Off
[Link] Sambhaji Patil , [Link] University Ambi , Pune 29
Bias Variance Tradeoff
[Link] Sambhaji Patil , [Link] University Ambi , Pune 30
Model Complexity
• Model complexity, which in the case of linear regression can be
thought of as the number of predictors increases, estimates variance
also increases, but the bias decreases.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 31
Use of Regularization
• use regularization - a technique allowing to decrease this variance at
the cost of introducing some bias.
• Finding a good bias-variance trade-off allows to minimize the
model's total error.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 32
Types of Regularization Techniques
• There are three popular regularization techniques, each of them aiming
at decreasing the size of the coefficients :
1. Ridge Regression, which penalizes sum of squared coefficients (L2
penalty).
2. Lasso Regression, which penalizes the sum of absolute values of the
coefficients (L1 penalty).
3. Elastic Net, a convex combination of Ridge and Lasso (L1 + L2 )
[Link] Sambhaji Patil , [Link] University Ambi , Pune 33
Types of Regularization Techniques
• L2 regularization takes the square of the weights, so the cost of
outliers present in the data increases exponentially.
• L1 regularization takes the absolute values of the weights, so
the cost only increases linearly.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 34
Lasso Regression
• Lasso, or Least Absolute Shrinkage and Selection Operator, is quite
similar conceptually to ridge regression.
• It also adds a penalty for non-zero coefficients, but unlike ridge
regression which penalizes sum of squared coefficients (the so-called L2
penalty), lasso penalizes the sum of their absolute values (L1 penalty).
• As a result, for high values of λ, many coefficients are exactly zeroed
under lasso, which is never the case in ridge regression.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 35
Lasso, the loss is defined as:
• In lasso, one of the correlated predictors has a larger coefficient, while
the rest are (nearly) zeroed.
• Lasso tends to do well if there are a small number of significant parameters
and the others are close to zero
[Link] Sambhaji Patil , [Link] University Ambi , Pune 36
Elastic Net Regularization
• Elastic Net first emerged as a result of critique on lasso, whose
variable selection can be too dependent on data and thus
unstable.
• The solution is to combine the penalties of ridge regression
and lasso to get the best of both worlds.
• Elastic Net aims at minimizing the following loss function:
[Link] Sambhaji Patil , [Link] University Ambi , Pune 37
Elastic Net Regularization
• Elastic Net first emerged as a result of critique on lasso,
whose variable selection can be too dependent on data
and thus unstable.
• The solution is to combine the penalties of ridge
regression and lasso to get the best of both worlds.
• Elastic Net aims at minimizing the following loss function:
[Link] Sambhaji Patil , [Link] University Ambi , Pune 38
Elastic Net Regularization
• Where α is the mixing parameter between
• ridge (α = 0) and
• lasso (α = 1).
[Link] Sambhaji Patil , [Link] University Ambi , Pune 39
Elastic Net Regularization
• The elastic net penalty is a weighted sum of the L1 and L2 penalties.
• The mixing parameter, alpha (α), controls the weight of the L1
penalty relative to the L2 penalty.
• When alpha=1, the penalty reduces to the L1 penalty (Lasso
regression), and when alpha=0, the penalty reduces to the L2
penalty (Ridge regression).
[Link] Sambhaji Patil , [Link] University Ambi , Pune 40
Elastic Net Regularization
• Elastic net regression is a linear regression technique that uses a
penalty term to shrink the coefficients of the predictors.
• The penalty term is a combination of the l1-norm (absolute value)
and the l2-norm (square) of the coefficients, weighted by a
parameter called alpha.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 41
Elastic Net Regularization
• Now, there are two parameters to tune: λ and α.
• The glmnet package allows to tune λ via cross-validation for a fixed α,
but it does not support α-tuning, so we will turn to caret for this job.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 42
Polynomial Regression
• Polynomial Regression is a regression algorithm that models
the relationship between a dependent(y) and independent
variable(x) as nth degree polynomial. The Polynomial
Regression equation is given below:
y= b0+b1x1+ b2x12+ b2x13+...... bnx1n
[Link] Sambhaji Patil , [Link] University Ambi , Pune 43
Polynomial Regression
• It is also called the special case of Multiple Linear Regression in ML.
Because we add some polynomial terms to the Multiple Linear regression
equation to convert it into Polynomial Regression.
• It is a linear model with some modification in order to increase the
accuracy.
• The dataset used in Polynomial regression for training is of non-linear
nature.
• It makes use of a linear regression model to fit the complicated and non-
linear functions and datasets.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 44
Polynomial Regression
• "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled
using a linear model."
[Link] Sambhaji Patil , [Link] University Ambi , Pune 45
Polynomial Regression
[Link] Sambhaji Patil , [Link] University Ambi , Pune 46
Simple Linear Regression equation:
• Simple Linear Regression equation:
• y = b0+b1x .........(a)
• Multiple Linear Regression equation:
• y= b0+b1x+ b2x2+ b3x3+....+ bnxn .........(b)
• Polynomial Regression equation:
• y= b0+b1x + b2x2+ b3x3+....+ bnxn …………………(c)
[Link] Sambhaji Patil , [Link] University Ambi , Pune 47
Linear Regression equation
• When we compare the above three equations, we can clearly see that
all three equations are Polynomial equations but differ by the degree
of variables.
• The Simple and Multiple Linear equations are also Polynomial
equations with a single degree, and the Polynomial regression
equation is Linear equation with the nth degree.
• So if we add a degree to our linear equations, then it will be
converted into Polynomial Linear
[Link] Sambhaji equations.
Patil , [Link] University Ambi , Pune 48
Isotonic Regression
• 'iso' means equal and 'tonic' means stretching.
• In terms of machine learning algorithms, isotonic regression
can, therefore, be understood as equal stretching along the
linear regression line.
• It works on top of a linear regression model.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 49
Isotonic Regression
• Isotonic regression has to be non-negative whereas in linear
regression can be negative.
• This means every point in isotonic regression should be high as
before the previous point.
• Isotonic can be free form but linear regression should be linear.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 50
Isotonic Regression
[Link] Sambhaji Patil , [Link] University Ambi , Pune 51
Isotonic Regression
• Isotonic regression can be formulated as an optimization
problem in which the goal is to find a monotonic function
that minimizes the sum of the squared errors between the
predicted and observed values of the target variable.
• The optimization problem can be written as follows:
minimize subject to
[Link] Sambhaji Patil , [Link] University Ambi , Pune 52
Isotonic Regression
• where x_i and y_i are the predictors and target variables
for the i^{th} data point,
• respectively, and
• f is the monotonic function that is being fit to the data.
• The constraint ensures that the function is monotonic.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 53
Applications of Isotonic Regression
1. Calibration of predicted probabilities: Isotonic regression can be
used to adjust the predicted probabilities produced by a classifier
so that they are more accurately calibrated to the true probabilities.
2. Ordinal regression: Isotonic regression can be used to model
ordinal variables, which are variables that can be ranked in order
(e.g., “low,” “medium,” and “high”).
[Link] Sambhaji Patil , [Link] University Ambi , Pune 54
Applications of Isotonic Regression
3. Non-parametric regression: Because isotonic regression does not make
any assumptions about the functional form of the relationship between the
predictor and target variables, it can be used as a non-parametric
regression method.
4. Imputing missing values: Isotonic regression can be used to impute
missing values in a dataset by predicting the missing values based on the
surrounding non-missing values.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 55
Applications of Isotonic Regression
5. Outlier detection: Isotonic regression can be used to identify outliers
in a dataset by identifying points that are significantly different from
the overall trend of the data.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 56
Isotonic Regression
• In scikit-learn, isotonic regression can be performed using the
‘IsotonicRegression’ class. This class implements the isotonic
regression algorithm, which fits a non-decreasing piecewise-constant
function to the data.
• how to use the IsotonicRegression class in scikit-learn to perform
isotonic regression:
[Link] Sambhaji Patil , [Link] University Ambi , Pune 57
Isotonic Regression
from [Link] import IsotonicRegression
ir = IsotonicRegression()
# create an instance of the IsotonicRegression class
# Fit isotonic regression model
y_ir = ir.fit_transform(x, y)
# fit the model and transform the data
print('Isotonic Regression Predictions :\n',y_ir)
[Link] Sambhaji Patil , [Link] University Ambi , Pune 58
Logistic Regression
• Logistic regression is a supervised machine learning algorithm
mainly used for classification tasks where the goal is to predict
the probability that an instance of belonging to a given class.
• It is used for classification algorithms its name is logistic
regression. it’s referred to as regression because it takes the
output of the linear regression function as input and uses a
sigmoid function to estimate the probability for the given class.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 59
Logistic Regression
• The difference between linear regression and logistic
regression is that linear regression output is the continuous
value that can be anything while logistic regression predicts the
probability that an instance belongs to a given class or not.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 60
• Logistic Regression: Logistic Regression
• It is used for predicting the categorical dependent variable
using a given set of independent variables.
• Logistic regression predicts the output of a categorical
dependent variable. Therefore the outcome must be a
categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead
of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 61
Logistic Regression
• Logistic Regression:
• Logistic Regression is much similar to the Linear Regression except
that how they are used.
• Linear Regression is used for solving Regression problems, whereas
Logistic regression is used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an
“S” shaped logistic function, which predicts two maximum values
(0 or 1). [Link] Sambhaji Patil , [Link] University Ambi , Pune 62
Logistic Regression
• The curve from the logistic function indicates the likelihood of
something such as whether the cells are cancerous or not, a mouse is
obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm
because it has the ability to provide probabilities and classify new
data using continuous and discrete datasets.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 63
Logistic Regression
• Logistic Regression can be used to classify the observations using
different types of data and can easily determine the most effective
variables used for the classification.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 64
Logistic Regression : Logistic Function (Sigmoid Function)
• Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
• It maps any real value into another value within a range of 0 and 1. o The value of
the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 65
Type of Logistic Regression
• On the basis of the categories, Logistic Regression can be classified into three
types:
• Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as “low”, “Medium”, or “High”.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 66
[Link] Linear Regression Logistic Regression
Linear regression is used to predict the Logistic regression is used to predict the
1 continuous dependent variable using a categorical dependent variable using a given
given set of independent variables. set of independent variables.
Linear regression is used for solving
2 Regression problem.
It is used for solving classification problems.
In this we predict the value of continuous In this we predict values of categorical
3 variables varibles
4 In this we find best fit line. In this we find S-Curve .
Least square estimation method is used for Maximum likelihood estimation method is
5 estimation of accuracy. used for Estimation of accuracy.
The output must be continuous value,such Output is must be categorical value such as
6 as price,age,etc. 0 or 1, Yes or no, etc.
It required linear relationship between
7 dependent and independent variables.
It not required linear relationship.
There may be collinearity between the There should not be collinearity between
8 [Link] Sambhaji Patil , [Link] University Ambi , Pune
independent variables. independent varible. 67
Logistic Regression : Sigmoid Function
sigmoid function
where the input
will be z and we
find the
probability
between 0 and 1.
i.e predicted y.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 68
Logistic Regression
• Logistic Regression Equation
• The odd is the ratio of something occurring to something not
occurring. it is different from probability as the probability is the ratio
of something occurring to everything that could possibly occur. so
odd will be
• from sklearn.linear_model import LogisticRegression
[Link] Sambhaji Patil , [Link] University Ambi , Pune 69
Types of Gradient Descent:
• Typically, there are three types of Gradient Descent:
• Batch Gradient Descent
• Stochastic Gradient Descent
• Mini-batch Gradient Descent
[Link] Sambhaji Patil , [Link] University Ambi , Pune 70
Gradient Descent
• Gradient descent is an optimization algorithm that’s used
when training a machine learning model.
• It’s based on a convex function and tweaks (changing) its parameters
iteratively to minimize a given function to its local minimum.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 71
WHAT IS GRADIENT DESCENT IN MACHINE LEARNING?
• Gradient Descent is an optimization algorithm for finding a local
minimum of a differentiable function.
• Gradient descent in machine learning is simply used to find the
values of a function's parameters (coefficients) that minimize a cost
function as far as possible.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 72
ROLE OF GRADIENT DESCENT
• Initial parameter’s values and from there the gradient descent
algorithm uses calculus to iteratively adjust the values so they
minimize the given cost-function.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 73
ROLE OF GRADIENT DESCENT
• stochastic gradient descent (SGD) does this for each training example
within the dataset, meaning it updates the parameters for each
training example one by one.
• Depending on the problem, this can make SGD faster than batch
gradient descent.
• One advantage is the frequent updates allow us to have a pretty
detailed rate of improvement..
[Link] Sambhaji Patil , [Link] University Ambi , Pune 74
ROLE OF GRADIENT DESCENT
• The frequent updates, however, are more computationally expensive
than the batch gradient descent approach.
• Additionally, the frequency of those updates can result in noisy
gradients, which may cause the error rate to jump around instead of
slowly decreasing.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 75
Stochastic gradient descendent algorithms
• The gradient descent algorithm is an approximate and iterative
method for mathematical optimization.
• You can use it to approach the minimum of any differentiable
function.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 76
Stochastic gradient descendent algorithms
• Gradient Descent is an iterative optimization process that searches
for an objective function’s optimum value (Minimum/Maximum).
• It is one of the most used methods for changing a model’s
parameters in order to reduce a cost function in machine learning
projects.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 77
Stochastic gradient descendent algorithms
• The primary goal of gradient descent is to identify the model
parameters that provide the maximum accuracy on both training
and test datasets.
• In gradient descent, the gradient is a vector pointing in the general
direction of the function’s steepest rise at a particular point.
• The algorithm might gradually drop towards lower values of the
function by moving in the opposite direction of the gradient, until
reaching the minimum of the function.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 78
Stochastic Gradient Descent
• Stochastic Gradient Descent (SGD) is a variant of the Gradient
Descent algorithm that is used for optimizing machine learning
models.
• It addresses the computational inefficiency of traditional Gradient
Descent methods when dealing with large datasets in machine
learning projects.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 79
Stochastic Gradient Descent
• In SGD, instead of using the entire dataset for each iteration, only a
single random training example (or a small batch) is selected to
calculate the gradient and update the model parameters.
• This random selection introduces randomness into the optimization
process, hence the term “stochastic” in stochastic Gradient Descent
[Link] Sambhaji Patil , [Link] University Ambi , Pune 80
Stochastic Gradient Descent
• The advantage of using SGD is its computational efficiency, especially
when dealing with large datasets.
• By using a single example or a small batch, the computational cost
per iteration is significantly reduced compared to traditional
Gradient Descent methods that require processing the entire dataset.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 81
Stochastic Gradient Descent
• Stochastic Gradient Descent Algorithm
• Initialization: Randomly initialize the parameters of the model.
• Set Parameters: Determine the number of iterations and the learning
rate (alpha) for updating the parameters.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 82
Stochastic Gradient Descent
• Stochastic Gradient Descent Loop: Repeat the following steps until the model converges
or reaches the maximum number of iterations:
a. Shuffle the training dataset to introduce randomness.
b. Iterate over each training example (or a small batch) in the shuffled order.
c. Compute the gradient of the cost function with respect to the model parameters using
the current training example (or batch).
d. Update the model parameters by taking a step in the direction of the negative gradient,
scaled by the learning rate.
e. Evaluate the convergence criteria, such as the difference in the cost function between
iterations of the gradient. [Link] Sambhaji Patil , [Link] University Ambi , Pune 83
Stochastic Gradient Descent
• Return Optimized Parameters: Once the convergence criteria are met or
the maximum number of iterations is reached, return the optimized model
parameters.
• In SGD, since only one sample from the dataset is chosen at random for
each iteration, the path taken by the algorithm to reach the minima is
usually noisier than your typical Gradient Descent algorithm.
• But that doesn’t matter all that much because the path taken by the
algorithm does not matter, as long as we reach the minimum and with a
significantly shorter training time.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 84
The path taken by Batch Gradient Descent is shown below:
• we reach the
minimum and
with a
significantly
shorter
training time.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 85
A path taken by Stochastic Gradient Descent looks as follows –
• One thing to be noted is that,
as SGD is generally noisier
than typical Gradient
Descent, it usually took a
higher number of iterations
to reach the minima,
because of the randomness
in its descent.
• Even though it requires a
higher number of iterations
to reach the minima than
typical Gradient Descent, it is
still computationally much
• Hence, in most scenarios,
SGD is preferred over Batch
less expensive than typical
Gradient Descent for Gradient Descent.
optimizing a learning
[Link] Sambhaji Patil , [Link] University Ambi , Pune 86
algorithm.
Advantages of Stochastic Gradient Descent
• Speed: SGD is faster than other variants of Gradient Descent such as
Batch Gradient Descent and Mini-Batch Gradient Descent since it uses
only one example to update the parameters.
• Memory Efficiency: Since SGD updates the parameters for each training
example one at a time, it is memory-efficient and can handle large
datasets that cannot fit into memory.
• Avoidance of Local Minima: Due to the noisy updates in SGD, it has the
ability to escape from local minima and converges to a global minimum.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 87
Disadvantages of Stochastic Gradient Descent
• Noisy updates: The updates in SGD are noisy and have a high variance,
which can make the optimization process less stable and lead to
oscillations around the minimum.
• Slow Convergence: SGD may require more iterations to converge to the
minimum since it updates the parameters for each training example one at
a time.
• Sensitivity to Learning Rate: The choice of learning rate can be critical in
SGD since using a high learning rate can cause the algorithm to overshoot
the minimum, while a low learning rate can make the algorithm converge
slowly.
• Less Accurate: Due to the noisy updates, SGD may not converge to the
exact global minimum and can result in a suboptimal solution. This can
be mitigated by using techniques such as learning rate scheduling and
momentum-based updates
[Link] Sambhaji Patil , [Link] University Ambi , Pune 88
Confusion Matrix
[Link] Sambhaji Patil , [Link] University Ambi , Pune 89
Confusion Matrix
The target variable has
two values:
Positive or Negative
The columns represent
the actual values of the
target variable.
The rows represent
the predicted values of
the target variable
[Link] Sambhaji Patil , [Link] University Ambi , Pune 90
Confusion Matrix
• The classification matrix is a standard tool for evaluation of statistical
models and is sometimes referred to as a confusion matrix.
• A Confusion matrix is an N x N matrix used for evaluating
the performance of a classification model, where N is the number
of target classes. The matrix compares the actual target values with
those predicted by the machine learning model.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 91
Confusion Matrix
• A good model is one which has high TP and TN rates, while low FP and
FN rates.
• A confusion matrix is a tabular summary of the number of correct and
incorrect predictions made by a classifier.
• It is used to measure the performance of a classification model.
• It can be used to evaluate the performance of a classification model
through the calculation of performance metrics like accuracy, precision,
recall, and F1-score.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 92
Confusion Matrix
• True Positives (TP): when the actual value is Positive and
predicted is also Positive.
• True negatives (TN): when the actual value is Negative and
prediction is also Negative.
• False positives (FP): When the actual is negative but
prediction is Positive. Also known as the Type 1 error
• False negatives (FN): When the actual is Positive but the
prediction is Negative. Also known as the Type 2 error
[Link] Sambhaji Patil , [Link] University Ambi , Pune 93
Confusion Matrix
[Link] Sambhaji Patil , [Link] University Ambi , Pune 94
Classification Measure
• Classification Measure
• Basically, it is an extended version of the confusion matrix. There are
measures other than the confusion matrix which can help achieve
better understanding and analysis of our model and its
performance.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 95
Classification Measure
a. Accuracy
b. Precision
c. Recall (TPR, Sensitivity)
d. F1-Score
e. FPR (Type I Error)
f. FNR (Type II Error)
[Link] Sambhaji Patil , [Link] University Ambi , Pune 96
Classification Measure : a. Accuracy
Accuracy simply
measures how often
the classifier makes
the correct
prediction.
It’s the ratio between
the number of
correct predictions
and the total number
of predictions.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 97
Classification Measure
• In a two-class problem, we are often looking to discriminate between
observations with a specific outcome, from normal observations.
• “true positive” for correctly predicted event values.
• “false positive” for incorrectly predicted event values.
• “true negative” for correctly predicted no-event values.
• “false negative” for incorrectly predicted no-event values.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 98
Confusion Matrix
1 # Example of a confusion matrix in Python
2 from [Link] import confusion_matrix
3
4 expected = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
5 predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]
6 results = confusion_matrix(expected, predicted)
7 print(results)
[[4 2]
[1 3]]
[Link] Sambhaji Patil , [Link] University Ambi , Pune 99
Calculate Accuracy, Error, Precision, Recall and F1 Score
for the following Confusion Matrix
Actual Positive Actual Negative
Predicted 10 10
Positive
Predicted 25 55
Negative
[Link] Sambhaji Patil , [Link] University Ambi , Pune 100
Solution
1. Accuracy is calculated as follows:
(TP + TN) / (TP + TN + FP + FN)
= (10 + 55) / (10 + 55 + 10 + 25) = 65 / 100 = 0.65
2. Error = 1 – Accuracy
Error = 1 – 0.65
Error = 0.35
3. Precision = TP / TP + FP = 10 / (10 + 10) = 0.5.
4. Recall (Sensitivity) = 10 / (10 + 25) = 0.2857.
5. F1 Score is calculated as follows:
F1 Score = 2 * Precision * Recall / (Precision + Recall)
F1 Score = 2 * 0.5 * 0.2857 / (0.5 + 0.2857)
F1 Score = 0.3571
[Link] Sambhaji Patil , [Link] University Ambi , Pune 101
ROC Curve
• An ROC curve (receiver operating characteristic curve) is a graph
showing the performance of a classification model at all
classification thresholds.
• This curve plots two parameters: True Positive Rate and False
Positive Rate.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 102
ROC Curve
• An ROC curve (receiver operating characteristic curve) is a graph
showing the performance of a classification model at all
classification thresholds.
• This curve plots two parameters: True Positive Rate and False
Positive Rate.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 103
ROC Curve
• True Positive Rate (TPR) is a synonym for recall and is therefore defined
as follows:
• False Positive Rate (FPR) is defined as follows:
• An ROC curve plots TPR vs. FPR at different classification thresholds.
Lowering the classification threshold classifies more items as positive,
thus increasing both False Positives
[Link] Sambhaji Patil , [Link]
UniversityTrue Positives.
Ambi , Pune 104
ROC Curve
• The following figure shows a typical ROC curve.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 105
ROC Curve
• With a ROC curve, you’re trying to find a good model that optimizes the
trade off between the False Positive Rate (FPR) and True Positive Rate
(TPR). What counts here is how much area is under the curve (Area under
the Curve = AuC).
• The ideal curve in the left image fills in 100%, which means that you’re
going to be able to distinguish between negative results and positive
results 100% of the time (which is almost impossible in real life).
[Link] Sambhaji Patil , [Link] University Ambi , Pune 106
ROC Curve
[Link] Sambhaji Patil , [Link] University Ambi , Pune 107
ROC Curve
• A Receiver Operator Characteristic (ROC) curve is a graphical plot used to
show the diagnostic ability of binary classifiers.
• A ROC curve is constructed by plotting the true positive rate (TPR)
against the false positive rate (FPR).
• The true positive rate is the proportion of observations that were
correctly predicted to be positive out of all positive observations (TP/(TP
+ FN)).
[Link] Sambhaji Patil , [Link] University Ambi , Pune 108
Plot ROC-AUC Curve for binary classification problem
[Link] Sambhaji Patil , [Link] University Ambi , Pune 109
[Link] Sambhaji Patil , [Link] University Ambi , Pune 110
Thank You
[Link] Sambhaji Patil , [Link] University Ambi , Pune 111