0% found this document useful (0 votes)

92 views111 pages

Unit 2 Supervised Learning Regression

The document discusses supervised machine learning and regression. It defines supervised learning, how it works, and common algorithms like linear regression. Linear regression finds the best fit line between independent and dependent variables to predict continuous outputs.

Uploaded by

kingsourabh1074

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views111 pages

Unit 2 Supervised Learning Regression

Uploaded by

kingsourabh1074

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unit No.

2
Unit 2: Supervised learning:
Regression

Prof . Sachin Sambhaji Patil

D. Y. Patil University Ambi, Pune

[Link] Sambhaji Patil , [Link] University Ambi , Pune 1

Supervised Machine Learning

• Supervised learning is the types of machine learning in

which machines are trained using well "labelled" training
data, and on basis of that data, machines predict the
output.
• The labelled data means some input data is already tagged
with the correct output.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 2
Supervised Machine Learning

• In supervised learning, the training data provided to the

machines work as the supervisor that teaches the machines
to predict the output correctly.

• It applies the same concept as a student learns in the

supervision of the teacher.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 3

Supervised Machine Learning

• Supervised learning is a process of providing input data as

well as correct output data to the machine learning model.
The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the
output variable(y).

[Link] Sambhaji Patil , [Link] University Ambi , Pune 4

Supervised Machine Learning

• In the real-world, supervised learning can be used for Risk

Assessment, Image classification, Fraud Detection, spam
filtering, etc.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 5

How Supervised Learning Works?

• In supervised learning, models are trained using labelled

dataset, where the model learns about each type of data.
Once the training process is completed, the model is tested
on the basis of test data (a subset of the training set), and
then it predicts the output.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 6

How Supervised Learning Works?

[Link] Sambhaji Patil , [Link] University Ambi , Pune 7

Supervised Machine Learning
• Suppose we have a dataset of different types of shapes which includes
square, rectangle, triangle, and Polygon. Now the first step is that we
need to train the model for each shape.

• If the given shape has four sides, and all the sides are equal, then it will
be labelled as a Square.

• If the given shape has three sides, then it will be labelled as a triangle.

• If the given shape has six equal sides then it will be labelled as hexagon.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 8
Supervised Machine Learning
• Now, after training, we test our model using the test set, and the
task of the model is to identify the shape.

• The machine is already trained on all types of shapes, and when it

finds a new shape, it classifies the shape on the bases of a number
of sides, and predicts the output.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 9

Supervised Machine Learning
• Steps Involved in Supervised Learning:
1. First Determine the type of training dataset
2. Collect/Gather the labelled training data.
3. Split the training dataset into training dataset, test dataset, and
validation dataset.
4. Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
5. Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
6. Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of
training datasets.
7. Evaluate the accuracy of the model by providing the test set. If the
model predicts the correct output,
[Link] Sambhaji which
Patil , [Link] means
University Ambi , Pune our model is accurate.
10
Types of supervised Machine learning Algorithms:

[Link] Sambhaji Patil , [Link] University Ambi , Pune 11

Types of Supervised Machine learning Algorithms:
• Regression
• Regression algorithms are used if there is a relationship
between the input variable and the output variable.
• It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc.
• Below are some popular Regression algorithms which come
under supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
[Link] Sambhaji Patil , [Link] University Ambi , Pune 12
Types of Supervised Machine learning Algorithms:
• Classification
• Classification algorithms are used when the output
variable is categorical, which means there are two classes
such as Yes-No, Male-Female, True-false, etc.
• Spam Filtering,
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
[Link] Sambhaji Patil , [Link] University Ambi , Pune 13
Types of Supervised Machine learning Algorithms:
• Supervised learning has two types:
• Classification: It predicts the class of the dataset based on the
independent input variable. Class is the categorical or discrete values.
like the image of an animal is a cat or dog?

• Regression: It predicts the continuous output variables based on the

independent input variable. like the prediction of house prices based on
different parameters like house age, distance from the main road,
location, area, etc.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 14
Types of Supervised Machine learning Algorithms:
• Linear Regression
• Linear regression is a type of supervised machine learning
algorithm that computes the linear relationship between a
dependent variable and one or more independent features.

• When the number of the independent feature, is 1 then it is

known as Univariate Linear regression, and in the case of more
than one feature, it is known as multivariate linear regression.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 15

Types of Supervised Machine learning Algorithms:
• Linear Regression
• The goal of the algorithm is to find the best linear equation that
can predict the value of the dependent variable based on the
independent variables.

• The equation provides a straight line that represents the

relationship between the dependent and independent variables.

• The slope of the line indicates how much the dependent variable
changes for a unit change in the independent variables.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 16
Linear regression- Linear models,

• Linear regression performs the task to predict a dependent variable

value (y) based on a given independent variable (x)).
• Hence, the name is Linear Regression.
• In the figure above, X (input) is the work experience and Y (output) is the
salary of a person.
• The regression line is the best-fit line for our model.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 17
Linear regression- Linear models,

our independent feature is the experience i.e X and the

respective salary Y is the dependent variable. Let’s assume
there is a linear relationship between X and Y then the salary
can be predicted using:

[Link] Sambhaji Patil , [Link] University Ambi , Pune 18

Linear regression- Linear models,

• The model gets the best regression fit line by finding the
best θ1 and θ2 values.
• θ1: intercept
• θ2: coefficient of x
• Once we find the best θ1 and θ2 values, we get the best-fit
line. So when we are finally using our model for prediction,
it will predict the value of y for the input value of x.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 19
Linear regression-Cost Function
• The cost function or the loss function is nothing but the error or
difference between the predicted value and the true value Y.

• It is the Mean Squared Error (MSE) between the predicted value and
the true value.

• The cost function (J) can be written as:

[Link] Sambhaji Patil , [Link] University Ambi , Pune 20

How to update θ1 and θ2 values to get the best-fit line?

• To achieve the best-fit regression line, the model aims to predict the
target value such that the error difference between the predicted value
and the true value Y is minimum.

• So, it is very important to update the θ1 and θ2 values, to reach the best
value that minimizes the error between the predicted y value (pred) and
the true y value (y).

[Link] Sambhaji Patil , [Link] University Ambi , Pune 21

Gradient Descent
• A linear regression model can be trained using the optimization algorithm
gradient descent by iteratively modifying the model’s parameters to reduce
the mean squared error (MSE) of the model on a training dataset.

• To update θ1 and θ2 values in order to reduce the Cost function (minimizing

RMSE value) and achieve the best-fit line the model uses Gradient Descent.
The idea is to start with random θ1 and θ2 values and then iteratively
update the values, reaching minimum cost.

• A gradient is nothing but a derivative that defines the effects on outputs of

the function with a little bit of variation in inputs.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 22
Multi Dimensionality Reduction
• Used for dimensionality reduction when the input data is not linearly
arranged or it is not known whether a linear relationship exists or not.

• MDS is a non-linear technique for embedding data in a lower-dimensional

space.

• MDS (multidimensional scaling) is an algorithm that transforms a dataset

into another dataset, usually with lower dimensions, keeping the same
euclidean distances between the points.

• It can be used to detect outliers in some multivariate distribution,

[Link] Sambhaji Patil , [Link] University Ambi , Pune 23
Multi Dimensionality Reduction

• The main objective of MDS is to represent dissimilarities as distances

between points in a low dimensional space such that the distances
correspond as closely as possible to the dissimilarities.

• nonlinear method to project in lower dimensions by saving pairwise

distances

[Link] Sambhaji Patil , [Link] University Ambi , Pune 24

Multi Dimensionality Reduction

• The metric MDS calculates distances between each pair of points in

the original high-dimensional space and then maps it to lower-
dimensional space while preserving those distances between points
as well as possible.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 25

Differentiate the cost function(J)

[Link] Sambhaji Patil , [Link] University Ambi , Pune 26

Linear Regression model
• Finding the coefficients of a linear equation that best fits the training
data is the objective of linear regression.

• By moving in the direction of the Mean Squared Error negative gradient

with respect to the coefficients, the coefficients can be changed.

• And the respective intercept and coefficient of X will be if alpha is the

learning rate.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 27

Gradient Descent

[Link] Sambhaji Patil , [Link] University Ambi , Pune 28

Bias-Variance Trade-Off

[Link] Sambhaji Patil , [Link] University Ambi , Pune 29

Bias Variance Tradeoff

[Link] Sambhaji Patil , [Link] University Ambi , Pune 30

Model Complexity

• Model complexity, which in the case of linear regression can be

thought of as the number of predictors increases, estimates variance
also increases, but the bias decreases.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 31

Use of Regularization

• use regularization - a technique allowing to decrease this variance at

the cost of introducing some bias.

• Finding a good bias-variance trade-off allows to minimize the

model's total error.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 32

Types of Regularization Techniques
• There are three popular regularization techniques, each of them aiming
at decreasing the size of the coefficients :

1. Ridge Regression, which penalizes sum of squared coefficients (L2

penalty).

2. Lasso Regression, which penalizes the sum of absolute values of the

coefficients (L1 penalty).

3. Elastic Net, a convex combination of Ridge and Lasso (L1 + L2 )

[Link] Sambhaji Patil , [Link] University Ambi , Pune 33

Types of Regularization Techniques
• L2 regularization takes the square of the weights, so the cost of
outliers present in the data increases exponentially.

• L1 regularization takes the absolute values of the weights, so

the cost only increases linearly.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 34

Lasso Regression
• Lasso, or Least Absolute Shrinkage and Selection Operator, is quite
similar conceptually to ridge regression.

• It also adds a penalty for non-zero coefficients, but unlike ridge

regression which penalizes sum of squared coefficients (the so-called L2
penalty), lasso penalizes the sum of their absolute values (L1 penalty).

• As a result, for high values of λ, many coefficients are exactly zeroed

under lasso, which is never the case in ridge regression.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 35

Lasso, the loss is defined as:

• In lasso, one of the correlated predictors has a larger coefficient, while

the rest are (nearly) zeroed.
• Lasso tends to do well if there are a small number of significant parameters
and the others are close to zero
[Link] Sambhaji Patil , [Link] University Ambi , Pune 36
Elastic Net Regularization
• Elastic Net first emerged as a result of critique on lasso, whose
variable selection can be too dependent on data and thus
unstable.

• The solution is to combine the penalties of ridge regression

and lasso to get the best of both worlds.

• Elastic Net aims at minimizing the following loss function:

[Link] Sambhaji Patil , [Link] University Ambi , Pune 37

Elastic Net Regularization
• Elastic Net first emerged as a result of critique on lasso,
whose variable selection can be too dependent on data
and thus unstable.
• The solution is to combine the penalties of ridge
regression and lasso to get the best of both worlds.
• Elastic Net aims at minimizing the following loss function:

[Link] Sambhaji Patil , [Link] University Ambi , Pune 38

Elastic Net Regularization

• Where α is the mixing parameter between

• ridge (α = 0) and
• lasso (α = 1).

[Link] Sambhaji Patil , [Link] University Ambi , Pune 39

Elastic Net Regularization

• The elastic net penalty is a weighted sum of the L1 and L2 penalties.

• The mixing parameter, alpha (α), controls the weight of the L1

penalty relative to the L2 penalty.

• When alpha=1, the penalty reduces to the L1 penalty (Lasso

regression), and when alpha=0, the penalty reduces to the L2
penalty (Ridge regression).

[Link] Sambhaji Patil , [Link] University Ambi , Pune 40

Elastic Net Regularization

• Elastic net regression is a linear regression technique that uses a

penalty term to shrink the coefficients of the predictors.

• The penalty term is a combination of the l1-norm (absolute value)

and the l2-norm (square) of the coefficients, weighted by a
parameter called alpha.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 41

Elastic Net Regularization

• Now, there are two parameters to tune: λ and α.

• The glmnet package allows to tune λ via cross-validation for a fixed α,

but it does not support α-tuning, so we will turn to caret for this job.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 42

Polynomial Regression
• Polynomial Regression is a regression algorithm that models
the relationship between a dependent(y) and independent
variable(x) as nth degree polynomial. The Polynomial
Regression equation is given below:

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

[Link] Sambhaji Patil , [Link] University Ambi , Pune 43

Polynomial Regression
• It is also called the special case of Multiple Linear Regression in ML.
Because we add some polynomial terms to the Multiple Linear regression
equation to convert it into Polynomial Regression.

• It is a linear model with some modification in order to increase the

accuracy.

• The dataset used in Polynomial regression for training is of non-linear

nature.

• It makes use of a linear regression model to fit the complicated and non-
linear functions and datasets.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 44
Polynomial Regression

• "In Polynomial regression, the original features are converted into

Polynomial features of required degree (2,3,..,n) and then modeled
using a linear model."

[Link] Sambhaji Patil , [Link] University Ambi , Pune 45

Polynomial Regression

[Link] Sambhaji Patil , [Link] University Ambi , Pune 46

Simple Linear Regression equation:
• Simple Linear Regression equation:
• y = b0+b1x .........(a)

• Multiple Linear Regression equation:

• y= b0+b1x+ b2x2+ b3x3+....+ bnxn .........(b)

• Polynomial Regression equation:

• y= b0+b1x + b2x2+ b3x3+....+ bnxn …………………(c)

[Link] Sambhaji Patil , [Link] University Ambi , Pune 47

Linear Regression equation
• When we compare the above three equations, we can clearly see that
all three equations are Polynomial equations but differ by the degree
of variables.

• The Simple and Multiple Linear equations are also Polynomial

equations with a single degree, and the Polynomial regression
equation is Linear equation with the nth degree.

• So if we add a degree to our linear equations, then it will be

converted into Polynomial Linear
[Link] Sambhaji equations.
Patil , [Link] University Ambi , Pune 48
Isotonic Regression

• 'iso' means equal and 'tonic' means stretching.

• In terms of machine learning algorithms, isotonic regression

can, therefore, be understood as equal stretching along the
linear regression line.

• It works on top of a linear regression model.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 49

Isotonic Regression

• Isotonic regression has to be non-negative whereas in linear

regression can be negative.

• This means every point in isotonic regression should be high as

before the previous point.

• Isotonic can be free form but linear regression should be linear.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 50

Isotonic Regression

[Link] Sambhaji Patil , [Link] University Ambi , Pune 51

Isotonic Regression
• Isotonic regression can be formulated as an optimization

problem in which the goal is to find a monotonic function

that minimizes the sum of the squared errors between the
predicted and observed values of the target variable.

• The optimization problem can be written as follows:

minimize subject to

[Link] Sambhaji Patil , [Link] University Ambi , Pune 52

Isotonic Regression

• where x_i and y_i are the predictors and target variables
for the i^{th} data point,

• respectively, and

• f is the monotonic function that is being fit to the data.

• The constraint ensures that the function is monotonic.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 53

Applications of Isotonic Regression

1. Calibration of predicted probabilities: Isotonic regression can be

used to adjust the predicted probabilities produced by a classifier
so that they are more accurately calibrated to the true probabilities.

2. Ordinal regression: Isotonic regression can be used to model

ordinal variables, which are variables that can be ranked in order
(e.g., “low,” “medium,” and “high”).

[Link] Sambhaji Patil , [Link] University Ambi , Pune 54

Applications of Isotonic Regression
3. Non-parametric regression: Because isotonic regression does not make
any assumptions about the functional form of the relationship between the
predictor and target variables, it can be used as a non-parametric
regression method.

4. Imputing missing values: Isotonic regression can be used to impute

missing values in a dataset by predicting the missing values based on the
surrounding non-missing values.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 55

Applications of Isotonic Regression

5. Outlier detection: Isotonic regression can be used to identify outliers

in a dataset by identifying points that are significantly different from
the overall trend of the data.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 56

Isotonic Regression

• In scikit-learn, isotonic regression can be performed using the

‘IsotonicRegression’ class. This class implements the isotonic
regression algorithm, which fits a non-decreasing piecewise-constant
function to the data.

• how to use the IsotonicRegression class in scikit-learn to perform

isotonic regression:

[Link] Sambhaji Patil , [Link] University Ambi , Pune 57

Isotonic Regression
from [Link] import IsotonicRegression

ir = IsotonicRegression()
# create an instance of the IsotonicRegression class

# Fit isotonic regression model

y_ir = ir.fit_transform(x, y)
# fit the model and transform the data

print('Isotonic Regression Predictions :\n',y_ir)

[Link] Sambhaji Patil , [Link] University Ambi , Pune 58
Logistic Regression
• Logistic regression is a supervised machine learning algorithm
mainly used for classification tasks where the goal is to predict
the probability that an instance of belonging to a given class.

• It is used for classification algorithms its name is logistic

regression. it’s referred to as regression because it takes the
output of the linear regression function as input and uses a
sigmoid function to estimate the probability for the given class.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 59
Logistic Regression
• The difference between linear regression and logistic
regression is that linear regression output is the continuous
value that can be anything while logistic regression predicts the
probability that an instance belongs to a given class or not.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 60

• Logistic Regression: Logistic Regression
• It is used for predicting the categorical dependent variable
using a given set of independent variables.

• Logistic regression predicts the output of a categorical

dependent variable. Therefore the outcome must be a
categorical or discrete value.

• It can be either Yes or No, 0 or 1, true or False, etc. but instead

of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 61
Logistic Regression
• Logistic Regression:

• Logistic Regression is much similar to the Linear Regression except

that how they are used.

• Linear Regression is used for solving Regression problems, whereas

Logistic regression is used for solving the classification problems.

• In Logistic regression, instead of fitting a regression line, we fit an

“S” shaped logistic function, which predicts two maximum values
(0 or 1). [Link] Sambhaji Patil , [Link] University Ambi , Pune 62
Logistic Regression

• The curve from the logistic function indicates the likelihood of

something such as whether the cells are cancerous or not, a mouse is
obese or not based on its weight, etc.

• Logistic Regression is a significant machine learning algorithm

because it has the ability to provide probabilities and classify new
data using continuous and discrete datasets.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 63

Logistic Regression

• Logistic Regression can be used to classify the observations using

different types of data and can easily determine the most effective
variables used for the classification.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 64

Logistic Regression : Logistic Function (Sigmoid Function)
• Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
• It maps any real value into another value within a range of 0 and 1. o The value of
the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 65
Type of Logistic Regression
• On the basis of the categories, Logistic Regression can be classified into three
types:

• Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.

• Multinomial: In multinomial Logistic regression, there can be 3 or more

possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”

• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered

types of dependent variables, such as “low”, “Medium”, or “High”.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 66
[Link] Linear Regression Logistic Regression
Linear regression is used to predict the Logistic regression is used to predict the
1 continuous dependent variable using a categorical dependent variable using a given
given set of independent variables. set of independent variables.

Linear regression is used for solving

2 Regression problem.
It is used for solving classification problems.

In this we predict the value of continuous In this we predict values of categorical

3 variables varibles

4 In this we find best fit line. In this we find S-Curve .

Least square estimation method is used for Maximum likelihood estimation method is
5 estimation of accuracy. used for Estimation of accuracy.

The output must be continuous value,such Output is must be categorical value such as
6 as price,age,etc. 0 or 1, Yes or no, etc.

It required linear relationship between

7 dependent and independent variables.
It not required linear relationship.

There may be collinearity between the There should not be collinearity between
8 [Link] Sambhaji Patil , [Link] University Ambi , Pune
independent variables. independent varible. 67
Logistic Regression : Sigmoid Function
sigmoid function
where the input
will be z and we
find the
probability
between 0 and 1.
i.e predicted y.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 68
Logistic Regression
• Logistic Regression Equation
• The odd is the ratio of something occurring to something not
occurring. it is different from probability as the probability is the ratio
of something occurring to everything that could possibly occur. so
odd will be

• from sklearn.linear_model import LogisticRegression

[Link] Sambhaji Patil , [Link] University Ambi , Pune 69
Types of Gradient Descent:
• Typically, there are three types of Gradient Descent:
• Batch Gradient Descent

• Stochastic Gradient Descent

• Mini-batch Gradient Descent

[Link] Sambhaji Patil , [Link] University Ambi , Pune 70

Gradient Descent

• Gradient descent is an optimization algorithm that’s used

when training a machine learning model.

• It’s based on a convex function and tweaks (changing) its parameters

iteratively to minimize a given function to its local minimum.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 71

WHAT IS GRADIENT DESCENT IN MACHINE LEARNING?
• Gradient Descent is an optimization algorithm for finding a local
minimum of a differentiable function.

• Gradient descent in machine learning is simply used to find the

values of a function's parameters (coefficients) that minimize a cost
function as far as possible.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 72

ROLE OF GRADIENT DESCENT
• Initial parameter’s values and from there the gradient descent
algorithm uses calculus to iteratively adjust the values so they
minimize the given cost-function.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 73

ROLE OF GRADIENT DESCENT
• stochastic gradient descent (SGD) does this for each training example
within the dataset, meaning it updates the parameters for each
training example one by one.

• Depending on the problem, this can make SGD faster than batch
gradient descent.

• One advantage is the frequent updates allow us to have a pretty

detailed rate of improvement..

[Link] Sambhaji Patil , [Link] University Ambi , Pune 74

ROLE OF GRADIENT DESCENT
• The frequent updates, however, are more computationally expensive
than the batch gradient descent approach.

• Additionally, the frequency of those updates can result in noisy

gradients, which may cause the error rate to jump around instead of
slowly decreasing.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 75

Stochastic gradient descendent algorithms
• The gradient descent algorithm is an approximate and iterative
method for mathematical optimization.

• You can use it to approach the minimum of any differentiable

function.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 76

Stochastic gradient descendent algorithms
• Gradient Descent is an iterative optimization process that searches
for an objective function’s optimum value (Minimum/Maximum).

• It is one of the most used methods for changing a model’s

parameters in order to reduce a cost function in machine learning
projects.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 77

Stochastic gradient descendent algorithms
• The primary goal of gradient descent is to identify the model
parameters that provide the maximum accuracy on both training
and test datasets.

• In gradient descent, the gradient is a vector pointing in the general

direction of the function’s steepest rise at a particular point.

• The algorithm might gradually drop towards lower values of the

function by moving in the opposite direction of the gradient, until
reaching the minimum of the function.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 78
Stochastic Gradient Descent
• Stochastic Gradient Descent (SGD) is a variant of the Gradient
Descent algorithm that is used for optimizing machine learning
models.

• It addresses the computational inefficiency of traditional Gradient

Descent methods when dealing with large datasets in machine
learning projects.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 79

Stochastic Gradient Descent
• In SGD, instead of using the entire dataset for each iteration, only a
single random training example (or a small batch) is selected to
calculate the gradient and update the model parameters.

• This random selection introduces randomness into the optimization

process, hence the term “stochastic” in stochastic Gradient Descent

[Link] Sambhaji Patil , [Link] University Ambi , Pune 80

Stochastic Gradient Descent
• The advantage of using SGD is its computational efficiency, especially
when dealing with large datasets.

• By using a single example or a small batch, the computational cost

per iteration is significantly reduced compared to traditional
Gradient Descent methods that require processing the entire dataset.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 81

Stochastic Gradient Descent
• Stochastic Gradient Descent Algorithm
• Initialization: Randomly initialize the parameters of the model.

• Set Parameters: Determine the number of iterations and the learning

rate (alpha) for updating the parameters.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 82

Stochastic Gradient Descent
• Stochastic Gradient Descent Loop: Repeat the following steps until the model converges
or reaches the maximum number of iterations:

a. Shuffle the training dataset to introduce randomness.

b. Iterate over each training example (or a small batch) in the shuffled order.

c. Compute the gradient of the cost function with respect to the model parameters using
the current training example (or batch).

d. Update the model parameters by taking a step in the direction of the negative gradient,
scaled by the learning rate.

e. Evaluate the convergence criteria, such as the difference in the cost function between
iterations of the gradient. [Link] Sambhaji Patil , [Link] University Ambi , Pune 83
Stochastic Gradient Descent
• Return Optimized Parameters: Once the convergence criteria are met or
the maximum number of iterations is reached, return the optimized model
parameters.

• In SGD, since only one sample from the dataset is chosen at random for
each iteration, the path taken by the algorithm to reach the minima is
usually noisier than your typical Gradient Descent algorithm.

• But that doesn’t matter all that much because the path taken by the
algorithm does not matter, as long as we reach the minimum and with a
significantly shorter training time.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 84
The path taken by Batch Gradient Descent is shown below:

• we reach the
minimum and
with a
significantly
shorter
training time.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 85

A path taken by Stochastic Gradient Descent looks as follows –
• One thing to be noted is that,
as SGD is generally noisier
than typical Gradient
Descent, it usually took a
higher number of iterations
to reach the minima,
because of the randomness
in its descent.
• Even though it requires a
higher number of iterations
to reach the minima than
typical Gradient Descent, it is
still computationally much
• Hence, in most scenarios,
SGD is preferred over Batch
less expensive than typical
Gradient Descent for Gradient Descent.
optimizing a learning
[Link] Sambhaji Patil , [Link] University Ambi , Pune 86
algorithm.
Advantages of Stochastic Gradient Descent
• Speed: SGD is faster than other variants of Gradient Descent such as
Batch Gradient Descent and Mini-Batch Gradient Descent since it uses
only one example to update the parameters.

• Memory Efficiency: Since SGD updates the parameters for each training
example one at a time, it is memory-efficient and can handle large
datasets that cannot fit into memory.

• Avoidance of Local Minima: Due to the noisy updates in SGD, it has the
ability to escape from local minima and converges to a global minimum.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 87
Disadvantages of Stochastic Gradient Descent
• Noisy updates: The updates in SGD are noisy and have a high variance,
which can make the optimization process less stable and lead to
oscillations around the minimum.
• Slow Convergence: SGD may require more iterations to converge to the
minimum since it updates the parameters for each training example one at
a time.
• Sensitivity to Learning Rate: The choice of learning rate can be critical in
SGD since using a high learning rate can cause the algorithm to overshoot
the minimum, while a low learning rate can make the algorithm converge
slowly.
• Less Accurate: Due to the noisy updates, SGD may not converge to the
exact global minimum and can result in a suboptimal solution. This can
be mitigated by using techniques such as learning rate scheduling and
momentum-based updates
[Link] Sambhaji Patil , [Link] University Ambi , Pune 88
Confusion Matrix

[Link] Sambhaji Patil , [Link] University Ambi , Pune 89

Confusion Matrix
The target variable has
two values:
Positive or Negative

The columns represent

the actual values of the
target variable.

The rows represent

the predicted values of
the target variable
[Link] Sambhaji Patil , [Link] University Ambi , Pune 90
Confusion Matrix

• The classification matrix is a standard tool for evaluation of statistical

models and is sometimes referred to as a confusion matrix.

• A Confusion matrix is an N x N matrix used for evaluating

the performance of a classification model, where N is the number
of target classes. The matrix compares the actual target values with
those predicted by the machine learning model.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 91

Confusion Matrix
• A good model is one which has high TP and TN rates, while low FP and
FN rates.

• A confusion matrix is a tabular summary of the number of correct and

incorrect predictions made by a classifier.

• It is used to measure the performance of a classification model.

• It can be used to evaluate the performance of a classification model

through the calculation of performance metrics like accuracy, precision,
recall, and F1-score.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 92
Confusion Matrix
• True Positives (TP): when the actual value is Positive and
predicted is also Positive.
• True negatives (TN): when the actual value is Negative and
prediction is also Negative.
• False positives (FP): When the actual is negative but
prediction is Positive. Also known as the Type 1 error
• False negatives (FN): When the actual is Positive but the
prediction is Negative. Also known as the Type 2 error
[Link] Sambhaji Patil , [Link] University Ambi , Pune 93
Confusion Matrix

[Link] Sambhaji Patil , [Link] University Ambi , Pune 94

Classification Measure
• Classification Measure
• Basically, it is an extended version of the confusion matrix. There are
measures other than the confusion matrix which can help achieve
better understanding and analysis of our model and its
performance.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 95

Classification Measure
a. Accuracy

b. Precision

c. Recall (TPR, Sensitivity)

d. F1-Score

e. FPR (Type I Error)

f. FNR (Type II Error)

[Link] Sambhaji Patil , [Link] University Ambi , Pune 96
Classification Measure : a. Accuracy
Accuracy simply
measures how often
the classifier makes
the correct
prediction.
It’s the ratio between
the number of
correct predictions
and the total number
of predictions.
[Link] Sambhaji Patil , [Link] University Ambi , Pune 97
Classification Measure

• In a two-class problem, we are often looking to discriminate between

observations with a specific outcome, from normal observations.

• “true positive” for correctly predicted event values.

• “false positive” for incorrectly predicted event values.

• “true negative” for correctly predicted no-event values.

• “false negative” for incorrectly predicted no-event values.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 98

Confusion Matrix
1 # Example of a confusion matrix in Python
2 from [Link] import confusion_matrix
3
4 expected = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
5 predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]
6 results = confusion_matrix(expected, predicted)
7 print(results)

[[4 2]
[1 3]]
[Link] Sambhaji Patil , [Link] University Ambi , Pune 99
Calculate Accuracy, Error, Precision, Recall and F1 Score
for the following Confusion Matrix
Actual Positive Actual Negative

Predicted 10 10
Positive

Predicted 25 55
Negative

[Link] Sambhaji Patil , [Link] University Ambi , Pune 100

Solution 
1. Accuracy is calculated as follows:
(TP + TN) / (TP + TN + FP + FN)
= (10 + 55) / (10 + 55 + 10 + 25) = 65 / 100 = 0.65
2. Error = 1 – Accuracy
Error = 1 – 0.65
Error = 0.35
3. Precision = TP / TP + FP = 10 / (10 + 10) = 0.5.
4. Recall (Sensitivity) = 10 / (10 + 25) = 0.2857.
5. F1 Score is calculated as follows:
F1 Score = 2 * Precision * Recall / (Precision + Recall)
F1 Score = 2 * 0.5 * 0.2857 / (0.5 + 0.2857)
F1 Score = 0.3571
[Link] Sambhaji Patil , [Link] University Ambi , Pune 101
ROC Curve

• An ROC curve (receiver operating characteristic curve) is a graph

showing the performance of a classification model at all
classification thresholds.

• This curve plots two parameters: True Positive Rate and False
Positive Rate.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 102

ROC Curve

• An ROC curve (receiver operating characteristic curve) is a graph

showing the performance of a classification model at all
classification thresholds.

• This curve plots two parameters: True Positive Rate and False
Positive Rate.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 103

ROC Curve
• True Positive Rate (TPR) is a synonym for recall and is therefore defined
as follows:

• False Positive Rate (FPR) is defined as follows:

• An ROC curve plots TPR vs. FPR at different classification thresholds.

Lowering the classification threshold classifies more items as positive,
thus increasing both False Positives
[Link] Sambhaji Patil , [Link]
UniversityTrue Positives.
Ambi , Pune 104
ROC Curve
• The following figure shows a typical ROC curve.

[Link] Sambhaji Patil , [Link] University Ambi , Pune 105

ROC Curve

• With a ROC curve, you’re trying to find a good model that optimizes the
trade off between the False Positive Rate (FPR) and True Positive Rate
(TPR). What counts here is how much area is under the curve (Area under
the Curve = AuC).

• The ideal curve in the left image fills in 100%, which means that you’re
going to be able to distinguish between negative results and positive
results 100% of the time (which is almost impossible in real life).

[Link] Sambhaji Patil , [Link] University Ambi , Pune 106

ROC Curve

[Link] Sambhaji Patil , [Link] University Ambi , Pune 107

ROC Curve
• A Receiver Operator Characteristic (ROC) curve is a graphical plot used to
show the diagnostic ability of binary classifiers.

• A ROC curve is constructed by plotting the true positive rate (TPR)

against the false positive rate (FPR).

• The true positive rate is the proportion of observations that were

correctly predicted to be positive out of all positive observations (TP/(TP
+ FN)).

[Link] Sambhaji Patil , [Link] University Ambi , Pune 108

Plot ROC-AUC Curve for binary classification problem

[Link] Sambhaji Patil , [Link] University Ambi , Pune 109

[Link] Sambhaji Patil , [Link] University Ambi , Pune 110
Thank You

[Link] Sambhaji Patil , [Link] University Ambi , Pune 111

Unit 4
No ratings yet
Unit 4
72 pages
Machine Learning Reg
No ratings yet
Machine Learning Reg
45 pages
Unsupervised Learning in Machine Learning
No ratings yet
Unsupervised Learning in Machine Learning
49 pages
Dinesh ML
No ratings yet
Dinesh ML
11 pages
AI ML 3 Updated
No ratings yet
AI ML 3 Updated
34 pages
Group 2 ML Asignmet
No ratings yet
Group 2 ML Asignmet
23 pages
Machine Learning
No ratings yet
Machine Learning
49 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Unit II
No ratings yet
Unit II
25 pages
Machine Learning IAI
No ratings yet
Machine Learning IAI
94 pages
Unit - 2, Updated Notes
No ratings yet
Unit - 2, Updated Notes
121 pages
Machine Learning Ppts
No ratings yet
Machine Learning Ppts
38 pages
Overview of Supervised Machine Learning
No ratings yet
Overview of Supervised Machine Learning
24 pages
ML Unit-4
No ratings yet
ML Unit-4
20 pages
Data Science Lecture: Classification & Regression
No ratings yet
Data Science Lecture: Classification & Regression
27 pages
MLT Unit 2 Notes
No ratings yet
MLT Unit 2 Notes
58 pages
CE880 Lecture5 Slides
No ratings yet
CE880 Lecture5 Slides
32 pages
Intro to Supervised Learning
No ratings yet
Intro to Supervised Learning
52 pages
SML Lecture1
No ratings yet
SML Lecture1
37 pages
Unit3aiml 230421054431 97b34666
No ratings yet
Unit3aiml 230421054431 97b34666
62 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
137 pages
Unit - III
No ratings yet
Unit - III
40 pages
Unit 2 Machine Learning
No ratings yet
Unit 2 Machine Learning
32 pages
8-Module 5 Linear and Logical Regression-18-03-2024
No ratings yet
8-Module 5 Linear and Logical Regression-18-03-2024
14 pages
Summary Machine Learning
No ratings yet
Summary Machine Learning
6 pages
Ca10bd6d De86 4bae 9427 c60d433d2076 Supervised Learning
No ratings yet
Ca10bd6d De86 4bae 9427 c60d433d2076 Supervised Learning
17 pages
Chapter Two-FFnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
No ratings yet
Chapter Two-FFnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
62 pages
Lecture - 2 & 3
No ratings yet
Lecture - 2 & 3
62 pages
Introduction To Ai & ML
No ratings yet
Introduction To Ai & ML
27 pages
E-Notes 34758 Content Document 20250415115803AM
No ratings yet
E-Notes 34758 Content Document 20250415115803AM
23 pages
Ai Unit-4-1
No ratings yet
Ai Unit-4-1
9 pages
Aiml 4
No ratings yet
Aiml 4
107 pages
Unit 3
No ratings yet
Unit 3
62 pages
Machine Learning: BE Sixth Semester 20CS610
No ratings yet
Machine Learning: BE Sixth Semester 20CS610
211 pages
Chapter 2
No ratings yet
Chapter 2
50 pages
Machine Learning
No ratings yet
Machine Learning
100 pages
Supervised Learning Algorithmn
No ratings yet
Supervised Learning Algorithmn
4 pages
Chap2 SupervisedLearning
No ratings yet
Chap2 SupervisedLearning
24 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Supervised Learning Techniques
No ratings yet
Supervised Learning Techniques
4 pages
Machine Learning Cheatsheet
100% (1)
Machine Learning Cheatsheet
15 pages
Module 1 - Intro To ML - V2
No ratings yet
Module 1 - Intro To ML - V2
47 pages
Slide 1
No ratings yet
Slide 1
29 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
Unit 1 PDF
No ratings yet
Unit 1 PDF
135 pages
Classification and Regression
No ratings yet
Classification and Regression
15 pages
Supervised ML Algorithms
No ratings yet
Supervised ML Algorithms
9 pages
Basics of Machine Learning and Deep Learning
100% (1)
Basics of Machine Learning and Deep Learning
49 pages
Unit 5
No ratings yet
Unit 5
16 pages
AI & ML Unit 3 Notes
No ratings yet
AI & ML Unit 3 Notes
20 pages
Unit 6
No ratings yet
Unit 6
107 pages
Machine Learning
No ratings yet
Machine Learning
21 pages
ML Introduction
No ratings yet
ML Introduction
76 pages
Lecture 2
No ratings yet
Lecture 2
36 pages
Unit1 6thsemCS
No ratings yet
Unit1 6thsemCS
22 pages
3 Machine Learning Introduction 18-07-2024
No ratings yet
3 Machine Learning Introduction 18-07-2024
12 pages
DocScanner 18-Dec
No ratings yet
DocScanner 18-Dec
1 page
MySQL-JDBC Client Server App Guide
No ratings yet
MySQL-JDBC Client Server App Guide
6 pages
Classroom of The Elite - Second Year Volume 11
100% (1)
Classroom of The Elite - Second Year Volume 11
263 pages
Unit 4: Trees
No ratings yet
Unit 4: Trees
28 pages
Employee Performance Management Study
No ratings yet
Employee Performance Management Study
19 pages
Birth Order and Personality Insights
No ratings yet
Birth Order and Personality Insights
1 page
FS2 Learing Ep. 1 The Teacher We Remember
No ratings yet
FS2 Learing Ep. 1 The Teacher We Remember
39 pages
BNBC Part 03 - General Building Requirements, Control and Regulation
100% (2)
BNBC Part 03 - General Building Requirements, Control and Regulation
103 pages
Quality Tolerances For Water For Textile Industry: Indian Standard
No ratings yet
Quality Tolerances For Water For Textile Industry: Indian Standard
10 pages
Verb Tense Analysis of Research Article Abstracts in Asian Efl Journal
No ratings yet
Verb Tense Analysis of Research Article Abstracts in Asian Efl Journal
8 pages
Detailed Lesson Plan-Interactive
0% (1)
Detailed Lesson Plan-Interactive
2 pages
Instruction Manual: Hand Stacker Pa1015 Capacity 1000kg
No ratings yet
Instruction Manual: Hand Stacker Pa1015 Capacity 1000kg
17 pages
ICRU Report 87
No ratings yet
ICRU Report 87
164 pages
Pagtuon Sa Ambahanon: The Language of Ilonggo Folk Songs
No ratings yet
Pagtuon Sa Ambahanon: The Language of Ilonggo Folk Songs
21 pages
How To Make A DIY Metal Band Saw
No ratings yet
How To Make A DIY Metal Band Saw
4 pages
Understanding Forward Interpolation Techniques
No ratings yet
Understanding Forward Interpolation Techniques
4 pages
Kreafunk Amove Manual
No ratings yet
Kreafunk Amove Manual
46 pages
IoT's Impact on Education
No ratings yet
IoT's Impact on Education
4 pages
Outgoing: Voltage To External Incoming 1 Voltage To External Incoming 2
No ratings yet
Outgoing: Voltage To External Incoming 1 Voltage To External Incoming 2
1 page
Curso Basico de Variadores de Frecuencia
No ratings yet
Curso Basico de Variadores de Frecuencia
136 pages
Previews IEEE 841-2009 Pre
100% (1)
Previews IEEE 841-2009 Pre
14 pages
Understanding Measurements and Units
No ratings yet
Understanding Measurements and Units
38 pages
Dotnet and C#
No ratings yet
Dotnet and C#
54 pages
Catalog Zelio Logic Smart Relays - English - September 2018
No ratings yet
Catalog Zelio Logic Smart Relays - English - September 2018
42 pages
CTT KN95 Mask: Features & Certifications
No ratings yet
CTT KN95 Mask: Features & Certifications
29 pages
Motors Types
No ratings yet
Motors Types
6 pages
Class 10 Non-Finites MCQs PDF Download
No ratings yet
Class 10 Non-Finites MCQs PDF Download
3 pages
Exam Review Techniques for HUMMS Students
No ratings yet
Exam Review Techniques for HUMMS Students
5 pages
VHF/UHF Radio Safety Guide
No ratings yet
VHF/UHF Radio Safety Guide
14 pages
Contractor Injury at Tata Steel
No ratings yet
Contractor Injury at Tata Steel
2 pages
Understanding Family Life Cycle Stages
100% (1)
Understanding Family Life Cycle Stages
11 pages
Mac OS X Quarantine Data
No ratings yet
Mac OS X Quarantine Data
1 page
Xiaomi 14 CIVI: Premium Camera Phone
No ratings yet
Xiaomi 14 CIVI: Premium Camera Phone
4 pages