0% found this document useful (0 votes)
49 views74 pages

Module 2-Supervised Learning

The document provides an overview of supervised learning, focusing on linear regression and logistic regression. It explains the concepts of simple and multiple linear regression, including how to derive the regression line and predict outcomes, as well as the use of logistic regression for binary classification problems. Additionally, it discusses performance measures, cost functions, and optimization techniques like gradient descent for improving model accuracy.

Uploaded by

kush tejani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views74 pages

Module 2-Supervised Learning

The document provides an overview of supervised learning, focusing on linear regression and logistic regression. It explains the concepts of simple and multiple linear regression, including how to derive the regression line and predict outcomes, as well as the use of logistic regression for binary classification problems. Additionally, it discusses performance measures, cost functions, and optimization techniques like gradient descent for improving model accuracy.

Uploaded by

kush tejani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Supervised Learning

Linear Regression
Module 2
Linear Regression
Simple Linear Regression
• Simple linear regression is when you want to predict
values of one variable, given values of another variable.
For example, you might want to predict a person's height
(in inches) from his weight (in pounds).
• Imagine a sample of ten people for whom you know their
height and weight. You could plot the values on a graph, with
weight on the x axis and height on the y axis.
Linear Regression

• If there were a perfect linear relationship between height


and weight, then all 10 points on the graph would fit on a
straight line. But, this is never the case (unless your data
are rigged).
• If there is a (nonperfect) linear relationship between
height and weight (presumably a positive one), then you
would get a cluster of points on the graph which slopes
upward. In other words, people who weigh a lot should
be taller than those people who are of less weight. (See
graph.)
Linear Regression
Linear Regression
• The purpose of regression analysis is to come up with an
equation of a line that fits through that cluster of points
with the minimal amount of deviations from the line.
• The deviation of the points from the line is called
"error." Once you have this regression equation, if you
knew a person's weight, you could then predict their
height.
• Simple linear regression is actually the same as a
bivariate correlation between the independent and
dependent variable.
Linear Regression
• After verifying that the linear correlation between two variables is significant,
next we determine the equation of the line that can be used to predict the value of
y for a given value of x.
For a given x-value,
d = (observed y-value) – (predicted y-value)

• Each data point di represents the difference between the observed y-value and the
predicted y-value for a given x- value on the line. These differences are called residuals.
Regression Line
A regression line, also called a line of best fit, is the line for which the sum of the
squares of the residuals is a minimum.

The Equation of a Regression Line


The equation of a regression line for an independent variable
x and a dependent variable y is
ŷ = mx + b
where ŷ is the predicted y-value for a given x-value. The
slope m and y-intercept b are given by,
Regression Line

Example:
Find the equation of the regression line.
x y xy x
2
y
2

1 –3 –3 1 9
2 –1 –2 4 1
3 0 0 9 0
4 1 4 16 1
5 2 10 25 4
∑ x = 15 ∑ y = −1 ∑ xy = 9 ∑ x 2 = 55 ∑ y 2 = 15
Regression Line

Step 1 : Calculate m
Regression Line
Regression Line
Regression Line
Example:
The following data represents the number of hours 12 different students watched television
during the weekend and the scores of each student who took a test the following Monday.
a.) Find the equation of the regression line.
b.) Use the equation to find the expected test score
for a student who watches 9 hours of TV.

Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
xy 0 85 164 222 285 340 380 420 348 455 525 500
2 0 1 4 9 9 25 25 25 36 49 49 100
x
2 9216 7225 6724 5476 9025 4624 5776 7056 3364 4225 5625 2500
y
Regression Line
Example:
The following data represents the number of hours 12 different students watched television
during the weekend and the scores of each student who took a test the following Monday.
a.) Find the equation of the regression line.
b.) Use the equation to find the expected test score
for a student who watches 9 hours of TV.

Hours, x 0 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
xy 0 85 164 222 285 340 380 420 348 455 525 500
2 0 1 4 9 9 25 25 25 36 49 49 100
x
2 9216 7225 6724 5476 9025 4624 5776 7056 3364 4225 5625 2500
y
Performance Measures
Performance metrics for linear regression help evaluate how well the model fits the
data. Here are the most commonly used ones:

yi is the ith observation, and

yi^is the estimated value of yi.


Multiple regression Methods

In many instances, a better prediction can be found for a dependent (response)


variable by using more than one independent (explanatory) variable.
For example, a more accurate prediction of Monday’s test grade from the previous section
might be made by considering the number of other classes a student is taking as well as the
student’s previous knowledge of the test material.
A multiple regression equation has the form
ŷ = b + m1x1 + m2x2 + m3x3 + … + mkxk
where x1, x2, x3,…, xk are independent variables, b is the y-intercept, and y is the
dependent variable.
Performance Measures
After finding the equation of the multiple regression line, you can use the equation to
predict y-values over the range of the data.

Example:
The following multiple regression equation can be used to predict the annual U.S. rice yield
(in pounds).

ŷ = 859 + 5.76x1 + 3.82x2

where x1 is the number of acres planted (in thousands), and x2 is the number of acres
harvested (in thousands).
a.)Predict the annual rice yield when x1 = 2758, and x2 = 2714. b.) Predict the
annual rice yield when x1 = 3581, and x2 = 3021.
Performance Measures
Example continued:

a.) ŷ = 859 + 5.76x1 + 3.82x2


= 859 + 5.76(2758) + 3.82(2714)
= 27,112.56
The predicted annual rice yield is 27,1125.56 pounds.

b.) ŷ = 859 + 5.76x1 + 3.82x2


= 859 + 5.76(3581) + 3.82(3021)
= 33,025.78
The predicted annual rice yield is 33,025.78 pounds
Logistic Regression
• Logistic Regression is a “Supervised machine learning” algorithm that can be
used to model the probability of a certain class or event. It is used when the data
is linearly separable and the outcome is binary.
• Logistic regression is usually used for Binary classification problems.
• let us try if we can use linear regression to solve a binary class classification
problem. Assume we have a dataset that is linearly separable and has the output
that is discrete in two classes (0, 1).
Logistic Regression
Logistic Regression
How does Logistic Regression Work?
Consider we have a model with one predictor “x” and one response variable “ŷ” and
p is the probability of ŷ=1. The linear equation can be written as:
p = b0+b1x --------> eq 1
• The right-hand side of the equation (b0+b1x) is a linear equation and can hold
values that exceed the range (0,1). But we know probability will always be in the
range of (0,1).
• To overcome that, we predict odds instead of probability.
• Odds: The ratio of the probability of an event occurring to the probability of an
event not occurring.
• Odds = p/(1-p)
• The equation 1 can be re-written as:
• p/(1-p) = b0+b1x --------> eq 2
• ln(p/(1-p)) = b0+b1x --------> eq 3
Logistic Regression
• To recover p from equation 3, we apply exponential on both sides.
exp(ln(p/(1-p))) = exp(b0+b1x)
eln(p/(1-p)) = e(b0+b1x)
• p = (1-p) * e(b0+b1x) p = e(b0+b1x)- p * e(b0+b1x)
• Taking p as common on the right-hand side
• p = p * ((e(b0+b1x))/p - e(b0+b1x)) p = e(b0+b1x) / (1 + e(b0+b1x))
• Dividing numerator and denominator by e(b0+b1x) on the right-hand side
• p = 1 / (1 + e-(b0+b1x))
• Similarly, the equation for a logistic model with ‘n’ predictors is as below:p =
1/ (1 + e-(b0+b1x1+b2x2+b3x3+----+bnxn)
• The right side part looks familiar, isn’t it? Yes, it is the sigmoid function. It
helps to squeeze the output to be in the range between 0 and 1.
Sigmoid Function
• The sigmoid function is useful to map any predicted values of probabilities into another
value between 0 and 1.
Logistic Regression

• Linear model: ŷ = b0+b1x


Sigmoid function: σ(z) = 1/(1+e−z)
Logistic regression model:
ŷ = σ(b0+b1x) = 1/(1+e-(b0+b1x))
Logistic Regression
E.g. 1. Construct a logistic regression model with two predictors
Logistic Regression

Calculate Accuracy using above confusion matrix


Maximum Likelihood Estimation

But how would you be sure that the given sigmoid curve is the best sigmoid curve?
The logistic regression does this by estimating the optimal values for the coefficients that
maximize the likelihood of the observed data.
This process is called Maximum Likelihood Estimation.

Maximum Likelihood Estimation (MLE) is a statistical method used to estimate the parameters of
a probability distribution that best explains the observed data.
Mathematically, the likelihood function L(θ) is defined as the product of the individual probabilities of
each data point given the parameters (derived from Bernoulli Distribution) or you can say it as the joint
probability of independent events.
Logistic Regression

So, the best-fitting combination of β0, β1… βn will be the one that maximizes this product.

Log Likelihood
Considering the values of probability is very dense that lies between 0 and 1 doing a product would not be mathematical
easier and feasible to do.
However, taking the logarithm allows transforming the product of probabilities in the likelihood function into a sum of
logarithms, which is generally easier to work with and we called this Log Likelihood. Our target is to maximize it, such that
we can get the best coefficients value.
Logistic Regression

Negatibe Log Likelihood

However, again taking the logarithm of probabilities value which lies between 0 and 1 will
always return negative values.

So, in order to have positive values we multiply it with -1 and by doing so we get the negative
log-likelihood.

Cost Function

In logistic regression, we consider the negative log-likelihood as the cost function, and it is also called a
cross-entropy function.
Logistic Regression
Cross entropy is a loss function that is used to quantify the difference between two probability distributions. It is
a measure of how well one distribution predicts another distribution.
So, if it is minimum then it means that the true and the predicted probabilities distributions are close to each
other and that is what we always aim for.
To optimize the parameters in logistic regression, iterative optimization algorithms like gradient descent are
commonly used. These algorithms may converge to a local minimum, but there is no guarantee that the found
solution is the global minimum.
Logistic Regression
In linear regression, we use mean squared error (MSE) as the cost function. But in logistic regression, using the mean of the
squared differences between actual and predicted outcomes as the cost function might give a wavy, non-convex solution;
containing many local optima:
In this case, finding an optimal solution with the gradient
descent method is not possible.
Instead, we use a logarithmic function to represent the
cost of logistic regression.
It is guaranteed to be convex for all input values, containing
only one minimum, allowing us to run the gradient descent
algorithm.
When dealing with a binary classification problem, the
logarithmic cost of error depends on the value of y. We can
define the cost for two cases separately:
Logistic Regression
Gradient Descent Algorithm
Gradient descent is an iterative optimization algorithm, which finds the minimum of a differentiable function. In
this process, we try different values and update them to reach the optimal ones, minimizing the output.

This way, we can find an optimal solution minimizing the cost over model parameters:
Gradient descent Algorithm for Logistic Regression

xx (Exam Score) yy (Admitted=1)

2 0

4 0

6 1

8 1
Gradient descent Algorithm
Gradient descent Algorithm

Iteration β₀ β₁ grad_b0 grad_b1

1 0.0000 0.1000 0.0000 -1.0000

2 -0.0121 0.1278 0.1210 -0.2780

3 -0.0270 0.1389 0.1490 -0.1110

4 -0.0428 0.1447 0.1575 -0.0581

5 -0.0588 0.1486 0.1602 -0.0392


Gradient descent Algorithm
For linear regression
Gradient descent Algorithm
Gradient descent Algorithm
Gradient descent Algorithm with logistic regression

We have one feature (e.g. a normalized tumor measurement) and a binary label:
xxx = feature (continuous)
y∈{0,1}y\in\{0,1\}y∈{0,1} where 1 = malignant, 0 = benign

x y

0.5 0

1.0 0

1.5 1

2.0 1

2.5 1
Gradient descent Algorithm with logistic regression

We have one feature (e.g. a normalized tumor measurement) and a binary label:
xxx = feature (continuous)
y∈{0,1}y\in\{0,1\}y∈{0,1} where 1 = malignant, 0 = benign

x y
0.5 0
1.0 0
1.5 1
2.0 1
2.5 1
Gradient descent Algorithm with logistic regression
Gradient descent Algorithm with logistic regression
Gradient descent Algorithm
Optimization
• In machine learning, optimizers and loss functions are two components that help
improve the performance of the model.
• A loss function measures the performance of a model by measuring the difference
between the output expected from the model and the actual output obtained from
the model.
• The optimizer helps improve the model by adjusting its parameters to minimize the
loss function value.
• The role of the optimizer is to find the best set of parameters (weights and biases)
of the neural network that allow it to make accurate predictions.
• Optimization is at the heart of almost all machine learning techniques. Choosing
the best element from some set of available alternatives.
Optimization
• The computational method for iterative optimization technique can be broadly divided in
three types.
• Zero Order or Direct Search: These involve exploring a range of potential values for the
variable x to find the minimum of the objective function
• First-order or Gradient Methods: These techniques make use of the first-order partial
derivatives. e.g. gradient descent
• Second Order Methods: These techniques make use of the second-order partial
derivatives: Newton Method
or
• Derivative based optimization- Steepest Descent, Newton metho

• Derivative free optimization- Random Search, Down Hill Simplex


Optimization
Optimization is the process of finding the best solution from all possible solutions to a problem, usually by maximizing
or minimizing an objective function.
Mathematically:

Types of Optimization Methods:


1. Derivative Based Methods
Gradient Descent
Newton’s Method

1. Derivative Free methods


Grid Search
Random Search, Downhill Simplex methods
Optimization
General Optimization Algorithm Structure
1. Initialize variables or population.
2. Evaluate the objective function for current solutions.
3. Update parameters using search strategy (gradient step, mutation, exploration, etc.).
4. Check stopping criteria (max iterations, convergence, tolerance).
5. Return the best solution.

Example Applications
● Engineering: Minimize material cost while keeping strength.

● Machine Learning: Minimize loss function to improve model accuracy.

● Operations Research: Maximize profits with limited resources.

● Networking: Minimize latency or maximize throughput.


Basic element of Optimization
Basic elements of optimization: There are three basic elements of any optimization
problem

Variables: These are the free parameters which the algorithm can tune

Constraints: These are the boundaries within which the parameters (or some
combination thereof) must fall

Objective function: This is the set of goal towards which the algorithm drives the
solution. For machine learning, often this amount to minimizing some error measure or
maximizing some utility function.
Steepest Descent method

Derivative based optimization deals with gradient-based optimization techniques, capable of


determining search directions according to an objective function’s derivative information.

1. Steepest Descent: In this method, the search starts from an initial trial point X1, and iteratively
moves along the steepest descent directions until the optimum point is found. Although, the method
is straightforward, it is not applicable to the problems having multiple local optima. In such cases
the solution may get stuck at local optimum points.
Newton’s Method

Newton method: Newton's method is a very popular method which is based on Taylor's Series
expansion.
Basic Concept of Newton Method
Let us understand Newton method for root finding , then we will see how it can be used for
optimization.
Problem : We have a function f(x). The goal is to find root of f(x) i.e., f(x) = 0.
Initial Guess : We make a initial guess for the root Xo.
Update: We update the current guess to get a new estimate using the formula……

The method proceeds iteratively repeat above step till we reach a predefined tolerance or a
certain number of iterations.
Newton’s Method

Newton's method is a very popular method which is based on Taylor's Series expansion. The Taylor's
Series expansion of a function
Newton Method for optimization

Newton Method For Optimization


Derivative free optimization
• Derivative free optimization algorithms are often used when it is difficult to find function
derivatives, or if finding such derivatives are time consuming.
• Derivative-free optimization is a discipline in mathematical optimization that does not
use derivative information in the classical sense to find optimal solutions.
• Sometimes information about the derivative of the objective function f is unavailable, unreliable or
impractical to obtain.
• For example, f might be non- smooth, or time-consuming to evaluate, or in some way noisy, so
that methods that rely on derivatives or approximate them via finite differences are of little use.
• The problem to find optimal points in such situations is referred to as derivative-free optimization .
Random Search
Random Search: This method generates trial solutions for the optimization model using random number
generators for the decision variables.
Random search method includes random jump method, random walk method and random walk method with
direction exploitation.
Random jump method generates huge number of data points for the decision variable assuming a uniform
distribution for them and finds out the best solution by comparing the corresponding objective function values.
Random walk method generates trial solution with sequential improvements which is governed by a scalar step
length and a unit random vector.
The random walk method with direct exploitation is an improved version of random walk method, in which, first
the successful direction of generating trial solutions is found out and then maximum possible steps are taken along
this successful direction.
Methods of Random search
Random search
Random Search
Random Walk Method
Random Walk Method with direction
Exploitation
Single feature Logistic Regression
Simplex

In geometry, a simplex (plural: simplexes or simplices) is a generalization


of the notion of a triangle or tetrahedron to arbitrary dimensions. The simplex
is so-named because it represents the simplest possible polytope in any
given dimension. For example,
a 0-dimensional simplex is a point,
a 1-dimensional simplex is a line segment,
a 2-dimensional simplex is a triangle,
a 3-dimensional simplex is a tetrahedron, and
a 4-dimensional simplex is a 5-cell.
Downhill Simplex
Simplex Method: Simplex method is a conventional direct search algorithm where the best
solution lies on the vertices of a geometric figure in N-dimensional space made of a set of N+1
points.
The method compares the objective function values at the N+1 vertices and moves towards
the optimum point iteratively.
The movement of the simplex algorithm is achieved by reflection, contraction and expansion.

The Nelder–Mead method (also downhill simplex method, amoeba method, or polytope
method) is a numerical method used to find the minimum or maximum of an objective
function in a multidimensional space.
Simplex
Simplex
Simplex
Simplex
Simplex
Simplex
Simplex
Simplex
Simplex
Simplex
Simplex
Simplex

You might also like