0% found this document useful (0 votes)
6 views97 pages

Unit 2 ML

Uploaded by

reddykedharnathy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views97 pages

Unit 2 ML

Uploaded by

reddykedharnathy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MACHINE LEARNING

UNIT 2
21CSC305P
Maximum Likelihood Estimation

• Maximum Likelihood Estimation (MLE) is a statistical method


used to estimate the parameters of a probability distribution that
best describe a given dataset.
• To analyze the data provided, we need to identify the
distribution from which we have obtained data.
• Next, use data to find the parameters of our distribution. A
parameter is a numerical characteristic of a distribution.
Maximum Likelihood Estimation
Example Distribution and their parameters:
• Normal distributions - mean (µ) & variance (σ2)
• Binomial distributions –no.of trials (n) & probability of success (p)
• Gamma distributions - shape (k) and scale (θ)
• Exponential distributions - inverse mean (λ).
• These parameters are vital for understanding the size, shape, spread,
and other properties of a distribution.
• Since the data that we have is mostly randomly generated, we often
don’t know the true values of the parameters characterizing our
distribution.
Examples of Probability Distributions
Normal Distribution (µ, σ²)
Blood pressure of healthy adults
Mean (µ) ≈ 120 mmHg
Variance (σ²) shows variation across individuals
Binomial Distribution (n, p)
Quality check of light bulbs
n = 20 bulbs tested
p = 0.1 defective probability
Models: Number of defective bulbs found
Gamma Distribution (k, θ)
Time until 5 patients arrive at ER
k = 5 arrivals
θ = 10 minutes (average gap)
Models: Waiting time for group events
Exponential Distribution (λ)
Time between bus arrivals
λ = 1/15 per minute
Mean waiting time = 15 minutes
Models: Time between random events
Maximum Likelihood Estimation
• An estimator is like a function of data that gives approximate values of
the parameters.
Ex: sample-mean estimator – Simple & Frequently used estimator

• Since the numerical characteristics of the distribution vary as a


function of the range of parameter it is not easy to estimate parameter
θ of the distribution.

• Maximum likelihood estimation, which is a process of estimation that


gives an entire class of estimators called maximum likelihood
estimators or MLEs
Maximum Likelihood Estimation

When to Use Log-Likelihood:


• When Dealing with Large Datasets: The likelihood function can become extremely small
as more data points are considered, leading to computational difficulties. The log-
likelihood avoids this by converting multiplication into addition.
• Simplifying Derivatives: When performing MLE, you often need to take derivatives to find
the maximum. The log-likelihood simplifies this process, as the logarithm of a product
becomes a sum of logarithms, making differentiation easier.
The log-likelihood is a transformed version of the likelihood function that is more
mathematically and computationally convenient for optimization in machine learning
models.
Maximum Likelihood Estimation in linear regression
Linear regression - solved example
Ordinary Least squares
Ordinary Least Squares (OLS) is the most common method used to estimate the
coefficients (β) in linear regression.
It finds the best-fitting line through the data by minimizing the sum of squared
errors.

We assume a linear relationship:


y=β0+β1x+ϵ

where ϵ is the error.


OLS chooses β0,β1 (intercept and slope) so that the total error
between predicted values (y^​) and actual values (y) is as small as
possible.
Ordinary Least squares
The error we minimize is:

it is called least squares.

Drawback of OLS - multicollinearity

Multicollinearity happens when two or more predictors (independent variables) are highly
correlated.
This means they provide overlapping information about the target variable.
In regression, this makes it hard to separate their individual effects, leading to unstable
coefficients.
Multicollinearity
Example
We want to predict house price (y) using:
x1 = house size (sq ft)
x2 = number of rooms
larger houses usually have more rooms, so x1​and x2​are highly
correlated→ multicollinearity.

OLS tries to assign coefficients β1,β2​to fit the data.


Because x1​and x2 overlap, OLS struggles to decide how much of price is
explained by size vs. rooms.
Result: unstable and weird coefficients.
Example:
β1=200,β2=−150
Ridge regression
Ridge Regression is a regularization technique used in linear regression to handle
the problem of multicollinearity and overfitting. It is also known as L2
Regularization.
In ordinary linear regression (OLS), we minimize the Residual Sum of Squares (RSS):
Ridge Regression Objective Function
However, when features are highly correlated, OLS estimates become unstable (large
variance in coefficients). Ridge regression solves this problem by adding a penalty term to the
loss function.
Ridge Regression penalty term
The penalty term in Ridge regression penalizes large coefficients, shrinks
them toward zero, reduces model complexity, and combats multicollinearity
and overfitting.

Why Squared Penalty (L2)?


L2 penalty (squares) shrinks coefficients smoothly but never makes them
exactly zero.
That’s why Ridge keeps all features in the model.
(Compare with Lasso’s L1 penalty, which can shrink some coefficients to exactly
zero → feature selection.)
Example - OLS and Ridge regression
Example - OLS and Ridge regression
Example - OLS and Ridge regression
Example - OLS and Ridge regression
Logistic regression

Logistic regression is used to describe data and to explain the


relationship between one dependent binary variable and one or
more nominal (categorical), ordinal (logical order), interval (no true
zero) or ratio-level (true zero) independent variables.
Remember: though the name of algorithm carries regression, it is
used for classification.
Logistic regression predicts the output of a categorical dependent
variable. Therefore the outcome must be a categorical or discrete
value. It can be either Yes or No, 0 or 1, true or False, etc. but instead
of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Logistic regression - Types
Logistic regression

Decision Rule

If p 0.5, predict class 1.
If p<0.5, predict class 0.
Logistic regression

Decision Rule

If p 0.5, predict class 1.
If p<0.5, predict class 0.
Logistic regression - sigmoid function
Logistic regression - Coefficients
Logistic regression - solved problem

x1.deviati
x1.deviati
x2.deviation on*x2.dev
on
iation

(x1-x1bar)
(x2- Logit L=b0 + P=1/(1+exp
x1 x2 x1-x1bar x2-x2bar (x2- (x1-x1bar)^2 Y NEW Given Y Accuracy
x2bar)^2 b1*x1 + b2*x2 (-L))
x2bar)

1 68 166 27.1 77.2 2092.12 734.41 5959.84 216.2132671 1 1 1 1

2 70 178 29.1 89.2 2595.72 846.81 7956.64 225.5147413 1 1 1 1

3 72 170 31.1 81.2 2525.32 967.21 6593.44 226.7086169 1 1 1 1

4 66 124 25.1 35.2 883.52 630.01 1239.04 194.750395 1 1 0 0

5 66 115 25.1 26.2 657.62 630.01 686.44 191.1019757 1 1 0 0

6 67 135 26.1 46.2 1205.82 681.21 2134.44 201.4280318 1 1 0 0

sum 409 888 163.6 355.2 9960.12 4489.66 24569.84 1255.717028 50

x1- bar,
40.9 88.8
x2-bar
Logistic regression - solved problem

b1 (x1-x1 bar)(x2-x2bar)/(x1-x1bar)^2 2.218457522

b2 (x1-x1 bar)(x2-x2bar)/(x2-x2bar)^2 0.405379929

b0 x2bar-b1*x1bar -1.934912666
Logistic regression - sigmoid function

study pass LET


hours /fail β0​=−31.165
1 0 β1​=9.19
2 0
3 0 Use the formula and find the pass/fail status of
4 1
students who study for 3.5 hours, 5.2 hours and
4.5 hours
5 1
6 1
Robust linear Regression

• Robust linear regression is designed to be less sensitive to outliers compared to


traditional linear regression.
• Traditional linear regression minimizes the sum of squared residuals, which can
be heavily influenced by outliers.
• Robust linear regression uses different techniques to mitigate the effect of
outliers and produce a more reliable model.
Robust linear Regression

•Linear Regression: Suitable when the data meets the assumptions, especially

when there are no significant outliers and the relationship is linear.

•Robust Linear Regression: Appropriate when there are outliers or when the

assumptions of linear regression are violated, making it more reliable for real-world

data that may not adhere perfectly to theoretical assumptions.


Robust linear Regression
Robust linear Regression - Huber Loss
Robust linear Regression - Huber Loss
Given:
Residuals:
r=[−2, −0.5, 0.3, 1.2],,δ=1
Evaluate loss and ψ(r) for each residual.
Solution:

Case 1: r=−2
Check: ∣ ∣−2 =2>δ → use second formula.
Loss: L=1×(2−0.5)=1.5
ψ: ψ=1×sign(−2)=−1

Result: Loss = 1.5, ψ = -1


Robust linear Regression - Huber Loss
Case 2: r=−0.5
Check: ∣ ∣
−0.5 =0.5≤1 → use first formula.
Loss: L=12(r2)=0.5×0.25=0.125
ψ: ψ=r=−0.5
Result: Loss = 0.125, ψ = -0.5

Case 3: r=0.3
Check: ∣ ∣
0.3 =0.3≤1 → use first formula.
Loss: L=0.5×(0.3)power2=0.5×0.09=0.045
ψ: ψ=r=0.3
Result: Loss = 0.045, ψ = 0.3
Robust linear Regression - Huber Loss

Case 4: r=1.2
Check: ∣ ∣ →
1.2 =1.2>1 use second formula.
Loss: L=1×(1.2−0.5)=0.7
ψ: ψ=1×sign(1.2)=+1
Result: Loss = 0.7, ψ = +1
Baye’s theorem - Background
Bayesian Linear Regression
Bayesian Linear Regression

Prior P(β):
Our guess about coefficients before seeing data.

Likelihood P(y,X β):
Probability of observing the data, given coefficients β\betaβ.
Evidence P(y,X):
Normalization constant to ensure a valid probability distribution.

Posterior P(β y,X):
The updated distribution of β after combining prior + data.
This is what we ultimately use for prediction.
Bayesian Linear Regression
Discriminant Functions

A discriminant function is used in classification t asks to assign a given input to


one of several possible classes. It is designed to make decisions based on the
values of input features by computing a score for each class. The class with
the highest score is the one to which the input is assigned.
Discriminant Functions
Discriminant Functions
Discriminant Functions
Discriminant Functions
Discriminant Functions
Discriminant Functions
Discriminant Functions
Discriminant Functions
Discriminant Functions
Discriminant Functions
Discriminant Functions
Discriminant Functions
Discriminant Functions
Discriminant Functions- Fishers LDA

Fisher’s Linear Discriminant Analysis (LDA) is a supervised technique used


for dimensionality reduction and classification.
It finds a projection line that maximizes the separation between class
means while minimizing the within-class scatter.
The method computes a discriminant vector

to achieve optimal separation.


It is widely used in face recognition, speech recognition, and other pattern
classification tasks.
Discriminant Functions- Fishers LDA

Intuition
Suppose you have two classes of points (say red and blue) in a high-
dimensional feature space.
You want to project them onto a line such that they are as well separated as
possible.
Fisher’s idea: find the projection direction (a vector w) that maximizes the
distance between the class means while minimizing the spread (variance)
within each class.
Discriminant Functions- Fishers LDA

Projection
Each point x (a feature vector) is projected onto a line:

where w is the direction vector we are trying to find.


Discriminant Functions- Fishers LDA
Discriminant Functions- Fishers LDA
Discriminant Functions- Fishers LDA

Decision rule
Discriminant Functions- Fishers LDA
Discriminant Functions- Fishers LDA

Two classes are given:


Class 1 (C1): {2, 3, 4, 5, 6}
Class 2 (C2): {7, 8, 9, 10, 11}
Find Fisher’s LDA discriminant.

Step 1. Class means


Discriminant Functions- Fishers LDA

Step 2: Within-class scatter (Sw)


Discriminant Functions- Fishers LDA

Step 3: Fisher’s weight (w)

Step 4: Decision boundary


Laplace Approximation

Laplace approximation is a technique in machine learning and


st atistics used to approximate integrals, particularly when
dealing with Bayesian inference.
It's often applied when the exact calculation of posterior
distributions is intract able. The method relies on approximating
the posterior distribution with a Gaussian distribution centered
around the mode of the posterior.
This is achieved by t aking the second-order Taylor expansion of
the log-posterior distribution around it s mode.
Laplace Approximation
Key Steps in Laplace Approximation:
1.Find the Mode of the Posterior Distribution:
-Identify the maximum a posteriori (MAP) estimate, which is the mode of the
posterior distribution. This can be done using optimization techniques.
2.Approximate the Posterior with a Gaussian:
-Use a second-order Taylor expansion of the log-posterior distribution around
the MAP estimate. This result s in a quadratic approximation, which corresponds to a
Gaussian distribution.
3.Calculate the Hessian Matrix:
-The covariance of the approximating Gaussian is the inverse of the Hessian
matrix of the negative log-posterior evaluated at the mode. The Hessian captures
the curvature of the posterior distribution.
4. Obt ain the Approximation:
-With the mode and covariance matrix, the posterior is approximated as a
multivariate normal distribution.
Laplace Approximation
Laplace Approximation
Laplace Approximation
Laplace Approximation
Support Vector Machine
What is support vector?
• “Support Vector Machine” (SVM) is a supervised machine learning
algorithm which can be used for both classification or regression
challenges. However, it is mostly used in classification problems.
• In this algorithm, we plot each data item as a point in n-dimensional
space (where n is number of features you have) with the value of each
feature being the value of a particular coordinate.
• Then, we perform classification by finding the hyperplane that
differentiate the two classes very well.
Support Vector Machine

It can easily handle multiple continuous and categorical variables.


SVM constructs a hyperplane in multidimensional space to separate
different classes. SVM generates optimal hyperplane in an iterative
manner, which is used to minimize an error.
The core idea of SVM is to find a maximum marginal
hyperplane(MMH) that best divides the dataset into classes.
Support Vector Machine
Support Vector Machine
Support Vectors
–Support vectors are the data points, which are closest to the hyperplane. These points
will define the separating line better by calculating margins.
These points are more relevant to the construction of the classifier.
Hyperplane
–A hyperplane is a decision plane which separates between a set of objects having
different class memberships.
Margin
–A margin is a gap between the two lines on the closest class points.
–This is calculated as the perpendicular distance from the line to support vectors or
closest points.
–If the margin is larger in between the classes, then itis considered a good margin, a
smaller margin is a bad margin.
Support Vector Machine
•The main objective is to segregate the given dataset in the best possible way.
•The distance between the either nearest points is known as the margin.
•The objective is to select a hyperplane with the maximum possible margin
between support vectors in the given dataset. SVM searches for the maximum
marginal hyperplane in the following steps:
–Generate hyperplanes which segregates the classes in the best way.
–Select the right hyperplane with the maximum segregation from the either
nearest data points.
Support Vector Machine
Identify the right hyper-plane (Scenario-1):

Here, we have three hyper-planes (A, B, and C). Now, identify the right hyper-plane to classify stars and
circles.

You need to remember a thumb rule to identify the right hyper-plane: “Select the hyper-plane which
segregates the two classes better”. In this scenario, hyper-plane “B” has excellently performed this job
Support Vector Machine
Identify the right hyper-plane (Scenario-2):

•Here, we have three hyper-planes (A, B, and C) and all are segregating the classes well. Now, How can
we identify the right hyper-plane?

•Here, maximizing the distances between nearest data point (either class) and hyper-plane will help us to
decide the right hyper-plane. This distance is called as Margin
Support Vector Machine
Identify the right hyper-plane (Scenario-2):

you can see that the margin for hyper-plane C is high as compared to both A and B. Hence, we name
the right hyper-plane as C. Another lightning reason for selecting the hyper-plane with higher margin is
robustness. If we select a hyper-plane having low margin then there is high chance of miss-
classification.
Support Vector Machine
Identify the right hyper-plane (Scenario-3):
•Hint: Use the rules as discussed in previous section to identify the right hyper-plane.

•Some of you may have selected the hyper-plane B as it has higher margin compared to A.
But, here is the catch, SVM selects the hyper-plane which classifies the classes accurately
prior to maximizing margin. Here, hyper-plane B has a classification error and A has classified
all correctly. Therefore, the right hyper-plane is A.
Support Vector Machine
Identify the right hyper-plane (Scenario-4):
In the scenario below, we can’t have linear hyper-plane between the two classes, so how
does SVM classify these two classes? Till now, we have only looked at the linear hyper-
plane.
Support Vector Machine
Identify the right hyper-plane (Scenario-4):
•SVM can solve this problem. Easily! It solves this problem by introducing additional feature.
Here, we will add a new feature z=x^2+y^2. Now, let’s plot the data points on axis x and z:
Types of SVM

•Linear SVM:
-Linear SVMs use a linear decision boundary to separate the data points of different
classes.
-When the data can be precisely linearly separated, linear SVMs are very suitable.
This means that a single straight line (in 2D) or a hyperplane (in higher dimensions) can
entirely divide the data points into their respective classes.
•Non-Linear SVM:
-Non-Linear SVM can be used to classify data when it cannot be separated into two
classes by a straight line (in the case of 2D).
By using kernel functions, nonlinear SVMs can handle nonlinearly separable data.
Types of SVM
Support Vector Machine
1-Dimensional Data Transferable
The kernel trick used the kernel function to work with non-linearly
separable data.
A polynomial kernel with degree 2 has been applied in transforming the
data from 1-dimensional to 2-dimensional data.
Support Vector Machine
2-Dimensional Transferable
In the 2-dimensional case, the kernel trick is applied as below with the
polynomial kernel with degree 2.
It seems that observations have been classified successfully using a
linear plane after projecting the data into higher dimensions
Support Vector Machine - Kernel
Trick
SVM algorithms use a set of mathematical functions that are defined as the
kernel. The function of kernel is to take data as input and transform it into the
required form.
Firstly, a kernel takes the data from its original space and implicitly maps it
to a higher-dimensional space. This is crucial when dealing with data that is
not linearly separable in its original form.
Instead of performing computationally expensive high-dimensional
calculations, the kernel function calculates the relationships or similarities
between pairs of data points as if they were in this higher-dimensional
space.
Support Vector Machine - Kernel
Trick
Support Vector Machine - Kernel Trick
Support Vector Machine - Kernel Trick
Numerical Example for working of Kernel Function
Feature(x) -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
X2 36 25 16 9 4 2 0 1 4 9 16 25 36
Kernel Functions
Kernel Functions
Let κ(x, x’)≥ 0 be some measure of similarity between objects x, x’ ∈ X , where X is some
abstract space; we will call κ a kernel function.
We define a kernel function to be a real-valued function of two arguments, κ(x, x’ ) ∈ R, for x,
x’∈ X . Typically the function is symmetric (i.e., κ(x, x’ ) = κ(x’ , x)), and non-negative (i.e., κ(x,

x’ ) 0.
Linear Kernel:
Let φ(x) = x, we get the linear kernel, defined by just the dot product between the two
object vectors: This is useful if the original data is already high dimensional, and if the
original features are individually informative,
•e.g., a bag of words representation where the vocabulary size is large, or the expression
level of many genes.
•In such a case, the decision boundary is likely to be representable as a linear combination
of the original features, so it is not necessary to work in some other feature space
Kernel Functions
Kernel Functions
Kernel Functions
Kernel Functions
Kernel Functions
Kernel Functions

You might also like