0 ratings0% found this document useful (0 votes) 34 views13 pagesFML Unit2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
UNIT II
Supervised Learning
Syllabus
Linear Regression Models : Least squares, single & multiple variables, Bayesian linear regression,
radient descent, Linear Classification Models : Discriminant function - Perceptron algorithm,
Sopabilistic discriminative model - Logistic regression, Probabilistic generative model - Naive
Bayes, Maximum margin classifier - Support vector machine, Decision Tree, Random Forests
Contents
2.1. Regression
2.2. Linear Classification Models
23 Probabilistic Generative Model
2.4 Maximum Margin Classifier : Support Vector Machine
2.5 Decision Tree
2.6 Random Forests
2.7 Two Marks Questions with Answers
@-Machine Learing
Regression
© Regression finds correlations between i
dependent and independent variables. 7 cnr
If the desired output consists of one line eteeco,
or more continuous variable, then the s
z
task is called as regression. 8
+ Therefore, regression algorithms help
a
predict continuous variables such as
house prices, market trends, weather
patterns, oil and gas prices etc. Fig, 241 Regression
Independent variable
+ Fig, 2.1.1 shows regression.
* When the targets in a dataset are real numbers, ¢
known as regression and each sample in the dataset has a real-valued output or
the machine learning task is
target.
* Regression analysis is a set of statistical methods used for the estimation of
relationships between a dependent variable and one or more independent
variables. It can be utilized to assess the strength of the relationship between
variables and for modelling the future relationship between them.
+ The two basic types of regression are linear regression and multiple linear
regression.
EXE] Linear Regression Models
. Linear regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables.
* The objective of a linear regressi is i
7 ‘gression model is to find a relationshi
input variables and a target variable. ae
1. One variable, denoted x, is regarded as the predictor,
independent variable. explanatory or
2. The other variable, denoted y, is rej
, 7 garded as
Fre ainer| vac he response, outcome or
* Regression models predict a continuous variable, such as the sales made on a d
or predict temperature of a city. Let’s imagine that we fit a line with the trai i
point that we have. If we want to add another data point, but to fit it, we ane
change existing model. rw need to
* This will happen with each data point that we add to the model; hence, {i
regression isn’t good for classification models. ‘ oer
®iL
‘Supervised Leaming
ion line gives thi .
+ The regress! 8 © average relationshi iables i
mathematical form. ip between the two variables in
vo variables X
. For two varial and Y, there are always two lines of regression.
+ Regression line of X on Y Gives the best ost
Fae civen'walues'st Wf imate for the value of X for any
specific 5
X= atby
here a = X- intercept
b = Slope of the line
X = Dependent variable
Y = Independent variable
« Regression line Y on X : Gives the best estimate for the value of Y for any
specific given values of X :
Y = a+bx
wee a
Y - intercept
b = Slope of the line
Y = Dependent variable
x = Independent variable
thod (a procedure that minimizes the vertical
yunding a straight line) we are able to construct a
ter diagram points and then formulate a
* By using the least squares mel
deviations of plotted points surro'
best fitting straight line to the scal
gression equation in the form of :
1
§ = a+bx Bias term——"
My
fix. w)
§ = y+be-x SS
"Re J ¥+bex-%) Input vector) *2—w,
‘tession analysis is the art x t
“nd science of fitting straight “
“Nes to patterns of data. In @ Fig. 2.1.2
ar regression model, the
TECHNICAL PUBLICAT!2-4 Supervised Leeming
Machine Learning :
variable) is predicted from k other variables
+ equation. If Y denotes the dependent
ariables, then the assumption is that
etermined by the linear equation ;
variable of interest (“dependent”
(Vindependent” variables) using a lineal
variable and X1,.../Xxv are the independent v.
the value of Y at time t in the data sample is d
Yy = Bo +BaX11 +BaXat + +Ba%Ht Ft
betas are constants and the epsilons are independent and identically
ith mean zero.
the predicted value and the actual values
‘The split point errors across
lowest SSE is
where the
distributed normal random variables wi
At each split point, the “error” between E
is squared to get a “Sum of Squared Errors (SSE)". The sp
the variables are compared and the variable/point yielding the
chosen as the root node/split point. This process is recursively continued.
« Error function measures how much our predictions deviate from the desired
answers.
1
Mean-squared error Jn == (yi fous)?
isn
Advantages :
a. Training a linear regression model is usually much faster than methods such as
neural networks.
b. Linear regression models are simple and require minimum memory to implement.
c. By examining the magnitude and sign of the regression coefficients you can infer
how predictor variables affect the target outcome.
EXP Least Squares
+ The method of least squares is about estimating parameters by minimizing the
squared discrepancies between observed data, on the one hand, and their expected
values on the other. y
Considering an arbitrary ' $= By + Byx
straight line, y = by +b)x, is
to be fitted through these
data points. The question is
‘Which line is the most yj -9j=Ertor (residual)
representative" ? ||
* What are the values of 3%
bo andb; such that the z
resulting line "best" fits the |
data points ? But, what Fig. 24.3ea
a Leamind 2.
ye 2 supervised Leaming
ess-of-fit criterion to
oo? use to determine among all possibl binations of
poand er ? gall possible combi
east Squares (LS) criteri
nee The ae that the sum of the squares of errors is
mun hi solutions yields y(x) whose elements sum to 1, but do
ensure the outputs to be in the range [0,1] s
How to draw such a line based
gn data points observed 7 y
guppose @ imaginary line ofy= 4
a + bx.
| jmagine @ vertical distance 3
petween the line and a data
point E = ¥ - EQ).
E(Y)=a + bX
« This error is the deviation of the
data point from the imaginary
line, regression line. Then what
is the best values ofaandb?A
and b that minimizes the sum
of such errors.
+ Deviation does not have good Fig. 2.14
properties for computation.
Then why do we use squares of deviation ? Let us get a and b that can minimize
the sum of squared, deviations rather than the sum of deviations. This method is
called least squares.
f squares of errors. Such a and b are
parameters a. and B.
a and b) is called estimation.
thod minimizes the sum of
* Least squares met
timators of
called least squares estimators i.e. es
parameter estimators (eg,
* The process of getting
method of Ordinary Least Squares (OLS).
Lest squares method is the estimation
tT
isadvantages of least square
1, Lack robustness to outliers
2 Certain datasets unsuitable for I ation
east squares classific
3. Decision boundary corresponds to ML. solution
vn up-thrust for knowledge
pL ICATIONSSupervised Learning
2-6
Machine Leaming
CQEEEEESED Fo stright tine to the points i the table.
Compute m and b by least
squares.
Points x y
x 3.00 oy
B 4.25 4.25
c 5.50 5.50
D 8.00 SEN
Solution : Represent in matrix form :
3.00 1 450] [va
425 1] [m 425] | vg
= +
550 1 cE ve
8.00 1 5.50} vp
X= (tl aT aytaty
121.3125 20.7500]""/105.8125] _ [0.246
* | 20.7500 4.0000 | | 19.7500 | ~ | 3.663
V = AX-L
3.00 1 4.50] f-0.10
425 1/7024] | 4.25 0.46
5.50 1||3.663|7] 5.50! ~|.-0.48
8.00 1 5.50 0.13
Multiple Regression
+ Regression analysis is used to predict the value of one or more responses from a
set of predictors. It can also be used to estimate the linear association between the
predictors and responses. Predictors can be continuous or categorical or a mixture
of both.
e If multiple independent variables affect the response variable, then the analysis
calls for a model different from that used for the single predictor variable. In a
situation where more than one independent factor (variable) [Link] outcome of
a process, a multiple regression model is used. This is referred to as multiple
linear regression model or multivariate least squares fitting.“Gq a
machine Learning >
Machin? Supervised Leaming
« Let Z1; Z, be a set of r .
Predictors believed to be related to a response
variable Y. The li i
varial near regression model for the j!" sample unit has the form
Yi = Bo+Br 2i1+B2 Zip +.B, 2p +e;
s ir i
where € is a random error and Bj, i=
x Bi, i=0,1,...,r are un-known regression coefficients.
» With n independent observations,
Ss, We can writ ,
rac deennotel te now rite one model for each sample unit so
Y = ZBte
where Y is nx 1, Z is nx (r+1),B is(r+1)x1 and eis nx1
ein order to estimate B , we take a least squares approach that is analogous to what
we did in the simple linear regression case.
« In matrix form, we can arrange the data in the following form :
Lox x2 XK yi By
Lox Xa A
xn [EE Xm a] yfye | og |B
1 XN1 XN20 ++ XNK YN B,
where fj are the estimates of the regression coefficients
By Difference between Simple Regression and Multiple Regression
|
Simple regression Multiple regression
One dependent variable Y predicted from one One dependent variable ¥ predicteg om 2 set
independent variable X _of independent variables (X), Xz -- Xi)
One regression coefficient for each independent
One regression coefficient
variable
R? : Proportion of variation in dependent
1 : Proportion of variation in dependent
variable Y predictable from X variable Y predictable by set of independent
variables (X's)
EXE] Bayesian Linear Regression
n allows a useful mechanism to deal with insufficient
to put a prior on the coefficients and
the priors can take over. A prior is a
* Bayesian linear regressio:
data, or poor distributed data. It allows user
on the noise so that in the absence of data,
distribution on a parameter.
* If we could flip the coin an infinite number
easy by the law of large numbers. Howevel if we c
ques that a coin is biased if we
handful of times? Would we.
of times, inferring its bias would be
what if we could only flip the coin a
saw three heads in
TECHNICAL PUBLICATIONS® - an up-thrust for krowiedgeSupervised Learning
Machine Lea
ight times with unbiased coins? The
bias of p =1-
quantifying our prior knowledge that
a the bias parameter is peaked aroung
about coins.
three flips, an event that happens one out ofl
MLE would overfit these data, inferring a coin
+ A Bayesian approach avoids overfitting b;
most coins are unbiased, that the prior on
ig pri lief
one-half, The data must overwhelm this prior be i eee ae
imate model para’ - °
. vesi \ds allow us to estimal . A Igorith:
ee a conduct model comparisons. Le EE ee ete
sts and to
1 hypotheses.
dea that the training data are utilized to calculate
foreca t
calculate explicit probabilities fo
+ Bayesian classifiers use a simple i
an observed probability of each class
an classifier is used for unclassified data, i
s for the new features.
based on feature values.
it uses the observed
« When Bayesi e
probabilities to predict the most likely clas
« Each observed training example can incrementally decrease or increase the
estimated probability that a hypothesis is correct.
+ Prior knowledge can be combined with observed data to determine the final
probability of a hypothesis. In Bayesian learning, prior knowledge is provided by
asserting a prior probability for each candidate hypotheses and a probability
distribution over observed data for each possible hypothesis.
* Bayesian methods can accommodate hypotheses that make probabilistic
predictions. New instances can be classified by combining the predictions of
multiple hypotheses, weighted by their probabilities.
Even in cases where Bayesian methods
provide a standard of optimal decisi
methods can be measured.
. prove computationally intractable, they can
ion making against which other practical
* Uses of Bayesian classifiers are as follows :
1. Used in text-based classification f i
¢ or finding spam or junk mail filter
2. Medical diagnosis. sean
3. Network security such as detecting illegal intrusion,
The basic procedure for implementing Bayesian Linear R ion i
i) Specify priors for the model parameter, Baer
ii) Create a model mapping the training inj
ie Puts to the traini
iii) Have a Markov Chain Monte Carlo (MCMC) alpori Ing outputs.
the posterior distributions for the parameters Tt draw samples fromje Leaming 2
-9 Supervised Leaming
Mech
po Gradient Descent
+ Goal : Solving minimizati i
Goi iB nization nonlinear problems through derivative information
. eon dead :
First a econ derivatives of the objective function or the constraints play an
important role in optimization. The first order derivatives are called the gradient
and the second order derivatives are called the Hessian matrix.
Derivative based optimization is also called nonlinear. Capable of determining
search directions" according to an objective function's derivative information.
Derivative based optimization methods are used for :
1, Optimization of nonlinear neuro-fuzzy models
2. Neural network learning
3, Regression analysis in nonlinear models
Basic descent methods are as follows :
1. Steepest descent
2. Newton-Raphson method
Gradient Descent :
Gradient descent is a first-order optimization algorithm. T
of a function using gradient descent, one takes steps proportional to the negative
of the gradient of the function at the current point.
Gradient descent is popular for very large-scale optimization problems because it
is easy to implement, can handle black box functions, and each iteration is cheap.
Given a differentiable scalar field f (x) and an initial guess x; , gradient descent
s of "f" by taking steps in the
s ‘o find a local minimum
iteratively moves the guess toward lower value:
direction of the negative gradient ~ V f (x).
* Locally, the negated gradient is the steep
that x would need to move in order to
yest descent direction, ie., the direction
decrease "f" the fastest. The algorithm
typically converges to a local minimum, but may rarely reach @ saddle point, or
lies at a local maximum.
e curve at that x and its direction will point
change x in the opposite direction to lower
not move at all if x;
* The gradient will give the slope of th
to an increase in the function. So we
the function value :
Xie = xR AVE (x)
The A>0 is a small number that forces the algorithm to make small jumps
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeSupervised
Machine Leaning 2-10 Learn
dient Descent : aa
he minimum : techni
to # technically, it
Jatively slow close
ce is inferior to many other methods.
gradient descent increasingly ‘zigzag.
the shortest direction to a mi A as
‘um,
Limitations of Gra
© Gradient descent is rel
asymptotic rate of convergen
.d convex problems,
+ For poorly conditione
orthogonally to
the gradients point nearly
point
Steepest Descent :
# Steepest descent is also
« This method is based on first
function. This method is also call
descent method.
known as gradient method.
order Taylor series approximation of obje
ed saddle point method. Fig. 2.1.5 shows ae e
est
Fig. 2.1.5 Steepest descent method
© The Steepest D
escent i :
direction is where Paes simplest of the gradient methods. The choice of
es most quickly, which is in the direction opposite
VE (x). The search
: starts at an arbi :
until reach close to the eae) point x0 and then go down the gradient
The method of
steepest is
nen eas ee one is the discrete analogue of gradient descent, but
using a local minimization rather than computing ®
gradient. It is typi
ically
local minima Be 'y able to converge in few st ne e
plateaus in the objective functi ‘eps but it is unable to escap’
inction.
The gradient i
; t is everywhi
aaa eres Eee to the contour lines. After each lin?
ient i: ‘i
nt is always orthogonal to the previous step directo"
TECHNICAL PUBLICAT| ip-thrust for knowled:
'UBLICATIONS™ - an up-thrust
8 jowledgeachine Leeming 2-14 ‘Supervised Leaming
Consequently, the iterates tend to zig-zag down the valley in a very inefficient
manner.
+ The method of Steepest Descent is simple, easy to apply, and each iteration is fast.
It also very stable; if the minimum points exist, the method is guaranteed to locate
them after at least an infinite number of iterations.
fa Linear Classification Models
+ A classification algorithm (Classifier) that makes its classification based on a linear
predictor function combining a set of weights with the feature vector.
+ A linear classifier does classification decision based on the value of a linear
combination of the characteristics. Imagine that the linear classifier will merge into
it's weights all the characteristics that define a particular class.
+ Linear classifiers can represent a lot of things, but they can't represent everything.
The classic example of what they can't represent is the XOR function.
EZAD Discri
+ Linear Discriminant Analysis (LDA) is the most commonly used dimensionality
reduction technique in supervised learning. Basically, it is a preprocessing step for
pattern classification and machine learning applications. LDA is a powerful
algorithm that can be used to determine the best separation between two or more
inant Function
classes,
+ LDA is a supervised learning algorithm, which means that it requires a labelled
training set of data points in order to learn the linear discriminant function.
* The main purpose of LDA is to find the line or plane that best separates data
points belonging to different classes. The key idea behind LDA is that the decision
boundary should be chosen such that it maximizes the distance between the
means of the two classes while simultaneously minimizing the variance within
each class's data or within-class scatter. This criterion is known as the Fisher
criterion,
* LDA is one of the most widely used machine learning algorithms due to its
accuracy and flexibility, LDA can be used for a variety of tasks such as
classification, dimensionality reduction, and feature selection.Supervised 1,
2-12 oe
Machine Learning :
ify them efficiently, then ye;
classes and we need to classify # a
two
* Suppose we have "
an classes are divided as follows : :
Before LDA After LDA
Fig. 2.2.1 LDA
ithm wi following steps :
* LDA algorithm works based on the x
a) The first step is to calculate the means and standard deviation of each feature,
b) Within class scatter matrix and between class scatter matrix is calculated
c) These matrices are then used to calculate the eigenvectors and eigenvalues.
4) LDA chooses the k eigenvectors with the largest eigenvalues to form a
transformation matrix.
LDA uses this transformation matrix to transform the data into a new space
with k dimensions.
f) Once the transformation matrix transforms the data into new space with k
dimensions, LDA can then be used for classification or dimensionality
reduction
Benefits of using LDA :
a) LDA is used for classification problems,
e)
») LDA is a powerful tool for dimensionality reduction,
©) LDA is not susceptible to the
learning algorithms.
Logistic Regression
ae a
ees ie form of regression analysis in which the outcome variable
shotomous. A statistical method i
: : : used to model dichotomous oF
binary outcomes using predictor variables, .
* Logistic component : Instead of mode
models the log odds
curse of dimensionality" like many other machine
ling the outcome, Y, directly, the method
CO using the logistic function.” ” ‘
TECHNICAL PUBLICATIONS® _ .ine Leaming é
sch 2-13 Supervised Leeming
» Regression component ? Methods us
outcome and predictor variables,
function of predictors.
ed to quantify association between an
It could be used to build predictive models as a
«In simple logistic regression, logistic regression with 1 predictor variable.
Logistic Regression :
PQ) ne
of 2) = Bo + BiX1+B2X2 +...4B,X,
= Bo+ BiX1+B2X2 4...4B,X_ +e
With logistic regression, the response variable is an indicator of some
characteristic, that is, a 0/1 variable. Logistic regression is used to determine
whether other measurements are related to the presence of some characteristic, for
example, whether certain blood measures are predictive of having a disease.
If analysis of covariance can be said to be a t test adjusted for other variables, then
logistic regression can be thought of
as a chi-square test for homogeneity
of proportions adjusted for other
variables. While the response
variable in a logistic regression is a
0/1 variable, the logistic regression
equation, which is a linear equation,
does not predict the 0/1 variable
itself.
Linear
Ny
Logistic
Fig, 2.2.2
Fig. 22.2 shows Sigmoid curve for
logistic regression.
* The linear and logistic probability models are :
Linear Regression :
P= at ayXq $agXo tet AkXk
Logistic Regression :
In[p(—py = bo + byX1 tb2X2 tt PRX
obability p is a linear function of the
* The li that the pr
pero came that the natural log of the odds
egressors, while the logistic model assumes
P/(1~p) is a linear function of the regressors.
* The major advantage of the linear model is its interpretability. In the linear model,
if a1 is 0.05, that means that a one-unit increase in X1 is associated with a5 %
Point increase in the probability that ¥ is 7
TECHNICAL PUBLIGATIONS® - an up-hrst for knowiedge