SUBJECT CODE: 410242
Choice Based Credit System
SAVITRIBAI PHULE PUNE UNIVERSITY - 2019 SYLLABUS
B.E. (Computer) Semester - Vil
MACHINE LEARNING
(For END SEM Exam - 70 Marks)
lresh A. Dhotre
ME. (Information Technology)
Ex-Faculty, Sinhgad College of Engineering,
Pune.
© Written by Popular Authors of Text Books of Technical Publications
© Covers Entire Syllabus {2 Question’ Answer Format
© Exact Answers and Solutions
© Solved Model Question Paper (As Per 2019 Pattern)
SOLVED SPPU QUESTION PAPERS
«March - 2019 + June- 2022
A Gulde For Engineering StudentsA Guide For Engineering Students
MACHINE LEARNING
(For END SEM Exam - 70 Marks)
SUBJECT CODE : 410242
B.E, (Computer Engineering) Semester - VII
© Copyright with Technical Publications
All publishing rights (printed and ebook version) reserved with Technical Publications.
No part of this book should be reproduced in any form, Electronic, Mechanical, Photocopy
or any information storage and retrieval system without prior permission in writing,
from Technical Publications, Pune.
Published by :
TECHNICAL Amit Residency, Office No.1, 412, Shaniwor Peth,
Pune - 411030, M.S. INDIA Ph.: +91-020-24495496/97
PUBLICATIONS | Email :
[email protected]
‘Aa UpTarstierkrowisdye _) Website : www-echnicalpublications.in
Printer :
Yosisj Printers & Binders, Sr.No. 10/1A,Ghule Industral Estate, Nanded Village Road,
Tal. - Havel, Dist. - Pune - 411041
ISBN 978-93-8585-241-0
mM.
9789355852410 [1] ai).SYLLABUS
Machine Learning - (410242)
Credit Examination Scheme :
03 | End-Sem (Paper) : 70 Marks
Unit III Supervised Learning : Regression
Bias, Variance, Generalization, Underfitting, Overfitting, Linear regression,
Regression : Lasso regression, Ridge regression, Gradient descent algorithm.
Evaluation Metrics : MAE, RMSE, R2 (Chapter - 3)
Unit IV Supervised Learning : Classification
Classification : K-nearest neighbour, Support vector machine. Ensemble Leaming :
Bagging, Boosting, Random Forest, Adaboost.
Binary-vs-Multiclass Classification, Balanced and Imbalanced —Multiclass.
Classification Problems, Variants of Multiclass Classification: One-vs-One and One-
vs-All
Evaluation Metrics and Score : Accuracy, Precision, Recall, Fscore, Cross-validation,
Micro-Average Precision and Recall, Micro-Average F-score, Macro Average
Precision and Recall, Macro-Average F-score. (Chapter - 4)
Unit V__ Unsupervised Learning
K-Means, K-medoids, Hierarchical, and Density-based Clustering, Spectral
Clustering. Outlier analysis: introduction of isolation factor, local outlier factor.
Evaluation metrics and score : elbow method, extrinsic and intrinsic methods
(Chapter - 5)
Unit VI Introduction To Neural Networks
Artificial Neural Networks : Single Layer Neural Network, Multilayer Perceptron,
Back Propagation Learning, Functional Link Artificial Neural Network, and Radial
Basis Function Network, Activation functions,
Introduction to Recurrent Neural Networks and Convolutional Neural Networks
(Chapter - 6)
i)TABLE OF CONTENTS
bOpcb tay 00 Gee
Chapter - 3 Supervised Learnii
Regression
(3 - 1) to (3 - 26)
3.1. Bias and Variance...
3.2 Underfitting and Overfitting...
3.3 Linear Regression ...
3.4. Regression : Lasso Regression, Ridge Regression...
3.5 Gradient Descent Algorithm...
3,6 Evaluation Metrics : MAE, RMSE, R2
Supervised Learni
Chapter -
4.1 Classification : K-Nearest Neighbour.
4.2 Support Vector Machine ..
4.3. Ensemble Learning : Bagging, Boosting,
Random Forest, AdabOost .....sssssssssssssesesnseesnsennecessesessnnnsnseee
4.4 Binary-vs-Multiclass Classification.
4.5 Variants of Multiclass Classification : One-vs-One
and One-vs-All
4.6 Evaluation Metrics and Score.
4.7 Cross-Validation4.8 Micro-Average.
4.9 Macro-Average.....
Chapter-5 Unsupervised Learning (5 - 1) to (8 - 32)
5.1 Introduction to Clustering...
5.2 K-Means and K-Medoids
5.3. Hierarchical Clustering ....
5.4. Density-Based Clustering...
5.5. Outlier Analysis.
5.6 Evaluation Metrics and Score..
Chapter - 6 Introduction to Neural Networks
(6 - 1) to (6 - 28)
6.1 Artificial Neural Networks .. oat.
6.2 Multilayer Perceptron...
6.3. Back Propagation Learning...
6.4 Functional Link Artificial Neural Network,
6.5 Radial Basis Function Network
6.6 Activation Functions...
6.7 Introduction to Recurrent Neural Networks...
6.8 Convolutional Neural Network:
=
Solved Model Question Paper (M - 1) to (M - 3)
(v)Supervised Learning : Regression
3.1: Bias and Variance
Q.1 What is bias in machine learning ?
Ans, : © Bias is a phenomenon that skews the result of an algorithm in
favor or against an idea.
® Bias is consideréd a systematic error that occurs in the machine learning
model itself due to incorrect assumptions in the ML process.
Bias is the difference between the average prediction of our model and
the correct value which we are trying to predict. Fig. Q.1.1 shows bias,
High bias Low bias
Low variance High variance
Test sample
Training sample
Low jh
Model complexity 2
Fig. Q.1.1
G-)Machine Learning 3-2 Supervised Learning : Regression
e Bias and variance are components of reducible error. Reducing errors
requires selecting models that have appropriate complexity and
flexibility, as well as suitable training data.
© Low bias : A low bias model will make fewer assumptions about the
form of the target function.
° High bias : A model with a high bias makes more assumptions and the
model becomes unable to capture the important features of our dataset.
A high bias model also cannot perform well on new data.
e Examples of machine learning algorithms with low bias are decision
trees, k-nearest neighbours and support vector machines.
e An algorithm with high bias is linear regression, linear discriminant
analysis and logistic regression.
Q.2 How to reduce the high bias ?
Ans. : ¢ If the average predicted values are far off from the actual values
then the bias is high. High bias causes algorithm to miss relevant
relationship between input and output variable.
e When a model has a high bias then it implies that the model is too
simple and does not capture the complexity of data thus underfitting the
data.
° Low variance means there is a small variation in the prediction of the
target function with changes in the training data set. At the sarhe time,
High variance shows a large variation in the prediction of the target
function with changes in the training dataset.
¢ High bias can be identified when we have high training error and
validation error or test etror is same as training error.
¢ Following methods are used to reduce high bias :
1. Increase the input features as the model is underfitted.
2. Decrease the regularization term.
3. Use more complex models, such as including some polynomial
features.
A Guide for Engineering StudentsMachine Learning 3-3 Supervised Learning : Regression
Q.3 Define variance. Explain low and high variance ? How to reduce
high variance ?
‘Ans. © Variance indicates how much the estimate of the target function
will alter if different training data were used. In other words, variance
describes how much a random variable differs from its expected value.
© Variance is based on a single training set. Variance measures the
inconsistency of different predictions using different training sets, it's
not a measure of overall accuracy.
« Low variance means there is a small variation in the prediction of the
target function with changes in the training data set. High variance
shows a large variation in the prediction of the target function with
changes in the training dataset.
© Variance comes from highly complex models with a large number of
features.
1. Models with high bias will have low variance.
2. Models with high variance will have a low bias.
« Following methods are used to reduce high variance :
1. Reduce the input features or number of parameters as a model is
overfitted.
2. Do not use a much complex model.
3, Increase the training data,
4, Increase the regularization term.
Q4 Explain bias-variance trade off.
Ans.: ¢ In the experimental practice we observe an important
phenomenon called the bias variance dilemma.
© In supervised learning, the class value assigned by the learning model
built based on the training data may differ from the actual class value.
This error in learning can be of two types, errors due to ‘bias’ and error
due to ‘variance’.
1. Low-bias, low-variance : The combination of low bias and low
variance shows an ideal machine leaning model. However, it is
not possible practically. °
‘A Guide for Engineering Studen'sMachine Learning 3-4
2.
Supervised Learning : Regression
Low-bias, high-variance : With low bias and high variance, model
predictions are inconsistent and accurate on average. This case
occurs when, the model leams with a large number of Parameters
and hence leads to an overfitting
High-bias, low-variance : With high bias and low variance,
predictions are consistent but inaccurate on average. This case
occurs when a model does not lear well with the training dataset
or uses few numbers of the parameter. It leads to underfitting
problems in the model.
High-bias, high-variance : With high bias and high variance,
predictions are inconsistent and also inaccurate on average.
Q.5 What is difference between bias and variance ?
Ans. :
Sr. No. Bias Variance
1,
Bias is the difference between Variance is the amount that the
the average prediction and the _prediction will change if
correct value. different training data sets were
used.
2. The model is incapable of The model recognizes the
locating patterns in the dataset majority of the dataser's patterns
that it was trained on and it and can even leam from the
produces inaccurate results for noise or data that isn't vital to its
both seen and unseen data. operation.
3. Low bias models : k-nearest Low variance models : Linear
neighbors, decision trees and regression and logistic
support vector machines. regression.
4. High bias models : Linear High variance models : k-Nearest
regression and logistic neighbors, decision trees and
regression, support vector machines.
A Guide for Engineering StudentsMachine Learning 3-5 Supervised Learning : Regression
3. inderfitting and Overfitting
Q6 What is overfitting and underfitting in machine learning
model ? Explain with example. CaP [SPPU + June-22, Marks 6)
‘Ans, © Fig. Q.6.1 shows underfitting and overfitting.
Values
a
oy
aNd
Time
Fig. Q.6.1 (a) Underfitting Fig. Q.6.1 (b) Overfitting
© Underfitting occurs when the model is unable to match the input data
to the target data. This happens when the model is not complex enough
underfitting and overfitting to match all the available data and performs
poorly with the training dataset.
« These kind of models are very simple to capture the complex patterns
in data like linear and logistic regression.
Underfitting examples +
1. The learning time may be prohibitively large and the learning stage
was prematurely terminated.
. The learner did not use a sufficient number of iterations.
3, The learner tries to fit a straight line to a training sect whose
examples exhibit a quadratic nature.
© Overfitting relates to instances where the model tries to match
non-existent data. This occurs when dealing with highly complex
models where the model will match almost all the given data points and
perform well in training datasets. However, the model would not b¢
a
A Guide for Engineering StudentMachine Learning 3-6 Supervised Learning : Regression
able to generalize the data point in the test data set to predict the
outcome accurately.
* These models have low bias and high variance, These models are very
complex like decision trees which are prone to overfiting,
© Reasons for overfitting are noisy data, training data set is too small and
large number of features.
Q.7 How to avoid overfitting and underfitting model ?
Aus. : 1. Following methods are used to avoid overftting :
© Cross validation
* Training with more data
© Removing features
¢ Early stopping the training
* Regularization
* Ensembling
2. Following methods are used to avoid underfitting
* By growing the education time of the model.
* By increasing the wide variety of functions.
Q.8 How do we know if we are underfitting or overfitting ?
Ans.
+ If by increasing capacity we decrease generalization error, then we
ae underfitting, otherwise we are overfitting,
If the error in representing the training set is relatively large and the
generalization error is large, then underfiting
If the error in representing the training set is relatively small and the
generalization error is large, then overfitting;
4, There are many features but relatively small training set
Q9 Explain the Fig, Q.9.1 (a), (b) and (c),
A Guide for Engineering StudentsMachine Learning 3-7 Supervised Learning : Regression
Machine Leaming 00 oe
Price
Price
Size Size Size
(a) (b) (°)
Fig. 0.9.1
‘Ams. : ¢ Given Fig. Q.9.1 is related to overfitting and underfitting.
Underfitting (High blas and low variance) :
© A statistical model or a machine leaming algorithm is said to have
underfitting when it cannot capture the underlying trend of the data.
It usually happens when we have less data to build an accurate model
‘and also when we try to build a linear model with a.non-linear data
Price
Price
Price
Size Size Size
2 Op + 04x + 8; xe 6, x24 0, x2
8p + 84x + Box’ jo + Oyx + Opx'+ px + 02
Oy + 84x
High variance (overfit)
High bais (underfit) High bais (underfit)
Fig. 0.9.2
In such cases the rules of the machine learning model are too easy and
flexible to be applied on such minimal data and therefore the model
will probably make a lot of wrong predictions.
© Underfitting can be avoided by using more data and also reducing the
features by feature selection.
Overfitting (High variance and low bias) :
© A statistical model is said to be overfitted, when we train it with a Jot
of data.
a
“A Guide for Engineering Stes
anMachine Learning 3-8 Supervised Learning : Regression
¢ When a model gets trained with so much of data, it starts learning from
the noise and inaccurate data entries in our data set.
© Then the model does not categorize the data correctly, because of too
many details and noise,
© The causes of overfitting are the non-parametric and non-linear methods
because these types of machine learning algorithms have more freedom
in building the model based on the dataset and therefore they can really
build unrealistic models.
® A solution to avoid overfitting is using a linear algorithm if we have
linear data or using the parameters like the maximal depth if we are
using decision trees.
Q.10 Explain difference between overfitting and underfitting.
Ans. :
Sr. Overfitting Underfitting
No.
1. Very low training error. High training error.
2. It is too complex. Model is too simple.
3, Support high variance and low Support low variance and high
bias. bias.
4, Smaller quantity of features. Larger quantity of feature.
5. Performs more regularization, Performs less regularization,
6. Training error much lower than Training error close to test error.
the test error.
Q.11 What is goodness of fit ?
Ans. : © The goodness of fit of a'model explains how well it matches a
set of observations. Usually, the goodness of fit indicators summarizes the
disparity between observed values and the model's anticipated values.
A Guide for Engineering Students=
Machine Learning 3-9 Supervised Learning : Regression
¢ Jn machine leaming algorithm, a good fit is when both the traning
data error and the test data are minimal. As the algorithm leams, the
mistake in the training data for the modal is decreasing over time and
so is the error on the test dataset.
¢ If we train for too long, the training dataset performance may continue
to decline due to the model being overfitting and learning the irrelevant
detail and noise in the training dataset. At the same time, the test set
error begins to rise again as the ability of the model to generalize
decreases.
© The errors in the test dataset start increasing, so the point, just before
the raising of errors, is the good point, and we can stop here for
achieving a good model.
3, inear Regression
Q.12 Define and explain regression with its model.
Ans.: ¢. Regression finds correlations between dependent and
independent variables. If the desired output consists of one or more
continuous variable, then thé task is called as regression.
© Therefore, regression algorithms help predict cqntinuous variables such
as house prices, market trends; weather patterns, oil and gas prices etc.
© Fig. Q.12.1 shows regression.
<
Data points
Line of regression
Dependent variable
Independent variable
Fig. .12.1 Regression
A Guide for Engineering StudentsMachine Learning 3-10 Supervised Learning : Regression
© When the targets in a dataset are real numbers, the machine leaning
task is known as regression and each sample in the dataset has a
real-valued output or target.
¢ Regression analysis is a set of statistical methods used for the
estimation of relationships between a dependent variable and one or
more independent variables. It can be utilized to assess the strength of
the relationship between variables and for modelling the future
relationship between them.
© The two basic types of regression are linear regression and multiple
linear regression.
Q.13 Explain univariate regression.
Ans. : © Univariate data is the type of data in which the result depends
only on one variable. If there is only one input‘variable then we call it
"Single Variable Linear Regression’ or ‘Univariate Linear Regression.
© The function that we are trying to develop looks like. this :
hg(x) = 0 + 0x
y= mxt+b
© That is because linear regression is essentially the algorithm for finding
the line of best fit for a set of data,
e The algorithm finds the values for @9 and ; that best fit the inputs and
outputs given to the algorithm. This is called univariate linear regression
because the ?? parameters only go up to 1.
© The univariate linear regression algorithm is much simpler than the one
for multivariate.
Q.14 When is it suitable to use linear regression over classification ?
Ans. : © Linear regression is a statistical method that allows us to
summarize and study relationships between two continuous (quantitative)
variables.
© The objective of a linear regression model is to find a relationship
between the input variables and a target variable.
A Guide for Engineering Studentsa |
Machine Learning 3-11 Supervised Learning : Regression
1. One variable, denoted x, is regarded as the predictor, explanatory, op
independent variable.
2. The other variable, denoted y, is regarded as the response, outcome,
or dependent variable. :
© Regression models predict a continuous variable, such as the sales made
on a day or predict temperature of a city. Let's imagine that we fit
line with the training points that we have. If we want to add another
data point, but to fit it, we need to change existing model.
© This will happen with each data point that we add to the model; hence,
linear regression isn't good for classification models.
Regression estimates are used to explain the relationship between one
dependent variable and one or more independent’ variables.
Classification predicts categorical labels (classes), prediction models
continuous - valued functions. Classification is considered to be
supervised learning.
Classifies data based on the training set and the values in a classifying
attribute and uses it in classifying new data. Prediction means models
continuous - valued functions, i.e. predicts unknown or missing values.
Q.15 Why do we need to regularize in regression ? Explain.
Ans. : ¢ Regression model fails to generalize on unseen data. This could
happen when the model tries to accommodate all kinds of changes in the
data including those belonging to both the actual pattern and also the
noise.
© As a result, the model ends up becoming a complex model having
significantly high variance due to overfitting, thereby impacting the
model performance (accuracy, precision, recall, etc.) on unseen data.
© Regularization needed for reducing overfitting in the regression model.
Regularization techniques are used to calibrate the coefficients of the
determination of multi-linear regression models in order to minimize the
adjusted loss function.
A Guide for Engineering Studen®Machine Learning 3-12 Supervised Learning : Regression
=_
Regression model
having high variance,
low bias
x
Regression model
having balanced
bias-variance
x
Fig. Q.15.2 Regression model after regularization
© Regularization methods provide a means to control our regression
coefficients, which can reduce the variance and decrease our of sample
error.
© The goal is to reduce the variance while making sure that the model
does not become biased (underfitting). After applying the regularization
technique, the above model could be obtained.
Q.16 Consider following data for 5 students.
Each X; (i = 1 to 5) represents the score of i student in standard X
and corresponding Y; (i = 1 to 5) represents the score of i“ student
in standard XI.
i) What linear regression equation best predicts standard XII‘
score ?
ii) Find regression line that fits best for given sample data.
iii) How to interpret regression equation ?
A Guide for Engineering StudentsMachine Learning 3-13 Supervised Learning ; Regression
iv) If a student's score is 80 in std X, then what is his expected
score in XII standard ?
Score in X core in XU
andard (Xj) standard (¥;)
‘Ans. : © The mean of the x values denoted by X
© The mean of the y values denoted by Y
© The standard deviation of the x values (denoted Sx)
‘© The standard deviation of the y values (denoted Sy)
xy a -% 01-0
us eT 8
8075-9025 7 18
5600 4900 2 -7
4550 4225 -8 -12
42004900
—EXY=. | gy2=
30500
TOF , | [Ema
gy =
ne lee n-1
390/5 = 78, Y= 385/5 = 77,
ee
‘A Guide for Engineering StenMachine Learning 3-14 Supervised Learning : Regression
Sx = y (390-78)? /4 = (24336 = 156
Sy = (385-77)? /4 = [23716 = 154
We also need to compute the squares of the deviation scores :
Yi m0?
85 A 289 ae)
95 324
70 49
144
49
630
The regression equation is a linear equation of the form : Y = Bo +BiX
First, we solve for the regression coefficient (B,) :
Br = SUK - XK - YI 106-7)
= 480 / 730 = 0.657
Once we know the value of the regression coefficient (6), we can solve
for the regression slope (By ) : :
By = Y-B,X
Bo = 385 ~ 0.657 * 390 = 128.77
a
‘Therefore, the regression equation is Y = 128.77 + 0.657 X
Q.17 Consider following data :
i) Find values of By and ®, wort. linear regression model which best
fits given data.
ii) Interpret and explain equation of regression line.
iii) If new person rates "Bahubali -Part I" as 3 then predict the |
rating of same person for "Bahubali -Part II".
‘A Guide for Engineering StudentsoN
Maceo 3:15 __ Supervised Learsing Ron
q eSsio,
x = Rating for movie. Y=
Ratin;
/— "pahubali Part - 1" by i'* "Bahubali a on
person
person
Ans.
| x, ¥ x? xY Yo -0 %-1
4 3 16 _ 3 -
2 4 4 ; FF =
Meee 2 oe am ot
: : 38 25 2 2
e } 3 9 =) 0
a3 1 ; ; - +
Sx- Dye yee L- py=
64 37 64
Find : X, ¥, Sx, Sy
— 5
xe 2% DMs . [LO - HF CL [Eee
n n {SS sy 2 art
X = 186 =3, Y= 196 =3, ;
Sx = fsa 75+
‘ nll
‘A Guide for Engineer" su
dgMachine Learning 3-16 Supervised Learning : Regression
Sy = ¥ (18-3) /5= [45
We also need to compute the squares of the deviation scores :
Bo eee aw lye
eloufol ols
wlafrofalo
1 0
1 1
(ee 1
4 4
4 0
5 1 OF 4
IX= 18 SY= 18 10- 10
The regression equation is a linear equation of the form: Y = Bo +B\X
First, we solve for the regression coefficient (B,) :
Br = DIG - Xx - Y/Y (4 - XP
= 3/10 = 0.3
Once we know the value of the regression coefficient (B, ), we can solve
for the regression slope (®o ):
Bo = Y- BX
Bo = 18 +03 * 18 = 12.6
Therefore, the regression equation is §
12.6 + 0.3 X
Q.18 What do you mean by a linear regression ? Which applications
are best modeled by linear regression 2
S&S [SPPU : March-19, In Sem, Marks 5]
Ans. : © Best applications of linear regression are as follows :
1. If a company's sales have increased steadily every month for the
past few years, by conducting a linear analysis on ‘the sales data
with monthly sales, the company could forecast sales in future
months.
2. Linear regression can also be used to analyze the marketing
effectiveness, pricing and promotions on sales of a product.
3. Linear regressions can be used in business to evaluate trends and
make estimates or forecasts.
: A Guide for Engineering StudentsMachine Learning 3-17 Supervised Learning : Regressig,
a
| 4. Supposing two campaigns are run on TV and Radio in paraley
linear regression can capture the isolated as well as the combines
impact of running this ads together.
Also Refer Q.12.
Regression : Lasso Regression, Ridge Regression
Q.419 What do you mean by logistic regression ? Explain with
Uap [SPPU : June-22, Marks a)
ined
example.
‘Ans. : © Logistic regression is a form of regression analysis in which the
outcome variable is binary or dichotomous. A statistical method used to
model dichotomous or binary outcomes using predictor variables.
« Logistic component : Instead of modeling the outcome, Y, directly, the
method models the log odds (Y) using the logistic function.
¢ Regression component : Methods used to quantify association between
an outcome and predictor variables. It could be used to build predictive
models as a function of predictors.
In simple logistic regression, logistic regression with 1 predictor
variable.
Logistic Regression :
mw
»( 205) = Bo + ByX1 +B2X2 +... +BRXk
Y = Bo + BiX1+B2X2 +-..+PyXz + €
© With logistic regression, the response variable is an indicator of some
characteristic, that is, a 0/1 variable. Logistic regression is used to
determine whether other measurements are related to the presence of
some characteristic, for example, whether certain blood measures are
predictive of having a disease.
© If analysis of covariance can be said to be a t test adjusted for other
variables, then logistic regression can be thought of as a chi-square test
for homogeneity of proportions adjusted for other variables. While the
response variable in a logistic regression is a 0/1 variable, the losis
Z tS
" A Guide for Engineerin& stude™
adMachine Learning 3-18 Supervised Learning : Regression
regression equation, which is a linear equation, does not predict the 0/1
variable itself,
© Fig. Q.19.1 shows sigmoid curve for logistic regression,
Linear
|
Logistic
Fig. Q.19.1
© The linear and logistic probability models are :
Linear Regression :
P = ag t+ ajXy+anXq +... tayXy
Logistic Regression :
In[pA-p)] = bo + b)Xy +bgXq +... DXx
© The linear model assumes that the probability p is a linear function of
the regressors, while the logistic model assumes that the natural log of
the odds p/(1—p) is a linear function of the regressors.
® The major advantage of the linear model is its interpretability. In the
linear model, if a 1 is 0.05, that means that a one-unit increase in X1 is
associated with a 5 % point increase in the probability that Y is 1
© The logistic model is less interpretable. In the logistic model, if bl is
0.05, that means that a one - unit increase in X1 is associated with a
0.05 increase in the log odds that Y is 1. And what does that mean?
I’ve never met anyone with any intuition for log odds.
A Guide for Engineering Students3-19 Supervised Learning Regress,
sion
Q.20 Explain in detail the ridge regression and the lasso Fegression,
EG SPPU : March-20, In Sem, Marks 9
‘Ans. : © Ridge regression and the Lasso are two forms of regulatizeg
regression. These methods are seeking to improve the Consequences of
multicollinearity.
1. When variables are highly correlated, a large coefficient in ong
variable may be alleviated by a large coefficient in another variable
which is negatively correlated to the former. :
Regularization-imposes an upper threshold on the values taken by the
coefficients, thereby producing a more parsimonious solution, and a set
of coefficients with smaller variance.
Ridge
«© Ridge estimation produces a biased estimator of the true parameter
EB%8°] x] = (XTX+ AIT XT XB
= (XT X+ AIT (XT X+A- ADB
= [-MXTX+ ADB
= B-AXTX+AI 1B
v
© Ridge regression shrinks the regression coefficients by imposing a
penalty on their size. The ridge coefficients minimize a penalized
residual sum of squares.
© Ridge regression protects against the potentially high variance of
gradients estimated in the short directions.
Lasso
« One significant problem of ridge regression is thatthe penalty tem wil
never force any of the coefficients to be exactly zero. Thus, the fit
model will include all p predictors, which creates a challenge in model
interpretation. A more modem machine learning alternative is the lass
¢ The lasso works in a similar way to ridge regression, except it use *
different penalty term that shrinks some of the coefficients ex#lY 0
zer0.
a 7 . voids
Lasso is a regularized regression machine learning technique that avoid
me see
‘A Guide for Engineering 5
|Machine Learning 3-20 Supervised Learning : Regression
© The lasso is a shrinkage method like ridge, with subtle but important
differences. The lasso estimate is defined by, :
2
n
isso = arg min > yi-Bo 38, xy
B iso j=l
subject to by Bl
0 is a small number that forces the algorithm to make small
jumps
ee
is
‘A Guide for Engineering Smae"
\
igMachine Learning 3-22 Supervised Learning : Regression
Limitations of gradient descent :
* Gradient descent is relatively slow close to the minimum : technically,
its asymptotic rate of convergence is inferior to many other methods.
* For poorly conditioned convex problems, gradient descent increasingly
‘zigzags' as the gradients point nearly orthogonally to the shortest
direction to a minimum point
Q.24 Explain steepest descent method.
Ans. : ¢ Steepest descent is also known as gradient method.
© This method is based on first order Taylor series approximation of
objective function. This method is also called. saddle point method.
Fig. Q.24.1 shows steepest descent method.
x
g
3
Xo
Fig. Q.24.1 Steepest descent method
© The steepest descent is the simplest of the gradient methods. The choice
of direction is where f decreases most quickly, which is in the direction
opposite to Vf (xj). The search starts at an arbitrary point xq and then
go down the gradient, until reach close to the solution.
. A Guide for Engineering StudentsMachine Learning 3-23 Supervised Learning : Regression
The method of steepest descent js the discrete analogue of Bradient
descent, but the best move is computed using a local minimization
rather than computing a gradient. It is typically able to converge in fey,
steps but it is unable to escape local minima or plateaus in the objective
function.
The gradient, is everywhere perpendicular to the contour lines. A er
each line minimization the new gradient is always orthogonal to the
previous step direction. Consequently, the iterates tend to zig-zag down
the valley in a very inefficient manner.
The: method of steepest descent is simple, easy to apply and each
jteration is fast. It also very stable; if the minimum’ points exist, the
method is guaranteed to locate them after at least an infinite number of
iterations.
valuation Metrics ; MAE, RMSE, R2
Q.25 Define and explain Squared Error (SE) and Mean Squared
Error (MSE) w.r.t regression.
‘Ans. : © The most common measurement of overall error is the sum of
the squares of the errors, or SSE (sum of squared errors). The line with
the smallest SSE is called the least-squares regression line.
© Mean Squared Error (MSE) is calculated by taking the average of the
square of the difference between the original and predicted values of the
data. It can also be called the quadratic cost function or sum of squared
errors.
The value of MSE is always positive or greater than zero. A value
close to zero will represent better quality of the estimator/predictor. A®
MSE of zero (0) represents the fact that the predictor is a perfet!
predictor.
aif
MSE = N 2 (Atal values ~ Predicted values)*
6
‘A Guide for Engineering sie"Machine Learning 3-24 Supervised Learning : Regression
Y 3 > =
Residual, _7— Regression /
error eae Best-ft line
ny 8
o Ye 1
Sum (
N
x—
Fig. Q.25.1 Representation of MSE
¢ Here N is the total number of observations/tows in the dataset. The
sigma symbol denotes that the difference between actual and predicted
values taken on every i value ranging from 1 to n.
Mean squared error is the most commonly used loss function for
regression. MSE is sensitive towards outliers and given several
examples with the same input feature values, the optimal pfediction will
be their mean target value. This should be compared with Mean
Absolute Error, where the optimal prediction is the median. MSE is
thus good to use if you believe that your target data, conditioned on the
input, is normally distributed around a mean value, and when it's
important to penalize outliers extra much.
* MSE incorporates both the variance and the bias of the predictor. MSE
also gives more weight to larger differences, The bigger the error, the
more it is penalized.
* Example : You want to predict’ future house prices, The price is a
continuous value, and therefore we want to do regression. MSE can
here be used as the loss function
Q.26 How the performance of a regression function is measured ?
Ans. : » Following are the performance metrics used for evaluating a
regression model :
a) Mean Absolute Error (MAE)
b) Mean Squared Error (MSE)
A Guide for Engineering StudeitsMachine Learning 3-25 Supervised Learning Revressy
©) Root Mean Squared Error (RMSE)
d) Resquared
©) Adjusted R-squared
4. MAE :
* MAE is the sum of absolute differences between our target an
predicted variables, So it measures the average magnitude of eros in
set of predictions, without considering their directions,
i er
MAE n Resi
2, Mean Squared Error (MSE)
a ere
MSE nd yi)
3. Root Mean Square Error (RMSE)
Root Mean Square Error (RMSE)is a standard way to measure the error
of a model in predicting quantitative data.
a Lg 9. \2
La a » (vi-vi?
i=l
4, R-squared
© R-squared is also known as the coefficient of determination. This metrie
gives an indication of how good a model fits a given dataset. It
indicates how close the regression line is to the actual data values.
The R squared value lies between 0 and 1 where 0 indicates that this
model doesn't fit the given data and 1 indicates that the model fis
perfectly to the dataset provided, .
Resquared = 1 — Fist sum of errors
Second sum of errors
w= 1-SSes_,_ Di Wind
S
Stor yi
= swude®
‘A Guide for Engineeti"® SuMachine Learning 3-26 Supervised Learning : Regression
5) Adjusted R-squared :
* The adjusted R-squated shows whether adding additional predictors
improve a regression model or not.
(=R?) W-1)
N-p-1
Adjusted R? =
9.27 For a given data having 100 examples, if squared errors SE,,
SEz; and SE3 are 13.33, 3.33 and 4.00 respectively, calculate Mean
Squared Error (MSE). State the formula for MSE.
Ans, :
Squared Error ; + Squared Error
Mee ae +...+ Squared Error jy
ae of data samples
Mean Squared Error = 283+855+ 400 = 0.2066
END... &
A Guide for Engineering StudentsSupervised Learning ;
Classification
4,1 : Classification : K-Nearest Neighbour
Q.1 What are neighbors ? Why is it necessary to use neares
neighbor while classifying ?
‘Ans.: © To find a predefined number of training samples closest in
distance to the new point, and predict the label from these.
She number of samples can be a user-defined constant (k-nearest
neighbor leaming), or vary based on the local density of points
(radius-based neighbor learning).
The distance can, in general, be any metric measure : standard
Euclidean distance is the most common choice.
In the nearest neighbor algorithm, we classified a new data point by
calculating its distance to all the existing data points, then assigning it
the same label as the closest labeled data point.
© Despite its simplicity, nearest neighbors has been successful in a large
number of classification and regression problems, including handwritten
digits or satellite image scenes.
¢ Being a non-parametric method, it is often successful in classification
situations where the decision boundary is very irregular.
7 H 1g
© Neighbors-based classification is a type of instance-based Hearing
non-generalizing leaming : it does not attempt to construct @ 2°
internal model, but simply stores instances of the training data-
ares!
© Classification is computed from a simple majority vote of the me
neighbors of each point : a query point is assigned the data cee Me
has the most representatives within the nearest neighbors of the ro
ee
@-)Machine Learning 4-2 Supervised Learning : Classification
© The basic nearest neighbors classification uses uniform weights : that is,
the value assigned to a query point is computed from a simple majority .
vote of the nearest neighbors.
© Under some circumstances, it is better to weight the neighbors such that
nearer neighbors contribute more to the fit. This can be accomplished
through the weights keyword.
* The default value, weights = ‘uniform’, assigns uniform weights to each
neighbor. weights = ‘distance’ assigns weights proportional to the inverse
of the distance from the query point.
© Alternatively, a user-defined function of the distance can be supplied
which is used to compute the weights,
Q.2 Explain KNN algorithm with its advantages and disadvantages.
Ans.: ¢ The k-nearest neighbor (KNN) is a classical classification
method and requires no training effort, critically depends on the quality of
the distance measures among examples.
* It is the simplest machine leaming algorithms based on supervised
learning technique. KNN algorithm assumes the similarity between the
new data and available data and put the new data into the category that
is most similar to the available categories.
The K-nearest neighbour classification is one of the most popular
distance-based algorithms. This classification is based on measuring the
distances between the test sample and the training samples to determine
the final classification output. The traditional k-NN classifier works
naturally with numerical data, Fig. Q.2.1 shows KNN.
¢ KNN stores all available data and classifies a new point based on the
similarity. This algorithm also used for regression ‘as well. as for
Classification but mostly it is used for the classification problems.
© KNN is a non-parametric algorithm, which means it does not make any
assumption on underlying data. It is also called a lazy leamer algorithm
because it does'not learn from the training set immediately instead it
stores the dataset and at the time of classification, it performs an action
on the dataset,
A Guide for Enginesring StudentsMachine Learning i
Classi
*Hfcayy,
Class A
: e
Class B @
Xp
Fig, .2.1 KNN
© KNN, k is the number of nearest neighbors. The number of neighbors
the core deciding factor. k is generally an odd number if the number
classes is 2. When k = 1, then the algorithm is known as the nearey
neighbor algorithm.
© KANN algorithm gives user the flexibility to choose distance while
building k-NN model.
a), Euclidean distance) Hamming distance |
c) Manhattan distance d) Minkowski distance
‘The performance of the KNN algorithm is influenced by three main
factors :
1. The distance function or distance metric used to determine the nese!
neighbors.
2. The decision rule used to derive a classification from the kenearts
neighbors.
|
3. The number of neighbors used to classify the new ©
Advantages :
1, The KNN algorithm is very easy to implement. |
2. Nearly optimal in the large sample limit.
3. Uses local information, which can yield highly aday
4, Lends itself very easily to parallel implementations.
ptive behavior
Disadvantages :
1, Large storage requirements.
2. Computationally intensive recall.
a
‘A Guide for ‘Engineeringfachine Learning 4-4 Supervised Learning : Classification
2.3 Let on a scale of 1 to 10 (where 1 is lowest and 10 is highest), a
student is evaluated by internal examiner and external examiner and
accordingly student result can be pass or fail. A sample data is
collected. for 4 students. If a new student is rated by internal and
external examiner as 3 and 7 respectively (test instance) decide new
student's result using KNN classifier.
Student No. | (x,) Rating by | (x,) Rating by | (y) Result
internal external
examiner examiner
ei 7 7 Pass
S> 7 4 Pass
83 3 4 Fail
Sq 1 4 Fail
Snew 3 7 2
Anns. +,
X;_ = Input vector of dimension 2 = {x;1,x;p}
where, xq = Rating by intemal examiner
Xig = Rating by external examiner
For i = 1, 2,3, 4 (ie. 4 mimber of sample instances)
a) y; = Result of a student and y; € {pass, fail}
b) Xo = Ko1-X02) = GB, 7)
Step 1: Let K = 3 (because number of classes = 2 and K should be odd)
Step 2 : Calculation of Euclidean distance between xg and x1,x2,%3/X4
xo1)*
é
g
#
a
1
Number of features in input vector = 2
A Guide for Engineering Studentsdy = 3.60
Step 3: Arranging all above distances in non decreasing order.
(d30,d40, dy9,dg9) = (3; 3.60, 4, 5)
Step 4 : Select K = 3 distance from above as (439, 449/40)
(3, 3.60, 4, 5)
Step 5 1 Decide instances corresponding to 3-nearest instances. For give
test instance 3, nearest neighbors are 3°4 student, 4" student, 1** studem,
ie. (3, 4, fail), (1, 4, fail), (7, 7, pass).
Step 6.: Decide Kpags and Kya
Kpass = 1
Kpa = 2
Krai > Kpass
Step.7 : New student or test instance xq is classified to fail" becaus?
Kya is maximum.
6
ering Smt
‘A Guide for EngiMachine Learning 4-6 Supervised Learning : Classification
4.2 : Support Vector Machine
Q.4 What do you mean by SVM ? Explain with example.
F SG [SPPU : May-22, Marks 8]
Ans. : © Support Vector Machines (SVMs) are a set of supervised
learning methods which learn from the dataset and used for classification.
© Support vector machines are supervised machine learning algorithms,
and they are used for classification and regression analysis. The SVM
performs both linear classification and nonlinear classification.
¢ The nonlinear classification is performed using the kernel function. In
nonlinear classification, the kernels are homogenous polynomial,
complex polynomial, Gaussian radial basis function, and hyperbolic
tangent function.
© SVM finds a hyperplane to separate the inputs into separate groups.
There can be many hyperplanes that successfully divide the input
vectors. To optimize the solution and find the optimum hyperplane,
support vectors are used.
© The points closest to the hyperplane are known as support vectors. In
SVM, the hyperplane having maximal distance from support vectors is
chosen as the output hyperplane. The hyperplane and support vectors
are shown in Fig. Q.4.1.
‘Support vectors
Fig. Q.4.1
A Guide for Engineering StudentsYY
lassigi tia
a
o So as its a far 2d area so by using just USiNE @ Straight ling,
ble to without difficulty separate those instructions. But there
multiple lines that may separate those lessons. Consider the i
picture +
Supervised Learning ;
Machine Learning
Fig. Q.6.2 Linear SVM understanding
© Hence, the SVM algorithm helps to discover the first-rate line o
selection boundary; this exceptional boundary or region is known 352)
hyperplane. SVM algorithm unearths the nearest point of the traces from
each of the lessons. These points are referred to as guide vectors. Tht
distance between the vectors and the hyperplane is called the maui
And the purpose of SVM is to maximise this margin. The hyperlét*
with maximum margin is known as the most suitable hyperplane.
6 Explain non linear SVM with examples, c3°[SPPU : May-19, Marks 4
Ans. : © Non-linear SVM : Nonlinear SVM is used for nei
Separated facts, because of this if a dataset can't be labelled throvgt !
usage of a directly lin, then such facts i called as non-linear informs |
and classifier used is known as non-linear SVM classifier. |
— ng su
A Guide for Engineet"6Machine Learning 4-10 Supervised Learning : Classification
© If information is linearly arranged, then we can separate it through
using a directly line, however for non-linear information, we can not
draw an unmarried directly line. Consider the beneath picture :
x
Fig. Q.6.1 Non linear SVM
© So to separate these data points, we need to feature one greater size.
For linear information, we've used dimensions x and y, so for non-linear
information, we will upload a 3°4 dimension z. It can be calculated as :
2= % ty.
© By adding the third measurement, the sample area will become as
below photograph :
a .
Fig. Q.6.2 Non linear SVM with third measurement
A Guide for Engineering Studentsos
4-11 Supervised Learning : Cias,,
Sificati
On
Machine Learning
© So now, SVM will divide the datasets into instructions Within
following way. Consider the under photo : the
Fig. @.6.3 Datasets representation
Q7 Explain key properties of support vector machine.
Ans. :
1, Use a single hyperplane which subdivides the space into two
half-spaces, one which is occupied by-Class 1 and the other by Class
a
2. They maximize the margin of the decision boundary using quadratic
optimization techniques which find the optimal hyperplane.
3. Ability to handle large feature spaces.
4. Overfitting can be controlled by soft margin approach
+ When used in practice, SVM approaches frequently map the examples
to a higher dimensional space and find margin maximal hyperplanes i
the mapped space, obtaining decision boundaries which are mt
hyperplanes in the original space.
6. The most popular versions of SVMs use non-linear kernel funeis
and map the attribute space into a higher dimensional spa?
facilitate finding "good" linear decision boundaries in the modified
space.
des
‘A Guide for Enginecrtne 5Machine Learning 4-12 Supervised Learning : Classification
Q.8 Explain applications and limitations of SVM.
Ans. : SVM applications
* SVM has been used successfully in many real-world problems,
1, Text (and hypertext) categorization
2. Image classification
3. Bioinformatics (Protein classification, Cancer classification)
4, Hand-written character recognition
5. Determination of SPAM email.
Limitations of SVM
1. It is sensitive to noise.
2. The biggest limitation of SVM lies in the choice of the kemel.
3. Another limitation is speed and size.
4.
. The optimal design for multiclass SVM classifiers is also a research
area.
4.3 : Ensemble Learning : Bagging, Boosting,
Random Forest, Adaboost
Q9 Explain ensemble learning.
Ans, : © The idea of ensemble learning is to employ multiple leamers and
combine their predictions. If we have a committee of M models with
uncorrelated errors, simply by averaging them the average error of a
model can be reduced by a factor of M.
® Unfortunately, the key assumption that the errors due to the individual
models are uncorrelated is unrealistic; in practice, the errors. are
typically highly correlated, so the reduction in overall error is generally
small.
© Ensemble modeling is the process of running two or more related but
different analytical models and then synthesizing the results into a
single score or spread in order to improve the accuracy of predictive
analytics and data mining applications.
© Ensemble of classifiers is a set of classifiers whose individual decisions
combined in some way to classify new examples.
A Guide for Engineering Studentsi A s
predictive formance than a single decision tree ae hg
~ gnain principle behind the ensemble model is that a group
Jeamers come together’ to form a strong learner, thus ines of. Weak
ih asin,
accuracy of the model. B the
« There are two approaches for combining models : voting and
‘ackin,
a
«In voting, no leaming takes place at the meta level when
classifiers by a voting scheme, Label that is most often assi
particular instance is chosen as the correct prediction when usi
OMbining
BNed tp
. . : NE Voting
« Stacking is concemed with combining multiple classifiers genetaieg
different learning algorithms Z;,...,2,y on a single dataset s, Which
composed by a feature vector Sj = (x7, t;
by
is
\ « The stacking process can be broken into two phases :
1. Generate a’ set of base-level classifiers C1,...,Cy Where
Cc; = 4 (8)
2, Train a meta-level classifier to combine the outputs of the base-leve
classifiers
# Fig, Q9.1 shows stacking frame,
‘© The training set for the meta-level classifier is generated through. &
leave-one-out cross validation process.
2c
wy NC
vi = 1,..nand Vk= 1,
= Ly -5))
© The leamed classifiers are then used to generate predictions for
ik = Ch Oi)
©The meta-level dataset consists of examples
(Gime J )sYih Where the features are the predictions
class of the &
of the
Je 0
base-level classifiers and the class is the correct ss
hand.
EI © Why do ensemble methods work ?
— ae
A Guide for ‘Engineerinf stMachine Learning 4-14 Supervised Learning : Classification
Training set
Hypotheses
Tht
training ebsorvatons ®
Nb ob
Meta learner's
hypothesis
Final ereiton| |
Fig. Q.9.1 Stacking frame
~<— Lest observation
¢ Based on one of two basic observations +
1. Variance reduction ; If the training sets are completely independent,
it will always helps to average an ensemble because this will reduce
variance without affecting bias (e.g. bagging) and reduce sensitivity
to individual data points.
2. Bias reduction : For simple models, average of models has much
gteater capacity than single model. Averaging models can reduce
bias substantially by increasing capacity and control variance by
citting one component at a time.
Q.10 What is bagging ? Explain bagging steps. List its advantages
and disadvantages.
Ans. : © Bagging is also called Bootstrap aggregating. Bagging and
boosting are meta-algorithms that pool decisions from multiple classifiers.
It creates ensembles by repeatedly randomly resampling the training data.
A Guide for Engineering Studentsof arching. The meta-algorithm, vihigh or
of the model ‘averaging, Was originally designey
classification and is-usually applied to decision tree models, by, it
be used with any type of model for classification or regression, Wea
# Ensemble classifiers such as bagging, boosting and model averagin
known to have improved accuracy and robustness over a single mea ate
‘Although unsupervised models, such as clustering, donot fa
generate label prediction for each individual, they provide Usefy
constraints for the joint prediction of a set of related objects.
is,
For given a training set of size n, create m samples of size p ty
drawing n examples from the original data, with replacement, Each
bootstrap sample will on average contain 63.2 % of the unique traning
examples, the rest are replicates. It combines the m resulting models
using simple majority vote.
© In particular, on each round, the base Jearner is trained on what is often
called a "bootstrap replicate” of the original training set. Suppose the
training set consists of n examples. Then a bootstrap replicate is a new
training set that also consists of n examples, and which is formed by
repeatedly selecting uniformly at random and with replacement »
examples ftom the original training set. This ‘means that the same
example may appear multiple times. in the bootstrap replicate, or it may
appear not at all.
results due ©
« It also decreases error by decreasing the variance in the
fe output om
unstable learners, algorithms (like decision trees) whos
change dramatically when the training data is slightly changed.
# Pseudocode :
1, Given training data (x1, y),-«.» (Xm Ym)
2. Fort= LT:
m rondo
a, Form bootstrap replicate dataset S, by selecting
examples ftom the training set with replacement.
b, Let hy be the result of traning base learning algorithm" ©? iy
‘
gut
“A Guide for Engine”Machine Learning 4-16 Supervised Learning : Classification
3. Output combined classifier :
H(x)=majority (hy), ..., hz (x))
Bagging steps :
1. Suppose there are N observations and M features in training data set.
A sample from training data set is taken randomly with replacement.
2. A subset of M features is selected randomly and whichever feature
gives the best split is used to split the node iteratively.
3. The tree is grown to the largest.
4, Above steps are repeated n times and prediction is given based on the
aggregation of predictions from n number of trees.
Advantages of bagging :
1, Reduces over-fitting of the model.
2. Handles higher dimensionality data very well.
3. Maintains accuracy for missing data.
Disadvantages of bagging :
1. Since final prediction is based on the mean predictions from subset
trees, it won't give precise values for the classification and regression
model.
Q.11 Explain boosting steps. List advantages and disadvantages of
boosting,
Ans, : Boosting steps :
1. Draw a random subset of training samples dy without replacement
from the training set D to train a weak leaner Cy.
2. Draw second random training subset dy without replacement from the
training set and add 50 percent of the samples that were previously
falsely classified/misclassified to train a weak learner Cc.
3. Find the training samples d; in the training set D on'which C; and Cy
disagree to train a third weak learner C3.
4, Combine all the weak leamers via majority voting.
Advantages of boosting :
1, Supports different loss function.
2. Works well with interactions.
A Guide for Engineering StudentsMachine Learning 4-17 Supervised Learning : Clas,
Disadvantages of boosting :
1. Prone to over-fitting.
2. Requires careful tuning of different hyper-parameters.
Q.12. What is random forest ? Explain working of random forest,
‘Ans. : ¢ Ramdom forest is supervised learning algorithm. The » fo
builds, is an ensemble of decision trees, usually ‘trained Rat
veageing" method. The general idea of the bagging method ig y, the
combination of learning models increases the overall result. at
‘© Random forest is a widely used classification and regression algorithm,
« Fig. Q.12.1 shows random forest algorithm.
Tree 1 Tree 2 en
Averaging for regression
Fig. Q.12.1
Random forest is a classifier that contains several decision trees o
various subsets of a given dataset and takes the average to enhance the
predicted accuracy of that dataset. Instead of relying on a single
decision tree, the random forest collects the result from each tree and
expects the final output based on the majority votes of predictions.
© Steps involved in random forest algorithm :
Step 1: In random forest n number of random records
from the data set having k number of records.
Stop 2: Individual decision trees are constructed for each sample:
ereo0ys ; .
A Guide for Engineering Seudert
are taket
amMachine Learning 4-18 Supervised Learning : Classification
Step 3: Each decision tree will generate’ an output.
Step 4: Final output is considered based on majority voting or
averaging for classification and regression respectively.
Q.13 Discuss application, advantages and disadvantages of random
forest algorithm.
Ans. : © Application of random forest :
1. Banking : It is mainly-used in the banking industry to identify loan
tisk.
2. Medicine : To identify illness trends and risks.
3. Land use : Random forest classifier is also used to classify places
with similar land-use patterns.
4. Market trends : You can determine market trends using this
algorithm.
© Advantages :
1. Random forest can be used to solve both classification as well as
regression problems.
2. Random forest works well with. both categorical and continuous
variables. .
3. Random forest can automatically handle missing values.
4. It is not affected by the dimensionality curse.
© Disadvantages
1. Due to its complexities, training time is longer than for other
models.
2. Not suitable for real-time predictions.
3. More trees slow down model.
4. Can't describe relationships within data.
Q.14 Write short note on :
i) Bagging ii) Boosting Random forest. 03° [SPPU : June-22, Marks 8]
Ans. : Refer Q.10, Q.11 and Q.12.
Q.15 Write short note on Adaboost.
Ans. : © AdaBoost stands for Adaptive Boosting. Adaboost is also an
ensemble learning algorithm that is created using a bunch of what is
A Guide for Engineering StudentsMachine Learning 4-19 Supervised Leaming :Classifcny,
‘On
called a decision stump. AdaBoost uses an iterative approach to '
from the mistakes of weak classifiers and tum them into strong ones,
ble method. AdaBoost classifier by,
AdaBoost is an iterative ensemb'e t
strong classifier by combining multiple poorly performing classifiers *
¢ AdaBoost is called adaptive because it uses multiple iterations t
generate a single composite strong leamer. During each roung :
training, a new weak learner is added to the ensemble and a wei, ht,
vector is adjusted to focus on examples that were misclassifieg in
previous rounds.
« Steps for performing the AdaBoost algorithm :
| Initially, all observations are given equal weights.
. A model is built on a subset of data. 7
Using this model, predictions are made on the whole dataset.
Errors are calculated by comparing the predictions and actual values,
While creating the next model, higher weights are given to the data
points which were predicted incorrectly.
Weights can be determined using the error value. For instance, the
higher the error the more is the weight assigned to the observation,
‘This process is repeated until the error function does not change, or
the maximum limit of the number of estimators is reached.
4.4 ; Binary-vs-Multiclass Classification
Q.46 What is binary classification ? Explain with example.
‘Ans. © If the output is a binary value; (yes/no), (+/-), (0/1), then the
leaming problem is referred to as a binary classification problem.
Example : learning to distinguish sport cars.
¢ Binary classification would generally fall into the dom:
learning since the training dataset is labeled. Discriminant functto
used to represent a classifier.
yon
a
1,
ain of supervised
nis
«A well known example of a binary classification problem is predetne
whether an email is spam or not spam.
ng sud
: ‘A Guide for Engineeri"
dlMachine Learning 4-20 Supervised Learning : Classification
¢ In binary classification, the objective is to design a rule that assigns
objects to one of 2 classes, often referred to as positive and negative,
on the basis of descriptive vectors of objects.
¢ Example of classification : The ABC bank wants to classify its
customers based on whether they are expected to pay back their
approved home loans. For this purpose, the history of past customers is
used to train the classifier. The classifier provides rules, which identify
potentially reliable future customers, The classification rule is as
follows :
a. If age = "31...40"and income = low then credit-rating = bad
b. Hf age = "31..40"and income = high then oredit-rating = very
good
Q.17_ Explain linear classification model.
Ans. : A classification algorithm (Classifier) that makes its classification
based on a linear predictor function combining a set of weights with the
feature vector.
¢ A linear classifier does classification decision based on the value of a
linear combination of the characteristics. Imagine that the linear
classifier will merge into it's weights all the characteristics that define a
particular class.
° Linear classifiers can represent a lot of things, but they can't represent
everything. The classic example of what they can't represent is the XOR.
function.
Q.18 Define multiclass’ classificat
Ans. : Multiclass classification is a machine learning classification task
that consists of more than two classes, or outputs. For example, using a
model to identify animal types in images from an encyclopedia is a
multiclass classification example because there are many different animal
classifications that each image can be classified as.
n.
A Guide for Engineering Stidents4-21 _ Supervised Leaming : cya,
Machine Learning .
cat
Q.19 Explain difference between Binary-vs-Multiclass Classifica,
‘Ons,
* Ans. ¢
~ Multiclass Classification
“Malti-less classification is thy,
where examples are assigned ¢,
one of more than two classes
actly
Multiclass classification is often y,
in image recognition and documen
classification tasks.
Algorithm used in multi-class
classification are Naive Bayes,
gistic regression,
_. | Random forest and Gradient boosting.
Trees, SVM.
classificati Multiclass classification is a
¢ learning algorithm that is | supervised machine leering algoritin
_ used to predict one of two classes for | used to predict one or more classes
Fanitem. © = | for an item.
sification
Q.20 Explain balanced and imbalanced multiclass cla
problems. |
Ans, : Balanced classification : es
|
© When the usage of a device gains knowledge of a set of rules, #° 2
very crucial to teach the model on a dataset with nearly t
number of samples. This is referred to as a balanced eleganc®
‘
: —— 1p suudet
“A Guide for Engineeti"é
aaMachine Learning 4-22 Supervised Learning : Classification
to have balanced training to train a version, however if the training isn't
balanced, we need to use a class balancing method before using a
device gaining knowledge of the set of rules.
Balanced classes
500
400
300
Count
200
100
Class
Fig. Q.20.1 Balanced classification
Imbalanced classification :
© In imbalanced type trouble is an instance of a type trouble in which the
distribution of examples across the recognised classes is biassed or
skewed. The distribution can vary from a moderate bias to an excessive
imbalance where there is one instance in the minority class for loads,
heaps or thousands and thousands of examples in the majority
magnificence or instructions.
® Imbalanced classifications pose a mission for predictive modelling as
most of the device mastering algorithms used for category were
designed across the assumption of the same variety of examples for
every -magnificence. These outcomes in fashions that have negative
: A Guide for Engineering StuadentsMachine Learning 4-23 Supervised Learning : Classification
predictive overall performance, specially for the minority elegance. This
is a problem due to the fact usually, the minority class is more
‘important and therefore the problem is more sensitive to category errors
for the minority class than most people's magnificence.
Imbalanced class distribution
Fig. Q.20.2 Imbalanced classification
4.5 : Variants of Multiclass Classification
One ne and One-vs-All
Q.21 Explain various multiclass classification techniques.
Ans. : 4. One Vs All (OVA) :
nt © For each class build a classifier for that class vs the rest. Build N
different binary classifiers.
© For this approach, we require N = K binary classifiers, where the K'
classifier is trained with positive examples belonging to class K and
negative examples belonging to the other K — 1 classes.
© When testing an unknown example, the classifier producing the
maximum output is considered the winner, and this class label is
assigned to that example.
“© It is simple and provides performance that is comparable to other more
complicated approaches when the binary classifier is tuned ‘well.
2. All-Vs-All (AVA) :
* For each class build a classifier for those class vs the rest. Build N
i WN ~ 1) classifiers, one classifier to distinguish each pair of classes i
and j .
\ A Guide for Engineering StudentsMachine Learning 4-24 — Supervised Learning Classification” :
* A binary classifier is built to discriminate between each pair of classes,
while discarding the rest of the classes.
« When testing a new. example, a voting is performed among the
classifiers and the class with the maximum number of votes wins.
Calibration
©The decision function f of a classifier is said to be calibrated or
well-calibrated if P (x is correctly classified | ff) = s) =
© Informally f is a good estimate of the probability of classifying
correctly a new datapoint x which would have output value x.
Intuitively if the "raw" output of a classifier is g you can calibrate it by
estimating the probability of x being well classified given that g(x)=y
for all y values possible.
4. Error-Correcting Output-Coding (ECOC)
* Error correcting code approaches try to combine binary classifiers in a
way that lets you exploit de-correlations and correct errors.
* This approach works by training N binary classifiers to distinguish
between the K different classes. Each class is given a codeword of
length N according to a binary matrix M . Each row of M corresponds
to a certain class.
¢ The following table shows an example for K = 5 classes and N = 7 bit
code words.
* Each class is given a row of the matrix. Each column is used to train a
distinct binary classifier. When testing an unseen example, the output
codeword from the N classifiers is compared to the given K code
A Guide for Engineering StudentsMachine Learning
Q.22 What is mean average precision 2 Explain Pree |
appropriateness. Cisioy .
Ans.: ¢ Mean Average Precision (MAP) i, 7 7
precision at seen relevant documents. It determine recisig ld tte
when a new relevant document gets retrieved, Average es at tay
value obtained for the top k documents, each time ql \
retrieved. van gt
* Avoids interpolation, use of fixed recall levels
collection is arithmetic averaging. Average pre
normally used to compare the performance of
Cision - recat)”
distinct IR algorign® a
© Use P = 0 for each relevant document that was not Tettieved, py s,
average for each query, then average over queries . * Deter
N Qi
mar = 2 ¥ 1 Sp caoc,)
“N4G 4
jel “i isa
where Q; = Number of relevant document for query j
N= Number of queries
P(doc;) = Precision at i** relevant document
Precision - recall appropriateness : |
© Precision and recall have been extensively used to evaluate the retried)
Performance of IR algorithms. However, a more careful refein
reveals problems with these two measures :
* First, the proper estimation of maximum recall for a query requits
detailed knowledge of all the documents in the collection.
* Second, in many situations the use of a single measure could be mst
appropriate.
ir . set
© Third, recall and Precision measure the effectiveness over #
queries processed in batch mode.
re ses
“A Guide for Enginecrt8Machine Learning 4-26 Supervised Learning : Classification
© Fourth, for systems which require a weak ordering though, recall and
precision might be inadequate.
Q.23 Write short note on F-measure.
Ans. : ¢ The F measure is a measure of a test's accuracy and is defined
as the weighted harmonic mean of the precision and recall of the test. The
F - measure or F - score is one of the most commonly used "single
number" measures in Information Retrieval, Natural Language Processing
and Machine Learning.
F-measure comes from Information Retrieval (IR) where Recall is the
frequency with which relevant documents are retrieved or ‘recalled’ by a
system, but it is known elsewhere as Sensitivity or True Positive Rate
(TPR).
© Precision is the frequency with which retrieved documents or
predictions are relevant or ‘correct’, and is properly a form of Accuracy,
also known as Positive Predictive Value (PPV) or True Positive
Accuracy (TPA). F is intended to combine these into a single measure
of search ‘effectiveness’.
© High precision and low accuracy is possible due to systematic bias. One
of the problems with Recall, Precision, F - measure and Accuracy as
used in Information Retrieval is that they are easily biased.
¢ The F-measure balances the precision and recall. The result is a value
between 0.0 for the worst F-measure and 1.0 for a perfect F - measure.
© The formula for the standard F1 - score is the harmonic mean of the
precision and recall. A perfect model has an F-score of 1.
2 x Precision x Recall
F - Measure = —
Precision + Recall
Q.24 State formulae for calculating accuracy, true positive rate, true
negative rate, false positive rate and false negative rate for binary
classification tasks.
Ans, : The evaluation measure most used in practice is the accuracy rate.
It evaluates the effectiveness of the classifier by its percentage of correct
predictions.
A Guide for Engineering Students; 27 Supervised Learning
Machine Learning 42 e me Classig, |
a
Accuracy rate =
\True negatives | + !True positives| |
_TFalse negatives|+ False positives|+ [True negatives|+iTrye Pay
3)
«# True Positive Rate (TPR) is also called sensitivity, hit rate ang
Number of true positives
Sensitivity = x7 imnber of True positive+ Number of False Poa
: ive
Trea
«The true negative rate is also called specificity, Which is the pry
that an actual negative will test negative. abil
me [True negative!
Specificity = TFafse positivel + | True positive!
© The false negative rate is also called-the miss rate. It is the Probabis |
that a true positive will be missed by the test. iy
False negative
Miss rate = + negative + True positive
Summary :
Name Formula
The Positive Rate (TP rate) | TP / (TP + FP)
True Negative Rate (TN rate) | TN / (TN + FN)
False Positive Rate (FP rate) Fe / (fF + TP)
False Negative Rate (FN rate) | FN / (FN + TP)
Q.25 What is a contingency table ? What does it represent ?
Ans, : © Contingency tables are used to analyze the relation between two
or more categorical variables. A contingency table displays the frequen)
of specific combinations, They are used to summarize the relationships
and not to evaluate the model.
* A contingency table corresponds to a probability mass function.
exes A Guide for Engineering SeuMachine Learning 4-28 Supervised Learning : Classification
* A contingency table, sometimes called a two-way frequency. table, isa
tabular mechanism with at least two rows and two columns used to -
present categorical data in terms of frequency counts.
*An 1 X c contingency table shows the observed frequency of two
variables, the observed frequencies of which are arranged into r rows
and ¢ columns. The intersection of a row and a column of a
contingency table is called a cell.
© Contingency tables is represented as follows :
True class Predicted class _
; Positive
Positive True positive
| * Negative False positive
Q.26 i) Find contingency table ii) Find recall iii) Precision
iv) Negative recall v) False positive rate
Fast learner ;
+ Ve class
Fast learner : 20
Slow learner : 10
Ans. : Contingency table
Predicted
Faster learner
Slow leamer z : : Actual
True positive True positive
Precision = 7 Cqual results °" True positive False pos
A Guide for Engineering Students> |
; : Supervised Learning ;
Machine Learning 4-29 Sup Sarning : Classifio,
True positive
True positive+ False Negative
True pt
e
Reall= Predicted results
Caloulate precision and recall
Precision = 25/35 = 0.714
Recall = 25/30 = 0.833
False positive rate = (False positive ) / (False positive + True negative
= 10/(10 + 30) = 0.25
Q.27 Consider the following 3-class confusion matrix,
precision. and recall per class. Also calculate weighted
precision and recall for classifier.
Caleta,
Average
Ans. :
IS+15+45 _ 75
Classifier A = == 0,7!
lassifier Accuracy 100 100 0.75
Calculate per-class precision and recall :
. 15 15
First class = 15 = 9, 1S 0:
inst class = == 0.63 and 557 075
Second class = 45 = 9.75 ang = 050
ji 45
Third class = = = 45
as: 6 0.8 and i 0.9
np tude
“A Guide for Engineering S"Machine Learning 4-30 Supervised Learning : Classification
Cross-Validation
Q.28 What is cross - validation ? How it improves the accuracy of
the outcome ?
Ans, : © Cross-validation is 4 technique for evaluating ML models by
training several ML models on subsets of the available input data and
evaluating them on the complementary subset of the . data. Use
cross-validation to detect overfitting, i.e., failing to generalize a pattern.
e In general, ML involves deriving models from data, with the aim of
achieving some kind of desired behavior, e.g, prediction or
classification.
e But this generic task is broken down into a number of special cases.
When training is done, the data that was removed can be used to test
the performance of the learned model on “‘new" data. This is the basic
idea for a whole class of model evaluation methods called cross
validation.
¢ Types of cross validation methods are holdout, K-fold and
leave-one-out.
© The holdout method is the simplest kind of cross validation. The data
set is separated into two sets, called the training set and the testing set.
The function approximate fits a function using the training set orily.
© K-fold cross validation is one way to improve over the holdout method.
The data set is divided into k subsets, and the holdout method is
repeated k times. Each time, one of the k subsets is used as the test set
and the other k-1 subsets are put together to form a training set. Then
the average error across all k trials is computed.
* Leave-one-out cross validation is K-fold cross validation taken to its
logical extreme, with K equal to N, the number of data points in the
. Set. That means that N separate times, the function approximate is
trained on all the data except for one point and a prediction is made for
that point.
A Guide for Engineering Students