EDA Lecture Module 2 | PDF | Logistic Regression | Regression Analysis
100% found this document useful (1 vote)
93 views

EDA Lecture Module 2

This document discusses classification techniques in machine learning. It covers topics like logistic regression, Bayes' theorem for classification, decision trees, bagging/boosting/random forests, support vector machines, and hyperplane classification. Logistic regression models the probability of class membership using a logistic function of the predictor variables. Bayes' theorem provides a way to calculate the posterior probability of class membership given features. Decision trees use a tree structure to split data into partitions based on feature values. Ensemble methods like bagging, boosting, and random forests combine multiple decision trees to improve performance. Support vector machines find a hyperplane that distinctly classifies data points.

Uploaded by

WINORLOSE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
93 views

EDA Lecture Module 2

This document discusses classification techniques in machine learning. It covers topics like logistic regression, Bayes' theorem for classification, decision trees, bagging/boosting/random forests, support vector machines, and hyperplane classification. Logistic regression models the probability of class membership using a logistic function of the predictor variables. Bayes' theorem provides a way to calculate the posterior probability of class membership given features. Decision trees use a tree structure to split data into partitions based on feature values. Ensemble methods like bagging, boosting, and random forests combine multiple decision trees to improve performance. Support vector machines find a hyperplane that distinctly classifies data points.

Uploaded by

WINORLOSE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)


School of Electronics Engineering (SENSE), VIT-Chennai

Email: sathiyanarayanan.s@vit.ac.in
Handphone No.: +91-9944226963

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 42


Suggested Readings

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani,


“An Introduction to Statistical Learning with Applications in R”,
Springer Texts in Statistics, 2013 (Facilitator’s Recommendation).

Alpaydin Ethem, “ Introduction to Machine Learning”, 3rd Edition,


PHI Learning Private Limited, 2019.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 42


Contents

1 Module 2: Classification

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 42


Module 2: Classification

Topics to be covered in Module-2

Logistic Regression
Bayes’ Theorem for classification
Decision Trees
Bagging, Boosting and Random Forest
Hyperplane for Classification
Support Vector Machines

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 42


Module 2: Classification

Logistic Regression

Most common problems that occur when we fit a linear regression


model to a particular data set are: (i) non-linearity of the
response-predictor relationships, (ii) outliers and (iii) correlation of
error terms.
Moreover, the linear regression model assumes that the response
variable Y is quantitative (or numerical). But in many situations,Y is
instead qualitative (or categorical).
Consider predicting whether an individual will default on his or her
credit card payment, on the basis of annual income and monthly
credit card balance. Since the outcome is not quantitative, the linear
regression model is not appropriate.
In general, if the response Y falls into one of two categories (Yes or
No), logistic regression is used.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 42
Module 2: Classification

Logistic Regression

Rather than modeling Y directly, logistic regression models the


probability that Y belongs to a particular category.
For example, in the case of predicting whether an individual will
default on his or her credit card payment on the basis of monthly
credit card balance, logistic regression models the probability of
default as

Pr (default=Yes | balance) = p(balance).

The values of p(balance) ranges from 0 to 1. For any given value of


balance, a prediction can be made for default. For example, one
might predict default=Yes for any individual for whom p(balance)
exceeds a predefined threshold.
Logistic regression uses a logistic function to model this probability.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 42
Module 2: Classification

Logistic Regression

The logistic function for predicting the probability of Y on the basis


of a single predictor variable X is be expressed as

e β0 +β1 X
p(X ) =
1 + e β0 +β1 X
where β0 and β1 are the model parameters.
To fit the above model (i.e. to determine β0 and β1 ), a method called
maximum likelihood is used.
The estimates β0 and β1 are chosen to maximize the likelihood
function:
Y Y
`(β0 , β1 ) = p(xi ) × (1 − p(xi 0 )).
i:yi =1 i 0 :yi 0 =1

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 42


Module 2: Classification

Logistic Regression

The logistic function can be manipulated as follows:

p(X )
= e β0 +β1 X
1 − p(X )
p(X )
The quantity 1−p(X ) is called the odds, and can take on any value
odds between 0 and ∞. Values of the odds close to 0 and ∞ indicate
very low and very high probabilities of default, respectively.
Taking logarithm on both sides of the above equation gives log-odds
or logit:  
p(X )
loge = β0 + β1 X .
1 − p(X )
The logit of a logistic regression model is linear in X . Note that
loge () is natural logarithm which is usually denoted as ln().
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 42
Module 2: Classification

Logistic Regression

Logistic regression can be extended to multiple logistic regression (i.e.


to make 2-class prediction based on p predictor variables X1 , X2 , ... ,
Xp ). The logistic function for multiple logistic regression can be
expressed as
e β0 +β1 X1 +β2 X2 +...+βp Xp
p(X ) = .
1 + e β0 +β1 X1 +β2 X2 +...+βp Xp
Model parameters can be chosen to maximize the same likelihood
function as in the case of single predictor variable.
The logit of a multiple logistic regression model will be linear in
{X1 , X2 , ..., Xp }.
Logistic regression can be extended to predict a response variable that
has more than two classes as well. However, for such tasks,
discriminant analysis is preferred.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 9 / 42
Module 2: Classification
Question 2.1
Consider the following training examples
Marks scored: X = [81 42 61 59 78 49]
Grade (Pass/Fail): Y = [Pass Fail Pass Fail Pass Fail]
Assume we want to model the probability of Y of the form
e β0 +β1 x
p(x) = 1+e β0 +β1 x which is parameterized by (β0 , β1 ).

(i) Which of the following parameters would you use to model p(x).
(a) (-119, 2)
(b) (-120, 2)
(c) (-121, 2)
(ii) With the chosen parameters, what should be the minimum mark to
ensure the student gets a ‘Pass’ grade with 95% probability?

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 10 / 42


Module 2: Classification

Bayes’ Theorem for Classification

Bayes’ theorem is used in formulating the optimal classifier. The


classification task is: Given an input x, find the class ωi it belongs to.
Assume there are K ≥ 2 classes: ω1 , ω2 , ... , ωK .
The likelihood function of class k (i.e. the probability that class k has
x in it) is represented as p(x|ωk ) for k = 1, 2, ..., K .
The probability of deciding x belonging to ωk is denoted as p(ωk |x).
This probability distribution is generally unknown and it can be
estimated using Bayes’ theorem:

p(x|ωk ) p(ωk )
p(ωk |x) =
p(x)

where p(ωk ) is the probability of occurrence of class k and p(x) is the


probability of occurance of x. Note that p(x) is independent of k.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 11 / 42


Module 2: Classification

Bayes’ Theorem for Classification


Both p(x|ωk ) and p(ωk ) are apriori probabilities and they can be
estimated using training data. Using these apriori probabilities, the
posterior probability p(ωk |x) or its equivalent can be estimated.
The decision function for Bayes’ classifier is
K
X
dj (x) = − Lkj p(x|ωk ) p(ωk )
k=1

where Lkj is the loss/penalty due to misclassification. In general, Lkj


takes a value between 0 and 1. Since p(x) is independent of k, it
becomes a common term and hence it is not included in dj (x).
The decision is stated as follows:
x → ωi if di = max{dj } for j = 1, 2, ..., K
j

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 42


Module 2: Classification

Question 2.2
Assume A and B are Boolean random variables (i.e. they take one of the
two possible values: True and False).
Given: p(A = True) = 0.3, p(A = False) = 0.7,
p(B = True|A = True) = 0.4, p(B = False|A = True) = 0.6,
p(B = True|A = False) = 0.6, p(B = False|A = False) = 0.4.

Calculate p(A = True|B = False) by appyling Bayes’ rule.

Hint: Use the relation

p(B = False) = p(B = False|A = True) × p(A = True)


+ p(B = False|A = False) × p(A = False).

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 13 / 42


Module 2: Classification
Bayes’ Theorem for Classification
Naive Bayes’ Classifier (NBC) assumes conditional independence.
Two random variables A and B are said to be conditionally
independent given another random variable C if
p(A ∩ B|C ) = p(A, B|C ) = p(A|C ) × p(B|C ).
This implies, as long as the value of C is known and fixed, A and B
are independent. Equivalently, p(A|B, C ) = p(A|C ).
NBC is termed naive because of this strong assumption which is
unrealistic (for real data), yet very effective.
The joint probability distribution of n random variables A1 , A2 , ..., An
can be expressed as a product of n localized probabilities:
n
Y
p(∩nk=1 Ak ) = p(Ak | ∩k−1
j=1 Aj ).
k=1

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 14 / 42


Module 2: Classification
Bayes’ Theorem for Classification
Consider the Bayesian network in Figure 1. It is a directed acyclic
graph in which each edge corresponds to a conditional dependency,
and each node corresponds to a unique random variable.
The network has 4 nodes: Cloudy, Sprinkler, Rain and WetGrass.
Since Cloudy has an edge going into Rain, it means that
p(Rain|Cloudy) will be a factor, whose probability values are specified
next to the Rain node in a conditional probability table.
Note that Sprinkler is conditionally independent of Rain given
Cloudy. Therefore,

p(Sprinkler|Cloudy, Rain) = p(Sprinkler|Cloudy).

Using the relationships specified by the Bayesian network, the joint


probability distribution can be obtained as a product of n factors (i.e.
n probabilities) by taking advantage of conditional independence.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 15 / 42
Module 2: Classification

Figure 1: Bayesian network - example 1

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 16 / 42


Module 2: Classification

Question 2.3

(a) Consider the Bayesian network in Figure 1. Evaluate the following


probability distribution functions:
(i) p(Cloudy = True, Sprinkler = True, Rain = False, WetGrass = True)
(ii) p(Cloudy = True, Sprinkler = False, Rain = True, WetGrass = True)

(b) Consider the Bayesian network in Figure 2. Evaluate the following


probability distribution functions:
(i) p(a = 1, b = 0, c = 1, d = 1, e = 0)
(ii) p(a = 1, b = 1, c = 2, d = 0, e = 1)
(iii) p(a = 1, b = 1, c = 2, d = 0)

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 17 / 42


Module 2: Classification

Figure 2: Bayesian network - example 2

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 18 / 42


Module 2: Classification

Decision Trees
A decision tree is a hierarchical model for supervised learning. It can
be applied to both regression and classification problems.
A decision tree consists of decision nodes (root and internal) and leaf
nodes (terminal). Figure 3 shows a data set and its classification tree
(i.e. decision tree for classification).
Given an input, at each decision node, a test function is applied and
one of the branches is taken depending on the outcome of the
function. The test function gives discrete outcomes labeling the
branches (say for example, Yes or No).
The process starts at the root node (topmost decision node) and is
repeated recursively until a leaf node is hit. Each leaf node has an
output label (say for example, Class 0 or Class 1).
During the learning process, the trees grows, branches and leaf nodes
are added depending on the data.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 19 / 42
Module 2: Classification

Figure 3: Data set (left) and the corresponding decision tree (right) - Example of
a classification tree.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 20 / 42


Module 2: Classification

Decision Trees

Decision trees do not assume any parametric form for the class
densities and the tree structure is not fixed apriori. Therefore, a
decision tree is a non-parametric model.
Different decision trees assume different models for the test function,
say f (·). In a decision tree, the assumed model for f (·) defines the
shape of the classified regions. For example, in Figure 3, the test
functions define ‘rectangular’ regions.
In a univariate decision tree, the test function in each decision node
uses only one of the input dimensions.
In a classification tree, the ‘goodness of a split’ is quantified by an
impurity measure. Popular among them are entropy and Gini index. If
the split is such that, for all branches, all the instances choosing a
branch belong to the same class, then it is pure.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 21 / 42


Module 2: Classification

Question 2.4

What is specified at any non-leaf node in a decision tree?

(a) Class of instance (Class 0 or Class 1)


(b) Data value description
(c) Test function/specification
(d) Data process description

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 22 / 42


Module 2: Classification

Advantages of Decision Trees

Fast localization of the region covering an input - due to hierarchical


placement of decisions. If the decisions are binary, it requires only
log2 (b) decisions to localize b regions (in the best case). In the case
of classification trees, there is no need to create dummy variables
while handling qualitative predictors.
Easily interpretable (in graphical form) and can be converted to easily
understandable IF-THEN rules. To some extent, decision trees mirror
human decision-making. For this reason, decision trees are sometimes
preferred over more accurate but less interpretable methods.

Disadvantages of Decision Trees

Greedy learning approach - they look for the best split at each step.
Low prediction accuracy compared to methods like regression.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 23 / 42


Module 2: Classification

Bagging, Boosting and Random Forest

Since the prediction accuracy of a decision tree is low (due to high


variance), techniques like bagging, random forests, and boosting
aggregate many decision trees to construct more powerful prediction
models.
Bagging creates multiple copies of the original training data using
the bootstrap (i.e. random sampling), fits a separate decision tree to
each copy, and then combines all of the trees in order to create a
single, powerful prediction model. Each tree is independent of the
other trees.
Boosting works in a way similar to bagging, except that the trees are
grown sequentially. Boosting does not involve random sampling;
instead each tree is grown using information from previously grown
trees (i.e. fit on a modified version of the original training data).

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 24 / 42


Module 2: Classification

Bagging, Boosting and Random Forest


As in bagging, random forests build a number of decision trees on
bootstrapped training data. While building these trees, for each split,
a random sample of m predictors is chosen as split candidates from
the full set of p predictors and one among these m is used.
Suppose that there is one very strong predictor in the data set, along
with a number of other moderately strong predictors. In this case,
bootstrap aggregation (i.e. bagging) will not lead to a substantial
reduction in variance over a single tree.
Since in random forests only m out of p predictors are considered for
each split, on average p−mp of the splits will not even consider the
strong predictor, and therefore other predictors stand a chance. This
decorrelation process reduces the variance in the average of the
resulting trees and hence improves the reliability and the prediction

accuracy. Typically, m = p.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 25 / 42
Module 2: Classification

Question 2.5

Using a small value of m in building a random forest will typically be


helpful when

(a) the number of correlated samples is zero


(b) the number of correlated samples is small
(c) the number of correlated samples is large
(d) all predictors in the data set are moderately strong

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 26 / 42


Module 2: Classification
Hyperplane for Classification
A hyperplane is a flat subspace of dimension p-1, in a p-dimensional
space. It is mathematically defined as
α0 + α1 X1 + α2 X2 + ... + αp Xp = 0.
The set of points X = {X1 , X2 , ...Xp } (i.e. vectors of length p)
satisfying the above equation lie on the hyperplane.
Suppose that,
α0 + α1 X1 + α2 X2 + ... + αp Xp > 0.
This shows the set of points lie on one side of the hyperplane.
On the other hand, if
α0 + α1 X1 + α2 X2 + ... + αp Xp < 0,
then the set of points lie on the other side of the hyperplane.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 27 / 42
Module 2: Classification

Hyperplane for Classification

In a 2-dimensional space (i.e. for p = 2), a hyperplane is a line


dividing the space into two halves. Figure 4 shows the hyperplane
1 + 2X1 + 3X2 = 0 dividing a 2-dimensional space into two. Similarly
for p = 3, a hyperplane is a plane dividing the 3-dimensional space
into two halves. In p > 3 dimensions, it becomes hard to visualize a
hyperplane but the notion of dividing p-dimensional space into two
halves still applies.
Consider a training data X of dimension n × p (i.e. a n × p data
matrix consisting of n training observations in p-dimensional space) in
which each of the observations fall into two classes, say Class -1 and
Class 1. Now, given a test observation x ? (i.e. a vector of p features
or variables), the concept of a separating hyperplane can be used to
develop a classifier that will correctly classify x ? .

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 28 / 42


Module 2: Classification

Figure 4: The hyperplane (i.e. line) 1 + 2X1 + 3X2 = 0 in a 2-dimensional space.


Blue region: set of points satisfying 1 + 2X1 + 3X2 > 0. Purple region: set of
points satisfying 1 + 2X1 + 3X2 < 0

.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 29 / 42
Module 2: Classification

Hyperplane for Classification


If the class labels for Class -1 and Class 1 are yi = −1 and yi = 1,
respectively, then the separating hyperplane has the property that
yi (α0 + α1 xi,1 + α2 xi,2 + ... + αp xi,p ) > 0.
for all i = 1, 2, ..., n.
If there exists a hyperplane that separates the training observations
perfectly according to their class labels, then x ? can be assigned a
class depending on which side of the hyperplane it is located.
As shown in Figure 5, a classifier based on a separating hyperplane
leads to a linear boundary, and there can be more than one separating
hyperplane. The separating hyperplane that is farthest from the
training observations is considered for classification. It is called
optimal separating hyperplane or maximal margin hyperplane.
Figure 6 shows one such hyperplane.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 30 / 42
Module 2: Classification

Figure 5: Two classes of observations (shown in purple and blue), each having
two features/variables, and three separating hyperplanes.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 31 / 42


Module 2: Classification

Figure 6: Two classes of observations (shown in purple and blue), each having
two features/variables, and the optimal separating hyperplane or the maximal
marigin hyperplane
Facilitator: Dr Sathiya Narayanan S .
VIT-Chennai - SENSE Winter Semester 2020-21 32 / 42
Module 2: Classification

Hyperplane for Classification


Let M represent the marigin of the hyperplane. The maximal marigin
hyperplane is the solution to the following optimization problem:

maximizeα0 ,α1 ,...,αp M


p
X
subject to αj2 = 1,
j=1
 p
X 
yi α0 + αj xij ≥ M for all i = 1, 2, ..., n.
j=1

The two constraints in the above optimization problem ensures that:


(i) each training observation is in the correct side of the hyperplane;
and (ii) each observation is located at least a distance M from the
hyperplane.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 33 / 42
Module 2: Classification

Hyperplane for Classification


As shown in Figure 7, addition of a single observation leads to a
dramatic change in the maximal marigin hyperplane. Such highly
sensitive hyperplanes are problematic in the sense that they may
overfit the training data.
Consider a hyperplane that does not perfectly separate the two
classes, in the interest of: (i) robustness to individual observations;
and (ii) better classification of most of the training observations.
Classifier based on such hyperplane is called support vector classifier
(SVC) or soft marigin classifier.
The underlying assumption is, allowing misclassification of a few
training observations will result in a better classification of the
remaining observations.
The SVC is a natural approach for two-class classification, if the
boundary between the two classes is linear.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 34 / 42
Module 2: Classification

Figure 7: Two classes of observations (shown in purple and blue), each having
two features/variables, and two separating hyperplanes.

.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 35 / 42
Module 2: Classification

Hyperplane for Classification

The hyperplane for SVC is the solution to the following optimization


problem:

maximizeα0 ,α1 ,...,αp ,1 ,2 ,...,n M


p
X
subject to αj2 = 1,
j=1
 p
X 
yi α0 + αj xij ≥ M(1 − i ),
j=1
n
X
i ≥ 0, i ≤ C ,
i=1

for all i = 1, 2, ..., n and C is a non-negative tuning parameter.


Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 36 / 42
Module 2: Classification
Support Vector Machines
In real-world data, the class boundaries are often non-linear (as shown
in Figure 8) and in such scenarios, SVC or any linear classifier will
perform poorly.
In the case of SVC, in computing its coefficients, only inner products
are required. This inner product can be generalized as K (xi , xi 0 ),
where K is some function and it will be referred as a kernel. A linear
kernel will give back the SVC.
To handle non-linear boundaries, a polynomial kernel of degree d
(where d is a positive integer) is required. Using such a kernel with
d > 1 leads to a more flexible decision boundary compared to that of
a SVC. When the SVC is combined with a non-linear kernel, the
resulting classifier is known as a support vector machine (SVM).
Therefore, SVM is an extension of the SVC that enlarges the feature
space using polynomial kernels of degree d > 1, to handle non-linear
boundaries.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 37 / 42
Module 2: Classification

Figure 8: Two classes of observations (shown in purple and blue), with a


non-linear boundary separating them.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 38 / 42


Module 2: Classification

Support Vector Machines

The hyperplane for SVM using polynomial kernel of degree d = 2 is


the solution to the following optimization problem:

maximizeα0 ,α11 ,α12 ...,αp1 ,αp2 ,...,1 ,2 ,...,n M


p X
X 2
2
subject to αjk = 1,
j=1 k=1
 p
X p
X 
2
yi α0 + αj1 xij + αj2 xij ≥ M(1 − i ),
j=1 j=1
n
X
i ≥ 0, i ≤ C , for all i = 1, 2, ..., n.
i=1

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 39 / 42


Module 2: Classification

Support Vector Machines


A radial kernel or radial basis function (RBF) is a popular non-linear
kernel used in SVMs. It takes the form
 p
X 
2
K (xi , xi 0 ) = exp − γ (xij − xi 0 j )
j=1

where γ is a positive constant. For a test observation x ? that is far


from a training observation xi , the value of K (xi , xi 0 ) will be tiny.
Therefore, the radial kernel has a local behavior, in the sense that
only nearby observations have an effect on the predicted class labels.
Figure 9 shows an example of an SVM with a radial kernel on a
non-linear data.
Usage of kernels (instead of simply expanding the feature space) in
SVM is computationally advantageous. A kernel-based approach
requires computation of K (xi , xi 0 ) for all n(n−1)
2 distinct pairs i and i 0 .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 40 / 42
Module 2: Classification

Figure 9: SVM with a radial kernel, on a non-linear data.

.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 41 / 42
Module 2: Classification

Module-2 Summary

Logistic regression: Modeling the probability that the response Y


belongs to a particular category, using a logistic function, on the basis
of single or multiple variables.
Bayes’ theorem for classification: Bayes’ classifier using conditional
independence
Decision trees and random forests: A non-parametric,
‘information-based learning’ approach which is easy to interpret.
Hyperplane for classification: maximal marigin classifier and SVC.
Support Vector Machines (SVMs): Extension of SVC to handle
‘non-linear boundaries’ between classes. Uses kernels for
computational efficiency. RBF kernel exhibits ‘local behavior’.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 42 / 42

You might also like