EDA Lecture Module 2
EDA Lecture Module 2
Email: sathiyanarayanan.s@vit.ac.in
Handphone No.: +91-9944226963
1 Module 2: Classification
Logistic Regression
Bayes’ Theorem for classification
Decision Trees
Bagging, Boosting and Random Forest
Hyperplane for Classification
Support Vector Machines
Logistic Regression
Logistic Regression
Logistic Regression
e β0 +β1 X
p(X ) =
1 + e β0 +β1 X
where β0 and β1 are the model parameters.
To fit the above model (i.e. to determine β0 and β1 ), a method called
maximum likelihood is used.
The estimates β0 and β1 are chosen to maximize the likelihood
function:
Y Y
`(β0 , β1 ) = p(xi ) × (1 − p(xi 0 )).
i:yi =1 i 0 :yi 0 =1
Logistic Regression
p(X )
= e β0 +β1 X
1 − p(X )
p(X )
The quantity 1−p(X ) is called the odds, and can take on any value
odds between 0 and ∞. Values of the odds close to 0 and ∞ indicate
very low and very high probabilities of default, respectively.
Taking logarithm on both sides of the above equation gives log-odds
or logit:
p(X )
loge = β0 + β1 X .
1 − p(X )
The logit of a logistic regression model is linear in X . Note that
loge () is natural logarithm which is usually denoted as ln().
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 42
Module 2: Classification
Logistic Regression
(i) Which of the following parameters would you use to model p(x).
(a) (-119, 2)
(b) (-120, 2)
(c) (-121, 2)
(ii) With the chosen parameters, what should be the minimum mark to
ensure the student gets a ‘Pass’ grade with 95% probability?
p(x|ωk ) p(ωk )
p(ωk |x) =
p(x)
Question 2.2
Assume A and B are Boolean random variables (i.e. they take one of the
two possible values: True and False).
Given: p(A = True) = 0.3, p(A = False) = 0.7,
p(B = True|A = True) = 0.4, p(B = False|A = True) = 0.6,
p(B = True|A = False) = 0.6, p(B = False|A = False) = 0.4.
Question 2.3
Decision Trees
A decision tree is a hierarchical model for supervised learning. It can
be applied to both regression and classification problems.
A decision tree consists of decision nodes (root and internal) and leaf
nodes (terminal). Figure 3 shows a data set and its classification tree
(i.e. decision tree for classification).
Given an input, at each decision node, a test function is applied and
one of the branches is taken depending on the outcome of the
function. The test function gives discrete outcomes labeling the
branches (say for example, Yes or No).
The process starts at the root node (topmost decision node) and is
repeated recursively until a leaf node is hit. Each leaf node has an
output label (say for example, Class 0 or Class 1).
During the learning process, the trees grows, branches and leaf nodes
are added depending on the data.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 19 / 42
Module 2: Classification
Figure 3: Data set (left) and the corresponding decision tree (right) - Example of
a classification tree.
Decision Trees
Decision trees do not assume any parametric form for the class
densities and the tree structure is not fixed apriori. Therefore, a
decision tree is a non-parametric model.
Different decision trees assume different models for the test function,
say f (·). In a decision tree, the assumed model for f (·) defines the
shape of the classified regions. For example, in Figure 3, the test
functions define ‘rectangular’ regions.
In a univariate decision tree, the test function in each decision node
uses only one of the input dimensions.
In a classification tree, the ‘goodness of a split’ is quantified by an
impurity measure. Popular among them are entropy and Gini index. If
the split is such that, for all branches, all the instances choosing a
branch belong to the same class, then it is pure.
Question 2.4
Greedy learning approach - they look for the best split at each step.
Low prediction accuracy compared to methods like regression.
Question 2.5
.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 29 / 42
Module 2: Classification
Figure 5: Two classes of observations (shown in purple and blue), each having
two features/variables, and three separating hyperplanes.
Figure 6: Two classes of observations (shown in purple and blue), each having
two features/variables, and the optimal separating hyperplane or the maximal
marigin hyperplane
Facilitator: Dr Sathiya Narayanan S .
VIT-Chennai - SENSE Winter Semester 2020-21 32 / 42
Module 2: Classification
Figure 7: Two classes of observations (shown in purple and blue), each having
two features/variables, and two separating hyperplanes.
.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 35 / 42
Module 2: Classification
.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 41 / 42
Module 2: Classification
Module-2 Summary