UET
Since 2004
ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN
VNU-University of Engineering and Technology
INT3405 - Machine Learning
Lecture 5: Classification (P2)
Decision Tree
Hanoi, 09/2023
Outline
● Decision Tree Examples
● Decision Tree Algorithms
● Methods for Expressing Test Conditions
● Measures of Node Impurity
○ Gini index
○ Entropy
○ Gain Ratio
○ Classification Error
FIT-CS INT3405 - Machine Learning 2
Example of a Decision Tree
Source: https://regenerativetoday.com/simple-explanation-on-how-decision-tree-algorithm-makes-decisions/
FIT-CS INT3405 - Machine Learning 3
Example of a Decision Tree
FIT-CS INT3405 - Machine Learning 4
Example of Model Prediction (1)
FIT-CS INT3405 - Machine Learning 5
Example of Model Prediction (2)
FIT-CS INT3405 - Machine Learning 6
Example of Model Prediction (3)
FIT-CS INT3405 - Machine Learning 7
Example of Model Prediction (4)
FIT-CS INT3405 - Machine Learning 8
Example of Model Prediction (5)
FIT-CS INT3405 - Machine Learning 9
Example of Model Prediction (6)
FIT-CS INT3405 - Machine Learning 10
Decision Tree - Another Solution
There could be more than one tree that
fits the same data!
FIT-CS INT3405 - Machine Learning 11
Decision Tree - Classification Task
FIT-CS INT3405 - Machine Learning 12
Decision Tree Induction
● Various Algorithms
○ Hunt’s Algorithm (one of the earliest)
○ CART
○ ID3, C4.5
○ SLIQ, SPRINT
FIT-CS INT3405 - Machine Learning 13
General Structure of Hunt’s Algorithm
● Let Dt be the set of training records that reach a
node t
● General Procedure:
○ If Dt contains records that belong the same
class yt, then t is a leaf node labeled as yt
○ If Dt contains records that belong to more than
one class, use an attribute test to split the data
into smaller subsets.
○ Recursively apply the procedure to each subset.
FIT-CS INT3405 - Machine Learning 14
Hunt’s Algorithm
FIT-CS INT3405 - Machine Learning 15
Hunt’s Algorithm
FIT-CS INT3405 - Machine Learning 16
Hunt’s Algorithm
FIT-CS INT3405 - Machine Learning 17
Hunt’s Algorithm
FIT-CS INT3405 - Machine Learning 18
Design Issues of Decision Tree Induction
● How should training records be split?
○ Method for expressing test condition
■ depending on attribute types
○ Measure for evaluating the goodness of a test condition
● How should the splitting procedure stop?
○ Stop splitting if all the records belong to the same class or
have identical attribute values
○ Early termination
FIT-CS INT3405 - Machine Learning 19
Methods for Expressing Test Conditions
● Depends on attribute types
○ Binary
○ Nominal
○ Ordinal
○ Continuous
FIT-CS INT3405 - Machine Learning 20
Test Conditions for Nominal Attributes
● Multi-way split:
○ Use as many partitions as distinct values.
● Binary split:
○ Divides values into two subsets
FIT-CS INT3405 - Machine Learning 21
Test Conditions for Nominal Attributes
● Multi-way split:
○ Use as many partitions as distinct values.
● Binary split:
○ Divides values into two subsets
FIT-CS INT3405 - Machine Learning 22
Test Conditions for Ordinal Attributes
● Multi-way split:
○ Use as many partitions as distinct values.
● Binary split:
○ Divides values into two subsets
○ Preserve order property among attribute values
This grouping
violates order
property
FIT-CS INT3405 - Machine Learning 23
Test Conditions for Continuous Attributes
FIT-CS INT3405 - Machine Learning 24
Splitting based on Continuous Attributes
● Different ways of handling
○ Discretization to form an ordinal categorical attribute. Ranges
can be found by equal interval bucketing, equal frequency
bucketing (percentiles), or clustering.
■ Static – discretize once at the beginning
■ Dynamic – repeat at each node
○ Binary Decision: (A < v) or (A ≥ v)
■ consider all possible splits and finds the best cut
■ can be more compute intensive
FIT-CS INT3405 - Machine Learning 25
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
FIT-CS INT3405 - Machine Learning 26
How to determine the Best Split
● Greedy approach:
– Nodes with purer class distribution are preferred
● Need a measure of node impurity:
High degree of impurity Low degree of impurity
FIT-CS INT3405 - Machine Learning 27
Measures of Node Impurity
● Gini Index
● Entropy
● Misclassification error
FIT-CS INT3405 - Machine Learning 28
Find the Best Split
● Compute impurity measure (P) before splitting
● Compute impurity measure (M) after splitting
○ Compute impurity measure of each child node
○ M is the weighted impurity of child nodes
● Choose the attribute test condition that produces the highest
gain
Gain = P - M
● or equivalently, lowest impurity measure after splitting (M)
FIT-CS INT3405 - Machine Learning 29
Find the Best Split
FIT-CS INT3405 - Machine Learning 30
Measure of Impurity: Gini (index)
● Gini Index for a given node t :
● For 2-class problem (p, 1 – p):
◆ GINI = 1 – p2 – (1 – p)2 = 2p (1-p)
FIT-CS INT3405 - Machine Learning 31
Computing Gini Index of a Single Node
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
FIT-CS INT3405 - Machine Learning 32
Computing Gini Index for a Collection of Nodes
●
FIT-CS INT3405 - Machine Learning 33
Binary Attributes: Computing GINI Index
● Splits into two partitions (child nodes)
● Effect of Weighing partitions:
– Larger and purer partitions are sought
FIT-CS INT3405 - Machine Learning 34
Categorical Attributes: Computing GINI Index
● For each distinct value, gather counts for each class in the dataset
● Use the count matrix to make decisions
Which of these is the best?
FIT-CS INT3405 - Machine Learning 35
Continuous Attributes: Computing Gini Index
● Use Binary Decisions based on one value
● Several Choices for the splitting value
○ Number of possible splitting values
= Number of distinct values
● Each splitting value has a count matrix associated with it
○ Class counts in each of the partitions, A ≤ v and A > v
● Simple method to choose best v
○ For each v, scan the database to gather count matrix
and compute its Gini index
○ Computationally Inefficient! Repetition of work.
FIT-CS INT3405 - Machine Learning 36
Measure of Impurity: Entropy
● Entropy for a given node t :
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
FIT-CS INT3405 - Machine Learning 37
Computing Information Gain After Splitting
●
FIT-CS INT3405 - Machine Learning 38
Problem with large number of partitions
● Node impurity measures tend to prefer splits that result in large number
of partitions, each being small but pure
– Customer ID has highest information gain because entropy for all
the children is zero
FIT-CS INT3405 - Machine Learning 39
Gain Ratio
●
FIT-CS INT3405 - Machine Learning 40
Gain Ratio
SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97
FIT-CS INT3405 - Machine Learning 41
Measure of Impurity: Classification Error
●
FIT-CS INT3405 - Machine Learning 42
Computing Error of a Single Node
FIT-CS INT3405 - Machine Learning 43
Computing Error of a Single Node
FIT-CS INT3405 - Machine Learning 44
Comparison among Impurity Measures
For a 2-class problem:
FIT-CS INT3405 - Machine Learning 45
Decision Tree Classification
● Advantages:
○ Relatively inexpensive to construct
○ Extremely fast at classifying unknown records
○ Easy to interpret for small-sized trees
○ Robust to noise
○ Can easily handle redundant attributes, irrelevant attributes
● Disadvantages: .
○ Due to the greedy nature of splitting criterion, interacting attributes (that
can distinguish between classes together but not individually) may be
passed over in favor of other attributed that are less discriminating.
○ Each decision boundary involves only a single attribute
FIT-CS INT3405 - Machine Learning 46
Summary
● Decision Tree Examples
● Decision Tree Algorithms
● Methods for Expressing Test Conditions
● Measures of Node Impurity
○ Gini index
○ Entropy
○ Gain Ratio
○ Classification Error
FIT-CS INT3405 - Machine Learning 47
UET
Since 2004
ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN
VNU-University of Engineering and Technology
Thank you