ISOM3360 Data Mining for Business Analytics, Session 4
Decision Trees (I)
Instructor: Rong Zheng
Department of ISOM
Fall 2020
Recap: Data Understanding
Preliminary investigation of the data to better understand
its specific characteristics
Help in selecting appropriate data mining algorithms
Things to look at
Class imbalance
Dispersion of data attribute values
Skewness, outliers, missing values
Correlation analysis
Visualization tools are important
Histograms, box plots
Scatter plot
2
Recap: Data Preprocessing
Data transformation might be needed
Handling missing values
Handling categorical variables
Feature transformation
e.g., log transformation
Normalization (back to this when clustering is discussed)
Feature discretization
3
Question
Which of the following normalization method(s)
may transform the original variable to a negative
value: min-max, z-score, or both, or neither?
4
What Induction Algorithm Shall We Use?
5
Commonly Used Induction Algorithms
Post-mortem analysis of a popular data mining competition
6
Why Decision Trees?
Decision trees are one of the most popular data mining
tools.
Classification trees (used when target variable is categori
cal)
Regression trees (used when target variable is numeric)
They are easy to understand, implement and use, and c
omputationally cheap.
The model comprehensibility is important for communi
cation to non-DM-savvy stakeholders.
7
Classification Tree
Employed Balance Age Default
Yes 123,000 50 No
No 51,100 40 Yes
No 68,000 55 No
Yes 34,000 46 Yes
Yes 50,000 44 No
No 100,000 50 Yes
Objective: predicting borrowers who will default on loan payments.
8
Classification Tree: Upside-Down
Employed Root Node
Yes No
Class = Not
Default Balance Node
>=50K <50K
Class = Not
Default
Age
>=45 <45
Class = Not Class =
Leaf Default Default
9
Classification Tree: Divide and Conquer
“Recursive Partitioning”
Employed
Nodes
Each node represents one attribute Yes No
Tests on nominal attribute: number Class = Not
of splits (branches) is number of po Default Balance
ssible values or 2 (one value vs. res
>=50K <50K
t)
Continuous attributes are discretize Class = Not
d Default
Age
Leaves >=45 <45
A class assignment (e.g., default /not default) Class = Not Class =
Also provide a distribution over all possible Default Default
classes (e.g., default with probability 0.25, not
default with prob. 0.75)
10
How a Tree is Used for Classification ?
To determine the class of a new exa
mple: e,g., Mark, age 40, retired, bal Employed
ance 38K. Yes No
The example is routed down the tre Class = Not
e according to values of attributes . Default Balance
At each node, a test is applied to one >=50K <50K
attribute.
Class = Not
When a leaf is reached, the example Default
is assigned to a class—or alternativel Age
y to a distribution over the possible c >=45 <45
lasses (e.g., default with probability
0.25, not default with prob. 0.75). Class = Not Class =
Default Default
11
Assigning Probability Estimates
Age >=45
Entire Population
Age <45
Balance >= 50k
Balance < 50k
Age >=45
Age < 45
Q: How would you assign estimates of class probability (e.g.,
probability of default/not default) based on such trees?
12
Exercise: Assume this is the classification tree you learned from the
A Process View to a Business Problem
past defaulting data. A new guy, who is 45 and has 20K balance, is
applying a credit card issued by your company. Can you predict if
this new guy is gonna default? How confident are you about your
prediction? Another girl is also applying the same credit card. But
the only information we have about her is she has 70k balance. Can
you predict if she will default? How sure about that?
13
Classification Tree Learning
Objective: based on customer attributes, partition the custome
rs into subgroups that are less impure – with respect to the clas
s (i.e., such that in each group most instances belong to the sa
me class)
No
Yes
No Yes Yes
Yes Yes
No
Yes Yes No No
14
Classification Tree Learning
Partitioning into “purer” groups
Orange Bodies Purple Bodies
Yes Yes Yes Yes No No
Yes Yes No Yes No No
15
Classification Tree Learning
Partitioning into “purer” groups recursively
Purple Bodies
Yes No No
Yes No No
16
Classification Tree Learning
Purple Bodies
Red Head Green Head
Yes No No
Yes No No
17
Classification Tree Learning
Partitioning into “purer” groups
Orange Bodies Purple Bodies
Yes Yes Yes Yes No No
Yes Yes No Yes No No
18
Classification Tree Learning
Partitioning into “purer” groups recursively
Orange Bodies
Yes Yes Yes
Yes Yes No
19
Classification Tree Learning
Orange Bodies
Blue Limbs Yellow Limbs
Yes Yes Yes
No
Yes Yes
20
Classification Tree Learning
Body
Orange Bodies Purple Bodies
Yes Yes Yes Yes No No
Yes Yes No Yes No No
Head
Limbs
Red Head Green Head
Blue Limbs
Yes No No
Yellow Limbs
Yes Yes Yes Yes Yes No
Yes No No
21
Summary: Classification Tree Learning
A tree is constructed by recursively partitioning the e
xamples.
With each partition, the examples are split into subgr
oups that are “increasingly pure”.
22
Let’s play a game. I have someone in my mind, and
your job is to guess this person. You can only ask
yes/no question.
This person is an entrepreneur.
Go!
23
Next…
Some important questions without being answered
yet:
How to automatically choose which attribute to be
used to split the data?
When to stop the splitting?
24
How to Choose Which Attribute to Split Over?
Objectives
For each splitting node, choose the attribute that best p
artitions the population into less impure groups.
All else being equal, fewer nodes is better (more compr
ehensible, easy to use, reduce overfitting)
Impurity measures: many available but most common
one (from information theory) is: entropy.
25
Entropy
is the proportion of class in the data
For example: our initial population is composed of 16 cases
of class “Default” and 14 cases of class “Not default”
Entropy (entire population of examples) =
26
Entropy Exercise
A dataset is composed of 10
1
cases of class “Positive” and
10 cases of class “Negative” 0.75
Entropy
Entropy=?
0.5
A dataset is composed of 0 c
ases of class “Positive” and 2 0.25
0 cases of class “Negative”
0
Entropy=? 0.25 0.5 0.75 1
% of one class
Two-class entropy function
𝑡𝑖𝑝 : 0 log 2 0=0
27
Information Gain (based on Entropy)
The information gain is based on the reduction in entro
py after a dataset is split on an attribute
Information Gain =
entropy (parent) – [weighted average] entropy (children)
where weight of each child is given by the
proportion of the examples in that child.
Balance<50k Balance>=50k
28
Information Gain Example
30 instances
Balance < 50k Balance >= 50k
17 instances
13 instances
(Weighted) Average Impurity of Children =
Information Gain = 0.997 - 0.615 = 0.382
29
What If We Split Over “Age” First?
30 instances
Age < 45 Age >=45
15 instances
15 instances
Impurity Impurity
=? =? Exercise!
Information Gain = ?
recall, gain from first splitting on “Balance” = 0.382
30
Our Original Question
Now, ready to answer anxiously awaiting question:
How to choose which attribute to split over?
Answer: at each node, choose the attribute that obt
ains maximum information gain!
31
Decision Tree Algorithm (Full Tree)
Step 1: Calculate the information gain from splitting over
each attribute using the dataset
Step 2: Split the set into subsets using the attribute for which
the information gain is maximum
Step 3: Make a decision tree node containing that attribute,
divide the dataset by its branches and repeat the same
process on every node.
Step 4a: A node with entropy of 0 is a leaf.
Step 4b: A node with entropy more than 0 needs further
splitting.
32