Decision Trees
Class
prediction
Input Data Attributes
X1=x1
Classifier
Y=y
XM=xM
Training data
Decision Tree Example
Three variables:
Hair = {blond, dark}
Height = {tall,short}
Country = {Gromland, Polvia}
Training data:
P:2 G:4
(B,T,P)
(B,T,P)
Hair = D?
Hair = B?
(B,S,G)
(D,S,G)
(D,T,G)
P:0 G:2
P:2 G:2
(B,S,G)
Height = S?
Height = T?
P:0 G:2
P:2 G:0
At each level of the tree,
we split the data
according to the value
of on of the attributes
After enough splits, only
one class is represented
in the node This is a
P:2terminal
G:4
leaf of the tree
We call that
class
Hair
= D?the
output class for that
Hair = B?
node
P:0 G:2
P:2 G:2
Height = T?
P:2 G:0
Height = S?
G is the output
P:0 G:2 for this node
P:2 G:4
Hair = D?
Hair = B?
P:0 G:2
P:2 G:2
Height = T?
Height = S?
P:2 G:0
P:0 G:2
The class of a new input can be classified by following the
tree all the way down to a leaf and by reporting the output
of the leaf. For example:
(B,T) is classified as P
(D,S) is classified as G
General Case (Discrete Attributes)
We have R observations
from training data
Each observation has M
attributes X1,..,XM
Each Xi can take N distinct
discrete values
Each observation has a
class attribute Y with C
distinct (discrete) values
Problem: Construct a
sequence of tests on the
attributes such that, given a
new input (x1,..,xM), the class
attribute y is correctly
predicted
X1 .. XM Y
Input x1 ..... xM ???
data
X1 . XM
Data 1 x1 ..... xM
Data 2
Data R
Training Data
X = attributes of training data (RxM) Y = Class of training data (R)
General Decision Tree (Discrete
Attributes)
X1 =
first possible value for X1?
X1 =
nth possible value for X1 ?
Output class Y = y1
Xj =
ith possible value for Xj ?
Output class Y = yc
X2=0.5
Decision Tree Example
:7
:4
X1 < 0.5 ??
:3
X1=0.5
:4
:4
:0
X2 < 0.5??
:3
:0
:0
:4
:7
:4
X1 < 0.5 ??
:3
:4
:4
:0
X2 < 0.5??
:3
:0
:0
:4
The class of a new input can be classified by following the
tree all the way down to a leaf and by reporting the output
of the leaf. For example:
(0.2,0.8) is classified as
(0.8,0.2) is classified as
General Case (Continuous Attributes)
We have R observations
from training data
Each observation has M
attributes X1,..,XM
Each Xi can take N distinct
discrete values
Each observation has a
class attribute Y with C
distinct (discrete) values
Problem: Construct a
sequence of tests of the
form Xi < ti ? on the
attributes such that, given
a new input (x1,..,xM), the
class attribute y is correctly
predicted
X1 .. XM Y
Input x1 ..... xM ???
data
X1 . XM
Data 1 x1 ..... xM
Data 2
Data R
Training Data
X = attributes of training data (RxM) Y = Class of training data (R)
General Decision Tree (Continuous
Attributes)
X 1 < t 1?
Output class Y = y1
X j < tj ?
Output class Y = yc
Basic Questions
How to choose the attribute/value to split
on at each level of the tree?
When to stop splitting? When should a
node be declared a leaf?
If a leaf node is impure, how should the
class label be assigned?
If the tree is too large, how can it be
pruned?
How to choose the attribute/value to split on
at each level of the tree?
Two classes (red circles/green crosses)
Two attributes: X1 and X2
11 points in training data
Idea Construct a decision tree such that the
leaf nodes predict correctly the class for all the
training examples
How to choose the attribute/value to split on
at each level of the tree?
Good
Bad
This node is
This node is
pure because
Good
almost pure
there is only
one class left Little
No ambiguity in ambiguity in the
the class label class label
Bad
These nodes contain a
mixture of classes
Do not disambiguate
between the classes
We want to find the most compact, smallest size
tree (Occams razor), that classifies the training
data correctly We want to find the split choices
that will get us the fastest to pure nodes
This node is
node is
pure because
GoodThis
almost pure
there is only
Little
one class left
ambiguity in the
No ambiguity in
class label
the class label
Bad
These nodes contain a
mixture of classes
Do not disambiguate
between the classes
Digression: Information Content
Frequency of
occurrence
Suppose that we are dealing with data which can come from
four possible values (A, B, C, D)
Each class may appear with some probability
Suppose P(A) = P(B) = P(C) = P(D) = 1/4
What is the average number of bits necessary to encode each
class?
In this case: average = 2 = 2xP(A)+2xP(B)+2xP(C)+2xP(D)
A 00 B 01 C 10 D 11
The distribution is not very
informative impure
A B CD Class Number
Information Content
Frequency of
occurrence
Suppose now P(A) = 1/2 P(B) = 1/4 P(C) = 1/8 P(D) = 1/8
What is the average number of bits necessary to encode each
class?
In this case, the classes can be encoded by using 1.75 bits on
average
A 0 B 10 C 110 D 111
Average
= 1xP(A)+2xP(B)+3xP(C)+3xP(D) = 1.75
The distribution is more
informative higher purity
A B CD Class Number
Entropy
In general, the average number of bits
necessary to encode n values is the
entropy:
n
H = Pi log 2 Pi
i =1
Pi = probability of occurrence of value i
High entropy All the classes are (nearly)
equally likely
Low entropy A few classes are likely; most
of the classes are rarely observed
Entropy
Frequency of
occurrence
High
Entropy
The entropy
captures the
degree of purity
of the distribution
Frequency of
occurrence
Class Number
Low
Entropy
Class Number
10
Example Entropy Calculation
1
NA = 1
NB = 6
pA = NA/(NA+NB) = 1/7
pB = NB/(NA+NB) = 6/7
NA = 3
NB = 2
pA = NA/(NA+NB) = 3/5
pB = NB/(NA+NB) = 2/5
H1 = -pAlog2 pA pBlog2 pB
= 0.59
H2 = -pAlog2 pA pBlog2 pB
= 0.97
H1 < H2 => (2) less pure than (1)
Example Entropy Calculation
1
Frequency of occurrence
2
of class A in node (1)
NA = 1
NA =of3 occurrence
Frequency
NB = 6
= 2node (1)
of classNBB in
pA = NA/(NA+NB) = 1/7
pA = NA/(NA+NB) = 3/5
of node (1)
pB = NB/(NA+NB) = 6/7
pB = NEntropy
B/(NA+NB) = 2/5
H1 = -pAlog2 pA pBlog2 pB
= 0.59
H2 = -pAlog2 pA pBlog2 pB
= 0.97
H1 < H2 => (2) less pure than (1)
11
Conditional Entropy
Entropy before splitting: H
After splitting, a fraction
PL of the data goes to
the left node, which has
entropy HL
After splitting, a fraction
PR of the data goes to the
left node, which has
entropy HR
The average entropy after splitting is:
HLx PL+ HR x PR
Conditional Entropy
Entropy before splitting: H
After splitting, a fraction
After splitting, a fraction
Probability that a random input
PL of the data goes to
PR of the data goes to the
is directed
to the left node
Entropy
left has
left node, which has
the left
node,ofwhich
node
entropy HR
entropy H
L
The average entropy after splitting is:
HLx PL+ HR x PR
Conditional Entropy
12
Information Gain
PL
PR
HL
HR
We want nodes as pure as possible
We want to reduce the entropy as much as possible
We want to maximize the difference between the
entropy of the parent node and the expected entropy of
the children
Maximize:
IG = H (HLx PL+ HR x PR)
Notations
Entropy: H(Y) = Entropy of the distribution
of classes at a node
Conditional Entropy:
Discrete: H(Y|Xj) = Entropy after splitting with
respect to variable j
Continuous: H(Y|Xj,t) = Entropy after splitting
with respect to variable j with threshold t
Information gain:
Discrete: IG(Y|Xj) = H(Y) - H(Y|Xj) = Entropy
after splitting with respect to variable j
Continuous: IG(Y|Xj,t) = H(Y) - H(Y|Xj,t) =
Entropy after splitting with respect to variable j
with threshold t
13
Information Gain
PL
HL
PR
HR
We want nodes as pure as possible
We want to reduce the entropy as much as possible
We want to maximize the difference between the
entropy of the parent node and the expected entropy of
the children
Information Gain (IG) = Amount by
Maximize:
which the ambiguity is decreased
by splitting the node
IG = H (HLx PL+ HR x PR)
H = 0.99
IG =
H (HL * 4/11 + HR * 7/11)
HL = 0
H = 0.99
IG =
H (HL * 5/11 + HR * 6/11)
HR = 0.58 HL = 0.97 HR = 0.92
14
H = 0.99
H = 0.99
IG = 0.62
IG = 0.052
HL = 0
HR = 0.58 HL = 0.97 HR = 0.92
H = 0.99
IG = 0.62
HL = 0
H = 0.99
Choose this split because the
information gain is greater
than with the other split
IG = 0.052
HR = 0.58 HL = 0.97 HR = 0.92
15
More Complete Example
= 20 training examples from class A
= 20 training examples from class B
Attributes = X1 and X2 coordinates
IG
X1 Split value
Best split value (max Information Gain) for X1
attribute: 0.24 with IG = 0.138
16
IG
X2 Split value
Best split value (max Information Gain) for X2
attribute: 0.234 with IG = 0.202
Best X1 split: 0.24, IG = 0.138
Best X2 split: 0.234, IG = 0.202
Split on X2 with 0.234
X2
17
Best X split: 0.24, IG = 0.138
There
is no point
in splitting
Best
Y split:
0.234,
IG = 0.202
this node further since it
contains only data from a
single class return it as a
leaf Split
node on
withYoutput
A
with 0.234
X2
This node is not pure so we
need to split further
IG
X1 Split value
Best split value (max Information Gain) for X1
attribute: 0.22 with IG ~ 0.182
18
IG
X2 Split value
Best split value (max Information Gain) for X2
attribute: 0.75 with IG ~ 0.353
Best X1 split: 0.22, IG = 0.182
Best X2 split: 0.75, IG = 0.353
Split on X2 with 0.75
X2
X2
19
Best
X split: 0.22, IG = 0.182
There is no point in splitting
Best
split:
0.75,since
IG =it 0.353
thisYnode
further
contains only data from a
single class return it as a
leaf node with output A
Split on X with 0.5
X2
X2
A
A
Final decision tree
B
X2
X2
X1
Each of the leaf
nodes is pure
contains data from
only one class
X1
20
Final decision tree
A
A
Given an input (X,Y)
Follow the tree down to a
leaf.
Return corresponding
output class for this leaf
X2
X2
Example (X,Y) = (0.5,0.5)
X1
X1
Basic Questions
How to choose the attribute/value to split
on at each level of the tree?
When to stop splitting? When should a
node be declared a leaf?
If a leaf node is impure, how should the
class label be assigned?
If the tree is too large, how can it be
pruned?
21
Pure and Impure Leaves and When
to Stop Splitting
All the data in the node comes from a
single class We declare the node to be
a leaf and stop splitting. This leaf will
output the class of the data it contains
Several data points have exactly the same
attributes even though they are from the
same class We cannot split any further
We still declare the node to be a leaf, but it
will output the class that is the majority of
the classes in the node (in this example, B)
Decision Tree Algorithm (Continuous Attributes)
LearnTree(X,Y)
Input:
Set X of R training vectors, each containing the values (x1,..,xM) of
M attributes (X1,..,XM)
A vector Y of R elements, where yj = class of the jth datapoint
If all the datapoints in X have the same class value y
Return a leaf node that predicts y as output
If all the datapoints in X have the same attribute value (x1,..,xM)
Return a leaf node that predicts the majority of the class values in Y
as output
Try all the possible attributes Xj and threshold t and choose the
one, j*, for which IG(Y|Xj,t) is maximum
XL, YL= set of datapoints for which xj* < t and corresponding
classes
XH, YH = set of datapoints for which xj* >= t and corresponding
classes
Left Child LearnTree(XL,YL)
Right Child LearnTree(XH,YH)
22
Decision Tree Algorithm (Discrete Attributes)
LearnTree(X,Y)
Input:
Set X of R training vectors, each containing the values
(x1,..,xM) of M attributes (X1,..,XM)
A vector Y of R elements, where yj = class of the jth datapoint
If all the datapoints in X have the same class value y
Return a leaf node that predicts y as output
If all the datapoints in X have the same attribute value
(x1,..,xM)
Return a leaf node that predicts the majority of the class
values in Y as output
Try all the possible attributes Xj and choose the one,
j*, for which IG(Y|Xj) is maximum
For every possible value v of Xj*:
Xv, Yv= set of datapoints for which xj* = v and corresponding
classes
Childv LearnTree(Xv,Yv)
Decision Trees So Far
Given R observations from training data, each
with M attributes X and a class attribute Y,
construct a sequence of tests (decision tree) to
predict the class attribute Y from the attributes X
Basic strategy for defining the tests (when to
split) maximize the information gain on the
training data set at each node of the tree
Problems (next):
Computational issues How expensive is it to
compute the IG
The tree will end up being much too big pruning
Evaluating the tree on training data is dangerous
overfitting
23