Decision Trees
Representation of Concepts
Concept learning: conjunction of attributes
(Sunny AND Hot AND Humid AND Windy) +
Decision trees: disjunction of conjunction of attributes
(Sunny AND Normal) OR (Overcast) OR (Rain AND Weak) +
More powerful representation
OutdoorSport
Larger hypothesis space H
Can be represented as a tree
sunny
overcast
rain
Common form of decision
making in humans
Humidity
Wind
Yes
high
normal
strong
weak
No
Yes
No
Yes
2
Rectangle learning.
---------------
+++++++
++++++
-----
Conjunctions
(single rectangle)
-----------
-------------------------
-----
----------+++++++
++++++
+++++++
++++++
-----------
-----
-----
-----
Disjunctions of Conjunctions
(union of rectangles)
Training Examples
Day
Outlook
Temp
Humidity Wind
Tennis?
D1
Sunny
Hot
High
Weak
No
D2
Sunny
Hot
High
Strong
No
D3
Overcast
Hot
High
Weak
Yes
D4
Rain
Mild
High
Weak
Yes
D5
Rain
Cool
Normal
Weak
Yes
D6
Rain
Cool
Normal
Strong
No
D7
Overcast
Cool
Normal
Strong
Yes
D8
Sunny
Mild
High
Weak
No
D9
Sunny
Cool
Normal
Weak
Yes
D10
Rain
Mild
Normal
Weak
Yes
D11
Sunny
Mild
Normal
Strong
Yes
D12 Overcast
Mild
High
Strong
Yes
D13 Overcast
Hot
Normal
Weak
Yes
Decision Trees
Decision tree to represent learned target functions
Each internal node tests an attribute
Each branch corresponds to attribute value
Each leaf node assigns a classification
Outlook
Can be represented
by logical formulas
sunny
overcast
rain
Yes
Wind
Humidity
high
normal
strong
weak
No
Yes
No
Yes
5
Representation in decision trees
Example of representing rule in DTs:
if outlook = sunny AND humidity = normal
OR
if outlook = overcast
OR
if outlook = rain AND wind = weak
then playtennis
Applications of Decision Trees
Instances describable by a fixed set of attributes and their values
Target function is discrete valued
2-valued
N-valued
But can approximate continuous functions
Disjunctive hypothesis space
Possibly noisy training data
Errors, missing values,
Examples:
Equipment or medical diagnosis
Credit risk analysis
Calendar scheduling preferences
Attribute 2
Decision Trees
+ + + +
+ + + +
+ + + +
-
+ + + +
+ + + +
+ + + +
Given distribution
Of training instances
Draw axis parallel
Lines to separate the
Instances of each class
+ + + +
+ + + +
+ + + +
Attribute 1
8
Attribute 2
Decision Tree Structure
+ + + +
+ + + +
+ + + +
-
+ + + +
+ + + +
+ + + +
Draw axis parallel
Lines to separate the
Instances of each class
+ + + +
+ + + +
+ + + +
Attribute 1
9
Decision Tree Structure
Attribute 2
Decision leaf
+ + + +
+ + + +
+ + + +
30
-
* Alternate splits possible
+ + + +
+ + + +
+ + + +
Decision node
= condition
= box
+ + + +
+ + + +
+ + + +
= collection of satisfying
examples
Decision nodes (splits)
20
40
Attribute 1
10
Decision Tree Construction
Find the best structure
Given a training data set
11
Top-Down Construction
Start with empty tree
Main loop:
1.
2.
3.
4.
5.
Split the best decision attribute (A) for next node
Assign A as decision attribute for node
For each value of A, create new descendant of node
Sort training examples to leaf nodes
If training examples perfectly classified, STOP,
Else iterate over new leaf nodes
Grow tree just deep enough for perfect classification
If possible (or can approximate at chosen depth)
Which attribute is best?
12
Attribute 2
Best attribute to split?
+ + + +
+ + + +
+ + + +
-
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
Attribute 1
13
Attribute 2
Best attribute to split?
+ + + +
+ + + +
+ + + +
-
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
Attribute 1 > 40?
Attribute 1
14
Attribute 2
Best attribute to split?
+ + + +
+ + + +
+ + + +
-
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
Attribute 1 > 40?
Attribute 1
15
Which split to make next?
Pure box/node
Attribute 2
Mixed box/node
+ + + +
+ + + +
+ + + +
-
Attribute 1 > 20?
+ + + +
+ + + +
+ + + +
Already pure leaf
No further need to split
+ + + +
+ + + +
+ + + +
Attribute 1
16
Attribute 2
Which split to make next?
A2 > 30?
Already pure leaf
No further need to split
+ + + +
+ + + +
+ + + +
-
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
Attribute 1
17
Principle of Decision Tree Construction
Finally we want to form pure leaves
Correct classification
Greedy approach to reach correct classification
1. Initially treat the entire data set as a single box
2. For each box choose the spilt that reduces its impurity (in terms
of class labels) by the maximum amount
3. Split the box having highest reduction in impurity
4. Continue to Step 2
5. Stop when all boxes are pure
18
Choosing Best Attribute?
Consider examples, and
Which one is better?
29, 35 A1
t
29, 35
t
25, 5
4, 30
A2
14, 16
15, 19
Which is better?
29, 35 A1
t
21, 5
29, 35 A2
t
8, 30
18, 33
11, 2
19
Entropy
A measure for
uncertainty
purity
information content
Information theory: optimal length code assigns (logp) bits to
message having probability p
S is a sample of training examples
p+ is the proportion of positive examples in S
p is the proportion of negative examples in S
Entropy of S: average optimal number of bits to encode information
about certainty/uncertainty about S
EntropySplogpplogp plogpplogp
Can be generalized to more than two values
20
Entropy
Entropy can also be viewed as measuring
purity of S,
uncertainty in S,
information in S,
E.g.: values of entropy for p+=1, p+=0, p+=.5
Easy generalization to more than binary values
Sum over pi *(-log2 pi) , i=1,n
i is + or for binary
i varies from 1 to n in the general case
21
Choosing Best Attribute?
Consider examples (,and compute entropies:
Which one is better?
29, 35 A1 E(S)=0.993
0.650
0.522
25, 5
4, 30
29, 35
t
0.989
A2
E(S)=0.993
f
0.997
14, 16
15, 19
Which is better?
29, 35 A1
0.708
21, 5
E(S)=0.993
f
0.742
8, 30
E(S)=0.993
29 , 35 A2
0.937
18, 33
0.619
11, 2
22
Information Gain
Gain(S,A): reduction in entropy after choosing attr. A
Gain( S , A) Entropy ( S )
Sv
S
vValues ( A )
29, 35 A1 E(S)=0.993
0.650
25, 5
0.522
4, 30
0.708
0.989
21, 5
0.742
8, 30
Gain: 0.265
E(S)=0.993
f
0.997
14, 16
15, 19
Gain: 0.000
E(S)=0.993
f
A2
29 , 35
Gain: 0.395
29, 35 A1
Entropy ( S v )
E(S)=0.993
29 , 35 A2
0.937
0.619
11, 2
18, 33
Gain: 0.121
23
Gain function
Gain is measure of how much can
Reduce uncertainty
Value lies between 0,1
What is significance of
gain of 0?
example where have 50/50 split of +/- both before and
after discriminating on attributes values
gain of 1?
Example of going from perfect uncertainty to perfect
certainty after splitting example with predictive attribute
Find patterns in TEs relating to attribute
values
Move to locally minimal representation of TEs
24
Training Examples
Day
Outlook
Temp
Humidity Wind
Tennis?
D1
Sunny
Hot
High
Weak
No
D2
Sunny
Hot
High
Strong
No
D3
Overcast
Hot
High
Weak
Yes
D4
Rain
Mild
High
Weak
Yes
D5
Rain
Cool
Normal
Weak
Yes
D6
Rain
Cool
Normal
Strong
No
D7
Overcast
Cool
Normal
Strong
Yes
D8
Sunny
Mild
High
Weak
No
D9
Sunny
Cool
Normal
Weak
Yes
D10
Rain
Mild
Normal
Weak
Yes
D11
Sunny
Mild
Normal
Strong
Yes
D12 Overcast
Mild
High
Strong
Yes
D13 Overcast
Hot
Normal
Weak
Yes
Determine the Root Attribute
9+, 5
9+, 5
Humidity
High
3+, 4
Wind
Low
6+, 1
Weak
6+, 2
Strong
3+, 3
Gain (S, Humidity)
GainS, Wind
Gain (S, Outlook)
Gain (S, Temp)
26
Sort the Training Examples
9+, 5D1,,D14
Outlook
Sunny
D1,D2,D8,D9,D11
2+, 3
Overcast
D3,D7,D12,D13
4+, 0
Yes
Rain
D4,D5,D6,D10,D15
3+, 2
?
Ssunny= {D1,D2,D8,D9,D11}
Gain (Ssunny, Humidity) = .970
Gain (Ssunny, Temp) = .570
Gain (Ssunny, Wind) = .019
27
Final Decision Tree for Example
Outlook
Sunny
Rain
Overcast
Humidity
High
No
Yes
Normal
Yes
Wind
Strong
No
Weak
Yes
28
Hypothesis Space Search (ID3)
Hypothesis space (all possible trees) is complete!
Target function is included in there
29
Hypothesis Space Search in Decision Trees
Conduct a search of the space of decision trees which
can represent all possible discrete functions.
Goal: to find the best decision tree
Finding a minimal decision tree consistent with a set of data
is NP-hard.
Perform a greedy heuristic search: hill climbing without
backtracking
Statistics-based decisions using all data
30
Hypothesis Space Search by ID3
Hypothesis space is complete!
H is space of all finite DTs (all discrete functions)
Target function is included in there
Simple to complex hill-climbing search of H
Use of gain as hill-climbing function
Outputs a single hypothesis (which one?)
Cannot assess all hypotheses consistent with D (usually many)
Analogy to breadth first search
Examines all trees of given depth and chooses best
No backtracking
Locally optimal ...
Statistics-based search choices
Use all TEs at each step
Robust to noisy data
31
Restriction bias vs. Preference bias
Restriction bias (or Language bias)
Incomplete hypothesis space
Preference (or search) bias
Incomplete search strategy
Candidate Elimination has restriction bias
ID3 has preference bias
In most cases, we have both a restriction and a
preference bias.
32
Inductive Bias in ID3
Preference for short trees, and for those with high
information gain attributes near the root
Principle of Occam's razor
prefer the shortest hypothesis that fits the data
Justification
Smaller likelihood of a short hypothesis fitting the data
at random
Problems
Other ways to reduce random fits to data
Size of hypothesis based on the data representation
Minimum description length principle
33
Overfitting the Data
Learning a tree that classifies the training data perfectly may
not lead to the tree with the best generalization performance.
- There may be noise in the training data the tree is fitting
- The algorithm might be making decisions based on
very little data
A hypothesis h is said to overfit the training data if the is
another hypothesis, h, such that h has smaller error than h
on the training data but h has larger error on the test data than h.
On training
accuracy
On testing
Complexity of tree
34
Attribute 2
Overfitting
+ + + +
+ + - +
+ + + +
-
+ + + +
+ + + +
+ + + +
A very deep tree required
To fit just one odd training
example
+ + + +
+ + + +
+ + + +
Attribute 1
35
Attribute 2
When to stop splitting further?
+ + + +
+ + - +
+ + + +
-
+ + + +
+ + + +
+ + + +
A very deep tree required
To fit just one odd training
example
+ + + +
+ + + +
+ + + +
Attribute 1
36
Overfitting in Decision Trees
Consider adding noisy training example (should be +):
Day
Outlook
Temp
Humidity
Wind
Tennis?
D15
Sunny
Hot
Normal
Strong
No
What effect on earlier tree?
Outlook
Sunny
Humidity
High
Normal
Overcast
Yes
Rain
Wind
Strong
Weak
37
Overfitting - Example
Outlook
Noise or other
coincidental regularities
Sunny
Overcast
Rain
1,2,8,9,11
2+,3Humidity
3,7,12,13
4+,0Yes
4,5,6,10,14
3+,2Wind
High
No
Normal
Wind
Strong
No
Weak
Yes
Strong
No
Weak
Yes
38
Avoiding Overfitting
Two basic approaches
- Prepruning: Stop growing the tree at some point during
construction when it is determined that there is not enough
data to make reliable choices.
- Postpruning: Grow the full tree and then remove nodes
that seem not to have sufficient evidence. (more popular)
Methods for evaluating subtrees to prune:
- Cross-validation: Reserve hold-out set to evaluate utility (more popular)
- Statistical testing: Test if the observed regularity can be
dismissed as likely to be occur by chance
- Minimum Description Length: Is the additional complexity of
the hypothesis smaller than remembering the exceptions ?
This is related to the notion of regularization that we will see
in other contexts keep the hypothesis simple.
39
Reduced-Error Pruning
A post-pruning, cross validation approach
- Partition training data into grow set and validation set.
- Build a complete tree for the grow data
- Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote.
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase
in accuracy on the validation test.
Problem: Uses less data to construct the tree
Sometimes done at the rules level
General Strategy: Overfit and Simplify
40
Rule post-pruning
Allow tree to grow until best fit (allow overfitting)
Convert tree to equivalent set of rules
One rule per leaf node
Prune each rule independently of others
Remove various preconditions to improve
performance
Sort final rules into desired sequence for use
41
Example of rule post pruning
IF (Outlook = Sunny) ^ (Humidity = High)
THEN PlayTennis = No
IF (Outlook = Sunny) ^ (Humidity = Normal)
THEN PlayTennis = Yes
Outlook
Sunny
Humidity
High
Normal
Overcast
Yes
Rain
Wind
Strong
Weak
42
Extensions of basic algorithm
Continuous valued attributes
Attributes with many values
TEs with missing data
Attributes with associated costs
Other impurity measures
Regression tree
43
Continuous Valued Attributes
Create a discrete attribute from continuous variables
E.g., define critical Temperature = 82.5
Candidate thresholds
chosen by gain function
can have more than one threshold
typically where values change quickly
(80+90)/2
(48+60)/2
Temp
Tennis?
40
N
48
N
60
Y
72
Y
80
Y
90
N
44
Attributes with Many Values
Problem:
If attribute has many values, Gain will select it (why?)
E.g. of birthdates attribute
365 possible values
Likely to discriminate well on small sample
For sample of fixed size n, and attribute with N values, as N ->
infinity
ni/N -> 0
- pi*log pi -> 0 for all i and entropy -> 0
Hence gain approaches max value
45
Attributes with many values
Problem: Gain will select attribute with many values
One approach: use GainRatio instead
Gain( S , A)
GainRatio( S , A)
SplitInformation( S , A)
c
SplitInformation( S , A)
i 1
Si
S
log 2
Si
S
Entropy of the
partitioning
Penalizes
higher number
of partitions
where Si is the subset of S for which A has value vi
(example of Si/S = 1/N: SplitInformation = log N)
46
Unknown Attribute Values
What if some examples are missing values of attribute A?
Use training example anyway, sort through tree
if node n tests A, assign most common value of A among other
examples sorted to node n
assign most common value of A among other examples with same
target value
assign probability pi to each possible value vi of A
assign fraction pi of example to each descendant in tree
Classify test instances with missing values in same fashion
Used in C4.5
47
Attributes with Costs
Consider
medical diagnosis: BloodTest has cost $150, Pulse has a cost of $5.
robotics, Width-From-1ft has cost 23 sec., from 2 ft 10s.
How to learn a consistent tree with low expected cost?
Replace gain by
Tan and Schlimmer (1990)
Gain 2 ( S , A)
Cost ( A)
Nunez (1988)
2Gain ( S , A) 1
(
Cost
(
A
)
1
)
where determines importance of cost
48
Gini Index
Another sensible measure of impurity
(i and j are classes)
After applying attribute A, the resulting Gini index is
Gini can be interpreted as expected error rate
Gini Index
.
.
Attributes: color, border, dot
Classification: triangle, square
50
Gini Index for Color
.
.
red
Color?
green
.
yellow
.
.
51
Gain of Gini Index
52
Three Impurity Measures
A
Gain(A)
GainRatio(A)
GiniGain(A)
Color
0.247
0.156
0.058
Outline
0.152
0.152
0.046
Dot
0.048
0.049
0.015
53
Regression Tree
Similar to classification
Use a set of attributes to predict the value (instead
of a class label)
Instead of computing information gain, compute
the sum of squared errors
Partition the attribute space into a set of
rectangular subspaces, each with its own predictor
The simplest predictor is a constant value
54
Rectilinear Division
A regression tree is a piecewise constant function of the
input attributes
X2
X1 t1
r5
r2
X1 t3
X2 t2
r3
t2
r1
r2
r1
X2 t4
r3
r4
t1
r4
t3
X1
r5
55
Growing Regression Trees
To minimize the square error on the learning sample,
the prediction at a leaf is the average output of the
learning cases reaching that leaf
Impurity of a sample is defined by the variance of the
output in that sample:
I(LS)=vary|LS{y}=Ey|LS{(y-Ey|LS{y})2}
The best split is the one that reduces the most variance:
| LS a |
I ( LS , A) vary|LS { y}
vary|LS a { y}
a | LS |
56
Regression Tree Pruning
Exactly the same algorithms apply: pre-pruning
and post-pruning.
In post-pruning, the tree that minimizes the
squared error on VS is selected.
In practice, pruning is more important in
regression because full trees are much more
complex (often all objects have a different output
values and hence the full tree has as many leaves
as there are objects in the learning sample)
57
When Are Decision Trees Useful ?
Advantages
Very fast: can handle very large datasets with many
attributes
Flexible: several attribute types, classification and
regression problems, missing values
Interpretability: provide rules and attribute importance
Disadvantages
Instability of the trees (high variance)
Not always competitive with other algorithms in terms
of accuracy
58
History of Decision Tree Research
Hunt and colleagues in Psychology used full search decision
trees methods to model human concept learning in the 60s
Quinlan developed ID3, with the information gain heuristics
in the late 70s to learn expert systems from examples
Breiman, Friedmans and colleagues in statistics developed
CART (classification and regression trees simultaneously
A variety of improvements in the 80s: coping with noise,
continuous attributes, missing data, non-axis parallel etc.
Quinlans updated algorithm, C4.5 (1993) is commonly used (New:C5)
Boosting (or Bagging) over DTs is a good general purpose algorithm
59
Summary
Decision trees are practical for concept learning
Basic information measure and gain function for best first
search of space of DTs
ID3 procedure
search space is complete
Preference for shorter trees
Overfitting is an important issue with various solutions
Many variations and extensions possible
60
Software
In R:
Packages tree and rpart
C4.5:
http://www.cse.unwe.edu.au/~quinlan
Weka
http://www.cs.waikato.ac.nz/ml/weka
61