Module-2
Decision Tree Learning
• Introduction,
• Decision tree representation,
• Appropriate problems for decision tree learning,
• Issues In Decision Tree Learning ,
• Basic decision tree learning algorithm,
• Tree Construction,
• Testing and storing the classifier,
• Example: using decision trees to predict contact lens type
1
Decision tree
Decision tree learning is a method for approximating discrete-valued target functions,
in which the learned function is represented by a decision tree.
2
Decision tree
3
Decision Tree
Raw data Decision making trees
4
Terminology used in Decision Tree
• Root Node
• Splitting
• Decision Node
• Leaf/ Terminal Node:
• Pruning
• Branch / Sub-Tree
• Parent and Child Node
5
Decision tree representation
6
Convert the following Boolean functions into decision trees.
1. A Λ B(A AND B)
A B AΛB
7
• Convert the following boolean functions into decision trees.
1. A Λ B
A B AΛB A
T F
T T T
T F F
F T F
F F F
8
• Convert the following boolean functions into decision trees.
1. A Λ B
A B AΛB A
T F
T T T B F
T F F
F T F
F F F
9
• Convert the following boolean functions into decision trees.
1. A Λ B
A B AΛB A
T F
T T T B
T F F
T F F
F T F
T F
F F F
10
2. A XOR B
11
2. A XOR B
12
3. A Λ ﬧB
13
4. A V [ B Λ C ]
14
5. [A Λ B] V [C Λ D]
15
Appropriate problems for decision tree learning
o Instances are represented by attribute-value pairs
o The target function has discrete output values
o Disjunctive descriptions may be required
o The training data may contain errors
o The training data may contain missing attribute values
Problems :
Classify medical patients by their disease
Equipment malfunctions by their cause
Loan applicants by their likelihood of defaulting on payments
16
17
Entropy
18
Entropy:
Entropy is the measures of impurity, disorder or uncertainty in
a bunch of examples.
• Example: S is a collection of 14 examples → [9+, 5-]
Entropy(S) = — (9/14) log2(9/14) — (5/14)log2(5/14)
Entropy(S) = 0.940
• The entropy is 0 if all members of S belong to the same class.
For example, if all members are positive
Entropy(S) = — (1) log2(1) — (0)log2(0) = 0
19
• The entropy is 1 when the collection contains an equal number of
positive and negative examples.
• If the collection contains unequal numbers of positive and negative
examples, the entropy is between 0 and 1.
• Entropy of S :
Interpretation of entropy
p+ = 1
p+ = 0.5
p+ = 0.8
20
INFORMATION GAIN MEASURES THE EXPECTED
REDUCTION IN ENTROPY
• Given entropy as a measure of the impurity in a collection of training examples,
we can now define a measure of the effectiveness of an attribute in classifying
the training data.
• The measure we will use, called information gain, is simply the expected
reduction in entropy caused by partitioning the examples according to this
attribute. More precisely, the information gain, Gain(S, A) of an attribute A,
relative to a collection of examples S, is defined as
S → Collection of examples, A → Attribute, Values(A) → possible values of attribute A
Sv → subset of S for which attribute A has value v
Gain(S, A) is therefore the expected reduction in entropy caused by knowing the value of
attribute A. Put another way, Gain(S, A) is the information provided about the target
&action value, given the value of some other attribute A.
21
22
Example:
• Suppose S is a collection of training-example days described by attributes
including Wind, which can have the values Weak or Strong.
• As before, assume S is a collection containing 14 examples, [9+, 5-].
• Of these 14 examples, suppose 6 of the positive and 2 of the negative
examples have Wind = Weak, and the remainder have Wind = Strong.
The information-gain due to sorting the original 16 examples by the attribute
Wind may then be calculated as Gain(S, Wind)
Values(Wind) = { Weak, Strong }
S = [9+, 5-]
Sweak = [6+, 2-] → Entropy(Sweak) = 0.811
Sstrong = [3+, 3-] → Entropy(Sstrong) = 1.00
Gain(S, Wind) = Entropy(S)- 8/14 .Entropy(Sweak ) – 6/14 .Entropy (Sstrong)
= 0.048
23
• Consider the following set of training examples
a) What is the entropy of this collection of training example with respect to
the target function classification?
b) What is the information gain of a2 relative to these training examples?
24
a) no. of +ve instances =3
no. of -ve instances =3
Entropy(S) = 1
b) Gain (S,a2) = Entropy(S)-[4/6 Entropy(ST) + 2/6 Entropy(SF)]
Entropy(ST) = -(2/4) log2(2/4) - (2/4) log2(2/4) =1
Entropy(SF) = -(1/2) log2(1/2) -(1/2) log2(1/2) =1
Gain (S,a2) = 1- [4/6(1)+2/6(1)]
= 1-1
=0
25
26
27
28
Algorithm used in decision trees:
ID3
Gini Index
Chi-Square
Reduction in Variance
• ID3(Iterative Dichotomiser)
ID3 → J. R. Quinlan in 1986.
29
Basic decision tree learning algorithm
• Most algorithms that have been developed for learning decision trees
are variations on a core algorithm that employs a top-down, greedy
search through the space of possible decision trees.
• Basic algorithm, ID3, learns decision trees by constructing them
topdown, beginning with the question "which attribute should be tested
at the root of the tree?‘
• To answer this question, each instance attribute is evaluated using a
statistical test to determine how well it alone classifies the training
examples.
• The best attribute is selected and used as the test at the root node of the
tree. A descendant of the root node is then created for each possible
value of this attribute, and the training examples are sorted to the
appropriate descendant node
30
• “which attribute should be tested at the root of the tree?”
Instance attribute → statistical test
Descendant of the root node
Best attribute
concept learning
• Which Attribute is the Best Classifier
Selecting attribute
Information gain
Entropy
31
Basic decision tree learning algorithm
• A simplified version of the algorithm, specialized to learning
boolean-valued functions (i.e., concept learning)
32
Algorithm ID3 (Examples, TargetAttribute, Attributes)
1. Create a Root node for the tree
2. If all Examples are positive → +
3. If all Examples are negative → -
4. If Attributes is empty → Root
Otherwise Begin
A ← the attribute from Attributes that best classifies Examples
The decision attribute for Root ← A
For each possible value vi of A,
Add a new tree branch below Root, corresponding to the test A = vi
Let Examples_vi be the subset of Examples that have value vi for A
If Examples_vi is empty
Then below this new branch add a leaf node with label = most
common value of TargetAttribute in Examples
Else below this new branch add the subtree
ID3(Examples_vi , TargetAttribute, Attributes – {A})
End
5. Return Root
33
Summary of the ID3 algorithm
• Summary of the ID3 algorithm specialized to learning
boolean-valued functions. ID3 is a greedy algorithm that
grows the tree top-down, at each node selecting the attribute
that best classifies the local training examples.
• This process continues until the tree perfectly classifies the
training examples, or until all attributes have been used.
34
An Illustrative Example
35
• Entropy(S) S(9+,5-)
= – PYes . log2 PYes – PNo . log2 PNo = 0.940
• Information Gain
1. Gain(S, Outlook)
Values(Outlook) = { Sunny, Overcast, Rain }
S = [9+, 5-]
OutlookSunny = [+, -] Entropy(OutlookSunny ) =
OutlookOvercast = [+, -] Entropy(OutlookOvercast) =
OutlookRain = [+, -] Entropy(OutlookRain) =
Gain(S, Outlook) = E(S) - 5/14 E(OutlookSunny ) – 4/14 E(OutlookOvercast) – 5/14 Entropy(OutlookRain)
=
36
Entropy([9+,5-] = – (9/14) log2 (9/14) – (5/14) log2 (5/14) = 0.940
38
2. Gain(S, Temparature)
Values(Temparature) = { Hot, Cool, mild }
S = [9+, 5-]
TemparatureHot = [+, -] Entropy(TemparatureHot ) =
TemparatureCool = [+, -] Entropy(TemparatureCool) =
TemparatureMild = [+,-] Entropy(TemparatureMild) =
Gain(S, Temparature) = E(S) - 4/14 E(TemparatureHot ) – 4/14 E (TemparatureCool) – 6/14 E(TemparatureMild)
=
39
Entropy([9+,5-] = – (9/14) log2 (9/14) – (5/14) log2 (5/14) = 0.940
41
3. Gain(S, Humidity)
Values(Humidity) = { High, Normal}
S = [9+, 5-]
HumidityHigh = [3+, 4-] Entropy(Humidity High )=
Humidity Normal = [6+, 1-] Entropy(Humidity Normal) =
Gain(S, Humidity) = E(S) - 7/14 Entropy(Humidity High ) – 7/14 Entropy (Humidity Normal )
= 0.1518
42
Entropy([9+,5-] = – (9/14) log2 (9/14) – (5/14) log2 (5/14) = 0.940
44
4. Gain(S, Wind)
Values(Wind) = { Weak, Strong }
S = [9+, 5-]
Sweak = [6+, 2-] Entropy(Sweak) =
Sstrong = [3+, 3-] Entropy(Sstrong) =
Gain(S,Wind) = Entropy(S)- 8/14 Entropy(Sweak ) – 6/14 Entropy (Sstrong)
= 0.048
45
Entropy([9+,5-] = – (9/14) log2 (9/14) – (5/14) log2 (5/14) = 0.940
47
1. Gain(S, Outlook) = 0.246
2. Gain(S, Temparature) = 0.029
3. Gain(S, Humidity) = 0.1516
4. Gain(S, Wind) = 0.048
• Root node decision on the tree
48
50
Overcast outlook on decision
Outlook is overcast → Yes
51
Overcast outlook on decision
52
53
• Sunny outlook on decision
• Entropy(OutlookSunny) → 5 instances → S(2+,3-)
→ 0.9709
1- Gain(Outlook=Sunny|Temperature)
= 0.570
2- Gain(Outlook=Sunny|Humidity)
= 0.970
3- Gain(Outlook=Sunny|Wind)
= 0.019
54
• If humidity is High → No
• If humidity is Normal → Yes
59
60
61
• Rain outlook on decision
• Entropy(OutlookRain) → 5 instaces → S(3+,2-)
1- Gain(Outlook=Rain | Temperature)
= 0.019973
2- Gain(Outlook=Rain | Humidity)
= 0.019973
3- Gain(Outlook=Rain | Wind)
= 0.97095
62
• Rain outlook on decision
• If wind is weak and outlook is rain → Yes
• If wind is strong and outlook is rain → NO
67
• Final version of decision tree
68
• Final version of decision tree
69
Algorithm ID3 (Examples, TargetAttribute, Attributes)
1. Create a Root node for the tree
2. If all Examples are positive → +
3. If all Examples are negative → -
4. If Attributes is empty → Root
Otherwise Begin
A ← the attribute from Attributes that best classifies Examples
The decision attribute for Root ← A
For each possible value vi of A,
Add a new tree branch below Root, corresponding to the test A = vi
Let Examples_vi be the subset of Examples that have value vi for A
If Examples_vi is empty
Then below this new branch add a leaf node with label = most
common value of TargetAttribute in Examples
Else below this new branch add the subtree
ID3(Examples_vi , TargetAttribute, Attributes – {A})
End
5. Return Root
71
• Identify the entropy, information gain and draw the decision tree for the
following training examples.
76
Information Gain (S, Gender)
Information Gain (S, Ownership)
Information Gain (S, Travel cost)
Information Gain (S, Income level)
Gender : 0.101
Car Ownership : 0.3382
Travel cost : 0.76
Income Level : 0.437
Travel Cost
cheap standard expensive
Gender Bus
Train
Male Female
Bus Income Level
Bus Train
HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING
• As with other inductive learning methods, ID3 can be characterized as
searching a space of hypotheses for one that fits the training examples.
• The hypothesis space searched by ID3 is the set of possible decision trees.
• ID3 performs a simple-to complex, hill-climbing search through this
hypothesis space, beginning with the empty tree, then considering
progressively more elaborate hypotheses in search of a decision tree that
correctly classifies the training data.
• The evaluation function that guides this hill-climbing search is the
information gain measure.
102
Hypothesis Space Search in Decision Tree Learning
This search is depicted in Figure
103
Capabilities and Limitations Of ID3 algorithm
• ID3's hypothesis space of all decision trees is a complete space of finite
discrete-valued functions, relative to the available attributes.
– Because every finite discrete-valued function can be represented by some decision tree, ID3 avoids one of the major
risks of methods that search incomplete hypothesis spaces (such as methods that consider only conjunctive
hypotheses): that the hypothesis space might not contain the target function.
• ID3 maintains only a single current hypothesis as it searches through the
space of decision trees.
• ID3 in its pure form performs no backtracking in its search.
– Once it, selects an attribute to test at a particular level in the tree, it never backtracks to reconsider this choice.
Therefore, it is susceptible to the usual risks of hill-climbing search without backtracking: converging to locally
optimal solutions that are not globally optimal.
• ID3 uses all training examples at each step in the search to make statistically
based decisions regarding how to refine its current hypothesis.
104
105
106
Issues in Decision Tree Learning
1. Avoiding Overfitting the Data
Reduced error pruning
Rule post-pruning
2. Incorporating Continuous-Valued Attributes
3. Alternative Measures for Selecting Attributes
4. Handling Training Examples with Missing Attribute Values
5. Handling Attributes with Differing Costs
107
Avoiding Over fitting the Data
o h=[0,1,2] h’=[0:100]
o Overfit: Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the
training data if there exists some alternative hypothesis h' ∈ H, such that h has
smaller error than h' over the training examples, but h' has a smaller error than h
over the entire distribution of instances.
108
Reasons for overfitting:
o Overfitting can occur when the training examples contain random
errors or noise
o When small numbers of examples are associated with leaf nodes.
Approaches to avoid over fitting
o Pre-pruning (avoidance)
o Post-pruning (recovery)
o Reduced Error Pruning
o Rule post pruning
109
• Reduced-Error Pruning
112
• Training set
• Validation set
• Test set
113
• Rule Post-Pruning
115
• Incorporating Continuous-Valued Attributes
Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes NO
116
• Incorporating Continuous-Valued Attributes
Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes NO
• (48 + 60)/2 = 54
• (80 + 90)/2 =85
• Temperature >54
• Temperature >85
117
Alternative Measures for Selecting Attributes
118
• Tree construction- Refer Python programming
in Harringnton-Page 39
• Testing and storing the classifier( Harrington-
Page 56 )
• Example: using decision trees to predict
contact lens type –Page 57)
119
120
121