0% found this document useful (0 votes)
16 views118 pages

ML Module 2 2025

Uploaded by

dd1738851
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views118 pages

ML Module 2 2025

Uploaded by

dd1738851
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module-2

Decision Tree Learning


• Introduction,
• Decision tree representation,
• Appropriate problems for decision tree learning,
• Issues In Decision Tree Learning ,
• Basic decision tree learning algorithm,
• Tree Construction,
• Testing and storing the classifier,
• Example: using decision trees to predict contact lens type

1
Decision tree
Decision tree learning is a method for approximating discrete-valued target functions,
in which the learned function is represented by a decision tree.

2
Decision tree

3
Decision Tree

Raw data Decision making trees

4
Terminology used in Decision Tree

• Root Node
• Splitting
• Decision Node
• Leaf/ Terminal Node:
• Pruning
• Branch / Sub-Tree
• Parent and Child Node

5
Decision tree representation

6
Convert the following Boolean functions into decision trees.
1. A Λ B(A AND B)

A B AΛB

7
• Convert the following boolean functions into decision trees.
1. A Λ B

A B AΛB A
T F
T T T
T F F
F T F
F F F

8
• Convert the following boolean functions into decision trees.
1. A Λ B

A B AΛB A
T F
T T T B F
T F F
F T F
F F F

9
• Convert the following boolean functions into decision trees.
1. A Λ B

A B AΛB A
T F
T T T B
T F F
T F F
F T F
T F
F F F

10
2. A XOR B

11
2. A XOR B

12
3. A Λ ‫ﬧ‬B

13
4. A V [ B Λ C ]

14
5. [A Λ B] V [C Λ D]

15
Appropriate problems for decision tree learning
o Instances are represented by attribute-value pairs
o The target function has discrete output values
o Disjunctive descriptions may be required
o The training data may contain errors
o The training data may contain missing attribute values

Problems :
 Classify medical patients by their disease
 Equipment malfunctions by their cause
 Loan applicants by their likelihood of defaulting on payments

16
17
Entropy

18
Entropy:
Entropy is the measures of impurity, disorder or uncertainty in
a bunch of examples.

• Example: S is a collection of 14 examples → [9+, 5-]


Entropy(S) = — (9/14) log2(9/14) — (5/14)log2(5/14)
Entropy(S) = 0.940
• The entropy is 0 if all members of S belong to the same class.
For example, if all members are positive
Entropy(S) = — (1) log2(1) — (0)log2(0) = 0

19
• The entropy is 1 when the collection contains an equal number of
positive and negative examples.
• If the collection contains unequal numbers of positive and negative
examples, the entropy is between 0 and 1.
• Entropy of S :

Interpretation of entropy
p+ = 1
p+ = 0.5
p+ = 0.8

20
INFORMATION GAIN MEASURES THE EXPECTED
REDUCTION IN ENTROPY
• Given entropy as a measure of the impurity in a collection of training examples,
we can now define a measure of the effectiveness of an attribute in classifying
the training data.
• The measure we will use, called information gain, is simply the expected
reduction in entropy caused by partitioning the examples according to this
attribute. More precisely, the information gain, Gain(S, A) of an attribute A,
relative to a collection of examples S, is defined as

S → Collection of examples, A → Attribute, Values(A) → possible values of attribute A


Sv → subset of S for which attribute A has value v
Gain(S, A) is therefore the expected reduction in entropy caused by knowing the value of
attribute A. Put another way, Gain(S, A) is the information provided about the target
&action value, given the value of some other attribute A.

21
22
Example:
• Suppose S is a collection of training-example days described by attributes
including Wind, which can have the values Weak or Strong.
• As before, assume S is a collection containing 14 examples, [9+, 5-].
• Of these 14 examples, suppose 6 of the positive and 2 of the negative
examples have Wind = Weak, and the remainder have Wind = Strong.

The information-gain due to sorting the original 16 examples by the attribute


Wind may then be calculated as Gain(S, Wind)

Values(Wind) = { Weak, Strong }

S = [9+, 5-]

Sweak = [6+, 2-] → Entropy(Sweak) = 0.811


Sstrong = [3+, 3-] → Entropy(Sstrong) = 1.00
Gain(S, Wind) = Entropy(S)- 8/14 .Entropy(Sweak ) – 6/14 .Entropy (Sstrong)
= 0.048

23
• Consider the following set of training examples
a) What is the entropy of this collection of training example with respect to
the target function classification?
b) What is the information gain of a2 relative to these training examples?

24
a) no. of +ve instances =3
no. of -ve instances =3
Entropy(S) = 1

b) Gain (S,a2) = Entropy(S)-[4/6 Entropy(ST) + 2/6 Entropy(SF)]

Entropy(ST) = -(2/4) log2(2/4) - (2/4) log2(2/4) =1

Entropy(SF) = -(1/2) log2(1/2) -(1/2) log2(1/2) =1

Gain (S,a2) = 1- [4/6(1)+2/6(1)]


= 1-1
=0

25
26
27
28
Algorithm used in decision trees:
ID3
Gini Index
Chi-Square
Reduction in Variance

• ID3(Iterative Dichotomiser)
ID3 → J. R. Quinlan in 1986.

29
Basic decision tree learning algorithm

• Most algorithms that have been developed for learning decision trees
are variations on a core algorithm that employs a top-down, greedy
search through the space of possible decision trees.
• Basic algorithm, ID3, learns decision trees by constructing them
topdown, beginning with the question "which attribute should be tested
at the root of the tree?‘
• To answer this question, each instance attribute is evaluated using a
statistical test to determine how well it alone classifies the training
examples.
• The best attribute is selected and used as the test at the root node of the
tree. A descendant of the root node is then created for each possible
value of this attribute, and the training examples are sorted to the
appropriate descendant node

30
• “which attribute should be tested at the root of the tree?”
Instance attribute → statistical test
Descendant of the root node
Best attribute
concept learning

• Which Attribute is the Best Classifier


Selecting attribute
Information gain
Entropy

31
Basic decision tree learning algorithm
• A simplified version of the algorithm, specialized to learning
boolean-valued functions (i.e., concept learning)

32
Algorithm ID3 (Examples, TargetAttribute, Attributes)
1. Create a Root node for the tree
2. If all Examples are positive → +
3. If all Examples are negative → -
4. If Attributes is empty → Root
Otherwise Begin
A ← the attribute from Attributes that best classifies Examples
The decision attribute for Root ← A
For each possible value vi of A,
Add a new tree branch below Root, corresponding to the test A = vi
Let Examples_vi be the subset of Examples that have value vi for A
If Examples_vi is empty
Then below this new branch add a leaf node with label = most
common value of TargetAttribute in Examples
Else below this new branch add the subtree
ID3(Examples_vi , TargetAttribute, Attributes – {A})
End
5. Return Root
33
Summary of the ID3 algorithm
• Summary of the ID3 algorithm specialized to learning
boolean-valued functions. ID3 is a greedy algorithm that
grows the tree top-down, at each node selecting the attribute
that best classifies the local training examples.
• This process continues until the tree perfectly classifies the
training examples, or until all attributes have been used.

34
An Illustrative Example

35
• Entropy(S) S(9+,5-)
= – PYes . log2 PYes – PNo . log2 PNo = 0.940

• Information Gain
1. Gain(S, Outlook)
Values(Outlook) = { Sunny, Overcast, Rain }
S = [9+, 5-]
OutlookSunny = [+, -] Entropy(OutlookSunny ) =
OutlookOvercast = [+, -] Entropy(OutlookOvercast) =

OutlookRain = [+, -] Entropy(OutlookRain) =

Gain(S, Outlook) = E(S) - 5/14 E(OutlookSunny ) – 4/14 E(OutlookOvercast) – 5/14 Entropy(OutlookRain)


=

36
Entropy([9+,5-] = – (9/14) log2 (9/14) – (5/14) log2 (5/14) = 0.940

38
2. Gain(S, Temparature)

Values(Temparature) = { Hot, Cool, mild }


S = [9+, 5-]
TemparatureHot = [+, -] Entropy(TemparatureHot ) =

TemparatureCool = [+, -] Entropy(TemparatureCool) =


TemparatureMild = [+,-] Entropy(TemparatureMild) =

Gain(S, Temparature) = E(S) - 4/14 E(TemparatureHot ) – 4/14 E (TemparatureCool) – 6/14 E(TemparatureMild)


=

39
Entropy([9+,5-] = – (9/14) log2 (9/14) – (5/14) log2 (5/14) = 0.940

41
3. Gain(S, Humidity)

Values(Humidity) = { High, Normal}


S = [9+, 5-]
HumidityHigh = [3+, 4-] Entropy(Humidity High )=
Humidity Normal = [6+, 1-] Entropy(Humidity Normal) =

Gain(S, Humidity) = E(S) - 7/14 Entropy(Humidity High ) – 7/14 Entropy (Humidity Normal )

= 0.1518

42
Entropy([9+,5-] = – (9/14) log2 (9/14) – (5/14) log2 (5/14) = 0.940

44
4. Gain(S, Wind)

Values(Wind) = { Weak, Strong }


S = [9+, 5-]
Sweak = [6+, 2-] Entropy(Sweak) =
Sstrong = [3+, 3-] Entropy(Sstrong) =

Gain(S,Wind) = Entropy(S)- 8/14 Entropy(Sweak ) – 6/14 Entropy (Sstrong)


= 0.048

45
Entropy([9+,5-] = – (9/14) log2 (9/14) – (5/14) log2 (5/14) = 0.940

47
1. Gain(S, Outlook) = 0.246
2. Gain(S, Temparature) = 0.029
3. Gain(S, Humidity) = 0.1516
4. Gain(S, Wind) = 0.048

• Root node decision on the tree

48
50
 Overcast outlook on decision

 Outlook is overcast → Yes

51
 Overcast outlook on decision

52
53
• Sunny outlook on decision

• Entropy(OutlookSunny) → 5 instances → S(2+,3-)


→ 0.9709
1- Gain(Outlook=Sunny|Temperature)
= 0.570
2- Gain(Outlook=Sunny|Humidity)
= 0.970
3- Gain(Outlook=Sunny|Wind)
= 0.019
54
• If humidity is High → No

• If humidity is Normal → Yes


59
60
61
• Rain outlook on decision

• Entropy(OutlookRain) → 5 instaces → S(3+,2-)


1- Gain(Outlook=Rain | Temperature)
= 0.019973
2- Gain(Outlook=Rain | Humidity)
= 0.019973
3- Gain(Outlook=Rain | Wind)
= 0.97095
62
• Rain outlook on decision

• If wind is weak and outlook is rain → Yes

• If wind is strong and outlook is rain → NO

67
• Final version of decision tree

68
• Final version of decision tree

69
Algorithm ID3 (Examples, TargetAttribute, Attributes)
1. Create a Root node for the tree
2. If all Examples are positive → +
3. If all Examples are negative → -
4. If Attributes is empty → Root
Otherwise Begin
A ← the attribute from Attributes that best classifies Examples
The decision attribute for Root ← A
For each possible value vi of A,
Add a new tree branch below Root, corresponding to the test A = vi
Let Examples_vi be the subset of Examples that have value vi for A
If Examples_vi is empty
Then below this new branch add a leaf node with label = most
common value of TargetAttribute in Examples
Else below this new branch add the subtree
ID3(Examples_vi , TargetAttribute, Attributes – {A})
End
5. Return Root
71
• Identify the entropy, information gain and draw the decision tree for the
following training examples.

76
Information Gain (S, Gender)
Information Gain (S, Ownership)
Information Gain (S, Travel cost)
Information Gain (S, Income level)
Gender : 0.101
Car Ownership : 0.3382
Travel cost : 0.76
Income Level : 0.437
Travel Cost

cheap standard expensive

Gender Bus
Train

Male Female

Bus Income Level

Bus Train
HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING

• As with other inductive learning methods, ID3 can be characterized as


searching a space of hypotheses for one that fits the training examples.

• The hypothesis space searched by ID3 is the set of possible decision trees.

• ID3 performs a simple-to complex, hill-climbing search through this


hypothesis space, beginning with the empty tree, then considering
progressively more elaborate hypotheses in search of a decision tree that
correctly classifies the training data.

• The evaluation function that guides this hill-climbing search is the


information gain measure.

102
Hypothesis Space Search in Decision Tree Learning
This search is depicted in Figure

103
Capabilities and Limitations Of ID3 algorithm

• ID3's hypothesis space of all decision trees is a complete space of finite


discrete-valued functions, relative to the available attributes.
– Because every finite discrete-valued function can be represented by some decision tree, ID3 avoids one of the major
risks of methods that search incomplete hypothesis spaces (such as methods that consider only conjunctive
hypotheses): that the hypothesis space might not contain the target function.

• ID3 maintains only a single current hypothesis as it searches through the


space of decision trees.

• ID3 in its pure form performs no backtracking in its search.


– Once it, selects an attribute to test at a particular level in the tree, it never backtracks to reconsider this choice.
Therefore, it is susceptible to the usual risks of hill-climbing search without backtracking: converging to locally
optimal solutions that are not globally optimal.

• ID3 uses all training examples at each step in the search to make statistically
based decisions regarding how to refine its current hypothesis.

104
105
106
Issues in Decision Tree Learning
1. Avoiding Overfitting the Data

Reduced error pruning

Rule post-pruning

2. Incorporating Continuous-Valued Attributes

3. Alternative Measures for Selecting Attributes

4. Handling Training Examples with Missing Attribute Values

5. Handling Attributes with Differing Costs

107
Avoiding Over fitting the Data

o h=[0,1,2] h’=[0:100]
o Overfit: Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the
training data if there exists some alternative hypothesis h' ∈ H, such that h has
smaller error than h' over the training examples, but h' has a smaller error than h
over the entire distribution of instances.

108
 Reasons for overfitting:
o Overfitting can occur when the training examples contain random
errors or noise

o When small numbers of examples are associated with leaf nodes.


 Approaches to avoid over fitting
o Pre-pruning (avoidance)
o Post-pruning (recovery)
o Reduced Error Pruning
o Rule post pruning

109
• Reduced-Error Pruning

112
• Training set
• Validation set
• Test set
113
• Rule Post-Pruning

115
• Incorporating Continuous-Valued Attributes

Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes NO

116
• Incorporating Continuous-Valued Attributes

Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes NO

• (48 + 60)/2 = 54
• (80 + 90)/2 =85
• Temperature >54 
• Temperature >85 

117
Alternative Measures for Selecting Attributes

118
• Tree construction- Refer Python programming
in Harringnton-Page 39
• Testing and storing the classifier( Harrington-
Page 56 )
• Example: using decision trees to predict
contact lens type –Page 57)

119
120
121

You might also like