Decision Tree Algorithm (C4.
5)
1
Training Examples for PlayTennis
(Mitchell 1997)
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Oyercast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
Di1 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
|D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
2
Decision Tree for PlayTennis
(Mitchell 1997)
Outlook
Sunny Overcast Rain
No
K
Humidity
Normal
Yes
Yes
Strong
No
Wind
Weak
Yes
3
Decision Tree Algorithm (C4.5)
All the equations of C4.5 algorithm are as follow:
Calculate lnfoS)(Entropy, )to identify the class in
the training set S.
i=1
where |s] is the number of cases in the training set, CG
is a class, i 1,2,.k,k is the number of classes,
freq(C;, S) and is the number of cases in C;.
(9Y, 5N)
4
Decision Tree Algorithm (C4.5)
Calculate the expected information valuc, Info, (S)
for feature Xto partition S. Otloo
|Humidir T
= -El(s/S)nfaS,)]
Igo,(8)
where L is the number of outputs for feature X, S is a
S
subset of corresponding to the f output, and S is
the number of cases in subset S,.
5
Decision Tree Algorithm (C4.5)
Calculate the information gaincd after partitioning
according to featureX.
Gai.N) =lnfoS) -nfo,(S)
Calculate the partition information value, Splitnfo (I)
acquired for Spartitioned into L subsets.
S
Splitlnfo X)=
6
Decision Tree Algorithm (C4.5)
Calculate the gain ratio of Gain(X) over Splitlnfo(X).
GainRatiol) = Gain.X)/ Splitlnfo.1 )
The gain ratio can reduce the probability of choosing
the node with more attribute values.
If gain ratio for two attribute values were the same
smallest, then you can choice one randomly.
7
Decision Tree Algorithm (C4.5)
Step 1
Decide which attribute should consider first?
Outlook? Temperature? Humidity? Wind?
Use Entropy:-P.log,P-P.log,P.
Atfirst, we have 9(YES) and 5(NO)
Starting Entropy:-9/14 log, 9/14 -5/14 log,5/14-0.94
Ingo( S)=-eg(C,
=1
Sy|s]log, [freg(C, SY|S|)
8
Entropy (Info)
1.0
0.5
Entropy(S)
0.0 0. 1.0
9
Decision Tree Algorithm (C4.5)
Step 2
Compute the gain ratio for each attribute
Information gain: (The last E-The possible E)...(desire MAX)
Humidity Outlook
Normal Overcast Rain
High Sunny
10
Decision Tree Algorithm (C4.5)
Step 2(Con t)
Compute the gain ratio for cach attribute
Wind Temperature
Strong Weak Hot Mild Cool
11
Decision Tree Algorithm (C4.5)
log,,X log,,X
log, X=
Step 2(Con't) log,,2 0.30103
Compute the gain ratio for cach attribute
E,(-2/5 1og,2/5 -3/5 log,3/5)-0.4 *(-1.3219) -0.6 *(-0.7369))
=(0.5281H04421)=0.9702
Outlook E{-4/4 log,4/4 -0)=(-1log, 1 -0)-0;
E(-3/5 log,3/5 -2/5 log, 2/5)
(-0.6* (-0.7369)+0.4 *(-1.3219)= 0.9702
Sunny Overcast Rain
Info,(8) = s/snfo8)]
3+ 2
i=1
2+ 3 4+ 0
Gain (X)= hnfo (S)-Info,(S)
Splitlnfo =-5/14 log, 5/14 - 4/14 log,4/14 -0.94 -5/14*0.9702- 4/14*0 - 5/14*0.9702
-5/14 log, 5/14=1.577 =0.94 -0.693=0.247
Gain ratio =0.247/1.577=0.157 12
Decision Tree Algorithm (C4.5)
Step 2(Con't)
Compute the gain ratio for each attribute
E-(-3/7 log, 3/7 -4/7 log, 4/7)=0.985
Humidity E,(-6/7 log,6/7 -1/7 log, 1/7)F0.592
Info gain=0.94- 7/14*0.985 - 7/14*0.592 =0.151
Split Info =-7/14 log,7/14 -7/14 log,7/14=1
High Normal
Gain ratio=0.151/1=0.151
3+4- 6+ 1
13
Decision Tree Algorithm (C4.5)
Step 2(Con t)
Compute the gain ratio for cach attribute
E(-2/4 log, 2/4 -2/4 log,2/4)=1
E(-4/6 log,4/6 -2/6 1log, 2/6)-0.9182
Temperature
E;(-3/4log, 3/4 -1/4 log, 1/4)F0.8112
Info gain=0.94-4/14*1- 6/14*0.9182
Hot Mild Cool - 4/14*0.8112 =0.0292
SplitInfo =-4/14 log,4/14 -6/14 log, 6/14
2+ 2 4+ 2 3+ 1
4/14 log, 4/14-1.556
Gain ratio=0.0292/1.556=0.01876
14
Decision Tree Algorithm (C4.5)
Step 2(Con t)
Compute the gain ratio for cach attribute
E-(-3/6log, 3/6-3/6 log,3/6)=1
Wind E-6/8 log, 6/8 - 2/8 log,2/8)-0.8112
Infogain-0.94 -6/14*1 =
- 8/14*0.8112 0.048
Split Info-6/14 log, 6/14 -8/14 log, 8/14 -0.9852
Strong Weak
Gain ratio=0.048/0.9852=0.049
3+3- 6+ 2
15
Decision Tree Algorithm (C4.5)
Step 2(Con t)
Summary of gain ratio
Gain ratio Outlook-0.157 So the root node is Outlook
Gain ratio Humidify-0.151
Gain ratio Wind0.049
Gain ratio Temperature-0.01876
16
Decision Tree Algorithm (C4.5)
Step 3
Decide other attribute under root node
May choice T, H, W, under Sunny and Rain
Outlook
(Don't care Overcast, because no
Sunny Rain
information contained)
"Overcast
Yes
17
Decision Tree Algorithm (C4.5)
Step 3(Con t)
Look under Outlook=Sunny
May choice T, H, W, under Outlook -Sunny
Outlook
E (Outlook=Sunny)
=-2/5 log,2/5 - 3/5 log,3/5= 0.97
Sunny
243
18