Decision Tree
Decision tree is a supervised learning algorithm used for both classification and
regression. The main goal of decision tree is to create a model that predicts the value of
a target variable by learning simple decision rules inferred from the data features.
Weight
Heavy Not Heavy
High Mileage Horsepower
<=86 >86
High Mileage Low Mileage
Q. How to build decision tree?
Ans: - We can build the decision by using the below mentioned sequence of formulas:
1. Information Gain: - It is also known as “Kullback-Leibler Divergence” denoted
by IG(S,A). It is actually the effective change in entropy for a set “S” after deciding
on a particular attribute “A”. Basically, it is a effective change in entropy w.r.t.
independent variable.
[ IG(S,A) = H(S) – H(S,A) = H(S) - ∑ 𝑃(𝑋) ∗ 𝐻(𝑋) ]
Here, H(S) -> Entropy of S
We can also rewrite the above formula as: -
[ IG(S,A) =
( )
∗ 𝑙𝑜𝑔
( )
− ( )
∗ 𝑙𝑜𝑔
( )
]
2. Entropy: - It is also known as “Shanon Entropy” denoted by H(S) for a finite set
“S”. It is the measure of amount of uncertainty or randomness in data.
[ H(S) = ∑ 𝑃(𝑖) ∗ 𝑙𝑜𝑔 ( ) ]
Here, n -> No. of data
P(i) -> Probability of “I”
We can also rewrite entropy as:
( )
[ H(S) = ∑ ∗ 𝐼𝐺(𝑆 , 𝐴 ) ]
( )
We can say that entropy tells us about the predictability of certain event.
3. Final Gain: - In this step we will choose the attribute which provides us highest
final gain. We choose such attribute as a root node for our decision tree.
[ Final Gain = IG(S,A) – H(S) ]
Example: -
Consider the dataset given below: -
Size Shape Colour Choice
M Brick Blue Yes
S Wedge Red No
L Wedge Red No
S Sphere Red Yes
L Pillar Green Yes
L Pillar Red N0
L Sphere Green Yes
Here we have 2 classes:
1. “yes”, S = 4
2. “No”, A = 3
Step-1: - Information Gain
𝟒 𝟑
IG(S,A) = ∗ 𝑙𝑜𝑔 − ∗ 𝑙𝑜𝑔 = 0.985
(𝟒 𝟑) ( ) ( ) ( )
Step-2: - Entropy
Size 𝑺𝒊 𝑨𝒊 IG(𝑺𝒊 , 𝑨𝒊 )
S 1 1 1
M 1 0 0
L 2 2 1
H(Size) = ∗1 + ∗0 + ∗ 1 = 0.857
Shape 𝑺𝒊 𝑨𝒊 IG(𝑺𝒊 , 𝑨𝒊 )
Brick 1 0 0
Wedge 0 2 0
Sphere 2 0 0
Pillar 1 1 1
H(Shape) = ∗0 + ∗0 + ∗0 + ∗ 1 = 0.286
Colour 𝑺𝒊 𝑨𝒊 IG(𝑺𝒊 , 𝑨𝒊 )
Blue 1 0 0
Red 1 3 0.815
Green 2 0 0
H(Colour) = ∗0 + ∗ 0.815 + ∗ 0 = 0.466
Step-3: - Final Gain
FG(Size) = IF(S,A) – H(Size) = 0.985 – 0.857 = 0.128
FG(Shape) = IF(Shape) – H(Shape) = 0.985 – 0.286 = 0.7
FG(Colour) = IF(Colour) – H(Colour) = 0.985 – 0.466 = 0.519
Now, we select the feature with highest gain and choose it as a root. i.e;
Shape
Brick Wedge Sphere Pillar
It is clear from the data that choice class for
1. Brick = “Yes”
2. Wedge = “No”
3. Sphere = “Yes”
But for “Pillar” we have two choices: -
Size Shape Colour Choice
L Pillar Green Yes
L Pillar Red No
Now, we can remove the “Shape” attribute from the dataset and select the attribute with
next highest final gain. i.e;
As we already made a choice for shape attributes except “pillar” so we select the
attributes of “colour” for only “pillar”. Then, our decision tree will look like: -
Shape
Brick Wedge Sphere Pillar
Yes No Yes Colour
Green Red
Yes No
Now, if we got an unlabelled data X = <S,Pillar,Red>, then with the help of above
decision tree we can easily predict it’s label as: “No”