Decision Trees
➢ A decision tree is a non-parametric supervised learning algorithm.
➢ It has a hierarchical, tree structure, which consists of a root node, branches, internal
nodes and leaf nodes.
➢ A decision tree is a flowchart-like structure in which each internal node represents a
test on a feature (e.g. whether a coin flip comes up heads or tails), each leaf node
represents a class label (decision taken after computing all features) and branches
represent conjunctions of features that lead to those class labels.
➢ Root Node: The initial node at the beginning of a decision tree, where the
entire population or dataset starts dividing based on various features or
condition.
➢ Decision Nodes: Nodes resulting from the splitting of root nodes are
known as decision nodes. These nodes represent intermediate decisions
or conditions within the tree.
➢ Leaf Nodes: Nodes where further splitting is not possible, often
indicating the final classification or outcome. Leaf nodes are also referred
to as terminal nodes.
➢ Pruning: The process of removing or cutting down specific nodes in a
decision tree to prevent overfitting and simplify the model.
➢ Branch / Sub-Tree: A subsection of the entire decision tree is referred to
as a branch or sub-tree. It represents a specific path of decisions and
outcomes within the tree.
➢ Parent and Child Node: In a decision tree, a node that is divided into
sub-nodes is known as a parent node, and the sub-nodes emerging from
it are referred to as child nodes. The parent node represents a decision or
condition, while the child nodes represent the potential outcomes or
further decisions based on that condition.
Types of Decision Trees
Hunt’s algorithm, which was developed in the 1960s to model human learning in Psychology, forms the
foundation of many popular decision tree algorithms.
ID3, C4.5 and CART
Example of Decision Tree
➢ Decision trees are upside down.
➢ Decision trees are nothing but a bunch of if-else statements.
➢ In the below diagram the tree will first ask what is the weather?
Did you notice anything in the above flowchart? We see that if the weather is cloudy then we must go to play. Why
didn’t it split more? Why did it stop there?
To answer this question, we need to know about few more concepts like entropy, information gain, and Gini index.
But in simple terms, I can say here that the output for the training dataset is always yes for cloudy weather, since there
is no disorderliness here we don’t need to split the node further.
The goal of machine learning is to decrease uncertainty or disorders from the dataset and for this, we use decision
trees.
Entropy
• Entropy is nothing but the uncertainty in our dataset or measure of disorder.
• Entropy is a measure of the randomness in the information being processed.
• Suppose you have a group of friends who decides which movie they can watch together on Sunday.
This is exactly what we call disorderness, there is an equal number of votes for both the movies, and we
can’t really decide which movie we should watch. It would have been much easier if the votes more for
one movie,
In a decision tree, the output is mostly “yes” or “no”
The formula for Entropy is shown below:
Mathematically Entropy for 1 attribute is represented as:
Here,
•p+ is the probability of positive class
•p– is the probability of negative class
•S is the subset of the training example
where T→ Current state and X → Selected attribute
How do Decision Trees use Entropy?
Entropy basically measures the impurity of a node. Impurity is the degree of randomness; it tells how random
our data is. Apure sub-split means that either you should be getting “yes”, or you should be getting “no”.
Suppose a feature has 8 “yes” and 4 “no” initially,
In order to make a decision tree, we need to calculate the impurity of each split, and when the purity is 100%,
we make it as a leaf node.
To check the impurity of feature 2 and feature 3 we will take the help for Entropy formula.
For feature 2, For feature 3,
➢ We can clearly see from the tree itself that left node has low entropy or more purity than right node since left
node has a greater number of “yes” and it is easy to decide here.
➢ Always remember that the higher the Entropy, the lower will be the purity and the higher will be the impurity.
The goal of machine learning is to decrease the uncertainty or impurity in the dataset, here by using the entropy we
are getting the impurity of a particular node, we don’t know if the parent entropy or the entropy of a particular node
has decreased or not.
For this, we bring a new metric called “Information gain” which tells us how much the parent entropy has decreased
after splitting it with some feature.
Information Gain
Information gain measures the reduction of uncertainty (entropy) given some feature and it is also a deciding factor for
which attribute should be selected as a decision node or root node.
Information gain can be defined as the amount of information gained about a random variable or signal from observing
another random variable. It can be considered as the difference between the entropy of parent node and weighted
average entropy of child nodes.
It is just entropy of the full dataset – entropy of the dataset given some feature.
It computes the difference between entropy before split and average entropy after split of the dataset based on given
attribute values. ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain.
To understand this better let’s consider an example: Suppose our entire population has a total of 30 instances. The
dataset is to predict whether the person will go to the gym or not. Let’s say 16 people go to the gym and 14 people
don’t
Now we have two features to predict whether he/she will go to the gym or not.
Feature 1 is “Energy” which takes two values “high” and “low”
Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly motivated”.
Let’s see how our decision tree will be made using these 2 features. We’ll use information gain to decide which
feature should be the root node and which feature should be placed after the split.
Let’s calculate the entropy
To see the weighted average of entropy of each node we will do as follows:
Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
Our parent entropy was near 0.99 and after looking at this value of information gain, we can say that the
entropy of the dataset will decrease by 0.37 if we make “Energy” as our root node.
Similarly, we will do this with the other feature “Motivation” and calculate its information gain
Let’s calculate the entropy here:
To see the weighted average of entropy of each node we will do as follows:
Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:
We now see that the “Energy” feature gives more reduction which is 0.37 than the “Motivation” feature. Hence
we will select the feature which has the highest information gain and then split the node based on that feature.
In this example “Energy” will be our root node and we’ll do the same for sub-nodes. Here we can see that when
the energy is “high” the entropy is low and hence we can say a person will definitely go to the gym if he has
high energy, but what if the energy is low? We will again split the node based on the new feature which is
“Motivation”.
Find the entropy of the class variable.
E(S) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94
From the above data for outlook we can arrive at the following
table easily
Now we have to calculate average weighted entropy. ie, we have found the total of weights of each feature multiplied by
probabilities.
Outlook: Sunny, Overcast and
E(3,2) = Entropy(SRain) =-3/5 log 3/5 -2/5 log2/5 =0.971
E(4,0) = Entropy(SOvercast) =-4/4 log 4/4 -0/4 log0/4 =0
E(2,3) = Entropy(SSunny) =-2/5 log 2/5 -3/5 log3/5 =0.971
The next step is to find the information gain. It is the difference between parent entropy and average
weighted entropy we found above.
= Entropy(S) - 5/14 Entropy(SSunny) - 4/14 Entropy(SOvercast) - 5/14 Entropy(SRain)
= 0.94 -5/14 *0.971 - 4/14 *0 - 5/14*0.971 = 0.2464
Similarly find Information gain for Temperature [9+ 5-] : Hot, Mild and Cool
Play
E(S) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94
Yes No Total
Temp Hot 2 2 4
E(2,2) = Entropy(SHot) = -2/4 log 2/4 -2/4 log2/4 =1.0 Mild 4 2 6
Cool 3 1 4
E(4,2) = Entropy(SMild) =-4/6 log 4/6 -2/6 log2/6 =0.9183
Total 9 5 14
E(3,1) = Entropy(SCool) =-3/4 log 3/4 -1/4 log1/4 =0.8113
= Entropy(S) - 4/14 Entropy(SHot) - 6/14 Entropy(SMild) - 4/14 Entropy(SCool)
= 0.94 -4/14 *1.0 - 6/14 *0.9183 - 4/14*0.8113 = 0.0289
Similarly find Information gain for Humidity [9+ 5-] : High and Normal
Play
E(S) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94
Yes No Total
Humidity High 3 4 7
E(3,4) = Entropy(SHigh) = -3/7 log 3/7 -4/7 log4/7 =0.9852 Normal 6 1 7
Total 9 5 14
E(6,1) = Entropy(SNormal) =-6/7 log 6/7 -1/7 log1/7 =0.5916
= Entropy(S) - 7/14 Entropy(SHigh) - 7/14 Entropy(SNormal)
= 0.94 -7/14 *0.9852 - 7/14 *0.5916 = 0.1516
Similarly find Information gain for Wind [9+ 5-] : Weak and Strong
Play
E(S) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94
Yes No Total
Wind Weak 6 2 6
E(6,2) = Entropy(SWeak) = -6/8 log 6/8 -2/8 log2/8 =0.8113 Strong 3 3 8
Total 9 5 14
E(3,3) = Entropy(SStrong) =-3/6 log 3/6 -3/6 log3/6 =1.0
= Entropy(S) - 6/14 Entropy(SStrong) - 8/14 Entropy(SWeak)
= 0.94 -6/14 *1.0 - 8/14 *0.8113 = 0.0478
Gain(S, Outlook) = 0.2464
Gain(S, Temperature) = 0.289
Gain(S, Humidity) = 0.1516
Gain(S, Wind) = 0.0478
The next step is to find the next node in our decision tree.
Now we will find one under sunny. We have to determine
which of the following Temperature, Humidity or Wind has
higher information gain.
Calculate parent entropy E(sunny) [2,3]:
E(sunny) = (-(3/5)log(3/5)-(2/5)log(2/5)) = 0.971
Similarly find Information gain for Temperature [2+ 3-] : Hot, Mild and Cool
Play
E(sunny) = (-(2/5)log(2/5)-(3/5)log(3/5)) = 0.971
Yes No Total
Temp Hot 0 2 2
E(0,2) = Entropy(SHot) = -0/2 log 0/2 -2/2 log2/2 = 0.0
Mild 1 1 2
E(1,1) = Entropy(SMild) =-1/2 log 1/2 -1/2 log1/1 =1.0 Cool 1 0 1
Total 2 3 5
E(1,0) = Entropy(SCool) =-1/1 log 1/1 -0/1 log0/1 =0.0
= Entropy(S) - 2/5 Entropy(SHot) - 2/5 Entropy(SMild) - 1/5 Entropy(SCool)
= 0.971 -2/5 *0.0 - 2/5 *1.0 - 1/5*0.0 = 0. 570
Similarly find Information gain for Humidity [2+ 3-] : High and Normal
Play
E(sunny) = (-(2/5)log(2/5)-(3/5)log(3/5)) = 0.971
Yes No Total
Humidity High 0 3 3
E(0,3) = Entropy(SHigh) = -0/3 log 0/3 -3/3 log3/3 =0.0 Normal 2 0 2
Total 2 3 5
E(2,0) = Entropy(SNormal) =-2/2 log 2/2 -0/2 log0/2 =0.0
= Entropy(S) - 3/5 Entropy(SHigh) - 2/5 Entropy(SNormal)
= 0.97 -3/5 *0.0 - 2/5 *0.0 = 0.97
Similarly find Information gain for Wind [2+ 3-] : Weak and Strong
Play
E(sunny) = (-(2/5)log(2/5)-(3/5)log(3/5)) = 0.971
Yes No Total
Wind Weak 1 1 2
E(1,1) = Entropy(SWeak) = -1/2 log 1/2 -1/2 log1/2 =1.0 Strong 1 2 3
Total 2 3 5
E(1,2) = Entropy(SStrong) =-1/3 log 1/2 -2/3 log2/3 =0.9183
= Entropy(S) - 3/5 Entropy(SStrong) - 2/5 Entropy(SWeak)
= 0.94 -3/5 *0.9183 - 2/5 *1.0 = 0.0192
Gain(SSunny, Temperature) = 0.570
Gain(SSunny, Humidity) = 0.97
Gain(SSunny, Wind) = 0.0192
The next step is to find the next node in our decision tree.
Now we will find one under Rain. We have to determine
which of the following Temperature, Humidity or Wind has
higher information gain.
Similarly find Information gain for Temperature [2+ 3-] : Hot, Mild and Cool
Play
E(Rain) = (-(2/5)log(2/5)-(3/5)log(3/5)) = 0.971
Yes No Total
Temp Hot 0 0 0
E(0,0) = Entropy(SHot) = -0/0 log 0/0 -0/0 log0/0 = 0.0
Mild 2 1 3
E(2,1) = Entropy(SMild) =-2/3 log 2/3 -1/3 log1/3 =0.9183 Cool 1 1 2
Total 3 2 5
E(1,1) = Entropy(SCool) =-1/2 log 1/2 -1/2 log1/2 =1.0
= Entropy(S) - 0/5 Entropy(SHot) - 3/5 Entropy(SMild) - 2/5 Entropy(SCool)
= 0.971 -0/5 *0.0 - 3/5 *0.9183 - 2/5*1.0 = 0. 0192
Similarly find Information gain for Humidity [2+ 3-] : High and Normal
Play
E(Rain) = (-(2/5)log(2/5)-(3/5)log(3/5)) = 0.971
Yes No Total
Humidity High 1 1 2
E(1,1) = Entropy(SHigh) = -1/2 log 1/2 -1/2 log1/2 =1.0 Normal 2 1 3
Total 3 2 5
E(2,1) = Entropy(SNormal) =-2/3 log 2/3 -1/3 log1/3 =0.9183
= Entropy(S) - 2/5 Entropy(SHigh) - 3/5 Entropy(SNormal)
= 0.97 -2/5 *1.0 - 3/5 *0.9183 = 0.0192
Similarly find Information gain for Wind [2+ 3-] : Weak and Strong
Play
E(Rain) = (-(2/5)log(2/5)-(3/5)log(3/5)) = 0.971
Yes No Total
Wind Weak 3 0 3
E(3,0) = Entropy(SWeak) = -3/3 log 3/3 -0/3 log0/3 =0.0 Strong 0 2 2
Total 3 2 5
E(0,2) = Entropy(SStrong) =-0/2 log 0/2 -2/2 log2/2 =0.0
= Entropy(S) - 2/5 Entropy(SStrong) - 3/5 Entropy(SWeak)
= 0.94 -2/5 *0.0 - 3/5 *0.0 = 0.97
Gain(SRain, Temperature) = 0.0192
Gain(SRain, Humidity) = 0.0192
Gain(SRai, Wind) = 0.97
When to Stop Splitting?
You must be asking this question to yourself that when do we stop growing our tree? Usually, real-world
datasets have a large number of features, which will result in a large number of splits, which in turn gives a
huge tree. Such trees take time to build and can lead to overfitting. That means the tree will give very good
accuracy on the training dataset but will give bad accuracy in test data.
There are many ways to tackle this problem through hyperparameter tuning. We can set the maximum depth of
our decision tree using themax_depth parameter. The more the value of max_depth, the more complex your tree
will be. The training error will off-course decrease if we increase the max_depth value but when our test data
comes into the picture, we will get a very bad accuracy. Hence you need a value that will not overfit as well as
underfit our data and for this,
Another way is to set the minimum number of samples for each spilt. It is denoted by min_samples_split. Here we
specify the minimum number of samples required to do a spilt. For example, we can use a minimum of 10 samples
to reach a decision. That means if a node has less than 10 samples then using this parameter, we can stop the further
splitting of this node and make it a leaf node.
There are more hyperparameters such as :
•min_samples_leaf – represents the minimum number of samples required to be in the leaf node. The
more you increase the number, the more is the possibility of overfitting.
•max_features – it helps us decide what number of features to consider when looking for the best split.
Pruning
Pruning is another method that can help us avoid overfitting. It helps in improving the performance
of the tree by cutting the nodes or sub-nodes which are not significant. Additionally, it removes the
branches which have very low importance.
There are mainly 2 ways for pruning:
•Pre-pruning – we can stop growing the tree earlier, which means we can prune/remove/cut a node
if it has low importance while growing the tree.
•Post-pruning – once our tree is built to its depth, we can start pruning the nodes based on their
significance.