Decision Tree in Machine Learning
A decision tree is a supervised learning algorithm used for both classification
and regression tasks. It has a hierarchical tree structure which consists of a
root node, branches, internal nodes and leaf nodes. It It works like a
flowchart help to make decisions step by step where:
Internal nodes represent attribute tests
Branches represent attribute values
Leaf nodes represent final decisions or predictions.
Decision trees are widely used due to their interpretability, flexibility and low
preprocessing needs.
How Does a Decision Tree Work?
A decision tree splits the dataset based on feature values to create pure
subsets ideally all items in a group belong to the same class. Each leaf node
of the tree corresponds to a class label and the internal nodes are feature-
based decision points. Let’s understand this with an example.
Decisi
on Tree
Let’s consider a decision tree for predicting whether a customer will buy a
product based on age, income and previous purchases: Here's how the
decision tree works:
1. Root Node (Income)
First Question: "Is the person’s income greater than $50,000?"
If Yes, proceed to the next question.
If No, predict "No Purchase" (leaf node).
2. Internal Node (Age):
If the person’s income is greater than $50,000, ask: "Is the person’s age
above 30?"
If Yes, proceed to the next question.
If No, predict "No Purchase" (leaf node).
3. Internal Node (Previous Purchases):
If the person is above 30 and has made previous purchases, predict
"Purchase" (leaf node).
If the person is above 30 and has not made previous purchases, predict
"No Purchase" (leaf node).
Decision making with 2 Decision Tree
Example: Predicting Whether a Customer Will Buy a Product Using Two
Decision Trees
Tree 1: Customer Demographics
First tree asks two questions:
1. "Income > $50,000?"
If Yes, Proceed to the next question.
If No, "No Purchase"
2. "Age > 30?"
Yes: "Purchase"
No: "No Purchase"
Tree 2: Previous Purchases
"Previous Purchases > 0?"
Yes: "Purchase"
No: "No Purchase"
Once we have predictions from both trees, we can combine the results to
make a final prediction. If Tree 1 predicts "Purchase" and Tree 2 predicts "No
Purchase", the final prediction might be "Purchase" or "No Purchase"
depending on the weight or confidence assigned to each tree. This can be
decided based on the problem context.
Information Gain and Gini Index in Decision Tree
Till now we have discovered the basic intuition and approach of how decision
tree works, so lets just move to the attribute selection measure of decision
tree. We have two popular attribute selection measures used:
1. Information Gain
Information Gain tells us how useful a question (or feature) is for splitting
data into groups. It measures how much the uncertainty decreases after the
split. A good question will create clearer groups and the feature with the
highest Information Gain is chosen to make the decision.
For example if we split a dataset of people into "Young" and "Old" based on
age and all young people bought the product while all old people did not, the
Information Gain would be high because the split perfectly separates the two
groups with no uncertainty left
Suppose SS is a set of instances AA is an attribute, SvSv is the subset
of SS, vv represents an individual value that the attribute AA can take
and Values (AA) is the set of all possible values of AA then
−∑vA∣Sv∣∣S∣.Entropy(Sv)Gain(S,A)=Entropy(S)−∑vA∣S∣∣Sv∣
Gain(S,A)=Entropy(S)
.Entropy(Sv)
Entropy: is the measure of uncertainty of a random variable it
characterizes the impurity of an arbitrary collection of examples. The
higher the entropy more the information content.
For example if a dataset has an equal number of "Yes" and "No" outcomes
(like 3 people who bought a product and 3 who didn’t), the entropy is high
because it’s uncertain which outcome to predict. But if all the outcomes are
the same (all "Yes" or all "No") the entropy is 0 meaning there is no
uncertainty left in predicting the outcome
Suppose SS is a set of instances, AA is an attribute, SvSv is the subset
of SS with AA= vv and Values (AA) is the set of all possible values of AA,
Gain(S,A)=Entropy(S)−∑vϵValues(A) ∣Sv∣∣S∣.Entropy(Sv)
then
Gain(S,A)=Entropy(S)−∑vϵValues(A)∣S∣∣Sv∣.Entropy(Sv)
Example:
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3
Entropy H(X)=[(38)log238+(58)log258]=−[0.375(−1.415)+0.625(−
0.678)]=−(−0.53−0.424)=0.954 Entropy H(X)=[(83)log283+(85)log2
85]=−[0.375(−1.415)+0.625(−0.678)]=−(−0.53−0.424)=0.954
Building Decision Tree using Information Gain the essentials
Start with all training instances associated with the root node
Use info gain to choose which attribute to label each node with
Recursively construct each subtree on the subset of training instances
that would be classified down that path in the tree.
If all positive or all negative training instances remain, the label that node
“yes" or “no" accordingly
If no attributes remain label with a majority vote of training instances left
at that node
If no instances remain label with a majority vote of the parent's training
instances.
Example: Now let us draw a Decision Tree for the following data using
Information gain. Training set: 3 features and 2 classes
X Y Z C
1 1 1 I
1 1 0 I
X Y Z C
0 0 1 II
1 0 0 II
Here, we have 3 features and 2 output classes. To build a decision tree
using Information gain. We will take each of the features and calculate the
information for each feature.