Decision tree
A decision tree is a versatile and intuitive tool for making decisions based on data. It
splits the data into subsets based on feature values and uses a tree-like structure to
represent the decision-making process, making it easy to understand and implement.
A decision tree is a graphical representation of possible solutions to a decision based
on certain conditions. It is a type of algorithm that is used for both classification and
regression tasks. Here’s a detailed yet simple explanation:
Key Components of a Decision Tree
Root Node:
1. The top node that represents the entire dataset.
2. It is the starting point for making decisions.
Decision Nodes:
1. These nodes represent the points where the data is split based on certain conditions
or features.
2. Each decision node has two or more branches.
Branches:
These are the lines connecting nodes, representing the outcome of a decision
or test at the decision node.
Leaf Nodes (Terminal Nodes):
1. These nodes represent the final outcome or class label.
2. They do not split further.
How It Works
Starting Point: The process begins at the root node.
Splitting: The dataset is split into subsets based on the value of a feature. This split is chosen
to maximize some criterion (like Gini impurity or information gain).
Recursive Process: This process is repeated recursively for each derived subset, splitting
them further into smaller subsets.
Stopping Criteria: The recursion stops when one of the stopping conditions is met, such as:
o All instances in a node belong to the same class.
o There are no more features to split on.
o The node contains fewer instances than a specified threshold.
A decision tree is a versatile and intuitive tool for making
decisions based on data
Example
Imagine you want to decide whether to play outside based on the weather. You can
create a decision tree with the following steps:
1. Root Node: Start with the question, "Is it sunny?"
2. Decision Node: If the answer is yes, go to the next question, "Is it hot?"
3. Branches:
o If "Is it hot?" is yes, you might decide "Do not play outside."
o If "Is it hot?" is no, you might decide "Play outside."
4. Leaf Nodes: The decisions "Play outside" and "Do not play outside" are the leaf nodes.
Visual Representation
A decision tree visually looks like this:
Is it sunny? (Root Node)
/ \
Yes No
/ \
Is it hot? (Decision Node) Play inside (Leaf Node)
/ \
Yes No
/ \
Don't play Play outside (Leaf Nodes)
Advantages
Easy to Understand: The tree structure is easy to interpret and visualize.
Non-Linear Relationships: Can capture non-linear relationships between features.
No Need for Feature Scaling: Decision trees do not require normalization of data.
Disadvantages
Overfitting: Trees can become too complex and overfit the training data.
Instability: Small changes in the data can result in very different tree structures.
''if I make less than $50k, I won't be able to support my family.'' In this
way, the complex and difficult decision of predicting one's future happiness can be
reduced to a series of simple decisions.
This decision tree is a graphical representation of the process to decide whether to
accept a new job offer based on several criteria. Here’s a detailed explanation of how
this decision tree works:
Root Node
Salary at least $50,000?
o This is the root node, where the decision-making process starts.
o The decision here is whether the salary offered is at least $50,000.
o There are two possible outcomes: "yes" or "no".
First Decision Level
If the answer is "no":
o Decline offer: The decision here is straightforward. If the salary is less than $50,000,
the job offer is declined.
If the answer is "yes":
o The next criterion is evaluated.
Second Decision Level
Commute more than 1 hour?
o This node checks if the commute to the new job will take more than 1 hour.
o There are two possible outcomes: "yes" or "no".
If the commute is more than 1 hour (answer is "yes"):
Decline offer: If the commute is longer than 1 hour, the job offer is declined.
If the commute is not more than 1 hour (answer is "no"):
The next criterion is evaluated.
Third Decision Level
Offers free coffee?
o This node checks if the job offers free coffee as a perk.
o There are two possible outcomes: "yes" or "no".
If the job offers free coffee (answer is "yes"):
Accept offer: If free coffee is provided, the job offer is accepted.
If the job does not offer free coffee (answer is "no"):
Decline offer: If free coffee is not provided, the job offer is declined.
Summary
1. Salary at least $50,000?
o No: Decline offer.
o Yes: Move to the next decision node.
2. Commute more than 1 hour?
o Yes: Decline offer.
o No: Move to the next decision node.
3. Offers free coffee?
o Yes: Accept offer.
o No: Decline offer.
Interpretation
This decision tree helps visualize the sequence of decisions made to determine
whether to accept a job offer based on three specific criteria: salary, commute time,
and availability of free coffee. Each node represents a decision point that leads to
either accepting or declining the job offer, providing a clear and logical decision-
making path.
In a decision tree, the root node and terminal nodes (also known as leaf nodes) have
specific roles and characteristics:
Root Node
Definition: The root node is the topmost node in a decision tree. It represents the initial
decision point and contains the entire dataset.
Role: It is the starting point for the decision-making process and is used to split the dataset
based on the most important attribute (according to a certain criterion like Gini impurity,
information gain, etc.).
Characteristics:
o It has no incoming edges.
o It can have two or more child nodes.
o The decision at the root node is based on the attribute that provides the best initial
split of the data.
In the provided decision tree example, the root node is:
"Salary at least $50,000?"
Terminal Node (Leaf Node)
Definition: Terminal nodes, also known as leaf nodes, are the nodes at the end of a branch in
a decision tree. They do not have any child nodes.
Role: Terminal nodes provide the final decision or classification for the instances that reach
that point in the tree.
Characteristics:
o They have no child nodes.
o They represent the outcome of a series of decisions.
o In classification trees, they contain the class label; in regression trees, they contain a
predicted value.
In the provided decision tree example, the terminal nodes (leaf nodes) are:
"Decline offer" (reached when salary < $50,000)
"Decline offer" (reached when commute > 1 hour)
"Accept offer" (reached when salary ≥ $50,000, commute ≤ 1 hour, and offers free coffee)
"Decline offer" (reached when salary ≥ $50,000, commute ≤ 1 hour, and does not offer free
coffee)
Example Explanation
Root Node: "Salary at least $50,000?"
This is the first decision point. Depending on whether the answer is "yes" or
"no," the decision tree proceeds to the next level of decision-making.
Intermediate Nodes (Decision Nodes):
"Commute more than 1 hour?"
This is a decision node that comes into play if the salary is at least $50,000.
o "Offers free coffee?"
This is another decision node that is considered if the commute is not more
than 1 hour.
Terminal Nodes (Leaf Nodes):
o "Decline offer": The outcome if the salary is less than $50,000, if the commute is
more than 1 hour, or if free coffee is not offered.
o "Accept offer": The outcome if the salary is at least $50,000, the commute is not
more than 1 hour, and free coffee is offered.
In summary, the root node is the starting point of the decision tree, where the first
split of the data occurs. Terminal nodes, or leaf nodes, represent the final decisions or
classifications made after a series of splits based on the attributes evaluated at each
decision node.
Decision trees are built using a heuristic called recursive partitioning. This approach
is also commonly known as divide and conquer because it splits the data into
subsets, which are then split repeatedly into even smaller subsets, and so on and
so forth until the process stops when the algorithm determines the data within the
subsets are sufficiently homogeneous, or another stopping criterion has been met.
Working down each branch, the algorithm continues to divide and conquer the data,
choosing the best candidate feature each time to create another decision node, until a
stopping criterion is reached. Divide and conquer might stop at a node in a case that:
• All (or nearly all) of the examples at the node have the same class
• There are no remaining features to distinguish among the examples
• The tree has grown to a predefined size limit
To illustrate the tree building process, let's consider a simple example. Imagine that
you work for a Hollywood studio, where your role is to decide whether the studio
should move forward with producing the screenplays pitched by promising new
authors. After returning from a vacation, your desk is piled high with proposals.
Without the time to read each proposal cover-to-cover, you decide to develop a
decision tree algorithm to predict whether a potential movie would fall into one of
three categories: Critical Success, Mainstream Hit, or Box Office Bust.
To build the decision tree, you turn to the studio archives to examine the factors
leading to the success and failure of the company's 30 most recent releases. You
quickly notice a relationship between the film's estimated shooting budget, the
number of A-list celebrities lined up for starring roles, and the level of success.
Objective: Predict whether a potential movie will be a Critical Success,
Mainstream Hit, or Box Office Bust.
Input Data: Proposals for new screenplays from promising authors.
Problem: Limited time to read each proposal in detail.
Solution: Develop a decision tree algorithm to classify the movies.
Steps to Build the Decision Tree:
Collect Data: Gather features related to each screenplay (e.g., genre, director, budget,
cast).
Preprocess Data: Clean and prepare the data for analysis.
Choose Features: Select the most relevant features that influence movie success.
Split Data: Divide the data into training and testing sets.
Train Model: Use the training set to build the decision tree.
Test Model: Evaluate the model's accuracy with the testing set.
Optimize Tree: Prune the tree to avoid overfitting and improve performance.
Outcome: Use the decision tree to predict the success category for each
screenplay proposal.
The passage describes the initial steps in building a decision tree to predict the success
of movies based on historical data. Here's the meaning in simple terms:
Data Collection: You gather information from the studio's archives on the 30 most
recent movies the studio produced.
Identify Key Factors: You notice that the success of these movies is related to two
main factors:
Shooting Budget: How much money was spent on making the movie.
A-list Celebrities: The number of well-known, top-tier actors involved in the movie.
Observation of Relationships: You see a pattern where the budget and the presence of
A-list celebrities seem to influence whether the movies were successful or not.
To illustrate the tree building process, let's consider a simple example. Imagine that
you work for a Hollywood studio, where your role is to decide whether the studio
should move forward with producing the screenplays pitched by promising new
authors. After returning from a vacation, your desk is piled high with proposals.
Without the time to read each proposal cover-to-cover, you decide to develop a
decision tree algorithm to predict whether a potential movie would fall into one of
three categories: Critical Success, Mainstream Hit, or Box Office Bust.
To build the decision tree, you turn to the studio archives to examine the factors
leading to the success and failure of the company's 30 most recent releases. You
quickly notice a relationship between the film's estimated shooting budget, the
number of A-list celebrities lined up for starring roles, and the level of success.
Excited about this finding, you produce a scatterplot to illustrate the pattern:
Using the divide and conquer strategy, we can build a simple decision tree from this
data. First, to create the tree's root node, we split the feature indicating the number
of celebrities, partitioning the movies into groups with and without a significant
number of A-list stars:
At this point, we have partitioned the data into three groups. The group at the
top-left corner of the diagram is composed entirely of critically acclaimed films. This
group is distinguished by a high number of celebrities and a relatively low budget.
At the top-right corner, majority of movies are box office hits with high budgets and
a large number of celebrities. The final group, which has little star power but budgets
ranging from small to large, contains the flops.
If we wanted, we could continue to divide and conquer the data by splitting it
based on the increasingly specific ranges of budget and celebrity count, until each of
the currently misclassified values resides in its own tiny partition, and is correctly
classified. However, it is not advisable to overfit a decision tree in this way. Though
there is nothing to stop us from splitting the data indefinitely, overly specific decisions
do not always generalize more broadly. We'll avoid the problem of overfitting by
stopping the algorithm here, since more than 80 percent of the examples in each group
are from a single class. This forms the basis of our stopping criterion.
Our model for predicting the future success of movies can be represented in a simple
tree, as shown in the following diagram. To evaluate a script, follow the branches
through each decision until the script's success or failure has been predicted. In no
time, you will be able to identify the most promising options among the backlog of
scripts and get back to more important work, such as writing an Academy Awards
acceptance speech.
Since real-world data contains more than two features, decision trees quickly
become far more complex than this, with many more nodes, branches, and
leaves.
In the next section, you will learn about a popular algorithm to build decision
tree models automatically
Decision tree induction is a method of creating a decision tree to model data and make
predictions. Here's a step-by-step algorithm for decision tree induction:
Algorithm for Decision Tree Induction
Start with the Entire Dataset:
Begin with the entire dataset as the root of the tree.
Select the Best Attribute:
Choose the attribute that best separates the data. This is typically done using a measure
like Information Gain, Gini Index, or Gain Ratio.
Split the Dataset:
Divide the dataset into subsets based on the selected attribute. Each subset should
correspond to one of the possible values of the attribute.
Create a Decision Node:
Create a decision node in the tree that corresponds to the chosen attribute. Add
branches from this node for each value of the attribute.
Recursively Apply the Process:
For each subset created in step 3, repeat the process:
1. If all instances in a subset belong to the same class, create a leaf node with
that class.
2. If the subset is empty, create a leaf node with the most common class in
the parent node.
3. If none of the above conditions are met, remove the attribute used for the
split from the set of attributes and repeat steps 2-4 for each branch.
Stopping Criteria:
1. Stop if all attributes have been used.
2. Stop if the depth of the tree reaches a predefined limit.
3. Stop if the number of instances in a node is less than a predefined threshold.
class Node:
def __init__(self, attribute=None, value=None, is_leaf=False, prediction=None):
self.attribute = attribute
self.value = value
self.is_leaf = is_leaf
self.prediction = prediction
self.children = {}
def decision_tree(data, attributes, target):
# Check if all instances have the same class
if all_same_class(data, target):
return Node(is_leaf=True, prediction=data[0][target])
# If no more attributes to split, return a leaf node with the most common class
if not attributes:
return Node(is_leaf=True, prediction=most_common_class(data, target))
# Select the best attribute to split on
best_attr = select_best_attribute(data, attributes, target)
# Create a decision node
tree = Node(attribute=best_attr)
# Split the data based on the best attribute
for value in unique_values(data, best_attr):
subset = split_data(data, best_attr, value)
if not subset:
tree.children[value] = Node(is_leaf=True, prediction=most_common_class(data, target))
else:
subtree = decision_tree(subset, [attr for attr in attributes if attr != best_attr], target)
tree.children[value] = subtree
return tree
def all_same_class(data, target):
# Check if all instances have the same class
return len(set(row[target] for row in data)) == 1
def most_common_class(data, target):
# Find the most common class in the data
return max(set(row[target] for row in data), key=lambda cls: sum(1 for row in data if row[target] ==
cls))
def select_best_attribute(data, attributes, target):
# Implement Information Gain or Gini Index calculation to find the best attribute
pass
def unique_values(data, attribute):
# Return unique values for the given attribute in the dataset
return set(row[attribute] for row in data)
def split_data(data, attribute, value):
# Split the dataset based on the attribute and value
return [row for row in data if row[attribute] == value]
Example: Predicting Success of Hollywood Movies
Assume we have a dataset with the following attributes:
Budget: Low, Medium, High
Genre: Action, Comedy, Drama
Star Power: Low, Medium, High
Success: Yes, No (target attribute)
Step-by-Step Example
Initialize the Tree:
Start with the Entire Dataset: The entire dataset is our root.
Select the Best Attribute:
Calculate the Information Gain for each attribute and choose the one with the highest
gain. Assume 'Budget' has the highest Information Gain.
Split the Dataset:
Split the dataset based on 'Budget' (Low, Medium, High).
Create a Decision Node:
Create a decision node for 'Budget' with branches for Low, Medium, and High.
Recursively Apply the Process:
For each branch, repeat the process:
For 'Low' budget movies, calculate the Information Gain for 'Genre' and
'Star Power'. Assume 'Star Power' has the highest gain.
Split 'Low' budget movies based on 'Star Power'.
Create decision nodes and branches for 'Star Power'.
Continue this process until all data is classified or stopping criteria are met.
Stopping Criteria:
If a branch reaches a point where all instances belong to one class (e.g., all 'Low' budget,
'High' star power movies are successful), create a leaf node with that class ('Success' =
Yes).
[Budget]
/ | \
Low Medium High
/ | \
[Star Power] [Genre] Success = Yes
/ | \ / \
Low Med High Action Drama
| | | | |
Success = No Success = Yes Success = No
Let's explain each part of the tree:
Root Node: The root node is Budget, which splits into three branches: Low,
Medium,and High.
First Level:
For Low budget movies, the next best attribute is Star Power, which splits
into three branches: Low, Medium, and High.
1. For Low star power, the leaf node is Success = No.
2. For Medium star power, the leaf node is Success = Yes.
3. For High star power, the leaf node is Success = No.
For Medium budget movies, the next best attribute is Genre, which splits into
two branches: Action and Drama.
For Action genre, the leaf node is Success = Yes.
For Drama genre, the leaf node is Success = No.
For High budget movies, the leaf node is Success = Yes.
Budget
/ | \
Low Medium High
/ | \
Star Power Genre Success
/ | \ / \ = Yes
Low Med High Action Drama
| | | | |
Success=No Success=Yes Success=No
let's go through examples for calculating Information Gain, Gini Index, and Entropy
using the Hollywood movie dataset example. We'll focus on how to calculate these
metrics for the attribute Budget to decide the best split.
Step-by-Step Calculation of Information Gain, Gini Index, and
Entropy
Dataset Example
Consider a simplified dataset of movies with the attributes Budget, Genre, Star
Power, and Success:
Budget Genre Star Power Success
Low Action High No
Low Drama Medium Yes
Medium Action Low Yes
Medium Drama Medium No
High Action High Yes
High Drama High Yes
1. Entropy :Entropy measures the impurity or disorder of a
set of instances. It is calculated as:
For Medium Budget:
Calculate Information Gain for Genre and Star Power.
Assume Genre has the highest Information Gain.
These calculations show how to determine the best attribute to split
the data on using Information Gain and Gini Index. The attribute with
the highest Information Gain or the lowest Gini Index is chosen for
the split.
let's go through the process of splitting the dataset for the Hollywood movies example
based on the values of Budget, using the calculated Information Gain and Gini Index.
Dataset Example
Budget Genre Star Power Success
Low Action High No
Low Drama Medium Yes
Medium Action Low Yes
Medium Drama Medium No
High Action High Yes
High Drama High Yes
Calculated Values
Information Gain
Entropy before split: 1
Entropy after split: 2/3
Information Gain: 1/3
Gini Index
Gini Index before split: 0.5
Gini Index after split: 1/3
Splitting Process
1. Selecting the Attribute to Split On
Based on the Information Gain calculated:
The attribute Budget was found to have an Information Gain of 1/3 ,which we can assume
is the highest among the attributes available.
2. Splitting the Dataset by Budget
We split the dataset into subsets based on the values of Budget: Low, Medium, and
High.
Detailed Splitting
Split: Low
Budget Genre Star Power Success
Low Action High No
Low Drama Medium Yes
Entropy Calculation for Low:
p_{Yes} = 0.5
p_{No} = 0.5
Entropy: 1
Split: Medium
Budget Genre Star Power Success
Medium Action Low Yes
Medium Drama Medium No
Entropy Calculation for Medium:
p_{Yes} = 0.5
p_{No} = 0.5
Entropy: 1
Split: High
Budget Genre Star Power Success
High Action High Yes
High Drama High Yes
Entropy Calculation for High:
p_{Yes} = 1
p_{No} = 0
Entropy: 0
Building the Decision Tree
1. Root Node: Budget
[Budget]
/ | \
Low Medium High
Subsets: Low Budget:
o Entropy is 1, so we need to split further.
o We calculate Information Gain for Genre and Star Power.
[Budget]
/ | \
Low Medium High
/\
[Star Power]
/ | \
Low Medium High
For simplicity, assume Star Power has higher Information Gain.
Low Budget, Low Star Power:
o No instances
Low Budget, Medium Star Power:
Success = Yes
Low Budget, High Star Power:
o Success = No
[Budget]
/ | \
Low Medium High
/\
[Star Power]
/ | \
No yes no
- **Medium Budget:**
- Entropy is 1, so we need to split further.
- We calculate Information Gain for `Genre` and `Star Power`.
```plaintext
[Budget]
/ | \
Low Medium High
/\
[Genre]
/ \
Action Drama
Assume Genre has higher Information Gain.
Medium Budget, Action Genre:
Success = Yes
Medium Budget, Drama Genre:
Success = No
[Budget]
/ | \
Low Medium High
/\
[Genre]
/ \
Yes No
High Budget:
Entropy is 0, so it's a pure node.
All instances are Yes.
[Budget]
/ | \
Low Medium High
/\ / \
[Star Power] [Genre]
/ | \ / \
No Yes No Yes No
Final Decision Tree
[Budget]
/ | \
Low Medium High (Success = Yes)
/\
[Star Power]
/ | \
No Yes No
(Success = Yes)
This decision tree splits the dataset on Budget first, then
further splits the Low subset based on Star Power, and the
Medium subset based on Genre, until all subsets are pure
(contain only one class). The High budget subset is already
pure, hence it directly becomes a leaf node with Success =
Yes