DECISION TREE
Attribute selection measures
Attribute selection measures in decision trees include entropy, information gain, Gini index, gain
ratio, reduction in variance, and chi-square. These measures are also known as splitting rules.
Information gain
Measures how much information a feature provides about the class. It's the decrease in entropy after
splitting the dataset.
Gini index
Also known as Gini impurity, it calculates the probability of a feature being incorrectly classified
when selected randomly.
Entropy
Measures the impurity of a dataset. It's used to decide how a decision tree can divide the information.
Chi-square
Used for categorical features.
Tree-Pruning
When a decision tree is built to its full depth, it often overfits to the training data. To combat overfitting, two
techniques are used:
Post-pruning and Pre-pruning.
1. Post-pruning (Cost Complexity Pruning)
Post-pruning involves first allowing the decision tree to grow fully, and then removing parts of the tree that do
not improve its performance.
How it works:
[Link] the Tree Fully:
The decision tree is initially constructed without any constraints, allowing it to overfit on the training data.
[Link] Node Importance:
The tree is then evaluated to identify nodes and subtrees that do not contribute significantly to the accuracy of
the model.
[Link] Subtrees:
Nodes that do not add significant value are converted into leaf nodes. For instance, if a node has 90% “Yes”
and 10% “No” outcomes, further splitting may not be beneficial, so the subtree is pruned.
4. Simplify the Tree:
Reducing tree complexity lowers overfitting while maintaining accuracy, which is particularly useful for
small datasets.
Pre-Pruning
• In pre-pruning, hyperparameters such as max_depth and max_features are set before the tree is
fully constructed to limit its growth.
• max_depth: Limits the maximum depth of the tree.
• max_features: Restricts the number of features considered for splitting at each node.
• This technique reduces the risk of overfitting by preventing the tree from growing too deep and
capturing noise in the data.