Comprehensive Notes on Decision Trees (for Examination)
Introduction to Decision Trees
• A supervised learning algorithm used for both classification and regression.
• Splits data into branches based on feature values until a decision/leaf node is reached.
Interpretation of Decision Trees
• Root Node: Feature giving maximum purity gain
• Internal Nodes: Decision points
• Leaf Nodes: Final predictions
Building Decision Trees
• Choose best feature at each step using impurity measures (Entropy, Gini, etc.)
• Recursively split until stopping condition is met
⚖ Tree Models vs Linear Models
• Trees: Handle non-linearity, no need for feature scaling
• Linear Models: Work best with linearly separable data
Memory Tip: "Trees branch smartly, lines draw plainly"
Decision Trees for Regression
• Predicts mean value of target in each region
• Uses Mean Squared Error (MSE) or Mean Absolute Error (MAE) to measure impurity
1 n
Formula: MSE = n ∑i=1 (yi − yˉ)2
Example:
• Leaf 1: [200, 210, 190] → Predicted y = 200
• Leaf 2: [400, 390, 410] → Predicted y = 400
1
Regression Tree Building Process
1. Calculate MSE of the target variable.
2. Split data based on attributes, compute MSE for each resulting node.
3. Subtract resulting MSE from original MSE → MSE Reduction.
4. Choose attribute with highest MSE reduction.
5. Repeat recursively until MSE is low and node is homogeneous.
6. Final prediction at leaf = average of target values.
⚖ Impurity Measures for Classification
1. Classification Error:
E = 1 − max(pi )
2. Gini Index:
k
G = ∑i=1 pi (1 − pi )
3. Entropy:
k
D = − ∑i=1 pi log2 (pi )
Memory Tip: "Entropy = Uncertainty, Gini = Diversity, Error = Simplicity"
Information Gain (Entropy-Based)
Gain = D − DA
• D: Entropy before split
• D_A: Weighted avg. entropy after split
Example:
• Parent: 2 of class A, 2 of class B → D = 1.0
• Split: Left = [A, A], Right = [B, B] → D_A = 0
• Gain = 1 - 0 = 1.0 (Perfect Split)
Splitting Based on Feature Type
1. Nominal Categorical Feature (k categories): 2k−1 − 1 splits
2. Ordinal or Continuous Feature (n values):
2
3. Sort values
4. Try n − 1 split points between values
Goal: Maximize homogeneity (minimize impurity)
📊 Weighted Post-Split Impurity:
nL nR
Post-Impurity = n DL + n DR ΔImpurity = D − Post-Impurity
Choose split with maximum gain (i.e., largest ∆ Impurity).
Disadvantages of Decision Trees
Disadvantage Explanation
Overfitting Trees grow too deep, memorize data
Instability Small data changes → large tree change
Bias Favors features with many levels
Poor linear fit Can’t model smooth linear trends
Memory Tip: "D.O.S.E.: Deep trees, Outliers, Sensitive, Easy to overfit"
✂ Tree Truncation vs Tree Pruning
Method Description Risk
Truncation Pre-pruning, stop early Underfitting
Pruning Post-pruning, cut weak branches Better generalization
Hyperparameter Tuning for Trees
Hyperparameter Effect
max_depth Controls depth, prevents overfitting
min_samples_split Minimum samples to split node
min_samples_leaf Minimum samples per leaf
max_features Max features to consider per split
3
Hyperparameter Effect
criterion Impurity function: Gini, Entropy, MSE
ccp_alpha Cost complexity pruning
Tip: Use Grid Search or Randomized Search + Cross-Validation
🔍 Feature Importance
• Importance = Total impurity reduction caused by the feature across all splits
Example:
Feature Total Gain
Income 0.40
Age 0.04
City 0.01
Memory Tip: "More impurity it kills, more important it feels."
Log Base 2 Calculation (log₂)
log10 (x) log(x) ln(x)
Formula: log2 (x) = log10 (2) ≈ 0.3010 Or using natural log: log2 (x) = 0.6931
Steps on calculator:
1. Use log(x) or ln(x)
2. Divide by log(2) or ln(2)
🔍 Entropy with Equal Classes (3 Classes Example)
• Class Distribution: [A, B, C] → each with 1/3
Entropy = −3 × ( 13 log2 ( 13 )) = log2 (3) ≈ 1.585
Memory Tip: Maximum entropy = log2 (k) where k = number of equally likely classes.
4
⭐ Cross-Validation & K-Fold Cross-Validation
Cross-Validation
• Technique to evaluate model performance more reliably than a single train-test split
• Helps avoid overfitting by testing on multiple data subsets
♻ K-Fold Cross-Validation
1. Split dataset into K equal parts (folds)
2. For each fold:
3. Use it as test set, others as training set
4. Train and evaluate
5. Compute average score across K runs
Example (K=5):
• Run model 5 times, each time a different fold is test set
Memory Tip: "K parts, K turns as test. Judge by average."
GridSearchCV (Grid Search with Cross-Validation)
• Used to find the best hyperparameters for a model
• Tries all combinations of given parameters using cross-validation
Steps:
1. Define parameter grid:
param_grid = {
'max_depth': [3, 5, 10],
'min_samples_split': [2, 5, 10]
}
1. Apply GridSearchCV:
from sklearn.model_selection import GridSearchCV
model = DecisionTreeClassifier()
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X, y)
1. Best params:
grid_search.best_params_
5
Memory Tip: "Grid = Try All, CV = Test All"