0% found this document useful (0 votes)
38 views32 pages

ML Notes All

The document provides an introduction to Machine Learning (ML), defining it as a subfield of AI focused on systems that learn from data. It covers key concepts, types of ML (supervised, unsupervised, reinforcement), applications across various industries, and challenges faced in ML. Additionally, it discusses preprocessing techniques, regression methods, and classification tasks, including evaluation metrics and algorithms like logistic regression and K-Nearest Neighbors.

Uploaded by

iamadesigr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views32 pages

ML Notes All

The document provides an introduction to Machine Learning (ML), defining it as a subfield of AI focused on systems that learn from data. It covers key concepts, types of ML (supervised, unsupervised, reinforcement), applications across various industries, and challenges faced in ML. Additionally, it discusses preprocessing techniques, regression methods, and classification tasks, including evaluation metrics and algorithms like logistic regression and K-Nearest Neighbors.

Uploaded by

iamadesigr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

UNIT 1: INTRODUCTION

(Reference: Alpaydin, Chapter 1)


Total Time: 5 Hours

1. What is Machine Learning?

Definition:
Machine Learning (ML) is a subfield of artificial intelligence (AI) that
focuses on developing systems that can learn from and make decisions or
predictions based on data.

Instead of being explicitly programmed to perform a task, ML systems


learn from experience—improving their performance over time based on
input data and feedback.

Arthur Samuel defined ML as:

“The field of study that gives computers the ability to learn without
being explicitly programmed.”

2. Basic Concepts and Terminology

• Model: A mathematical representation of a real-world process based


on data.
• Training Data: The dataset used to train the model.
• Features (Inputs): Independent variables (X₁, X₂, ..., Xn) used by the
model to make predictions.
• Labels (Outputs): Target variable (Y) the model is trying to predict.
• Hypothesis: A function that approximates the true relationship
between input and output.
• Loss Function: A method to measure how far the model's
predictions are from actual results.
• Training: The process of feeding data to the model to minimize the
loss function.
• Testing: Evaluating model performance on unseen data.

3. Key Elements of an ML System


1. Representation:
The choice of how data is represented and the type of model (e.g.,
decision tree, neural network).
2. Evaluation:
A metric to evaluate model performance (e.g., accuracy, precision,
recall, F1 score).
3. Optimization:
The method to adjust parameters to minimize errors (e.g., gradient
descent).
4. Data:
Quality and quantity of data are crucial. Clean, labeled, and relevant
data leads to better learning.
5. Feedback:
Some ML models learn continuously from incoming data or
performance signals (especially in reinforcement learning).

4. Types of Machine Learning

1. Supervised Learning

• Definition: The algorithm is trained on a labeled dataset, meaning


each training example has a corresponding output label.
• Goal: Learn a mapping from inputs (X) to outputs (Y).
• Examples:
o Spam detection
o House price prediction
o Image classification
• Common Algorithms:
o Linear regression
o Logistic regression
o Decision trees
o k-NN
o SVM
o Neural networks

2. Unsupervised Learning

• Definition: The algorithm is given data without any labels. It tries to


learn the structure or patterns in the data.
• Goal: Find hidden patterns or intrinsic structures in data.
• Examples:
o Customer segmentation
o Anomaly detection
o Topic modeling
• Common Algorithms:
o k-means clustering
o Hierarchical clustering
o PCA (Principal Component Analysis)

3. Reinforcement Learning

• Definition: An agent interacts with an environment and learns by


receiving rewards or penalties for its actions.
• Goal: Learn a policy that maximizes the long-term reward.
• Examples:
o Game playing (e.g., AlphaGo)
o Robotics
o Self-driving cars
• Key Concepts:
o Agent, Environment, Actions, States, Reward, Policy

5. Applications of Machine Learning

ML is widely used across industries for automation, prediction,


personalization, and analytics:

Domain Application
Healthcare Disease diagnosis, drug discovery
Finance Credit scoring, fraud detection
Product recommendation, customer
Retail
analytics
Marketing Targeted advertising, churn prediction
Transportation Self-driving cars, route optimization
Content recommendation, fake news
Social Media
detection
Natural Language Processing
Chatbots, language translation
(NLP)

6. Traditional Programming vs Machine Learning

Aspect Traditional Programming Machine Learning


Input Rules + Data Data + Output
Rules (i.e., the learned
Output Output
model)
Flexible – model learns
Flexibility Rigid – programmer defines rules
patterns from data
Aspect Traditional Programming Machine Learning
Cannot adapt to new data Can improve with new
Adaptability
without reprogramming data

7. Challenges in Machine Learning

• Data Quality and Quantity: Incomplete, noisy, or biased data can


degrade performance.
• Overfitting vs Underfitting: Overfitting occurs when the model
learns noise; underfitting when it fails to learn patterns.
• Interpretability: Some ML models, especially deep learning, are
black boxes.
• Ethical Issues: Bias in data, decision transparency, and privacy
concerns.

UNIT 2: PREPROCESSING
(Reference: James et al., Chapter 6, Sections 6.1.1, 6.1.2; Chapter 10,
Section 10.2)
Total Time: 6 Hours

1. Introduction to Preprocessing

Preprocessing is a critical step in any machine learning pipeline. It involves


transforming raw data into a clean, usable format to improve the accuracy,
efficiency, and reliability of a machine learning model.

Why Preprocessing?

• Raw data often contains noise, missing values, or irrelevant


information.
• Many algorithms are sensitive to scale, distribution, and type of input
features.
• Good preprocessing improves model performance and reduces
overfitting.

2. Feature Scaling
Feature scaling is used to normalize the range of independent variables or
features of data. It ensures that features contribute equally to the learning
process.

✅ 2.1 Normalization (Min-Max Scaling)

Formula:

• Transforms data into a range [0, 1] (or any other desired range).
• Useful when the distribution is not Gaussian (e.g., image pixel
intensities).
• Sensitive to outliers.

Example:
A feature ranging from 20 to 80 will be scaled as:

✅ 2.2 Standardization (Z-score Scaling)

Formula:

Where:

• μ = mean of the feature


• σ = standard deviation
• Transforms data to have zero mean and unit variance.
• Preferred when the feature values follow a Gaussian distribution.
• Robust to outliers compared to normalization.

Use case: Required in algorithms like logistic regression, SVM, k-means,


and PCA.
3. Feature Selection

Feature selection is the process of identifying and removing as many


irrelevant and redundant features as possible from the dataset.

Goals:

• Improve model generalization.


• Reduce computational cost.
• Increase model interpretability.

✅ 3.1 Methods of Feature Selection

a. Filter Methods

• Use statistical techniques to evaluate feature importance


independently of any ML model.
• Examples:
o Correlation coefficient
o Chi-square test
o Mutual information

b. Wrapper Methods

• Use a predictive model to evaluate combinations of features and


select the best performing subset.
• Computationally expensive.
• Examples:
o Recursive Feature Elimination (RFE)
o Forward/Backward selection

c. Embedded Methods

• Perform feature selection during the model training process.


• Examples:
o Lasso (L1 regularization): Shrinks some coefficients to zero.
o Decision tree feature importance.

4. Dimensionality Reduction

Dimensionality reduction involves reducing the number of input variables


in a dataset while preserving as much information as possible.

Why reduce dimensions?

• Avoid the curse of dimensionality


• Reduce overfitting
• Improve visualization and computation

Two major approaches:

1. Feature Selection (discussed above)


2. Feature Extraction (e.g., PCA)

5. Principal Component Analysis (PCA)

PCA is the most widely used technique for unsupervised feature


extraction and dimensionality reduction.

✅ 5.1 Intuition Behind PCA

• PCA identifies new axes (principal components) in the data that


capture the maximum variance.
• The first component captures the highest variance, the second
captures the next highest variance orthogonal to the first, and so on.
• The data is then projected onto a reduced number of these
components.

✅ 5.2 Steps in PCA

1. Standardize the data: Scale features to zero mean and unit variance.
2. Compute the covariance matrix: Understand the relationships
between variables.
3. Calculate eigenvectors and eigenvalues:
o Eigenvectors represent the directions of new feature space.
o Eigenvalues represent the magnitude of variance in the
direction of each eigenvector.
4. Sort eigenvectors by decreasing eigenvalues and choose k
components.
5. Project data onto these top k eigenvectors to get the reduced
dataset.
✅ 5.3 Mathematical Foundation

Let:

✅ 5.4 Choosing the Number of Components

• Use explained variance ratio.


• Choose k such that 95% (or 99%) of the total variance is retained.

Example:
If PCA tells you that the first 3 components explain 97% of the variance, you
can reduce the feature set to 3 dimensions without losing much
information.

✅ 5.5 Applications of PCA

• Image compression
• Face recognition
• Noise filtering
• Data visualization (e.g., 2D plots)

❗ Important Notes:

• PCA assumes linear relationships and large variances = important


structure.
• PCA does not consider output labels — it's unsupervised.
• PCA may discard features that are important for classification.
UNIT 3: REGRESSION
(Reference: James et al., Chapter 3 & Chapter 6.2.1)
Total Time: 12 Hours

1. Introduction to Regression

Regression is a supervised learning technique used for predicting


continuous numeric outcomes based on one or more input variables
(features). It estimates the relationship between dependent and
independent variables.

• Goal: Predict or explain a quantitative response.


• Examples:
o Predicting house prices based on size, location, etc.
o Estimating a student’s test score based on study hours.

2. Linear Regression with One Variable (Simple Linear


Regression)

✅ 2.1 Definition

Simple linear regression models the relationship between a single


predictor (X) and the response variable (Y) by fitting a straight line:

Where:

• Y = dependent variable
• X = independent variable
• β0= intercept
• β1= slope (effect of X on Y)
• ε = error term

✅ 2.2 Assumptions of Linear Regression

1. Linearity between predictors and response.


2. Independence of errors.
3. Homoscedasticity (constant variance of errors).
4. Normality of errors.

✅ 2.3 Least Squares Estimation

Objective: Minimize the Residual Sum of Squares (RSS):

The best-fit line is obtained by solving:

✅ 2.4 Coefficient of Determination (R^2)

• Indicates the proportion of variance explained by the model.


• Range: 0 to 1 (higher is better).

3. Linear Regression with Multiple Variables (Multiple Linear


Regression)

Extends simple linear regression to multiple predictors:

• Predicts Y based on a linear combination of multiple features.


• Solved using matrix algebra:

• Can capture more complex relationships than simple regression.


4. Gradient Descent (Optimization Method)

Gradient Descent is an iterative optimization algorithm to minimize the


cost function.

✅ 4.1 Cost Function

✅ 4.2 Update Rule

• α: learning rate
• Repeat until convergence

Types of Gradient Descent:

• Batch Gradient Descent: Uses the entire dataset.


• Stochastic GD: Uses one sample at a time (faster).
• Mini-batch GD: Uses a small batch (best trade-off).

5. Overfitting and Underfitting

✅ 5.1 Overfitting

• Model learns noise instead of pattern.


• High training accuracy, poor test accuracy.

✅ 5.2 Underfitting

• Model is too simple to capture patterns.


• Low training and test accuracy.
6. Regularization

Regularization prevents overfitting by adding a penalty to the loss


function.

✅ 6.1 Ridge Regression (L2 Regularization)

Adds squared magnitude of coefficients to the cost function:

• Shrinks coefficients but never sets them to zero.


• λ: regularization parameter (controls penalty).

✅ 6.2 Lasso Regression (L1 Regularization)

Adds absolute value of coefficients:

• Performs feature selection by shrinking some coefficients to zero.

7. Regression Evaluation Metrics

✅ 7.1 Mean Squared Error (MSE)

• Sensitive to outliers.

✅ 7.2 Root Mean Squared Error (RMSE)


• More interpretable due to same units as Y.

✅ 7.3 Mean Absolute Error (MAE)

• Less sensitive to outliers than MSE.

✅ 7.4 R-Squared (R^2)

• Measures variance explained by the model.


• R^2=1 : perfect fit,
R^2=0: model predicts no better than mean.

Key Takeaways

• Linear Regression is foundational for ML. It helps understand


relationships between variables.
• Multiple Regression extends it to multiple predictors.
• Gradient Descent is essential when analytical solutions are
intractable.
• Regularization avoids overfitting and improves generalization.
• Evaluation metrics help assess model accuracy and reliability.
UNIT 4: CLASSIFICATION
(Reference: James et al., Chapter 4 & Chapter 6.2.2)
Total Time: 10 Hours

1. Introduction to Classification

Classification is a supervised learning task where the goal is to predict a


categorical outcome (label or class) based on input features.

• Output: Discrete class labels (e.g., "spam" or "not spam")


• Examples:
o Email spam detection
o Disease diagnosis (yes/no)
o Image recognition (cat/dog/bird)

2. Binary Classification

Binary classification involves only two classes, typically labeled as:

• 0 and 1, or
• Negative and Positive

Goal: Learn a decision boundary that separates the two classes effectively.

3. Logistic Regression

Logistic regression is used when the dependent variable is binary. It


models the probability that a given input belongs to a particular class.

✅ 3.1 Logistic Function (Sigmoid Function)

The model outputs a probability, which is squashed between 0 and 1


using the sigmoid function:
• If the output > 0.5 → class 1
• If the output ≤ 0.5 → class 0

✅ 3.2 Log-Odds or Logit Transformation

Instead of modeling P(Y=1∣X) directly, logistic regression models the log


odds:

This transforms the linear function into a probability using the sigmoid.

✅ 3.3 Model Training

Logistic regression is typically trained using Maximum Likelihood


Estimation (MLE), not least squares.

• Likelihood function:

• Log-likelihood is maximized to estimate β\betaβ values.

4. Decision Boundaries

A decision boundary is a surface that separates different classes predicted


by the model.

• In logistic regression with two features, the decision boundary is a


line (or hyperplane in higher dimensions).
• For example, when:

This line separates class 0 and class 1 regions.


5. Evaluation Metrics for Classification

✅ 5.1 Confusion Matrix

✅ 5.2 Accuracy

• Works well only when classes are balanced.

✅ 5.3 Precision

• Of all predicted positives, how many are correct.

✅ 5.4 Recall (Sensitivity or TPR)

• Of all actual positives, how many did the model find.

✅ 5.5 F1 Score

• Harmonic mean of precision and recall.


• Useful in imbalanced datasets.

✅ 5.6 ROC Curve (Receiver Operating Characteristic)

• Plots True Positive Rate (Recall) vs. False Positive Rate (FPR)
• Area Under the Curve (AUC) measures model performance:
o AUC = 1: perfect classifier
o AUC = 0.5: random classifier

6. K-Nearest Neighbors (KNN)

KNN is a non-parametric, instance-based learning algorithm.

✅ 6.1 Working of KNN

• Store the entire training dataset.


• For a new data point:
1. Calculate distance to all training points.
2. Select the K closest points.
3. Assign the class that is most common among those
neighbors.

✅ 6.2 Distance Metrics

• Euclidean Distance:

• Manhattan Distance:

✅ 6.3 Choosing K

• Small K → high variance (overfitting)


• Large K → high bias (underfitting)
• Use cross-validation to find optimal K
7. Comparison: Logistic Regression vs. KNN

Feature Logistic Regression K-Nearest Neighbors (KNN)


Type Parametric Non-parametric
Model
High Low
Interpretability
Linear (unless
Decision Boundary Non-linear
extended)
Slow (only prediction takes
Training Time Fast
time)
Prediction Time Fast Slow (distance calc)
Handles non- No (needs
Yes (implicitly handles it)
linearity transformation)

Key Takeaways

• Classification predicts categories, not quantities.


• Logistic Regression is a powerful and interpretable linear classifier.
• KNN is simple, flexible, and works well with local structure in data.
• Choosing the right evaluation metric is crucial, especially for
imbalanced data.
• Decision boundaries help us understand model behavior visually.
UNIT 5: RESAMPLING METHODS
Reference: James et al., Chapter 5
Time: 7 Hours

1. What Are Resampling Methods?

Resampling methods are a set of techniques used to assess model


accuracy and improve model performance by reusing data.

They are especially useful when:

• The dataset is not large enough to be split into distinct training and
test sets.
• We want a better estimate of the model's error rate.
• We aim to compare multiple models effectively.

2. Types of Resampling Methods

There are two major resampling methods discussed:

1. Cross-Validation
2. Bootstrap

PART A: Cross-Validation

✅ 1. Why Not Just Use a Validation Set?

When we split the data into training and validation sets:

• The model performance may vary based on how the data is split.
• It uses less data for training, potentially leading to a less accurate
model.

Solution: Use cross-validation to repeatedly split and train/test on different


subsets of the data.
✅ 2. K-Fold Cross-Validation

K-fold cross-validation is the most common cross-validation technique.

Working:

1. Split the dataset into K equal-sized parts (folds).


2. For each fold:
o Use the fold as a validation set.
o Use the remaining K-1 folds as the training set.
3. Repeat K times.
4. Average the validation errors to estimate overall model error.

Example:

If K = 5:

• You split the data into 5 parts.


• Each fold gets to be the validation set once.

Advantages:

• Less variance in performance estimate compared to a single


validation set.
• All observations are used for both training and validation.

✅ 3. Choosing K

• Small K (e.g., 5): Less computational cost, higher bias.


• Large K (e.g., 10): More accurate, but higher computational cost.
• Extreme case: Leave-One-Out CV (LOOCV) where K = n.

✅ 4. Leave-One-Out Cross-Validation (LOOCV)

• A special case where K = number of data points.


• For each iteration:
o Train on n − 1 points.
o Test on the remaining 1 point.
• Repeat n times.

Advantages:

• Very low bias.


• Utilizes maximum data for training.

Disadvantages:

• Computationally expensive.
• High variance: the training set is almost the same in each iteration.

✅ 5. Stratified K-Fold Cross-Validation

Used when the data is imbalanced (e.g., 90% class A, 10% class B).

• Ensures each fold maintains the same class distribution as the


original dataset.
• More reliable performance evaluation for classification tasks.

PART B: Bootstrap Method

✅ 1. What is the Bootstrap?

The Bootstrap is a statistical resampling technique used to estimate the


accuracy (e.g., variance, confidence intervals) of a sample statistic.

Introduced by Bradley Efron in 1979.

✅ 2. How It Works

1. From a dataset of size n, draw B samples with replacement.


2. Each sample is also of size n.
3. Calculate the statistic (e.g., mean, standard deviation, model
accuracy) on each bootstrap sample.
4. Use the distribution of the B results to estimate uncertainty
(standard error, confidence intervals).

✅ 3. Example: Estimating Standard Error of the Mean

Let’s say you have a dataset with 100 values.


• Sample 100 data points with replacement → one bootstrap sample.
• Repeat this B = 1000 times.
• Compute the mean for each sample.
• Use the standard deviation of those means to estimate the standard
error.

✅ 4. Bootstrap for Model Evaluation

• Build and test the model on each bootstrap sample.


• Estimate the model’s prediction error using multiple such
bootstrap samples.
• Helps in model selection, bias-variance tradeoff analysis, etc.

✅ 5. Out-of-Bag (OOB) Error Estimation

For each bootstrap sample:

• About 63% of data is included (some data gets repeated).


• The remaining 37% (not selected) is called the out-of-bag (OOB)
data.
• Use OOB data to evaluate model performance.

This acts like internal cross-validation.

Comparison: Cross-Validation vs. Bootstrap


Feature Cross-Validation Bootstrap
Data
Without replacement With replacement
sampling
Model performance Estimating accuracy and
Used for
evaluation variability
Error estimate (bias,
Output Distribution of statistics
variance)
When model selection is the
Best use case When measuring uncertainty
goal
Computation Moderate (K folds) Often heavier (B samples)

Final Thoughts
• Resampling methods are crucial for reliable model evaluation.
• They help in:
o Choosing the best model
o Tuning hyperparameters
o Understanding model stability
• While they increase computational cost, they offer more
trustworthy performance estimates, especially when the dataset is
small.

UNIT 6: LINEAR MODEL SELECTION AND


REGULARIZATION
Reference: James et al., Chapter 6
Time: 7 Hours

Overview

Linear models, such as Linear Regression, can suffer from problems like:

• Overfitting when too many predictors are used.


• Multicollinearity (high correlation among predictors).
• High variance in predictions.

To address these, we use:

1. Subset selection
2. Shrinkage methods (regularization):
o Ridge Regression
o Lasso Regression
3. Dimension reduction methods:
o Principal Component Regression (PCR)
o Partial Least Squares (PLS)

PART A: Best Subset and Stepwise Selection


✅ 1. Best Subset Selection

Involves fitting a separate linear model for every possible combination


of predictors and selecting the best model.

Steps:

• Given p predictors, there are 2^p possible models.


• For each model size k (number of predictors), find the best-fitting
model.
• Select the model with the lowest test error, highest adjusted R², or
lowest AIC/BIC.

Disadvantages:

• Computationally expensive (impractical for large p).


• Prone to overfitting if not paired with cross-validation.

✅ 2. Stepwise Selection

A more practical alternative to best subset selection.

Forward Stepwise Selection:

• Start with no predictors.


• Add predictors one-by-one that improve model performance the
most.
• Stop when adding more does not significantly reduce the error.

Backward Stepwise Selection:

• Start with all predictors.


• Remove predictors one-by-one that reduce model performance the
least.
• Stop when further removal increases error.

Criteria for Evaluation:

• Adjusted R²
• AIC (Akaike Information Criterion)
• BIC (Bayesian Information Criterion)
• Validation error (via cross-validation)
PART B: Shrinkage Methods (Regularization)

✅ 3. Ridge Regression

Ridge regression addresses overfitting by adding a penalty term to the


ordinary least squares (OLS) loss function.

Objective Function:

• λis the tuning parameter controlling the amount of shrinkage.


• As λ → ∞, coefficients shrink towards zero but never become exactly
zero.
• Useful when multicollinearity exists.

Key Features:

• Retains all predictors but shrinks their coefficients.


• Works well when all predictors are relevant.
• Needs standardization of predictors before use.

✅ 4. Lasso Regression (Least Absolute Shrinkage and


Selection Operator)

Lasso modifies the penalty term to use the absolute value of the
coefficients.

Objective Function:

• Forces some coefficients to become exactly zero, leading to


feature selection.
• More interpretable than ridge regression.
Comparison with Ridge:

Feature Ridge Lasso


L2 (squared
Penalty type L1 (absolute coefficients)
coefficients)
Shrinks to small
Coefficients Can shrink to zero
values
Feature
❌ No ✅ Yes
selection
Sparse model (few predictors
Best use case Many small effects
matter)

✅ 5. Choosing λ (Tuning Parameter)

Both Ridge and Lasso depend on λ, which is selected using cross-


validation.

• Small λ → Less penalty → Model similar to OLS


• Large λ → More shrinkage

Use K-fold cross-validation to find the optimal λ that minimizes the test
error.

PART C: Dimension Reduction Methods

✅ 6. Principal Component Regression (PCR)

PCR reduces the predictor variables to principal components (PCs), which


are linear combinations of original variables.

Steps:

1. Apply PCA to predictors.


2. Select top M components (based on variance).
3. Use those PCs as predictors in a linear regression.

Advantages:

• Handles multicollinearity.
• Reduces overfitting.
Disadvantages:

• PCs are not necessarily related to the outcome Y.


• May discard variables that are important for prediction.

✅ 7. Partial Least Squares (PLS)

Like PCR, but PLS components are chosen based on both predictors
and response.

• Finds components that explain both the predictors and the response
well.
• Often performs better than PCR.

Summary Table
Feature Handles Good for Uses All
Method
Selection Multicollinearity Interpretation Predictors
OLS ❌ ❌ ✅ ✅
Ridge ❌ ✅ ❌ ✅
Lasso ✅ ✅ ✅ ❌ (some)

PCR ❌ ✅ ❌ (transformed
PCs)

PLS ❌ ✅ ❌ (transformed
PCs)

Real-World Use
• In high-dimensional data like genomics, finance, or text mining,
regularization is vital.
• Lasso is widely used for automatic feature selection.
• Ridge is common when you expect many weak predictors.
• PCR/PLS are used when variables are highly correlated.
UNIT 7: TREE-BASED METHODS
Reference: James et al., Chapter 8
Time: 5 Hours

Overview
Tree-based methods partition the feature space into rectangular regions,
making predictions based on the majority vote (classification) or mean
(regression) of observations in those regions.

They are:

1. Easy to interpret (especially decision trees)


2. Able to handle non-linear relationships
3. Prone to overfitting (especially single trees), hence ensemble
methods like bagging, random forests, and boosting are introduced
to improve performance.

PART A: Decision Trees

✅ 1. What is a Decision Tree?

A decision tree is a flowchart-like structure where:

• Internal nodes represent tests on features


• Branches represent outcomes of tests
• Leaf nodes represent predictions (outcomes)

✅ 2. Tree Construction

For Regression:

• Goal: Split data to minimize RSS (Residual Sum of Squares).


• RSS for region Rm is:
For Classification:

• Common criteria for splitting:

✅ 3. Tree Pruning

Trees can overfit if grown fully. To prevent this:

• Grow a large tree


• Apply cost-complexity pruning using parameter α:

Use cross-validation to choose optimal α.

PART B: Bagging and Random Forests


✅ 4. Bagging (Bootstrap Aggregation)

Bagging builds multiple trees on bootstrapped samples and averages


the predictions.

• Reduces variance
• Works best for unstable models (like decision trees)

For Classification:

• Majority vote across all trees

For Regression:

• Average prediction across all trees

Out-of-Bag (OOB) Error:

• Each observation is used in about 2/3 of bootstraps.


• The remaining 1/3 are OOB samples used to estimate error.

✅ 5. Random Forests

Improvement over bagging:

• Adds randomness to tree growth.


• At each split, only a random subset of features is considered.

This:

• Decorrelates trees
• Further reduces variance

Hyperparameters:

• Number of trees (n_estimators)


• Number of features considered at each split (max_features)

Variable Importance:

• Measures how much prediction error increases when a variable is


randomly permuted.
• Random forests provide variable importance plots.
PART C: Boosting

✅ 6. Boosting

Boosting builds trees sequentially, where:

• Each tree learns from the residuals (errors) of the previous one.
• Trees are usually shallow (often stumps with depth = 1).

Algorithm (Gradient Boosting):

1. Start with initial prediction (mean for regression)


2. Compute residuals
3. Fit a small tree to residuals
4. Update the model
5. Repeat for M iterations

Parameters:

• Number of trees M
• Learning rate λ
• Tree depth

Advantages:

• High predictive power


• Works well for both classification and regression

Disadvantages:

• Prone to overfitting if not properly regularized


• Slower to train than random forests

Summary Table
Variance Bias Overfitting
Method Ensemble? Interpretability
Reduction Reduction Risk
Decision
❌ ❌ ❌ ✅ High
Tree
Bagging ✅ ✅ ❌ ❌ Medium
Variance Bias Overfitting
Method Ensemble? Interpretability
Reduction Reduction Risk
Random
✅ ✅✅ ❌ ❌ Low
Forest
Medium to
Boosting ✅ ✅ ✅ ❌
High

Real-World Use Cases


• Credit scoring (classification trees)
• Customer churn prediction
• Loan default detection
• Sales forecasting (regression trees)
• Medical diagnosis (boosting and random forests)

You might also like