0% found this document useful (0 votes)

38 views32 pages

ML Notes All

The document provides an introduction to Machine Learning (ML), defining it as a subfield of AI focused on systems that learn from data. It covers key concepts, types of ML (supervised, unsupervised, reinforcement), applications across various industries, and challenges faced in ML. Additionally, it discusses preprocessing techniques, regression methods, and classification tasks, including evaluation metrics and algorithms like logistic regression and K-Nearest Neighbors.

Uploaded by

iamadesigr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views32 pages

ML Notes All

Uploaded by

iamadesigr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

UNIT 1: INTRODUCTION

(Reference: Alpaydin, Chapter 1)

Total Time: 5 Hours

1. What is Machine Learning?

Definition:
Machine Learning (ML) is a subfield of artificial intelligence (AI) that
focuses on developing systems that can learn from and make decisions or
predictions based on data.

Instead of being explicitly programmed to perform a task, ML systems

learn from experience—improving their performance over time based on
input data and feedback.

Arthur Samuel defined ML as:

“The field of study that gives computers the ability to learn without
being explicitly programmed.”

2. Basic Concepts and Terminology

• Model: A mathematical representation of a real-world process based

on data.
• Training Data: The dataset used to train the model.
• Features (Inputs): Independent variables (X₁, X₂, ..., Xn) used by the
model to make predictions.
• Labels (Outputs): Target variable (Y) the model is trying to predict.
• Hypothesis: A function that approximates the true relationship
between input and output.
• Loss Function: A method to measure how far the model's
predictions are from actual results.
• Training: The process of feeding data to the model to minimize the
loss function.
• Testing: Evaluating model performance on unseen data.

3. Key Elements of an ML System

1. Representation:
The choice of how data is represented and the type of model (e.g.,
decision tree, neural network).
2. Evaluation:
A metric to evaluate model performance (e.g., accuracy, precision,
recall, F1 score).
3. Optimization:
The method to adjust parameters to minimize errors (e.g., gradient
descent).
4. Data:
Quality and quantity of data are crucial. Clean, labeled, and relevant
data leads to better learning.
5. Feedback:
Some ML models learn continuously from incoming data or
performance signals (especially in reinforcement learning).

4. Types of Machine Learning

1. Supervised Learning

• Definition: The algorithm is trained on a labeled dataset, meaning

each training example has a corresponding output label.
• Goal: Learn a mapping from inputs (X) to outputs (Y).
• Examples:
o Spam detection
o House price prediction
o Image classification
• Common Algorithms:
o Linear regression
o Logistic regression
o Decision trees
o k-NN
o SVM
o Neural networks

2. Unsupervised Learning

• Definition: The algorithm is given data without any labels. It tries to

learn the structure or patterns in the data.
• Goal: Find hidden patterns or intrinsic structures in data.
• Examples:
o Customer segmentation
o Anomaly detection
o Topic modeling
• Common Algorithms:
o k-means clustering
o Hierarchical clustering
o PCA (Principal Component Analysis)

3. Reinforcement Learning

• Definition: An agent interacts with an environment and learns by

receiving rewards or penalties for its actions.
• Goal: Learn a policy that maximizes the long-term reward.
• Examples:
o Game playing (e.g., AlphaGo)
o Robotics
o Self-driving cars
• Key Concepts:
o Agent, Environment, Actions, States, Reward, Policy

5. Applications of Machine Learning

ML is widely used across industries for automation, prediction,

personalization, and analytics:

Domain Application
Healthcare Disease diagnosis, drug discovery
Finance Credit scoring, fraud detection
Product recommendation, customer
Retail
analytics
Marketing Targeted advertising, churn prediction
Transportation Self-driving cars, route optimization
Content recommendation, fake news
Social Media
detection
Natural Language Processing
Chatbots, language translation
(NLP)

6. Traditional Programming vs Machine Learning

Aspect Traditional Programming Machine Learning

Input Rules + Data Data + Output
Rules (i.e., the learned
Output Output
model)
Flexible – model learns
Flexibility Rigid – programmer defines rules
patterns from data
Aspect Traditional Programming Machine Learning
Cannot adapt to new data Can improve with new
Adaptability
without reprogramming data

7. Challenges in Machine Learning

• Data Quality and Quantity: Incomplete, noisy, or biased data can

degrade performance.
• Overfitting vs Underfitting: Overfitting occurs when the model
learns noise; underfitting when it fails to learn patterns.
• Interpretability: Some ML models, especially deep learning, are
black boxes.
• Ethical Issues: Bias in data, decision transparency, and privacy
concerns.

UNIT 2: PREPROCESSING
(Reference: James et al., Chapter 6, Sections 6.1.1, 6.1.2; Chapter 10,
Section 10.2)
Total Time: 6 Hours

1. Introduction to Preprocessing

Preprocessing is a critical step in any machine learning pipeline. It involves

transforming raw data into a clean, usable format to improve the accuracy,
efficiency, and reliability of a machine learning model.

Why Preprocessing?

• Raw data often contains noise, missing values, or irrelevant

information.
• Many algorithms are sensitive to scale, distribution, and type of input
features.
• Good preprocessing improves model performance and reduces
overfitting.

2. Feature Scaling
Feature scaling is used to normalize the range of independent variables or
features of data. It ensures that features contribute equally to the learning
process.

✅ 2.1 Normalization (Min-Max Scaling)

Formula:

• Transforms data into a range [0, 1] (or any other desired range).
• Useful when the distribution is not Gaussian (e.g., image pixel
intensities).
• Sensitive to outliers.

Example:
A feature ranging from 20 to 80 will be scaled as:

✅ 2.2 Standardization (Z-score Scaling)

Formula:

Where:

• μ = mean of the feature

• σ = standard deviation
• Transforms data to have zero mean and unit variance.
• Preferred when the feature values follow a Gaussian distribution.
• Robust to outliers compared to normalization.

Use case: Required in algorithms like logistic regression, SVM, k-means,

and PCA.
3. Feature Selection

Feature selection is the process of identifying and removing as many

irrelevant and redundant features as possible from the dataset.

Goals:

• Improve model generalization.

• Reduce computational cost.
• Increase model interpretability.

✅ 3.1 Methods of Feature Selection

a. Filter Methods

• Use statistical techniques to evaluate feature importance

independently of any ML model.
• Examples:
o Correlation coefficient
o Chi-square test
o Mutual information

b. Wrapper Methods

• Use a predictive model to evaluate combinations of features and

select the best performing subset.
• Computationally expensive.
• Examples:
o Recursive Feature Elimination (RFE)
o Forward/Backward selection

c. Embedded Methods

• Perform feature selection during the model training process.

• Examples:
o Lasso (L1 regularization): Shrinks some coefficients to zero.
o Decision tree feature importance.

4. Dimensionality Reduction

Dimensionality reduction involves reducing the number of input variables

in a dataset while preserving as much information as possible.

Why reduce dimensions?

• Avoid the curse of dimensionality

• Reduce overfitting
• Improve visualization and computation

Two major approaches:

1. Feature Selection (discussed above)

2. Feature Extraction (e.g., PCA)

5. Principal Component Analysis (PCA)

PCA is the most widely used technique for unsupervised feature

extraction and dimensionality reduction.

✅ 5.1 Intuition Behind PCA

• PCA identifies new axes (principal components) in the data that

capture the maximum variance.
• The first component captures the highest variance, the second
captures the next highest variance orthogonal to the first, and so on.
• The data is then projected onto a reduced number of these
components.

✅ 5.2 Steps in PCA

1. Standardize the data: Scale features to zero mean and unit variance.
2. Compute the covariance matrix: Understand the relationships
between variables.
3. Calculate eigenvectors and eigenvalues:
o Eigenvectors represent the directions of new feature space.
o Eigenvalues represent the magnitude of variance in the
direction of each eigenvector.
4. Sort eigenvectors by decreasing eigenvalues and choose k
components.
5. Project data onto these top k eigenvectors to get the reduced
dataset.
✅ 5.3 Mathematical Foundation

Let:

✅ 5.4 Choosing the Number of Components

• Use explained variance ratio.

• Choose k such that 95% (or 99%) of the total variance is retained.

Example:
If PCA tells you that the first 3 components explain 97% of the variance, you
can reduce the feature set to 3 dimensions without losing much
information.

✅ 5.5 Applications of PCA

• Image compression
• Face recognition
• Noise filtering
• Data visualization (e.g., 2D plots)

❗ Important Notes:

• PCA assumes linear relationships and large variances = important

structure.
• PCA does not consider output labels — it's unsupervised.
• PCA may discard features that are important for classification.
UNIT 3: REGRESSION
(Reference: James et al., Chapter 3 & Chapter 6.2.1)
Total Time: 12 Hours

1. Introduction to Regression

Regression is a supervised learning technique used for predicting

continuous numeric outcomes based on one or more input variables
(features). It estimates the relationship between dependent and
independent variables.

• Goal: Predict or explain a quantitative response.

• Examples:
o Predicting house prices based on size, location, etc.
o Estimating a student’s test score based on study hours.

2. Linear Regression with One Variable (Simple Linear

Regression)

✅ 2.1 Definition

Simple linear regression models the relationship between a single

predictor (X) and the response variable (Y) by fitting a straight line:

Where:

• Y = dependent variable
• X = independent variable
• β0= intercept
• β1= slope (effect of X on Y)
• ε = error term

✅ 2.2 Assumptions of Linear Regression

1. Linearity between predictors and response.

2. Independence of errors.
3. Homoscedasticity (constant variance of errors).
4. Normality of errors.

✅ 2.3 Least Squares Estimation

Objective: Minimize the Residual Sum of Squares (RSS):

The best-fit line is obtained by solving:

✅ 2.4 Coefficient of Determination (R^2)

• Indicates the proportion of variance explained by the model.

• Range: 0 to 1 (higher is better).

3. Linear Regression with Multiple Variables (Multiple Linear

Regression)

Extends simple linear regression to multiple predictors:

• Predicts Y based on a linear combination of multiple features.

• Solved using matrix algebra:

• Can capture more complex relationships than simple regression.

4. Gradient Descent (Optimization Method)

Gradient Descent is an iterative optimization algorithm to minimize the

cost function.

✅ 4.1 Cost Function

✅ 4.2 Update Rule

• α: learning rate
• Repeat until convergence

Types of Gradient Descent:

• Batch Gradient Descent: Uses the entire dataset.

• Stochastic GD: Uses one sample at a time (faster).
• Mini-batch GD: Uses a small batch (best trade-off).

5. Overfitting and Underfitting

✅ 5.1 Overfitting

• Model learns noise instead of pattern.

• High training accuracy, poor test accuracy.

✅ 5.2 Underfitting

• Model is too simple to capture patterns.

• Low training and test accuracy.
6. Regularization

Regularization prevents overfitting by adding a penalty to the loss

function.

✅ 6.1 Ridge Regression (L2 Regularization)

Adds squared magnitude of coefficients to the cost function:

• Shrinks coefficients but never sets them to zero.

• λ: regularization parameter (controls penalty).

✅ 6.2 Lasso Regression (L1 Regularization)

Adds absolute value of coefficients:

• Performs feature selection by shrinking some coefficients to zero.

7. Regression Evaluation Metrics

✅ 7.1 Mean Squared Error (MSE)

• Sensitive to outliers.

✅ 7.2 Root Mean Squared Error (RMSE)

• More interpretable due to same units as Y.

✅ 7.3 Mean Absolute Error (MAE)

• Less sensitive to outliers than MSE.

✅ 7.4 R-Squared (R^2)

• Measures variance explained by the model.

• R^2=1 : perfect fit,
R^2=0: model predicts no better than mean.

Key Takeaways

• Linear Regression is foundational for ML. It helps understand

relationships between variables.
• Multiple Regression extends it to multiple predictors.
• Gradient Descent is essential when analytical solutions are
intractable.
• Regularization avoids overfitting and improves generalization.
• Evaluation metrics help assess model accuracy and reliability.
UNIT 4: CLASSIFICATION
(Reference: James et al., Chapter 4 & Chapter 6.2.2)
Total Time: 10 Hours

1. Introduction to Classification

Classification is a supervised learning task where the goal is to predict a

categorical outcome (label or class) based on input features.

• Output: Discrete class labels (e.g., "spam" or "not spam")

• Examples:
o Email spam detection
o Disease diagnosis (yes/no)
o Image recognition (cat/dog/bird)

2. Binary Classification

Binary classification involves only two classes, typically labeled as:

• 0 and 1, or
• Negative and Positive

Goal: Learn a decision boundary that separates the two classes effectively.

3. Logistic Regression

Logistic regression is used when the dependent variable is binary. It

models the probability that a given input belongs to a particular class.

✅ 3.1 Logistic Function (Sigmoid Function)

The model outputs a probability, which is squashed between 0 and 1

using the sigmoid function:
• If the output > 0.5 → class 1
• If the output ≤ 0.5 → class 0

✅ 3.2 Log-Odds or Logit Transformation

Instead of modeling P(Y=1∣X) directly, logistic regression models the log

odds:

This transforms the linear function into a probability using the sigmoid.

✅ 3.3 Model Training

Logistic regression is typically trained using Maximum Likelihood

Estimation (MLE), not least squares.

• Likelihood function:

• Log-likelihood is maximized to estimate β\betaβ values.

4. Decision Boundaries

A decision boundary is a surface that separates different classes predicted

by the model.

• In logistic regression with two features, the decision boundary is a

line (or hyperplane in higher dimensions).
• For example, when:

This line separates class 0 and class 1 regions.

5. Evaluation Metrics for Classification

✅ 5.1 Confusion Matrix

✅ 5.2 Accuracy

• Works well only when classes are balanced.

✅ 5.3 Precision

• Of all predicted positives, how many are correct.

✅ 5.4 Recall (Sensitivity or TPR)

• Of all actual positives, how many did the model find.

✅ 5.5 F1 Score

• Harmonic mean of precision and recall.

• Useful in imbalanced datasets.

✅ 5.6 ROC Curve (Receiver Operating Characteristic)

• Plots True Positive Rate (Recall) vs. False Positive Rate (FPR)
• Area Under the Curve (AUC) measures model performance:
o AUC = 1: perfect classifier
o AUC = 0.5: random classifier

6. K-Nearest Neighbors (KNN)

KNN is a non-parametric, instance-based learning algorithm.

✅ 6.1 Working of KNN

• Store the entire training dataset.

• For a new data point:
1. Calculate distance to all training points.
2. Select the K closest points.
3. Assign the class that is most common among those
neighbors.

✅ 6.2 Distance Metrics

• Euclidean Distance:

• Manhattan Distance:

✅ 6.3 Choosing K

• Small K → high variance (overfitting)

• Large K → high bias (underfitting)
• Use cross-validation to find optimal K
7. Comparison: Logistic Regression vs. KNN

Feature Logistic Regression K-Nearest Neighbors (KNN)

Type Parametric Non-parametric
Model
High Low
Interpretability
Linear (unless
Decision Boundary Non-linear
extended)
Slow (only prediction takes
Training Time Fast
time)
Prediction Time Fast Slow (distance calc)
Handles non- No (needs
Yes (implicitly handles it)
linearity transformation)

Key Takeaways

• Classification predicts categories, not quantities.

• Logistic Regression is a powerful and interpretable linear classifier.
• KNN is simple, flexible, and works well with local structure in data.
• Choosing the right evaluation metric is crucial, especially for
imbalanced data.
• Decision boundaries help us understand model behavior visually.
UNIT 5: RESAMPLING METHODS
Reference: James et al., Chapter 5
Time: 7 Hours

1. What Are Resampling Methods?

Resampling methods are a set of techniques used to assess model

accuracy and improve model performance by reusing data.

They are especially useful when:

• The dataset is not large enough to be split into distinct training and
test sets.
• We want a better estimate of the model's error rate.
• We aim to compare multiple models effectively.

2. Types of Resampling Methods

There are two major resampling methods discussed:

1. Cross-Validation
2. Bootstrap

PART A: Cross-Validation

✅ 1. Why Not Just Use a Validation Set?

When we split the data into training and validation sets:

• The model performance may vary based on how the data is split.
• It uses less data for training, potentially leading to a less accurate
model.

Solution: Use cross-validation to repeatedly split and train/test on different

subsets of the data.
✅ 2. K-Fold Cross-Validation

K-fold cross-validation is the most common cross-validation technique.

Working:

1. Split the dataset into K equal-sized parts (folds).

2. For each fold:
o Use the fold as a validation set.
o Use the remaining K-1 folds as the training set.
3. Repeat K times.
4. Average the validation errors to estimate overall model error.

Example:

If K = 5:

• You split the data into 5 parts.

• Each fold gets to be the validation set once.

Advantages:

• Less variance in performance estimate compared to a single

validation set.
• All observations are used for both training and validation.

✅ 3. Choosing K

• Small K (e.g., 5): Less computational cost, higher bias.

• Large K (e.g., 10): More accurate, but higher computational cost.
• Extreme case: Leave-One-Out CV (LOOCV) where K = n.

✅ 4. Leave-One-Out Cross-Validation (LOOCV)

• A special case where K = number of data points.

• For each iteration:
o Train on n − 1 points.
o Test on the remaining 1 point.
• Repeat n times.

Advantages:

• Very low bias.

• Utilizes maximum data for training.

Disadvantages:

• Computationally expensive.
• High variance: the training set is almost the same in each iteration.

✅ 5. Stratified K-Fold Cross-Validation

Used when the data is imbalanced (e.g., 90% class A, 10% class B).

• Ensures each fold maintains the same class distribution as the

original dataset.
• More reliable performance evaluation for classification tasks.

PART B: Bootstrap Method

✅ 1. What is the Bootstrap?

The Bootstrap is a statistical resampling technique used to estimate the

accuracy (e.g., variance, confidence intervals) of a sample statistic.

Introduced by Bradley Efron in 1979.

✅ 2. How It Works

1. From a dataset of size n, draw B samples with replacement.

2. Each sample is also of size n.
3. Calculate the statistic (e.g., mean, standard deviation, model
accuracy) on each bootstrap sample.
4. Use the distribution of the B results to estimate uncertainty
(standard error, confidence intervals).

✅ 3. Example: Estimating Standard Error of the Mean

Let’s say you have a dataset with 100 values.

• Sample 100 data points with replacement → one bootstrap sample.
• Repeat this B = 1000 times.
• Compute the mean for each sample.
• Use the standard deviation of those means to estimate the standard
error.

✅ 4. Bootstrap for Model Evaluation

• Build and test the model on each bootstrap sample.

• Estimate the model’s prediction error using multiple such
bootstrap samples.
• Helps in model selection, bias-variance tradeoff analysis, etc.

✅ 5. Out-of-Bag (OOB) Error Estimation

For each bootstrap sample:

• About 63% of data is included (some data gets repeated).

• The remaining 37% (not selected) is called the out-of-bag (OOB)
data.
• Use OOB data to evaluate model performance.

This acts like internal cross-validation.

Comparison: Cross-Validation vs. Bootstrap

Feature Cross-Validation Bootstrap
Data
Without replacement With replacement
sampling
Model performance Estimating accuracy and
Used for
evaluation variability
Error estimate (bias,
Output Distribution of statistics
variance)
When model selection is the
Best use case When measuring uncertainty
goal
Computation Moderate (K folds) Often heavier (B samples)

Final Thoughts
• Resampling methods are crucial for reliable model evaluation.
• They help in:
o Choosing the best model
o Tuning hyperparameters
o Understanding model stability
• While they increase computational cost, they offer more
trustworthy performance estimates, especially when the dataset is
small.

UNIT 6: LINEAR MODEL SELECTION AND

REGULARIZATION
Reference: James et al., Chapter 6
Time: 7 Hours

Overview

Linear models, such as Linear Regression, can suffer from problems like:

• Overfitting when too many predictors are used.

• Multicollinearity (high correlation among predictors).
• High variance in predictions.

To address these, we use:

1. Subset selection
2. Shrinkage methods (regularization):
o Ridge Regression
o Lasso Regression
3. Dimension reduction methods:
o Principal Component Regression (PCR)
o Partial Least Squares (PLS)

PART A: Best Subset and Stepwise Selection

✅ 1. Best Subset Selection

Involves fitting a separate linear model for every possible combination

of predictors and selecting the best model.

Steps:

• Given p predictors, there are 2^p possible models.

• For each model size k (number of predictors), find the best-fitting
model.
• Select the model with the lowest test error, highest adjusted R², or
lowest AIC/BIC.

Disadvantages:

• Computationally expensive (impractical for large p).

• Prone to overfitting if not paired with cross-validation.

✅ 2. Stepwise Selection

A more practical alternative to best subset selection.

Forward Stepwise Selection:

• Start with no predictors.

• Add predictors one-by-one that improve model performance the
most.
• Stop when adding more does not significantly reduce the error.

Backward Stepwise Selection:

• Start with all predictors.

• Remove predictors one-by-one that reduce model performance the
least.
• Stop when further removal increases error.

Criteria for Evaluation:

• Adjusted R²
• AIC (Akaike Information Criterion)
• BIC (Bayesian Information Criterion)
• Validation error (via cross-validation)
PART B: Shrinkage Methods (Regularization)

✅ 3. Ridge Regression

Ridge regression addresses overfitting by adding a penalty term to the

ordinary least squares (OLS) loss function.

Objective Function:

• λis the tuning parameter controlling the amount of shrinkage.

• As λ → ∞, coefficients shrink towards zero but never become exactly
zero.
• Useful when multicollinearity exists.

Key Features:

• Retains all predictors but shrinks their coefficients.

• Works well when all predictors are relevant.
• Needs standardization of predictors before use.

✅ 4. Lasso Regression (Least Absolute Shrinkage and

Selection Operator)

Lasso modifies the penalty term to use the absolute value of the
coefficients.

Objective Function:

• Forces some coefficients to become exactly zero, leading to

feature selection.
• More interpretable than ridge regression.
Comparison with Ridge:

Feature Ridge Lasso

L2 (squared
Penalty type L1 (absolute coefficients)
coefficients)
Shrinks to small
Coefficients Can shrink to zero
values
Feature
❌ No ✅ Yes
selection
Sparse model (few predictors
Best use case Many small effects
matter)

✅ 5. Choosing λ (Tuning Parameter)

Both Ridge and Lasso depend on λ, which is selected using cross-

validation.

• Small λ → Less penalty → Model similar to OLS

• Large λ → More shrinkage

Use K-fold cross-validation to find the optimal λ that minimizes the test
error.

PART C: Dimension Reduction Methods

✅ 6. Principal Component Regression (PCR)

PCR reduces the predictor variables to principal components (PCs), which

are linear combinations of original variables.

Steps:

1. Apply PCA to predictors.

2. Select top M components (based on variance).
3. Use those PCs as predictors in a linear regression.

Advantages:

• Handles multicollinearity.
• Reduces overfitting.
Disadvantages:

• PCs are not necessarily related to the outcome Y.

• May discard variables that are important for prediction.

✅ 7. Partial Least Squares (PLS)

Like PCR, but PLS components are chosen based on both predictors
and response.

• Finds components that explain both the predictors and the response
well.
• Often performs better than PCR.

Summary Table
Feature Handles Good for Uses All
Method
Selection Multicollinearity Interpretation Predictors
OLS ❌ ❌ ✅ ✅
Ridge ❌ ✅ ❌ ✅
Lasso ✅ ✅ ✅ ❌ (some)
❌
PCR ❌ ✅ ❌ (transformed
PCs)
❌
PLS ❌ ✅ ❌ (transformed
PCs)

Real-World Use
• In high-dimensional data like genomics, finance, or text mining,
regularization is vital.
• Lasso is widely used for automatic feature selection.
• Ridge is common when you expect many weak predictors.
• PCR/PLS are used when variables are highly correlated.
UNIT 7: TREE-BASED METHODS
Reference: James et al., Chapter 8
Time: 5 Hours

Overview
Tree-based methods partition the feature space into rectangular regions,
making predictions based on the majority vote (classification) or mean
(regression) of observations in those regions.

They are:

1. Easy to interpret (especially decision trees)

2. Able to handle non-linear relationships
3. Prone to overfitting (especially single trees), hence ensemble
methods like bagging, random forests, and boosting are introduced
to improve performance.

PART A: Decision Trees

✅ 1. What is a Decision Tree?

A decision tree is a flowchart-like structure where:

• Internal nodes represent tests on features

• Branches represent outcomes of tests
• Leaf nodes represent predictions (outcomes)

✅ 2. Tree Construction

For Regression:

• Goal: Split data to minimize RSS (Residual Sum of Squares).

• RSS for region Rm is:
For Classification:

• Common criteria for splitting:

✅ 3. Tree Pruning

Trees can overfit if grown fully. To prevent this:

• Grow a large tree

• Apply cost-complexity pruning using parameter α:

Use cross-validation to choose optimal α.

PART B: Bagging and Random Forests

✅ 4. Bagging (Bootstrap Aggregation)

Bagging builds multiple trees on bootstrapped samples and averages

the predictions.

• Reduces variance
• Works best for unstable models (like decision trees)

For Classification:

• Majority vote across all trees

For Regression:

• Average prediction across all trees

Out-of-Bag (OOB) Error:

• Each observation is used in about 2/3 of bootstraps.

• The remaining 1/3 are OOB samples used to estimate error.

✅ 5. Random Forests

Improvement over bagging:

• Adds randomness to tree growth.

• At each split, only a random subset of features is considered.

This:

• Decorrelates trees
• Further reduces variance

Hyperparameters:

• Number of trees (n_estimators)

• Number of features considered at each split (max_features)

Variable Importance:

• Measures how much prediction error increases when a variable is

randomly permuted.
• Random forests provide variable importance plots.
PART C: Boosting

✅ 6. Boosting

Boosting builds trees sequentially, where:

• Each tree learns from the residuals (errors) of the previous one.
• Trees are usually shallow (often stumps with depth = 1).

Algorithm (Gradient Boosting):

1. Start with initial prediction (mean for regression)

2. Compute residuals
3. Fit a small tree to residuals
4. Update the model
5. Repeat for M iterations

Parameters:

• Number of trees M
• Learning rate λ
• Tree depth

Advantages:

• High predictive power

• Works well for both classification and regression

Disadvantages:

• Prone to overfitting if not properly regularized

• Slower to train than random forests

Summary Table
Variance Bias Overfitting
Method Ensemble? Interpretability
Reduction Reduction Risk
Decision
❌ ❌ ❌ ✅ High
Tree
Bagging ✅ ✅ ❌ ❌ Medium
Variance Bias Overfitting
Method Ensemble? Interpretability
Reduction Reduction Risk
Random
✅ ✅✅ ❌ ❌ Low
Forest
Medium to
Boosting ✅ ✅ ✅ ❌
High

Real-World Use Cases

• Credit scoring (classification trees)
• Customer churn prediction
• Loan default detection
• Sales forecasting (regression trees)
• Medical diagnosis (boosting and random forests)

FML PT 1
No ratings yet
FML PT 1
25 pages
ML Module I
No ratings yet
ML Module I
71 pages
ML Revision
No ratings yet
ML Revision
207 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
6 pages
Introduction To Machine Learning Lecture Notes
No ratings yet
Introduction To Machine Learning Lecture Notes
3 pages
Lecture Notes On Machine Learning Concepts
No ratings yet
Lecture Notes On Machine Learning Concepts
5 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
8 pages
Machine Learning Study Notes - Quick Review Guide
No ratings yet
Machine Learning Study Notes - Quick Review Guide
12 pages
Machine Learning Engineer Interview Preparation Guide
No ratings yet
Machine Learning Engineer Interview Preparation Guide
14 pages
Machine Learning for Beginners
No ratings yet
Machine Learning for Beginners
18 pages
ML Notes
No ratings yet
ML Notes
16 pages
Machinelearning Unit1
No ratings yet
Machinelearning Unit1
9 pages
Social Media Analytics Techniques
No ratings yet
Social Media Analytics Techniques
77 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
3 pages
MACHINE LEARNING 1-5 (Ai &DS)
100% (1)
MACHINE LEARNING 1-5 (Ai &DS)
60 pages
Chatgpt Unit - 1
No ratings yet
Chatgpt Unit - 1
5 pages
ML Notes-1
No ratings yet
ML Notes-1
59 pages
Ass Bigd
No ratings yet
Ass Bigd
9 pages
ML 7th Sem AIML ITE Notes Complete LONG (1) - 10-33
No ratings yet
ML 7th Sem AIML ITE Notes Complete LONG (1) - 10-33
24 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
5 pages
ML Module 1
No ratings yet
ML Module 1
12 pages
ML Mdu 2024 10939237
No ratings yet
ML Mdu 2024 10939237
20 pages
Unit 1
No ratings yet
Unit 1
43 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
23 pages
Unit 1,2,3
No ratings yet
Unit 1,2,3
30 pages
Chapter 01 Machine Learning
No ratings yet
Chapter 01 Machine Learning
22 pages
MCS224 Dec 2024 Solved
No ratings yet
MCS224 Dec 2024 Solved
22 pages
AI ML Concepts
No ratings yet
AI ML Concepts
97 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
4 pages
ML Insem
No ratings yet
ML Insem
46 pages
AI Unit 1
No ratings yet
AI Unit 1
30 pages
ML Sem
No ratings yet
ML Sem
24 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
2 pages
ML Short U1-4
No ratings yet
ML Short U1-4
60 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Machine Learning For Data Science Unit-4
No ratings yet
Machine Learning For Data Science Unit-4
16 pages
Rohit Unit 1 ML Notes
No ratings yet
Rohit Unit 1 ML Notes
27 pages
ML Unit1
No ratings yet
ML Unit1
6 pages
Unit-1 Introduction To Machine Learning (5hrs)
No ratings yet
Unit-1 Introduction To Machine Learning (5hrs)
8 pages
Unit 1 ML R23
No ratings yet
Unit 1 ML R23
32 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
10 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
7 pages
Lecture 2
No ratings yet
Lecture 2
36 pages
AI Module 1 Simple Notes
No ratings yet
AI Module 1 Simple Notes
14 pages
Comprehensive ML Course Guide
No ratings yet
Comprehensive ML Course Guide
4 pages
Machine Learning Guide: Types & Concepts
No ratings yet
Machine Learning Guide: Types & Concepts
4 pages
Comprehensive Machine Learning Guide
No ratings yet
Comprehensive Machine Learning Guide
20 pages
ML - Unit 1
No ratings yet
ML - Unit 1
68 pages
ML Algorithms Comprehensive Study
No ratings yet
ML Algorithms Comprehensive Study
9 pages
Machine Learning
No ratings yet
Machine Learning
16 pages
Notes For Machine Learning
No ratings yet
Notes For Machine Learning
7 pages
Slide 1
No ratings yet
Slide 1
3 pages
CS 601-Machine Learning
No ratings yet
CS 601-Machine Learning
82 pages
MACHINE LEARNING R23 Material
100% (11)
MACHINE LEARNING R23 Material
32 pages
AML Slides Indexed 2in1
No ratings yet
AML Slides Indexed 2in1
33 pages
DSF - UNIT III Notes
No ratings yet
DSF - UNIT III Notes
17 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Mosip c5 Ucm Voxvalley
No ratings yet
Mosip c5 Ucm Voxvalley
10 pages
DSC User Manual for Windows
No ratings yet
DSC User Manual for Windows
19 pages
MediCard GO User Manual and FAQs - Sept 2023
No ratings yet
MediCard GO User Manual and FAQs - Sept 2023
33 pages
Cybersecurity Essentials Overview
No ratings yet
Cybersecurity Essentials Overview
48 pages
01 Ubuntu Server Apache2
No ratings yet
01 Ubuntu Server Apache2
2 pages
Setup Guide for Akuvox R29S Device
No ratings yet
Setup Guide for Akuvox R29S Device
10 pages
Cambridge University PHD Thesis Latex
100% (3)
Cambridge University PHD Thesis Latex
5 pages
MDMA Setup Guide for Huawei Modems
No ratings yet
MDMA Setup Guide for Huawei Modems
12 pages
30 MST Prim Kruskal
No ratings yet
30 MST Prim Kruskal
17 pages
Operating Systems Study Guide
No ratings yet
Operating Systems Study Guide
3 pages
Business Document Database Overview
No ratings yet
Business Document Database Overview
1 page
WhatsApp Media Access Denied Error
No ratings yet
WhatsApp Media Access Denied Error
1 page
2D Fire Alarm Station Consilium CS3004 Fire Sys
No ratings yet
2D Fire Alarm Station Consilium CS3004 Fire Sys
166 pages
Unofficial Document
No ratings yet
Unofficial Document
59 pages
ACR Nauticast™-B: The Science of Survival
No ratings yet
ACR Nauticast™-B: The Science of Survival
2 pages
Mediatek Confidential: Mt7688 Datasheet
No ratings yet
Mediatek Confidential: Mt7688 Datasheet
52 pages
Development of GUVO (Guess The Vocabulary), An Android-Based English
No ratings yet
Development of GUVO (Guess The Vocabulary), An Android-Based English
13 pages
Astm D7417 17
No ratings yet
Astm D7417 17
6 pages
What Are LLMs
No ratings yet
What Are LLMs
3 pages
Keyboard Functions and Usage Guide
No ratings yet
Keyboard Functions and Usage Guide
7 pages
Worksheet #2
No ratings yet
Worksheet #2
4 pages
Digital Design of FIR LPF Filter
No ratings yet
Digital Design of FIR LPF Filter
12 pages
Red Hat Cloud-Native Microservices Development With Quarkus: ID DO378 Prezzo 2.400
No ratings yet
Red Hat Cloud-Native Microservices Development With Quarkus: ID DO378 Prezzo 2.400
3 pages
CSC253-Interactive Multimedia-Group (20) - (APR2025)
No ratings yet
CSC253-Interactive Multimedia-Group (20) - (APR2025)
2 pages
DNS Gratis Volphz
No ratings yet
DNS Gratis Volphz
21 pages
Emerson Ha SBH023711103310F10
No ratings yet
Emerson Ha SBH023711103310F10
68 pages
First Laguna Electric Cooperative
No ratings yet
First Laguna Electric Cooperative
5 pages
Z80 - Instruction Set
No ratings yet
Z80 - Instruction Set
36 pages
Semester Grade Distribution Guide
No ratings yet
Semester Grade Distribution Guide
217 pages
Curriculum Vitae: MD: Kawsar Molla
No ratings yet
Curriculum Vitae: MD: Kawsar Molla
2 pages