0% found this document useful (0 votes)
21 views34 pages

Aml Unit 1

Uploaded by

aitscserd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views34 pages

Aml Unit 1

Uploaded by

aitscserd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

R23 III B.

Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Bias and Variance:
 Bias refers to the errors which occur when we try to fit a statistical model on real-world data
which does not fit perfectly well on some mathematical model. If we use a way too simplistic
a model to fit the data then we are more probably face the situation of High
Bias (underfitting) refers to the case when the model is unable to learn the patterns in the
data at hand and perform poorly.
 Variance shows the error value that occurs when we try to make predictions by using data
that is not previously seen by the model. There is a situation known as high
variance (overfitting) that occurs when the model learns noise that is present in the data.
Finding a proper balance between the two is also known as the Bias-Variance Tradeoff which
helps us to design an accurate model.
Bias Variance tradeoff
The Bias-Variance Tradeoff refers to the balance between bias and variance which affect
predictive model performance. Finding the right tradeoff is important for creating models that
generalize well to new data.
 The bias-variance tradeoff shows the inverse relationship between bias and variance. When
one decreases, the other tends to increase and vice versa.
 Finding the right balance is important. An overly simple model with high bias won't capture
the underlying patterns while an overly complex model with high variance will fit the noise
in the data.

Overfitting and Underfitting:


Overfitting and underfitting are terms used to describe the performance of machine learning
models in relation to their ability to generalize from the training data to unseen data.

1
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I

Overfitting happens when a machine learning model learns the training data too well including
the noise and random details. This makes the model to perform poorly on new, unseen data
because it memorizes the training data instead of understanding the general patterns.
For example, if we only study last week’s weather to predict tomorrow’s i.e our model might
focus on one-time events like a sudden rainstorm which won’t help for future predictions.

Underfitting is the opposite problem which happens when the model is too simple to learn even
the basic patterns in the data. An underfitted model performs poorly on both training and new
data. To fix this we need to make the model more complex or add more features.
For example if we use only the average temperature of the year to predict tomorrow’s weather
hence the model misses important details like seasonal changes which results in bad predictions.

2
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Ensemble Learning
Ensemble learning is a method where we use many small models instead of just one. Each of
these models may not be very strong on its own, but when we put their results together, we get a
better and more accurate answer. It's like asking a group of people for advice instead of just one
person—each one might be a little wrong, but together, they usually give a better answer.
Types of Ensembles Learning in Machine Learning
There are three main types of ensemble methods:
1. Bagging (Bootstrap Aggregating):
Models are trained independently on different random subsets of the training data. Their
results are then combined—usually by averaging (for regression) or voting (for
classification). This helps reduce variance and prevents overfitting.
2. Boosting:
Models are trained one after another. Each new model focuses on fixing the errors made by
the previous ones. The final prediction is a weighted combination of all models, which helps
reduce bias and improve accuracy.
3. Stacking (Stacked Generalization):
Multiple different models (often of different types) are trained, and their predictions are used
as inputs to a final model, called a meta-model. The meta-model learns how to best combine
the predictions of the base models, aiming for better performance than any individual model.
1. Bagging Algorithm
Bagging classifier can be used for both regression and classification tasks. Here is an overview
of Bagging classifier algorithm:
 Bootstrap Sampling: Divides the original training data into ‘N’ subsets and randomly
selects a subset with replacement in some rows from other subsets. This step ensures that the
base models are trained on diverse subsets of the data and there is no class imbalance.
 Base Model Training: For each bootstrapped sample we train a base model independently on
that subset of data. These weak models are trained in parallel to increase computational
efficiency and reduce time consumption. We can use different base learners i.e. different ML
models as base learners to bring variety and robustness.
 Prediction Aggregation: To make a prediction on testing data combine the predictions of all
base models. For classification tasks it can include majority voting or weighted majority
while for regression it involves averaging the predictions.
 Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training subset of
particular base models during the bootstrapping method. These “out-of-bag” samples can be
used to estimate the model’s performance without the need for cross-validation.
 Final Prediction: After aggregating the predictions from all the base models, Bagging
produces a final prediction for each instance.
Python pseudo code for Bagging Estimator implementing libraries:
1. Importing Libraries and Loading Data
 BaggingClassifier: for creating an ensemble of classifiers trained on different subsets of
data.
 DecisionTreeClassifier: the base classifier used in the bagging ensemble.
 load_iris: to load the Iris dataset for classification.
 train_test_split: to split the dataset into training and testing subsets.
 accuracy_score: to evaluate the model’s prediction accuracy.

3
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I

from sklearn.ensemble import BaggingClassifier


from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
2. Loading and Splitting the Iris Dataset
 data = load_iris(): loads the Iris dataset, which includes features and target labels.
 X = data.data: extracts the feature matrix (input variables).
 y = data.target: extracts the target vector (class labels).
 train_test_split(...): splits the data into training (80%) and testing (20%) sets, with
random_state=42 to ensure reproducibility.

data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
3. Creating a Base Classifier
Decision tree is chosen as the base model. They are prone to overfitting when trained on
small datasets making them good candidates for bagging.
 base_classifier = DecisionTreeClassifier(): initializes a Decision Tree classifier, which will
serve as the base estimator in the Bagging ensemble.
base_classifier = DecisionTreeClassifier()
4. Creating and Training the Bagging Classifier
 A BaggingClassifier is created using the decision tree as the base classifier.
 n_estimators = 10 specifies that 10 decision trees will be trained on different bootstrapped
subsets of the training data.

bagging_classifier = BaggingClassifier(base_classifier, n_estimators=10, random_state=42)


bagging_classifier.fit(X_train, y_train)
5. Making Predictions and Evaluating Accuracy
 The trained bagging model predicts labels for test data.
 The accuracy of the predictions is calculated by comparing the predicted labels (y_pred) to
the actual labels (y_test).

y_pred = bagging_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Output:
Accuracy: 1.0

2. Boosting Algorithm
Boosting is an ensemble technique that combines multiple weak learners to create a strong
learner. Weak models are trained in series such that each next model tries to correct errors of the
previous model until the entire training dataset is predicted correctly. One of the most well-

4
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
known boosting algorithms is AdaBoost (Adaptive Boosting). Here is an overview of Boosting
algorithm:
 Initialize Model Weights: Begin with a single weak learner and assign equal weights to all
training examples.
 Train Weak Learner: Train weak learners on these dataset.
 Sequential Learning: Boosting works by training models sequentially where each model
focuses on correcting the errors of its predecessor. Boosting typically uses a single type of
weak learner like decision trees.
 Weight Adjustment: Boosting assigns weights to training datapoints. Misclassified
examples receive higher weights in the next iteration so that next models pay more attention
to them.
Python pseudo code for boosting Estimator implementing libraries:
1. Importing Libraries and Modules
 AdaBoostClassifier from sklearn.ensemble: for building the AdaBoost ensemble model.
 DecisionTreeClassifier from sklearn.tree: as the base weak learner for AdaBoost.
 load_iris from sklearn.datasets: to load the Iris dataset.
 train_test_split from sklearn.model_selection: to split the dataset into training and testing
sets.
 accuracy_score from sklearn.metrics: to evaluate the model’s accuracy.

from sklearn.ensemble import AdaBoostClassifier


from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
2. Loading and Splitting the Dataset
 data = load_iris(): loads the Iris dataset, which includes features and target labels.
 X = data.data: extracts the feature matrix (input variables).
 y = data.target: extracts the target vector (class labels).
 train_test_split(...): splits the data into training (80%) and testing (20%) sets, with
random_state=42 to ensure reproducibility.

data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3. Defining the Weak Learner


We are creating the base classifier as a decision tree with maximum depth 1 (a decision stump).
This simple tree will act as a weak learner for the AdaBoost algorithm, which iteratively
improves by combining many such weak learners.

base_classifier = DecisionTreeClassifier(max_depth=1)

4. Creating and Training the AdaBoost Classifier

5
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
 base_classifier: The weak learner used in boosting.
 n_estimators = 50: Number of weak learners to train sequentially.
 learning_rate = 1.0: Controls the contribution of each weak learner to the final model.
 random_state = 42: Ensures reproducibility.

adaboost_classifier = AdaBoostClassifier(
base_classifier, n_estimators=50, learning_rate=1.0, random_state=42
)
adaboost_classifier.fit(X_train, y_train)

5. Making Predictions and Calculating Accuracy


We are calculating the accuracy of the model by comparing the true labels y_test with the
predicted labels y_pred. The accuracy_score function returns the proportion of correctly
predicted samples. Then, we print the accuracy value.

accuracy = accuracy_score(y_test, y_pred)


print("Accuracy:", accuracy)
Output:
Accuracy: 1.0

Benefits of Ensemble Learning in Machine Learning


Ensemble learning is a versatile approach that can be applied to machine learning model for: -
 Reduction in Overfitting: By aggregating predictions of multiple model's ensembles can
reduce overfitting that individual complex models might exhibit.
 Improved Generalization: It generalizes better to unseen data by minimizing variance and
bias.
 Increased Accuracy: Combining multiple models gives higher predictive accuracy.
 Robustness to Noise: It mitigates the effect of noisy or incorrect data points by averaging
out predictions from diverse models.
 Flexibility: It can work with diverse models including decision trees, neural networks and
support vector machines making them highly adaptable.
 Bias-Variance Tradeoff: Techniques like bagging reduce variance, while boosting reduces
bias leading to better overall performance.
There are various ensemble learning techniques we can use as each one of them has their own
pros and cons.
Ensemble Learning Techniques
Technique Category Description

Random forest constructs multiple decision trees on


Random Forest Bagging bootstrapped subsets of the data and aggregates their
predictions for final output, reducing overfitting and variance.

6
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Technique Category Description

Trains models on random subsets of input features to enhance


Random
Bagging diversity and improve generalization while reducing
Subspace Method
overfitting.

Gradient Boosting Machines sequentially builds decision


Gradient Boosting
Boosting trees, with each tree correcting errors of the previous ones,
Machines (GBM)
enhancing predictive accuracy iteratively.

Extreme Gradient XGBoost do optimizations like tree pruning, regularization,


Boosting Boosting and parallel processing for robust and efficient predictive
(XGBoost) models.

AdaBoost AdaBoost focuses on challenging examples by assigning


(Adaptive Boosting weights to data points. Combines weak classifiers with
Boosting) weighted voting for final predictions.

CatBoost specialize in handling categorical features natively


CatBoost Boosting without extensive preprocessing with high predictive accuracy
and automatic overfitting handling.

7
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Bagging
Bagging (Bootstrap Aggregating) is an ensemble learning technique in machine learning that
improves the accuracy and stability of models by reducing variance and avoiding overfitting,
especially in high-variance models like decision trees.

Definition:
Bagging stands for Bootstrap Aggregating. It involves:
 Generating multiple versions of a training dataset using bootstrap sampling (random
sampling with replacement).
 Training separate models (often the same type, like decision trees) on each of these
datasets.
 Aggregating their predictions (averaging for regression, majority vote for
classification).
Workflow of Bagging Algorithm (Step-by-Step):

1. Bootstrap Sampling: Create multiple datasets (say, 𝑘 datasets) from the original training
data using sampling with replacement.
2. Model Training: Train a base learner (e.g., decision tree) on each dataset independently.
3. Aggregation:
o Classification: Use majority voting to decide the final output.
o Regression: Use averaging of all predictions to give the final output.

Uses of Bagging:
 Reduces overfitting by averaging out predictions.
 Decreases model variance (good for unstable models).
 Improves generalization.

Common Algorithms That Use Bagging:


 Random Forest is a prime example: it’s a bagging method using decision trees with
added randomness in feature selection.

8
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I

Advantages of Bagging:
 Reduces variance, thus improving model stability.
 Works well with high-variance, low-bias models.
 Easy to implement and parallelize.

Limitations:
 Doesn’t help much if the base model is already low in variance (like linear regression).
 May not reduce bias.
 Can be computationally expensive.

9
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Boosting
Boosting is an ensemble learning method that combines multiple weak learners to form a
strong learner. It builds models sequentially, where each model learns from the errors of the
previous ones, improving overall performance.

Definition:
Boosting refers to a family of algorithms that convert weak models (like shallow decision
trees) into a strong model by focusing more on misclassified data points during each iteration.

Working Steps of Boosting:

1. Initialize the model by training a weak learner on the original dataset.


2. Compute Errors: Measure the performance of the model.
3. Update Weights: Increase weights of incorrectly predicted samples.
4. Train Next Learner: The next model focuses more on the harder examples.
5. Combine Models: Final prediction is a weighted sum of all weak learners.

Key Concepts:
 Sequential training
 Focus on difficult samples
 Reduces both bias and variance
 Final prediction is based on the weighted majority vote (classification) or weighted
average (regression)

Popular Boosting Algorithms:


Algorithm Key Feature
AdaBoost Adjusts weights of samples
Gradient Boosting Optimizes loss function via gradients
XGBoost Optimized, fast version of gradient boosting
LightGBM Faster training, uses histogram-based techniques

10
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Algorithm Key Feature
CatBoost Handles categorical features efficiently

Advantages of Boosting:
 High accuracy
 Handles both bias and variance
 Performs well on imbalanced data

Limitations:
 Prone to overfitting if not regularized
 Sequential → difficult to parallelize
 Slower than bagging

11
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Random Forest Algorithm
 Random Forest is a supervised ensemble learning algorithm.
 It is used for both classification and regression tasks.
 It builds multiple decision trees and merges them together to get a more accurate and
stable prediction.
A Random Forest is a collection (ensemble) of Decision Trees where:
 Each tree is trained on a different subset of the data using bootstrap sampling (bagging).
 At each node, only a random subset of features is considered for splitting.
 Final output is based on majority voting (classification) or averaging (regression).
Workflow of Random Forest (Step-by-Step)

Step 1: Bootstrap Sampling


 Create N different subsets (with replacement) from the training data.
 Each subset is used to train one decision tree.
Step 2: Build Decision Trees
 For each tree:
o Choose a random subset of features at each split (feature bagging).
o Grow trees fully without pruning.
Step 3: Aggregate Results
 For Classification: Each tree votes → final class = majority vote.
 For Regression: Average the outputs from all trees.
Key Terms
Term Description
Bootstrap Sampling Sampling with replacement from the dataset
Feature Bagging Randomly selecting a subset of features at each split
Ensemble Learning Combining multiple models for better performance
Majority Voting Used in classification
Averaging Used in regression

12
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Advantages
 Reduces overfitting compared to individual decision trees.
 Works well with both categorical and numerical features.
 Can handle missing values and maintain accuracy.
 Robust to outliers and noise.
 Can give feature importance scores.
Disadvantages
 Computationally intensive (training many trees).
 Less interpretable than a single decision tree.
 Slower in real-time predictions (due to ensemble size).
Applications of Ramdom Forest:
 Medical diagnosis (e.g., cancer prediction)
 Financial risk analysis
 Credit scoring
 Image classification
 Fraud detection

Python Code Example


from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Build model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_test, y_pred))

Parameters of Random Forest (Sklearn)


Parameter Description
n_estimators Number of trees
max_features Number of features to consider at each split
max_depth Maximum depth of the tree
min_samples_split Minimum samples required to split an internal node

13
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Parameter Description
bootstrap Whether bootstrap samples are used

Comparison with Other Algorithms


Feature Decision Tree Bagging Random Forest Boosting
Overfitting Risk High Low Low Medium
Interpretability High Low Medium Low
Accuracy Medium High High Very High
Training Speed Fast Moderate Slow Slow

14
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
AdaBoost Algorithm
AdaBoost (Adaptive Boosting) is a Boosting ensemble technique that combines multiple weak
learners (usually decision stumps — trees with one split) to form a strong classifier.
 It focuses on instances that were previously misclassified.
 Learners are added sequentially, and each one tries to correct the mistakes of the
previous ones.
Key Idea:
Increase the weights of incorrectly classified data points so that subsequent models focus more
on those “hard” cases.
Workflow of AdaBoost:

Step-by-Step:
1. Initialize Weights:
o Assign equal weights to all training samples.
2. Train a Weak Learner:
o Train a classifier (e.g., a decision stump) on the weighted data.
3. Calculate Error:
o Compute the weighted error of the learner:

o
where I is an indicator function.
4. Compute Learner's Weight:
o A classifier with lower error gets higher importance:

15
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I

5. Update Weights of Samples:


o Increase weights of misclassified samples.
o Decrease weights of correctly classified samples:

o Normalize weights.
6. Repeat:
o Train next learner on updated weights.
o Repeat steps for T rounds (number of estimators).
7. Final Prediction:
o Combine all classifiers using their weights:

Key Notations:

AdaBoost Code Example (Python)


from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Base weak learner: Decision stump


base = DecisionTreeClassifier(max_depth=1)

# AdaBoost model
model = AdaBoostClassifier(base_estimator=base, n_estimators=50, learning_rate=1.0)
model.fit(X_train, y_train)

16
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
# Accuracy
print("Accuracy:", model.score(X_test, y_test))
Advantages of AdaBoost
Feature Benefit
Improves weak learners Combines simple models to perform well
Versatile Works for binary and multi-class classification
Feature importance Can give feature significance
No need for data pre-processing Robust to outliers and noise

Disadvantages
 Sensitive to noisy data and outliers
 Not suitable for large datasets with many irrelevant features
 Harder to interpret compared to individual trees

Applications
 Face detection (e.g., Viola-Jones algorithm)
 Fraud detection
 Text classification
 Bioinformatics

Comparison: AdaBoost vs Bagging vs Random Forest


Feature AdaBoost Bagging Random Forest
Base Learners Sequential Parallel Parallel
Focus Hard samples Variance reduction Random features & samples
Output Weighted vote Majority vote Majority vote

17
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Gradient Boosting Algorithm
Gradient Boosting is an ensemble learning technique that builds a strong predictive model by
combining multiple weak learners (typically decision trees), trained sequentially to correct the
errors made by previous models.
It uses the idea of minimizing a loss function by applying gradient descent.
Key Idea:
Each new learner is trained to predict the residuals (errors) of the previous learners, thereby
improving the model step by step.
Workflow of Gradient Boosting (Step-by-Step):

Step 1: Initialize the Model


 Use a constant value that minimizes the loss function.
 For regression with MSE:

Step 2: Iterate for T steps (number of trees)

These are the pseudo-residuals.

18
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I

Key Terms
Term Description
Weak Learner Typically a decision tree (shallow)
Loss Function Measures error (MSE, Log Loss, etc.)
Learning Rate η\etaη Shrinks the contribution of each tree
Residuals Errors the model tries to fix
Additive Model Combines learners in a stage-wise manner
Loss Functions
 Regression:

 Classification:

Gradient Boosting Code in Python


from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Model
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
gb_model.fit(X_train, y_train)

# Accuracy
print("Accuracy:", gb_model.score(X_test, y_test))
Advantages of Gradient Boosting
 High prediction accuracy
 Handles both regression and classification

19
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
 Works with many types of loss functions
 Feature importance ranking
Disadvantages
 Can overfit if not tuned properly
 Training is slower due to sequential nature
 Requires careful parameter tuning (learning rate, depth, etc.)
Comparison: AdaBoost and Gradient Boosting
Feature AdaBoost Gradient Boosting
Loss Optimization Based on exponential loss Any differentiable loss
Weighting Adjusts sample weights Fits to residuals
Robustness to Outliers Lower Higher
Tuning Needed Less More (learning rate, depth)

20
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
XGBoost Algorithm
XGBoost (Extreme Gradient Boosting) is an advanced implementation of the Gradient
Boosting algorithm. It is designed to be highly efficient, flexible, and portable, with state-of-
the-art performance.

XGBoost = Gradient Boosting + Regularization + Speed + Flexibility

It is robust, scalable, and tunable, and often outperforms other models in structured/tabular data
tasks.

The uses of XGBoost:


 Fast and parallelizable
 Handles missing values
 Includes regularization (to prevent overfitting)
 Excellent performance in Kaggle competitions
 Scales well to large datasets
Core Idea
Like Gradient Boosting, XGBoost builds trees sequentially, where each new tree corrects the
errors of the previous ensemble by minimizing a loss function using gradient descent.
XGBoost enhances this process with:
 Second-order optimization (using both gradient and hessian)
 Regularization
 Tree pruning
 Cache-aware computing

Workflow of XGBoost (Step-by-Step)

Step 1: Objective Function


XGBoost minimizes a regularized objective function:

21
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Step 2: Second-Order Taylor Approximation
The loss is approximated with gradients and hessians:

Step 3: Structure Score for Splits


For a split node with instances III:

Choose the split with the highest score.


Step 4: Tree Building
 Add trees greedily to minimize loss.
 Trees are built depth-wise or loss-wise, not leaf-wise like LightGBM.
 Stop growing when score improvement < threshold.
Step 5: Prediction Update
Update prediction:

 η: Learning rate
Advantages of XGBoost
Advantage Description
Speed Parallel and fast due to efficient CPU use
Accuracy Often better than other ML models
Regularization Controls overfitting via λ,γ\lambda, \gammaλ,γ
Handles Missing Values Smart split-finding for missing data
Built-in Cross-Validation Available in API

Disadvantages
 Complex to tune (many hyperparameters)
 Can overfit on small data if not regularized
 Not ideal for image or sequential data (use CNNs or RNNs instead)

XGBoost Code Example (Python)


import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

22
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
model.fit(X_train, y_train)

# Predict and evaluate


y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Common Parameters
Parameter Meaning
n_estimators Number of boosting rounds
max_depth Maximum tree depth
learning_rate Shrinks contribution of each tree
subsample Fraction of training data per tree
colsample_bytree Feature sampling per tree
lambda L2 regularization
gamma Minimum loss reduction to make a split

23
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Stacking

Stacking (Stacked Generalization) is an ensemble learning technique that combines multiple


different models (called base learners) and trains a meta-model to make the final prediction.

Unlike bagging or boosting (which use the same type of learners), stacking uses diverse models
(e.g., decision trees, SVMs, neural networks).

Workflow of Stacking:

Step-by-Step Process:
1. Train Base Learners
o Train several different machine learning models on the training dataset.
o These models can be of different types (e.g., logistic regression, random forest,
SVM).
2. Generate Base Predictions
o Each base learner makes predictions on:
 Either the validation set (during cross-validation),
 Or directly on the test set.
3. Train Meta-Learner
o A new model (called a meta-model or blender) is trained using the predictions
of base models as features.
o Its goal is to learn how to best combine the outputs of base models.
4. Final Prediction
o The meta-model takes the predictions from base learners and makes the final
decision.

Illustration (Simple Example)


Assume you have 3 base learners:
 Model 1: Logistic Regression
 Model 2: Decision Tree
 Model 3: K-Nearest Neighbors
Let the predictions from these models for a data point be:
Model 1: 0.6
Model 2: 0.8
Model 3: 0.7
These become the features for the meta-model, which might output a final prediction of 0.75.

Use of Stacking:
 Combines strengths of multiple models
 Can reduce generalization error
 Works well when base models are diverse and not highly correlated

24
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I

Mathematically:

Example in Python (with scikit-learn)


from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Define base learners


base_learners = [
('dt', DecisionTreeClassifier()),
('svc', SVC(probability=True))
]

# Define meta-learner
meta_model = LogisticRegression()

# Build stacking model


stacked_model = StackingClassifier(estimators=base_learners, final_estimator=meta_model)
stacked_model.fit(X_train, y_train)

# Predict and evaluate


y_pred = stacked_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

25
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Advantages of Stacking
Benefit Description
Combines model strengths Leverages diversity to improve performance
Reduces generalization error Less likely to overfit than a single model
Flexible Works with any combination of models
Disadvantages
Limitation Description
More complex Requires training multiple models
Risk of overfitting If meta-model is too complex or base models are similar
Slower to train Compared to single-model methods

26
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Blending in Machine Learning

Blending is an ensemble technique used to combine the predictions of multiple machine learning
models using a validation dataset and a meta-model (usually a simple one like logistic
regression or linear regression).
It’s very similar to stacking, but with a few key differences in how data is split and how the
meta-model is trained.

How Does Blending Work?


Steps:
1. Split the dataset into 3 parts:
o Training set: For training base models
o Validation set: For generating predictions from base models
o Test set: For final evaluation
2. Train Base Models:
o Use the training set to train multiple models (e.g., SVM, Random Forest,
XGBoost)
3. Predict on Validation Set:
o Use base models to make predictions on the validation set
o These predictions become input features for the meta-model
4. Train Meta-Model:
o Train a simple model (e.g., logistic regression) using:
 Inputs: Predictions of base models on the validation set
 Targets: True values from the validation set
5. Final Prediction:
o Use base models to predict on the test set
o Meta-model uses these to make final predictions

How It Differs from Stacking


Feature Blending Stacking
Data Split Train/Validation/Test split Usually uses cross-validation
Meta-model trained Out-of-fold predictions from cross-
Validation set predictions
on validation
Simplicity Easier to implement More robust but complex
Higher (due to smaller validation
Risk of Overfitting Lower (thanks to cross-validation)
set)

Why Use Blending?


 Simpler implementation
 Useful when you're in a time crunch (e.g., in competitions)
 Easy to apply when you want to combine different models quickly

Blending Illustration Example

27
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I

Advantages of Blending
Benefit Description
Simple to implement No need for complex cross-validation setups
Fast to train Meta-model trained on small dataset
Good for competitions Useful in last-minute model improvement

Disadvantages
Drawback Description
High risk of overfitting Meta-model trained on small validation set
Not as robust Compared to stacking with cross-validation
Wastes data Validation data not used in base model training

Small Python Example (Pseudo-code Style)


# Step 1: Split data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

# Step 2: Train base models


model1 = LogisticRegression().fit(X_train, y_train)
model2 = RandomForestClassifier().fit(X_train, y_train)

# Step 3: Predict on validation set


pred1 = model1.predict_proba(X_valid)[:, 1]
pred2 = model2.predict_proba(X_valid)[:, 1]

# Step 4: Stack predictions and train meta-model


meta_X = np.column_stack((pred1, pred2))
meta_model = LogisticRegression().fit(meta_X, y_valid)

# Step 5: Predict on test set

28
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
final_pred = meta_model.predict(np.column_stack((
model1.predict_proba(X_test)[:, 1],
model2.predict_proba(X_test)[:, 1]
)))

Mathematical Formulation of Blending

29
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I

Example with 3 Models

30
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Regularization Methods in Machine Learning

Regularization is a technique used to reduce overfitting by adding a penalty term to the loss
function of a machine learning model. This discourages the model from becoming too complex
or sensitive to noise in the training data.

Need to Use Regularization:


 Prevents overfitting
 Improves generalization to unseen data
 Controls the complexity of the model

Benefits of Regularization
Now, let’s see various benefits of regularization which are as follows:
1. Prevents Overfitting: Regularization helps models focus on underlying patterns instead of
memorizing noise in the training data.
2. Improves Interpretability: L1 (Lasso) regularization simplifies models by reducing less
important feature coefficients to zero.
3. Enhances Performance: Prevents excessive weighting of outliers or irrelevant features
helps in improving overall model accuracy.
4. Stabilizes Models: Reduces sensitivity to minor data changes which ensures consistency
across different data subsets.
5. Prevents Complexity: Keeps model from becoming too complex which is important for
limited or noisy data.
6. Handles Multicollinearity: Reduces the magnitudes of correlated coefficients helps in
improving model stability.
7. Allows Fine-Tuning: Hyperparameters like alpha and lambda control regularization strength
helps in balancing bias and variance.
8. Promotes Consistency: Ensures reliable performance across different datasets which
reduces the risk of large performance shifts.

Common Regularization Methods


1. L1 Regularization (Lasso)
 Adds the absolute value of coefficients to the loss function.
 Encourages sparsity (sets some weights to zero), leading to feature selection.

2. L2 Regularization (Ridge)
 Adds the square of coefficients to the loss function.

31
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
 Keeps all features but shrinks weights.

3. Elastic Net Regularization


 Combines both L1 and L2 penalties.
 Useful when there are many correlated features.

4. Dropout (in Neural Networks)


 Randomly sets a fraction of neurons to 0 during training.
 Reduces co-adaptation of neurons.
Intuition:
During each training iteration:
 Drop units with a probability p
 Forces the network to not rely too much on specific paths
5. Early Stopping
 Stop training when the model’s performance on the validation set starts to degrade.
 Prevents overfitting without modifying the loss function.
6. Data Augmentation & Noise Injection
 Add noise to input data or intermediate layers to make the model more robust.

32
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Cross-Validation Strategies

Cross-validation (CV) is a statistical method used to estimate the performance of machine


learning models. It helps detect overfitting and ensures that the model generalizes well to unseen
data.

Use of Cross-Validation:
 To assess model stability and robustness
 To detect overfitting or underfitting
 To choose the best model hyperparameters

Common Cross-Validation Strategies


1. Hold-Out Validation
 Split dataset into:
o Training set: to train the model
o Test set: to evaluate the model
Limitation:
 High variance depending on how data is split
2. K-Fold Cross-Validation
 Divide data into K equal parts (folds)
 Train the model on K−1 folds, validate on the remaining fold
 Repeat K times, each fold used once as validation
 Final performance = mean of K results
Example:
For K=5
Fold Train On Validate On
1 2,3,4,5 1
2 1,3,4,5 2
3 1,2,4,5 3
4 1,2,3,5 4
5 1,2,3,4 5
3. Stratified K-Fold Cross-Validation
 Like K-Fold but preserves the percentage of samples for each class in every fold.
 Useful for imbalanced datasets.
4. Leave-One-Out Cross-Validation (LOOCV)
 Special case of K-Fold with K=n (number of samples)
 Train on all data except one sample, test on that one
 Repeat for all samples
Limitation:
 Computationally expensive for large datasets
5. Repeated K-Fold Cross-Validation
 Repeats K-Fold CV multiple times with different random splits
 Reduces variance in performance estimation

33
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
6. Group K-Fold Cross-Validation
 Ensures that the same group (e.g., from the same patient or user) does not appear in
both training and validation sets.
 Ideal for grouped or clustered data
7. Time Series Split (Rolling Forecast Origin)
 For time series data where order matters
 Avoids data leakage by ensuring that future data is not used to predict the past
Example:
Fold Train On Validate On
1 1, 2 3
2 1, 2, 3 4
3 1, 2, 3, 4 5

Advantages and Disadvantages of Cross-Validation Strategies


Strategy Best For Advantages Disadvantages
Hold-Out Quick checks Simple and fast High variance
K-Fold General-purpose Balanced, less bias Can be slow for large K
Imbalanced Maintains class
Stratified K-Fold More complex
classification distribution
Uses almost all data to Very slow for large
LOOCV Small datasets
train datasets
Slower than standard
Repeated K-Fold Stability checking Reduces random bias
K-Fold
Grouped data (e.g.,
Group K-Fold Prevents data leakage Requires group identifiers
patients)
Time Series Split Time-based data Respects time order Needs careful setup

34

You might also like