Aml Unit 1
Aml Unit 1
Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Bias and Variance:
Bias refers to the errors which occur when we try to fit a statistical model on real-world data
which does not fit perfectly well on some mathematical model. If we use a way too simplistic
a model to fit the data then we are more probably face the situation of High
Bias (underfitting) refers to the case when the model is unable to learn the patterns in the
data at hand and perform poorly.
Variance shows the error value that occurs when we try to make predictions by using data
that is not previously seen by the model. There is a situation known as high
variance (overfitting) that occurs when the model learns noise that is present in the data.
Finding a proper balance between the two is also known as the Bias-Variance Tradeoff which
helps us to design an accurate model.
Bias Variance tradeoff
The Bias-Variance Tradeoff refers to the balance between bias and variance which affect
predictive model performance. Finding the right tradeoff is important for creating models that
generalize well to new data.
The bias-variance tradeoff shows the inverse relationship between bias and variance. When
one decreases, the other tends to increase and vice versa.
Finding the right balance is important. An overly simple model with high bias won't capture
the underlying patterns while an overly complex model with high variance will fit the noise
in the data.
1
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Overfitting happens when a machine learning model learns the training data too well including
the noise and random details. This makes the model to perform poorly on new, unseen data
because it memorizes the training data instead of understanding the general patterns.
For example, if we only study last week’s weather to predict tomorrow’s i.e our model might
focus on one-time events like a sudden rainstorm which won’t help for future predictions.
Underfitting is the opposite problem which happens when the model is too simple to learn even
the basic patterns in the data. An underfitted model performs poorly on both training and new
data. To fix this we need to make the model more complex or add more features.
For example if we use only the average temperature of the year to predict tomorrow’s weather
hence the model misses important details like seasonal changes which results in bad predictions.
2
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Ensemble Learning
Ensemble learning is a method where we use many small models instead of just one. Each of
these models may not be very strong on its own, but when we put their results together, we get a
better and more accurate answer. It's like asking a group of people for advice instead of just one
person—each one might be a little wrong, but together, they usually give a better answer.
Types of Ensembles Learning in Machine Learning
There are three main types of ensemble methods:
1. Bagging (Bootstrap Aggregating):
Models are trained independently on different random subsets of the training data. Their
results are then combined—usually by averaging (for regression) or voting (for
classification). This helps reduce variance and prevents overfitting.
2. Boosting:
Models are trained one after another. Each new model focuses on fixing the errors made by
the previous ones. The final prediction is a weighted combination of all models, which helps
reduce bias and improve accuracy.
3. Stacking (Stacked Generalization):
Multiple different models (often of different types) are trained, and their predictions are used
as inputs to a final model, called a meta-model. The meta-model learns how to best combine
the predictions of the base models, aiming for better performance than any individual model.
1. Bagging Algorithm
Bagging classifier can be used for both regression and classification tasks. Here is an overview
of Bagging classifier algorithm:
Bootstrap Sampling: Divides the original training data into ‘N’ subsets and randomly
selects a subset with replacement in some rows from other subsets. This step ensures that the
base models are trained on diverse subsets of the data and there is no class imbalance.
Base Model Training: For each bootstrapped sample we train a base model independently on
that subset of data. These weak models are trained in parallel to increase computational
efficiency and reduce time consumption. We can use different base learners i.e. different ML
models as base learners to bring variety and robustness.
Prediction Aggregation: To make a prediction on testing data combine the predictions of all
base models. For classification tasks it can include majority voting or weighted majority
while for regression it involves averaging the predictions.
Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training subset of
particular base models during the bootstrapping method. These “out-of-bag” samples can be
used to estimate the model’s performance without the need for cross-validation.
Final Prediction: After aggregating the predictions from all the base models, Bagging
produces a final prediction for each instance.
Python pseudo code for Bagging Estimator implementing libraries:
1. Importing Libraries and Loading Data
BaggingClassifier: for creating an ensemble of classifiers trained on different subsets of
data.
DecisionTreeClassifier: the base classifier used in the bagging ensemble.
load_iris: to load the Iris dataset for classification.
train_test_split: to split the dataset into training and testing subsets.
accuracy_score: to evaluate the model’s prediction accuracy.
3
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
3. Creating a Base Classifier
Decision tree is chosen as the base model. They are prone to overfitting when trained on
small datasets making them good candidates for bagging.
base_classifier = DecisionTreeClassifier(): initializes a Decision Tree classifier, which will
serve as the base estimator in the Bagging ensemble.
base_classifier = DecisionTreeClassifier()
4. Creating and Training the Bagging Classifier
A BaggingClassifier is created using the decision tree as the base classifier.
n_estimators = 10 specifies that 10 decision trees will be trained on different bootstrapped
subsets of the training data.
y_pred = bagging_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Output:
Accuracy: 1.0
2. Boosting Algorithm
Boosting is an ensemble technique that combines multiple weak learners to create a strong
learner. Weak models are trained in series such that each next model tries to correct errors of the
previous model until the entire training dataset is predicted correctly. One of the most well-
4
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
known boosting algorithms is AdaBoost (Adaptive Boosting). Here is an overview of Boosting
algorithm:
Initialize Model Weights: Begin with a single weak learner and assign equal weights to all
training examples.
Train Weak Learner: Train weak learners on these dataset.
Sequential Learning: Boosting works by training models sequentially where each model
focuses on correcting the errors of its predecessor. Boosting typically uses a single type of
weak learner like decision trees.
Weight Adjustment: Boosting assigns weights to training datapoints. Misclassified
examples receive higher weights in the next iteration so that next models pay more attention
to them.
Python pseudo code for boosting Estimator implementing libraries:
1. Importing Libraries and Modules
AdaBoostClassifier from sklearn.ensemble: for building the AdaBoost ensemble model.
DecisionTreeClassifier from sklearn.tree: as the base weak learner for AdaBoost.
load_iris from sklearn.datasets: to load the Iris dataset.
train_test_split from sklearn.model_selection: to split the dataset into training and testing
sets.
accuracy_score from sklearn.metrics: to evaluate the model’s accuracy.
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
base_classifier = DecisionTreeClassifier(max_depth=1)
5
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
base_classifier: The weak learner used in boosting.
n_estimators = 50: Number of weak learners to train sequentially.
learning_rate = 1.0: Controls the contribution of each weak learner to the final model.
random_state = 42: Ensures reproducibility.
adaboost_classifier = AdaBoostClassifier(
base_classifier, n_estimators=50, learning_rate=1.0, random_state=42
)
adaboost_classifier.fit(X_train, y_train)
6
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Technique Category Description
7
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Bagging
Bagging (Bootstrap Aggregating) is an ensemble learning technique in machine learning that
improves the accuracy and stability of models by reducing variance and avoiding overfitting,
especially in high-variance models like decision trees.
Definition:
Bagging stands for Bootstrap Aggregating. It involves:
Generating multiple versions of a training dataset using bootstrap sampling (random
sampling with replacement).
Training separate models (often the same type, like decision trees) on each of these
datasets.
Aggregating their predictions (averaging for regression, majority vote for
classification).
Workflow of Bagging Algorithm (Step-by-Step):
1. Bootstrap Sampling: Create multiple datasets (say, 𝑘 datasets) from the original training
data using sampling with replacement.
2. Model Training: Train a base learner (e.g., decision tree) on each dataset independently.
3. Aggregation:
o Classification: Use majority voting to decide the final output.
o Regression: Use averaging of all predictions to give the final output.
Uses of Bagging:
Reduces overfitting by averaging out predictions.
Decreases model variance (good for unstable models).
Improves generalization.
8
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Advantages of Bagging:
Reduces variance, thus improving model stability.
Works well with high-variance, low-bias models.
Easy to implement and parallelize.
Limitations:
Doesn’t help much if the base model is already low in variance (like linear regression).
May not reduce bias.
Can be computationally expensive.
9
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Boosting
Boosting is an ensemble learning method that combines multiple weak learners to form a
strong learner. It builds models sequentially, where each model learns from the errors of the
previous ones, improving overall performance.
Definition:
Boosting refers to a family of algorithms that convert weak models (like shallow decision
trees) into a strong model by focusing more on misclassified data points during each iteration.
Key Concepts:
Sequential training
Focus on difficult samples
Reduces both bias and variance
Final prediction is based on the weighted majority vote (classification) or weighted
average (regression)
10
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Algorithm Key Feature
CatBoost Handles categorical features efficiently
Advantages of Boosting:
High accuracy
Handles both bias and variance
Performs well on imbalanced data
Limitations:
Prone to overfitting if not regularized
Sequential → difficult to parallelize
Slower than bagging
11
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Random Forest Algorithm
Random Forest is a supervised ensemble learning algorithm.
It is used for both classification and regression tasks.
It builds multiple decision trees and merges them together to get a more accurate and
stable prediction.
A Random Forest is a collection (ensemble) of Decision Trees where:
Each tree is trained on a different subset of the data using bootstrap sampling (bagging).
At each node, only a random subset of features is considered for splitting.
Final output is based on majority voting (classification) or averaging (regression).
Workflow of Random Forest (Step-by-Step)
12
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Advantages
Reduces overfitting compared to individual decision trees.
Works well with both categorical and numerical features.
Can handle missing values and maintain accuracy.
Robust to outliers and noise.
Can give feature importance scores.
Disadvantages
Computationally intensive (training many trees).
Less interpretable than a single decision tree.
Slower in real-time predictions (due to ensemble size).
Applications of Ramdom Forest:
Medical diagnosis (e.g., cancer prediction)
Financial risk analysis
Credit scoring
Image classification
Fraud detection
# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Build model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Accuracy
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_test, y_pred))
13
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Parameter Description
bootstrap Whether bootstrap samples are used
14
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
AdaBoost Algorithm
AdaBoost (Adaptive Boosting) is a Boosting ensemble technique that combines multiple weak
learners (usually decision stumps — trees with one split) to form a strong classifier.
It focuses on instances that were previously misclassified.
Learners are added sequentially, and each one tries to correct the mistakes of the
previous ones.
Key Idea:
Increase the weights of incorrectly classified data points so that subsequent models focus more
on those “hard” cases.
Workflow of AdaBoost:
Step-by-Step:
1. Initialize Weights:
o Assign equal weights to all training samples.
2. Train a Weak Learner:
o Train a classifier (e.g., a decision stump) on the weighted data.
3. Calculate Error:
o Compute the weighted error of the learner:
o
where I is an indicator function.
4. Compute Learner's Weight:
o A classifier with lower error gets higher importance:
15
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
o Normalize weights.
6. Repeat:
o Train next learner on updated weights.
o Repeat steps for T rounds (number of estimators).
7. Final Prediction:
o Combine all classifiers using their weights:
Key Notations:
# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# AdaBoost model
model = AdaBoostClassifier(base_estimator=base, n_estimators=50, learning_rate=1.0)
model.fit(X_train, y_train)
16
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
# Accuracy
print("Accuracy:", model.score(X_test, y_test))
Advantages of AdaBoost
Feature Benefit
Improves weak learners Combines simple models to perform well
Versatile Works for binary and multi-class classification
Feature importance Can give feature significance
No need for data pre-processing Robust to outliers and noise
Disadvantages
Sensitive to noisy data and outliers
Not suitable for large datasets with many irrelevant features
Harder to interpret compared to individual trees
Applications
Face detection (e.g., Viola-Jones algorithm)
Fraud detection
Text classification
Bioinformatics
17
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Gradient Boosting Algorithm
Gradient Boosting is an ensemble learning technique that builds a strong predictive model by
combining multiple weak learners (typically decision trees), trained sequentially to correct the
errors made by previous models.
It uses the idea of minimizing a loss function by applying gradient descent.
Key Idea:
Each new learner is trained to predict the residuals (errors) of the previous learners, thereby
improving the model step by step.
Workflow of Gradient Boosting (Step-by-Step):
18
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Key Terms
Term Description
Weak Learner Typically a decision tree (shallow)
Loss Function Measures error (MSE, Log Loss, etc.)
Learning Rate η\etaη Shrinks the contribution of each tree
Residuals Errors the model tries to fix
Additive Model Combines learners in a stage-wise manner
Loss Functions
Regression:
Classification:
# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Model
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
gb_model.fit(X_train, y_train)
# Accuracy
print("Accuracy:", gb_model.score(X_test, y_test))
Advantages of Gradient Boosting
High prediction accuracy
Handles both regression and classification
19
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Works with many types of loss functions
Feature importance ranking
Disadvantages
Can overfit if not tuned properly
Training is slower due to sequential nature
Requires careful parameter tuning (learning rate, depth, etc.)
Comparison: AdaBoost and Gradient Boosting
Feature AdaBoost Gradient Boosting
Loss Optimization Based on exponential loss Any differentiable loss
Weighting Adjusts sample weights Fits to residuals
Robustness to Outliers Lower Higher
Tuning Needed Less More (learning rate, depth)
20
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
XGBoost Algorithm
XGBoost (Extreme Gradient Boosting) is an advanced implementation of the Gradient
Boosting algorithm. It is designed to be highly efficient, flexible, and portable, with state-of-
the-art performance.
It is robust, scalable, and tunable, and often outperforms other models in structured/tabular data
tasks.
21
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Step 2: Second-Order Taylor Approximation
The loss is approximated with gradients and hessians:
η: Learning rate
Advantages of XGBoost
Advantage Description
Speed Parallel and fast due to efficient CPU use
Accuracy Often better than other ML models
Regularization Controls overfitting via λ,γ\lambda, \gammaλ,γ
Handles Missing Values Smart split-finding for missing data
Built-in Cross-Validation Available in API
Disadvantages
Complex to tune (many hyperparameters)
Can overfit on small data if not regularized
Not ideal for image or sequential data (use CNNs or RNNs instead)
22
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
model.fit(X_train, y_train)
23
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Stacking
Unlike bagging or boosting (which use the same type of learners), stacking uses diverse models
(e.g., decision trees, SVMs, neural networks).
Workflow of Stacking:
Step-by-Step Process:
1. Train Base Learners
o Train several different machine learning models on the training dataset.
o These models can be of different types (e.g., logistic regression, random forest,
SVM).
2. Generate Base Predictions
o Each base learner makes predictions on:
Either the validation set (during cross-validation),
Or directly on the test set.
3. Train Meta-Learner
o A new model (called a meta-model or blender) is trained using the predictions
of base models as features.
o Its goal is to learn how to best combine the outputs of base models.
4. Final Prediction
o The meta-model takes the predictions from base learners and makes the final
decision.
Use of Stacking:
Combines strengths of multiple models
Can reduce generalization error
Works well when base models are diverse and not highly correlated
24
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Mathematically:
# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Define meta-learner
meta_model = LogisticRegression()
25
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Advantages of Stacking
Benefit Description
Combines model strengths Leverages diversity to improve performance
Reduces generalization error Less likely to overfit than a single model
Flexible Works with any combination of models
Disadvantages
Limitation Description
More complex Requires training multiple models
Risk of overfitting If meta-model is too complex or base models are similar
Slower to train Compared to single-model methods
26
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Blending in Machine Learning
Blending is an ensemble technique used to combine the predictions of multiple machine learning
models using a validation dataset and a meta-model (usually a simple one like logistic
regression or linear regression).
It’s very similar to stacking, but with a few key differences in how data is split and how the
meta-model is trained.
27
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Advantages of Blending
Benefit Description
Simple to implement No need for complex cross-validation setups
Fast to train Meta-model trained on small dataset
Good for competitions Useful in last-minute model improvement
Disadvantages
Drawback Description
High risk of overfitting Meta-model trained on small validation set
Not as robust Compared to stacking with cross-validation
Wastes data Validation data not used in base model training
28
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
final_pred = meta_model.predict(np.column_stack((
model1.predict_proba(X_test)[:, 1],
model2.predict_proba(X_test)[:, 1]
)))
29
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
30
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Regularization Methods in Machine Learning
Regularization is a technique used to reduce overfitting by adding a penalty term to the loss
function of a machine learning model. This discourages the model from becoming too complex
or sensitive to noise in the training data.
Benefits of Regularization
Now, let’s see various benefits of regularization which are as follows:
1. Prevents Overfitting: Regularization helps models focus on underlying patterns instead of
memorizing noise in the training data.
2. Improves Interpretability: L1 (Lasso) regularization simplifies models by reducing less
important feature coefficients to zero.
3. Enhances Performance: Prevents excessive weighting of outliers or irrelevant features
helps in improving overall model accuracy.
4. Stabilizes Models: Reduces sensitivity to minor data changes which ensures consistency
across different data subsets.
5. Prevents Complexity: Keeps model from becoming too complex which is important for
limited or noisy data.
6. Handles Multicollinearity: Reduces the magnitudes of correlated coefficients helps in
improving model stability.
7. Allows Fine-Tuning: Hyperparameters like alpha and lambda control regularization strength
helps in balancing bias and variance.
8. Promotes Consistency: Ensures reliable performance across different datasets which
reduces the risk of large performance shifts.
2. L2 Regularization (Ridge)
Adds the square of coefficients to the loss function.
31
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Keeps all features but shrinks weights.
32
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
Cross-Validation Strategies
Use of Cross-Validation:
To assess model stability and robustness
To detect overfitting or underfitting
To choose the best model hyperparameters
33
R23 III B.Tech I Semester
Department of Artificial Intelligence and Machine Learning
Subject: Advanced Machine Learning 23A03351T
Unit-I
6. Group K-Fold Cross-Validation
Ensures that the same group (e.g., from the same patient or user) does not appear in
both training and validation sets.
Ideal for grouped or clustered data
7. Time Series Split (Rolling Forecast Origin)
For time series data where order matters
Avoids data leakage by ensuring that future data is not used to predict the past
Example:
Fold Train On Validate On
1 1, 2 3
2 1, 2, 3 4
3 1, 2, 3, 4 5
34