Unit-2
Advance Supervised algorithms
2.1 Ridge Regression
2.2 Lasso Regression
2.3 Decision tree classifier
2.4 Random forest classifier
2.5 Supervised model optimization Techniques
Ridge Regression-
What is Ridge Regression?
1. Ridge Regression is a method used to improve linear regression models.
2. It is also called L2 regularization.
3. It helps when your independent variables (inputs) are highly correlated — this issue is called
multicollinearity.
4. Multicollinearity can make the model’s predictions unstable and the coefficients unreliable.
5. Ridge Regression also helps to prevent overfitting, which happens when a model learns the
training data too well (including noise) and performs poorly on new data.
Purpouse?
Example-1:-
We are building a model to predict house prices based on features like:
Size of the house (in square feet)
Number of bedrooms
Number of bathrooms
Total number of rooms
Issue: Multicollinearity
Total rooms, bedrooms, and bathrooms are highly correlated.
(E.g., more bedrooms usually means more total rooms.)
If we use ordinary linear regression, the model may give very large and unstable
coefficients to some features to compensate for that overlap.
A small increase in the number of bedrooms also increases total rooms, so the model
struggles to tell which feature is actually causing the change in price.
Example-2:-
We want to predict students’ final exam scores using clean, meaningful features:
Hours studied
Attendance rate (%)
Assignment scores
Midterm exam score
Class participation score
We decide to use a very complex model — say, a polynomial regression of
degree 5 or 6. Even with clean, useful features, the model can over-learn small
patterns, noise, or outliers in the training data.
Issue: Overfitting
The model fits the training data perfectly, even capturing tiny ups and downs in
scores but when tested on new students, the predictions are wildly inaccurate.
This is classic overfitting due to high model complexity — the model has learned
the training data too well, and can’t generalize.
How to Fix it?
1. Use a simpler model (e.g., linear regression instead of polynomial of degree 6)
2. Apply regularization (e.g., Ridge or Lasso) to control the size of coefficients
3. Use cross-validation to find a model that works well on unseen data
Let’s Focus on Ridge……
Ridge Regression helps by simplifying the model and reducing the
influence of each feature (input variable).
It uses a method called L2 regularization, which means it adds a penalty
based on the sum of the squares of the coefficients.
This penalty is added to the loss function (a formula that tells how
wrong the model's predictions are).
The loss function for Ridge Regression is:
where:
→ RSS/MSE = Residual Sum of Squares (error)
→ λ (lambda) = regularization strength
→ bi = model coefficients
Impact of Lambda (λ) on the Model-
λ = 0 → Ridge becomes ordinary linear regression (no regularization)
λ very high → All coefficients shrink close to 0 (underfitting may occur)
Moderate λ → Best balance between bias and variance.
Dataset Example: -
Step-1
Step-2 Apply Linear Regression-
Build a model-
Step-3 Problem — Multicollinearity
Bedrooms and Total Rooms are highly correlated.
The model gives:
Large +95 to bedrooms
Large –80 to total rooms
It’s compensating one for the other — an unstable model!
Step 4: Evaluate the Model
Negative price! This is not realistic.
This shows the instability of coefficients due to multicollinearity.
Step 5: MSE
Let say
MSE= 430
Step 6: Apply Ridge Regression
Let's set λ=10
Step 7: Ridge Coefficients-
Step 8: Build model
Lasso Regression-
Lasso Regression stands for Least Absolute Shrinkage and Selection Operator.
Like Ridge, it’s a regularization technique used to prevent overfitting and improve
model generalization.
It is also known as L1 regularization.
Lasso adds a penalty based on the absolute value of the coefficients.
Unlike Ridge, Lasso can shrink some coefficients to exactly zero, effectively performing
feature selection.
Example:-
Step-1
Step 2: Apply Simple Linear Regression
For Student ‘C’
Let’s say the actual score was 92
Step 4: Apply Lasso Regression (λ = 1)
Difference Between Ridge and Lasso Regression
Feature / Criteria Ridge Regression (L2) Lasso Regression (L1)
Ridge Regression or L2
Full Name Lasso Regression or L1 Regularization
Regularization
Penalty Term λ × ∑(coefficients²) λ × ∑|Cofficients|
Reduces model complexity, avoids Reduces model complexity and selects
Purpose
overfitting key features
Can remove unimportant features (set
Feature Selection Keeps all features
coefficients to 0)
All features likely useful, but may
When to Use Only a few features are important
overfit
All features retained, with smaller Simpler model (some features
Output Model
coefficients removed)
Shrinks all coefficients toward Shrinks some coefficients to exactly
Coefficient Shrinking
zero zero
Multicollinearity Can help, but may drop one of the
Good at handling multicollinearity
Handling correlated features
Less interpretable (many features More interpretable (only key features
Interpretability
remain) remain)
Coefficients become very small, Some coefficients become exactly
High λ (lambda) Impact
none are zero zero
Model Complexity Moderate — all features included Simple — fewer features used
Slightly slower (due to feature
Computational Cost Faster (no feature elimination)
selection)
Predicting house prices using size, Selecting key genes from thousands in
Use Case Example
location, etc. medical data
Doesn’t simplify model (all May remove useful features if λ is too
Drawback
features retained) high
Mathematical
Uses L2 norm: ‖w‖² Uses L1 norm: ‖w‖₁
Optimization
Decision tree classifier-
Decision tree classifiers are a fundamental type of supervised
machine learning algorithm used for classification tasks.
It works by splitting the data into branches based on feature
values, creating a tree-like model of decisions.
They create a model that predicts the value of a target variable by
learning simple decision rules inferred from the data features.
Terms-
Root Node: The top decision node in a tree.
Decision Node: A node that splits the data based on a feature.
Leaf Node: A terminal node that gives the classification output.
Splitting: Dividing data based on a condition.
Entropy: Measures to determine the best feature to split on.
Pruning: Removing parts of the tree to prevent overfitting
Working-
1. Start at the Top (Root Node): The decision tree begins by asking an
important question based on the data.
2. Yes or No Questions: It then asks simple yes-or-no questions to divide
the data into smaller groups.
3. Follow the Branches:
o If the answer is yes, it goes one way.
o If the answer is no, it goes another way.
4. Keep Splitting: It keeps asking questions at each step to break the data
down further.
5. Final Answer (Leaf Node): When no more questions are needed, it
reaches a final decision — like classifying something as a "Yes" or "No"
Splitting Criteria in Decision Trees-
When building a decision tree, it’s important to choose the best feature to
split the data at each step. This is done using splitting criteria, which help the
tree decide how to divide the data effectively.
Entropy
Measures the amount of disorder or uncertainty in the data.
The tree tries to split the data in a way that reduces entropy the most.
This is also called Information Gain — the more information we gain
from a split, the better.
Pruning-
Pruning is a way to simplify the decision tree by removing extra
branches that don’t add much value.
It is used to avoid overfitting — when the tree learns the training data
too well, including the noise or random patterns.
It Improves accuracy on new (unseen) data by focusing on general
patterns.
Reduces complexity, making the model faster and easier to understand.
Helps the tree generalize better, instead of just memorizing the training
data.
Example:-
Step-1
Total: 14 instances
Target: PlayTennis (Yes = 9, No = 5)
Step 2: Calculate Entropy of the full dataset-
Step 3: Calculate Information Gain for each feature
Feature 1: Outlook
Values: Sunny, Overcast, Rain
→ Split by Sunny:
5 samples: [No, No, No, Yes, Yes] → 2 Yes, 3 No
→ Split by Overcast:
4 samples: All Yes → Entropy = 0 (pure class)
→ Split by Rain:
5 samples: [Yes, Yes, No, Yes, No] → 3 Yes, 2 No
Feature 2: Humidity
Values: High, Normal
High: [No, No, Yes, Yes, No, Yes] → 3 Yes, 3 No
Normal: [Yes, Yes, Yes, Yes, Yes, Yes, No, Yes] → 6 Yes, 1 No
Outlook
Outloo
Sunny Overcast Rain
Step-4 Split 'Sunny' Branch Further
Step- 5 Calculate Information Gain for Each Feature
Split on Humidity-
Split on Temperature-
Split on Wind-
Step-6
Overcast: All Yes → Leaf node
Step-7 Split 'Sunny' Branch Further
Step-8 Calculate Entropy of the "Rain" Subset-
Step-9 Calculate Information Gain for Each Feature
Split on Wind-
Split on Temperature-
Random forest classifier-
A Random Forest Classifier is a powerful and widely used machine learning
algorithm.
Random forests can be used for solving regression (numeric target variable) and
classification (categorical target variable) problems, but mostly for classification.
Random forests are an ensemble method, meaning they combine predictions from
other models.
Each of the smaller models in the random forest ensemble is a decision tree.
It works by creating many small decision trees (like mini decision-makers), and then
all these trees vote on the final decision.
It outputs the mode (most common class) of the classes predicted by individual
trees.
It is called "random forest" because it uses many trees (a forest), and each tree is
built in a random way.
Terms-
Ensemble Learning: Combines predictions from multiple models to improve accuracy
and robustness.
Decision Trees: Basic building blocks. Each tree is trained on a random subset of the
data and features.
Bagging (Bootstrap Aggregating): Random Forests use bagging to train trees on
different subsets of the training data.
Feature Randomness: At each split in a tree, a random subset of features is
considered — reducing correlation between trees.
Working-
Suppose we have a complex problem to solve, and we gather a group of experts from
different fields to provide their input. Each expert provides their opinion based on their
expertise and experience. Then, the experts would vote to arrive at a final decision.
In a random forest classification, multiple decision trees are created using different random
subsets of the data and features. Each decision tree is like an expert, providing its opinion
on how to classify the data. Predictions are made by calculating the prediction for each
decision tree and then taking the most popular result.
1. Bootstrap Sampling: Random rows are picked (with replacement) to train each tree.
2. Random Feature Selection: Each tree uses a random set of features (not all features).
3. Build Decision Trees: Trees split the data using the best feature from their random set.
Splitting continues until a stopping rule is met (like max depth).
4. Make Predictions: Each tree gives its own prediction.
5. Majority Voting: The final prediction is the one most tree agree on.
Benefits-
Random Forest can handle large datasets and high-dimensional data.
By combining predictions from many decision trees, it reduces the risk of overfitting
compared to a single decision tree.
It is robust to noisy data and works well with categorical data.
Difference Between Decision Tree and Random Forest-
Decision trees Random Forest
1. Random forests are created from subsets of
1. Decision trees normally suffer
data, and the final output is based on average
from the problem of overfitting if it’s
or majority ranking; hence the problem of
allowed to grow without any control.
overfitting is taken care of.
2. A single decision tree is faster in
2. It is comparatively slower.
computation.
3. When a data set with features is 3. Random forest randomly selects
taken as input by a decision tree, it observations, builds a decision tree, and takes
will formulate some rules to make the average result. It doesn’t use any set of
predictions. formulas.
Example: -
Step-1
Step-2 Build 3 Decision Trees (Using Bootstrapped Samples)
We'll randomly select subsets (with replacement) to create Tree 1, Tree 2, Tree 3.
Step 3: Predict a New Day
Step 4: Voting
Majority Vote → Final Prediction: YES (Play Tennis)
Supervised model optimization Techniques-
Optimizing a supervised learning model involves a multifaceted approach that considers
various aspects of the data, the chosen algorithm, and the training process.
The goal is to build a model that performs well on unseen data, effectively generalizes to
new situations, and avoids issues like overfitting or Underfitting.
Some of common supervised model optimization techniques: -
Data preprocessing and feature engineering-
[Link] Cleaning and Handling Missing Values
2. Feature Scaling
3. Feature Engineering
Algorithm selection and tuning
1. Choosing the Right Learning Algorithm
2. Hyperparameter Tuning
3. Regularization Techniques
4. Gradient Descent Optimization
Model evaluation and refinement
1. Cross-Validation
2. Ensemble Methods
3. Continuous Evaluation and Adjustment
Will Study-
1. Gradient Descent
[Link] Tuning
1. Gradient Descent
Cost Function-
It is a function that measures the performance of a model for any given data. Cost
Function quantifies the error between predicted values and expected values and
presents it in the form of a single real number.
After making a hypothesis(Guess) with initial parameters, we calculate the Cost
function. And with a goal to reduce the cost function, we modify the parameters by
using the Gradient descent algorithm over the given data.
Here’s the mathematical representation for it:
learning rate-
The learning rate is a key parameter in the gradient descent algorithm. It controls how big a
step the model takes when updating its parameters during training.
If the learning rate is too small, the model learns very slowly and takes many steps
to reach the optimal solution.
If the learning rate is too large, the model might skip over the best solution or even
diverge in different direction, making the training unstable.
What is Gradient Descent?
Gradient- means slope(derivative) (for lower dimension)
A vector is a mathematical object that has two main properties:
Magnitude (how big it is)
Direction (where it's pointing)
In higher dimensions, it's a vector showing how much the function increases in each
direction.
A vector representing the direction of the steepest downward path at specified
point
Descent- means going downward or decreasing.
Local minima refer to a point in a function where the value is lower than all
nearby points, but not necessarily the lowest overall.
Gradient descent is an optimization algorithm commonly used in machine learning to
minimize a cost function by iteratively adjusting model parameters.
The goal is to find the set of parameters that reduces the difference between the
model's predictions and the actual outputs, thereby improving performance.
The algorithm works by computing the gradient of the cost function, which indicates
the direction of steepest(fall) increase.
To minimize the cost, gradient descent moves in the opposite direction—along the
negative gradient.
In each iteration, the model's parameters are updated based on this negative
gradient.
The learning rate, a key Hyperparameter, controls the step size of these updates,
affecting both the speed and stability of convergence.
Gradient descent is a versatile method applicable to various machine learning
models, including linear and logistic regression, neural networks, and support vector
machines, making it a foundational tool for model optimization.
Working: -
1. Start with an initial guess for the model's parameters (weights).
2. Calculate the gradient (i.e., the slope or direction of the steepest increase of the loss
function).
3. Update the parameters by moving in the opposite direction of the gradient (to
reduce the loss).
4. Repeat until you reach a minimum (ideally, the lowest point).
Formula: -
Two important points-
Which direction to go (downhill — the negative gradient)
How big a step to take (controlled by the learning rate α)
for one parameter W-
For multiple weights:-
Example:-
Step-by-Step: Simple Linear Regression with Gradient Descent:-
Step 1:- Dataset:-
Step -2 Initialize: -
Ste
p 3: Compute Predictions and Loss
Step 4: Compute Gradients-
Formula for w: -
Formula for b:-
Step-5 Update Parameters-
Step 5: New Predictions and Loss
Final conclusion-