AIT401 : FOUNDATIONS
OF DEEP LEARNING
Module: 2
PREPARED BY: ASHA ROSE THOMAS
AP,CSE(AI),ASIET,KALADY
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Syllabus
Module 2: Training deep models
Introduction, setup and initialization- Kaiming, Xavier weight intializations,
Vanishing and exploding gradient problems, Optimization techniques - Gradient
Descent (GD), Stochastic GD, GD with momentum, GD with Nesterov
momentum, AdaGrad, RMSProp, Adam., Regularization Techniques -L1 and L2
regularization, Early stopping, Dataset augmentation, Parameter tying and
sharing,Ensemble methods, Dropout, Batch normalization.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Training deep models
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Steps of Neural Network Training
1. Data Preparation:
•Data Collection: Gather relevant data for the problem.
•Data Cleaning: Handle missing values, outliers, and inconsistencies.
•Data Preprocessing: Normalize, standardize, or encode features as required.
•Data Splitting: Divide data into training, validation, and test sets.
2. Model Architecture:
•Define Network Structure: Choose the number of layers, neurons per layer, and activation
functions.
•Select Loss Function: Determine how to measure the error between predicted and actual values.
•Choose Optimizer: Select an algorithm to update network weights (e.g., Gradient Descent, Adam,
SGD).
3. Initialization:
•Assign Weights: Initialize weights and biases with random or specific values.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Steps of Neural Network Training Cont..
4. Training:
•Forward Propagation: Input data passes through the network to produce an output.
•Loss Calculation: Compute the difference between predicted and actual values.
•Backpropagation: Calculate gradients of the loss function with respect to weights.
•Vanishing and exploding gradient problems can occur
•Weight Update: Adjust weights using the optimizer and calculated gradients.
•Regularization: Techniques like L1/L2 regularization or dropout to prevent overfitting.
•Repeat: Iterate through the training dataset multiple times (epochs).
5. Validation:
•Evaluate Model: Assess performance on the validation set to prevent overfitting.
•Hyperparameter Tuning: Adjust hyperparameters based on validation results.
6. Testing:
•Final Evaluation: Measure the model's performance on the unseen test set.
7. Deployment (Optional):
•Integrate Model: Deploy the trained model into a production environment.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Initialization in Neural Networks
❑What is Initialization?
❑Initialization refers to assigning initial values to a neural network's parameters (weights and biases).
❑These parameters significantly impact the network's learning process.
❑Fan-in and fan-out
❑ Fan-in is the number of input connections to a neuron.
❑ Fan-out is the number of output connections from a neuron.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Importance of Initialization
❑Prevents Symmetry
❑If two hidden units have exactly the same bias and exactly the same incoming and outgoing weights,
they will always get exactly the same gradient
❑ So they can never learn to be different features
❑We break symmetry by initialising the weights to have small random values.
❑Controls Gradient Flow
❑maintain gradients within a reasonable range
❑facilitating effective learning by preventing vanishing or exploding gradients.
❑Accelerates Convergence
❑the network converges faster to an optimal solution
❑Because it is starting the optimization process from a promising region.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Probability distributions
❑Uniform Distribution
❑Lower bound (a) and upper bound (b).
❑Equal probability: Any value within the range a to
b has the same chance of being generated.
❑Standard Normal Distribution
❑A probability distribution with a bell-shaped
curve.
❑Central tendency: Most generated numbers
cluster around the mean (0).
❑Spread: The likelihood of generating a number
decreases as it moves farther from the mean,
determined by the standard deviation (1)
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Common Initialization Techniques
❑Random Initialization
❑Assign random values to weights and biases.
❑Helps break symmetry, allowing the network to explore different solutions.
❑Random Uniform Initialization
❑Draws each weight w from a uniform distribution within range of -x and +x
❑User will decide the range (x value) of the uniform distribution
❑Typically it will be a small value (e.g. 0.1) centred around 0
❑Random Initialization
❑Draws each weights w from a normal distribution with mean =0 and standard deviation (σ)
❑User can decide the σ value of the Normal distribution
❑A common choice is to set σ to a small value like 0.01
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Xavier/ Glorot Initialization
❑Activation function-specific initialization
❑Designed for tanh and sigmoid activation functions
❑Scales the weights based on the fan-in and fan-out of a layer
❑Method
❑Uniform Xavier initialization
❑Draws each weight w from a uniform distribution within the range of +x and –x where
❑ x is decided based on fan-in and fan-out
❑Normal Xavier initialization
❑Draws each weight w from a normal distribution with mean =0 and standard deviation
❑σ is decided based on fan-in and fan-out
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Kaiming (He) Initialization
❑Designed for ReLU and varients of ReLU activations
❑Method
❑Uniform He initialization
❑ Draws each weight w from a uniform distribution within the range of +x and –x where
❑ x is decided based on fan-in
❑Normal He initialization
❑ Draws each weight w from a normal distribution with mean =0 and standard deviation
❑σ is decided based on fan-in
❑Method
❑Draws weights from a normal distribution with mean = 0 and standard
deviation
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
PRACTICAL SESSION
INITIALIZATION COMPARISON
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Home work
❑Xavier initialization is better for activation functions like tanh and sigmoid
❑Why this is more suitable for these functions compared to random initialization?
❑He initialization is better for activation function ReLU
❑Why this is more suitable for ReLU compared to random initialization?
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Vanishing & Exploding Gradients
Node is learning based on the loss function
Node is unable to learn based on the loss function
Chain rule
Vanishing Gradient
Exploding Gradient
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Vanishing Gradients Problem
❑Gradients become increasingly small as they propagate backward through the network.
❑Difficulty in updating weights of earlier layers.
❑Common in deep networks with sigmoid or tanh activations.
❑Slows down training or prevents convergence.
❑Impact
❑This hinders learning in deep neural networks.
❑Prevent the network from reaching optimal performance.
❑Mitigation Techniques
❑Careful weight initialization (Xavier, Kaiming)
❑Gradient clipping:- gradients are rescaled to the maximum threshold during backpropagation
❑Batch normalization - normalizes the activations within each min-batch during training
❑LSTM/GRU for recurrent networks
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Exploding Gradients Problem
❑Gradients become excessively large during backpropagation.
❑Leads to unstable training and divergence.
❑Often caused by large weights or improper initialization.
❑Can result in NaN values.
❑Impact:
❑Both problems hinder learning in deep neural networks.
❑Prevent the network from reaching optimal performance.
❑Mitigation Techniques:
❑Careful weight initialization (Xavier, Kaiming)
❑Gradient clipping
❑Batch normalization
❑LSTM/GRU for recurrent networks
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
PRACTICAL SESSION
INITIALIZATION COMPARISON
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Home work
❑Xavier initialization is better for activation functions like tanh and sigmoid
❑Why this is more suitable for these functions compared to random initialization?
❑He initialization is better for activation function ReLU
❑Why this is more suitable for ReLU compared to random initialization?
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Reference
1. Goodfellow, I., Bengio,Y., and Courville, A., Deep Learning, MIT Press, 2016.
2. Neural Networks and Deep Learning, Aggarwal, Charu C., c Springer International
Publishing AG, part of Springer Nature 2018
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Reference
[Link]
[Link]
[Link]
Weight Initialization for Deep Feedforward Neural Networks ([Link])
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Optimization techniques
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
The Optimization Problem
❑Objective: Minimize the loss
function (error between predicted
and actual values)
❑Challenge: Loss landscape is
often with multiple local minima
❑Goal: Find the global minimum
or a sufficiently good local
minimum
❑Visualization: A complex, hilly
terrain with many valleys
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Loss Functions - Revisiting
❑Loss function has multiple local minima
❑Finding the global minimum is
challenging
❑High Dimensionality:
❑Parameter space is vast and complex
❑Visualization is difficult
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Complexity of Loss function
❑3D representation of a high
dimensional function
❑VGG-56 deep network's loss
function on the CIFAR-10 dataset
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Optimization techniques of Neural
Network
❑Gradient Descent (GD)
❑Stochastic Gradient Descent
❑Gradient Descent with momentum
❑Gradient Descent with Nesterov momentum
❑AdaGrad
❑RMSProp
❑Adam
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Gradient Descent: Finding the Lowest
Point
❑Objective: Find the minimum of a function
(loss function in ML) Gradient Descent in Action
❑Analogy: Imagine a blindfolded hiker trying
to reach the bottom of a valley
❑Gradient: Measures the steepness of the
function at a given point
❑Gradient Descent: Iteratively moves in the
opposite direction of the gradient to reach
the lowest point
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Gradient Descent Algorithm
❑Initialization:
❑Randomly initialize parameters (weights)
❑Set learning rate (step size)
❑Iteration:
❑Calculate the gradient of the loss function with
respect to the parameters
❑Update parameters by subtracting the learning
rate multiplied by the gradient
❑Termination:
❑Stop when the change in parameters is small or a
maximum number of iterations is reached
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Challenges and Improvements
❑Batch Gradient Decent
❑Gradient Decent Computes gradient using the entire training dataset in each iteration.
❑It Calculates the average gradient for all training examples and updates the parameters
❑Ensures stability during training.
❑Challenges:
❑Can get stuck in local minima
❑Can be computationally expensive for large datasets.
❑May converge slowly for noisy or redundant data.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Stochastic Gradient Descent
❑Massive Datasets
❑Modern machine learning models often train on enormous datasets.
❑Computational Bottleneck
❑Calculating gradients for the entire dataset in each iteration is computationally expensive.
❑Solution
❑Stochastic Gradient Descent (SGD) offers a faster alternative by processing one training example at a
time.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Stochastic Gradient Descent: One Step at
a Time
❑Random Selection: SGD picks a random
training example from the dataset.
❑Gradient Calculation: Computes the
gradient based on this single example.
❑Parameter Update: Updates model
parameters using the calculated gradient.
❑Iterative Process: Repeats the process for
multiple epochs.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
SGD - Advantages
❑Speed: Significantly faster than batch and mini-batch gradient descent due to processing only one example.
❑Escaping Local Minima: The stochastic nature can help the model escape local optima.
❑ SGD uses a single random data point for each update, introducing noise into the gradient calculation
❑ This noise allows SGD to explore different regions of the loss landscape, increasing the chance of escaping local minima.
❑ While SGD is more likely to escape local minima compared to batch gradient descent, it's not guaranteed.
❑Challenges: Noisy updates can lead to more fluctuations in the loss function compared to batch GD.
❑Hyperparameter Tuning: Learning rate needs careful adjustment.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Mini-Batch Gradient Descent
❑Compromise between Batch and Stochastic GD: Processes
data in small batches (subset of training data).
❑Faster than Batch GD: Reduces computation time compared to
using the entire dataset.
❑Smoother than Stochastic GD: Less noisy updates than SGD
due to averaging gradients within a batch.
❑Efficient for large datasets: Handles large datasets effectively.
❑Commonly used in deep learning: Preferred optimization
algorithm due to its balance of speed and stability.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
PRACTICAL SESSION
PARAMETER OPTIMIZATION IN NEURAL NETWORKS -
[Link]
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Reference
[Link]
[Link]
[Link]
Weight Initialization for Deep Feedforward Neural Networks ([Link])
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Basic Optimization Algorithms
These algorithms form the foundation for more complex optimization techniques.
•Gradient Descent (GD):
• Updates parameters using the gradient computed from all training data points in each iteration.
•Stochastic Gradient Descent (SGD):
• Updates parameters using the gradient computed from a single random data point.
• Faster but less stable than GD.
•Mini-batch Gradient Descent:
• Updates parameters using the gradient computed from a small random subset of data points.
• Balances speed and stability between GD and SGD.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
GD improvement
These algorithms build upon the basic concepts of GD and incorporate additional techniques for
improved performance.
•Gradient Descent with Momentum
•Nesterov Accelerated Gradient (NAG)
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Gradient Descent with Momentum
❑Analogy
❑Imagine a ball rolling down a hill. The ball's momentum helps it overcome small bumps and continue
rolling towards the bottom.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Gradient Descent with Momentum
❑Gradient Descent
❑Update rule
❑ parameters = parameters –(Learning Rate * gradient)
❑Momentum
❑Update rule:
❑ velocity = β * previous velocity + Learning Rate * gradient
❑ parameters = parameters - velocity
❑Momentum factor (β) controls the influence of past gradients.
❑The velocity term acts like the ball's momentum, carrying information about previous steps.
❑This helps the algorithm gain speed in the correct direction and overcome obstacles (like local
minima).
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Gradient Descent with Momentum
Cont..
❑Hyperparameter Tuning: The momentum factor (γ) needs to be tuned.
❑Typical values: 0.8 or 0.9
❑High values can lead to overshooting, while low values approach standard gradient descent.
What will be the impact of setting momentum factor (γ) =0 ?
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Gradient Descent with Nesterov momentum -
Nesterov Accelerated Gradient (NAG)
❑NAG is an enhancement over standard momentum.
❑Key Idea
❑Instead of calculating the gradient at the current position, it calculates the gradient at a point ahead in
the direction of the current momentum.
❑By looking ahead, NAG can anticipate the direction of the parameters.
❑Update Rules
lookahead = parameters - β * previous velocity
gradient = gradient of loss function at lookahead point
velocity = β * previous velocity + Learning Rate * gradient
parameters = parameters - velocity
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Adaptive Learning Rate Algorithms
❑Story So far
❑Traditional gradient descent uses a universal fixed learning rate
❑Fixed learning rate struggles with diverse parameters.
❑Solution
❑Adaptive learning rate algorithms adjust the learning rate for each parameter based on
historical gradients.
❑How it works
❑ AdaGrad
❑Tracks past gradients for each parameter.
❑Adjusts learning rate based on this history. ❑ RMSprop
❑Treats each parameter as an individual learner. ❑ Adam
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
AdaGrad (Adaptive Gradient) Optimizer
❑AdaGrad adapts the learning rate for each parameter individually based on the sum of historical
squared gradients.
❑Accumulates the sum of squared gradients for each parameter.
❑Divides the gradient by the square root of the accumulated sum.
❑The AdaGrad Update Rule:
❑For each weight (parameter) in the model, AdaGrad maintains a running sum of squared gradients.
❑Update Rule:
accumulated_squared_gradient = accumulated_squared_gradient + gradient^2
adjusted_gradient = gradient / sqrt(accumulated_squared_gradient + epsilon)
parameters = parameters - Learning Rate * adjusted_gradient
❑Hyperparameters:
❑Learning rate: Controls the overall step size.
❑Epsilon: Small constant to prevent division by zero.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
RMSprop Optimizer
❑RMSProp maintains a moving average of the squared gradients for each weight.
❑RMSprop is an improvement over AdaGrad by introducing a decay factor to prevent the rapid
decrease in learning rate
❑Update Rule
accumulated_squared_gradient = decay_rate * accumulated_squared_gradient + (1 - decay_rate) * gradient^2
adjusted_gradient = gradient / sqrt(accumulated_squared_gradient + epsilon)
parameters = parameters - Learning Rate * adjusted_gradient
❑Hyperparameters
❑Learning rate: Controls the overall step size.
❑Decay rate: Controls the influence of past gradients on the accumulated squared gradient.
❑Epsilon: Small constant to prevent division by zero.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Adam Optimizer
❑Combines RMSprop and Momentum
❑Adam incorporates the best aspects of both RMSprop and momentum.
❑Offers a balance between adaptation and momentum.
❑Update Rules:
first_moment = beta1 * first_moment pre+ (1 - beta1) * gradient
second_moment = beta2 * second_moment pre + (1 - beta2) * gradient^2
first_moment_corrected = first_moment / (1 - beta1^t)
second_moment_corrected = second_moment / (1 - beta2^t)
parameters = parameters - Learning Rate * first_moment_corrected
/ (sqrt(second_moment_corrected) + epsilon)
❑Hyperparameters
❑Learning rate: Controls the overall step size.
❑Beta1: Decay rate for the first moment (momentum).
❑Beta2: Decay rate for the second moment (RMSprop).
❑Epsilon: Small constant to prevent division by zero.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Comparison of Optimization Algorithms
❑Animation of 5 gradient descent methods
on a surface
❑gradient descent (cyan)
❑GD with momentum (magenta)
❑AdaGrad (white)
❑RMSProp (green)
❑Adam (blue)
❑Left well is the global minimum; right well
is a local minimum.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
PRACTICAL SESSION
COMPARISON OF OPTIMIZATION ALGORITHMS
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Comparison of Optimization Algorithms
Tuning Optimization
Algorithm Strengths Weaknesses
Parameters Time
High for large Slow, prone to local minima,
Gradient Descent Learning rate Simple, deterministic
datasets sensitive to learning rate
Stochastic Gradient Learning rate, batch Fast, can escape local Noisy updates, less stable
Fast
Descent (SGD) size minima convergence
Accelerates
Gradient Descent Learning rate, Sensitive to learning rate and
Faster than GD convergence, helps
with Momentum momentum factor momentum factor
escape local minima
Gradient Descent Improves
Learning rate, Faster than GD Sensitive to hyperparameters,
with Nesterov convergence over
momentum factor with momentum complex to implement
Momentum standard momentum
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Comparison of Optimization Algorithms
Tuning Optimization
Algorithm Strengths Weaknesses
Parameters Time
Can be sensitive to
Learning rate, Adapts learning rate per learning rate, learning
AdaGrad Can be slow
epsilon parameter rate might become too
small
Adapts learning rate per
Learning rate, decay Faster than Sensitive to learning rate
RMSprop parameter, often faster
rate, epsilon AdaGrad and decay rate
convergence
Combines advantages of More complex, might
Learning rate, beta1,
Adam Fast RMSprop and momentum, require careful
beta2, epsilon
efficient hyperparameter tuning
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Reading
❑Deep Learning Book – Section 8.3 & 8.5
❑#4. A Beginner’s Guide to Gradient Descent in Machine Learning | by Yennhi95zz | Medium
❑Intro to optimization in deep learning: Gradient Descent ([Link])
❑[Link]
❑A Visual Explanation of Gradient Descent Methods (Momentum, AdaGrad, RMSProp, Adam) |
by Lili Jiang | Towards Data Science
❑[Link]
❑[Link]
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Regularization
techniques
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Regularization in Neural Networks
❑Overfitting: The model learns training data too well, including noise.
❑Poor performance on new data.
❑Regularization: Penalizes complex models.
❑Encourages simpler models.
❑Improves generalization (Enhances model performance on unseen data)
❑Goal: Balance between fitting training data and generalizing to new data.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
How Regularization Works
❑Early Stopping: A Regularization Technique
❑Divide data into training, validation, and test sets.
❑Train model on training data, evaluate on validation set.
❑Stop training if validation performance stops improving for a set number of epochs.
❑Prevents model from becoming too complex.
❑Result: Simpler model, better generalization
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Regularization Methods
Type Mechanism How
Penalizes large weights, promotes sparsity (L1)
L1 and L2 Regularization Weight Decay or smoother weights (L2)
Early Stopping Model Optimization Stops training before overfitting
Dataset Augmentation Data-Based Increases data variability with transformations
Parameter Tying and Sharing Model Architecture Reduces model complexity by sharing weights
Ensemble Methods Model Combination Combines predictions from multiple models
Dropout Model Training Randomly drops neurons during training
Batch Normalization Model Training Normalizes inputs to improve training stability
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Regularization Methods
L1 and L2 Regularization
Early Stopping
Dataset Augmentation
Parameter Tying and Sharing
Ensemble Methods
Dropout
Batch Normalization
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
L1 Regularization (Lasso)
❑Add a Penalty Term to the loss function
❑Penalty term
❑ The absolute value of the magnitude of coefficients.
❑Effect
❑ Encourages sparsity, meaning some coefficients become zero.
❑ This effectively performs feature selection by eliminating irrelevant features.
❑Loss function
𝐿𝑜𝑠𝑠 = 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐿𝑜𝑠𝑠 + 𝜆 ∗ Σ |𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠|
where λ is the regularization parameter controlling the strength of the penalty.
❑Advantages:
❑ Feature selection
❑ Interpretable models
❑ Robust to outliers
❑Disadvantages:
❑ Less stable than L2 for feature selection
❑ Can be computationally expensive
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
L2 Regularization (Ridge)
❑Penalty term
❑The square of the magnitude of coefficients.
❑Effect
❑Shrinks coefficients towards zero but doesn't force them to be exactly zero.
❑Loss function
𝐿𝑜𝑠𝑠 = 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐿𝑜𝑠𝑠 + 𝜆 ∗ Σ 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠^2
where λ is the regularization parameter controlling the strength of the penalty.
❑Advantages
❑Improves model stability
❑Handles multicollinearity better
❑Generally faster to compute
❑Disadvantages
❑Doesn't perform feature selection
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
L1 , L2 Comparison
Feature L1 Regularization (Lasso) L2 Regularization (Ridge)
Penalty term Absolute value of coefficients Square of coefficients
Effect on coefficients Drives some coefficients to zero Shrinks coefficients towards zero
Feature selection Yes No
Model complexity Reduces complexity Reduces complexity
Computational cost Higher Lower
Stability Less stable More stable
❑When to Use Which
❑L1 regularization
❑ preferred when you suspect that only a few features are important and you want to identify them.
❑ useful for feature selection in high-dimensional datasets.
❑L2 regularization
❑ preferred when you believe that all features contribute to the prediction and you want to improve model stability and prevent overfitting.
❑Elastic Net
❑ A combination of L1 and L2 regularization can be used to benefit from both techniques.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Early Stopping
❑Stops training before model overfits to training data
❑How it Works
❑Split data into training, validation, and test sets
❑Train model on training set
❑Evaluate model on validation set after each epoch
❑Stop training when validation performance decreases
❑Key Points
❑Crucial for iterative models (e.g., gradient descent)
❑Validation set quality is critical
❑Can be combined with other regularization techniques
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Dataset Augmentation
❑Goal
❑ Increase training data size and diversity to prevent overfitting.
❑Method
❑ Create variations of existing data points.
❑Common Augmentation Techniques
❑ Image data: Rotation, flipping, cropping, scaling, color jittering, noise addition, adding random transformations.
❑ Text data: Synonym replacement, backtranslation, random insertion/deletion of words, creating sentence variations.
❑ Audio data: Adding background noise, changing pitch or speed, time stretching, augmenting with similar audio clips.
❑Benefits
❑ Improved model performance
❑ Reduced overfitting
❑ Cost-effective (no new data collection)
❑Challenges
❑ Augmentation quality (maintain label relevance)
❑ Computational cost
❑ Augmentation strategy (data-type specific)
❑ Data imbalance
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Parameter Tying and Sharing
❑Reduce model complexity by sharing parameters.
❑Tying
❑ Set specific parameters equal.
❑ Example: Tie weights across time steps in RNNs.
❑Sharing
❑ Encourage similar parameters.
❑ Example: Use L2 regularization to penalize differences between weights in CNN filters.
❑Benefits
❑ Fewer parameters
❑ improved generalization
❑ reduced computational cost.
❑Common uses
❑ CNNs, RNNs, language models.
❑Challenges
❑ Careful design
❑ potential loss of representational power.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Ensemble Methods
Ensemble Methods as Regularization
◦ Combines multiple models to improve performance and reduce overfitting.
◦ Introduces diversity among models, reducing reliance on any single model.
◦ Averaging or voting predictions helps smooth out noise and outliers.
◦ Can effectively reduce model complexity despite using multiple models.
Specific Examples
◦ Bagging: Creates diverse models by sampling data with replacement, reducing
variance.
◦ Boosting: Sequentially builds models, focusing on correcting errors, improving
generalization.
Ensemble Regularization vs. Other Techniques
◦ Differs from traditional regularization (L1/L2, dropout) by combining models
instead of modifying individual models.
◦ Complements other techniques for enhanced performance.
Key Points
◦ Powerful technique for improving model robustness and accuracy.
◦ Leverages diversity among models to reduce overfitting.
◦ Effectively combines multiple models for better generalization.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Ensemble Methods Cont…
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Dropout
What is Dropout?
◦ Randomly drops neurons during training
How it Works
◦ Randomly selects neurons to deactivate for each training iteration
◦ Forces remaining neurons to learn more robust features
◦ Reduces reliance on any specific neuron
How to Implement
◦ Add a dropout layer to the neural network
◦ Specify dropout rate (percentage of neurons to drop)
◦ Scale outputs of remaining neurons
Benefits
◦ Improves generalization
◦ Reduces overfitting
◦ Acts as an ensemble method
◦ Can be combined with other regularization techniques
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Batch Normalization
❑Internal Covariate Shift:
❑Imagine training a neural network to classify flowers. Some mini-batches contain red rose buds, while others have
fully bloomed roses of various colors.
❑Ideally, each mini-batch should have a similar distribution of features (colors, shapes, etc.).
❑However, if the mini-batches differ significantly (like our two subsets of flowers), it leads to covariate shift—a
challenge during training.
❑What Is Batch Normalization?
❑Batch Normalization (BN) addresses covariate shift by normalizing the activations of each hidden layer.
❑It ensures that the mean and variance of activations remain stable during training.
❑BN operates on a mini-batch of data, not individual examples.
❑How Does It Work?
❑For each layer’s output, BN:
❑ Computes the batch mean and standard deviation.
❑ Normalizes the activations by subtracting the mean and dividing by the standard deviation.
❑ Scales and shifts the normalized values using learnable parameters (gamma and beta).
❑This process stabilizes training, prevents vanishing/exploding gradients, and speeds up convergence.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
WORKOUT
WRITE NOTE ON L1 AND L2 REGULARIZATION TECHNIQUES
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Thank You
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY