0% found this document useful (0 votes)

48 views66 pages

Module 2

The document outlines the foundations of deep learning, focusing on training deep models, initialization techniques, and optimization strategies. It covers essential topics such as data preparation, model architecture, weight initialization methods (like Xavier and Kaiming), and optimization techniques (including Gradient Descent and Stochastic Gradient Descent). Additionally, it discusses challenges like vanishing and exploding gradients, providing insights into their mitigation and the importance of proper initialization.

Uploaded by

Neha Shaji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views66 pages

Module 2

Uploaded by

Neha Shaji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

AIT401 : FOUNDATIONS

OF DEEP LEARNING
Module: 2
PREPARED BY: ASHA ROSE THOMAS
AP,CSE(AI),ASIET,KALADY

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Syllabus
Module 2: Training deep models
Introduction, setup and initialization- Kaiming, Xavier weight intializations,
Vanishing and exploding gradient problems, Optimization techniques - Gradient
Descent (GD), Stochastic GD, GD with momentum, GD with Nesterov
momentum, AdaGrad, RMSProp, Adam., Regularization Techniques -L1 and L2
regularization, Early stopping, Dataset augmentation, Parameter tying and
sharing,Ensemble methods, Dropout, Batch normalization.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Training deep models

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Steps of Neural Network Training
1. Data Preparation:
•Data Collection: Gather relevant data for the problem.
•Data Cleaning: Handle missing values, outliers, and inconsistencies.
•Data Preprocessing: Normalize, standardize, or encode features as required.
•Data Splitting: Divide data into training, validation, and test sets.
2. Model Architecture:
•Define Network Structure: Choose the number of layers, neurons per layer, and activation
functions.
•Select Loss Function: Determine how to measure the error between predicted and actual values.
•Choose Optimizer: Select an algorithm to update network weights (e.g., Gradient Descent, Adam,
SGD).
3. Initialization:
•Assign Weights: Initialize weights and biases with random or specific values.
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Steps of Neural Network Training Cont..
4. Training:
•Forward Propagation: Input data passes through the network to produce an output.
•Loss Calculation: Compute the difference between predicted and actual values.
•Backpropagation: Calculate gradients of the loss function with respect to weights.
•Vanishing and exploding gradient problems can occur
•Weight Update: Adjust weights using the optimizer and calculated gradients.
•Regularization: Techniques like L1/L2 regularization or dropout to prevent overfitting.
•Repeat: Iterate through the training dataset multiple times (epochs).
5. Validation:
•Evaluate Model: Assess performance on the validation set to prevent overfitting.
•Hyperparameter Tuning: Adjust hyperparameters based on validation results.
6. Testing:
•Final Evaluation: Measure the model's performance on the unseen test set.
7. Deployment (Optional):
•Integrate Model: Deploy the trained model into a production environment.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Initialization in Neural Networks
❑What is Initialization?
❑Initialization refers to assigning initial values to a neural network's parameters (weights and biases).
❑These parameters significantly impact the network's learning process.

❑Fan-in and fan-out

❑ Fan-in is the number of input connections to a neuron.
❑ Fan-out is the number of output connections from a neuron.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Importance of Initialization
❑Prevents Symmetry
❑If two hidden units have exactly the same bias and exactly the same incoming and outgoing weights,
they will always get exactly the same gradient
❑ So they can never learn to be different features
❑We break symmetry by initialising the weights to have small random values.

❑Controls Gradient Flow

❑maintain gradients within a reasonable range
❑facilitating effective learning by preventing vanishing or exploding gradients.

❑Accelerates Convergence
❑the network converges faster to an optimal solution
❑Because it is starting the optimization process from a promising region.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Probability distributions
❑Uniform Distribution
❑Lower bound (a) and upper bound (b).
❑Equal probability: Any value within the range a to
b has the same chance of being generated.

❑Standard Normal Distribution

❑A probability distribution with a bell-shaped
curve.
❑Central tendency: Most generated numbers
cluster around the mean (0).
❑Spread: The likelihood of generating a number
decreases as it moves farther from the mean,
determined by the standard deviation (1)

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Common Initialization Techniques
❑Random Initialization
❑Assign random values to weights and biases.
❑Helps break symmetry, allowing the network to explore different solutions.

❑Random Uniform Initialization

❑Draws each weight w from a uniform distribution within range of -x and +x
❑User will decide the range (x value) of the uniform distribution
❑Typically it will be a small value (e.g. 0.1) centred around 0

❑Random Initialization
❑Draws each weights w from a normal distribution with mean =0 and standard deviation (σ)
❑User can decide the σ value of the Normal distribution
❑A common choice is to set σ to a small value like 0.01

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Xavier/ Glorot Initialization
❑Activation function-specific initialization
❑Designed for tanh and sigmoid activation functions

❑Scales the weights based on the fan-in and fan-out of a layer

❑Method
❑Uniform Xavier initialization
❑Draws each weight w from a uniform distribution within the range of +x and –x where
❑ x is decided based on fan-in and fan-out
❑Normal Xavier initialization
❑Draws each weight w from a normal distribution with mean =0 and standard deviation
❑σ is decided based on fan-in and fan-out
10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY
Kaiming (He) Initialization
❑Designed for ReLU and varients of ReLU activations
❑Method
❑Uniform He initialization
❑ Draws each weight w from a uniform distribution within the range of +x and –x where
❑ x is decided based on fan-in
❑Normal He initialization
❑ Draws each weight w from a normal distribution with mean =0 and standard deviation
❑σ is decided based on fan-in

❑Method
❑Draws weights from a normal distribution with mean = 0 and standard
deviation

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

PRACTICAL SESSION
INITIALIZATION COMPARISON

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Home work
❑Xavier initialization is better for activation functions like tanh and sigmoid
❑Why this is more suitable for these functions compared to random initialization?

❑He initialization is better for activation function ReLU

❑Why this is more suitable for ReLU compared to random initialization?

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Vanishing & Exploding Gradients
Node is learning based on the loss function

Node is unable to learn based on the loss function

Chain rule

Vanishing Gradient

Exploding Gradient

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Vanishing Gradients Problem
❑Gradients become increasingly small as they propagate backward through the network.
❑Difficulty in updating weights of earlier layers.
❑Common in deep networks with sigmoid or tanh activations.
❑Slows down training or prevents convergence.
❑Impact
❑This hinders learning in deep neural networks.
❑Prevent the network from reaching optimal performance.
❑Mitigation Techniques
❑Careful weight initialization (Xavier, Kaiming)
❑Gradient clipping:- gradients are rescaled to the maximum threshold during backpropagation
❑Batch normalization - normalizes the activations within each min-batch during training
❑LSTM/GRU for recurrent networks

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Exploding Gradients Problem
❑Gradients become excessively large during backpropagation.
❑Leads to unstable training and divergence.
❑Often caused by large weights or improper initialization.
❑Can result in NaN values.
❑Impact:
❑Both problems hinder learning in deep neural networks.
❑Prevent the network from reaching optimal performance.
❑Mitigation Techniques:
❑Careful weight initialization (Xavier, Kaiming)
❑Gradient clipping
❑Batch normalization
❑LSTM/GRU for recurrent networks

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

PRACTICAL SESSION
INITIALIZATION COMPARISON

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Home work
❑Xavier initialization is better for activation functions like tanh and sigmoid
❑Why this is more suitable for these functions compared to random initialization?

❑He initialization is better for activation function ReLU

❑Why this is more suitable for ReLU compared to random initialization?

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Reference
1. Goodfellow, I., Bengio,Y., and Courville, A., Deep Learning, MIT Press, 2016.
2. Neural Networks and Deep Learning, Aggarwal, Charu C., c Springer International
Publishing AG, part of Springer Nature 2018

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Reference
[Link]
[Link]
[Link]
Weight Initialization for Deep Feedforward Neural Networks ([Link])

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Optimization techniques

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

The Optimization Problem
❑Objective: Minimize the loss
function (error between predicted
and actual values)
❑Challenge: Loss landscape is
often with multiple local minima
❑Goal: Find the global minimum
or a sufficiently good local
minimum
❑Visualization: A complex, hilly
terrain with many valleys

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Loss Functions - Revisiting
❑Loss function has multiple local minima
❑Finding the global minimum is
challenging
❑High Dimensionality:
❑Parameter space is vast and complex
❑Visualization is difficult

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Complexity of Loss function
❑3D representation of a high
dimensional function
❑VGG-56 deep network's loss
function on the CIFAR-10 dataset

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Optimization techniques of Neural
Network
❑Gradient Descent (GD)
❑Stochastic Gradient Descent
❑Gradient Descent with momentum
❑Gradient Descent with Nesterov momentum
❑AdaGrad
❑RMSProp
❑Adam

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Gradient Descent: Finding the Lowest
Point
❑Objective: Find the minimum of a function
(loss function in ML) Gradient Descent in Action
❑Analogy: Imagine a blindfolded hiker trying
to reach the bottom of a valley
❑Gradient: Measures the steepness of the
function at a given point
❑Gradient Descent: Iteratively moves in the
opposite direction of the gradient to reach
the lowest point

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Gradient Descent Algorithm
❑Initialization:
❑Randomly initialize parameters (weights)
❑Set learning rate (step size)

❑Iteration:
❑Calculate the gradient of the loss function with
respect to the parameters
❑Update parameters by subtracting the learning
rate multiplied by the gradient

❑Termination:
❑Stop when the change in parameters is small or a
maximum number of iterations is reached

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Challenges and Improvements
❑Batch Gradient Decent
❑Gradient Decent Computes gradient using the entire training dataset in each iteration.
❑It Calculates the average gradient for all training examples and updates the parameters
❑Ensures stability during training.

❑Challenges:
❑Can get stuck in local minima
❑Can be computationally expensive for large datasets.
❑May converge slowly for noisy or redundant data.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Stochastic Gradient Descent
❑Massive Datasets
❑Modern machine learning models often train on enormous datasets.

❑Computational Bottleneck
❑Calculating gradients for the entire dataset in each iteration is computationally expensive.

❑Solution
❑Stochastic Gradient Descent (SGD) offers a faster alternative by processing one training example at a
time.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Stochastic Gradient Descent: One Step at
a Time
❑Random Selection: SGD picks a random
training example from the dataset.
❑Gradient Calculation: Computes the
gradient based on this single example.
❑Parameter Update: Updates model
parameters using the calculated gradient.
❑Iterative Process: Repeats the process for
multiple epochs.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

SGD - Advantages
❑Speed: Significantly faster than batch and mini-batch gradient descent due to processing only one example.
❑Escaping Local Minima: The stochastic nature can help the model escape local optima.
❑ SGD uses a single random data point for each update, introducing noise into the gradient calculation
❑ This noise allows SGD to explore different regions of the loss landscape, increasing the chance of escaping local minima.
❑ While SGD is more likely to escape local minima compared to batch gradient descent, it's not guaranteed.

❑Challenges: Noisy updates can lead to more fluctuations in the loss function compared to batch GD.
❑Hyperparameter Tuning: Learning rate needs careful adjustment.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Mini-Batch Gradient Descent
❑Compromise between Batch and Stochastic GD: Processes
data in small batches (subset of training data).
❑Faster than Batch GD: Reduces computation time compared to
using the entire dataset.
❑Smoother than Stochastic GD: Less noisy updates than SGD
due to averaging gradients within a batch.
❑Efficient for large datasets: Handles large datasets effectively.
❑Commonly used in deep learning: Preferred optimization
algorithm due to its balance of speed and stability.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

PRACTICAL SESSION
PARAMETER OPTIMIZATION IN NEURAL NETWORKS -
[Link]

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Reference
[Link]
[Link]
[Link]
Weight Initialization for Deep Feedforward Neural Networks ([Link])

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Basic Optimization Algorithms
These algorithms form the foundation for more complex optimization techniques.
•Gradient Descent (GD):
• Updates parameters using the gradient computed from all training data points in each iteration.

•Stochastic Gradient Descent (SGD):

• Updates parameters using the gradient computed from a single random data point.
• Faster but less stable than GD.

•Mini-batch Gradient Descent:

• Updates parameters using the gradient computed from a small random subset of data points.
• Balances speed and stability between GD and SGD.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

GD improvement
These algorithms build upon the basic concepts of GD and incorporate additional techniques for
improved performance.
•Gradient Descent with Momentum
•Nesterov Accelerated Gradient (NAG)

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Gradient Descent with Momentum
❑Analogy
❑Imagine a ball rolling down a hill. The ball's momentum helps it overcome small bumps and continue
rolling towards the bottom.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Gradient Descent with Momentum
❑Gradient Descent
❑Update rule
❑ parameters = parameters –(Learning Rate * gradient)

❑Momentum
❑Update rule:
❑ velocity = β * previous velocity + Learning Rate * gradient
❑ parameters = parameters - velocity

❑Momentum factor (β) controls the influence of past gradients.

❑The velocity term acts like the ball's momentum, carrying information about previous steps.
❑This helps the algorithm gain speed in the correct direction and overcome obstacles (like local
minima).

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Gradient Descent with Momentum
Cont..
❑Hyperparameter Tuning: The momentum factor (γ) needs to be tuned.
❑Typical values: 0.8 or 0.9
❑High values can lead to overshooting, while low values approach standard gradient descent.

What will be the impact of setting momentum factor (γ) =0 ?

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Gradient Descent with Nesterov momentum -
Nesterov Accelerated Gradient (NAG)
❑NAG is an enhancement over standard momentum.
❑Key Idea
❑Instead of calculating the gradient at the current position, it calculates the gradient at a point ahead in
the direction of the current momentum.
❑By looking ahead, NAG can anticipate the direction of the parameters.

❑Update Rules
lookahead = parameters - β * previous velocity
gradient = gradient of loss function at lookahead point
velocity = β * previous velocity + Learning Rate * gradient
parameters = parameters - velocity

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Adaptive Learning Rate Algorithms
❑Story So far
❑Traditional gradient descent uses a universal fixed learning rate
❑Fixed learning rate struggles with diverse parameters.

❑Solution
❑Adaptive learning rate algorithms adjust the learning rate for each parameter based on
historical gradients.

❑How it works
❑ AdaGrad
❑Tracks past gradients for each parameter.
❑Adjusts learning rate based on this history. ❑ RMSprop
❑Treats each parameter as an individual learner. ❑ Adam

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

AdaGrad (Adaptive Gradient) Optimizer
❑AdaGrad adapts the learning rate for each parameter individually based on the sum of historical
squared gradients.
❑Accumulates the sum of squared gradients for each parameter.
❑Divides the gradient by the square root of the accumulated sum.
❑The AdaGrad Update Rule:
❑For each weight (parameter) in the model, AdaGrad maintains a running sum of squared gradients.
❑Update Rule:
accumulated_squared_gradient = accumulated_squared_gradient + gradient^2
adjusted_gradient = gradient / sqrt(accumulated_squared_gradient + epsilon)
parameters = parameters - Learning Rate * adjusted_gradient

❑Hyperparameters:
❑Learning rate: Controls the overall step size.
❑Epsilon: Small constant to prevent division by zero.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

RMSprop Optimizer
❑RMSProp maintains a moving average of the squared gradients for each weight.
❑RMSprop is an improvement over AdaGrad by introducing a decay factor to prevent the rapid
decrease in learning rate
❑Update Rule
accumulated_squared_gradient = decay_rate * accumulated_squared_gradient + (1 - decay_rate) * gradient^2
adjusted_gradient = gradient / sqrt(accumulated_squared_gradient + epsilon)
parameters = parameters - Learning Rate * adjusted_gradient

❑Hyperparameters
❑Learning rate: Controls the overall step size.
❑Decay rate: Controls the influence of past gradients on the accumulated squared gradient.
❑Epsilon: Small constant to prevent division by zero.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Adam Optimizer
❑Combines RMSprop and Momentum
❑Adam incorporates the best aspects of both RMSprop and momentum.
❑Offers a balance between adaptation and momentum.
❑Update Rules:
first_moment = beta1 * first_moment pre+ (1 - beta1) * gradient
second_moment = beta2 * second_moment pre + (1 - beta2) * gradient^2
first_moment_corrected = first_moment / (1 - beta1^t)
second_moment_corrected = second_moment / (1 - beta2^t)
parameters = parameters - Learning Rate * first_moment_corrected
/ (sqrt(second_moment_corrected) + epsilon)

❑Hyperparameters
❑Learning rate: Controls the overall step size.
❑Beta1: Decay rate for the first moment (momentum).
❑Beta2: Decay rate for the second moment (RMSprop).
❑Epsilon: Small constant to prevent division by zero.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Comparison of Optimization Algorithms
❑Animation of 5 gradient descent methods
on a surface
❑gradient descent (cyan)
❑GD with momentum (magenta)
❑AdaGrad (white)
❑RMSProp (green)
❑Adam (blue)

❑Left well is the global minimum; right well

is a local minimum.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

PRACTICAL SESSION
COMPARISON OF OPTIMIZATION ALGORITHMS

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Comparison of Optimization Algorithms
Tuning Optimization
Algorithm Strengths Weaknesses
Parameters Time
High for large Slow, prone to local minima,
Gradient Descent Learning rate Simple, deterministic
datasets sensitive to learning rate
Stochastic Gradient Learning rate, batch Fast, can escape local Noisy updates, less stable
Fast
Descent (SGD) size minima convergence
Accelerates
Gradient Descent Learning rate, Sensitive to learning rate and
Faster than GD convergence, helps
with Momentum momentum factor momentum factor
escape local minima
Gradient Descent Improves
Learning rate, Faster than GD Sensitive to hyperparameters,
with Nesterov convergence over
momentum factor with momentum complex to implement
Momentum standard momentum

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Comparison of Optimization Algorithms
Tuning Optimization
Algorithm Strengths Weaknesses
Parameters Time
Can be sensitive to
Learning rate, Adapts learning rate per learning rate, learning
AdaGrad Can be slow
epsilon parameter rate might become too
small

Adapts learning rate per

Learning rate, decay Faster than Sensitive to learning rate
RMSprop parameter, often faster
rate, epsilon AdaGrad and decay rate
convergence

Combines advantages of More complex, might

Learning rate, beta1,
Adam Fast RMSprop and momentum, require careful
beta2, epsilon
efficient hyperparameter tuning

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Reading
❑Deep Learning Book – Section 8.3 & 8.5
❑#4. A Beginner’s Guide to Gradient Descent in Machine Learning | by Yennhi95zz | Medium
❑Intro to optimization in deep learning: Gradient Descent ([Link])
❑[Link]
❑A Visual Explanation of Gradient Descent Methods (Momentum, AdaGrad, RMSProp, Adam) |
by Lili Jiang | Towards Data Science
❑[Link]
❑[Link]

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Regularization
techniques

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Regularization in Neural Networks
❑Overfitting: The model learns training data too well, including noise.
❑Poor performance on new data.

❑Regularization: Penalizes complex models.

❑Encourages simpler models.
❑Improves generalization (Enhances model performance on unseen data)

❑Goal: Balance between fitting training data and generalizing to new data.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

How Regularization Works
❑Early Stopping: A Regularization Technique
❑Divide data into training, validation, and test sets.
❑Train model on training data, evaluate on validation set.
❑Stop training if validation performance stops improving for a set number of epochs.
❑Prevents model from becoming too complex.

❑Result: Simpler model, better generalization

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Regularization Methods
Type Mechanism How
Penalizes large weights, promotes sparsity (L1)
L1 and L2 Regularization Weight Decay or smoother weights (L2)
Early Stopping Model Optimization Stops training before overfitting
Dataset Augmentation Data-Based Increases data variability with transformations
Parameter Tying and Sharing Model Architecture Reduces model complexity by sharing weights
Ensemble Methods Model Combination Combines predictions from multiple models
Dropout Model Training Randomly drops neurons during training
Batch Normalization Model Training Normalizes inputs to improve training stability

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Regularization Methods
L1 and L2 Regularization
Early Stopping
Dataset Augmentation
Parameter Tying and Sharing
Ensemble Methods
Dropout
Batch Normalization

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

L1 Regularization (Lasso)
❑Add a Penalty Term to the loss function
❑Penalty term
❑ The absolute value of the magnitude of coefficients.

❑Effect
❑ Encourages sparsity, meaning some coefficients become zero.
❑ This effectively performs feature selection by eliminating irrelevant features.

❑Loss function
𝐿𝑜𝑠𝑠 = 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐿𝑜𝑠𝑠 + 𝜆 ∗ Σ |𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠|
where λ is the regularization parameter controlling the strength of the penalty.

❑Advantages:
❑ Feature selection
❑ Interpretable models
❑ Robust to outliers

❑Disadvantages:
❑ Less stable than L2 for feature selection
❑ Can be computationally expensive

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

L2 Regularization (Ridge)
❑Penalty term
❑The square of the magnitude of coefficients.
❑Effect
❑Shrinks coefficients towards zero but doesn't force them to be exactly zero.
❑Loss function
𝐿𝑜𝑠𝑠 = 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐿𝑜𝑠𝑠 + 𝜆 ∗ Σ 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠^2
where λ is the regularization parameter controlling the strength of the penalty.
❑Advantages
❑Improves model stability
❑Handles multicollinearity better
❑Generally faster to compute
❑Disadvantages
❑Doesn't perform feature selection

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

L1 , L2 Comparison
Feature L1 Regularization (Lasso) L2 Regularization (Ridge)
Penalty term Absolute value of coefficients Square of coefficients
Effect on coefficients Drives some coefficients to zero Shrinks coefficients towards zero
Feature selection Yes No
Model complexity Reduces complexity Reduces complexity
Computational cost Higher Lower
Stability Less stable More stable

❑When to Use Which

❑L1 regularization
❑ preferred when you suspect that only a few features are important and you want to identify them.
❑ useful for feature selection in high-dimensional datasets.

❑L2 regularization
❑ preferred when you believe that all features contribute to the prediction and you want to improve model stability and prevent overfitting.

❑Elastic Net
❑ A combination of L1 and L2 regularization can be used to benefit from both techniques.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Early Stopping
❑Stops training before model overfits to training data
❑How it Works
❑Split data into training, validation, and test sets
❑Train model on training set
❑Evaluate model on validation set after each epoch
❑Stop training when validation performance decreases

❑Key Points
❑Crucial for iterative models (e.g., gradient descent)
❑Validation set quality is critical
❑Can be combined with other regularization techniques

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Dataset Augmentation
❑Goal
❑ Increase training data size and diversity to prevent overfitting.

❑Method
❑ Create variations of existing data points.

❑Common Augmentation Techniques

❑ Image data: Rotation, flipping, cropping, scaling, color jittering, noise addition, adding random transformations.
❑ Text data: Synonym replacement, backtranslation, random insertion/deletion of words, creating sentence variations.
❑ Audio data: Adding background noise, changing pitch or speed, time stretching, augmenting with similar audio clips.

❑Benefits
❑ Improved model performance
❑ Reduced overfitting
❑ Cost-effective (no new data collection)

❑Challenges
❑ Augmentation quality (maintain label relevance)
❑ Computational cost
❑ Augmentation strategy (data-type specific)
❑ Data imbalance

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Parameter Tying and Sharing
❑Reduce model complexity by sharing parameters.
❑Tying
❑ Set specific parameters equal.
❑ Example: Tie weights across time steps in RNNs.

❑Sharing
❑ Encourage similar parameters.
❑ Example: Use L2 regularization to penalize differences between weights in CNN filters.

❑Benefits
❑ Fewer parameters
❑ improved generalization
❑ reduced computational cost.

❑Common uses
❑ CNNs, RNNs, language models.

❑Challenges
❑ Careful design
❑ potential loss of representational power.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Ensemble Methods
Ensemble Methods as Regularization
◦ Combines multiple models to improve performance and reduce overfitting.
◦ Introduces diversity among models, reducing reliance on any single model.
◦ Averaging or voting predictions helps smooth out noise and outliers.
◦ Can effectively reduce model complexity despite using multiple models.

Specific Examples
◦ Bagging: Creates diverse models by sampling data with replacement, reducing
variance.
◦ Boosting: Sequentially builds models, focusing on correcting errors, improving
generalization.

Ensemble Regularization vs. Other Techniques

◦ Differs from traditional regularization (L1/L2, dropout) by combining models
instead of modifying individual models.
◦ Complements other techniques for enhanced performance.

Key Points
◦ Powerful technique for improving model robustness and accuracy.
◦ Leverages diversity among models to reduce overfitting.
◦ Effectively combines multiple models for better generalization.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Ensemble Methods Cont…

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Dropout
What is Dropout?
◦ Randomly drops neurons during training
How it Works
◦ Randomly selects neurons to deactivate for each training iteration
◦ Forces remaining neurons to learn more robust features
◦ Reduces reliance on any specific neuron
How to Implement
◦ Add a dropout layer to the neural network
◦ Specify dropout rate (percentage of neurons to drop)
◦ Scale outputs of remaining neurons
Benefits
◦ Improves generalization
◦ Reduces overfitting
◦ Acts as an ensemble method
◦ Can be combined with other regularization techniques

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Batch Normalization
❑Internal Covariate Shift:
❑Imagine training a neural network to classify flowers. Some mini-batches contain red rose buds, while others have
fully bloomed roses of various colors.
❑Ideally, each mini-batch should have a similar distribution of features (colors, shapes, etc.).
❑However, if the mini-batches differ significantly (like our two subsets of flowers), it leads to covariate shift—a
challenge during training.
❑What Is Batch Normalization?
❑Batch Normalization (BN) addresses covariate shift by normalizing the activations of each hidden layer.
❑It ensures that the mean and variance of activations remain stable during training.
❑BN operates on a mini-batch of data, not individual examples.
❑How Does It Work?
❑For each layer’s output, BN:
❑ Computes the batch mean and standard deviation.
❑ Normalizes the activations by subtracting the mean and dividing by the standard deviation.
❑ Scales and shifts the normalized values using learnable parameters (gamma and beta).
❑This process stabilizes training, prevents vanishing/exploding gradients, and speeds up convergence.

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

WORKOUT
WRITE NOTE ON L1 AND L2 REGULARIZATION TECHNIQUES

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

Thank You

10-08-2025 PREPARED BY ASHA ROSE THOMAS, AP, CSE(AI),ASIET, KALADY

DL Mod2
No ratings yet
DL Mod2
152 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
Module 2 Initialization and Optimization Technique
No ratings yet
Module 2 Initialization and Optimization Technique
6 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Effective Neural Network Initialization
No ratings yet
Effective Neural Network Initialization
15 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Practical Deep Learning Techniques
No ratings yet
Practical Deep Learning Techniques
30 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Training Neural
No ratings yet
Training Neural
16 pages
Unit 3
No ratings yet
Unit 3
110 pages
DL Unit 5 Notes 2
No ratings yet
DL Unit 5 Notes 2
23 pages
Module 2
No ratings yet
Module 2
13 pages
Geometric Modeling of Occam's Razor in DL
No ratings yet
Geometric Modeling of Occam's Razor in DL
93 pages
Chapter 8-Deep Learning Book (Final Part) - Rev1
No ratings yet
Chapter 8-Deep Learning Book (Final Part) - Rev1
19 pages
FDL Module2
No ratings yet
FDL Module2
37 pages
Ceng403 - Week 6b
No ratings yet
Ceng403 - Week 6b
51 pages
Practical Aspects of Deep Learning
No ratings yet
Practical Aspects of Deep Learning
46 pages
Introduction To Deep Learning - Deep Feed Forward Network
No ratings yet
Introduction To Deep Learning - Deep Feed Forward Network
24 pages
Intro DL 04
No ratings yet
Intro DL 04
35 pages
DeepLearning
No ratings yet
DeepLearning
32 pages
DL UNIT 3 - Part2
No ratings yet
DL UNIT 3 - Part2
34 pages
Deep Neural Network Optimization Techniques
No ratings yet
Deep Neural Network Optimization Techniques
23 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
CT1 DL Ans
No ratings yet
CT1 DL Ans
13 pages
Module 1 Lesson 1
No ratings yet
Module 1 Lesson 1
8 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Weights Initialization in Neural Networks
No ratings yet
Weights Initialization in Neural Networks
31 pages
Deep Neural Network
No ratings yet
Deep Neural Network
60 pages
Fixing Neural Network Course 2 1659759284
No ratings yet
Fixing Neural Network Course 2 1659759284
30 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
DL CS 7 M4 Live Class Flow
No ratings yet
DL CS 7 M4 Live Class Flow
37 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
40 pages
lb0sNIABQY29LDSAARGNyg C2W1
No ratings yet
lb0sNIABQY29LDSAARGNyg C2W1
39 pages
5 - Chapter8 - Optimization 2
No ratings yet
5 - Chapter8 - Optimization 2
40 pages
Intro To Neural Network
No ratings yet
Intro To Neural Network
25 pages
LecML - 3 NN
No ratings yet
LecML - 3 NN
33 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
Module 2
No ratings yet
Module 2
13 pages
Training Neural Netwok: Data Set
No ratings yet
Training Neural Netwok: Data Set
35 pages
Understanding Deep Learning Concepts
No ratings yet
Understanding Deep Learning Concepts
78 pages
Artificial Neural NetworkIV
No ratings yet
Artificial Neural NetworkIV
6 pages
Lecture 8.4
No ratings yet
Lecture 8.4
13 pages
Unit 4 Short Notes Deep Feedforward Networks Gradient Learning
No ratings yet
Unit 4 Short Notes Deep Feedforward Networks Gradient Learning
27 pages
Unit 4 Short Notes
No ratings yet
Unit 4 Short Notes
27 pages
Deep Learning: MLPs and Regularization Techniques
No ratings yet
Deep Learning: MLPs and Regularization Techniques
44 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Technical Strategy For AI Engineers
No ratings yet
Technical Strategy For AI Engineers
4 pages
Manual - Deep Learning Lab.
No ratings yet
Manual - Deep Learning Lab.
43 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
Neural Network Initialization Methods
No ratings yet
Neural Network Initialization Methods
42 pages
Hyperparameter Tuning in Deep Learning
No ratings yet
Hyperparameter Tuning in Deep Learning
1 page
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
Bag of Freebies For Training Object Detection Neural Networks
No ratings yet
Bag of Freebies For Training Object Detection Neural Networks
9 pages
Aiml Report
No ratings yet
Aiml Report
70 pages
DRL Notes
No ratings yet
DRL Notes
18 pages
Scaling Vision Transformers
No ratings yet
Scaling Vision Transformers
31 pages
20 Stop Wasting My Time Saving Da
No ratings yet
20 Stop Wasting My Time Saving Da
10 pages
Neural Networks for Tech Enthusiasts
No ratings yet
Neural Networks for Tech Enthusiasts
2 pages
Deep Policy Gradients: PPO vs TRPO Analysis
No ratings yet
Deep Policy Gradients: PPO vs TRPO Analysis
14 pages
Optimizer v4
No ratings yet
Optimizer v4
19 pages
Mauch 2015
No ratings yet
Mauch 2015
5 pages
Defining A Feature-Level Digital Twin Process Model by Extracting Machining Features From MBD Models For Intelligent Process Planning
No ratings yet
Defining A Feature-Level Digital Twin Process Model by Extracting Machining Features From MBD Models For Intelligent Process Planning
22 pages
STEW Dataset Paper
No ratings yet
STEW Dataset Paper
7 pages
3.1.4gradient Descent Methods
No ratings yet
3.1.4gradient Descent Methods
12 pages
Understanding Deep Learning Concepts
No ratings yet
Understanding Deep Learning Concepts
74 pages
Optimizers
No ratings yet
Optimizers
3 pages
Ann MPDM Ii
No ratings yet
Ann MPDM Ii
42 pages
Hyperparameter Tuningin Machine Learning AComprehensive Review
No ratings yet
Hyperparameter Tuningin Machine Learning AComprehensive Review
9 pages
Artificial Neural Network Notes
No ratings yet
Artificial Neural Network Notes
9 pages
Stochastic Gradient Descent - Math and Python Code
No ratings yet
Stochastic Gradient Descent - Math and Python Code
28 pages
Deep Learning With Python Mini Course
No ratings yet
Deep Learning With Python Mini Course
26 pages
Ensemble Methods Random Forests.
No ratings yet
Ensemble Methods Random Forests.
9 pages
Maghda Zakiyah Muthi'Ah - Colab
No ratings yet
Maghda Zakiyah Muthi'Ah - Colab
4 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
20 pages
Deep Learning Algorithms
No ratings yet
Deep Learning Algorithms
19 pages
Fuzzy Systems for Engineering Students
No ratings yet
Fuzzy Systems for Engineering Students
9 pages
MiniCPM: Advancing Small Language Models
No ratings yet
MiniCPM: Advancing Small Language Models
32 pages
Technical Report of AI CEP
No ratings yet
Technical Report of AI CEP
10 pages
AI Course: Learning & Optimization
No ratings yet
AI Course: Learning & Optimization
18 pages
Understanding Deep Learning - Answer Booklet
No ratings yet
Understanding Deep Learning - Answer Booklet
81 pages