Deep Learning
Subject Code – EC37T
Course Pre-requisite:EC37P
Dr. Nayana Mahajan
9/16/2025 Dr. Nayana Mahajan 1
Module II: Convolutional Neural Networks
(CNNs)
Basics of CNNs (Convolution, Pooling, Padding, Stride)
Modern Deep Learning Architectures: LeNET: Architecture,
AlexNET: Architecture
Advanced Architectures: ResNet, DenseNet, EfficientNet
Transfer Learning and Fine-tuning CNNs
Applications: Image Classification, Object Detection
9/16/2025 Dr. Nayana Mahajan 2
ResNet, DenseNet, and EfficientNet are all advanced
convolutional neural network (CNN) architectures that have
significantly impacted the field of computer vision.
ResNet addresses the vanishing gradient problem in deep
networks through residual connections,
DenseNet enhances feature reuse through dense
connections, and
Efficient Net achieves state-of-the-art accuracy and efficiency
by uniformly scaling network depth, width, and resolution.
9/16/2025 Dr. Nayana Mahajan 3
In very deep neural networks, gradients (error signals) become very
small as they are backpropagated through many layers →
vanishing gradient problem.
This makes training very deep networks difficult because earlier
layers hardly get updated.
Residual connections (skip connections) in ResNet solve this by:
Allowing the input of a layer (identity mapping) to be directly added
to its output.
This creates a shortcut path for gradients, so even if
intermediate layers shrink the gradient, it can still flow backward
through the skip connection without vanishing.
Thus, ResNet can train networks with hundreds of layers
effectively.
9/16/2025 Dr. Nayana Mahajan 4
ResNet (Residual Network):
ResNet (Residual Network) is a deep learning
architecture that addresses the vanishing gradient
problem and enables the training of very deep neural
networks.
It introduces skip connections, also known as shortcuts,
that allow gradients to flow more directly through the
network during backpropagation, facilitating efficient
learning even in very deep networks.
9/16/2025 Dr. Nayana Mahajan 5
9/16/2025 Dr. Nayana Mahajan 6
ResNet architecture
9/16/2025 Dr. Nayana Mahajan 7
How it Works
Skip Connections:
ResNet utilizes skip connections (shortcut connections) that
allow the input of a block to be directly added to the output
of the block after multiple convolutional layers, effectively
creating a shortcut path for gradient flow.
2. Gradient Flow:
This shortcut helps mitigate the vanishing gradient problem,
where gradients during backpropagation become extremely
small, hindering learning in very deep networks.
9/16/2025 Dr. Nayana Mahajan 8
How it Works
Residual Learning:
By learning a "residual" function, the network learns the
difference between the input and output of the bypassed
layers.
This simplifies the task of learning, especially when layers
might not be contributing significant new information.
9/16/2025 Dr. Nayana Mahajan 9
The approach behind this network is instead of layers
learning the underlying mapping, allow the network to fit
the residual mapping. So, instead of say H(x), initial
mapping, let the network fit,
F(x) = H(x) - x is the "residual", which is usually smaller and
easier to learn
The network combines this residual F(x) with the input x using
a shortcut or skip connection:
which gives H(x) = F(x) + x.
9/16/2025 Dr. Nayana Mahajan 10
9/16/2025 Dr. Nayana Mahajan 11
9/16/2025 Dr. Nayana Mahajan 12
ResNet architecture (a typical ResNet-50
style).
1. Input Layer
The input image (e.g., 224×224×3 for ImageNet dataset).
2. Zero Padding
Pads the input image with zeros to maintain spatial
dimensions before convolution.
9/16/2025 Dr. Nayana Mahajan 13
3. Initial Convolution + BN + ReLU
Conv 7×7 with 64 filters: Large receptive field to
capture low-level features.
Batch Normalization (BN): Normalizes activations,
stabilizing and speeding up training.
ReLU Activation: Introduces non-linearity.
4. Max Pooling
Reduces spatial dimensions (downsampling).
Retains the most important features.
9/16/2025 Dr. Nayana Mahajan 14
5. Residual Blocks (ID Blocks and Conv Blocks)
Conv Block: Used when input and output dimensions differ
→ applies convolution in the shortcut path.
ID Block (Identity Block): Shortcut path is unchanged
(input = output shape), the identity connection simply adds
the input to the output.
The stacking here is:
2 ID Blocks
3 ID Blocks
5 ID Blocks
2 ID Blocks
These correspond to the deep residual layers of ResNet.
9/16/2025 Dr. Nayana Mahajan 15
6. Average Pooling (7×7)
Global Average Pooling reduces each feature map to a
single number by averaging → creates a compact feature
vector.
7. Flatten
Converts pooled feature maps into a 1D vector.
8. Fully Connected Layer
Final classification layer (e.g., 1000-way softmax for
ImageNet).
9/16/2025 Dr. Nayana Mahajan 16
Benefits of ResNet
Training very deep networks:
ResNet can be used to build deep convolutional neural
networks with a significantly greater number of layers
than traditional networks, while achieving excellent
results.
Mitigating the vanishing gradient problem:
The skip connections in ResNet help in maintaining
stable gradient flow, allowing for more effective training
of deeper networks.
9/16/2025 Dr. Nayana Mahajan 17
DenseNet (Densely Connected Convolutional
Network):
DenseNet, or Densely Connected Convolutional
Network, is a deep learning architecture for
convolutional neural networks (CNNs) that
revolutionized image classification by directly connecting
each layer to every other layer within a block.
This dense connectivity pattern offers advantages like
improved feature propagation, reduced vanishing
gradients, and parameter efficiency.
9/16/2025 Dr. Nayana Mahajan 18
How DenseNet Came into the Picture
In deep learning, convolutional neural networks (CNNs) are the
cornerstone for many vision-based tasks.
However, as networks became deeper, researchers faced two
significant challenges:
Vanishing Gradients: As gradients backpropagate through deeper
layers, they diminish, making the network difficult to train.
Redundancy in Feature Maps: Many layers in deep networks
learn repetitive features, leading to inefficiencies in computation and
memory usage.
9/16/2025 Dr. Nayana Mahajan 19
Redundancy in Feature Maps in Deep
Networks
In traditional deep convolutional neural networks
(CNNs), such as VGG or ResNet, each layer receives
input from the previous layer’s output.
The output of each layer is a set of feature maps —
representations of learned features (e.g., edges, textures,
or more abstract patterns).
9/16/2025 Dr. Nayana Mahajan 20
Key Features and Structure:
Transition Layers:
These layers, often including 1x1 convolutions and pooling,
reduce feature map dimensions between dense blocks.
Growth Rate (k):
This parameter controls the number of feature maps added
by each layer within a dense block.
Bottleneck Layers (Optional):
1x1 convolutions can be used before 3x3 convolutions to
reduce computational complexity.
9/16/2025 Dr. Nayana Mahajan 21
Key Features and Structure:
Feature Reuse:
The dense connections facilitate extensive feature reuse,
leading to more compact and efficient models.
Reduced Vanishing Gradients:
By providing shorter paths for gradient flow, DenseNets
mitigate the vanishing gradient problem.
9/16/2025 Dr. Nayana Mahajan 22
How it works:
1.Input:
The input image or feature map is fed into the first layer of a
dense block.
2. Convolutional Layers:
Each layer in the dense block performs convolutional
operations.
3. Concatenation:
The output of each convolutional layer is concatenated with
the input of the block and passed to the next layer within
the block.
9/16/2025 Dr. Nayana Mahajan 23
How it works:
4.Transition Layers:
These layers downsample the feature maps and reduce
the number of channels before passing the output to the
next dense block.
5. Output:
The final output is typically passed through a global
average pooling layer and a softmax classifier for image
classification.
9/16/2025 Dr. Nayana Mahajan 24
Advantages:
Improved Feature Reuse:
Dense connections encourage feature reuse, leading to
more efficient and compact models.
Reduced Vanishing Gradients:
Dense connections provide shorter paths for gradient flow
during backpropagation, mitigating the vanishing gradient
problem.
9/16/2025 Dr. Nayana Mahajan 25
Advantages:
Better Accuracy:
DenseNets have achieved state-of-the-art performance on
various image classification tasks.
Parameter Efficiency:
Due to feature reuse, DenseNets can achieve higher
accuracy with fewer parameters compared to traditional
CNNs.
9/16/2025 Dr. Nayana Mahajan 26
Growth Rate (k)
The growth rate ( k ) is a critical hyperparameter in
DenseNet.
It defines the number of feature maps each layer in a
dense block produces.
A larger growth rate means more information is added
at each layer, but it also increases the computational cost.
The choice of k affects the network's capacity and
performance.
9/16/2025 Dr. Nayana Mahajan 27
A deep DenseNet with three dense blocks.The layers between two adjacent
blocks are referred to as transition layers and change feature –map sizes via
convolution and pooling.
9/16/2025 Dr. Nayana Mahajan 28
9/16/2025 Dr. Nayana Mahajan 29
DenseNet architecture is based on a series of dense
blocks, each containing multiple convolutional layers.
Each dense block takes the output of the previous
block as input, as well as the outputs of all the previous
blocks.
This creates a dense connectivity pattern between all
the layers of the network, allowing information to flow
more efficiently through the network.
9/16/2025 Dr. Nayana Mahajan 30
Advantages of DenseNet
Reduced Vanishing Gradient Problem: Dense
connections improve gradient flow and facilitate the
training of very deep networks.
Feature Reuse: Each layer has access to all preceding
layers' feature maps, promoting the reuse of learned
features and enhancing learning efficiency.
9/16/2025 Dr. Nayana Mahajan 31
Advantages of DenseNet
Fewer Parameters: Dense Nets often have fewer
parameters compared to traditional CNNs with similar
depth due to efficient feature reuse.
Improved Accuracy: DenseNets have shown high
accuracy on various benchmarks, such as ImageNet and
CIFAR.
9/16/2025 Dr. Nayana Mahajan 32
Limitations of DenseNet
High Memory Consumption: Dense connections
increase memory usage due to the storage requirements
for feature maps, making DenseNet less practical for
devices with limited memory.
Computational Complexity: The extensive
connectivity leads to increased computational demands,
resulting in longer training times and higher
computational costs, which may not be ideal for real-time
applications.
9/16/2025 Dr. Nayana Mahajan 33
Limitations of DenseNet
Implementation Complexity: Managing and
concatenating a large number of feature maps adds
complexity to the implementation, requiring careful
tuning of hyperparameters and regularization
techniques to maintain performance and stability.
Risk of Overfitting: Although DenseNet reduces
overfitting through better feature reuse, there is still a
risk, particularly if the network is not properly
regularized or if the training data is insufficient.
9/16/2025 Dr. Nayana Mahajan 34
Applications of DenseNet
DenseNet is versatile and can be applied to various tasks in
computer vision, including:
Image Classification: DenseNet's ability to extract rich
feature representations makes it suitable for image
classification tasks.
Object Detection: DenseNet can be used as a backbone
for object detection networks, providing detailed feature
maps for accurate detection.
Semantic Segmentation: DenseNet's dense connections
help in capturing fine details, making it effective for
semantic segmentation tasks.
9/16/2025 Dr. Nayana Mahajan 35
EfficientNet
EfficientNet is a family of convolutional neural networks
(CNNs) and a scaling method designed for efficient model
size and computational cost while maintaining high accuracy.
It utilizes a compound scaling method to uniformly scale
depth, width, and resolution using a compound coefficient.
This approach contrasts with traditional methods that often
scale these factors arbitrarily.
9/16/2025 Dr. Nayana Mahajan 36
Model scaling can be achieved in three ways: by
increasing model depth, width, or image resolution.
Depth (d): Scaling network depth is the most
commonly used method. The idea is simple, deeper
ConvNet captures richer and more complex features
and also generalizes better. However, this solution comes
with a problem, the vanishing gradient problem.
9/16/2025 Dr. Nayana Mahajan 37
Depth scaling
9/16/2025 Dr. Nayana Mahajan 38
Width (w): This is used in smaller models. Widening a
model allows it to capture more fine-grained features.
However, extra-wide models are unable to capture
higher-level features.
9/16/2025 Dr. Nayana Mahajan 39
Image resolution (r): Higher resolution images enable
the model to capture more fine-grained patterns.
Previous models used 224 x 224 size images, and newer
models tend to use a higher resolution. However, higher
resolution also leads to increased computation
requirements.
9/16/2025 Dr. Nayana Mahajan 40
Resolution Scaling
9/16/2025 Dr. Nayana Mahajan 41
What is EfficientNet?
EfficientNet proposes a simple and highly effective
compound scaling method, which enables it to easily
scale up a baseline ConvNet to any target resource
constraints, in a more principled and efficient way.
9/16/2025 Dr. Nayana Mahajan 42
What is Compound Scaling?
The creator of EfficientNet observed that different scaling dimensions
(depth, width, image size) are not independent.
High-resolution images require deeper networks to capture large-scale
features with more pixels. Additionally, wider networks are needed to
capture the finer details present in these high-resolution images.
To pursue better accuracy and efficiency, it is critical to balance all
dimensions of network width, depth, and resolution during ConvNet
scaling.
However, scaling CNNs using particular ratios yields a better result. This
is what compound scaling does.
9/16/2025 Dr. Nayana Mahajan 43
9/16/2025 Dr. Nayana Mahajan 44
9/16/2025 Dr. Nayana Mahajan 45
9/16/2025 Dr. Nayana Mahajan 46
visualize the advantage of compound scaling
using an activation map.
9/16/2025 Dr. Nayana Mahajan 47
Key Features:
Compound Scaling:
EfficientNet uniformly scales the network's depth, width, and
resolution using a compound coefficient.
MBConv Blocks:
It employs Mobile Inverted Bottleneck (MBConv) layers,
which are a variant of depth wise separable convolutions and
inverted residual blocks.
Squeeze-and-Excitation (SE) Optimization:
EfficientNet incorporates SE blocks to further enhance
model performance by recalibrating channel-wise feature
responses.
9/16/2025 Dr. Nayana Mahajan 48
Key Features:
Inverted Bottleneck Design:
The inverted bottleneck structure increases the number
of channels in each block, improving the network's
capacity without significantly increasing computational
complexity.
Efficient Scaling:
By uniformly scaling all dimensions (depth, width, and
resolution), EfficientNet achieves better accuracy with
fewer parameters and computations compared to other
CNN architectures.
9/16/2025 Dr. Nayana Mahajan 49
How it Works:
Baseline Architecture (EfficientNet-B0):
The foundation of EfficientNet is the EfficientNet-B0,
which is based on MobileNetV2's inverted bottleneck
residual blocks and SE blocks.
2. Compound Coefficient:
A small grid search determines the optimal values for
alpha (depth), beta (width), and gamma (resolution) based
on the baseline model.
9/16/2025 Dr. Nayana Mahajan 50
How it Works:
3. Scaling Up:
When more computational resources are available, the
network depth is increased by a factor of , width by , and
image size by , where α, β, and γ are the scaling coefficients.
4. EfficientNet Variants:
The EfficientNet family includes various models (B0 to B7)
that are scaled versions of the baseline B0, each with
different computational requirements and accuracy levels.
9/16/2025 Dr. Nayana Mahajan 51
Benefits
Improved Accuracy:
EfficientNet achieves state-of-the-art accuracy on image
classification tasks.
Computational Efficiency:
It requires fewer parameters and computations compared to
other CNN architectures.
Real-time Applications:
The efficiency of EfficientNet makes it suitable for
deployment on devices with limited processing capabilities.
9/16/2025 Dr. Nayana Mahajan 52
EfficientNet Architecture
EfficientNet-B0, discovered through Neural Architectural
Search (NAS) is the baseline model. The main
components of the architecture are:
MBConv block (Mobile Inverted Bottleneck
Convolution)
Squeeze-and-excitation optimization
9/16/2025 Dr. Nayana Mahajan 53
9/16/2025 Dr. Nayana Mahajan 54
The MBConv layer is a fundamental building block of the
EfficientNet architecture.
It is inspired by the inverted residual blocks from MobileNetV2
but with some modifications.
The MBConv layer starts with a depth-wise convolution, followed
by a point-wise convolution (1x1 convolution) that expands the
number of channels, and finally, another 1x1 convolution that
reduces the channels back to the original number.
This bottleneck design allows the model to learn efficiently while
maintaining a high degree of representational power.
9/16/2025 Dr. Nayana Mahajan 55
Residual Learning
9/16/2025 Dr. Nayana Mahajan 56
Residual Block
9/16/2025 Dr. Nayana Mahajan 57
Inverted residual block
However, an inverted residual block starts by expanding the
input feature map into a higher-dimensional space using a
1×1 convolution then applies a depthwise convolution in this
expanded space and finally uses another 1×1 convolution
that projects the feature map back to a lower-dimensional
space, the same as the input dimension.
The “inverted” aspect comes from this expansion of
dimensionality at the beginning of the block and reduction at
the end, which is opposite to the traditional approach where
expansion happens towards the end of the residual block.
9/16/2025 Dr. Nayana Mahajan 58
Inverted Residual Block
9/16/2025 Dr. Nayana Mahajan 59
In addition to MBConv layers, EfficientNet incorporates
the SE block, which helps the model learn to focus on
essential features and suppress less relevant ones.
The SE block uses global average pooling to reduce the
spatial dimensions of the feature map to a single channel,
followed by two fully connected layers.
9/16/2025 Dr. Nayana Mahajan 60
What is Squeeze-and-Excitation?
Squeeze-and-Excitation (SE) simply allows the model to
emphasize useful features, and suppress the less useful
ones. We perform this in two steps:
Squeeze: This phase aggregates the spatial dimensions
(width and height) of the feature maps across each
channel into a single value, using global average pooling.
This results in a compact feature descriptor that
summarizes the global distribution for each channel,
reducing each channel to a single scalar value.
9/16/2025 Dr. Nayana Mahajan 61
What is Squeeze-and-Excitation?
Excitation: In this step, the model using a full-
connected layer applied after the squeezing step,
produces a collection of per channel weight (activations
or scores). The final step is to apply these learned
importance scores to the original input feature map,
channel-wise, effectively scaling each channel by its
corresponding score.
9/16/2025 Dr. Nayana Mahajan 62
Squeeze-and-Excitation block
9/16/2025 Dr. Nayana Mahajan 63
What is the Swish Activation Function?
Swish is a smooth continuous function, unlike Rectified
Linear Unit (ReLU) which is a piecewise linear function.
Swish allows a small number of negative weights to be
propagated through, while ReLU thresholds all negative
weights to zero.
9/16/2025 Dr. Nayana Mahajan 64
9/16/2025 Dr. Nayana Mahajan 65
9/16/2025 Dr. Nayana Mahajan 66
ResNet addresses the vanishing gradient problem,
DenseNet enhances feature reuse,
EfficientNet optimizes model scaling for improved
accuracy and efficiency.
9/16/2025 Dr. Nayana Mahajan 67
In machine learning and deep learning, two common
methods for using pre-trained models are transfer
learning and fine-tuning.
They allow you to borrow the knowledge of existing
models to make your own models smarter.
To simplify, think of transfer learning and fine-tuning as
ways to make your own models better by using what
other models already know.
9/16/2025 Dr. Nayana Mahajan 68
Transfer learning and fine-tuning
Transfer learning and fine-tuning are powerful techniques in
machine learning, especially when working with
Convolutional Neural Networks (CNNs).
Transfer learning leverages a pre-trained model on a large
dataset (like ImageNet) to improve performance on a new,
related task.
Fine-tuning goes a step further, by further training the pre-
trained model on the new dataset to adapt it to the specific
task.
9/16/2025 Dr. Nayana Mahajan 69
9/16/2025 Dr. Nayana Mahajan 70
Transfer Learning:
Concept:
Reuses a model trained on a source task (e.g., image classification
on ImageNet) to accelerate learning on a new, related target task.
How it works:
Utilizes the knowledge (feature representations) learned by the
source model, which can be beneficial for the target task.
Often involves freezing the early layers of the pre-trained model,
which typically capture generic features (edges, textures), and
training only the later layers on the target task.
9/16/2025 Dr. Nayana Mahajan 71
Benefits:
Reduces training time and computational resources.
Improves performance, especially when the target
dataset is limited.
Useful when training from scratch is impractical due to
data scarcity or computational limitations.
Example:
Using a pre-trained ResNet50 model (trained on
ImageNet) as a starting point for classifying images of
animals.
9/16/2025 Dr. Nayana Mahajan 72
Benefits:
Can lead to higher accuracy than feature extraction
alone by adapting the model to the nuances of the new
data.
Potentially better performance than training from
scratch, especially when dealing with limited data.
Example:
Further training a pre-trained ResNet50 model on a new
dataset of bird species, allowing it to learn specific
features related to bird anatomy.
9/16/2025 Dr. Nayana Mahajan 73
Key Differences:
Transfer Learning (Feature Extraction):
Freezes most or all of the pre-trained model's layers and
trains only new layers added on top.
Fine-tuning:
Unfreezes some or all of the pre-trained model's layers
and retrains them along with the newly added layers.
9/16/2025 Dr. Nayana Mahajan 74
Transfer learning provides a foundation by reusing a pre-
trained model, while fine-tuning refines that foundation
for a specific task.
Both techniques are valuable for efficient and effective
CNN model development, especially when dealing with
limited labeled data.
9/16/2025 Dr. Nayana Mahajan 75
9/16/2025 Dr. Nayana Mahajan 76
9/16/2025 Dr. Nayana Mahajan 77
Transfer Learning
Transfer Learning is the re-use of a pre-trained model with a new related
task.
It is particularly beneficial when the new task has limited labeled data and
computational resources.
It is a popular term in deep learning because it involves training a deep
neural network, and it can also be applied to traditional machine learning
models.
This is very useful since most problems typically do not have enough
labeled data points to train such complex models.
9/16/2025 Dr. Nayana Mahajan 78
9/16/2025 Dr. Nayana Mahajan 79
Why Should You Use Transfer Learning?
Transfer learning provides several advantages, including
decreased training time, enhanced neural network
performance (in many cases), and the ability to work
effectively with limited data.
Training a neural model from the ground up usually
requires substantial data, which may not always be
available. Transfer learning in CNN addresses this
challenge effectively.
9/16/2025 Dr. Nayana Mahajan 80
9/16/2025 Dr. Nayana Mahajan 81
Transfer learning in CNN leverages pre-trained models
to achieve strong performance with limited training data,
crucial in fields like natural language processing with vast
labeled datasets.
It reduces training time significantly compared to building
complex models from scratch, which can take days or
weeks.
9/16/2025 Dr. Nayana Mahajan 82
Steps to Use Transfer Learning
When annotated data is insufficient for training, leveraging a pre-
trained model from TensorFlow trained on similar tasks is beneficial.
Restoring the model and retraining specific layers allows adaptation
to your task.
Transfer learning in deep learning relies on general features learned
in the initial task, applicable to new tasks.
Ensure the model’s input size matches the original training
conditions for effective transfer.
9/16/2025 Dr. Nayana Mahajan 83
Training a Model to Reuse it
If you lack data for training Task A with a deep neural network, consider
finding a related Task B with ample data.
Train your deep neural network on Task B and transfer the learned
model to solve Task A.
Depending on your problem, you may use the entire model or specific
layers. For consistent inputs, you can reuse the model for predictions.
Alternatively, adjust and retrain task-specific layers and the output layer
as needed.
9/16/2025 Dr. Nayana Mahajan 84
Using a Pre Trained Model
The second option is to employ a model that has already been trained.
There are a number of these models out there, so do some research
beforehand.
You determine the number of layers to reuse and retrain based on the
task.
The most popular application of this form of transfer learning is deep
learning.
9/16/2025 Dr. Nayana Mahajan 85
Using a Pre Trained Model
Keras consists of nine pre-trained models used in transfer learning,
prediction, fine-tuning.
You can find these models and some quick lessons on how to utilize them
here.
Many research institutions also make trained models accessible.
The most popular application of this form of transfer learning is deep
learning.
9/16/2025 Dr. Nayana Mahajan 86
Extraction of Features
9/16/2025 Dr. Nayana Mahajan 87
Extraction of Features in Neural Networks
Neural networks can learn which features are important
and which are not. For complex tasks that require much
human effort, a representation learning algorithm can
quickly find a good combination of features.
The learned representation can then apply to a variety of
other challenges.
9/16/2025 Dr. Nayana Mahajan 88
Extraction of Features in Neural Networks
Use the initial layers for feature representation, excluding
the network’s task-specific output. Instead, pass data
through an intermediate layer to interpret raw data as its
representation. This approach is popular in computer
vision for dataset reduction and efficiency with
traditional algorithms.
9/16/2025 Dr. Nayana Mahajan 89
After freezing the pre-trained layers, we add new layers on top of
the pre-trained model to adapt it to the new task.
These new layers, referred to as the “classifier,” are responsible for
making predictions specific to our task (e.g., classifying different
types of flowers).
Initially, these new layers had random weights.
During training, we feed the input data through the pre-trained
layers to extract features.
9/16/2025 Dr. Nayana Mahajan 90
These extracted features are then passed to the new classifier
layers, which learn to map these features to the correct output for
the new task.
The weights of these new layers are updated during training using
backpropagation and gradient descent, based on the error between
the predicted output and the true labels.
By training the new classifier on top of the fixed, pre-trained layers,
we effectively transfer the knowledge learned from the original task
to the new task.
9/16/2025 Dr. Nayana Mahajan 91
Why is Transfer Learning Important?
Transfer learning offers solutions to key challenges like:
◦ Limited Data: Acquiring extensive labelled data is often
challenging and costly.
◦ Transfer learning enables us to use pre-trained models, reducing the
dependency on large datasets.
◦ Enhanced Performance: Starting with a pre-trained model which
has already learned from substantial data allows for faster and more
accurate results on new tasks ideal for applications needing high
accuracy and efficiency.
9/16/2025 Dr. Nayana Mahajan 92
Time and Cost Efficiency: Transfer learning shortens
training time and conserves resources by utilizing
existing models hence eliminating the need for training
from scratch.
Adaptability: Models trained on one task can be fine-
tuned for related tasks making transfer learning versatile
for various applications from image recognition to
natural language processing.
9/16/2025 Dr. Nayana Mahajan 93
How Does Transfer Learning Work?
Transfer learning involves a structured process to use
existing knowledge from a pre-trained model for new tasks:
Pre-trained Model: Start with a model already trained on
a large dataset for a specific task.
This pre-trained model has learned general features and
patterns that are relevant across related tasks.
Base Model: This pre-trained model, known as the base
model, includes layers that have processed data to learn
hierarchical representations, capturing low-level to complex
features.
9/16/2025 Dr. Nayana Mahajan 94
Transfer Layers: Identify layers within the base model that
hold generic information applicable to both the original and
new tasks.
These layers often near the top of the network capture
broad, reusable features.
Fine-tuning: Fine-tune these selected layers with data from
the new task.
This process helps retain the pre-trained knowledge while
adjusting parameters to meet the specific requirements of
the new task, improving accuracy and adaptability.
9/16/2025 Dr. Nayana Mahajan 95
Low-level features learned for task A should be
beneficial for learning of model for task B.
9/16/2025 Dr. Nayana Mahajan 96
What is Fine-Tuning?
Fine-tuning allows a pre-trained model to adapt to a new task.
This approach uses the knowledge gained from training a model on
a large dataset and applying it to a smaller, domain specific dataset.
Fine-tuning involves adjusting the weights of the model's layers or
updating certain parts of the model to improve its performance on
the new task.
9/16/2025 Dr. Nayana Mahajan 97
9/16/2025 Dr. Nayana Mahajan 98
9/16/2025 Dr. Nayana Mahajan 99
Fine-tuning is used in transfer learning where a is
model trained on one similar task and is reused for
another task often with minimal changes.
The underlying assumption is that the model has already
learned useful features in the original task that can be
transferred and adapted to the new task hence reducing
the need for training a model from scratch.
9/16/2025 Dr. Nayana Mahajan 100
step-by-step approach to effectively fine-tuning a
model:
Select a Pre-trained Model: Choose a pre-trained
model that aligns with your task and dataset.
Understand Model Architecture: Study the
architecture of the pre-trained model, including the
number of layers, their functionalities, and the specific
tasks they were trained on.
9/16/2025 Dr. Nayana Mahajan 101
Determine Fine-tuning Layers: Decide which layers of the pre-
trained model to fine-tune.
Typically, earlier layers capture low-level features, while later layers
capture more high-level features.
You may choose to fine-tune only the top layers or some of the
entire model.
Freeze Pre-trained Layers: Freeze the weights of the pre-
trained layers that you do not want to fine-tune.
This ensures that you prevent these layers from being updated
during training.
9/16/2025 Dr. Nayana Mahajan 102
Add Task-specific Layers: Add new layers on top of the
pre-trained model to adapt it to your specific task.
These layers referred to as the “classifier,” will be responsible
for making predictions relevant to your task.
Configure Training Parameters: Set the hyperparameters
for training, including the learning rate(Small learning rate),
batch size, and number of epochs.
These parameters may need to be adjusted based on the size
of your dataset and the complexity of your task.
9/16/2025 Dr. Nayana Mahajan 103
Train the Model: Train the model on your dataset using
a suitable optimization algorithm, such as stochastic
gradient descent (SGD) or Adam.
During training, the weights of the unfrozen layers will be
updated to minimize the loss between the predicted
outputs and the ground truth labels.
9/16/2025 Dr. Nayana Mahajan 104
How Fine-tuning Works:
1.Pre-training:
A model is first trained on a large, general-purpose
dataset to learn broad features and patterns.
2.Task-specific adaptation:
The pre-trained model is then adapted to a particular
task or dataset by training it on a smaller, task-specific
dataset.
9/16/2025 Dr. Nayana Mahajan 105
3. Layer selection:
During fine-tuning, some layers of the pre-trained model,
usually the earlier layers, may be frozen (weights are not
updated) to preserve the model's general knowledge,
while the later layers are trained on the new task data.
9/16/2025 Dr. Nayana Mahajan 106
Smaller learning rate:
A smaller learning rate is often used during fine-tuning to
avoid significantly altering the pre-trained model's
weights.
5. Evaluation and refinement:
The fine-tuned model's performance is evaluated on a
validation set, and training parameters may be adjusted to
optimize the results.
9/16/2025 Dr. Nayana Mahajan 107
Advantages of Fine-tuning:
Efficiency:
Fine-tuning allows you to leverage existing pre-trained models
instead of training from scratch, reducing training time and
computational resources.
Improved performance:
By specializing a model on your specific task, fine-tuning can lead to
better performance than training a model from scratch, especially
on smaller datasets.
Data efficiency:
Fine-tuning allows you to achieve strong performance with smaller
datasets, as the model already has a good foundation of knowledge.
9/16/2025 Dr. Nayana Mahajan 108
Applications of Fine-tuning:
Natural Language Processing (NLP):
Fine-tuning pre-trained language models for tasks like
sentiment analysis, question answering, or text generation.
Computer Vision:
Fine-tuning pre-trained image classification models for
specific object detection or image segmentation tasks.
Image classification:
Adapting a general image classification model to differentiate
between different breeds of dogs using labeled images of
specific breeds.
9/16/2025 Dr. Nayana Mahajan 109
Examples of Fine-Tuning
Image Classification: A common use case for fine-
tuning is in computer vision.
A model like ResNet might be pre-trained on a large
dataset like ImageNet.
When we need to classify medical images we can fine-
tune the model to focus on detecting relevant medical
features such as tumors without retraining the model
from scratch.
9/16/2025 Dr. Nayana Mahajan 110
Examples of Fine-Tuning
Natural Language Processing: In NLP fine-tuning is
done on models like BERT or GPT.
For example if a model is trained on general text and
needs to be used for a specific task like question-
answering or sentiment analysis, fine-tuning helps adjust
the model's knowledge to suit that particular application.
9/16/2025 Dr. Nayana Mahajan 111
9/16/2025 Dr. Nayana Mahajan 112
Key Differences Between Fine-Tuning and Transfer Learning
9/16/2025 Dr. Nayana Mahajan 113
9/16/2025 Dr. Nayana Mahajan 114
When to Use Transfer Learning vs Fine-
Tuning
Understanding when and how to use these methods can
significantly enhance the performance of machine
learning models especially when you’re working with
limited data or in scenarios where training a model from
scratch would be computationally expensive.
9/16/2025 Dr. Nayana Mahajan 115
Use Transfer Learning when:
The new dataset is small.
The new task closely resembles the original task for
example classifying different types of images.
A quick solution with limited computational resources is
needed.
9/16/2025 Dr. Nayana Mahajan 116
9/16/2025 Dr. Nayana Mahajan 117