Module II
Module II
Dataset Preparation
Data Preprocessing
Weight Initialization
Activation Functions
Non-linearity Layer
Sigmoid
Sigmoid
Dying ReLU problem: a large gradient flowing through a ReLU neuron could
cause the weights to update in such a way that the neuron will never activate
on any datapoint again.
Source: [Link]
Activation Functions: Parametric ReLU
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet
classification. IEEE international conference on computer vision (CVPR).
Source: [Link]
Activation Functions: Maxout
Maxout neuron (introduced by Goodfellow et al.) generalizes
the ReLU and its leaky version.
Both ReLU and Leaky ReLU are a special case of this form
(for example, for ReLU, we have w1=0,b1=0, w2=identity, and
b2=0).
Both ReLU and Leaky ReLU are a special case of this form
(for example, for ReLU, we have w1=0,b1=0, w2=identity, and
b2=0).
Clevert, Djork-Arné, Thomas Unterthiner, and Sepp Hochreiter. "Fast and accurate deep network learning by
exponential linear units (elus)." International Conference on Learning Representations (ICLR) 2016.
Activation Functions: ELU
Klambauer, Günter, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. "Self-normalizing neural networks."
In Advances in Neural Information Processing Systems (NIPS), 2017.
Activation Functions: Swish
CIFAR-10 accuracy
abReLU makes sure that all the neurons producing the values more than
the average of all values in that layer must not be dead.
Dubey, S.R. and Chakraborty, S., Average Biased ReLU Based CNN Descriptor for Improved Face
Retrieval. arXiv preprint arXiv:1804.02051, 2018.
Activation Functions: In Practice
Source: cs231n
Data Preprocessing
Source: cs231n
Weight Initialization: Gaussian
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Symmetry breaking: Weights are different for
different neurons
Works ~okay for small networks, but problems with
deeper networks,
i.e. Almost all neurons will become zero
-> gradient diminishing problem.
Source: cs231n
Weight Initialization: Gaussian
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Symmetry breaking: Weights are different for
different neurons
Works ~okay for small networks, but problems with
deeper networks,
i.e. Almost all neurons will become zero
-> gradient diminishing problem.
Increase the standard deviation to 1
Source: cs231n
Weight Initialization: Gaussian
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Symmetry breaking: Weights are different for
different neurons
Works ~okay for small networks, but problems with
deeper networks,
i.e. Almost all neurons will become zero
-> gradient diminishing problem.
Increase the standard deviation to 1
Almost all neurons completely saturated, either -1
or 1. Gradients will be all zero.
-> gradient diminishing problem.
Source: cs231n
Weight Initialization: Gaussian
Source: cs231n
Weight Initialization: Gaussian
Source: cs231n
Weight Initialization: Gaussian
Source: cs231n
[Glorot et al., 2010]
Weight Initialization: Xavier
Calibrating the variances with 1/sqrt(fan_in)
Reasonable initialization.
(Mathematical derivation assumes linear activations)
Source: cs231n
Weight Initialization: Xavier
Source: cs231n
Weight Initialization: Xavier
Source: cs231n
Weight Initialization: XavierImproved
Source: cs231n
Understanding the difficulty of training deep feedforward neural
networks by Glorot and Bengio, 2010
Source: cs231n
Things to remember
Training CNN
Activation Functions: ReLU is common, Swish can
be tried
Data Preparation: Train/Val/Test
Data preprocessing: Centering is common
Weight initialization: XavierImproved works well
Training
Forward it
Sample Back- Update the
through the
labeled data propagate network
network, get
(batch) the errors weights
predictions
Source: [Link]
Optimization
Optimization is the process of finding the set of
parameters W that minimize the loss function.
Strategy #1:First very bad idea solution: Random search:
Simply try out many different random weights and
keep track of what works best.
Source: [Link]
Optimization
Optimization is the process of finding the set of
parameters W that minimize the loss function.
Strategy #1:First very bad idea solution: Random search:
Simply try out many different random weights and
keep track of what works best.
Strategy #2: Random local search:
Start out with a random W, generate random
changes to it and if the loss at the
changed is lower, we will perform an update.
Source: [Link]
Optimization
Optimization is the process of finding the set of
parameters W that minimize the loss function.
Strategy #1:First very bad idea solution: Random search:
Simply try out many different random weights and
keep track of what works best.
Strategy #2: Random local search:
Start out with a random W, generate random
changes to it and if the loss at the
changed is lower, we will perform an update.
Strategy #3: Following the gradients:
There is no need to randomly search for a good
direction: this direction is related to the gradient of the
loss function.
Source: [Link]
Gradient Descent
The procedure of repeatedly evaluating the gradient of
loss function and then performing a parameter update.
Source: [Link]
Gradient Descent
The procedure of repeatedly evaluating the gradient of
loss function and then performing a parameter update.
Source: [Link]
Gradient Descent
The procedure of repeatedly evaluating the gradient of
loss function and then performing a parameter update.
Source: [Link]
Gradient Descent
The procedure of repeatedly evaluating the gradient of
loss function and then performing a parameter update.
Loop:
1. Sample a batch of data
2. Forward prop it through the graph
(network), get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient
Source: cs231n
Stochastic Gradient Descent (SGD)
The procedure of repeatedly evaluating the
gradient of loss function and then performing a
parameter update.
Source: cs231n
SGD: Problems
What if loss changes quickly in one direction and
slowly in another?
Source: cs231n
SGD: Problems
What if loss changes quickly in one direction and
slowly in another?
Source: cs231n
SGD: Problems
What if the loss function has
a local minima or saddle
point?
Source: cs231n
SGD: Problems
What if the loss function has
a local minima or saddle
point?
Zero gradient,
gradient descent
gets stuck
Source: cs231n
SGD: Problems
What if the loss function has
a local minima or saddle
point?
Zero gradient,
gradient descent
gets stuck
Source: cs231n
SGD + Momentum
Source: cs231n
SGD + Momentum
Source: cs231n
SGD + Momentum
Source: cs231n
SGD + Momentum
- Build
consistent gradient
- Rho 0.99
Source: cs231n
SGD + Momentum
Source: cs231n
SGD + Momentum
Source: cs231n
Nesterov Momentum
Sutskever
and momentum in deel Source: cs231n
Nesterov Momentum
Sutskever
and momentum in deel Source: cs231n
Nesterov Momentum
Sutskever
and momentum in deel Source: cs231n
Nesterov Momentum
Sutskever
and momentum in deel Source: cs231n
Nesterov Momentum
Sutskever
and momentum in deel Source: cs231n
AdaGrad
RMSProp
Source: cs231n
Kingma
Adam A method for stochastic
Source: cs231n
Kingma
Adam A method for stochastic
Problem:
Initially, second_moment=0 and beta2=0.999
After 1st iteration, second_moment -> close to zero
So, very large step for update of x
Source: cs231n
Kingma
Adam (with Bias correction) A method for stochastic
Source: cs231n
Kingma
Adam (with Bias correction) A method for stochastic
Reddi, S.J., Kale, S. and Kumar, S., On the convergence of adam and beyond, ICLR 2018.
[Link]
AMSGrad
Some minibatches (rarely occur) provide large and informative
gradients, exponential averaging diminishes their influence,
which leads to poor convergence.
Reddi, S.J., Kale, S. and Kumar, S., On the convergence of adam and beyond, ICLR 2018.
[Link]
Learning Rate
SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.
Source: cs231n
Learning Rate
SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.
Source: cs231n
Optimizer and Learning Rate
In Practice:
Source: cs231n
Regularization
Source: cs231n
Regularization
Source: cs231n
Regularization
Source: cs231n
Regularization
Why regularize?
- Express preferences over weights
- Make the model simple so it works on test data
- Improve optimization by adding curvature
Source: cs231n
Regularization
Source: cs231n
Regularization
Source: cs231n
Regularization
Which W to consider?
Source: cs231n
Regularization
Source: cs231n
Regularization
L2 regularization likes to
Source: cs231n
Dropout
Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter; 0.5 is common
Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014]
Dropout
How can this possibly be a good idea?
Source: cs231n
Dropout
How can this possibly be a good idea?
Source: cs231n
Dropout
How can this possibly be a good idea?
Dropout is training a large ensemble of
models (that share parameters).
Source: cs231n
We drop and scale at train time and don't do anything at test time.
Source: cs231n
DropConnect
Dropping some connections
Problem:
do we necessarily want a
zero-mean unit-variance input?
Instance Normalization:
Ulyanov et al, Improved Texture Networks: Maximizing
Quality and Diversity in Feed-forward Stylization and Texture
Synthesis, CVPR 2017
Group Normalization:
arXiv 2018
(Appeared 3/22/2018)
Decorrelated Normalization:
Decorrelated arXiv 2018
(Appeared 4/23/2018)
Data Augmentation
Data Augmentation (Jittering)
Source: cs231n
Data Augmentation (Jittering)
Horizontal Flips
Source: cs231n
Data Augmentation (Jittering)
Random crops and scales
Source: cs231n
Data Augmentation (Jittering)
Create virtual training samples
Get creative for your problem!
Horizontal flip
Random crop
Color casting
Randomize contrast
Randomize brightness
Geometric distortion
Rotation
Photometric changes
• The best way to identify if a model is being overfitted is to compare the training and
test error. Generally, training error is lower than the test error. And the goal of any
machine learning model is to minimize both (a) training error, as well as, (b) the gap
between training and test error. If the training error is low, and the difference between
training and test error is significant then the model might be in the overfitting zone.
• One can think of “overfitting” as a form of “memorization” of the training data, where
the model becomes too specialized to the training data and loses its ability to
generalize to new data.
Dr. Dipti Mishra, Mahindra University Ecole centrale School of
2/18/2025 2
Engineering, Hyderabad
Dr. Dipti Mishra, Mahindra University Ecole centrale School of
2/18/2025 3
Engineering, Hyderabad
Causes of Overfitting and how it can be mitigated
• There are several causes of overfitting.
• Using a complex model for the given dataset. For example, a decision tree with
too many levels or a neural network with too many layers and hidden units may
be more complex than is necessary to solve the problem at hand. Similarly, having
too many features can also make a model complex.
• Using too few examples to train the model, as this can make it more difficult to
find the underlying patterns in the data. In that case getting more training data
can help.
• Overfitting can also occur when the model is not regularized properly. L1 or L2
regularization add a penalty term to the model’s loss function that discourages it
from fitting the training data too closely.
• Dropout, which is another regularization techniques can also be used that
randomly drops out some of the neurons in a neural network during training,
which can help prevent the network from memorizing the training data.
Hyperparameter
“Settings + Tuning”
Dr. Dipti Mishra, Mahindra University Ecole centrale School of
2/18/2025 15
Engineering, Hyderabad
Neural networks: Pros and cons
Pros
Flexible and general function approximation
framework
Can build extremely powerful models by adding
more layers
Cons
Hard to analyze theoretically (e.g., training is
prone to local optima)
Huge amount of training data, computing power
may be required to get good performance
The space of implementation choices are huge
(network architectures, parameters)