0% found this document useful (0 votes)
18 views13 pages

Module 2

Module 2 covers the training of deep models, focusing on setup, initialization, optimization techniques, and regularization methods. Key topics include hyperparameter tuning, feature preprocessing, weight initialization methods like Kaiming and Xavier, and various optimization algorithms such as Gradient Descent and Adam. Additionally, the module discusses regularization techniques to mitigate overfitting, including L1 and L2 regularization.

Uploaded by

ghostwolfvn6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

Module 2

Module 2 covers the training of deep models, focusing on setup, initialization, optimization techniques, and regularization methods. Key topics include hyperparameter tuning, feature preprocessing, weight initialization methods like Kaiming and Xavier, and various optimization algorithms such as Gradient Descent and Adam. Additionally, the module discusses regularization techniques to mitigate overfitting, including L1 and L2 regularization.

Uploaded by

ghostwolfvn6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module 2

Training deep models Introduction, setup and initialization- Kaiming, Xavier weight intializations,
Vanishing and exploding gradient problems, Optimization techniques - Gradient Descent (GD),
Stochastic GD, GD with momentum, GD with Nesterov momentum, AdaGrad, RMSProp, Adam.,
Regularization Techniques - L1 and L2 regularization, Early stopping, Dataset augmentation,
Parameter tying and sharing, Ensemble methods, Dropout, Batch normalization.

Setup and initialization


There are several important issues associated with the setup of the neural network, preprocessing, and
initialization. First, the hyperparameters of the neural network (such as the learning rates and
regularization parameters) need to be selected.
1) Tuning Hyperparameters
The term “hyperparameter” is used to specifically refer to the parameters regulating the design of the
model (like learning rate and regularization), and they are different from the more fundamental
parameters representing the weights of connections in the neural network. In turning phase the
performance of the model is tested on the validation set with various choices of hyperparameters. This
type of approach ensures that the tuning process does not overfit to the training data set (while
providing poor test data performance).
The most well-known technique for parameter selection is grid search. One issue with this procedure
is that the number of hyperparameters might be large, and the number of points in the grid increases
exponentially with the number of hyperparameters. Therefore, a commonly used trick is to first work
with coarse grids. Later, when one narrows down to a particular range of interest, finer grids are used.
A mathematically justified way of choosing for hyperparameters is the use of Bayesian optimization.
However, these methods are often too slow to practically use in large scale neural networks and
remain an intellectual curiosity for researchers. For smaller networks, it is possible to use libraries
such as Hyperopt, Spearmint , and SMAC .
2) Feature Preprocessing
The feature processing methods used for neural network training are not very different from those in
other machine learning algorithms. There are two forms of feature preprocessing used in machine
learning algorithms.
1. Additive preprocessing and mean-centering: It can be useful to mean-center the data in order to
remove certain types of bias effects. A vector of column-wise means is subtracted from each data
point. A second type of pre-processing is used when it is desired for all feature values to be non-
negative. In such a case, the absolute value of the most negative entry of a feature is added to the
corresponding feature value of each data point.
2. Feature normalization: A common type of normalization is to divide each feature value by its
standard deviation. When this type of feature scaling is combined with mean-centering, the data is
said to have been standardized. The basic idea is that each feature is presumed to have been drawn
from a standard normal distribution with zero mean and unit variance.
The other type of feature normalization is useful when the data needs to be scaled in the range (0, 1).
Let minj and maxj be the minimum and maximum values of the jth attribute. Then, each feature value
xij for the jth dimension of the ith point is scaled by min-max normalization as follows:

Whitening : Another form of feature pre-processing is referred to as whitening, in which the axis-
system is rotated to create a new set of de-correlated features, each of which is scaled to unit variance.
(PCA)
Let D be an n × d data matrix that has already been mean-centered. Let C be the d × d co-variance
matrix of D in which the (i, j)th entry is the co-variance between the dimensions i and j. Because the
matrix D is mean-centered, we have the following:
The eigenvectors of the co-variance matrix provide the de-correlated directions in the data.
Furthermore, the eigenvalues provide the variance along each of the directions. Therefore, if one uses
the top-k eigenvectors (i.e., largest k eigenvalues) of the covariance matrix, most of the variance in
the data will be retained and the noise will be removed.
3) Initialization
learning rates and regularization parameters) need to be selected. Initialization techniques are methods
used to set the initial values of the weights and biases in a neural network. Proper initialization is
crucial for training neural networks effectively, as it can impact convergence speed and whether the
network converges at all.
Kaiming initialization method
The Kaiming initialization method, also known as He initialization, is an initialization technique
designed for deep neural networks, particularly when using the Rectified Linear Unit (ReLU)
activation function. It addresses the issue of vanishing gradients that can occur during training by
providing appropriate initial weights. The Kaiming initialization method is named after its creator,
Kaiming He, who introduced it in his 2015. This method is widely used in convolutional neural
networks (CNNs) and other deep learning architectures with ReLU activations.
The key idea behind Kaiming initialization is to set the initial weights such that the variance of the
activations remains roughly the same across layers. This helps prevent the vanishing gradient problem
that can occur with standard weight initialization methods.
This implies an initialization scheme of:

Initialize the weights with values drawn from a Gaussian distribution with a mean of 0 and a variance
f 2 / n_in.

Normal distribution
Biases can be initialized to zero or small constants
Xavier/Glorot Initialization:
More sophisticated rules for initialization consider the fact that the nodes in different layers interact
with one another to contribute to output sensitivity. Let rin and rout respectively be the fan-in and fan-
out for a particular neuron. One suggested initialization rule, referred to as Xavier initialization or
Glorot initialization is to use a Gaussian distribution with standard deviation of
The initialization strategy involves setting the weights of each layer according to a specific
distribution, typically a uniform or normal distribution, with its parameters calculated based on the
number of input and output neurons of that layer. This initialization technique helps prevent vanishing
or exploding gradients during training, which can occur when weights are initialized too small or too
large, respectively. By scaling the initial weights based on the number of input and output neurons, it
aims to maintain a more stable training process.
The Vanishing and Exploding Gradient Problems
Deep neural networks have several stability issues associated with training. In particular, networks
with many layers may be hard to train because of the way in which the gradients in earlier and later
layers are related. In order to understand this point, let us consider a very deep network that has a
single node in each layer. We assume that there are (m + 1) layers, including the noncomputational
input layer. The weights of the edges between the various layers are denoted by w1, w2,...wm.
Furthermore, assume that the sigmoid activation function Φ(·) is applied in each layer. Let x be the
input, h1 ...hm−1 be the hidden values in the various layers, and o be the final output. Let Φ′ (h t) be the
derivative of the activation function in hidden layer t. Let ∂L /∂ht be the derivative of the loss function
with respect to the hidden activation ht. The neural architecture is illustrated in Figure.
It is relatively easy to use backpropagation update to show the following relationship:

The derivative with a sigmoid with output f ∈ (0, 1) is given by f(1−f). This value takes on its
maximum at f = 0.5, and therefore the value of Φ′ (ht) is no more than 0.25 even at its maximum.
Since the absolute value of wt+1 is expected to be 1, it follows that each weight update will (typically)
cause the value of ∂L /∂ht to be less than 0.25 that of ∂L /∂ht+1 . Therefore, after moving by about r
layers, this value will typically be less than 0.25r.

Vanishing gradient
However, when n hidden layers use an activation like the sigmoid function, n small derivatives are multiplied
together. Thus, the gradient decreases exponentially as we propagate down to the initial layers.
a1=σ(w1x)→a2=σ(w2a1)→a3=σ(w3a2)

So, most neural network architectures nowadays use the rectified linear unit (ReLU) rather than the logistic
sigmoid function as an activation function in the hidden layers.
Optimization techniques - Gradient Descent (GD),
Gradient Descent (GD) Algorithm:
Gradient Descent is a deterministic optimization algorithm that computes the gradient of the cost function with
respect to the model parameters and updates the parameters in the direction of the negative gradient to minimize
the cost.

Initialize the model parameters θ randomly or with some initial values.


Set a learning rate (α), which is a hyperparameter that determines the step size for each update.
Choose a stopping criterion (e.g., a maximum number of iterations or a minimum change in the cost function).
Repeat the following steps until the stopping criterion is met:
a. Compute the gradient of the cost function with respect to the parameters: ∇J(θ).
b. Update the parameters using the gradient and the learning rate:
θ = θ - α * ∇J(θ)
c. Repeat steps a and b for a fixed number of iterations or until convergence.

Stochastic GD

Stochastic Gradient Descent (SGD) Algorithm:


Stochastic Gradient Descent is a variation of Gradient Descent that updates the model parameters using a single
randomly chosen training example at each iteration, making it faster and more suitable for large datasets.

Initialize the model parameters θ randomly or with some initial values.


Set a learning rate (α), which is a hyperparameter that determines the step size for each update.
Choose a stopping criterion (e.g., a maximum number of epochs or a minimum change in the cost function).
Repeat the following steps until the stopping criterion is met:
a. Shuffle the training dataset randomly (to introduce randomness).
b. Iterate through the shuffled dataset one training example at a time:
i. Compute the gradient of the cost function for the current example: ∇J(θ, x, y), where x is the input and y is the
target.
ii. Update the parameters using the gradient and the learning rate:
θ = θ - α * ∇J(θ, x, y)
c. Repeat steps a and b for a fixed number of epochs or until convergence.
SGD introduces randomness and noise into the parameter updates, which can help escape local minima and
converge faster, especially in noisy or large datasets
GD with momentum
Momentum-based techniques recognize that zigzagging is a result of highly contradictory steps that
cancel out one another and reduce the effective size of the steps in the correct (long-term) direction.
In order to understand this point, consider a setting in which one is performing gradientdescent with
respect to the parameter vector W. The normal updates for gradient-descent with respect to loss
function L (defined over a mini-batch of instances) are as follows:

Here, α is the learning rate. In momentum-based descent, the vector V is modified with exponential
smoothing, where β ∈ (0, 1) is a smoothing parameter:

Larger values of β help the approach pick up a consistent velocity V in the correct direction. Setting β
= 0 specializes to straightforward mini-batch gradient-descent. The parameter β is also referred to as
the momentum parameter or the friction parameter.
With momentum-based descent, the learning is accelerated, because one is generally moving in a
direction that often points closer to the optimal solution and the useless “side ways” oscillations are
muted. The basic idea is to give greater preference to consistent directions over multiple steps, which
have greater importance in the descent.
GD with Nesterov momentum
The Nesterov momentum is a modification of the traditional momentum method in which the
gradients are computed at a point that would be reached after executing a β discounted version of the
previous step again (i.e., the momentum portion of the current step). This point is obtained by
multiplying the previous update vector V with the friction parameter β and then computing the
gradient at W + βV .
Therefore, one is using a certain amount of lookahead in computing the updates. Let us denote the
loss function by L(W) at the current solution W. In this case, it is important to explicitly denote the
argument of the loss function because of the way in which the gradient is computed. Therefore, the
update may be computed as follows:

Note that the only difference from the standard momentum method is in terms of where the gradient is
computed. Using the value of the gradient a little further along the previous update can lead to faster
convergence. The Nesterov method works only in mini-batch gradient descent with modest batch
sizes; using very small batches is a bad idea.
Parameter-Specific Learning Rates
The basic idea in the momentum methods of the previous section is to leverage the consistency in the
gradient direction of certain parameters in order to speed up the updates. This goal can also be
achieved more explicitly by having different learning rates for different parameters. The idea is that
parameters with large partial derivatives are often oscillating and zigzagging, whereas parameters
with small partial derivatives tend to be more consistent but move in the same direction.
AdaGrad
In the AdaGrad algorithm, one keeps track of the aggregated squared magnitude of the partial
derivative with respect to each parameter over the course of the algorithm. The square-root of this
value is proportional to the root-mean-square slope for that parameter (although the absolute value
will increase with the number of epochs because of successive aggregation).
Let Ai be the aggregate value for the ith parameter. Therefore, in each iteration, the following update
is performed:

The update for the ith parameter wi is as follows:


If desired, one can use √ Ai + ε in the denominator instead of √ Ai to avoid ill-conditioning. Here, ε is
a small positive value such as 10−8 .
RMSProp
The RMSProp algorithm [194] uses a similar motivation as AdaGrad for performing the “signal-to-
noise” normalization with the absolute magnitude √ Ai of the gradients. However, instead of simply
adding the squared gradients to estimate Ai , it uses exponential averaging. Since one uses averaging
to normalize rather than aggregate values, the progress is not slowed prematurely by a constantly
increasing scaling factor Ai . The basic idea is to use a decay factor ρ ∈ (0, 1), and weight the squared
partial derivatives occurring t updates ago by ρt . Note that this can be easily achieved by
multiplying the current squared aggregate (i.e., running estimate) by ρ and then adding (1 − ρ)
times the current (squared) partial derivative. The running estimate is initialized to 0. This causes
some (undesirable) bias in early iterations, which disappears over the longer term. Therefore, if Ai
is the exponentially averaged value of the ith parameter wi , we have the following way of updating
Ai :

The square-root of this value for each parameter is used to normalize its gradient. Then, the following
update is used for (global) learning rate α:

Another advantage of RMSProp over AdaGrad is that the importance of ancient (i.e., stale) gradients
decays exponentially with time. Furthermore, it can benefit by incorporating concepts of momentum
within the computational algorithm.
Adam
The Adam algorithm uses a similar “signal-to-noise” normalization as AdaGrad and RMSProp;
however, it also exponentially smooths the first-order gradient in order to incorporate momentum into
the update. It also directly addresses the bias inherent in exponential smoothing when the running
estimate of a smoothed value is unrealistically initialized to 0. As in the case of RMSProp, let Ai be
the exponentially averaged value of the ith parameter wi . This value is updated in the same way as
RMSProp with the decay parameter ρ ∈ (0, 1):

At the same time, an exponentially smoothed value of the gradient is maintained for which the ith
component is denoted by Fi . This smoothing is performed with a different decay parameter ρf :

This type of exponentially smoothing of the gradient with ρf is a variation of the momentum method
discussed in Section 3.5.2 (which is parameterized by a friction parameter β instead of ρf ). Then, the
following update is used at learning rate αt in the tth iteration:
There are two key differences from the RMSProp algorithm. First, the gradient is replaced with its
exponentially smoothed value in order to incorporate momentum. Second, the learning rate αt now
depends on the iteration index t, and is defined as follows:

Regularization Techniques - L1 and L2 regularization


Penalty-based regularization is the most common approach for reducing overfitting. In order to
understand this point, let us revisit the example of the polynomial with degree d. In this case, the
prediction ˆy for a given value of x is as follows:

It is possible to use a single-layer network with d inputs and a single bias neuron with weight w0 in
order to model this prediction. The ith input is x i . This neural network uses linear activations, and
the squared loss function for a set of training instances (x, y) from data set D can be defined as
follows:

A large value of d tends to increase overfitting. One possible solution to this problem is to reduce the
value of d. In other words, using a model with economy in parameters leads to a simpler model.
Instead of reducing the number of parameters in a hard way, one can use a soft penalty on the use of
parameters. Furthermore, large (absolute) values of the parameters are penalized more than small
values, because small values do not affect the prediction significantly.

The most common choice is L2-regularization, which is also referred to as Tikhonov regularization.
In such a case, the additional penalty is defined by the sum of squares of the values of the parameters.
Then, for the regularization parameter λ > 0, one can define the objective function as follows:

Increasing or decreasing the value of λ reduces the softness of the penalty. One advantage of this type
of parameterized penalty is that one can tune this parameter for optimum performance on a portion of
the training data set that is not used for learning the parameters. This type of approach is referred to as
model validation.
However, it is possible to use other types of penalties on the parameters. A common approach is L1-
regularization in which the squared penalty is replaced with a penalty on the sum of the absolute
magnitudes of the coefficients. Therefore, the new objective function is as follows:

The main problem with this objective function is that it contains the term |wi |, which is not
differentiable when wi is exactly equal to 0. This requires some modifications to the gradient-descent
method when wi is 0. For the case when wi is non-zero, one can use the straightforward update
obtained by computing the partial derivative.
A question arises as to whether L1- or L2-regularization is desirable. From an accuracy point of view,
L2-regularization usually outperforms L1-regularization. This is the reason that L2- regularization is
almost always preferred over L1-regularization is most implementations. The performance gap is
small when the number of inputs and units is large.

Early Stopping
When training neural networks, we typically employ gradient-descent methods aiming to converge,
and although they optimize training data loss, they may not always yield optimal results for the test
data due to overfitting in the final stages of training. One way to address this issue is by implementing
early stopping. In this method, a portion of the training data is held out as a validation set. The
backpropagation-based training is only applied to the portion of the training data that does not include
the validation set. At the same time, the error of the model on the validation set is continuously
monitored. At some point, this error begins to rise on the validation set, even though it continues to
reduce on the training set. This is the point at which further training causes overfitting. Therefore, this
point can be chosen for termination.
One advantage of early stopping is that it can be easily added to neural network training without
significantly changing the training procedure. Early stopping can be used in combination with other
regularizers in a relatively straightforward way. One way of understanding the bias-variance trade-off
is that the true loss function of an optimization problem can only be constructed if we have infinite
data. Shift in loss function caused by variance effects and the effect of early stopping. Because of the
differences in the true loss function and that on the training data, the error will begin to rise if gradient
descent is continued beyond a certain point. Here, we have shown a similar shape of the true and
training loss functions for simplicity, although this might not be the case in practice. Unfortunately,
the learning procedure can perform the gradient-descent only on the loss function defined on the
training data set, because the true loss function is unknown.
Dataset Augmentation
The best way to make a machine learning model generalize better is to train it on more data. In
practice, the amount of data we have is limited. One way to get around this problem is to create fake
data and add it to the training set. This approach is easiest for classification, like object recognition.
We can generate new (x, y) pairs easily just by transforming the x inputs in our training set.
Operations like translating the training images a few pixels in each direction can often greatly
improve generalization. Many other operations such as rotating the image or scaling the image have
also proven quite effective.
Dataset augmentation is effective for speech recognition tasks as well. Injecting noise in the input to a
neural network can also be seen as a form of data augmentation. Usually, operations that are
generally applicable (such as adding Gaussian noise to the input) are considered part of the machine
learning algorithm, while operations that are specific to one application domain (such as randomly
cropping an image) are considered to be separate pre-processing steps.
This approach is not as readily applicable to many other tasks. For example, it is difficult to generate
new fake data for a density estimation task. One must be careful not to apply transformations that
would change the correct class. For example, optical character recognition tasks require recognizing
the difference between ‘b’ and ‘d’ and the difference between ‘6’ and ‘9’, so horizontal flips and 180◦
rotations are not appropriate ways of augmenting datasets for these tasks.

Parameter Tying and Parameter Sharing


Parameter tying and parameter sharing are techniques used in machine learning and neural networks
to control and reduce the number of model parameters, which can help improve training efficiency
and model generalization.

1. Parameter Tying:
- Parameter tying involves constraining or linking certain model parameters together, so they share
the same values during training and inference.
A common type of dependency that we often want to express is that certain parameters should be
close to one another. Consider the following scenario: we have two models performing the same
classification task (with the same set of classes) but with somewhat different input distributions.
Formally, we have model A with parameters w(A) and model B with parameters w(B) . The two
models map the input to two different, but related outputs: yˆ(A) = f(w(A), x) and yˆ(B) = g(w(B) , x).
We do this by measuring the difference between the parameters for task A and task B, and then we
square that difference and multiply it by a number (this is called an L2 penalty). The goal is to
minimize this penalty, which effectively makes the parameters for both tasks more alike.
- For example, in natural language processing, word embeddings can be tied for words with similar
meanings. Instead of having separate parameters for each word, similar words share the same
embedding vector. This can be useful for reducing the dimensionality of the model and improving
generalization.

2. Parameter Sharing:
- Parameter sharing is a broader concept where the same set of parameters is used in multiple parts
of a model.
- This sharing can occur across layers or even across different instances of a model.
- In convolutional neural networks (CNNs), parameter sharing is commonly used. In the
convolutional layers, a small set of filters (kernels) is applied to different spatial locations in the input.
These filters are the shared parameters, and this sharing allows the model to learn local patterns that
are applicable across the entire input space.
- Recurrent Neural Networks (RNNs) also use parameter sharing. In an RNN, the same set of
weights is applied at each time step, allowing the model to capture temporal dependencies.

The main advantage of parameter tying and sharing is that they can significantly reduce the number of
parameters in a model, which is important for models with large input data or when computational
resources are limited. Additionally, parameter sharing can lead to more efficient training and
improved generalization, as the model learns to recognize common patterns and features across
different parts of the data.

Ensemble Methods
Ensemble methods in deep learning involve combining predictions from multiple individual models to
produce a more robust and accurate final prediction. Ensemble methods are widely used in deep
learning to improve model performance and reduce overfitting. Ensemble methods derive their
inspiration from the bias-variance trade-of. Ensemble methods are used commonly in machine
learning, and two examples of such methods are bagging and boosting. The former is a method for
variance reduction, whereas the latter is a method for bias reduction. The goal of most ensemble
methods in the neural network setting is variance reduction (i.e., better generalization).
Bagging
Repeatedly create different training data sets and predict the same test instance using these data sets.
The prediction across different data sets can then be averaged to yield the final prediction. If a
sufficient number of training data sets is used, the variance of the prediction will be reduced to 0,
although the bias will still remain depending on the choice of model. This can be used only when an
infinite resource of data is available.
Subsampling
generate new training data sets from the single instance of the base data by sampling. The sampling
can be performed with or without replacement. The predictions on a particular test instance, which are
obtained from the models built with different training sets, are then averaged to create the final
prediction. One can average either the real-valued predictions (e.g., probability estimates of class
labels) or the discrete predictions.
The main difference between bagging and subsampling is in terms of whether or not replacement is
used in the creation of the sampled training data sets. The main challenge in directly using bagging for
neural networks is that one must construct multiple training models, which is highly inefficient.
However, the construction of different models can be fully parallelized, and therefore this type of
setting is a perfect candidate for training on multiple GPU processors.
Parametric Model Selection and Averaging
Hold out a portion of the training data and try different combinations of parameters and model
choices. The selection that provides the highest accuracy on the held-out portion of the training data is
then used for prediction. This is, of course, the standard approach used for parameter tuning in all
machine learning models, and is also referred to as model selection. Also called bucket-of-models
technique.
An additional approach that can be used to reduce the variance, is to select the k best configurations
and then average the predictions of these configurations. However, such an approach cannot be used
in very large-scale settings because each execution might require on the order of a few weeks.

Randomized Connection Dropping


The random dropping of connections between different layers in a multilayer neural network often
leads to diverse models in which different combinations of features are used to construct the hidden
variables. The dropping of connections between layers does tend to create less powerful models
because of the addition of constraints to the model-building process. The averaged prediction from
these different models is often highly accurate.
Dropout
Dropout is a method that uses node sampling instead of edge sampling in order to create a neural
network ensemble. If a node is dropped, then all incoming and outgoing connections from that node
need to be dropped as well. The nodes are sampled only from the input and hidden layers of the
network.

Data Perturbation Ensembles


A small amount of noise can be added to the input data, and the weights can be learned on the
perturbed data. This process can be repeated with multiple such additions, and the predictions of the
test point from different ensemble components can be averaged. This approach is used commonly in
the unsupervised setting with de-noising autoencoders.
One can also perform other types of data set augmentation. For example, an image instance can be
rotated or translated in order to add to the data set. Carefully designed data augmentation schemes can
often greatly improve the accuracy of a learner by increasing its generalization power. However,
strictly speaking such schemes are not perturbation schemes because the augmented examples are
created with a calibrated procedure and an understanding of the domain at hand. Data Perturbation
Ensembles are applied during model training and involve creating multiple models, each trained on a
perturbed version of the data.

Batch normalization

Batch normalization is a technique used in deep neural networks to stabilize and speed up the training
process. It works by normalizing the inputs of each layer in a mini-batch, typically just before
applying the activation function. Batch normalization is typically used in deep neural networks, such
as deep feedforward networks, convolutional neural networks (CNNs), and recurrent neural networks
(RNNs).
1. Normalization: In a neural network, the input to each layer can vary significantly during training,
which can slow down learning. Batch normalization addresses this by normalizing the inputs. For
each feature (neuron) in a layer, it subtracts the mean and divides by the standard deviation of that
feature within the mini-batch.
2. Scale and Shift: After normalizing, batch normalization allows the network to learn two additional
parameters, usually referred to as gamma (scale) and beta (shift). These parameters give the model
flexibility to adapt the normalized values back to the appropriate scale and location for the problem,
effectively preserving the representational capacity of the network.
In batch normalization, the idea is to add additional “normalization layers” between hidden layers that
resist this type of behavior by creating features with somewhat similar variance. Furthermore, each
unit in the normalization layers contains two additional parameters βi and γi that regulate the precise
level of normalization in the ith unit; these parameters are learned in a data-driven manner. The basic
idea is that the output of the ith unit will have a mean of βi and a standard deviation of γi over each
mini-batch of training instances. One might wonder whether it might make sense to simply set each βi
to 0 and each γi to 1, but doing so reduces the representation power of the network. For example, if
we make this transformation, then the sigmoid units will be operating within their linear regions,
especially if the normalization is performed just before activation

Consider the case in which its input is vi(r) , corresponding to the rth element of the batch feeding into
the ith unit. Each vi(r) i is obtained by using the linear transformation defined by the coefficient vector
Wi (and biases if any). For a particular batch of m instances, let the values of the m activations be
denoted by vi(1), vi(2)i ,... vi(m). The first step is to compute the mean μi and standard deviation σi for the
ith hidden unit. These are then scaled using the parameters βi and γi to create the outputs for the next
layer:

Note that ai(r) is the pre-activation output of the ith node, when the rth batch instance passes through it.
This value would otherwise have been set to vi(r), if we had not applied batch normalization.
Benefits:
- Faster Training: Normalized inputs often lead to faster convergence during training because it
reduces internal covariate shift (changes in the distribution of inputs).
- Regularization: Batch normalization acts as a form of regularization, reducing the risk of
overfitting.
- Stability: It helps with gradient propagation, reducing the likelihood of vanishing or exploding
gradients.
- Enables Higher Learning Rates: It allows the use of higher learning rates, which can speed up
training without destabilizing it.

Important Questions

1. How does early stopping help prevent overfitting, and what considerations should be made
when implementing it?
2. Why is weight initialization crucial in neural network training?
3. Explain the two choices for applying batch normalization layer within a neural network
architecture.
4. Explain the weight update rule of Gradient Descent
5. How does the vanishing gradient problem impact the learning process in a neural network?
Illustrate this issue using a basic network diagram.
6. Explain parameter-specific optimization algorithms (Adam and RMSProp)
7. Consider a 1-dimensional time-series with values 2, 1, 3, 4, 7. Perform a convolution with a
1-dimensional filter 1, 0, 1 and zero padding.
8. Draw the architecture of AlexNet and explain its distinctive features.
9. Explain the Kaiming and Xavier weight initialization methods and their significance in deep
learning.
10. What is Regularization technique? Discuss various regularization techniques
11. Consider a two-input neuron that multiplies its two inputs x1 and x2 to obtain the output o.
Let L be the loss function that is computed at o. Suppose that you know that ∂L/∂o = 5, x1 =
2, and x2 = 3. Compute the values of ∂L/∂x1 and ∂L/∂x2 .
12. Differentiate between Gradient Descent (GD), Stochastic Gradient Descent (SGD), and Mini-
batch Gradient Descent in terms of their working principles.

You might also like