0% found this document useful (0 votes)

18 views13 pages

Module 2

Module 2 covers the training of deep models, focusing on setup, initialization, optimization techniques, and regularization methods. Key topics include hyperparameter tuning, feature preprocessing, weight initialization methods like Kaiming and Xavier, and various optimization algorithms such as Gradient Descent and Adam. Additionally, the module discusses regularization techniques to mitigate overfitting, including L1 and L2 regularization.

Uploaded by

ghostwolfvn6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views13 pages

Module 2

Uploaded by

ghostwolfvn6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Module 2

Training deep models Introduction, setup and initialization- Kaiming, Xavier weight intializations,
Vanishing and exploding gradient problems, Optimization techniques - Gradient Descent (GD),
Stochastic GD, GD with momentum, GD with Nesterov momentum, AdaGrad, RMSProp, Adam.,
Regularization Techniques - L1 and L2 regularization, Early stopping, Dataset augmentation,
Parameter tying and sharing, Ensemble methods, Dropout, Batch normalization.

Setup and initialization

There are several important issues associated with the setup of the neural network, preprocessing, and
initialization. First, the hyperparameters of the neural network (such as the learning rates and
regularization parameters) need to be selected.
1) Tuning Hyperparameters
The term “hyperparameter” is used to specifically refer to the parameters regulating the design of the
model (like learning rate and regularization), and they are different from the more fundamental
parameters representing the weights of connections in the neural network. In turning phase the
performance of the model is tested on the validation set with various choices of hyperparameters. This
type of approach ensures that the tuning process does not overfit to the training data set (while
providing poor test data performance).
The most well-known technique for parameter selection is grid search. One issue with this procedure
is that the number of hyperparameters might be large, and the number of points in the grid increases
exponentially with the number of hyperparameters. Therefore, a commonly used trick is to first work
with coarse grids. Later, when one narrows down to a particular range of interest, finer grids are used.
A mathematically justified way of choosing for hyperparameters is the use of Bayesian optimization.
However, these methods are often too slow to practically use in large scale neural networks and
remain an intellectual curiosity for researchers. For smaller networks, it is possible to use libraries
such as Hyperopt, Spearmint , and SMAC .
2) Feature Preprocessing
The feature processing methods used for neural network training are not very different from those in
other machine learning algorithms. There are two forms of feature preprocessing used in machine
learning algorithms.
1. Additive preprocessing and mean-centering: It can be useful to mean-center the data in order to
remove certain types of bias effects. A vector of column-wise means is subtracted from each data
point. A second type of pre-processing is used when it is desired for all feature values to be non-
negative. In such a case, the absolute value of the most negative entry of a feature is added to the
corresponding feature value of each data point.
2. Feature normalization: A common type of normalization is to divide each feature value by its
standard deviation. When this type of feature scaling is combined with mean-centering, the data is
said to have been standardized. The basic idea is that each feature is presumed to have been drawn
from a standard normal distribution with zero mean and unit variance.
The other type of feature normalization is useful when the data needs to be scaled in the range (0, 1).
Let minj and maxj be the minimum and maximum values of the jth attribute. Then, each feature value
xij for the jth dimension of the ith point is scaled by min-max normalization as follows:

Whitening : Another form of feature pre-processing is referred to as whitening, in which the axis-
system is rotated to create a new set of de-correlated features, each of which is scaled to unit variance.
(PCA)
Let D be an n × d data matrix that has already been mean-centered. Let C be the d × d co-variance
matrix of D in which the (i, j)th entry is the co-variance between the dimensions i and j. Because the
matrix D is mean-centered, we have the following:
The eigenvectors of the co-variance matrix provide the de-correlated directions in the data.
Furthermore, the eigenvalues provide the variance along each of the directions. Therefore, if one uses
the top-k eigenvectors (i.e., largest k eigenvalues) of the covariance matrix, most of the variance in
the data will be retained and the noise will be removed.
3) Initialization
learning rates and regularization parameters) need to be selected. Initialization techniques are methods
used to set the initial values of the weights and biases in a neural network. Proper initialization is
crucial for training neural networks effectively, as it can impact convergence speed and whether the
network converges at all.
Kaiming initialization method
The Kaiming initialization method, also known as He initialization, is an initialization technique
designed for deep neural networks, particularly when using the Rectified Linear Unit (ReLU)
activation function. It addresses the issue of vanishing gradients that can occur during training by
providing appropriate initial weights. The Kaiming initialization method is named after its creator,
Kaiming He, who introduced it in his 2015. This method is widely used in convolutional neural
networks (CNNs) and other deep learning architectures with ReLU activations.
The key idea behind Kaiming initialization is to set the initial weights such that the variance of the
activations remains roughly the same across layers. This helps prevent the vanishing gradient problem
that can occur with standard weight initialization methods.
This implies an initialization scheme of:

Initialize the weights with values drawn from a Gaussian distribution with a mean of 0 and a variance
f 2 / n_in.

Normal distribution
Biases can be initialized to zero or small constants
Xavier/Glorot Initialization:
More sophisticated rules for initialization consider the fact that the nodes in different layers interact
with one another to contribute to output sensitivity. Let rin and rout respectively be the fan-in and fan-
out for a particular neuron. One suggested initialization rule, referred to as Xavier initialization or
Glorot initialization is to use a Gaussian distribution with standard deviation of
The initialization strategy involves setting the weights of each layer according to a specific
distribution, typically a uniform or normal distribution, with its parameters calculated based on the
number of input and output neurons of that layer. This initialization technique helps prevent vanishing
or exploding gradients during training, which can occur when weights are initialized too small or too
large, respectively. By scaling the initial weights based on the number of input and output neurons, it
aims to maintain a more stable training process.
The Vanishing and Exploding Gradient Problems
Deep neural networks have several stability issues associated with training. In particular, networks
with many layers may be hard to train because of the way in which the gradients in earlier and later
layers are related. In order to understand this point, let us consider a very deep network that has a
single node in each layer. We assume that there are (m + 1) layers, including the noncomputational
input layer. The weights of the edges between the various layers are denoted by w1, w2,...wm.
Furthermore, assume that the sigmoid activation function Φ(·) is applied in each layer. Let x be the
input, h1 ...hm−1 be the hidden values in the various layers, and o be the final output. Let Φ′ (h t) be the
derivative of the activation function in hidden layer t. Let ∂L /∂ht be the derivative of the loss function
with respect to the hidden activation ht. The neural architecture is illustrated in Figure.
It is relatively easy to use backpropagation update to show the following relationship:

The derivative with a sigmoid with output f ∈ (0, 1) is given by f(1−f). This value takes on its
maximum at f = 0.5, and therefore the value of Φ′ (ht) is no more than 0.25 even at its maximum.
Since the absolute value of wt+1 is expected to be 1, it follows that each weight update will (typically)
cause the value of ∂L /∂ht to be less than 0.25 that of ∂L /∂ht+1 . Therefore, after moving by about r
layers, this value will typically be less than 0.25r.

Vanishing gradient
However, when n hidden layers use an activation like the sigmoid function, n small derivatives are multiplied
together. Thus, the gradient decreases exponentially as we propagate down to the initial layers.
a1=σ(w1x)→a2=σ(w2a1)→a3=σ(w3a2)

So, most neural network architectures nowadays use the rectified linear unit (ReLU) rather than the logistic
sigmoid function as an activation function in the hidden layers.
Optimization techniques - Gradient Descent (GD),
Gradient Descent (GD) Algorithm:
Gradient Descent is a deterministic optimization algorithm that computes the gradient of the cost function with
respect to the model parameters and updates the parameters in the direction of the negative gradient to minimize
the cost.

Initialize the model parameters θ randomly or with some initial values.

Set a learning rate (α), which is a hyperparameter that determines the step size for each update.
Choose a stopping criterion (e.g., a maximum number of iterations or a minimum change in the cost function).
Repeat the following steps until the stopping criterion is met:
a. Compute the gradient of the cost function with respect to the parameters: ∇J(θ).
b. Update the parameters using the gradient and the learning rate:
θ = θ - α * ∇J(θ)
c. Repeat steps a and b for a fixed number of iterations or until convergence.

Stochastic GD

Stochastic Gradient Descent (SGD) Algorithm:

Stochastic Gradient Descent is a variation of Gradient Descent that updates the model parameters using a single
randomly chosen training example at each iteration, making it faster and more suitable for large datasets.

Initialize the model parameters θ randomly or with some initial values.

Set a learning rate (α), which is a hyperparameter that determines the step size for each update.
Choose a stopping criterion (e.g., a maximum number of epochs or a minimum change in the cost function).
Repeat the following steps until the stopping criterion is met:
a. Shuffle the training dataset randomly (to introduce randomness).
b. Iterate through the shuffled dataset one training example at a time:
i. Compute the gradient of the cost function for the current example: ∇J(θ, x, y), where x is the input and y is the
target.
ii. Update the parameters using the gradient and the learning rate:
θ = θ - α * ∇J(θ, x, y)
c. Repeat steps a and b for a fixed number of epochs or until convergence.
SGD introduces randomness and noise into the parameter updates, which can help escape local minima and
converge faster, especially in noisy or large datasets
GD with momentum
Momentum-based techniques recognize that zigzagging is a result of highly contradictory steps that
cancel out one another and reduce the effective size of the steps in the correct (long-term) direction.
In order to understand this point, consider a setting in which one is performing gradientdescent with
respect to the parameter vector W. The normal updates for gradient-descent with respect to loss
function L (defined over a mini-batch of instances) are as follows:

Here, α is the learning rate. In momentum-based descent, the vector V is modified with exponential
smoothing, where β ∈ (0, 1) is a smoothing parameter:

Larger values of β help the approach pick up a consistent velocity V in the correct direction. Setting β
= 0 specializes to straightforward mini-batch gradient-descent. The parameter β is also referred to as
the momentum parameter or the friction parameter.
With momentum-based descent, the learning is accelerated, because one is generally moving in a
direction that often points closer to the optimal solution and the useless “side ways” oscillations are
muted. The basic idea is to give greater preference to consistent directions over multiple steps, which
have greater importance in the descent.
GD with Nesterov momentum
The Nesterov momentum is a modification of the traditional momentum method in which the
gradients are computed at a point that would be reached after executing a β discounted version of the
previous step again (i.e., the momentum portion of the current step). This point is obtained by
multiplying the previous update vector V with the friction parameter β and then computing the
gradient at W + βV .
Therefore, one is using a certain amount of lookahead in computing the updates. Let us denote the
loss function by L(W) at the current solution W. In this case, it is important to explicitly denote the
argument of the loss function because of the way in which the gradient is computed. Therefore, the
update may be computed as follows:

Note that the only difference from the standard momentum method is in terms of where the gradient is
computed. Using the value of the gradient a little further along the previous update can lead to faster
convergence. The Nesterov method works only in mini-batch gradient descent with modest batch
sizes; using very small batches is a bad idea.
Parameter-Specific Learning Rates
The basic idea in the momentum methods of the previous section is to leverage the consistency in the
gradient direction of certain parameters in order to speed up the updates. This goal can also be
achieved more explicitly by having different learning rates for different parameters. The idea is that
parameters with large partial derivatives are often oscillating and zigzagging, whereas parameters
with small partial derivatives tend to be more consistent but move in the same direction.
AdaGrad
In the AdaGrad algorithm, one keeps track of the aggregated squared magnitude of the partial
derivative with respect to each parameter over the course of the algorithm. The square-root of this
value is proportional to the root-mean-square slope for that parameter (although the absolute value
will increase with the number of epochs because of successive aggregation).
Let Ai be the aggregate value for the ith parameter. Therefore, in each iteration, the following update
is performed:

The update for the ith parameter wi is as follows:

If desired, one can use √ Ai + ε in the denominator instead of √ Ai to avoid ill-conditioning. Here, ε is
a small positive value such as 10−8 .
RMSProp
The RMSProp algorithm [194] uses a similar motivation as AdaGrad for performing the “signal-to-
noise” normalization with the absolute magnitude √ Ai of the gradients. However, instead of simply
adding the squared gradients to estimate Ai , it uses exponential averaging. Since one uses averaging
to normalize rather than aggregate values, the progress is not slowed prematurely by a constantly
increasing scaling factor Ai . The basic idea is to use a decay factor ρ ∈ (0, 1), and weight the squared
partial derivatives occurring t updates ago by ρt . Note that this can be easily achieved by
multiplying the current squared aggregate (i.e., running estimate) by ρ and then adding (1 − ρ)
times the current (squared) partial derivative. The running estimate is initialized to 0. This causes
some (undesirable) bias in early iterations, which disappears over the longer term. Therefore, if Ai
is the exponentially averaged value of the ith parameter wi , we have the following way of updating
Ai :

The square-root of this value for each parameter is used to normalize its gradient. Then, the following
update is used for (global) learning rate α:

Another advantage of RMSProp over AdaGrad is that the importance of ancient (i.e., stale) gradients
decays exponentially with time. Furthermore, it can benefit by incorporating concepts of momentum
within the computational algorithm.
Adam
The Adam algorithm uses a similar “signal-to-noise” normalization as AdaGrad and RMSProp;
however, it also exponentially smooths the first-order gradient in order to incorporate momentum into
the update. It also directly addresses the bias inherent in exponential smoothing when the running
estimate of a smoothed value is unrealistically initialized to 0. As in the case of RMSProp, let Ai be
the exponentially averaged value of the ith parameter wi . This value is updated in the same way as
RMSProp with the decay parameter ρ ∈ (0, 1):

At the same time, an exponentially smoothed value of the gradient is maintained for which the ith
component is denoted by Fi . This smoothing is performed with a different decay parameter ρf :

This type of exponentially smoothing of the gradient with ρf is a variation of the momentum method
discussed in Section 3.5.2 (which is parameterized by a friction parameter β instead of ρf ). Then, the
following update is used at learning rate αt in the tth iteration:
There are two key differences from the RMSProp algorithm. First, the gradient is replaced with its
exponentially smoothed value in order to incorporate momentum. Second, the learning rate αt now
depends on the iteration index t, and is defined as follows:

Regularization Techniques - L1 and L2 regularization

Penalty-based regularization is the most common approach for reducing overfitting. In order to
understand this point, let us revisit the example of the polynomial with degree d. In this case, the
prediction ˆy for a given value of x is as follows:

It is possible to use a single-layer network with d inputs and a single bias neuron with weight w0 in
order to model this prediction. The ith input is x i . This neural network uses linear activations, and
the squared loss function for a set of training instances (x, y) from data set D can be defined as
follows:

A large value of d tends to increase overfitting. One possible solution to this problem is to reduce the
value of d. In other words, using a model with economy in parameters leads to a simpler model.
Instead of reducing the number of parameters in a hard way, one can use a soft penalty on the use of
parameters. Furthermore, large (absolute) values of the parameters are penalized more than small
values, because small values do not affect the prediction significantly.

The most common choice is L2-regularization, which is also referred to as Tikhonov regularization.
In such a case, the additional penalty is defined by the sum of squares of the values of the parameters.
Then, for the regularization parameter λ > 0, one can define the objective function as follows:

Increasing or decreasing the value of λ reduces the softness of the penalty. One advantage of this type
of parameterized penalty is that one can tune this parameter for optimum performance on a portion of
the training data set that is not used for learning the parameters. This type of approach is referred to as
model validation.
However, it is possible to use other types of penalties on the parameters. A common approach is L1-
regularization in which the squared penalty is replaced with a penalty on the sum of the absolute
magnitudes of the coefficients. Therefore, the new objective function is as follows:

The main problem with this objective function is that it contains the term |wi |, which is not
differentiable when wi is exactly equal to 0. This requires some modifications to the gradient-descent
method when wi is 0. For the case when wi is non-zero, one can use the straightforward update
obtained by computing the partial derivative.
A question arises as to whether L1- or L2-regularization is desirable. From an accuracy point of view,
L2-regularization usually outperforms L1-regularization. This is the reason that L2- regularization is
almost always preferred over L1-regularization is most implementations. The performance gap is
small when the number of inputs and units is large.

Early Stopping
When training neural networks, we typically employ gradient-descent methods aiming to converge,
and although they optimize training data loss, they may not always yield optimal results for the test
data due to overfitting in the final stages of training. One way to address this issue is by implementing
early stopping. In this method, a portion of the training data is held out as a validation set. The
backpropagation-based training is only applied to the portion of the training data that does not include
the validation set. At the same time, the error of the model on the validation set is continuously
monitored. At some point, this error begins to rise on the validation set, even though it continues to
reduce on the training set. This is the point at which further training causes overfitting. Therefore, this
point can be chosen for termination.
One advantage of early stopping is that it can be easily added to neural network training without
significantly changing the training procedure. Early stopping can be used in combination with other
regularizers in a relatively straightforward way. One way of understanding the bias-variance trade-off
is that the true loss function of an optimization problem can only be constructed if we have infinite
data. Shift in loss function caused by variance effects and the effect of early stopping. Because of the
differences in the true loss function and that on the training data, the error will begin to rise if gradient
descent is continued beyond a certain point. Here, we have shown a similar shape of the true and
training loss functions for simplicity, although this might not be the case in practice. Unfortunately,
the learning procedure can perform the gradient-descent only on the loss function defined on the
training data set, because the true loss function is unknown.
Dataset Augmentation
The best way to make a machine learning model generalize better is to train it on more data. In
practice, the amount of data we have is limited. One way to get around this problem is to create fake
data and add it to the training set. This approach is easiest for classification, like object recognition.
We can generate new (x, y) pairs easily just by transforming the x inputs in our training set.
Operations like translating the training images a few pixels in each direction can often greatly
improve generalization. Many other operations such as rotating the image or scaling the image have
also proven quite effective.
Dataset augmentation is effective for speech recognition tasks as well. Injecting noise in the input to a
neural network can also be seen as a form of data augmentation. Usually, operations that are
generally applicable (such as adding Gaussian noise to the input) are considered part of the machine
learning algorithm, while operations that are specific to one application domain (such as randomly
cropping an image) are considered to be separate pre-processing steps.
This approach is not as readily applicable to many other tasks. For example, it is difficult to generate
new fake data for a density estimation task. One must be careful not to apply transformations that
would change the correct class. For example, optical character recognition tasks require recognizing
the difference between ‘b’ and ‘d’ and the difference between ‘6’ and ‘9’, so horizontal flips and 180◦
rotations are not appropriate ways of augmenting datasets for these tasks.

Parameter Tying and Parameter Sharing

Parameter tying and parameter sharing are techniques used in machine learning and neural networks
to control and reduce the number of model parameters, which can help improve training efficiency
and model generalization.

1. Parameter Tying:
- Parameter tying involves constraining or linking certain model parameters together, so they share
the same values during training and inference.
A common type of dependency that we often want to express is that certain parameters should be
close to one another. Consider the following scenario: we have two models performing the same
classification task (with the same set of classes) but with somewhat different input distributions.
Formally, we have model A with parameters w(A) and model B with parameters w(B) . The two
models map the input to two different, but related outputs: yˆ(A) = f(w(A), x) and yˆ(B) = g(w(B) , x).
We do this by measuring the difference between the parameters for task A and task B, and then we
square that difference and multiply it by a number (this is called an L2 penalty). The goal is to
minimize this penalty, which effectively makes the parameters for both tasks more alike.
- For example, in natural language processing, word embeddings can be tied for words with similar
meanings. Instead of having separate parameters for each word, similar words share the same
embedding vector. This can be useful for reducing the dimensionality of the model and improving
generalization.

2. Parameter Sharing:
- Parameter sharing is a broader concept where the same set of parameters is used in multiple parts
of a model.
- This sharing can occur across layers or even across different instances of a model.
- In convolutional neural networks (CNNs), parameter sharing is commonly used. In the
convolutional layers, a small set of filters (kernels) is applied to different spatial locations in the input.
These filters are the shared parameters, and this sharing allows the model to learn local patterns that
are applicable across the entire input space.
- Recurrent Neural Networks (RNNs) also use parameter sharing. In an RNN, the same set of
weights is applied at each time step, allowing the model to capture temporal dependencies.

The main advantage of parameter tying and sharing is that they can significantly reduce the number of
parameters in a model, which is important for models with large input data or when computational
resources are limited. Additionally, parameter sharing can lead to more efficient training and
improved generalization, as the model learns to recognize common patterns and features across
different parts of the data.

Ensemble Methods
Ensemble methods in deep learning involve combining predictions from multiple individual models to
produce a more robust and accurate final prediction. Ensemble methods are widely used in deep
learning to improve model performance and reduce overfitting. Ensemble methods derive their
inspiration from the bias-variance trade-of. Ensemble methods are used commonly in machine
learning, and two examples of such methods are bagging and boosting. The former is a method for
variance reduction, whereas the latter is a method for bias reduction. The goal of most ensemble
methods in the neural network setting is variance reduction (i.e., better generalization).
Bagging
Repeatedly create different training data sets and predict the same test instance using these data sets.
The prediction across different data sets can then be averaged to yield the final prediction. If a
sufficient number of training data sets is used, the variance of the prediction will be reduced to 0,
although the bias will still remain depending on the choice of model. This can be used only when an
infinite resource of data is available.
Subsampling
generate new training data sets from the single instance of the base data by sampling. The sampling
can be performed with or without replacement. The predictions on a particular test instance, which are
obtained from the models built with different training sets, are then averaged to create the final
prediction. One can average either the real-valued predictions (e.g., probability estimates of class
labels) or the discrete predictions.
The main difference between bagging and subsampling is in terms of whether or not replacement is
used in the creation of the sampled training data sets. The main challenge in directly using bagging for
neural networks is that one must construct multiple training models, which is highly inefficient.
However, the construction of different models can be fully parallelized, and therefore this type of
setting is a perfect candidate for training on multiple GPU processors.
Parametric Model Selection and Averaging
Hold out a portion of the training data and try different combinations of parameters and model
choices. The selection that provides the highest accuracy on the held-out portion of the training data is
then used for prediction. This is, of course, the standard approach used for parameter tuning in all
machine learning models, and is also referred to as model selection. Also called bucket-of-models
technique.
An additional approach that can be used to reduce the variance, is to select the k best configurations
and then average the predictions of these configurations. However, such an approach cannot be used
in very large-scale settings because each execution might require on the order of a few weeks.

Randomized Connection Dropping

The random dropping of connections between different layers in a multilayer neural network often
leads to diverse models in which different combinations of features are used to construct the hidden
variables. The dropping of connections between layers does tend to create less powerful models
because of the addition of constraints to the model-building process. The averaged prediction from
these different models is often highly accurate.
Dropout
Dropout is a method that uses node sampling instead of edge sampling in order to create a neural
network ensemble. If a node is dropped, then all incoming and outgoing connections from that node
need to be dropped as well. The nodes are sampled only from the input and hidden layers of the
network.

Data Perturbation Ensembles

A small amount of noise can be added to the input data, and the weights can be learned on the
perturbed data. This process can be repeated with multiple such additions, and the predictions of the
test point from different ensemble components can be averaged. This approach is used commonly in
the unsupervised setting with de-noising autoencoders.
One can also perform other types of data set augmentation. For example, an image instance can be
rotated or translated in order to add to the data set. Carefully designed data augmentation schemes can
often greatly improve the accuracy of a learner by increasing its generalization power. However,
strictly speaking such schemes are not perturbation schemes because the augmented examples are
created with a calibrated procedure and an understanding of the domain at hand. Data Perturbation
Ensembles are applied during model training and involve creating multiple models, each trained on a
perturbed version of the data.

Batch normalization

Batch normalization is a technique used in deep neural networks to stabilize and speed up the training
process. It works by normalizing the inputs of each layer in a mini-batch, typically just before
applying the activation function. Batch normalization is typically used in deep neural networks, such
as deep feedforward networks, convolutional neural networks (CNNs), and recurrent neural networks
(RNNs).
1. Normalization: In a neural network, the input to each layer can vary significantly during training,
which can slow down learning. Batch normalization addresses this by normalizing the inputs. For
each feature (neuron) in a layer, it subtracts the mean and divides by the standard deviation of that
feature within the mini-batch.
2. Scale and Shift: After normalizing, batch normalization allows the network to learn two additional
parameters, usually referred to as gamma (scale) and beta (shift). These parameters give the model
flexibility to adapt the normalized values back to the appropriate scale and location for the problem,
effectively preserving the representational capacity of the network.
In batch normalization, the idea is to add additional “normalization layers” between hidden layers that
resist this type of behavior by creating features with somewhat similar variance. Furthermore, each
unit in the normalization layers contains two additional parameters βi and γi that regulate the precise
level of normalization in the ith unit; these parameters are learned in a data-driven manner. The basic
idea is that the output of the ith unit will have a mean of βi and a standard deviation of γi over each
mini-batch of training instances. One might wonder whether it might make sense to simply set each βi
to 0 and each γi to 1, but doing so reduces the representation power of the network. For example, if
we make this transformation, then the sigmoid units will be operating within their linear regions,
especially if the normalization is performed just before activation

Consider the case in which its input is vi(r) , corresponding to the rth element of the batch feeding into
the ith unit. Each vi(r) i is obtained by using the linear transformation defined by the coefficient vector
Wi (and biases if any). For a particular batch of m instances, let the values of the m activations be
denoted by vi(1), vi(2)i ,... vi(m). The first step is to compute the mean μi and standard deviation σi for the
ith hidden unit. These are then scaled using the parameters βi and γi to create the outputs for the next
layer:

Note that ai(r) is the pre-activation output of the ith node, when the rth batch instance passes through it.
This value would otherwise have been set to vi(r), if we had not applied batch normalization.
Benefits:
- Faster Training: Normalized inputs often lead to faster convergence during training because it
reduces internal covariate shift (changes in the distribution of inputs).
- Regularization: Batch normalization acts as a form of regularization, reducing the risk of
overfitting.
- Stability: It helps with gradient propagation, reducing the likelihood of vanishing or exploding
gradients.
- Enables Higher Learning Rates: It allows the use of higher learning rates, which can speed up
training without destabilizing it.

Important Questions

1. How does early stopping help prevent overfitting, and what considerations should be made
when implementing it?
2. Why is weight initialization crucial in neural network training?
3. Explain the two choices for applying batch normalization layer within a neural network
architecture.
4. Explain the weight update rule of Gradient Descent
5. How does the vanishing gradient problem impact the learning process in a neural network?
Illustrate this issue using a basic network diagram.
6. Explain parameter-specific optimization algorithms (Adam and RMSProp)
7. Consider a 1-dimensional time-series with values 2, 1, 3, 4, 7. Perform a convolution with a
1-dimensional filter 1, 0, 1 and zero padding.
8. Draw the architecture of AlexNet and explain its distinctive features.
9. Explain the Kaiming and Xavier weight initialization methods and their significance in deep
learning.
10. What is Regularization technique? Discuss various regularization techniques
11. Consider a two-input neuron that multiplies its two inputs x1 and x2 to obtain the output o.
Let L be the loss function that is computed at o. Suppose that you know that ∂L/∂o = 5, x1 =
2, and x2 = 3. Compute the values of ∂L/∂x1 and ∂L/∂x2 .
12. Differentiate between Gradient Descent (GD), Stochastic Gradient Descent (SGD), and Mini-
batch Gradient Descent in terms of their working principles.

Accelerated Bayesian Optimization For Deep Learning
No ratings yet
Accelerated Bayesian Optimization For Deep Learning
13 pages
DL Mod2
No ratings yet
DL Mod2
152 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Batch Normalization Prevents Rank Collapse
No ratings yet
Batch Normalization Prevents Rank Collapse
12 pages
5.4 Numerical Stability and Initialization: 5.4.1 Vanishing and Exploding Gradients
No ratings yet
5.4 Numerical Stability and Initialization: 5.4.1 Vanishing and Exploding Gradients
23 pages
Reserch Papers On Deep Learning Mpgi
No ratings yet
Reserch Papers On Deep Learning Mpgi
6 pages
Unit3 DL JNTUK
No ratings yet
Unit3 DL JNTUK
15 pages
Unit 3
No ratings yet
Unit 3
110 pages
Module 2 Initialization and Optimization Technique
No ratings yet
Module 2 Initialization and Optimization Technique
6 pages
PDF Hyperparameter Tuning Batch Normalization
No ratings yet
PDF Hyperparameter Tuning Batch Normalization
11 pages
Activation Functions & Xavier Initialization
No ratings yet
Activation Functions & Xavier Initialization
17 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
The Deep Neural Network-A Review
No ratings yet
The Deep Neural Network-A Review
5 pages
An Introductory Note On Machine Learning. A V Narasimhadhan
No ratings yet
An Introductory Note On Machine Learning. A V Narasimhadhan
2 pages
5025 Predicting Parameters in Deep Learning
No ratings yet
5025 Predicting Parameters in Deep Learning
9 pages
Module 2
No ratings yet
Module 2
66 pages
Understanding Representation Learning
No ratings yet
Understanding Representation Learning
6 pages
Activation Functions Book
No ratings yet
Activation Functions Book
20 pages
Artificial Neural Network Notes
No ratings yet
Artificial Neural Network Notes
9 pages
DL 3unit Last Topic Meta Algoritham
No ratings yet
DL 3unit Last Topic Meta Algoritham
32 pages
Unit 6
No ratings yet
Unit 6
19 pages
Dynamic Neural Diversification Path To Computation
No ratings yet
Dynamic Neural Diversification Path To Computation
9 pages
Chapter 8-Deep Learning Book (Final Part) - Rev1
No ratings yet
Chapter 8-Deep Learning Book (Final Part) - Rev1
19 pages
Deep Learning Challenges & Solutions
No ratings yet
Deep Learning Challenges & Solutions
64 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
A Dynamic Systems Perspective On The Analysis of Neural Networks
No ratings yet
A Dynamic Systems Perspective On The Analysis of Neural Networks
37 pages
Data Analysis ch1
No ratings yet
Data Analysis ch1
13 pages
Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
20 pages
Introduction To Deep Learning - Deep Feed Forward Network
No ratings yet
Introduction To Deep Learning - Deep Feed Forward Network
24 pages
FDL Module2
No ratings yet
FDL Module2
37 pages
Understanding Deep Learning Concepts
No ratings yet
Understanding Deep Learning Concepts
78 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Deep Neural Nets As Hamiltonians: Mike Winer, Boris Hanin April 1, 2025
No ratings yet
Deep Neural Nets As Hamiltonians: Mike Winer, Boris Hanin April 1, 2025
26 pages
NeuralNetworks JorgeAndreu
No ratings yet
NeuralNetworks JorgeAndreu
6 pages
DL Unit 3
No ratings yet
DL Unit 3
14 pages
Practical On Artificial Neural Networks: Amrender Kumar
No ratings yet
Practical On Artificial Neural Networks: Amrender Kumar
11 pages
Simple Introduction of Neural Network
No ratings yet
Simple Introduction of Neural Network
28 pages
Deep Learning
No ratings yet
Deep Learning
52 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Neural Networks
No ratings yet
Neural Networks
10 pages
Module 1
No ratings yet
Module 1
23 pages
Parallelized Deep Neural Networks
No ratings yet
Parallelized Deep Neural Networks
34 pages
Deep Neural Network
No ratings yet
Deep Neural Network
60 pages
Neural Networks: A Beginner's Guide
No ratings yet
Neural Networks: A Beginner's Guide
23 pages
Self-Adaptive Deep Neural Networks
No ratings yet
Self-Adaptive Deep Neural Networks
16 pages
Optimization For Deep Learning Theory and Algorithms
No ratings yet
Optimization For Deep Learning Theory and Algorithms
60 pages
L8 Ann
No ratings yet
L8 Ann
20 pages
BDA Unit 2
No ratings yet
BDA Unit 2
48 pages
Deep Neural Network Training Techniques
No ratings yet
Deep Neural Network Training Techniques
47 pages
Cst414-Deep Learning Module 2
No ratings yet
Cst414-Deep Learning Module 2
13 pages
GAN Training Enhancements and Techniques
No ratings yet
GAN Training Enhancements and Techniques
12 pages
DL Unit 1
No ratings yet
DL Unit 1
10 pages
Deep Learning Basics and Applications
No ratings yet
Deep Learning Basics and Applications
10 pages
ANN Unit-3 Associative Learning
No ratings yet
ANN Unit-3 Associative Learning
13 pages
Deep Learning Own Notes
No ratings yet
Deep Learning Own Notes
10 pages
DL Notes
No ratings yet
DL Notes
16 pages
ANNMath
No ratings yet
ANNMath
104 pages
Cyber Bullying Detection Using Machine Learning
No ratings yet
Cyber Bullying Detection Using Machine Learning
4 pages
Project Report Sem II Final
0% (1)
Project Report Sem II Final
102 pages
Admin, 1277
No ratings yet
Admin, 1277
21 pages
2022hw01sol Na Na
No ratings yet
2022hw01sol Na Na
11 pages
Car Popularity Prediction
No ratings yet
Car Popularity Prediction
5 pages
Internet of Things
No ratings yet
Internet of Things
25 pages
Short-Term Solar Power Forecasting Using Different Machine Learning Models
No ratings yet
Short-Term Solar Power Forecasting Using Different Machine Learning Models
21 pages
A Model Combining Lightgbm and Neural Network For High-Frequency Realized Volatility Forecasting
No ratings yet
A Model Combining Lightgbm and Neural Network For High-Frequency Realized Volatility Forecasting
7 pages
XPrediction of Reference Evapotranspiration For
No ratings yet
XPrediction of Reference Evapotranspiration For
10 pages
AI-guided Auto-Discovery of Low-Carbon Cost-Effective Ultra-High Performance Concrete (UHPC)
No ratings yet
AI-guided Auto-Discovery of Low-Carbon Cost-Effective Ultra-High Performance Concrete (UHPC)
17 pages
A Machine Learning Approach For Opinion Mining Online Customer Review
No ratings yet
A Machine Learning Approach For Opinion Mining Online Customer Review
4 pages
CDS - Unit 2
No ratings yet
CDS - Unit 2
31 pages
Online Payment Fraud Detection
No ratings yet
Online Payment Fraud Detection
24 pages
Machine Learning for TB Parameterization
No ratings yet
Machine Learning for TB Parameterization
10 pages
An Indian Currency Recognition Model For Assisting Visually Impaired Individuals-1
No ratings yet
An Indian Currency Recognition Model For Assisting Visually Impaired Individuals-1
5 pages
LLM4Decompile: Decompiling Binary Code With Large Language Models
No ratings yet
LLM4Decompile: Decompiling Binary Code With Large Language Models
13 pages
Transformers For One-Shot Visual Imitation: For Code and Project Video Please Check Our Website
No ratings yet
Transformers For One-Shot Visual Imitation: For Code and Project Video Please Check Our Website
14 pages
Cavity Detection Project
No ratings yet
Cavity Detection Project
18 pages
Ensemble of Heterogeneous Classifiers For Diagnosis and Prediction of Coronary Artery Disease With Red
No ratings yet
Ensemble of Heterogeneous Classifiers For Diagnosis and Prediction of Coronary Artery Disease With Red
13 pages
CVPR24 Tutoria Clean 06162024 Sec1
No ratings yet
CVPR24 Tutoria Clean 06162024 Sec1
17 pages
AIML NOTES Organized
No ratings yet
AIML NOTES Organized
12 pages
Anomaly Detection
No ratings yet
Anomaly Detection
13 pages
AI Tools for Adolescent Anxiety Detection
No ratings yet
AI Tools for Adolescent Anxiety Detection
22 pages
Zhu Self-Promoted Prototype Refinement For Few-Shot Class-Incremental Learning CVPR 2021 Paper
No ratings yet
Zhu Self-Promoted Prototype Refinement For Few-Shot Class-Incremental Learning CVPR 2021 Paper
10 pages
Chest CT Image Segmentation Using Deep Learning
No ratings yet
Chest CT Image Segmentation Using Deep Learning
44 pages
Business Intelligence 2nd Edition Turban Test Bank Sample
100% (9)
Business Intelligence 2nd Edition Turban Test Bank Sample
50 pages
9781040222836
No ratings yet
9781040222836
114 pages
ML - Practical List
No ratings yet
ML - Practical List
3 pages
Detecting Fake E-commerce Reviews
No ratings yet
Detecting Fake E-commerce Reviews
12 pages
10 Artificial Intelligence-417-Mock Test Paper-01-for-board-exam-2024-PKGupta
No ratings yet
10 Artificial Intelligence-417-Mock Test Paper-01-for-board-exam-2024-PKGupta
5 pages

Module 2

Uploaded by

Module 2

Uploaded by

Module 2

Setup and initialization

Initialize the model parameters θ randomly or with some initial values.

Stochastic Gradient Descent (SGD) Algorithm:

Initialize the model parameters θ randomly or with some initial values.

The update for the ith parameter wi is as follows:

Regularization Techniques - L1 and L2 regularization

Parameter Tying and Parameter Sharing

Randomized Connection Dropping

Data Perturbation Ensembles

You might also like