0% found this document useful (0 votes)
197 views25 pages

DLRL Module 1 Updated

Notes of AIML 2022 scheme 7th sem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
197 views25 pages

DLRL Module 1 Updated

Notes of AIML 2022 scheme 7th sem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Learning & Reinforcement Learning- BAI701

Deep Learning and Reinforcement Learning


(BAI701)

Module-1

1.1 Introduction:

• Machine learning systems, with shallow or deep architectures, have ability to learn
and improve with experience.

• The process of machine learning begins with the raw data which is used for
extracting useful information that helps in decision-making.

• The primary aim is to allow a machine to learn useful information just like humans
do.

• At abstract level, machine learning can be carried out using following approaches:

• Supervised learning adapts a system such that for a given input data it produces a
target output.

• The learning data is made up of tuples (attributes, label) where “attributes”


represent the input data and “label” represents the target output.

• The goal here is to adapt the system so that for a new input the system can predict
the target output.

Supervised learning:

- Supervised learning adapts a system such that for a given input data it produces a
target output.

-The learning data is made up of tuples (attributes, label) where “attributes” represent
the input data and “label” represents the target output.

-The goal here is to adapt the system so that for a new input the system can predict
the target output.

-Supervised learning can use both continuous and discrete types of input data.

Unsupervised learning:

-Unsupervised learning involves data that comprises of input vectors without any
target output.

-There are different objectives in unsupervised learning, such as clustering, density


estimation, and visualization.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 1


Deep Learning & Reinforcement Learning- BAI701

-The goal of clustering is to discover groups of similar data items on the basis of
measured or perceived similarities between the data items.

-The purpose of density estimation is to determine the distribution of the data within
the input space.

-In visualization, the data is projected down from a high-dimensional space to two or
three dimensions to view the similar data items.

Semi-supervised learning:

-Semi-supervised learning first uses unlabeled data to learn a feature representation of


the input data and then uses the learned feature representation to solve the supervised
task.

-The training dataset can be divided into two parts: the data samples with
corresponding labels and the data samples where the labels are not known.

-Semi-supervised learning can involve not providing with an explicit form of error at
each time but only a generalized reinforcement is received giving indication of how
the system should change its behavior, and this is sometimes referred to as
reinforcement learning.

-Reinforcement learning has been successful in applications as diverse as autonomous


helicopter flight, robot legged locomotion, cell-phone network routing, marketing
strategy selection, factory control and efficient webpage indexing.

1.2 Shallow Learning:

-Shallow architectures are well understood and perform good on many common
machine learning problems, and they are still used in a vast majority of today’s
machine learning applications.

-However, there has been an increased interest in deep architectures recently, in the
hope to find means to solve more complex real-world problems (e.g., image analysis
or natural language understanding) for which shallow architectures are unable to learn
models adequately.

1.3 Deep Learning:

-Deep learning is a new area of machine learning which has gained popularity in
recent past.

-Deep learning refers to the architectures which contain multiple hidden layers (deep
networks) to learn different features with multiple levels of abstraction.

-Deep learning algorithms seek to exploit the unknown structure in the input
distribution in order to discover good representations, often at multiple levels, with
higher level learned features defined in terms of lower-level features.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 2


Deep Learning & Reinforcement Learning- BAI701

Conventional machine learning techniques are restricted in the way they process the
natural data in its raw form.

-For decades, constructing a pattern recognition or machine learning system required


considerable domain expertise and careful hand engineering to come up with a feature
extractor that transformed the raw data (such as pixel values of an image) into suitable
internal representation or feature vector from which the learning system, such as a
classifier, could detect or classify patterns in the input.

-Deep learning allows inputting the raw data (pixels in case of image data) to the
learning algorithm without first extracting features or defining a feature vector.

-Deep learning algorithms can learn the right set of features, and it does this in a much
better way than extracting these features using hand-coding.

-Instead of handcrafting a set of rules and algorithms to extract features from raw data,
deep learning involves learning these features automatically during the training
process.

-In deep learning, a problem is realized in terms of hierarchy of concepts, with each
concept built on the top of the others.

-The lower layers of the model encode some basic representation of the problem,
whereas higher level layers build upon these lower layers to form more complex
concepts.

-Given an image, the pixel intensity values are fed as inputs to the deep learning
system.

-A number of hidden layers then extract features from the input image.

-These hidden layers are built upon each other in a hierarchical fashion.

-At first, the lower-level layers of the network detect only edge-like regions.

-These edge regions are then used to define corners (where edges intersect) and
contours (outlines of objects).

-The layers in the higher level combine corners and contours to lead to more abstract
“object parts” in the next layer.

-The key aspect of deep learning is that these layers of features are not handcrafted
and designed by human engineers; rather, they are learnt from data gradually using a
general-purpose learning procedure.

-Finally, the output layer classifies the image and obtains the output class label—the
output obtained at the output layer is directly influenced by every other node available
in the network.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 3


Deep Learning & Reinforcement Learning- BAI701

-Thisprocess can be viewed as hierarchical learning as each layer in the network uses
the output of previous layers as “building blocks” to construct increasingly more
complex concepts at the higher layers.

-Figure 1.1 compares traditional machine learning approach based on handcrafted


features to deep learning approach based on hierarchical representation learning.

Fig. 1.1 a Conventional machine learning using hand-designed feature extraction algorithms Vs
deep learning approach using hierarchy of representations

-The word “deep” refers to learning successive layers of increasingly meaningful


representations of input data.

-The number of layers used to model the data determines the depth of the model.

-Current deep learning often involves learning tens or even hundreds of successive
layers of representation from the training data automatically.

-The conventional approaches to machine learning often focus on learning only one or
two layers of representations of data; such approaches are often categorized as
shallow learning.

-Deep learning and machine learning are sub-fields of Artificial Intelligence (AI).-
Figure 1.2 illustrates the relationship between AI, machine learning, and deep
learning.

-In deep learning, the successive layers of representations may be learned via sub-
models, which are structured in the form of layers stacked on the top of each other.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 4


Deep Learning & Reinforcement Learning- BAI701

-As deep learning network has typically more layers and parameters, it has the
potential to represent more complex inputs.

-Although deep learning has been around since 1980s, it was relatively unpopular for
several years as the computational infrastructure (both hardware and software)
was not adequate and the available datasets were quite small.

-With the decline in the popularity of the conventional neural networks, it was only
recently that deep networks made a big reappearance by achieving spectacular results
in speech recognition and computer vision tasks.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 5


Deep Learning & Reinforcement Learning- BAI701

Some of the aspects that helped in the evolution of deep networks are listed below:
-Improved computational resources for processing massive amounts of data and
training much larger models.
-Automatic feature extraction.

The term artificial neural networks have a reference to neuroscience but deep learning
networks are not models of the brain; however, deep learning models are formulated
by only drawing inspiration from the understanding of biological brain.

Not all the components of deep models are inspired by neuroscience; some of them
come from empirical exploration, theory, and intuition.

The neural activity in our brains is far more complex than might be suggested by
simply studying artificial neurons.

The learning mechanisms used by deep learning models are in no way comparable to
the human brain, but can be described as a mathematical framework for learning
representations from data.

Figure 1.2 shows an example of a deep learning architecture that can be used for
character recognition. Figure 1.3 shows representations that are learned by the deep
learning network.

-The deep network uses several layers to transform the input image (here a digit) in
order to recognize what the digit is.

Each layer performs some transformations on the input that it receives from the
previous layers.

-The deep network transforms the digit image into representations that tend to capture
a higher level of abstraction.

-Each hidden layer transforms the input image into a representation that is
increasingly different from the original image and increasingly informative about the
final result.

-The representations learnt help to distinguish between different concepts which in


turn help to find out similarities between it.

-Deep network can be thought of as a multistage distillation information operation,


where layers use multiple filters on the information to obtain an increasingly
transformed form of information (i.e., the information useful with regard to some
task).

-In summary, a deep learning network constructs features at multiple levels, with
higher features constructed as functions of lower ones.

-It is a fast-growing field that circumvents the problem of feature extraction which is
used as a prelude by conventional machine learning approaches.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 6


Deep Learning & Reinforcement Learning- BAI701

-Deep learning is capable of learning the appropriate features by itself, requiring little
steering by the user.

1.4 Why to Use Deep Learning

-The choice of features that represent a given dataset has a profound impact on the
success of a machine learning system.

-Better results cannot be achieved without identifying which aspects of the problem
need to be included for feature extraction that would be more useful to the machine
learning algorithm.

-This requires a machine learning expert to collaborate with the domain expert in
order to obtain a useful feature set.

-A biological brain can easily determine which aspects of the problem it needs to
focus on with comparatively little guidance.

-This is not the case with the artificial agents, thereby making it difficult to create
computer learning systems that can respond to high-dimensional input and perform
hard AI tasks.

-Machine learning practitioners have spent a huge time to extract informative features
from the data.

-At the time of Big Bang introduction of deep learning, the state-of-the-art machine
learning algorithms had already took decades of human effort to accumulate relevant
set of features required to classify the input.

-Deep learning has surpassed those conventional algorithms in accuracy as the


features are learnt from the data using a general-purpose learning procedure instead of
being designed by human engineers.

-Deep networks have demonstrated dramatic improvements in computer vision and


have dramatically improved machine translation, and have taken off as an effective AI
technique that has the ability to recognize spoken words nearly as good as a person
can.

-It has achieved not only the excellent accuracy in machine learning modeling, but it
has also demonstrated outstanding generalization power that has even attracted
scientists from other academic disciplines.

-It is now being used as a guide to make key decisions in fields like medicine, finance,
manufacturing, and beyond.

-Deep learning grew to prominence in 2007, with promising results on perceptual


problems such as hearing and seeing problems that humans are very good at, but have
long been subtle for the machines.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 7


Deep Learning & Reinforcement Learning- BAI701

-It has enabled the computer scientists to harness the vast computational power and
use large volumes of data—audio, video, to teach computers how to do things that
seem natural and intuitive for humans, such as spotting objects in the photos,
recognizing words or sentences, and translating a document into other language.

-It has made it possible for machines to output the transcript from an audio clip–
speech recognition, to identify whether a mail is spam or not, likelihood of whether a
customer will repay his loan and so on; as long as there is enough data to train
machines, the possibilities are endless.

-It has achieved state-of-the-art results on many applications, such as natural language
parsing, language modeling, image and character recognition, playing the challenging
game of Go, pixels-to-controls video game playing, and in other applications.

-Today, many tech giant companies—Facebook, Baidu, Amazon, Microsoft, and


Google—have commercially deployed deep learning applications.

-These companies have vast amount of data and deep learning works well whenever
there are vast volumes of data and complex problems to solve.

-Many companies are using deep learning to develop more helpful and realistic
customer service representatives—Chatbots.

-In particular, deep learning has made good impact in historically difficult areas of
machine learning:
• Near-human-level image classification;
• Near-human-level speech recognition;
• Near-human-level handwriting transcription;
• Improved self-driving cars;
• Digital assistants such as Google Now, Microsoft Cortana, Apple’s Siri, and
Amazon Alexa;
• Improved ad targeting, as used by Google, Baidu, and Bing;
• Improved search results on the web;
• Ability to answer natural language questions; and
• Superhuman Go, Shogi, and Chess playing.

-The exceptional performance of deep models can be mainly attributed to their


flexibility in representing a rich set of highly nonlinear functions as well as the
devised methods for efficient training of these powerful networks.

-Furthermore, employing various regularization techniques ensured that deep models


with huge numbers of free parameters are statistically desirable in the sense that they
will generalize well to unseen data.

-The automatic and generic approach of feature learning in deep models enables one
to use them across different applications (e.g., image classification, speech
recognition, language modeling, and information retrieval) with relatively little
adjustments.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 8


Deep Learning & Reinforcement Learning- BAI701

-Therefore, deep models seem to be domain-oblivious in the sense that in order to use
it across different applications, only a small amount of domain-specific
customizations is required.

-Ideally, the domain-obliviousness of deep networks is advantageous, as having


access to a universal and generic model reduces the hassles of adapting for new
applications.

-Deep learning is still in its infancy, but it is likely that deep learning will have many
successes in the near future as it requires little hand engineering and thus can take
advantage of vast amount of data and computation power.

-Deep learning has succeeded in previously unsolved problems which were quite
difficult to resolve using machine learning as well as other shallow networks.

-The dramatic progress of deep learning has sparked such a burst of activity that
venture capitalists who did not even know what deep learning was all about some
years back, today are suspicious of the startups that do not have it.

-In near future, deep learning may herald an age where it may assist humans in
software development, science, and many more.

-Integrating deep learning with the whole toolbox of other artificial intelligence
techniques may accomplish startling things that will have great impact in the field of
technology.

1.5 How Deep Learning Works

-Deep networks map input to target via a sequence of layered transformations, and
that these layered transformations are learned by exposure to the training examples.

-The transformations that a layer applies to its input are determined by the layer’s
weights, which are basically a bunch of numbers.

-In other words, transformations implemented by a layer are parameterized by its


weights.

-In this context, learning can be defined as the process of finding the values of the
weights of all layers in the network in such a manner that input examples can be
correctly mapped to their associated targets.

-A deep learning network contains thousands of parameters, and finding the right
values of these parameters is not an easy task, particularly when the value of one
parameter has an impact on the value of another parameter.

-In order to train a deep network, one needs to find out how far the calculated output
of the network is from the desired value.

-This measure is obtained by using a loss function, also called as objective function.
This gives a measure of how well the network has learnt a specific example.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 9


Deep Learning & Reinforcement Learning- BAI701

-The objective of the training is to find the values for the weights that minimize the
chosen error function.

-The difference obtained is then used as a feedback signal to adjust the weights of the
network, in a way that loss score for the current example is lowered.

-This adjustment is done by the optimizer—back propagation algorithm, the central


algorithm in deep learning.

-Back propagation algorithm involves assigning random values to the weight vectors
initially, so that the network just implements a series of random transformations.

-Initially, the output obtained from the network can be far from what it should be, and
accordingly the loss score may be very high.

-With every example that is fed to the network, the weights are adjusted in such a
direction that makes the loss score to decrease.

-This process is repeated a number of times, until the weight values that minimize the
loss function are obtained.

-A network is said to have learned when the output values obtained from the network
are as close as they can be to the target values.

1.6 Deep Learning Challenges

-Deep learning networks have brought their own set of problems and challenges
which outweighed the benefits of deep architectures for several decades.

-Training these architectures for general use was impractically slow.

-With limited computational power, deep learning networks were already overtaken
by other approaches such as kernel methods.

-With the significant growth in computational power (particularly in GPUs and


distributed computing) and access to large labeled datasets paved the way for its
return.

-However, despite the remarkable advances in this area, training deep models with a
huge number of free parameters is an intricate and ill-posed optimization problem.

-Many research works have been dedicated to creating efficient training methods for
deep architectures.

-The strategies reported in the literature that deal with the difficulties of training deep
networks include developing better optimizers, using well-designed initialization
strategies, using activation functions based on local competition and using skip
connections between layers with the aim to improve the flow of information.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 10


Deep Learning & Reinforcement Learning- BAI701

-However, deep network training still faces problems which are caused by the
stacking of several nonlinear transformations and need to be addressed.

-Moreover, deep learning involves using large amounts of data to learn progressively.

-While large amounts of data are available in many applications, however, in some
areas copious amount of data are rarely available.

-More flexible models are required to achieve an enhanced learning ability when only
a limited amount of data is available.

1.7 How Learning Differs from Pure Optimization

-Optimization algorithms used for training of deep models differ from traditional
optimization algorithms in several ways.

-Machine learning usually acts indirectly.

-In most machine learning scenarios, we care about some performance measure P, that
is defined with respect to the test set and may also be intractable.

-We therefore optimize P only indirectly. We reduce a different cost function J(θ) in
the hope that doing so will improve P.

-This is in contrast to pure optimization, where minimizing J is a goal in and of itself.

-Optimization algorithms for training deep models also typically include some
specialization on the specific structure of machine learning objective functions.

-Typically, the cost function can be written as an average over the training set, such as

------------------7.1

-Where L is the per-example loss function, f(x;θ) is the predicted output when the
input is x, p^data is the empirical distribution.

In the supervised learning case, y is the target output.

Throughout this chapter, we develop the unregularized supervised case, where the
arguments to L are f(x;θ) and y.

However, it is trivial to extend this development, for example, to include θ or x as


arguments, or to exclude y as arguments, in order to develop various forms of
regularization or unsupervised learning.

Equation 7.1 defines an objective function with respect to the training set.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 11


Deep Learning & Reinforcement Learning- BAI701

We would usually prefer to minimize the corresponding objective function where the
expectation is taken across the data generating distribution pdata rather than just over
the finite training set.

--------7.2

1.7.1 Empirical Risk Minimization

-The goal of a machine learning algorithm is to reduce the expected generalization


error given by equation 7.2. This quantity is known as the risk.

-We emphasize here that the expectation is taken over the true underlying distribution
pdata.

-If we knew the true distribution pdata(x, y), risk minimization would be an
optimization task solvable by an optimization algorithm.

-However, when we do not know pdata(x, y) but only have a training set of samples,
we have a machine learning problem.

-The simplest way to convert a machine learning problem back into an optimization
problem is to minimize the expected loss on the training set.

-This means replacing the true distribution p(x, y) with the empirical distribution ˆp(x,
y) defined by the training set.

-We now minimize the empirical risk.

--------7.3

where m is the number of training examples.

-The training process based on minimizing this average training error is known as
empirical risk minimization.

-In this setting, machine learning is still very similar to straightforward optimization.

-Rather than optimizing the risk directly, we optimize the empirical risk, and hope
that the risk decreases significantly as well.

-A variety of theoretical results establish conditions under which the true risk can be
expected to decrease by various amounts. However, empirical risk minimization is
prone to overfitting. Models with high capacity can simply memorize the training set.

-In many cases, empirical risk minimization is not really feasible.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 12


Deep Learning & Reinforcement Learning- BAI701

-The most effective modern optimization algorithms are based on gradient descent,
but many useful loss functions, such as 0-1 loss, have no useful derivatives (the
derivative is either zero or undefined everywhere).

-These two problems mean that, in the context of deep learning, we rarely use
empirical risk minimization.

-Instead, we must use a slightly different approach, in which the quantity that we
actually optimize is even more different from the quantity that we truly want to
optimize.

1.7.2 Surrogate Loss Functions and Early Stopping

-Sometimes, the loss function we actually care about (say classification error) is not
one that can be optimized efficiently.

-For example, exactly minimizing expected 0-1 loss is typically intractable


(exponential in the input dimension), even for a linear classifier (Marcotte and Savard,
1992).

-In such situations, one typically optimizes a surrogate loss function instead, which
acts as a proxy but has advantages.

-For example, the negative log-likelihood of the correct class is typically used as a
surrogate for the 0-1 loss.

-The negative log-likelihood allows the model to estimate the conditional probability
of the classes, given the input, and if the model can do that well, then it can pick the
classes that yield the least classification error in expectation.

-In some cases, a surrogate loss function actually results in being able to learn more.

-For example, the test set 0-1 loss often continues to decrease for a long time after the
training set 0-1 loss has reached zero, when training using the log-likelihood surrogate.

-This is because even when the expected 0-1 loss is zero, one can improve the
robustness of the classifier by further pushing the classes apart from each other,
obtaining a more confident and reliable classifier, thus extracting more information
from the training data than would have been possible by simply minimizing the
average 0-1 loss on the training set.

-A very important difference between optimization in general and optimization as we


use it for training algorithms is that training algorithms do not usually halt at a local
minimum.

-Instead, a machine learning algorithm usually minimizes a surrogate loss function but
halts when a convergence criterion based on early stopping is satisfied.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 13


Deep Learning & Reinforcement Learning- BAI701

-Typically, the early stopping criterion is based on the true underlying loss function,
such as 0-1 loss measured on a validation set, and is designed to cause the algorithm
to halt whenever overfitting begins to occur.

-Training often halts while the surrogate loss function still has large derivatives,
which is very different from the pure optimization setting, where an optimization
algorithm is considered to have converged when the gradient becomes very small.

1.7.3 Batch and Minibatch Algorithms

-One aspect of machine learning algorithms that separates them from general
optimization algorithms is that the objective function usually decomposes as a sum
over the training examples.

-Optimization algorithms for machine learning typically compute each update to the
parameters based on an expected value of the cost function estimated using only a
subset of the terms of the full cost function.

-For example, maximum likelihood estimation problems, when viewed in log space,
decompose into a sum over each example.

------7.4

Maximizing this sum is equivalent to maximizing the expectation over the

empirical distribution defined by the training set:

-------- -7.5

Most of the properties of the objective function J used by most of our optimization
algorithms are also expectations over the training set. For example, the most
commonly used property is the gradient:

-------7.6

-Computing this expectation exactly is very expensive because it requires evaluating


the model on every example in the entire dataset.

-In practice, we can compute these expectations by randomly sampling a small


number of examples from the dataset, then taking the average over only those
examples.
-Recall that the standard error of the mean (equation 5.46) estimated from n samples
is given by σ/√n, where σ is the true standard deviation of the value of the samples.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 14


Deep Learning & Reinforcement Learning- BAI701

-The denominator of √n shows that there are less than linear returns to using more
examples to estimate the gradient.

-Compare two hypothetical estimates of the gradient, one based on 100 examples and
another based on 10,000 examples.

-The latter requires 100 times more computation than the former, but reduces the
standard error of the mean only by a factor of 10.

-Most optimization algorithms converge much faster (in terms of total computation,
not in terms of number of updates) if they are allowed to rapidly compute
approximate estimates of the gradient rather than slowly computing the exact gradient.

-Another consideration motivating statistical estimation of the gradient from a small


number of samples is redundancy in the training set.
In th-e worst case, all m samples in the training set could be identical copies of each
other.

-A sampling-based estimate of the gradient could compute the correct gradient with a
single sample, using m times less computation than the naive approach.

-In practice, we are unlikely to truly encounter this worst-case situation, but we may
find large numbers of examples that all make very similar contributions to the
gradient.

-Optimization algorithms that use the entire training set are called batch or
deterministic gradient methods, because they process all of the training examples
simultaneously in a large batch.

-This terminology can be somewhat confusing because the word “batch” is also often
used to describe the minibatch used by minibatch stochastic gradient descent.

-Typically, the term “batch gradient descent” implies the use of the full training set,
while the use of the term “batch” to describe a group of examples does not.

-For example, it is very common to use the term “batch size” to describe the size of a
minibatch.

-Optimization algorithms that use only a single example at a time are sometimes
called stochastic or sometimes online methods.

-The term online is usually reserved for the case where the examples are drawn from a
stream of continually created examples rather than from a fixed-size training set over
which several passes are made.

-Most algorithms used for deep learning fall somewhere in between, using more than
one but less than all of the training examples.
-These were traditionally called minibatch or minibatch stochastic methods and it is
now common to simply call them stochastic methods.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 15


Deep Learning & Reinforcement Learning- BAI701

-The canonical example of a stochastic method is stochastic gradient descent,


presented in detail in section 8.3.1.

-The minibatch content explains several key factors influencing minibatch sizes in
machine learning. Larger batches yield more accurate gradient estimates but with
diminishing returns, while very small batches underutilize multicore architectures and
do not reduce processing time effectively. Memory constraints often limit batch size,
especially on hardware like GPUs, where power-of-2 batch sizes (commonly between
32 and 256) tend to optimize runtime. Small batches may provide a regularizing effect
due to added noise, often improving generalization error, with batch size 1 being ideal
in some cases. However, smaller batches require smaller learning rates for stability
and increase total runtime because more steps are needed to cover the dataset.

Different algorithms utilize minibatch information differently; gradient-based


methods (using only g) can handle smaller batches (~100), while second-order
methods (using Hessian H and its inverse) demand much larger batches (~10,000) to
reduce fluctuations in estimates. Poor conditioning of H amplifies errors in gradient
estimates, causing instability in updates. Minibatches must be randomly selected to
ensure unbiased gradient estimates, as independence among samples is critical for
accurate expectation calculation.
-We also wish for two subsequent gradient estimates to be independent from each
other, so two subsequent minibatches of examples should also be independent from
each other.

-Many datasets are most naturally arranged in a way where successive examples are
highly correlated.

-For example, we might have a dataset of medical data with a long list of blood
sample test results.

-This list might be arranged so that first we have five blood samples taken at different
times from the first patient, then we have three blood samples taken from the second
patient, then the blood samples from the third patient, and so on.

-If we were to draw examples in order from this list, then each of our minibatches
would be extremely biased, because it would represent primarily one patient out of the
many patients in the dataset.

-In cases such as these where the order of the dataset holds some significance, it is
necessary to shuffle the examples before selecting minibatches.

-For very large datasets, for example datasets containing billions of examples in a data
centre, it can be impractical to sample examples truly uniformly at random every time
we want to construct a minibatch.

-Fortunately, in practice it is usually sufficient to shuffle the order of the dataset once
and then store it in shuffled fashion.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 16


Deep Learning & Reinforcement Learning- BAI701

-This will impose a fixed set of possible minibatches of consecutive examples that all
models trained thereafter will use, and each individual model will be forced to reuse
this ordering every time it passes through the training data.

-However, this deviation from true random selection does not seem to have a
significant detrimental effect.

-Failing to ever shuffle the examples in any way can seriously reduce the
effectiveness of the algorithm.

Parallelism and Generalization in Minibatch SGD


-Many optimization problems in machine learning decompose over examples well
enough that we can compute entire separate updates over different examples in
parallel.
-In other words, we can compute the update that minimizes J(X) for one minibatch of
examples X at the same time that we compute the update for several other
minibatches.

-Such asynchronous parallel distributed approaches.

Generalization Error and Online Learning in SGD


-An interesting motivation for minibatch stochastic gradient descent is that it follows
the gradient of the true generalization error (equation 8.2) so long as no examples are
repeated.

-Most implementations of minibatch stochastic gradient descent shuffle the dataset


once and then pass through it multiple times.

-On the first pass, each minibatch is used to compute an unbiased estimate of the true
generalization error.

-On the second pass, the estimate becomes biased because it is formed by re-sampling
values that have already been used, rather than obtaining new fair samples from the
data generating distribution.

-The fact that stochastic gradient descent minimizes generalization error is easiest to
see in the online learning case, where examples or minibatches are drawn from a
stream of data.

-In other words, instead of receiving a fixed-size training set, the learner is similar to a
living being who sees a new example at each instant, with every example (x, y)
coming from the data generating distribution pdata(x, y).

-In this scenario, examples are never repeated; every experience is a fair sample from
pdata.

-The equivalence is easiest to derive when both x and y are discrete. In this case, the
generalization error (equation 8.2) can be written as a sum.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 17


Deep Learning & Reinforcement Learning- BAI701

---------7.7
with the exact gradient

------------7.8

-We have already seen the same fact demonstrated for the log-likelihood in equation
7.5 and equation 7.6; we observe now that this holds for other functions L besides the
likelihood.
-A similar result can be derived when x and y are continuous, under mild assumptions
regarding pdata and L.
-Hence, we can obtain an unbiased estimator of the exact gradient of the
generalization error by sampling a minibatch of examples {x(1), . . . x(m)} with
corresponding targets y(i) from the data generating distribution pdata, and computing
the gradient of the loss with respect to the parameters for that minibatch.

--------7.9

Updating θ in the direction of gˆ performs SGD on the generalization error.

-Of course, this interpretation only applies when examples are not reused.
Nonetheless, it is usually best to make several passes through the training set, unless
the training set is extremely large.

-When multiple such epochs are used, only the first epoch follows the unbiased
gradient of the generalization error, but of course, the additional epochs usually
provide enough benefit due to decreased training error to offset the harm they cause
by increasing the gap between training error and test error.

-With some datasets growing rapidly in size, faster than computing power, it is
becoming more common for machine learning applications to use each training
example only once or even to make an incomplete pass through the training set.

-When using an extremely large training set, overfitting is not an issue, so underfitting
and computational efficiency become the predominant concerns.

-See also for a discussion of the Bottou and Bousquet 2008 effect of computational
bottlenecks on generalization error, as the number of training examples grows.

1.8 Challenges in Neural Network Optimization

• Optimization, in general, is an extremely difficult task.


• Traditionally, machine learning has avoided the difficulty of general
optimization by carefully designing the objective function and constraints to
ensure that the optimization problem is convex.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 18


Deep Learning & Reinforcement Learning- BAI701

• When training neural networks, we must confront the general non-convex case.
• Even convex optimization is not without its complications.
• This section summarizes several of the most prominent challenges involved in
optimization for training deep models.

1.8.1 Ill-Conditioning:

-In many cases, the gradient norm does not shrink significantly throughout learning,
but the g⊺ Hg term grows by more than an order of magnitude.”
-“The result is that learning becomes very slow despite the presence of a strong
gradient because the learning rate must be shrunk to compensate for even stronger
curvature.”
-“Figure 8.1 shows an example of the gradient increasing significantly during the
successful training of a neural network.”

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 19


Deep Learning & Reinforcement Learning- BAI701

-“Though ill-conditioning is present in other settings besides neural network training,


some of the techniques used to combat it in other contexts are less applicable to neural
networks.”

For example, Newton’s method is an excellent tool for minimizing convex functions
with poorly conditioned Hessian matrices, but in the subsequent sections we will
argue that Newton’s method requires significant modification before it can be applied
to neural networks.”

1.8.2 Local Minima


Convex optimization is a problem that can be reduced to finding a local minimum,
which is guaranteed to be a global minimum. A good solution is reached if a critical
point is found. Non-convex functions, like neural nets, can have many local minima,
but this is not a major problem.
Neural networks and models with multiple equivalently parametrized latent variables
have multiple local minima due to the model identifiability problem. A model is
identifiable if a large training set rules out all but one setting of its parameters. Models
with latent variables are often not identifiable since equivalent models can be obtained
by exchanging latent variables. For example, swapping incoming and outgoing weight
vectors of units gives equivalent models. With m layers of n units, there are n!m of
arranging hidden units. This non-identifiability is called weight space symmetry.

In addition to weight space symmetry, many neural networks have other causes of
non-identifiability. For example, in rectified linear or maxout networks, we can scale
incoming weights and biases of a unit by α\alphaα if outgoing weights are scaled by

his means that—if the cost function does not include terms like weight decay
depending directly on weights—every local minimum of a rectified linear or maxout
network lies on an (m×n) dimensional hyperbola of equivalent local minima.

Model identifiability issues mean there can be extremely many or infinite local
minima in a neural network cost function, but these are equivalent in cost and not
problematic. Local minima are problematic if they have higher cost than the global
minimum, and small networks can have such local minima. It is an open question
whether many high-cost local minima exist in practical networks. Most practitioners
once believed local minima were a major problem, but today experts suspect that in
large networks most local minima have low cost, and it is not necessary to find a true
global minimum. Many practitioners attribute optimization difficulty to local minima,
but testing is needed. A test is to plot the norm of the gradient over time; if it does not
shrink, the problem is not local minima. In high-dimensional spaces, it is difficult to
prove local minima are the problem, since other structures can also have small
gradients.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 20


Deep Learning & Reinforcement Learning- BAI701

1.8.3 Plateaus, Saddle Points and Other Flat Regions:


- **Saddle points vs. local minima**: In many high-dimensional non-convex
functions, local minima and maxima are rare compared to saddle points, which have
zero gradient but mixed positive and negative Hessian eigenvalues.

- **Nature of saddle points**: At a saddle point, the Hessian matrix has both positive
and negative eigenvalues, meaning the point is a local minimum along some
directions and a local maximum along others.

- **Dimensionality effect**: In low-dimensional spaces, local minima are common;


in higher dimensions, saddle points dominate. The expected ratio of saddle points to
local minima grows exponentially with the dimension \(n\).

- **Intuition via eigenvalues**: The Hessian at a local minimum has only positive
eigenvalues; at a saddle point, it has a mixture. Imagining eigenvalue signs as coin
tosses, the probability of all positive eigenvalues (all heads) decreases exponentially
with dimension.

- **Eigenvalues and cost regions**: Eigenvalues are more likely positive in low-cost
regions, meaning local minima tend to have low cost. High-cost critical points are
typically saddle points; very high-cost critical points are likely local maxima.

- **Neural networks and saddle points**: Shallow autoencoders without


nonlinearities have global minima and saddle points but no high-cost local minima.
Deeper linear networks share similar properties. These models help study nonlinear
networks because their loss functions are non-convex.

- **Empirical and theoretical support**: Dauphin et al. (2014) experimentally showed


that real neural networks have many high-cost saddle points. Choromanska et al.
(2014) provided theoretical arguments supporting this for related random functions.

- **Implications for training algorithms**:


- First-order methods (gradient descent) face unclear issues near saddle points since
gradients vanish, but empirically gradient descent often escapes saddle points.
- Goodfellow et al. (2015) visualized gradient descent escaping saddle points,
suggesting continuous-time gradient descent is repelled by saddle points. As in fig
8.2
- Newton’s method, designed to find zero gradients, can get stuck at saddle points
without modification, explaining why second-order methods are less favored in deep
learning.

- **Saddle-free Newton method**: Dauphin et al. (2014) introduced a saddle-free


Newton method that improves over traditional Newton’s method but scaling it to large
networks remains challenging.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 21


Deep Learning & Reinforcement Learning- BAI701

- **Other zero-gradient points**: Besides minima and saddle points, maxima also
have zero gradient and pose similar optimization challenges. Maxima become rare
with increasing dimension.

- **Flat regions**: There can be wide, flat regions where both gradient and Hessian
are zero. These degenerate areas cause major problems for optimization algorithms. In
convex problems, such flat regions correspond to global minima; in general cases,
they may correspond to high objective values.

This summary preserves the original sentences and highlights the critical insights into
the nature of saddle points, their prevalence in high-dimensional spaces, their impact
on neural network training, and the challenges they pose for optimization methods.

1.8.4 Cliffs and Exploding Gradients

The issue of cliffs and exploding gradients in deep neural networks, especially those
with many layers. These steep regions form in the cost function due to the
multiplication of multiple large weights, as illustrated in Figure 8.3. When gradient
descent encounters such a cliff, the update step can become excessively large, causing
parameters to overshoot and diverge from the optimal path. This problem occurs

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 22


Deep Learning & Reinforcement Learning- BAI701

regardless of whether the approach is from above or below the cliff. However,
the gradient clipping heuristic (§10.11.1) helps mitigate the issue by limiting the
step size. Since gradients only indicate the optimal direction—not the step
magnitude—clipping prevents excessively large updates, keeping the optimization
within a reasonable range. The problem is particularly prominent in recurrent neural
networks (RNNs) because they involve repeated multiplications over many time
steps, making long sequences especially prone to extreme gradient explosions. By
using gradient clipping, training remains stable, avoiding drastic parameter updates
while still following the general descent direction. The text emphasizes that while
cliffs are a significant challenge in deep learning, especially in RNNs, practical
techniques like gradient clipping provide an effective solution.

1.8.5 Long-Term Dependencies

Another difficulty that neural network optimization algorithms must overcome arises
when the computational graph becomes extremely deep. Feedforward networks with
many layers have such deep computational graphs. So do recurrent networks,
described in chapter 10, which construct very deep computational graphs by
repeatedly applying the same operation at each time step of a long temporal sequence.
Repeated application of the same parameters gives rise to especially pronounced
difficulties.
For example, suppose that a computational graph contains a path that consists of
repeatedly multiplying by a matrix W, equivalent to multiplying by Wt. It provides
the eigendecomposition of W as
W = Vdiag(λ)V-1 and

shows that Wt = (Vdiag(λ)V-1)t = Vdiag(λ)tV-1.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 23


Deep Learning & Reinforcement Learning- BAI701

Eigenvalues λi that are not near an absolute value of 1 will either explode or vanish,
which refers to the vanishing and exploding gradient problem. Repeated
multiplication by W is similar to the power method algorithm. Recurrent networks use
the same matrix W at each time step, while feedforward networks do not.

1.8.6 Inexact Gradients


Most optimization algorithms are designed with the assumption that we have access to
the exact gradient or Hessian matrix. In practice, we usually only have a noisy or even
biased estimate of these quantities. Nearly every deep learning algorithm relies on
sampling-based estimates at least insofar as using a minibatch of training examples to
compute the gradient. In other cases, the objective function we want to minimize is
actually intractable. When the objective function is intractable, typically its gradient is
intractable as well. In such cases we can only approximate the gradient.
For example, contrastive III divergence gives a technique for approximating the
gradient of the intractable log-likelihood of a Boltzmann machine. Various neural
network optimization algorithms are designed to account for imperfections in the
gradient estimate. One can also avoid the problem by choosing a surrogate loss
function that is easier to approximate than the true loss.

1.8.7 Poor Correspondence between Local and Global Structure

Optimization issues often arise from local properties like poor conditioning, cliffs,
or saddle points, but even overcoming these may not lead to good solutions if the
direction of improvement does not reach lower-cost regions.
Much of training runtime is due to long trajectories, with learning paths tracing
wide arcs around obstacles (Goodfellow et al., 2015).
Neural networks often do not converge to global minima, local minima, or saddle
points, and may not reach regions of small gradient.
Some loss functions lack true global minima and only asymptotically approach
optimal values.
For softmax classifiers, negative log-likelihood approaches zero but cannot reach it;
for Gaussian models, it can diverge to −∞ as β increases.
Local optimization can fail to find good cost values even without local minima or
saddle points (fig. 8.4).

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 24


Deep Learning & Reinforcement Learning- BAI701

Research often focuses on finding good initial points rather than algorithms with
non-local moves.
Neural network training relies on small, local moves, but gradients are often only
approximated with bias or variance.
Poor conditioning or discontinuous gradients make the region where gradients are
reliable very small.
Local descent with step size ε may define a good path, but in practice we can only
compute steps of much smaller size δ ≪ ε.
Local descent can incur high computational cost, fail in flat regions or critical
points, or follow greedy/long paths away from solutions.
It is unclear which of these issues most affects neural network optimization—this
remains an active research area.
These problems may be avoided if training starts in a well-behaved region
connected directly to a solution.
This motivates research into choosing good initial points for optimization
algorithms.

1.8.8 Theoretical Limits of Optimization

• Theoretical results show limits on optimization algorithms for neural networks,


but they have little bearing in practice.
• Most neural network units output smoothly increasing values, making
optimization via local search feasible.
• In practice, solutions can be found using larger networks even if finding one
for a smaller network is intractable.
• In training, the goal is not the exact minimum but sufficiently reducing the
function to get good generalization. Developing realistic bounds on
optimization performance is an important research goal.

Prof. Deepali A Dixit, Dept of AI &ML,RRCE 25

You might also like