DLRL Module 1 Updated
DLRL Module 1 Updated
Module-1
1.1 Introduction:
• Machine learning systems, with shallow or deep architectures, have ability to learn
and improve with experience.
• The process of machine learning begins with the raw data which is used for
extracting useful information that helps in decision-making.
• The primary aim is to allow a machine to learn useful information just like humans
do.
• At abstract level, machine learning can be carried out using following approaches:
• Supervised learning adapts a system such that for a given input data it produces a
target output.
• The goal here is to adapt the system so that for a new input the system can predict
the target output.
Supervised learning:
- Supervised learning adapts a system such that for a given input data it produces a
target output.
-The learning data is made up of tuples (attributes, label) where “attributes” represent
the input data and “label” represents the target output.
-The goal here is to adapt the system so that for a new input the system can predict
the target output.
-Supervised learning can use both continuous and discrete types of input data.
Unsupervised learning:
-Unsupervised learning involves data that comprises of input vectors without any
target output.
-The goal of clustering is to discover groups of similar data items on the basis of
measured or perceived similarities between the data items.
-The purpose of density estimation is to determine the distribution of the data within
the input space.
-In visualization, the data is projected down from a high-dimensional space to two or
three dimensions to view the similar data items.
Semi-supervised learning:
-The training dataset can be divided into two parts: the data samples with
corresponding labels and the data samples where the labels are not known.
-Semi-supervised learning can involve not providing with an explicit form of error at
each time but only a generalized reinforcement is received giving indication of how
the system should change its behavior, and this is sometimes referred to as
reinforcement learning.
-Shallow architectures are well understood and perform good on many common
machine learning problems, and they are still used in a vast majority of today’s
machine learning applications.
-However, there has been an increased interest in deep architectures recently, in the
hope to find means to solve more complex real-world problems (e.g., image analysis
or natural language understanding) for which shallow architectures are unable to learn
models adequately.
-Deep learning is a new area of machine learning which has gained popularity in
recent past.
-Deep learning refers to the architectures which contain multiple hidden layers (deep
networks) to learn different features with multiple levels of abstraction.
-Deep learning algorithms seek to exploit the unknown structure in the input
distribution in order to discover good representations, often at multiple levels, with
higher level learned features defined in terms of lower-level features.
Conventional machine learning techniques are restricted in the way they process the
natural data in its raw form.
-Deep learning allows inputting the raw data (pixels in case of image data) to the
learning algorithm without first extracting features or defining a feature vector.
-Deep learning algorithms can learn the right set of features, and it does this in a much
better way than extracting these features using hand-coding.
-Instead of handcrafting a set of rules and algorithms to extract features from raw data,
deep learning involves learning these features automatically during the training
process.
-In deep learning, a problem is realized in terms of hierarchy of concepts, with each
concept built on the top of the others.
-The lower layers of the model encode some basic representation of the problem,
whereas higher level layers build upon these lower layers to form more complex
concepts.
-Given an image, the pixel intensity values are fed as inputs to the deep learning
system.
-A number of hidden layers then extract features from the input image.
-These hidden layers are built upon each other in a hierarchical fashion.
-At first, the lower-level layers of the network detect only edge-like regions.
-These edge regions are then used to define corners (where edges intersect) and
contours (outlines of objects).
-The layers in the higher level combine corners and contours to lead to more abstract
“object parts” in the next layer.
-The key aspect of deep learning is that these layers of features are not handcrafted
and designed by human engineers; rather, they are learnt from data gradually using a
general-purpose learning procedure.
-Finally, the output layer classifies the image and obtains the output class label—the
output obtained at the output layer is directly influenced by every other node available
in the network.
-Thisprocess can be viewed as hierarchical learning as each layer in the network uses
the output of previous layers as “building blocks” to construct increasingly more
complex concepts at the higher layers.
Fig. 1.1 a Conventional machine learning using hand-designed feature extraction algorithms Vs
deep learning approach using hierarchy of representations
-The number of layers used to model the data determines the depth of the model.
-Current deep learning often involves learning tens or even hundreds of successive
layers of representation from the training data automatically.
-The conventional approaches to machine learning often focus on learning only one or
two layers of representations of data; such approaches are often categorized as
shallow learning.
-Deep learning and machine learning are sub-fields of Artificial Intelligence (AI).-
Figure 1.2 illustrates the relationship between AI, machine learning, and deep
learning.
-In deep learning, the successive layers of representations may be learned via sub-
models, which are structured in the form of layers stacked on the top of each other.
-As deep learning network has typically more layers and parameters, it has the
potential to represent more complex inputs.
-Although deep learning has been around since 1980s, it was relatively unpopular for
several years as the computational infrastructure (both hardware and software)
was not adequate and the available datasets were quite small.
-With the decline in the popularity of the conventional neural networks, it was only
recently that deep networks made a big reappearance by achieving spectacular results
in speech recognition and computer vision tasks.
Some of the aspects that helped in the evolution of deep networks are listed below:
-Improved computational resources for processing massive amounts of data and
training much larger models.
-Automatic feature extraction.
The term artificial neural networks have a reference to neuroscience but deep learning
networks are not models of the brain; however, deep learning models are formulated
by only drawing inspiration from the understanding of biological brain.
Not all the components of deep models are inspired by neuroscience; some of them
come from empirical exploration, theory, and intuition.
The neural activity in our brains is far more complex than might be suggested by
simply studying artificial neurons.
The learning mechanisms used by deep learning models are in no way comparable to
the human brain, but can be described as a mathematical framework for learning
representations from data.
Figure 1.2 shows an example of a deep learning architecture that can be used for
character recognition. Figure 1.3 shows representations that are learned by the deep
learning network.
-The deep network uses several layers to transform the input image (here a digit) in
order to recognize what the digit is.
Each layer performs some transformations on the input that it receives from the
previous layers.
-The deep network transforms the digit image into representations that tend to capture
a higher level of abstraction.
-Each hidden layer transforms the input image into a representation that is
increasingly different from the original image and increasingly informative about the
final result.
-In summary, a deep learning network constructs features at multiple levels, with
higher features constructed as functions of lower ones.
-It is a fast-growing field that circumvents the problem of feature extraction which is
used as a prelude by conventional machine learning approaches.
-Deep learning is capable of learning the appropriate features by itself, requiring little
steering by the user.
-The choice of features that represent a given dataset has a profound impact on the
success of a machine learning system.
-Better results cannot be achieved without identifying which aspects of the problem
need to be included for feature extraction that would be more useful to the machine
learning algorithm.
-This requires a machine learning expert to collaborate with the domain expert in
order to obtain a useful feature set.
-A biological brain can easily determine which aspects of the problem it needs to
focus on with comparatively little guidance.
-This is not the case with the artificial agents, thereby making it difficult to create
computer learning systems that can respond to high-dimensional input and perform
hard AI tasks.
-Machine learning practitioners have spent a huge time to extract informative features
from the data.
-At the time of Big Bang introduction of deep learning, the state-of-the-art machine
learning algorithms had already took decades of human effort to accumulate relevant
set of features required to classify the input.
-It has achieved not only the excellent accuracy in machine learning modeling, but it
has also demonstrated outstanding generalization power that has even attracted
scientists from other academic disciplines.
-It is now being used as a guide to make key decisions in fields like medicine, finance,
manufacturing, and beyond.
-It has enabled the computer scientists to harness the vast computational power and
use large volumes of data—audio, video, to teach computers how to do things that
seem natural and intuitive for humans, such as spotting objects in the photos,
recognizing words or sentences, and translating a document into other language.
-It has made it possible for machines to output the transcript from an audio clip–
speech recognition, to identify whether a mail is spam or not, likelihood of whether a
customer will repay his loan and so on; as long as there is enough data to train
machines, the possibilities are endless.
-It has achieved state-of-the-art results on many applications, such as natural language
parsing, language modeling, image and character recognition, playing the challenging
game of Go, pixels-to-controls video game playing, and in other applications.
-These companies have vast amount of data and deep learning works well whenever
there are vast volumes of data and complex problems to solve.
-Many companies are using deep learning to develop more helpful and realistic
customer service representatives—Chatbots.
-In particular, deep learning has made good impact in historically difficult areas of
machine learning:
• Near-human-level image classification;
• Near-human-level speech recognition;
• Near-human-level handwriting transcription;
• Improved self-driving cars;
• Digital assistants such as Google Now, Microsoft Cortana, Apple’s Siri, and
Amazon Alexa;
• Improved ad targeting, as used by Google, Baidu, and Bing;
• Improved search results on the web;
• Ability to answer natural language questions; and
• Superhuman Go, Shogi, and Chess playing.
-The automatic and generic approach of feature learning in deep models enables one
to use them across different applications (e.g., image classification, speech
recognition, language modeling, and information retrieval) with relatively little
adjustments.
-Therefore, deep models seem to be domain-oblivious in the sense that in order to use
it across different applications, only a small amount of domain-specific
customizations is required.
-Deep learning is still in its infancy, but it is likely that deep learning will have many
successes in the near future as it requires little hand engineering and thus can take
advantage of vast amount of data and computation power.
-Deep learning has succeeded in previously unsolved problems which were quite
difficult to resolve using machine learning as well as other shallow networks.
-The dramatic progress of deep learning has sparked such a burst of activity that
venture capitalists who did not even know what deep learning was all about some
years back, today are suspicious of the startups that do not have it.
-In near future, deep learning may herald an age where it may assist humans in
software development, science, and many more.
-Integrating deep learning with the whole toolbox of other artificial intelligence
techniques may accomplish startling things that will have great impact in the field of
technology.
-Deep networks map input to target via a sequence of layered transformations, and
that these layered transformations are learned by exposure to the training examples.
-The transformations that a layer applies to its input are determined by the layer’s
weights, which are basically a bunch of numbers.
-In this context, learning can be defined as the process of finding the values of the
weights of all layers in the network in such a manner that input examples can be
correctly mapped to their associated targets.
-A deep learning network contains thousands of parameters, and finding the right
values of these parameters is not an easy task, particularly when the value of one
parameter has an impact on the value of another parameter.
-In order to train a deep network, one needs to find out how far the calculated output
of the network is from the desired value.
-This measure is obtained by using a loss function, also called as objective function.
This gives a measure of how well the network has learnt a specific example.
-The objective of the training is to find the values for the weights that minimize the
chosen error function.
-The difference obtained is then used as a feedback signal to adjust the weights of the
network, in a way that loss score for the current example is lowered.
-Back propagation algorithm involves assigning random values to the weight vectors
initially, so that the network just implements a series of random transformations.
-Initially, the output obtained from the network can be far from what it should be, and
accordingly the loss score may be very high.
-With every example that is fed to the network, the weights are adjusted in such a
direction that makes the loss score to decrease.
-This process is repeated a number of times, until the weight values that minimize the
loss function are obtained.
-A network is said to have learned when the output values obtained from the network
are as close as they can be to the target values.
-Deep learning networks have brought their own set of problems and challenges
which outweighed the benefits of deep architectures for several decades.
-With limited computational power, deep learning networks were already overtaken
by other approaches such as kernel methods.
-However, despite the remarkable advances in this area, training deep models with a
huge number of free parameters is an intricate and ill-posed optimization problem.
-Many research works have been dedicated to creating efficient training methods for
deep architectures.
-The strategies reported in the literature that deal with the difficulties of training deep
networks include developing better optimizers, using well-designed initialization
strategies, using activation functions based on local competition and using skip
connections between layers with the aim to improve the flow of information.
-However, deep network training still faces problems which are caused by the
stacking of several nonlinear transformations and need to be addressed.
-Moreover, deep learning involves using large amounts of data to learn progressively.
-While large amounts of data are available in many applications, however, in some
areas copious amount of data are rarely available.
-More flexible models are required to achieve an enhanced learning ability when only
a limited amount of data is available.
-Optimization algorithms used for training of deep models differ from traditional
optimization algorithms in several ways.
-In most machine learning scenarios, we care about some performance measure P, that
is defined with respect to the test set and may also be intractable.
-We therefore optimize P only indirectly. We reduce a different cost function J(θ) in
the hope that doing so will improve P.
-Optimization algorithms for training deep models also typically include some
specialization on the specific structure of machine learning objective functions.
-Typically, the cost function can be written as an average over the training set, such as
------------------7.1
-Where L is the per-example loss function, f(x;θ) is the predicted output when the
input is x, p^data is the empirical distribution.
Throughout this chapter, we develop the unregularized supervised case, where the
arguments to L are f(x;θ) and y.
Equation 7.1 defines an objective function with respect to the training set.
We would usually prefer to minimize the corresponding objective function where the
expectation is taken across the data generating distribution pdata rather than just over
the finite training set.
--------7.2
-We emphasize here that the expectation is taken over the true underlying distribution
pdata.
-If we knew the true distribution pdata(x, y), risk minimization would be an
optimization task solvable by an optimization algorithm.
-However, when we do not know pdata(x, y) but only have a training set of samples,
we have a machine learning problem.
-The simplest way to convert a machine learning problem back into an optimization
problem is to minimize the expected loss on the training set.
-This means replacing the true distribution p(x, y) with the empirical distribution ˆp(x,
y) defined by the training set.
--------7.3
-The training process based on minimizing this average training error is known as
empirical risk minimization.
-In this setting, machine learning is still very similar to straightforward optimization.
-Rather than optimizing the risk directly, we optimize the empirical risk, and hope
that the risk decreases significantly as well.
-A variety of theoretical results establish conditions under which the true risk can be
expected to decrease by various amounts. However, empirical risk minimization is
prone to overfitting. Models with high capacity can simply memorize the training set.
-The most effective modern optimization algorithms are based on gradient descent,
but many useful loss functions, such as 0-1 loss, have no useful derivatives (the
derivative is either zero or undefined everywhere).
-These two problems mean that, in the context of deep learning, we rarely use
empirical risk minimization.
-Instead, we must use a slightly different approach, in which the quantity that we
actually optimize is even more different from the quantity that we truly want to
optimize.
-Sometimes, the loss function we actually care about (say classification error) is not
one that can be optimized efficiently.
-In such situations, one typically optimizes a surrogate loss function instead, which
acts as a proxy but has advantages.
-For example, the negative log-likelihood of the correct class is typically used as a
surrogate for the 0-1 loss.
-The negative log-likelihood allows the model to estimate the conditional probability
of the classes, given the input, and if the model can do that well, then it can pick the
classes that yield the least classification error in expectation.
-In some cases, a surrogate loss function actually results in being able to learn more.
-For example, the test set 0-1 loss often continues to decrease for a long time after the
training set 0-1 loss has reached zero, when training using the log-likelihood surrogate.
-This is because even when the expected 0-1 loss is zero, one can improve the
robustness of the classifier by further pushing the classes apart from each other,
obtaining a more confident and reliable classifier, thus extracting more information
from the training data than would have been possible by simply minimizing the
average 0-1 loss on the training set.
-Instead, a machine learning algorithm usually minimizes a surrogate loss function but
halts when a convergence criterion based on early stopping is satisfied.
-Typically, the early stopping criterion is based on the true underlying loss function,
such as 0-1 loss measured on a validation set, and is designed to cause the algorithm
to halt whenever overfitting begins to occur.
-Training often halts while the surrogate loss function still has large derivatives,
which is very different from the pure optimization setting, where an optimization
algorithm is considered to have converged when the gradient becomes very small.
-One aspect of machine learning algorithms that separates them from general
optimization algorithms is that the objective function usually decomposes as a sum
over the training examples.
-Optimization algorithms for machine learning typically compute each update to the
parameters based on an expected value of the cost function estimated using only a
subset of the terms of the full cost function.
-For example, maximum likelihood estimation problems, when viewed in log space,
decompose into a sum over each example.
------7.4
-------- -7.5
Most of the properties of the objective function J used by most of our optimization
algorithms are also expectations over the training set. For example, the most
commonly used property is the gradient:
-------7.6
-The denominator of √n shows that there are less than linear returns to using more
examples to estimate the gradient.
-Compare two hypothetical estimates of the gradient, one based on 100 examples and
another based on 10,000 examples.
-The latter requires 100 times more computation than the former, but reduces the
standard error of the mean only by a factor of 10.
-Most optimization algorithms converge much faster (in terms of total computation,
not in terms of number of updates) if they are allowed to rapidly compute
approximate estimates of the gradient rather than slowly computing the exact gradient.
-A sampling-based estimate of the gradient could compute the correct gradient with a
single sample, using m times less computation than the naive approach.
-In practice, we are unlikely to truly encounter this worst-case situation, but we may
find large numbers of examples that all make very similar contributions to the
gradient.
-Optimization algorithms that use the entire training set are called batch or
deterministic gradient methods, because they process all of the training examples
simultaneously in a large batch.
-This terminology can be somewhat confusing because the word “batch” is also often
used to describe the minibatch used by minibatch stochastic gradient descent.
-Typically, the term “batch gradient descent” implies the use of the full training set,
while the use of the term “batch” to describe a group of examples does not.
-For example, it is very common to use the term “batch size” to describe the size of a
minibatch.
-Optimization algorithms that use only a single example at a time are sometimes
called stochastic or sometimes online methods.
-The term online is usually reserved for the case where the examples are drawn from a
stream of continually created examples rather than from a fixed-size training set over
which several passes are made.
-Most algorithms used for deep learning fall somewhere in between, using more than
one but less than all of the training examples.
-These were traditionally called minibatch or minibatch stochastic methods and it is
now common to simply call them stochastic methods.
-The minibatch content explains several key factors influencing minibatch sizes in
machine learning. Larger batches yield more accurate gradient estimates but with
diminishing returns, while very small batches underutilize multicore architectures and
do not reduce processing time effectively. Memory constraints often limit batch size,
especially on hardware like GPUs, where power-of-2 batch sizes (commonly between
32 and 256) tend to optimize runtime. Small batches may provide a regularizing effect
due to added noise, often improving generalization error, with batch size 1 being ideal
in some cases. However, smaller batches require smaller learning rates for stability
and increase total runtime because more steps are needed to cover the dataset.
-Many datasets are most naturally arranged in a way where successive examples are
highly correlated.
-For example, we might have a dataset of medical data with a long list of blood
sample test results.
-This list might be arranged so that first we have five blood samples taken at different
times from the first patient, then we have three blood samples taken from the second
patient, then the blood samples from the third patient, and so on.
-If we were to draw examples in order from this list, then each of our minibatches
would be extremely biased, because it would represent primarily one patient out of the
many patients in the dataset.
-In cases such as these where the order of the dataset holds some significance, it is
necessary to shuffle the examples before selecting minibatches.
-For very large datasets, for example datasets containing billions of examples in a data
centre, it can be impractical to sample examples truly uniformly at random every time
we want to construct a minibatch.
-Fortunately, in practice it is usually sufficient to shuffle the order of the dataset once
and then store it in shuffled fashion.
-This will impose a fixed set of possible minibatches of consecutive examples that all
models trained thereafter will use, and each individual model will be forced to reuse
this ordering every time it passes through the training data.
-However, this deviation from true random selection does not seem to have a
significant detrimental effect.
-Failing to ever shuffle the examples in any way can seriously reduce the
effectiveness of the algorithm.
-On the first pass, each minibatch is used to compute an unbiased estimate of the true
generalization error.
-On the second pass, the estimate becomes biased because it is formed by re-sampling
values that have already been used, rather than obtaining new fair samples from the
data generating distribution.
-The fact that stochastic gradient descent minimizes generalization error is easiest to
see in the online learning case, where examples or minibatches are drawn from a
stream of data.
-In other words, instead of receiving a fixed-size training set, the learner is similar to a
living being who sees a new example at each instant, with every example (x, y)
coming from the data generating distribution pdata(x, y).
-In this scenario, examples are never repeated; every experience is a fair sample from
pdata.
-The equivalence is easiest to derive when both x and y are discrete. In this case, the
generalization error (equation 8.2) can be written as a sum.
---------7.7
with the exact gradient
------------7.8
-We have already seen the same fact demonstrated for the log-likelihood in equation
7.5 and equation 7.6; we observe now that this holds for other functions L besides the
likelihood.
-A similar result can be derived when x and y are continuous, under mild assumptions
regarding pdata and L.
-Hence, we can obtain an unbiased estimator of the exact gradient of the
generalization error by sampling a minibatch of examples {x(1), . . . x(m)} with
corresponding targets y(i) from the data generating distribution pdata, and computing
the gradient of the loss with respect to the parameters for that minibatch.
--------7.9
-Of course, this interpretation only applies when examples are not reused.
Nonetheless, it is usually best to make several passes through the training set, unless
the training set is extremely large.
-When multiple such epochs are used, only the first epoch follows the unbiased
gradient of the generalization error, but of course, the additional epochs usually
provide enough benefit due to decreased training error to offset the harm they cause
by increasing the gap between training error and test error.
-With some datasets growing rapidly in size, faster than computing power, it is
becoming more common for machine learning applications to use each training
example only once or even to make an incomplete pass through the training set.
-When using an extremely large training set, overfitting is not an issue, so underfitting
and computational efficiency become the predominant concerns.
-See also for a discussion of the Bottou and Bousquet 2008 effect of computational
bottlenecks on generalization error, as the number of training examples grows.
• When training neural networks, we must confront the general non-convex case.
• Even convex optimization is not without its complications.
• This section summarizes several of the most prominent challenges involved in
optimization for training deep models.
1.8.1 Ill-Conditioning:
-In many cases, the gradient norm does not shrink significantly throughout learning,
but the g⊺ Hg term grows by more than an order of magnitude.”
-“The result is that learning becomes very slow despite the presence of a strong
gradient because the learning rate must be shrunk to compensate for even stronger
curvature.”
-“Figure 8.1 shows an example of the gradient increasing significantly during the
successful training of a neural network.”
For example, Newton’s method is an excellent tool for minimizing convex functions
with poorly conditioned Hessian matrices, but in the subsequent sections we will
argue that Newton’s method requires significant modification before it can be applied
to neural networks.”
In addition to weight space symmetry, many neural networks have other causes of
non-identifiability. For example, in rectified linear or maxout networks, we can scale
incoming weights and biases of a unit by α\alphaα if outgoing weights are scaled by
his means that—if the cost function does not include terms like weight decay
depending directly on weights—every local minimum of a rectified linear or maxout
network lies on an (m×n) dimensional hyperbola of equivalent local minima.
Model identifiability issues mean there can be extremely many or infinite local
minima in a neural network cost function, but these are equivalent in cost and not
problematic. Local minima are problematic if they have higher cost than the global
minimum, and small networks can have such local minima. It is an open question
whether many high-cost local minima exist in practical networks. Most practitioners
once believed local minima were a major problem, but today experts suspect that in
large networks most local minima have low cost, and it is not necessary to find a true
global minimum. Many practitioners attribute optimization difficulty to local minima,
but testing is needed. A test is to plot the norm of the gradient over time; if it does not
shrink, the problem is not local minima. In high-dimensional spaces, it is difficult to
prove local minima are the problem, since other structures can also have small
gradients.
- **Nature of saddle points**: At a saddle point, the Hessian matrix has both positive
and negative eigenvalues, meaning the point is a local minimum along some
directions and a local maximum along others.
- **Intuition via eigenvalues**: The Hessian at a local minimum has only positive
eigenvalues; at a saddle point, it has a mixture. Imagining eigenvalue signs as coin
tosses, the probability of all positive eigenvalues (all heads) decreases exponentially
with dimension.
- **Eigenvalues and cost regions**: Eigenvalues are more likely positive in low-cost
regions, meaning local minima tend to have low cost. High-cost critical points are
typically saddle points; very high-cost critical points are likely local maxima.
- **Other zero-gradient points**: Besides minima and saddle points, maxima also
have zero gradient and pose similar optimization challenges. Maxima become rare
with increasing dimension.
- **Flat regions**: There can be wide, flat regions where both gradient and Hessian
are zero. These degenerate areas cause major problems for optimization algorithms. In
convex problems, such flat regions correspond to global minima; in general cases,
they may correspond to high objective values.
This summary preserves the original sentences and highlights the critical insights into
the nature of saddle points, their prevalence in high-dimensional spaces, their impact
on neural network training, and the challenges they pose for optimization methods.
The issue of cliffs and exploding gradients in deep neural networks, especially those
with many layers. These steep regions form in the cost function due to the
multiplication of multiple large weights, as illustrated in Figure 8.3. When gradient
descent encounters such a cliff, the update step can become excessively large, causing
parameters to overshoot and diverge from the optimal path. This problem occurs
regardless of whether the approach is from above or below the cliff. However,
the gradient clipping heuristic (§10.11.1) helps mitigate the issue by limiting the
step size. Since gradients only indicate the optimal direction—not the step
magnitude—clipping prevents excessively large updates, keeping the optimization
within a reasonable range. The problem is particularly prominent in recurrent neural
networks (RNNs) because they involve repeated multiplications over many time
steps, making long sequences especially prone to extreme gradient explosions. By
using gradient clipping, training remains stable, avoiding drastic parameter updates
while still following the general descent direction. The text emphasizes that while
cliffs are a significant challenge in deep learning, especially in RNNs, practical
techniques like gradient clipping provide an effective solution.
Another difficulty that neural network optimization algorithms must overcome arises
when the computational graph becomes extremely deep. Feedforward networks with
many layers have such deep computational graphs. So do recurrent networks,
described in chapter 10, which construct very deep computational graphs by
repeatedly applying the same operation at each time step of a long temporal sequence.
Repeated application of the same parameters gives rise to especially pronounced
difficulties.
For example, suppose that a computational graph contains a path that consists of
repeatedly multiplying by a matrix W, equivalent to multiplying by Wt. It provides
the eigendecomposition of W as
W = Vdiag(λ)V-1 and
Eigenvalues λi that are not near an absolute value of 1 will either explode or vanish,
which refers to the vanishing and exploding gradient problem. Repeated
multiplication by W is similar to the power method algorithm. Recurrent networks use
the same matrix W at each time step, while feedforward networks do not.
Optimization issues often arise from local properties like poor conditioning, cliffs,
or saddle points, but even overcoming these may not lead to good solutions if the
direction of improvement does not reach lower-cost regions.
Much of training runtime is due to long trajectories, with learning paths tracing
wide arcs around obstacles (Goodfellow et al., 2015).
Neural networks often do not converge to global minima, local minima, or saddle
points, and may not reach regions of small gradient.
Some loss functions lack true global minima and only asymptotically approach
optimal values.
For softmax classifiers, negative log-likelihood approaches zero but cannot reach it;
for Gaussian models, it can diverge to −∞ as β increases.
Local optimization can fail to find good cost values even without local minima or
saddle points (fig. 8.4).
Research often focuses on finding good initial points rather than algorithms with
non-local moves.
Neural network training relies on small, local moves, but gradients are often only
approximated with bias or variance.
Poor conditioning or discontinuous gradients make the region where gradients are
reliable very small.
Local descent with step size ε may define a good path, but in practice we can only
compute steps of much smaller size δ ≪ ε.
Local descent can incur high computational cost, fail in flat regions or critical
points, or follow greedy/long paths away from solutions.
It is unclear which of these issues most affects neural network optimization—this
remains an active research area.
These problems may be avoided if training starts in a well-behaved region
connected directly to a solution.
This motivates research into choosing good initial points for optimization
algorithms.