Dive Into Deep Learning-435-462
Dive Into Deep Learning-435-462
If you read the book in sequence up to this point you already used a number of optimization
algorithms to train deep learning models. They were the tools that allowed us to continue up-
dating model parameters and to minimize the value of the loss function, as evaluated on the
training set. Indeed, anyone content with treating optimization as a black box device to min-
imize objective functions in a simple setting might well content oneself with the knowledge
that there exists an array of incantations of such a procedure (with names such as “SGD” and
“Adam”).
To do well, however, some deeper knowledge is required. Optimization algorithms are im-
portant for deep learning. On the one hand, training a complex deep learning model can take
hours, days, or even weeks. The performance of the optimization algorithm directly affects
the model’s training efficiency. On the other hand, understanding the principles of differ-
ent optimization algorithms and the role of their hyperparameters will enable us to tune the
hyperparameters in a targeted manner to improve the performance of deep learning mod-
els.
In this chapter, we explore common deep learning optimization algorithms in depth. Almost
all optimization problems arising in deep learning are nonconvex. Nonetheless, the design and
analysis of algorithms in the context of convex problems have proven to be very instructive. It
is for that reason that this chapter includes a primer on convex optimization and the proof for
a very simple stochastic gradient descent algorithm on a convex objective function.
In this section, we will discuss the relationship between optimization and deep learning as well
as the challenges of using optimization in deep learning. For a deep learning problem, we will
usually define a loss function first. Once we have the loss function, we can use an optimization
algorithm in attempt to minimize the loss. In optimization, a loss function is often referred
to as the objective function of the optimization problem. By tradition and convention most
optimization algorithms are concerned with minimization. If we ever need to maximize an
objective there is a simple solution: just flip the sign on the objective.
Although optimization provides a way to minimize the loss function for deep learning, in
essence, the goals of optimization and deep learning are fundamentally different. The for-
mer is primarily concerned with minimizing an objective whereas the latter is concerned
with finding a suitable model, given a finite amount of data. In Section 3.6, we discussed the
difference between these two goals in detail. For instance, training error and generalization
error generally differ: since the objective function of the optimization algorithm is usually a
loss function based on the training dataset, the goal of optimization is to reduce the training
error. However, the goal of deep learning (or more broadly, statistical inference) is to reduce
435
436 Optimization Algorithms
the generalization error. To accomplish the latter we need to pay attention to overfitting in
addition to using the optimization algorithm to reduce the training error.
%matplotlib inline
import numpy as np
import torch
from mpl_toolkits import mplot3d
from d2l import torch as d2l
To illustrate the aforementioned different goals, let’s consider the empirical risk and the risk.
As described in Section 4.7.3, the empirical risk is an average loss on the training dataset
while the risk is the expected loss on the entire population of data. Below we define two
functions: the risk function f and the empirical risk function g. Suppose that we have only a
finite amount of training data. As a result, here g is less smooth than f.
def f(x):
return x * torch.cos(np.pi * x)
def g(x):
return f(x) + 0.2 * torch.cos(5 * np.pi * x)
The graph below illustrates that the minimum of the empirical risk on a training dataset may
be at a different location from the minimum of the risk (generalization error).
In this chapter, we are going to focus specifically on the performance of optimization algo-
rithms in minimizing the objective function, rather than a model’s generalization error. In
Section 3.1 we distinguished between analytical solutions and numerical solutions in opti-
mization problems. In deep learning, most objective functions are complicated and do not
have analytical solutions. Instead, we must use numerical optimization algorithms. The opti-
mization algorithms in this chapter all fall into this category.
There are many challenges in deep learning optimization. Some of the most vexing ones are
local minima, saddle points, and vanishing gradients. Let’s have a look at them.
437 Optimization and Deep Learning
Local Minima
For any objective function f (x), if the value of f (x) at x is smaller than the values of f (x)
at any other points in the vicinity of x, then f (x) could be a local minimum. If the value of
f (x) at x is the minimum of the objective function over the entire domain, then f (x) is the
global minimum.
we can approximate the local minimum and global minimum of this function.
The objective function of deep learning models usually has many local optima. When the nu-
merical solution of an optimization problem is near the local optimum, the numerical solution
obtained by the final iteration may only minimize the objective function locally, rather than
globally, as the gradient of the objective function’s solutions approaches or becomes zero.
Only some degree of noise might knock the parameter out of the local minimum. In fact,
this is one of the beneficial properties of minibatch stochastic gradient descent where the
natural variation of gradients over minibatches is able to dislodge the parameters from local
minima.
Saddle Points
Besides local minima, saddle points are another reason for gradients to vanish. A saddle point
is any location where all gradients of a function vanish but which is neither a global nor a local
minimum. Consider the function f (x) = x3 . Its first and second derivative vanish for x = 0.
Optimization might stall at this point, even though it is not a minimum.
Saddle points in higher dimensions are even more insidious, as the example below shows.
Consider the function f (x, y) = x2 − y 2 . It has its saddle point at (0, 0). This is a maximum
with respect to y and a minimum with respect to x. Moreover, it looks like a saddle, which
is where this mathematical property got its name.
438 Optimization Algorithms
x, y = torch.meshgrid(
torch.linspace(-1.0, 1.0, 101), torch.linspace(-1.0, 1.0, 101))
z = x**2 - y**2
ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
ax.plot([0], [0], [0], 'rx')
ticks = [-1, 0, 1]
d2l.plt.xticks(ticks)
d2l.plt.yticks(ticks)
ax.set_zticks(ticks)
d2l.plt.xlabel('x')
d2l.plt.ylabel('y');
/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/
,→torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release,
,→ ../aten/src/ATen/native/TensorShape.cpp:2895.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
We assume that the input of a function is a k-dimensional vector and its output is a scalar,
so its Hessian matrix will have k eigenvalues. The solution of the function could be a local
minimum, a local maximum, or a saddle point at a position where the function gradient is
zero:
• When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are all
positive, we have a local minimum for the function.
• When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are all
negative, we have a local maximum for the function.
• When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are
negative and positive, we have a saddle point for the function.
For high-dimensional problems the likelihood that at least some of the eigenvalues are negative
is quite high. This makes saddle points more likely than local minima. We will discuss some
439 Optimization and Deep Learning
exceptions to this situation in the next section when introducing convexity. In short, convex
functions are those where the eigenvalues of the Hessian are never negative. Sadly, though,
most deep learning problems do not fall into this category. Nonetheless it is a great tool to
study optimization algorithms.
Vanishing Gradients
Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-
used activation functions and their derivatives in Section 5.1.2. For instance, assume that we
want to minimize the function f (x) = tanh(x) and we happen to get started at x = 4. As
we can see, the gradient of f is close to nil. More specifically, f ′ (x) = 1 − tanh2 (x) and
thus f ′ (4) = 0.0013. Consequently, optimization will get stuck for a long time before we
make progress. This turns out to be one of the reasons that training deep learning models was
quite tricky prior to the introduction of the ReLU activation function.
As we saw, optimization for deep learning is full of challenges. Fortunately there exists a
robust range of algorithms that perform well and that are easy to use even for beginners. Fur-
thermore, it is not really necessary to find the best solution. Local optima or even approximate
solutions thereof are still very useful.
12.1.3 Summary
• Minimizing the training error does not guarantee that we find the best set of parameters to
minimize the generalization error.
• The problem may have even more saddle points, as generally the problems are not convex.
12.1.4 Exercises
1. Consider a simple MLP with a single hidden layer of, say, d dimensions in the hidden
layer and a single output. Show that for any local minimum there are at least d! equivalent
solutions that behave identically.
440 Optimization Algorithms
2. Assume that we have a symmetric random matrix M where the entries Mij = Mji are
each drawn from some probability distribution pij . Furthermore assume that pij (x) =
pij (−x), i.e., that the distribution is symmetric (see e.g., Wigner (1958) for details).
1. Prove that the distribution over eigenvalues is also symmetric. That is, for any eigenvec-
tor v the probability that the associated eigenvalue λ satisfies P (λ > 0) = P (λ < 0).
2. Why does the above not imply P (λ > 0) = 0.5?
3. What other challenges involved in deep learning optimization can you think of?
4. Assume that you want to balance a (real) ball on a (real) saddle.
1. Why is this hard?
2. Can you exploit this effect also for optimization algorithms?
Discussions 165 165
12.2 Convexity
Convexity plays a vital role in the design of optimization algorithms. This is largely due to
the fact that it is much easier to analyze and test algorithms in such a context. In other words,
if the algorithm performs poorly even in the convex setting, typically we should not hope
to see great results otherwise. Furthermore, even though the optimization problems in deep
learning are generally nonconvex, they often exhibit some properties of convex ones near
local minima. This can lead to exciting new optimization variants such as (Izmailov et al.,
2018).
%matplotlib inline
import numpy as np
import torch
from mpl_toolkits import mplot3d
from d2l import torch as d2l
12.2.1 Definitions
Before convex analysis, we need to define convex sets and convex functions. They lead to
mathematical tools that are commonly applied to machine learning.
Convex Sets
Sets are the basis of convexity. Simply put, a set X in a vector space is convex if for any
a, b ∈ X the line segment connecting a and b is also in X . In mathematical terms this means
that for all λ ∈ [0, 1] we have
This sounds a bit abstract. Consider Fig.12.2.1. The first set is not convex since there exist
line segments that are not contained in it. The other two sets suffer no such problem.
Definitions on their own are not particularly useful unless you can do something with them.
In this case we can look at intersections as shown in Fig.12.2.2. Assume that X and Y are
convex sets. Then X ∩ Y is also convex. To see this, consider any a, b ∈ X ∩ Y. Since X and
441 Convexity
t
Figure 12.2.1 The first set is nonconvex and the other two are convex.
Y are convex, the line segments connecting a and b are contained in both X and Y. Given
that, they also need to be contained in X ∩ Y, thus proving our theorem.
t
Figure 12.2.2 The intersection between two convex sets is convex.
We can strengthen this result with little effort: given convex sets Xi , their intersection ∩i Xi
is convex. To see that the converse is not true, consider two disjoint sets X ∩ Y = ∅. Now
pick a ∈ X and b ∈ Y. The line segment in Fig.12.2.3 connecting a and b needs to contain
some part that is neither in X nor in Y, since we assumed that X ∩ Y = ∅. Hence the line
segment is not in X ∪ Y either, thus proving that in general unions of convex sets need not
be convex.
t
Figure 12.2.3 The union of two convex sets need not be convex.
Typically the problems in deep learning are defined on convex sets. For instance, Rd , the set
of d-dimensional vectors of real numbers, is a convex set (after all, the line between any two
points in Rd remains in Rd ). In some cases we work with variables of bounded length, such
as balls of radius r as defined by {x|x ∈ Rd and ∥x∥ ≤ r}.
Convex Functions
Now that we have convex sets we can introduce convex functions f . Given a convex set X , a
function f : X → R is convex if for all x, x′ ∈ X and for all λ ∈ [0, 1] we have
To illustrate this let’s plot a few functions and check which ones satisfy the requirement.
Below we define a few functions, both convex and nonconvex.
442 Optimization Algorithms
As expected, the cosine function is nonconvex, whereas the parabola and the exponential
function are. Note that the requirement that X is a convex set is necessary for the condition
to make sense. Otherwise the outcome of f (λx+(1−λ)x′ ) might not be well defined.
Jensen’s Inequality
Given a convex function f , one of the most useful mathematical tools is Jensen’s inequality.
It amounts to a generalization of the definition of convexity:
( )
∑ ∑
αi f (xi ) ≥ f αi xi and EX [f (X)] ≥ f (EX [X]) , (12.2.3)
i i
∑
where αi are nonnegative real numbers such that i αi = 1 and X is a random variable. In
other words, the expectation of a convex function is no less than the convex function of an
expectation, where the latter is usually a simpler expression. To prove the first inequality we
repeatedly apply the definition of convexity to one term in the sum at a time.
One of the common applications of Jensen’s inequality is to bound a more complicated ex-
pression by a simpler one. For example, its application can be with regard to the log-likelihood
of partially observed random variables. That is, we use
12.2.2 Properties
Convex functions have many useful properties. We describe a few commonly-used ones be-
low.
443 Convexity
First and foremost, the local minima of convex functions are also the global minima. We can
prove it by contradiction as follows.
Consider a convex function f defined on a convex set X . Suppose that x∗ ∈ X is a local
minimum: there exists a small positive value p so that for x ∈ X that satisfies 0 < |x − x∗ | ≤
p we have f (x∗ ) < f (x).
Assume that the local minimum x∗ is not the global minumum of f : there exists x′ ∈ X
for which f (x′ ) < f (x∗ ). There also exists λ ∈ [0, 1) such as λ = 1 − |x∗ −x
p
′ | so that
f = lambda x: (x - 1) ** 2
d2l.set_figsize()
d2l.plot([x, segment], [f(x), f(segment)], 'x', 'f(x)')
The fact that the local minima for convex functions are also the global minima is very conve-
nient. It means that if we minimize functions we cannot “get stuck”. Note, though, that this
does not mean that there cannot be more than one global minimum or that there might even
exist one. For instance, the function f (x) = max(|x| − 1, 0) attains its minimum value over
the interval [−1, 1]. Conversely, the function f (x) = exp(x) does not attain a minimum
value on R: for x → −∞ it asymptotes to 0, but there is no x for which f (x) = 0.
We can conveniently define convex sets via below sets of convex functions. Concretely, given
a convex function f defined on a convex set X , any below set
is convex.
444 Optimization Algorithms
Let’s prove this quickly. Recall that for any x, x′ ∈ Sb we need to show that λx+(1−λ)x′ ∈
Sb as long as λ ∈ [0, 1]. Since f (x) ≤ b and f (x′ ) ≤ b, by the definition of convexity we
have
First, we need to prove the one-dimensional case. To see that convexity of f implies f ′′ ≥ 0
we use the fact that
( )
1 1 x+ϵ x−ϵ
f (x + ϵ) + f (x − ϵ) ≥ f + = f (x). (12.2.8)
2 2 2 2
Since the second derivative is given by the limit over finite differences it follows that
f (x + ϵ) + f (x − ϵ) − 2f (x)
f ′′ (x) = lim ≥ 0. (12.2.9)
ϵ→0 ϵ2
To see that f ′′ ≥ 0 implies that f is convex we use the fact that f ′′ ≥ 0 implies that f ′
is a monotonically nondecreasing function. Let a < x < b be three points in R, where
x = (1 − λ)a + λb and λ ∈ (0, 1). According to the mean value theorem, there exist
α ∈ [a, x] and β ∈ [x, b] such that
f (x) − f (a) f (b) − f (x)
f ′ (α) = and f ′ (β) = . (12.2.10)
x−a b−x
By monotonicity f ′ (β) ≥ f ′ (α), hence
x−a b−x
f (b) + f (a) ≥ f (x). (12.2.11)
b−a b−a
Since x = (1 − λ)a + λb, we have
is convex.
To prove that convexity of f implies that g is convex, we can show that for all a, b, λ ∈ [0, 1]
(thus 0 ≤ λa + (1 − λ)b ≤ 1)
g(λa + (1 − λ)b)
=f ((λa + (1 − λ)b) x + (1 − λa − (1 − λ)b) y)
=f (λ (ax + (1 − a)y) + (1 − λ) (bx + (1 − b)y)) (12.2.14)
≤λf (ax + (1 − a)y) + (1 − λ)f (bx + (1 − b)y)
=λg(a) + (1 − λ)g(b).
445 Convexity
f (λx + (1 − λ)y)
=g(λ · 1 + (1 − λ) · 0)
(12.2.15)
≤λg(1) + (1 − λ)g(0)
=λf (x) + (1 − λ)f (y).
Finally, using the lemma above and the result of the one-dimensional case, the multi-dimensional
case can be proven as follows. A multi-dimensional function f : Rn → R is convex if and
def
only if for all x, y ∈ Rn g(z) = f (zx + (1 − z)y), where z ∈ [0, 1], is convex. Accord-
ing to the one-dimensional case, this holds if and only if g ′′ = (x − y)⊤ H(x − y) ≥ 0
def
(H = ∇2 f ) for all x, y ∈ Rn , which is equivalent to H ⪰ 0 per the definition of positive
semidefinite matrices.
12.2.3 Constraints
One of the nice properties of convex optimization is that it allows us to handle constraints
efficiently. That is, it allows us to solve constrained optimization problems of the form:
minimize f (x)
x
(12.2.16)
subject to ci (x) ≤ 0 for all i ∈ {1, . . . , n},
where f is the objective and the functions ci are constraint functions. To see what this does
consider the case where c1 (x) = ∥x∥2 − 1. In this case the parameters x are constrained to
the unit ball. If a second constraint is c2 (x) = v⊤ x + b, then this corresponds to all x lying
on a half-space. Satisfying both constraints simultaneously amounts to selecting a slice of a
ball.
Lagrangian
Skipping over the derivation of the Lagrangian L, the above reasoning can be expressed via
the following saddle point optimization problem:
∑
n
L(x, α1 , . . . , αn ) = f (x) + αi ci (x) where αi ≥ 0. (12.2.17)
i=1
Here the variables αi (i = 1, . . . , n) are the so-called Lagrange multipliers that ensure that
constraints are properly enforced. They are chosen just large enough to ensure that ci (x) ≤ 0
for all i. For instance, for any x where ci (x) < 0 naturally, we’d end up picking αi = 0.
Moreover, this is a saddle point optimization problem where one wants to maximize L with
respect to all αi and simultaneously minimize it with respect to x. There is a rich body of
literature explaining how to arrive at the function L(x, α1 , . . . , αn ). For our purposes it is
sufficient to know that the saddle point of L is where the original constrained optimization
problem is solved optimally.
446 Optimization Algorithms
Penalties
In fact, we have been using this trick all along. Consider weight decay in Section 3.7. In it
we add λ2 ∥w∥2 to the objective function to ensure that w does not grow too large. From the
constrained optimization point of view we can see that this will ensure that ∥w∥2 − r2 ≤ 0
for some radius r. Adjusting the value of λ allows us to vary the size of w.
Projections
This turns out to be a projection of g onto the ball of radius θ. More generally, a projection
on a convex set X is defined as
ProjX (x) = argmin ∥x − x′ ∥, (12.2.19)
x′ ∈X
t
Figure 12.2.4 Convex Projections.
The mathematical definition of projections may sound a bit abstract. Fig.12.2.4 explains it
somewhat more clearly. In it we have two convex sets, a circle and a diamond. Points inside
both sets (yellow) remain unchanged during projections. Points outside both sets (black) are
projected to the points inside the sets (red) that are closet to the original points (black). While
for ℓ2 balls this leaves the direction unchanged, this need not be the case in general, as can
be seen in the case of the diamond.
One of the uses for convex projections is to compute sparse weight vectors. In this case we
project weight vectors onto an ℓ1 ball, which is a generalized version of the diamond case in
Fig.12.2.4.
12.2.4 Summary
In the context of deep learning the main purpose of convex functions is to motivate opti-
mization algorithms and help us understand them in detail. In the following we will see how
gradient descent and stochastic gradient descent can be derived accordingly.
447 Gradient Descent
• The expectation of a convex function is no less than the convex function of an expectation
(Jensen’s inequality).
• Convex constraints can be added via the Lagrangian. In practice we may simply add them
with a penalty to the objective function.
• Projections map to points in the convex set closest to the original points.
12.2.5 Exercises
1. Assume that we want to verify convexity of a set by drawing all lines between points within
the set and checking whether the lines are contained.
3. Given convex functions f and g, show that max(f, g) is convex, too. Prove that min(f, g)
is not convex.
4. Prove that the normalization of the softmax function is convex. More specifically prove
∑
the convexity of f (x) = log i exp(xi ).
5. Prove that linear subspaces, i.e., X = {x|Wx = b}, are convex sets.
6. Prove that in the case of linear subspaces with b = 0 the projection ProjX can be written
as Mx for some matrix M.
Discussions 166
In this section we are going to introduce the basic concepts underlying gradient descent. Al-
though it is rarely used directly in deep learning, an understanding of gradient descent is key
to understanding stochastic gradient descent algorithms. For instance, the optimization prob-
lem might diverge due to an overly large learning rate. This phenomenon can already be seen
in gradient descent. Likewise, preconditioning is a common technique in gradient descent
and carries over to more advanced algorithms. Let’s start with a simple special case.
448 Optimization Algorithms
Gradient descent in one dimension is an excellent example to explain why the gradient de-
scent algorithm may reduce the value of the objective function. Consider some continuously
differentiable real-valued function f : R → R. Using a Taylor expansion we obtain
That is, in first-order approximation f (x + ϵ) is given by the function value f (x) and the first
derivative f ′ (x) at x. It is not unreasonable to assume that for small ϵ moving in the direction
of the negative gradient will decrease f . To keep things simple we pick a fixed step size η > 0
and choose ϵ = −ηf ′ (x). Plugging this into the Taylor expansion above we get
If the derivative f ′ (x) ̸= 0 does not vanish we make progress since ηf ′2 (x) > 0. Moreover,
we can always choose η small enough for the higher-order terms to become irrelevant. Hence
we arrive at
x ← x − ηf ′ (x) (12.3.4)
to iterate x, the value of function f (x) might decline. Therefore, in gradient descent we first
choose an initial value x and a constant η > 0 and then use them to continuously iterate x
until the stop condition is reached, for example, when the magnitude of the gradient |f ′ (x)|
is small enough or the number of iterations has reached a certain value.
For simplicity we choose the objective function f (x) = x2 to illustrate how to implement
gradient descent. Although we know that x = 0 is the solution to minimize f (x), we still use
this simple function to observe how x changes.
%matplotlib inline
import numpy as np
import torch
from d2l import torch as d2l
Next, we use x = 10 as the initial value and assume η = 0.2. Using gradient descent to
iterate x for 10 times we can see that, eventually, the value of x approaches the optimal
solution.
show_trace(results, f)
Learning Rate
The learning rate η can be set by the algorithm designer. If we use a learning rate that is too
small, it will cause x to update very slowly, requiring more iterations to get a better solution.
To show what happens in such a case, consider the progress in the same optimization problem
for η = 0.05. As we can see, even after 10 steps we are still very far from the optimal
solution.
show_trace(gd(0.05, f_grad), f)
Conversely, if we use an excessively high learning rate, |ηf ′ (x)| might be too large for the
first-order Taylor expansion formula. That is, the term O(η 2 f ′2 (x)) in (12.3.2) might become
significant. In this case, we cannot guarantee that the iteration of x will be able to lower the
value of f (x). For example, when we set the learning rate to η = 1.1, x overshoots the
optimal solution x = 0 and gradually diverges.
450 Optimization Algorithms
show_trace(gd(1.1, f_grad), f)
Local Minima
To illustrate what happens for nonconvex functions consider the case of f (x) = x·cos(cx) for
some constant c. This function has infinitely many local minima. Depending on our choice of
the learning rate and depending on how well conditioned the problem is, we may end up with
one of many solutions. The example below illustrates how an (unrealistically) high learning
rate will lead to a poor local minimum.
c = torch.tensor(0.15 * np.pi)
show_trace(gd(2, f_grad), f)
Now that we have a better intuition of the univariate case, let’s consider the situation where
x = [x1 , x2 , . . . , xd ]⊤ . That is, the objective function f : Rd → R maps vectors into
451 Gradient Descent
In other words, up to second-order terms in ϵ the direction of steepest descent is given by the
negative gradient −∇f (x). Choosing a suitable learning rate η > 0 yields the prototypical
gradient descent algorithm:
To see how the algorithm behaves in practice let’s construct an objective function f (x) =
x21 + 2x22 with a two-dimensional vector x = [x1 , x2 ]⊤ as input and a scalar as output. The
gradient is given by ∇f (x) = [2x1 , 4x2 ]⊤ . We will observe the trajectory of x by gradient
descent from the initial position [−5, −2].
To begin with, we need two more helper functions. The first uses an update function and
applies it 20 times to the initial value. The second helper visualizes the trajectory of x.
Next, we observe the trajectory of the optimization variable x for learning rate η = 0.1. We
can see that after 20 steps the value of x approaches its minimum at [0, 0]. Progress is fairly
well-behaved albeit rather slow.
eta = 0.1
show_trace_2d(f_2d, train_2d(gd_2d, f_grad=f_2d_grad))
,→ ../aten/src/ATen/native/TensorShape.cpp:2895.)
As we could see in Section 12.3.1, getting the learning rate η “just right” is tricky. If we
pick it too small, we make little progress. If we pick it too large, the solution oscillates and
in the worst case it might even diverge. What if we could determine η automatically or get
rid of having to select a learning rate at all? Second-order methods that look not only at the
value and gradient of the objective function but also at its curvature can help in this case.
While these methods cannot be applied to deep learning directly due to the computational
cost, they provide useful intuition into how to design advanced optimization algorithms that
mimic many of the desirable properties of the algorithms outlined below.
Newton’s Method
Reviewing the Taylor expansion of some function f : Rd → R there is no need to stop after
the first term. In fact, we can write it as
1
f (x + ϵ) = f (x) + ϵ⊤ ∇f (x) + ϵ⊤ ∇2 f (x)ϵ + O(∥ϵ∥3 ). (12.3.8)
2
def
To avoid cumbersome notation we define H = ∇2 f (x) to be the Hessian of f , which is
a d × d matrix. For small d and simple problems H is easy to compute. For deep neural
networks, on the other hand, H may be prohibitively large, due to the cost of storing O(d2 )
entries. Furthermore it may be too expensive to compute via backpropagation. For now let’s
ignore such considerations and look at what algorithm we would get.
After all, the minimum of f satisfies ∇f = 0. Following calculus rules in Section 2.4.3,
by taking derivatives of (12.3.8) with regard to ϵ and ignoring higher-order terms we arrive
at
That is, we need to invert the Hessian H as part of the optimization problem.
As a simple example, for f (x) = 12 x2 we have ∇f (x) = x and H = 1. Hence for any x
we obtain ϵ = −x. In other words, a single step is sufficient to converge perfectly without the
need for any adjustment! Alas, we got a bit lucky here: the Taylor expansion was exact since
f (x + ϵ) = 21 x2 + ϵx + 12 ϵ2 .
Let’s see what happens in other problems. Given a convex hyperbolic cosine function f (x) =
cosh(cx) for some constant c, we can see that the global minimum at x = 0 is reached after
a few iterations.
c = torch.tensor(0.5)
def newton(eta=1):
x = 10.0
results = [x]
for i in range(10):
x -= eta * f_grad(x) / f_hess(x)
results.append(float(x))
print('epoch 10, x:', x)
return results
show_trace(newton(), f)
Now let’s consider a nonconvex function, such as f (x) = x cos(cx) for some constant c.
After all, note that in Newton’s method we end up dividing by the Hessian. This means that
if the second derivative is negative we may walk into the direction of increasing the value of
f . That is a fatal flaw of the algorithm. Let’s see what happens in practice.
c = torch.tensor(0.15 * np.pi)
show_trace(newton(), f)
This went spectacularly wrong. How can we fix it? One way would be to “fix” the Hessian
by taking its absolute value instead. Another strategy is to bring back the learning rate. This
seems to defeat the purpose, but not quite. Having second-order information allows us to
be cautious whenever the curvature is large and to take longer steps whenever the objective
function is flatter. Let’s see how this works with a slightly smaller learning rate, say η = 0.5.
As we can see, we have quite an efficient algorithm.
show_trace(newton(0.5), f)
Convergence Analysis
We only analyze the convergence rate of Newton’s method for some convex and three times
differentiable objective function f , where the second derivative is nonzero, i.e., f ′′ > 0. The
multivariate proof is a straightforward extension of the one-dimensional argument below and
omitted since it does not help us much in terms of intuition.
def
Denote by x(k) the value of x at the k th iteration and let e(k) = x(k) −x∗ be the distance from
optimality at the k th iteration. By Taylor expansion we have that the condition f ′ (x∗ ) = 0
455 Gradient Descent
can be written as
1
0 = f ′ (x(k) − e(k) ) = f ′ (x(k) ) − e(k) f ′′ (x(k) ) + (e(k) )2 f ′′′ (ξ (k) ), (12.3.10)
2
which holds for some ξ (k) ∈ [x(k) − e(k) , x(k) ]. Dividing the above expansion by f ′′ (x(k) )
yields
f ′ (x(k) ) 1 f ′′′ (ξ (k) )
e(k) − ′′ (k)
= (e(k) )2 ′′ (k) . (12.3.11)
f (x ) 2 f (x )
Recall that we have the update x(k+1) = x(k) − f ′ (x(k) )/f ′′ (x(k) ). Plugging in this update
equation and taking the absolute value of both sides, we have
′′′ (k)
(k+1) 1 (k) 2 f (ξ )
e = (e ) . (12.3.12)
2 f ′′ (x(k) )
Consequently, whenever we are in a region of bounded f ′′′ (ξ (k) ) /(2f ′′ (x(k) )) ≤ c, we
have a quadratically decreasing error
(k+1)
e ≤ c(e(k) )2 . (12.3.13)
Asan aside,
optimization
researchers call this linear convergence, whereas a condition such
as e(k+1) ≤ α e(k) would be called a constant rate of convergence. Note that this analysis
comes with a number of caveats. First, we do not really have much of a guarantee when
we will reach the region of rapid convergence. Instead, we only know that once we reach it,
convergence will be very quick. Second, this analysis requires that f is well-behaved up to
higher-order derivatives. It comes down to ensuring that f does not have any “surprising”
properties in terms of how it might change its values.
Preconditioning
Quite unsurprisingly computing and storing the full Hessian is very expensive. It is thus desir-
able to find alternatives. One way to improve matters is preconditioning. It avoids computing
the Hessian in its entirety but only computes the diagonal entries. This leads to update algo-
rithms of the form
While this is not quite as good as the full Newton’s method, it is still much better than not
using it. To see why this might be a good idea consider a situation where one variable de-
notes height in millimeters and the other one denotes height in kilometers. Assuming that
for both the natural scale is in meters, we have a terrible mismatch in parameterizations.
Fortunately, using preconditioning removes this. Effectively preconditioning with gradient
descent amounts to selecting a different learning rate for each variable (coordinate of vector
x). As we will see later, preconditioning drives some of the innovation in stochastic gradient
descent optimization algorithms.
One of the key problems in gradient descent is that we might overshoot the goal or make
insufficient progress. A simple fix for the problem is to use line search in conjunction with
gradient descent. That is, we use the direction given by ∇f (x) and then perform binary
search as to which learning rate η minimizes f (x − η∇f (x)).
This algorithm converges rapidly (for an analysis and proof see e.g., Boyd and Vandenberghe
(2004)). However, for the purpose of deep learning this is not quite so feasible, since each
step of the line search would require us to evaluate the objective function on the entire dataset.
This is way too costly to accomplish.
456 Optimization Algorithms
12.3.4 Summary
• Learning rates matter. Too large and we diverge, too small and we do not make progress.
• Newton’s method is a lot faster once it has started working properly in convex problems.
• Beware of using Newton’s method without any adjustments for nonconvex problems.
12.3.5 Exercises
1. Experiment with different learning rates and objective functions for gradient descent.
2. Implement line search to minimize a convex function in the interval [a, b].
1. Do you need derivatives for binary search, i.e., to decide whether to pick [a, (a + b)/2]
or [(a + b)/2, b].
2. Use the absolute values of that rather than the actual (possibly signed) values.
Discussions 167
In earlier chapters we kept using stochastic gradient descent in our training procedure, how-
ever, without explaining why it works. To shed some light on it, we just described the basic
principles of gradient descent in Section 12.3. In this section, we go on to discuss stochastic
gradient descent in greater detail.
%matplotlib inline
import math
import torch
from d2l import torch as d2l
457 Stochastic Gradient Descent
In deep learning, the objective function is usually the average of the loss functions for each
example in the training dataset. Given a training dataset of n examples, we assume that fi (x)
is the loss function with respect to the training example of index i, where x is the parameter
vector. Then we arrive at the objective function
1∑
n
f (x) = fi (x). (12.4.1)
n i=1
1∑
n
∇f (x) = ∇fi (x). (12.4.2)
n i=1
If gradient descent is used, the computational cost for each independent variable iteration is
O(n), which grows linearly with n. Therefore, when the training dataset is larger, the cost of
gradient descent for each iteration will be higher.
Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each it-
eration of stochastic gradient descent, we uniformly sample an index i ∈ {1, . . . , n} for data
examples at random, and compute the gradient ∇fi (x) to update x:
where η is the learning rate. We can see that the computational cost for each iteration drops
from O(n) of the gradient descent to the constant O(1). Moreover, we want to emphasize
that the stochastic gradient ∇fi (x) is an unbiased estimate of the full gradient ∇f (x) be-
cause
1∑
n
Ei ∇fi (x) = ∇fi (x) = ∇f (x). (12.4.4)
n i=1
This means that, on average, the stochastic gradient is a good estimate of the gradient.
Now, we will compare it with gradient descent by adding random noise with a mean of 0 and
a variance of 1 to the gradient to simulate a stochastic gradient descent.
def constant_lr():
return 1
eta = 0.1
lr = constant_lr # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))
458 Optimization Algorithms
,→ ../aten/src/ATen/native/TensorShape.cpp:2895.)
As we can see, the trajectory of the variables in the stochastic gradient descent is much
more noisy than the one we observed in gradient descent in Section 12.3. This is due to the
stochastic nature of the gradient. That is, even when we arrive near the minimum, we are still
subject to the uncertainty injected by the instantaneous gradient via η∇fi (x). Even after 50
steps the quality is still not so good. Even worse, it will not improve after additional steps (we
encourage you to experiment with a larger number of steps to confirm this). This leaves us
with the only alternative: change the learning rate η. However, if we pick this too small, we
will not make any meaningful progress initially. On the other hand, if we pick it too large, we
will not get a good solution, as seen above. The only way to resolve these conflicting goals is
to reduce the learning rate dynamically as optimization progresses.
This is also the reason for adding a learning rate function lr into the sgd step function. In
the example above any functionality for learning rate scheduling lies dormant as we set the
associated lr function to be constant.
Replacing η with a time-dependent learning rate η(t) adds to the complexity of controlling
convergence of an optimization algorithm. In particular, we need to figure out how rapidly
η should decay. If it is too quick, we will stop optimizing prematurely. If we decrease it too
slowly, we waste too much time on optimization. The following are a few basic strategies that
are used in adjusting η over time (we will discuss more advanced strategies later):
In the first piecewise constant scenario we decrease the learning rate, e.g., whenever progress
in optimization stalls. This is a common strategy for training deep networks. Alternatively
we could decrease it much more aggressively by an exponential decay. Unfortunately this
often leads to premature stopping before the algorithm has converged. A popular choice is
459 Stochastic Gradient Descent
polynomial decay with α = 0.5. In the case of convex optimization there are a number of
proofs that show that this rate is well behaved.
def exponential_lr():
# Global variable that is defined outside this function and updated inside
global t
t += 1
return math.exp(-0.1 * t)
t = 1
lr = exponential_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=1000, f_grad=f_grad))
As expected, the variance in the parameters is significantly reduced. However, this comes
at the expense of failing to converge to the optimal solution x = (0, 0). Even after 1000
iteration steps are we are still very far away from the optimal solution. Indeed, the algorithm
fails to converge at all. On the other hand, if we use a polynomial decay where the learning
rate decays with the inverse square root of the number of steps, convergence gets better after
only 50 steps.
def polynomial_lr():
# Global variable that is defined outside this function and updated inside
global t
t += 1
return (1 + 0.1 * t) ** (-0.5)
t = 1
lr = polynomial_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))
There exist many more choices for how to set the learning rate. For instance, we could start
with a small rate, then rapidly ramp up and then decrease it again, albeit more slowly. We
could even alternate between smaller and larger learning rates. There exists a large variety of
such schedules. For now let’s focus on learning rate schedules for which a comprehensive the-
oretical analysis is possible, i.e., on learning rates in a convex setting. For general nonconvex
168 problems it is very difficult to obtain meaningful convergence guarantees, since in general
minimizing nonlinear nonconvex problems is NP hard. For a survey see e.g., the excellent
lecture notes 168 of Tibshirani 2015.
460 Optimization Algorithms
The following convergence analysis of stochastic gradient descent for convex objective func-
tions is optional and primarily serves to convey more intuition about the problem. We limit
ourselves to one of the simplest proofs (Nesterov and Vial, 2000). Significantly more ad-
vanced proof techniques exist, e.g., whenever the objective function is particularly well be-
haved.
Suppose that the objective function f (ξ, x) is convex in x for all ξ. More concretely, we
consider the stochastic gradient descent update:
where f (ξ t , x) is the objective function with respect to the training example ξ t drawn from
some distribution at step t and x is the model parameter. Denote by
the expected risk and by R∗ its minimum with regard to x. Last let x∗ be the minimizer
(we assume that it exists within the domain where x is defined). In this case we can track
the distance between the current parameter xt at time t and the risk minimizer x∗ and see
whether it improves over time:
∥xt+1 − x∗ ∥2
=∥xt − ηt ∂x f (ξ t , x) − x∗ ∥2 (12.4.8)
∗ 2 ∗
=∥xt − x ∥ + ηt2 ∥∂x f (ξ t , x)∥2 − 2ηt ⟨xt − x , ∂x f (ξ t , x)⟩ .
We assume that the ℓ2 norm of stochastic gradient ∂x f (ξ t , x) is bounded by some constant
L, hence we have that
We are mostly interested in how the distance between xt and x∗ changes in expectation.
In fact, for any specific sequence of steps the distance might well increase, depending on
whichever ξ t we encounter. Hence we need to bound the dot product. Since for any convex
function f it holds that f (y) ≥ f (x) + ⟨f ′ (x), y − x⟩ for all x and y, by convexity we
have
f (ξ t , x∗ ) ≥ f (ξ t , xt ) + ⟨x∗ − xt , ∂x f (ξ t , xt )⟩ . (12.4.10)
Plugging both inequalities (12.4.9) and (12.4.10) into (12.4.8) we obtain a bound on the
distance between parameters at time t + 1 as follows:
This means that we make progress as long as the difference between current loss and the
optimal loss outweighs ηt L2 /2. Since this difference is bound to converge to zero it follows
that the learning rate ηt also needs to vanish.
461 Stochastic Gradient Descent
The last step involves summing over the inequalities for t ∈ {1, . . . , T }. Since the sum
telescopes and by dropping the lower term we obtain
( T )
∑ ∑
T
∥x1 − x∗ ∥2 ≥ 2 ηt [E[R(xt )] − R∗ ] − L2 ηt2 . (12.4.13)
t=1 t=1
Note that we exploited that x1 is given and thus the expectation can be dropped. Last de-
fine
∑T
def ηt xt
x̄ = ∑t=1 T
. (12.4.14)
t=1 ηt
Since
(∑ ) ∑T
T
t=1 ηt R(xt ) ηt E[R(xt )]
t=1
E ∑T = ∑T = E[R(xt )], (12.4.15)
t=1 ηt t=1 ηt
∑T
by Jensen’s inequality (setting i = t, αi = ηt / t=1 ηt in (12.2.3)) and convexity of R it
follows that E[R(xt )] ≥ E[R(x̄)], thus
∑
T ∑
T
ηt E[R(xt )] ≥ ηt E [R(x̄)] . (12.4.16)
t=1 t=1
So far we have played a bit fast and loose when it comes to talking about stochastic gra-
dient descent. We posited that we draw instances xi , typically with labels yi from some
distribution p(x, y) and that we use this to update the model parameters in some man-
ner. In particular, for a finite sample size we simply argued that the discrete distribution
∑n
p(x, y) = n1 i=1 δxi (x)δyi (y) for some functions δxi and δyi allows us to perform stochas-
tic gradient descent over it.
However, this is not really what we did. In the toy examples in the current section we simply
added noise to an otherwise non-stochastic gradient, i.e., we pretended to have pairs (xi , yi ).
It turns out that this is justified here (see the exercises for a detailed discussion). More trou-
bling is that in all previous discussions we clearly did not do this. Instead we iterated over all
instances exactly once. To see why this is preferable consider the converse, namely that we
are sampling n observations from the discrete distribution with replacement. The probability
of choosing an element i at random is 1/n. Thus to choose it at least once is
A similar reasoning shows that the probability of picking some sample (i.e., training example)
exactly once is given by
( ) ( )n−1 ( )n
n 1 1 n 1
1− = 1− ≈ e−1 ≈ 0.37. (12.4.19)
1 n n n−1 n
Sampling with replacement leads to an increased variance and decreased data efficiency rel-
ative to sampling without replacement. Hence, in practice we perform the latter (and this is
the default choice throughout this book). Last note that repeated passes through the training
dataset traverse it in a different random order.
12.4.5 Summary
• For convex problems we can prove that for a wide choice of learning rates stochastic gra-
dient descent will converge to the optimal solution.
• For deep learning this is generally not the case. However, the analysis of convex problems
gives us useful insight into how to approach optimization, namely to reduce the learning
rate progressively, albeit not too quickly.
• Problems occur when the learning rate is too small or too large. In practice a suitable
learning rate is often found only after multiple experiments.
• When there are more examples in the training dataset, it costs more to compute each
iteration for gradient descent, so stochastic gradient descent is preferred in these cases.
• Optimality guarantees for stochastic gradient descent are in general not available in non-
convex cases since the number of local minima that require checking might well be
exponential.
12.4.6 Exercises
1. Experiment with different learning rate schedules for stochastic gradient descent and with
different numbers of iterations. In particular, plot the distance from the optimal solution
(0, 0) as a function of the number of iterations.
2. Prove that for the function f (x1 , x2 ) = x21 + 2x22 adding normal noise to the gradient is
equivalent to minimizing a loss function f (x, w) = (x1 − w1 )2 + 2(x2 − w2 )2 where x
is drawn from a normal distribution.
3. Compare convergence of stochastic gradient descent when you sample from {(x1 , y1 ), . . . , (xn , yn )}
with replacement and when you sample without replacement.
4. How would you change the stochastic gradient descent solver if some gradient (or rather
some coordinate associated with it) was consistently larger than all the other gradients?
5. Assume that f (x) = x2 (1 + sin x). How many local minima does f have? Can you
change f in such a way that to minimize it one needs to evaluate all the local minima?
Discussions 169 169
So far we encountered two extremes in the approach to gradient-based learning: Section 12.3
uses the full dataset to compute gradients and to update parameters, one pass at a time.