0% found this document useful (0 votes)
138 views28 pages

Dive Into Deep Learning-435-462

Optimization algorithms are important for deep learning for two main reasons: 1) training complex deep learning models can take a long time, and the algorithm's performance directly affects efficiency, and 2) understanding algorithm principles and hyperparameters enables targeted tuning to improve model performance. Almost all deep learning optimization problems are nonconvex, but analyzing convex problems provides useful insights. Common challenges for optimization in deep learning include local minima, saddle points, and vanishing gradients, which can prevent algorithms from finding the global minimum.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views28 pages

Dive Into Deep Learning-435-462

Optimization algorithms are important for deep learning for two main reasons: 1) training complex deep learning models can take a long time, and the algorithm's performance directly affects efficiency, and 2) understanding algorithm principles and hyperparameters enables targeted tuning to improve model performance. Almost all deep learning optimization problems are nonconvex, but analyzing convex problems provides useful insights. Common challenges for optimization in deep learning include local minima, saddle points, and vanishing gradients, which can prevent algorithms from finding the global minimum.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

12 Optimization Algorithms

If you read the book in sequence up to this point you already used a number of optimization
algorithms to train deep learning models. They were the tools that allowed us to continue up-
dating model parameters and to minimize the value of the loss function, as evaluated on the
training set. Indeed, anyone content with treating optimization as a black box device to min-
imize objective functions in a simple setting might well content oneself with the knowledge
that there exists an array of incantations of such a procedure (with names such as “SGD” and
“Adam”).

To do well, however, some deeper knowledge is required. Optimization algorithms are im-
portant for deep learning. On the one hand, training a complex deep learning model can take
hours, days, or even weeks. The performance of the optimization algorithm directly affects
the model’s training efficiency. On the other hand, understanding the principles of differ-
ent optimization algorithms and the role of their hyperparameters will enable us to tune the
hyperparameters in a targeted manner to improve the performance of deep learning mod-
els.

In this chapter, we explore common deep learning optimization algorithms in depth. Almost
all optimization problems arising in deep learning are nonconvex. Nonetheless, the design and
analysis of algorithms in the context of convex problems have proven to be very instructive. It
is for that reason that this chapter includes a primer on convex optimization and the proof for
a very simple stochastic gradient descent algorithm on a convex objective function.

12.1 Optimization and Deep Learning

In this section, we will discuss the relationship between optimization and deep learning as well
as the challenges of using optimization in deep learning. For a deep learning problem, we will
usually define a loss function first. Once we have the loss function, we can use an optimization
algorithm in attempt to minimize the loss. In optimization, a loss function is often referred
to as the objective function of the optimization problem. By tradition and convention most
optimization algorithms are concerned with minimization. If we ever need to maximize an
objective there is a simple solution: just flip the sign on the objective.

12.1.1 Goal of Optimization

Although optimization provides a way to minimize the loss function for deep learning, in
essence, the goals of optimization and deep learning are fundamentally different. The for-
mer is primarily concerned with minimizing an objective whereas the latter is concerned
with finding a suitable model, given a finite amount of data. In Section 3.6, we discussed the
difference between these two goals in detail. For instance, training error and generalization
error generally differ: since the objective function of the optimization algorithm is usually a
loss function based on the training dataset, the goal of optimization is to reduce the training
error. However, the goal of deep learning (or more broadly, statistical inference) is to reduce
435
436 Optimization Algorithms

the generalization error. To accomplish the latter we need to pay attention to overfitting in
addition to using the optimization algorithm to reduce the training error.

%matplotlib inline
import numpy as np
import torch
from mpl_toolkits import mplot3d
from d2l import torch as d2l

To illustrate the aforementioned different goals, let’s consider the empirical risk and the risk.
As described in Section 4.7.3, the empirical risk is an average loss on the training dataset
while the risk is the expected loss on the entire population of data. Below we define two
functions: the risk function f and the empirical risk function g. Suppose that we have only a
finite amount of training data. As a result, here g is less smooth than f.

def f(x):
return x * torch.cos(np.pi * x)

def g(x):
return f(x) + 0.2 * torch.cos(5 * np.pi * x)

The graph below illustrates that the minimum of the empirical risk on a training dataset may
be at a different location from the minimum of the risk (generalization error).

def annotate(text, xy, xytext): #@save


d2l.plt.gca().annotate(text, xy=xy, xytext=xytext,
arrowprops=dict(arrowstyle='->'))

x = torch.arange(0.5, 1.5, 0.01)


d2l.set_figsize((4.5, 2.5))
d2l.plot(x, [f(x), g(x)], 'x', 'risk')
annotate('min of\nempirical risk', (1.0, -1.2), (0.5, -1.1))
annotate('min of risk', (1.1, -1.05), (0.95, -0.5))

12.1.2 Optimization Challenges in Deep Learning

In this chapter, we are going to focus specifically on the performance of optimization algo-
rithms in minimizing the objective function, rather than a model’s generalization error. In
Section 3.1 we distinguished between analytical solutions and numerical solutions in opti-
mization problems. In deep learning, most objective functions are complicated and do not
have analytical solutions. Instead, we must use numerical optimization algorithms. The opti-
mization algorithms in this chapter all fall into this category.

There are many challenges in deep learning optimization. Some of the most vexing ones are
local minima, saddle points, and vanishing gradients. Let’s have a look at them.
437 Optimization and Deep Learning

Local Minima

For any objective function f (x), if the value of f (x) at x is smaller than the values of f (x)
at any other points in the vicinity of x, then f (x) could be a local minimum. If the value of
f (x) at x is the minimum of the objective function over the entire domain, then f (x) is the
global minimum.

For example, given the function

f (x) = x · cos(πx) for − 1.0 ≤ x ≤ 2.0, (12.1.1)

we can approximate the local minimum and global minimum of this function.

x = torch.arange(-1.0, 2.0, 0.01)


d2l.plot(x, [f(x), ], 'x', 'f(x)')
annotate('local minimum', (-0.3, -0.25), (-0.77, -1.0))
annotate('global minimum', (1.1, -0.95), (0.6, 0.8))

The objective function of deep learning models usually has many local optima. When the nu-
merical solution of an optimization problem is near the local optimum, the numerical solution
obtained by the final iteration may only minimize the objective function locally, rather than
globally, as the gradient of the objective function’s solutions approaches or becomes zero.
Only some degree of noise might knock the parameter out of the local minimum. In fact,
this is one of the beneficial properties of minibatch stochastic gradient descent where the
natural variation of gradients over minibatches is able to dislodge the parameters from local
minima.

Saddle Points

Besides local minima, saddle points are another reason for gradients to vanish. A saddle point
is any location where all gradients of a function vanish but which is neither a global nor a local
minimum. Consider the function f (x) = x3 . Its first and second derivative vanish for x = 0.
Optimization might stall at this point, even though it is not a minimum.

x = torch.arange(-2.0, 2.0, 0.01)


d2l.plot(x, [x**3], 'x', 'f(x)')
annotate('saddle point', (0, -0.2), (-0.52, -5.0))

Saddle points in higher dimensions are even more insidious, as the example below shows.
Consider the function f (x, y) = x2 − y 2 . It has its saddle point at (0, 0). This is a maximum
with respect to y and a minimum with respect to x. Moreover, it looks like a saddle, which
is where this mathematical property got its name.
438 Optimization Algorithms

x, y = torch.meshgrid(
torch.linspace(-1.0, 1.0, 101), torch.linspace(-1.0, 1.0, 101))
z = x**2 - y**2

ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
ax.plot([0], [0], [0], 'rx')
ticks = [-1, 0, 1]
d2l.plt.xticks(ticks)
d2l.plt.yticks(ticks)
ax.set_zticks(ticks)
d2l.plt.xlabel('x')
d2l.plt.ylabel('y');

/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/
,→torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release,

,→ it will be required to pass the indexing argument. (Triggered internally at␣

,→ ../aten/src/ATen/native/TensorShape.cpp:2895.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]

We assume that the input of a function is a k-dimensional vector and its output is a scalar,
so its Hessian matrix will have k eigenvalues. The solution of the function could be a local
minimum, a local maximum, or a saddle point at a position where the function gradient is
zero:
• When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are all
positive, we have a local minimum for the function.
• When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are all
negative, we have a local maximum for the function.
• When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are
negative and positive, we have a saddle point for the function.
For high-dimensional problems the likelihood that at least some of the eigenvalues are negative
is quite high. This makes saddle points more likely than local minima. We will discuss some
439 Optimization and Deep Learning

exceptions to this situation in the next section when introducing convexity. In short, convex
functions are those where the eigenvalues of the Hessian are never negative. Sadly, though,
most deep learning problems do not fall into this category. Nonetheless it is a great tool to
study optimization algorithms.

Vanishing Gradients

Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-
used activation functions and their derivatives in Section 5.1.2. For instance, assume that we
want to minimize the function f (x) = tanh(x) and we happen to get started at x = 4. As
we can see, the gradient of f is close to nil. More specifically, f ′ (x) = 1 − tanh2 (x) and
thus f ′ (4) = 0.0013. Consequently, optimization will get stuck for a long time before we
make progress. This turns out to be one of the reasons that training deep learning models was
quite tricky prior to the introduction of the ReLU activation function.

x = torch.arange(-2.0, 5.0, 0.01)


d2l.plot(x, [torch.tanh(x)], 'x', 'f(x)')
annotate('vanishing gradient', (4, 1), (2, 0.0))

As we saw, optimization for deep learning is full of challenges. Fortunately there exists a
robust range of algorithms that perform well and that are easy to use even for beginners. Fur-
thermore, it is not really necessary to find the best solution. Local optima or even approximate
solutions thereof are still very useful.

12.1.3 Summary

• Minimizing the training error does not guarantee that we find the best set of parameters to
minimize the generalization error.

• The optimization problems may have many local minima.

• The problem may have even more saddle points, as generally the problems are not convex.

• Vanishing gradients can cause optimization to stall. Often a reparameterization of the


problem helps. Good initialization of the parameters can be beneficial, too.

12.1.4 Exercises

1. Consider a simple MLP with a single hidden layer of, say, d dimensions in the hidden
layer and a single output. Show that for any local minimum there are at least d! equivalent
solutions that behave identically.
440 Optimization Algorithms

2. Assume that we have a symmetric random matrix M where the entries Mij = Mji are
each drawn from some probability distribution pij . Furthermore assume that pij (x) =
pij (−x), i.e., that the distribution is symmetric (see e.g., Wigner (1958) for details).
1. Prove that the distribution over eigenvalues is also symmetric. That is, for any eigenvec-
tor v the probability that the associated eigenvalue λ satisfies P (λ > 0) = P (λ < 0).
2. Why does the above not imply P (λ > 0) = 0.5?
3. What other challenges involved in deep learning optimization can you think of?
4. Assume that you want to balance a (real) ball on a (real) saddle.
1. Why is this hard?
2. Can you exploit this effect also for optimization algorithms?
Discussions 165 165

12.2 Convexity

Convexity plays a vital role in the design of optimization algorithms. This is largely due to
the fact that it is much easier to analyze and test algorithms in such a context. In other words,
if the algorithm performs poorly even in the convex setting, typically we should not hope
to see great results otherwise. Furthermore, even though the optimization problems in deep
learning are generally nonconvex, they often exhibit some properties of convex ones near
local minima. This can lead to exciting new optimization variants such as (Izmailov et al.,
2018).

%matplotlib inline
import numpy as np
import torch
from mpl_toolkits import mplot3d
from d2l import torch as d2l

12.2.1 Definitions

Before convex analysis, we need to define convex sets and convex functions. They lead to
mathematical tools that are commonly applied to machine learning.

Convex Sets

Sets are the basis of convexity. Simply put, a set X in a vector space is convex if for any
a, b ∈ X the line segment connecting a and b is also in X . In mathematical terms this means
that for all λ ∈ [0, 1] we have

λa + (1 − λ)b ∈ X whenever a, b ∈ X . (12.2.1)

This sounds a bit abstract. Consider Fig.12.2.1. The first set is not convex since there exist
line segments that are not contained in it. The other two sets suffer no such problem.
Definitions on their own are not particularly useful unless you can do something with them.
In this case we can look at intersections as shown in Fig.12.2.2. Assume that X and Y are
convex sets. Then X ∩ Y is also convex. To see this, consider any a, b ∈ X ∩ Y. Since X and
441 Convexity

t
Figure 12.2.1 The first set is nonconvex and the other two are convex.

Y are convex, the line segments connecting a and b are contained in both X and Y. Given
that, they also need to be contained in X ∩ Y, thus proving our theorem.

t
Figure 12.2.2 The intersection between two convex sets is convex.

We can strengthen this result with little effort: given convex sets Xi , their intersection ∩i Xi
is convex. To see that the converse is not true, consider two disjoint sets X ∩ Y = ∅. Now
pick a ∈ X and b ∈ Y. The line segment in Fig.12.2.3 connecting a and b needs to contain
some part that is neither in X nor in Y, since we assumed that X ∩ Y = ∅. Hence the line
segment is not in X ∪ Y either, thus proving that in general unions of convex sets need not
be convex.

t
Figure 12.2.3 The union of two convex sets need not be convex.

Typically the problems in deep learning are defined on convex sets. For instance, Rd , the set
of d-dimensional vectors of real numbers, is a convex set (after all, the line between any two
points in Rd remains in Rd ). In some cases we work with variables of bounded length, such
as balls of radius r as defined by {x|x ∈ Rd and ∥x∥ ≤ r}.

Convex Functions

Now that we have convex sets we can introduce convex functions f . Given a convex set X , a
function f : X → R is convex if for all x, x′ ∈ X and for all λ ∈ [0, 1] we have

λf (x) + (1 − λ)f (x′ ) ≥ f (λx + (1 − λ)x′ ). (12.2.2)

To illustrate this let’s plot a few functions and check which ones satisfy the requirement.
Below we define a few functions, both convex and nonconvex.
442 Optimization Algorithms

f = lambda x: 0.5 * x**2 # Convex


g = lambda x: torch.cos(np.pi * x) # Nonconvex
h = lambda x: torch.exp(0.5 * x) # Convex

x, segment = torch.arange(-2, 2, 0.01), torch.tensor([-1.5, 1])


d2l.use_svg_display()
_, axes = d2l.plt.subplots(1, 3, figsize=(9, 3))
for ax, func in zip(axes, [f, g, h]):
d2l.plot([x, segment], [func(x), func(segment)], axes=ax)

As expected, the cosine function is nonconvex, whereas the parabola and the exponential
function are. Note that the requirement that X is a convex set is necessary for the condition
to make sense. Otherwise the outcome of f (λx+(1−λ)x′ ) might not be well defined.

Jensen’s Inequality

Given a convex function f , one of the most useful mathematical tools is Jensen’s inequality.
It amounts to a generalization of the definition of convexity:
( )
∑ ∑
αi f (xi ) ≥ f αi xi and EX [f (X)] ≥ f (EX [X]) , (12.2.3)
i i

where αi are nonnegative real numbers such that i αi = 1 and X is a random variable. In
other words, the expectation of a convex function is no less than the convex function of an
expectation, where the latter is usually a simpler expression. To prove the first inequality we
repeatedly apply the definition of convexity to one term in the sum at a time.

One of the common applications of Jensen’s inequality is to bound a more complicated ex-
pression by a simpler one. For example, its application can be with regard to the log-likelihood
of partially observed random variables. That is, we use

EY ∼P (Y ) [− log P (X | Y )] ≥ − log P (X), (12.2.4)



since P (Y )P (X | Y )dY = P (X). This can be used in variational methods. Here Y is
typically the unobserved random variable, P (Y ) is the best guess of how it might be dis-
tributed, and P (X) is the distribution with Y integrated out. For instance, in clustering Y
might be the cluster labels and P (X | Y ) is the generative model when applying cluster
labels.

12.2.2 Properties

Convex functions have many useful properties. We describe a few commonly-used ones be-
low.
443 Convexity

Local Minima Are Global Minima

First and foremost, the local minima of convex functions are also the global minima. We can
prove it by contradiction as follows.
Consider a convex function f defined on a convex set X . Suppose that x∗ ∈ X is a local
minimum: there exists a small positive value p so that for x ∈ X that satisfies 0 < |x − x∗ | ≤
p we have f (x∗ ) < f (x).
Assume that the local minimum x∗ is not the global minumum of f : there exists x′ ∈ X
for which f (x′ ) < f (x∗ ). There also exists λ ∈ [0, 1) such as λ = 1 − |x∗ −x
p
′ | so that

0 < |λx∗ + (1 − λ)x′ − x∗ | ≤ p.


However, according to the definition of convex functions, we have
f (λx∗ + (1 − λ)x′ ) ≤ λf (x∗ ) + (1 − λ)f (x′ )
< λf (x∗ ) + (1 − λ)f (x∗ ) (12.2.5)

= f (x ),
which contradicts with our statement that x∗ is a local minimum. Therefore, there does not
exist x′ ∈ X for which f (x′ ) < f (x∗ ). The local minimum x∗ is also the global mini-
mum.
For instance, the convex function f (x) = (x − 1)2 has a local minimum at x = 1, which is
also the global minimum.

f = lambda x: (x - 1) ** 2
d2l.set_figsize()
d2l.plot([x, segment], [f(x), f(segment)], 'x', 'f(x)')

The fact that the local minima for convex functions are also the global minima is very conve-
nient. It means that if we minimize functions we cannot “get stuck”. Note, though, that this
does not mean that there cannot be more than one global minimum or that there might even
exist one. For instance, the function f (x) = max(|x| − 1, 0) attains its minimum value over
the interval [−1, 1]. Conversely, the function f (x) = exp(x) does not attain a minimum
value on R: for x → −∞ it asymptotes to 0, but there is no x for which f (x) = 0.

Below Sets of Convex Functions Are Convex

We can conveniently define convex sets via below sets of convex functions. Concretely, given
a convex function f defined on a convex set X , any below set

Sb := {x|x ∈ X and f (x) ≤ b} (12.2.6)

is convex.
444 Optimization Algorithms

Let’s prove this quickly. Recall that for any x, x′ ∈ Sb we need to show that λx+(1−λ)x′ ∈
Sb as long as λ ∈ [0, 1]. Since f (x) ≤ b and f (x′ ) ≤ b, by the definition of convexity we
have

f (λx + (1 − λ)x′ ) ≤ λf (x) + (1 − λ)f (x′ ) ≤ b. (12.2.7)

Convexity and Second Derivatives

Whenever the second derivative of a function f : Rn → R exists it is very easy to check


whether f is convex. All we need to do is check whether the Hessian of f is positive semidef-
inite: ∇2 f ⪰ 0, i.e., denoting the Hessian matrix ∇2 f by H, x⊤ Hx ≥ 0 for all x ∈ Rn .
For instance, the function f (x) = 12 ∥x∥2 is convex since ∇2 f = 1, i.e., its Hessian is an
identity matrix.

Formally, a twice-differentiable one-dimensional function f : R → R is convex if and only


if its second derivative f ′′ ≥ 0. For any twice-differentiable multi-dimensional function f :
Rn → R, it is convex if and only if its Hessian ∇2 f ⪰ 0.

First, we need to prove the one-dimensional case. To see that convexity of f implies f ′′ ≥ 0
we use the fact that
( )
1 1 x+ϵ x−ϵ
f (x + ϵ) + f (x − ϵ) ≥ f + = f (x). (12.2.8)
2 2 2 2
Since the second derivative is given by the limit over finite differences it follows that
f (x + ϵ) + f (x − ϵ) − 2f (x)
f ′′ (x) = lim ≥ 0. (12.2.9)
ϵ→0 ϵ2
To see that f ′′ ≥ 0 implies that f is convex we use the fact that f ′′ ≥ 0 implies that f ′
is a monotonically nondecreasing function. Let a < x < b be three points in R, where
x = (1 − λ)a + λb and λ ∈ (0, 1). According to the mean value theorem, there exist
α ∈ [a, x] and β ∈ [x, b] such that
f (x) − f (a) f (b) − f (x)
f ′ (α) = and f ′ (β) = . (12.2.10)
x−a b−x
By monotonicity f ′ (β) ≥ f ′ (α), hence
x−a b−x
f (b) + f (a) ≥ f (x). (12.2.11)
b−a b−a
Since x = (1 − λ)a + λb, we have

λf (b) + (1 − λ)f (a) ≥ f ((1 − λ)a + λb), (12.2.12)

thus proving convexity.

Second, we need a lemma before proving the multi-dimensional case: f : Rn → R is convex


if and only if for all x, y ∈ Rn
def
g(z) = f (zx + (1 − z)y) where z ∈ [0, 1] (12.2.13)

is convex.

To prove that convexity of f implies that g is convex, we can show that for all a, b, λ ∈ [0, 1]
(thus 0 ≤ λa + (1 − λ)b ≤ 1)

g(λa + (1 − λ)b)
=f ((λa + (1 − λ)b) x + (1 − λa − (1 − λ)b) y)
=f (λ (ax + (1 − a)y) + (1 − λ) (bx + (1 − b)y)) (12.2.14)
≤λf (ax + (1 − a)y) + (1 − λ)f (bx + (1 − b)y)
=λg(a) + (1 − λ)g(b).
445 Convexity

To prove the converse, we can show that for all λ ∈ [0, 1]

f (λx + (1 − λ)y)
=g(λ · 1 + (1 − λ) · 0)
(12.2.15)
≤λg(1) + (1 − λ)g(0)
=λf (x) + (1 − λ)f (y).

Finally, using the lemma above and the result of the one-dimensional case, the multi-dimensional
case can be proven as follows. A multi-dimensional function f : Rn → R is convex if and
def
only if for all x, y ∈ Rn g(z) = f (zx + (1 − z)y), where z ∈ [0, 1], is convex. Accord-
ing to the one-dimensional case, this holds if and only if g ′′ = (x − y)⊤ H(x − y) ≥ 0
def
(H = ∇2 f ) for all x, y ∈ Rn , which is equivalent to H ⪰ 0 per the definition of positive
semidefinite matrices.

12.2.3 Constraints

One of the nice properties of convex optimization is that it allows us to handle constraints
efficiently. That is, it allows us to solve constrained optimization problems of the form:

minimize f (x)
x
(12.2.16)
subject to ci (x) ≤ 0 for all i ∈ {1, . . . , n},

where f is the objective and the functions ci are constraint functions. To see what this does
consider the case where c1 (x) = ∥x∥2 − 1. In this case the parameters x are constrained to
the unit ball. If a second constraint is c2 (x) = v⊤ x + b, then this corresponds to all x lying
on a half-space. Satisfying both constraints simultaneously amounts to selecting a slice of a
ball.

Lagrangian

In general, solving a constrained optimization problem is difficult. One way of addressing it


stems from physics with a rather simple intuition. Imagine a ball inside a box. The ball will
roll to the place that is lowest and the forces of gravity will be balanced out with the forces
that the sides of the box can impose on the ball. In short, the gradient of the objective function
(i.e., gravity) will be offset by the gradient of the constraint function (the ball need to remain
inside the box by virtue of the walls “pushing back”). Note that some constraints may not be
active: the walls that are not touched by the ball will not be able to exert any force on the
ball.

Skipping over the derivation of the Lagrangian L, the above reasoning can be expressed via
the following saddle point optimization problem:

n
L(x, α1 , . . . , αn ) = f (x) + αi ci (x) where αi ≥ 0. (12.2.17)
i=1

Here the variables αi (i = 1, . . . , n) are the so-called Lagrange multipliers that ensure that
constraints are properly enforced. They are chosen just large enough to ensure that ci (x) ≤ 0
for all i. For instance, for any x where ci (x) < 0 naturally, we’d end up picking αi = 0.
Moreover, this is a saddle point optimization problem where one wants to maximize L with
respect to all αi and simultaneously minimize it with respect to x. There is a rich body of
literature explaining how to arrive at the function L(x, α1 , . . . , αn ). For our purposes it is
sufficient to know that the saddle point of L is where the original constrained optimization
problem is solved optimally.
446 Optimization Algorithms

Penalties

One way of satisfying constrained optimization problems at least approximately is to adapt


the Lagrangian L. Rather than satisfying ci (x) ≤ 0 we simply add αi ci (x) to the objective
function f (x). This ensures that the constraints will not be violated too badly.

In fact, we have been using this trick all along. Consider weight decay in Section 3.7. In it
we add λ2 ∥w∥2 to the objective function to ensure that w does not grow too large. From the
constrained optimization point of view we can see that this will ensure that ∥w∥2 − r2 ≤ 0
for some radius r. Adjusting the value of λ allows us to vary the size of w.

In general, adding penalties is a good way of ensuring approximate constraint satisfaction.


In practice this turns out to be much more robust than exact satisfaction. Furthermore, for
nonconvex problems many of the properties that make the exact approach so appealing in the
convex case (e.g., optimality) no longer hold.

Projections

An alternative strategy for satisfying constraints is projections. Again, we encountered them


before, e.g., when dealing with gradient clipping in Section 9.5. There we ensured that a
gradient has length bounded by θ via

g ← g · min(1, θ/∥g∥). (12.2.18)

This turns out to be a projection of g onto the ball of radius θ. More generally, a projection
on a convex set X is defined as
ProjX (x) = argmin ∥x − x′ ∥, (12.2.19)
x′ ∈X

which is the closest point in X to x.

t
Figure 12.2.4 Convex Projections.

The mathematical definition of projections may sound a bit abstract. Fig.12.2.4 explains it
somewhat more clearly. In it we have two convex sets, a circle and a diamond. Points inside
both sets (yellow) remain unchanged during projections. Points outside both sets (black) are
projected to the points inside the sets (red) that are closet to the original points (black). While
for ℓ2 balls this leaves the direction unchanged, this need not be the case in general, as can
be seen in the case of the diamond.

One of the uses for convex projections is to compute sparse weight vectors. In this case we
project weight vectors onto an ℓ1 ball, which is a generalized version of the diamond case in
Fig.12.2.4.

12.2.4 Summary

In the context of deep learning the main purpose of convex functions is to motivate opti-
mization algorithms and help us understand them in detail. In the following we will see how
gradient descent and stochastic gradient descent can be derived accordingly.
447 Gradient Descent

• Intersections of convex sets are convex. Unions are not.

• The expectation of a convex function is no less than the convex function of an expectation
(Jensen’s inequality).

• A twice-differentiable function is convex if and only if its Hessian (a matrix of second


derivatives) is positive semidefinite.

• Convex constraints can be added via the Lagrangian. In practice we may simply add them
with a penalty to the objective function.

• Projections map to points in the convex set closest to the original points.

12.2.5 Exercises

1. Assume that we want to verify convexity of a set by drawing all lines between points within
the set and checking whether the lines are contained.

1. Prove that it is sufficient to check only the points on the boundary.

2. Prove that it is sufficient to check only the vertices of the set.


def
2. Denote by Bp [r] = {x|x ∈ Rd and ∥x∥p ≤ r} the ball of radius r using the p-norm.
Prove that Bp [r] is convex for all p ≥ 1.

3. Given convex functions f and g, show that max(f, g) is convex, too. Prove that min(f, g)
is not convex.

4. Prove that the normalization of the softmax function is convex. More specifically prove

the convexity of f (x) = log i exp(xi ).

5. Prove that linear subspaces, i.e., X = {x|Wx = b}, are convex sets.

6. Prove that in the case of linear subspaces with b = 0 the projection ProjX can be written
as Mx for some matrix M.

7. Show that for twice-differentiable convex functions f we can write f (x + ϵ) = f (x) +


ϵf ′ (x) + 12 ϵ2 f ′′ (x + ξ) for some ξ ∈ [0, ϵ].
166
8. Given a convex set X and two vectors x and y, prove that projections never increase
distances, i.e., ∥x − y∥ ≥ ∥ProjX (x) − ProjX (y)∥.

Discussions 166

12.3 Gradient Descent

In this section we are going to introduce the basic concepts underlying gradient descent. Al-
though it is rarely used directly in deep learning, an understanding of gradient descent is key
to understanding stochastic gradient descent algorithms. For instance, the optimization prob-
lem might diverge due to an overly large learning rate. This phenomenon can already be seen
in gradient descent. Likewise, preconditioning is a common technique in gradient descent
and carries over to more advanced algorithms. Let’s start with a simple special case.
448 Optimization Algorithms

12.3.1 One-Dimensional Gradient Descent

Gradient descent in one dimension is an excellent example to explain why the gradient de-
scent algorithm may reduce the value of the objective function. Consider some continuously
differentiable real-valued function f : R → R. Using a Taylor expansion we obtain

f (x + ϵ) = f (x) + ϵf ′ (x) + O(ϵ2 ). (12.3.1)

That is, in first-order approximation f (x + ϵ) is given by the function value f (x) and the first
derivative f ′ (x) at x. It is not unreasonable to assume that for small ϵ moving in the direction
of the negative gradient will decrease f . To keep things simple we pick a fixed step size η > 0
and choose ϵ = −ηf ′ (x). Plugging this into the Taylor expansion above we get

f (x − ηf ′ (x)) = f (x) − ηf ′2 (x) + O(η 2 f ′2 (x)). (12.3.2)

If the derivative f ′ (x) ̸= 0 does not vanish we make progress since ηf ′2 (x) > 0. Moreover,
we can always choose η small enough for the higher-order terms to become irrelevant. Hence
we arrive at

f (x − ηf ′ (x)) ⪅ f (x). (12.3.3)

This means that, if we use

x ← x − ηf ′ (x) (12.3.4)

to iterate x, the value of function f (x) might decline. Therefore, in gradient descent we first
choose an initial value x and a constant η > 0 and then use them to continuously iterate x
until the stop condition is reached, for example, when the magnitude of the gradient |f ′ (x)|
is small enough or the number of iterations has reached a certain value.

For simplicity we choose the objective function f (x) = x2 to illustrate how to implement
gradient descent. Although we know that x = 0 is the solution to minimize f (x), we still use
this simple function to observe how x changes.

%matplotlib inline
import numpy as np
import torch
from d2l import torch as d2l

def f(x): # Objective function


return x ** 2

def f_grad(x): # Gradient (derivative) of the objective function


return 2 * x

Next, we use x = 10 as the initial value and assume η = 0.2. Using gradient descent to
iterate x for 10 times we can see that, eventually, the value of x approaches the optimal
solution.

def gd(eta, f_grad):


x = 10.0
results = [x]
for i in range(10):
x -= eta * f_grad(x)
results.append(float(x))
print(f'epoch 10, x: {x:f}')
return results

results = gd(0.2, f_grad)


449 Gradient Descent

epoch 10, x: 0.060466

The progress of optimizing over x can be plotted as follows.

def show_trace(results, f):


n = max(abs(min(results)), abs(max(results)))
f_line = torch.arange(-n, n, 0.01)
d2l.set_figsize()
d2l.plot([f_line, results], [[f(x) for x in f_line], [
f(x) for x in results]], 'x', 'f(x)', fmts=['-', '-o'])

show_trace(results, f)

Learning Rate

The learning rate η can be set by the algorithm designer. If we use a learning rate that is too
small, it will cause x to update very slowly, requiring more iterations to get a better solution.
To show what happens in such a case, consider the progress in the same optimization problem
for η = 0.05. As we can see, even after 10 steps we are still very far from the optimal
solution.

show_trace(gd(0.05, f_grad), f)

epoch 10, x: 3.486784

Conversely, if we use an excessively high learning rate, |ηf ′ (x)| might be too large for the
first-order Taylor expansion formula. That is, the term O(η 2 f ′2 (x)) in (12.3.2) might become
significant. In this case, we cannot guarantee that the iteration of x will be able to lower the
value of f (x). For example, when we set the learning rate to η = 1.1, x overshoots the
optimal solution x = 0 and gradually diverges.
450 Optimization Algorithms

show_trace(gd(1.1, f_grad), f)

epoch 10, x: 61.917364

Local Minima

To illustrate what happens for nonconvex functions consider the case of f (x) = x·cos(cx) for
some constant c. This function has infinitely many local minima. Depending on our choice of
the learning rate and depending on how well conditioned the problem is, we may end up with
one of many solutions. The example below illustrates how an (unrealistically) high learning
rate will lead to a poor local minimum.

c = torch.tensor(0.15 * np.pi)

def f(x): # Objective function


return x * torch.cos(c * x)

def f_grad(x): # Gradient of the objective function


return torch.cos(c * x) - c * x * torch.sin(c * x)

show_trace(gd(2, f_grad), f)

epoch 10, x: -1.528166

12.3.2 Multivariate Gradient Descent

Now that we have a better intuition of the univariate case, let’s consider the situation where
x = [x1 , x2 , . . . , xd ]⊤ . That is, the objective function f : Rd → R maps vectors into
451 Gradient Descent

scalars. Correspondingly its gradient is multivariate, too. It is a vector consisting of d partial


derivatives:
[ ]⊤
∂f (x) ∂f (x) ∂f (x)
∇f (x) = , ,..., . (12.3.5)
∂x1 ∂x2 ∂xd
Each partial derivative element ∂f (x)/∂xi in the gradient indicates the rate of change of f at
x with respect to the input xi . As before in the univariate case we can use the corresponding
Taylor approximation for multivariate functions to get some idea of what we should do. In
particular, we have that

f (x + ϵ) = f (x) + ϵ⊤ ∇f (x) + O(∥ϵ∥2 ). (12.3.6)

In other words, up to second-order terms in ϵ the direction of steepest descent is given by the
negative gradient −∇f (x). Choosing a suitable learning rate η > 0 yields the prototypical
gradient descent algorithm:

x ← x − η∇f (x). (12.3.7)

To see how the algorithm behaves in practice let’s construct an objective function f (x) =
x21 + 2x22 with a two-dimensional vector x = [x1 , x2 ]⊤ as input and a scalar as output. The
gradient is given by ∇f (x) = [2x1 , 4x2 ]⊤ . We will observe the trajectory of x by gradient
descent from the initial position [−5, −2].
To begin with, we need two more helper functions. The first uses an update function and
applies it 20 times to the initial value. The second helper visualizes the trajectory of x.

def train_2d(trainer, steps=20, f_grad=None): #@save


"""Optimize a 2D objective function with a customized trainer."""
# `s1` and `s2` are internal state variables that will be used in Momentum,
,→ adagrad, RMSProp

x1, x2, s1, s2 = -5, -2, 0, 0


results = [(x1, x2)]
for i in range(steps):
if f_grad:
x1, x2, s1, s2 = trainer(x1, x2, s1, s2, f_grad)
else:
x1, x2, s1, s2 = trainer(x1, x2, s1, s2)
results.append((x1, x2))
print(f'epoch {i + 1}, x1: {float(x1):f}, x2: {float(x2):f}')
return results

def show_trace_2d(f, results): #@save


"""Show the trace of 2D variables during optimization."""
d2l.set_figsize()
d2l.plt.plot(*zip(*results), '-o', color='#ff7f0e')
x1, x2 = torch.meshgrid(torch.arange(-5.5, 1.0, 0.1),
torch.arange(-3.0, 1.0, 0.1))
d2l.plt.contour(x1, x2, f(x1, x2), colors='#1f77b4')
d2l.plt.xlabel('x1')
d2l.plt.ylabel('x2')

Next, we observe the trajectory of the optimization variable x for learning rate η = 0.1. We
can see that after 20 steps the value of x approaches its minimum at [0, 0]. Progress is fairly
well-behaved albeit rather slow.

def f_2d(x1, x2): # Objective function


return x1 ** 2 + 2 * x2 ** 2

def f_2d_grad(x1, x2): # Gradient of the objective function


return (2 * x1, 4 * x2)

def gd_2d(x1, x2, s1, s2, f_grad):


g1, g2 = f_grad(x1, x2)
(continues on next page)
452 Optimization Algorithms

(continued from previous page)

return (x1 - eta * g1, x2 - eta * g2, 0, 0)

eta = 0.1
show_trace_2d(f_2d, train_2d(gd_2d, f_grad=f_2d_grad))

epoch 20, x1: -0.057646, x2: -0.000073


/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/
,→torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release,
,→ it will be required to pass the indexing argument. (Triggered internally at␣

,→ ../aten/src/ATen/native/TensorShape.cpp:2895.)

return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]

12.3.3 Adaptive Methods

As we could see in Section 12.3.1, getting the learning rate η “just right” is tricky. If we
pick it too small, we make little progress. If we pick it too large, the solution oscillates and
in the worst case it might even diverge. What if we could determine η automatically or get
rid of having to select a learning rate at all? Second-order methods that look not only at the
value and gradient of the objective function but also at its curvature can help in this case.
While these methods cannot be applied to deep learning directly due to the computational
cost, they provide useful intuition into how to design advanced optimization algorithms that
mimic many of the desirable properties of the algorithms outlined below.

Newton’s Method

Reviewing the Taylor expansion of some function f : Rd → R there is no need to stop after
the first term. In fact, we can write it as
1
f (x + ϵ) = f (x) + ϵ⊤ ∇f (x) + ϵ⊤ ∇2 f (x)ϵ + O(∥ϵ∥3 ). (12.3.8)
2
def
To avoid cumbersome notation we define H = ∇2 f (x) to be the Hessian of f , which is
a d × d matrix. For small d and simple problems H is easy to compute. For deep neural
networks, on the other hand, H may be prohibitively large, due to the cost of storing O(d2 )
entries. Furthermore it may be too expensive to compute via backpropagation. For now let’s
ignore such considerations and look at what algorithm we would get.
After all, the minimum of f satisfies ∇f = 0. Following calculus rules in Section 2.4.3,
by taking derivatives of (12.3.8) with regard to ϵ and ignoring higher-order terms we arrive
at

∇f (x) + Hϵ = 0 and hence ϵ = −H−1 ∇f (x). (12.3.9)


453 Gradient Descent

That is, we need to invert the Hessian H as part of the optimization problem.
As a simple example, for f (x) = 12 x2 we have ∇f (x) = x and H = 1. Hence for any x
we obtain ϵ = −x. In other words, a single step is sufficient to converge perfectly without the
need for any adjustment! Alas, we got a bit lucky here: the Taylor expansion was exact since
f (x + ϵ) = 21 x2 + ϵx + 12 ϵ2 .
Let’s see what happens in other problems. Given a convex hyperbolic cosine function f (x) =
cosh(cx) for some constant c, we can see that the global minimum at x = 0 is reached after
a few iterations.

c = torch.tensor(0.5)

def f(x): # Objective function


return torch.cosh(c * x)

def f_grad(x): # Gradient of the objective function


return c * torch.sinh(c * x)

def f_hess(x): # Hessian of the objective function


return c**2 * torch.cosh(c * x)

def newton(eta=1):
x = 10.0
results = [x]
for i in range(10):
x -= eta * f_grad(x) / f_hess(x)
results.append(float(x))
print('epoch 10, x:', x)
return results

show_trace(newton(), f)

epoch 10, x: tensor(0.)

Now let’s consider a nonconvex function, such as f (x) = x cos(cx) for some constant c.
After all, note that in Newton’s method we end up dividing by the Hessian. This means that
if the second derivative is negative we may walk into the direction of increasing the value of
f . That is a fatal flaw of the algorithm. Let’s see what happens in practice.

c = torch.tensor(0.15 * np.pi)

def f(x): # Objective function


return x * torch.cos(c * x)

def f_grad(x): # Gradient of the objective function


return torch.cos(c * x) - c * x * torch.sin(c * x)

def f_hess(x): # Hessian of the objective function


(continues on next page)
454 Optimization Algorithms

(continued from previous page)

return - 2 * c * torch.sin(c * x) - x * c**2 * torch.cos(c * x)

show_trace(newton(), f)

epoch 10, x: tensor(26.8341)

This went spectacularly wrong. How can we fix it? One way would be to “fix” the Hessian
by taking its absolute value instead. Another strategy is to bring back the learning rate. This
seems to defeat the purpose, but not quite. Having second-order information allows us to
be cautious whenever the curvature is large and to take longer steps whenever the objective
function is flatter. Let’s see how this works with a slightly smaller learning rate, say η = 0.5.
As we can see, we have quite an efficient algorithm.

show_trace(newton(0.5), f)

epoch 10, x: tensor(7.2699)

Convergence Analysis

We only analyze the convergence rate of Newton’s method for some convex and three times
differentiable objective function f , where the second derivative is nonzero, i.e., f ′′ > 0. The
multivariate proof is a straightforward extension of the one-dimensional argument below and
omitted since it does not help us much in terms of intuition.
def
Denote by x(k) the value of x at the k th iteration and let e(k) = x(k) −x∗ be the distance from
optimality at the k th iteration. By Taylor expansion we have that the condition f ′ (x∗ ) = 0
455 Gradient Descent

can be written as
1
0 = f ′ (x(k) − e(k) ) = f ′ (x(k) ) − e(k) f ′′ (x(k) ) + (e(k) )2 f ′′′ (ξ (k) ), (12.3.10)
2
which holds for some ξ (k) ∈ [x(k) − e(k) , x(k) ]. Dividing the above expansion by f ′′ (x(k) )
yields
f ′ (x(k) ) 1 f ′′′ (ξ (k) )
e(k) − ′′ (k)
= (e(k) )2 ′′ (k) . (12.3.11)
f (x ) 2 f (x )
Recall that we have the update x(k+1) = x(k) − f ′ (x(k) )/f ′′ (x(k) ). Plugging in this update
equation and taking the absolute value of both sides, we have
′′′ (k)
(k+1) 1 (k) 2 f (ξ )
e = (e ) . (12.3.12)
2 f ′′ (x(k) )

Consequently, whenever we are in a region of bounded f ′′′ (ξ (k) ) /(2f ′′ (x(k) )) ≤ c, we
have a quadratically decreasing error

(k+1)
e ≤ c(e(k) )2 . (12.3.13)

As an aside,
optimization
researchers call this linear convergence, whereas a condition such
as e(k+1) ≤ α e(k) would be called a constant rate of convergence. Note that this analysis
comes with a number of caveats. First, we do not really have much of a guarantee when
we will reach the region of rapid convergence. Instead, we only know that once we reach it,
convergence will be very quick. Second, this analysis requires that f is well-behaved up to
higher-order derivatives. It comes down to ensuring that f does not have any “surprising”
properties in terms of how it might change its values.

Preconditioning

Quite unsurprisingly computing and storing the full Hessian is very expensive. It is thus desir-
able to find alternatives. One way to improve matters is preconditioning. It avoids computing
the Hessian in its entirety but only computes the diagonal entries. This leads to update algo-
rithms of the form

x ← x − ηdiag(H)−1 ∇f (x). (12.3.14)

While this is not quite as good as the full Newton’s method, it is still much better than not
using it. To see why this might be a good idea consider a situation where one variable de-
notes height in millimeters and the other one denotes height in kilometers. Assuming that
for both the natural scale is in meters, we have a terrible mismatch in parameterizations.
Fortunately, using preconditioning removes this. Effectively preconditioning with gradient
descent amounts to selecting a different learning rate for each variable (coordinate of vector
x). As we will see later, preconditioning drives some of the innovation in stochastic gradient
descent optimization algorithms.

Gradient Descent with Line Search

One of the key problems in gradient descent is that we might overshoot the goal or make
insufficient progress. A simple fix for the problem is to use line search in conjunction with
gradient descent. That is, we use the direction given by ∇f (x) and then perform binary
search as to which learning rate η minimizes f (x − η∇f (x)).
This algorithm converges rapidly (for an analysis and proof see e.g., Boyd and Vandenberghe
(2004)). However, for the purpose of deep learning this is not quite so feasible, since each
step of the line search would require us to evaluate the objective function on the entire dataset.
This is way too costly to accomplish.
456 Optimization Algorithms

12.3.4 Summary

• Learning rates matter. Too large and we diverge, too small and we do not make progress.

• Gradient descent can get stuck in local minima.

• In high dimensions adjusting the learning rate is complicated.

• Preconditioning can help with scale adjustment.

• Newton’s method is a lot faster once it has started working properly in convex problems.

• Beware of using Newton’s method without any adjustments for nonconvex problems.

12.3.5 Exercises

1. Experiment with different learning rates and objective functions for gradient descent.

2. Implement line search to minimize a convex function in the interval [a, b].

1. Do you need derivatives for binary search, i.e., to decide whether to pick [a, (a + b)/2]
or [(a + b)/2, b].

2. How rapid is the rate of convergence for the algorithm?

3. Implement the algorithm and apply it to minimizing log(exp(x) + exp(−2x − 3)).

3. Design an objective function defined on R2 where gradient descent is exceedingly slow.


Hint: scale different coordinates differently.

4. Implement the lightweight version of Newton’s method using preconditioning:

1. Use diagonal Hessian as preconditioner.

2. Use the absolute values of that rather than the actual (possibly signed) values.

3. Apply this to the problem above.


167
5. Apply the algorithm above to a number of objective functions (convex or not). What
happens if you rotate coordinates by 45 degrees?

Discussions 167

12.4 Stochastic Gradient Descent

In earlier chapters we kept using stochastic gradient descent in our training procedure, how-
ever, without explaining why it works. To shed some light on it, we just described the basic
principles of gradient descent in Section 12.3. In this section, we go on to discuss stochastic
gradient descent in greater detail.

%matplotlib inline
import math
import torch
from d2l import torch as d2l
457 Stochastic Gradient Descent

12.4.1 Stochastic Gradient Updates

In deep learning, the objective function is usually the average of the loss functions for each
example in the training dataset. Given a training dataset of n examples, we assume that fi (x)
is the loss function with respect to the training example of index i, where x is the parameter
vector. Then we arrive at the objective function

1∑
n
f (x) = fi (x). (12.4.1)
n i=1

The gradient of the objective function at x is computed as

1∑
n
∇f (x) = ∇fi (x). (12.4.2)
n i=1

If gradient descent is used, the computational cost for each independent variable iteration is
O(n), which grows linearly with n. Therefore, when the training dataset is larger, the cost of
gradient descent for each iteration will be higher.

Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each it-
eration of stochastic gradient descent, we uniformly sample an index i ∈ {1, . . . , n} for data
examples at random, and compute the gradient ∇fi (x) to update x:

x ← x − η∇fi (x), (12.4.3)

where η is the learning rate. We can see that the computational cost for each iteration drops
from O(n) of the gradient descent to the constant O(1). Moreover, we want to emphasize
that the stochastic gradient ∇fi (x) is an unbiased estimate of the full gradient ∇f (x) be-
cause

1∑
n
Ei ∇fi (x) = ∇fi (x) = ∇f (x). (12.4.4)
n i=1

This means that, on average, the stochastic gradient is a good estimate of the gradient.

Now, we will compare it with gradient descent by adding random noise with a mean of 0 and
a variance of 1 to the gradient to simulate a stochastic gradient descent.

def f(x1, x2): # Objective function


return x1 ** 2 + 2 * x2 ** 2

def f_grad(x1, x2): # Gradient of the objective function


return 2 * x1, 4 * x2

def sgd(x1, x2, s1, s2, f_grad):


g1, g2 = f_grad(x1, x2)
# Simulate noisy gradient
g1 += torch.normal(0.0, 1, (1,))
g2 += torch.normal(0.0, 1, (1,))
eta_t = eta * lr()
return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)

def constant_lr():
return 1

eta = 0.1
lr = constant_lr # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))
458 Optimization Algorithms

epoch 50, x1: -0.095588, x2: 0.195790


/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/
,→numpy/core/shape_base.py:65: VisibleDeprecationWarning: Creating an ndarray␣
,→from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or␣

,→ndarrays with different lengths or shapes) is deprecated. If you meant to do␣

,→this, you must specify 'dtype=object' when creating the ndarray.


ary = asanyarray(ary)
/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/
,→torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release,
,→ it will be required to pass the indexing argument. (Triggered internally at␣

,→ ../aten/src/ATen/native/TensorShape.cpp:2895.)

return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]

As we can see, the trajectory of the variables in the stochastic gradient descent is much
more noisy than the one we observed in gradient descent in Section 12.3. This is due to the
stochastic nature of the gradient. That is, even when we arrive near the minimum, we are still
subject to the uncertainty injected by the instantaneous gradient via η∇fi (x). Even after 50
steps the quality is still not so good. Even worse, it will not improve after additional steps (we
encourage you to experiment with a larger number of steps to confirm this). This leaves us
with the only alternative: change the learning rate η. However, if we pick this too small, we
will not make any meaningful progress initially. On the other hand, if we pick it too large, we
will not get a good solution, as seen above. The only way to resolve these conflicting goals is
to reduce the learning rate dynamically as optimization progresses.

This is also the reason for adding a learning rate function lr into the sgd step function. In
the example above any functionality for learning rate scheduling lies dormant as we set the
associated lr function to be constant.

12.4.2 Dynamic Learning Rate

Replacing η with a time-dependent learning rate η(t) adds to the complexity of controlling
convergence of an optimization algorithm. In particular, we need to figure out how rapidly
η should decay. If it is too quick, we will stop optimizing prematurely. If we decrease it too
slowly, we waste too much time on optimization. The following are a few basic strategies that
are used in adjusting η over time (we will discuss more advanced strategies later):

η(t) = ηi if ti ≤ t ≤ ti+1 piecewise constant


−λt
η(t) = η0 · e exponential decay (12.4.5)
−α
η(t) = η0 · (βt + 1) polynomial decay

In the first piecewise constant scenario we decrease the learning rate, e.g., whenever progress
in optimization stalls. This is a common strategy for training deep networks. Alternatively
we could decrease it much more aggressively by an exponential decay. Unfortunately this
often leads to premature stopping before the algorithm has converged. A popular choice is
459 Stochastic Gradient Descent

polynomial decay with α = 0.5. In the case of convex optimization there are a number of
proofs that show that this rate is well behaved.

Let’s see what the exponential decay looks like in practice.

def exponential_lr():
# Global variable that is defined outside this function and updated inside
global t
t += 1
return math.exp(-0.1 * t)

t = 1
lr = exponential_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=1000, f_grad=f_grad))

epoch 1000, x1: -0.760049, x2: -0.045451

As expected, the variance in the parameters is significantly reduced. However, this comes
at the expense of failing to converge to the optimal solution x = (0, 0). Even after 1000
iteration steps are we are still very far away from the optimal solution. Indeed, the algorithm
fails to converge at all. On the other hand, if we use a polynomial decay where the learning
rate decays with the inverse square root of the number of steps, convergence gets better after
only 50 steps.

def polynomial_lr():
# Global variable that is defined outside this function and updated inside
global t
t += 1
return (1 + 0.1 * t) ** (-0.5)

t = 1
lr = polynomial_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))

epoch 50, x1: 0.063074, x2: 0.038695

There exist many more choices for how to set the learning rate. For instance, we could start
with a small rate, then rapidly ramp up and then decrease it again, albeit more slowly. We
could even alternate between smaller and larger learning rates. There exists a large variety of
such schedules. For now let’s focus on learning rate schedules for which a comprehensive the-
oretical analysis is possible, i.e., on learning rates in a convex setting. For general nonconvex
168 problems it is very difficult to obtain meaningful convergence guarantees, since in general
minimizing nonlinear nonconvex problems is NP hard. For a survey see e.g., the excellent
lecture notes 168 of Tibshirani 2015.
460 Optimization Algorithms

12.4.3 Convergence Analysis for Convex Objectives

The following convergence analysis of stochastic gradient descent for convex objective func-
tions is optional and primarily serves to convey more intuition about the problem. We limit
ourselves to one of the simplest proofs (Nesterov and Vial, 2000). Significantly more ad-
vanced proof techniques exist, e.g., whenever the objective function is particularly well be-
haved.
Suppose that the objective function f (ξ, x) is convex in x for all ξ. More concretely, we
consider the stochastic gradient descent update:

xt+1 = xt − ηt ∂x f (ξ t , x), (12.4.6)

where f (ξ t , x) is the objective function with respect to the training example ξ t drawn from
some distribution at step t and x is the model parameter. Denote by

R(x) = Eξ [f (ξ, x)] (12.4.7)

the expected risk and by R∗ its minimum with regard to x. Last let x∗ be the minimizer
(we assume that it exists within the domain where x is defined). In this case we can track
the distance between the current parameter xt at time t and the risk minimizer x∗ and see
whether it improves over time:
∥xt+1 − x∗ ∥2
=∥xt − ηt ∂x f (ξ t , x) − x∗ ∥2 (12.4.8)
∗ 2 ∗
=∥xt − x ∥ + ηt2 ∥∂x f (ξ t , x)∥2 − 2ηt ⟨xt − x , ∂x f (ξ t , x)⟩ .
We assume that the ℓ2 norm of stochastic gradient ∂x f (ξ t , x) is bounded by some constant
L, hence we have that

ηt2 ∥∂x f (ξ t , x)∥2 ≤ ηt2 L2 . (12.4.9)

We are mostly interested in how the distance between xt and x∗ changes in expectation.
In fact, for any specific sequence of steps the distance might well increase, depending on
whichever ξ t we encounter. Hence we need to bound the dot product. Since for any convex
function f it holds that f (y) ≥ f (x) + ⟨f ′ (x), y − x⟩ for all x and y, by convexity we
have

f (ξ t , x∗ ) ≥ f (ξ t , xt ) + ⟨x∗ − xt , ∂x f (ξ t , xt )⟩ . (12.4.10)

Plugging both inequalities (12.4.9) and (12.4.10) into (12.4.8) we obtain a bound on the
distance between parameters at time t + 1 as follows:

∥xt − x∗ ∥2 − ∥xt+1 − x∗ ∥2 ≥ 2ηt (f (ξ t , xt ) − f (ξ t , x∗ )) − ηt2 L2 . (12.4.11)

This means that we make progress as long as the difference between current loss and the
optimal loss outweighs ηt L2 /2. Since this difference is bound to converge to zero it follows
that the learning rate ηt also needs to vanish.
461 Stochastic Gradient Descent

Next we take expectations over (12.4.11). This yields


[ ] [ ]
E ∥xt − x∗ ∥2 − E ∥xt+1 − x∗ ∥2 ≥ 2ηt [E[R(xt )] − R∗ ] − ηt2 L2 . (12.4.12)

The last step involves summing over the inequalities for t ∈ {1, . . . , T }. Since the sum
telescopes and by dropping the lower term we obtain
( T )
∑ ∑
T
∥x1 − x∗ ∥2 ≥ 2 ηt [E[R(xt )] − R∗ ] − L2 ηt2 . (12.4.13)
t=1 t=1

Note that we exploited that x1 is given and thus the expectation can be dropped. Last de-
fine
∑T
def ηt xt
x̄ = ∑t=1 T
. (12.4.14)
t=1 ηt

Since
(∑ ) ∑T
T
t=1 ηt R(xt ) ηt E[R(xt )]
t=1
E ∑T = ∑T = E[R(xt )], (12.4.15)
t=1 ηt t=1 ηt
∑T
by Jensen’s inequality (setting i = t, αi = ηt / t=1 ηt in (12.2.3)) and convexity of R it
follows that E[R(xt )] ≥ E[R(x̄)], thus


T ∑
T
ηt E[R(xt )] ≥ ηt E [R(x̄)] . (12.4.16)
t=1 t=1

Plugging this into the inequality (12.4.13) yields the bound


∑T
r2 + L2 t=1 ηt2
[E[x̄]] − R∗ ≤ ∑T , (12.4.17)
2 t=1 ηt
def
where r2 = ∥x1 − x∗ ∥2 is a bound on the distance between the initial choice of parame-
ters and the final outcome. In short, the speed of convergence depends on how the norm of
stochastic gradient is bounded (L) and how far away from optimality the initial parameter
value is (r). Note that the bound is in terms of x̄ rather than xT . This is the case since x̄ is
a smoothed version of the optimization
√ path. Whenever r, L, and √ T are known we can pick
the learning rate
√ η = r/(L T ). This yields as upper bound rL/ T . That is, we converge
with rate O(1/ T ) to the optimal solution.

12.4.4 Stochastic Gradients and Finite Samples

So far we have played a bit fast and loose when it comes to talking about stochastic gra-
dient descent. We posited that we draw instances xi , typically with labels yi from some
distribution p(x, y) and that we use this to update the model parameters in some man-
ner. In particular, for a finite sample size we simply argued that the discrete distribution
∑n
p(x, y) = n1 i=1 δxi (x)δyi (y) for some functions δxi and δyi allows us to perform stochas-
tic gradient descent over it.

However, this is not really what we did. In the toy examples in the current section we simply
added noise to an otherwise non-stochastic gradient, i.e., we pretended to have pairs (xi , yi ).
It turns out that this is justified here (see the exercises for a detailed discussion). More trou-
bling is that in all previous discussions we clearly did not do this. Instead we iterated over all
instances exactly once. To see why this is preferable consider the converse, namely that we
are sampling n observations from the discrete distribution with replacement. The probability
of choosing an element i at random is 1/n. Thus to choose it at least once is

P (choose i) = 1 − P (omit i) = 1 − (1 − 1/n)n ≈ 1 − e−1 ≈ 0.63. (12.4.18)


462 Optimization Algorithms

A similar reasoning shows that the probability of picking some sample (i.e., training example)
exactly once is given by
( ) ( )n−1 ( )n
n 1 1 n 1
1− = 1− ≈ e−1 ≈ 0.37. (12.4.19)
1 n n n−1 n
Sampling with replacement leads to an increased variance and decreased data efficiency rel-
ative to sampling without replacement. Hence, in practice we perform the latter (and this is
the default choice throughout this book). Last note that repeated passes through the training
dataset traverse it in a different random order.

12.4.5 Summary

• For convex problems we can prove that for a wide choice of learning rates stochastic gra-
dient descent will converge to the optimal solution.
• For deep learning this is generally not the case. However, the analysis of convex problems
gives us useful insight into how to approach optimization, namely to reduce the learning
rate progressively, albeit not too quickly.
• Problems occur when the learning rate is too small or too large. In practice a suitable
learning rate is often found only after multiple experiments.
• When there are more examples in the training dataset, it costs more to compute each
iteration for gradient descent, so stochastic gradient descent is preferred in these cases.
• Optimality guarantees for stochastic gradient descent are in general not available in non-
convex cases since the number of local minima that require checking might well be
exponential.

12.4.6 Exercises

1. Experiment with different learning rate schedules for stochastic gradient descent and with
different numbers of iterations. In particular, plot the distance from the optimal solution
(0, 0) as a function of the number of iterations.
2. Prove that for the function f (x1 , x2 ) = x21 + 2x22 adding normal noise to the gradient is
equivalent to minimizing a loss function f (x, w) = (x1 − w1 )2 + 2(x2 − w2 )2 where x
is drawn from a normal distribution.
3. Compare convergence of stochastic gradient descent when you sample from {(x1 , y1 ), . . . , (xn , yn )}
with replacement and when you sample without replacement.
4. How would you change the stochastic gradient descent solver if some gradient (or rather
some coordinate associated with it) was consistently larger than all the other gradients?
5. Assume that f (x) = x2 (1 + sin x). How many local minima does f have? Can you
change f in such a way that to minimize it one needs to evaluate all the local minima?
Discussions 169 169

12.5 Minibatch Stochastic Gradient Descent

So far we encountered two extremes in the approach to gradient-based learning: Section 12.3
uses the full dataset to compute gradients and to update parameters, one pass at a time.

You might also like