Lecture 2 (Wednesday, March 20)
▶ More properties of differentiable functions
Theorem 2.1 (Mean value theorem). Let f : [a, b] → R be continuous on [a, b] and differen-
tiable on (a, b). Then there exists c ∈ (a, b) such that
f (b) − f (a)
f ′ (c) = .
b−a
f (b)
f (a)
f (c)
x
a c b
Figure 1: Mean value theorem: slope of tangent f ′ (c) = [f (b) − f (a)]/(b − a) slope of secant.
Recall that a function is said to be monotone increasing (or nondecreasing) if
x1 < x2 ⇒ f (x1 ) ≤ f (x2 ),
and monotone decreasing (or nonincreasing) if
x1 < x2 ⇒ f (x1 ) ≥ f (x2 ).
We say strictly increasing or strictly decreasing if the inequalities on the right hand sides are
strict.
Corollary 2.2. If f is differentiable on (a, b), then
1. f ′ (x) ≥ 0 for all x ∈ (a, b) ⇒ f is monotone increasing;
2. f ′ (x) = 0 for all x ∈ (a, b) ⇒ f is constant;
3. f ′ (x) ≤ 0 for all x ∈ (a, b) ⇒ f is monotone decreasing.
Lecture 2 Page 1 of 7
Proof. Suppose x1 , x2 ∈ (a, b) with x1 < x2 . Then f is continuous on [x1 , x2 ] and differ-
entiable on (x1 , x2 ), so by the mean value theorem, there exists c with x1 < c < x2 such
that
f (x2 ) − f (x1 )
f ′ (c) = .
x2 − x1
Since x2 > x1 and f ′ (c) ≥ 0, we have f (x2 ) − f (x1 ) ≥ 0, i.e., f (x2 ) ≥ f (x1 ). The other
statements are proved similarly.
Exercise 2.3. Deduce similar statements for f ′ (x) > 0 and f ′ (x) < 0.
The next two results help us ascertain whether a local optimizer is a minimizer or maximizer.
Theorem 2.4 (First order sufficient condition for local optimality). If f is continuous at x0
and differentiable in a neighborhood of x0 , then
• if f ′ (x) < 0 in a left neighborhood of x0 and f ′ (x) > 0 in a right neighborhood of x0 ,
then x0 is a minimizer;
• if f ′ (x) > 0 in a left neighborhood of x0 and f ′ (x) < 0 in a right neighborhood of x0 ,
then x0 is a maximizer.
For an illustration, see Figure ??: the point x2 is a local minimizer and the points x1 , x3
are local maximizers. But note that you do not need differentiability at x0 to apply the first
order sufficient condition. For example, we can deduce that f (x) = |x| has a minimizer at
x = 0 by applying Theorem 2.4.
Theorem 2.5 (Second order condition). Let f ∈ C 2 (X) and x0 be a local optimizer, so
f ′ (x0 ) = 0. Then
• if f ′′ (x0 ) > 0, then x0 is a minimizer;
• if f ′′ (x0 ) < 0, then x0 is a maximizer.
A common rookie mistake is to overestimate the usefulness of Theorem 2.5 and underestimate
that of Theorem 2.4, as you will see in Homework 1.
Exercise 2.6. Prove Theorems 2.4 and 2.5.
For most of our purposes, we need a stronger refinement of mean value theorem — Taylor’s
theorem for n = 1 or n = 2.
Theorem 2.7 (Taylor). If f has continuous derivatives of order up to and including n + 1
in some neighborhood of x0 , then
n
X f (k) (x0 )
f (x) = (x − x0 )k + Rn (x),
k!
k=0
Lecture 2 Page 2 of 7
where the remainder term is
ˆ
1 x (n+1)
Rn (x) = f (t)(x − t)n dt (integral form),
n! x0
1 (n+1) (x − x0 )n+1
= f (ξ) (mean value form),
n! n+1
for some ξ between x0 and x.
Proof. Fundamental Theorem of Calculus.
The integral form is usually much more useful in optimization, not least because it does not
involve an unknown ξ that we don’t have much control over.
But really what is most important for us is a Taylor’s theorem for “n = 1.5” — a function
that is more than C 1 but less than C 2 , or more precisely, a function f ∈ C 1 (X) such that
f ′ ∈ Lipγ (X). We will discuss this in the next lecture.
▶ Lipschitz functions
Definition 2.8 (Lipschitz). A function f : X → R is said to be Lipschitz if there is some
γ < ∞ such that
|f (x) − f (y)| ≤ γ|x − y|
for all x, y ∈ X. We write f ∈ Lipγ (X).
Lemma 2.9. Let f : X ⊂ R → R be a differentiable function. Then f ′ is bounded if and
only if f is Lipschitz on X.
Proof. Suppose |f ′ (x)| ≤ γ for all x, and let x1 , x2 ∈ X be distinct. By the Mean Value
Theorem, there exists ξ between x1 and x2 such that
f (x2 ) − f (x1 )
= f ′ (ξ).
x2 − x1
Taking absolute values and re-arranging,
|f (x2 ) − f (x1 )|
= |f ′ (ξ)| ≤ γ ⇒ |f (x2 ) − f (x1 )| ≤ γ|x2 − x1 |
|x2 − x1 |
Hence f ∈ Lipγ (X). Conversely, if f ∈ Lipγ (X), then
|f (x) − f (x0 )| ≤ γ|x − x0 |
for every x and x0 in X. Re-arranging,
f (x) − f (x0 )
≤γ
x − x0
The inequality holds as x → x0 , so
f (x) − f (x0 ) f (x) − f (x0 )
|f ′ (x)| = lim = lim < γ.
x→x0 x − x0 x→x0 x − x0
Lecture 2 Page 3 of 7
In optimization, the Lipschitz condition usually arises in the derivative, not the function.
That is, f ′ ∈ Lipγ (X). This is a stronger condition than f ∈ C 1 (X) but weaker than
f ∈ C 2 (X) and turns out to be the right level of regularity for optimization. Assuming that
X is compact, then
f ∈ C 2 (X) ⇒ f ′ ∈ Lipγ (X) ⇒ f ∈ C 1 (X),
so the condition f ′ ∈ Lipγ (X) is a bit like saying that “f ∈ C 1.5 (X).”
Lemma 2.10. Let f : X → R be C 1 and f ′ ∈ Lipγ (X). Then for any x, y ∈ X,
γ
|f (y) − f (x) − f ′ (x)(y − x)| ≤ (y − x)2 .
2
Proof. By the Fundamental Theorem of Calculus,
ˆ y
f (y) − f (x) = f ′ (z) dz
x
Since f ′ (x) is constant in z,
ˆ y
f (y) − f (x) − f ′ (x)(y − x) = f ′ (z) − f ′ (x) dz
x
Applying the change of variables z = x + t(y − x),
ˆ y ˆ 1
′ ′
f ′ (x + t(y − x)) − f ′ (x) (y − x) dt.
f (z) − f (x) dz =
x 0
Taking absolute values and applying the Lipschitz condition on f ′ ,
ˆ 1
′
|f (y) − f (x) − f (x)(y − x)| = (f ′ (x + t(y − x)) − f ′ (x))(y − x) dt
0
ˆ 1
≤ f ′ (x + t(y − x)) − f ′ (x) |y − x| dt
0
ˆ 1
≤ γ|t(y − x)||y − x| dt
0
ˆ 1
2
= γ|y − x| · t dt
0
γ
= (y − x)2 .
2
Note that this lemma is stronger than Taylor’s Theorem since we do not require f to be twice
differentiable. Why is this important? The lemma gives us a linear approximation of f :
f (y) = f (x) + f ′ (x)(y − x) + error,
where
C
|y − x|2 .
|error| ≤
2
Optimization algorithms all based on either linear or quadratic approximations. Quadratic:
f ′′ (x)
f (y) = f (x) + f ′ (x)(y − x) + (y − x)2 + error.
2
Lecture 2 Page 4 of 7
▶ Newton’s method
We now turn to Newton’s method for univariate optimization. In the univariate case, it
is also called the Newton–Raphson method, where the goal is to find roots of a nonlinear
equation g(x) = 0. In the context of optimization, the function is usually some derivative,
i.e., g = f ′ . In the multivariate case, Newton’s method is used to find roots of system of
nonlinear equations:
g1 (x1 , . . . , xn ) = 0
g2 (x1 , . . . , xn ) = 0
..
.
gn (x1 , . . . , xn ) = 0.
In the context of optimization, the functions are partial derivatives, i.e., gi = ∂f /∂xi .
g ′ (x1 )
g(x1 )
g ′ (x2 )
g(x2 )
g(x3 )
g ′ (x3 )
x4 x3 x2 x1 x
g(x) exact
root
Figure 2: Newton method applied to find root of g.
The idea is to use tangent lines to approximate the zeroes of f : if ℓ is the tangent line to f
at x, then ℓ(y) = f (x) + f ′ (x)(y − x), and ℓ(y) = 0 when
f (x)
y =x− .
f ′ (x)
Iterating this procedure using y as the next best guess, we obtain
f (xn )
xn+1 = xn − , n = 0, 1, 2, . . . .
f ′ (xn )
Ideally, xn → x∗ , where f (x∗ ) = 0. Newton’s method can fail, for example, if f ′ is unbounded
or is zero. Thus we need to identify a class of functions or impose conditions for which we
can find zeros.
Lecture 2 Page 5 of 7
√
Example 2.11 (Square root). Let r ∈ R+ ; we compute r, i.e., x∗ such that f (x) = x2 −r =
0. Newton’s method is:
x2 − r
1 r
xn+1 = xn − n = xn + .
2xn 2 xn
The number of correct digits doubles every iteration, a phenomenon known as quadratic
convergence. For example, with r = 17 and x0 = 4,
n xn
0 4.000000000000000
1 4.125000000000000
2 4.123106060606061
3 4.123105625617683
4 4.123105625617661
▶ Non-rigorous analysis of Newton’s method
Here we provide a rough idea of why Newton’s method converges quadratically — in the next
lecture and in the homework we will look at rigorous versions of such convergence proof.
Assume f ∈ C 2 , and x0 ∈ (x∗ − η, x∗ + η) for some η > 0 to be determined later.
Let εn = xn − x∗ be the error at nth step. Then
f (xn ) f (xn ) εn f ′ (xn ) − f (xn )
εn+1 = xn+1 − x∗ = xn − ′
− x ∗ = εn − ′ = .
f (xn ) f (xn ) f ′ (xn )
By Taylor’s Theorem,
1
0 = f (x∗ ) = f (xn − εn ) = f (xn ) − εn f ′ (xn ) + ε2n f ′′ (ξn )
2
for some ξn between xn and x∗ . Rearranging,
1
εn f ′ (xn ) − f (xn ) = f ′′ (ξn )ε2n .
2
Thus
1 f ′′ (ξn ) 2
εn+1 = ε ≤ c(η)ε2n ,
2 f ′ (xn ) n
where
1 max |f ′′ (x)|
c(η) =
2 min |f ′ (x)|
where the maximum and minimum run over all x such that |x − x∗ | ≤ η. In fact,
1 f ′′ (x∗ )
lim c(η) =
η→0 2 f ′ (x∗ )
since f ∈ C 2 . So may pick η small enough such that
ρ = ηc(η) < 1.
Lecture 2 Page 6 of 7
Since x0 ∈ (x∗ − η, x∗ + η) we have |ε0 | = |x0 − x∗ | ≤ η and |ξ0 − x0 | ≤ η. By definition of
c(η),
1 f ′′ (ξ0 )
≤ c(η).
2 f ′ (x0 )
It follows that
|ε1 | = |x1 − x∗ | ≤ c(η) · ε20
= |ε0 | · |ε0 | · c(η)
≤ |e0 | · η · c(η)
= ρ|ε0 |
Since ρ ≤ 1, it follows that x1 is also in (x∗ − η, x∗ , η) since. Repeat argument to get
|ε2 | ≤ ρ|ε1 | ≤ ρ2 |ε0 |.
By induction, we obtain
|εn | ≤ ρn |ε0 |.
Since 0 < ρ < 1, the error term converges to 0. That is, Newton’s method converges, and
εn+1 ≤ c(η)ε2n
so the sequence converges quadratically.
Lecture 2 Page 7 of 7