Lecture 3 (Monday, March 25)
▶ Rates of convergence
Definition 3.1. Let (xn )∞ ∞
n=1 be a sequence converging to x∗ . We say (xn )n=1 :
1. converges if
lim |xn − x∗ | = 0;
n→∞
2. converges linearly if there is some N ∈ N and 0 < c < 1 such that
|xn+1 − x∗ | ≤ c|xn − x∗ |
whenever n > N ;
3. converges superlinearly if there is some N ∈ N and (cn )∞
n=1 converging to 0 such that
|xn+1 − x∗ | ≤ cn |xn − x∗ |
whenever n > N ;
4. converges sublinearly if there is some N ∈ N and (cn )∞
n=1 converging to 1 such that
|xn+1 − x∗ | ≤ cn |xn − x∗ |
whenever n > N ;
5. converges quadratically if there is some N ∈ N and c > 0 such that
|xn+1 − x∗ | ≤ c|xn − x∗ |2
whenever n > N .
N.B. the terms “q-linearly convergent” and “q-superlinearly” are occasionally used for what
we called linear and superlinear, respectively. Higher order convergences exist (e.g., cubic
convergence, with “3” instead of “2” in the quadratic convergence definition), but in practice,
quadratic convergence is good enough.
Here are canonical examples of sequences converging at different rates.
Example 3.2 (Convergence rates). Let γ ∈ (0, 1). Then
• (1/n)∞
n=1 exhibits sublinear convergence
• (γ n )∞
n=1 exhibits linear convergence
2
• (γ n )∞
n=1 exhibits superlinear convergence
n
• (γ 2 )∞
n=1 exhibits quadratic convergence
You would be asked to check these against their definitions in Homework 1.
Lecture 3 Page 1 of 7
▶ Rigorous analysis of Newton’s method
We are now ready to give a rigorous proof of Newton’s method convergence. Such proofs
depend very much on the assumptions. Here we give a set of practically useful assumptions
that extend readily to a multivariate scenario; in the homework we will see another version
with different assumptions (neater but not as realistic).
Theorem 3.3 (Newton’s method). Let f : (a, b) → R and f ′ ∈ Lipγ (a, b). Suppose |f ′ (x)| ≥
ρ > 0 for all x ∈ (a, b). If f (x) = 0 has solution x∗ ∈ (a, b), then there exists some η > 0
such that if |x0 − x∗ | < η, then the sequence (xn )∞
n=1 defined by
f (xn )
xn+1 = xn −
f ′ (xn )
converges to x∗ quadratically. Furthermore,
γ
|xn+1 − x∗ | ≤ |xn − x∗ |2 . (3.4)
2ρ
Proof. Let 0 < τ < 1 be fixed, and choose δ > 0 be such that (x∗ − δ, x∗ + δ) ⊂ (a, b). Put
η = min{δ, 2ρτ /γ}. We claim that
|xn+1 − x∗ | ≤ τ |xn − x∗ | < η
for all n ∈ N. By assumption, we have
|x0 − x∗ | < η.
Next
f (x0 )
|x1 − x∗ | = x0 − x∗ −
f ′ (x0 )
1
= · x0 f ′ (x0 ) − x∗ f ′ (x0 ) − f (x0 )
|f ′ (x 0 )|
1
= · f (x∗ ) − f (x0 ) − f ′ (x0 )(x∗ − x0 )
|f ′ (x 0 )|
1 γ
≤ |x∗ − x0 |2
· (by “C 1.5 Taylor’s theorem” from earlier)
|f ′ (x 0 )| 2
γ
≤ |x∗ − x0 |2 (this gives (3.4) for n = 0)
2ρ
γ
≤ · η|x∗ − x0 |
2ρ
γ 2ρτ
≤ · · |x∗ − x0 |
2ρ γ
= τ |x∗ − x0 |
≤ τ η < η.
Since |x1 − x∗ | < η, the same argument can be repeated for |x2 − x∗ |, etc., to finish. Note
that the τ factor is needed to guarantee strict inequality |xn − x∗ | < η at every step.
Lecture 3 Page 2 of 7
y
y = arctan(x)
x1 x0 x
x2
Figure 1: Newton method applied to arctan(x) = 0.
The condition |x0 − x∗ | < η means that Newton’s method is only quadratically convergent if
we start sufficiently close to the solution x∗ . In fact the method may not even be convergent
if we don’t have some kind of “sufficiently close” assumption.
For example, if we take f = arctan, then the Newton’s iteration is
xn+1 = xn − (x2n + 1) arctan xn .
If we choose x0 = 3, then x1 = −9.4904, x2 = 124.0007, . . . , and |xn | → ∞.
▶ Weaker variants of Newton’s method
Despite the simplicity of Newton’s method, it is not always possible to perform the procedure:
for example, the derivative may be unbounded, close to 0, or not feasible to compute. Here,
we discuss some weaker variants of Newton’s method and their advantages/disadvantages.
1. Secant method : avoids derivatives by using only function evaluations. Using the approxi-
mation
f (xn ) − f (xn−1 )
f ′ (xn ) ≈ ,
xn − xn−1
modify Newton’s method to
xn − xn−1
xn+1 = xn − f (xn ) .
f (xn ) − f (xn−1 )
It can be shown (and the proof is very similar to Theorem 3.3) that the secant method
exhibits superlinear convergence:√if f ∈ C 2 and x0 , x1 are sufficiently close to x∗ , then
|εn+1 | ≈ c|εn |α , where α = (1 + 5)/2 ≈ 1.618 is the golden ratio, i.e., the largest real
zero of α2 − α − 1 = 0. The “sufficiently close” assumption is crucial here as it is in
Newton’s method. Figure 2 shows a divergent instance of secant method.
There are other variants of secant method, for example, Muller’s method, which uses the
values of f at three points xn , xn−1 , xn−2 to improve the convergence order to α ≈ 1.84
Lecture 3 Page 3 of 7
x3 x0 x2 x1
g(x)
Figure 2: Secant method can diverge. The secant line in the interval [x2 , x1 ] takes the next
iterate x3 outside the original interval [x0 , x1 ]. But if we had used the secant line in the
interval [x0 , x2 ] to generate x3 , we would have been fine, this leads to regula falsi.
where α is the largest real zero of α3 − α2 − α − 1 = 0. The multivariate version of secant
method is known as Broyden’s method. We shall see this when we discuss multivariate
quasi-Newton methods.
2. Steffensen’s method : avoids derivatives and preserves quadratic convergence using “di-
vided difference”:
f (xn )2
xn+1 = xn − .
f (xn + f (xn )) − f (xn )
The multivariate Steffensen’s method usage of divided difference is very expensive and
should never be used. Even in the univariate case, the multiple function evaluations can
be expensive.
In summary, the derivative-based methods discussed thusfar take the form
f (xn )
xn+1 = xn − ,
g(xn )
where g captures the local rate of change near x∗ :
g(xn ) = f ′ (xn ) (Newton)
f (xn ) − f (xn−1 )
g(xn ) = (Secant)
xn − xn−1
f (xn + f (xn )) − f (xn )
g(xn ) = (Steffensen)
f (xn )
Lecture 3 Page 4 of 7
Algorithm 1 Method of bisection
Require: f (a0 )f (b0 ) < 0, ε > 0
1: procedure bisection(a0 , b0 , x0 , ε)
2: k←0
3: x ← (ak + bk )/2
4: while |bk − ak | > ε do
5: if f (ak )f (xk ) < 0 then
6: xk ← ak
7: else
8: xk ← bk
9: end if
10: k ←k+1
11: end while
12: return x∗ := xk
13: end procedure
3. Method of bisection: only needs sign, not magnitude or value, of f (x). It relies on:
Theorem 3.5 (Intermediate Value Theorem). If f ∈ C([a, b]) and f (a)f (b) < 0, then
there is some x∗ ∈ (a, b) such that f (x∗ ) = 0.
The idea is as follows: set c = (a + b)/2 (i.e., bisect [a, b]) and evaluate the function at c.
• if f (a)f (c) < 0, then x∗ ∈ [a, c]; set b = c
• if f (c)f (b) < 0, then x∗ ∈ [c, b]; set a = c
Repeating this procedure, we obtain a sequence of nested intervals
[a, b] = [a0 , b0 ] ⊃ [a1 , b1 ] ⊃ [a2 , b2 ] ⊃ · · · (3.6)
Note that
a0 ≤ a1 ≤ · · · ≤ b0
so the sequence (an )∞
n=0 is monotone increasing and bounded above. Similarly, since
b0 ≥ b1 ≥ b2 ≥ · · · ≥ a0 ,
the sequence (bn )∞
n=0 is monotone decreasing and bounded below. Hence limn→∞ an and
limn→∞ bn exist. The length of each interval is bn − an , and
1 1
bn − an = (bn−1 − an−1 ) = · · · = n (b0 − a0 ),
2 2
so
1
lim (bn − an ) = (b0 − a0 ) · lim = 0.
n→∞ n→∞ 2n
Hence
lim an = lim bn = x∗ .
n→∞ n→∞
Let xn be the midpoint at the nth step. Then
1
|xn+1 − x∗ | ≤ |b − a|.
2n
Lecture 3 Page 5 of 7
y
y = g(x)
1
π
0 1 5 3 1 1 x
4 16 8 2
Figure 3: Pictorial view of the method of bisection.
If we want an answer up to ε-precision, i.e. |xn+1 − x∗ | < ε, we require
b−a
n = log2 .
ε
Hence the method of bisection is linearly convergent. The multivariate case is much
harder since there is no n-dimensional intermediate value theorem (see Nelder–Meade,
derivative-free optimization).
a2
a0 = a1 b1 = b2 b0
g(x)
Figure 4: Pictorial view of the method of regula falsi.
4. Regula falsi : The fact that method of bisection requires only the sign of the function value
is of course an advantage — if we can find the sign much faster than the value. But if
that’s not the case, then it would be a waste to not use the value of the function. This
can be remedied by using
ak f (bk ) − bk f (ak )
ck = (3.7)
f (bk ) − f (ak )
Lecture 3 Page 6 of 7
in place of the mid-point ck = (ak + bk )/2 in method of bisection. The resulting method
is called regula falsi (meaning ‘false position’), a method that dates back to the Babylons
and Egyptians. The idea is that instead of taking the mid-point in an interval [ak , bk ],
you take ck to be the x-intercept of the line through (ak , f (ak )) and (bk , f (bk )), i.e.,
f (bk ) − f (ak )
y − f (bk ) = (x − bk ).
bk − ak
Plugging in x = ck and y = 0 gives you (3.7). Why use this instead of secant method? Note
that the divergent scenario in Figure 2 would not happen here — regula falsi guarantees
that our iterates lie in a sequence of of nested intervals (3.6) as in bisection method.
Lecture 3 Page 7 of 7