L.
Vandenberghe EE236C (Spring 2016)
9. Accelerated proximal gradient methods
• Nesterov’s method
• analysis with fixed step size
• line search
9-1
Proximal gradient method
Results from lecture 6
• each proximal gradient iteration is a descent step:
f (x(k)) < f (x(k−1)), kx(k) − x?k22 ≤ c kx(k−1) − x?k22
with c = 1 − m/L
• suboptimality after k iterations is O(1/k):
(k) L (0)
f (x ) − f ≤ kx − x?k22
?
2k
Accelerated proximal gradient methods
• to improve convergence, we add a momentum term
• we relax the descent property
• originated in work by Nesterov in the 1980s
Accelerated proximal gradient methods 9-2
Assumptions
we consider the same problem and make the same assumptions as in lecture 6:
minimize f (x) = g(x) + h(x)
• h is closed and convex (so that proxth is well defined)
• g is differentiable with dom g = Rn
• there exist constants m ≥ 0 and L > 0 such that the functions
m T L T
g(x) − x x, x x − g(x)
2 2
are convex
• the optimal value f ? is finite and attained at x? (not necessarily unique)
Accelerated proximal gradient methods 9-3
Nesterov’s method
Algorithm: choose x(0) = v (0) and γ0 > 0; for k ≥ 1, repeat the steps
• define γk = θk2 /tk where θk is the positive root of the quadratic equation
θk2 /tk = (1 − θk )γk−1 + mθk
• update x(k) and v (k) as follows:
(k−1) θk γk−1
y = x + (v (k−1) − x(k−1))
γk−1 + mθk
x(k) = proxtk h(y − tk ∇g(y))
(k) (k−1) 1 (k)
v = x + (x − x(k−1))
θk
stepsize tk is fixed (tk = 1/L) or obtained from line search
Accelerated proximal gradient methods 9-4
Momentum interpretation
• first iteration (k = 1) is a proximal gradient step at y = x(0)
• next iterations are proximal gradient steps at extrapolated points y :
θk γk−1
y = x(k−1) + (v (k−1) − x(k−1))
γk−1 + mθk
θk γk−1 1
= x(k−1) + − 1 (x(k−1) − x(k−2))
γk−1 + mθk θk−1
x(k) = proxtk h(y − tk ∇g(y))
x(k−2) x(k−1) y
Accelerated proximal gradient methods 9-5
Algorithm parameters
θk2 θk2
= (1 − θk )γk−1 + mθk , γk =
tk tk
• θk is positive root of the quadratic equation
• θk < 1 if mtk < 1
• if tk is constant, sequence θk is completely determined by starting value γ0
Example: L = 1, m = 0.1, tk = 1/L
0.4 γ0 = m 0.02 γ0 = m
γ0 = 0.5m γ0 = 0.5m
γ0 = 2m γ0 = 2m
0.35 0.15
q γk
θk
m
L
0.3 0.1 m
0.25 0.05
0 5 10 15 20 25 0 5 10 15 20 25
k k
Accelerated proximal gradient methods 9-6
FISTA
if we take m = 0 on page 9-4, the expression for y simplifies:
y = x(k−1) + θk (v (k−1) − x(k−1))
x(k) = proxtk h(y − tk ∇g(y))
(k) (k−1) 1 (k)
v = x + (x − x(k−1))
θk
eliminating the variables v (k) gives the equivalent iteration (for k ≥ 2)
1
y = x(k−1) + θk ( − 1)(x(k−1) − x(k−2))
θk−1
x(k) = proxtk h(y − tk ∇g(y))
this is known as FISTA (‘Fast Iterative Shrinkage-Thresholding Algorithm’)
Accelerated proximal gradient methods 9-7
Example
m
minimize exp(aTi x + bi)
P
log
i=1
• two randomly generated problems with m = 2000, n = 1000
• same fixed step size used for gradient method and FISTA
• figures show (f (x(k)) − f ?)/f ?
100 100
gradient gradient
FISTA FISTA
10−1 10−1
10−2 10−2
10−3 10−3
10−4 10−4
10−5 10−5
10−6 10−6
0 50 100 150 200 0 50 100 150 200
k k
Accelerated proximal gradient methods 9-8
Nesterov’s simplest method
• if m > 0 and we choose γ0 = m, then
√
γk = m, θk = mtk for all k ≥ 1
• the algorithm on p. 9-4 and p. 9-5 simplifies:
√ √
tk 1 − mtk−1 (k−1)
y = x(k−1) + √ √ (x − x(k−2))
tk−1 1 + mtk
x(k) = proxtk h(y − tk ∇g(y))
• with constant stepsize tk = 1/L, the expression for y reduces to
p
(k−1) 1− m/L
y=x + p (x(k−1) − x(k−2))
1+ m/L
Accelerated proximal gradient methods 9-9
Outline
• Nesterov’s method
• analysis with fixed step size
• line search
Overview
• we show that if tk = 1/L, the following inequality holds at each iteration:
γk (k)
f (x(k)) − f ? + kv − x?k22
2
γ k−1
≤ (1 − θk ) f (x(k−1)) − f ? + kv (k−1) − x?k22
2
• therefore the rate of convergence is determined by λk = − θi):
Qk
i=1 (1
(k) ? γk (k)
f (x )−f ≤ f (x ) − f + kv − x?k22
(k) ?
2
γ 0
≤ λk f (x(0)) − f ? + kx(0) − x?k22
2
(here we assume that x(0) ∈ dom h = dom f )
Accelerated proximal gradient methods 9-10
Notation for one iteration
quantities in iteration i of the algorithm on page 9-4
• define t = ti, θ = θi, γ = γi, and γ + = γi:
γ + = (1 − θ)γ + mθ, γ + = θ2/t
• define x = x(i−1), x+ = x(i), v = v (i−1), and v + = v (i):
1 +
y = γ x + θγv
γ + mθ
x+ = y − tGt(y)
1
v + = x + (x+ − x)
θ
• v +, v , and y are related as
γ +v + = (1 − θ)γv + mθy − θGt(y) (1)
Accelerated proximal gradient methods 9-11
Proof (last identity):
• combine v and x updates and use γ + = θ2/t:
+ 1
v = x + (y − tGt(y) − x)
θ
1 θ
= (y − (1 − θ)x) − + Gt(y)
θ γ
• multiply with γ + = γ + mθ − θγ :
+ + γ+
γ v = (y − (1 − θ)x) − θGt(y)
θ
(1 − θ)
= ((γ + mθ)y − γ +x) + θmy − θGt(y)
θ
= (1 − θ)γv + θmy − θGt(y)
Accelerated proximal gradient methods 9-12
Bounds on objective function
recall the results on the proximal gradient update (page 6-13):
• if 0 < t ≤ 1/L then g(x+) = g(y − tGt(y)) is bounded by
t
g(x ) ≤ g(y) − t∇g(y) Gt(y) + kGt(y)k22
+ T
(2)
2
• if the inequality (2) holds, then mt ≤ 1 and, for all z ,
t m
f (z) ≥ f (x+) + kGt(y)k22 + Gt(y)T (z − y) + kz − yk22
2 2
• combine the inequalities for z = x and z = x?:
f (x+) − f ? ≤ (1 − θ)(f (x) − f ?) − Gt(y)T ((1 − θ)x + θx? − y)
t mθ ?
− kGt(y)k22 − kx − yk22
2 2
Accelerated proximal gradient methods 9-13
Progress in one iteration
• the definition of γ + and (1) imply that
γ+
(kx? − v +k22 − ky − v +k22)
2
(1 − θ)γ ? 2 2 mθ ?
= (kx − vk2 − ky − vk2) + kx − yk22 + θGt(y)T (x? − y)
2 2
• combining this with the last inequality on page 9-13 gives
+ γ+ ? ?
f (x ) − f + kx − v +k22
2
? γ ? 2 T γ 2
≤ (1 − θ) f (x) − f + kx − vk2 − Gt(y) (x − y) − ky − vk2
2 2
t 2 γ+
− kGt(y)k2 + ky − v +k22
2 2
Accelerated proximal gradient methods 9-14
• the last term on the right-hand side is
γ+ 1
ky − v k2 = + k(1 − θ)γ(y − v) + θGt(y)k22
+ 2
2 2γ
(1 − θ)2γ 2 2 θ(1 − θ)γ T t 2
= ky − vk2 + G t (y) (y − v) + kG t (y)k 2
2γ + γ+ 2
+
γ(γ − mθ) 2 T t 2
= (1 − θ) ky − vk 2 + G t (y) (x − y) + kG t (y)k2
2γ + 2
last step uses definitions of γ + and y (chosen so that θγ(y − v) = γ +(x − y))
• substituting this in the last inequality on page 9-14 gives the result on page 9-10
+ γ+ ? ?
f (x ) − f + kx − v +k22
2
γ (1 − θ)γ mθ
≤ (1 − θ) f (x) − f ? + kx? − vk2 − ky − vk 2
2
2 2 γ+
? γ ? 2
≤ (1 − θ) f (x) − f + kx − vk
2
Accelerated proximal gradient methods 9-15
Analysis for fixed step size
the product λk = − θi) determines the rate of convergence (page 9-10)
Qk
i=1 (1
• the sequence λk satisfies the following bound (proof on next page)
4
λk ≤ k √
√
ti)2
P
(2 + γ0
i=1
• for constant step size tk = 1/L, we obtain
4
λk ≤ p
(2 + k γ0/L)2
• combined with the inequality on p. 9-10, this shows the 1/k 2 convergence rate:
4 γ 0
f (x(k)) − f ? ≤ p f (x(0)) − f ? + kx(0) − x?k22
(2 + k γ0/L)2 2
Accelerated proximal gradient methods 9-16
Proof.
• recall that γk and θk are defined by γk = (1 − θk )γk−1 + θk m and γk = θk2 /tk
• we first note that λk ≤ γk /γ0; this follows from
γk − θk m γk
λk = (1 − θk )λk−1 = λk−1 ≤ λk−1
γk−1 γk−1
• the inequality follows by combining from i = 1 to i = k the inequalities
1 1 λi−1 − λi
√ −p ≥ √ (because λi ≤ λi−1)
λi λi−1 2λi−1 λi
θi
= √
2 λi
θ
≥ p i
2 γi/γ0
1√
= γ0ti
2
Accelerated proximal gradient methods 9-17
Strongly convex functions
the following bound on λk is useful for strongly convex functions (m > 0)
• if γ0 ≥ m then γk ≥ m for all k and
k
Y √
λk ≤ (1 − mti)
i=1
(proof on next page)
• for constant step size tk = 1/L, we obtain
p k
λk ≤ 1 − m/L
• combined with the inequality on p. 9-10, this shows
r k
m γ 0
f (x(k)) − f ? ≤ 1− f (x(0)) − f ?) + kx(0) − x?k22
L 2
Accelerated proximal gradient methods 9-18
Proof.
• if γk−1 ≥ m, then
γk = (1 − θk )γk−1 + θk m
≥ m
• since γ0 ≥ m, we have γk ≥ m for all k
√ √
• it follows that θi = γiti ≥ mti and
k k
Y Y √
λk = (1 − θi) ≤ (1 − mti)
i=1 i=1
Accelerated proximal gradient methods 9-19
Outline
• Nesterov’s method
• analysis with fixed step size
• line search
Line search
• the analysis for fixed step size starts with the inequality (2):
t
g(x − tGt(y)) ≤ g(y) − t∇g(y)T Gt(y) + kGt(y)k22
2
this inequality is known to hold for 0 ≤ t ≤ 1/L
• if L is not known, we can satisfy (2) by a backtracking line search:
start at some t := t̂ > 0 and backtrack (t := βt) until (2) holds
• step size selected by the line search satisfies t ≥ tmin = min {t̂, β/L}
• for each tentative tk we need to recompute θk , y , x(k) in the algorithm on p. 9-4
• requires evaluations of ∇g , proxth, and g (twice) per line search iteration
Accelerated proximal gradient methods 9-20
Analysis with line search
• from page 9-16:
4 4
λk ≤ k √
≤ √ 2
√ (2 + k γ0tmin)
ti)2
P
(2 + γ0
i=1
• from page 9-18, if γ0 ≥ m:
k
Y √ √ k
λk ≤ (1 − mti) ≤ 1 − mtmin
i=1
• therefore the results for fixed step size hold with 1/tmin substituted for L
Accelerated proximal gradient methods 9-21
References
Accelerated gradient methods
• Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004).
The material in the lecture is from §2.2 of this book.
• P. Tseng, On accelerated proximal gradient methods for convex-concave optimization (2008).
• S. Bubeck, Convex Optimization: Algorithms and Complexity, Foundations and Trends in
Machine Learning (2015), §3.7.
FISTA
• A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse
problems, SIAM J. on Imaging Sciences (2009).
• A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal recovery, in: Y.
Eldar and D. Palomar (Eds.), Convex Optimization in Signal Processing and Communications
(2009).
Line search strategies
• FISTA papers by Beck and Teboulle.
• D. Goldfarb and K. Scheinberg, Fast first-order methods for composite convex optimization with
line search (2011).
• O. Güler, New proximal point algorithms for convex minimization, SIOPT (1992).
• Yu. Nesterov, Gradient methods for minimizing composite functions (2013).
Accelerated proximal gradient methods 9-22
Interpretation and insight
• Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004), §2.2.
• W. Su, S. Boyd, E. Candès, A differentiable equation for modeling Nesterov’s accelerated
gradient method: theory and insight, NIPS (2014).
• H. Lin, J. Mairal, Z. Harchaoui, A universal catalyst for first-order optimization,
arXiv:1506.02186 (2015).
Implementation
• S. Becker, E.J. Candès, M. Grant, Templates for convex cone problems with applications to
sparse signal recovery, Mathematical Programming Computation (2011).
• B. O’Donoghue, E. Candès, Adaptive restart for accelerated gradient schemes, Foundations of
Computational Mathematics (2015).
• T. Goldstein, C. Studer, R. Baraniuk, A field guide to forward-backward splitting with a FASTA
implementation, arXiv:1411.3406 (2016).
Accelerated proximal gradient methods 9-23