0% found this document useful (0 votes)

38 views25 pages

Fast Gradient Method

The document discusses accelerated proximal gradient methods, particularly Nesterov's method, which enhances convergence by incorporating a momentum term. It outlines the algorithm, assumptions, and analysis of convergence rates, emphasizing the use of fixed step sizes and line search techniques. Additionally, it introduces FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) as a simplified version of Nesterov's method, providing examples and bounds for convergence in optimization problems.

Uploaded by

smashouff

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views25 pages

Fast Gradient Method

Uploaded by

smashouff

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

L.

Vandenberghe EE236C (Spring 2016)

9. Accelerated proximal gradient methods

• Nesterov’s method

• analysis with fixed step size

• line search

9-1
Proximal gradient method

Results from lecture 6

• each proximal gradient iteration is a descent step:

f (x(k)) < f (x(k−1)), kx(k) − x?k22 ≤ c kx(k−1) − x?k22

with c = 1 − m/L

• suboptimality after k iterations is O(1/k):

(k) L (0)
f (x ) − f ≤ kx − x?k22
?
2k

Accelerated proximal gradient methods

• to improve convergence, we add a momentum term

• we relax the descent property
• originated in work by Nesterov in the 1980s

Accelerated proximal gradient methods 9-2

Assumptions

we consider the same problem and make the same assumptions as in lecture 6:

minimize f (x) = g(x) + h(x)

• h is closed and convex (so that proxth is well defined)

• g is differentiable with dom g = Rn

• there exist constants m ≥ 0 and L > 0 such that the functions

m T L T
g(x) − x x, x x − g(x)
2 2
are convex

• the optimal value f ? is finite and attained at x? (not necessarily unique)

Accelerated proximal gradient methods 9-3

Nesterov’s method

Algorithm: choose x(0) = v (0) and γ0 > 0; for k ≥ 1, repeat the steps

• define γk = θk2 /tk where θk is the positive root of the quadratic equation

θk2 /tk = (1 − θk )γk−1 + mθk

• update x(k) and v (k) as follows:

(k−1) θk γk−1
y = x + (v (k−1) − x(k−1))
γk−1 + mθk
x(k) = proxtk h(y − tk ∇g(y))
(k) (k−1) 1 (k)
v = x + (x − x(k−1))
θk

stepsize tk is fixed (tk = 1/L) or obtained from line search

Accelerated proximal gradient methods 9-4

Momentum interpretation

• first iteration (k = 1) is a proximal gradient step at y = x(0)

• next iterations are proximal gradient steps at extrapolated points y :

θk γk−1
y = x(k−1) + (v (k−1) − x(k−1))
γk−1 + mθk

θk γk−1 1
= x(k−1) + − 1 (x(k−1) − x(k−2))
γk−1 + mθk θk−1

x(k) = proxtk h(y − tk ∇g(y))

x(k−2) x(k−1) y

Accelerated proximal gradient methods 9-5

Algorithm parameters

θk2 θk2
= (1 − θk )γk−1 + mθk , γk =
tk tk

• θk is positive root of the quadratic equation

• θk < 1 if mtk < 1
• if tk is constant, sequence θk is completely determined by starting value γ0

Example: L = 1, m = 0.1, tk = 1/L

0.4 γ0 = m 0.02 γ0 = m
γ0 = 0.5m γ0 = 0.5m
γ0 = 2m γ0 = 2m
0.35 0.15
q γk
θk

m
L
0.3 0.1 m

0.25 0.05
0 5 10 15 20 25 0 5 10 15 20 25
k k

Accelerated proximal gradient methods 9-6

FISTA

if we take m = 0 on page 9-4, the expression for y simplifies:

y = x(k−1) + θk (v (k−1) − x(k−1))

x(k) = proxtk h(y − tk ∇g(y))
(k) (k−1) 1 (k)
v = x + (x − x(k−1))
θk

eliminating the variables v (k) gives the equivalent iteration (for k ≥ 2)

1
y = x(k−1) + θk ( − 1)(x(k−1) − x(k−2))
θk−1
x(k) = proxtk h(y − tk ∇g(y))

this is known as FISTA (‘Fast Iterative Shrinkage-Thresholding Algorithm’)

Accelerated proximal gradient methods 9-7

Example
m
minimize exp(aTi x + bi)
P
log
i=1

• two randomly generated problems with m = 2000, n = 1000

• same fixed step size used for gradient method and FISTA
• figures show (f (x(k)) − f ?)/f ?

100 100
gradient gradient
FISTA FISTA
10−1 10−1

10−2 10−2

10−3 10−3

10−4 10−4

10−5 10−5

10−6 10−6
0 50 100 150 200 0 50 100 150 200
k k

Accelerated proximal gradient methods 9-8

Nesterov’s simplest method

• if m > 0 and we choose γ0 = m, then

√
γk = m, θk = mtk for all k ≥ 1

• the algorithm on p. 9-4 and p. 9-5 simplifies:

√ √
tk 1 − mtk−1 (k−1)
y = x(k−1) + √ √ (x − x(k−2))
tk−1 1 + mtk

x(k) = proxtk h(y − tk ∇g(y))

• with constant stepsize tk = 1/L, the expression for y reduces to

p
(k−1) 1− m/L
y=x + p (x(k−1) − x(k−2))
1+ m/L

Accelerated proximal gradient methods 9-9

Outline

• Nesterov’s method

• analysis with fixed step size

• line search
Overview

• we show that if tk = 1/L, the following inequality holds at each iteration:

γk (k)
f (x(k)) − f ? + kv − x?k22
2
γ k−1

≤ (1 − θk ) f (x(k−1)) − f ? + kv (k−1) − x?k22
2

• therefore the rate of convergence is determined by λk = − θi):

Qk
i=1 (1

(k) ? γk (k)
f (x )−f ≤ f (x ) − f + kv − x?k22
(k) ?
2
γ 0

≤ λk f (x(0)) − f ? + kx(0) − x?k22
2

(here we assume that x(0) ∈ dom h = dom f )

Accelerated proximal gradient methods 9-10

Notation for one iteration

quantities in iteration i of the algorithm on page 9-4

• define t = ti, θ = θi, γ = γi, and γ + = γi:

γ + = (1 − θ)γ + mθ, γ + = θ2/t

• define x = x(i−1), x+ = x(i), v = v (i−1), and v + = v (i):

1 +

y = γ x + θγv
γ + mθ
x+ = y − tGt(y)
1
v + = x + (x+ − x)
θ

• v +, v , and y are related as

γ +v + = (1 − θ)γv + mθy − θGt(y) (1)

Accelerated proximal gradient methods 9-11

Proof (last identity):

• combine v and x updates and use γ + = θ2/t:

+ 1
v = x + (y − tGt(y) − x)
θ
1 θ
= (y − (1 − θ)x) − + Gt(y)
θ γ

• multiply with γ + = γ + mθ − θγ :

+ + γ+
γ v = (y − (1 − θ)x) − θGt(y)
θ
(1 − θ)
= ((γ + mθ)y − γ +x) + θmy − θGt(y)
θ
= (1 − θ)γv + θmy − θGt(y)

Accelerated proximal gradient methods 9-12

Bounds on objective function

recall the results on the proximal gradient update (page 6-13):

• if 0 < t ≤ 1/L then g(x+) = g(y − tGt(y)) is bounded by

t
g(x ) ≤ g(y) − t∇g(y) Gt(y) + kGt(y)k22
+ T
(2)
2

• if the inequality (2) holds, then mt ≤ 1 and, for all z ,

t m
f (z) ≥ f (x+) + kGt(y)k22 + Gt(y)T (z − y) + kz − yk22
2 2

• combine the inequalities for z = x and z = x?:

f (x+) − f ? ≤ (1 − θ)(f (x) − f ?) − Gt(y)T ((1 − θ)x + θx? − y)

t mθ ?
− kGt(y)k22 − kx − yk22
2 2

Accelerated proximal gradient methods 9-13

Progress in one iteration

• the definition of γ + and (1) imply that

γ+
(kx? − v +k22 − ky − v +k22)
2
(1 − θ)γ ? 2 2 mθ ?
= (kx − vk2 − ky − vk2) + kx − yk22 + θGt(y)T (x? − y)
2 2

• combining this with the last inequality on page 9-13 gives

+ γ+ ? ?
f (x ) − f + kx − v +k22
2

? γ ? 2 T γ 2

≤ (1 − θ) f (x) − f + kx − vk2 − Gt(y) (x − y) − ky − vk2
2 2
t 2 γ+
− kGt(y)k2 + ky − v +k22
2 2

Accelerated proximal gradient methods 9-14

• the last term on the right-hand side is

γ+ 1
ky − v k2 = + k(1 − θ)γ(y − v) + θGt(y)k22
+ 2
2 2γ
(1 − θ)2γ 2 2 θ(1 − θ)γ T t 2
= ky − vk2 + G t (y) (y − v) + kG t (y)k 2
2γ + γ+ 2
+

γ(γ − mθ) 2 T t 2
= (1 − θ) ky − vk 2 + G t (y) (x − y) + kG t (y)k2
2γ + 2

last step uses definitions of γ + and y (chosen so that θγ(y − v) = γ +(x − y))

• substituting this in the last inequality on page 9-14 gives the result on page 9-10

+ γ+ ? ?
f (x ) − f + kx − v +k22
2
γ (1 − θ)γ mθ
≤ (1 − θ) f (x) − f ? + kx? − vk2 − ky − vk 2
2
2 2 γ+

? γ ? 2

≤ (1 − θ) f (x) − f + kx − vk
2

Accelerated proximal gradient methods 9-15

Analysis for fixed step size

the product λk = − θi) determines the rate of convergence (page 9-10)

Qk
i=1 (1

• the sequence λk satisfies the following bound (proof on next page)

4
λk ≤ k √
√
ti)2
P
(2 + γ0
i=1

• for constant step size tk = 1/L, we obtain

4
λk ≤ p
(2 + k γ0/L)2

• combined with the inequality on p. 9-10, this shows the 1/k 2 convergence rate:

4 γ 0

f (x(k)) − f ? ≤ p f (x(0)) − f ? + kx(0) − x?k22
(2 + k γ0/L)2 2

Accelerated proximal gradient methods 9-16

Proof.

• recall that γk and θk are defined by γk = (1 − θk )γk−1 + θk m and γk = θk2 /tk

• we first note that λk ≤ γk /γ0; this follows from

γk − θk m γk
λk = (1 − θk )λk−1 = λk−1 ≤ λk−1
γk−1 γk−1

• the inequality follows by combining from i = 1 to i = k the inequalities

1 1 λi−1 − λi
√ −p ≥ √ (because λi ≤ λi−1)
λi λi−1 2λi−1 λi
θi
= √
2 λi
θ
≥ p i
2 γi/γ0
1√
= γ0ti
2

Accelerated proximal gradient methods 9-17

Strongly convex functions

the following bound on λk is useful for strongly convex functions (m > 0)

• if γ0 ≥ m then γk ≥ m for all k and

k
Y √
λk ≤ (1 − mti)
i=1

(proof on next page)

• for constant step size tk = 1/L, we obtain

p k
λk ≤ 1 − m/L

• combined with the inequality on p. 9-10, this shows

r k
m γ 0

f (x(k)) − f ? ≤ 1− f (x(0)) − f ?) + kx(0) − x?k22
L 2

Accelerated proximal gradient methods 9-18

Proof.

• if γk−1 ≥ m, then

γk = (1 − θk )γk−1 + θk m
≥ m

• since γ0 ≥ m, we have γk ≥ m for all k

√ √
• it follows that θi = γiti ≥ mti and

k k
Y Y √
λk = (1 − θi) ≤ (1 − mti)
i=1 i=1

Accelerated proximal gradient methods 9-19

Outline

• Nesterov’s method

• analysis with fixed step size

• line search
Line search

• the analysis for fixed step size starts with the inequality (2):

t
g(x − tGt(y)) ≤ g(y) − t∇g(y)T Gt(y) + kGt(y)k22
2

this inequality is known to hold for 0 ≤ t ≤ 1/L

• if L is not known, we can satisfy (2) by a backtracking line search:

start at some t := t̂ > 0 and backtrack (t := βt) until (2) holds

• step size selected by the line search satisfies t ≥ tmin = min {t̂, β/L}

• for each tentative tk we need to recompute θk , y , x(k) in the algorithm on p. 9-4

• requires evaluations of ∇g , proxth, and g (twice) per line search iteration

Accelerated proximal gradient methods 9-20

Analysis with line search

• from page 9-16:

4 4
λk ≤ k √
≤ √ 2
√ (2 + k γ0tmin)
ti)2
P
(2 + γ0
i=1

• from page 9-18, if γ0 ≥ m:

k
Y √ √ k
λk ≤ (1 − mti) ≤ 1 − mtmin
i=1

• therefore the results for fixed step size hold with 1/tmin substituted for L

Accelerated proximal gradient methods 9-21

References

Accelerated gradient methods

• Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004).

The material in the lecture is from §2.2 of this book.
• P. Tseng, On accelerated proximal gradient methods for convex-concave optimization (2008).
• S. Bubeck, Convex Optimization: Algorithms and Complexity, Foundations and Trends in
Machine Learning (2015), §3.7.

FISTA

• A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse
problems, SIAM J. on Imaging Sciences (2009).
• A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal recovery, in: Y.
Eldar and D. Palomar (Eds.), Convex Optimization in Signal Processing and Communications
(2009).

Line search strategies

• FISTA papers by Beck and Teboulle.

• D. Goldfarb and K. Scheinberg, Fast first-order methods for composite convex optimization with
line search (2011).
• O. Güler, New proximal point algorithms for convex minimization, SIOPT (1992).
• Yu. Nesterov, Gradient methods for minimizing composite functions (2013).

Accelerated proximal gradient methods 9-22

Interpretation and insight

• Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004), §2.2.
• W. Su, S. Boyd, E. Candès, A differentiable equation for modeling Nesterov’s accelerated
gradient method: theory and insight, NIPS (2014).
• H. Lin, J. Mairal, Z. Harchaoui, A universal catalyst for first-order optimization,
arXiv:1506.02186 (2015).

Implementation

• S. Becker, E.J. Candès, M. Grant, Templates for convex cone problems with applications to
sparse signal recovery, Mathematical Programming Computation (2011).
• B. O’Donoghue, E. Candès, Adaptive restart for accelerated gradient schemes, Foundations of
Computational Mathematics (2015).
• T. Goldstein, C. Studer, R. Baraniuk, A field guide to forward-backward splitting with a FASTA
implementation, arXiv:1411.3406 (2016).

Accelerated proximal gradient methods 9-23

Fista
No ratings yet
Fista
32 pages
Fast Gradient Methods: - Fast Proximal Gradient Method (FISTA) - Nesterov's Second Method
No ratings yet
Fast Gradient Methods: - Fast Proximal Gradient Method (FISTA) - Nesterov's Second Method
30 pages
Proximal Operators in Optimization
No ratings yet
Proximal Operators in Optimization
4 pages
Acceleration Scribed
No ratings yet
Acceleration Scribed
8 pages
Subgradient Methods for Convex Optimization
No ratings yet
Subgradient Methods for Convex Optimization
33 pages
Subgradient Method for Optimization
No ratings yet
Subgradient Method for Optimization
33 pages
Homework 2
No ratings yet
Homework 2
5 pages
A Geometric Structure of Acceleration and Its Role in Making Gradients Small Fast
No ratings yet
A Geometric Structure of Acceleration and Its Role in Making Gradients Small Fast
40 pages
A Simplified View of First Order Methods For Optimization
No ratings yet
A Simplified View of First Order Methods For Optimization
30 pages
A Note On The Accelerated Proximal Gradient Method For Nonconvex Optimization
No ratings yet
A Note On The Accelerated Proximal Gradient Method For Nonconvex Optimization
9 pages
SESO2018 Wednesday Sagastizabal
No ratings yet
SESO2018 Wednesday Sagastizabal
181 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Proximal Algorithms for Convex Optimization
No ratings yet
Proximal Algorithms for Convex Optimization
5 pages
Gradient Method in Convex Optimization
No ratings yet
Gradient Method in Convex Optimization
31 pages
Chương 9
No ratings yet
Chương 9
12 pages
Optimization for Convex Functions
No ratings yet
Optimization for Convex Functions
31 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Convergence Rates of Inexact Proximal-Gradient Methods For Convex Optimization
No ratings yet
Convergence Rates of Inexact Proximal-Gradient Methods For Convex Optimization
31 pages
Understanding Sequential Quadratic Programming
No ratings yet
Understanding Sequential Quadratic Programming
50 pages
ECOM 6302: Engineering Optimization: Chapter Three
100% (1)
ECOM 6302: Engineering Optimization: Chapter Three
56 pages
Clnote Sept24
No ratings yet
Clnote Sept24
24 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Subgradient Method for ECE Students
No ratings yet
Subgradient Method for ECE Students
22 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Clnote Oct12
No ratings yet
Clnote Oct12
25 pages
Unconstrained and Constrained Optimization Techniques
No ratings yet
Unconstrained and Constrained Optimization Techniques
25 pages
Opt 202 LN
No ratings yet
Opt 202 LN
86 pages
Basic Concepts: 1.1 Continuity
No ratings yet
Basic Concepts: 1.1 Continuity
7 pages
Optimumengineeringdesign Day5
No ratings yet
Optimumengineeringdesign Day5
84 pages
04 Nonlinear Systems and Optimization
No ratings yet
04 Nonlinear Systems and Optimization
74 pages
New Inertial Proximal Gradient Methods For Unconstrained Convex Optimization Problems
No ratings yet
New Inertial Proximal Gradient Methods For Unconstrained Convex Optimization Problems
18 pages
Algorithms Process Optimization
No ratings yet
Algorithms Process Optimization
5 pages
Proximal Minimization With D-Functions: Gorithms
No ratings yet
Proximal Minimization With D-Functions: Gorithms
11 pages
An Acceleration of Gradient Descent Algorithm With Backtracking For Unconstrained Optimization. Numer. Algor. 42, 63-73 (2006)
No ratings yet
An Acceleration of Gradient Descent Algorithm With Backtracking For Unconstrained Optimization. Numer. Algor. 42, 63-73 (2006)
11 pages
Equality Constrained Minimization
No ratings yet
Equality Constrained Minimization
19 pages
6 OneD Unconstrained Opt
No ratings yet
6 OneD Unconstrained Opt
29 pages
Accelerated and Inexact Forward-Backward Algorithms
No ratings yet
Accelerated and Inexact Forward-Backward Algorithms
27 pages
Unconstrained Multivariable Optimization
No ratings yet
Unconstrained Multivariable Optimization
42 pages
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
No ratings yet
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
20 pages
Gradient Descent in Convex Optimization
No ratings yet
Gradient Descent in Convex Optimization
27 pages
Lecture 11 AGD Restart Lower Bounds
No ratings yet
Lecture 11 AGD Restart Lower Bounds
5 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
Convex - Optimization - Homework 3
No ratings yet
Convex - Optimization - Homework 3
6 pages
Simons Bootcamp01convex
No ratings yet
Simons Bootcamp01convex
77 pages
Maximum Slope Method
No ratings yet
Maximum Slope Method
14 pages
Wavelets 3
No ratings yet
Wavelets 3
29 pages
Lecture 5 Si416 2025
No ratings yet
Lecture 5 Si416 2025
21 pages
Lecture 11
No ratings yet
Lecture 11
4 pages
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
21 pages
L10 - Subgrad - PGD (Partially Annotated)
No ratings yet
L10 - Subgrad - PGD (Partially Annotated)
39 pages
Conjugate Directions Optimization
No ratings yet
Conjugate Directions Optimization
32 pages
Multi-Variable Optimization Methods
No ratings yet
Multi-Variable Optimization Methods
21 pages
Subgradient Methods
No ratings yet
Subgradient Methods
56 pages
Unconstrained Function Optimization
No ratings yet
Unconstrained Function Optimization
30 pages
ISYE 8803 - Kamran - M5 - Optimization Methods 2
No ratings yet
ISYE 8803 - Kamran - M5 - Optimization Methods 2
17 pages
14 Newton
No ratings yet
14 Newton
24 pages
Ieng: 5361-Industrial Management and Engineering Economy: Dagne T. 2019 1
No ratings yet
Ieng: 5361-Industrial Management and Engineering Economy: Dagne T. 2019 1
78 pages
Understanding Finite Automata Concepts
No ratings yet
Understanding Finite Automata Concepts
35 pages
Rajiv Gandhi University of Knowledge Technologies: (Department of Electronics and Communication Engineering)
No ratings yet
Rajiv Gandhi University of Knowledge Technologies: (Department of Electronics and Communication Engineering)
6 pages
Customizing Java Priority Queues
No ratings yet
Customizing Java Priority Queues
10 pages
Practice 14.7
No ratings yet
Practice 14.7
16 pages
111111lot 2 Wif
67% (3)
111111lot 2 Wif
5 pages
Lab 6a
No ratings yet
Lab 6a
5 pages
IIM Calcutta PGDBA Course List
No ratings yet
IIM Calcutta PGDBA Course List
5 pages
Cheat Sheet Quantitative Methods in Finance Nova Cheat Sheet Quantitative Methods in Finance Nova
0% (1)
Cheat Sheet Quantitative Methods in Finance Nova Cheat Sheet Quantitative Methods in Finance Nova
3 pages
Digital Communication Basics
No ratings yet
Digital Communication Basics
2 pages
Introduction to Biostatistics Basics
No ratings yet
Introduction to Biostatistics Basics
37 pages
E061341 - NLP
No ratings yet
E061341 - NLP
3 pages
Big O Notation and Algorithm Efficiency
No ratings yet
Big O Notation and Algorithm Efficiency
20 pages
Ijgi 11 00400
No ratings yet
Ijgi 11 00400
18 pages
An Empirical Evaluation of Stop Word Removal in Statistical Machine Translation
No ratings yet
An Empirical Evaluation of Stop Word Removal in Statistical Machine Translation
8 pages
2021 Algorithm Design and Problem Solving IGCSE 0478
No ratings yet
2021 Algorithm Design and Problem Solving IGCSE 0478
59 pages
CNN Paper
No ratings yet
CNN Paper
13 pages
ACC 229 Linear Programming Concepts and Applications
No ratings yet
ACC 229 Linear Programming Concepts and Applications
23 pages
Inverse Variation
No ratings yet
Inverse Variation
2 pages
Resource Destroying Maps
No ratings yet
Resource Destroying Maps
6 pages
AI Lab Manual New-1-18
No ratings yet
AI Lab Manual New-1-18
18 pages
2 Phase
No ratings yet
2 Phase
31 pages
Dynamic Programming: General Method Overview
No ratings yet
Dynamic Programming: General Method Overview
13 pages
Pert and Project Crashing
100% (1)
Pert and Project Crashing
19 pages
CP LAB1 Problem Solving
No ratings yet
CP LAB1 Problem Solving
11 pages
CPM Diagram and Critical Path Analysis
No ratings yet
CPM Diagram and Critical Path Analysis
11 pages
COA GTU Study Material Presentations Unit-7 10052021051444AM
No ratings yet
COA GTU Study Material Presentations Unit-7 10052021051444AM
16 pages
Toa Presentation
No ratings yet
Toa Presentation
12 pages
Finite Element Analysis Guide
No ratings yet
Finite Element Analysis Guide
7 pages
Exact Differential Equations Overview
No ratings yet
Exact Differential Equations Overview
20 pages