Quasi-Newton Methods
BFGS Variants
Numerical Optimization
Lecture Notes #18
Quasi-Newton Methods The BFGS Method
Peter Blomgren,
[email protected]
Department of Mathematics and Statistics
Dynamical Systems Group
Computational Sciences Research Center
San Diego State University
San Diego, CA 92182-7720
http://terminus.sdsu.edu/
Fall 2016
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (1/24)
Quasi-Newton Methods
BFGS Variants
Outline
1 Quasi-Newton Methods
Introduction
The BFGS Method
2 BFGS Variants
Limited-memory BFGS
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (2/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
Quasi-Newton Methods
Quasi-Newton methods require only the gradient (like steepest
of the objective to be computed at each iterate.
descent)
By successive measurements of the gradient, Quasi-Newton
methods build a quadratic model of the objective function which is
sufficiently good that superlinear convergence is achieved.
Quasi-Newton methods are much faster than steepest descent (and
coordinate descent) methods.
Since second derivatives (the Hessian) are not required,
quasi-Newton methods are sometimes more efficient (as measured by
total work / wall-clock computational time) than Newton methods,
especially when Hessian evaluation is slow/expensive.
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (3/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Method: Introduction 1 of 3
The BFGS method is named for its discoverers:
Broyden-Fletcher-Goldfarb-Shanno, and is the most popular
quasi-Newton method.
We derive the DFP (a close relative) and BFGS methods; and look
at some properties and practical implementation details.
The derivation starts with the quadratic model
1 T
mk (
p) = f ( xk )T p
xk ) + f ( + p Bk p
2
at the current iterate
xk . Bk is a symmetric positive definite
matrix (model Hessian) that will be updated in every iteration.
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (4/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Method: Introduction Standard Stuff 2 of 3
Given this convex quadratic model, we can write down the
minimizer pk explicitly as
k = Bk1 f (
p xk ).
We can compute the search direction p k using e.g. the Cholesky
factorization, or a CG-iteration; once we have pk we find the new
iterate:
xk+1 = xk + k p
k ,
where we require that the step length k satisfies e.g. the Wolfe
conditions:
f (
xk +
pk ) f ( pT
xk ) + c1 k f (
x), c1 (0, 1)
T
pk f (
xk + T
k f (
pk ) c 2 p xk ), c2 (c1 , 1).
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (5/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Method: Introduction 3 of 3
So far we have not really done anything new the key difference
compared with the linesearch Newton method is that we are using
an approximate Hessian Bk 6= 2 f (
xk ).
Instead to computing a completely new Bk in each iteration, we
will update
Bk+1 = Bk + something,
using information about the curvature at step #k. Thus we get a
new model
1 T
mk+1 (
p) = f ( xk+1 )T p
xk+1 ) + f ( + p Bk+1 p
.
2
Clearly, for this to make sense we must impose some conditions on
the update.
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (6/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Method: Conditions on Bk+1 1 of 3
We impose two conditions on the new model mk+1 (
p):
[1,2] mk+1 (p) must match the gradient of the objective
function in
xk and
xk+1 .
The second condition is satisfied by construction, since
mk+1 (
0) = f (
xk+1 ).
The first condition gives us
mk+1 (k p
k ) = f (
xk+1 ) k Bk+1 p
k = f (
xk ).
With a little bit of re-arrangement we get
k Bk+1 p
k = f(
xk+1 ) f(
xk ).
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (7/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Method: Conditions on Bk+1 2 of 3
We clean up the notation by introducing
sk =
xk+1
xk k p
k
yk = f (
xk+1 ) f (
xk )
We can now express the condition on Bk+1 in terms of sk and
yk :
Bk+1sk =
yk .
We will refer to this equation as the Secant Equation.
By pre-multiplying the secant equation by sT
k we see the
curvature condition
sT Bk+1sk = sT
kyk sT
kyk > 0.
|k {z }
>0
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (8/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Method: Conditions on Bk+1 3 of 3
If we impose the Wolfe, or strong Wolfe condition on the line
search procedure, the curvature condition will always hold, since
xk+1 )T sk c2 f (
f ( xk )T sk ,
by the (curvature) Wolfe condition, and therefore
ykT sk (c2 1)k f (
xk )T p
k ,
where the right-hand-side is positive since c2 < 1 and p
k is a
descent direction.
When the curvature condition is satisfied, the secant
equation always has at least one solution Bk+1 .
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (9/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Method: More Conditions on Bk+1 1 of 2
It turns out that there are infinitely many symmetric positive
definite matrices Bk+1 which satisfy the secant equation.
Degrees of Freedom Conditions Imposed
n(n + 1)/2 Symmetric n The Secant Equation
n Principal minors positive (PD)
To determine Bk+1 uniquely we must impose additional conditions
we will select the Bk+1 that is closest to Bk in some sense:
Bk+1 = arg min kB Bk ksome-norm
B
subject to B = B T , Bsk =
yk .
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (10/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Method: More Conditions on Bk+1 2 of 2
Each choice of matrix norm in this matrix-minimization-problem
(MMP) gives rise to a different quasi-Newton method.
The weighted Frobenius norm
v
u n X
n
uX
kAkW = kW 1/2 AW 1/2 kF = kC kF = t cij2
i=0 j=0
allows easy solution of the MMP, and gives rise to a scale-invariant
optimization method.
The matrix W is chosen to be the inverse Gk1 of the average
Hessian Z 1
Gk = 2 f (
xk + k p
k ) d.
0
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (11/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The DFP Method
With this weighting matrix and norm, the unique solution of the
MMP is
1
Bk+1 = I k yk sT
k B k I k
s k
yk
T
+ k ykT , k = T .
yk
yk sk
yk sT
Note that k is a scalar, and k , ykT , and
sk ykT are rank-one
yk
matrices.
This is the original Davidon-Fletcher-Powell (DFP) method suggested by
W.C. Davidon in 1959.
The original paper describing this revolutionary idea the first
quasi-Newton method was not accepted for publication. It later
appeared in 1991 in the first issue the the SIAM Journal on Optimization.
Fletcher and Powell demonstrated that this algorithm was much faster
and more reliable than existing methods (at the time). This
revolutionized the field of non-linear optimization.
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (12/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The DFP Method: Cleaning Up 1 of 2
The inverse of Bk is useful for the implementation of the method,
since it allows the search direction p
k to be computed using a
simple matrix-vector product. We let
Hk = Bk1
and use
Sherman-Morrison-Woodbury formula
If A Rnn is non-singular and Rn , and if
a, b
B = A+T
ab
then
A1 T A1
ab
B 1 = A1 .
1+b T A1
a
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (13/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The DFP Method: Cleaning Up 2 of 2
With a little bit of linear algebra we end up with
Hk ykT Hk
yk sk sT
k
Hk+1 = Hk + .
y T Hk
yk ykT sk
| {z } | {z }
Update #1 Update #2
Both the update terms are rank-one matrices; so that Hk
undergoes a rank-2 modification in each iteration.
This is the fundamental idea of quasi-Newton updating:
instead of recomputing the matrix (-inverse) from scratch each
time around, we apply a simple modification which combines the
more recently observed information about the objective with
existing knowledge embedded in the current Hessian
approximation.
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (14/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
Improving on DFP The BFGS Method
The DFP method is quite effective, but once the quasi-Newton
idea was accepted by the optimization community is was quickly
superseded by the BFGS method.
BFGS updating is derived by instead of imposing conditions on the
Hessian approximations Bk , we impose conditions directly on the
inverses Hk .
The updated approximation must be symmetric positive definite,
and must satisfy the secant equation in the form
Hk+1
yk = sk , compare: Bk+1sk =
yk
We get a slightly different matrix minimization problem...
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (15/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Matrix Minimization Problem
Hk+1 = arg min kH Hk ksome-norm
H
subject to H = H T , H
yk = sk
If we again choose the weighted Frobenius norm (with the same
weight), then we get the unique update
1
ykT Hk I k
Hk+1 = I k sk yk sT
k + k sk sT
k , k = T ,
yk sk
which translated back to the Hessian approximation yields
BksksT
k Bk
ykykT
Bk+1 = Bk + .
sT
k Bksk ykTsk
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (16/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
BFGS vs. DFP Updates
BFGS:
1
Hk+1 = I k sk yk sT
ykT Hk I k sk sT
k + k k , k = ,
ykT sk
Bk sk sT
k Bk ykT
yk
Bk+1 = Bk + .
sT
k Bk sk ykT sk
DFP:
1
yk sT
Bk+1 = I k k Bk I k ykT + k
sk ykT ,
yk k = .
ykT sk
Hk ykT Hk
yk sk sT
Hk+1 = Hk T
+ Tk .
y Hk yk
yk sk
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (17/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Method: Starting H0 = ???
The initial value for the iteration can be selected in different ways
A finite difference approximation at
x0 .
H0 = I , the identity matrix.
H0 = diag(s1 , s2 , . . . , sn ), where s captures the scaling of the
variables (if known).
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (18/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Method: Algorithm
Algorithm: The BFGS Method
Given starting point x0 , convergence tolerance > 0, and initial
inverse Hessian approximation H0 :
k = 0
while( kf ( xk )k > )
p
k = Hk f ( xk )
xk+1 = linesearch( pk , . . . )
sk =
xk+1 xk
yk = f ( xk+1 ) f ( xk )
k = yT1s
k k
Hk+1 = I k sk ykT Hk I k yk sT sk sT
k + k k
k = k + 1
end-while
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (19/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Method: Summary
The cost per iteration is
O(n2 ) arithmetic operations
function evaluation
gradient evaluation
The convergence rate is
Super-linear
Newtons method converges quadratically, but the cost per
iteration is higher it requires the solution of a linear system. In
addition Newtons method requires the calculation of second
derivatives whereas the BFGS method does not.
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (20/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Method: Stability and Self-Correction 1 of 2
ykT sk becomes large, i.e.
If at some point k = 1/ ykT sk 0, then
from the update formula
ykT Hk I k
Hk+1 = I k sk yk sT
k + k sk sT
k
we see that Hk+1 becomes large.
If for
this, or some other, reason Hk becomes a poor approximation
2 1
of f ( xk ) for some k, is there any hope of correcting it?
It has been shown that the BFGS method has self-correcting
properties. If Hk incorrectly estimates the curvature of the
objective function, and if this estimate slows down the iteration,
then the Hessian approximation will tend to correct itself within a
few steps.
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (21/24)
Quasi-Newton Methods Introduction
BFGS Variants The BFGS Method
The BFGS Method: Stability and Self-Correction 2 of 2
The self-correcting properties stand and fall with the quality of the line
search! The Wolfe conditions ensure that the model captures
appropriate curvature information.
The DFP method is less effective at self-correcting bad Hessian
approximations.
Practical Implementation Details:
The linesearch should always test = 1 first, because this step length
will eventually be accepted, thus creating super-linear convergence.
The linesearch can be somewhat sloppy: c1 = 104 and c2 = 0.9
are commonly used values in the Wolfe conditions.
The initial matrix H0 should not be too large, if H0 = I , then the first
step is p
0 = f ( x0 ) which may be too long if is large, often H0
is rescaled before the update H1 is computed:
ykT sk
H0 I.
ykT
yk
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (22/24)
Quasi-Newton Methods
Limited-memory BFGS
BFGS Variants
L-BFGS
Forming the n n dense matrix Hk can be quite expensive for
large problems. L-BFGS stores a limited history of the BFGS
update vectors sk and
yk (which are size n), and use these to
implicitly form the matrix operations.
In standard BFGS, the current Hk contains updates all the way
k1
back to initial step {sj ,
yj }j=0 , whereas L-BFGS only uses a
limited number of recent updates; so that the action of Hk is
k1
formed by application of {sj , yj }j=km .
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (23/24)
Quasi-Newton Methods
Limited-memory BFGS
BFGS Variants
L-BFGS Two Loop Recursion
Given a local initial positive definite model for the Hessian, Hk :
1
v = f ( xk )
2 T
j = j sj
v,
v=
v j
yj , j = k 1, . . . , k m.
3 w
= Hk v
4 yjT w,
j = j w =w + sj (j j ), j = k m, . . . , k 1
5 Now, use p k = w
( Hk f ( xk )).
References:
1 Matthies, H.; Strang, G. (1979). The solution of non linear finite
element equations. International Journal for Numerical Methods in
Engineering 14 (11): 16131626. doi:10.1002/nme.1620141104
2 Nocedal, J. (1980). Updating Quasi-Newton Matrices with Limited
Storage. Mathematics of Computation 35 (151): 773782.
doi:10.1090/S0025-5718-1980-0572855-7
Peter Blomgren, [email protected] Quasi-Newton Methods The BFGS Method (24/24)