0% found this document useful (0 votes)
21 views263 pages

Notes Ipad

Uploaded by

qichang.hu.cn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views263 pages

Notes Ipad

Uploaded by

qichang.hu.cn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 263

30

25

20

15

10

−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2

Optimisation III
Optimisation III

APP MATH 3014 and 7072.

] ] ]

Matthew Roughan
(based on notes produced by Nigel Bean and Liz Cousins)

] ] ]

Discipline of Applied Mathematics


School of Mathematical Sciences
University of Adelaide
] ] ]

2013
© 2013 Matthew Roughan
All rights reserved

Publishing Company, Adelaide, SA.

Printed in the World

The paper used in this publication may meet the minimum requirements of the
American National Standard for Information Sciences — Permanence of Paper for
Printed Library Materials, ANSI Z39.48–1984.
10 09 08 07 06 05 04 03 02 01 15 14 13 12 11 10 9
First edition: 12th March, 2010
C ONTENTS

Contents i

List of Figures iv

Preface x

1 Introduction to Optimisation 1
1.1 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Examples seen in previous courses: . . . . . . . . . . . . 3
Some More Difficult Examples . . . . . . . . . . . . . . . 7
1.3 Notation and Conventions . . . . . . . . . . . . . . . . . 8
1.4 Questions to Ponder . . . . . . . . . . . . . . . . . . . . . 12
1.5 Expected background . . . . . . . . . . . . . . . . . . . . 13
1.6 What are we trying to teach? . . . . . . . . . . . . . . . . 14
1.7 An Outline of our Progress Towards Hard Problems . . 15

2 Single Variable Optimisation 17


2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Introduction to Fast Algorithms . . . . . . . . . . . . . . 20
2.2 Direct Search Algorithms . . . . . . . . . . . . . . . . . . 21

i
Bounded problems . . . . . . . . . . . . . . . . . . . . . 22
Unbounded problems . . . . . . . . . . . . . . . . . . . . 53
2.3 Quadratic Approximation Algorithms . . . . . . . . . . 55
2.4 Methods for Smooth Functions . . . . . . . . . . . . . . 67
Newton’s Method. . . . . . . . . . . . . . . . . . . . . . . 67
Secant Method . . . . . . . . . . . . . . . . . . . . . . . . 77

3 Unconstrained, Multi-Variable, Convex Optimisation 82


3.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Positive Definite Matrices . . . . . . . . . . . . . . . . . . 85
Quadratic functions . . . . . . . . . . . . . . . . . . . . . 86
Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . 89
Convex Functions . . . . . . . . . . . . . . . . . . . . . . 91
Conditions for Global Minimums . . . . . . . . . . . . . 98
3.3 Convex Optimisation Algorithms . . . . . . . . . . . . . 102
Descent Directions . . . . . . . . . . . . . . . . . . . . . 104
Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . 106
Conjugate Directions . . . . . . . . . . . . . . . . . . . . 123
Conjugate Gradient Method . . . . . . . . . . . . . . . . 129
The Conjugate Gradient Algorithm for non-quadratic
functions – the Fletcher Reeves Algorithm. . . . 139

4 Constrained Convex Optimisation 144


4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.2 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Some related problems and transforms between such . 154
Finding feasible points . . . . . . . . . . . . . . . . . . . 159
4.3 Background: Lagrange multipliers . . . . . . . . . . . . 160
4.4 KKT Conditions . . . . . . . . . . . . . . . . . . . . . . . 170
Proof of the KKT Conditions . . . . . . . . . . . . . . . . 175

ii
KKT Conditions for other problems . . . . . . . . . . . . 186
KKT Conditions for Quadratic Programming . . . . . . 189
4.5 The Gradient Projection Algorithm . . . . . . . . . . . . 192
Background to the Gradient Projection Method . . . . . 194

5 Non-Convex Optimisation 221


5.1 Simulated Annealing . . . . . . . . . . . . . . . . . . . . 223
Acceptance functions . . . . . . . . . . . . . . . . . . . . 225
The annealing schedule . . . . . . . . . . . . . . . . . . . 227
The algorithm . . . . . . . . . . . . . . . . . . . . . . . . 229
5.2 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . 231
Chromosome Encoding . . . . . . . . . . . . . . . . . . . 235
Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
Selection algorithms . . . . . . . . . . . . . . . . . . . . . 239
Parameters of GAs . . . . . . . . . . . . . . . . . . . . . . 241
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.3 Other Approaches . . . . . . . . . . . . . . . . . . . . . . 243

6 Conclusion 245

Bibliography 246

Index 248

iii
L IST OF F IGURES

1.1 Two, 2D optimisation problems. . . . . . . . . . . . . . . . . 6


1.2 The monkey saddle. . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Unimodality examples. . . . . . . . . . . . . . . . . . . . . . 22


2.2 Example 1: f (x) = 1 − e −x/2 ln(x/5). . . . . . . . . . . . . . . 27
2.3 Example 2: f (x) = e x/5 sin(x/5). . . . . . . . . . . . . . . . . 29
2.4 Finding bounds using a geometric search. . . . . . . . . . . 54
2.5 Quadratic approximation to a function f (x). . . . . . . . . 57
2.6 Method of tangents applied to g (x) = f 0 (x). . . . . . . . . . 73
2.7 A rough sketch of the type of convergence problems we
might see with Newton’s method. . . . . . . . . . . . . . . . 77

3.1 The defining property of a convex function. . . . . . . . . . 92


3.2 Convex and non-convex examples. . . . . . . . . . . . . . . 93
3.3 A convex function g (t ). . . . . . . . . . . . . . . . . . . . . . 95
3.4 The level curve of a function with tangent and steepest de-
scent direction. . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.1 An example of a traffic matrix. . . . . . . . . . . . . . . . . . 146

iv
4.2 An illustration of the least-square solution. The point shows
the gravity model solution, and the dashed line shows the
subspace specified by a single constraint equation. The
least-square solution is simply the point which satisfies the
equation which is closest to the gravity model solution. The
weighted least-squares solution gives different weight to
different unknowns. . . . . . . . . . . . . . . . . . . . . . . . 148
4.3 Convex and non-convex sets. . . . . . . . . . . . . . . . . . . 149
4.4 Two convex cones. . . . . . . . . . . . . . . . . . . . . . . . . 157
4.5 An illustration of Farkas’ Lemma (see Example 30). . . . . 158
4.6 We wish to maximise the area of the rectangle, subject to a
fixed perimeter constraint. . . . . . . . . . . . . . . . . . . . 161
4.7 We wish to maximise the area of the rectangle, subject to it
fitting inside a circle with diameter 1. Note that the diagonal
of the rectangle is just the diameter of the circle. . . . . . . 162
4.8 An illustration of the affect of a Lagrange multiplier us-
ing f (x) = 2x 12 + 2x 1 x 2 + x 22 − 10x 1 − 10x 2 and constraint
g (x) = x 12 + x 22 − 5 = 0. We can see that (4.14) is enforced
here with λ = 1. Intuitively we can see why — the minimum
occurs just at the point where the level curves of f (x) (the
orange ellipses) just touch the feasible region curve g (x) = 0
(shown in blue). The critical level curve is f (x) = 20 (shown
in green). When they just touch, their tangents must be the
same (as they are continuously differentiable) and hence,
the normals defined by the gradients must be aligned. . . . 166
An illustration of the fact that ∇ f (x ∗ ) = − i ∈I (x∗ ) λi ∇g i (x∗ )
P
4.9
for λi ≥ 0 for any global minimum. The shaded region in-
dicates the feasible set. We know that −∇g i (x∗ ) must point
into this set because the constraints are g i (x∗ ) ≤ 0. The
green shaded sub-region indicates the convex cone of vi-
able directions for ∇ f . . . . . . . . . . . . . . . . . . . . . . . 172

v
4.10 An illustration of Example 38. The green region is the feasi-
ble set Ω. The orange lines are the elliptical level curves of
f (x). The four cases of solutions are illustrated by dots. . . 182
4.11 A closeup of Figure 4.10 showing the gradients, and the
convex cone of {−∇g 1 , −∇g 2 } (in green) at the point x where
I (x) = {1, 2}. As ∇ f (x) is outside this cone, this can’t be an
optimal point. . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.12 A closeup of Figure 4.11, with the green region showing fea-
sible descent directions from the point I (x) = {1, 2}. As there
are feasible descent directions, we can’t be at the minimum. 184
4.13 Decomposition or −∇ f (x) into u + q, where u and q are
chosen to be orthogonal, i.e., , uT q = 0. . . . . . . . . . . . . 194
4.14 Orthogonal projection of the vector y into the space S(x)
using the projection matrix P S . . . . . . . . . . . . . . . . . . 196
4.15 A side on view of the orthogonal projection problem. . . . 197
4.16 The two cases where u = 0. . . . . . . . . . . . . . . . . . . . 207
4.17 Example 41. The feasible region is shown in green. Level
curves of f (x) are shown as the orange concentric circles. It
may seem unlikely that we would exactly hit a vertex, but
the following figure shows what would happen if we started
at x0 = (1.5, 4). . . . . . . . . . . . . . . . . . . . . . . . . . . 219
4.18 Example 41 but with an alternative starting point x0 = (1.5, 4).
The feasible region is shown in green. Level curves of f (x)
are shown as the orange concentric circles. . . . . . . . . . 220

5.1 Non-convex optimisation problem. We can see that a de-


scent algorithm would get caught in a local minimum unless
we just happened to choose a good starting point. . . . . . 222
5.2 Illustration of the properties of acceptance functions. As
δ f increases, or temperature decreases, the probability of
acceptance decreases. . . . . . . . . . . . . . . . . . . . . . . 227

vi
5.3 Examples of “genetic art”. . . . . . . . . . . . . . . . . . . . . 232
5.4 Selection methods. . . . . . . . . . . . . . . . . . . . . . . . . 240

vii
L IST OF A LGORITHMS

1 The Dichotomous Search . . . . . . . . . . . . . . . . . . . 25


2 The Dichotomous Search (efficient version). . . . . . . . . 31
3 The Golden Ratio Search . . . . . . . . . . . . . . . . . . . 37
4 The Fibonacci Search . . . . . . . . . . . . . . . . . . . . . 48
5 Geometric Search for Bounds . . . . . . . . . . . . . . . . 55
6 Geometric Search for Three Bracketing Points. . . . . . . 62
7 The DSC algorithm. Note that changes to the h value in-
side Algorithm 6 should not change it here (side affects like
this should be avoided using simple scope rules). Similarly
the counter k used in this algorithm should be different
from that inside Algorithm 6. If this comment makes no
sense to you, then go and look up the idea of scope (and
don’t use global variables). . . . . . . . . . . . . . . . . . . 63
8 Newton’s Method. . . . . . . . . . . . . . . . . . . . . . . . 70
9 Secant Method. . . . . . . . . . . . . . . . . . . . . . . . . . 78

10 The Steepest Descent Algorithm. . . . . . . . . . . . . . . 107


11 The Conjugate Gradient Method. . . . . . . . . . . . . . . 130
12 The Fletcher Reeves Algorithm. . . . . . . . . . . . . . . . 142

viii
13 The Gradient Projection Algorithm. Note that we have
ommitted checking for linear independence of the rows
of M at each step for simplicity of exposition (this step is
tricky because you don’t want to do it every time as M is
sometimes repeated). . . . . . . . . . . . . . . . . . . . . . 214

14 The Metropolis-Hastings Algorithm. The function r and om()


is assumed to generate a uniform random variable on
(0, 1), and P (accept | ∆ f ) is given by (5.3). . . . . . . . . . 230
15 A Genetic Algorithm (or at least the shell of one). . . . . . 234

ix
P REFACE

Optimisation is a task all human beings, indeed all living things,


do. It is central to any decision making task, i.e., in any task involving
choosing between alternatives. Choice is governed by wanting to make
the “best” decision:
• minimize the cost of producing a widget;

• shortest route to Hungry Jacks, or the bar; or

• getting greatest exam mark, given a limited amount of study


time.

All involve looking for the best solution to some objective often subject
to some constraints.
We can even think of natural processes such as evolution as a form
of optimisation, and indeed genetic algorithms and evolutionary com-
puting deliberately exploit this metaphor to solve other optimisation
problems we may wish to solve.
This course is about solving complicated optimisation problems
involving continuous, real variables. They’re complicated because they
can involve many variables and constraints, and because the objective
and constraints can be non-linear.

x
Apart from teaching one how to solve such problems, the course is
concerned with a key part of all applied mathematics: translation of a
real-world problem into mathematics. In general, this is a challenging
task. It seems, from what you may have studied, that this is easy in
optimisation. After all, cost or profit sound like well defined things
that could be written mathematically. However, the challenge is that
although we might be able to write them, the resulting optimisation
problem may be intractable. Sometimes it isn’t even possible to write
down all the factors that go into an objective function, and constraints
are similarly complicated.
The job of a mathematician is often to approximate these factors
in a way that is both reasonable (from the point of view of solving the
real-world problem), and mathematically tractable. To do so, we have
to get to the nub of “what is hard” in optimisation, and that is at the
root of the structure of this course.

xi
30

25

20

15

10

−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2

§1
I NTRODUCTION TO O PTIMISATION

§1.1 O PTIMISATION
Optimisation is a task all human beings, indeed all living things,
do. It is central to any decision making task, i.e., in any task involving
choosing between alternatives. Choice is governed by wanting to make
the “best” decision:

• minimize the cost of producing a widget;

• shortest route to Hungry Jacks, or the bar; or

• getting greatest exam mark, given a limited amount of study


time.

All involve looking for the best solution to some objective often subject
to some constraints.
We can even think of natural processes such as evolution as a form
of optimisation, and indeed genetic algorithms and evolutionary com-
puting deliberately exploit this metaphor to solve other optimisation
problems we may wish to solve.
In general, we define an optimisation problem in three pieces: the

1
CHAPTER 1. INTRODUCTION TO OPTIMISATION 2

variables – these are the choices we can make, or the things we can
control. In this course they will often be given in a vector, e.g., x,
of real numbers.

objective – this is the goal. It’s usually to minimise or maximize some


function of the variables, which we sometimes call the cost, or
utility, or profit function.

constraints – these are the limits on the possible choices we can make
(usually in the form of inequalities with respect to the variables).

Optimisation problems are categorised, or named, based on the


characteristics of these components: for instance:

• When the objective function is convex (for a minimization prob-


lem), and the constraints define a convex set, the problem is
called a convex program.

• When the objective is a quadratic function, and the constraints


are linear (equalities and inequalities), the problem is called a
quadratic program. This is a special case of convex program-
ming.

• When the constraints and objective are linear functions, the


problem is called a linear program. This is a special case of
convex programming (and of quadratic programming).

• When the variables take integer values, the problem is called an


integer programming problem. Integer programs are, in general,
non-convex, and much harder to solve.

Generally, we can see different optimisation problems as branching


out from general, to specific: of particular relevance here is the generic
CHAPTER 1. INTRODUCTION TO OPTIMISATION 3

problem, which can be restricted to convex, then to quadratic, and


then to linear programming.
Note the use of “program” here. It doesn’t mean a computer pro-
gram1 , it is using the older sense of “a plan, schedule, or procedure.”
Often the goal of one of our programs is to determine a schedule (its
one of the oldest mathematically treated optimisation problems).
However, we will be using computer programs to solve our opti-
misation problems, so I will usually refer to them as problems, except
where the historical convention is very strong.

§1.2 E XAMPLES
One of the most challenging problems in Applied Mathematics is
translating a real-life optimisation problem into a mathematical formu-
lation. Real-life problems are stated in (vague and fuzzy) words, and
are complex (sometimes too complex for us to solve the real problem).
There is often a requirement to approximate the real problems into
something mathematically and computationally tractable.

E XAMPLES SEEN IN PREVIOUS COURSES :


1. Linear Programming Example:
maximise z = 3x 1 + 4x 2 − 5x 3
such that x1 + x2 + x3 ≤ 7
x1 − x3 ≤ 2
Soln: z ∗ = 26 @ x ∗ = (2, 5, 0)
£ ¤
x1 , x2 , x3 ≥ 0
1
Incidentally, the term programming was used for optimisation at least in 1948
by Dantzig [1, 13], well before it was used for computer programming. For a nice,
quick history of optimisation see [8].
CHAPTER 1. INTRODUCTION TO OPTIMISATION 4

We know of at least one honours graduate from Adelaide who was


solving these types of problems for a living. He was consulting
for Carlton United Breweries, solving location and inventory
problems for shipping beverages around the Eastern states. This
was a large LP.

2. Differentiable functions of 1 variable: e.g., construct an open


box of greatest volume from a sheet of cardboard a × b cm, by
removing a square of side x from each corner:

Mathematical formulation:

maximise V = (b − 2x)(a − 2x)x


such that 0 ≤ x ≤ a/2
0 ≤ x ≤ b/2

Solve by putting V 0 (x) = 0 and checking V 00 (x) < 0 to ensure it is


a maximum.
CHAPTER 1. INTRODUCTION TO OPTIMISATION 5

3. Differentiable functions of two variables:


Let f : R 2 → R, and
¯ ¯
¯ f xx f x y ¯¯
∆ = ¯¯ = f xx f y y − f x y f y x .
fyx fyy ¯

Then if f x = f y = 0 at a point (x, y), it is a stationary point, and

a) If ∆ > 0 and f xx or f y y < 0 then we have a maximum.


b) If ∆ > 0, and f xx or f y y > 0 then we have a minimum.
c) If ∆ < 0, we have a saddle point.
d) If ∆ = 0, then we need to investigate further (by looking at
higher order derivatives).

For instance, consider f (x 1 , x 2 ) = x 12 − x 22 + x 13 , which has a local


maximum at (−2/3, 0), and a saddle point at (0, 0), as shown in
Figure 1.1a.
However, that can make it seem simpler than it is. Figure 1.1b
shows the case where f (x 1 , x 2 ) = r − 1/2r 2 , where r 2 = x 12 + x 22 . In
this case, the maxima occurs on the curve r = 1, and a minimum
at r = 0.
The case also highlights the difference between a global maxi-
mum (or minimum) and a local maximum (or minimum). The
local extrema can be defined by local conditions (on the deriva-
tives), but we can only find global extrema by examining all of
the alternatives (including the function values on the boundaries
of the region of allowed variable values.
A third, yet more complicated case is that of the monkey saddle
given by f (x 1 , x 2 ) = x 23 − 3x 12 x 2 , shown in Figure 1.2 where the
stationary point at (0, 0) looks a little like a saddle designed for
CHAPTER 1. INTRODUCTION TO OPTIMISATION 6

(a) A 2D optimisation problem with a (b) A problem where the maxima occur
local maximum and a saddle point. on a ring.

Figure 1.1: Two, 2D optimisation problems.

Figure 1.2: The monkey saddle.


CHAPTER 1. INTRODUCTION TO OPTIMISATION 7

a creature with a tail (e.g., a monkey). The monkey saddle is a


little strange compared to the conventional saddle point shown
in Figure 1.1a, because the conventional saddle point can be
thought of as a maximum in one direction, and minimum in the
other. The monkey saddle, by contrast looks like a maximum on
one direction, and a stationary point of inflection in the other.

S OME M ORE D IFFICULT E XAMPLES


Now some examples that would be hard to solve using the techniques
you already know.

1. Quadratic Programming Example:

maximise z = 3x 1 x 2 + 4x 2 x 3 − 5x 3 x 1
such that x1 + x2 + x3 ≤ 7
x1 − x3 ≤ 2
x1 , x2 , x3 ≥ 0

Here the objective function is non-linear so we can’t use simplex


to solve it, and looking for the stationary points of f (·) won’t
always help because the solution is likely to be on the boundary.

2. Unknown or undefined gradient: Consider a problem where we


can calculate the function f (x), but we don’t know its gradient,
or its slope doesn’t even exist, e.g., , minimise the function |x|. At
the minimum, the slope doesn’t exist!

3. Optimal Chess Strategy: If we can describe a strategy by a set of


numbers (and we certainly can for some chess strategies), then
we could aim to find the optimal strategy. Here our objective
function might be the percentage of games we can win against
CHAPTER 1. INTRODUCTION TO OPTIMISATION 8

a certain set of opponents. The critical thing about this type of


problem is that the function we wish to optimise is difficult (i.e.,
time consuming) to calculate. One of our requirements, in this
case, is that we find the optimum strategy with as few function
evaluations as possible.

There are lots of other optimisation problems that cause trouble for
the optimisation techniques you have been taught so far, but the cases
above are the types of problems that we will tackle in this class.

§1.3 N OTATION AND C ONVENTIONS


Throughout these notes we will try to use consistent notation.
Lower-case letters, e.g., x, will generally denote scalars, boldface letters,
e.g., x, will denote (column) vectors, i.e.,
 
x1
 x2 
x=
 
.. 
 . 
xn

and we say x ∈ Rn , and upper case letters, e.g., A, will denote matrices.
Optimisation problems will be describe in the following form (de-
tails will vary, but the general form remains the same):

Problem 1.1. Find the minimiser x of the function f : Rn →


R such that x satisfies some set of constraints.
CHAPTER 1. INTRODUCTION TO OPTIMISATION 9

We talk of finding the minimiser of the function as the value of the


variables that minimises it, rather than just minimising the function
because we usually want to know x, not just f (x).
In most cases the function and constraints should be defined a little
more precisely, though we may not write f (·) explicitly, and we may
often omit the statement that we are searching for the minimiser: for
instance we might write the following.

Problem 1.2.

maximise z = 3x 1 + 4x 2 − 5x 3
such that x1 + x2 + x3 ≤ 7
x1 − x3 ≤ 2
x1 , x2 , x3 ≥ 0

We’re interested, primarily, in single-value, real functions of real


variables, i.e., , functions f : Rn → R, e.g.,
s
n
x i2 = Euclidean norm.
P
(i) f (x) = kxk =
i =1

(ii) f (x) = xT Ax.


 
a1
 a2
0 
For example, when A= ,
 
..
.

0 
an
Pn 2
f (x) = i =1 a i x i (diagonalised quadratic).

(iii) f (x 1 , x 2 , x 3 ) = sin(x 1 + x 2 ) − x 32 .
CHAPTER 1. INTRODUCTION TO OPTIMISATION 10

Other common notation used here include

1. from Maths 1, the gradient of f is the vector


 
D 1 f (x)
 D 2 f (x) 
∇ f (x) =  ,
 
..
 . 
D n f (x)

where D i ≡ differentiation of f (x) with respect to i th component


of x. That is, if the i th component is x i ,

∂ f /∂x 1
 

∂f  ∂ f /∂x 2 
D i f (x) = and ∇ f (x) =  .
 
..
∂x i  . 
∂ f /∂x n

2. The Hessian of f , denoted H (x) or H f (x), is the matrix with i , j th


component
D i j f = D i (D j f ) .
If the i th component of x is x i then

∂ f (x)
µ 2 ¶
H f (x) = = ∇2 f (x) . (1.1)
∂x i ∂x j

This is a matrix of order n × n, if f : Rn → R. If the second partial


derivatives of f are continuous, then

∂2 f ∂2 f
= ,
∂x i ∂x j ∂x j ∂x i

and hence H T = H , or H is symmetric.


CHAPTER 1. INTRODUCTION TO OPTIMISATION 11

Examples.
q
(i) For f (x) = kxk = x 12 + x 22 + · · · + x n2
 
x1
1  x2  x
∇f = q  ..  = ,
 
 .  kxk
x 12 + · · · + x n2
xn

for all kxk 6= 0, i.e., x 6= 0.


(ii) For f (x) = 4x 12 + 2x 1 x 2 − 3x 22 we can write
µ ¶µ ¶
1 8 2 x1 1
f (x) = (x 1 , x 2 ) = xT Ax,
2 2 −6 x 2 2

and
µ ¶ µ ¶
8x 1 + 2x 2 8 2
∇ f (x) = = Ax, H (x) = = A (= A T ).
2x 1 − 6x 2 2 −6

iii) In fact, for any quadratic of the form

1 T
f (x) = x Ax + xT b + c, (where A = A T )
2
∇ f (x) = Ax + b,
H = A.

Thus for quadratics, H is constant. This is not usually the


case though!
CHAPTER 1. INTRODUCTION TO OPTIMISATION 12

§1.4 QUESTIONS TO P ONDER


(a) Does the minimum of f actually exist? Answer: not always!

1. Minimise f (x) = ln x with x ∈ (0, 1)

y=ln x

1 x

The function isn’t bounded below, so we can’t find a mini-


mum.
2. A discontinuous function:

a x

The minimum is not 1 since that value is never reached! f


is bounded below (by 1) but the minimum Does Not Exist
(DNE).
CHAPTER 1. INTRODUCTION TO OPTIMISATION 13

(b) Is a minimum local or global?

1. By definition a every global minimiser (maximiser) is a local


minimiser (maximiser), but not visa versa.
2. A global minimiser (maximiser) isn’t necessarily unique,
e.g., , Figure 1.1b.
3. There may be very many local minimisers, so how do we
find the global minimiser? How could we guarantee it is
global without examining every local minimiser?

§1.5 E XPECTED BACKGROUND


This class assumes the minimum knowledge to get students through
to important results, but nevertheless we must make some assump-
tions.
A basic level of linear algebra, as we teach in 1st Year Maths at
Adelaide is expected, including a knowledge of

• matrix and vector operations;

• linear independence,

• vector spaces, and subspaces;

• a set of basis vectors;

• eigenvalues and vectors.

Multi-variable calculus is required for this course, primarily knowl-


edge of differentiation (integral calculus and differential equations will
not be used). For instance, students are expected to know how to take
partial derivatives, and the multivariable versions of

• the Chain rule;


CHAPTER 1. INTRODUCTION TO OPTIMISATION 14

• Taylor’s theorem.
Some coding will be required in this class. The computer lan-
guage/toolset of choice is Matlab. Students will be expected to be
able to create functions in Matlab, and use function handles.
Some instruction/revision in the above topics will be included in
the course, but students who are not already familiar with these topics
may struggle. If you don’t remember this material, please brush up on
it ASAP.

§1.6 W HAT ARE WE TRYING TO TEACH ?


This course has multiple objectives.
Starting at the most general, a key part of all applied mathematics
is translation of a real-world problem into mathematics. In general,
this is a challenging task. It seems, from what you may have studied,
that this is easy in optimisation. After all, cost or profit sound like
well defined things that could be written mathematically. However,
the challenge is that although we might be able to write them, the
resulting optimisation problem may be intractable. Sometimes it isn’t
even possible to write down all the factors that go into an objective
function, and constraints are similarly complicated.
The job of a mathematician is often to approximate these factors
in a way that is both reasonable (from the point of view of solving the
real-world problem), and mathematically tractable.
A second objective is that of teaching algorithms. You will most
likely have already come across algorithms in your earlier subjects, but
in this course we shall pick them apart in detail, and look at how to
optimise our optimisation algorithms. We’ll also have a look at some
of the issues that surround implementation of algorithms to solve real
problems. How do you go from mathematical specification to working
code?
CHAPTER 1. INTRODUCTION TO OPTIMISATION 15

On the other hand, we also want to be able to prove the properties


of these algorithms, not just examine them empirically.
In the process, we will have to learn some more linear algebra and
multivariable calculus than you currently know, and that can only be
useful.
Finally, we hope to get to the nub of “what is hard” in optimisation.
That leads us back to tractability. You need to have an intuitive grasp of
what is possible, and what is hard to be able to effectively approximate
real problems.

§1.7 A N O UTLINE OF OUR P ROGRESS T OWARDS H ARD P ROB -


LEMS

The basic approach of the course is to start with the simplest prob-
lems and generalize. That has two advantages:

• we get to work up to complicated problems slowly (hopefully I


won’t scare you off that way); and

• we can use the techniques for simple problems as subcompo-


nents of our more complicated methods for more general prob-
lems.

The approach we will take is to first look at problems in one di-


mension. You may think you are familiar with these (I hope you are in
one sense), but what we have taught you in previous probably hasn’t
considered the types of issues described in §1.2. What happens if the
derivative isn’t defined, or is hard to calculate? What happens if the
function itself is expensive to calculate? We’ll concentrate on solving
these simple problems very quickly as these will form a key component
to later algorithms.
CHAPTER 1. INTRODUCTION TO OPTIMISATION 16

We will then generalize the problems we consider to multiple di-


mensions, but without constraints. There are lots of different ways to
tackle such problems, some of which are targeted as simpler cases such
as quadratric or convex problems, and others that are more general.
We will then generalize to include constraints for convex optimisa-
tion problems. The critical concept here is the Kuhn-Tucker condition
for optimality, but we’ll also need algorithms to find solutions.
Finally, we will very briefly consider non-convex problems. This
is a hard class of problems in general, and in fact it is impractical to
ever find the actual optima of any but the smallest of these problems
but there are some clever modern heuristics for finding approximate
optima, and we will briefly consider these.
End
30

25

20

15

10

−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2

§2
S INGLE VARIABLE O PTIMISATION

We’ll start by considering, perhaps, the simplest case covered in


this course: the case of searching for the minimiser1 of a function of a
single variable, on some interval:

Problem 2.1. Find the minimiser x of the function

f : [a, b] → R, (a < b ∈ R).

We could try putting f 0 (x) = 0 as mentioned earlier, but . . .

1. Is f differentiable, even continuous? The simple


function, f (x) = |x| is not differentiable.
2. Can we solve f 0 (x ∗¢) = 0 for x ∗ ? If f (x) = e −x ln x,
f 0 (x) = e −x x1 − ln x = 0 when x ln x = 1
¡

– cannot solve this analytically, only numerically.

1
Throughout these notes we will focus on finding minima, but we can easily
convert all of our techniques to finding maxima simply by considering − f (·).

17
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 18

Thus already we need to consider differing cases for f :

1. direct methods, where no smoothness (differentia-


bility) of f is assumed;
2. methods where f can be approximated by a smooth
function (normally a quadratic or cubic)
3. methods where functions are differentiable, e.g. Newto
Rhapson method
We’ll consider a number of approaches that can be used, depending
on which type of function we allow, but we’ll first do a quick reprise of
the basics of minimisation of functions of one variable.

§2.1 B ACKGROUND
If our function is differentiable, we look for stationary points –
where the derivative is zero — first. Why should we look for these?
Well, first let us make a formal definition of a local minimum:

Defn 2.1. If there exists an interior point x ∈ (a, b) such that


there exists a δ > 0 with f (x) ≤ f (x̂) for all x̂ ∈ (x − δ, x + δ),
then x is called a local minimum of f (·).

Defn 2.1 is the definition of a weak local minimum because it allow


f (x) ≤ f (x̂). We could also define a strict local minimum by requiring
f (x) < f (x̂). We can also make equivalent definitions for local maxima.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 19

Very quickly, we can see why minima must be stationary points by


considering the Taylor series approximation of the function f (x), i.e., ,

d f ε2 d 2 f
f (x + ε) = f (x) + ε + +··· .
dx 2 d x2
We know that if x is a local minimum, then for some small ε, we must
have f (x + ε) − f (x) > 0, but note that for small ε we can approximate

df
f (x + ε) − f (x) ' ε ,
dx
whose sign depends on ε, i.e., the only way that this can be greater than
zero for both positive and negative ε is if d f /d x = 0 at the minimum.
We can then classify these into local minima, maxima or stationary
points of inflection by considering the second derivative in the Taylors
series. If

• f 00 (x) > 0 we have a local minima;

• f 00 (x) < 0 we have a local maxima; and

• f 00 (x) = 0 we may have a stationary point of inflection, depending


on higher order derivatives, e.g. x 4 , or it may still be a local
extremum.

The classification follows again directly from the Taylor’s series by


looking at the sign of the second-order term.
A more general result (also derived from Taylor’s theorem, but this
time to higher order) is given below:
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 20

Theorem 2.1. Let f (x) : [a, b] → R have derivatives of all


orders, then a necessary and sufficient condition for a local
minima is that for some n

f 0 (x) = f 00 (x) = · · · = f (2n−1) (x) = 0 and f (2n) (x) > 0.

Proof. Taylor’s theorem, where x̂ − x = ε


ε2n−1 (2n−1) ε2n (2n)
f (x̂) = f (x) + ε f 0 (x) + · · · + f (x) + f (x) + O(ε2n+1 )
(2n − 1)! (2n)!
Then
ε2n (2n)
f (x̂) − f (x) = f (x) + O(ε2n+1 )
(2n)!
> 0 for small enough ε.

Any of the local minima could be the global minima. Moreover,


given the problem is set on a finite interval [a, b], the global extremum
could be at either end point as well, so we have to check these points.
The above holds true only for differentiable functions. All bets are
off if the function has breaks in its derivative (consider again |x|) or is
discontinuous, but we can still find minima in many of these cases.

I NTRODUCTION TO FAST A LGORITHMS


The other issue to discuss before we get started is “what makes a good
algorithm?” In answer, we will judge algorithms here by their speed.
There are other criteria, for instance, the amount of memory they
require to store variables, however, this is less of an issue in modern
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 21

computers. The amount of memory (and more importantly, high-


speed cached memory) on modern computers is large, and so for
any reasonably sensible algorithm (for linear searchers) the memory
requirements should not be a large issue.
However, speed is not quite as trivial to calculate as you may think.
There are two aspects to fast code:

1. minimising the number of operations in the code (the number


of additions, subtractions, multiplications, and other basic oper-
ations), as part of the algorithm; and

2. minimising the number of times we need to evaluate the objec-


tive function itself.

The choice of which is most important depends on the “cost” of evaluat-


ing the objective function. If it were, say |x|, then the cost is negligible,
and so we concentrate on reducing the number of other operations. If
we were solving the chess strategy example, where the time to evaluate
a particular strategy is very large, then this is the dominant factor. We’ll
see algorithms that tackle both aspects below.
There are other aspects of writing fast code, but we will discuss
them at the relevant point in our description.

§2.2 D IRECT S EARCH A LGORITHMS


Direct search techniques are reminiscent of the binary search in
computer science, but they work by bracketing the solution. That’s
easier if we start with a bracket, i.e., , we start knowing that the solution
lies in the interval [a, b], so we’ll consider bounded problems first. We
then diverge a little to consider how to solve unbounded problems.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 22

B OUNDED PROBLEMS
These are often referred to as bracketing methods, because we assume
we have an interval [a 0 , b 0 ] that definitely contains the minimiser. We
start with absolutely minimal assumptions about the function to be
optimised. We do assume f (x) is unimodal over [a 0 , b 0 ], i.e., it has
only one local minimiser in [a 0 , b 0 ]. That’s so that we know the local
minimiser will be the global minimiser2 . More formally,

Defn 2.2. A function f (x) is unimodal if f is monotoni-


cally decreasing on [a 0 , p) and monotonically increasing on
(p, b 0 ], where p is the local minimiser.

Figure 2.1a and 2.1b show examples of unimodality and multimodality.


Note that it is common in statistics for unimodal to mean a function
with a single peak, but we use it here to mean a single minimum.

y y

x x
(a) A unimodal function. (b) A multimodal function.

Figure 2.1: Unimodality examples.

2
We can weaken even that requirement as well, but we’ll do so later
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 23

Then our first real problem will be

Problem 2.2. Find the minimiser x of the unimodal func-


tion
f : [a 0 , b 0 ] → R, (a < b ∈ R),
assuming nothing about differentiability of f .

Basic Idea: choose points in [a 0 , b 0 ] at which to evaluate


f , and on the basis of these evaluations, we eliminate
part of [a 0 , b 0 ] from our search interval, thereby reduc-
ing the size of the search interval at each iteration.
Suppose we choose any point p ∈ (a 0 , b 0 ) and find f (p). Knowing
f (p) < f (a 0 ), f (p) < f (b 0 ) does not tell us whether x ∗ ∈ [a 0 , p] or [p, b 0 ].
However, if we choose 2 points, p, q ∈ (a 0 , b 0 ), p < q
then if f is unimodal, we have

f (p) < f (q) ⇒ x ∗ ∈ [a 0 , q]


½
f (p) > f (q) ⇒ x ∗ ∈ [p, b 0 ]
and we can then search over this new (reduced) interval.
Question: How do we choose p, q?
We want to ensure we choose the points so that we reduce our
workload as much as possible. We will measure this by trying to get
a small interval containing x∗ , using as few function evaluations as
possible. Our first approach is analogous to the bisection search of
computer science3 in that at each step it tries to halve the region in
which the minimiser might lie.
3
Remember it isn’t the same as the bisection search though
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 24

Dichotomous Search Methods

Here we halve the search interval at each step. (dichotomy≡2 pieces) It


works as follows:
Given: a unimodal function f over [a 0 , b 0 ], we divide the search inter-
val [a 0 , b 0 ] into quarters at d , c, e as shown, and evaluate the function at
each of these points (later we’ll see how to avoid repeating evaluations
by reusing these points).

a_0 d c e b_0

Now if f (d ) < f (c) min of f lies in [a0 , c] (Set [a1 , b1 ] = [a0 , c])
½
f (d ) ≥ f (c)
if min of f lies in [c, b0 ] (Set [a1 , b1 ] = [c, b0 ])
f (c) > f (e)
½
f (d ) ≥ f (c)
if min of f lies in [d , e] (Set [a1 , b1 ] = [d , e])
f (c) ≤ f (e)

In each case, the new search interval is half the length of the old one.
Algorithm 1 shows how the Dichotomous Search is put together out of
successively following these steps.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 25

input: The initial bracketing interval [a 0 , b 0 ] and ε which is the


maximum size of the final bracketing interval.
output: A final bracketed interval [a k , b k ] where b k − a k ≤ ε.
Initialisation: k = 0;
while b k − a k > ε do
ak + bk ak + ck bk + ck
Let c k = ; dk = ; ek = ;
2 2 2
Evaluate f (c k ), f (d k ) and f (e k );
if f (d k ) < f (c k ) then
a k+1 = a k ;
b k+1 = c k ;
else
if f (c k ) > f (e k ) then
a k+1 = c k ;
b k+1 = b k ;
else
a k+1 = d k ;
b k+1 = e k ;
end
end
Set k = k + 1 ;
end
Set N = k ;

Algorithm 1: The Dichotomous Search

End
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 26

At the end of the Algorithm 1 we will have been through the while
loop N = n times. We call each of time through this main loop an
iteration.
At the end we will know that the minimiser x ∗ lies in the final
interval [a N , b N ] = [a k , b k ], and we know that the length of the interval
is no more than ε.
We know the minimiser x ∗ ∈ [a N , b N ], so it can be ap-
proximated by the midpoint of the interval, (a N +b N )/2,
with an error of (less than) ε/2.
In each iteration, the bracketing interval is halved. So, after N
iterations, the length of the bracketing interval is
(b 0 − a 0 )
`N = b N − a N = .
2N
We can use this to calculate N , given an initial value of ε and [a 0 , b 0 ]. If
we want b N − a N ≤ ε at completion, then we need
(b 0 − a 0 )
≤ ε
2N
(b 0 − a 0 )
µ ¶
N ≥ log2 .
ε
The number of iterations N must be an integer, so we choose it using
the ceiling function, which we denote dxe, i.e., we take
(b 0 − a 0 )
» µ ¶¼
N = log2 .
ε
However, let us make an important distinction here: the calculation
of the number of iterations shown here is to ensure that the interval
has length less than ε. If we aim to ensure that the error in the final
estimate is less than ε, then remember that the final error at most is
half the final interval, and so we need one less iteration! Many students
make this error in the exam!
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 27

Example 1. Determine the minimum of f (x) = 1 − e −x/2 ln(x/5) (see


Figure 2.2) over the interval [2,10], with an error of no more than 1/2
(hence final interval length must be less than 1).

1.4

1.3

1.2

1.1

0.9
2 3 4 5 6 7 8 9 10
−x/2
Figure 2.2: Example 1: f (x) = 1 − e ln(x/5).

Solution. We want a final ¨ of length ε ≤ 1, and the initial interval


§ interval
b 0 − a 0 = 8, and so N = log2 (8) = 3.

k ak dk ck e k bk `k
0 2 4 6 8 10 8
1 4 5 6 7 8 4
2 6 6.5 7 7.5 8 2
3 6 6.5 7 1

k f (a k ) f (d k ) f (c k ) f (e k ) f (b k )
0 1.337085 1.030199 0.990923 0.991392 0.99533
1 1.030199 1 0.990923 0.989839 0.991392
2 0.990923 0.989827 0.989839 0.990464 0.991392
3 0.990923 0.989827 0.989839
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 28

The estimate of the minimiser using a dichotomous search is 6.5 with


an error of no more than ε/2 = 0.5.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 29

Example 2. Determine the minimum of f (x) = e x/5 sin(x/5) over the


interval [-10,3], with an error of no more than 0.05. (Hence final interval
length must be less than 0.1.) The graph of the function is shown in
Figure 2.3.

1.5

0.5

−0.5
−10 −8 −6 −4 −2 0 2 4

Figure 2.3: Example 2: f (x) = e x/5 sin(x/5).

k ak dk ck ek bk `k
0 −10 −6.75 −3.5 −0.25 3 13
1 −6.75 −5.125 −3.5 −1.875 −0.25 6.5
2 −5.125 −4.3125 −3.5 −2.6875 −1.875 3.25
3 −5.125 −4.71875 −4.3125 −3.90625 −3.5 1.625
4 −4.3125 −4.109375 −3.90625 −3.703125 −3.5 0.8125
5 −4.109375 −4.007812 −3.90625 −3.804688 −3.703125 0.40625
6 −4.007812 −3.957031 −3.90625 −3.855469 −3.804688 0.203125
7 −3.957031 −3.931641 −3.90625 −3.880859 −3.855469 0.101562
8 −3.957031 −3.931641 −3.90625 0.050781

The estimate of the minimiser using a dichotomous search is −3.931641


with an error of no more than ε/2 = 0.05. The number of times the
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 30

interval has been halved is N = 8.


CHAPTER 2. SINGLE VARIABLE OPTIMISATION 31

input: As in Algorithm 1.
output: As in Algorithm 1.
Initialisation:
(b 0 − a 0 )
» µ ¶¼
N = log2 ;
ε
for k = 0, 1, . . . , N − 1 do
ak + bk ak + ck
Let c k = ; dk = ;
2 2
Evaluate (if needed) f c = f (c k ) and f d = f (d k );
if f d < f c then
a k+1 = a k ;
b k+1 = c k ;
fc = fd ;
else
bk + ck
Let e k = ;
2
Evaluate f e = f (e k );
if f c > f e then
a k+1 = c k ;
b k+1 = b k ;
fc = fe ;
else
a k+1 = d k ;
b k+1 = e k ;
f c = f c (NB: nothing need be done here);
end
end
end

Algorithm 2: The Dichotomous Search (efficient version).


CHAPTER 2. SINGLE VARIABLE OPTIMISATION 32

The previous description of the algorithm was aimed at making


the algorithm look simple, but in fact it was a little careless. There is a
smarter way to perform the algorithm, where we reduce the number of
operations needed by

1. calculating N at the start, so that we don’t have to check the size


of the interval at each step;

2. reusing previously calculated values when we can; and

3. only calculating the points we need.

To illustrate, look at the second version of the algorithm in Algorithm 2,


or at the Matlab program dichotomous.m, that I will hand out later in
the course.
Given N , we can calculate the number of function evaluations
required by (the more efficient version of) the Dichotomous Search.
When performing such analyses, we usually consider the worst case,
i.e., the case where we make the most possible calculations. For the
Dichotomous Search this happens when we go through to the first
else statement for every interval.
In this case, we calculate the number of function eval-
uations as follows:
• In the first iteration, we have to evaluate the func-
tion f (x) at c 0 , d 0 and e 0 , so we have 3 function eval-
uations.
• In subsequent intervals, we can recall one of the
values from the prior iteration, and so we only need
2 function evaluations.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 33

So the total number of function evaluations M is given


by
M = 3 + 2(N − 1) = 2N + 1.
↑ ↑ ↑
first for remaining total
interval N − 1 interval function
reduction reductions evaluations

We will compare this result with later algorithms with the goal of
reducing it. It may not be at all obvious how we might go about such a
reduction, but here are a couple of suggestions.
Variations:

1. Take (a + b)/2 − α, (a + b)/2 + α as 2 points near centre, where


α > 0 and very small.

2. One could of course, split the interval into any predetermined


number of equal length subintervals (e.g. 3 subintervals, 4 subin-
tervals) rather than 2 but it has been shown (Keifer, 1957) that
interval halving as here is the best among the methods using
equal length subintervals.

But what if we remove the necessity for equal length?


CHAPTER 2. SINGLE VARIABLE OPTIMISATION 34

Golden Ratio Search, or Golden Section Method.

In the previous case we searched for a better bracketing interval us-


ing points distributed uniformly. In this section we allow them to be
distributed unevenly. Our problem is still Problem 2.2, but this time
we will seek to reduce the number of function evaluations we need to
make to reduce the size of the interval to ε. To do so we need to

• choose the locations for points — we will choose two this time,
at arbitrary locations p k < q k both ∈ (a k , b k ) — that help reduce
the interval as fast as possible; and

• choose the points so that we can reuse function calls from one
iteration to the next (much as we could reuse f c in Algorithm 2.

Given we have two points p k and q k at which we will evaluate the


function, we can deduce that

• if f (p k ) < f (q k ) then x ∗ ∈ [a k , q k ]; and

• if f (p k ) > f (q k ) then x ∗ ∈ [p k , b k ].

We can use these to reduce the size of the future interval. The problem
is symmetric, so we may as well choose p k and q k symmetrically, i.e.,
we can choose q k to be the same distance from b k as p k is from a k .
Suppose then at each iteration we choose p k and q k
using
q k − a k = γ(b k − a k ) and b k − p k = γ(b k − a k ) .
To have p k < q k we need 1/2 < γ < 1, but the actual
value of γ is not known yet. However, we can see from
the rules above that at each iteration we would retain a
proportion γ of the old search interval.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 35

So the situation in the n th iteration looks as follows:

proportion γ of the interval

ak pk qk bk

proportion γ of the interval


But what is γ? To determine γ, we need to look at the next iteration.
Now w.l.o.g. suppose that f (p 0 ) < f (q 0 ), and so our new search
interval is [a k+1 , b k+1 ] = [a k , q k ]. We wish to place 2 points in this
interval using the same rule, i.e.,

q k+1 − a k+1 = γ(b k+1 − a k+1 ) and b k+1 − p k+1 = γ(b k+1 − a k+1 ) .

and the same value of γ. We also wish to reduce the number of func-
tion evaluations so it makes sense to choose one of our new
points to coincide with p k , which belongs to (a k , q k ) and
for which we already know the function evaluation!

γ ( bk - ak )

ak pk qk bk

ak+1 pk+1 qk+1 bk+1


γ ( bk+1 - ak+1 )
we can reuse this point
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 36

Reusing the point requires q k+1 = p k from which

q k+1 − a k+1 = p k − a k = (1 − γ)(b k − a k ),


and also by definition of q k (with respect to γ) we have

q k+1 − a k+1 = γ(q k − a k ) = γ γ(b k − a k ) .


£ ¤

so
γ2 (b k − a k ) = (1 − γ)(b k − a k ),
and as b k 6= a k , we can divide by (b k − a k ) to get

γ2 + γ − 1 = 0.
The positive solution to this quadratic is
p
5−1
γ= (≈ .61803),
2
the Golden Ratio or Golden Section, hence the name of the search.
p Thus the length of the search interval is reduced by a proportion γ =
( 5−1)/2 at each iteration. We repeat the iterations until the interval is
smaller than our required tolerance ε. The resulting algorithm is shown
in Algorithm 3. The algorithm can be made slightly more efficient at
the expense of making it a little more confusing, so I leave this step to
the student.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 37

input: The initial bracketing interval [a 0 , b 0 ] and ε which is the


maximum size of the final bracketing interval.
output: A final bracketedpinterval [a k , b k ] where b k − a k ≤ ε.
Initialisation: k = 0, γ = ( 5 − 1)/2;
p 0 = b 0 − γ(b 0 − a 0 ) and q 0 = a 0 + γ(b 0 − a 0 );
f p = f (p 0 ) and f q = f (q 0 );
while b k − a k > ε do
if f p < f q then
[a k+1 , b k+1 ] = [a k , q k ];
q k+1 = p k ;
fq = fp ;
p k+1 = a k+1 + b k+1 − q k+1 ;
f p = f (p k+1 )
else
[a k+1 , b k+1 ] = [p k , b k ];
p k+1 = q k ;
fp = fq ;
q k+1 = a k+1 + b k+1 − p k+1 ;
f q = f (q k+1 );
end
Set k = k + 1 ;
end
Set N = k ; // or precalculate using (2.1)
Algorithm 3: The Golden Ratio Search
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 38

We can calculate the number of function calls by not-


ing that
1. in the initialisation we call f (·) twice; and
2. in each subsequent iteration we call it once.
However, note that the function call in the last iteration
is unnecessary (we don’t use it again), and so if we write
our code carefully we can omit that call (I haven’t done
that here to make the algorithm clearer). So the total
number of function evaluations M is
M = 2 + 1.(N − 1) = N +1
↑ ↑
initialisation for remaining
N − 1 interval
reductions
Notice that for N reduction steps, we reduce the search
domain to γN of the original interval, i.e., N +1 function
evaluations result in a final interval which is a fraction
γN of [a 0 , b 0 ]. So the number of iterations required to
reduce the interval ≤ ε is N , where
γN (b 0 − a 0 ) ≤ ε
or
ln(ε/(b 0 − a 0 ))
» ¼
N= . (2.1)
ln γ
Obviously, as in Algorithm 2 we could predetermine N , rather than
testing `k = (b k − a k ) ≤ ε at each iteration.
End
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 39

Example 3. Let’s repeat Example 1 using the Golden Section Search.


As a reminder, the function is f (x) = 1 − e −x/2 ln(x/5), over [2,10], and
we search for a minimiser with an error of no more than 0.5.
Solution. We want a final interval of length ≤ ε = 1, and so the number
of iterations will be N = 5. From the data, a 0 = 2, and b 0 = 10 and so

q 0 = 2 + 8γ = 6.944272
p 0 = 10 − 8γ = 5.055728

We can then start to calculate the points, as shown in the following two
tables:
k ak pk qk bk `k = b k − a k
0 2 5.05573 6.94427 10 8
1 5.05573 6.94427 8.11146 10 4.94427
2 5.05573 6.22291 6.94427 8.11146 3.05573
3 6.22291 6.94427 7.3901 8.11146 1.88855
4 6.22291 6.66874 6.94427 7.3901 1.16719
5 6.22291 6.94427 0.72135

k f (p k ) f (q k ) f (p k ) < f (q k )?
0 0.99911517 0.98980051 >
1 0.98980051 0.99161851 <
2 0.99025551 0.98980051 >
3 0.98980051 0.9902925 <
4 0.98973678 0.98980051 <
5
When N = 5, `5 = b 5 − a 5 = 0.72135 < 1 = ε, so we stop (as expected).
We estimate x ∗ by
a5 + b5
x̂ = = 6.58359214.
2
The function has to be called M = N + 1 = 6 to make the calculation.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 40

Example 4. Repeat Example 2, i.e., minimise f (x) = e x/5 sin(x/5) over


[−10, 3] with error < 0.05.
This means we want a final interval of length ε = 0.1.
Let x̂ = the midpoint of the final interval be our approx-
imation to x ∗ .

k ak pk qk bk f (p k ) f (q k )
0 −10 −5.03444 −1.96556 3 −0.30879 −0.25855
1 −10 −6.93112 −5.03444 −1.96556 −0.24577 −0.30879
2 −6.93112 −5.03444 −3.86223 −1.96556 −0.30879 −0.32234
3 −5.03444 −3.86223 −3.13777 −1.96556 −0.32234 −0.31349
4 −5.03444 −4.30998 −3.86223 −3.13777 −0.32060 −0.32234
5 −4.30998 −3.86223 −3.58551 −3.13777 −0.32234 −0.32082
6 −4.30998 −4.03326 −3.86223 −3.58551 −0.32225 −0.32234
7 −4.03326 −3.86223 −3.75653 −3.58551 −0.32234 −0.32201
8 −4.03326 −3.92756 −3.86223 −3.75653 −0.32240 −0.32234
9 −4.03326 −3.96793 −3.92756 −3.86223 −0.32238 −0.32240
10 −3.96793 −3.92756 −3.90261 −3.86223 −0.32240 −0.32239
11 −3.96793 −3.90261

The number of times the interval has been reduced is N = 11; the
function f (x) has been called M = 12 times; the length of the final
interval is 13 × γ11 = .06532 ≤ ε. The estimate of the minimiser of f (x)
using the Golden Section Search is −3.935, the midpoint of the final
interval, with an error of no more than 0.0327.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 41

Example 5. Use the Golden Section Method to find the minimiser of


f (x) = 1 − e −0.5x ln x, over [1,9], with an error of no more than 0.5.
Solution. Again |b 0 − a 0 | = 8, and ε = 1, so N = 5 as above.
a 0 = 1, b 0 = 9 ⇒ q 0 = 1 + 8γ = 5.944272, p 0 = 9 − 8γ = 4.055728.

k ak pk qk bk f (p k ) f (q k )
0 1 4.055728 5.944272 9 0.81571997 < 0.90875065
1 1 2.888542 4.055728 5.944272 0.74974954 < 0.81571997
2 1 2.16719 2.888542 4.055728 0.73828870 < 0.74974954
3 1 1.72135 2.16719 2.888542 0.77033200 > 0.73828870
4 1.7214 2.16719 2.4427 2.888542 0.73828870 > 0.73668448
5 2.1672 2.8885

The interval [2.1672, 2.8885] has a length of 0.72135 ≤ 1.


a 5 + b 5 2.1672 + 2.8885
Estimate x ∗ by x̂ = = = 2.5279.
2 2
1
a=1 p=4.055 q=5.944 b=9

2 a=1 p=2.8885 q=4.055 b=5.944

3
a=1 p q b=4.055

4 a=1 p q b=2.8885
5
a=1.72 p q b=2.8885

6
a=2.1672 b=2.8885
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 42

Comparison: we compare the Dichotomous and Golden Section


searches in the Table 2.1. We can see that the Dichotomous search
requires fewer iterations N (it reduces the interval by a half rather that
γ each iteration), but it also needs more function calls which may mean
that it is slower in the end.
We could also compare the size of the final interval: in Example 1,
the size after the Dichotomous search was 2−3 = 0.125, whereas after
the Golden Section Search it is γ5 = 0.09, which is smaller, and therefore
notionally better. However, this criteria is in the end misleading as all
we required was that it be less than ε, and so it could have turned out
the other way around in other circumstances.

Dichotomous Golden Section


N M N M
Example 1 3 7 5 6
Example 2 8 17 11 12
Table 2.1: Comparison between Dichotomous and Golden Section
searches (NB: for the Dichotomous Search M is the worst case, not the
actual number needed, which is 6 for Example 1 and 15 for Example2).
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 43

Fibonacci Search

In the two previous methods, each iteration is identical, and so they


can be continued ad infinitum. They are identical because in each
iteration we always select our test points in the same way. However,
if we loosened this requirement, and chose points in each iteration
differently, perhaps we could improve our searches.
The Fibonacci Search is the result of trying to do just that. As in the
Golden Section search we will choose two test points p k < q k at each
step. As in the Golden Section search we will reuse one of these two
in the subsequent iteration in order to reduce the number of function
calls. However, we shall no insist that the relative positions of these
points are the same at each iteration. We can think of this as allowing a
different value of γ at each iteration, i.e., γk . We’ll still retain symmetry,
so a typical situation might look like:

γk ( bk - ak )

ak pk qk bk

ak+1 pk+1 qk+1 bk+1


γk+1 ( bk+1 - ak+1 )

ak+2 pk+2 qk+2 bk+2


γk+2 ( bk+2 - ak+2 )

Here, w.l.o.g., we are assuming that f (p i ) < f (q i ) for i = k and k + 1.


CHAPTER 2. SINGLE VARIABLE OPTIMISATION 44

Note that

`k = bk − ak
`k+1 = b k+1 − a k+1
= qk − ak
`k+2 = b k+2 − a k+2
= q k+1 − a k+1
= p k − ak
= bk − qk .

Hence `k = `k+1 + `k+2 .


We keep p k ≤ q k at each iteration, so the process will finish in a
finite number of steps if we can no longer symmetrically place distinct
test points, i.e., we can’t have p k < q k . Let us assume this takes place
after N − 1 iterations of the algorithm, then the situation is shown
below.

aN-1 pN-1 = qN-1 bN-1


At this we perform one final step to determine whether the min-
imiser is on the left- or right-hand interval. The test points are the same,
p N −1 = q N −1 , so to do this we choose another test point p 0 = p N −1 − α
(for α > 0 but very small), and

if f (p 0 ) > f (p N −1 ) , x ∗ ∈ [p N −1 , b N −1 ]
if f (p 0 ) ≤ f (p N −1 ) , x ∗ ∈ [a N −1 , p N −1 ] .

giving us a final interval of length `N = `N −1 /2 either way.


Let us define ∆ = `N 6= 0, then the goal of our algorithm was to
ensure that ∆ ≤ ε. Now, from the proceeding step we know that `N −1 =
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 45

2∆, and we can then use the relationship `k = `k+1 + `k+2 to derive the
value of ` for all n (if we knew ∆ and N ). The situation is illustrated
below:
Δ
Δ
aN bN
Δ Δ

aN-1 pN-1 bN-1
Δ Δ Δ

aN-2 pN-2 qN-2 bN-2
2Δ Δ 2Δ

aN-3 pN-3 qN-3 bN-3
3Δ 2Δ 3Δ

aN-4 pN-4 qN-4 bN-4
You should see a familiar series arising. We can see it visually by
noting that each interval is the size of the sum of the two previous
intervals. We can derive it formally by taking `N −k+1 = F k ∆. From the
diagram above F 1 = 1, F 2 = 2, F 3 = 3, F 4 = 5 and F 5 = 8, but we can
continue the series for all n using `k = `k+1 + `k+2 .
Substituting n = N − k − 1 we get `N −k−1 = `N −k+1 +
`N −k and dividing by ∆ we get
F k+2 = F k + F k+1 ,
which results in the well-known Fibonacci sequence.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 46

Defn 2.3. The sequence of positive integers {F k : k = 1, 2, . . .}


with F 1 = 1, F 2 = 2, F k+2 = F k+1 + F k (k ≥ 1) is called the
Fibonacci Sequence (Leonardo de Pisa, 1202).
½
k 1 2 3 4 5 6 7 8 9 10 · · ·
Fk 1 2 3 5 8 13 21 34 55 89 · · ·

We have `N −k+1 = F k ∆, so we could now derive the size of each


interval b k − a k = `k = F N −k+1 ∆, where F k is the Fibonacci Sequence,
if we knew ∆ and N . However, we don’t know these a priori, so our next
step is to determine them.
We do so by noting that `0 = b 0 − a 0 = F N +1 ∆, and that we require
∆ ≤ ε. There is no trivial way to solve this, but it is very easy to calculate
the Fibonacci sequence and determine the first integer value N such
that
(b 0 − a 0 )
∆= ≤ ε.
F N +1
In doing so, we determine in advance the number of iterations (re-
member we perform the iteration N − 1 followed by the final step of
separating the left- and right-hand intervals).
For example: if seeking the minimum of f (x) over x ∈ [1, 9] with
final interval ≤ ε = 1, we look for the first N such that
9−1
F N +1 ≥ = 8,
1
and we know from above that this will be N = 4 as F 5 = 8.
Finally, we need to construct an algorithm, but this is almost trivial
as it will look exactly like the Golden Section Search, apart from the
initialisation step, and the final extra step. The interval reduction steps
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 47

don’t have the same value of γ at each interval, but nothing about the
way we coded the Golden Section Search made the assumption that
γ was constant (if you examine Algorithm 3 you’ll see that γ isn’t even
used after the first step).
The first step of the algorithm is different from the Golden Section
Search. We must first determine N , and then calculate the initial test
points p 0 and q 0 , but we can again use `k = F N −k+1 ∆, to note that
∆1 FN
γ0 = = ,
∆0 F N +1
where we must have already calculated F N and F N +1 in the process of
computing N . Once we know γ0 , we can calculate p 0 and q 0 as before
as
FN
p 0 = b0 − (b 0 − a 0 ),
F N +1
FN
q0 = a0 + (b 0 − a 0 ).
F N +1
The resulting algorithm is given in Algorithm 4.
The number of function evaluations required is

• 2 in the initialisation;

• N − 2 in the while loop (as the function need not be evaluated in


the last run of the loop); and

• 1 additional evaluation in the last if/then statement.

Thus the total evaluations is M = N + 1, which is the same as for the


Golden Section Search. However, as we shall see, the Fibonacci Search
reduces the search interval faster than the Golden Section Search, typi-
cally resulting in fewer iterations, and hence fewer function calls.
End
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 48

input: As in Algorithm 1 and Algorithm 3.


output: As in Algorithm 1 and Algorithm 3.
Initialisation:
Determine smallest integer N such that F N +1 ≥ ε/(b 0 − a 0 );
γ0 = F N /F N +1 ;
p 0 = b 0 − γ0 (b 0 − a 0 ) and q 0 = a 0 + γ0 (b 0 − a 0 );
f p = f (p 0 ) and f q = f (q 0 );
for k = 0, 1, . . . , N − 2 do
if f p < f q then
[a k+1 , b k+1 ] = [a k , q k ];
q k+1 = p k ;
fq = fp ;
p k+1 = a k+1 + b k+1 − q k+1 ;
f p = f (p k+1 ) ; // not needed for n = N − 2
else
[a k+1 , b k+1 ] = [p k , b k ];
p k+1 = q k ;
fp = fq ;
q k+1 = a k+1 + b k+1 − p k+1 ;
f q = f (q k+1 ) ; // not needed for n = N − 2
end
end
if f (p − α) < f p (for some small α > 0) then
[a k+1 , b k+1 ] = [a k , p k ];
else
[a k+1 , b k+1 ] = [p k , b k ];
end

Algorithm 4: The Fibonacci Search


CHAPTER 2. SINGLE VARIABLE OPTIMISATION 49

The difference, however, is small. One piece of trivia about the


Fibonacci sequence is that
Fk
lim γk = lim = γ,
n→∞ n→∞ F k+1

so we can see that the Fibonacci Search for large N is almost the same
as the Golden Section Search. The only significant difference lies in
final few iterations.
Example 6. Once again we shall solve Example 1, this time using the
Fibonacci Search. Remember we are looking for the minimiser of
f (x) = 1 − e −x/2 ln(x/5) over [2, 10], with ε = 1.
10 − 2
Solution. To find N , we need to find the smallest term F N +1 ≥ = 8.
1
For N = 4 we get F N +1 = 8. Then
F4 5
p 0 = b0 − (b 0 − a 0 ) = 10 − × 8 = 5
F5 8
F4 5
q0 = a 0 + (b 0 − a 0 ) = 2 + × 8 = 7
F5 8
The table of the results follows (where we have used α = 0.01 in the final
step). You can see the Fibonacci sequence in the interval lengths `k .

k ak p k qk bk f (p k ) <? f (q k ) `k
0 2 5 7 10 1 > 0.9898394 8
1 5 7 8 10 0.9898394 < 0.9913900 5
2 5 6 7 8 0.9909200 > 0.9898394 3
3 6 7 7 8 0.9898394 = 0.9898394 2
4 p 4 −α = 6.99 f (p 4 − α) = 0.989831875 < f (p 4 ) 1
Example 7. Let us repeat Example 2 using the Fibonacci Search. Re-
member we are looking for the minimiser of f (x) = e x/5 sin(x/5) over
[−10, 3] with ε = 0.1.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 50

3 − (−10)
Solution. Solving F N +1 ≥ = 130 we get N = 10
ε
and F N +1 = 144.

k ak pk qk bk f (p k ) f (q k ) `k
0 −10 −5.0347 −1.9653 3 −0.3088 −0.2585 13
1 −10 −6.9306 −5.0347 −1.9653 −0.2458 −0.3088 8.0347
2 −6.9306 −5.0347 −3.8611 −1.9653 −0.3088 −0.3224 4.9653
3 −5.0347 −3.8611 −3.1389 −1.9653 −0.3224 −0.3135 3.0694
4 −5.0347 −4.3125 −3.8611 −3.1389 −0.3206 −0.3224 1.8958
5 −4.3125 −3.8611 −3.5903 −3.1389 −0.3224 −0.3209 1.1736
6 −4.3125 −4.0417 −3.8611 −3.5903 −0.3223 −0.3224 0.7222
7 −4.0417 −3.8611 −3.7708 −3.5903 −0.3224 −0.3221 0.4514
8 −4.0417 −3.9514 −3.8611 −3.7708 −0.3224 −0.3224 0.2708
9 −4.0417 −3.9514 −3.9514 −3.8611 −0.3224 −0.3224 0.1806
10 p 10 + α = −3.95 −0.3224 < f (p 10 ) 0.0903

At the n = 9 stage, p 9 = q 9 = −3.95139. So for the 10th and final reduc-


tion:
Test f (−3.95139 + α) e.g. f (−3.95), to determine which half of [-
4.04167,-3.86111] we need.
f (−3.95) = −0.322390135 < −0.32238929 = f (−3.95139). So we need
the top half of the interval, i.e. [ -3.95139, -3.86111], which is of length
0.09028.
The estimate of the minimiser of f using the Fibonacci Search is the
midpoint, -3.90625, with an error of no more than (ε/2) = 0.05.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 51

Comparison: we compare the three searches we now know in the


Table 2.2. As we already knew from Table 2.1, the Dichotomous search
requires fewer iterations N than the Golden Section search, but more
function evaluations.
Now, we can see that the Fibonacci Search is very similar to the
Golden Section Search but takes one less iteration in each case (and
consequently has one less function evaluation). In fact, for a specified
number of function evaluations, the Fibonacci Search is optimal, in
that the best interval reduction is obtained for each function evaluation.
[Keifer (1953), Johnson (1955)].

Dichotomous Golden Section Fibonacci


N M N M N M
Example 1 3 7 5 6 4 5
Example 2 8 17 11 12 10 11
Table 2.2: Comparison between three searches (NB: for the Dichoto-
mous Search M is the worst case, not the actual number needed, which
is 6 for Example 1 and 15 for Example2).

We could do an alternative comparison: suppose we fix the allowed


number of function evaluations to be M , then we can calculate the
number iterations available to us, and the consequent size of the final
interval ∆. Table 2.3 performs exactly such a comparison, first for a
general M , and then for a specific case with N = 21 and b 0 − a 0 = 1.
The main disadvantage of the Fibonacci Search is that number of
function evaluations N (and hence interval reductions) is specified in
advance – you cannot change your mind halfway through. So, once
the final interval width is specified, you cannot later opt for something
smaller without restarting the whole process.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 52

Dichotomous Golden Section Fibonacci


Iterations (M-1)/2 M-1 M-1
b0 − a0 b0 − a0
∆ (b 0 − a 0 )γM −1
2(M −1)/2 FM
−4
∆(N = 21, `0 = 1) 9.7 × 10 6.6 × 10−5 5.6 × 10−5
Table 2.3: Comparison between three searches for fixed M .

Advantages of Direct Search Techniques

1. For M function evaluations, we know the number of interval


reductions and how big the final search interval (∆) will be. Con-
versely, given the upper bound ε on the length of the final interval
(∆) we can choose N such that ∆ ≤ ε.
It also means we know exactly when to stop to achieve our objec-
tive with a given accuracy. This is an advantage highly atypical of
most optimisation problems (as we shall later see).

2. No differentiability is assumed, or needed.

3. The function does not even need to be continuous. Some search


method can be adapted to a discrete set of points, by selecting F N
data points x 1 , x 2 , . . . , x F N and then comparing f (x p ) and f (x q )

k 1 2 3 ··· FN
xk x1 x2 x3 ··· xFN
f (x k ) f (x 1 ) f (x 2 ) f (x 3 ) · · · f (x F N )

4. Only 4 sample points (a, p, q, b) need to be stored at each iter-


ation for Golden Section and Fibonacci Search, and 5 sample
points (a, d , c, e, b) for Dichotomous Search.
End
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 53

U NBOUNDED PROBLEMS
The techniques above have been designed for the case when we start
with a known bracketing interval. What do we do when the initial
interval is unknown? For instance, consider the following problem:

Problem 2.3. Find the minimiser x of the unimodal func-


tion
f : R → R,
assuming nothing about differentiability of f .

One way to solve the problem is to first find a set of bounds. We’d
like these to be as tight as possible, but without any pre-knowledge, it
might be hard to achieve this. We’d also like to follow the same princi-
ples as in our previous work, namely reduce the number of function
evaluations required (using reuse where possible).
We present below a simple approach to finding bounds for the
minimiser for a problem such as Problem 2.3. Note that we are still
assuming the function is unimodal.
The basic idea is to start at a point x 0 , and search for a set of three
points that together define a “dip” in the function, i.e., , we wish to
find three points (x i , f (x i )) such that we have a situation similar to the
following:

x k−1 < x k < x k+1 and f (x k−1 ) ≥ f (x k ) < f (x k+1 ).

We’ll do so using a geometric search, i.e., a search where the location of


the points increases geometrically. The geometric increase guarantees
that even if the minimiser is far away from our initial guess x 0 , the
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 54

search will find it quickly. The actual algorithm is given in Algorithm 5,


and it is illustrated Figure 2.4.

y
f(x)

h
x0 x1 x2 x3 x4
xk-1 xk x xk+1
Figure 2.4: Finding bounds using a geometric search.

The first step of the algorithm is to find the direction of the search.
Then we iterate, searching in that direction, until we find a dip in the
function. The algorithm requires an initial guess of the point of interest
x 0 . We also choose an initial step size h. A larger step size will result
in a faster search, but at the expense of a larger bracketing interval.
However, if we are to follow this by one of the algorithms above, say the
Dichotomous Search, the bracketing interval will decrease in factors of
2 and so we should not be too scared of a large initial bracket.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 55

input: An initial point x 0 and initial step size h > 0.


output: An initial bracketing interval [a 0 , b 0 ].
Initialise: k = 0; x 1 = x 0 + h;
if f (x 0 ) < f (x 1 ); // f (x) is increasing at x0
then
h = −h ; // move to the left
x1 = x0 ; // swap the first two points
x0 = x0 − h ;
end
while f (x k+1 ) ≤ f (x k ) do
k = k +1 ;
x k+1 = x 0 + 2k h ;
end
if h > 0; // depends on direction of search
then
[a 0 , b 0 ] = [x k−1 , x k+1 ] ;
else
[a 0 , b 0 ] = [x k+1 , x k−1 ] ;
end

Algorithm 5: Geometric Search for Bounds

§2.3 QUADRATIC A PPROXIMATION A LGORITHMS


The direct methods approach the problem as one of reducing a
bracket. They work well when we start with a bounded interval for our
minimiser, but if we don’t start with such an interval what can we do?
One approach is to exploit our knowledge of simple functions, in
this case the quadratic. If our objective function was the quadratic

p(x) = a 0 + a 1 x + a 2 x 2 ,
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 56

with a 2 > 0 (if a 2 ≤ 0 then no minimiser exists), then the minimiser of


p(x) is just
a1
x̂ = − . (2.2)
2a 2
In general, our objective function may NOT be quadratic, but if
we approximate f (x) by p(x) perhaps the minimiser of p will be close
to the minimiser of f ? To test this idea, we must first learn how to
effectively approximate a function with a quadratic.

Quadratic Approximation

We can approximate a function f (x) with a quadratic over some inter-


val [a, b] by taking three points: x 1 , x 2 , x 3 ∈ [a, b], with x 1 < x 2 < x 3 , and
then finding the quadratic that satisfies p(x i ) = f (x i ) for i = 1, 2, 3. The
situation is depicted in Figure 2.5.
There are no guarantees in such an approximation, but consider-
ation of the Taylor series approximation of a function suggests that if
these points were sufficiently close together, the approximation would
work well around the points. Implicitly this assumes that the function
we are dealing with is well behaved, i.e., it is continuous, and has two
well defined derivatives, but we shall see how this works later on.
Let us define E i = f (x i ) for i = 1, 2, 3, then we seek to find the
quadratic that satisfies p(x i ) = a 0 + a 1 x i + a 2 x i2 = E i . We can rewrite
this condition in matrix form as

1 x 1 x 12 " a 0 # " E 1 #
 
 1 x2 x 2  a1 = E 2 . (2.3)
2
1 x 3 x 32 a2 E3

The matrix of powers of x i is known as are E i , and we seek to find


the coefficients a i . The determinant of the coefficient matrix in (2.3) is
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 57

y
p(x)
f(x)

x1 x x2 x3 x
Figure 2.5: Quadratic approximation to a function f (x).

(x 1 − x 2 )(x 2 − x 3 )(x 3 − x 1 ) > 0, and so the system has a unique solution


(given we chose distinct values of x i ).
We could try to solve the equations directly, by inverting the matrix,
but this (i) would be inefficient, and (ii) it wouldn’t tell us when to
expect the quadratic has a minimum or a maximum? To answer that we
need to know the sign of a 2 . For a minimum, we need a 2 > 0. Cramer’s
Rule tells us that a system of n linear equations Ax = b has solutions
x i = det(A i )/det(A), where A i is the matrix formed by replacing the i th
column of A with b. In our case the rule tells us that
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 58

¯ ¯
¯1
¯ x 1 E 1 ¯¯
¯1
¯ x 2 E 2 ¯¯
¯1 x3 E 3 ¯ (x 2 − x 3 )E 1 + (x 3 − x 1 )E 2 + (x 1 − x 2 )E 3
½ ¾
a2 = ¯ ¯ =− ,
¯1 x 1 x 12 ¯¯ (x 1 − x 2 )(x 2 − x 3 )(x 3 − x 1 )
x 2 x 22 ¯¯
¯
¯1
¯
¯1 x 3 x 32 ¯
which is > 0 when the numerator is negative, as our choice of x 1 < x 2 <
x 3 makes the the denominator positive. Therefore, we need
(x 2 − x 3 )E 1 + (x 3 − x 1 )E 2 + (x 1 − x 2 )E 3 < 0. (2.4)
Thus we must choose 3 points (x i , E i ), i = 1, 2, 3, (with E i = f (x i ) and
x 1 < x 2 < x 3 ) satisfying (2.4) in order to approximate the minimiser of
f over [a, b] by the minimiser x̂ of p.
To find x̂, however, we do not need to find all coefficients of p
(a 0 , a 1 , a 2 ). All we want is x̂ = −a 1 /2a 2 .
Cramer’s Rule gives us
¯1 x 12 E 1 ¯
¯ ¯ ¯ ¯
¯1 x 1 E 1 ¯
¯1 x 2 E 2 ¯
¯ ¯ ¯ ¯
¯1 x 2 E 2 ¯ −
¯ ¯ ¯ 2 ¯
¯1 x E ¯ ¯1 x 2 E ¯
3 3 3 3
a2 = ¯ and a1 = ¯ ¯ .
¯1 x 1 x 12 ¯ ¯1 x 1 x 12 ¯
¯
¯1 x 2 x 2 ¯ ¯1 x 2 x 2 ¯
¯ ¯ ¯ ¯
¯ 2¯ ¯ 2¯
¯1 x x 2 ¯ ¯1 x x 2 ¯
3 3 3 3
and so
¯1 x 12 E 1 ¯ . ¯1 x 1 E 1 ¯
¯ ¯ ¯ ¯

x̂ = ¯¯1 x 22 E 2 ¯¯ 2 ¯¯1 x 2 E 2 ¯¯
¯ ¯ ¯ ¯
¯1 x 2 E ¯ ¯1 x E ¯
3 3 3
( 3 )
1 (x 2 − x 3 )E 1 + (x 3 − x 1 )E 2 + (x 12 − x 22 )E 3
2 2 2 2
= , (2.5)
2 (x 2 − x 3 )E 1 + (x 3 − x 1 )E 2 + (x 1 − x 2 )E 3
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 59

which is guaranteed to be well-defined as we chose the points so that


the denominator < 0. This formula lets us simply write the minimiser
of an approximate quadratic directly from the approximating points
(x i , E i ) with a reduced number of operations (compared to inverting
the matrix).
However, we can do still better. We commonly consider problems
where the points x i are evenly distributed. When x 1 , x 2 , x 3 are equally
spaced, i.e.,

x 1 = x 2 − s, and x 3 = x 2 + s,
for some s > 0, then (2.5) simplifies to
s E1 − E3
½ ¾
x̂ = x 2 + , (2.6)
2 E 1 − 2E 2 + E 3
and (2.4) reduces to
E 1 − 2E 2 + E 3 > 0. (2.7)
This inequality will certainly hold if we choose our 3 points such that

E1 ≥ E2 and E2 < E3 (or, E 1 > E 2 and E 2 ≤ E 3 ),

i.e., there ∃ a “dip” in the middle.


2
Example 8. f (x) = e x

a) Using x 1 = −1, x 2 = 1, x 3 = 2: 3 unequally spaced points


Here E 1 = f (x 1 ) = e ; E 2 = f (x 2 ) = e ; E 3 = f (x 3 ) = e 4
and so inequality½(2) holds. ¾
1 −3E 1 + 3E 2
Therefore, x̂ = = 0 is an ap-
2 −E 1 + 3E 2 − 2E 3
proximation to x ∗ .
Indeed x̂ = x ∗ in this case!
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 60

b) Using x 1 = −2, x 2 = 1, x 3 = 3: 3 unequally spaced points


Here E 1 = e 4 E 2 = e E 3 = e 9 and so inequality (2)
holds.
1 −8e 4 + 5e + 3e 9
½ ¾
Therefore x̂ = = −0.48937 is an
2 −2e 4 + 5e − 3e 9
approximation to x ∗ .

c) 3 equally spaced points, −1, −0.75, −0.5 s = 1/4:


Here E 1 − 2E 2 + E 3 = 0.492 > 0 as desired and so a
suitable approximation to x ∗ is

e +1 − e 0.25
µ ¶
1
x̂ = −0.75 + = −0.386
8 e +1 − 2e (−0.75)2 + e +0.25

We can use the approximation and its analysis given above for many
purposes. In the next section we will use it in constructing a search
algorithm called the DSC algorithm after its creators: Davies, Swann
and Campey (1964).

End
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 61

The DSC Algorithm

Named after its inventors (Davies, Swann and Campey) the DSC algo-
rithm aims to use the quadratic approximations we discussed above to
find minimisers.
We are still considering minimising a unimodal function, but the
other condition (that we start with a known bound for the minimiser)
can be loosened here and we can start to look for a minimiser of a
function without any known bounds. So our new problem can be
expressed as in Problem 2.3.
In practice, convergence of our approach may depend on conti-
nuity or differentiability of the function (see the discussion of Taylor’s
series above), but we don’t need any of these to try to approach (later
methods will need to calculate derivatives, and hence explicitly need
them to exist).
We need to be able to work with unbounded intervals, because our
new algorithm will only produce successive estimates of the minimiser.
It doesn’t allow us to refine our bounds at each interval. In addition, we
want three points at each step, in order to perform our approximation.
Its a lot easier if these points are uniformly spaced, so we shall extend
Algorithm 5 to find not just bounds, but a set of three uniformly-spaced
points.
Our extended algorithm is given in Algorithm 6. It works by taking
the three bracketing points x k−1 , x k and x k+1 and supplementing them
with a forth x̄ = (x k + x k+1 )/2 so that the sequence x k−1 , x k , x̄ and
x k+1 are uniformly spaced. We then restrict our attention to either
(x k−1 , x k , x̄), or to (x k , x̄, x k+1 ) depending on which three bracket the
minimum, as illustrated in Figure 2.4. We need to take a little care over
the direction of the search (determined by testing the first two points),
and ensuring we choose three bracketing points.
Once we have a method for determining three uniformly-spaced
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 62

input: An initial point x 0 and initial step size h > 0.


output: Three uniformly-spaced points z 1 , z 2 , and z 3 bracketing
the minimiser, and their spacing s.
Initialise: k = 0; x 1 = x 0 + h;
if f (x 1 ) > f (x 0 ) then
h = −h;
x1 = x0 ;
x 0 = x 0 − h;
end
while f (x k+1 ) ≤ f (x k ) do
k = k +1 ;
x k+1 = x 0 + 2k h ;
end
if k = 1 then
(z 1 , z 2 , z 3 ) = (x 0 , x 1 , x 2 ) = (x 0 , x 0 + h, x 0 + 2h);
s =h ;
else
x̄ = (x k + x k+1 )/2;
if f (x k ) < f (x̄) then
(z 1 , z 2 , z 3 ) = (x k−1 , x k , x̄);
else
(z 1 , z 2 , z 3 ) = (x n , x̄, x k+1 );
end
s = 2k−2 h ; // actually calculate by taking |z 3 − z 2 |
end
if h < 0 then
// reverse the ordering
(z 1 , z 2 , z 3 ) = (z 3 , z 2 , z 1 );
end

Algorithm 6: Geometric Search for Three Bracketing Points.


CHAPTER 2. SINGLE VARIABLE OPTIMISATION 63

points that bracket the minimiser, we can use it iteratively. The ba-
sic idea is to find three points, use them to obtain an approximating
parabola, and from this derive an estimate of the minimiser. In choos-
ing the points the way we have, we guarantee that condition (2.7) is
satisfied, so we need only use (2.6) to find the minimiser. We then use
this estimate of the minimiser as the starting point for a new search
for bracketing points, but with a smaller value of h so that we hope-
fully have a tighter set of points, and hence a better estimate of the
minimiser. The resulting algorithm is given in Algorithm 7.

input: An initial point x 0 , an initial step size h > 0, a step


contraction factor σ ∈ (0, 1) and a tolerance ε.
output: An estimate x̂ of the minimiser.
Initialise: k = 0;
repeat
Calculate (z 1 , z 2 , z 3 ) = using Algorithm 6(x k , h);
E i = f (z i ) for i = 1, 2, 3;
s E1 − E3
½ ¾
x̂ = z 2 + ; // From (2.6)
2 E 1 − 2E 2 + E 3
x k+1 = x̂;
h = σh ;
k = k +1 ;
until z 3 − z 1 ≤ ε;

Algorithm 7: The DSC algorithm. Note that changes to the h


value inside Algorithm 6 should not change it here (side affects
like this should be avoided using simple scope rules). Similarly
the counter k used in this algorithm should be different from that
inside Algorithm 6. If this comment makes no sense to you, then
go and look up the idea of scope (and don’t use global variables).
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 64

Example 9. Let us repeat Example 1 again: remember f (x) = 1 −


e −0.5x ln x, and we will choose x 0 = 1, h = 0.25, σ = 0.5, ε = 0.15.
Iteration 1:
k xk f (x k )
0 1 1
1 1.25 0.8806
2 1.5 0.8085
3 2 0.745
4 3 0.7549
Since f (x 4 ) > f (x 3 ), x ∗ ∈ [x 2 , x 4 ] = [1.5, 3] and x̄ = 2.5. Then 1.5, 2, 2.5, 3
is a set of four, equidistant points straddling x ∗ . f (x̄) = 0.737 . . . and so
f (3) > f (x̄) and f (x̄) < f (2) so x ∗ ∈ [2, 3]. The equally spaced points

z 1 = 2, z 2 = 2.5, z3 = 3
give x̂ = 2.401053944. (We do not really need this many decimal places,
but I am keeping them to prevent round-off errors here.)
The interval [2,3] has a length of 1> ε; put x 0 = x̂ = 2.401053944, h =
σ.h = 0.125 and begin again.
Iteration 2:
k xk f (x k )
0(1) 2.401053944 0.736320633
1(0) 2.526053944 0.737944057
2 2.276053944 0.736447531
Since f (x 0 ) < f (x 0 + h) so swap direction (and incidentally the first
two points) and then f (x 1 ) < f (x 2 ). Hence x ∗ ∈ [2.2761, 2.5261] and
2.2761,2.4011 and 2.5261 is a set of three equally spaced points strad-
dling x ∗ (x̄ = x 1 ). We then get x̂ = 2.3476.
The interval [2.2761,2.5261] has length 0.25 > ε; so put x 0 = x̂ = 2.3476,
h = σ.h = 0.0625 and begin again.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 65

Iteration 3:
k xk f (x k
0(1) 2.347616391 0.736139442
1(0) 2.410116391 0.73638377
2 2.285116391 0.736371413
Here, f (x 0 ) < f (x 0 + h) and f (x 1 ) < f (x 2 ), so its similar to the previous
iteration. Thus x ∗ ∈ [2.2851, 2.4102].
Since the length of this interval < ε, we STOP

Approximate x ∗ by 2.347.
Note that at each iteration the new search interval is contained within
the preceding search interval. Is this provably true? No. It is easy to
find examples where the search interval “moves" along
the number-line. However, it is true that the length of
the search interval never grows. This is tricky and te-
dious to prove!
The advantage of the DSC method is that a function that can be well
approximated by a quadratic will lead to very fast convergence. How-
ever, if this is not true (say if the function were almost discontinuous),
convergence might be quite poor. A critical factor in its performance is
the choice of h and σ, but I personally don’t know how to do that well,
and even in some of the best cases I have tried, the performance of the
algorithm isn’t as good as the searches we performed above.

Extensions

An obvious extension to the DSC method is to use a higher-order ap-


proximation. The method using 3r d order polynomials is often called
Davidson’s method (Davidson 1959).
Higher-order polynomials could be calculated using more points,
but that would increase the number of function calls. Instead, David-
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 66

son’s method uses derivatives at points to aid in the calculation. For


some functions this might be ideal, as calculating the derivative may
be easy given you are already calculating the function. In other cases,
it may increase the workload substantially, or even be impossible. In
any case such methods go beyond the scope of this course.

End
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 67

§2.4 M ETHODS FOR S MOOTH F UNCTIONS


All of our previous approaches have made minimal assumptions
about the function to be minimised. We have assumed it was unimodal
(so that a unique minimum existed), but we have not assumed much
more than continuity.
However, if we know more about our function, then we can take
advantage of this additional knowledge to (hopefully) speed up our
search. In this section, we consider what we can do if the function is
smooth in the sense that it has two well-defined derivatives. We start
we the (very) famous Newton’s method. Though in many contexts this
is thought of as a method to find zeros of a function, we shall use it
here to find extrema.

N EWTON ’ S M ETHOD.
Assume that the objective function f : R → R is at least twice differen-
tiable, i.e., f 0 , f 00 are well defined. So the problem of interest here is of
the form

Problem 2.4. Find the minimiser x of the unimodal func-


tion
f : R → R,
assuming f has two, well-defined derivatives (that we can
compute).
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 68

Newton’s method proceeds by finding a sequence of points x 0 , x 1 , x 2 , · · ·


to successively approximate the minimiser4 .
To do this we fit a quadratic q(x) to f (x), through the current point
(x k , f (x k )), matching the first and second derivatives of f (x) and q(x) at
the point This contrasts with the DSC method which used a quadratic
approximation based on three distinct points.
For intuition, we again resort to the Taylor series for f (x),
f 00 (x k )
0
f (x) ' f (x k ) + f (x k )(x − x k ) + (x − x k )2 .
2
We can immediately see that this is approximation is a quadratic in x,
and thus if we can obtain out approximating parabola q(x) = a 0 +a 1 (x−
x k ) + a 2 (x − x k )2 just by matching the coefficients to the derivatives,
i.e.,
a 0 = f (x k ),
a 1 = f 0 (x k ), (2.8)
a 2 = f 00 (x k )/2.
Now the minimum of f can be approximated by the minimum of q,
which we know from (2.2) is at (x̂ − x k ) = −a 1 /(2a 2 ), assuming a 2 > 0.
Substituting (2.8) and rearranging slightly we get

f 0 (x k )
x̂ = x k − ( f 00 (x k ) 6= 0) . (2.9)
f 00 (x k )
If the function were quadratic, then this would immediately find the
required minimiser. As this is rare, we instead use this point as the next
estimate, i.e.,
f 0 (x k )
x k+1 = x k − 00 , (2.10)
f (x k )
4
It is important, however, to understand that the sequence of points generated
by Newton’s method don’t always get closer to the minimiser at every step.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 69

provided f 00 (x k ) 6= 0 (we will consider this condition in more detail


in a minute). Beginning with initial estimate x 0 construct {x k } using
(2.10). Note that we don’t ever have to calculate f (x k ), but the two
derivatives must be calculable. The method is called Newton’s Method,
and it is shown in Algorithm 8. Note that it is not actually searching
for a minimum, but only for a stationary point of the function (as we
aren’t checking that a 2 > 0 at any point), but we could certainly do so
in a final step.
The actual algorithm is not much more complicated than so far
described, but we do need stopping criteria. Typically, we choose a
tolerance level ε and stop when | f 0 (x k )| < ε. We know that when the
derivative is small x k is close to being a stationary point of f (x k ), but we
must be careful, because “close to being stationary” is not always the
same as “close to the stationary point”. Hence there are other criteria
that are sometimes used.

• |x k − x k−1 | < ε,

• | f (x k ) − f (x k−1 )| < ε,
¯ ¯
¯ f (x k ) − f (x k−1 ) ¯
• ¯
¯ ¯ < ε, and
x k − x k−1 ¯

¯ f (x k + ∆) − f (x k ) ¯
¯ ¯
• ¯
¯ ¯ < ε for some small ∆ > 0.
∆ ¯

The complicating factor is that no particular one of these is always


superior, so sometimes you will find code that uses some combination
of a set of criteria. Newton’s method doesn’t always converge (as we
shall see below) and so we need some stopping criteria in this case as
well. This criteria could be quite clever, but in the algorithm below we
just stop if the number of iterations gets too large.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 70

input: An initial point x 0 , and tolerance level ε.


output: An estimate x k of the minimiser.
Initialise: k = 0;
repeat
fd = f 0 (x k );
fdd = f 00 (x k );
if fdd 6= 0 then
x k+1 = x k − fd/fdd ; // From (2.10)
k = k +1 ;
else
start again at a different x 0 ;
end
until |fd| < ε OR k > BIGNUM;

Algorithm 8: Newton’s Method.

Example 10. f (x) = 1 − e −0.5x ln x for x ≥ 1.


Start with x 0 = 1.
µ ¶
0 −0.5x 1
f (x) = e 0.5 ln x − f 0 (1) = −e −0.5
x
µ · ¸ ¶
00 −0.5x 1 0.5 1
f (x) = e −0.5 0.5 ln x − + + 2
x x x
µ ¶
1 1
= e −0.5x −0.25 ln x + + 2 ∴ f 00 (1) = 2e −0.5
x x

f 0 (x 0 ) f 0 (1) −e −0.5
So x 1 = x 0 − 00 = 1 − 00 = 1 − −0.5 = 1.5.
f (x 0 ) f (1) 2e
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 71

Continuing, we have

f 0 (x 1 ) f 0 (1.5)
x2 = x1 − = 1.5 − = 1.9594
f 00 (x 1 ) f 00 (1.5)

The extended table of values is:

k xk
0 1
1 1.5
2 1.95945678249834
3 2.24821027937209
4 2.33848545180889
5 2.34570813278122
6 2.34575075344906
7 2.34575075492277

From then on, at least 14 decimal places are consistent. To obtain a


solution with this degree of accuracy using the Fibonacci Search (i.e.,
with ε < 2 × 10−14 starting with the interval [1, 5] would require N = 70
iterations. We can see that, at least in this case, Newton’s method is
very fast.
End
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 72

Notes on Newton’s Method

1. We can see from the above arguments that Newton’s method is


not really trying to find the minimum, so much as trying to find
a zero of the derivative (i.e., a stationary point). That is, it tries
to find f 0 (x) = 0. It is in this context that many people will have
seen Newton’s method (as a zero finder).

2. Newton’s method is often referred to as method of tangents. If


we consider the previous comment, we know Newton’s Method
is seeking a zero of the function g (x) = f 0 (x). Let’s rewrite our
existing algorithm in terms of g (x). Then (2.10) becomes

g (x k )
x k+1 = x k − .
g 0 (x k )

This is equivalent to drawing a tangent from the curve g (x) at the


point x k to the x-axis. If g (x) were a straight line (i.e., f (x) were
quadratic) we can see that this approach would find the point
of interest in one step. Otherwise the situation is illustrated in
Figure 2.6.

3. We need f not just to be twice differentiable, we also need to


be able to compute these derivatives (efficiently). In practice,
we may need to compute these approximately using multiple
evaluations of f , losing some of the benefit of the method.

4. Newton’s method naturally generalises to multiple variables. We


won’t consider it in detail here, but the derivation is a very simple
application of Taylor’s Theorem in multiple dimensions.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 73

y
g(x)
tangent

xk+1 xk

Figure 2.6: Method of tangents applied to g (x) = f 0 (x).

Convergence of Newton’s Method

We saw that in our example Newton’s Method converged very quickly


indeed. The obvious question is “can we rely on this?” The answer is
sometimes.
Suppose that we have an interval [a, b] where f 0 (a) f 0 (b) < 0 and on
the interval f 00 (x) > 0 and f has a continuous third derivative. Then, if
x 0 is chosen sufficiently close5 to the minimiser the method will con-
verge to the minimiser. Otherwise, we can get oscillatory OR divergent
behaviour, as seen in the examples below.

f 0 (x) f 0 (x) f 000 (x)


³ ´
5
Sufficiently close translates to 0 ≤ 1 − ddx f 00 (x)
< 1, or, 0 ≤ f 00 (x)2
< 1. Note
f 0 (x)
the term is the derivative of x − f 00 (x)
, the function defining the iterates.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 74

Examples where things go wrong in Newton’s Method!

Example 11 (Oscillation). Take f (x) = x tan−1 x − 12 ln(1+ x 2 ). Then the


derivatives are
f 0 (x) = tan−1 x,
1
f 00 (x) = .
1 + x2
The following diagram shows f and f 0 , and the tangents to f 0 at two
critical points.

f(x)
−1 f’(x)
tangent to f’
−3 −2 −1 0 1 2 3

When we start at the point x 0 = 1.39174520027074, then x 1 = −x 0 ,


x 2 = x 0 , and so on.
The problem here is that
f 0 (1) f 000 (1)
= −1.5708 < 0,
f 00 (1)2
so the condition above is not satisfied.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 75

2
Example 12 (Divergence). Take the function f (x) = −e −x . Its deriva-
tives are
2
f 0 (x) = −2xe −x ,
2
f 00 (x) = (2 − 4x 2 )e −x .

Starting at x 0 = 1, we note that f 00 (x) < 0 for all x ≥ x 0 , so the tangents


will slope away from the minimum, and push the estimates to the right
as illustrated below.

In this case
f 0 (1) f 000 (1)
= −1 < 0,
f 00 (1)2
so the condition above is not satisfied. However, if we had started close
to the minimiser (at 0), we would have converged, and very quickly.

Thus the choice of x 0 is important!


CHAPTER 2. SINGLE VARIABLE OPTIMISATION 76

Suppose f 000 is continuous and convergence does occur i.e. x k → x ∗ .


If x 0 is sufficiently close to x ∗ , convergence is normally fast.
Consider x k → x ∗ . Then
0
f
½ ¾
(x k )
x k+1 − x ∗ = x k − 00 − x∗
f (x k )
∗ f 0 (x k ) − f 0 (x ∗ )
= (x k − x ) − ,
f 00 (x k )
because f 0 (x ∗ ) = 0. Now by Taylor’s Theorem
f 000 (ξ)
f 0 (x ∗ ) = f 0 (x k ) + (−x k + x ∗ ) f 00 (x k ) + (−x k + x ∗ )2 ,
2!
for some ξ (xi) between x ∗ and x k , and hence

∗ 1 f 000 (ξ)∗ 2
x k+1 − x = (x k − x ) (2.11)
2! f 00 (x k )
So if f 000 is continuous,
f 000 (ξ) f 000 (x ∗ )
→ (since x k → x ∗ and f 00 , f 000 are continuous).
f 00 (x k ) f 00 (x ∗ )
Also if a function g is continuous on [a, b] then this implies g is bounded
on (a, b), so
¯ 000 ¯
000 00 ∗
¯ f (ξ) ¯
f continuous and f (x ) 6= 0 ⇒ ¯¯ 00 ¯ is bounded near x ∗
f (x k ) ¯
i.e., | f 000 (x ∗ )/ f 00 (x ∗ )| < K for some K > 0. Consequently, if x 0 is suffi-
ciently close to x ∗ ,
K
|x k+1 − x ∗ | ≤ |x k − x ∗ |2 . (†)
2
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 77

In general, if |x k+1 − x ∗ | ≤ c|x k − x ∗ |p , for constants c and p, then we


say the order of convergence is p. In Newton’s Method p = 2 so we say
that Newton’s Method has quadratic convergence.
If f 00 (x k ) → 0 (i.e., f 0 is flat-ish) then convergence can still be slow.
If f 0 (x) = 0 has multiple solutions, the method could converge to
wrong (local, not global) solution (though this is a problem for all of
the methods we have considered).
Otherwise, Newton’s Method’s convergence is likely to be very fast,
and this is why it is such a common method, but it is not a method
without problems, as illustrated in Figure 2.7.

y g(x) = f'(x)
diverge

converge to a
(possibly)

converge to c
(possibly)

diverge

a c x
Figure 2.7: A rough sketch of the type of convergence problems we might see
with Newton’s method.

S ECANT M ETHOD
If information about the second derivative is NOT available, (and there-
fore we cannot use Newton’s Method), we can approximate f 00 using
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 78

information about f 0 :

00 f 0 (x k ) − f 0 (x k−1 )
f (x k ) ' .
x k − x k−1
Thus replacing f 00 (x k ) by this in Newton’s Method gives the recursive
formula
x k − x k−1
½ ¾
0
x k+1 = x k − f (x k ) 0 , (2.12)
f (x k ) − f 0 (x k−1 )
for n ≥ 1, given x 0 , x 1 , and f 0 (x k ) 6= f 0 (x k−1 ).

input: Two initial points x 0 and x 1 , and tolerance level ε.


output: An estimate x k of the minimiser.
Initialise: k = 1, fd = f 0 (x 0 );
repeat
fdol d = fd ; // f 0 (xk−1 )
fd = f 0 (x k );
fdd = (fd − fdol d )/(x k − x k−1 );
if fdd 6= 0 then
x k+1 = x k − fd/fdd ; // From (2.12)
k = k +1 ;
else
start again with different x 0 and x 1 ;
end
until |fd| < ε OR k > BIGNUM;

Algorithm 9: Secant Method.

Alternative viewpoint: find the point x k+1 , where the secant join-
ing (x k−1 , f 0 (x k−1 )) to (x k , f 0 (x k )) cuts the x-axis. i.e., , again, we can
think of it as a method for solving g (x) = 0 where g = f 0 .
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 79

Advantage over Newton’s method – we do not need f 00 .


Disadvantages:

1. We need 2 starting values, x 0 and x 1 , and a poor


choice might lead to divergence, as shown in Fig-
ure 2.8a and 2.8b.
y y
g(x) = f'(x) g(x) = f'(x)

x* x0 x1 x* x0 x1

(a) Newton’s and Secant Methods both (b) Newton’s Method converges, but
diverge. Secant method diverges.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 80

p
5+1
2. order of convergence is p = = 1.618 < 2 (as
2
in Newton’s method).
(ref: D.G.Leunberger. Introduction to Linear and Nonlinear Pro-
gramming, Add.-Wesley, 1973)

In any case, we need to check quite carefully for pathological con-


ditions in both Newton’s and the Secant Methods. The algorithms
presented in Algorithm 8 and 9 are just bare bones approaches. Any
real code would spend more time checking for divergence and oscilla-
tion.

S UMMARY OF L INE S EARCH M ETHODS


There are many other line searches possible. Some may have better
properties in certain situations.

For Using Method


Non-smooth f Dichotomous, Golden-Section, Fibonacci
Smooth f DSC
Available f0 Davidson, Secant.
Available f 00 Newton’s

Table 2.4: A summary of line search methods.

There are many alternatives: the method of regula falsi and Brent’s
method, but in practice, line searches are used after a search direction
has been found in multi-variable problems. Practical experience seems
to suggest that often, it is better to allocate more computational time
to iterating the optimisation algorithm for the multi-variable problem
than in doing exact line searches.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 81

End
30

25

20

15

10

−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2

§3
U NCONSTRAINED, M ULTI -VARIABLE ,
C ONVEX O PTIMISATION

In the previous chapter we concentrated on optimising functions of


one variable. However, it is much more normal to need to optimise over
many variables. Some problems may involve hundreds, thousands, or
these days even hundreds of thousands of variables.
In this chapter we extend ourselves to consider if such multivariable
problems, but we will restrain ourselves from introducing any type
of constraints on the variables until the next chapter. Constraints
complicate the problem to a perhaps surprising degree.
We’ll also may one more restriction here. We want to consider
unimodal problems, because this naturally leads to a unique minimum.
However, where unimodality might be obvious in a function of one
variable, it is rarely so clear (even to define) in general multivariable
functions. Thus we restrict ourselves to convex functions, which leads
to much the same results.
We will assume we have n real variables, and as before we shall look
for minima, with the search for maxima being almost identical (when
applied to − f ). Thus the type of problems we shall consider can be
described as follows.

82
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 83

Problem 3.1. Find the minimiser x of the convex function

f : Rn → R.

One natural approach to such problems is to generalize Newton’s


method, but we won’t consider this in detail here because although
it works well for unconstrained problems, it doesn’t easily allow us to
build in constraints. Instead, we shall focus on descent methods, which
will (after some work) allow us to build in constraints. In any case,
let us first consider some brief examples of the type of problems we
consider.

§3.1 E XAMPLES
Let’s consider a couple of very simple example problems to moti-
vate the work of this chapter.
Example 13. Consider optimising the location of a set of recycling
plants to minimise the total distance travelled by users from several
locations. For example: take four urban centres A, B , C and D, and
choose the location of 2 recycling plants X and Y .
That is, given the locations of the urban centers A = (a 1 , a 2 ), B =
(b 1 , b 2 ), C = (c 1 , c 2 ), D = (d 1 , d 2 ), find X = (x 1 , x 2 ), Y = (y 1 , y 2 ) to min-
imise the total distance travelled by the users, or the total road length
(assuming we could travel in straight lines).
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 84
C
A

X
Y

D
B
The problem is then to find the locations of X and Y that solve:
p p
min f (x 1 , x 2 , y 1 , y 2 ) = (a 1 − x 1 )2 + (a 2 − x 2 )2 + (b 1 − x 1 )2 + (b 2 − x 2 )2
q q
+ (c 1 − y 1 ) + (c 2 − y 2 ) + (d 1 − y 1 )2 + (d 2 − y 2 )2
2 2
q
+ (x 1 − y 1 )2 + (x 2 − y 2 )2

In this case there are four variables (the x and y positions of X and
Y ) and so f : R4 → R. This is a very simple operations research problem
where we seek to optimise the operations of some organisation. OR (as
it is called) is one of the major consumers of optimisation algorithms
and there are very many more such problems of interest.
Example 14. Another classic optimisation problem occurs in statistics.
Often, we wish to fit a curve of some type (in linear regression its just
a linear function of the variables) to a set of data will minimising the
error in the fit. If we imagine a set of data (x i , y i ) for i = 1, 2, . . . , m,
and for the sake of argument we seek to fit a degree n − 1 polynomial
p(x) = a 0 + a 1 x + a 2 x 2 + · · · a n−1 x n−1 to these, then we would aim to
find coefficients a 0 , . . . , a n−1 such that we minimise the sum of squares
of the errors
m ¡
X ¢2
f (a 0 , . . . , a n−1 ) = y i − p(x i ) .
i =1
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 85
That is, we seek to minimise a function f : Rn → R. More importantly,
the function we want to minimise is a quadratic, and this type of func-
tion is both common, and highly tractable, so we shall devote some
attention specifically to these (and call them quadratic programs).
Many statistical problems can be written as optimisation, and con-
sequently, we can statistics as almost a branch of optimisation, though
the problems aren’t often this simple. Signal and image processing
often use similar ideas to estimate or infer properties of their signals.

§3.2 B ACKGROUND
Let us first begin with a quick refresher of notation and results that
you should have seen in previous courses.

P OSITIVE D EFINITE M ATRICES

Defn 3.1. A matrix H is

• positive definite if zT H z > 0, ∀z 6= 0,

• positive semidefinite if zT H z ≥ 0, ∀z.

µ ¶
2 4
Example 15. If H = , then
4 16

zT H z = 2z 12 + 8z 1 z 2 + 16z 22
= 2[(z 1 + 2z 2 )2 + 4z 22 ] ≥ 0

and equals 0 when z 1 = z 2 = 0 only. Thus H is a positive


definite matrix.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 86
There are several ways to test if a matrix is positive definite:

• A matrix is positive definite iff every eigenvalue is greater than


zero, and positive semidefinite iff every eigenvalue is non-negative.

• A matrix is positive definite iff all the leading principal minors1 in


the top-left corner of A are positive (but not necessarily semidef-
inite if they are all non-negative). In other words
¯ ¯
¯ ¯ ¯ a 11 a 12 a 13 ¯
¯ a 11 a 12 ¯ ¯ ¯
a 11 > 0, ¯¯ ¯ > 0, ¯ a 21 a 22 a 23 ¯ > 0, · · ·
a 21 a 22 ¯ ¯
¯ a
¯
31 a 32 a 33
¯

There are many others,


· but¸this should suffice for this course.
2 4
Example 16. A = has leading principal minors of 2 and 32-
4 16
16=16. Both are positive, and so A is positive definite.

QUADRATIC FUNCTIONS
We saw in one of our examples that we often seek to optimise quadratic
objective functions. These are a natural generalisation of the 1D-
quadratic functions we have used already in this course.
A quadratic function can always be written in the form:

1
q(x) = xT Ax + bT x + c, (x ∈ Rn ), (3.1)
2

where A is symmetric matrix, i.e., A T = A. We can also define a quadratic


form, which is similar, but requires b = 0 and c = 0.
1
Principal minors are k × k sub-determinants formed by successively removing
an i -th row and column. Leading principle minors are those from the top left.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 87
In our context, we are interested in quadratic functions with posi-
tive definite A matrix. In this case, the level curves of the function are
ellipsoids (ellipses in 2D), and the function has a single, well-defined
minimum.
To see this, consider completing the square, by putting the function
in the form:
1
q(x) = (x − p)T A(x − p) + d , (3.2)
2
where

p = −A −1 b, (3.3)
1
d = c − pT Ap. (3.4)
2
Note that if A is positive definite and symmetric, then det A 6= 0 and
so A −1 must exist, and we can always calculate p for such a quadratic.
Then, because A is positive definite, we can see immediately that the
minimiser is x∗ = p and f (x∗ ) = d . In fact, if we were able to invert the
matrix efficiently, this would be a fine way to find the minimiser, but
we will see that this is not always the fastest or most robust approach,
and it doesn’t work for the more general class of convex functions.
Example 17. Consider the following quadratic function:
3 3
f (x 1 , x 2 , x 3 ) = x 12 + 2x 22 + x 32 + x 1 x 3 + 2x 2 x 3 − 3x 1 − x 3
2 2
   
3 0 1 −3
1 T T
= x Ax + b x where A = 0 4 2 , b = 0  .
  
2
1 2 3 −1

When we complete the square we get


 
1
1
f (x) = (x − p)T A(x − p) + d , where p = 0 ,
 and d = −3/2.
2
0
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 88

Example 18. A more practical example is that of fitting curves, as in


Example 14. Let’s take the simplest case of fitting a line to a set of data,
based on minimising the squared error, i.e., , we have a set of data
(x i , y i ) and we want to find the line y = a 0 + a 1 x that fits it best, where
“best” means it has the minimum squared error
m ¡
X ¢2
f (a 0 , a 1 ) = y i − (a 0 + a 1 x i ) .
i =1

We first need to write this in the conventional quadratic form above so


we write
m
a 02 + 2a 0 a 1 x i + x i2 a 12 − 2y i x i a 1 − 2y i a 0 + y i2
X
f (a) =
i =1
1 T
= a Aa + bT a + c, (3.5)
2
where
µP ¶
m x
A = 2 P P i 2i , (3.6)
i i x x
i i
µ P ¶
− i yi
b = 2 P , (3.7)
− i y i xi
m
y i2 .
X
c = (3.8)
i =1

The inverse of A is is easy to calculate, and hence


µP 2 P ¶µ P ¶
−1 1 i xi − i xi i yi
â = −A b = P 2 P P P ,
m i x i − ( i x i )2 − i x i m i y i xi

which matches the conventional estimator. It should be obvious that


the above generalises both to the case where x i and y i are vectors, and
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 89
to the case where we fit higher order polynomials to the data, though
calculating appropriate inverses may be more difficult.
If we have a quadratic function, its easy to calculate its derivatives

1 T
q(x) = x Ax + xT b + c, (3.9)
2
∇q(x) = Ax + b, (3.10)
Hf = A. (3.11)

We can easily use this to confirm that x∗ = −A −1 b is a stationary point


by calculating ∇q(x∗ ) using (3.10).

TAYLOR S ERIES
We can naturally extend Taylor series to functions of multiple variables
through simple extension. Consider the function

g (t ) = f (x + t u),

where xT = (x 1 , x 2 , . . . , x n ) and uT = (u 1 , u 2 , . . . , u n ) are fixed, i.e., g is


indeed a function of a single variable t .
Example 19. If f (x) = sin(x 1 + 2x 2 ) + x 12 x 2 then with x =
(2, 3) and u = (1, −1)
g (t ) = f ((2, 3) + t (1, −1)) = f (2 + t , 3 − t )
= sin(8 − t ) + (2 + t )2 (3 − t )

– a function of t alone!
Then
g (t ) = f (x 1 + t u 1 , x 2 + t u 2 , . . . , x n + t u n ).
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 90
So
d
g 0 (t ) = f (x + t u)
dt
d d
= D1 f (x 1 + t u 1 ) + D 2 f (x 2 + t u 2 ) + · · ·
dt dt
= u1D 1 f + u2D 2 f + · · · + un D n f
= uT ∇ f (x + t u).
Similarly,
d 2g d
= (u 1 D 1 f + u 2 D 2 f + · · · + u n D n f )
dt2 dt
X n d
= {u j D j f (x + t u)}
j =1 d t
n X
X n
= u i u j D i j f (x + t u)
i =1 j =1

= uT H f (x + t u) u,
where H = H f (x + t u) is the Hessian matrix of f at the point (x + t u),
as defined in (1.1).
Recall the Taylor Series expansion of a function g (t ), about c = 0
g 00 (0)
g (t ) = g (0) + t g 0 (0) + t 2 +··· .
2
Taking t = 1 we get
g (1) = f (x + u)
g (0) = f (x)
g 0 (0) = uT ∇ f (x)
g 00 (0) = uT H f (x)u
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 91
which gives us

1 ©
f (x + u) = f (x) + uT ∇ f (x) + uT H f (x) u + O(kuk3 ),
ª
2
(3.12)
which is the Taylor Series for functions of several variables. The corre-
sponding “Remainder Form” is
1
f (x + u) = f (x) + uT ∇ f (x) + uT H f (z)u,
2
where z = x + θu, for some 0 < θ < 1.

C ONVEX F UNCTIONS
In multiple dimensions, we won’t ask if our functions are unimodal
(as we often won’t know in advance), but we can often restrict our
attention to convex functions. We know in 1D, that a convex function
is one such that the function will lie below a chord drawn across the
function, as shown in Figure 3.1, and we generalise this in Defn 3.2.

Defn 3.2. A function f defined on a convex set C ⊆ Rn is

1. convex if for all x, y ∈ C and α ∈ [0, 1]

f αx + (1 − α)y ≤ α f (x) + (1 − α) f (y),


¡ ¢

2. strictly convex if ∀ x, y ∈ C and α ∈ (0, 1)

f αx + (1 − α)y < α f (x) + (1 − α) f (y).


¡ ¢
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 92

f(x)

αf(x) + (1-α)f(y)

f(αx + (1-α)y)
x y
αx + (1-α)y
Figure 3.1: The defining property of a convex function.

Note.

1. An alternative form of definition writes the inequality as

f x + α(y − x) ≤ f (x) + α( f (y) − f (x)).


¡ ¢

2. The set C on which the function has to be defined must be convex


in order to guarantee that x+α(y−x) ∈ C , but for the moment we
will ignore this constraint (as we are considering unconstrained
problems), and so consider convex function on Rn . Note that if
f is convex on C , then f restricted to any convex subset of C is
also convex.

3. If f is linear then

f x + α(y − x) = f (x) + α( f (y) − f (x)),


¡ ¢
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 93
so convexity in n-D forces the curve of f to lie below the line
joining (x, f (x)) and (y, f (y)) just as in a 1D convex function (as
in Figure 3.1). This is true for any such line, so we are saying that
any hyperplane cutting through the function will lie “above” the
function.
Examples of convex and non-convex2 functions are shown in Fig-
ure 3.2a and 3.2b.

30

14000

12000 25

10000

20
8000

6000
15
4000

2000 10

0
−40
−20 5
−6
0 −4
−2
20 0
2 −1
40 4
6
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2

(a) A convex function f (x, y) = 6.3x 4 + (b) A non-convex function (the “3-hump
6y 2 . camel function”).

Figure 3.2: Convex and non-convex examples.

If f is convex on C ⇔ for any x, y ∈ C , the function g (t ) with g (t ) =


f (x + t (y − x)) is convex on [0, 1], i.e., t ∈ [0, 1].
Notice that g is a function of one variable, t , so the same old ideas of
convexity in 1-variable calculus apply now. This is useful for reducing
properties of convex functions in Rn to properties of convex functions
in 1-D. For instance, if g (x) is a convex, twice-differentiable function
on [0, 1] if and only if g 00 (t ) ≥ 0. The result generalises to multiple
dimensions to give Theorem 3.1.
2
Notice we don’t say concave. The definitions are reversed, not complementary,
so a function may be neither convex, nor concave.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 94

Theorem 3.1. Provided the second partial derivatives of f


exist (i.e., H f exists), f is

• strictly convex on the convex set C iff H f (x) is positive


definite on C (see Defn 3.1).

• convex on the convex set C iff H f (x) is positive


semidefinite on C (see Defn 3.1).

Proof. We’ll prove the second case, the first being simply a matter of
replacing ≥ with > in the appropriate places.
(i) ⇒ Given: f is convex on C .
Taylor’s Theorem tells us

s2 T
f (x + sz) = f (x) + s∇ f (x)T z + z H f (x)z + O(s 3 )
2!
2 £
zT H f (x)z = lim f (x + sz) − f (x) − s∇ f (x)T z + O(s 3 )
¤
∴ 2
s→0 s

But f is convex on C

∴ f (x + sz) ≥ f (x) + s∇ f (x)T z (Theorem 3.)

O(s 3 )
So since lim = 0, we have
s→0 s 2
µ ¶
T 2 something
z H f (x)z ≥ 0 since = lim 2 (≥ 0) +
s→0 s →0

(ii) ⇐ Given: H f (x) semidefinite, i.e.,

zT H f (x)z ≥ 0 ∀z, x ∈ C (3.13)


CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 95
Let x, y ∈ C and

g (t ) = f (y + t (x − y)) t ∈ [0, 1] .

Then
g 00 (t ) = (x − y)T H f (y + t (x − y))(x − y) ≥ 0
by (3.13) (with x replaced by y + t (x − y) ∈ C , and z replaced by x − y .)
But from calculus of one variable, for t ∈ [0, 1]

g 00 (t ) ≥ 0 ⇒ g is convex on [0, 1]

Since g is convex, f is convex on C as we noted earlier.

Similarly, forming g (t ) = f (x + t (y − x)), we know that a convex


function g (t ) lies above its tangents, e.g., g (1) ≥ g (0) + (1 − 0)g 0 (0) as
shown in the Figure 3.3 (and the working below), and this result can be
used in Theorem 3.2.

y y=g(t)

(1,g(1)) *

g(1)−g(0)
g’(0)
(0,g(0))
1

0 1 t
Figure 3.3: A convex function g (t ).
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 96
The result can be derived from the definition of the derivative, i.e.,
g (h + 0) − g (0)
g 0 (0) = lim+ ,
h→0 h
and noting that from convexity

g (0 + h(1 − 0)) ≤ g (0) + h(g (1) − g (0))


g (h) ≤ g (0) + h(g (1) − g (0))
g (h) − g (0)
≤ g (1) − g (0)
h
g 0 (0) ≤ g (1) − g (0).

Theorem 3.2. If the partial derivatives of f exist in C , then

f convex ⇔ f (x) + (y − x)T ∇ f (x) ≤ f (y), ∀x, y ∈ C .

Proof. (i) ⇒ Given f is convex in C ,


Let x, y ∈ C . Then g is convex where
g (t ) = f (x + t (y − x)) , t ∈ [0, 1] .
But g is a function of one variable therefore by the result
above,
g (1) − g (0) ≥ g 0 (0) .
But g (0) = f (x), g (1) = f (y) and by the Chain Rule
g 0 (t ) = (y − x)T ∇ f (x + t (y − x))
∴ g 0 (0) = (y − x)T ∇ f (x) .
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 97
Substitution gives
f (x) + (y − x)T ∇ f (x) ≤ f (y), ∀x, y ∈ C .

(ii) Given:

f (x) + (y − x)T ∇ f (x) ≤ f (y) ∀x, y ∈ C . (3.14)

Let
z = αx + (1 − α)y α ∈ [0, 1] . (3.15)
Then applying (3.14) to x, z and z, y gives
f (z) + (x − z)T ∇ f (z) ≤ f (x)
f (z) + (y − z)T ∇ f (z)T ≤ f (y)
Substitute z from (3.15) to give
f (z) + (1 − α)(x − y)T ∇ f (z) ≤ f (x) , (3.16)
f (z) + α(y − x)T ∇ f (z) ≤ f (y) . (3.17)
Since α and 1 − α ≥ 0, we can take
α × (3.16) + (1 − α) × (3.17),
to yield
f (z) + α(1 − α)∇ f (z)T (x − y + y − x) ≤ α f (x) + (1 − α) f (y),
and cancelling terms, and substituting z from (3.15) gives
f (αx + (1 − α)y) ≤ α f (x) + (1 − α) f (y) ∀x, y ∈ C
Therefore f is convex in C .
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 98
The geometric interpretation of Theorem 3.2 is that f lies “above”
any tangent hyperplane to the graph, just as a convex function of one
variable g (t ) lies above any tangent line.

w=f(x)

(y,f(y))

(x,f(x))
(y,d) grad f(x)

Suppose (y, d ) is any point on the tangent hyperplane P to f at


(x, f (x)). Then since the normal to the surface w = f (x) at (x, f (x)) is
(∇ f (x)T , −1), the equation of the tangent plane P at (x, f (x)) is then
d = f (x) + (y − x)T ∇ f (x) .

What Theorem 3.2 says is

f (y) ≥ f (x) + (y − x)T ∇ f (x)


i.e., f (y) ≥ d .

C ONDITIONS FOR G LOBAL M INIMUMS


The reason we care about convex functions is summed up in the fol-
lowing Corollary to Theorem 3.2 and resulting theorem. They provide
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 99
a set of general conditions under which we can show that stationary
points are a global minimum.

Corollary 3.1. If a convex function f over a convex set C


has a stationary point x∗ ∈ C , then x∗ is a global minimiser
of f over C .

Proof. ∀x ∈ C ,
f (x) ≥ f (x∗ ) + (x − x∗ )T ∇ f (x∗ ) by Theorem 3.2,
| {z }
=0

≥ f (x ),
since ∇ f (x∗ ) = 0 at a stationary point x∗ . Therefore x∗ is a global
minimiser of f over C .

For example, we know from (3.10) that x∗ = −A −1 b is a stationary


point of the quadratic q(x) = 12 xT Ax + xT b + c, and we also know that
when A is positive definite, the quadratic is convex, and therefore from
Corollary 3.1 we know x∗ is a global minimum of the quadratic.
So the corollary is useful, but it only tells us a sufficient condition
to have a global minimum (for a convex function). That is, it tells us
that if x∗ is a stationary point, we will have a global minimum, but
it doesn’t say that this is necessary. The following theorem provides
both necessary and sufficient conditions. It should be immediately
obvious that it encapsulates the result above, because when ∇ f (x∗ ) = 0,
the conditions of the corollary are immediately satisfied, but the new
theorem is more general that this in the case that constraints limit the
possible set of x∗ .
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 100

Theorem 3.3. x∗ is a minimiser of the convex function f


over the convex set C iff

(y − x∗ )T ∇ f (x∗ ) ≥ 0, ∀y ∈ C .

Proof.

i) ⇐ i.e., if (y − x∗ )T ∇ f (x∗ ) ≥ 0, ∀y ∈ C ,
Then ∀y ∈ C ,

f (x) ≥ f (x∗ ) + (y − x∗ )T ∇ f (x∗ ), by Theorem 3.2,


≥ f (x∗ ).
Therefore x∗ is a global minimiser of f over C .

ii) ⇒ i.e., if x∗ is the minimiser of f over the convex set C ,


Then ∀t ∈ [0, 1], and y ∈ C , we have x∗ +t (y−x∗ ) ∈ C
since C is a convex set.
By Taylor’s Theorem,

f (x∗ + t (y − x∗ )) = f (x∗ ) + t (y − x∗ )T ∇ f (x∗ ) + O(t 2 )


¡ 2¢
∗ ∗ ∗ O t
f (x + t (y − x )) − f (x )
or (y − x∗ )T ∇ f (x∗ ) = +
t t
f (x∗ + t (y − x∗ )) − f (x∗ )
= lim +0
t →0+ t
≥ 0,
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 101
since the numerator ≥ 0 because x∗ is the minimiser
(note convex f implies f (x ∗ ≤ f (x) for all x ∈ C ),
while the denominator > 0 because t → 0+ . Note
that we can’t take t → 0− because then we cannot
guarantee that x∗ + t (y − x∗ ) ∈ C .

Theorem 3.3 is often used as a criterion for determining if x∗ is a


minimiser of a convex function on a convex set. We can understand it
a little better if we consider its 1D version, i.e., take the convex function
f : [a, b] → R. Theorem 3.3 says that x ∗ is the minimiser of f : [a, b] →
R iff
(x − x ∗ ) f 0 (x ∗ ) ≥ 0, ∀x ∈ [a, b].
There are three cases where this is true:

• if x ∗ ∈ (a, b), then (x − x ∗ ) f 0 (x ∗ ) ≥ 0, ∀x ∈ [a, b] ⇒ f 0 (x ∗ ) = 0


• if x ∗ = a then f 0 (x ∗ )(x − x ∗ ) ≥ 0 ⇒ f 0 (x ∗ ) ≥ 0. ( f ↑ at a)
∗ 0 ∗ ∗ 0 ∗
• if x = b then f (x )(x − x ) ≥ 0 ⇒ f (x ) ≤ 0 ( f ↓ at b)
In this chapter we consider unconstrained optimisation, so we
only really need to consider the first case, and so we shall be seeking
stationary points, but we will come back to these results when we
consider constrained optimisation in the next chapter.
End
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 102
§3.3 C ONVEX O PTIMISATION A LGORITHMS
In the previous section we built up the basic results we need in
order to go looking for minimisers. If we can perform analysis on our
function to find the stationary points, then we are all set. However, as
we have frequently noted, there are many cases where the stationary
points cannot be found through analysis alone.
In this section we shall start to apply the theorems we have just
learnt to build search algorithms for multidimensional convex optimi-
sation problems. We shall also analyse these algorithms, but typically
we can only do so in the limited context of minimising quadratic func-
tions (still without constraints) of the form given in (3.1), i.e.,
1
q(x) = xT Ax + bT x + c, (3.18)
2
for symmetric, positive-definite matrix A (note that c doesn’t affect the
location of the minimiser, only the function value at that point, and so
we shall sometimes omit it from the problem).
We noted earlier that for a quadratic one way to find the minimiser
would simply be to use (3.3), i.e., take x ∗ = −A −1 b. We could approx-
imate our problem by a quadratic (as in the 1D Newton’s Method
search), and then apply this to find an approximate minimum, and
so on. The result would be a generalisation of Newton’s method to
multiple dimension. The resulting technique is often used, but has
flaws:
• All the requirements of the 1D Newton’s method apply, e.g., dif-
ferentiability is required.

• There is no guarantee that the Hessian of the approximating


quadratic is positive definite (even if the function we are approx-
imating is convex – it has to be strictly convex). In that case, we
may not always be able to invert the Hessian.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 103
• The Hessian may be a very large matrix. Inverting such a matrix
can require much more computation (per iteration) than the
alternatives we will examine here, and what’s more we may suffer
(numerically) if any of the Hessian matrices are ill-conditioned
matrices.

One approach used to overcome this problem is to use Quasi-


Newton methods, similar in some ways to the Secant method. One
such is the Broyden-Fletcher-Goldfarb-Shanno (BFGS) Method. Quasi-
Newton methods revolutionised nonlinear optimization in the 1960s
because they avoid costly computations of Hessian matrices and per-
form well in practice. Several kinds have been proposed, but since the
1970s the BFGS method became more and more popular and today it
is accepted as the best Quasi-Newton method.
However, we shall look here at an alternative set of methods called
descent methods. Broadly speaking, these methods work by starting
at some point, selecting a descent direction (a direction which results
in a reduction of the objective function) and then searching in this
direction for a new point. More precisely, we define a descent direction
as follows:

Defn 3.3. A direction u is said to be a descent direction for


f at x if there exists some ε > 0 such that

f (x + λu) < f (x), for all 0 < λ < ε.

Given a function f and a starting point x0 ; we generate xk+1 from xk


(k = 0, 1, 2, . . .) as follows:

i) Select uk , a descent direction at xk .


CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 104
ii) Find λ∗k that minimises g (λ) = f (xk + λuk ), (λ ≥ 0) .

iii) Put xk+1 = xk + λ∗k uk .

iv) Repeat until a termination condition is attained.

Note.

1. Finding u can always be done if ∇ f (x) 6= 0 (we’ll see why in just


a moment), but part of the problem is choosing a good descent
direction. We will see several strategies for this below.

2. Finding λk in step ii) is doing a one-dimensional (line) search, as


in Chapter 2! We know that the function g (λ) will have convexity
properties similar to f (·), which in turn guarantee it is unimodal,
and hence suitable for our line searches.

3. The termination condition might be e.g., k∇ f (xk )k is small enough


|| f (xk )− f (xk−1 )||
(or kxk − xk−1 k is small enough). Others: ||x −x ||
< ε, for
k k−1
uTk ∇ f (xk ) ≥ 0 (check!), f (xk ) > f (xk−1 ), or simply that the maxi-
mum number of iterations is exceeded.

D ESCENT D IRECTIONS
Our first task is to choose descent directions. The following lemma is
useful for considering this choice (note its really just the Chain Rule
again).
Lemma 1.
f (x + λu) − f (x)
lim+ = uT ∇ f (x).
λ→0 λ
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 105
Proof. By Taylor’s Theorem,

f (x + λu) = f (x) + λuT ∇ f (x) + O(λ2 ),


f (x + λu) − f (x) O(λ2 )
so = uT ∇ f (x) + ,
λ λ
f (x + λu) − f (x)
and lim+ = uT ∇ f (x).
λ→0 λ

Consequently if u is a descent direction for f at x, then

f (x + λu) − f (x)
lim ≤ 0.
λ→0+ λ
The converse, however, is not necessarily true as we could have f (x +
λu) = f (x) for all 0 < λ < ε. However, if we have

f (x + λu) − f (x)
lim <0
λ→0+ λ
then f (x+λu)− f (x) < 0 for sufficiently small λ > 0 and so u is a descent
direction.
Therefore, by applying Lemma 1, if uT ∇ f (x) < 0 then u is a descent
direction at x, for any differentiable function f . As long as ∇ f (x) 6= 0
we can always find such a direction, for instance, take u = −∇ f (x). We
can visualise the result by considering the level curves of our objective
function, an example of which is shown in Figure 3.4 for illustration.
In this figure, ∇ f (x) is orthogonal to the level curve f (x) = c at x. If we
choose any vector pointing to the North-West of the tangent (i.e., the
angle between u and −∇ f (x) is in [0, π/2)), then we will be heading
downwards (at least for sufficiently small λ).
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 106
The level curve f(x)=c

- f(x)
Δ
c
(x)=
n t to f
a nge
At
Figure 3.4: The level curve of a function with tangent and steepest descent
direction.

In fact, this turns out that u = −∇ f (x) is the direction of steepest


descent from x, i.e., for a given small displacement from x, the objective
function f decreases more quickly this direction than in any other.
This then suggests that
uk = −∇ f (xk )
would be a good choice for a descent direction at xk , and it is this we
shall investigate next.

S TEEPEST D ESCENT
We have seen that uk = −∇ f (xk ) is the direction of steepest descent
from xk , so methods based on using this choice of uk are called steepest
descent methods, or sometimes gradient methods. The basic algorithm
is expressed in Algorithm 10.
The method will ideally converge to a minimum of the function. We
know that the method will always take steps that decrease the objective
function, and if this has a well-defined minimum (e.g., it is convex),
then we know that the objective function is bounded below. Hence, the
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 107

input: A starting point x0 and a tolerance ε.


output: A final estimate of the minimiser xk .
Initialisation: k = 0;
while k∇ f (xk )k ≥ ε do
Let uk = −∇ f (xk ) ;
Find λ∗k to minimise g k (λ) = f (xk + λuk ) ;
Let xk+1 = xk + λ∗k uk ;
k = k +1 ;
end

Algorithm 10: The Steepest Descent Algorithm.

method must converge. However, it isn’t obvious that it must converge


to the right thing. Theorem 3.4 provides a set of criteria that guarantee
convergence to the correct minimiser.

Theorem 3.4 (Minoux). Let f be a continuous, differen-


tiable function, and f (x) → ∞ as kxk → ∞. Let x0 be any
starting point, and {xk } be the sequence of iterates found us-
ing the Steepest Descent Method above. Then {xk } converges
to a stationary point of f .
If, in addition, f is convex, then this stationary point will be
a minimiser x∗ of f .

It can happen, though, that the convergence guaranteed by the


theorem can be very slow. Consider what is happening geometrically.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 108
Consider the level set at a point xk :

{x : f (x) = c} for c = f (xk )

z = f (x)

*
(xk , f (xk )) z = f (xk ) = c

−∇f (xk )
xk

Then xk is on the level curve and, from Maths. 1, ∇ f (xk ) is orthogo-


nal to the level curve at xk .
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 109

−∇f (xk ) level curve


f (x) = f (xk )

xk+1
−λk ∇f (xk )

xk tangent to level curve


= −∇ f (xk ) and uk+1 = −∇ f (xk+1 )
But, it turns out that uk
are always orthogonal!
Reason: g k (λ) = f (xk + λuk ) is minimal when

g k0 (λ) = 0 i.e., uTk ∇ f (xk + λuk ) = 0 and then λ = λk .


But xk + λk uk = xk+1 . So uTk ∇ f (xk+1 ) = 0.
| {z }
−uk+1
Thus, for Steepest Descent methods, uk and uk+1 are orthogonal,
i.e., next search direction will be orthogonal to present one. Also, xk+1 −
xk (equivalently uk ) is tangent to the level curve f (x) = f (xk+1 ).
A typical sequence of search directions would be: (Chong & Zak
p.103)
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 110
If the method is applied to a function with a long narrow valley, it
can become ineffective (diagr. from Chong & Zak, p.110)
Example 20. Use the Steepest Descent Method to find the minimum
of
f (x 1 , x 2 ) = x 14 + 4x 12 x 22 − 2x 12 + 2x 22 − 1
starting at (1, 1). Here,
µ 3
4x 1 + 8x 1 x 22 − 4x 1

∇f =
8x 12 x 2 + 4x 2

Iteration 1. x0 = (1, 1) so ∇ f (x0 ) = (8, 12) and u0 = (−8, −12).


Then

g 0 (λ) = f (x0 + λu0 ) = f (1 − 8λ, 1 − 12λ)


= (1 − 8λ)4 + 4(1 − 8λ)2 (1 − 12λ)2 − 2(1 − 8λ)2 + 2(1 − 12λ)2 − 1 .

Using one of the 1-D search methods earlier (e.g., Newton’s method),
we get g 00 (λ) = 0 when λ0 = 0.0700224. So x1 = (0.439803, 0.159704).

100

80

60

40

20

−20
4

−1
1.5 2
0.5 1
−2 −0.5 0
−1.5 −1
−2
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 111
Iteration 2. We have x1 = (0.439803, 0.159704).
Then u1 = −∇ f (x1 ) = (1.329194, −0.885946).

g (λ) = f (x1 + λu1 ) is minimised when λ1 = 0.316159

This gives x2 = (0.860039646, −0.120395801).

k xk uk λk
0 (1, 1) (−8, −12) 0.07002
1 (0.439803, 0.159704) (1.329194, −0.885946) 0.31615
2 (0.860039646, −0.120395801) (0.79585148, 1.194006764) 0.11774
3 (0.953747974, 0.020193719) (0.341629584, −0.227726217) 0.1166
4 (0.993592233, −0.00636599) (0.050448349, 0.075741299) 0.0938
5 (0.99832681, 0.000742331) (0.013347542, −0.008888118) 0.1086
6 (0.999776887, −0.000223274) (0.001783907, 0.00267849)

The solution is converging to (1, 0) which is a strict local mini-


mum of f and f(1,0) = -2. There also exists a saddle point at (0, 0)
with f (0, 0) = −1 and another local minimum at (−1, 0) which also has
f (−1, 0) = −2.

Quadratic Functions and Steepest Descent Method.

When we restrict our attention to the case of a quadratic objective


function, we can explicitly calculate the convergence of the steepest
descent method.
From (3.2) and (3.10) we see that that for a quadratic f (x) = 21 xT Ax+
xT b + c we have g k (λ) = f (xk + λuk ) where

uk = −∇ f (xk )
= −A(xk − p).
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 112
So
dg
g k0 (λ) = = uTk ∇ f (xk + λuk )

= uTk {A(xk + λuk − p)}
= λuTk Auk + uTk A(xk − p)
| {z }
−uk

= λuTk Auk − uTk uk

Therefore g k0 (λ) = 0 when λ = λk = ku k k2 /(uTk Auk ) where


kuk k2 = uTk uk .
The denominator 6= 0 unless uk = 0 since A is positive
definite (⇒ uT Au > 0, ∀u 6= 0) and so we know that λk >
0.
Then

xk+1 = xk + λk uk ,
1
f (xk+1 ) = (xk + λk u k − p)T A(xk + λk uk − p) + d ,
2
1
= f (xk ) − λk kuk k2 + λ2k uTk Au,
2
1
= f (xk ) − λk kuk k2 ,
2
< f (xk ) since λk > 0.

Examples:
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 113
(i) f (x) = x 12 + x 22 , x = (x 1 , x 2 )

f (x) = 12 xT [2I ]x soA = 2I . Therefore ∇ f (x) = 2x.


Obviously x∗ = 0.
Start at any arbitrary initial point x0 :

∇ f (x0 ) = 2x0 ; u0 = −2x0 ;


uT0 u0 xT0 x0 1
λ0 = = = .
uT0 Au0 xT0 Ax0 2
So

2
x1 = x0 + λ0 u0 = x0 − x0 = 0 and ∇ f (x1 ) = 0.
2
x1 is the minimum, so we have reached x∗ in one step.
Level curves are circles: x 12 + x 22 = k(> 0), and the gradient vector
at any x0 points at 0.
End
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 114
x 12
(ii) f (x 1 , x 2 ) =+ x 22 . Obviously, minimum is at (0, 0).
5
In this case, convergence is very slow as we get the narrow valley
effect mentioned earlier.
Here,
¶ µ2 µ ¶
0 x 1 /5
A= 5 , so ∇ f (x) = Ax = 2 x .
0 2 2

µ ¶
1
a) Starting at, say x0 = (5, 1)T ⇒ u0 = −∇ f (x0 ) = −2 1 .
x 12
on level curve + x 22 = 6
6
5
Line search along (5 − 2λ, 1 − 2λ), λ ≥ 0.
uT0 u0
Using the formula from before, we have λ0 = =
uT0 Au0
(−2)2 .2 5
= .
(−2)2 2.( 51
+ 1) 6
Alternatively, use
1
g (λ) = (5 − 2λ)2 + (1 − 2λ)2
5
4
∴ g 0 (λ) = − (5 − 2λ) − 4(1 − 2λ)
5
8 5
⇒ −4 + λ − 4 + 8λ = 0 so λ = .
5 6
So the new point is
µ ¶T µ ¶T
5 5 10 2
x1 = 5 − , 1 − = ,−
3 3 3 3
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 115

x 12 8
which lies on the level curve + x 22 = .
5 3
(b) New direction is

2/3 1 −1
µ ¶ µ ¶ µ ¶
4
u1 = −∇ f (x1 ) = −2 =− =k (k > 0).
−2/3 3 −1 1

Note that uT0 u1 = 0, as we expect. Now line search along


¡ 10
− λ, − 23 + λ , λ ≥ 0, to minimise
¢
3
µ ¶2 µ ¶2
1 10 2
g (λ) = −λ + − +λ
5 3 3
½ µ ¶ ¾
0 1 10 2
0 = g (λ) = 2 − − λ + (− + λ)
5 3 3
4 6λ 10
⇒ = so λ = .
3 5 9
Note: if you use the formula for λ, developed above for the
general quadratic, then when you find µ the
¶ new x, remem-
1
ber to take x1 + λ1 u1 , where u1 = − 43 . . . don’t just use
−1
x1 + λ1 (1, −1)T . . . as we did above. The formula for λ might
cancel the ks , but remember that the argument relied on
−A(x−p) = u, not some scalar multiple of this! In fact if you
do plug u1 = (1, −1)T into the formula for λ you will get the
incorrect answer λ1 = 65 !

The new point is x2 = 20


¡ 4¢
9 , 9 , which lies on the level curve

x 12 32
+ x 22 = .
5 27
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 116
(c) New direction is

4/9 −1
µ ¶ µ ¶
u2 = −∇ f (x2 ) = −2 =k .
4/9 −1

Note that uT1 u2 = 0, as expected. Line search along 20 λ, 4


λ
¡ ¢
9 − 9 −
0 to minimise
µ ¶2 µ ¶2
1 20 4
g (λ) = −λ + −λ
5 9 9
½ µ ¶ µ ¶¾
1 20 4 6 8 20
0 = g 0 (λ) = −2 −λ + −λ ⇒ λ = so λ = .
5 9 9 5 9 27

The new point is x3 = 40 8


¡ ¢
27 , − 27 which lies on the level curve

x 12 128
+ x 22 = .
5 243
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 117
3.2.2: Level Curves for Steepest Descent Example, f(x,y)=x2/5+y2, starting at (5,1)
4

2 x2 /5 + y 2 = 6
(5, 1)

0
y

(10/3, −2/3)
−1
2 2
x /5 + y = 8/3

−2

−3

−4
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6
x

Convergence of Steepest Descent Method for Quadratics

If the eigenvalues of the positive definite, symmetric matrix A are


α1 , . . . , αn sorted so that 0 < α1 ≤ · · · ≤ αn , then we call
αn
r= ≥ 1,
α1
the condition number of A. It is the ratio of the largest to the smallest
eigenvalues3 . For example

(i) A = 2I , the eigenvalues are 2, 2 and so r = 1.


3
You may come across other definitions of condition number, but in general they
will serve a similar purpose to the one we describe here.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 118
µ ¶
2/5 0
(ii) A= 0 2 has eigenvalues 2/5, 2 and ∴ r = 5.
µ ¶
4 −1
(iii) A = −1 4 has eigenvalues 3, 5 and ∴ r = 5/3.
When, f (x) quadratic with matrix A, the condition number of A is large,
we say that the problem is ill-conditioned because convergence will be
slow. We will show that, for:

a) If r = 1, the method of steepest descent converges in one step.

b) If r À 1, then convergence is slow because the level curves are


“long and skinny”.

When r is very large, our numerical technique may even fail due to
numerical errors.
To determine the rate of convergence at the k-th step, consider

f (xk+1 ) − f (x∗ )
R= .
f (xk ) − f (x∗ )

For f (x) = 12 (x − p)T A(x − p), we know that x∗ = p, f (x∗ ) = 0 and so

f (xk+1 )
R= ≥ 0, since f (x) ≥ 0.
f (xk )
||uk ||4
Recall f (xk+1 ) = f (xk ) − 12 λk ||uk ||2 = f (xk ) − . Therefore
2uTk Auk

f (xk+1 ) 1 λk kuk k2
µ ¶
R= = 1− .
f (xk ) 2 f (xk )
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 119
||uk ||2
But we know that λk = , so what do we know about f (xk )?
uTk Auk
uk = −A(xk − p) therefore xk − p = −A −1 uk (since A −1
exists when A is positive-definite). So
1
f (xk ) = (−A −1 uk )T A(−A −1 uk )
2
1 T −1 T 1
= uk A A A −1 uk = uTk (A −1 )uk
2 2
since (A −1 )T = A −1 because A is symmetric.
Hence

f (xk+1 ) kuk k4
= 1− T .
f (xk ) (uk Auk )(uTk A −1 uk )

Now we use Kantorovich’s Inequality defined below:

Theorem 3.5 (Kantorovich’s Inequality). Let p i ≥ 0 and


0 < a ≤ x i ≤ b for i = 1, 2, . . . , n then
à !à ! à !
Xn X n p
i (a + b)2 X n
2
p i xi ≤ pi .
i =1 i =1 x i 4ab i =1

The inequality is in fact a special case of the Cauchy-Schwartz


Inequality, and can actually be made tighter, e.g., see Luenberger, In-
troduction to Linear and Nonlinear Programming, p.15, but we only
need this form for our result here. We know that for a positive definite
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 120
matrix A, the eigenvalues are all positive, and so we write them in order
as 0 < α1 ≤ α2 ≤ α3 · · · ≤ αn . Then we use the eigenvalues as the x i in
Kantorovich’s Inequality, and note that a positive definite, symmetric
matrix can always be diagonalised (i.e., multiplied by a unitary4 matrix
P such that D = P AP T is a matrix with zero everywhere except the
diagonals, and the eigenvalues of A on the diagonals, and similarly for
its inverse). Thus we can write (taking y = P uk )
(uTk Auk )(uTk A −1 uk ) = (uTk P T P AP T P uk )(uTk P T P A −1 P T P uk )
= (yT Dy)(yT D −1 y)
2
à !à !
y i
αi y i2
X X
= , (3.19)
i i α i

and
kuk k4 = (uTk uk )2
= (uTk P T P uk )2
= (yT y)2
à !2
X 2
= yi . (3.20)
i

Using the eigenvalues αi = x i and the p i = y i2 ≥ 0, in Theorem 3.5 we


get
kuk k4 4α1 αn
≥ .
(uk Auk )(uk A uk ) (α1 + αn )2
T T −1

So
f (xk+1 ) 4(α1 αn ) α1 − αn 2 1−r 2
µ ¶ µ ¶
0≤R = ≤ 1− = = ,
f (xk ) (α1 + αn )2 α1 + αn r +1
4
A unitary matrix is a square matrix such that U ∗U = UU ∗ = I where ∗ denotes
the conjugate transpose of the matrix. For a real matrix U ∗ = U T .
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 121
i.e.,
f (xk+1 ) 1−r 2
µ ¶
≤ ,
f (xk ) r +1
for f (x) = 21 (x − p)T A(x − p), (≥ 0).
Thus if r = 1 (i.e., αn /α1 = 1) f (x1 )/ f (x0 ) ≤ 0 for any x0 (6= p).
Hence f (x1 ) = 0, i.e., the minimum is reached in one
step.
If r 6= 1, then the recursive relation gives
¶2k
1−r
µ
f (xk ) ≤ f (x0 ),
r +1

¯ that for r À 1, convergence will be


and we would¯expect
1−r ¯
slow because r +1 ≈ 1.
¯

Consider Example (ii) above, where r = 5. Therefore,

1/5 − 1 2k
µ ¶ µ ¶2k µ ¶2k
2 2
f (xk ) ≤ f (x0 ) = f (x0 ) = 6 .
1/5 + 1 3 3
¡ ¢2 ¡ 2 ¢4
Now f (x 1 ) = 38 = 6 23 , f (x 2 ) = 32
27 = 6 3 and similarly
2 6
f (x 3 ) = 128
¡ ¢
243 = 6 3 .

Properties of the Steepest Descent Method

• Reliable, even when the starting point is far from the minimum
point.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 122
• Convergence can be extremely slow, especially near the mini-
mum point because the decrease in f value is proportional to
||∇ f (x)||2 , which approaches 0.

• The subsequent directions are orthogonal if line search is exact:


∇ f (xk )T ∇ f (xk+1 ) = 0

• Often used as a starting or restarting direction for more sophisti-


cated methods.

If we cannot guarantee fast convergence even for a quadratic, how


can we do so for other functions? Thus the question is: can we do
better than steepest descent?
We now look at a method which:

1. Determines the minimiser of a quadratic in n variables, in n


steps

2. Requires no evaluation of Hessians

3. Requires no matrix inversion and no storage of an n × n matrix

4. (Typically) performs better than the Steepest Descent Method


but not as well as Newton’s Method.

End
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 123
C ONJUGATE D IRECTIONS

Defn 3.4. If A is a real, symmetric matrix, then two vectors


u1 , u2 (6= 0) are conjugate with respect to A if

uT1 Au2 = uT2 Au1 = 0.

Example 21. The following u and v are conjugate WRT to A


µ ¶ µ ¶ µ

2 1 1 1
A= , u= , v= .
1 3 0 −2

Theorem 3.6. If u1 , u2 are orthogonal eigenvectors of A then


u1 , u2 are conjugate w.r.t. A.

Proof.
uT1 Au2 = uT1 λ2 u2 = λ2 uT1 u2 = 0,
since u1 and u2 are orthogonal.

Corollary 3.2. If u1 , u2 are eigenvectors of A corresponding


to distinct eigenvalues, then u1 , u2 are conjugate w.r.t. A.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 124
Proof. From Maths 1, eigenvectors corresponding to dis-
tinct eigenvalues are orthogonal and so Theorem 3.6 im-
plies that u1 , u2 are conjugate w.r.t. A.

Corollary 3.3. If A is an n × n real symmetric matrix, then


there exists a set of n nonzero vectors that are mutually con-
jugate with respect to A.

Proof. From Maths 1, an n × n matrix has n orthogonal


eigenvectors (it is orthogonally diagonalisable). There-
fore, Theorem 3.6 implies that these n orthogonal eigen-
vectors are also mutually conjugate with respect to A.

We have just seen that one way of establishing conjugate directions


is by determining the eigenspaces of the symmetric matrix. However,
determining the eigenvectors of a large matrix requires considerable
computation. However, so long as the matrix is also positive definite,
there is an easier way. First define the standard basis vectors ei , for
i = 1, 2, . . . , n as
ei = (0, 0, . . . , 1 , . . . , 0)T . (3.21)
6 i th spot

In fact we could use any orthonormal basis for Rn , but this is the easiest
to start with. Then we can use this to construct a set of conjugate
vectors as follows.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 125

Theorem 3.7. For any positive definite symmetric n ×n ma-


trix A, the vectors constructed as follows:

u1 = e1
à T !
k
X ek+1 Aui
uk+1 = ek+1 − ui k = 1, . . . n − 1.
i =1 uTi Aui

are a set of n mutually conjugate vectors with respect to A.

Proof.The proof is much like that for the Gram-Schmidt


process, from Maths 1. The fact that A is positive defi-
nite ensures that uk+1 is well defined, ∀k = 1, . . . , n − 1
because the denominator is always > 0.
Then it it is easy to prove by induction that
span{u1 , u2 , . . . , uk } ⊆ span{e1 , e2 , . . . , ek }, ∀k ≤ n.
Also, since ek+1 6∈ span{e1 , e2 , . . . ek }, we can’t write uk+1
as a linear combination of the previous ei , and hence
the uk+1 6= 0 so that {u1 , . . . , un } is a set of n nonzero vec-
tors.
It can also be proved by induction that

uTi Au j = 0, ∀i 6= j.

i.e., the vectors are mutually conjugate with respect to


A.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 126
This provides us with a simple, and computationally efficient method
to construct a set of mutually conjugate vectors ui . There are now a
number of useful small results we can use to enhance our understand-
ing of these conjugate vectors.
Lemma 2. If {u1 , . . . , uk } are k nonzero, mutually conjugate vectors with
respect to the positive definite matrix A, then u1 , . . . , uk are linearly
independent.

Proof. Let α1 , . . . , αk be scalars such that


α1 u1 + α2 u2 + · · · + αk uk = 0.

Premultiply by uTj A for some j ∈ {1, . . . , k} to get

α1 uTj Au1 + α2 uTj Au2 + · · · + αk uTj Auk = 0.

By the assumption that the vectors ui are mutually con-


jugate wrt A, we know that uTj Aui = 0 when i 6= j , so

α j (uTj Au j ) = 0.

But A is positive definite, so uTj Au j 6= 0 (unless u j = 0).


So α j = 0 and hence the vectors are linearly indepen-
dent.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 127

Lemma 3. If v ∈ Rn and {u1 , . . . , un } is a set of n nonzero mutually con-


jugate vectors with respect to the positive definite symmetric matrix A
then à T !
X n uk Av
v= T
uk .
k=1 uk Auk

Proof.From Lemma 2 the vectors {u1 , . . . , un } are linearly


independent, and so they form a basis for Rn (as there
are n). Therefore they span the space, i.e., there exist
scalars α1 , α2 , . . . , αn ∈ R such that
n
αk uk .
X
v=
k=1

Premultiply by uTj A to get


n
uTj αk (uTj Auk ) = α j uTj Au j
X
Av =
k=1

uTj Av
i.e., α j = and the result follows.
uTj Au j
We could now construct an algorithm for minimising quadratic
objective functions. The basic idea would be to construct a set of
conjugate directions, and to use these as our search directions. The
idea is elegant, and provides several guarantees (e.g., that the method
will converge in at most n steps), but it has two main problems:
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 128
• calculating the conjugate directions in advance is wasteful – we
can do better, and

• it does not easily generalise to non-quadratic (but still convex)


objective functions.

What we need is a technique that calculates new conjugate direc-


tions at each step of the algorithm in an efficient method. Such an
approach exists, and we call it the Conjugate Gradient Method.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 129
C ONJUGATE G RADIENT M ETHOD
The conjugate gradient method is designed for a quadratic problem as
follows.

Problem 3.2. Find the minimiser x of the quadratic func-


tion
1
f (x) = xT Ax + bT x + c, (x ∈ Rn ),
2

The method proceeds exactly as in our general descent algorithm


(see p.104), where we calculate the descent direction by constructing
a succession of mutually conjugate vectors WRT to A. We present the
approach in Algorithm 11, but it will take some explanation (to follow)
to understand why it works.
The termination condition is similar to previous algorithms – we
search for a point where the gradient is close to zero. In some math-
ematical descriptions of this algorithm, we might search for it to be
exactly zero as it should find the exact minimum for the quadratic,
but it is good practice in all programming to accept that floating point
calculations are never exact. We should never expect that we can test
equality with zero, but only that the vector is very small.
In practice, we would also ensure that terms such as uTk Auk are
only calculated once per iteration to reduce the required arithmetic at
each step.
The starting direction is just chosen using steepest descent. How-
ever, from then on, we use directions chosen to be as steeply descend-
ing as possible, while maintaining mutual conjugacy.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 130

problem: Problem 3.2


input: A starting point x0 and tolerance ε.
output: A final estimate of the minimiser xk .
Initialisation: k = 0;
g0 = ∇ f (x0 ) = Ax0 + b;
u0 = −g0 ;
while kgk k ≥ ε do

gTk uk
Let λk = − T ;
uk Auk
Let xk+1 = xk + λ∗k uk ;
Let gk+1 = ∇ f (xk+1 ) ;
gT Auk
Let βk = k+1 ;
uTk Auk
Let uk+1 = −gk+1 + βk uk ;
k = k +1 ;
end

Algorithm 11: The Conjugate Gradient Method.

Notice also that we don’t need to perform a line search at each


iteration, to find the optimal distance along our search direction. We
calculate this directly using matrix arithmetic.
To see the mutual conjugacy, and optimality of these steps, we need
to demonstrate a little more theory. First note that for all k, we can
write
k−1
X ∗
xk = x0 + λj uj , (3.22)
j =0

by simple recursion on the formula for xk+1 in Algorithm 11. We then


provide the following useful lemma.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 131

Lemma 4. If the vectors u j for j ≤ k are mutually conjugate WRT A,


then the vectors u j and gk+1 defined by Algorithm 11 are orthogonal for
all j ≤ k, i.e., uTj gk+1 = 0.

Proof. Recall that for all k, gk+1 = Axk+1 + b, so

uTj gk+1 = uTj (Axk+1 + b)


à " # !
k
= uTj A x0 + λ∗i ui + b
X
i =0
k
= uTj (Ax0 + b) + λ∗i uTj Aui ,
X
(3.23)
i =0

from (3.22). Now we have assumed that the vectors u j for j ≤ k are
mutually conjugate WRT to A, so all of the terms in the summation are
zero, except the j th , and λ∗j = −uTj g j /uTj Au j so

uTj gk+1 = uTj (Ax0 + b) + λ∗j uTj Au j


= uTj (Ax0 + b) − uTj g j . (3.24)

Now, by definition, g j = Ax j + b so

uTj gk+1 = uTj (Ax0 + b) − uTj Ax j + b


¡ ¢

= uTj A x0 − x j
¡ ¢

jX
−1
= uTj A λ∗i ui , (3.25)
i =0

again using (3.22). The mutual conjugacy of the u j implies that this
summation is zero.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 132

Theorem 3.8. The search directions uk of Algorithm 11 are


mutually conjugate WRT the matrix A.

Proof. (By Induction). Using the construction of uk+1 from Algorithm 11

uTk+1 Au j = (−gk+1 + βk uk )T Au j = −gTk+1 Au j + βk uTk Au j . (3.26)

In particular, taking j = k and using βk from Algorithm 11 we get

gTk+1 Auk
uTk+1 Auk = −gTk+1 Auk + uTk Auk = 0. (3.27)
uTk Auk

So, we note that uk+1 and uk are mutually conjugate for all k, and in
particular, for k = 0, i.e., the first two vectors are mutually conjugate
WRT A.
Now, assume u0 , u1 , . . . , uk are mutually conjugate with respect to
A. We know (from the equation above) that this is true for k = 1, and
aim to show that given the assumption for k, it is also true for k + 1.
Given the mutual conjugacy of the these vectors, Lemma 4 implies that
gTk+1 u j = 0, for all j = 0, 1, . . . , k. We also note that from the algorithm

gTk+1 u j = gTk+1 (−g j + β j −1 u j −1 ) = −gTk+1 g j ,

and consequently gTk+1 g j = 0, for all j = 0, 1, . . . , k. Using (3.22) again


we get
A(x j +1 − x j )
Au j =
λ∗j
g j +1 − g j
= .
λ∗j
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 133
Therefore
(g j +1 − g j )
uTk+1 Au j = −gTk+1 + βk uTk Au j
λ∗j
= 0,

since gTk+1 gi = 0 for all i ≤ k and uk , u j are conjugate with respect


to A by the induction hypothesis, for j ≤ k. Therefore u0 , . . . , uk+1
are mutually conjugate with respect to A, if u0 , . . . uk are, and so by
induction all the search directions are mutually conjugate.

The theorem above shows that the algorithm generates mutually


conjugate search directions, but this is not enough to demonstrate that
the algorithm finds the minimiser. We need also that the distances λ∗k
are actually optimal. We can see this is true by considering the search
in question, namely, we aim to minimise f (xk+1 ) = f (xk + λk uk ). For
the quadratic problem in question we can write this

λ∗k = argmin f (xk + λuk )


λ
1
= argmin (xk + λuk )T A(xk + λuk ) + bT (xk + λuk ) + c
λ 2
λ2 T
= argmin f (xk ) + λuTk (Axk + b) + λuk Auk
λ 2
1
= argmin f (xk ) + λuTk gk + λ2 uTk Auk , (3.28)
λ 2

by the definition of gk . This is a simple quadratic in λ, and thus the


value of λ∗k . However, the following theorem demonstrates something
more powerful, namely that we find the optimal search point at each
step of the algorithm given the search directions used so far.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 134

Theorem 3.9. The search points xk generated by Algo-


rithm 11 satisfy
à !
k
λi ui ,
X
f (xk+1 ) = min f x0 +
λ0 ,...,λk i =0

given a quadratic objective function as given in Problem 3.2.


In other words, f (xk+1 ) = min f (x) where
x∈V

V = {x0 + span{u0 , . . . , uk }}.

Proof. Define matrix Uk = (u0 u1 . . . uk ), i.e., the i t h column of Uk


is ui −1 . Note that (3.22) can be written xk+1 = x0 + Uk λ where λ =
(λ0 , . . . , λk )T . Hence xk+1 ∈ V .
Now for any vector x ∈ V there exists a vector a such that x = x0 +
Uk a. Let φk (a) = f (x0 + Uk a). Now, I claim that φk (·) is a quadratic
function with a unique minimiser. To prove this, write the quadratic in
completed square form (ignoring the constant which has no affect on
the location of the minimiser)

φk (a) = f (x0 +Uk a)


1
= (x0 +Uk a − p)T A(x0 +Uk a − p)
2
1 T T 1
= a Uk AUk a + (x0 − p)T AUk a + (x0 − p)T A(x0 − p),
2 2

which is a quadratic equation in a with symmetric matrix UkT AUk .


In order to show that this has a minimum, we need to that UkT AUk
is positive definite, given that A is positive definite. Consider any vector
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 135
z ∈ Rk+1 , z 6= 0, and so for w = Uk z

zT UkT AUk z = wT Aw > 0, for all w 6= 0,

since A is positive definite. Now, w ∈ Rn and w = 0 iff z = 0 as Uk


consists of linearly independent columns. Therefore UkT AUk is positive
definite, and hence φk (·) is a strictly convex function and hence has a
unique minimiser.
Now from the definition of φ(·), and the definition of gk+1

∇φk (λ) = UkT ∇ f (x0 +Uk λ)


= UkT ∇ f (xk+1 )
= UkT gk+1
=0 by (d).

So λ minimises φk (·), in other words


¡ ¢
f (xk+1 ) = min f (x0 +Uk a) = min f (x).
a x∈V

There is very little left for us to show to demonstrate the algorithm’s


accuracy. We know from the previous theorem that at each step the
algorithm must find the optimal possible solution given the given set of
search directions, and we know that these are congruent. We only have
to show that the algorithm terminates and we are done. The following
theorem demonstrates this final result.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 136

Theorem 3.10. Algorithm 11 terminates in at most n steps.

Proof. From Lemma 3 there can be at most n mutually conjugate vec-


tors in the space Rn , and these will span the space. Therefore, by The-
orem 3.9 after n steps, the algorithm must have found the minimum
over the whole space.

It should be clear that at the minimum gk = ∇ f (xk ) = 0 (to within


numerical errors). Thus we use that as a termination criteria, and we
allow the algorithm to terminate early (to save computations), but in
fact there is nothing to stop us continuing. If gk = 0 then βk = 0, and so
the subsequent steps achieve nothing, but it is convenient to consider
this to be what happens in proving the results above.
Example 22. Minimise the following quadratic function

3 3
f (x 1 , x 2 , x 3 ) = x 12 + 2x 22 + x 32 + x 1 x 3 + 2x 2 x 3 − 3x 1 − x 3
2 2
   
3 0 1 −3
1 T T
= x Ax + b x where A = 0 4 2 , b = 0  ,
  
2
1 2 3 −1

starting at x0 = 0.
First, ∇ f (x) = Ax + b, so

g(x) = ∇ f (x) = (3x 1 +x 3 −3, 4x 2 +2x 3 , x 1 +2x 2 +3x 3 −1)T ,


and we know that the minimum occurs when ∇ f = 0, so
x∗ = (1, 0, 0).
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 137
Algorithm:

k =0
g0 = −∇ f (x0 ) = (−3, 0, −1)T ,
u0 = −g0 = (3, 0, 1)T ,
gT0 u0 5
λ∗0 = − = 18 ,
uT0 Au0
5 T
¡5
x0 + λ∗0 u0
¢
x1 = = 6 , 0, 18 .

k =1
g1 = −∇ f (x1 ) = (−2/9, 5/9, 2/3)T ,
gT1 Au0 13
β0 = = 162 ,
uT0 Au0
1 T
u1 = −g1 + β0 u0 = 162 (75, −90, −95) ,
gT1 u1 117
λ∗1 = − =
uT1 Au1 535
x2 = x1 + λ∗1 u1 = (0.9346, −0.1215, 0.1495)T .

k =2

g2 = −∇ f (x2 ) = (−0.04673, −0.1869, 0.1402)T ,


gT2 Au1
β1 = = 0.07075,
uT1 Au1
u2 = −g2 + β1 u1 = (0.07948, 0.1476, −0.1817)T ,
gT2 u2
λ∗2 = − = 0.8231
uT2 Au2
x3 = x2 + λ∗2 u2 = (1, 0, 0)T .

Note that even though we made a foray into quite messy num-
bers, x3 = x∗ exactly, after exactly 3 steps – as we should expect for the
Conjugate Gradient Method.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 138
If A is positive semi-definite (rather than positive definite) then

(i) f need not have a minimum, e.g.,

f (x 1 , x 2 ) = x 12 + x 2 is unbounded, and
f (0, −n) = −n as n → ∞;
∇ f (x) = (2x 1 , 1)T 6= 0 for any x.
µ ¶
2 0
A= is positive semidefinite.
0 0

(ii) f has a minimum x∗ iff Ax∗ + b = 0 has a solution i.e., iff b ∈


Range(A), meaning that b can be written as a linear combination
of the columns of A. µ ¶ µ ¶ µ ¶
0 2 0
(For example, in the example above, b = 6= λ1 + λ2 for
1 0 0
any λ1 , λ2 .)

(iii) if b ∈ Range(A) then f has a minimum (not necessarily unique)


and the conjugate gradient method will find one in ≤ n steps.
(Note that if x∗ is a minimum, then so is x∗ + y, for all y such that
Ay = 0).
(The reason that the method still works is based on the fact that
uTk Auk > 0 still for uk 6= 0 as otherwise f will be unbounded
below; to see this consider equation (∗).)

(iv) This means that even if A is singular, we can still use the method
– a distinct advantage over Newton’s method.

We have shown that the conjugate gradient technique is a reason-


able approach to solving quadratic problems. Certainly, it is superior
to Steepest Descent for many problems. Newton’s method (which we
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 139
haven’t discussed in detail for the multivariable problem) is certainly
faster for quadratics (taking only one step when it works) but it requires
calculation of inverse Hessians, and that is a problem for a singular
matrix A, but more importantly, calculating Hessians can be difficult
for non-quadratic problems.
So what remains for us here is to show that the conjugate gradient
algorithm does generalise elegantly to non-quadratic problems, and
that is what we consider in the following section.

T HE C ONJUGATE G RADIENT A LGORITHM FOR NON - QUADRATIC


FUNCTIONS – THE F LETCHER R EEVES A LGORITHM .

So far, we have only considered f as a quadratic function. We can


extend our approach by considering general nonlinear functions in
terms of their second order Taylor Series approximation,

1
f (x) ' q(x) = (x − x0 )T A(x − x0 ) + gT (x − x0 ) + c, (3.29)
2
where A = H f (x0 ) and g = ∇ f (x0 ). For a quadratic function the Hessian
H = A is constant at each step. This is not true of a general nonlinear
function and hence the Hessian must be re-evaluated at each step. This
can be very inefficient computationally, so any efficient implementa-
tion of a conjugate gradient method should eliminate the need to find
a new Hessian evaluation at each step. Note that in the Conjugate
Gradient Method, the Hessian A appears only in computation of λ∗k
and βk . So:

• We manipulate the formula for the βk to eliminate the A.

• We go back to computing λ∗k using a simple line search.


CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 140
Our first result concerns the recalculation of the βk without the Hes-
sian.

Theorem 3.11. For a general non-linear problem, the cor-


rect value of βk to use is given by

k∇ f (xk+1 )k2
βk = .
k∇ f (xk )k2

Proof. As earlier,
1 
Auk = (gk+1 − gk ). 1

λ∗k
Then,
1 T
gTk+1 Auk = (g gk+1 − gTk+1 gk )
λ∗k k+1 6 = 0 from earlier

1 
2 2
= ∗ kgk+1 k , 
λk
and
1 
uTk Auk = uTk 1
(gk+1 − gk ) from 
λ∗k
1 T T
= ∗ (uk gk+1 − uk gk )
λk
1
= ∗ (0 − [−gk + βk−1 uk−1 ]T gk )
λk
1 
= ∗ kgk k2 , since uTj g` = 0, ( j < `). 3

λk
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 141
So
gTk+1 Auk kgk+1 k2
βk = = .
uTk Auk kgk k2

Theorem 3.11 is called the Fletcher-Reeves Formula. It avoids all


mention of Hessians and relies only on the objective function and
gradient values at each iteration.
We incorporate this change in the formulation for βk into an algo-
rithm called the Fletcher-Reeves Algorithm. We also need to realise
that the previous proofs that the algorithm will converge in n steps no
longer apply (they only worked for quadratic functions). Hence, we
apply the search, but when we have finished n steps, we start again
(assuming the minimum has not already been found). The resulting
algorithm is given in Algorithm 12.

Notes

(i) If f is convex, the algorithm is good for finding its minimum (if
it exists). If f is not convex, then the scheme will converge to a
solution of ∇ f (x) = 0, so we should ensure we start near x∗ and
assume f is convex near x∗ .

(ii) Convergence results are very technical, but if H f (x∗ ) is positive


definite, then for some c > 0,

kx(k+1)n − x∗ k ≤ ckxkn − x∗ k2 k = 0, 1, 2, . . . ,

which is basically the same convergence rate as Newton’s Method,


but the method takes n steps rather than 1 to reach the minimiser
of a quadratic function.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 142

problem: Problem 3.1


input: A starting point x0 and tolerance ε.
output: A final estimate of the minimiser xk .
Initialisation: k = 0;
g0 = ∇ f (x0 );
u0 = −g0 ;
while kgk k ≥ ε do
if k 6= n − 1 then
Determine λ∗k = argmin f (xk + λuk ) using a line search;
λ
Let xk+1 = xk + λ∗k uk ;
Let gk+1 = ∇ f (xk+1 ) ;
Let βk = kgk+1 k2 /kgk k2 ;
Let uk+1 = −gk+1 + βk uk ;
k = k +1 ;
else
// restart
k = 0;
x0 = xn ;
g0 = ∇ f (x0 ) = Ax0 + b;
u0 = −g0 ;
end
end

Algorithm 12: The Fletcher Reeves Algorithm.


CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 143
(iii) Going back to u0 = −∇ f (x0 ) after n steps keeps us on track and
doesn’t allow us to wander too much (See Chong & Zak).

(iv) If f is not quadratic, the conjugacy of the uk might get lost. Rea-
sons: the line search to determine λ∗ might be inaccurate; the
nature of the non-quadratic terms in f .
An alternative formula for βk which seeks to overcome these
problems is the Polak-Ribiere form:
T
∇ f (xk )T (∇ f (xk ) − ∇ f (xk−1 )) gk (gk − gk−1 )
βk−1 = =
||∇ f (xk−1 )||2 ||gk−1 ||2

It has been found to help maintain conjugacy when f is non-


quadratic. (see Kelley, “Iterative Methods in Optimisation", who
comments (p.49) that “the F-R method has been observed to
take long sequences of very small steps and virtually stagnate...
the Polak-Robiere formula performs much better and is more
commonly used but has a less satisfactory convergence theory.")
There are other ways around the problem of removing the depen-
dence on the Hessian. For example, the BFGS (Broyden-Fletcher-
Goldfarb-Shanno) update, where

B k sk sTk B k yk yTk
B 0 = I , B k+1 = B k − +
sTk B k sk yT sk

where yk = gk+1 − gk , and sk = xk+1 − xk .


30

25

20

15

10

−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2

§4
C ONSTRAINED C ONVEX O PTIMISATION

So far we have considered constrained searches in 1D, and uncon-


strained searches on convex functions. Now we want to put the two
concepts together. We’ll start considering problems of the form:

Problem 4.1. Find the minimiser x of the convex function

f : C → R.

where C ⊂ Rn is a convex set.

The function is convex, and so also is the domain on which we


perform our optimisation. We call such problem convex programs.
As we saw earlier in Corollary 3.1, a convex program of this form is
guaranteed to a (relatively) easy to find global minimum, and this is
why we restrict our attention (for the moment) to this case, i.e., , if a
local minimum exists, then it is a global minimum.

144
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 145

§4.1 E XAMPLES
Let’s consider a couple of very simple example problems to moti-
vate the work of this chapter.
Example 23. Imagine trying to find the deepest point of a deep, murky
lake. We can’t see the bottom, and let’s assume sonar depth sounders
haven’t not been invented, but we do have a boat, and it has a sounding
line1 We could just measure “every” point of the lake to find the deepest,
but that would be a lot of work. The problem is t find the deepest point
efficiently.
Obviously not all lakes are convex, but that may be a reasonable
approximation in enough cases to make it worthwhile (and we’ll talk
about non-convex problems a little later).

1
A sounding line is a long cable with a weight on the end (called a lead even
if its not made of lead), and markers at even intervals. You throw the weighted
end into the water, and count the number of markers until it hits the bottom. The
traditional markers were 6 feet apart, or 1 fathom. We get a lot of terms in English
from nautical sources, and sounding is a source for some. For instance deep six comes
from leadsmen called out “by the mark” or “by the deep” followed by a depth, e.g., six.
Actually, the terminology is slightly more complicated, but you get the idea. Samuel
Clemens (1835-1910) took his pseudonym Mark Twain from riverboat leadsman
jargon “by the mark twain”, or two fathoms.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 146

B
 
x A A x AB x AC ···
 xB A xB B x BC ··· 
X = 
.. .. .. ..
. . . .
A
Figure 4.1: An example of a traffic matrix.

Example 24. One of the problems in my work has been to infer a net-
work traffic matrix from inadequate link traffic measurements. The
traffic matrix (a matrix of the traffic from point A to B, e.g., see Fig-
ure 4.1) is needed in operational IP networks to use in optimising the
design, and engineering of the network, but obtaining it is difficult.
Here we use an optimisation technique to obtain the traffic matrix
(which is then used in subsequent optimisation).
Currently, it is relatively easy to gather information on link traffic
using a tool called SNMP (the Simple Network Management Protocol).
SNMP is unique in that it is supported by essentially every device in
an IP network. The key result we use is that the traffic on each link is
related to the traffic matrix by the linear equations:
y = Ax (4.1)
where x = (x 1 , x 2 , . . . , x M )T is the traffic matrix (written as a column
vector), y = (y 1 , y 2 , . . . , y L )T represents link loads, and A the network
routing matrix A = {A i r } is the L × M routing matrix where
½
F i r , if traffic for r traverses link i
Ai r = (4.2)
0, otherwise
where F i r is the fraction of traffic from source/destination pair r that
traverses link i .
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 147

This equation must be inverted to find the traffic matrix from the
link loads, but L ¿ M so the equation is highly under-constrained, and
so allows many solutions. So we must have a method which finds a
“best” solution from all of the possibilities.
One approach is to start with a good guess (a prior) and try to refine
it, and this approach has been called network tomography. We came
up with a technique that AT&T (one of the world’s biggest network
operators) now uses. We start with a prior from a gravity model, and
then refine it. Call the prior xg , then the method works (approximately)
by solving the optimisation problem

min ∥ x − xg ∥2 ,
s.t. Ax − y = 0,
x ≥ 0.

where ∥ . ∥ is the L 2 norm of a vector (i.e., the Euclidean distance to the


origin).
The solution is, in effect, just a least squares solution finding the
mininimum of the distance (or distance squared which is a quadratic
function), from the prior (the gravity model) to the constrained sub-
space defined by the measurements equations (4.1), as shown in Fig-
ure 4.2.
Other version of the approach modify this, for instance by intro-
ducing weights, but the basic concept is the same: we use constrained,
convex optimisation to infer the matrix from data.
This is a real problem, and the solution is really used by a big com-
pany (I wrote the code myself, except for the optimisation routine
which used PDSCO, a Matlab optimisation library). The number of
links being measured was in the thousands (and each measurement
represents 1 constraint) and the number of variables being estimated
was in the order of 10,000, so we can see that such problems can be
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 148

least−squares solution

weighted least squares

gravity
model
constraint
subspace

Figure 4.2: An illustration of the least-square solution. The point shows the
gravity model solution, and the dashed line shows the subspace specified by
a single constraint equation. The least-square solution is simply the point
which satisfies the equation which is closest to the gravity model solution. The
weighted least-squares solution gives different weight to different unknowns.

quite large (at least compared to the test cases we have looked at in
class). We even have a patent on the idea:
U.S. Patent 7,574,506, “Traffic Matrix Estimation Method and Ap-
paratus”, N.G. Duffield, A.G. Greenberg, J.G. Klincewicz, M. Roughan,
and Y. Zhang, 2009.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 149

§4.2 C ONSTRAINTS
A convex program is defined on a convex set. Recall from Maths 1
that a set is convex if any line between two points in the set remains
inside the set, i.e.,

Defn 4.1. A set C ∈ Rn is convex if for all x, y ∈ C and all


α ∈ [0, 1]

x + α(y − x) = (1 − α)x + α ∈ C .

Some examples of convex and nonconvex sets are shown in Figure 4.3.

Convex Non-convex
Figure 4.3: Convex and non-convex sets.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 150

We have lots of ways of creating convex sets (they’re easy to draw),


but we need a method to mathematically guarantee that a set of con-
straints we are given (in the form of inequalities, or equalities) defines
a convex set. To do so we use the following lemma:
Lemma 5. Given m convex functions g i (x), the set Ω defined by

Ω = x ∈ Rn ¯ g i (x) ≤ 0, for i = 1, 2, . . . , m ,
© ¯ ª

is convex.

Proof. Start by considering the set Ω1 = x ∈ Rn ¯g 1 (x) ≤ 0 . Take two


© ¯ ª

points in this set x and y, then consider a point z = αx + (1 − α)y on the


chord between them (α ∈ [0, 1]). As g 1 (·) is convex, we can see that

g 1 αx + (1 − α)y ≤ αg 1 (x) + (1 − α)g 1 (y)


¡ ¢

≤ 0,

where g 1 (x) ≤ 0 and g 1 (y) ≤ 0 because x and y ∈ Ωi and α and 1 − α ≥ 0.


Hence z ∈ Ω1 , i.e., any point on a chord between two point in the set
must also be in the set, so Ω1 is convex. Likewise for the similarly
defined Ωi for i = 1, 2, . . . , m.
We then note then two properties of convex sets:

• The empty set and the whole vector-space (in our case Rn ) are
convex.

• The intersection of any collection of convex sets is convex.

Together these mean that Ω = ∩m Ω must be convex.


i =1 i

We will say our problem is in standard form when written:


CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 151

Problem 4.2. Find the minimiser x of the convex function

f : Ω → R.

subject to x ∈ Ω where

Ω = x ∈ Rn ¯ g i (x) ≤ 0, for i = 1, 2, . . . , m ,
© ¯ ª

for convex functions g i (x).

The useful feature of the standard form is that it is often easy to


define a set of constraints in terms of functions, and it is easy to test if
these functions are convex.
In addition to assuming we have a convex programming problem,
we will also assume that the functions are continuously differentiable
(i.e., the derivatives exist and are continuous), however, note that this
does not mean that the region boundary is smooth as there can be
“corners” where different constraints meet. We also need to assume
that the interior of the convex set is non-empty.

Problem 4.3. Find the minimiser x of the convex function

f : Ω → R.

subject to x ∈ Ω = x ∈ Rn ¯g i (x) ≤ 0, for i = 1, 2, . . . , m ,


© ¯ ª

where the g i (·) are convex, and we require ∇ f and ∇g i to


exist and be continuous for all x in Ω, and that int (Ω) 6= φ.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 152

A simple set of examples of such problems are the Linear Programs


(LP) that you may have studied in earlier optimisation courses. A
number of other examples are shown below.
Example 25. Convex constraints
8
7
6
5

x2 4
x2 − x1 − 1 ≤ 0, 3
2 2
x 1 /2 − x 2 − 2 ≤ 0.
1
0
−1
−2
−5 −4 −3 −2 −1 0 1 2 3 4 5
x1

Example 26. Convex constraints


5
4
3
2
1
x2

x1 − 7 ≤ 0, 0

2 −1
−x 1 + x 2 /4 + 4 ≤ 0. −2
−3
−4
−5

−1 0 1 2 3 4 5 6 7 8 9 10
x1

Note that here g 2 (x) is a convex function of x 1 and x 2 , but is not convex
if we draw x 2 as a function of x 1 , i.e., , we have to be careful of our
definition here.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 153

Example 27. Non-convex (and unbounded) constraints


5
4
3
2
1
−x 1 − 3x 2 ≤ 0,

x2
0
−x 1 + 3x 2 ≤ 0, −1

x 1 + x 22 /4 − 4 ≤ 0. −2
−3
−4
−5
−2 −1 0 1 2 3 4 5 6 7 8
x
1

We should be able to see that in the 1st two cases, a straight line
would stay within the set, whereas in the 2nd, it could pass outside.
Notice also that the third set is unbounded, but this is OK. We don’t
require the set of interest to be bounded (see the previous chapter, for
instance).
We refer to the interior of a set Ω as “int (Ω)” and its boundary as
d Ω. We’ll assume that in general our constraints define a non-empty
set, i.e., Ω 6= φ. If this is not true, we call the problem infeasible and
then optimisation is futile unless we relax at least one constraint.
Given a point x ∈ Ω (as defined for Problem 4.2) we define the set
I (x) by © ¯ ª
I (x) = i ¯ g i (x) = 0 , (4.3)
In simple terms, I (x) tells us which constraints are tight, i.e., which
constraints “bite”. If I (x) = φ, then x ∈ int (Ω), otherwise x ∈ d Ω.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 154

S OME RELATED PROBLEMS AND TRANSFORMS BETWEEN SUCH


In practice, constraints may be given in various forms. For instance,
in the traffic matrix estimation problem Example 24, the constraints
were given as a set of linear equalities, not as convex inequalities. We
need to have some tools for changing problems into standard form
(and back again if needed).

Problem 4.4. Find the minimiser x ∈ Rn of the convex


function f (·), i.e.,

min f (x)
s.t . g i (x) ≤ 0 i = 1, 2, . . . , m,
` j (x) = 0 j = 1, 2, . . . , p,

where f and g i are as in Problem 4.2 and the ` j (·) are all
linear.

The linear constraints can be expressed as

` j (x) = cTj x − d j , j = 1, 2, . . . , p,

for some set of vectors c j , and the ` j (·) are obviously convex functions,
but the constraints are expressed as equalities, not as inequalities. The
standard trick to deal with this is to replace each equality in Problem 4.4
with two inequalities, i.e.,

` j (x) ≤ 0,
½
` j (x) = 0 ⇔
−` j (x) ≤ 0.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 155

We can do this because for a linear function ` j and −` j are both convex
functions, and the resulting pair of inequalities is equivalent to the
original equality.
We can transform from inequalities to equalities where required by
the inclusion of slack variables. When we can include non-negativity
constraints, inclusion of slack variables is easy, i.e., we simply replace
an inquality such as g i (x) ≤ 0 with g i (x) + s i = 0 where s i ≥ 0. If non-
negativity constraints must be avoided, we can still perform a similar
trick using the square of the variable g i (x) + s i2 = 0.
While we are considering linear constraints, we should also men-
tion Farkas’ Lemma, which will be important below:
Lemma 6 (Farkas’ Lemma). Let A be a real-valued m × n matrix.
The system
Ax = b, x ≥ 0, (4.4)
has a solution iff

whenever u ∈ Rm is such that A T u ≥ 0 then bT u ≥ 0 also. (4.5)

Proof. Consider the primal-dual pair of linear programs:

(P ) max z = 0T x (D) min w = bT u


Ax = b u free
T
x ≥0 A u ≥0

Recall (P ) has an optimal solution iff (D) has an optimal solution, and
if (P), (D) have feasible solutions, w ≥ z.
⇒ If (4.4) has a solution, then a feasible solution of the
primal problem (P) exists. All such solutions are optimal
solution because z is always zero, i.e., any x satisfying
Ax = b, x ≥ 0 is a maximiser (with z = 0 by definition).
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 156

In this case, feasible solutions u of the dual problem


(D) must exist, and for any such A T u ≥ 0 and

bT u = w ≥ z = 0,
by definition of (D). Therefore, any u satisfying A T u ≥ 0
satisfies bT u ≥ 0, and hence (4.5).
⇐ The dual problem (D) always has feasible solutions,
since u = 0 satisfies the constraints of (D).
So if (4.5) holds, then (D) has an optimal solution,
since w is bounded below specifically, by w = bT u ≥ 0.
Therefore (P) has an optimal solution, and so (P) has
a feasible solution.
In other words, (4.4) holds.

Geometric Interpretation: If we denote the columns of A to be


a1 , a2 , . . . , an ∈ Rm , then Ax = b means “b is a linear combination of the
columns of A”, i.e., ,

b = x 1 a1 + x 2 a2 + · · · + x n an .

If we insist that the coefficients x i ≥ 0, then the set of all such linear
combinations is a convex cone 2 of the vectors {a1 , a2 , . . . , an }. In 2D and
3D we can visualise this — Example 28 and Example 29 give a couple
of examples. Note that these convex cones don’t really look like “cones”.
Example 28. The convex cone of the vectors a1 = (1, 0)T and a2 =
(1, 1)T is the region shown in Figure 4.4a. It’s the unbounded shaded
triangular shape.
2
A convex cone is defined to be a subset of a vector space that is closed under
linear combinations with positive coefficients.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 157

Example 29. The convex cone of the vectors a1 = (1, 0, 0)T , a2 = (1, 1, 1)T
and a2 = (1, 0, 1)T is the region shown in Figure 4.4b. We can see that it
isn’t actually a “cone”, but really an (unbounded) polyhedral shape.

2
a2
1 1.5

1
y

0.5
z

0 a1 0

−0.5 2
1
−1
2 0
−1 1
−1 0 1 2 0
−1 −1 x
x y

(a) A convex cone of 2 vectors in R2 . (b) A convex cone of 3 vectors in R3 .

Figure 4.4: Two convex cones.

So Farkas’ Lemma insists that either

• b is in the convex cone of the columns of A, or

• there exists a vector u ∈ Rm such that aTi u ≥ 0 for i = 1, 2, . . . , n


and bT u < 0.

Interpreting the second statement, note that aTi u ≥ 0 implies that the
angle between u and ai is at most 90◦ , and bt u < 0 implies that the
angle between u and b is more than 90◦ . If we were to take the hyper-
plane normal to u, then it would therefore separate the vectors in the
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 158

convex cone of {a1 , a2 , . . . , an } from b. The following example extends


Example 28 showing examples of points.
Example 30. Given the vectors a1 = (1, 0)T and a2 = (1, 1)T (whose
convex cone was shown in Figure 4.4a we could have one of two cases
illustrated in Figure 4.5. Either b is inside the convex cone, in which
case it can be written as a linear combination with positive coefficents,
as in the point b = 0.6a1 +0.7a2 shown in Figure 4.5a. Or if b lies outside
the cone, as in Figure 4.5b (b = (1/5, −1/2)T )), then we can draw a
separating hyperplane (in this case a line), with normal u = (1/2, 1)T ,
though there are an infinite number of other such lines. We can easily
see that aT1 u = 1/4 > 0 and aT2 u = 3/4 > 0, but bT u = −4/10 < 0 as
required by the lemma.

2 2

a2 a2
1 1 u
b
y
y

0 a1 0 a1

−1 −1
−1 0 1 2 −1 0 1 2
x x

(a) A point b = 0.6a1 + 0.7a2 inside the (b) A point b = (1/5, −1/2)T ) outside the
convex cone of {a1 , a2 }. convex cone, and u = (1/2, 1)T .

Figure 4.5: An illustration of Farkas’ Lemma (see Example 30).


To further elaborate, Farkas’s lemma, as described above, can be
interpreted as saying that either the vector b is in the convex cone,
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 159

or there is a separating hyperplane between it and the cone (but not


both).

F INDING FEASIBLE POINTS


Many methods of optimisation must start with a feasible point, i.e., a
point that satisfies the constraints. Naively, this seems an easy prob-
lem, however, we should note that given a large problem with many
variables and constraints, it isn’t even obvious that a feasible point will
exist (i.e., that Ω 6= φ).
We already know how to do this for linear constraints (it was taught
in second year OR). The method taught was call Phase I of the Simplex
Algorithm. Loosely it works as follows:

1. Construct an artificial problem from the original, for which there


is an obvious feasible point.

2. Solve the artificial problem (via the Simplex Method).

3. If this finds a basic feasible solution, we can use this to construct


the starting point. Otherwise the original problem is infeasible.

We also need would like an approach to find a feasible starting point


for problems such as Problem 4.2, but this is somewhat harder, and we
won’t cover it in this course as the convex programming algorithm we
will examine here uses linear constraints.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 160

§4.3 B ACKGROUND : L AGRANGE MULTIPLIERS


We have introduced the idea of optimisation with constraints. One
of the most familiar methods for dealing with such problems is that
of the Lagrange multiplier. They can be applied when we seek a min-
imum (or maximum) under a set of equality constraints, e.g., as in
Problem 4.5.

Problem 4.5. Find the minimiser x ∈ Rn of the function


f (·), subject to the constraints

g i (x) = 0, i = 1, 2, . . . , m < n,

where f and g i are differentiable functions.

Notice that we make no statements about convexity (or otherwise)


of the functions f (·) or g i (·). The standard solution uses Lagrange
Multipliers. For each constraint we introduce a new variable λi , and
then we solve the new problem Problem 4.6.

Problem 4.6. Find the minimiser x ∈ Rn of the function


h(·), given by
m
h(x, λ) = f (x) + λi g i (x),
X
i =1

where f and g i are differentiable functions.


CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 161

The minimiser of Problem 4.6 will also be the minimiser or Prob-


lem 4.5 (assuming a feasible solutions exists), though remember we
now have n + m variables.
Example 31. Find the rectangle with fixed perimeter, and maximum
area. The problem is illustrated in Figure 4.6a. Given the sides of the
rectangle are given by (x 1 , x 2 ), the problem is maximise f (x 1 , x 2 ) =
x 1 × x 2 subject to x 1 + x 2 = 1, and x 1 , x 2 > 0.

111111111111
00000000000 0
00000000000
11111111111 0
1
00000000000
11111111111 0
1
000000000001
11111111111 0
00000000000
11111111111
00000000000
11111111111
0
1
0
1
00000000000
11111111111 0
1
000000000001
00000000000
11111111111
11111111111
0
0
1
00000000000
11111111111
00000000000
11111111111
0
1
00000000000
11111111111 0
1
x
02
1
00000000000
11111111111
000000000001
11111111111 0
00000000000
11111111111 0
1
000000000001
11111111111 0
0
1
00000000000
11111111111
000000000001
11111111111 0
00000000000
11111111111 0
1
000000000001
11111111111 0
000000000001
11111111111 0
0
1
11111x10000
00000 1111 (b) The function being maximised, also showing
(a) The rectangle. the restriction implied by the perimeter constraint.

Figure 4.6: We wish to maximise the area of the rectangle, subject to a fixed
perimeter constraint.

We instead maximise the new function

h(x 1 , x 2 , λ) = x 1 x 2 + λ(x 1 + x 2 − 1),

where λ is the Lagrange multiplier.


CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 162

Setting the partial derivatives to be zero

∂h ∂h ∂h
= = = 0,
∂x 1 ∂x 2 ∂λ
we get
∂h
∂x 1
= x2 + λ = 0,
∂h
∂x 2
= x1 + λ = 0,
∂h
∂λ = x 1 + x 2 − 1 = 0,
which has solution x 1 = x 2 = 1/2, λ = −1/2 as illustrated in Figure 4.6b.
It is noteworthy that the third constraint (derived from the partial
derivative with respect to the Lagrange multiplier) is simply the original
constraint.
Example 32. Find the rectangle of largest area inscribed in a circle
with diameter 1 as illustrated in Figure 4.7.

111111111111111111111
000000000000000000000
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111

Figure 4.7: We wish to maximise the area of the rectangle, subject to it fitting
inside a circle with diameter 1. Note that the diagonal of the rectangle is just
the diameter of the circle.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 163

Mathematically, the problem is to maximize f (x 1 , x 2 ) = x 1 x 2 sub-


ject to x 12 + x 22 = 1 (i.e., the diagonal of the rectangle matches the diam-
eter of the circle), and x 1 , x 2 > 0. Again, we create a new function using
a Lagrange multiplier λ:

h = x 1 x 2 + λ(x 12 + x 22 − 1),

and maximise this by setting its partial deriviative to zero:


∂h
= x 2 + 2λx 1 , (4.6)
∂x 1
∂h
= x 1 + 2λx 2 , (4.7)
∂x 2
∂h
= x 12 + x 22 − 1. (4.8)
∂λ
Subtract 2λ× (4.6) from (4.7) and we get

x 1 (1 − 4λ2 ) = 0,

with solution λ = ±1/2. To satisfy x 1 , x 2 > 0, λ = −1/2, and hence


x 1 = x 2 . To satisfy the constraint
p
x 1 = x 2 = 1/ 2.

Hence the solution is a square.


Example 33. The previous examples are somewhat obvious, but if
we modify the previous example to “find the largest area rectangle
inscribed in an ellipse” the problem is somewhat more interesting.
Mathematically it is maximise f (x, y) = x y subject to x 2 /a 2 + y 2 /b 2 = 1
(where a and b are the semi-major and semi-minor axis of the ellipse),
and x, y > 0. As before we create a new function

h = x y + λ(x 2 /a 2 + y 2 /b 2 − 1),
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 164

and maximise this by finding its partial derivatives:


∂h
= y + 2λx/a 2 , (4.9)
∂x
∂h
= x + 2λy/b 2 , (4.10)
∂y
∂h
= x 2 /b 2 + y 2 /a 2 − 1, (4.11)
∂λ
Subtract 2λ/a 2 × (4.10) from (4.9) and we get

λ2
µ ¶
y 1 − 4 2 2 = 0.
a b
So λ = ±ab/2. To satisfy x, y > 0, λ = −ab/2, and hence x = (a/b)y. To
satisfy the constraint
p p
x = a/ 2, y = b/ 2.

The solution is no longer a square (for a non-circular ellipse).


Example 34. We can solve problems with multiple constraints in just
the same way. For instance, maximize f (x 1 , x 2 , x 3 ) = x 1 x 2 x 3 subject to
x 1 x 2 + x 1 x 3 + x 2 x 3 = 1, and x 1 + x 2 + x 3 = 3. Once again we create a new
function, but now with two Lagrange multipliers λ and µ, i.e.,

h = x 1 x 2 x 3 + λ(x 1 x 2 + x 1 x 3 + x 2 x 3 − 1) + µ(x 1 + x 2 + x 3 − 3),

whose partial derivatives are:


∂h
∂x 1
= x 2 x 3 + λ(x 2 + x 3 ) + µ = 0
∂h
∂x 2 = x 1 x 3 + λ(x 1 + x 3 ) + µ = 0
∂h
∂x 3 = x 1 x 2 + λ(x 1 + x 2 ) + µ = 0
∂h
∂λ = x1 x2 + x1 x3 + x2 x3 − 1 = 0
∂h
∂µ
= x1 + x2 + x3 − 3 = 0
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 165

We can immediately see that although the Lagrange multiplier pro-


vides us with a clear method for solving the problem, the resulting
problem may still be non-linear, and quite complex to solve (hence
necessitating the types of algorithms taught in this course).

Why do we use Lagrange multipliers? The first reason is simple: in


the examples above, when we take the partial derivative with respect
to the Lagrange multiplier, we simply get the constraint. This is always
the case, which we can see as follows. Take the new objective function
m
h(x, λ) = f (x) + λi g i (x),
X
(4.12)
i =1

To minimise (or maximise) this, we need to set the partial derivatives


to zero, and those for the new Lagrange multipliers just look like

∂h
= g i (x)
∂λi
so in setting these to zero, we just rewrite our old constraint g i (x) = 0.
However, its doing something more than just this. To understand,
lets consider the case of a single constraint: e.g., minimise f (x) subject
to g (x) = 0. The new objective function is

h(x, λ) = f (x) + λg (x)

and then the condition ∂h/∂x i = 0 implies that for all i = 1, 2, . . . , m

∂ f /∂x i = −λ∂g /∂x i , (4.13)

or more concisely
∇ f = −λ∇g . (4.14)
The intuition of this can be see in Figure 4.8.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 166

6
5 ∇g1 (x)

4
3
x∗ f (x) = 20
2
x2

1
0
∇f (x)
−1
−2
−3
−3 −2 −1 0 1 2 3 4 5 6
x1
Figure 4.8: An illustration of the affect of a Lagrange multiplier using f (x) =
2x 12 + 2x 1 x 2 + x 22 − 10x 1 − 10x 2 and constraint g (x) = x 12 + x 22 − 5 = 0. We can
see that (4.14) is enforced here with λ = 1. Intuitively we can see why — the
minimum occurs just at the point where the level curves of f (x) (the orange
ellipses) just touch the feasible region curve g (x) = 0 (shown in blue). The
critical level curve is f (x) = 20 (shown in green). When they just touch, their
tangents must be the same (as they are continuously differentiable) and hence,
the normals defined by the gradients must be aligned.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 167

Mathematically, we assume that x is a minimiser (which satisfies


the constraint), then we can perform a Taylor series expansion of both
the function f (·) and the constraint g (·) around this point, i.e.,

f (x + δx) = f (x) + δxT ∇f + O(δx2 ), (4.15)


T 2
g (x + δx) = g (x) + δx ∇g + O(δx ). (4.16)

Now the critical trick is to realise that if we enforce the constraint, then
for any feasible point
g (x + δx) = 0,
including the minimiser itself g (x) = 0, and so from (4.16) we get (for
small δx) that
δxT ∇g (x) ' 0.
Then from (4.14) we know that δxT ∇f (x) ' 0, and hence for small
enough δx we get
f (x + δx) = f (x) + O(δx2 ). (4.17)
Hence we can see that f (x) is a local minimum, given the constraint.
Note also that if we had a constraint g (x) = d , then the constant is
dropped when taking derivatives w.r.t. x i , and so such constants are
only needed when checking the constraint itself, and can sometimes
just be left out the function h(·).
Intuitively, you should realise by now that (4.14) expresses some-
thing quite important. Remember from the previous chapter that ∇g (x)
is a direction at right angles to level curve g (x) = 0, and that −∇ f (x) is
the direction of steepest descent of f (·), so that the equality enforced
by (4.14) says that at the minimiser the constraint curves normal will
point in the direction of steepest descent of the function f (x) at that
point. In other words, we require that the partial derivatives of f (·), in
any direction along the constraint curve, will be zero (but not in any
arbitrary direction).
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 168

Another way to intuitively think about Lagrange multipliers is that


in creating
h(x) = f (x) + λg (x),
we are proposing to optimise a weighted sum of two objectives. Given
a fixed value of λ, this would be like saying we have two costs, and
perhaps one is more or less important, and so gets a larger or smaller
weight λ. So given a fixed value of λ we minimise the weighted sum.
However, we don’t know λ in advance, so we determine both the
optimum of the weighted sum, and the optimal value of λ to go with
it. A bigger value of λ means that the “cost” corresponding to the
constraint g (x) = 0 was more important, and so had a higher weight.
A small value means it was relatively unimportant. So another way to
think about the Lagrange multiplier is “how hard do we have to pull to
ensure the constraint holds?”. As a result Lagrange multipliers often
turn out to have interesting meanings. In physics they often form what
are called co-state variables, one of which is momentum. We can think
of momentum as how hard a particle resists forces applied to it.
Another example is shadow prices in economics, which is the rate
at which the optimal value of the function f (·) changes if you change
the (economic) constraint. We can see this by simply calculating ∂h/∂c
for some constraint g (x) − c = 0. Clearly, we will get ∂h/∂c = λ.
One interesting trick is to take µ = 1/λ (assuming λ 6= 0), and to
scale the objective by a constant to optimise

µh(x) = µ f (x) + g (x),

which we recognize as the Lagrange multiplier form of the problem:


minimise g (x) (our constraint function) subject to f (x) − c = 0. That
is we have flipped the roles of the constraint and objective function.
From the new formulation, we can see that both problems have the
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 169

same form of solution. This is a type of duality that can sometimes be


exploited in optmisation.
It is somewhat tricker to use Lagrange multipliers to solve problems
with inequality constraints (such as in the standard form for our prob-
lems Problem 4.2). However, for one constraint, its easy and instructive.
We just find the minimum or maximum, and then check the constraint.
If its satisfied, then we are OK, but if not, we look on the boundary
g (x) = 0, so now solve the constrained problem.
For example, in Example 32, we were looking for the largest area
rectangle we can inscribe in a circle. We assumed that x 2 + y 2 = 1 so
that the diagonal of the square matched the diameter of the circle, but
really the question says that x 2 + y 2 ≤ 1. However, the max area square
(without the constraint) is clearly unbounded, and so doesn’t satisfy
the constraint, so we look for the square that lies on the boundary p
g (x, y) = x 2 + y 2 − 1 = 0, which we solve (as before) to get x = y = 1/ 2.
One way to think of this is that a contraint that “bites” requires
a non-zero Lagrange multiplier, but when we are not on the edge of
the constraint, the Lagrange multiplier can be zero (eliminating that
constraint from the function h(·).
Unfortunately, it is somewhat harder to consider problems with
multiple inequality constraints simply because we have a combinato-
rial problem if we wanted to test all of the possible combinations of
the constraints that apply in a particular problem. Nevertheless, it is
that problem we will address in the following section.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 170

§4.4 KKT C ONDITIONS


The conditions necessary to establish an optimal solution for Prob-
lem 4.3 are called the Karush-Kuhn-Tucker (KKT) conditions, or just
the Kuhn-Tucker conditions (the first paper on the topic was by Kuhn
and Tucker, but Karush described necessary conditions in an earlier
Master’s thesis).
The conditions are expressed in the theorem below. Note that it
requires convexity, but this is to ensure we find a global minimum. If
convexity isn’t true (but some other regularity conditions hold), then
the KKT conditions still give us a local minima.

Theorem 4.1 (KKT (Slater’s condition version)). Let f :


Rn → R be a convex, continuously differentiable
ª function
on the set Ω = x ∈ Rn ¯ g i (x) ≤ 0, ∀i = 1, . . . , m , where the
© ¯

g i : Rn → R are convex and continuously differentiable for


all i = 1, . . . , m, such that int (Ω) 6= φ.
A necessary and sufficient condition that x∗ ∈ Ω minimises
f over Ω is that there exists λi ≥ 0 such that
m
∇ f x∗ + λi ∇g i x∗ = 0.
¡ ¢ X ¡ ¢
(4.18)
i =1

where
λi g i x∗ = 0, for all i = 1, 2, . . . , m.
¡ ¢
(4.19)

The condition (4.19) is sometimes called a complementary slack-


ness condition because it expresses that either the constraint is sat-
isfied (and the Lagrange multiplier is potentially positive) or the La-
grange multiplier is zero (and the constraint is not tight).
The conditions (4.18) and (4.19) are called the KKT conditions. They
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 171

can be equivalently replaced with

∇ f x∗ + λi ∇g i x∗ = 0.
¡ ¢ X ¡ ¢
(4.20)
i ∈I (x∗ )

where implicitly λi = 0 for all i 6∈ I (x∗ ).


We will prove it in a moment, but first note that it is an intuitive
extension of the Lagrange multiplier idea. That is, if we say that con-
straints I (x∗ ) are tight3 , then we would write an new objective function

λi g i (x),
X
h(x) = f (x) +
i ∈I (x∗ )

including Lagrange multipliers as in (4.12) for the tight constraints, and


then ∇h is just the left-hand side of condition (4.20). So we are simply
solving a problem including Lagrange multipliers to enforce the right
set of constraints.
The only additional information we get from Theorem 4.1 is that
the Lagrange multipliers λi must be positive. The intuition behind
the positivity requirements comes from the following: imagine that
the constraints I (x∗ ) are tight, then the point x∗ is at the intersection
of the curves g i (x) = 0, for i ∈ I (x∗ ). We know that the feasible region
is inside the set Ω, i.e., that g i (x) ≤ 0, and also that g i (x) is increasing
in the direction ∇g i (x). Therefore, if g i (x) = 0, then ∇g i (x) will point
outside the set Ω, or equivalently −∇g i (x) points inside. Likewise, if x∗
is a minimum, the objective function must increase in any direction
pointing into the region. However, what the theorem says is even
stronger. It says that

∇ f (x∗ ) = − λi ∇g i (x∗ ),
X
i ∈I (x∗ )
3
If I (x∗ ) = φ then we are not on any boundary, and the KKT condition just reverts
to the familiar condition for an unconstrained problem, i.e., ∇ f (x∗ ) = 0.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 172

for λi ≥ 0. In other words, ∇ f (x∗ ) is in the convex cone of −∇g i (x) for
i ∈ I (x∗ ). The condition is illustrated in 2D in Figure 4.9.

x*

g1(x) = 0 -∇g1(x)
-∇g2(x) g2(x) = 0

∇f(x)

Figure 4.9: An illustration of the fact that ∇ f (x ∗ ) = − i ∈I (x∗ ) λi ∇g i (x∗ ) for


P

λi ≥ 0 for any global minimum. The shaded region indicates the feasible set.
We know that −∇g i (x∗ ) must point into this set because the constraints are
g i (x∗ ) ≤ 0. The green shaded sub-region indicates the convex cone of viable
directions for ∇ f .

We will prove this result below, but that will take some work, and
a few examples should help with out intuition, but before then let us
also define a feasible direction.

Defn 4.2. A feasible direction at x ∈ Ω is a vector d ∈ Rn ,


d 6= 0, such that there exists δ > 0 where x + εd ∈ Ω for all
0 ≤ ε ≤ δ.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 173

The following examples extend Example 25 to Example 27 using


objective function f (x) = x 12 + x 22 , which has ∇ f (x) = (2x 1 , 2x 2 )T .

Example 35. From Example 25


8
7
6
g 1 (x) = x 2 − x 1 − 1,
5
g 2 (x) = x 12 /2 − x 2 − 2. 4

x2
3
The minimum occurs at the ori- 2
gin, where I (x) = φ, and so the 1
KKT conditions simple require 0
that ∇ f (x) = 0, which indeed it −1
does. −2
−5 −4 −3 −2 −1 0 1 2 3 4 5
x1

Example 36. From Example 26

g 1 (x) = x 1 − 7, 5
4
g 2 (x) = −x 1 + x 22 /4 + 4. 3
2
The concentric rings show level 1
curves of f (x) = x 12 + x 22 . The min-
x2

0
imum is obtained where these −1

rings just touch the feasible region −2


−3
at x∗ = (4, 0) where I (x∗ ) = {2}, so
−4
the KKT conditions are −5

−1 0 1 2 3 4 5 6 7 8 9 10
∇ f + λ2 ∇g 2 = 0, x1

At x∗ = (4, 0)T , we have ∇ f (x∗ ) = (8, 0)T and ∇g 2 (x∗ ) = (−1, x 2∗ /2)T =
(−1, 0)T so we choose λ1 = 0 and λ2 = 4 to satisfy the KKT conditions.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 174

Example 37. From Example 27

5
g 1 (x) = −x 1 − 3x 2 ,
4
g 2 (x) = −x 1 + 3x 2 , 3
g 3 (x) = x 1 + x 22 /4 − 4. 2
1
The concentric rings show level

x2
0
curves of f (x) = x 12 + x 22 . The −1
constraints are non-convex, but −2
we can still see that the minimum −3
is obtained where these rings −4

just touch thep feasible region at


−5
∗ −2 −1 0 1 2 3 4 5 6 7 8
x 2 = ±(−6 + 52) ' ±1.21. x
1

The solution, therefore is at the points x∗ ' (3.63, ±1.21) where I (x∗ ) =
{1, 3} or I (x∗ ) = {2, 3}. Taking the former, we get ∇ f (x∗ ) = 2(x 1∗ , x 2∗ )T =
2(3x 2∗ , −x 2∗ )T , and

∇g 1 (x∗ ) = (−1, −3)T


∇g 3 (x∗ ) = (1, x 2∗ /2)T

so the KKT conditions become

6x 2∗ − λ1 + λ3 = 0
2x 2∗ − 3λ1 + x 2∗ λ3 /2 = 0

we obtain this solution when λ1 ' 0.8261, λ2 = 0 and λ3 ' 8.0927.


CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 175

P ROOF OF THE KKT C ONDITIONS


Repeating Theorem 4.1 for reference:

Theorem 4.1 (KKT (Slater’s condition version)). Let f :


Rn → R be a convex, continuously differentiable
ª function

on the set Ω = x ∈ R g i (x) ≤ 0, ∀i = 1, . . . , m , where the
© ¯

g i : Rn → R are convex and continuously differentiable for


all i = 1, . . . , m, such that int (Ω) 6= φ.
A necessary and sufficient condition that x∗ ∈ Ω minimises
f over Ω is that there exists λi ≥ 0 such that
m
∇ f x∗ + λi ∇g i x∗ = 0.
¡ ¢ X ¡ ¢
(4.18)
i =1

where
λi g i x∗ = 0, for all i = 1, 2, . . . , m.
¡ ¢
(4.19)

Proof. (⇐) (Sufficiency): i.e., assume that there exists a point x∗ , and
a set of λi that satisfy (4.18) and (4.19). The objective function f is
convex on Ω, and x∗ ∈ Ω. By Theorem 3.2 we know that
f (x) ≥ f (x∗ ) + (x − x∗ )T ∇ f (x∗ ), (4.21)
for all x ∈ Ω. The KKT condition (4.18) substituted into (4.21) implies
à !
m
f (x) ≥ f (x∗ ) + (x − x∗ )T − λi ∇g i (x∗ )
X
i =1
m
= f (x∗ ) − λi (x − x∗ )T ∇g i (x∗ ).
X
(4.22)
i =1
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 176

Since Ω is convex, for all x ∈ Ω and all α ∈ [0, 1],


x∗ + α(x − x∗ ) = (1 − α)x∗ + αx ∈ Ω,
and as x, x∗ ∈ Ω we know g i (x) ≤ 0 and g i (x∗ ) ≤ 0, and
g i (·) is convex so
g i (1 − α)x∗ + αx ≤ (1 − α)g i x∗ + αg i x ≤ 0,
¡ ¢ ¡ ¢ ¡ ¢

for all i = 1, 2, . . . , m. The KKT theorem requires λi ≥ 0


and (4.19), i.e., λi g i (x∗ ) = 0, so
λi g i x∗ + α(x − x∗ ) − g i x∗ ≤ 0,
© ¡ ¢ ¡ ¢ª
(4.23)
for all i = 1, 2, . . . , m. Taking the limit as α → 0+ (from
first principles) we get
λ α(x
© ¡ ∗ ∗
¢ ¡ ∗ ¢ª
i g i x + − x ) − g i x
λi (x − x∗ )T ∇g i (x∗ ) = lim
α→0+ α
≤ 0.
Therefore
m
λi (x − x∗ )T ∇g i x∗ ≤ 0,
X ¡ ¢
i =1
and thus for any x ∈ Ω, (4.22) is
f (x) ≥ f (x∗ ) − (something nonpositive) ≥ f (x∗ ),
and so x∗ is a global minimiser of f over Ω.
Notice that we didn’t require that int (Ω) 6= 0 yet – we only need that
in the necessity proof.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 177

Proof. (⇒) (Necessity): i.e., assume that f (x∗ ) is the global minimum
over Ω. From Theorem 3.3 we know that

(y − x∗ )T ∇ f (x∗ ) ≥ 0, ∀y ∈ Ω . (4.24)

Now consider the definition of a feasible search direction, Defn 4.2. For
any such direction d we can choose y = x∗ + εd ∈ Ω for some ε > 0, and
likewise for any point y ∈ Ω we can choose a search direction d = y − x∗ ,
and so (4.24) is equivalent to saying

dT ∇ f (x∗ ) ≥ 0, (4.25)

if and only if d is a feasible direction.


It can also be proved that if x∗ minimises f over Ω,

whenever ∇g i (x∗ )T d ≤ 0, for all i ∈ I (x∗ ), then ∇ f (x∗ )T d ≥ 0. (4.26)

(We do not prove this here as it is the trickiest step!)


Now, let A be the matrix with column vectors −∇g i (x∗ )
for i ∈ I (x∗ ) and let b = ∇ f (x∗ ). Then statement (4.26) is
equivalent to
whenever A T d ≥ 0, then bT d ≥ 0. (4.27)
This is just the condition in Farkas’ Lemma, and hence
(4.27) is equivalent to saying there exists x ≥ 0 such that
Ax = b. We take x = λ = (λi : i ∈ I (x∗ )), and then
λi ∇g i (x∗ ) = b,
X
Ax = Aλ = −
i ∈I (x∗ )

where b = ∇ f (x∗ ), and hence the KKT condition (4.20),


or equivalently (4.18) and (4.19).
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 178

Example 38. Minimise f (x 1 , x 2 ) = 2x 12 + 2x 1 x 2 + x 22 − 10x 1 − 10x 2

g 1 (x) = x 12 + x 22 − 5 ≤ 0
(
x 12 + x 22 ≤ 5
s.t. ⇒
3x 1 + x 2 ≤ 6 g 2 (x) = 3x 1 + x 2 − 6 ≤ 0
Then,
µ ¶ µ ¶ µ ¶
4x 1 + 2x 2 − 10 2x 1 3
∇ f (x) = 2x + 2x − 10 , ∇g 1 (x) = 2x , ∇g 2 (x) = 1
1 2 2

and µ ¶
4 2
Hf = 2 2 ,

which is positive definite and so f is convex. Also g 2 is


linear and g 1 ≤ 0 is the interior of a circle so Ω is convex.

The KKT conditions, written in full, are

1) ∇ f (x) + λ1 ∇g 1 (x) + λ2 ∇g 2 (x) = 0


µ ¶ µ ¶ µ ¶
4x 1 + 2x 2 − 10 2x 1 3
i.e., 2x 1 + 2x 2 − 10 + λ1 2x 2 + λ2 1 =0
2) λ1 , λ2 ≥ 0
3) λ1 (x 12 + x 22 − 5) = 0
λ2 (3x 1 + x 2 − 6) = 0
4) to ensure feasibility: g i (x) ≤ 0
x 12 + x 22 − 5 ≤ 0,
i.e.,
3x 1 + x 2 − 6 ≤ 0.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 179

We need to check all choices for I (x). Here, there are 2 constraints, and
so I (x) is one of ;, {1}, {2} or {1, 2}.

1. I (x) = ;, so λ1 = λ2 = 0 by 3), since g 1 (x) < 0 and


g 2 (x) < 0.
¾
Therefore from 1), 4x 1 + 2x 2 = 10
2x + 2x = 10 ⇒ x 1 = 0, x 2 =
1 2

But then g 1 (0, 5) > 0 so this point is not feasible.


so g 2 is not active and λ2 = 0 by 3). So 1)
2. I (x) = {1},
and 3) require us to solve
4x 1 + 2x 2 − 10 + 2λ1 x 1 = 0
2x 1 + 2x 2 − 10 + 2λ1 x 2 = 0
x 12 + x 22 =5 · · · g 1 active

giving x 1 = 1, x 2 = 2 and λ1 = 1.
All the Kuhn-Tucker conditions are satisfied and so
we can conclude that the minimum is x∗ = (1, 2)
but let’s check the others for completeness
3. I (x) = {2} and so λ1 = 0 and
g 2 (x) = 0.
So 1) and 3) require us to solve
4x 1 + 2x 2 − 10 + 3λ2 = 0
2x 1 + 2x 2 − 10 + λ2 = 0
3x 1 + x 2 − 6 = 0.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 180

giving x 1 = 0.4, x 2 = 4.8, and λ2 = −0.4 < 0. How-


ever, this violates 2) and so the Kuhn-Tucker condi-
tions are not satisfied.
4. I (x) = {1, 2} and so g 1 (x) = 0, and g 2 (x) = 0. so first find
the solution to
x 12 + x 22 = 5,
3x 1 + x 2 = 6,
which is
p
18 ± 14 3 p
x1 = , x2 = (2 ∓ 14).
10 10
Now we need to solve 1), from which we find that
à p p !
−70 + 5 14 −70 − 5 14
λ1 = p or p <0
88 + 14 138 − 14

and so violates 2). Again, the Kuhn-Tucker conditions are


not satisfied.

So we are left with case (2) where I (x) = {1} and x∗ = (1, 2) as our final
solution as illustrated in Figure 4.10. From (3.3) we know that the
unconstrained optimum of the problem is at p = −A −1 b = (0, 5)T here
because µ ¶ µ ¶
4 1 10
A= and b = − .
1 2 10
We can see that A is positive definite, so the function is convex, and the
level curves are ellipse (with center at p). The figure also shows these
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 181

elliptical level curves of f (·), as well as the potential solution points for
each set of constraints. We can easily see the first and third solutions
are infeasible, and that the second solution is better than the forth.
Interestingly, the forth solution is feasable, but results in a negative
Lagrange multiplier λ1 , which is not allowed, ruling this solution out.
The reason lies in the fact that ∇ f is not in the convex cone of ∇g 1 and
∇g 2 as illustrated in Figure 4.11. The third figure, Figure 4.12, shows a
closer view yet, illustrating that there is a viable search direction that is
also a descent direction from the point x with I (x) = {1, 2}.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 182

5 I(x) = φ I(x) = {2}


4

2 I(x) = {1}
I(x) = {1, 2}
x2

0
I(x) = {1, 2}
−1

−2

−3
−3 −2 −1 0 1 2 3 4 5
x1
Figure 4.10: An illustration of Example 38. The green region is the feasible
set Ω. The orange lines are the elliptical level curves of f (x). The four cases of
solutions are illustrated by dots.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 183

f (x) = 20
2 I(x) = {1}
I(x) = {1, 2}
x2

−∇g2 (x)

−∇g1 (x) ∇f (x)

0
0 1 2 3
x1
Figure 4.11: A closeup of Figure 4.10 showing the gradients, and the convex
cone of {−∇g 1 , −∇g 2 } (in green) at the point x where I (x) = {1, 2}. As ∇ f (x) is
outside this cone, this can’t be an optimal point.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 184

I(x) = {1, 2}
x2

∇f (x)

1 2
x1
Figure 4.12: A closeup of Figure 4.11, with the green region showing feasible
descent directions from the point I (x) = {1, 2}. As there are feasible descent
directions, we can’t be at the minimum.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 185

Comment:

1. If there are m constraint functions g i , then there are 2m possible


choices for I (x). Obviously this is a large number of possibilities
to check out.

2. Each possibility requires us to solve (potentially) hard sets of


non-linear equations (unless f is at most quadratic and the g i
are linear, which we shall consider in §4.4).
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 186

KKT C ONDITIONS FOR OTHER PROBLEMS


The conditions given above are a particular case of the conditions for
the KKT theorem, often called the Slater conditions (that the functions
f (x) and g i (x) be convex and continuously differentiable, and that the
interior of Ω be non-empty. The conditions can be extended in many
ways, to consider other types of constraints and so on.
Perhaps the most useful is to include equality constraints. In gen-
eral, we might like to include constraints h j (x) = 0, but this can become
complex, unless the functions are linear, in which case the Slater condi-
tions naturally generalize through the technique shown in §4.2, where
we wrote the linear equalities as pairs of inequalities (which are still
convex because of linearity).
More precisely, we replace Problem 4.4 with Problem 4.7 below.

Problem 4.7. Find the minimiser x ∈ Rn of the convex


function f (·), i.e.,

min f (x)
s.t . g i (x) ≤ 0, i = 1, 2, . . . , m,
` j (x) ≤ 0, j = 1, 2, . . . , p,
−` j (x) ≤ 0, j = 1, 2, . . . , p.

where f and g i are as in Problem 4.2 and the ` j (·) are all
linear.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 187

We can now apply the KKT conditions directly, and we get


m p p
∇ f x∗ + λi ∇g i x∗ + µ+j ∇` j x∗ − µ−j ∇` j x∗ = 0. (4.28)
¡ ¢ X ¡ ¢ X ¡ ¢ X ¡ ¢
i =1 j =1 j =1

where we use the λi as Lagrange multipliers with respect to the con-


straints g i (x) = 0 and µ±j as Lagrange multipliers for the corresponding
linear inequalities. As before, we require λi ≥ 0 and µ±j ≥ 0, but note
that we can create new variables

µ j = µ+j − µ−j ,

and using these the KKT condition (4.28) becomes


m p

¡ ∗¢ X
λi ∇g i x + µ j ∇` j x∗ = 0.
¡ ¢ X ¡ ¢
∇f x + (4.29)
i =1 j =1

but now the variables µ j have no restriction on sign. Note that we also
require, as before in (4.19), that

λi g i x∗ = 0, for all i = 1, 2, . . . , m.
¡ ¢
(4.30)

but the linear constraints are automatically tight (as defined in the
problem), and so we need not have an equivalent for the µ j .
As noted there are other ways to extend the theory, even to non-
convex problems provided they satisfy some other type of regularlity
conditions. For instance one alternative is to insist that the gradients
of the active constraints are linearly independent at the minimiser.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 188

Defn 4.3 (regular point). Let x∗ satisfy g i (x∗ ) ≤ 0, ∀i , (i.e.,


x∗ ∈ Ω), and let I (x∗ ) be the active constraint set at x∗ . Then
x∗ is a regular point if the set of vectors { ∇g i (x∗ ) | i ∈ I (x∗ ) }
are linearly independent.

Below is the (1st order) necessary condition for a point x∗ to be a


local minimiser of f .

Theorem 4.2. [Karush-Kuhn-Tucker Theorem]


Let f , g i (i = 1, . . . , m) be sufficiently differentiable (that is,
∇ f , ∇g i , i = 1, . . . , m exist).
Let x∗ be a regular point and a local minimiser of f (x) s.t.
g i (x) ≤ 0 ∀i = 1, . . . , m
Then ∃ λi ∈ R, i = 1, . . . , m such that
m
1. ∇ f (x∗ ) + λi ∇g i (x∗ ) = 0,
X
i =1

2. λi g i (x ) = 0, ∀ i = 1, . . . , m, and
3. λi ≥ 0, ∀ i = 1, . . . , m.

Proof. Omitted.

We already saw an example of this in Example 37, where the KKT


conditions correctly gave the local minima, even though the con-
straints resulted in a non-convex set.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 189

The theorem can be further extended to give 2nd order necessary


and sufficient conditions for minimising problems with inequality
constraints, i.e., conditions in terms of the Hessians, not just the gradi-
ents. Hence they can separate minima, maxima and saddle points (for
non-convex problems where such things are important).

KKT C ONDITIONS FOR QUADRATIC P ROGRAMMING


A common form of optimisation problem is a quadratic program,
where the objective function is quadratic (as in (3.1)), and the con-
straints are linear (we can include linear inequalities and equalities as
shown in §4.2).

Problem 4.8. Find the minimiser x ∈ Rn of the quadratic


function
1
q(x) = xT Ax + bT x + c,
2
where A is a positive definite, symmetric matrix and the
problem is constrained to the space Ω defined by

E x ≤ f,
Gx = h,

for m × n and p × n matrices E and G, and length m and p


vectors f and h, respectively.

We earlier looked at solving such problems when the problem was


unconstrained (see §3.2 for analysis and we examined convergence of
the Steepest Descent Algorithm in §3.3). Now we introduce constraints
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 190

on the region of optimisation. Obviously, if these are not tight (active),


then the problem is unchanged, but if they are, then we must apply the
KKT conditions from (4.29).
Note first that we can rewrite the constraints in the form

eTi x − f i ≤ 0,
gTj x − h j = 0.

where ei and g j are the rows of E and G, respectively.


We know the gradient of a quadratic from (3.10), and

∇q(x) = Ax + b,

and it is likewise easy to derive the gradients of the constraints, each of


which takes the form

∇g i (x) = ei ,
∇h j (x) = g j .

Thus the KKT condition (4.29) is


m p
λi ei + µ j g j = Ax + b + E T λ +G T µ = 0.
X X
Ax + b + (4.31)
i =1 j =1

where λi ≥ 0, and λi = 0 for the inactive constaints, and the µ j are


unconstrained. This represents a set of linear equations to be solved
for each possible set of active constraints.
However, we can still see that there is the potential for 2m cases, and
hence even testing these simple cases might be fraught. Instead, we will
need an algorithm that searches through the possibilities effectively.
However, when the problem has pure equality constraints, such
as in Example 24, then we don’t need to worry about finding which
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 191

are tight (they all are), and so the solution is straight forward. We
simply rewrite the equations (4.31) (with E = 0) in combination with
the constraint Gx = h in the form

A GT x
µ ¶µ ¶ µ ¶
−b
= . (4.32)
G 0 µ h

In principle, solving this simply requires linear algebra, but life


is never that simple. The problem can get quite large (the matrix on
the left-hand side is (n + p) × (n + p)) and may have poor properties
for numerical solutions, e.g., the matrix is not guaranteed to be pos-
itive definite, even if A is. So even in this relatively simple problem
(quadratic programming) considerable effort can be applied to finding
robust and efficient solvers.
In my example (traffic matrix inference as in Example 24), we used
a Matlab library called PDSCO (written by M. A. Saunders, and B. Kim
at Stanford University), which actually works on convex problem. Per-
haps their later version PDCO would be a better starting point now.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 192

§4.5 T HE G RADIENT P ROJECTION A LGORITHM


In the previous section we saw that we could derive elegant condi-
tions for the extrema of a convex function (with convex constraints),
but that this didn’t solve the problem for us simply because there are
too many possible sets of active constraints to consider by simple
enumeration.
We need a way to attack such problems algorithmically, and in this
section we will present the gradient projection algorithm which is just
that.
Here we will restrict our attention to problems with convex objec-
tive functions, and linear (inequality) constraints4 . So the problem can
be described by Problem 4.9.

Problem 4.9. Find the minimiser x ∈ Rn of the convex,


continuously differentiable function f (·), i.e.,

min f (x)
s.t . Ax ≤ b

for m × n matrix A and vector b ∈ Rm .

We can write the linear constraints as a separate set of constraints

g i (x) = cTi x − b i ≤ 0 where cTi = i th row of A.


4
Using linear constraints may seem a large restriction, but remember that the
constraint functions we are using are probably an inexact representation of the real
problem constraints, and that we can approximate them by a set of linear constraints
with hopefully not too much impact on the final solution (though this is certainly not
always true).
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 193

The algorithms we will consider are similar to gradient algorithms for


unconstrained problems. However, for constrained problems, −∇ f (x)
isn’t always a feasible direction, so we need to find out how to find
such.
We start by taking

S(x) = u ¯ uT ∇g i (x) = 0, ∀i ∈ I (x)


© ¯ ª

= u ¯ uT ci = 0, ∀i ∈ I (x) .
© ¯ ª

The set S(x) defines the set of directions from x, such that we stay
on the boundary d Ω of the feasible region. That is, if we move an
infinitesimal distance in the direction u, the same set of constraints
I (x) that are currently active, will remain so. The result is easy to prove
as the constraints are linear so
g i (x + εu) = g i (x) + εuT ci .
The first term on the right-hand side is zero by the definition of I (x),
and the second is zero by the definition of S(x) (for small enough ε such
that we don’t run into another boundary). Thus, any u ∈ S(x), u 6= 0, is a
feasible direction at x.
If we can find u 6= 0, u ∈ S(x), such that

uT ∇ f (x) < 0 ,
then u is both a feasible and a descent direction at x, and so our ap-
proach seeks to find such directions. We can then move along the
boundary of the region until we get to the solution5 .
We have to be a bit careful, though, because its easy to get to a
point (a vertex) where S(x) = φ, and we need some way to detect this
(efficiently) and get away from such points.
5
Remember, if the minima is not on the boundary, the solution is simple found
via unconstrained optimisation techniques.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 194

B ACKGROUND TO THE G RADIENT P ROJECTION M ETHOD


Suppose −∇ f (x) is not a feasible direction at x. So −∇ f (x) 6∈ S(x). Then
let

S ⊥ (x) ={v ∈ Rn | vT u = 0, ∀u ∈ S(x)}


called S(x) “perp" (set of all vectors orthogonal to S(x)).

Then ∃ u ∈ S(x) and q ∈ S ⊥ (x) such that

u + q = −∇ f (x) , (4.33)
as illustrated in Figure 4.13. We call such a u the orthogonal projection
of −∇ f onto S.

-∇f(x)
q

x
u
S(x)
Figure 4.13: Decomposition or −∇ f (x) into u + q, where u and q are chosen to
be orthogonal, i.e., , uT q = 0.

If u = 0 then q = −∇ f (x), and ∇ f (x) is orthogonal to the constraints.


Hence there is no descent direction within S(x). So for the moment
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 195

assume u 6= 0. We can use (4.33) as follows:

−uT ∇ f (x) = uT (u + q)
= uT u + uT q
= kuk2
> 0, (4.34)
because u and q were chosen to be orthogonal. Thus uT ∇ f (x) < 0 and
hence u is a feasible descent direction.
Thus a basic algorithm would be:

1. At x ∈ Ω, compute u as in (4.33).

2. If u = 0 we need to do something special,


else u 6= 0, (so u is a feasible descent direction) so search in
direction u, i.e., minimise

f (x + λu) , λ > 0, x + λu ∈ Ω.

Find new x = x + λ∗ u and repeat the process.

Question: how do we find u, the orthogonal projection of −∇ f (x) on


S(x)? We know (from linear algebra) that we can create new vectors
from old by matrix multiplication, so we will write this projection
operation using the notation
¡ ¢
u = P S − ∇ f (x) = −P S ∇ f (x), (4.35)

where P S is called the projection matrix. Our aim is to find P S . It turns


out that this is an optimisation problem itself, and so we can use the
rules we have already discovered to help us find the projection matrix.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 196

First rewrite the set S(x) using the matrix M whose


rows are cTi , i ∈ I (x).

S(x) = {u ∈ Rn : cTi u = 0, ∀i ∈ I (x)}


= {u ∈ Rn : M u = 0}.
or more concisely write S = {u | M u = 0}, where we implicitly define M
in terms of the current point x and the set of constraints active at that
point I (x).
Ignore, for the moment, the context of the problem and consider
the general problem of orthogonal projection of an arbitrary vector
y into a set S = {u | M u = 0} where M is p × n, say. For y 6∈ S, we want
to find find ŷ = P S y, the orthogonal projection of y on S as shown in
Figure 4.14.

PS y
S
Figure 4.14: Orthogonal projection of the vector y into the space S(x) using the
projection matrix P S .

If we look at the problem from “side on” as in Figure 4.15 we can see
that one way to think about the question is to take a set of expanding
hyperspheres around the point y until the spheres are just large enough
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 197

to touch the hyperplane S. As they are just touching, the hyperplane


forms a tangent to the hypersphere, and so the radius of the sphere
at the point of contact will be orthogonal to S. The point of contact
gives us our orthogonal projection. The length of the radius at this
point will be ky − P S yk, and so we aim to minimise this, but it is easier
to minimise its square (the two are equivalent as (·)2 is a monotonic
increasing function). That is we solve the problem given in Problem 4.8,
which is just a quadratic program.

||y - PS y||

S
PS y
Figure 4.15: A side on view of the orthogonal projection problem.

Problem 4.10 (Orthogonal Projection). Find the min-


imiser ŷ = P S y of the convex, quadratic function
n ¡ ¢2
ky − ŷk2 =
X
ŷ i − y i ,
i =1

such that ŷ ∈ S, i.e., M ŷ = 0.


CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 198

Theorem 4.3 (Orthogonal Projection). The solution to


Problem 4.10, i.e., find the ŷ that minimises
n ¡ ¢2
kŷ − yk2 =
X
ŷ i − y i ,
i =1

such that M ŷ = 0 is given by

ŷ = P S y,

where the orthogonal projection matrix


¢−1
PS = I − M T M M T
¡
M,

presuming that M M T is invertible.

Proof. Consider that the problem is a convex optimisation problem,


with a quadratic objective function, and linear equality constraints,
hence we can write it in the form of Problem 4.8, i.e., with objective
q(x) = 12 xT Ax + bT x + c, by expanding

n ¡ ¢2
kŷ − yk2 =
X
ŷ i − y i
i =1
n
ŷ i2 − 2 ŷ i y i + y i2
X
= (4.36)
i =1
= ŷ ŷ − 2yT ŷ + yT y,
T
(4.37)
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 199

and so we take

A = 2I , (4.38)
b = −2y, (4.39)

and the constraint matrix G from Problem 4.8 is just G = M , and there
are no inequality constriants so E = 0. We find its solution using the
KKT conditions, which in this case are given in (4.31)

Ax + b +G T µ = 2ŷ − 2y + M T µ = 0.

where the µ are some set of (unconstrained) Lagrange multipliers.


Thus (if there are p active constraints)
 T
c1
1 cT 
 2
ŷ = y − M T µ M =  ..  M T = (c1 · · · cp ) .
2  .  | {z }
n×p
cTp
| {z }
p×n

To find µ, use the constraints, which require that


µ ¶
1 T
0 = M ŷ = M y − M µ ,
2
giving
M M T µ = 2M y.
Since M M T is a square p×p matrix, if M M T is invertible
we then have
µ = 2(M M T )−1 M y
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 200

and so
1
ŷ = y − M T µ
¡ 2 T
= I − M (M M T )−1 M y
¢

= P S y, (4.40)
where P S is defined above.
Theorem 4.3 explains how to find an orthogonal projection, so now
we apply it to our original problem of finding the orthogonal projection
of −∇ f (x), and we get
u = −P S ∇ f (x) = − I − M T (M M T )−1 M ∇ f (x),
¡ ¢

where M is the matrix with rows cTi , ∀i ∈ I (x), i.e., rows are the (gradient
vectors)T of the active constraints at x. Given the conditions specified
above, u is guaranteed to be a descent direction, and a feasible search
direction (that satisfies the same set of active constraints I (x)).
Example 39. Minimise f (x) = x 12 + x 22 + x 32 + x 42 − 2x 1 − 3x 4 such that
2x 1 + x 2 + x 3 + 4x 4 ≤ 7, (4.41)
x 1 + x 2 + 2x 3 + x 4 ≤ 6, (4.42)
xi ≥ 0. (4.43)
The gradients of the first two constraints are
c1 = ∇g 1 (x) = (2, 1, 1, 4)T , (4.44)
T
c2 = ∇g 2 (x) = (1, 1, 2, 1) . (4.45)
The last set of (non-negativity) constraints actually have to be written
as −x i ≤ 0, and so their corresponding gradients are
ci = ∇g i = (0 . . . , −1 , . . . , 0)T , for i = 3 to 6.
6 (i − 2)th spot
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 201

Imagine we started with xT0 = (2, 2, 1, 0), the active constraints are
I (x0 ) = {1, 2, 6} (which we can see by testing whether each inequal-
ity is tight). Hence the matrix M is a 3 × 4 matrix whose rows are the
constraint vectors c1 , c2 and c6 , i.e.,
 
2 1 1 4
M = 1 1 2 1  ,
0 0 0 −1

giving
 
22 9 −4
MMT =  9 7 −1 ,
−4 −1 1
and hence

1 −3 1 0
T T −1 1 
−3 9 −3 0

P S = I − M (M M ) M = .
11  1 −3 1 0

0 0 0 0

We can also write

∇ f (x) = (2x 1 − 2, 2x 2 , 2x 3 , 2x 4 − 3)T ,

and at the point xT0 = (2, 2, 1, 0) this is

∇ f (x0 ) = (2, 4, 2, −3)T .

Hence, we choose the direction



8
1 
−24

u = −P S ∇ f (x0 ) =
11  8 
 
0
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 202

This is just a direction, so we can scale it (to make the numbers easier),
taking u0 = (1, −3, 1, 0)T . Then x1 = x0 + λu0 where λ minimises
g (λ) = f (x0 + λu0 ), subject to x0 + λu0 ∈ Ω; λ > 0.
Here,

x1 = (2, 2, 1, 0)T + λ(1, −3, 1, 0)T λ>0


T
= (2 + λ, 2 − 3λ, 1 + λ, 0) .
For x1 to be feasible, it must satisfy constraints (4.41)-(4.43)
(4.41) ⇒ 2(2 + λ) + (2 − 3λ) + (1 + λ) ≤ 7
⇒ 7 ≤ 7, OK for any λ
(4.42) ⇒ (2 + λ) + (2 − 3λ) + 2(1 + λ) ≤ 6
⇒ 6 ≤ 6, OK for any λ
(4.43)(i = 1) ⇒ 2 + λ ≥ 0, i.e., λ ≥ −2 so OK for any λ > 0
2
(4.43)(i = 2) ⇒ 2 − 3λ ≥ 0, i.e., λ ≤
3
(4.43)(i = 3) ⇒ 1 + λ ≥ 0, i.e., λ ≥ −1. so OK for any λ > 0
(4.43)(i = 4) ⇒ 0 ≥ 0, always OK
So we have to choose a λ ∈ 0, 32 that minimises f (x1 ), that is,
£ ¤

g (λ) = (2 + λ)2 + (2 − 3λ)2 + (1 + λ)2 − 2(2 + λ)


g 0 (λ) = 2(2 + λ) + (−6)(2 − 3λ) + 2(1 + λ) − 2
8 4
= 0 when λ = = (< 2/3, note).
22 11
2
Since λ = 4/11 minimises g (λ) =¡f (x0 + λu0 ) and ¢ λ = 4/11 ∈ [0, 3 ], we
4 10 4
accept λ = λ = 4/11, giving x1 = 2 11 , 11 , 1 11 , 0 .

CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 203

The above illustrates one step of our proposed approach. The idea
is to use the gradient of f (·) orthogonally projected into the active
constraint set S in order to find a feasible descent direction. However,
there are two cases when it won’t work.
The first occurs because the above assumes that M M T is invertible.
What do we do if M M T is not invertible? The projection matrix P S is
still uniquely defined, but we need to alter the working above.
If M M T isn’t invertible, then its rows are linearly dependent. This
tells us the rows of M are linearly dependent6
So we are looking at the case where the rows of M are linearly
dependent, i.e., some constraints are linear combinations of the other
constraints. In this case, some of the constraints are redundant, and
can effectively be dropped.
There are several ways of finding what we can drop. One approach
is to row reduce M to echelon form and let B be the matrix made up of
the nonzero rows of this reduced form of M . This has the advantage of
simplifying the constraints as well. Then

S = {u | M u = 0} = {u | B u = 0},

but now we know that B B T is invertible. Hence we can now form the
projection matrix
P S = I − B T (B B T )−1 B .
6
If M M T isn’t invertible then its determinant is zero, i.e., |M M T | = 0. However,
we cannot conclude |M | = 0, because M is not even necessarily square. However, if
|M M T | = 0, then the matrix can’t be positive (or negative) definite, and hence there is
some α 6= 0, αT M M T α = 0. Equivalently, we can write (M T α)T M T α = kM T αk = 0,
and the only vector with zero norm is the zero vector so M T α = 0. Writing this in full
P
is just (αi × row i of M ) = 0, which is equivalent to saying that the rows of M are
i
linearly dependent (from the definition).
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 204

Example 40. Suppose the active constraints at x = (1, 0) were


x1=1
(1) x 1 + x 2 = 1 x1 + x2 = 1
(2) 2x 1 − x 2 = 2 2x1 - x2 = 2
(3) x 1 = 1

The third constraint is a linear combination of the first two (3) = ((1) +
(2))/3, so they are linearly dependent. At x = (1, 0), we have I (x) =
{1, 2, 3} so
   
1 1 µ ¶ 2 1 1
T 1 2 1
M = 2 −1 , M = , and M M T = 1 5 2 ,
1 −1 0
1 0 1 2 1

so |M M T | = 0, and hence the matrix is not invertible. If instead we use


just the first two rows of M , which are linearly independent, then
µ ¶ µ ¶
1 1 T 2 1
B= , so B B = , which has determinant |B B T | = 9 6= 0,
2 −1 1 5

and hence B B T is invertible, and can be used to construct a suitable


projection matrix P S . The situation is illustrated in above. The set
S = {(1, 0)} is a single point, and one of the constraints is redundant in
the points specification.

This solution is OK for small problems; for large problems, the idea
of a generalised inverse can be used, but these are beyond the scope of
this course.
The second problem occurs when the search direction u = 0. In
Example 39, if we continue from x1 to find a new search direction u1
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 205

we will have, as before, I (x1 ) = {1, 2, 6} and so the projection matrix P S


will remain unchanged, but ∇ f (x1 ) will be different, with the result that
u1 = −P S ∇ f (x1 ) = 0.
In these cases we don’t have a search direction, and must consider the
problem more carefully.
Recall that if u = −P S ∇ f (x) 6= 0 then u is a feasible descent direction
with
M u = 0 (because u ∈ S), and uT ∇ f (x) < 0.
Suppose that in our method, u = 0; so we haven’t found a feasible
direction in which to search. Now

u = 0 ⇔ (I − M T (M M T )−1 M )∇ f (x) = 0
(M not necessarily square, (p × n)); or
∇ f (x) − M T (M M T )−1 M ∇ f (x) = 0
T
or ∇ f (x) + |M{z ω} = 0 (4.46)

linear combination of columns
of M T , i.e., rows of M

where
ω1
 
 ω2 
ω = (M M T )−1 M (−∇ f (x)) = 
 ... 
 say,
ωp
so for this choice of ω,
ωi ∇g i (x) = 0.
X
∇ f (x) + (4.47)
i ∈I (x)
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 206

Recall the KKT conditions in Theorem 4.1: if ωi ≥ 0 for all i ∈ I (x) exist
and satisfy the (4.47) then we have found the optimal solution, i.e.,
x∗ = x.
Are ωi ≥ 0 for all i ∈ I (x) though?

In Example 39, if we continue from x1 then


µ ¶
26 10 15
x1 = , , ,0 and I (x1 ) = {1, 2, 6}
11 11 11

and we get

ω = (M M T )−1 M (−∇ f (x1 ))


= (−10/11 , −10/11 , −83/11) T
↑ ↑ ↑
corresponds to i = 2 ∈ I (x1 ) i = 6 ∈ I (x1 )
i = 1 ∈ I (x1 )

These don’t satisfy the KKT conditions, so we have not yet found the
optimal point x∗ . So what can we do now?
We need to think about what is really going on. If we find that the
projection of −∇ f (x) onto S(x) is 0 and yet we have not satisfied the
KKT-conditions, then we know that either

1. S(x) = {0}, i.e., the current point x is the only point that satisfies
all of the active constraints I (x); or

2. −∇ f (x) is perpendicular to S(x) and −∇ f (x) must point “inwards”


in some sense (as otherwise x would be the minimum).

An excellent example of the former case can be see in Example 38, the
final figure of which is repeated in Figure 4.16a, below. In this case,
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 207

at the point x = (x 1 , x 2 ) the active constraint set is I (x) = {1, 2}, but we
can see that only one point7 satisfies the constraints and so S(x) = {0}.
However, there are feasible descent directions leading away from x if
we relax the second constraint (the one in red). Obviously, this example
includes non-linear constraints, but the principle still holds.

I(x) = {1, 2}
x2 g1(x) = 0
x2

x
1 -∇f(x)

x*
∇f (x)

1
x1
2
x1
(a) A repeat of Figure 4.12. We can see (b) The figure shows the case of a single
that there is an isolated point satisfying active constraint I (x) = {1}, but where
both constraints, so that S(x) = {0} at this the minimiser x∗ is in the interior of the
point, but the green region shows the fea- constraint set, as indicated by the fact
sible descent directions. that −∇ f (x) points to the interior.
Figure 4.16: The two cases where u = 0.

The second case is, perhaps, best illustrated by the simple case
where we have a single constraint g 1 (x) = 0 active, but the optimal
point is in the interior. In this case, the descent direction must point
(orthogonally) away from the constraint, as in Figure 4.16b.
7
Actually two points satisfy the constraints, but they are disjoint.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 208

In either case, the active set of constraints has somehow become


too restrictive, but we can alleviate this by dropping one or more of
the active constraints. We are using ωi < 0 for some i , as an indicator
that we can’t have the correct active set. Thus, we will exploit the ωi to
determine which constraint to drop.
Remove the constraint which has the most negative
value ωi (we could choose any with a negative ωi but
we need to choose one, so lets choose the most nega-
tive). In practice, this means removing the j th row of M
where j = argmin(ωi ), but remember that the j th row,
corresponds to the j th active constraint, not g j (x) = 0.
Returning to Example 39,
ω = (−10/11, −10/11, −83/11) T ,
so we take j = 3, and remove the 3rd row of M (i.e., we make the 6th
constraint inactive). Then our new constraint matrix M 0 is
µ ¶
0 2 1 1 4
M =
1 1 2 1
and the new projection matrix is
 
59 −9 −13 −24
T T 1  −9 62 −24 −5 

P S0 = I − M 0 (M 0 M 0 )−1 M 0 =

−13 −24 14 9 
 
73 
−24 −5 9 11
1
which gives u1 = − (1992, 415, −747, −913)T so we have a nonzero
803
search direction, and we continue then as before.
This approach seems intuitive (get rid of the “offending” constraint),
but why does it actually work? The following provide the formal justifi-
cation.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 209

Defn 4.4 (Orthogonal Sum). We write that the vector space


Z is said to be the orthogonal sum of subspaces U and V ,

Z = U ⊕ V,

if any vector z ∈ Z can be written uniquely as the sum of two


vectors u ∈ U and v ∈ V , i.e., z = u + v, where uT v = 0.

Defn 4.5 (Null space). The null space (or kernel) of a matrix
A is the vector space N (A) of all vectors u such that Au = 0.

Defn 4.6 (Range). The range of a matrix A is the vector space


R(A) of all vectors v that can be written Aω = v for some
vector ω. It is also called the column space of the matrix
because it is the space of linear combinations of the columns
of A.

Lemma 7. If M is a p × n (p < n) matrix of rank p and N (M ) is the


kernel (or null space) of M and R(M T ) is the range of M T , then

Rn = N M ⊕ R M T .
¡ ¢ ¡ ¢

Proof. The lemma states that any vector z in Rn can be written as the
sum of two vectors, z = u + v, where vT u = 0 and
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 210

¡ ¢
u ∈N M , i.e., M u = 0,
v ∈R M T , i.e., M T ω = v, for some ω ∈ Rp .
¡ ¢

We chose M to be a p × n (p < n) matrix of rank p, and hence its


rows are all linearly independent, hence there cannot be any α 6= 0
such that i αi × row¡ i ¢= M T α ¡= 0. ¢ Therefore the only element in
P

common between N M and R M T is the zero vector.


Moreover, M T α 6= 0 for all non-zero α, so

αT (M M T )α = (M T α)T M T α = kM T αk2 > 0,

and hence M M T is positive-definite. Hence M M T is invertible.


Let P = I − M T (M M T )−1 M , and ω = (M M T )−1 M z, then

u = P z, (4.48)
v = M T ω, (4.49)

satisfy the required properties:

• u ∈ N (M ) since

M u = M P z = M − (M M T )(M M T )−1 M z = (M − M )z = 0,
¡ ¢

• v ∈ R M T by virtue of its definition,


¡ ¢

• z = u + v because

P z + M T ω = (I − M T (M M T )−1 M )z + M T (M M T )−1 M z = z,

• uT v = uT (M T ω) = (M u)T ω = 0T ω = 0 .
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 211

Notice that in our case, N (M ) = S(x), by deliberate construction. So


our normal approach to find a search direction is to project the −∇ f (x)
into this space. However, we can see from Lemma 7 that we are actually
doing something more powerful. We are decomposing the gradient
into two parts. If we take z = −∇ f (x) in Lemma 7, then we can see from
the proof that we are breaking it into

−∇ f (x) = u + v,

where

u = P S − ∇ f (x) ∈ N (M ),
¡ ¢
(4.50)
v = MT ω ∈ R MT ,
¡ ¢
(4.51)

for P S = I − M T (M M T )−1 M , and ω = (M M T )−1 M − ∇ f (x) .


¡ ¢

By (4.46), when our projected gradient u = −P S ∇ f (x) = 0, we have

−∇ f (x) = v = M T ω,

so in this case −∇ f (x) ∈ R M T .


¡ ¢

So now take M 0 = M ( j ) , which is the matrix M with the j th row


removed, where ω j < 0, and we form

S 0 (x) = x | M 0 x = 0 ,
© ª

Define a new projection matrix P S 0 for this space, and vectors

u0 = P S 0 (−∇ f (x)), and v0 = M 0T ω0

From Lemma 7 we can once again decomponse the gradient into two
parts
−∇ f (x) = u0 + v0 ,
but this time u0 6= 0.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 212

To see that u0 must be non-zero, we will assume that u0 = 0. Then

M T ω = −∇ f (x) = M 0T ω0 .

Now, ω j < 0 and hence 6= 0, and so

ωi ci = ω0i ci
X X
i i 6= j
ωj cj = (ω0i − ωi )ci
X
i 6= j
1 X 0
cj = (ω − ωi )ci .
ω j i 6= j i

However, we assumed that the rows of M (the ci ) were linearly inde-


pendent (the rank of M is p), so this is a contradiction, and hence
u0 6= 0.
This new direction u0 is also a descent direction because we con-
structed (as before) from a projection of the steepest descent direction
−∇ f (x). The fact that it is a projection into S 0 instead of S is irrelevant
(it would be true of any projection as long as u0 6= 0). The decomposi-
tion above allows us to show this formally rather elegantly simply by
taking
u0T ∇ f (x) = −u0T (u0 + v0 ) = −ku0 k2 < 0,
because u0T v0 = 0 by construction.

ω0 = (M 1 M 1T )M 1 (−∇ f (x)),
we have from Lemma 7, by letting z = −∇ f (x), that

−∇ f (x) = u0 + M 1T ω0 .
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 213

We can also see that the new direction will be feasible. For u0 to be
feasible (i.e., inward pointing) we need for all i ∈ I (x) that

u0T ∇g i (x) ≤ 0 (4.52)


0T
u ci ≤ 0. (4.53)

where the ci (for i ∈ I (x)) are the rows of M . By construction

M 0 u0 = 0,

and so cTi u0 = 0 for all i ∈ I (x) except for i = j . So we only need to check
that cTj u0 ≤ 0. Now since u = 0, we know ∇ f (x) = −M T ω, and u0 is a
descent direction, so

0 > u0T ∇ f (x) = u0T (−M T ω),


= u0T c j (−ω j ) + u0T ci (−ωi )
X
i ∈I (x),i 6= j
0T
= u c j (−ω j ) + 0,

and as ω j < 0, cTj u0 < 0 as well.


Note: if we deleted row k with ωk > 0 this wouldn’t work as the
final step could result in cTj u0 > 0, and hence the direction u0 would
not be feasible. But the u0 obtained by projection into the space with
one constraint corresponding to ω j < 0 removed is a feasible descent
direction.
If we include this step into our proposed algorithm to deal with
the case that u = 0 then the resulting algorithm is called the gradient
projection algorithm and is specified in detail in Algorithm 13 below.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 214

problem: Problem 4.9


input: A feasible starting point x0 .
output: A final estimate of the minimiser xk .
Initialisation: k = 0;
ω = −1;
Find I 0 = I (x0 ), the set of active constraints at x0 ;
while Not ω ≥ 0 do
Let M = (cTj , j ∈ I k );
Let P S = I − M T (M M T )−1 M ;
Let uk = −P S ∇ f (xk );
if uk 6= 0 then ©
Let β = max λ | xk + λuk ∈ Ω ;
ª

Choose λk to minimise g (λ) = f (xk + λuk ) on (0, β];


Let xk+1 = xk + λk uk ;
Let I k+1 = I (xk+1 );
else
Let ω = −(M M T )−1 M ∇ f (xk );
Let k = argmin{ωi };
Let j be the kth element of I k ;
Let I k+1 = I k − { j };
Let xk+1 = xk ;
end
k = k +1 ;
end

Algorithm 13: The Gradient Projection Algorithm. Note that we


have ommitted checking for linear independence of the rows
of M at each step for simplicity of exposition (this step is tricky
because you don’t want to do it every time as M is sometimes
repeated).
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 215

1
Example 41. Minimise f (x 1 , x 2 ) = (x 12 + x 22 ) s.t.
2
g 1 (x 1 , x 2 ) = 1 − x 1 ≤ 0 c1 = (−1, 0)T (4.1)
T
g 2 (x 1 , x 2 ) = 4 − 2x 1 − x 2 ≤ 0 c2 = (−2, −1) (4.2)

Note therefore, that ∇ f (x) = x.


Start at x0 = (2, 4)T · · · feasible.
Iteration 1.
¸ ·
1 0
I (x0 ) = ;, therefore P = I = 0 1 .

Note that I (x0 ) = ; implies x0 ∈ intΩ, and so the best search direction
is just the Steepest Descent direction, that is, you’d expect u = −∇
µ f (x
¶ 0 ).
−2
Check: from the algorithm, u0 = −P ∇ f (x0 ) = −∇ f (x0 ) = −x0 = .
−4
x1 = x0 + λu0 = (1 − λ)x0 = (2(1 − λ), 4(1 − λ))T is feasible when
1 − 2(1 − λ) ≤ 0
¾ ¾
2λ ≤ 1

4 − (4 + 4)(1 − λ) ≤ 0 8λ ≤ 4

1
i.e., when λ ≤ = β.
2
1
g (λ) = f (x0 + λu0 ) is minimised when (1 − λ)2 (22 + 42 ) is minimised,
2
i.e at λ = 1.
However, we have constrained the search such that λ ∈
1
(0, 1/2] and so we must choose λ0 = . Therefore,
2
µ ¶ µ ¶
1 1 1 1
x1 = x0 +λ0 u0 = x0 − x0 = x0 = 2 and ∇ f (x1 ) = 2 .
2 2
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 216

Iteration 2.
µ

−1 0
I (x 1 ) = {1, 2} so M = A = −2 −1
µ ¶
1 0
P = 0 1 − M T (M M T )−1 M = I − I = [0].

(As M is square and invertible, (M M T )−1 = (M T )−1 M −1 , giving M T (M M T )−


I .)
So u1 = P (−∇ f (x1 )) = 0.
Note that
µ ¶µ ¶ µ ¶ µ ¶
−1 0 −1 −2
T 1 2 T −1 5 −2
MM = = , (M M ) = .
−2 −1 0 −1 2 5 −2 1
Let

ω = −(M M T )−1 M ∇ f (x1 )


µ ¶µ ¶µ ¶ µ ¶µ ¶ µ ¶
5 −2 −1 0 1 5 −2 −1 −3
=− =− = .
−2 1 −2 −1 2 −2 1 −4 2

ω1
µ

Since ω = ω has ω1 = −3 < 0, delete 1 from I (x1 ) i.e.,
2
delete row 1 from M , giving new M = (−2 − 1).
Then M M T = 5 and
µ ¶ µ ¶
T 1 1 0 1 −2
P = I −M · M = − (−2 − 1)
5 0 1 5 −1
µ ¶ µ ¶ µ ¶
1 0 4/5 2/5 1/5 −2/5
= − =
0 1 2/5 1/5 −2/5 4/5
µ ¶µ ¶ µ ¶
1 1 −2 1 1 3
u1 = −P ∇ f (x1 ) = − =
5 −2 4 2 5 −6
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 217

µ ¶
1 1
Note uT1 ∇ f (x1 ) = (3 − 6) < 0 and therefore represents a descent
5 2
direction.
Put
λ 6 T
µ ¶
T T 3
x2 = x1 + λu1 = (1, 2) + (3, −6) = 1 + λ, 2 − λ ,
5 5 5
which is feasible when
µ ¶

1− 1+ ≤0 i.e., all λ ≥ 0
5
µ ¶ µ ¶
3λ 6λ
4−2 1+ − 2− ≤0 i.e., all λ ≥ 0.
5 5

So β = ∞.
³ ´2 ³ ´2
g (λ) = f (x2 ) = f (x1 + λu1 ) is minimised when 1 + 3λ
5 + 2 − 6λ
5 is
minimised, i.e.,
µ ¶ µ ¶
3 3λ 6 6λ
1+ − 2− =0
5 5 5 5
3λ = 3 or λ = 1 .
µ ¶T
8 4
So x2 = x1 + u1 = ,
5 5
µ ¶T
8 4
∇ f (x2 ) = x2 = ,
5 5

Iteration 3. I (x2 ) = {2}; M = (−2, −1) and as earlier,


µ ¶ µ ¶µ ¶ µ ¶
1 1 −2 1 1 −2 8 0
P= , u2 = −P ∇ f (x2 ) = − 2 = .
5 −2 4 5 −2 4 4 0
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 218

Find
µ ¶
T −1 1 8 20
ω = − (M M ) |{z}
M ∇ f (x2 ) = − 2 (−2 − 1) = > 0.
| {z } | {z } 5 4 25
1/5 (−2,−1) Ã
8/5
!

4/5

As no ω entry is negative therefore the algorithm has fin-


8 4 T
µ ¶

ished and the minimiser is x = , .
5 5
The diagram below shows the search directions, etc. Level curves of
f are circles, centred at (0,0). Since x0 = (2, 4) is an interior point,
we just search in the steepest descent direction, −∇ f (x0 ), reaching
x1 = (1, 2). Since (1,2) sits on the intersection of the two constraints,
we are projecting −∇ f (x1 ) at (1,2) back into the feasible region. But
S(x1 ) = {u : cT1 u = 0, cT2 u = 0} = {0} and so we don’t move (that is, u = 0).
We can see that at (1,2), g 1 is irrelevant in terms of the steepest descent
. . . we really want to project onto the surface g 2 (x) = 0. Removing Row
1 from M has this effect.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 219

x2 x2
x0=(2,4)

-∇f(x)

g2(x) = 4 - 2x1 - x2 ≤ 0

x1 x1
g1(x) = 1 - x1 ≤ 0
(a) Constraints. (b) Starting point x0 .
x2 x2
x0 x0
λ0 u0
I(x2)={2}
S(x2)={v(1,-2)}
x1=(1,2) x2

-∇f(x) -PS∇f(x)
I(x1)={1,2} x3
-∇f(x) S(x1)={0}
x1 x1

(c) Second point. Note that −∇ f (x1 ) now (d) The third point x2 = x1 , but with a
points outside the feasible region, so it different set of active constraints, and the
isn’t a feasible descent direction. last point x3 = x∗ .
Figure 4.17: Example 41. The feasible region is shown in green. Level curves
of f (x) are shown as the orange concentric circles. It may seem unlikely that we
would exactly hit a vertex, but the following figure shows what would happen
if we started at x0 = (1.5, 4).
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 220

x2
x0
x2
x0=(1.5,4) λ0 u0
x1
-∇f(x) I(x1)={1}
S(x1)={v(0,1)}

-∇f(x)
x1
x1 (b) Second point. Note that −∇ f (x1 )
now points outside the feasible region,
(a) Alternative starting point x0 . so it isn’t a feasible descent direction.
x2 x2
x0 x0

I(x3)={2}
x1 x1
S(x3)={v(1,-2)}
x2 I(x2)={1,2} x3
S(x2)={(0,0)}
-∇f(x) -∇f(x) -PS∇f(x)
x4
-PS∇f(x)
x1 x1

(c) The third point x2 is the same as the (d) The last step is just as before (though
vertex we came to in the previous exam- the indexes have changed to reflect the
ple. extra step).
Figure 4.18: Example 41 but with an alternative starting point x0 = (1.5, 4).
The feasible region is shown in green. Level curves of f (x) are shown as the
orange concentric circles.
30

25

20

15

10

−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2

§5
N ON -C ONVEX O PTIMISATION

Many applications of optimisation result in non-convex problems.


They are inherently hard because we loose our guarantees of a unique,
global minimiser. Most of the algorithms we have looked at during
this course will still converge (to a local minimiser) on a large range of
non-convex problems, but without a guarantee that this is unique, it is
unlikely to be the global minimiser.
However, there are many approaches to help find minimisers of
such problems. We often call these heuristic solutions, because they
lack the guarantees of the previous sections, but they have been used
with great success over more than half a century.
The area of non-convex optimisation is large: it includes integer
programming (where the variables are restricted to be integers), and
other problems where the variables are constrained to a non-convex
set. It also includes problems where the objective function is non-
convex, and its is these will will concentrate on here, i.e., we will
consider problems defined on a convex set, but with a non-convex
objective. So the problem of interest will be as given in Problem 5.1.

221
CHAPTER 5. NON-CONVEX OPTIMISATION 222

Problem 5.1. Find the minimiser x ∈ Rn of the function


f (·), subject to the constraints

g i (x) ≤ 0, i = 1, 2, . . . , m < n,

where the g i (·) are convex, differentiable functions, but we


make little restriction on f (·).

There are many potential examples.


Example 42. If we revisit Example 23, where we were sounding a lake
for its deepest point, then we might realise that few lakes have a truly
convex bottom.
Example 43. Our second small example is just a visual example to
convey the problems we are dealing with. Figure 5.1 shows a simple
1D function, but one which has many local minima.

4
C(x)

2
local minimum
0
global minimum
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x
Figure 5.1: Non-convex optimisation problem. We can see that a descent
algorithm would get caught in a local minimum unless we just happened to
choose a good starting point.
CHAPTER 5. NON-CONVEX OPTIMISATION 223

§5.1 S IMULATED A NNEALING


We have seen how gradient descent algorithms can become stuck
in a local minimum. Simulated annealing [7, 9] allows our search to
“bounce” out of such a point, by including some randomisation in its
search. We present here, very briefly, the Metropolis algorithm for
simulated annealing [7, 9].
The basic idea is to exploit a rough descent algorithm, but to allow
some steps that ascend.
Simulated annealing is based on an analogy: in Statistical Mechan-
ics and Chemistry Annealing is a process for obtaining low energy
states of a solid. One heats a material until it melts (e.g., glass or a
metal), and then reduces the temperature gradually. If the temperature
reduction is slow enough (particularly near the freezing point of the
material), then the system will form large crystals, which represent
a low energy state. If the temperature reduction is too fast, then the
system will be out of equilibrium, and we will get flawed crystals. These
correspond to a local minimum in the energy state of the solid. The
analogy to non-convex optimisation should be obvious.
The problem of interest is given in Problem 5.1, and in addition to
notation already defined, we write

∆x = xi +1 − xi , (5.1)
∆f = f (xi + ∆x) − f (xi ), (5.2)

and we refer to the variable T as “temperature”. In more detail, the


analogy looks like the following: A simple overview to explain how the
annealing works for us sits below.
CHAPTER 5. NON-CONVEX OPTIMISATION 224

An atom in a heat bath is given a small A solution to the optimisation prob-


random displacement, with a resul- lem is changed slightly to give a neigh-
tant change ∆E in energy. bouring solution, with a change in the
cost function of ∆C =new cost-old cost

If ∆E ≤ 0, accept displacement and If ∆C ≤ 0, accept new solution and


start again start again.

If ∆E > 0, sometimes accept/ some- If ∆C > 0, sometimes accept/ some-


times reject the new displacement on times reject the new solution on the
the basis of some probability measure. basis of some probability measure.

Either reiterate at this temp. or drop Either reiterate at this cost or drop
temp. cost.

This sort of method has proved successful in many applications of


Optimisation e.g.

• TSP

• Job Shop Scheduling

• Graph Partitioning

• minimum spanning trees in communications networks

• scheduling of 4th year exams, and so on.

The core components of the approach are simply the variables


x and objective function f (x), which sounds obvious. However, in
the world of non-convex problems, there are often different ways the
system can be represented, and the choice may make a difference! So
care and often some experimentation must be exercised even here.
We then need three other components:
CHAPTER 5. NON-CONVEX OPTIMISATION 225

a random move generator this creates a random direction u and dis-


tance λ for us to change the current configuration of the system.
It doesn’t need to be a descent direction, but a good choice of the
size of jump is important.

an annealing schedule The concept of temperature is included via a


control parameter T , which simulates the temperature changes
in a physical annealing process. The schedule determines the
choice of temperatures at each step of the algorithm.

an acceptance function which describes when should we accept a


new solution. The decision may take into account whether our
search direction is a descent direction, but should not rule out
all others, and so the function often uses randomisation.

The random move generator is perhaps easy (at least conceptually),


but the other two components require more discussion, so we will
consider them in detail below, starting with the acceptance function.

A CCEPTANCE FUNCTIONS
In a pure descent algorithm, our acceptance function looks like
∆f ≤ 0 accept,
∆f > 0 reject.
That is, we accept a move if it results in an improvement in the objective.
However, in general, the acceptance function is not really a function,
but actually a random variable, dependent on ∆ f . In this case, we can
rewrite the above in terms of probability of acceptance as P (accept |
∆ f ), which in this case would be given by
1, ∆ f ≤ 0,
½
P (accept | ∆ f ) =
0, ∆ f > 0.
CHAPTER 5. NON-CONVEX OPTIMISATION 226

But we want an acceptance function that will sometimes allow cost-


increasing solutions. There are many possibilities, so to whittle these
down to something reasonable, lets look at the required properties:

• We want P (accept | ∆ f ) = 1 for ∆ f ≤ 0, which simply says that


descent directions will always be accepted1 .

• For ∆ f > 0

– P (accept | ∆ f ) should decrease as ∆ f increase, to penalise


steps with large increases in the objective,i.e., really bad
directions.
– P (accept | ∆ f ) should decrease as T decreases, as this will
mimic our analogy to a real temperatures function in an-
nealing.

Figure 5.2 illustrates the properties that we require from the func-
tion. The illustration shows continuous functions, which are not nec-
essary, but certainly make sense.
A commonly used acceptance function incorporates the Boltzman
factor, derived from statistical mechanics of particulars (say in a fluid)
µ ¶
−E (x)
exp ,
kT
which describes the relative likelihood of configurations x with energies
E (x), where k is Boltzman’s constant. We incorporate it by using its
meaning as a probability of c configuration to define the acceptance
probability

´ ∆ f ≤ 0,
(
1, ³
P (accept | ∆ f ) = −∆ f
exp kT , ∆ f > 0.
1
The case ∆ f = 0 is a special case, and we will see why we set the acceptance
function probability to 1 for this case in a minute.
CHAPTER 5. NON-CONVEX OPTIMISATION 227

P( ∆f)

Temperature T1

Temperature T2 < T1

∆f
Figure 5.2: Illustration of the properties of acceptance functions. As δ f in-
creases, or temperature decreases, the probability of acceptance decreases.

In optimization, the temperature is arbitrary, so we may omit the con-


stant k.
A concise way of writing acceptance function is
½ µ ¶¾
−∆C
P (accept | ∆ f ) = min 1, exp . (5.3)
T

T HE ANNEALING SCHEDULE
In the physical analogy, temperature is reduced slowly over time, which
allows system to stay approximately in equilibrium as the temperature
decreases. We need to do something analogous here, but we have
complete control over T . The annealing schedule specifies exactly
what the behaviour of T will be.
In general, we want the temperature to decrease over time. In
optimisation terms this means that the “negative” jumps will be less
likely as the algorithm progress.
CHAPTER 5. NON-CONVEX OPTIMISATION 228

However, we have lots of choices to make. For instance should the


reductions be

• inhomogeneous: where we decrease the temperature at each


step; or

• homogeneous: where we run the algorithm for a while, and then


reduce the temperature, and then repeat.

The later is more in line with the physical process, because it is giving
the system some time to reach “equilibrium”, but it may be slower?
We then need to choose the initial temperature, and how much to
decrease the temperature? The analogy breaks down here a little, as we
have no scale equivalent to Celsius here, and no notion of “boiling” or
“freezing” point to set our number by. The initial temperature should be
in some sense high enough for “melting” but the right temperature isn’t
at all obvious. In some sense we might set it to make our initial steps
reasonably likely (say P (accept | ∆ f ) = 0.5 or 0.8) for any likely value
of ∆ f , but even this is a hard criteria to use for a particular problem,
without some experimentation.
The temperature reductions bring in even more complexity. We
could choose to build a full table of these reductions, but it is more
common to simply use geometric decrease, i.e.,

Ti +1 = αTi ,

where α is usually between [0.75, 0.95]. One advantage of a geometric


schedule is we can set the initial temperature to be quite high, as
reductions occur rapidly then.
The other determining factor is when to stop. As we may be stuck
in a local minimum, examining gradients isn’t useful, and step sizes
may also be misleading. The solution is to use the analogy, which
CHAPTER 5. NON-CONVEX OPTIMISATION 229

would have us stop when the temperature is 0 (or close to zero), where
non-descent moves become impossible. However, we might instead
stop if nothing changes for some time.

T HE ALGORITHM
Given an acceptance function, and annealing schedule, algorithm as
in Algorithm 14. Its often called the Metropolis-Hastings algorithm,
though technically this is an algorithm intended for a different purpose
(Markov Chain Monte Carlo), but we will use that name for the sake of
ease as many others have.
The algorithm itself is given by Algorithm 14, given initial and final
temperatures Ti ni t , and Tend , and a decrease factor α, and an inho-
mogeneous, geometric decrease schedule. We keep track of the best
solution “so far” in the variable xbest (the strict SA algorithm doesn’t
do this, but it is an obvious improvement). Obviously, there are many
ways to improve this algorithm by, for instance, storing function values
in temporary variables, but the underlying algorithm should be clear.
However, the oversimplifies. Generation of random moves is non-
trivial. The moves must be to “neighbouring” states, but what exactly
does that mean? And they must be random, while still enforcing any
constraints on the problem, and they must be generated efficiently.
Another issues to determine for a homogeneous approach is how
many times should we run the inner-loop before changing the temper-
ature. Ideally its long enough to explore the regions of search space
that should be reasonably populated, but determining this may take
some trial an error as its is problem dependent.
There are also many much more sophisticated modifications of
the approach in the literature, e.g. [12], and these may well be needed
because of the complexities of real problems.
CHAPTER 5. NON-CONVEX OPTIMISATION 230

problem: Problem 5.1


input: A feasible starting point x0 .
output: A final estimate of the minimiser xbest .
Initialisation: k = 0;
T = Ti ni t ;
while T > Tend do
Choose ∆x randomly;
Let ∆ f = f (xk + ∆x) − f (xk );
if ∆ f ≤ 0 then
Let xk+1 = xk + ∆x;
else
if r and om() ≤ P (accept | ∆ f ) then
Let xk+1 = xk + ∆x;
else
Let xk+1 = xk ;
end
end
if f (xk+1 ) < f (xbest ) then
Let xbest = xk+1 ;
end
T = αT ;
k = k +1 ;
end

Algorithm 14: The Metropolis-Hastings Algorithm. The function


r and om() is assumed to generate a uniform random variable on
(0, 1), and P (accept | ∆ f ) is given by (5.3).
CHAPTER 5. NON-CONVEX OPTIMISATION 231

§5.2 G ENETIC A LGORITHMS


Genetic Algorithms (GAs), also called evolutionary computing, are
a set of randomised algorithms which derive their behavior from a
metaphor of Darwin’s theory of evolution. We generate populations of
solutions, and allow them to “evolve” towards fit solutions (solutions
that minimize our objective). GAs have advantages in flexibility: they
can even be applied when the objective function isn’t known.
The idea was pioneered by John Holland in the 60s (see [6]), and
as with simulated annealing there have been very many applications,
e.g. [5, 10].
The key advantage of GAs is the ease with which they can handle
arbitrary kinds of constraints and objectives:

• we only have to be able to compute them;

• we don’t need gradients

• we don’t even need to be able to express (as math)

That makes them highly applicable where

• the search space is complex or poorly understood;

• expert knowledge is difficult to encode to narrow the search


space; and

• mathematical analysis is not available.

For instance, one example I like concerns using a GA to design a com-


puter algorithm that produces art. The pictures in Figure 5.3 we gen-
erated using a computer algorithm which was tuned by selection (see
http://www.geneticart.org/). The selection process was manual –
people used a web interface to say which picture they liked, and this
CHAPTER 5. NON-CONVEX OPTIMISATION 232

was used to select and improve the algorithm producing new genera-
tions of pictures that were more appealing.
Its a perfect example of a problem were we can’t express the ob-
jective function in mathematical form, and can’t calculate it quickly.
In other words, the GA might say “ I don’t know much about art, but I
know what I like.”

(a) . (b) .

Figure 5.3: Examples of “genetic art”.

The terminology used in GAs is based on biological terminology:

• Living things are built using a plan described in our chromo-


somes.

• Chromosomes are strings of DNA and serve as a model for the


whole organism.

• A chromosome’s DNA is grouped into blocks called genes, which


have a location called a locus.
CHAPTER 5. NON-CONVEX OPTIMISATION 233

• Notionally, a gene codes for a particular trait:

– e.g., blue, or brown eyes,


– where possible settings for a trait are called alleles.

• A complete set of genetic material (all the chromosomes) is called


a genotype, which we can think of as the code for the organism.

• The expression of the genotype produces a phenotype, namely,


the organism.

In biological evolution, successful parents are those that rear young.


During reproduction, recombination occurs where genetic material
from both parents is incorporated, so the young have a resemblance
to both parents. In this context we call this process, where genes from
parents combine to give genes for offspring, crossover.
We also see mutation where some genes are randomly changed.
The theory of evolution is sometimes called “survival of the fittest”.
Fitness of an organism is measured by success of the organism in its
reproduction, where fitter organisms are more likely to reproduce, and
so propagate their genes further. We can see here that the “objective”
function isn’t even defined a priori, but rather it is a function of compe-
tition, and this is ideal in the context for some optimisation problems
(e.g., consider the chess program we discussed earlier).
There are all sorts of variations on the theory, but that’s about all
we need to start making progress on an algorithm. The simple, general
version, of a GA is given in Algorithm 15. Its really just the shell of the
algorithm because there are many details to decide.
We need to choose component algorithms for all of these steps, as
well as a method to encode the “solution” in a chromosome x, in a way
that is compatible with these. Hence we need to choose
CHAPTER 5. NON-CONVEX OPTIMISATION 234

problem: Problem 5.1


input: GA parameters: N =number of generations,
P =population size, ...
output: A final estimate of the minimiser xbest .
Initialisation: Create (randomly) a set of solutions called the
population, P ;
for k = 0, 1, . . . , N do
Evaluate the fitness f (x) of each x ∈ P ;
Select parents from the population biased by fitness;
Perform crossover and mutation;
Replace old population with the next generation;
end

Algorithm 15: A Genetic Algorithm (or at least the shell of one).

• chromosome encoding,

• crossover method,

• mutation method,

• selection method,

• fitness criteria.

We will discuss each of these below, but but note that we didn’t
include fitness function evaluation, which could be explicit as in most
the previous problems, or the fitness could be implicit in the result of a
competition between the members of the population.
CHAPTER 5. NON-CONVEX OPTIMISATION 235

C HROMOSOME E NCODING
There are various ways to encode information. We can just store our
optimisation variables as number in a vector x. However, mutation is
usually thought of at the bit level – we might mutate a single bit, not
randomise an integer. Moreover, we will see that the various operations
of a GA place some emphasis on the locus of information. Genes that
are close together are less likely to be split up than those far apart. So
the order of elements of the vector may matter in ways that it doesn’t
in any of the algorithms we have considered so far.
Thus, in any GA, the first issue to consider is how to represent the
state variables of the problem in a chromosome. There are quite a few
valid approaches, some simple ones are:

• Value Encoding: encode values directly, as a vector of numbers;

• Binary Encoding:

Chromosome A 1101100100110110

– each bit could represents some characteristic, or


– a string of bits can represent a number.

• Permutation Encoding: each chromosome is a permutation,


e.g., representing a potential ordering:

Chromosome A 1 5 3 2 6 4 7 9 8
Chromosome B 8 5 6 7 2 3 1 4 9

• Tree Encoding: used for representing a more complicated data


structure such as an algorithm.
CHAPTER 5. NON-CONVEX OPTIMISATION 236

Binary encoding can be done in several ways. One appealing ap-


proach is a Gray Code [2, 4]. These represents each number in the
sequence of integers {0...2m−1 } as a binary string of length m. We al-
ready have conventional binary representations, but they have the
problem that flipping different bits in the string (say by random muta-
tion) has smaller or larger effects depending on the location of the bit
being flipped.
In a Gray code, bits are encoded such that in an order such that
adjacent integers have Gray code representations that differ in only
one bit position. Marching through the integer sequence therefore
requires flipping just one bit at a time. For example see the binary
coding of the numbers {0...7} with binary strings of length m = 3 are
shown in Table 5.1.

numbers 0 1 2 3 4 5 6 7
binary coding 000 001 010 011 100 101 110 111
Gray coding 000 001 011 010 110 111 101 100
Table 5.1: Binary codes for the integers {0, ..., 7}.

C ROSSOVER
Cross-over is the way we select genes from parents’ chromosomes to
create offspring’s genes. Typically it is performed between two parents,
though we can easily generalise.
One simple approach is a single crossover point:

• randomly choose a crossover point,

• copy first chromosome up to the crossover point, and


CHAPTER 5. NON-CONVEX OPTIMISATION 237

• copy second chromosome after the crossover point.

For example:

Parent 1: 1101100100110110
Parent 2: 1111111000011110
Offspring: 1101111000011110

crossover
There are other ways to perform crossover

• multiple crossover points: more than one random crossover


point is chosen

• random crossover: randomly select genes from each parent

• arithmetic crossover: some arithmetic operation is performed


to make a new offspring

Different types of crossovers work better for different problems, and


for different chromosome representations. For instance, crossover for
a permutation encoding must be different because randomly chosen
segments of two strings won’t create a valid permutation chromosome.
A reasonable (single point) crossover is as follows:

• one crossover point is selected,

• copy the chromosome from the first parent up to the crossover,


and

• then the other parent is scanned and if the number is not yet in
the offspring, it is added (in order of scanning).

For example:
CHAPTER 5. NON-CONVEX OPTIMISATION 238

Parent 1: 1 2 3 4 5 6 7 8 9
Parent 2: 4 5 3 6 8 9 7 2 1
Offspring: 1 2 3 4 5 6 8 9 7

crossover
Some more illustrations of encoding and crossover can be found at
http://cs.felk.cvut.cz/~xobitko/ga/cromu.html

M UTATION
Mutation is an important component of both real evolution and evo-
lutionary computing. It is necessary to prevent all solutions in the
population falling into a local optimum (crossover by itself may not be
able to escape). Think of it as ensuring genetic diversity.
Mutation is simulated by introducing random changes to the off-
spring, but the meaning of random must be carefully defined. If we are
using binary encoding (with Gray codes) then a common approach is
to flip each bit with a small probability p. For example:
Original offspring: 1101100100110110
Mutated offspring: 1101000100111110
There are other possibilities: for instance, if a group of bits encode
a gene, we could mutate whole genes at each step.
However, other codes may require different strategies: for instance
in permutation encoding we may want to swap two randomly chosen
elements so that the result is still a valid permutation. For example:
Permutation mutation example
Original offspring: 1 2 3 4 5 6 8 9 7
Mutated offspring: 1 8 3 4 5 6 2 9 7
CHAPTER 5. NON-CONVEX OPTIMISATION 239

S ELECTION ALGORITHMS
We also need a method to select parents. Obvious we could simply
choose the fittest, but this quickly reduces diversity in the population,
and isn’t the best long-term strategy.
In general, we select parents stochastically, but with some bias
towards fitter parents. Examples of approaches are:

• Roulette Wheel Selection: select randomly based on fitness


function. Probability of selection of xi is

f (xi )
pi = P
i ∈P f (xi )

• Rank Selection: rank the population in order, so that f (x(1) ) ≤


f (x(2) ) ≤ · · · ≤ f (x(N ) ). The probability of selection of x(i ) is

i 2i
pi = P =
i ∈P i N (N + 1)

In more detail, Roulette Wheel Selection is based on the following:

• Parents are selected according to their fitness

• The better the genotype is, the more chances it has to be selected

• Imagine a roulette wheel where all the genotypes in the popula-


tion are placed.

• The size of the section in the roulette wheel for an individual is


proportional to its fitness function

and illustrated in Figure 5.4a.


CHAPTER 5. NON-CONVEX OPTIMISATION 240

(a) Roulette Wheel Selection. (b) Rank-based selection.

Figure 5.4: Selection methods.

However, Roulette Wheel Selection has a problem when there are


big differences in fitness values — one individual may completely dom-
inate selection, and become the only parent. This is simply because
fitness values may vary wildly, depending on how they are calculated.
Rank selection ranks the population from 1, . . . , N , and selection
probabilities are linear in the potential parent’s rank:

• the worst will have probability 2/[N (N + 1)],

• the best will have probability 2N /[N (N + 1)].

The approach is illustrated in Figure 5.4b. We can see that it still favours
the fittest, but not by as much as a pure Roulette Wheel approach.
Another problem (similar to the problem of a pure simulated an-
nealing algorithm) is that there is no history kept and, as the process is
random, we may loose a good solution and never find it again. We can
avoid this using elitism. In this case, we keep a small number of the
best of each generation, and put them directly into the next generation
without crossover or mutation.
Elitism can dramatically improve the performance of a GA, with
little overhead, simply by ensuring that the best past solution is always
included in the current population, so that the algorithm never goes
backwards.
CHAPTER 5. NON-CONVEX OPTIMISATION 241

PARAMETERS OF GA S
The components above, once chosen, specify a particular GA from the
space of possible algorithms, but we need to also set the parameters of
the algorithm. The choices have a significant affect on performance,
so some care should be taken.
Typical parameters that need to be set are:

crossover probability if there is no crossover, offspring are exact copies


of parents, but this doesn’t mean the population is the same as
the distribution of individuals will change, but the genetic diver-
sity will generally decrease over time. Crossover is made in hope
that new chromosomes will contain good parts of old chromo-
somes and therefore the new chromosomes will be better so the
crossover rate should be high generally, about 80%-95% is a rule
of thumb (though it can vary).

mutation probability if there 0% mutation, offspring are generated


immediately after crossover, but there is the possibility that the
population will get caught in a local minima because there is
no way to jump out. However, mutation should not occur very
often, because then the GA will just be a random search. A rule
of thumb might be about 0.5%-1%.

population size if the population is too small, there are few opportu-
nities to perform crossover, and only a small part of the search
space is search in each generation. If the population is too large,
the GA slows down, and at some point we reach diminishing
returns. The best size depends on the problem (and the size of
the search space, but a rule of thumb might be 20-30, though
50-100 is reported as the best in some cases. Some research also
shows, that the best population size depends on the size of chro-
CHAPTER 5. NON-CONVEX OPTIMISATION 242

mosomes, e.g. for chromosomes with 32 bits, the population


should be higher than for chromosomes with 16 bits.

number of generations the number of generations determines directly


how long the algorithm will take. We typically see diminishing
returns as the number of generations increases, but often this
number needs to be quite high to reach a point relatively close
to optimal.

elitism number as described earlier, elitism keeps some of the best


members of each generation around for the next generation, to
prevent backsliding. It can work with only a single “elite”, but
sometimes it is helpful to keep the genes of the best two or three
around to keep injecting them into the population.

E XAMPLES
GAs have been used in many settings, but we mention below a few of
the more beautiful. For instance, GAs have been used in generating
artificial life. For instance Torsten Rei created realistic animations of
stick figures, by adding “muscles” to them, and using distance walked
as fitness2 , and these ideas have been used in the LotRs computer
generated affects. Karl Sims used it to evolve artificial creatures [11].

N OTES
More information (at various levels) on GAs can be obtained from
http://www.obitko.com/tutorials/genetic-algorithms/
http://www.rennard.org/alife/english/gavintrgb.html
2
http://www.democraticunderground.com/discuss/duboard.php?az=
view_all&address=389x3582335
CHAPTER 5. NON-CONVEX OPTIMISATION 243

http://www.cs.cmu.edu/Groups/AI/html/faqs/ai/genetic/top.html
http://www.genetic-programming.com/published/scientificamerican1096.
html
http://www.trnmag.com/Stories/2003/060403/Artificial_beings_evolve_
realistically_060403.html

§5.3 OTHER A PPROACHES


It is important to understand one critical feature of both simulated
annealing and GAs. They use randomness, but are not random. They
are still directed searches, which use features of the constraints and
objective to guide the search, but they use randomness to help them
avoid being caught in local minima. However, the result is highly non-
random!
The other key factor in these algorithms is that they make few
assumptions about the objective functions: they don’t require it to be
differentiable, or continuous, or even expressable as a “function”. They
just require us to be able to calculate the function.
There are other randomisation algorithms available to us.

• ants: metaphor is a colony of ants (simple agents) running sim-


ple rules, to achieve highly organized collective behaviour (also
called Swarm Intelligence)
http://www.merlotti.com/EngHome/Computing/AntsSim/ants.htm
http://www.codeproject.com/cpp/GeneticandAntAlgorithms.asp

• tabu search: iteratively try to find solutions to the problem, but


to keep a short list of previously found solutions and to avoid
’re-finding’ those solutions in subsequent iterations. Basically, if
you try a solution, it becomes tabu in future tries [3].
CHAPTER 5. NON-CONVEX OPTIMISATION 244

These approaches all perform well in certain settings, but there is


no “one best” algorithm, and often it requires some experimentation to
find the one (and its parameters) that work best for your application.
There are other non-convex problems where we don’t need to use
randomised algorithms. For instance, in certain “sparse” problems
where the solution has particular structure, we can often exchange the
original non-convex problem for an equivalent convex problem (some-
times even a linear program), and solve this instead. Such problems
are a current hot-topic in optimisation, often talked about under the
heading of “compressive sensing” (from the application in which they
are typically used).
30

25

20

15

10

−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2

§6
C ONCLUSION

We should probably say something to sum up the course here, but


I can’t really do that any better than saying just go through the notes
carefully again, and good luck with your exam!

245
B IBLIOGRAPHY

49:b [1] D ANTZIG , G. B. Programming of interdependent activities: Ii mathematical


model. Econometrica 17, 3/4 (1949), 200–211. This is a revised version of a
paper that appeared at the Cleveland Meeting of the Econometric Society on
December 27, 1948.

ames [2] G ARDNER , M. Mathematical games. Scientific American, 2 (August 1972), 106.

earc [3] G LOVER , F. Tabu search: A tutorial. Interfaces 20, 4 (1990), 74–94.

comm [4] G RAY, F. Pulse code communication. U. S. Patent 2 632 058, March 17 1953.

te85 [5] G RENFENSTETTE , J. J., Ed. Proceedings of the First International Conference
on Genetic Algorithms and Their Applications. Lawrence Erlbaum Associates,
Hillsdale, New Jersey, 1985.

stem [6] H OLLAND, J. H. Adaptation in Natural and Artificial Systems. University of


Michigan Press, Ann Arbor, MI, 1975.

ling [7] K IRKPATRICK , S., G ELATT J R ., C. D., AND V ECCHI , M. Optimization by simulated
annealing. Science 220 (1983), 671–680.

rogr [8] L EONE , R. D. The origin of operations research and linear programming. http:
//globopt.dsi.unifi.it/gol/Seminari/DeLeone2008.pdf, 2008.
ling [9] N.M ETROPOLIS , R OSENBLUTH , A., R OSENBLUTH , M., T ELLER , A., AND T ELLER ,
E. Equation of state calculations by fast computing machines. J. Chem. Phys.
21, 6 (1953), 1087–1092.

246
BIBLIOGRAPHY 247

lgor [10] S CHAFFER , J. D., Ed. ‘Proceedings of the Third international Conference on
Genetic Algorithms. Morgan Kaufmann Publishers, Inc., 1989.

2167 [11] S IMS , K. Evolving virtual creatures. In Proceedings of the 21st annual conference
on Computer graphics and interactive techniques (New York, NY, USA, 1994),
SIGGRAPH ’94, ACM, pp. 15–22.

nter [12] WANG , L., Z HANG , H., AND Z HENG , X. Inter-domain routing based on
simulated annealing algorithm in optical mesh networks. Opt. Express 12
(2004), 3095–3107. http://www.opticsexpress.org/abstract.cfm?URI=
OPEX-12-14-3095.
49:a [13] W OOD, M. K., AND D ANTZIG , G. B. Programming of interdependent activities:
I general discussion. Econometrica 17, 3/4 (1949), 193–199. This is a revised
version of a paper that appeared at the Madison Meeting of the Econometric
Society on September 9, 1948.
I NDEX

alleles, 233 genes, 232


genetic algorithms, x, 1
bracketing, 21 genotype, 233
global maximum, 5
chromosomes, 232 Golden Ratio, 36
column space, 209 Golden Section, 36
condition number, 117 gradient projection algorithm, 192,
Conjugate Gradient Method, 128 213
convex, 2, 82, 91
convex cone, 156, 172 heuristic, 221
convex program, 2
convex set, 2 ill-conditioned, 118
crossover, 233 infeasible, 153
iteration, 26
descent, 83
descent direction, 103 Kantorovich’s Inequality, 119
descent methods, 103 KKT conditions, 170
DNA, 232
duality, 169 Lagrange multiplier, 160
linear program, 2
elitism, 240 local maximum, 5
evolutionary computing, x, 1 locus, 232

feasible direction, 172, 177 minimiser, 9


feasible point, 159 monkey saddle, 5
mutation, 233
generalised inverse, 204

248
INDEX 249

null space, 209

operations research, 84
order of convergence, 77
orthogonal projection, 194
orthogonal sum, 209

phenotype, 233
population, 234
Principal minors, 86
projection matrix, 195

quadratic program, 189


quadratic convergence, 77
quadratic program, 2
quadratic form, 86
quadratic program, 85

range, 209

saddle point, 5
scope, 63
slack, 155
smooth, 67
sounding line, 145
standard basis vectors, 124
standard form, 150
stationary, 5
stationary points, 18
steepest descent, 106

tight, 153
tolerance level, 69

unimodal, 22
unitary, 120

worst case, 32

You might also like