Notes Ipad
Notes Ipad
25
20
15
10
−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2
Optimisation III
Optimisation III
] ] ]
Matthew Roughan
(based on notes produced by Nigel Bean and Liz Cousins)
] ] ]
2013
© 2013 Matthew Roughan
All rights reserved
The paper used in this publication may meet the minimum requirements of the
American National Standard for Information Sciences — Permanence of Paper for
Printed Library Materials, ANSI Z39.48–1984.
10 09 08 07 06 05 04 03 02 01 15 14 13 12 11 10 9
First edition: 12th March, 2010
C ONTENTS
Contents i
List of Figures iv
Preface x
1 Introduction to Optimisation 1
1.1 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Examples seen in previous courses: . . . . . . . . . . . . 3
Some More Difficult Examples . . . . . . . . . . . . . . . 7
1.3 Notation and Conventions . . . . . . . . . . . . . . . . . 8
1.4 Questions to Ponder . . . . . . . . . . . . . . . . . . . . . 12
1.5 Expected background . . . . . . . . . . . . . . . . . . . . 13
1.6 What are we trying to teach? . . . . . . . . . . . . . . . . 14
1.7 An Outline of our Progress Towards Hard Problems . . 15
i
Bounded problems . . . . . . . . . . . . . . . . . . . . . 22
Unbounded problems . . . . . . . . . . . . . . . . . . . . 53
2.3 Quadratic Approximation Algorithms . . . . . . . . . . 55
2.4 Methods for Smooth Functions . . . . . . . . . . . . . . 67
Newton’s Method. . . . . . . . . . . . . . . . . . . . . . . 67
Secant Method . . . . . . . . . . . . . . . . . . . . . . . . 77
ii
KKT Conditions for other problems . . . . . . . . . . . . 186
KKT Conditions for Quadratic Programming . . . . . . 189
4.5 The Gradient Projection Algorithm . . . . . . . . . . . . 192
Background to the Gradient Projection Method . . . . . 194
6 Conclusion 245
Bibliography 246
Index 248
iii
L IST OF F IGURES
iv
4.2 An illustration of the least-square solution. The point shows
the gravity model solution, and the dashed line shows the
subspace specified by a single constraint equation. The
least-square solution is simply the point which satisfies the
equation which is closest to the gravity model solution. The
weighted least-squares solution gives different weight to
different unknowns. . . . . . . . . . . . . . . . . . . . . . . . 148
4.3 Convex and non-convex sets. . . . . . . . . . . . . . . . . . . 149
4.4 Two convex cones. . . . . . . . . . . . . . . . . . . . . . . . . 157
4.5 An illustration of Farkas’ Lemma (see Example 30). . . . . 158
4.6 We wish to maximise the area of the rectangle, subject to a
fixed perimeter constraint. . . . . . . . . . . . . . . . . . . . 161
4.7 We wish to maximise the area of the rectangle, subject to it
fitting inside a circle with diameter 1. Note that the diagonal
of the rectangle is just the diameter of the circle. . . . . . . 162
4.8 An illustration of the affect of a Lagrange multiplier us-
ing f (x) = 2x 12 + 2x 1 x 2 + x 22 − 10x 1 − 10x 2 and constraint
g (x) = x 12 + x 22 − 5 = 0. We can see that (4.14) is enforced
here with λ = 1. Intuitively we can see why — the minimum
occurs just at the point where the level curves of f (x) (the
orange ellipses) just touch the feasible region curve g (x) = 0
(shown in blue). The critical level curve is f (x) = 20 (shown
in green). When they just touch, their tangents must be the
same (as they are continuously differentiable) and hence,
the normals defined by the gradients must be aligned. . . . 166
An illustration of the fact that ∇ f (x ∗ ) = − i ∈I (x∗ ) λi ∇g i (x∗ )
P
4.9
for λi ≥ 0 for any global minimum. The shaded region in-
dicates the feasible set. We know that −∇g i (x∗ ) must point
into this set because the constraints are g i (x∗ ) ≤ 0. The
green shaded sub-region indicates the convex cone of vi-
able directions for ∇ f . . . . . . . . . . . . . . . . . . . . . . . 172
v
4.10 An illustration of Example 38. The green region is the feasi-
ble set Ω. The orange lines are the elliptical level curves of
f (x). The four cases of solutions are illustrated by dots. . . 182
4.11 A closeup of Figure 4.10 showing the gradients, and the
convex cone of {−∇g 1 , −∇g 2 } (in green) at the point x where
I (x) = {1, 2}. As ∇ f (x) is outside this cone, this can’t be an
optimal point. . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.12 A closeup of Figure 4.11, with the green region showing fea-
sible descent directions from the point I (x) = {1, 2}. As there
are feasible descent directions, we can’t be at the minimum. 184
4.13 Decomposition or −∇ f (x) into u + q, where u and q are
chosen to be orthogonal, i.e., , uT q = 0. . . . . . . . . . . . . 194
4.14 Orthogonal projection of the vector y into the space S(x)
using the projection matrix P S . . . . . . . . . . . . . . . . . . 196
4.15 A side on view of the orthogonal projection problem. . . . 197
4.16 The two cases where u = 0. . . . . . . . . . . . . . . . . . . . 207
4.17 Example 41. The feasible region is shown in green. Level
curves of f (x) are shown as the orange concentric circles. It
may seem unlikely that we would exactly hit a vertex, but
the following figure shows what would happen if we started
at x0 = (1.5, 4). . . . . . . . . . . . . . . . . . . . . . . . . . . 219
4.18 Example 41 but with an alternative starting point x0 = (1.5, 4).
The feasible region is shown in green. Level curves of f (x)
are shown as the orange concentric circles. . . . . . . . . . 220
vi
5.3 Examples of “genetic art”. . . . . . . . . . . . . . . . . . . . . 232
5.4 Selection methods. . . . . . . . . . . . . . . . . . . . . . . . . 240
vii
L IST OF A LGORITHMS
viii
13 The Gradient Projection Algorithm. Note that we have
ommitted checking for linear independence of the rows
of M at each step for simplicity of exposition (this step is
tricky because you don’t want to do it every time as M is
sometimes repeated). . . . . . . . . . . . . . . . . . . . . . 214
ix
P REFACE
All involve looking for the best solution to some objective often subject
to some constraints.
We can even think of natural processes such as evolution as a form
of optimisation, and indeed genetic algorithms and evolutionary com-
puting deliberately exploit this metaphor to solve other optimisation
problems we may wish to solve.
This course is about solving complicated optimisation problems
involving continuous, real variables. They’re complicated because they
can involve many variables and constraints, and because the objective
and constraints can be non-linear.
x
Apart from teaching one how to solve such problems, the course is
concerned with a key part of all applied mathematics: translation of a
real-world problem into mathematics. In general, this is a challenging
task. It seems, from what you may have studied, that this is easy in
optimisation. After all, cost or profit sound like well defined things
that could be written mathematically. However, the challenge is that
although we might be able to write them, the resulting optimisation
problem may be intractable. Sometimes it isn’t even possible to write
down all the factors that go into an objective function, and constraints
are similarly complicated.
The job of a mathematician is often to approximate these factors
in a way that is both reasonable (from the point of view of solving the
real-world problem), and mathematically tractable. To do so, we have
to get to the nub of “what is hard” in optimisation, and that is at the
root of the structure of this course.
xi
30
25
20
15
10
−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2
§1
I NTRODUCTION TO O PTIMISATION
§1.1 O PTIMISATION
Optimisation is a task all human beings, indeed all living things,
do. It is central to any decision making task, i.e., in any task involving
choosing between alternatives. Choice is governed by wanting to make
the “best” decision:
All involve looking for the best solution to some objective often subject
to some constraints.
We can even think of natural processes such as evolution as a form
of optimisation, and indeed genetic algorithms and evolutionary com-
puting deliberately exploit this metaphor to solve other optimisation
problems we may wish to solve.
In general, we define an optimisation problem in three pieces: the
1
CHAPTER 1. INTRODUCTION TO OPTIMISATION 2
variables – these are the choices we can make, or the things we can
control. In this course they will often be given in a vector, e.g., x,
of real numbers.
constraints – these are the limits on the possible choices we can make
(usually in the form of inequalities with respect to the variables).
§1.2 E XAMPLES
One of the most challenging problems in Applied Mathematics is
translating a real-life optimisation problem into a mathematical formu-
lation. Real-life problems are stated in (vague and fuzzy) words, and
are complex (sometimes too complex for us to solve the real problem).
There is often a requirement to approximate the real problems into
something mathematically and computationally tractable.
Mathematical formulation:
(a) A 2D optimisation problem with a (b) A problem where the maxima occur
local maximum and a saddle point. on a ring.
maximise z = 3x 1 x 2 + 4x 2 x 3 − 5x 3 x 1
such that x1 + x2 + x3 ≤ 7
x1 − x3 ≤ 2
x1 , x2 , x3 ≥ 0
There are lots of other optimisation problems that cause trouble for
the optimisation techniques you have been taught so far, but the cases
above are the types of problems that we will tackle in this class.
and we say x ∈ Rn , and upper case letters, e.g., A, will denote matrices.
Optimisation problems will be describe in the following form (de-
tails will vary, but the general form remains the same):
Problem 1.2.
maximise z = 3x 1 + 4x 2 − 5x 3
such that x1 + x2 + x3 ≤ 7
x1 − x3 ≤ 2
x1 , x2 , x3 ≥ 0
(iii) f (x 1 , x 2 , x 3 ) = sin(x 1 + x 2 ) − x 32 .
CHAPTER 1. INTRODUCTION TO OPTIMISATION 10
∂ f /∂x 1
∂f ∂ f /∂x 2
D i f (x) = and ∇ f (x) = .
..
∂x i .
∂ f /∂x n
∂ f (x)
µ 2 ¶
H f (x) = = ∇2 f (x) . (1.1)
∂x i ∂x j
∂2 f ∂2 f
= ,
∂x i ∂x j ∂x j ∂x i
Examples.
q
(i) For f (x) = kxk = x 12 + x 22 + · · · + x n2
x1
1 x2 x
∇f = q .. = ,
. kxk
x 12 + · · · + x n2
xn
and
µ ¶ µ ¶
8x 1 + 2x 2 8 2
∇ f (x) = = Ax, H (x) = = A (= A T ).
2x 1 − 6x 2 2 −6
1 T
f (x) = x Ax + xT b + c, (where A = A T )
2
∇ f (x) = Ax + b,
H = A.
y=ln x
1 x
a x
• linear independence,
• Taylor’s theorem.
Some coding will be required in this class. The computer lan-
guage/toolset of choice is Matlab. Students will be expected to be
able to create functions in Matlab, and use function handles.
Some instruction/revision in the above topics will be included in
the course, but students who are not already familiar with these topics
may struggle. If you don’t remember this material, please brush up on
it ASAP.
The basic approach of the course is to start with the simplest prob-
lems and generalize. That has two advantages:
25
20
15
10
−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2
§2
S INGLE VARIABLE O PTIMISATION
1
Throughout these notes we will focus on finding minima, but we can easily
convert all of our techniques to finding maxima simply by considering − f (·).
17
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 18
§2.1 B ACKGROUND
If our function is differentiable, we look for stationary points –
where the derivative is zero — first. Why should we look for these?
Well, first let us make a formal definition of a local minimum:
d f ε2 d 2 f
f (x + ε) = f (x) + ε + +··· .
dx 2 d x2
We know that if x is a local minimum, then for some small ε, we must
have f (x + ε) − f (x) > 0, but note that for small ε we can approximate
df
f (x + ε) − f (x) ' ε ,
dx
whose sign depends on ε, i.e., the only way that this can be greater than
zero for both positive and negative ε is if d f /d x = 0 at the minimum.
We can then classify these into local minima, maxima or stationary
points of inflection by considering the second derivative in the Taylors
series. If
B OUNDED PROBLEMS
These are often referred to as bracketing methods, because we assume
we have an interval [a 0 , b 0 ] that definitely contains the minimiser. We
start with absolutely minimal assumptions about the function to be
optimised. We do assume f (x) is unimodal over [a 0 , b 0 ], i.e., it has
only one local minimiser in [a 0 , b 0 ]. That’s so that we know the local
minimiser will be the global minimiser2 . More formally,
y y
x x
(a) A unimodal function. (b) A multimodal function.
2
We can weaken even that requirement as well, but we’ll do so later
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 23
a_0 d c e b_0
Now if f (d ) < f (c) min of f lies in [a0 , c] (Set [a1 , b1 ] = [a0 , c])
½
f (d ) ≥ f (c)
if min of f lies in [c, b0 ] (Set [a1 , b1 ] = [c, b0 ])
f (c) > f (e)
½
f (d ) ≥ f (c)
if min of f lies in [d , e] (Set [a1 , b1 ] = [d , e])
f (c) ≤ f (e)
In each case, the new search interval is half the length of the old one.
Algorithm 1 shows how the Dichotomous Search is put together out of
successively following these steps.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 25
End
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 26
At the end of the Algorithm 1 we will have been through the while
loop N = n times. We call each of time through this main loop an
iteration.
At the end we will know that the minimiser x ∗ lies in the final
interval [a N , b N ] = [a k , b k ], and we know that the length of the interval
is no more than ε.
We know the minimiser x ∗ ∈ [a N , b N ], so it can be ap-
proximated by the midpoint of the interval, (a N +b N )/2,
with an error of (less than) ε/2.
In each iteration, the bracketing interval is halved. So, after N
iterations, the length of the bracketing interval is
(b 0 − a 0 )
`N = b N − a N = .
2N
We can use this to calculate N , given an initial value of ε and [a 0 , b 0 ]. If
we want b N − a N ≤ ε at completion, then we need
(b 0 − a 0 )
≤ ε
2N
(b 0 − a 0 )
µ ¶
N ≥ log2 .
ε
The number of iterations N must be an integer, so we choose it using
the ceiling function, which we denote dxe, i.e., we take
(b 0 − a 0 )
» µ ¶¼
N = log2 .
ε
However, let us make an important distinction here: the calculation
of the number of iterations shown here is to ensure that the interval
has length less than ε. If we aim to ensure that the error in the final
estimate is less than ε, then remember that the final error at most is
half the final interval, and so we need one less iteration! Many students
make this error in the exam!
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 27
1.4
1.3
1.2
1.1
0.9
2 3 4 5 6 7 8 9 10
−x/2
Figure 2.2: Example 1: f (x) = 1 − e ln(x/5).
k ak dk ck e k bk `k
0 2 4 6 8 10 8
1 4 5 6 7 8 4
2 6 6.5 7 7.5 8 2
3 6 6.5 7 1
k f (a k ) f (d k ) f (c k ) f (e k ) f (b k )
0 1.337085 1.030199 0.990923 0.991392 0.99533
1 1.030199 1 0.990923 0.989839 0.991392
2 0.990923 0.989827 0.989839 0.990464 0.991392
3 0.990923 0.989827 0.989839
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 28
1.5
0.5
−0.5
−10 −8 −6 −4 −2 0 2 4
k ak dk ck ek bk `k
0 −10 −6.75 −3.5 −0.25 3 13
1 −6.75 −5.125 −3.5 −1.875 −0.25 6.5
2 −5.125 −4.3125 −3.5 −2.6875 −1.875 3.25
3 −5.125 −4.71875 −4.3125 −3.90625 −3.5 1.625
4 −4.3125 −4.109375 −3.90625 −3.703125 −3.5 0.8125
5 −4.109375 −4.007812 −3.90625 −3.804688 −3.703125 0.40625
6 −4.007812 −3.957031 −3.90625 −3.855469 −3.804688 0.203125
7 −3.957031 −3.931641 −3.90625 −3.880859 −3.855469 0.101562
8 −3.957031 −3.931641 −3.90625 0.050781
input: As in Algorithm 1.
output: As in Algorithm 1.
Initialisation:
(b 0 − a 0 )
» µ ¶¼
N = log2 ;
ε
for k = 0, 1, . . . , N − 1 do
ak + bk ak + ck
Let c k = ; dk = ;
2 2
Evaluate (if needed) f c = f (c k ) and f d = f (d k );
if f d < f c then
a k+1 = a k ;
b k+1 = c k ;
fc = fd ;
else
bk + ck
Let e k = ;
2
Evaluate f e = f (e k );
if f c > f e then
a k+1 = c k ;
b k+1 = b k ;
fc = fe ;
else
a k+1 = d k ;
b k+1 = e k ;
f c = f c (NB: nothing need be done here);
end
end
end
We will compare this result with later algorithms with the goal of
reducing it. It may not be at all obvious how we might go about such a
reduction, but here are a couple of suggestions.
Variations:
• choose the locations for points — we will choose two this time,
at arbitrary locations p k < q k both ∈ (a k , b k ) — that help reduce
the interval as fast as possible; and
• choose the points so that we can reuse function calls from one
iteration to the next (much as we could reuse f c in Algorithm 2.
• if f (p k ) > f (q k ) then x ∗ ∈ [p k , b k ].
We can use these to reduce the size of the future interval. The problem
is symmetric, so we may as well choose p k and q k symmetrically, i.e.,
we can choose q k to be the same distance from b k as p k is from a k .
Suppose then at each iteration we choose p k and q k
using
q k − a k = γ(b k − a k ) and b k − p k = γ(b k − a k ) .
To have p k < q k we need 1/2 < γ < 1, but the actual
value of γ is not known yet. However, we can see from
the rules above that at each iteration we would retain a
proportion γ of the old search interval.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 35
ak pk qk bk
q k+1 − a k+1 = γ(b k+1 − a k+1 ) and b k+1 − p k+1 = γ(b k+1 − a k+1 ) .
and the same value of γ. We also wish to reduce the number of func-
tion evaluations so it makes sense to choose one of our new
points to coincide with p k , which belongs to (a k , q k ) and
for which we already know the function evaluation!
γ ( bk - ak )
ak pk qk bk
so
γ2 (b k − a k ) = (1 − γ)(b k − a k ),
and as b k 6= a k , we can divide by (b k − a k ) to get
γ2 + γ − 1 = 0.
The positive solution to this quadratic is
p
5−1
γ= (≈ .61803),
2
the Golden Ratio or Golden Section, hence the name of the search.
p Thus the length of the search interval is reduced by a proportion γ =
( 5−1)/2 at each iteration. We repeat the iterations until the interval is
smaller than our required tolerance ε. The resulting algorithm is shown
in Algorithm 3. The algorithm can be made slightly more efficient at
the expense of making it a little more confusing, so I leave this step to
the student.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 37
q 0 = 2 + 8γ = 6.944272
p 0 = 10 − 8γ = 5.055728
We can then start to calculate the points, as shown in the following two
tables:
k ak pk qk bk `k = b k − a k
0 2 5.05573 6.94427 10 8
1 5.05573 6.94427 8.11146 10 4.94427
2 5.05573 6.22291 6.94427 8.11146 3.05573
3 6.22291 6.94427 7.3901 8.11146 1.88855
4 6.22291 6.66874 6.94427 7.3901 1.16719
5 6.22291 6.94427 0.72135
k f (p k ) f (q k ) f (p k ) < f (q k )?
0 0.99911517 0.98980051 >
1 0.98980051 0.99161851 <
2 0.99025551 0.98980051 >
3 0.98980051 0.9902925 <
4 0.98973678 0.98980051 <
5
When N = 5, `5 = b 5 − a 5 = 0.72135 < 1 = ε, so we stop (as expected).
We estimate x ∗ by
a5 + b5
x̂ = = 6.58359214.
2
The function has to be called M = N + 1 = 6 to make the calculation.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 40
k ak pk qk bk f (p k ) f (q k )
0 −10 −5.03444 −1.96556 3 −0.30879 −0.25855
1 −10 −6.93112 −5.03444 −1.96556 −0.24577 −0.30879
2 −6.93112 −5.03444 −3.86223 −1.96556 −0.30879 −0.32234
3 −5.03444 −3.86223 −3.13777 −1.96556 −0.32234 −0.31349
4 −5.03444 −4.30998 −3.86223 −3.13777 −0.32060 −0.32234
5 −4.30998 −3.86223 −3.58551 −3.13777 −0.32234 −0.32082
6 −4.30998 −4.03326 −3.86223 −3.58551 −0.32225 −0.32234
7 −4.03326 −3.86223 −3.75653 −3.58551 −0.32234 −0.32201
8 −4.03326 −3.92756 −3.86223 −3.75653 −0.32240 −0.32234
9 −4.03326 −3.96793 −3.92756 −3.86223 −0.32238 −0.32240
10 −3.96793 −3.92756 −3.90261 −3.86223 −0.32240 −0.32239
11 −3.96793 −3.90261
The number of times the interval has been reduced is N = 11; the
function f (x) has been called M = 12 times; the length of the final
interval is 13 × γ11 = .06532 ≤ ε. The estimate of the minimiser of f (x)
using the Golden Section Search is −3.935, the midpoint of the final
interval, with an error of no more than 0.0327.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 41
k ak pk qk bk f (p k ) f (q k )
0 1 4.055728 5.944272 9 0.81571997 < 0.90875065
1 1 2.888542 4.055728 5.944272 0.74974954 < 0.81571997
2 1 2.16719 2.888542 4.055728 0.73828870 < 0.74974954
3 1 1.72135 2.16719 2.888542 0.77033200 > 0.73828870
4 1.7214 2.16719 2.4427 2.888542 0.73828870 > 0.73668448
5 2.1672 2.8885
3
a=1 p q b=4.055
4 a=1 p q b=2.8885
5
a=1.72 p q b=2.8885
6
a=2.1672 b=2.8885
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 42
Fibonacci Search
γk ( bk - ak )
ak pk qk bk
Note that
`k = bk − ak
`k+1 = b k+1 − a k+1
= qk − ak
`k+2 = b k+2 − a k+2
= q k+1 − a k+1
= p k − ak
= bk − qk .
if f (p 0 ) > f (p N −1 ) , x ∗ ∈ [p N −1 , b N −1 ]
if f (p 0 ) ≤ f (p N −1 ) , x ∗ ∈ [a N −1 , p N −1 ] .
2∆, and we can then use the relationship `k = `k+1 + `k+2 to derive the
value of ` for all n (if we knew ∆ and N ). The situation is illustrated
below:
Δ
Δ
aN bN
Δ Δ
2Δ
aN-1 pN-1 bN-1
Δ Δ Δ
3Δ
aN-2 pN-2 qN-2 bN-2
2Δ Δ 2Δ
5Δ
aN-3 pN-3 qN-3 bN-3
3Δ 2Δ 3Δ
8Δ
aN-4 pN-4 qN-4 bN-4
You should see a familiar series arising. We can see it visually by
noting that each interval is the size of the sum of the two previous
intervals. We can derive it formally by taking `N −k+1 = F k ∆. From the
diagram above F 1 = 1, F 2 = 2, F 3 = 3, F 4 = 5 and F 5 = 8, but we can
continue the series for all n using `k = `k+1 + `k+2 .
Substituting n = N − k − 1 we get `N −k−1 = `N −k+1 +
`N −k and dividing by ∆ we get
F k+2 = F k + F k+1 ,
which results in the well-known Fibonacci sequence.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 46
don’t have the same value of γ at each interval, but nothing about the
way we coded the Golden Section Search made the assumption that
γ was constant (if you examine Algorithm 3 you’ll see that γ isn’t even
used after the first step).
The first step of the algorithm is different from the Golden Section
Search. We must first determine N , and then calculate the initial test
points p 0 and q 0 , but we can again use `k = F N −k+1 ∆, to note that
∆1 FN
γ0 = = ,
∆0 F N +1
where we must have already calculated F N and F N +1 in the process of
computing N . Once we know γ0 , we can calculate p 0 and q 0 as before
as
FN
p 0 = b0 − (b 0 − a 0 ),
F N +1
FN
q0 = a0 + (b 0 − a 0 ).
F N +1
The resulting algorithm is given in Algorithm 4.
The number of function evaluations required is
• 2 in the initialisation;
so we can see that the Fibonacci Search for large N is almost the same
as the Golden Section Search. The only significant difference lies in
final few iterations.
Example 6. Once again we shall solve Example 1, this time using the
Fibonacci Search. Remember we are looking for the minimiser of
f (x) = 1 − e −x/2 ln(x/5) over [2, 10], with ε = 1.
10 − 2
Solution. To find N , we need to find the smallest term F N +1 ≥ = 8.
1
For N = 4 we get F N +1 = 8. Then
F4 5
p 0 = b0 − (b 0 − a 0 ) = 10 − × 8 = 5
F5 8
F4 5
q0 = a 0 + (b 0 − a 0 ) = 2 + × 8 = 7
F5 8
The table of the results follows (where we have used α = 0.01 in the final
step). You can see the Fibonacci sequence in the interval lengths `k .
k ak p k qk bk f (p k ) <? f (q k ) `k
0 2 5 7 10 1 > 0.9898394 8
1 5 7 8 10 0.9898394 < 0.9913900 5
2 5 6 7 8 0.9909200 > 0.9898394 3
3 6 7 7 8 0.9898394 = 0.9898394 2
4 p 4 −α = 6.99 f (p 4 − α) = 0.989831875 < f (p 4 ) 1
Example 7. Let us repeat Example 2 using the Fibonacci Search. Re-
member we are looking for the minimiser of f (x) = e x/5 sin(x/5) over
[−10, 3] with ε = 0.1.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 50
3 − (−10)
Solution. Solving F N +1 ≥ = 130 we get N = 10
ε
and F N +1 = 144.
k ak pk qk bk f (p k ) f (q k ) `k
0 −10 −5.0347 −1.9653 3 −0.3088 −0.2585 13
1 −10 −6.9306 −5.0347 −1.9653 −0.2458 −0.3088 8.0347
2 −6.9306 −5.0347 −3.8611 −1.9653 −0.3088 −0.3224 4.9653
3 −5.0347 −3.8611 −3.1389 −1.9653 −0.3224 −0.3135 3.0694
4 −5.0347 −4.3125 −3.8611 −3.1389 −0.3206 −0.3224 1.8958
5 −4.3125 −3.8611 −3.5903 −3.1389 −0.3224 −0.3209 1.1736
6 −4.3125 −4.0417 −3.8611 −3.5903 −0.3223 −0.3224 0.7222
7 −4.0417 −3.8611 −3.7708 −3.5903 −0.3224 −0.3221 0.4514
8 −4.0417 −3.9514 −3.8611 −3.7708 −0.3224 −0.3224 0.2708
9 −4.0417 −3.9514 −3.9514 −3.8611 −0.3224 −0.3224 0.1806
10 p 10 + α = −3.95 −0.3224 < f (p 10 ) 0.0903
k 1 2 3 ··· FN
xk x1 x2 x3 ··· xFN
f (x k ) f (x 1 ) f (x 2 ) f (x 3 ) · · · f (x F N )
U NBOUNDED PROBLEMS
The techniques above have been designed for the case when we start
with a known bracketing interval. What do we do when the initial
interval is unknown? For instance, consider the following problem:
One way to solve the problem is to first find a set of bounds. We’d
like these to be as tight as possible, but without any pre-knowledge, it
might be hard to achieve this. We’d also like to follow the same princi-
ples as in our previous work, namely reduce the number of function
evaluations required (using reuse where possible).
We present below a simple approach to finding bounds for the
minimiser for a problem such as Problem 2.3. Note that we are still
assuming the function is unimodal.
The basic idea is to start at a point x 0 , and search for a set of three
points that together define a “dip” in the function, i.e., , we wish to
find three points (x i , f (x i )) such that we have a situation similar to the
following:
y
f(x)
h
x0 x1 x2 x3 x4
xk-1 xk x xk+1
Figure 2.4: Finding bounds using a geometric search.
The first step of the algorithm is to find the direction of the search.
Then we iterate, searching in that direction, until we find a dip in the
function. The algorithm requires an initial guess of the point of interest
x 0 . We also choose an initial step size h. A larger step size will result
in a faster search, but at the expense of a larger bracketing interval.
However, if we are to follow this by one of the algorithms above, say the
Dichotomous Search, the bracketing interval will decrease in factors of
2 and so we should not be too scared of a large initial bracket.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 55
p(x) = a 0 + a 1 x + a 2 x 2 ,
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 56
Quadratic Approximation
1 x 1 x 12 " a 0 # " E 1 #
1 x2 x 2 a1 = E 2 . (2.3)
2
1 x 3 x 32 a2 E3
y
p(x)
f(x)
x1 x x2 x3 x
Figure 2.5: Quadratic approximation to a function f (x).
¯ ¯
¯1
¯ x 1 E 1 ¯¯
¯1
¯ x 2 E 2 ¯¯
¯1 x3 E 3 ¯ (x 2 − x 3 )E 1 + (x 3 − x 1 )E 2 + (x 1 − x 2 )E 3
½ ¾
a2 = ¯ ¯ =− ,
¯1 x 1 x 12 ¯¯ (x 1 − x 2 )(x 2 − x 3 )(x 3 − x 1 )
x 2 x 22 ¯¯
¯
¯1
¯
¯1 x 3 x 32 ¯
which is > 0 when the numerator is negative, as our choice of x 1 < x 2 <
x 3 makes the the denominator positive. Therefore, we need
(x 2 − x 3 )E 1 + (x 3 − x 1 )E 2 + (x 1 − x 2 )E 3 < 0. (2.4)
Thus we must choose 3 points (x i , E i ), i = 1, 2, 3, (with E i = f (x i ) and
x 1 < x 2 < x 3 ) satisfying (2.4) in order to approximate the minimiser of
f over [a, b] by the minimiser x̂ of p.
To find x̂, however, we do not need to find all coefficients of p
(a 0 , a 1 , a 2 ). All we want is x̂ = −a 1 /2a 2 .
Cramer’s Rule gives us
¯1 x 12 E 1 ¯
¯ ¯ ¯ ¯
¯1 x 1 E 1 ¯
¯1 x 2 E 2 ¯
¯ ¯ ¯ ¯
¯1 x 2 E 2 ¯ −
¯ ¯ ¯ 2 ¯
¯1 x E ¯ ¯1 x 2 E ¯
3 3 3 3
a2 = ¯ and a1 = ¯ ¯ .
¯1 x 1 x 12 ¯ ¯1 x 1 x 12 ¯
¯
¯1 x 2 x 2 ¯ ¯1 x 2 x 2 ¯
¯ ¯ ¯ ¯
¯ 2¯ ¯ 2¯
¯1 x x 2 ¯ ¯1 x x 2 ¯
3 3 3 3
and so
¯1 x 12 E 1 ¯ . ¯1 x 1 E 1 ¯
¯ ¯ ¯ ¯
x̂ = ¯¯1 x 22 E 2 ¯¯ 2 ¯¯1 x 2 E 2 ¯¯
¯ ¯ ¯ ¯
¯1 x 2 E ¯ ¯1 x E ¯
3 3 3
( 3 )
1 (x 2 − x 3 )E 1 + (x 3 − x 1 )E 2 + (x 12 − x 22 )E 3
2 2 2 2
= , (2.5)
2 (x 2 − x 3 )E 1 + (x 3 − x 1 )E 2 + (x 1 − x 2 )E 3
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 59
x 1 = x 2 − s, and x 3 = x 2 + s,
for some s > 0, then (2.5) simplifies to
s E1 − E3
½ ¾
x̂ = x 2 + , (2.6)
2 E 1 − 2E 2 + E 3
and (2.4) reduces to
E 1 − 2E 2 + E 3 > 0. (2.7)
This inequality will certainly hold if we choose our 3 points such that
e +1 − e 0.25
µ ¶
1
x̂ = −0.75 + = −0.386
8 e +1 − 2e (−0.75)2 + e +0.25
We can use the approximation and its analysis given above for many
purposes. In the next section we will use it in constructing a search
algorithm called the DSC algorithm after its creators: Davies, Swann
and Campey (1964).
End
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 61
Named after its inventors (Davies, Swann and Campey) the DSC algo-
rithm aims to use the quadratic approximations we discussed above to
find minimisers.
We are still considering minimising a unimodal function, but the
other condition (that we start with a known bound for the minimiser)
can be loosened here and we can start to look for a minimiser of a
function without any known bounds. So our new problem can be
expressed as in Problem 2.3.
In practice, convergence of our approach may depend on conti-
nuity or differentiability of the function (see the discussion of Taylor’s
series above), but we don’t need any of these to try to approach (later
methods will need to calculate derivatives, and hence explicitly need
them to exist).
We need to be able to work with unbounded intervals, because our
new algorithm will only produce successive estimates of the minimiser.
It doesn’t allow us to refine our bounds at each interval. In addition, we
want three points at each step, in order to perform our approximation.
Its a lot easier if these points are uniformly spaced, so we shall extend
Algorithm 5 to find not just bounds, but a set of three uniformly-spaced
points.
Our extended algorithm is given in Algorithm 6. It works by taking
the three bracketing points x k−1 , x k and x k+1 and supplementing them
with a forth x̄ = (x k + x k+1 )/2 so that the sequence x k−1 , x k , x̄ and
x k+1 are uniformly spaced. We then restrict our attention to either
(x k−1 , x k , x̄), or to (x k , x̄, x k+1 ) depending on which three bracket the
minimum, as illustrated in Figure 2.4. We need to take a little care over
the direction of the search (determined by testing the first two points),
and ensuring we choose three bracketing points.
Once we have a method for determining three uniformly-spaced
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 62
points that bracket the minimiser, we can use it iteratively. The ba-
sic idea is to find three points, use them to obtain an approximating
parabola, and from this derive an estimate of the minimiser. In choos-
ing the points the way we have, we guarantee that condition (2.7) is
satisfied, so we need only use (2.6) to find the minimiser. We then use
this estimate of the minimiser as the starting point for a new search
for bracketing points, but with a smaller value of h so that we hope-
fully have a tighter set of points, and hence a better estimate of the
minimiser. The resulting algorithm is given in Algorithm 7.
z 1 = 2, z 2 = 2.5, z3 = 3
give x̂ = 2.401053944. (We do not really need this many decimal places,
but I am keeping them to prevent round-off errors here.)
The interval [2,3] has a length of 1> ε; put x 0 = x̂ = 2.401053944, h =
σ.h = 0.125 and begin again.
Iteration 2:
k xk f (x k )
0(1) 2.401053944 0.736320633
1(0) 2.526053944 0.737944057
2 2.276053944 0.736447531
Since f (x 0 ) < f (x 0 + h) so swap direction (and incidentally the first
two points) and then f (x 1 ) < f (x 2 ). Hence x ∗ ∈ [2.2761, 2.5261] and
2.2761,2.4011 and 2.5261 is a set of three equally spaced points strad-
dling x ∗ (x̄ = x 1 ). We then get x̂ = 2.3476.
The interval [2.2761,2.5261] has length 0.25 > ε; so put x 0 = x̂ = 2.3476,
h = σ.h = 0.0625 and begin again.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 65
Iteration 3:
k xk f (x k
0(1) 2.347616391 0.736139442
1(0) 2.410116391 0.73638377
2 2.285116391 0.736371413
Here, f (x 0 ) < f (x 0 + h) and f (x 1 ) < f (x 2 ), so its similar to the previous
iteration. Thus x ∗ ∈ [2.2851, 2.4102].
Since the length of this interval < ε, we STOP
Approximate x ∗ by 2.347.
Note that at each iteration the new search interval is contained within
the preceding search interval. Is this provably true? No. It is easy to
find examples where the search interval “moves" along
the number-line. However, it is true that the length of
the search interval never grows. This is tricky and te-
dious to prove!
The advantage of the DSC method is that a function that can be well
approximated by a quadratic will lead to very fast convergence. How-
ever, if this is not true (say if the function were almost discontinuous),
convergence might be quite poor. A critical factor in its performance is
the choice of h and σ, but I personally don’t know how to do that well,
and even in some of the best cases I have tried, the performance of the
algorithm isn’t as good as the searches we performed above.
Extensions
End
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 67
N EWTON ’ S M ETHOD.
Assume that the objective function f : R → R is at least twice differen-
tiable, i.e., f 0 , f 00 are well defined. So the problem of interest here is of
the form
f 0 (x k )
x̂ = x k − ( f 00 (x k ) 6= 0) . (2.9)
f 00 (x k )
If the function were quadratic, then this would immediately find the
required minimiser. As this is rare, we instead use this point as the next
estimate, i.e.,
f 0 (x k )
x k+1 = x k − 00 , (2.10)
f (x k )
4
It is important, however, to understand that the sequence of points generated
by Newton’s method don’t always get closer to the minimiser at every step.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 69
• |x k − x k−1 | < ε,
• | f (x k ) − f (x k−1 )| < ε,
¯ ¯
¯ f (x k ) − f (x k−1 ) ¯
• ¯
¯ ¯ < ε, and
x k − x k−1 ¯
¯ f (x k + ∆) − f (x k ) ¯
¯ ¯
• ¯
¯ ¯ < ε for some small ∆ > 0.
∆ ¯
f 0 (x 0 ) f 0 (1) −e −0.5
So x 1 = x 0 − 00 = 1 − 00 = 1 − −0.5 = 1.5.
f (x 0 ) f (1) 2e
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 71
Continuing, we have
f 0 (x 1 ) f 0 (1.5)
x2 = x1 − = 1.5 − = 1.9594
f 00 (x 1 ) f 00 (1.5)
k xk
0 1
1 1.5
2 1.95945678249834
3 2.24821027937209
4 2.33848545180889
5 2.34570813278122
6 2.34575075344906
7 2.34575075492277
g (x k )
x k+1 = x k − .
g 0 (x k )
y
g(x)
tangent
xk+1 xk
f(x)
−1 f’(x)
tangent to f’
−3 −2 −1 0 1 2 3
2
Example 12 (Divergence). Take the function f (x) = −e −x . Its deriva-
tives are
2
f 0 (x) = −2xe −x ,
2
f 00 (x) = (2 − 4x 2 )e −x .
In this case
f 0 (1) f 000 (1)
= −1 < 0,
f 00 (1)2
so the condition above is not satisfied. However, if we had started close
to the minimiser (at 0), we would have converged, and very quickly.
∗ 1 f 000 (ξ)∗ 2
x k+1 − x = (x k − x ) (2.11)
2! f 00 (x k )
So if f 000 is continuous,
f 000 (ξ) f 000 (x ∗ )
→ (since x k → x ∗ and f 00 , f 000 are continuous).
f 00 (x k ) f 00 (x ∗ )
Also if a function g is continuous on [a, b] then this implies g is bounded
on (a, b), so
¯ 000 ¯
000 00 ∗
¯ f (ξ) ¯
f continuous and f (x ) 6= 0 ⇒ ¯¯ 00 ¯ is bounded near x ∗
f (x k ) ¯
i.e., | f 000 (x ∗ )/ f 00 (x ∗ )| < K for some K > 0. Consequently, if x 0 is suffi-
ciently close to x ∗ ,
K
|x k+1 − x ∗ | ≤ |x k − x ∗ |2 . (†)
2
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 77
y g(x) = f'(x)
diverge
converge to a
(possibly)
converge to c
(possibly)
diverge
a c x
Figure 2.7: A rough sketch of the type of convergence problems we might see
with Newton’s method.
S ECANT M ETHOD
If information about the second derivative is NOT available, (and there-
fore we cannot use Newton’s Method), we can approximate f 00 using
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 78
information about f 0 :
00 f 0 (x k ) − f 0 (x k−1 )
f (x k ) ' .
x k − x k−1
Thus replacing f 00 (x k ) by this in Newton’s Method gives the recursive
formula
x k − x k−1
½ ¾
0
x k+1 = x k − f (x k ) 0 , (2.12)
f (x k ) − f 0 (x k−1 )
for n ≥ 1, given x 0 , x 1 , and f 0 (x k ) 6= f 0 (x k−1 ).
Alternative viewpoint: find the point x k+1 , where the secant join-
ing (x k−1 , f 0 (x k−1 )) to (x k , f 0 (x k )) cuts the x-axis. i.e., , again, we can
think of it as a method for solving g (x) = 0 where g = f 0 .
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 79
x* x0 x1 x* x0 x1
(a) Newton’s and Secant Methods both (b) Newton’s Method converges, but
diverge. Secant method diverges.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 80
p
5+1
2. order of convergence is p = = 1.618 < 2 (as
2
in Newton’s method).
(ref: D.G.Leunberger. Introduction to Linear and Nonlinear Pro-
gramming, Add.-Wesley, 1973)
There are many alternatives: the method of regula falsi and Brent’s
method, but in practice, line searches are used after a search direction
has been found in multi-variable problems. Practical experience seems
to suggest that often, it is better to allocate more computational time
to iterating the optimisation algorithm for the multi-variable problem
than in doing exact line searches.
CHAPTER 2. SINGLE VARIABLE OPTIMISATION 81
End
30
25
20
15
10
−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2
§3
U NCONSTRAINED, M ULTI -VARIABLE ,
C ONVEX O PTIMISATION
82
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 83
f : Rn → R.
§3.1 E XAMPLES
Let’s consider a couple of very simple example problems to moti-
vate the work of this chapter.
Example 13. Consider optimising the location of a set of recycling
plants to minimise the total distance travelled by users from several
locations. For example: take four urban centres A, B , C and D, and
choose the location of 2 recycling plants X and Y .
That is, given the locations of the urban centers A = (a 1 , a 2 ), B =
(b 1 , b 2 ), C = (c 1 , c 2 ), D = (d 1 , d 2 ), find X = (x 1 , x 2 ), Y = (y 1 , y 2 ) to min-
imise the total distance travelled by the users, or the total road length
(assuming we could travel in straight lines).
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 84
C
A
X
Y
D
B
The problem is then to find the locations of X and Y that solve:
p p
min f (x 1 , x 2 , y 1 , y 2 ) = (a 1 − x 1 )2 + (a 2 − x 2 )2 + (b 1 − x 1 )2 + (b 2 − x 2 )2
q q
+ (c 1 − y 1 ) + (c 2 − y 2 ) + (d 1 − y 1 )2 + (d 2 − y 2 )2
2 2
q
+ (x 1 − y 1 )2 + (x 2 − y 2 )2
In this case there are four variables (the x and y positions of X and
Y ) and so f : R4 → R. This is a very simple operations research problem
where we seek to optimise the operations of some organisation. OR (as
it is called) is one of the major consumers of optimisation algorithms
and there are very many more such problems of interest.
Example 14. Another classic optimisation problem occurs in statistics.
Often, we wish to fit a curve of some type (in linear regression its just
a linear function of the variables) to a set of data will minimising the
error in the fit. If we imagine a set of data (x i , y i ) for i = 1, 2, . . . , m,
and for the sake of argument we seek to fit a degree n − 1 polynomial
p(x) = a 0 + a 1 x + a 2 x 2 + · · · a n−1 x n−1 to these, then we would aim to
find coefficients a 0 , . . . , a n−1 such that we minimise the sum of squares
of the errors
m ¡
X ¢2
f (a 0 , . . . , a n−1 ) = y i − p(x i ) .
i =1
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 85
That is, we seek to minimise a function f : Rn → R. More importantly,
the function we want to minimise is a quadratic, and this type of func-
tion is both common, and highly tractable, so we shall devote some
attention specifically to these (and call them quadratic programs).
Many statistical problems can be written as optimisation, and con-
sequently, we can statistics as almost a branch of optimisation, though
the problems aren’t often this simple. Signal and image processing
often use similar ideas to estimate or infer properties of their signals.
§3.2 B ACKGROUND
Let us first begin with a quick refresher of notation and results that
you should have seen in previous courses.
µ ¶
2 4
Example 15. If H = , then
4 16
zT H z = 2z 12 + 8z 1 z 2 + 16z 22
= 2[(z 1 + 2z 2 )2 + 4z 22 ] ≥ 0
QUADRATIC FUNCTIONS
We saw in one of our examples that we often seek to optimise quadratic
objective functions. These are a natural generalisation of the 1D-
quadratic functions we have used already in this course.
A quadratic function can always be written in the form:
1
q(x) = xT Ax + bT x + c, (x ∈ Rn ), (3.1)
2
p = −A −1 b, (3.3)
1
d = c − pT Ap. (3.4)
2
Note that if A is positive definite and symmetric, then det A 6= 0 and
so A −1 must exist, and we can always calculate p for such a quadratic.
Then, because A is positive definite, we can see immediately that the
minimiser is x∗ = p and f (x∗ ) = d . In fact, if we were able to invert the
matrix efficiently, this would be a fine way to find the minimiser, but
we will see that this is not always the fastest or most robust approach,
and it doesn’t work for the more general class of convex functions.
Example 17. Consider the following quadratic function:
3 3
f (x 1 , x 2 , x 3 ) = x 12 + 2x 22 + x 32 + x 1 x 3 + 2x 2 x 3 − 3x 1 − x 3
2 2
3 0 1 −3
1 T T
= x Ax + b x where A = 0 4 2 , b = 0 .
2
1 2 3 −1
1 T
q(x) = x Ax + xT b + c, (3.9)
2
∇q(x) = Ax + b, (3.10)
Hf = A. (3.11)
TAYLOR S ERIES
We can naturally extend Taylor series to functions of multiple variables
through simple extension. Consider the function
g (t ) = f (x + t u),
– a function of t alone!
Then
g (t ) = f (x 1 + t u 1 , x 2 + t u 2 , . . . , x n + t u n ).
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 90
So
d
g 0 (t ) = f (x + t u)
dt
d d
= D1 f (x 1 + t u 1 ) + D 2 f (x 2 + t u 2 ) + · · ·
dt dt
= u1D 1 f + u2D 2 f + · · · + un D n f
= uT ∇ f (x + t u).
Similarly,
d 2g d
= (u 1 D 1 f + u 2 D 2 f + · · · + u n D n f )
dt2 dt
X n d
= {u j D j f (x + t u)}
j =1 d t
n X
X n
= u i u j D i j f (x + t u)
i =1 j =1
= uT H f (x + t u) u,
where H = H f (x + t u) is the Hessian matrix of f at the point (x + t u),
as defined in (1.1).
Recall the Taylor Series expansion of a function g (t ), about c = 0
g 00 (0)
g (t ) = g (0) + t g 0 (0) + t 2 +··· .
2
Taking t = 1 we get
g (1) = f (x + u)
g (0) = f (x)
g 0 (0) = uT ∇ f (x)
g 00 (0) = uT H f (x)u
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 91
which gives us
1 ©
f (x + u) = f (x) + uT ∇ f (x) + uT H f (x) u + O(kuk3 ),
ª
2
(3.12)
which is the Taylor Series for functions of several variables. The corre-
sponding “Remainder Form” is
1
f (x + u) = f (x) + uT ∇ f (x) + uT H f (z)u,
2
where z = x + θu, for some 0 < θ < 1.
C ONVEX F UNCTIONS
In multiple dimensions, we won’t ask if our functions are unimodal
(as we often won’t know in advance), but we can often restrict our
attention to convex functions. We know in 1D, that a convex function
is one such that the function will lie below a chord drawn across the
function, as shown in Figure 3.1, and we generalise this in Defn 3.2.
f(x)
αf(x) + (1-α)f(y)
f(αx + (1-α)y)
x y
αx + (1-α)y
Figure 3.1: The defining property of a convex function.
Note.
3. If f is linear then
30
14000
12000 25
10000
20
8000
6000
15
4000
2000 10
0
−40
−20 5
−6
0 −4
−2
20 0
2 −1
40 4
6
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2
(a) A convex function f (x, y) = 6.3x 4 + (b) A non-convex function (the “3-hump
6y 2 . camel function”).
Proof. We’ll prove the second case, the first being simply a matter of
replacing ≥ with > in the appropriate places.
(i) ⇒ Given: f is convex on C .
Taylor’s Theorem tells us
s2 T
f (x + sz) = f (x) + s∇ f (x)T z + z H f (x)z + O(s 3 )
2!
2 £
zT H f (x)z = lim f (x + sz) − f (x) − s∇ f (x)T z + O(s 3 )
¤
∴ 2
s→0 s
But f is convex on C
O(s 3 )
So since lim = 0, we have
s→0 s 2
µ ¶
T 2 something
z H f (x)z ≥ 0 since = lim 2 (≥ 0) +
s→0 s →0
g (t ) = f (y + t (x − y)) t ∈ [0, 1] .
Then
g 00 (t ) = (x − y)T H f (y + t (x − y))(x − y) ≥ 0
by (3.13) (with x replaced by y + t (x − y) ∈ C , and z replaced by x − y .)
But from calculus of one variable, for t ∈ [0, 1]
g 00 (t ) ≥ 0 ⇒ g is convex on [0, 1]
y y=g(t)
(1,g(1)) *
g(1)−g(0)
g’(0)
(0,g(0))
1
0 1 t
Figure 3.3: A convex function g (t ).
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 96
The result can be derived from the definition of the derivative, i.e.,
g (h + 0) − g (0)
g 0 (0) = lim+ ,
h→0 h
and noting that from convexity
(ii) Given:
Let
z = αx + (1 − α)y α ∈ [0, 1] . (3.15)
Then applying (3.14) to x, z and z, y gives
f (z) + (x − z)T ∇ f (z) ≤ f (x)
f (z) + (y − z)T ∇ f (z)T ≤ f (y)
Substitute z from (3.15) to give
f (z) + (1 − α)(x − y)T ∇ f (z) ≤ f (x) , (3.16)
f (z) + α(y − x)T ∇ f (z) ≤ f (y) . (3.17)
Since α and 1 − α ≥ 0, we can take
α × (3.16) + (1 − α) × (3.17),
to yield
f (z) + α(1 − α)∇ f (z)T (x − y + y − x) ≤ α f (x) + (1 − α) f (y),
and cancelling terms, and substituting z from (3.15) gives
f (αx + (1 − α)y) ≤ α f (x) + (1 − α) f (y) ∀x, y ∈ C
Therefore f is convex in C .
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 98
The geometric interpretation of Theorem 3.2 is that f lies “above”
any tangent hyperplane to the graph, just as a convex function of one
variable g (t ) lies above any tangent line.
w=f(x)
(y,f(y))
(x,f(x))
(y,d) grad f(x)
Proof. ∀x ∈ C ,
f (x) ≥ f (x∗ ) + (x − x∗ )T ∇ f (x∗ ) by Theorem 3.2,
| {z }
=0
∗
≥ f (x ),
since ∇ f (x∗ ) = 0 at a stationary point x∗ . Therefore x∗ is a global
minimiser of f over C .
(y − x∗ )T ∇ f (x∗ ) ≥ 0, ∀y ∈ C .
Proof.
i) ⇐ i.e., if (y − x∗ )T ∇ f (x∗ ) ≥ 0, ∀y ∈ C ,
Then ∀y ∈ C ,
Note.
D ESCENT D IRECTIONS
Our first task is to choose descent directions. The following lemma is
useful for considering this choice (note its really just the Chain Rule
again).
Lemma 1.
f (x + λu) − f (x)
lim+ = uT ∇ f (x).
λ→0 λ
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 105
Proof. By Taylor’s Theorem,
f (x + λu) − f (x)
lim ≤ 0.
λ→0+ λ
The converse, however, is not necessarily true as we could have f (x +
λu) = f (x) for all 0 < λ < ε. However, if we have
f (x + λu) − f (x)
lim <0
λ→0+ λ
then f (x+λu)− f (x) < 0 for sufficiently small λ > 0 and so u is a descent
direction.
Therefore, by applying Lemma 1, if uT ∇ f (x) < 0 then u is a descent
direction at x, for any differentiable function f . As long as ∇ f (x) 6= 0
we can always find such a direction, for instance, take u = −∇ f (x). We
can visualise the result by considering the level curves of our objective
function, an example of which is shown in Figure 3.4 for illustration.
In this figure, ∇ f (x) is orthogonal to the level curve f (x) = c at x. If we
choose any vector pointing to the North-West of the tangent (i.e., the
angle between u and −∇ f (x) is in [0, π/2)), then we will be heading
downwards (at least for sufficiently small λ).
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 106
The level curve f(x)=c
- f(x)
Δ
c
(x)=
n t to f
a nge
At
Figure 3.4: The level curve of a function with tangent and steepest descent
direction.
S TEEPEST D ESCENT
We have seen that uk = −∇ f (xk ) is the direction of steepest descent
from xk , so methods based on using this choice of uk are called steepest
descent methods, or sometimes gradient methods. The basic algorithm
is expressed in Algorithm 10.
The method will ideally converge to a minimum of the function. We
know that the method will always take steps that decrease the objective
function, and if this has a well-defined minimum (e.g., it is convex),
then we know that the objective function is bounded below. Hence, the
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 107
z = f (x)
*
(xk , f (xk )) z = f (xk ) = c
−∇f (xk )
xk
xk+1
−λk ∇f (xk )
Using one of the 1-D search methods earlier (e.g., Newton’s method),
we get g 00 (λ) = 0 when λ0 = 0.0700224. So x1 = (0.439803, 0.159704).
100
80
60
40
20
−20
4
−1
1.5 2
0.5 1
−2 −0.5 0
−1.5 −1
−2
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 111
Iteration 2. We have x1 = (0.439803, 0.159704).
Then u1 = −∇ f (x1 ) = (1.329194, −0.885946).
k xk uk λk
0 (1, 1) (−8, −12) 0.07002
1 (0.439803, 0.159704) (1.329194, −0.885946) 0.31615
2 (0.860039646, −0.120395801) (0.79585148, 1.194006764) 0.11774
3 (0.953747974, 0.020193719) (0.341629584, −0.227726217) 0.1166
4 (0.993592233, −0.00636599) (0.050448349, 0.075741299) 0.0938
5 (0.99832681, 0.000742331) (0.013347542, −0.008888118) 0.1086
6 (0.999776887, −0.000223274) (0.001783907, 0.00267849)
uk = −∇ f (xk )
= −A(xk − p).
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 112
So
dg
g k0 (λ) = = uTk ∇ f (xk + λuk )
dλ
= uTk {A(xk + λuk − p)}
= λuTk Auk + uTk A(xk − p)
| {z }
−uk
xk+1 = xk + λk uk ,
1
f (xk+1 ) = (xk + λk u k − p)T A(xk + λk uk − p) + d ,
2
1
= f (xk ) − λk kuk k2 + λ2k uTk Au,
2
1
= f (xk ) − λk kuk k2 ,
2
< f (xk ) since λk > 0.
Examples:
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 113
(i) f (x) = x 12 + x 22 , x = (x 1 , x 2 )
2
x1 = x0 + λ0 u0 = x0 − x0 = 0 and ∇ f (x1 ) = 0.
2
x1 is the minimum, so we have reached x∗ in one step.
Level curves are circles: x 12 + x 22 = k(> 0), and the gradient vector
at any x0 points at 0.
End
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 114
x 12
(ii) f (x 1 , x 2 ) =+ x 22 . Obviously, minimum is at (0, 0).
5
In this case, convergence is very slow as we get the narrow valley
effect mentioned earlier.
Here,
¶ µ2 µ ¶
0 x 1 /5
A= 5 , so ∇ f (x) = Ax = 2 x .
0 2 2
µ ¶
1
a) Starting at, say x0 = (5, 1)T ⇒ u0 = −∇ f (x0 ) = −2 1 .
x 12
on level curve + x 22 = 6
6
5
Line search along (5 − 2λ, 1 − 2λ), λ ≥ 0.
uT0 u0
Using the formula from before, we have λ0 = =
uT0 Au0
(−2)2 .2 5
= .
(−2)2 2.( 51
+ 1) 6
Alternatively, use
1
g (λ) = (5 − 2λ)2 + (1 − 2λ)2
5
4
∴ g 0 (λ) = − (5 − 2λ) − 4(1 − 2λ)
5
8 5
⇒ −4 + λ − 4 + 8λ = 0 so λ = .
5 6
So the new point is
µ ¶T µ ¶T
5 5 10 2
x1 = 5 − , 1 − = ,−
3 3 3 3
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 115
x 12 8
which lies on the level curve + x 22 = .
5 3
(b) New direction is
2/3 1 −1
µ ¶ µ ¶ µ ¶
4
u1 = −∇ f (x1 ) = −2 =− =k (k > 0).
−2/3 3 −1 1
x 12 32
+ x 22 = .
5 27
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 116
(c) New direction is
4/9 −1
µ ¶ µ ¶
u2 = −∇ f (x2 ) = −2 =k .
4/9 −1
x 12 128
+ x 22 = .
5 243
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 117
3.2.2: Level Curves for Steepest Descent Example, f(x,y)=x2/5+y2, starting at (5,1)
4
2 x2 /5 + y 2 = 6
(5, 1)
0
y
(10/3, −2/3)
−1
2 2
x /5 + y = 8/3
−2
−3
−4
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6
x
When r is very large, our numerical technique may even fail due to
numerical errors.
To determine the rate of convergence at the k-th step, consider
f (xk+1 ) − f (x∗ )
R= .
f (xk ) − f (x∗ )
f (xk+1 )
R= ≥ 0, since f (x) ≥ 0.
f (xk )
||uk ||4
Recall f (xk+1 ) = f (xk ) − 12 λk ||uk ||2 = f (xk ) − . Therefore
2uTk Auk
f (xk+1 ) 1 λk kuk k2
µ ¶
R= = 1− .
f (xk ) 2 f (xk )
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 119
||uk ||2
But we know that λk = , so what do we know about f (xk )?
uTk Auk
uk = −A(xk − p) therefore xk − p = −A −1 uk (since A −1
exists when A is positive-definite). So
1
f (xk ) = (−A −1 uk )T A(−A −1 uk )
2
1 T −1 T 1
= uk A A A −1 uk = uTk (A −1 )uk
2 2
since (A −1 )T = A −1 because A is symmetric.
Hence
f (xk+1 ) kuk k4
= 1− T .
f (xk ) (uk Auk )(uTk A −1 uk )
and
kuk k4 = (uTk uk )2
= (uTk P T P uk )2
= (yT y)2
à !2
X 2
= yi . (3.20)
i
So
f (xk+1 ) 4(α1 αn ) α1 − αn 2 1−r 2
µ ¶ µ ¶
0≤R = ≤ 1− = = ,
f (xk ) (α1 + αn )2 α1 + αn r +1
4
A unitary matrix is a square matrix such that U ∗U = UU ∗ = I where ∗ denotes
the conjugate transpose of the matrix. For a real matrix U ∗ = U T .
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 121
i.e.,
f (xk+1 ) 1−r 2
µ ¶
≤ ,
f (xk ) r +1
for f (x) = 21 (x − p)T A(x − p), (≥ 0).
Thus if r = 1 (i.e., αn /α1 = 1) f (x1 )/ f (x0 ) ≤ 0 for any x0 (6= p).
Hence f (x1 ) = 0, i.e., the minimum is reached in one
step.
If r 6= 1, then the recursive relation gives
¶2k
1−r
µ
f (xk ) ≤ f (x0 ),
r +1
1/5 − 1 2k
µ ¶ µ ¶2k µ ¶2k
2 2
f (xk ) ≤ f (x0 ) = f (x0 ) = 6 .
1/5 + 1 3 3
¡ ¢2 ¡ 2 ¢4
Now f (x 1 ) = 38 = 6 23 , f (x 2 ) = 32
27 = 6 3 and similarly
2 6
f (x 3 ) = 128
¡ ¢
243 = 6 3 .
• Reliable, even when the starting point is far from the minimum
point.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 122
• Convergence can be extremely slow, especially near the mini-
mum point because the decrease in f value is proportional to
||∇ f (x)||2 , which approaches 0.
End
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 123
C ONJUGATE D IRECTIONS
Proof.
uT1 Au2 = uT1 λ2 u2 = λ2 uT1 u2 = 0,
since u1 and u2 are orthogonal.
In fact we could use any orthonormal basis for Rn , but this is the easiest
to start with. Then we can use this to construct a set of conjugate
vectors as follows.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 125
u1 = e1
à T !
k
X ek+1 Aui
uk+1 = ek+1 − ui k = 1, . . . n − 1.
i =1 uTi Aui
uTi Au j = 0, ∀i 6= j.
α j (uTj Au j ) = 0.
uTj Av
i.e., α j = and the result follows.
uTj Au j
We could now construct an algorithm for minimising quadratic
objective functions. The basic idea would be to construct a set of
conjugate directions, and to use these as our search directions. The
idea is elegant, and provides several guarantees (e.g., that the method
will converge in at most n steps), but it has two main problems:
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 128
• calculating the conjugate directions in advance is wasteful – we
can do better, and
from (3.22). Now we have assumed that the vectors u j for j ≤ k are
mutually conjugate WRT to A, so all of the terms in the summation are
zero, except the j th , and λ∗j = −uTj g j /uTj Au j so
Now, by definition, g j = Ax j + b so
= uTj A x0 − x j
¡ ¢
jX
−1
= uTj A λ∗i ui , (3.25)
i =0
again using (3.22). The mutual conjugacy of the u j implies that this
summation is zero.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 132
gTk+1 Auk
uTk+1 Auk = −gTk+1 Auk + uTk Auk = 0. (3.27)
uTk Auk
So, we note that uk+1 and uk are mutually conjugate for all k, and in
particular, for k = 0, i.e., the first two vectors are mutually conjugate
WRT A.
Now, assume u0 , u1 , . . . , uk are mutually conjugate with respect to
A. We know (from the equation above) that this is true for k = 1, and
aim to show that given the assumption for k, it is also true for k + 1.
Given the mutual conjugacy of the these vectors, Lemma 4 implies that
gTk+1 u j = 0, for all j = 0, 1, . . . , k. We also note that from the algorithm
3 3
f (x 1 , x 2 , x 3 ) = x 12 + 2x 22 + x 32 + x 1 x 3 + 2x 2 x 3 − 3x 1 − x 3
2 2
3 0 1 −3
1 T T
= x Ax + b x where A = 0 4 2 , b = 0 ,
2
1 2 3 −1
starting at x0 = 0.
First, ∇ f (x) = Ax + b, so
k =0
g0 = −∇ f (x0 ) = (−3, 0, −1)T ,
u0 = −g0 = (3, 0, 1)T ,
gT0 u0 5
λ∗0 = − = 18 ,
uT0 Au0
5 T
¡5
x0 + λ∗0 u0
¢
x1 = = 6 , 0, 18 .
k =1
g1 = −∇ f (x1 ) = (−2/9, 5/9, 2/3)T ,
gT1 Au0 13
β0 = = 162 ,
uT0 Au0
1 T
u1 = −g1 + β0 u0 = 162 (75, −90, −95) ,
gT1 u1 117
λ∗1 = − =
uT1 Au1 535
x2 = x1 + λ∗1 u1 = (0.9346, −0.1215, 0.1495)T .
k =2
Note that even though we made a foray into quite messy num-
bers, x3 = x∗ exactly, after exactly 3 steps – as we should expect for the
Conjugate Gradient Method.
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 138
If A is positive semi-definite (rather than positive definite) then
f (x 1 , x 2 ) = x 12 + x 2 is unbounded, and
f (0, −n) = −n as n → ∞;
∇ f (x) = (2x 1 , 1)T 6= 0 for any x.
µ ¶
2 0
A= is positive semidefinite.
0 0
(iv) This means that even if A is singular, we can still use the method
– a distinct advantage over Newton’s method.
1
f (x) ' q(x) = (x − x0 )T A(x − x0 ) + gT (x − x0 ) + c, (3.29)
2
where A = H f (x0 ) and g = ∇ f (x0 ). For a quadratic function the Hessian
H = A is constant at each step. This is not true of a general nonlinear
function and hence the Hessian must be re-evaluated at each step. This
can be very inefficient computationally, so any efficient implementa-
tion of a conjugate gradient method should eliminate the need to find
a new Hessian evaluation at each step. Note that in the Conjugate
Gradient Method, the Hessian A appears only in computation of λ∗k
and βk . So:
k∇ f (xk+1 )k2
βk = .
k∇ f (xk )k2
Proof. As earlier,
1
Auk = (gk+1 − gk ). 1
λ∗k
Then,
1 T
gTk+1 Auk = (g gk+1 − gTk+1 gk )
λ∗k k+1 6 = 0 from earlier
1
2 2
= ∗ kgk+1 k ,
λk
and
1
uTk Auk = uTk 1
(gk+1 − gk ) from
λ∗k
1 T T
= ∗ (uk gk+1 − uk gk )
λk
1
= ∗ (0 − [−gk + βk−1 uk−1 ]T gk )
λk
1
= ∗ kgk k2 , since uTj g` = 0, ( j < `). 3
λk
CHAPTER 3. UNCONSTRAINED, MULTI-VARIABLE, CONVEX
OPTIMISATION 141
So
gTk+1 Auk kgk+1 k2
βk = = .
uTk Auk kgk k2
Notes
(i) If f is convex, the algorithm is good for finding its minimum (if
it exists). If f is not convex, then the scheme will converge to a
solution of ∇ f (x) = 0, so we should ensure we start near x∗ and
assume f is convex near x∗ .
kx(k+1)n − x∗ k ≤ ckxkn − x∗ k2 k = 0, 1, 2, . . . ,
(iv) If f is not quadratic, the conjugacy of the uk might get lost. Rea-
sons: the line search to determine λ∗ might be inaccurate; the
nature of the non-quadratic terms in f .
An alternative formula for βk which seeks to overcome these
problems is the Polak-Ribiere form:
T
∇ f (xk )T (∇ f (xk ) − ∇ f (xk−1 )) gk (gk − gk−1 )
βk−1 = =
||∇ f (xk−1 )||2 ||gk−1 ||2
B k sk sTk B k yk yTk
B 0 = I , B k+1 = B k − +
sTk B k sk yT sk
25
20
15
10
−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2
§4
C ONSTRAINED C ONVEX O PTIMISATION
f : C → R.
144
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 145
§4.1 E XAMPLES
Let’s consider a couple of very simple example problems to moti-
vate the work of this chapter.
Example 23. Imagine trying to find the deepest point of a deep, murky
lake. We can’t see the bottom, and let’s assume sonar depth sounders
haven’t not been invented, but we do have a boat, and it has a sounding
line1 We could just measure “every” point of the lake to find the deepest,
but that would be a lot of work. The problem is t find the deepest point
efficiently.
Obviously not all lakes are convex, but that may be a reasonable
approximation in enough cases to make it worthwhile (and we’ll talk
about non-convex problems a little later).
1
A sounding line is a long cable with a weight on the end (called a lead even
if its not made of lead), and markers at even intervals. You throw the weighted
end into the water, and count the number of markers until it hits the bottom. The
traditional markers were 6 feet apart, or 1 fathom. We get a lot of terms in English
from nautical sources, and sounding is a source for some. For instance deep six comes
from leadsmen called out “by the mark” or “by the deep” followed by a depth, e.g., six.
Actually, the terminology is slightly more complicated, but you get the idea. Samuel
Clemens (1835-1910) took his pseudonym Mark Twain from riverboat leadsman
jargon “by the mark twain”, or two fathoms.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 146
B
x A A x AB x AC ···
xB A xB B x BC ···
X =
.. .. .. ..
. . . .
A
Figure 4.1: An example of a traffic matrix.
Example 24. One of the problems in my work has been to infer a net-
work traffic matrix from inadequate link traffic measurements. The
traffic matrix (a matrix of the traffic from point A to B, e.g., see Fig-
ure 4.1) is needed in operational IP networks to use in optimising the
design, and engineering of the network, but obtaining it is difficult.
Here we use an optimisation technique to obtain the traffic matrix
(which is then used in subsequent optimisation).
Currently, it is relatively easy to gather information on link traffic
using a tool called SNMP (the Simple Network Management Protocol).
SNMP is unique in that it is supported by essentially every device in
an IP network. The key result we use is that the traffic on each link is
related to the traffic matrix by the linear equations:
y = Ax (4.1)
where x = (x 1 , x 2 , . . . , x M )T is the traffic matrix (written as a column
vector), y = (y 1 , y 2 , . . . , y L )T represents link loads, and A the network
routing matrix A = {A i r } is the L × M routing matrix where
½
F i r , if traffic for r traverses link i
Ai r = (4.2)
0, otherwise
where F i r is the fraction of traffic from source/destination pair r that
traverses link i .
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 147
This equation must be inverted to find the traffic matrix from the
link loads, but L ¿ M so the equation is highly under-constrained, and
so allows many solutions. So we must have a method which finds a
“best” solution from all of the possibilities.
One approach is to start with a good guess (a prior) and try to refine
it, and this approach has been called network tomography. We came
up with a technique that AT&T (one of the world’s biggest network
operators) now uses. We start with a prior from a gravity model, and
then refine it. Call the prior xg , then the method works (approximately)
by solving the optimisation problem
min ∥ x − xg ∥2 ,
s.t. Ax − y = 0,
x ≥ 0.
least−squares solution
gravity
model
constraint
subspace
Figure 4.2: An illustration of the least-square solution. The point shows the
gravity model solution, and the dashed line shows the subspace specified by
a single constraint equation. The least-square solution is simply the point
which satisfies the equation which is closest to the gravity model solution. The
weighted least-squares solution gives different weight to different unknowns.
quite large (at least compared to the test cases we have looked at in
class). We even have a patent on the idea:
U.S. Patent 7,574,506, “Traffic Matrix Estimation Method and Ap-
paratus”, N.G. Duffield, A.G. Greenberg, J.G. Klincewicz, M. Roughan,
and Y. Zhang, 2009.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 149
§4.2 C ONSTRAINTS
A convex program is defined on a convex set. Recall from Maths 1
that a set is convex if any line between two points in the set remains
inside the set, i.e.,
x + α(y − x) = (1 − α)x + α ∈ C .
Some examples of convex and nonconvex sets are shown in Figure 4.3.
Convex Non-convex
Figure 4.3: Convex and non-convex sets.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 150
Ω = x ∈ Rn ¯ g i (x) ≤ 0, for i = 1, 2, . . . , m ,
© ¯ ª
is convex.
≤ 0,
• The empty set and the whole vector-space (in our case Rn ) are
convex.
f : Ω → R.
subject to x ∈ Ω where
Ω = x ∈ Rn ¯ g i (x) ≤ 0, for i = 1, 2, . . . , m ,
© ¯ ª
f : Ω → R.
x2 4
x2 − x1 − 1 ≤ 0, 3
2 2
x 1 /2 − x 2 − 2 ≤ 0.
1
0
−1
−2
−5 −4 −3 −2 −1 0 1 2 3 4 5
x1
x1 − 7 ≤ 0, 0
2 −1
−x 1 + x 2 /4 + 4 ≤ 0. −2
−3
−4
−5
−1 0 1 2 3 4 5 6 7 8 9 10
x1
Note that here g 2 (x) is a convex function of x 1 and x 2 , but is not convex
if we draw x 2 as a function of x 1 , i.e., , we have to be careful of our
definition here.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 153
x2
0
−x 1 + 3x 2 ≤ 0, −1
x 1 + x 22 /4 − 4 ≤ 0. −2
−3
−4
−5
−2 −1 0 1 2 3 4 5 6 7 8
x
1
We should be able to see that in the 1st two cases, a straight line
would stay within the set, whereas in the 2nd, it could pass outside.
Notice also that the third set is unbounded, but this is OK. We don’t
require the set of interest to be bounded (see the previous chapter, for
instance).
We refer to the interior of a set Ω as “int (Ω)” and its boundary as
d Ω. We’ll assume that in general our constraints define a non-empty
set, i.e., Ω 6= φ. If this is not true, we call the problem infeasible and
then optimisation is futile unless we relax at least one constraint.
Given a point x ∈ Ω (as defined for Problem 4.2) we define the set
I (x) by © ¯ ª
I (x) = i ¯ g i (x) = 0 , (4.3)
In simple terms, I (x) tells us which constraints are tight, i.e., which
constraints “bite”. If I (x) = φ, then x ∈ int (Ω), otherwise x ∈ d Ω.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 154
min f (x)
s.t . g i (x) ≤ 0 i = 1, 2, . . . , m,
` j (x) = 0 j = 1, 2, . . . , p,
where f and g i are as in Problem 4.2 and the ` j (·) are all
linear.
` j (x) = cTj x − d j , j = 1, 2, . . . , p,
for some set of vectors c j , and the ` j (·) are obviously convex functions,
but the constraints are expressed as equalities, not as inequalities. The
standard trick to deal with this is to replace each equality in Problem 4.4
with two inequalities, i.e.,
` j (x) ≤ 0,
½
` j (x) = 0 ⇔
−` j (x) ≤ 0.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 155
We can do this because for a linear function ` j and −` j are both convex
functions, and the resulting pair of inequalities is equivalent to the
original equality.
We can transform from inequalities to equalities where required by
the inclusion of slack variables. When we can include non-negativity
constraints, inclusion of slack variables is easy, i.e., we simply replace
an inquality such as g i (x) ≤ 0 with g i (x) + s i = 0 where s i ≥ 0. If non-
negativity constraints must be avoided, we can still perform a similar
trick using the square of the variable g i (x) + s i2 = 0.
While we are considering linear constraints, we should also men-
tion Farkas’ Lemma, which will be important below:
Lemma 6 (Farkas’ Lemma). Let A be a real-valued m × n matrix.
The system
Ax = b, x ≥ 0, (4.4)
has a solution iff
Recall (P ) has an optimal solution iff (D) has an optimal solution, and
if (P), (D) have feasible solutions, w ≥ z.
⇒ If (4.4) has a solution, then a feasible solution of the
primal problem (P) exists. All such solutions are optimal
solution because z is always zero, i.e., any x satisfying
Ax = b, x ≥ 0 is a maximiser (with z = 0 by definition).
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 156
bT u = w ≥ z = 0,
by definition of (D). Therefore, any u satisfying A T u ≥ 0
satisfies bT u ≥ 0, and hence (4.5).
⇐ The dual problem (D) always has feasible solutions,
since u = 0 satisfies the constraints of (D).
So if (4.5) holds, then (D) has an optimal solution,
since w is bounded below specifically, by w = bT u ≥ 0.
Therefore (P) has an optimal solution, and so (P) has
a feasible solution.
In other words, (4.4) holds.
b = x 1 a1 + x 2 a2 + · · · + x n an .
If we insist that the coefficients x i ≥ 0, then the set of all such linear
combinations is a convex cone 2 of the vectors {a1 , a2 , . . . , an }. In 2D and
3D we can visualise this — Example 28 and Example 29 give a couple
of examples. Note that these convex cones don’t really look like “cones”.
Example 28. The convex cone of the vectors a1 = (1, 0)T and a2 =
(1, 1)T is the region shown in Figure 4.4a. It’s the unbounded shaded
triangular shape.
2
A convex cone is defined to be a subset of a vector space that is closed under
linear combinations with positive coefficients.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 157
Example 29. The convex cone of the vectors a1 = (1, 0, 0)T , a2 = (1, 1, 1)T
and a2 = (1, 0, 1)T is the region shown in Figure 4.4b. We can see that it
isn’t actually a “cone”, but really an (unbounded) polyhedral shape.
2
a2
1 1.5
1
y
0.5
z
0 a1 0
−0.5 2
1
−1
2 0
−1 1
−1 0 1 2 0
−1 −1 x
x y
Interpreting the second statement, note that aTi u ≥ 0 implies that the
angle between u and ai is at most 90◦ , and bt u < 0 implies that the
angle between u and b is more than 90◦ . If we were to take the hyper-
plane normal to u, then it would therefore separate the vectors in the
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 158
2 2
a2 a2
1 1 u
b
y
y
0 a1 0 a1
−1 −1
−1 0 1 2 −1 0 1 2
x x
(a) A point b = 0.6a1 + 0.7a2 inside the (b) A point b = (1/5, −1/2)T ) outside the
convex cone of {a1 , a2 }. convex cone, and u = (1/2, 1)T .
g i (x) = 0, i = 1, 2, . . . , m < n,
111111111111
00000000000 0
00000000000
11111111111 0
1
00000000000
11111111111 0
1
000000000001
11111111111 0
00000000000
11111111111
00000000000
11111111111
0
1
0
1
00000000000
11111111111 0
1
000000000001
00000000000
11111111111
11111111111
0
0
1
00000000000
11111111111
00000000000
11111111111
0
1
00000000000
11111111111 0
1
x
02
1
00000000000
11111111111
000000000001
11111111111 0
00000000000
11111111111 0
1
000000000001
11111111111 0
0
1
00000000000
11111111111
000000000001
11111111111 0
00000000000
11111111111 0
1
000000000001
11111111111 0
000000000001
11111111111 0
0
1
11111x10000
00000 1111 (b) The function being maximised, also showing
(a) The rectangle. the restriction implied by the perimeter constraint.
Figure 4.6: We wish to maximise the area of the rectangle, subject to a fixed
perimeter constraint.
∂h ∂h ∂h
= = = 0,
∂x 1 ∂x 2 ∂λ
we get
∂h
∂x 1
= x2 + λ = 0,
∂h
∂x 2
= x1 + λ = 0,
∂h
∂λ = x 1 + x 2 − 1 = 0,
which has solution x 1 = x 2 = 1/2, λ = −1/2 as illustrated in Figure 4.6b.
It is noteworthy that the third constraint (derived from the partial
derivative with respect to the Lagrange multiplier) is simply the original
constraint.
Example 32. Find the rectangle of largest area inscribed in a circle
with diameter 1 as illustrated in Figure 4.7.
111111111111111111111
000000000000000000000
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
Figure 4.7: We wish to maximise the area of the rectangle, subject to it fitting
inside a circle with diameter 1. Note that the diagonal of the rectangle is just
the diameter of the circle.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 163
h = x 1 x 2 + λ(x 12 + x 22 − 1),
x 1 (1 − 4λ2 ) = 0,
h = x y + λ(x 2 /a 2 + y 2 /b 2 − 1),
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 164
λ2
µ ¶
y 1 − 4 2 2 = 0.
a b
So λ = ±ab/2. To satisfy x, y > 0, λ = −ab/2, and hence x = (a/b)y. To
satisfy the constraint
p p
x = a/ 2, y = b/ 2.
∂h
= g i (x)
∂λi
so in setting these to zero, we just rewrite our old constraint g i (x) = 0.
However, its doing something more than just this. To understand,
lets consider the case of a single constraint: e.g., minimise f (x) subject
to g (x) = 0. The new objective function is
or more concisely
∇ f = −λ∇g . (4.14)
The intuition of this can be see in Figure 4.8.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 166
6
5 ∇g1 (x)
4
3
x∗ f (x) = 20
2
x2
1
0
∇f (x)
−1
−2
−3
−3 −2 −1 0 1 2 3 4 5 6
x1
Figure 4.8: An illustration of the affect of a Lagrange multiplier using f (x) =
2x 12 + 2x 1 x 2 + x 22 − 10x 1 − 10x 2 and constraint g (x) = x 12 + x 22 − 5 = 0. We can
see that (4.14) is enforced here with λ = 1. Intuitively we can see why — the
minimum occurs just at the point where the level curves of f (x) (the orange
ellipses) just touch the feasible region curve g (x) = 0 (shown in blue). The
critical level curve is f (x) = 20 (shown in green). When they just touch, their
tangents must be the same (as they are continuously differentiable) and hence,
the normals defined by the gradients must be aligned.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 167
Now the critical trick is to realise that if we enforce the constraint, then
for any feasible point
g (x + δx) = 0,
including the minimiser itself g (x) = 0, and so from (4.16) we get (for
small δx) that
δxT ∇g (x) ' 0.
Then from (4.14) we know that δxT ∇f (x) ' 0, and hence for small
enough δx we get
f (x + δx) = f (x) + O(δx2 ). (4.17)
Hence we can see that f (x) is a local minimum, given the constraint.
Note also that if we had a constraint g (x) = d , then the constant is
dropped when taking derivatives w.r.t. x i , and so such constants are
only needed when checking the constraint itself, and can sometimes
just be left out the function h(·).
Intuitively, you should realise by now that (4.14) expresses some-
thing quite important. Remember from the previous chapter that ∇g (x)
is a direction at right angles to level curve g (x) = 0, and that −∇ f (x) is
the direction of steepest descent of f (·), so that the equality enforced
by (4.14) says that at the minimiser the constraint curves normal will
point in the direction of steepest descent of the function f (x) at that
point. In other words, we require that the partial derivatives of f (·), in
any direction along the constraint curve, will be zero (but not in any
arbitrary direction).
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 168
where
λi g i x∗ = 0, for all i = 1, 2, . . . , m.
¡ ¢
(4.19)
∇ f x∗ + λi ∇g i x∗ = 0.
¡ ¢ X ¡ ¢
(4.20)
i ∈I (x∗ )
λi g i (x),
X
h(x) = f (x) +
i ∈I (x∗ )
∇ f (x∗ ) = − λi ∇g i (x∗ ),
X
i ∈I (x∗ )
3
If I (x∗ ) = φ then we are not on any boundary, and the KKT condition just reverts
to the familiar condition for an unconstrained problem, i.e., ∇ f (x∗ ) = 0.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 172
for λi ≥ 0. In other words, ∇ f (x∗ ) is in the convex cone of −∇g i (x) for
i ∈ I (x∗ ). The condition is illustrated in 2D in Figure 4.9.
x*
g1(x) = 0 -∇g1(x)
-∇g2(x) g2(x) = 0
∇f(x)
λi ≥ 0 for any global minimum. The shaded region indicates the feasible set.
We know that −∇g i (x∗ ) must point into this set because the constraints are
g i (x∗ ) ≤ 0. The green shaded sub-region indicates the convex cone of viable
directions for ∇ f .
We will prove this result below, but that will take some work, and
a few examples should help with out intuition, but before then let us
also define a feasible direction.
x2
3
The minimum occurs at the ori- 2
gin, where I (x) = φ, and so the 1
KKT conditions simple require 0
that ∇ f (x) = 0, which indeed it −1
does. −2
−5 −4 −3 −2 −1 0 1 2 3 4 5
x1
g 1 (x) = x 1 − 7, 5
4
g 2 (x) = −x 1 + x 22 /4 + 4. 3
2
The concentric rings show level 1
curves of f (x) = x 12 + x 22 . The min-
x2
0
imum is obtained where these −1
−1 0 1 2 3 4 5 6 7 8 9 10
∇ f + λ2 ∇g 2 = 0, x1
At x∗ = (4, 0)T , we have ∇ f (x∗ ) = (8, 0)T and ∇g 2 (x∗ ) = (−1, x 2∗ /2)T =
(−1, 0)T so we choose λ1 = 0 and λ2 = 4 to satisfy the KKT conditions.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 174
5
g 1 (x) = −x 1 − 3x 2 ,
4
g 2 (x) = −x 1 + 3x 2 , 3
g 3 (x) = x 1 + x 22 /4 − 4. 2
1
The concentric rings show level
x2
0
curves of f (x) = x 12 + x 22 . The −1
constraints are non-convex, but −2
we can still see that the minimum −3
is obtained where these rings −4
The solution, therefore is at the points x∗ ' (3.63, ±1.21) where I (x∗ ) =
{1, 3} or I (x∗ ) = {2, 3}. Taking the former, we get ∇ f (x∗ ) = 2(x 1∗ , x 2∗ )T =
2(3x 2∗ , −x 2∗ )T , and
6x 2∗ − λ1 + λ3 = 0
2x 2∗ − 3λ1 + x 2∗ λ3 /2 = 0
where
λi g i x∗ = 0, for all i = 1, 2, . . . , m.
¡ ¢
(4.19)
Proof. (⇐) (Sufficiency): i.e., assume that there exists a point x∗ , and
a set of λi that satisfy (4.18) and (4.19). The objective function f is
convex on Ω, and x∗ ∈ Ω. By Theorem 3.2 we know that
f (x) ≥ f (x∗ ) + (x − x∗ )T ∇ f (x∗ ), (4.21)
for all x ∈ Ω. The KKT condition (4.18) substituted into (4.21) implies
à !
m
f (x) ≥ f (x∗ ) + (x − x∗ )T − λi ∇g i (x∗ )
X
i =1
m
= f (x∗ ) − λi (x − x∗ )T ∇g i (x∗ ).
X
(4.22)
i =1
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 176
Proof. (⇒) (Necessity): i.e., assume that f (x∗ ) is the global minimum
over Ω. From Theorem 3.3 we know that
(y − x∗ )T ∇ f (x∗ ) ≥ 0, ∀y ∈ Ω . (4.24)
Now consider the definition of a feasible search direction, Defn 4.2. For
any such direction d we can choose y = x∗ + εd ∈ Ω for some ε > 0, and
likewise for any point y ∈ Ω we can choose a search direction d = y − x∗ ,
and so (4.24) is equivalent to saying
dT ∇ f (x∗ ) ≥ 0, (4.25)
g 1 (x) = x 12 + x 22 − 5 ≤ 0
(
x 12 + x 22 ≤ 5
s.t. ⇒
3x 1 + x 2 ≤ 6 g 2 (x) = 3x 1 + x 2 − 6 ≤ 0
Then,
µ ¶ µ ¶ µ ¶
4x 1 + 2x 2 − 10 2x 1 3
∇ f (x) = 2x + 2x − 10 , ∇g 1 (x) = 2x , ∇g 2 (x) = 1
1 2 2
and µ ¶
4 2
Hf = 2 2 ,
We need to check all choices for I (x). Here, there are 2 constraints, and
so I (x) is one of ;, {1}, {2} or {1, 2}.
giving x 1 = 1, x 2 = 2 and λ1 = 1.
All the Kuhn-Tucker conditions are satisfied and so
we can conclude that the minimum is x∗ = (1, 2)
but let’s check the others for completeness
3. I (x) = {2} and so λ1 = 0 and
g 2 (x) = 0.
So 1) and 3) require us to solve
4x 1 + 2x 2 − 10 + 3λ2 = 0
2x 1 + 2x 2 − 10 + λ2 = 0
3x 1 + x 2 − 6 = 0.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 180
So we are left with case (2) where I (x) = {1} and x∗ = (1, 2) as our final
solution as illustrated in Figure 4.10. From (3.3) we know that the
unconstrained optimum of the problem is at p = −A −1 b = (0, 5)T here
because µ ¶ µ ¶
4 1 10
A= and b = − .
1 2 10
We can see that A is positive definite, so the function is convex, and the
level curves are ellipse (with center at p). The figure also shows these
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 181
elliptical level curves of f (·), as well as the potential solution points for
each set of constraints. We can easily see the first and third solutions
are infeasible, and that the second solution is better than the forth.
Interestingly, the forth solution is feasable, but results in a negative
Lagrange multiplier λ1 , which is not allowed, ruling this solution out.
The reason lies in the fact that ∇ f is not in the convex cone of ∇g 1 and
∇g 2 as illustrated in Figure 4.11. The third figure, Figure 4.12, shows a
closer view yet, illustrating that there is a viable search direction that is
also a descent direction from the point x with I (x) = {1, 2}.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 182
2 I(x) = {1}
I(x) = {1, 2}
x2
0
I(x) = {1, 2}
−1
−2
−3
−3 −2 −1 0 1 2 3 4 5
x1
Figure 4.10: An illustration of Example 38. The green region is the feasible
set Ω. The orange lines are the elliptical level curves of f (x). The four cases of
solutions are illustrated by dots.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 183
f (x) = 20
2 I(x) = {1}
I(x) = {1, 2}
x2
−∇g2 (x)
0
0 1 2 3
x1
Figure 4.11: A closeup of Figure 4.10 showing the gradients, and the convex
cone of {−∇g 1 , −∇g 2 } (in green) at the point x where I (x) = {1, 2}. As ∇ f (x) is
outside this cone, this can’t be an optimal point.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 184
I(x) = {1, 2}
x2
∇f (x)
1 2
x1
Figure 4.12: A closeup of Figure 4.11, with the green region showing feasible
descent directions from the point I (x) = {1, 2}. As there are feasible descent
directions, we can’t be at the minimum.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 185
Comment:
min f (x)
s.t . g i (x) ≤ 0, i = 1, 2, . . . , m,
` j (x) ≤ 0, j = 1, 2, . . . , p,
−` j (x) ≤ 0, j = 1, 2, . . . , p.
where f and g i are as in Problem 4.2 and the ` j (·) are all
linear.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 187
µ j = µ+j − µ−j ,
but now the variables µ j have no restriction on sign. Note that we also
require, as before in (4.19), that
λi g i x∗ = 0, for all i = 1, 2, . . . , m.
¡ ¢
(4.30)
but the linear constraints are automatically tight (as defined in the
problem), and so we need not have an equivalent for the µ j .
As noted there are other ways to extend the theory, even to non-
convex problems provided they satisfy some other type of regularlity
conditions. For instance one alternative is to insist that the gradients
of the active constraints are linearly independent at the minimiser.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 188
Proof. Omitted.
E x ≤ f,
Gx = h,
eTi x − f i ≤ 0,
gTj x − h j = 0.
∇q(x) = Ax + b,
∇g i (x) = ei ,
∇h j (x) = g j .
are tight (they all are), and so the solution is straight forward. We
simply rewrite the equations (4.31) (with E = 0) in combination with
the constraint Gx = h in the form
A GT x
µ ¶µ ¶ µ ¶
−b
= . (4.32)
G 0 µ h
min f (x)
s.t . Ax ≤ b
= u ¯ uT ci = 0, ∀i ∈ I (x) .
© ¯ ª
The set S(x) defines the set of directions from x, such that we stay
on the boundary d Ω of the feasible region. That is, if we move an
infinitesimal distance in the direction u, the same set of constraints
I (x) that are currently active, will remain so. The result is easy to prove
as the constraints are linear so
g i (x + εu) = g i (x) + εuT ci .
The first term on the right-hand side is zero by the definition of I (x),
and the second is zero by the definition of S(x) (for small enough ε such
that we don’t run into another boundary). Thus, any u ∈ S(x), u 6= 0, is a
feasible direction at x.
If we can find u 6= 0, u ∈ S(x), such that
uT ∇ f (x) < 0 ,
then u is both a feasible and a descent direction at x, and so our ap-
proach seeks to find such directions. We can then move along the
boundary of the region until we get to the solution5 .
We have to be a bit careful, though, because its easy to get to a
point (a vertex) where S(x) = φ, and we need some way to detect this
(efficiently) and get away from such points.
5
Remember, if the minima is not on the boundary, the solution is simple found
via unconstrained optimisation techniques.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 194
u + q = −∇ f (x) , (4.33)
as illustrated in Figure 4.13. We call such a u the orthogonal projection
of −∇ f onto S.
-∇f(x)
q
x
u
S(x)
Figure 4.13: Decomposition or −∇ f (x) into u + q, where u and q are chosen to
be orthogonal, i.e., , uT q = 0.
−uT ∇ f (x) = uT (u + q)
= uT u + uT q
= kuk2
> 0, (4.34)
because u and q were chosen to be orthogonal. Thus uT ∇ f (x) < 0 and
hence u is a feasible descent direction.
Thus a basic algorithm would be:
1. At x ∈ Ω, compute u as in (4.33).
f (x + λu) , λ > 0, x + λu ∈ Ω.
PS y
S
Figure 4.14: Orthogonal projection of the vector y into the space S(x) using the
projection matrix P S .
If we look at the problem from “side on” as in Figure 4.15 we can see
that one way to think about the question is to take a set of expanding
hyperspheres around the point y until the spheres are just large enough
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 197
||y - PS y||
S
PS y
Figure 4.15: A side on view of the orthogonal projection problem.
ŷ = P S y,
n ¡ ¢2
kŷ − yk2 =
X
ŷ i − y i
i =1
n
ŷ i2 − 2 ŷ i y i + y i2
X
= (4.36)
i =1
= ŷ ŷ − 2yT ŷ + yT y,
T
(4.37)
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 199
and so we take
A = 2I , (4.38)
b = −2y, (4.39)
and the constraint matrix G from Problem 4.8 is just G = M , and there
are no inequality constriants so E = 0. We find its solution using the
KKT conditions, which in this case are given in (4.31)
Ax + b +G T µ = 2ŷ − 2y + M T µ = 0.
and so
1
ŷ = y − M T µ
¡ 2 T
= I − M (M M T )−1 M y
¢
= P S y, (4.40)
where P S is defined above.
Theorem 4.3 explains how to find an orthogonal projection, so now
we apply it to our original problem of finding the orthogonal projection
of −∇ f (x), and we get
u = −P S ∇ f (x) = − I − M T (M M T )−1 M ∇ f (x),
¡ ¢
where M is the matrix with rows cTi , ∀i ∈ I (x), i.e., rows are the (gradient
vectors)T of the active constraints at x. Given the conditions specified
above, u is guaranteed to be a descent direction, and a feasible search
direction (that satisfies the same set of active constraints I (x)).
Example 39. Minimise f (x) = x 12 + x 22 + x 32 + x 42 − 2x 1 − 3x 4 such that
2x 1 + x 2 + x 3 + 4x 4 ≤ 7, (4.41)
x 1 + x 2 + 2x 3 + x 4 ≤ 6, (4.42)
xi ≥ 0. (4.43)
The gradients of the first two constraints are
c1 = ∇g 1 (x) = (2, 1, 1, 4)T , (4.44)
T
c2 = ∇g 2 (x) = (1, 1, 2, 1) . (4.45)
The last set of (non-negativity) constraints actually have to be written
as −x i ≤ 0, and so their corresponding gradients are
ci = ∇g i = (0 . . . , −1 , . . . , 0)T , for i = 3 to 6.
6 (i − 2)th spot
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 201
Imagine we started with xT0 = (2, 2, 1, 0), the active constraints are
I (x0 ) = {1, 2, 6} (which we can see by testing whether each inequal-
ity is tight). Hence the matrix M is a 3 × 4 matrix whose rows are the
constraint vectors c1 , c2 and c6 , i.e.,
2 1 1 4
M = 1 1 2 1 ,
0 0 0 −1
giving
22 9 −4
MMT = 9 7 −1 ,
−4 −1 1
and hence
1 −3 1 0
T T −1 1
−3 9 −3 0
P S = I − M (M M ) M = .
11 1 −3 1 0
0 0 0 0
This is just a direction, so we can scale it (to make the numbers easier),
taking u0 = (1, −3, 1, 0)T . Then x1 = x0 + λu0 where λ minimises
g (λ) = f (x0 + λu0 ), subject to x0 + λu0 ∈ Ω; λ > 0.
Here,
The above illustrates one step of our proposed approach. The idea
is to use the gradient of f (·) orthogonally projected into the active
constraint set S in order to find a feasible descent direction. However,
there are two cases when it won’t work.
The first occurs because the above assumes that M M T is invertible.
What do we do if M M T is not invertible? The projection matrix P S is
still uniquely defined, but we need to alter the working above.
If M M T isn’t invertible, then its rows are linearly dependent. This
tells us the rows of M are linearly dependent6
So we are looking at the case where the rows of M are linearly
dependent, i.e., some constraints are linear combinations of the other
constraints. In this case, some of the constraints are redundant, and
can effectively be dropped.
There are several ways of finding what we can drop. One approach
is to row reduce M to echelon form and let B be the matrix made up of
the nonzero rows of this reduced form of M . This has the advantage of
simplifying the constraints as well. Then
S = {u | M u = 0} = {u | B u = 0},
but now we know that B B T is invertible. Hence we can now form the
projection matrix
P S = I − B T (B B T )−1 B .
6
If M M T isn’t invertible then its determinant is zero, i.e., |M M T | = 0. However,
we cannot conclude |M | = 0, because M is not even necessarily square. However, if
|M M T | = 0, then the matrix can’t be positive (or negative) definite, and hence there is
some α 6= 0, αT M M T α = 0. Equivalently, we can write (M T α)T M T α = kM T αk = 0,
and the only vector with zero norm is the zero vector so M T α = 0. Writing this in full
P
is just (αi × row i of M ) = 0, which is equivalent to saying that the rows of M are
i
linearly dependent (from the definition).
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 204
The third constraint is a linear combination of the first two (3) = ((1) +
(2))/3, so they are linearly dependent. At x = (1, 0), we have I (x) =
{1, 2, 3} so
1 1 µ ¶ 2 1 1
T 1 2 1
M = 2 −1 , M = , and M M T = 1 5 2 ,
1 −1 0
1 0 1 2 1
This solution is OK for small problems; for large problems, the idea
of a generalised inverse can be used, but these are beyond the scope of
this course.
The second problem occurs when the search direction u = 0. In
Example 39, if we continue from x1 to find a new search direction u1
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 205
u = 0 ⇔ (I − M T (M M T )−1 M )∇ f (x) = 0
(M not necessarily square, (p × n)); or
∇ f (x) − M T (M M T )−1 M ∇ f (x) = 0
T
or ∇ f (x) + |M{z ω} = 0 (4.46)
↑
linear combination of columns
of M T , i.e., rows of M
where
ω1
ω2
ω = (M M T )−1 M (−∇ f (x)) =
...
say,
ωp
so for this choice of ω,
ωi ∇g i (x) = 0.
X
∇ f (x) + (4.47)
i ∈I (x)
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 206
Recall the KKT conditions in Theorem 4.1: if ωi ≥ 0 for all i ∈ I (x) exist
and satisfy the (4.47) then we have found the optimal solution, i.e.,
x∗ = x.
Are ωi ≥ 0 for all i ∈ I (x) though?
and we get
These don’t satisfy the KKT conditions, so we have not yet found the
optimal point x∗ . So what can we do now?
We need to think about what is really going on. If we find that the
projection of −∇ f (x) onto S(x) is 0 and yet we have not satisfied the
KKT-conditions, then we know that either
1. S(x) = {0}, i.e., the current point x is the only point that satisfies
all of the active constraints I (x); or
An excellent example of the former case can be see in Example 38, the
final figure of which is repeated in Figure 4.16a, below. In this case,
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 207
at the point x = (x 1 , x 2 ) the active constraint set is I (x) = {1, 2}, but we
can see that only one point7 satisfies the constraints and so S(x) = {0}.
However, there are feasible descent directions leading away from x if
we relax the second constraint (the one in red). Obviously, this example
includes non-linear constraints, but the principle still holds.
I(x) = {1, 2}
x2 g1(x) = 0
x2
x
1 -∇f(x)
x*
∇f (x)
1
x1
2
x1
(a) A repeat of Figure 4.12. We can see (b) The figure shows the case of a single
that there is an isolated point satisfying active constraint I (x) = {1}, but where
both constraints, so that S(x) = {0} at this the minimiser x∗ is in the interior of the
point, but the green region shows the fea- constraint set, as indicated by the fact
sible descent directions. that −∇ f (x) points to the interior.
Figure 4.16: The two cases where u = 0.
The second case is, perhaps, best illustrated by the simple case
where we have a single constraint g 1 (x) = 0 active, but the optimal
point is in the interior. In this case, the descent direction must point
(orthogonally) away from the constraint, as in Figure 4.16b.
7
Actually two points satisfy the constraints, but they are disjoint.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 208
Z = U ⊕ V,
Defn 4.5 (Null space). The null space (or kernel) of a matrix
A is the vector space N (A) of all vectors u such that Au = 0.
Rn = N M ⊕ R M T .
¡ ¢ ¡ ¢
Proof. The lemma states that any vector z in Rn can be written as the
sum of two vectors, z = u + v, where vT u = 0 and
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 210
¡ ¢
u ∈N M , i.e., M u = 0,
v ∈R M T , i.e., M T ω = v, for some ω ∈ Rp .
¡ ¢
u = P z, (4.48)
v = M T ω, (4.49)
• u ∈ N (M ) since
M u = M P z = M − (M M T )(M M T )−1 M z = (M − M )z = 0,
¡ ¢
• z = u + v because
P z + M T ω = (I − M T (M M T )−1 M )z + M T (M M T )−1 M z = z,
• uT v = uT (M T ω) = (M u)T ω = 0T ω = 0 .
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 211
−∇ f (x) = u + v,
where
u = P S − ∇ f (x) ∈ N (M ),
¡ ¢
(4.50)
v = MT ω ∈ R MT ,
¡ ¢
(4.51)
−∇ f (x) = v = M T ω,
S 0 (x) = x | M 0 x = 0 ,
© ª
From Lemma 7 we can once again decomponse the gradient into two
parts
−∇ f (x) = u0 + v0 ,
but this time u0 6= 0.
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 212
M T ω = −∇ f (x) = M 0T ω0 .
ωi ci = ω0i ci
X X
i i 6= j
ωj cj = (ω0i − ωi )ci
X
i 6= j
1 X 0
cj = (ω − ωi )ci .
ω j i 6= j i
ω0 = (M 1 M 1T )M 1 (−∇ f (x)),
we have from Lemma 7, by letting z = −∇ f (x), that
−∇ f (x) = u0 + M 1T ω0 .
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 213
We can also see that the new direction will be feasible. For u0 to be
feasible (i.e., inward pointing) we need for all i ∈ I (x) that
M 0 u0 = 0,
and so cTi u0 = 0 for all i ∈ I (x) except for i = j . So we only need to check
that cTj u0 ≤ 0. Now since u = 0, we know ∇ f (x) = −M T ω, and u0 is a
descent direction, so
1
Example 41. Minimise f (x 1 , x 2 ) = (x 12 + x 22 ) s.t.
2
g 1 (x 1 , x 2 ) = 1 − x 1 ≤ 0 c1 = (−1, 0)T (4.1)
T
g 2 (x 1 , x 2 ) = 4 − 2x 1 − x 2 ≤ 0 c2 = (−2, −1) (4.2)
Note that I (x0 ) = ; implies x0 ∈ intΩ, and so the best search direction
is just the Steepest Descent direction, that is, you’d expect u = −∇
µ f (x
¶ 0 ).
−2
Check: from the algorithm, u0 = −P ∇ f (x0 ) = −∇ f (x0 ) = −x0 = .
−4
x1 = x0 + λu0 = (1 − λ)x0 = (2(1 − λ), 4(1 − λ))T is feasible when
1 − 2(1 − λ) ≤ 0
¾ ¾
2λ ≤ 1
⇒
4 − (4 + 4)(1 − λ) ≤ 0 8λ ≤ 4
1
i.e., when λ ≤ = β.
2
1
g (λ) = f (x0 + λu0 ) is minimised when (1 − λ)2 (22 + 42 ) is minimised,
2
i.e at λ = 1.
However, we have constrained the search such that λ ∈
1
(0, 1/2] and so we must choose λ0 = . Therefore,
2
µ ¶ µ ¶
1 1 1 1
x1 = x0 +λ0 u0 = x0 − x0 = x0 = 2 and ∇ f (x1 ) = 2 .
2 2
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 216
Iteration 2.
µ
¶
−1 0
I (x 1 ) = {1, 2} so M = A = −2 −1
µ ¶
1 0
P = 0 1 − M T (M M T )−1 M = I − I = [0].
ω1
µ
¶
Since ω = ω has ω1 = −3 < 0, delete 1 from I (x1 ) i.e.,
2
delete row 1 from M , giving new M = (−2 − 1).
Then M M T = 5 and
µ ¶ µ ¶
T 1 1 0 1 −2
P = I −M · M = − (−2 − 1)
5 0 1 5 −1
µ ¶ µ ¶ µ ¶
1 0 4/5 2/5 1/5 −2/5
= − =
0 1 2/5 1/5 −2/5 4/5
µ ¶µ ¶ µ ¶
1 1 −2 1 1 3
u1 = −P ∇ f (x1 ) = − =
5 −2 4 2 5 −6
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 217
µ ¶
1 1
Note uT1 ∇ f (x1 ) = (3 − 6) < 0 and therefore represents a descent
5 2
direction.
Put
λ 6 T
µ ¶
T T 3
x2 = x1 + λu1 = (1, 2) + (3, −6) = 1 + λ, 2 − λ ,
5 5 5
which is feasible when
µ ¶
3λ
1− 1+ ≤0 i.e., all λ ≥ 0
5
µ ¶ µ ¶
3λ 6λ
4−2 1+ − 2− ≤0 i.e., all λ ≥ 0.
5 5
So β = ∞.
³ ´2 ³ ´2
g (λ) = f (x2 ) = f (x1 + λu1 ) is minimised when 1 + 3λ
5 + 2 − 6λ
5 is
minimised, i.e.,
µ ¶ µ ¶
3 3λ 6 6λ
1+ − 2− =0
5 5 5 5
3λ = 3 or λ = 1 .
µ ¶T
8 4
So x2 = x1 + u1 = ,
5 5
µ ¶T
8 4
∇ f (x2 ) = x2 = ,
5 5
Find
µ ¶
T −1 1 8 20
ω = − (M M ) |{z}
M ∇ f (x2 ) = − 2 (−2 − 1) = > 0.
| {z } | {z } 5 4 25
1/5 (−2,−1) Ã
8/5
!
4/5
x2 x2
x0=(2,4)
-∇f(x)
g2(x) = 4 - 2x1 - x2 ≤ 0
x1 x1
g1(x) = 1 - x1 ≤ 0
(a) Constraints. (b) Starting point x0 .
x2 x2
x0 x0
λ0 u0
I(x2)={2}
S(x2)={v(1,-2)}
x1=(1,2) x2
-∇f(x) -PS∇f(x)
I(x1)={1,2} x3
-∇f(x) S(x1)={0}
x1 x1
(c) Second point. Note that −∇ f (x1 ) now (d) The third point x2 = x1 , but with a
points outside the feasible region, so it different set of active constraints, and the
isn’t a feasible descent direction. last point x3 = x∗ .
Figure 4.17: Example 41. The feasible region is shown in green. Level curves
of f (x) are shown as the orange concentric circles. It may seem unlikely that we
would exactly hit a vertex, but the following figure shows what would happen
if we started at x0 = (1.5, 4).
CHAPTER 4. CONSTRAINED CONVEX OPTIMISATION 220
x2
x0
x2
x0=(1.5,4) λ0 u0
x1
-∇f(x) I(x1)={1}
S(x1)={v(0,1)}
-∇f(x)
x1
x1 (b) Second point. Note that −∇ f (x1 )
now points outside the feasible region,
(a) Alternative starting point x0 . so it isn’t a feasible descent direction.
x2 x2
x0 x0
I(x3)={2}
x1 x1
S(x3)={v(1,-2)}
x2 I(x2)={1,2} x3
S(x2)={(0,0)}
-∇f(x) -∇f(x) -PS∇f(x)
x4
-PS∇f(x)
x1 x1
(c) The third point x2 is the same as the (d) The last step is just as before (though
vertex we came to in the previous exam- the indexes have changed to reflect the
ple. extra step).
Figure 4.18: Example 41 but with an alternative starting point x0 = (1.5, 4).
The feasible region is shown in green. Level curves of f (x) are shown as the
orange concentric circles.
30
25
20
15
10
−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2
§5
N ON -C ONVEX O PTIMISATION
221
CHAPTER 5. NON-CONVEX OPTIMISATION 222
g i (x) ≤ 0, i = 1, 2, . . . , m < n,
4
C(x)
2
local minimum
0
global minimum
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x
Figure 5.1: Non-convex optimisation problem. We can see that a descent
algorithm would get caught in a local minimum unless we just happened to
choose a good starting point.
CHAPTER 5. NON-CONVEX OPTIMISATION 223
∆x = xi +1 − xi , (5.1)
∆f = f (xi + ∆x) − f (xi ), (5.2)
Either reiterate at this temp. or drop Either reiterate at this cost or drop
temp. cost.
• TSP
• Graph Partitioning
A CCEPTANCE FUNCTIONS
In a pure descent algorithm, our acceptance function looks like
∆f ≤ 0 accept,
∆f > 0 reject.
That is, we accept a move if it results in an improvement in the objective.
However, in general, the acceptance function is not really a function,
but actually a random variable, dependent on ∆ f . In this case, we can
rewrite the above in terms of probability of acceptance as P (accept |
∆ f ), which in this case would be given by
1, ∆ f ≤ 0,
½
P (accept | ∆ f ) =
0, ∆ f > 0.
CHAPTER 5. NON-CONVEX OPTIMISATION 226
• For ∆ f > 0
Figure 5.2 illustrates the properties that we require from the func-
tion. The illustration shows continuous functions, which are not nec-
essary, but certainly make sense.
A commonly used acceptance function incorporates the Boltzman
factor, derived from statistical mechanics of particulars (say in a fluid)
µ ¶
−E (x)
exp ,
kT
which describes the relative likelihood of configurations x with energies
E (x), where k is Boltzman’s constant. We incorporate it by using its
meaning as a probability of c configuration to define the acceptance
probability
´ ∆ f ≤ 0,
(
1, ³
P (accept | ∆ f ) = −∆ f
exp kT , ∆ f > 0.
1
The case ∆ f = 0 is a special case, and we will see why we set the acceptance
function probability to 1 for this case in a minute.
CHAPTER 5. NON-CONVEX OPTIMISATION 227
P( ∆f)
Temperature T1
Temperature T2 < T1
∆f
Figure 5.2: Illustration of the properties of acceptance functions. As δ f in-
creases, or temperature decreases, the probability of acceptance decreases.
T HE ANNEALING SCHEDULE
In the physical analogy, temperature is reduced slowly over time, which
allows system to stay approximately in equilibrium as the temperature
decreases. We need to do something analogous here, but we have
complete control over T . The annealing schedule specifies exactly
what the behaviour of T will be.
In general, we want the temperature to decrease over time. In
optimisation terms this means that the “negative” jumps will be less
likely as the algorithm progress.
CHAPTER 5. NON-CONVEX OPTIMISATION 228
The later is more in line with the physical process, because it is giving
the system some time to reach “equilibrium”, but it may be slower?
We then need to choose the initial temperature, and how much to
decrease the temperature? The analogy breaks down here a little, as we
have no scale equivalent to Celsius here, and no notion of “boiling” or
“freezing” point to set our number by. The initial temperature should be
in some sense high enough for “melting” but the right temperature isn’t
at all obvious. In some sense we might set it to make our initial steps
reasonably likely (say P (accept | ∆ f ) = 0.5 or 0.8) for any likely value
of ∆ f , but even this is a hard criteria to use for a particular problem,
without some experimentation.
The temperature reductions bring in even more complexity. We
could choose to build a full table of these reductions, but it is more
common to simply use geometric decrease, i.e.,
Ti +1 = αTi ,
would have us stop when the temperature is 0 (or close to zero), where
non-descent moves become impossible. However, we might instead
stop if nothing changes for some time.
T HE ALGORITHM
Given an acceptance function, and annealing schedule, algorithm as
in Algorithm 14. Its often called the Metropolis-Hastings algorithm,
though technically this is an algorithm intended for a different purpose
(Markov Chain Monte Carlo), but we will use that name for the sake of
ease as many others have.
The algorithm itself is given by Algorithm 14, given initial and final
temperatures Ti ni t , and Tend , and a decrease factor α, and an inho-
mogeneous, geometric decrease schedule. We keep track of the best
solution “so far” in the variable xbest (the strict SA algorithm doesn’t
do this, but it is an obvious improvement). Obviously, there are many
ways to improve this algorithm by, for instance, storing function values
in temporary variables, but the underlying algorithm should be clear.
However, the oversimplifies. Generation of random moves is non-
trivial. The moves must be to “neighbouring” states, but what exactly
does that mean? And they must be random, while still enforcing any
constraints on the problem, and they must be generated efficiently.
Another issues to determine for a homogeneous approach is how
many times should we run the inner-loop before changing the temper-
ature. Ideally its long enough to explore the regions of search space
that should be reasonably populated, but determining this may take
some trial an error as its is problem dependent.
There are also many much more sophisticated modifications of
the approach in the literature, e.g. [12], and these may well be needed
because of the complexities of real problems.
CHAPTER 5. NON-CONVEX OPTIMISATION 230
was used to select and improve the algorithm producing new genera-
tions of pictures that were more appealing.
Its a perfect example of a problem were we can’t express the ob-
jective function in mathematical form, and can’t calculate it quickly.
In other words, the GA might say “ I don’t know much about art, but I
know what I like.”
(a) . (b) .
• chromosome encoding,
• crossover method,
• mutation method,
• selection method,
• fitness criteria.
We will discuss each of these below, but but note that we didn’t
include fitness function evaluation, which could be explicit as in most
the previous problems, or the fitness could be implicit in the result of a
competition between the members of the population.
CHAPTER 5. NON-CONVEX OPTIMISATION 235
C HROMOSOME E NCODING
There are various ways to encode information. We can just store our
optimisation variables as number in a vector x. However, mutation is
usually thought of at the bit level – we might mutate a single bit, not
randomise an integer. Moreover, we will see that the various operations
of a GA place some emphasis on the locus of information. Genes that
are close together are less likely to be split up than those far apart. So
the order of elements of the vector may matter in ways that it doesn’t
in any of the algorithms we have considered so far.
Thus, in any GA, the first issue to consider is how to represent the
state variables of the problem in a chromosome. There are quite a few
valid approaches, some simple ones are:
• Binary Encoding:
Chromosome A 1101100100110110
Chromosome A 1 5 3 2 6 4 7 9 8
Chromosome B 8 5 6 7 2 3 1 4 9
numbers 0 1 2 3 4 5 6 7
binary coding 000 001 010 011 100 101 110 111
Gray coding 000 001 011 010 110 111 101 100
Table 5.1: Binary codes for the integers {0, ..., 7}.
C ROSSOVER
Cross-over is the way we select genes from parents’ chromosomes to
create offspring’s genes. Typically it is performed between two parents,
though we can easily generalise.
One simple approach is a single crossover point:
For example:
Parent 1: 1101100100110110
Parent 2: 1111111000011110
Offspring: 1101111000011110
⇑
crossover
There are other ways to perform crossover
• then the other parent is scanned and if the number is not yet in
the offspring, it is added (in order of scanning).
For example:
CHAPTER 5. NON-CONVEX OPTIMISATION 238
Parent 1: 1 2 3 4 5 6 7 8 9
Parent 2: 4 5 3 6 8 9 7 2 1
Offspring: 1 2 3 4 5 6 8 9 7
⇑
crossover
Some more illustrations of encoding and crossover can be found at
http://cs.felk.cvut.cz/~xobitko/ga/cromu.html
M UTATION
Mutation is an important component of both real evolution and evo-
lutionary computing. It is necessary to prevent all solutions in the
population falling into a local optimum (crossover by itself may not be
able to escape). Think of it as ensuring genetic diversity.
Mutation is simulated by introducing random changes to the off-
spring, but the meaning of random must be carefully defined. If we are
using binary encoding (with Gray codes) then a common approach is
to flip each bit with a small probability p. For example:
Original offspring: 1101100100110110
Mutated offspring: 1101000100111110
There are other possibilities: for instance, if a group of bits encode
a gene, we could mutate whole genes at each step.
However, other codes may require different strategies: for instance
in permutation encoding we may want to swap two randomly chosen
elements so that the result is still a valid permutation. For example:
Permutation mutation example
Original offspring: 1 2 3 4 5 6 8 9 7
Mutated offspring: 1 8 3 4 5 6 2 9 7
CHAPTER 5. NON-CONVEX OPTIMISATION 239
S ELECTION ALGORITHMS
We also need a method to select parents. Obvious we could simply
choose the fittest, but this quickly reduces diversity in the population,
and isn’t the best long-term strategy.
In general, we select parents stochastically, but with some bias
towards fitter parents. Examples of approaches are:
f (xi )
pi = P
i ∈P f (xi )
i 2i
pi = P =
i ∈P i N (N + 1)
• The better the genotype is, the more chances it has to be selected
The approach is illustrated in Figure 5.4b. We can see that it still favours
the fittest, but not by as much as a pure Roulette Wheel approach.
Another problem (similar to the problem of a pure simulated an-
nealing algorithm) is that there is no history kept and, as the process is
random, we may loose a good solution and never find it again. We can
avoid this using elitism. In this case, we keep a small number of the
best of each generation, and put them directly into the next generation
without crossover or mutation.
Elitism can dramatically improve the performance of a GA, with
little overhead, simply by ensuring that the best past solution is always
included in the current population, so that the algorithm never goes
backwards.
CHAPTER 5. NON-CONVEX OPTIMISATION 241
PARAMETERS OF GA S
The components above, once chosen, specify a particular GA from the
space of possible algorithms, but we need to also set the parameters of
the algorithm. The choices have a significant affect on performance,
so some care should be taken.
Typical parameters that need to be set are:
population size if the population is too small, there are few opportu-
nities to perform crossover, and only a small part of the search
space is search in each generation. If the population is too large,
the GA slows down, and at some point we reach diminishing
returns. The best size depends on the problem (and the size of
the search space, but a rule of thumb might be 20-30, though
50-100 is reported as the best in some cases. Some research also
shows, that the best population size depends on the size of chro-
CHAPTER 5. NON-CONVEX OPTIMISATION 242
E XAMPLES
GAs have been used in many settings, but we mention below a few of
the more beautiful. For instance, GAs have been used in generating
artificial life. For instance Torsten Rei created realistic animations of
stick figures, by adding “muscles” to them, and using distance walked
as fitness2 , and these ideas have been used in the LotRs computer
generated affects. Karl Sims used it to evolve artificial creatures [11].
N OTES
More information (at various levels) on GAs can be obtained from
http://www.obitko.com/tutorials/genetic-algorithms/
http://www.rennard.org/alife/english/gavintrgb.html
2
http://www.democraticunderground.com/discuss/duboard.php?az=
view_all&address=389x3582335
CHAPTER 5. NON-CONVEX OPTIMISATION 243
http://www.cs.cmu.edu/Groups/AI/html/faqs/ai/genetic/top.html
http://www.genetic-programming.com/published/scientificamerican1096.
html
http://www.trnmag.com/Stories/2003/060403/Artificial_beings_evolve_
realistically_060403.html
25
20
15
10
−1
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1
1.5 2
§6
C ONCLUSION
245
B IBLIOGRAPHY
ames [2] G ARDNER , M. Mathematical games. Scientific American, 2 (August 1972), 106.
earc [3] G LOVER , F. Tabu search: A tutorial. Interfaces 20, 4 (1990), 74–94.
comm [4] G RAY, F. Pulse code communication. U. S. Patent 2 632 058, March 17 1953.
te85 [5] G RENFENSTETTE , J. J., Ed. Proceedings of the First International Conference
on Genetic Algorithms and Their Applications. Lawrence Erlbaum Associates,
Hillsdale, New Jersey, 1985.
ling [7] K IRKPATRICK , S., G ELATT J R ., C. D., AND V ECCHI , M. Optimization by simulated
annealing. Science 220 (1983), 671–680.
rogr [8] L EONE , R. D. The origin of operations research and linear programming. http:
//globopt.dsi.unifi.it/gol/Seminari/DeLeone2008.pdf, 2008.
ling [9] N.M ETROPOLIS , R OSENBLUTH , A., R OSENBLUTH , M., T ELLER , A., AND T ELLER ,
E. Equation of state calculations by fast computing machines. J. Chem. Phys.
21, 6 (1953), 1087–1092.
246
BIBLIOGRAPHY 247
lgor [10] S CHAFFER , J. D., Ed. ‘Proceedings of the Third international Conference on
Genetic Algorithms. Morgan Kaufmann Publishers, Inc., 1989.
2167 [11] S IMS , K. Evolving virtual creatures. In Proceedings of the 21st annual conference
on Computer graphics and interactive techniques (New York, NY, USA, 1994),
SIGGRAPH ’94, ACM, pp. 15–22.
nter [12] WANG , L., Z HANG , H., AND Z HENG , X. Inter-domain routing based on
simulated annealing algorithm in optical mesh networks. Opt. Express 12
(2004), 3095–3107. http://www.opticsexpress.org/abstract.cfm?URI=
OPEX-12-14-3095.
49:a [13] W OOD, M. K., AND D ANTZIG , G. B. Programming of interdependent activities:
I general discussion. Econometrica 17, 3/4 (1949), 193–199. This is a revised
version of a paper that appeared at the Madison Meeting of the Econometric
Society on September 9, 1948.
I NDEX
248
INDEX 249
operations research, 84
order of convergence, 77
orthogonal projection, 194
orthogonal sum, 209
phenotype, 233
population, 234
Principal minors, 86
projection matrix, 195
range, 209
saddle point, 5
scope, 63
slack, 155
smooth, 67
sounding line, 145
standard basis vectors, 124
standard form, 150
stationary, 5
stationary points, 18
steepest descent, 106
tight, 153
tolerance level, 69
unimodal, 22
unitary, 120
worst case, 32