Optimization Principles and Algorithms Ed2 v1
Optimization Principles and Algorithms Ed2 v1
Optimization:
Principles and Algorithms
Michel Bierlaire
EPFL Press
Optimization:
Principles and Algorithms
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Michel Bierlaire
Every engineer and decision scientist must have a good mastery of optimization, an essential
element in their toolkit. Thus, this articulate introductory textbook will certainly be welcomed
by students and practicing professionals alike. Drawing from his vast teaching experience,
the author skillfully leads the reader through a rich choice of topics in a coherent, fluid and
tasteful blend of models and methods anchored on the underlying mathematical notions
(only prerequisites: first year calculus and linear algebra). Topics range from the classics to
some of the most recent developments in smooth unconstrained and constrained optimiza-
tion, like descent methods, conjugate gradients, Newton and quasi-Newton methods, linear
programming and the simplex method, trust region and interior point methods.
Furthermore elements of discrete and combinatorial optimization like network optimization,
integer programming and heuristic local search methods are also presented.
This book presents optimization as a modeling tool that beyond supporting problem formu-
lation plus design and implementation of efficient algorithms, also is a language suited for
interdisciplinary human interaction. Readers further become aware that while the roots of
mathematical optimization go back to the work of giants like Newton, Lagrange, Cauchy,
Euler or Gauss, it did not become a discipline on its own until World War Two. Also that its pre-
sent momentum really resulted from its symbiosis with modern computers, which made it
possible to routinely solve problems with millions of variables and constraints.
With his witty, entertaining, yet precise style, Michel Bierlaire captivates his readers and
awakens their desire to try out the presented material in a creative mode. One of the out-
standing assets of this book is the unified, clear and concise rendering of the various algo-
rithms, which makes them easily readable and translatable into any high level program-
ming language. This is an addictive book that I am very pleased to recommend.
MICHEL BIERLAIRE holds a PhD in mathematics from the University of Namur, Belgium. He is full professor at the
School of Architecture, Civil and Environmental Engineering at the Ecole Polytechnique Fédérale de Lausanne,
Switzerland. He has been teaching optimization and operations research at the EPFL since 1998.
EPFL Press
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Optimization:
Principles and Algorithms
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Cover Illustrations: Abstract cubes background, © ProMotion and Architectural detail of modern
roof structure, © Lucian Milasan – Fotolia.com
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Michel Bierlaire
Optimization:
EPFL Press
Principles and Algorithms
This book is published under the editorial direction of
Professor Robert Dalang (EPFL).
The EPFL Press is the English-language imprint of the Foundation of the Presses polytechniques et
universitaires romandes (PPUR). The PPUR publishes mainly works of teaching and research of the Ecole
polytechnique fédérale de Lausanne (EPFL), of universities and other institutions of higher education.
www.epflpress.org
To Patricia
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Preface
Optimization algorithms are important tools for engineers, but difficult to use. In
fact, none of them is universal, and a good understanding of the different methods is
necessary in order to identify the most appropriate one in the context of specific appli-
cations. Designed to teach undergraduate engineering students about optimization,
this book also provides professionals employing optimization methods with significant
elements to identify the methods that are appropriate for their applications, and to
understand the possible failures of certain methods on their problem. The content
is meant to be formal, in the sense that the results presented are proven in detail,
and all described algorithms have been implemented and tested by the author. In
addition, the many numeric and graphic illustrations constitute a significant base for
understanding the methods.
The book features eight parts. The first part focuses on the formulation and the
analysis of the optimization problem. It describes the modeling process that leads to
an optimization problem, as well as the transformations of the problem into an equiv-
alent formulation. The properties of the problem and corresponding hypotheses are
discussed independently of the algorithms. Subsequently, the optimality conditions,
the theoretical foundations that are essential for properly mastering the algorithms,
are analyzed in detail in Part II. Before explaining the methods for unconstrained
continuous optimization in Part IV, algorithms for solving systems of non linear
equations, based on Newton’s method, are described in Part III. The algorithms for
constrained continuous optimization constitute the fifth part. Part VI addresses op-
timization problems based on network structures, elaborating more specifically on
the shortest path problem and the maximum flow problem. Discrete optimization
problems, where the variables are constrained to take integer values, are introduced
in Part VII, where both exact methods and heuristics are presented. The last part is
an appendix containing the definitions and theoretical results used in the book.
Several chapters include exercises. Chapters related to algorithms also propose
projects involving an implementation. It is advisable to use a mathematical program-
ming language, such as Octave (Eaton, 1997) or Matlab (Moled, 2004). If a language
such as C, C++, or Fortran is preferred, a library managing the linear algebra, such
as LAPACK (Anderson et al., 1999), can be useful. When time limits do not allow
a full implementation of the algorithms by the students, the teaching assistant may
prepare the general structure of the program, including the implementation of opti-
mization problems (objective function, constraints, and derivatives) in order for the
Optimization: Principles and Algorithm viii
students to focus on the key points of the algorithms. The examples described in
detail in this book enable the implementations to be verified.
Optimization is an active research field, that is, permanently stimulated by the
needs of modern applications. Many aspects are absent from this book. “Every choice
entails the rejection of what might have been better,” said Andre Gide. Among the
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
There are many books on optimization. Within the vast literature, we may cite
the following books in English: Beck (2014), Calafiore and El Ghaoui (2014), Ben-
Tal et al. (2009), Boyd and Vandenberghe (2004), Ben-Tal and Nemirovski (2001),
Conn et al. (2000), Kelley (1999), Kelley (1995), Birge and Louveaux (1997), Den-
nis and Schnabel (1996), Axelsson (1994), Polyak (1987), Scales (1985), Coleman
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
(1984), McCormick (1983), Gill et al. (1981), Fletcher (1980), Fletcher (1981), Or-
tega and Rheinboldt (1970), and Fiacco and McCormick (1968). Several books are
also available in French. Among them, we can cite Korte et al. (2010), Dodge (2006),
Cherruault (1999), Breton and Haurie (1999), Hiriart-Urruty (1998), Bonnans et al.
(1997), and Gauvin (1992).
The bibliographic source for the biographies of Jacobi, Hesse, Lagrange, Fermat,
Newton, Al Khwarizmi, Cauchy, and Lipschitz is Gillispie (1990). The information
about Tucker (Gass, 2004), Dantzig (Gass, 2003), Little (Larson, 2004), Fulkerson
(Bland and Orlin, 2005), and Gomory (Johnson, 2005) come from the series IFORS’
Operational Research Hall of Fame. The source of information for Euler is his
biography by Finkel (1897). Finally, the information on Davidon was taken from his
web page
www.haverford.edu/math/wdavidon.html
and from Nocedal and Wright (1999). The selection of persons described in this work
is purely arbitrary. Many other mathematicians have contributed significantly to the
field of optimization and would deserve a place herein. I encourage the reader to read,
in particular, the articles of the series IFORS’ Operational Research Hall of Fame
published in International Transactions in Operational Research, expressly
dealing with Morse, Bellman, Kantorovich, Erlang, and Kuhn.
Online material
The book has a companion website:
www.optimizationprinciplesalgorithms.com
The algorithms presented in the book are coded in GNU Octave, a high-level in-
terpreted language (www.gnu.org/software/octave), primarily intended for numerical
computations. The code for the algorithms, as well as examples of optimization prob-
lems, are provided. All the examples have been run on GNU Octave, version 3.8.1.
on a MacBook Pro running OS X Yosemite 10.10.2. If you use these codes, Michel
Bierlaire, the author, grants you a nonexclusive license to run, display, reproduce, dis-
tribute and prepare derivative works of this code. The code has not been thoroughly
tested under all conditions. The author, therefore, does not guarantee or imply its
reliability, serviceability, or function. The author provides no program services for
the code.
Optimization: Principles and Algorithm x
Acknowledgments
I would like to thank EPFL Press for their support in the translation of the French
version and the publication of this book. The financial support of the School of
Architecture, Civil and Environmental Engineering at EPFL is highly appreciated.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
I am grateful to my PhD advisor, Philippe Toint, who passed on his passion for
optimization to me. Among the many things he taught me, the use of geometric
interpretations of certain concepts, especially algorithmic ones, proved particularly
useful for my research and my teaching. I hope that they will now benefit the readers
of this book.
I also thank Thomas Liebling who put his trust in me by welcoming me into his
group at EPFL back in 1998. Among other things, he asked me to take care of the
optimization and operations research courses, which year after year have enabled me
to build the material that is gathered in this book. But I am especially grateful for
his friendship, and all the things that he has imparted to me.
I would like to thank my doctoral students, teaching assistants, and postdocs, not
only for the pleasure of working with them, but also for their valuable help in the
optimization course over the years, and their comments on various versions of this
book.
Earlier versions of the manuscript were carefully read by the members of the Trans-
port and Mobility Laboratory at the EPFL, and in particular Stefan Binder, Anna
Fernandez Antolin, Flurin Hänseler, Yousef Maknoon, Iliya Markov, Marija Nikolic,
Tomas Robenek, Riccardo Scarinci, and Shadi Sharif. Thomas Liebling provided a
great deal of comments, with excellent suggestions to improve the style and the con-
tent of the book. He caught several errors and imprecisions. It greatly improved the
quality of the text. If there are still mistakes (and there are always some that escape
scrutinity), I clearly take full responsibility. Errata will be published on the website
as mistakes are caught.
I am so proud of my children, Aria and François, who have recently started the
difficult challenge of obtaining a university degree. Finally, I dedicate this book to
Patricia. I am really lucky to know such a wonderful person. Her love is a tremendous
source of joy and energy for me.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Contents
2 Objective function 29
2.1 Convexity and concavity . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Differentiability: the first order . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Differentiability: the second order . . . . . . . . . . . . . . . . . . . . 39
2.4 Linearity and non linearity . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5 Conditioning and preconditioning . . . . . . . . . . . . . . . . . . . . . 45
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3 Constraints 51
3.1 Active constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Linear independence of the constraints . . . . . . . . . . . . . . . . . . 56
3.3 Feasible directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.1 Convex constraints . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.2 Constraints defined by equations-inequations . . . . . . . . . . 62
3.4 Elimination of constraints . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.5 Linear constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
xii CONTENTS
3.5.1 Polyhedron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5.2 Basic solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.5.3 Basic directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
4 Introduction to duality 93
4.1 Constraint relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2 Duality in linear optimization . . . . . . . . . . . . . . . . . . . . . . . 102
4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
VI Networks 487
21 Introduction and definitions 491
21.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
21.2 Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
21.3 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
21.4 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
CONTENTS xv
27 Heuristics 647
27.1 Greedy heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
27.1.1 The knapsack problem . . . . . . . . . . . . . . . . . . . . . . . 648
27.1.2 The traveling salesman problem . . . . . . . . . . . . . . . . . 649
27.2 Neighborhood and local search . . . . . . . . . . . . . . . . . . . . . . 656
27.2.1 The knapsack problem . . . . . . . . . . . . . . . . . . . . . . . 662
27.2.2 The traveling salesman problem . . . . . . . . . . . . . . . . . 665
27.3 Variable neighborhood search . . . . . . . . . . . . . . . . . . . . . . . 669
27.3.1 The knapsack problem . . . . . . . . . . . . . . . . . . . . . . . 670
27.3.2 The traveling salesman problem . . . . . . . . . . . . . . . . . 672
27.4 Simulated annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
27.4.1 The knapsack problem . . . . . . . . . . . . . . . . . . . . . . . 677
27.4.2 The traveling salesman problem . . . . . . . . . . . . . . . . . 679
27.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
27.6 Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
B Definitions 689
C Theorems 695
D Projects 699
D.1 General instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699
D.2 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
References 703
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Part I
problem
Formulation and analysis of the
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Harlow Shapley
Chapter 1
Formulation
Contents
1.1 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Projectile . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Swisscom . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.3 Ch^ateau Laupt-Himum . . . . . . . . . . . . . . . . . . . 9
1.1.4 Euclid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.5 Agent 007 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.6 Indiana Jones . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.7 Geppetto . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Problem transformations . . . . . . . . . . . . . . . . . . . 16
1.2.1 Simple transformations . . . . . . . . . . . . . . . . . . . 16
1.2.2 Slack variables . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4 Problem definition . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.1 Modeling
The need to optimize is a direct result of the need to organize. Optimizing consists in
identifying an optimal configuration or an optimum in a system in the broadest sense
of the term. We use here Definition 1.1, given by Oxford University Press (2013).
Definition 1.1 (Optimum). (In Latin optimum, the best). The most favorable
or advantageous condition, value, or amount, especially under a particular set of
circumstances.
As part of a scientific approach, this definition requires some details. How can we
judge that the condition is favorable, and how can we formally describe the set of
circumstances?
6 Modeling
In practice, this step is probably the most complicated and most important. The
most complicated because only experience in modeling and a good knowledge of
the specific problem can guide the selection. The most important because the rest
of the process depends on it. An inappropriate selection of decision variables can
generate an optimization problem that is too complicated and impossible to solve.
2. The description of the method to assess the state of the system in question, given
a set of decision variables. In this book, we assume that the person performing
the modeling is able to identify a formula, a function, providing a measure of the
state of the system, a value that she wants to make the smallest or largest possible.
This function, called objective function, is denoted by f and the aforementioned
measure obtained for the decision variables x is a real number denoted by f(x).
3. The mathematical description of the circumstances or constraints, specifying the
values that the decision variables can take.
The modeling process is both exciting and challenging. Indeed, there is no uni-
versal recipe, and the number of possible models for a given problem is only limited
by the imagination of the modeler. However, it is essential to master optimization
tools and to understand the underlying assumptions in order to develop the adequate
model for the analysis in question. In this chapter, we provide some simple examples
of modeling exercises. In each case, we present a possible modeling.
1.1.1 Projectile
We start with a simple problem. A projectile is launched vertically at a rate of 50
meters per second, in the absence of wind. After how long and at which altitude does
1 See Appendix A about the mathematical notations used throughout the book.
Formulation 7
it start to fall? Note that, in this case, the decision variables represent a state of the
system that the analyst wants to calculate.
The modeling process consists of three steps.
Decision variables A single decision variable is used. Denoted by x, it represents
the number of seconds from the launch of the projectile. Note that in this case,
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
the decision variables represent a state of the system that the analyst wants to
calculate.
Objective function We seek to identify the maximum altitude reached by the ob-
ject. We must thus express the altitude as a function of the decision variable. Since
we are dealing with the uniformly accelerating movement of an object subjected
to gravity, we have
g 9.81 2
f(x) = − x2 + v0 x + x0 = − x + 50 x ,
2 2
where g = 9.81 is the acceleration experienced by the projectile, v0 = 50 is its
initial velocity, and x0 = 0 is its initial altitude.
Constraints Time only goes forward. Therefore, we impose x ≥ 0.
We obtain the optimization problem
9.81 2
max − x + 50 x , (1.1)
x∈R 2
subject to (s.t.)
x ≥ 0. (1.2)
1.1.2 Swisscom
The company Swisscom would like to install an antenna to connect four important
new customers to its network. This antenna must be as close as possible to each
client, giving priority to the best customers. However, to avoid the proliferation of
telecommunication antennas, the company is not allowed to install the new antenna
at a distance closer than 10 km from the other two antennas, respectively located at
coordinates (−5, 10) and (5, 0) and represented by the symbol ✪ in Figure 1.1. The
coordinates are expressed in kilometers from Swisscom’s headquarters. Swisscom
knows the geographic situation of each customer as well as the number of hours of
communication that the customer is supposed to consume per month. This data is
listed in Table 1.1. At which location should Swisscom install the new antenna?
The modeling process consists of three steps.
Decision variables Swisscom must identify the ideal location for the antenna, i.e.,
the coordinates of that location. We define two decision variables x1 and x2
representing these coordinates in a given reference system.
Objective function The distance di (x1 , x2 ) between a customer i located at the
coordinates (ai , bi ) and the antenna is given by
q
di (x1 , x2 ) = (x1 − ai )2 + (x2 − bi )2 . (1.3)
8 Modeling
3
✻
✪ 1
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
(x1 , x2 )
✪ ✲ 4
To take into account the communication time, we measure the sum of the distances
weighted by the number of consumed hours:
Constraints The constraints on the distances between the antennas can be expressed
as q
(x1 + 5)2 + (x2 − 10)2 ≥ 10 (1.5)
and q
(x1 − 5)2 + x22 ≥ 10 . (1.6)
Formulation 9
We can combine the various stages of modeling to obtain the following optimiza-
tion problem:
q
min f(x1 , x2 ) = 200 (x1 − 5)2 + (x2 − 10)2
x∈R2
q
+ 150 (x1 − 10)2 + (x2 − 5)2
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
q (1.7)
+ 200 x21 + (x2 − 12)2
q
+ 300 (x1 − 12)2 + x22
subject to (s.t.)
q
(x1 + 5)2 + (x2 − 10)2 ≥ 10
q (1.8)
(x1 − 5)2 + (x2 − 10)2 ≥ 10 .
1.1.3 Ch^
ateau Laupt-Himum
The Château Laupt-Himum produces rosé wine and red wine by buying grapes from
local producers. This year they can buy up to one ton of Pinot (a red grape) from a
winegrower, paying e 3 per kilo. They can then vinify the grapes in two ways: either
as a white wine to obtain a rosé wine or as a red wine to get Pinot Noir, a full-bodied
red wine. The vinification of the rosé wine costs e 2 per kilo of grapes, while that of
the Pinot Noir costs e 3.50 per kilo of grapes.
In order to take into account economies of scale, the Château wants to adjust the
price of its wine to the quantity produced. The price for one liter of the rosé is e 15
minus a rebate of e 2 per hundred liters produced. Thus, if they produce 100 liters
of rosé, they sell it for e 13 per liter. If they produce 200, they sell it for e 11 per
liter. Similarly, they sell the Pinot Noir at a price of e 23 per liter, minus a rebate of
e 1 per hundred liters produced.
How should the Château Laupt-Himum be organized in order to maximize its
profit, when a kilo of grapes produces 1 liter of wine?
The modeling process consists of three steps.
1
23 − x2 .
100
The revenues corresponding to the production of x1 liters of rosé wine and x2
liters of Pinot Noir are equal to
2 1
x1 15 − x1 + x2 23 − x2 .
100 100
It costs 3x3 to purchase the grapes. To produce a liter of wine, they need one kilo
of vinified grapes, which costs e 2 for the rosé and e 3.50 for the Pinot Noir. The
total costs are
2x1 + 3.5 x2 + 3x3 .
The objective function that the Château Laupt-Himum should maximize is
2 1
x1 15 − x1 + x2 23 − x2 − (2x1 + 3.5 x2 + 3x3 ) .
100 100
Constraints The Château cannot buy more than 1 ton of grapes from the wine-
grower, i.e.,
x3 ≤ 1, 000 .
Moreover, they cannot produce more wine than is possible with the amount of
grapes purchased. As one kilo of grapes produces one liter of wine, we have
x1 + x2 ≤ x3 .
It is necessary to add constraints which are, although apparently trivial at the
application level, essential to the validity of the mathematical model. These con-
straints specify the nature of the decision variables. In the case of Château Laupt-
Himum, negative values of these variables would have no valid interpretation. It
is necessary to impose
x1 ≥ 0 , x2 ≥ 0 , x3 ≥ 0 . (1.9)
We combine the modeling steps to obtain the following optimization problem:
2 1
max f(x) = x1 15 − x1 + x2 23 − x2 − (2x1 + 3.5 x2 + 3x3 ) (1.10)
x∈R3 100 100
subject to
x1 + x2 ≤ x3
x3 ≤ 1 000
x1 ≥ 0 (1.11)
x2 ≥ 0
x3 ≥ 0 .
Formulation 11
1.1.4 Euclid
In about 300 BC, the Greek mathematician Euclid was interested in the following
geometry problem: what is the rectangle with the greatest area among the rectangles
with given perimeter L? This is considered as one of the first known optimization
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
subject to
2x1 + 2x2 = L
x1 ≥ 0 (1.15)
x2 ≥ 0.
100 m
007
x
80 m
0 ≤ x ≤ 80 .
We can combine the different steps of the modeling to obtain the following opti-
mization problem:
q
18
min f(x) = x + 0.6 1002 + (80 − x)2 (1.19)
x∈R 100
subject to
x≥0
(1.20)
x ≤ 80 .
This example is inspired by Walker (1999).
Formulation 13
has for these reptiles, it is impossible for him to wade through them, and he considers
passing over them. However, the roof is not strong enough, so he cannot walk on it.
Ever ingenious, he places the end of a ladder on the ground, blocked by a boulder,
leans it on the wall, and uses it to reach the other end of the room (Figure 1.3). Once
there, he uses his whip to get down to the floor on the other side of the snake room.
Where on the floor must he place the end of the ladder, so that the length used is as
small as possible, and the ladder thus less likely to break under his weight? We write
the optimization problem that would help our hero.
x2
✻
h=5m
❄
✛ ✲ x1
ℓ = 10 m
or
x1 x2 − hx1 − ℓx2 = 0 .
Finally, the ends of the ladder must be outside the room, and
x1 ≥ ℓ and x2 ≥ h .
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
We can combine the modeling steps to obtain the following optimization problem:
q
min x21 + x22 (1.21)
x∈R2
subject to
x1 x2 − hx1 − ℓx2 = 0
x1 ≥ ℓ (1.22)
x2 ≥ h .
1.1.7 Geppetto
The company Geppetto Inc. produces wooden toys. It specializes in the manufacture
of soldiers and trains. Each soldier is sold for e 27 and costs e 24 in raw material. To
produce a soldier, 1 hour of carpentry labor is required, as well as 2 hours of finishing.
Each train is sold for e 21 and costs e 19 in raw material. To produce a train, it takes
1 hour of carpentry labor and 1 hour of finishing. Geppetto has two carpenters and
two finishing specialists each working 40 hours per week. He himself puts in 20 hours
per week on finishing work. The trains are popular, and he knows that he can always
sell his entire production. However, he is not able to sell more than 40 soldiers per
week. What should Geppetto do to optimize his income?
The modeling process is organized in three steps.
Decision variables Geppetto’s strategy is to determine the number of soldiers and
trains to produce per week. Thus, we define two decision variables:
• x1 is the number of soldiers to produce per week,
• x2 is the number of trains to produce per week.
Objective function The objective of Geppetto is to make as much money as pos-
sible. The quantity to maximize is the profit. Geppetto’s gains consist of the
income from the sales of the toys, minus the cost of the raw material. Geppetto’s
income is the sum of the income from the sales of the soldiers and the sales of the
trains. The income (in e ) from the sales of the soldiers is the number of soldiers
sold multiplied by the selling price of one soldier, i.e., 27 x1 . The income from
the sales of the trains is 21 x2 . The total income for Geppetto is 27 x1 + 21 x2 .
Similarly, we evaluate the material costs to 24 x1 + 19x2 . The gain is
or
f(x) = 3x1 + 2x2 . (1.24)
Formulation 15
Constraints The production is subject to three main constraints. First, the available
labor does not allow more than 100 finishing hours (performed by two workers and
Geppetto) and 80 hours of carpentry (performed by two carpenters). Furthermore,
to avoid unsold objects, they should not produce more than 40 soldiers per week.
The number of hours per week of finishing is the number of hours of finishing
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
for the soldiers, multiplied by the number of soldiers produced, plus the hours of
finishing for the trains, multiplied by the number of trains produced. This number
may not exceed 100, and we can express the following constraint:
x1 + x2 ≤ 80 . (1.26)
Finally, the constraint to avoid unsold products can simply be written as:
x1 ≤ 40 . (1.27)
At this stage, it seems that all the constraints of the problem have been described
mathematically. However, it is necessary to add constraints that are, although
apparently trivial at the application level, essential to the validity of the mathe-
matical model. These constraints specify the nature of the decision variables. In
the case of Geppetto, it is not conceivable to produce parts of trains or soldiers.
The decision variables must absolutely take integer values, so that
x1 ∈ N , x2 ∈ N . (1.28)
We combine the different stages of the modeling to obtain the following optimiza-
tion problem:
subject to
2x1 + x2 ≤ 100
x1 + x2 ≤ 80 (1.30)
x1 ≤ 40.
Even though the modeling step is completed, we are not yet out of the woods. In-
deed, there are many ways to write a given problem mathematically. Algorithms and
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
software often require a special formulation based on which they solve the problem.
In this chapter, we study techniques that help us comply with these requirements.
Obtaining the mathematical formulation of a problem does not necessarily end the
modeling process. In fact, the obtained formulation may not be adequate. In particu-
lar, the software capable of solving optimization problems often require the problems
to be formulated in a specific way, not necessarily corresponding to the result of
the approach described in Section 1.1. We review some rules for transforming an
optimization problem into another equivalent problem.
min f(x) ,
x∈X ⊆ Rn
n
where X is a subset of R . Consider a function g : R → R that is strictly
increasing on Im(f) = z | ∃x ∈ X such that z = f(x) , i.e., for any z1 , z2 ∈ Im(f),
g(z1 ) > g(z2 ) if and only if z1 > z2 . Thus,2
Similarly, if the function f generates only positive values, taking the logarithm of
the objective function or taking its square does not change its solution:
argminx∈X ⊆ Rn f(x) = argminx∈X ⊆ Rn log f(x) , (1.34)
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
and
2
argminx∈X ⊆ Rn f(x) = argminx∈X ⊆ Rn f(x) , (1.35)
as g(x) = x2 is strictly increasing for x ≥ 0. Note that the log transformation is
typically used in the context of maximum likelihood estimation of unknown pa-
rameters in statistics, where the objective function is a probability and, therefore,
is positive. The square transform is relevant namely when f(x) is expressed as a
square root (see the example on Indiana Jones in Section 1.1.6). In this case, the
square root can be omitted.
2. A maximization problem whose objective function is f(x) is equivalent to a mini-
mization problem whose objective function is −f(x):
and
max f(x) = − min −f(x) . (1.37)
x x
Similarly, we have
argminx f(x) = argmaxx −f(x) , (1.38)
and
min f(x) = − max −f(x) . (1.39)
x x
x = x+ − x− . (1.42)
In this case, we can meet the requirements of the software and impose x+ ≥ 0 and
x− ≥ 0, without loss of generality.
18 Problem transformations
x=e
x+a (1.43)
subject to
6x − y2 ≥ 1 (1.46)
2 2
x +y =3 (1.47)
x≥2 (1.48)
y ∈ R, (1.49)
and transform it in such a way as to obtain a minimization problem, in which all the
decision variables are non negative, and all constraints are defined by lower inequali-
ties. We get the following problem:
− x + 2)2 − sin(y+ − y− )
min (e (1.50)
e, y+ , y−
x
subject to
2
x + 2) + y+ − y−
−6(e +1≤0 (1.51)
2 +
− 2
(e
x + 2) + y − y −3≤0 (1.52)
2
x + 2)2 − y+ − y− + 3 ≤ 0
−(e (1.53)
e
x≥0 (1.54)
+
y ≥0 (1.55)
−
y ≥ 0, (1.56)
where
(1.50) is obtained by applying (1.37) to (1.45),
(1.51) is obtained by applying (1.40) to (1.46),
(1.52) and (1.53) are obtained by applying (1.41) to (1.47),
(1.54) is obtained by applying (1.43) to (1.48),
(1.55) and (1.56) are obtained by applying (1.42) to (1.49).
Note that the transformed problem has more decision variables (3 instead of 2 for the
original problem) and more constraints (6 instead of 3 for the original problem).
When the solution e x∗ , y+∗ , y−∗ of (1.50)–(1.56) is available, it is easy to obtain
the solution to the original problem (1.45)–(1.49) by applying the inverse transfor-
mations, i.e.,
x∗ = e
x∗ + 2
(1.57)
y∗ = y+∗ − y−∗ .
Formulation 19
• The slack variable y is introduced directly in the specification, and its value is
explicitly restricted to be non negative:
g(x) + y = 0
g(x) ≤ 0 ⇐⇒ (1.58)
y ≥ 0.
The above specification does not completely eliminate inequality constraints. How-
ever, it simplifies considerably the nature of these constraints.
• The slack variable is introduced indirectly using a specification enforcing its non
negativity. For example, the slack variable can be defined as y = z2 , and
g(x) ≤ 0 ⇐⇒ g(x) + z2 = 0. (1.59)
• The slack variable can also be introduced indirectly using an exponential, that is,
y = exp(z), and
g(x) ≤ 0 ⇐⇒ g(x) + ez = 0. (1.60)
The limitation of this approach is that there is no value of z such that g(x)+ez = 0
when the constraint is active, that is when g(x) = 0. Strictly speaking, the two
specifications are not equivalent. However, the slack variable exp(z) can be made
as close to zero as desired by decreasing the value of z, as
lim ez = 0. (1.61)
z→−∞
So, loosely speaking, we can say that the two specifications are “asymptotically
equivalent.”
The slack variable (z2 , ez or y) thus introduced measures the distance between the
constraint and the point x. The special status of such variables can be effectively
exploited for solving optimization problems.
subject to
π
sin x1 ≤ (1.63)
√ 2
ln ex1 + ex2 ≥ e (1.64)
x1 − x2 ≤ 100 . (1.65)
20 Hypotheses
We introduce the slack variable z1 for the constraint (1.63), the slack variable z2
for the constraint (1.64), and the slack variable y3 for the constraint (1.65). The
obtained optimization problem is
subject to
π
sin x1 + z21 = (1.67)
2
√
ln ex1 + ex2 − ez2 = e (1.68)
x1 − x2 + y3 = 100 (1.69)
y3 ≥ 0 . (1.70)
Note that the objective function is not affected by the introduction of slack variables.
1.3 Hypotheses
The methods and algorithms presented in this book are not universal. Each approach
is subject to assumptions about the structure of the underlying problem. Specific
assumptions are discussed for each method. However, there are three important
assumptions that concern (almost) the whole book: the continuity hypothesis, the
differentiability hypothesis, and the determinism hypothesis.
The continuity hypothesis consists in only considering problems for which the
objective to optimize and the constraints are modeled by continuous functions of
decision variables. This hypothesis excludes problems with integer variables (see
discussion in Section 1.1.7). Such problems are treated by discrete optimization
or integer programming.3 An introduction to discrete optimization in provided in
Part VII of this book. We also refer the reader to Wolsey (1998) for an introduction to
combinatorial optimization, as well as to Schrijver (2003), Bertsimas and Weismantel
(2005), and Korte and Vygen (2007).
The differentiability hypothesis (which obviously implies the continuity hypoth-
esis) also requires that the functions involved in the model are differentiable. Non
differentiable optimization is the subject of books such as Boyd and Vandenberghe
(2004), Bonnans et al. (2006), and Dem’Yanov et al. (2012).
The determinism hypothesis consists in ignoring possible errors in the data for the
problem. Measurement errors, as well as modeling errors, can have a non negligible
impact on the outcome. Stochastic optimization (Birge and Louveaux, 1997) enables
the use of models in which some pieces of data are represented by random variables.
Robust optimization (see Ben-Tal and Nemirovski, 2001 and Ben-Tal et al., 2009)
produces solutions that are barely modified by slight disturbances in the data of the
problem.
3 The term “programming”, used in the sense of optimization, was introduced during the Second
World War.
Formulation 21
subject to
h(x) = 0 (1.72)
g(x) ≤ 0 (1.73)
and
x ∈ X, (1.74)
Definition 1.5 (Local minimum). Let Y = x ∈ Rn | h(x) = 0 , g(x) ≤ 0 and x ∈ X
be the feasible set, that is the set of vectors satisfying all constraints. The vector
x∗ ∈ Y is called a local minimum of the problem (1.71)–(1.74) if there exists ε > 0
such that
f(x∗ ) ≤ f(x) , ∀x ∈ Y such that x − x∗ < ε . (1.75)
Definition 1.6 (Strict local minimum). Let Y = x ∈ Rn | h(x) = 0 , g(x) ≤
0 and x ∈ X be the feasible set, that is the set of vectors satisfying all the constraints.
The vector x∗ ∈ Y is called a strict local minimum of the problem (1.71)–(1.74) if
there exists ε > 0 such that
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Definition 1.7 (Global minimum). Let Y = x ∈ Rn | h(x) = 0 , g(x) ≤ 0 and x ∈
X be the feasible set, that is the set of vectors satisfying all the constraints. The
vector x∗ ∈ Y is called a global minimum of the problem (1.71)–(1.74) if
Definition 1.8 (Strict global minimum). Let Y = x ∈ Rn | h(x) = 0 , g(x) ≤
0 and x ∈ X be the feasible set, that is the set of vectors satisfying all the constraints.
The vector x∗ ∈ Y is called a strict global minimum of the problem (1.71)-(1.74) if
The notions of local maximum, strict local maximum, global maximum and strict
global maximum are defined in a similar manner.
Example 1.9 (Local minimum and maximum). Figure 1.4 illustrates these defini-
tions for the function
5 7 11
f(x) = − x3 + x2 − x + 3. (1.79)
6 2 3
The point x∗ ≃ 0.6972 is a local minimum of f. Indeed, there is an interval x∗ −
1 ∗ 1 ∗
2 , x + 2 , represented by the dotted lines, such that f(x ) ≤ f(x), for any x in the
e∗
interval. Similarly,
∗ 1 the x ≃ 2.1024 is a local maximum. Indeed, there
point exists
an interval ex − 2, ex∗ + 12 , represented by the dotted lines, such that f(e
x∗ ) ≥ f(x),
for any x in the interval.
hi (x) = xi (1 − xi ) = 0 , i = 1, . . . , n .
Formulation 23
3.5
local maximum
3
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
2.5 x∗ x̃∗
f(x)
2
local minimum
1.5
1
0 0.5 1 1.5 2 2.5 3
x
Figure 1.4: Local minimum and maximum for Example 1.9
We thus get n differentiable constraints and the hypotheses are satisfied. However,
since each feasible point is isolated, each of them is a local minimum of the problem.
Therefore, algorithms for continuous optimization designed to identify local minima
are useless for this type of problem.
f(x) ≥ M , ∀x ∈ Y . (1.80)
and
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
inf f(y) = 0 .
y∈Y
This example shows that an optimization problem is not always well defined, and
that no solution may exist. Theorem 1.14 identifies cases where the problem is well-
defined, and where the infimum of a function on a set is reached by at least one
point of the set. In this case, the point is a global minimum of the corresponding
optimization problem.
If Y is compact, it is bounded and the sequence has at least one limit point x∗ . If f is
coercive (Definition B.22), the sequence (xk )k is bounded and has at least one limit
point x∗ . In both cases, due to the lower semi-continuity of f in x∗ , we have
and
f(x∗ ) = inf f(y).
y∈Y
Note that some functions may include local minima and have no global minimum,
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
as for Example 1.9, shown in Figure 1.4, where the function (1.79) is unbounded from
below, as
5 7 11
lim − x3 + x2 − x + 3 = −∞.
x→+∞ 6 2 3
In some cases, an optimization problem with constraints can be simplified and
the constraints ignored. Before detailing such simplifications in Section 3.1, we state
now a general theoretical result when the optimum is an interior point of the set of
constraints.
min f(x).
x∈Y2
1.5 Exercises
Exercise 1.1 (Geometry). We want to determine a parallelepiped with a volume of
unity and minimal surface.
1. Formulate this problem as an optimization problem by determining
(a) the decision variables,
(b) the objective function,
(c) the constraint(s).
2. Formulate this problem as a minimization problem with only lower inequality
constraints.
Exercise 1.2 (Finance). A bank wants to determine how to invest its assets for the
year to come. Currently, the bank has a million euros that it can invest in bonds, real
estate loans, leases or personal loans. The annual interest rate of different investment
types are 6% for bonds, 10% for real estate loans, 8% for leases, and 13% for personal
loans.
To minimize risk, the portfolio selected by the bank must satisfy the following
restrictions:
• The amount allocated to personal loans must not exceed half of that invested in
bonds.
• The amount allocated to real estate loans must not exceed that allocated to leases.
• At most 20% of the total invested amount can be allocated to personal loans.
1. Formulate this problem as an optimization problem by determining
(a) the decision variables,
(b) the objective function,
(c) the constraint(s).
2. Formulate this problem as a minimization problem with only lower inequality
constraints.
3. Formulate this problem as a maximization problem with equality constraints and
with non negativity constraints on the decision variables.
Formulation 27
Exercise 1.3 (Stock management). The company Daubeliou sells oil and wants to
optimize the management of its stock. The annual demand is estimated at 100,000
liters and is assumed to be homogeneous throughout the year. The cost of storage
is e 40 per thousand liters per year. When the company orders oil to replenish its
stock, this costs e 80. Assuming that the order arrives instantly, how many orders
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
must the company Daubeliou place each year to satisfy demand and minimize costs?
Formulate this problem as an optimization problem by determining
1. the decision variables,
2. the objective function,
3. the constraint(s).
Advice:
• Set the amount of oil to order each time as a decision variable.
• Represent in a graph the evolution of the stock as a function of time.
Exercise 1.4 (Measurement errors). Mr. O. Beese is obsessed with his weight. He
owns 10 scales and weighs himself each morning on each of them. This morning he
got the following measurement results:
He wishes to determine an estimate of his weight while minimizing the sum of the
squares of measurement errors from the 10 scales. Formulate the optimization prob-
lem that he needs to solve.
Exercise 1.5 (Congestion). Every day, 10,000 people commute from Divonne to
Geneva. By train, the journey takes 40 minutes. By road, the travel time depends on
the level of congestion. It takes 20 minutes when the highway is completely deserted.
When there is traffic, travel time increases by 5 minutes per thousand people using
the highway (assuming that there is one person per car). If 4,000 people take the car
and 6,000 take the train, the travel time by road is equal to 20 + 5 × 4 = 40 minutes,
which is identical to the travel time by train. In this situation, it would be of no
interest to change one’s mode of transport. We say that the system is at equilibrium.
However, from the point of view of the average travel time per person, is this situation
optimal? Formulate an optimization problem to answer the question.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Chapter 2
Objective function
Contents
2.1 Convexity and concavity . . . . . . . . . . . . . . . . . . . 29
2.2 Differentiability: the first order . . . . . . . . . . . . . . . 31
2.3 Differentiability: the second order . . . . . . . . . . . . . 39
2.4 Linearity and non linearity . . . . . . . . . . . . . . . . . . 42
2.5 Conditioning and preconditioning . . . . . . . . . . . . . 45
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Definition
2.1 is shown
in Figure 2.1. The line segment connecting the points
x, f(x) and y, f(y) is never below the graph of f. The point z = λx + (1 − λ)y is
somewhere between x and y when 0 ≤ λ ≤ 1. The point with coordinates z, λf(x) +
(1−λ)f(y) is on the line segment between the points x, f(x) and y, f(y) . In order
for the function to be convex, this point must never (i.e., for all x, y and 0 ≤ λ ≤ 1)
be below the graph of the function.
(z, f(z))
x z y
(y, f(y))
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
(x, f(x))
Note that convexity and concavity are not complementary properties. A function
may be neither convex nor concave. This is the case of the function represented in
Figure 2.2.
If the partial derivatives ∂f(x)/∂xi exist for all i, the gradient of f is defined as
follows.
∂f(x)
∂x1
..
∇f(x) =
.
.
(2.5)
∂f(x)
∂xn
The gradient plays a key role in the development and analysis of optimization
algorithms.
Example 2.6 (Gradient). Consider f(x1 , x2 , x3 ) = ex1 +x21 x3 −x1 x2 x3 . The gradient
of f is given by
x1
e + 2x1 x3 − x2 x3
∇f(x1 , x2 , x3 ) = −x1 x3 . (2.6)
2
x1 − x1 x2
The analysis of the behavior of the function in certain directions is also important
for optimization methods. We introduce the concept of a directional derivative.
∇f(x)T d . (2.8)
this book, we deal with continuous differentiable functions for which a distinction is
unnecessary.
Example 2.9 (Directional derivative). Consider f(x1 , x2 , x3 ) = ex1 + x21 x3 − x1 x2 x3 ,
and
d1
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
d = d2 . (2.9)
d3
The directional derivative of f in the direction d is
( d1 d2 d3 ) ∇f(x1 , x2 , x3 ) =
(2.10)
d1 (ex1 + 2x1 x3 − x2 x3 ) − d2 x1 x3 + d3 (x21 − x1 x2 ) ,
Proof. We use (C.1) from Taylor’s theorem (Theorem C.1) to evaluate the function
in x + αd by employing the information in x. We have
f(x + αd) = f(x) + αdT ∇f(x) + o αkdk Taylor
T
= f(x) + αd ∇f(x) + o(α) kdk does not depend on α.
The result follows from the fact that dT ∇f(x) < 0 by assumption, and that o(α) can
be made as small as needed. More formally, according to the definition of the Landau
notation o( · ) (Definition B.17), for any ε > 0, there exists η such that
o(α)
< ε, ∀0 < α ≤ η .
α
o(α) o(α)
≤ < −dT ∇f(x) , ∀0 < α ≤ η . (2.15)
α α
and f(x + αd) < f(x), ∀0 < α ≤ η. The result (2.14) is obtained in a similar manner,
by choosing ε = (β − 1)∇f(x)T d in the definition of the Landau notation. We have
ε > 0 because β < 1 and ∇f(x)T d < 0.
Among all directions d from a point x, the one in which the slope is the steepest
is the direction of the gradient ∇f(x). To show this, we consider all directions d that
have the same norm (the same length) as the gradient and compare the directional
derivative for each of them.
∗T
=d ∇f(x) definition of d∗ .
2
Since d∗ T ∇f(x) = ∇f(x) ≥ 0, the function increases in the direction d∗ , which
corresponds to the steepest ascent.
If the direction of the gradient corresponds to the steepest ascent of the function x,
we need only consider the direction opposite the gradient −∇f(x) to find the steepest
descent.
and the direction opposite the gradient is that in which the function has its
steepest descent.
55
50
−1
45 d3 =
3
40
1
f(x + αdi ) 35 d2 =
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
1
30
25 d1 = ∇f(x)
20
15
10
5
0
0 0.2 0.4 0.6 0.8 1
α
1 2
Figure 2.3: Shape of the function 2 x1 + 2x2 at point (1, 1)T in several directions
18
16 d1 = −∇f(x)
14
−1 1
f(x + αdi ) 12 d2 = d3 =
−1 −3
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
10
8
6
4
2
0
0 0.2 0.4 0.6 0.8 1
α
1 2
Figure 2.4: Shape of the function 2 x1 + 2x2 at point (1, 1)T in several directions
Proof. Necessary condition. We first show that (2.18) is a necessary condition. Let
x, y ∈ X be arbitrary and let us consider d = y − x. We evaluate the directional
derivative of f in the direction d and obtain
the right term is nothing else than the equation of the hyperplane that is tangent to
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
the function f at point x. In this case, we see that a function is convex if and only
if the graph is never below the hyperplane tangent. Figure 2.5 illustrates this idea in
the case of a function with one variable.
We conclude this section by generalizing the notion of the gradient for the func-
tions of Rn → Rm . In this case, the various partial derivatives are arranged in a
matrix called the gradient matrix. Each column of this matrix is the gradient of the
corresponding component of f.
∂f ∂f2 ∂fm
1 (2.24)
···
∂x1 ∂x1 ∂x1
. .. .. ..
=
.. . . .
.
∂f ∂f2 ∂fm
1
···
∂xn ∂xn ∂xn
Objective function 39
The gradient matrix is often used in its transposed form and is then called the
Jacobian matrix of f.
∇f1 (x)T
..
J(x) = ∇f(x)T = . . (2.25)
T
∇fm (x)
is defined by
∂2 f(x) ∂2 f(x) ∂2 f(x)
···
∂x21 ∂x1 ∂x2 ∂x1 ∂xn
2
∂2 f(x) ∂2 f(x)
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
∂ f(x)
···
∂x2 ∂x1 ∂x22 ∂x2 ∂xn
∇2 f(x) = . (2.27)
.. .. ..
. . .
2
∂ f(x) 2
∂ f(x) ∂ f(x)
2
···
∂xn ∂x1 ∂xn ∂x2 ∂x2n
Note that the Hessian of f is the gradient matrix and the Jacobian matrix of ∇f.
Example 2.20 (Hessian). Consider f(x1 , x2 , x3 ) = ex1 + x21 x3 − x1 x2 x3 . The Hessian
of f is given by
x1
e + 2x3 −x3 2x1 − x2
∇2 f(x1 , x2 , x3 ) = −x3 0 −x1 . (2.28)
2x1 − x2 −x1 0
Just like the gradient, the Hessian gives us information about the convexity of the
function.
1
f(y) = f(x) + (y − x)T ∇f(x) + dT ∇2 f(x + αd)d . (2.29)
2
If 0 < α ≤ 1, x+αd ∈ X by convexity (Definition B.2) of X, and the matrix ∇2 f(x+αd)
is positive semidefinite (Definition B.8). Therefore,
Then, we have
f(y) ≥ f(x) + (y − x)T ∇f(x) . (2.31)
We now need only invoke Theorem 2.16 to prove the convexity of f. The strict
convexity is proven in a similar manner.
Objective function 41
Since the second derivative of a function of one variable gives us curvature informa-
tion, (2.34) provides us with information about the curvature of the function f in
the direction d. In particular, when α = 0, this expression gives information on the
curvature of f in x. To avoid the length of the direction influencing the notion of
curvature, it is important to normalize. We obtain the following definition.
dT ∇2 f(x)d
(2.35)
dT d
represents the curvature of the function f in x in the direction d.
In linear algebra, the quantity (2.35) is often called the Rayleigh quotient of ∇2 f(x)
in the direction d. One should immediately note that the curvature of the function
x in the direction −d is exactly the same as in the direction d.
Example 2.23 (Curvature). Consider f(x) = 21 x21 + 2x22 , and x = (1, 1)T , as in
Examples 2.14 and 2.15. The curvature of the function in different directions is given
below:
d −d dT ∇2 f(x)d/dT d
( 1 4 )T ( −1 −4 )T 3.8235
( 1 1 )T ( −1 −1 )T 2.5
( −1 3 )T ( 1 −3 )T 3.7
42 Linearity and non linearity
When a constant term is added to a linear function, the result is said to be affine.
Note that minimizing (2.38) is equivalent to minimizing (2.36). Note that all
linear functions are affine. By abuse of language, a non linear function is actually a
function that is not affine.
Objective function 43
Definition 2.26 (Non linear function). Any function that is not affine is said to be
non linear.
The set of non linear functions is vast, and one needs to be a little more precise
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
in their characterization. Intuitively, the function shown in Figure 2.8 seems more
non linear than the one in Figure 2.9. The slope of the former changes quickly with
x, which is not the case for the latter. This is formally captured by the Lipschitz
continuity (Definition B.16) of the gradient.
3
−x
2.5 f(x) = ecos(e )
2
f(x)
1.5
0.5
0
-5 -4 -3 -2 -1 0
x
Figure 2.8: Example of a non linear function
40
30
20 x2
f(x) = 100 + 3x + 1
10
f(x)
0
-10
-20
-30
-10 -5 0 5 10
x
Figure 2.9: Example of another non linear function
44 Linearity and non linearity
Intuitively, the definition says that the slopes of the function at two close points
are close as well. And the more so when M is small. Actually, when f is linear, the
slope is the same at any point, and (2.40) is verified with M = 0. The value of M
for the function represented in Figure 2.9 is low, while it is large for the function
illustrated in Figure 2.8, where the slope varies dramatically with small modifications
of x.
The constant M can be interpreted as an upper bound on the curvature of the
function. The greater M is, the smaller the curvature is. If M = 0, the curvature is
zero, and the function is linear. Note that this constant is essentially theoretical and
that it is generally difficult to obtain a value for it.
Among the non linear functions, quadratic functions play an important role in
optimization algorithms.
Ad1 = λ1 d1 . (2.45)
dT1 Ad1
λ1 = . (2.46)
dT1 d1
quadratic function with two dimensions (Figure 2.11(a)), this translates into nearly
circular level curves (Figure 2.11(b)).
Example 2.31 (Conditioning). The quadratic function
i.e.,
x1 1/2 √ 0 x1′
= . (2.50)
x2 0 2/6 x2′
We obtain
1 ′2 1 ′2
f(x1′ , x2′ ) =
x + x2 , (2.51)
2 1 2
for which the Hessian is the identity matrix, and the condition number is 1, for all
(x1′ , x2′ ).
f̃(x ′ ) = f(M−1 x ′ )
∇f̃(x ′ ) = M−T ∇f(M−1 x ′ )
(2.52)
∇2 f̃(x ′ ) = M−T ∇2 f(M−1 x ′ )M−1
= M−T ∇2 f(x)M−1 .
Objective function 47
40
30
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
f(x1 , x2 ) 20
10
0
-2
-1
-2 0 x1
-1 1
x2 0 1 2
2
(a) Function
0 x2
-1
-2
-2 -1 0 1 2
x1
(b) Level curves
In the context of optimization, the matrix for the change of variables should be
positive definite in order to preserve the nature of the problem.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48 48 Conditioning and preconditioning
f(x1 , x2 )
-2
-1
-2 0 x1
-1 1
x2 0 1 2
2
(a) Function
0 x2
-1
-2
-2 -1 0 1 2
x1
(b) Level curves
x ′ = LT x ⇐⇒ x = L−T x ′ . (2.54)
In this case
The conditioning of the function f̃ in x ′ is 1. According to Definition 2.29, κ2 ∇2 f̃(x ′ )
≥ 1 and the obtained conditioning is the best possible. The best preconditioning of f
in x consists in defining the change of variables based on the Cholesky factorization
of ∇2 f(x).
Example 2.34 (Preconditioning). Consider
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
1 2 25 2 √
f(x1 , x2 ) = x + x + 3x1 x2 − 12 x1 − π x2 − 6 . (2.55)
2 1 2 2
We have
x1 + 3x2 − 12
∇f(x1 , x2 ) = √ (2.56)
3x1 + 25 x2 − π
and T
2 1 3 1 0 1 0
∇ f(x1 , x2 ) = = . (2.57)
3 25 3 4 3 4
We define
x1′ 1 3 x1
= , (2.58)
x2′ 0 4 x2
i.e.,
x1 1 −3/4 x1′
= . (2.59)
x2 0 1/4 x2′
We obtain
2 2
1 3 ′ 25 1 ′ 3 3 ′
f̃(x1′ , x2′ ) = ′
x1 − x2 + x + ′
x1 − x2 x2′
2 4 2 4 2 4 4
√
3 π ′
− 12 x1′ − x2′ − x −6 (2.60)
4 4 2
√
1 ′2 1 ′2 ′ π
= x1 + x2 − 12 x1 + 9 − x2′ − 6 .
2 2 4
It is easy to verify that ∇2 f̃(x1′ , x2′ ) = I. Note that there are no longer any crossed
terms in x1′ x2′ .
50 Exercises
2.6 Exercises
Exercise 2.1. Among the following functions, which are convex and which are con-
cave? Justify your answer.
1. f(x) = 1 − x2 .
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
2. f(x) = x2 − 1.
q
3. f(x1 , x2 ) = x21 + x22 .
4. f(x) = x3 .
5. f(x1 , x2 , x3 ) = sin(a)x1 + cos(b)x2 + e−c x3 , a, b, c ∈ R.
Exercise 2.2. For each of the following functions:
• Calculate the gradient.
• Calculate the Hessian.
• Specify (and justify) whether the function is convex, concave, or neither.
• Calculate the curvature of the function in a direction d at the specified point x̄.
• Make a change of variables to precondition the function, using the Hessian at the
specified point x̄. Please note that the matrix for a change of variables must be
positive definite.
1 9
1. f(x1 , x2 ) = x21 + x22 , x̄ = (0, 0)T .
2 2
1 3
2. f(x1 , x2 ) = x1 + x32 − x1 − x2 , x̄ = (9, 1)T .
3
3. f(x1 , x2 ) = (x1 − 2)4 + (x1 − 2)2 x22 + (x2 + 1)2 , x̄ = (2, −1)T .
Chapter 3
Constraints
Life would be easier without constraints. Or would it? In this chapter, we investigate
ways to remove some of them, or even all of them. And when some remain, they need
to be properly understood in order to be verified. As algorithms need to move along
directions that are compatible with the constraints, such directions are characterized
in various contexts. We put a special emphasis on linear constraints, which the
analysis simplifies significantly.
Contents
3.1 Active constraints . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Linear independence of the constraints . . . . . . . . . . 56
3.3 Feasible directions . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.1 Convex constraints . . . . . . . . . . . . . . . . . . . . . . 60
3.3.2 Constraints defined by equations-inequations . . . . . . . 62
3.4 Elimination of constraints . . . . . . . . . . . . . . . . . . 75
3.5 Linear constraints . . . . . . . . . . . . . . . . . . . . . . . 78
3.5.1 Polyhedron . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5.2 Basic solutions . . . . . . . . . . . . . . . . . . . . . . . . 83
3.5.3 Basic directions . . . . . . . . . . . . . . . . . . . . . . . . 87
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
min x2 (3.1)
x∈R
subject to
x ≤ 4
(3.2)
x ≥ −10.
It is illustrated in Figure 3.1, where the inequality constraints are represented by
vertical lines, associated with an arrow pointed towards the feasible domain. The
solution to this problem is x∗ = 0. In fact, one could also choose to ignore the
constraints and still obtain the same solution. We say that the constraints are inactive
at the solution. When using the notation (1.73), the problem can be written as
min x2 (3.3)
x∈R
subject to
g1 (x) = x−4 ≤ 0
(3.4)
g2 (x) = −x − 10 ≤ 0.
We have g1 (x∗ ) = −4 and g2 (x∗ ) = −10, and g1 (x∗ ) < 0 and g2 (x∗ ) < 0. The fact
that the inequality constraints are strictly verified characterizes inactive constraints.
f(x)
x∗ = 0 f(x) = x2
-10 -5 0 5 10
x
subject to
x ≤ 4
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
(3.6)
x ≥ 1.
It is illustrated in Figure 3.2, where the inequality constraints are represented by
vertical lines, associated with an arrow pointed towards the feasible domain. The
solution to this problem is x∗ = 1. In this case, we can ignore the constraint x ≤ 4.
However, the constraint x ≥ 1 cannot be ignored without modifying the solution. It
is said to be active. Using the notation (1.73), the problem can be written as
min x2 (3.7)
x∈R
subject to
g1 (x) = x−4 ≤ 0
(3.8)
g2 (x) = 1−x ≤ 0.
We have g1 (x∗ ) = −3 and g2 (x∗ ) = 0, and g1 (x∗ ) < 0 and g2 (x∗ ) = 0. The first
constraint is verified with strict inequality and is inactive. The second constraint is
verified with equality and is active.
f(x) = x2
f(x)
x∗ = 1
-4 -2 0 2 4
x
gi (x) ≤ 0 (3.9)
54 Active constraints
is said to be active in x∗ if
gi (x∗ ) = 0, (3.10)
and inactive in x∗ if
gi (x∗ ) < 0. (3.11)
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
hi (x) = 0 (3.12)
hi (x∗ ) = 0. (3.13)
subject to
g(x) ≤ 0, (3.15)
x ∈ Y ⊆ Rn , (3.16)
where g : Rn −→ Rm is continuous and Y is a subset of Rn . If x∗ is feasible, i.e.,
g(x∗ ) ≤ 0, and if A(x∗ ) ⊆ {1, . . . , p} is the set of indices of the active constraints
in x∗ , i.e.,
A(x∗ ) = {i|gi (x∗ ) = 0}, (3.17)
we consider the following optimization problem P2
subject to
gi (x) = 0, i ∈ A(x∗ ), (3.19)
x ∈ Y ⊆ Rn . (3.20)
Here, x∗ is a local minimum of P1 if and only if x∗ is a local minimum of P2 .
Constraints 55
as illustrated in Figure 3.3. Consider two feasible neighborhoods around x∗ . The first
one is defined as
and contains neighbors of x∗ that are feasible for the problem P1 . The second is
defined as
We show that
Y2 (ε̃) ⊆ Y1 (^ε). (3.25)
Indeed, take any x in Y2 (ε̃). In order to show that it belongs to Y1 (^ε), we need to
show that g(x) ≤ 0, x ∈ Y, and kx − x∗ k ≤ ^ε. Since ε̃ ≤ ^ε, we have kx − x∗ k < ε̃ ≤ ^ε,
and the third condition is immediately verified. The second condition (x ∈ Y) is
inherited from the definition of Y2 (ε̃). We need only to demonstrate that g(x) ≤ 0.
To do this, consider the constraints of A(x∗ ) separately from the others. By definition
of Y2 , we have gi (x) = 0 for i ∈ A(x∗ ), which implies gi (x) ≤ 0. Take j 6∈ A(x∗ ).
Since ε̃ ≤ εj , we have gj (x) < 0 according to (3.21), which implies gj (x) ≤ 0. This
completes the proof that g(x) ≤ 0, so that x ∈ Y1 (^ε).
As a result of (3.25), since x∗ is the best element (in the sense of the objective
function) of Y1 (^ε) (according to Definition 1.5 of the local minimum), and since x∗
belongs to Y2 (ε̃), it is also the best element of this set, and a local minimum of P2 .
Necessary condition. Let X1 be the set of feasible points of P1 and X2 the set of
feasible points of P2 . We have X2 ⊆ X1 . Let x∗ be a local minimum of P2 . We assume
by contradiction that it is not a local minimum of P1 . Then, for any ε > 0, there
exists x ∈ X1 such that kx − x∗ k ≤ ε and f(x) < f(x∗ ). Since x∗ is a local minimum
of P2 , x cannot be feasible for P2 , and x 6∈ X2 ⊆ X1 , leading to the contradiction.
56 Linear independence of the constraints
gj (x) > 0
x∗✛ εj ✲
gj (x) < 0
subject to
Ax = b (3.27)
x ≥ 0 (3.28)
Since the inequality constraints are simple, we analyze in more details the system
of equations Ax = b. Like any linear system, three possibilities may arise:
• the system is incompatible, and there is no x such that Ax = b;
• the system is underdetermined, and there is an infinite number of x such that
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Ax = b;
• the system is non singular, and there is a unique x that satisfies Ax = b.
In an optimization context, incompatible and non singular systems have little rele-
vance because they leave no degree of freedom to optimize any objective. We thus
only consider underdetermined systems.
If the system is compatible, the rank of A (Definition B.29) gives us information
on the relevance of the various constraints. If the rank is deficient, this means that
certain rows of A form a linear combination of the others, and the corresponding
constraints are redundant. This is formalized by Theorem 3.6 and illustrated by
Example 3.7.
Ãx = b̃ ⇐⇒ Ax = b, (3.29)
Proof. Since the rank of A is r, this signifies that m − r rows of A can be written as
linear combinations of r other rows. Without loss of generality, we can assume that
the last m − r rows are linear combinations of the first r rows. By denoting aTk the
kth row of A, we have
r
X
ak = λjk aj k = r + 1, . . . , m and ∃j t.q. λjk 6= 0. (3.30)
j=1
Denote à the matrix composed of the first r rows of A, and b̃ the vector composed
of the first r components of b. Then, ℓi = i, i = 1, . . . , r.
=⇒ Consider x such that Ãx = b̃. According to the definition of Ã, we have that x
satisfies the first r equations of the system, i.e., aTi x = bi for i = 1, . . . , r. Select
k as an arbitrary index between r + 1 and m, and demonstrate that x satisfies the
58 Linear independence of the constraints
= bk according to (3.31).
x1 + x2 + x3 = 1
x1 − x2 + x4 = 1 (3.32)
x1 − 5x2 − 2x3 + 3x4 = 1
i.e.,
1 1 1 0 1
A = 1 −1 0 1 b = 1 . (3.33)
1 −5 −2 3 1
The rank of A is equal to 2, but there are 3 rows (the determinant of any squared
submatrix of dimension 3 is zero). This means that one of the rows is a linear
combination of the others. Since the system is compatible (for instance, x1 = 2/3,
x2 = 0,x3 = 1/3, x4 = 1/3 is feasible), one of the constraints must be redundant. In
this case, if aTi represents the ith row of A, we have
x1 + x2 + x3 = 1
(3.35)
x1 − x2 + x4 = 1
or by
∇h(x+ )T x = ∇h(x+ )T x+ − h(x+ ).
Therefore, the gradients of equality constraints play a similar role as the rows of
the matrix A in (3.27). As for the inequality constraints, we saw in Section 3.1 that
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
those that are inactive at x+ can be ignored, and that those that are active can be
considered equality constraints. Consequently, we can define the condition of linear
independence as follows.
We have
2x1 −2x1
∇g(x) = and ∇h(x) = .
2x2 − 2 1
2
h(x) = x2 − x21 = 0
1.5
1 •xa
0.5
x2
0 •xb
-1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x1
Figure 3.4: Linear independence of the constraints
60 Feasible directions
Consider the point xa = (1, 1)T , that is feasible, and for which the constraint
(3.36) is active. We have
2 −2
∇g(xa ) = and ∇h(xa ) = .
0 1
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
These two vectors are linearly independent, and the linear independence of the con-
straints is satisfied in xa . Figure 3.4 represents the normalized vectors
√ !
∇g(xa ) 1 ∇h(xa ) −2 5
5
= and = √ .
k∇g(xa )k 0 k∇h(xa )k 5
5
Consider the point xb = (0, 0)T , that is also feasible, and for which the constraint
(3.36) is active. We have
b 0 b 0
∇g(x ) = and ∇h(x ) = .
−2 1
These vectors are represented as normalized in Figure 3.4. They are linearly depen-
dent because ∇g(xb ) = −2∇h(xb ), and the linear independence of the constraints is
not satisfied in xb .
In short, it is a direction that can be followed, at least a little bit, while staying
within the feasible set. Some examples are provided in Figure 3.5, where the feasible
set is the polygon represented by thin lines, feasible directions are represented with
thick plain lines, and infeasible directions with thick dashed lines.
Theorem 3.11 (Feasible direction in a convex set). Let X be a convex set, and
consider x, y ∈ X, y 6= x. The direction d = y − x is a feasible direction in x and
x + αd = x + α(y − x) is feasible for any 0 ≤ α ≤ 1.
•y
d=y−x
x•
Proof. According to the definition of an interior point (Definition 1.15), there exists
a (convex) neighborhood V = {z such that kz − xk ≤ ε} such that V ⊆ X and ε > 0.
Consider an arbitrary direction d, and let y = x + εd/kdk be the point where the
direction intersects the border of the neighborhood. Since ky − xk = ε, then y ∈ V.
Since V is convex, Theorem 3.11 is invoked to demonstrate that d is feasible.
This result is particularly important. The fact that all directions are feasible at
an interior point gives freedom to algorithms in the selection of the direction. This
is what motivates the method of interior points, as described in Chapter 18.
62 Feasible directions
Theorem 3.13 (Feasible directions: linear case). Consider the optimization prob-
lem (3.26)–(3.28) min f(x) subject to Ax = b and x ≥ 0, and let x+ be a feasible
point. A direction d is feasible in x+ if and only if
1. Ad = 0, and
2. di ≥ 0 when x+
i = 0.
If we choose α ≤ η, we have
x+
α≤η≤− i
∀i t.q. x+
i > 0 and di < 0,
di
and
x+
i + αdi ≥ 0
because di < 0.
Constraints 63
is a feasible direction.
Proof. The two conditions of Theorem 3.13 are trivially satisfied for d.
To switch to the non linear case, we must use the gradients of the constraints.
Before this, we propose to interpret Theorem 3.13 in terms of gradients.
The first condition of the theorem concerns equality constraints. We have seen
that the rows of the matrix A are the gradients of the constraints, i.e. ∇hi (x) = ai ,
with
hi (x) = aTi x − bi = 0.
∇hi (x)T d = 0 i = 1, . . . , m.
gi (x) = −xi ≤ 0 i = 1, . . . , p,
and
0
..
.
0
∇gi (x) = −1 and ∇gi (x)T d = −di .
0
..
.
0
The second condition of the theorem can be expressed as follows: “If the constraint
gi (x) is active at x+ , then ∇gi (x+ )T d ≤ 0.” We should also note that if an inequality
constraint is not active at x+ , it does not involve any condition on the direction for
the latter to be feasible.
64 Feasible directions
Unfortunately, the generalization of these results to the non linear case is not
trivial. We develop it in two steps. We first see how to characterize feasible directions
for an inequality constraint. We treat the equality constraint later.
We observe that the gradient of an inequality constraint at a point where it is
active points toward the outside of the constraints, as shown by Example 3.15.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
x2
✻
✛ 1
(x
2 1
− 1)2 + 1
(x
2 2
− 1)2 <= 1
2
✲
✲
x1
Intuitively, the gradient direction and the directions that form an acute angle with
it cannot be considered as feasible directions.
In this case, nothing can guarantee that there exists α such that x+ + αd is feasible.
However, we can make the infeasibility as little as desired by choosing a sufficiently
small α.
We now analyze the feasible directions for an equality constraint
h(x) = 0.
h(x) ≤ 0
−h(x) ≤ 0.
Since the equality constraints pose a problem, we address the problem in the other
direction. Instead of positioning ourselves on a feasible point x+ and wondering how
to reach another one, we attempt to identify the ways to reach x+ while remaining
feasible. To do this, we introduce the concept of feasible sequences.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
h(x) = x21 − x2 = 0,
satisfies the three conditions of Definition 3.17 and belongs to S(x+ ). It is illustrated
in Figure 3.8.
h(x) = x21 − x2 = 0
x2
•
••
•••••••••
+•
x =0
-1 -0.5 0 0.5 1
x1
Figure 3.8: Example of a feasible sequence
Constraints 67
Given that the sequence (xk )k is feasible in x+ , we consider the directions con-
necting xk to x+ by normalizing them:
xk − x+
dk = . (3.40)
kxk − x+ k
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
We should keep in mind that these directions are generally not feasible directions.
We are looking at what happens at the limit.
Example 3.19 (Feasible direction at the limit). We consider once again Exam-
ple 3.18. We have xk − x+ = xk ,
√
+ k2 + 1
kxk − x k =
k2
and !
√ k
dk = k2 +1 .
√ 1
k2 +1
At the limit, we obtain
1
d = lim dk = .
k→∞ 0
These directions are illustrated in Figure 3.9. Note that
+ 0
∇h(x ) = and ∇h(x+ )T d = 0.
−1
d1
h(x) = x21 − x2 = 0
x2
d2
d3
•
•• d
•••••••••
+•
x =0
-1 -0.5 0 0.5 1
x1
Figure 3.9: Example of a feasible direction at the limit
Example 3.20 (Feasible direction at the limit). As for Example 3.18, we consider
the constraint in R2 :
h(x) = x21 − x2 = 0,
!
(−1)k
xk = k
1
k2
satisfies the three conditions of Definition 3.17 and belongs to S(x+ ). The calculation
of the directions gives
k !
(−1) k
√
dk = k2 +1
√ 1 ,
k2 +1
and limk→∞ dk does not exist. However, if we consider the subsequences defined by
the even and odd indices, respectively, we obtain
′ 1
d = lim d2k =
k→∞ 0
and
′′ −1
d = lim d2k+1 = ,
k→∞ 0
• h(x) = x21 − x2 = 0
d1
x2
d2
d3
• d4
d ′′ • d′
• •• •••••••• •
x+ = 0
-1 -0.5 0 0.5 1
x1
Figure 3.10: Example of a feasible sequence
We can now formally define the notion of a feasible sequence at the limit.
Constraints 69
Definition 3.21 (Feasible direction at the limit). Consider the optimization problem
(1.71)–(1.74), and a feasible point x+ ∈ Rn . Let (xk )k∈N be a feasible sequence in
x+ . The direction d 6= 0 is a feasible direction at the limit in x+ for the sequence
(xk )k∈N if there exists a subsequence (xki )i∈N such that
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
d xki − x+
= lim . (3.41)
kdk i→∞ kxki − x+ k
Note that any feasible direction d is also a feasible direction at the limit. Just
take the feasible sequence xk = x+ + k1 d in Definition 3.21.
Definition 3.22 (Tangent cone). A feasible direction at the limit is also called a
tangent direction. The set of all tangent directions in x+ is called the tangent cone
and denoted by T (x+ ).
We can now make the connection between this concept and the constraint gradient.
According to Theorem 3.16 and the associated comments, we consider all directions
that form an obtuse angle with the active inequality constraint gradients and those
that are orthogonal to the equality constraint gradients.
and
dT ∇hi (x+ ) = 0, i = 1, . . . , m, (3.43)
and of all their multiples, i.e.,
Theorem 3.24 (Feasible directions at the limit). Consider the optimization prob-
lem(1.71)–(1.74), and a feasible point x+ ∈ Rn . Any feasible direction at the
limit in x+ belongs to the linearized cone in x+ , that is
Proof. Let d be a normalized feasible direction at the limit, and xk a feasible sequence
such that
xk − x+
d = lim . (3.46)
k→∞ kxk − x+ k
gi (x+ ) = 0.
hi (x+ ) = 0.
hi (x+ ) ≤ 0
−hi (x+ ) ≤ 0
which are both active at x+ . According to the first point already demonstrated,
we have 0 ≤ dT ∇hi (x+ ) and 0 ≤ −dT ∇hi (x+ ), and obtain (3.43).
Theorem 3.24 does not yet provide a characterization of feasible directions at the
limit. Nevertheless, such a characterization is important, especially since the notion
of linearized cone is easier to handle than the concept of feasible direction at the limit.
Unfortunately, such a characterization does not exist in the general case. Therefore,
it is useful to assume that any element of the linearized cone is a feasible direction at
the limit. This hypothesis is called a constraint qualification.1
1 Several constraint qualifications have been proposed in the literature. This one is sometimes
called the Abadie Constraint Qualification, from the work by Abadie (1967).
Constraints 71
h(x) = x21 − x2 = 0
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
d ′′ x+ = 0 d′
x2
-1 -0.5 0 0.5 1
x1
Figure 3.11: Gradient and feasible directions at the limit
that is
T (x+ ) = D(x+ ). (3.48)
Proof. Theorem 3.24 shows that T (x+ ) ⊆ D(x+ ). To demonstrate that D(x+ ) ⊆
T (x+ ), consider a normalized direction2 d that belongs to the linearized cone D(x+ ).
We need to create a feasible sequence (xk )k , such that (3.41) is satisfied.
For each inequality constraint i active at x+ , we have
Following the arguments developed in Theorem 3.5, there exists ε such that all
constraints that are inactive at x+ are also inactive in any point of the sphere of
radius ε centered in x+ . Consider the sequence (xk )k with
ε
xk = x+ + d k = 1, 2, . . . (3.51)
k
Each xk is situated in the sphere mentioned above, and satisfies the inequality con-
straints that are inactive at x+ . For the inequality constraints that are active at x+ ,
we have
Since d is in the linearized cone at x+ , ∇gi (x+ )T d = aTi d ≤ 0, and xk is feasible for
any inequality constraint. For the equality constraint, we obtain in a similar manner
ε T
hi (xk ) = ā d.
k i
Finally, we now need only deduce from kdk = 1 that, for any k,
xk − x+ (ε/k)d
= =d
kxk − x+ k ε/k
Proof. Theorem 3.24 shows that T (x+ ) ⊆ D(x+ ). To demonstrate that D(x+ ) ⊆
T (x+ ), consider a normalized direction d that belongs to the linearized cone D(x+ ).
We create a feasible sequence (xk )k , such that (3.41) is satisfied. We create it implic-
itly and not explicitly.
To simplify the notations, we first assume that all constraints are equality con-
straints. We consider the Jacobian matrix of constraints in x+ , ∇h(x+ )T ∈ Rm×n ,
for which the rows are constraint gradients in x+ (see Definition 2.18). Since the
constraints are linearly independent, the Jacobian matrix is of full rank. Consider a
matrix Z ∈ Rn×(n−m) for which the columns form a basis of the kernel of ∇h(x+ )T ,
i.e., such that ∇h(x+ )T Z = 0. We apply the theorem of implicit functions (Theo-
rem C.6) to the parameterized function F : R × Rn → Rn defined by
h(x) − µ∇h(x+ )T d
F(µ, x) = . (3.53)
ZT (x − x+ − µd)
∇x F(µ, x) = (∇h(x) Z)
is non singular since the columns of Z are orthogonal to those of ∇h(x), and since
the two submatrices are of full rank. Then, we have a function φ such that
x+ = φ(0) (3.54)
Since d is in the linearized cone, we deduce from the first part of (3.55)
and φ(µ) is feasible. We use φ to build a feasible sequence. To do so, we show that
φ(µ) 6= x+ when µ 6= 0. Assume by contradiction that φ(µ) = x+ . In this case,
h(x+ ) − µ∇h(x+ )T d −µ∇h(x+ )T d
F(µ, x+ ) = = = 0. (3.57)
ZT (x+ − x+ − µd) −µZT d
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
If µ 6= 0, and since the matrices ∇h(x+ )T and Z are of full rank, we deduce that d = 0,
which is impossible since kdk = 1. Then, if µ 6= 0, we necessarily have φ(µ) 6= x+ .
We are now able to generate a feasible sequence. To do so, we consider a sequence
(µk )k such that limk→∞ (µk )k = 0. Then, the sequence
xk = φ(µk ) (3.58)
satisfies all conditions to be a feasible sequence (see Definition 3.17). We now need
to demonstrate that d is a feasible direction at the limit.
For a sufficiently large k, such that µk is sufficiently close to zero, we use a Taylor
series of h around x+
in (3.55) to obtain
0 = F(µ , x )
k k+ T
∇h(x ) (xk − x+ ) + o(kxk − x+ k) − µk ∇h(x+ )T d
=
ZT (xk − x+ − µk d)
∇h(x ) (xk − x+ − µk d) + o(kxk − x+ k)
+ T
=
ZT (xk − x+ − µk d)
∇h(x+ )T
= (xk − x+ − µk d) + o(kxk − x+ k)
ZT
∇h(x+ )T xk −x+
+
k −x k)
= T kx −x
µk
+ k − kx −x+ k d + o(kx
kxk −x+ k .
Z k k
Then, since the matrices ∇h(xk )T and ZT are of full rank, we have
xk − x+ µk
lim − d = 0. (3.59)
k→∞ kxk − x+ k kxk − x+ k
Define
xk − x+ µk
d̃ = lim +
and c = lim ,
k→∞ kxk − x k k→∞ kxk − x+ k
xk − x+
lim = d.
k→∞ kxk − x+ k
Constraints 75
from which we deduce the feasibility of φ(µ). Inactive constraints do not pose a prob-
lem, since there is a sphere around x+ such that all elements satisfy these constraints.
Since Definition 3.17 is asymptotic, we can always choose k sufficiently large such
that xk belongs to this sphere.
Feasible directions at the limit are an extension of the concept of a feasible direction.
It enables us to identify in which direction an infinitesimal displacement continues
to be feasible. Unfortunately, the definition is too complex to be operational. The
linearized cone, based on the constraint gradients, is directly accessible to the calcu-
lation. We usually assume that the constraint qualification is satisfied.
x3 = 1 − x1 − x2
(3.62)
x4 = 1 − x1 + x2 .
Thus, the optimization problem can be rewritten so as to depend only on two variables
x1 and x2 :
without constraint.
76 Elimination of constraints
Ax = b (3.64)
AP = (B N) (3.65)
where B ∈ Rm×m contains the m first columns of AP, and N ∈ Rm×(n−m) contains
the n − m last ones. Recalling that PPT = I, we write (3.64) in the following manner
The variables xB are called basic variables, and the variables xN non basic variables.
Example 3.30 (Elimination of variables – II). In Example 3.29, we have
1 1 1 0 1
A= b= .
1 −1 0 1 1
The variables to eliminate are x3 and x4 . They correspond to the last two columns
of the constraint matrix, and we choose the permutation matrix to make them the
first two, that is
0 0 1 0
0 0 0 1
P= 1 0 0 0
0 1 0 0
Constraints 77
to obtain
1 0 1 1 1 0 1 1
AP = (B|N) = B= N=
0 1 1 −1 0 1 1 −1
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
and
x3
xB = = B−1 (b − NxN )
x4
1 0 1 1 1 x1
= −
0 1 1 1 −1 x2
1 − x1 − x2
=
1 − x1 + x2
It is relatively easy to remove linear equality constraints. However, note that the
calculation of the matrix B−1 can be tedious, especially when m is large, and it is
sometimes preferable to explicitly maintain the constraints for the problem.
The elimination of non linear constraints can be problematic. An interesting
example, proposed as an exercise by Fletcher (1983) and again by Nocedal and Wright
(1999), illustrates this difficulty.
Example 3.31 (Elimination of non linear constraints). Consider the problem
subject to
(x1 − 1)3 = x22
3
x22 = (x1 − 1)3
2
0
x2
(1, 0)
-1
-2
-3
-3 -2 -1 0 1 2 3
x1
Figure 3.12: The problem in Example 3.31
78 Linear constraints
as shown in Figure 3.13. The problem is that the substitution can only be performed
if x1 ≥ 1, since x22 must necessarily be non negative. This implicit constraint in the
original problem should be explicitly incorporated in the problem after elimination.
It plays a crucial role since it is active at the solution.
20
x21 + (x1 − 1)3
10
0
-10
f̃(x1 )
-20
-30
-40
-50
-60
-3 -2 -1 0 1 2 3
x1
Figure 3.13: The problem without constraint in Example 3.31
3.5.1 Polyhedron
We analyze the linear constraints (3.27)–(3.28) from a geometrical point of view. The
central concept in this context is the polyhedron.
Constraints 79
{x ∈ Rn |Ax = b, x ≥ 0} , (3.72)
Note that according to Theorem 3.6, the matrix A is assumed to be of full rank
without loss of generality.
The identification of vertices or extreme points of a polyhedron is possible thanks
to the technique of elimination of variables described above. We begin by formally
defining a vertex.
x
P
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
z̃
x̃
ỹ
AP = (B|N) (3.74)
Proof. Without loss of generality, and to simplify the notations in the proof, we
assume that the m columns chosen are the m first ones, such that the permutation
matrix P = I. We assume by contradiction that there exists y, z ∈ P, y 6= x, z 6= x
and 0 ≤ λ ≤ 1 such that
x = λy + (1 − λ)z. (3.76)
After decomposition, we obtain
xB = λyB + (1 − λ)zB , (3.77)
and
xN = λyN + (1 − λ)zN . (3.78)
Since y and z are in P, we have yN ≥ 0 and zN ≥ 0. Since 0 ≤ λ ≤ 1, the only way
for xN = 0 is that yN = zN = 0. Then,
yB = B−1 (b − NyN ) = B−1 b = xB , (3.79)
Constraints 81
and
zB = B−1 (b − NzN ) = B−1 b = xB . (3.80)
We obtain x = y = z, which contradicts the fact that y and z are different from x,
and proves the result.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
It then appears that the vertices can be characterized by the set of active con-
straints.
x∗i
αi = − .
di
α∗ = min αi
i|di >0
and j an index for which the minimum is reached, i.e., α∗ = αj . The point x1 = x0 +
α∗ d belongs to the polyhedron by construction. Moreover, (x1 )j = 0 and (x0 )j > 0.
Then, the dimension of L(x1 ) is strictly inferior to that of L(x0 ). We now need
only repeat the procedure to obtain, after a certain number of iterations k at most
equal to the dimension of L(x0 ), a point xk such that dim L(xk ) = 0. According to
Theorem 3.36, xk is a vertex of P. This proof is illustrated in Figure 3.15, where
the linear manifold {x|Ax = b} is shown. In this example, L(x0 ) is the represented
plane, L(x1 ) is the straight line corresponding to the second coordinate axis, and
L(x2 ) = {x2 }.
Constraints 83
✻
x0
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
x1
✲
x2 = xk
with
1 1 1 0 1
A= b= . (3.83)
1 −1 0 1 1
84 Linear constraints
x2 ✻
3
d1
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
d3
❘ x1 + x2 = 1
❄ ✠
c4
■d
1 ✲
c3
d 2 x1
✠
■ x −x =1
1 2
This basic solution is feasible and also corresponds to point 2 in Figure 3.16.
3. Basic solution with x1 and x4 in the basis (j1 = 1, j2 = 4).
1
1 0 1 0 1 0
B= ; B−1 = ; xB = B−1 b = ;x =
0 .
1 1 −1 1 0
0
This basic solution is feasible and also corresponds to point 2 in Figure 3.16.
4. Basic solution with x2 and x3 in the basis (j1 = 2, j2 = 3).
0
1 1 0 −1 −1 −1
B= ; B−1 = ; xB = B−1 b = ;x =
2 .
−1 0 1 1 2
0
The notion of a basic solution (Definition 3.38) enables us to analyze the poly-
hedron in terms of active constraints of the optimization problem (Definition 3.4).
Let x be a feasible basic solution such that xB = B−1 b > 0. We say that it is non
degenerate. In this case, there are exactly n active constraints in x: the m equal-
ity constraints and the n − m non basic variables which are 0, and which make the
86 Linear constraints
Theorem 3.40 (Equivalence between vertices and feasible basic solutions). Let P =
{x ∈ Rn |Ax = b, x ≥ 0} be a polyhedron. The point x∗ ∈ P is a vertex of P if and
only if it is a feasible basic solution.
xB = B−1 (b − NxN ).
Since x∗ is not a feasible basic solution, there exists at least one component k of
xN that is not zero. We construct the direction d for which the basic component
is
dB = −B−1 Ak ,
where Ak is the kth column of A, and the non basic component for all zero
components, except the kth one which equals 1. Therefore,
X
Ad = BdB + Ndn = −BB−1 Ak + Aj dj = −Ak + Ak = 0.
jnon basic
It is important to note that Theorem 3.40 does not guarantee a bijective rela-
tionship between the vertices of the polyhedron and the feasible basic solutions in all
Constraints 87
cases. Indeed, when some of the components of xB = B−1 b are zero, there are more
than n active constraints, and the feasible basic solution is defined by more than n
equations in a space of n dimensions. We say in this case that we are dealing with a
degenerate feasible basic solution.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
that is,
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
dB = −B−1 Ap . (3.88)
i.e., that all the elements of dNp are zero, except the one corresponding to the variable
p, which is 1.
Note that these directions are not always feasible, as discussed later. But first, we
illustrate the concept with Example 3.39.
Example 3.43 (Basic directions). Consider the polygon in Example 3.39, and the
feasible basic solution where x2 and x4 are in the basis. Then
0 0 0 1 0
1 1 0 0 0
x=
0 and P = 0 0 0 1 .
2 0 1 0 0
The basic direction corresponding to the non basic variable x1 is
1 0 1 −1 1
−B−1 A1 − −1
d1 = P = P 1 1 1 = P −2 =
1 1 1 0 ,
0
0 0 −2
and the basic direction corresponding to the non basic variable x3 is
1 0 1 −1 0
−B−1 A3 − −1 −1
= P 1 1 0 = P =
d3 = P 0 0 1 .
0
1
1 1 −1
Constraints 89
These two directions are shown in Figure 3.16, from the feasible basic solution 3.
We now consider the feasible basic solution where x1 and x4 are in the basis (point
2 in Figure 3.16). Then,
1 1 0 0 0
0 0 0 1 0
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
x=
0 and P = 0 0 0 1 .
0 0 1 0 0
These directions are not represented in Figure 3.16. Finally, consider the feasible
basic solution where x1 and x2 are in the basis. Then,
1 1 0 0 0
0 0 1 0 0
x=
0 and P = 0 0 1
= I.
0
0 0 0 0 1
b4 is a feasible direction,
These two directions are shown in Figure 3.16. Note that d
b
whereas d3 is not.
90 Linear constraints
in x.
where N is the set of indices of the non basic variables, dj ∈ Rn the jth basic
direction, and (d)j ∈ R, the jth component of d.
Proof. Consider a feasible direction d, and assume without loss of generality that the
basic variables are the m first ones. According to Theorem 3.13, we have
Ad = BdB + NdN = 0
and
n
X
dB = −B−1 NdN = − (d)j B−1 Aj (3.92)
j=m+1
where ek ∈ Rn−m is a vector for which all the components are zero, except the kth
one, which is 1. According to Definition 3.42, (3.92) and (3.93) are written as
Xn Xn
dB −B−1 Aj
d= = (d)j = (d)j dj , (3.94)
dN ej−m
j=m+1 j=m+1
The proof of Theorems 3.27 and 3.28 is inspired by Nocedal and Wright (1999).
That of Theorems 3.36 and 3.37 is inspired by de Werra et al. (2003).
3.6 Exercises
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
x1 + x2 ≤ 3
x1 + x3 ≤ 7
x1 ≥ 0
x2 ≥ 0
x3 ≥ 0.
−x1 + x2 ≤1
x1 + 2x2 ≤ 4.
Exercise 3.5. Consider the feasible set defined by the following constraints:
x1 − x2 ≥ −2
2x1 + x2 ≤ 8
x1 + x2 ≤ 5
x1 + 2x2 ≤ 10
x1 ≥ 0
x2 ≥ 0
Chapter 4
Introduction to duality
There are those who are subject to constraints and others who impose them. We
now take the point of view of the second category in order to analyze an optimization
problem from a different angle.
Contents
4.1 Constraint relaxation . . . . . . . . . . . . . . . . . . . . . 93
4.2 Duality in linear optimization . . . . . . . . . . . . . . . . 102
4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Everest, culminating at 8,848 meters (Figure 4.1(b)). The billionaire thus decides to
set the fine at e 4,041. In this case, climbing Everest would give the mountaineer
8,848 − 4,041 = e 4,807, which is exactly the same amount as if he decided to climb
Mont Blanc. Therefore, the mountaineer has no interest in violating the constraint,
and the final solution to the problem is the same as with the constraint.
We can model the problem of the climber by calling his position (longitude/latitude)
x and the corresponding altitude f(x). The first problem is a constrained optimization
problem:
max f(x)
x
subject to
x ∈ Alps .
The fine imposed by the billionaire is denoted by a(x), and depends on the position
x. In particular, a(x) = 0 if x ∈ Alps. The optimization problem is now without
constraints and can be expressed as
subject to
1 − x1 − x2 = 0
x1 ≥ 0 (4.2)
x2 ≥ 0
for which the solution is x∗ = (0, 1)T with an optimal value 1. We now relax the
constraint 1 − x1 − x2 = 0 and introduce a fine that is proportional to the violation
of the constraint, with a proportionality factor λ. This way, the fine is zero when the
constraint is satisfied. We obtain the following problem:
subject to
x1 ≥ 0
(4.4)
x2 ≥ 0 .
• If λ = 0, (4.3) becomes 2x1 + x2 and the solution to the problem is x∗ = (0, 0)T
with an optimal value of 0 (Figure 4.3). This solution violates the constraint of
the original problem, and the optimal value is lower. It is a typical case where
the penalty value is ill-suited, and where it becomes interesting to violate the
constraint.
96 Constraint relaxation
30
20
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
10
2x + y 0
-10
-20
-30 10
5
0 x2
-5
-10 -5 -10
0 5 10
x1
30
20
10
2−y 0
-10
-20
-30
10
5
0 x2
-5
-10 -5 0 -10
5 10
x1
30
20
10
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
1+x 0
-10
-20
-30
10
5
0 x2
-5
-10 -5 0 -10
5 10
x1
As we did in Example 4.2, we can minimize the Lagrangian function for each fixed
value of the parameters λ and µ. Indeed, the Lagrangian function now depends only
on x. The function that associates a set of parameters to the optimal value of the
associated problem is called a dual function.
is the dual function of the problem (1.71)–(1.73). The parameters λ and µ are called
dual variables. In this context, the variables x are called primal variables.
98 Constraint relaxation
If we take Example 4.1, −q(λ, µ) represents the mountaineer’s prize1 if the bil-
lionaire imposes a fine for violation of the constraints λT h(x) + µT g(x).
For inequality constraints, since only non negative values of g(x) should be avoided
and result in a fine, it is essential that µ ≥ 0. Indeed, the term µT g(x) is non negative,
and thus penalizing, only when g(x) > 0.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Theorem 4.5 (Bound from dual function). Let x∗ be the solution to the opti-
mization problem (1.71)–(1.73), and let q(λ, µ) be the dual function to the same
problem. Consider λ ∈ Rm and µ ∈ Rp , µ ≥ 0. Then,
and the dual function provides lower bounds on the optimal value of the problem.
Proof.
Corollary 4.6 (Objective functions of the primal and dual). Let x be a feasible
solution of the optimization problem (1.71)–(1.73), and let q(λ, µ) be the dual
function to the same problem. Consider λ ∈ Rm and µ ∈ Rp , µ ≥ 0. Then,
Proof. Denote x∗ the optimal solution of the primal problem. As x is primal feasible,
we have f(x∗ ) ≤ f(x). The results follows from Theorem (4.5).
If we take the point of view of the billionaire, the problem is to define these fines
in such a manner that the mountaineer wins as little as possible with the new system.
He tries to optimize the dual function, ensuring that the considered parameters λ and
µ ≥ 0 do not generate an unbounded problem. This optimization problem is called
the dual problem.
1 The sign of q is changed because the problem with the mountaineer is one of maximization and
not minimization.
Introduction to duality 99
subject to
µ≥0 (4.11)
and
(λ, µ) ∈ Xq (4.12)
is the dual problem of the problem (1.71)–(1.73). In this context, the original problem
(1.71)–(1.73) is called the primal problem.
subject to
h1 (x) = 1 − x1 − x2 = 0 (λ)
g1 (x) = − x1 ≤0 (µ1 ) (4.14)
g2 (x) = − x2 ≤ 0 (µ2 ) .
The Lagrangian function of this problem is
In order for the dual function to be bounded, the coefficients of x1 and x2 have to be
zero, and
2 − λ − µ1 = 0 , 1 − λ − µ2 = 0 ,
or
µ1 = 2 − λ , µ2 = 1 − λ . (4.15)
subject to
λ ≤ 1,
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
for which the solution is λ∗ = 1. According to the equalities (4.15), we have µ∗1 = 1
and µ∗2 = 0.
As a direct consequence of Theorem 4.5, the optimal value of this problem can
never exceed the optimal value of the original problem. This result is called the weak
duality theorem.
Theorem 4.9 (Weak duality). Let x∗ be the optimal solution to the primal prob-
lem (1.71)–(1.73) and let (λ∗ , µ∗ ) be the optimal solution to the associated dual
problem (4.10)–(4.12). Then
Corollary 4.10 (Duality and feasibility). Consider the primal problem (1.71)–
(1.73) and the associated dual problem (4.10)–(4.12).
• If the primal problem is unbounded, then the dual problem is not feasible.
• If the dual problem is unbounded, then the primal problem is not feasible.
Proof. If the optimal value of the primal problem is −∞, there is no dual variable
(λ, µ) that satisfies (4.16) and the dual problem is not feasible. Similarly, if the
optimal value of the dual problem is +∞, there is no primal variable x that satisfies
(4.16) and the primal problem is not feasible.
Corollary 4.11 (Optimality of the primal and the dual). Let x∗ be a feasible so-
lution of the primal problem (1.71)–(1.73) and let (λ∗ , µ∗ ) be a feasible solution
of the associated dual problem (4.10)–(4.12). If q(λ∗ , µ∗ ) = f(x∗ ), then x∗ is
optimal for the primal, and (λ∗ , µ∗ ) is optimal for the dual.
Proof. Consider any x feasible for the primal. From Theorem 4.5, we have
proving the optimality of x∗ . Similarly, consider any (λ, µ) feasible for the dual. From
the same theorem, we have
Corollary 4.12 (Duality and feasibility (II)). Consider the primal problem (1.71)–
(1.73) and the associated dual problem (4.10)–(4.12).
• If the primal problem is infeasible, then the dual problem is either unbounded
or infeasible.
• If the dual problem is infeasible, then the primal problem is either unbounded
or infeasible.
Proof. We show the contrapositive. If the dual problem is bounded and feasible, it
has an optimal solution. From Corollary 4.11, the primal problem has also an optimal
solution, as is therefore feasible. The second statement is shown in a similar way.
The dual problem has interesting geometric properties. Indeed, the objective
function to maximize is concave, and the domain Xq is convex.
or
q αγ + (1 − α)γ̄ ≥ αq(γ) + (1 − α)q(γ̄) , (4.18)
which demonstrates the concavity of q (Definition 2.3). Since γ and γ̄ are in Xq , then
q(γ) > −∞ and q(γ̄) > −∞. According to (4.18), we also have q(αγ+(1−α)γ̄) > −∞
and this way αγ + (1 − α)γ̄ is in Xq , proving the convexity of Xq (Definition B.2).
102 Duality in linear optimization
subject to
Ax = b
(4.20)
x≥0
and we have
h(x) = b − Ax and g(x) = −x .
L(x, λ, µ) = cT x + λT (b − Ax) − µT x
T (4.21)
= c − AT λ − µ x + λT b .
In this case, the dual function is q(λ, µ) = λT b and the dual problem is written as
max λT b (4.22)
λ,µ
Introduction to duality 103
subject to
µ≥0
(4.23)
µ = c − AT λ .
By eliminating µ, renaming λ x, and changing the maximization to a minimization,
we obtain
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
subject to
AT x ≤ c . (4.25)
This is another linear optimization problem. We calculate its dual problem. Since
there are no equality constraints, we have
max −µT c
µ
subject to
µ≥0
Aµ = b .
By replacing µ by x, and transforming the maximization into a minimization, we
obtain the original problem (4.19)–(4.20). The dual of the dual problem is the primal
problem. We can now generalize these results.
Theorem 4.14 (Dual of a linear problem). Consider the following linear problem:
subject to
A1 x1 + B1 x2 + C1 x3 = b1
A2 x1 + B2 x2 + C2 x3 ≤ b2
A3 x1 + B3 x2 + C3 x3 ≥ b3
(4.27)
x1 ≥ 0
x2 ≤ 0
x3 ∈ Rn3 ,
where x1 ∈ Rn1 , x2 ∈ Rn2 , x3 ∈ Rn3 , b1 ∈ Rm , b2 ∈ Rpi and b3 ∈ Rps . The
matrices Ai , Bi , Ci , i = 1, 2, 3, have appropriate dimensions. The dual of this
problem is
max γT b = γT1 b1 + γT2 b2 + γT3 b3 (4.28)
γ
104 Duality in linear optimization
subject to
(γ1 ∈ Rm )
γ2 ≤ 0
γ3 ≥ 0
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
(4.29)
AT1 γ1 + AT2 γ2 + AT3 γ3 = T
A γ ≤ c1
BT1 γ1 + BT2 γ2 + BT3 γ3 = BT γ ≥ c2
CT1 γ1 + CT2 γ2 + CT3 γ3 = CT γ = c3
T T
with γ = γT1 γT2 γT3 ∈ Rm+pi +ps and A = AT1 AT2 AT3 ∈
(m+pi +ps )×n1
R . The matrices B and C are defined in a similar manner.
µx1 = c1 − AT γ
µx2 = BT γ − c2
CT γ = c3 .
We now need only use µx1 ≥ 0 and µx2 ≥ 0 to obtain the result.
Introduction to duality 105
Note that the problem (4.26)–(4.27) combines all the possibilities of writing the
constraints of a linear problem: equality, lower inequality, upper inequality, non pos-
itivity, and non negativity. The result can be summarized as follows:
• For each constraint of the primal there is a dual variable
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Theorem 4.15 (The dual of the dual is the primal). Consider a (primal) linear
optimization problem. If the dual is converted into a minimization problem, and
we calculate its dual, we obtain a problem equivalent to the primal problem.
subject to
x 1 ∈ Rm
x2 ≥ 0
x3 ≤0
T T T
A1 x1 + A2 x2 + A3 x3 ≥ −c1
BT1 x1 + BT2 x2 + BT3 x3 ≤ −c2
CT1 x1 + CT2 x2 + CT3 x3 = −c3 .
According to Theorem 4.14, the dual of this problem is
max −c1 γ1 − c2 γ2 − c3 γ3
γ1 ,γ2 ,γ3
subject to
A1 γ1 + B1 γ2 + C1 γ3 = b1
A2 γ1 + B2 γ2 + C2 γ3 ≤ b2
A3 γ1 + B3 γ2 + C3 γ3 ≥ b3
γ1 ≥ 0
γ2 ≤ 0
γ3 ∈ Rn3 .
106 Duality in linear optimization
We now need only replace γ by x, and convert the maximization into a minimization
to obtain (4.26)–(4.27) and prove the result.
Example 4.16 (Dual of a linear problem). Consider the linear optimization problem
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
subject to
−x1 + 3x2 = 5
2x1 − x2 + 3x3 ≥ 6
x3 ≤ 4
x1 ≥ 0
x2 ≤ 0
x3 ∈ R .
subject to
γ1 ∈ R
γ2 ≥ 0
γ3 ≤ 0
−γ1 + 2γ2 ≤1
3γ1 − γ2 ≥2
3γ2 + γ3 = 3 .
This is also a linear problem. We write it as a minimization problem and rename the
variables x.
min −5x1 − 6x2 − 4x3
subject to
x1 ∈ R
x2 ≥ 0
x3 ≤ 0
x1 − 2x2 ≥ −1
−3x1 + x2 ≤ −2
−3x2 − x3 = −3 .
subject to
γ1 − 3γ2 = −5
−2γ1 + γ2 − 3γ3 ≤ −6
−γ3 ≥ −4
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
γ1 ≥ 0
γ2 ≤ 0
γ3 ∈ R .
It is easy to verify that this problem is equivalent to the original problem.
Theorem 4.17 (Strong duality). Consider a linear optimization problem and its
dual. If one problem has an optimal solution, so does the other one, and the
optimal value of their objective functions are the same.
that is
Ay − rb = 0, (4.30)
and
T c
(y r) < 0,
−bT λ∗ − ε
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
that is
cT y − rbT λ∗ − rε < 0. (4.31)
We distinguish two cases.
r = 0 In this case, (4.30) is Ay = 0 and (4.31) is cT y < 0. Applying Farkas’ lemma
to the compatible system AT λ ≤ c (it is verified at least by λ∗ ), we obtain that
cT y ≥ 0, for each y ≥ 0 such that Ay = 0, contradicting (4.30)–(4.31). Therefore
r 6= 0.
r > 0 Divide (4.30) by r, and define x∗ = y/r to obtain
Ax∗ − b = 0. (4.32)
As y ≥ 0 and r > 0, x∗ is feasible for the primal problem. Dividing also (4.31) by
r, we obtain
cT x∗ − bT λ∗ − ε < 0. (4.33)
Denote δ = cT x∗ − bT λ∗ . By Corollary (4.6), as x∗ is primal feasible and λ∗ dual
feasible, we know that δ = cT x∗ − bT λ∗ ≥ 0. Therefore, (4.33) is written as
0 ≤ δ < ε.
4.3 Exercises
Exercise 4.1. Consider the optimization problem
Exercise 4.2. Same questions as for Exercise 4.1 for the problem
1 2
min (x + x22 ) s.c. x1 ≥ 1 .
x∈R2 2 1
Exercise 4.3. Consider a matrix A ∈ Rn×n such that AT = −A, and the vector
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
min cT x
x∈Rn
subject to
Ax ≥ −c
x ≥ 0.
Demonstrate that this problem is self-dual, i.e., that the dual problem is equivalent
to the primal problem.
Exercise 4.4. Consider the optimization problem
subject to
x1 − x2 ≤ 2
−x1 + x2 ≤ −3
x1 , x2 ≥ 0.
1. Write the Lagrangian.
2. Write the dual function.
3. Write the dual problem.
4. Represent graphically the feasible set of the primal problem.
5. Represent graphically the feasible set of the dual problem.
Exercise 4.5. Same questions for the following problem.
min −x1 − x2
x∈R2
subject to
−x1 + x2 ≤ 1
1
x1 − x2 ≤ 0
2
x1 , x2 ≥ 0 .
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Part II
Optimality conditions
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Albert Einstein
Chapter 5
Unconstrained optimization
Contents
5.1 Necessary optimality conditions . . . . . . . . . . . . . . . 115
5.2 Sufficient optimality conditions . . . . . . . . . . . . . . . 120
5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
∇f(x∗ ) = 0 . (5.1)
Proof. We recall that −∇f(x∗ ) is the direction of the steepest descent in x∗ (Theorem
2.13) and assume by contradiction that ∇f(x∗ ) 6= 0. We can then use Theorem 2.11
with the descent direction d = −∇f(x∗ ) to obtain η such that
f x∗ − α∇f(x∗ ) < f(x∗ ) , ∀α ∈ ]0, η] ,
116 Necessary optimality conditions
2
we have
1 2 T 2 ∗
f(x∗ + αd) − f(x∗ ) = α d ∇ f(x )d + o kαdk2 from (5.1)
2
1
= α2 dT ∇2 f(x∗ )d + o(α2 ) kdk does not depend on α
2
≥0 x∗ is optimal .
1 T 2 ∗ o(α2 )
d ∇ f(x )d + ≥ 0. (5.3)
2 α2
Intuitively, as the second term can be made as small as desired, the result must hold.
More formally, let us assume by contradiction that dT ∇2 f(x∗ )d is negative and that
its value is −2ε, with ε > 0. According to the Landau notation o(·) (Definition B.17),
for all ε > 0, there exists η such that
o(α2 )
< ε, ∀0 < α ≤ η ,
α2
and
1 T 2 ∗ o(α2 ) 1 T 2 ∗ o(α2 ) 1
d ∇ f(x )d + ≤ d ∇ f(x )d + < − 2ε + ε = 0 ,
2 α2 2 α2 2
which contradicts (5.3) and proves that dT ∇2 f(x∗ )d ≥ 0. Since d is an arbitrary
direction, ∇2 f(x∗ ) is positive semidefinite (Definition B.8).
From a geometrical point of view, the second-order condition means that f is
locally convex in x (Theorem 2.21).
Example 5.2 (Affine function). Consider an affine function (see Definition 2.25):
f(x) = cT x + d, (5.4)
illustrated in Figure 5.1 (see Section 11.6 for a discussion of this function). The point
T
1 1 is a local minimum of the function. We have
400 x31 − 400 x1 x2 + 2x1 − 2
∇f(x1 , x2 ) = ,
200 x2 − 200 x21
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
T
which is indeed zero in 1 1 . Moreover,
1200 x21 − 400 x2 + 2 −400 x1
∇2 f(x1 , x2 ) = ,
−400 x1 200
T
which, in 1 1 , is:
802 −400
∇2 f(1, 1) = ,
−400 200
for which the eigenvalues are positive (0.39936 and 1,001.6) and the Hessian matrix
T
is positive semidefinite. Note that the conditioning of f in 1 1 is high (2,508)
and that the function is ill-conditioned at the solution (Section 2.5).
x1 x2
T
illustrated in Figure 5.2. The point 0 0 satisfies the necessary optimality
conditions. Indeed,
−4x31
∇f(x1 , x2 ) =
−4x32
T
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
is zero in 0 0 . Moreover,
2 −12 x21 0
∇ f(x1 , x2 ) =
0 −12 x22
T
is positive semidefinite in 0 0 . However, this is not a local minimum.
T To
demonstrate this, consider a non zero arbitrary direction
T d = d1 d2 and take
a step α > 0 of any length from the point 0 0 . We have
x2
x1
T
is zero in 0 0 . Moreover,
2 100 0
∇ f(x1 , x2 ) =
0 −6x2
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
T
is positive semidefinite in 0 0 . However,
T it is not a local minimum. To show
this, consider the direction d = 0 1 and take any one step α > 0 from the
T
point 0 0 . We have
x2 x1
Proof. We assume by contradiction that there exists a direction d and η > 0 such
that, for any 0 < α ≤ η, f(x∗ + αd) < f(x∗ ). With an identical approach to the proof
of Theorem 5.1, we have
o(α2 )
< ε, ∀α, 0 < α ≤ η̄ ,
α2
and then, for any α ≤ min(η, η̄), we have
o(α2 ) o(α2 )
− ≤ < ε,
α2 α2
such that
1 T 2 ∗ o(α2 )
d ∇ f(x )d = − 2 − ε < 0 ,
2 α
which contradicts the fact that ∇2 f(x∗ ) is positive definite.
Unconstrained optimization 121
1 2
f(x1 , x2 ) = x + x1 cos x2
2 1
illustrated in Figure 5.4. We use the optimality conditions to identify the minima of
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
3
2
1
0
-1
-6
-4
-2
0
x2
2 -1.5
4 -0.5 -1
0.5 0
6 1
1.5 x1
T T
This gradient is zero for x∗k = (−1)k+1 , kπ , k ∈ Z, and for x̄k = 0, π2 + kπ ,
k ∈ Z, as illustrated in Figure 5.5. We also have
2 1 − sin x2
∇ f(x1 , x2 ) = .
− sin x2 −x1 cos x2
Since this matrix is positive definite, each point x∗k satisfies the sufficient optimality
conditions and is a local minimum of the function.
By evaluating the Hessian matrix in x̄k , we get for any k
2 1 (−1)k+1
∇ f(x̄k ) = .
(−1)k+1 0
122 Sufficient optimality conditions
x∗2 6
x̄1
4
x∗1
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
x̄0 2
x∗0 0 x2
x̄−1 -2
x∗−1
-4
x̄−2
x∗−2 -6
-1.5 -1 -0.5 0 0.5 1 1.5
x1
It means that any point strictly between x∗ and x+ is also strictly better than x∗ .
Consider an arbitrary ε > 0, and demonstrate that Definition 1.5 of the local minimum
is contradicted. If ε ≥ kx∗ − x+ k, (1.75) is not satisfied for x = x+ , when taking α = 1
in (5.7). If ε < kx∗ − x+ k, consider 0 < η < 1 such that kηx∗ + (1 − η)x+ k = ε. In
this case, (1.75) is not satisfied for x = αx∗ + (1 − α)x+ with η ≤ α < 1 according to
(5.7). Since η < 1, such α always exist.
Unconstrained optimization 123
We now consider a strictly convex function, and assume that x∗ and y∗ are two
distinct global minima, and then x∗ 6= y∗ and f(x∗ ) = f(y∗ ). According to Definition
2.2, we have
f αx∗ + (1 − α)y∗ < αf(x∗ ) + (1 − α)f(y∗ ) = f(x∗ ) = f(y∗ ) , ∀α ∈ ]0, 1[ ,
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
We conclude this chapter with a discussion of the optimality conditions for quadratic
problems (Definition 2.28).
x∗ = −Q−1 g (5.9)
The gradient is
Λr yr + gr
∇f(y) = .
gn−r
If gn−r 6= 0, the gradient is different from zero for any value of y, and the necessary
optimality condition is never verified. Now, if gn−r = 0, the variables yn−r do
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
not affect the objective function. We fix yn−r to any arbitrary value and solve
the problem for yr :
1
min f(yr ) = yTr Λr yr + gTr yr . (5.16)
2
As Λr is positive definite, the first result of this theorem applies, and
y∗r = −Λ−1
r gr . (5.17)
The last result of Theorem 5.10 has a geometric interpretation. The Schur de-
composition actually identifies one subspace where the quadratic function is strictly
convex (the subspace corresponding to the positive eigenvalues), and one subspace
where the function is linear (the subspace corresponding to zero eigenvalues). In this
latter subspace, in order to guarantee that the function is bounded, the linear part
must be constant, which corresponds to the condition gn−r = 0 (see Example 5.2).
5.3 Exercises
For the following optimization problems :
1. Calculate the gradient and the Hessian of the objective function.
2. Identify the critical points.
3. Eliminate those that do not satisfy the necessary optimality conditions.
4. Identify those that satisfy the sufficient optimality conditions.
Exercise 5.1. min x21 + x22 .
x∈R2
1 3
Exercise 5.2. min x + x32 − x1 − x2 .
3 1
x∈R2
1
Exercise 5.3. min x2 + .
x∈R x−2
Exercise 5.4. min x61 − 3x41 x22 + 3x21 x42 − x62 .
x∈R2
Exercise 5.5. min f(x), where f is defined by one of the functions of Exercise 2.2.
x∈R2
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Chapter 6
Constrained optimization
Contents
6.1 Convex constraints . . . . . . . . . . . . . . . . . . . . . . 128
6.2 Lagrange multipliers: necessary conditions . . . . . . . . 133
6.2.1 Linear constraints . . . . . . . . . . . . . . . . . . . . . . 133
6.2.2 Equality constraints . . . . . . . . . . . . . . . . . . . . . 137
6.2.3 Equality and inequality constraints . . . . . . . . . . . . . 142
6.3 Lagrange multipliers: sufficient conditions . . . . . . . . 152
6.3.1 Equality constraints . . . . . . . . . . . . . . . . . . . . . 153
6.3.2 Inequality constraints . . . . . . . . . . . . . . . . . . . . 154
6.4 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . 159
6.5 Linear optimization . . . . . . . . . . . . . . . . . . . . . . 165
6.6 Quadratic optimization . . . . . . . . . . . . . . . . . . . . 171
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
∇f(x∗ )T d ≥ 0
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
(6.1)
Since d = limk (xk − x∗ ), and ∇f(x∗ )T d < 0, there exists an index K such that
T
xk − x∗ ∇f(x∗ ) < 0 for all k ≥ K. In addition, the term o kxk − x∗ k can be made
as small as desired by making k sufficiently large (see Theorem 2.11 or Theorem 5.1
for a more formal analysis of this result). Therefore, there exists an index k large
enough that f(xk ) < f(x∗ ), which contradicts the local optimality of x∗ .
This general result does not take into account a possible structure in the con-
straints. We now propose optimality conditions for specific problems.
min f(x) ,
x∈X
Proof. We assume by contradiction that (6.3) is not satisfied. In this case, according
to Definition 2.10, the direction d = x − x∗ is a descent direction. According to
Theorem 2.11, there exists η > 0 such that
an acute angle with the gradient, as illustrated in Figure 6.1. When the convex set
has a particular structure, the necessary optimality conditions can be simplified, as
shown in Example 6.4.
3
X ∇f(x∗ )
2
1 x − x∗
x∗•
0
-1
-2
-3
-3 -2 -1 0 1 2 3
min f(x)
x∈X⊂Rn
with
X = x | ℓi ≤ xi ≤ ui , i = 1, . . . , n , (6.5)
∂f(x∗ )
(xi − x∗i ) ≥ 0 . (6.6)
∂xi
∂f(x∗ )
≥ 0.
∂xi
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
2. x∗i = ui . If we choose
ui − ℓi ui + ℓi
xi = ui − = ,
2 2
x is feasible. Moreover, xi − x∗i = −(ui − ℓi )/2 < 0. The condition (6.6) implies
that
∂f(x∗ )
≤ 0.
∂xi
3. ℓi < x∗i < ui . If we choose
ui + x∗i
xi = ,
2
x is feasible. Moreover, xi − x∗i = (ui − x∗i )/2 > 0 because x∗i < ui . The condition
(6.6) implies that
∂f(x∗ )
≥ 0.
∂xi
If we choose
ℓi + x∗i
xi = ,
2
x is feasible. Moreover, xi − x∗i = (ℓi − x∗i )/2 < 0 because ℓi < x∗i . The condition
(6.6) implies that
∂f(x∗ )
≤ 0.
∂xi
By combining these two results, we get
∂f(x∗ )
= 0.
∂xi
Then, in the case of bound constraints defined by (6.5), the necessary optimality
conditions can be written as
∂f(x∗ )
≥ 0, if x∗i = ℓi
∂xi
∂f(x∗ )
≤ 0, if x∗i = ui
∂xi
∂f(x∗ )
= 0, if ℓi < x∗i < ui
∂xi
for any i such that ℓi < ui . Finally, let us note that, in the case where ℓi = ui , each
feasible x is such that xi = ℓi = ui = x∗i and the condition (6.6) is trivially satisfied,
regardless of the value of ∂f(x∗ )/∂xi .
Constrained optimization 131
subject to
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
0.7 ≤ x1 ≤ 2
−1 ≤ x2 ≤ 1 .
T T
The solution is x∗ = 0.7 0 and ∇f(x∗ ) = 1.4 0 . Since x∗1 = ℓ1 = 0.7, we
∗ ∗ ∗
have ∂f(x )/∂x1 ≥ 0. Since ℓ2 < x2 < u2 , we have ∂f(x )/∂x2 = 0.
Figure 6.2(a) illustrates the problem
subject to
−2 ≤ x1 ≤ −0.7
−1 ≤ x2 ≤ 1 .
T T
The solution is x∗ = −0.7 0 and ∇f(x∗ ) = −1.4 0 . Since x∗1 = u1 =
∗ ∗ ∗
−0.7, we have ∂f(x )/∂x1 ≤ 0. Since ℓ2 < x2 < u2 , we have ∂f(x )/∂x2 = 0.
Figure 6.2: Illustration of the necessary optimality condition for bound constraints
Theorem 6.5 (Sufficient optimality conditions for convex constraints – I). Consider
the optimization problem
min f(x) ,
x∈X
f(x) − f(x∗ ) ≥ 0 , ∀x ∈ X ,
1
min f(x) = (x − z)T (x − z) subject to x ∈ X.
x 2
Since f is convex and ∇f(x) = x − z, a necessary and sufficient condition for x∗ to be
the projection on z over X is
T
x∗ − z (x − x∗ ) ≥ 0 , ∀x ∈ X . (6.7)
Theorem 6.7 (Optimality conditions for convex constraints – II). Consider the
optimization problem
min f(x) ,
x∈X
P
Proof. Consider z(α) = x∗ − α∇f(x∗ ). According to (6.7), we have z(α) = x∗ for
all α > 0 if and only if
T
x∗ − z(α) (x − x∗ ) ≥ 0 , ∀x ∈ X , ∀α > 0 ,
or
T
x∗ − x∗ + α∇f(x∗ ) (x − x∗ ) ≥ 0 , ∀x ∈ X , ∀α > 0 .
to handle. The most common cases, where the linearized cone is equivalent to the
tangent cone, are optimization problems with linear constraints (Theorem 3.27) and
linearly independent constraints (Theorem 3.28). Therefore, we present the results for
only these cases. It is also possible to develop optimality conditions by considering the
linearized cone from a general point of view. The details are described in Mangasarian
(1979) and Nocedal and Wright (1999). The necessary optimality conditions are
generally called Karush-Kuhn-Tucker conditions or KKT conditions. In fact, for
many years, they were called Kuhn-Tucker conditions, following the article by Kuhn
and Tucker (1951). It later turned out that Karush (1939) had already formulated
them independently. John (1948) proposed a generalization a decade later (Theorem
6.12).
Note that the theory of Lagrange multipliers extends beyond the optimality con-
ditions presented in this book and that they can also be adapted to non differentiable
optimization. We refer the interested reader to Bertsekas (1982) and Rockafellar
(1993).
In this text, we adopt the approach of Bertsekas (1999), who first presents these
conditions in the case of linear constraints, and then for problems including linearly
independent equality constraints. The proof provides intuitions that are reused in
the development of algorithms. Subsequently, we generalize the result for problems
that also include inequality constraints.
subject to
Ax = b (6.10)
Proof. We employ the technique for the elimination of constraints described in Section
3.4 to convert the optimization problem with constraints into an optimization problem
without constraint (3.70). To simplify the proof, we can assume that the variables
are arranged in a way that the m variables to eliminate are the m first ones. Then,
P = I in (3.70) and the minimization problem is
−1
B b − NxN
min g(xN ) = f . (6.13)
xN ∈Rn−m xN
If x∗ = (x∗B , x∗N ) is a local minimum of the problem with constraints, then x∗N is a
local minimum of the problem without constraints and the necessary condition (5.1)
applies to (6.13). By using chain rule differentiation (see (C.6) of Theorem C.3) we
obtain:
∇g(x∗N ) = −NT B−T ∇B f(x∗ ) + ∇N f(x∗ ) = 0 , (6.14)
where ∇B f(x∗ ) and ∇N f(x∗ ) represent the gradient of f with regard to the variables
xB and xN , respectively. If we define
∇B f(x∗ ) + BT λ∗ = 0 , (6.16)
Constrained optimization 135
∇N f(x∗ ) + NT λ∗ = 0 . (6.17)
Equations (6.16) and (6.17) form (6.11), which proves the first-order result.
To demonstrate the second-order result, we first derive (6.11) to obtain
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Then,
subject to
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
x1 + x2 + x3 =1
x1 − x2 + x4 = 1 .
The solution to this problem is
2/3 4/3
0 0
x∗ =
1/3
and ∇f(x∗ ) =
2/3 .
1/3 2/3
By decomposing
1 1 1 0
A= B N = ,
1 −1 0 1
we obtain
∗ −T ∗ 1/2 1/2 4/3 −2/3
λ = −B ∇B f(x ) = − = ,
1/2 −1/2 0 −2/3
and (6.11) is expressed as
4/3 1 1 4/3 −4/3 0
0 1 −1 −2/3 0 0 0
+ + = .
2/3 1 0 −2/3 = 2/3 −2/3 0
2/3 0 1 2/3 −2/3 0
= 3y23 + 3y24 ≥ 0 .
Constrained optimization 137
subject to
x ∈ Sε (6.29)
has a solution in Sε , denoted by xk . One should keep in mind that the problem (6.28)–
(6.29) is subject to constraints. Nevertheless, we demonstrate that, for a sufficiently
1 Since the constraints are linearly independent, the constraint qualification is satisfied and the
linearized cone corresponds to the tangent cone.
138 Lagrange multipliers: necessary conditions
large k, the solution lies strictly inside Sε and is therefore a solution to the problem
(6.28) without constraint (according to Theorem 3.5). The role of Sε is to ensure that
no local minima other than x∗ are found.
We have
Fk (xk ) ≤ f(x∗ ) . (6.30)
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
k 2 α ∗ 2
Fk (x∗ ) = f(x∗ ) + h(x∗ ) + x − x∗ = f(x∗ ),
2 2
because h(x∗ ) = 0. Then, when k → ∞, the value of Fk (xk ) remains bounded. We
show by contradiction that this implies that
2
Indeed, if (6.31) is not satisfied, then the term k2 h(xk ) tends towards +∞. Since
F(xk ) remains bounded, this signifies that either f(xk ), or kxk − x∗ k2 tends towards
−∞. However f(xk ) is bounded from below by f(x∗ ) over Sε (according to (6.27))
and kxk − x∗ k2 is positive, which leads to a contradiction and proves (6.31).
Let bx be a limit point of the sequence (xk )k (Definition B.20). According to (6.31),
we have h(b x) = 0 and b x is feasible for the original problem and thus f(x∗ ) ≤ f(bx).
Moreover, according to (6.30)
2
x − x∗
x) + α b
lim Fk (xk ) = f(b ≤ f(x∗ ) . (6.32)
k→∞
Then,
2
f(b x − x∗
x) + α b ≤ f(b
x) . (6.33)
∗ 2
As a result, α b
x−x = 0 and bx = x∗ . The sequence xk k
converges to x∗ . According
to Definition B.19, there exists b
k such that
where ε is the radius of the sphere Sε involved in the definition of the local minimum
(6.27). The point xk is inside Sε when k is sufficiently large.
According to Theorem 1.16, xk is a local minimum of the unconstrained prob-
lem (6.28). We can apply the necessary optimality conditions of an unconstrained
problem, given by Theorem 5.1:
x0
b
Sε
b
0.9ε
b
xk
x∗
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Since the constraints are linearly independent by hypothesis, the matrix ∇h(x∗ )T
∇h(x∗ ) is of full rank and invertible. By continuity of ∇h (indeed, h is continuously
differentiable), there exists a k sufficiently large such that ∇h(xk )T ∇h(xk ) is also
−1
invertible. By multiplying (6.37) by ∇h(xk )T ∇h(xk ) , we obtain
−1
kh(xk ) = − ∇h(xk )T ∇h(xk ) ∇h(xk )T ∇f(xk ) + α(xk − x∗ ) . (6.38)
When k → ∞, we define
λ∗ = lim kh(xk ) (6.39)
k→∞
We also have
T −1 T
lim yk = y − ∇h(x∗ ) ∇h x∗ ∇h(x∗ ) ∇h x∗ y = y , (6.42)
k→∞
T
because ∇h x∗ y = 0. Then, from (6.36),
m
!
X
yTk ∇2 F(xk )yk = yTk 2
∇ f(xk ) + k 2
hi (xk )∇ hi (xk ) yk
i=1
As ∇h(xk )T yk = 0,
m
!
X
yTk ∇2 F(xk )yk = yTk 2
∇ f(xk ) + k hi (xk )∇ hi (xk ) yk + αyTk yk .
2
i=1
If (6.24) is not satisfied, (6.44) is not valid for all α > 0. Indeed, if yT ∇2xx L(x∗ , λ∗ )y <
0, (6.44) is not satisfied for the values of α such that
yT ∇2xx L(x∗ , λ∗ )y
α<− .
yT y
Since α can be arbitrarily chosen, this concludes the proof.
Note that for linear constraints, h(x) = Ax − b, ∇h(x) = AT and (6.23) is written
as
∇f(x∗ ) + AT λ∗
= 0,
Ax − b
which is equivalent to (6.11) of Theorem 6.8.
Example 6.11 (Karush-Kuhn-Tucker: equality constraints). Consider the optimiza-
tion problem
min x1 + x2 (6.45)
x∈R2
subject to
x21 + (x2 − 1)2 − 1
h(x) = = 0. (6.46)
−x21 + x2
The set of constraints is illustrated in Figure 6.5. We have
1 2x1 −2x1
∇f(x) = and ∇h(x) = . (6.47)
1 2x2 − 2 1
The Lagrangian function of the problem is
L(x, λ) = x1 + x2 + λ1 x21 + (x2 − 1)2 − 1 + λ2 (−x21 + x2 ) (6.48)
and
1 + 2λ1 x1 − 2λ2 x1
1 + 2λ1 (x2 − 1) + λ2
∇L(x, λ) =
x2 + (x − 1)2 − 1
.
(6.49)
1 2
−x21 + x2
Constrained optimization 141
2
h2 (x) = x2 − x21
1.5
1 • a
x
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
0.5
x2
0 • b
x
-0.5 h1 (x) = x21 + (x2 − 1)2 − 1
-1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x1
Figure 6.5: Illustration of the KKT conditions
T
The point xa = 1 1 is a local minimum of the problem. The constraints are
a
linearly independent in x because the matrix
a 2 −2
∇h(x ) = (6.50)
0 1
The condition ∇L(xa , λ∗ ) = 0 is satisfied. Note that the linearized cone is empty
in xa and that the second-order
T condition is trivially satisfied.
The point xb = 0 0 is also a local minimum (in fact, we are dealing with a
global minimum of the problem). We have
1
1 − 2λ1 + λ2
∇L(xb , λ) =
,
(6.52)
0
0
which cannot be zero for any λ. The necessary condition is not satisfied in this case.
Indeed, the constraints are not linearly independent in xb because the matrix
0 0
∇h(xb ) = (6.53)
−2 1
We now present the result of John (1948) for equality constraints, which consti-
tutes a generalization of Theorem 6.10.
Proof. In the case where the constraints are linearly independent, Theorem 6.10 ap-
plies and (6.54) is trivially obtained with µ∗0 = 1. In the case where the constraints
are linearly dependent, then there exist λ∗1 , . . . , λ∗m , not all zero, such that
m
X
λ∗i ∇hi (x∗ ) = 0 ,
i=1
µ∗j ≥ 0 , j = 1, . . . , p , (6.56)
and
µ∗j gj (x∗ ) = 0 , j = 1, . . . , p , (6.57)
where L is the Lagrangian function (Definition 4.3). If f, h and g are twice
differentiable, then
yT ∇2xx L(x∗ , λ∗ , µ∗ )y ≥ 0 , ∀y 6= 0 such that
T ∗
y ∇hi (x ) = 0 , i = 1, . . . , m (6.58)
T ∗ ∗
y ∇gi (x ) = 0 , i = 1, . . . , p such that gi (x ) = 0 .
Constrained optimization 143
Proof. We consider the active inequality constraints at the solution as equality con-
straints and let us ignore the other constraints in order to obtain the problem
minx∈Rn f(x) subject to h(x) = 0, gi (x) = 0, for any i ∈ A(x∗ ), where A(x∗ ) is
the set
of active constraints in x∗ (Definition 3.4). According to Theorem 3.5 where
Y = x | h(x) = 0 , x∗ is a local minimum of the optimization problem with equality
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
with µ∗i = 0 if i 6∈ A(x∗ ). We thus get (6.55). Similarly, the second-order condition
of Theorem 6.10 implies that
and (6.58) is satisfied. We note that (6.57) is trivially satisfied. Indeed, if the con-
straint gj (x∗ ) ≤ 0 is active, we have gj (x∗ ) = 0. If on the other hand it is not, we
have µ∗j = 0. We now need only demonstrate (6.56).
We take the same proof as Theorem 6.10, by defining the penalty function for
inequality constraints with
g+
i (x) = max 0, gi (x) , i = 1, . . . , p . (6.62)
Since g+ 2
j (x) is differentiable and
∇g+ 2 +
j (x) = 2gj (x)∇gj (x) , (6.64)
we can invoke the same development as in the proof of Theorem 6.10. Since we have
obtained (6.39), we have
And since g+
i (x) ≥ 0, we get (6.56).
144 Lagrange multipliers: necessary conditions
6
x21 ≤ 1 + x2
4 x21 + (x2 − 3)2 ≤ 9
2
x2
0 •
(x1 − 3)2 + x22 ≤ 9
-2
-4
-4 -2 0 2 4 6
x1
Figure 6.6: Karush-Kuhn-Tucker optimality conditions
T
The point x∗ = 0 0 is a local minimum of this problem. The constraints
g1 (x) ≤ 0 and g2 (x) ≤ 0 are active in x∗ , whereas the constraint g3 (x) ≤ 0 is not.
The point x∗ is also a local minimum of the problem where the two active inequality
constraints are replaced by equality constraints, and the active constraint is ignored,
that is
min x1 + x2 (6.70)
x∈R2
subject to
(x1 − 3)2 + x22 = 9
(6.71)
x21 + (x2 − 3)2 = 9 ,
Constrained optimization 145
or, equivalently
h1 (x) = x21 − 6x1 + x22 = 0
(6.72)
h2 (x) = x21 − 6x2 + x22 = 0 .
The gradient of the constraints is written as
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
2x1 − 6 2x1
∇h(x) = . (6.73)
2x2 2x2 − 6
According to the KKT conditions (Theorem 6.10), λ∗ is given by (6.25) and we have
T !−1 T
∗ −6 0 −6 0 −6 0 1 1/6
λ =− = .
0 −6 0 −6 0 −6 1 1/6
Then, the necessary first-order KKT condition for the initial problem is written as
!
∗ ∗
1 − 6µ∗1 + 2(µ∗1 + µ∗2 + µ∗3 )x∗1
∇Lx (x , µ ) = = 0,
1 − 6µ∗2 + 2(µ∗1 + µ∗2 )x∗2 − µ∗3
T
where L is defined by (6.69), x∗ = 0 0 and
λ∗1 1/6
µ∗ = λ∗2 = 1/6 .
0 0
Since the linearized cone is empty in x∗ , the necessary second-order KKT condition
is trivially satisfied.
1 2 1 2
L(x, µ) = x − x + µ(x2 − 1) .
2 1 2 2
T
The point x∗ = 0 1 is a local minimum of the problem. The constraint is
active in this point. The first-order condition (6.55) is expressed as
∗ ∗ x∗1 0
∇x L(x , µ ) = = =0
−x∗2 + µ∗ −1 + µ∗
146 Lagrange multipliers: necessary conditions
60 10
40
20
0 5
-20
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
-40 0 x2
-60
-5
-10 -10
-5
x2 0 -10 -10 -5 0 5 10
5 5 0x -5
1010 1 x1
(a) Surface (b) Level curves
and is satisfied.
T Note that if we choose a feasible direction y, for instance y =
0 −1 , the condition (6.76) is not satisfied. It is only valid for y orthogonal to
active constraints.
Example 6.16 (Physical interpretation of KKT conditions). We consider the opti-
mization problem
min x1
x∈R2
subject to
h1 (x) = x1 − sin x2 = 0 .
T
The equality constraint is represented in Figure 6.8(a). The point xa = −1 3π/2
is a local minimum of the problem. We have
1 1
∇f(x) = , ∇h(x) =
0 − cos x2
6 λ∗ = −1
1
3 •
xb
2
1
h(x) = x1 − sin x2 = 0
0
-2 -1 0 1 2
x1
(a) Equality constraint
5 −∇f(xa )
xa•
4
x2
1
g(x) = x1 − sin x2 ≤ 0
0
-2 -1 0 1 2
x1
(b) Inequality constraint
Using the fact that −∇f(xa ) is the direction with the steepest descent (Theorem
2.13), we can interpret this condition as an equilibrium between the “force” −∇f(xa ),
which drives the solution to lower values of the objective function, and the force
−λ1 ∇h(xa ), which maintains the solution on the constraint. If xa is optimal, this
signifies that the forces are balanced and that their sum is zero. In our example, since
the “force” −∇f(xa ) acts in the same direction as −∇h(xa ), the multiplier λ∗1 should
be negative so that the two “forces” can compensate each other.
T
If we now take the point xb = sin(3) 3 , the “forces” −∇f(xb ) and −λ∇h(xb )
are not balanced. This is not only the case when λ = λ∗1 , as shown in Figure 6.8(a),
but for all λ.
148 Lagrange multipliers: necessary conditions
subject to
g1 (x) = x1 − sin x2 ≤ 0 .
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
The inequality constraint is represented in Figure 6.8(b). The point xa = (−1, 3π/2)
is not a local minimum of the problem. In fact, the direction −∇f(xa ) is feasible and
the associated “force” drives the solution towards feasible values.
Note that the constraint is active in xa and that the equation
is satisfied for µ1 = −1. The interpretation of the forces signifies that the “force”
−µ1 ∇g1 (xa ) drives the solution to the right and prevents it from going inside the
feasible domain, which is incompatible with the definition of the inequality constraint.
The condition (6.56) in Theorem 6.13, µ∗ ≥ 0 signifies that, for an inequality con-
straint, the “force” can only act in a single direction, so as to prevent the points from
leaving the feasible domain, but not from going inside. For the equality constraints,
the “force” can act in two directions and there is no condition for the sign of the
multipliers.
subject to
gi (x) ≤ 0, i = 1, . . . , m. (6.78)
X m
∂2 L ∂2 f ∂gi
(x, µ) = (x) + µi (x), (6.81)
∂xj ∂xk ∂xj ∂xk ∂xj ∂xk
i=1
for j, k = 1, . . . , n.
Constrained optimization 149
Let x∗ be a local optimum of problem P1. Therefore, the first order necessary
optimality (KKT) conditions say that, under appropriate assumptions, there is a
unique µ∗ ∈ Rm , µ∗ ≥ 0, such that
X ∂gim
∂L ∗ ∗ ∂f ∗
(x , µ ) = (x ) + µ∗i (x∗ ) = 0,
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
µ∗i gi (x∗ ) = 0 i = 1, . . . , m.
Assume that the first p constraints are active at x∗ , and the others not, that is
gi (x∗ ) = 0, i = 1, . . . , p, (6.83)
and
gi (x∗ ) < 0, i = p + 1, . . . , m. (6.84)
Therefore, we obtain
X ∂gi p
∂L ∗ ∗ ∂f ∗
(x , µ ) = (x ) + µ∗i (x∗ ) = 0, (6.85)
∂xj ∂xj ∂xj
i=1
and
µ∗i ≥ 0, i = 1, . . . , p,
(6.86)
µ∗i = 0, i = p + 1, . . . , m.
Moreover, for each d ∈ Rm such that, for each i = 1, . . . , p
n
X ∂gi ∗
dk (x ) = 0, (6.87)
∂xk
k=1
we have !
n X
n p
X ∂2 f X ∂g i
(x∗ ) + µ∗i (x∗ ) dj dk ≥ 0. (6.88)
∂xj ∂xk ∂xj ∂xk
j=1 k=1 i=1
Consider now problem (P2), obtained from problem (P1) by transforming the
inequality constraints into equality constraints using slack variables, as suggested in
Section 1.2.2:
min
n m
f(x) (6.89)
x∈R ,y∈R
subject to
hi (x, y) = gi (x) + y2i = 0, i = 1, . . . , m. (6.90)
For each i = 1, . . . , m, the first derivatives of the constraint are
∂hi ∂gi
= , k = 1, . . . , n, (6.91)
∂xk ∂xk
∂hi
= 2yi , (6.92)
∂yi
150 Lagrange multipliers: necessary conditions
and
∂hi
= 0, ℓ = 1, . . . , m, ℓ 6= i. (6.93)
∂yℓ
The Lagrangian of (P2) is
m
X
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
for j = 1, . . . , n, and
∂L
(x, y, λ) = 2λi yi , (6.96)
∂yi
for i = 1, . . . , m. The second derivatives are
m
∂2 L ∂2 f X ∂gi
(x, y, λ) = (x) + λi (x), (6.97)
∂xj ∂xk ∂xj ∂xk ∂xj ∂xk
i=1
for j, k = 1, . . . , n,
∂2 L
(x, y, λ) = 2λi , (6.98)
∂y2i
for i = 1, . . . , m,
∂2 L
(x, y, λ) = 0, (6.99)
∂yi ∂yℓ
for i, ℓ = 1, . . . , m, i 6= ℓ, and
∂2 L
(x, y, λ) = 0, (6.100)
∂xj ∂yi
for j = 1, . . . , n, i = 1, . . . , m.
Let x∗ and y∗ be local optima of problem P2. Therefore, the first order necessary
optimality (KKT) conditions say that there is a unique λ∗ ∈ Rm such that
X ∂gi m
∂L ∗ ∗ ∗ ∂f ∗
(x , y , λ ) = (x ) + λ∗i (x∗ ) = 0, (6.101)
∂xj ∂xj ∂xj
i=1
and
2λi y∗i = 0 i = 1, . . . , m. (6.102)
Moreover, for each
dx
d= ∈ Rn+m
dy
such that, for each i = 1, . . . , m
n
X ∂gi ∗
(dx )k (x ) + 2(dy )i y∗i = 0, (6.103)
∂xk
k=1
Constrained optimization 151
we have
n X
n m
! m
X ∂2 f X X
∗ ∗ ∂gi ∗
(x ) + λi (x ) (dx )j (dx )k + 2 λ∗i (dy )2i ≥ 0. (6.104)
∂xj ∂xk ∂xj ∂xk
j=1 k=1 i=1 i=1
Now, assume that the constraints are numbered so that y∗i = 0 for i = 1, . . . , p,
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
we have
n X
n p
! p
X ∂2 f ∗
X
∗ ∂gi ∗
X
(x ) + λi (x ) (dx )j (dx )k + 2 λ∗i (dy )2i ≥ 0. (6.108)
∂xj ∂xk ∂xj ∂xk
j=1 k=1 i=1 i=1
In particular, select k between 1 and p, and set (dy )i = 0 for each i 6= k, and
(dy )k = 1. Therefore, (6.109) implies λ∗k ≥ 0, for any k = 1, . . . , p.
Based on these results, we can prove that (x∗ , µ∗ ) verifies the KKT conditions
of problem P1, if and onlyp if (x∗ , y∗ , λ∗ ) verifies the KKT conditions of problem P2,
where λ∗ = µ∗ and y∗i = −gi (x∗ ), i = 1, . . . , m.
P1 =⇒ P2 Consider (x∗ , µ∗ ) that verifies the KKT conditions of problem P1, such
that gi (x∗ ) = 0 for i = 1, . . . , p and gi (x∗ ) < 0 for i = p + 1, . . . , m. Define y∗
such that
y∗i = 0p i = 1, . . . , p,
(6.110)
y∗i = −gi (x∗ ) i = p + 1, . . . , m,
and define λ∗ = µ∗ . Then (x∗ , y∗ , λ∗ ) verifies the KKT conditions of problem P2.
152 Lagrange multipliers: sufficient conditions
n X n p
!
X ∂2 f X
∗ ∗ ∂gi ∗
(x ) + λi (x ) (dx )j (dx )k ≥ 0. (6.111)
∂xj ∂xk ∂xj ∂xk
j=1 k=1 i=1
Now, (6.108) results from (6.111) and (6.86), which says that λ∗i ≥ 0, for each
i.
P2 =⇒ P1 Consider (x∗ , y∗ , λ∗ ) that verifies the KKT conditions of problem P2, such
that y∗ = 0 for i = 1, . . . , p and y∗i 6= 0 for i = p + 1, . . . , m. Then (x∗ , µ∗ ), where
µ∗ = λ∗ , verifies the KKT conditions of problem P1.
• The constraints (6.78) are direct consequences of (6.90).
• The first order conditions (6.105) are exactly the same as (6.85).
• The conditions (6.86) on the Lagrange multipliers are verified, as for P2, λ∗i ≥ 0,
for i = 1, . . . , p and λ∗i = 0, i = p + 1, . . . , m (see the discussion above).
• Consider d that verifies (6.87). Define dx = d, and define dy such that (dy )i =
0, i = 1, . . . , p, and
n
1 X ∂gi ∗
(dy )i = − ∗ (dx )k (x ) (6.112)
2yi ∂xk
k=1
h(x) .
2
The idea of the objective function (6.113) is to penalize points x that violate the
constraints, hence the name penalty parameter for c .
Proof. We first note that any solution to the augmented problem (6.113) is also a
solution to the original problem. We go back to a problem of unconstrained opti-
mization thanks to the augmented Lagrangian function, by showing that x∗ is a strict
local minimum of the problem
minn Lc (x, λ∗ ) (6.117)
x∈R
Similarly, we obtain
T
∇2xx Lc (x∗ , λ∗ ) = ∇2xx L(x∗ , λ∗ ) + c∇h(x∗ )∇h x∗ . (6.118)
By applying the theorem for the formation of a positive definite matrix (Theorem
C.18), there exists b
c such that (6.118) is positive definite for all c > b
c. According to
154 Lagrange multipliers: sufficient conditions
Theorem 5.7, x∗ is a strict local minimum of the unconstrained problem (6.117) for
sufficiently large c.
According to Definition 1.6, there exists ε > 0 such that
Notes
• Theorem 6.10 can be demonstrated by using the same constraint elimination tech-
nique as for Theorem 6.8. The logic behind the demonstration is the same, but
the proof is more technical (see Bertsekas, 1999).
• No constraint qualifications appear in the sufficient optimality conditions, neither
linear independence nor any other.
∇x L(x∗ , λ∗ , µ∗ ) = 0 (6.121)
∗
h(x ) = 0 (6.122)
∗
g(x ) ≤ 0 (6.123)
∗
µ ≥0 (6.124)
µ∗j gj (x∗ ) = 0, j = 1, . . . , p (6.125)
µ∗j > 0, ∀j ∈ A(x ) ∗
(6.126)
Proof. We use slack variables (Definition 1.4) to obtain the following optimization
problem with equality constraints
subject to
hi (x) = 0 , i = 1, . . . , m
(6.129)
gi (x) + z2i = 0, i = 1, . . . , p ,
and let us define p
z∗i = −gi (x∗ ) (6.130)
2
such that gi (x∗ ) + z∗i = 0 is trivially satisfied.
The Lagrangian function of this problem is
m p
X X
b z, λ, µ) = f(x) +
L(x, λi hi (x) + µi gi (x) + z2i (6.131)
i=1 i=1
and
m
X p
X
∇x b
L(x, z, λ, µ) = ∇f(x) + λi ∇hi (x) + µi ∇gi (x) = ∇x L(x, λ, µ) (6.132)
i=1 i=1
and
∂b
L(x, z, λ, µ)
= 2µi zi , i = 1, . . . , p . (6.133)
∂zi
x
Moreover, by expressing b
x= , we have
z
∇2xx L(x, λ, µ) 0
2µ1 0 ··· 0
∇2bxbx b
L(x, z, λ, µ) = 0 2µ2 ··· 0 . (6.134)
.. .. ..
..
0 . . . .
0 0 · · · 2µp
From the hypothesis (6.125) and as per (6.130), we have µ∗i z∗i = 0. Moreover,
from the hypothesis (6.121), we have
∇b
Lbx (x∗ , z∗ , λ∗ , µ∗ ) = 0 . (6.135)
and
yT ∇gi (x∗ ) + 2z∗i wi = 0 ,
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
i = 1, . . . , p . (6.137)
Note that if i ∈ A(x∗ ), then zi = 0 and (6.137) is written as yT ∇gi (x∗ ) = 0. The
vector y always corresponds to the conditions of the hypothesis (6.127). We have
y
yT wT b ∗ , z∗ , λ∗ , µ∗ )
∇bxbx L(x
w
p
X
= yT ∇L2xx (x∗ , λ∗ , µ∗ )y + 2 µ∗i w2i from (6.134)
i=1
X
= yT ∇2xx L(x∗ , λ∗ , µ∗ )y + 2 µ∗i w2i from (6.125) .
i∈A(x∗ )
From the hypothesis (6.127), the first term is positive if y 6= 0. From the hypothesis
(6.124), each term of the sum of the second term is non negative. If y = 0, there is
necessarily an i such that wi 6= 0. In order for (6.137) to be satisfied, we have that
z∗i = 0 and then i ∈ A(x∗ ). From (6.126), the corresponding term µ∗i w2i is positive.
The sufficient optimality conditions of Theorem 6.19 are satisfied for x∗ , z∗ , λ∗
and µ∗ and ∗
x
z∗
is a strict local minimum of (6.128)–(6.129). Consequently, x∗ is a strict local mini-
mum of the initial problem.
The condition (6.126) is called the strict complementarity condition. The fol-
lowing example illustrates its importance.
Example 6.21 (Importance of the strict complementarity condition). Consider the
problem
1
min (x21 − x22 ) (6.138)
x∈R 2
2
subject to
x2 ≤ 0 (6.139)
for which the objective function is illustrated in Figure 6.7. We demonstrate that
T all
sufficient optimality conditions except (6.126) are satisfied for x∗ = 0 0 and
∗
µ = 0. We have
1
L(x, µ) = (x21 − x22 ) + µx2 . (6.140)
2
Then,
x1
∇Lx (x, µ) =
−x2 + µ
Constrained optimization 157
T
and (6.121) is satisfied when x∗ = 0 0 and µ∗ = 0. The other conditions are
trivially satisfied and (6.126) is not satisfied because the constraint is active in x∗ and
µ∗ = 0.
We now consider the point (0, −α), with α > 0. It is feasible and the objective
function is − 21 α2 , which is strictly smaller than the value in x∗ , which is therefore
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
b 1
L(x, z, µ) = (x21 − x22 ) + µ(x2 + z2 ) . (6.144)
2
The Hessian ∇2bxxb b
L(x∗ , z∗ , µ∗ ) used in the proof is
1 0 0
2 b
∇bxbx L(x, z, µ) = 0 −1 0
0 0 2µ
and
1 0 0
∇2bxbx b
L(x∗ , z∗ , µ∗ ) = 0 −1 0 ,
0 0 0
and it is singular. As in the proof, let us take a vector belonging to the linearized
cone at ∗
x
z∗
of the problem with equality constraints, i.e.,
0
y
= 0 .
w
γ
For all γ, we have
1 0 0 0
( 0 0 γ ) 0 −1 0 0 = 0
0 0 0 γ
and the sufficient condition for the problem with equality constraints is not satisfied,
which prevents us from proving Theorem 6.20.
158 Lagrange multipliers: sufficient conditions
We conclude this section with two examples of the use of optimality conditions
to identify critical points. They both lead to the resolution of a system of equations,
the topic of the next section of this book.
Example 6.22 (Identification of critical points – I). Consider the problem
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
subject to
2x1 + x2 = 1
6x1 + 2λ = 0
2x2 + λ = 0
2x1 + x2 − 1 = 0 .
This is a system with three linear equations, with three unknowns, for which the
solution is
2
x∗1 =
7
3
x∗2 =
7
6
λ∗ = − .
7
We now need only ensure that this point satisfies the sufficient second-order conditions
in order to determine that it is indeed a solution.
• x∗ = ( 27 , 73 )
x2
x1
subject to
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
x21 + 4x1 − x2 + 3 = 0
illustrated in Figure 6.10. The necessary optimality condition (6.23) is expressed as
6x1 + 2λx1 + 4λ = 0
2x2 − λ = 0
x21 + 4x1 + 3 = 0 .
This is a system of three non linear equations, with three unknowns. One solution is
x1 = −1, x2 = 1.5, and λ = 3. It is not necessarily straightforward to calculate it.
3
2
• x∗ 1
0 x2
-1
-2
-3
-3 -2 -1 0 1 2 3
x1
Figure 6.10: Problem of Example 6.23
min f(x)
x∈Rn
160 Sensitivity analysis
subject to
h(x) = 0 .
Moreover, let x∗ be a local minimum and let λ∗ satisfy the sufficient optimality
conditions (6.115) and (6.116), such that the constraints are linearly indepen-
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
min f(x)
x∈Rn
subject to
h(x) = δ.
There thus exists a sphere S ⊂ Rm centered in 0 such that if δ ∈ S, there exist x(δ)
and λ(δ) satisfying the sufficient optimality conditions of the perturbed problem.
The functions x : Rm → Rn : δ x(δ) and λ : Rm → Rm : δ λ(δ) are
∗ ∗
continuously differentiable in S, with x(0) = x and λ(0) = λ . Moreover, for all
δ ∈ S, we have
∇p(δ) = −λ(δ) , (6.145)
where
p(δ) = f x(δ) . (6.146)
is non singular. We assume by contradiction that this is not the case. There then
exist y ∈ Rn and z ∈ Rm , non zero, such that
∗ y
∇γ F(δ, γ ) = 0,
z
i.e.,
We have
Since the sufficient optimality condition (6.116) is satisfied, ∇2xx L(x∗ , λ∗ ) is positive
definite and y = 0. Then, according to (6.148), ∇h(x∗ )z = 0. By assumption, the
constraints are linearly independent at x∗ and the matrix ∇h(x∗ ) is of full rank.
Then, z = 0, which contradicts the fact that y and z are non zero. The matrix
∇γ F(δ, γ∗ ) is indeed non singular and we can apply the theorem of implicit functions
(Theorem C.6): there exist neighborhoods V0 ⊆ Rm around δ+ = 0 and Vγ∗ ⊆ Rn+m
around γ+ = γ∗ , as well as a continuous function
x(δ)
φ : V0 → Vγ ∗ : δ γ(δ) =
λ(δ)
such that
F δ, γ(δ) = 0 , ∀δ ∈ V0 ,
i.e., !
∇f x(δ) + ∇h x(δ) λ(δ)
= 0. (6.150)
h x(δ) − δ
We now demonstrate that, for δ sufficiently close to 0, the sufficient optimality
conditions of the perturbed problem are satisfied. Assuming (by contradiction) that
this is not the case, there exists a sequence δk k , such that limk→∞ δk = 0 and a
T
sequence (yk )k , with yk ∈ Rm , kyk k = 1 and ∇h x(δk ) yk = 0, for all k, such that
yTk ∇2xx L x(δk ), λ(δk ) yk ≤ 0 , ∀k .
df
x(δ) = 2δ + 2 = −λ(δ) .
dδ
The quantity ∇f x(δ) represents the marginal modification of the objective func-
tion for a perturbation δ of the constraints. When δ is small, we use Taylor’s theorem
(Theorem C.1) to obtain
p(δ) = p(0) + δT ∇p(0) + o kdk .
subject to
2x1 + x2 ≤ 10
x1 + 3x2 ≤ 15
x1 , x2 ≥ 0 .
We omit the non negativity constraints to maintain a simple formulation (these con-
straints are inactive at the solution). We first express the problem in the form (1.71)–
(1.72) by changing the maximization problem into a minimization problem, and by
including slack variables (see Section 1.2):
subject to
h1 (x) = 2x1 + x2 + x23 − 10 = 0
h2 (x) = x1 + 3x2 + x24 − 15 = 0 ,
∗
where x3 and x4 are the slack variables.
T The solution to the problem is x =
T ∗
and λ = 13/5 4/5 , enabling the company to bring in e 38,000
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
3 4 0 0
per day.
x2
∗ ∗
4 •x = (3, 4) 4 •x = (3, 4)
2 2
0 0
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
x1 x1
(a) δ = 5 (b) δ = 16
to obtain
4
f x(δ) = f(x∗ ) − δT λ∗ = −38 − δ . (6.153)
5
Therefore, the purchase of 5 tons of raw material per day would enable the company
to bring in e 4,000. If this purchase costs less than e 4,000, it is worth going through
with the investment. Otherwise, the investment is of no use.
Note that this result is only valid for small values of δ. If δ = 16, such that
31 tons of raw material are available each day, the company no longer has enough
machine-hours to use up all of the raw material. Therefore, the second constraint
is no longer active (here, x4 is positive). The company should thus produce only
the second product, which consumes half as many machine-hours. In this case, the
164 Sensitivity analysis
purchase of additional raw material would not enable the company to earn more, and
the company should thus rather invest in new machines.
The inspiration for this example came from de Werra et al. (2003).
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
subject to
h(x) = 0
g(x) ≤ 0 .
Moreover, let x be a local minimum and let λ∗ , µ∗ satisfy the sufficient optimality
∗
min f(x)
x∈Rn
subject to
h(x) = δh
g(x) ≤ δg .
T
Then, there exists a sphere S ⊂ Rm+p centered in 0 such that if δ = δTh δTg ∈
S, there are x(δ), λ(δ) and µ(δ) are satisfying the sufficient optimality conditions
of the perturbed problem. The functions x : Rm+p → Rn , λ : Rm+p → Rm and
µ : Rm+p → Rp are continuously differentiable in S, with x(0, 0) = x∗ , λ(0, 0) = λ∗
and µ(0, 0) = µ∗ . Moreover, for all δ ∈ S, we have
with
p(δ) = f x(δ) . (6.155)
Proof. From Theorem 3.5, x∗ , λ∗ and µ∗ satisfy the optimality conditions of the
problem
min f(x)
x∈Rn
subject to
h(x) = 0
gi (x) = 0 , ∀i ∈ A(x∗ ) .
Constrained optimization 165
From Theorem 6.24, there exist x(δ), λ(δ) and µ(δ) satisfying the sufficient optimality
conditions of the problem
minn f(x) (6.156)
x∈R
subject to
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
h(x) = δh (6.157)
∗
gi (x) = δg i
, ∀i ∈ A(x ) . (6.158)
We now need only verify the result for inequality constraints that are inactive in
x∗ , i.e.,
gi (x∗ ) < 0 .
In this case, if δi is sufficiently close to 0, gi x(δ) < δi , and the constraint i is also in-
active on the solution to the perturbed problem (by continuity of gi ). Then, according
to Theorem 3.5, the perturbed problem is equivalent to the problem (6.156)–(6.158).
If we take µi (δ) = 0 for all i 6∈ A(x∗ ), x(δ), λ(δ) and µ(δ) satisfying the sufficient
optimality conditions. Moreover, regardless of the value of δi (small enough for the
constraint of the problem to remain inactive), the value of x(λ) remains constant,
since it is determined by the problem (6.156)–(6.158), which does not depend on δi ,
if i 6∈ A(x∗ ). Therefore,
n
X ∂f ∂xj
∂p
= = 0 = −µi (δ) ,
∂δi ∂xj ∂δi
j=1
subject to
Ax = b
(6.160)
x ≥ 0,
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , for which the Lagrangian function is
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
∇L(x, λ, µ) = c − AT λ − µ = 0
(6.162)
µ ≥ 0.
These conditions represent exactly the constraints (4.23) of the dual problem de-
scribed in Section 4.2. The second-order conditions are trivial, because ∇2xx L(x, λ, µ) =
0, for all (x, λ, µ). The complementarity condition (6.57) is simply written as
µi xi = 0 , i = 1, . . . , n, (6.163)
or, equivalently,
m
X
(ci − aji λj )xi = 0 , i = 1, . . . , n. (6.164)
j=1
We show below that this condition happens to also be sufficient for optimality.
We can also utilize the necessary conditions of Theorem 5.1. In particular, if we
consider the jth basic direction dj (Definition 3.42), the directional derivative of the
objective function in the direction dj is given by
In the context of linear optimization, this quantity is often called reduced cost.
Definition 6.28 (Reduced costs). Consider the linear optimization problem (6.159)–
(6.160) and let x be a feasible basic solution of the constraint polyhedron. The reduced
cost of xj is
c̄j = cj − cTB B−1 Aj , j = 1, . . . , n . (6.166)
In matrix form, we have
c̄ = c − AT B−T cB . (6.167)
The reduced costs can be decomposed into their basic and non basic components,
as follows:
c̄B = cB − BT B−T cB = 0, (6.168)
and
c̄N = cN − NT B−T cB . (6.169)
Constrained optimization 167
Therefore, for any basis B, the basic components of the reduced costs is always
0. This, together with the geometric interpretation of the non basic components, is
formalized in the next theorem.
and let x be a basic solution of the constraint polyhedron. When j is the index
of a non basic variable, the jth reduced cost is the directional derivative of the
objective function in the jth basic direction. When j is the index of a basic
variable, the jth reduced cost is zero.
Proof. In the case where j is non basic, the proof can be found above (see (6.165)).
For basic j, we see that B−1 B = I and B−1 Aj is the jth column of the identity matrix.
Then, cTB B−1 Aj = cj and the reduced cost is zero.
The concept of reduced costs now enables us to state the optimality conditions
for linear optimization.
Proof. Consider the basic direction dk . According to Theorem 3.44, the non degen-
eracy of x∗ ensures that dk is feasible. Therefore, given the convexity of the set of
constraints, the necessary condition of Theorem 6.3 applies and
T
∇f x∗ dk = c̄ ≥ 0 ,
by using (6.166) and Theorem 6.29.
Proof. Let y be an arbitrary feasible point and w = y − x∗ . Since the feasible set
is convex, w is a feasible direction (Theorem 3.11). If dj is the jth basic direction2
(Definition 3.42), we have
X
cT w = (w)j cT dj from Theorem 3.45
j∈N
X
= (w)j (−cTB B−1 Aj + cj ) from Definition 3.42
j∈N
X
= (w)j c̄j from Definition 6.28.
j∈N
2 In this proof, dj is a vector of Rn , while (w)j is a scalar, representing the jth entry of the vector
w.
168 Linear optimization
X
cT y − cT x∗ = cT w = (w)j c̄j ≥ 0 ,
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
j∈N
Note that the sufficient conditions do not assume that x∗ is non degenerate. To
understand why the necessary conditions may not hold when x∗ is degenerate, we go
back to the example illustrated in Figure 3.16. In this example, the vertex number
2 corresponds to a degenerate solution, and the basic direction d b3 is not feasible.
Therefore, if this vertex happens to be the optimal solution of an optimization prob-
b3 to correspond to an ascent direction.
lem, it is not necessary for the basic direction d
It does not matter if it is a descent direction, as the direction is not feasible anyway.
If it happens to be a descent direction, the reduced cost is negative, and the necessary
condition is not verified although we are at the optimal solution.
We now characterize the optimal solution of the dual problem given the optimal
solution of the primal. We obtain an important result for linear optimization, called
strong duality, that the optimal value of the primal coincides with the optimal value
of the dual. Moreover, this result provides a direct link between the reduced costs
and the dual variables.
Corollary 6.32 (Optimality of the dual). Consider the primal linear problem
(6.159)–(6.160) and let B be a basis such that B−1 b ≥ 0 and c̄ ≥ 0. Consider
also the dual problem
maxm
λT b (6.170)
λ∈R
subject to
AT λ ≤ c. (6.171)
Then, the primal vector x∗ with basic variables
and non basic variables x∗N = 0, is optimal for the primal problem, the dual
vector
λ∗ = B−T cB (6.173)
is optimal for the dual problem, and the objective functions are equal, that is
(λ∗ )T b = cT x∗ . (6.174)
Constrained optimization 169
proving (6.174).
The vector λ∗ is feasible for the dual. Indeed, from (6.167), we have
AT λ∗ = AT B−T cB = c − c̄.
As c̄ ≥ 0, we obtain (6.171).
Consider now any dual feasible λ. By the weak duality theorem (Theorem 4.9),
λT b ≤ cT x∗ = (λ∗ )T b,
min cT x
x∈Rn
subject to
Ax = b
x ≥ 0,
where A ∈ Rm×n , b ∈ Rn , c ∈ Rn , and the dual problem
max λT b
λ∈Rm
subject to
AT λ ≤ c.
If either the primal or the dual problem has an optimal solution, so does the
other, and the optimal objective values are equal.
Proof. From the discussion above, if x∗ is a solution to the primal problem, there
exists a basis B such that x∗B = B−1 b ≥ 0 and c̄ = c − AT B−T cB ≥ 0. Therefore,
Corollary 6.32 applies, the optimal solution of the dual is B−T cB , and the objective
functions are equal.
170 Linear optimization
If λ∗ is a solution to the dual problem, the fact that the primal problem is the
dual of the dual (Theorem 4.15) is used to prove the result.
Note that another proof of this result, exploiting Farkas’ lemma, is presented in
Theorem 4.17. Finally, we show that the complementarity condition (6.164) is a
sufficient and necessary optimality condition.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
min cT x
x∈Rn
subject to
Ax = b
x ≥ 0,
where A ∈ Rm×n , b ∈ Rn , c ∈ Rn , and the dual problem
max λT b
λ∈Rm
subject to
AT λ ≤ c.
Consider x∗ primal feasible and λ∗ dual feasible. x∗ is optimal for the primal
and λ∗ is optimal for the dual if and only if
m
X
(ci − aji λj )xi = 0 , i = 1, . . . , n. (6.175)
j=1
Proof. Conditions (6.175) are KKT necessary optimality conditions (see Theorem 6.13
and the discussion at the beginning of the section). To show that they are sufficient,
consider the equation
n
X m
X
(c − AT λ∗ )T x∗ = (ci − aji λj )xi = 0.
i=1 j=1
Therefore,
cT x∗ = (λ∗ )T Ax∗ .
As x∗ is primal feasible, we have Ax∗ = b and
cT x∗ = (λ∗ )T b.
Consequently, the objective function of the primal at x∗ equals the objective function
of the dual at λ∗ . We apply Theorem 4.11 to prove the optimality of x∗ and λ∗ .
Conditions (6.175) are called complementarity slackness conditions because the
activity of the constraints must be complementary. At the optimal solution, if a
Constrained optimization 171
primal variable is positive, that is if xi > 0, the corresponding dual constraint must
Pm
be active, that is ci − j=1 aji λj . Symmetrically, if a dual constraint is inactive, that
Pm
is if ci > j=1 aji λj , the corresponding primal variable must be equal to 0.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
1 T
min x Qx + cT x (6.176)
x∈Rn 2
subject to
Ax = b , (6.177)
where Q ∈ Rn×n , c ∈ Rn , A ∈ Rm×n and b ∈ Rm . The Lagrangian function is
1 T
L(x, λ) = x Qx + cT x + λT (b − Ax) , (6.178)
2
with λ ∈ Rm . By directly applying Theorem 6.13, the necessary first-order optimality
condition is written as
∇x L(x, λ) = Qx + c − AT λ = 0 . (6.179)
yT ZT QZy = 0 .
equation is written as
Qx − AT λ = −AT λ = 0 .
Since A is of full rank, then λ = 0.
We calculate the analytical solution to this problem.
x∗ − AT λ∗ = −c , (6.183)
by A, we obtain
Ax∗ − AAT λ∗ = −Ac .
Since Ax∗ = 0 and A is of full rank, we obtain (6.182). We now need only introduce
(6.182) in (6.183) to obtain (6.181).
1 T
min x x + cT x
x 2
subject to
Ax = b
Constrained optimization 173
subject to
Ay = 0 .
According to Theorem 6.36, the solution to this problem is
−1
y∗ = AT AAT A(c + x0 ) − (c + x0 )
−1
λ∗ = AAT A(c + x0 ) .
We now need only use Ax0 = b and y∗ = x∗ − x0 to obtain the result.
and −1
λ∗ = AQ−1 AT (AQ−1 c + b) . (6.187)
Proof. Let Z ∈ Rn×(n−m) be a matrix where the columns form a basis of the null
space of A, i.e., such that AZ = 0. Since Q is positive definite, then so is ZT QZ,
and Theorem 6.35 applies to demonstrate the non singularity of the system and the
unicity of the solution. Let L be an lower triangular matrix such that Q = LLT and
let us take y = LT x. The problem (6.176)–(6.177) is thus written as
1 T −1 T −T 1
min y L LL L y + cT L−T y = yT y + cT L−T y
y 2 2
174 Exercises
subject to
AL−T y = b .
The solution to this problem is given by Theorem 6.37, by replacing c with L−1 c and
A with AL−T . Then,
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
−1
λ∗ = AL−T L−1 AT (AL−T L−1 c + b) = (AQ−1 AT )−1 (AQ−1 c + b)
and
y∗ = L−1 AT λ∗ − L−1 c .
We now need only take y∗ = LT x∗ to obtain the result.
The presentation of the proof of Theorems 6.10, 6.13, 6.19, 6.20, and 6.24 was
inspired by Bertsekas (1999). That of the proof of Theorem 6.31 was inspired by
Bertsimas and Tsitsiklis (1997).
6.7 Exercises
Exercise 6.1.
Identify the local optima of the following optimization problems, and verify the op-
timality conditions.
n
X
1. minn kxk22 , subject to xi = 1.
x∈R
i=1
n
X
2. minn xi , subject to kxk22 = 1.
x∈R
i=1
3. min −x21 − x22 , subject to (x1 /2)2 + (x2 /2)2 ≤ 1 (Hint: plot the level curves and
x∈R2
the constraints).
4. min −x21 − x22 , subject to −x21 + x22 ≤ 1, and −5 ≤ x1 ≤ 5 (Hint: plot the level
x∈R2
curves and the constraints).
5. The Indiana Jones problem (Section 1.1.6): min x21 + x22 , subject to x1 x2 − hx1 −
x∈R2
ℓx2 = 0, x1 ≥ ℓ, x2 ≥ h.
Exercise 6.2.
An electricity company must supply a town that consumes 100 MWh daily. Three
plants are used to generate the energy: a gas plant, producing at the cost of e 800/MWh,
a coal plant, producing at the cost of e 1,500/MWh, and a hydroelectric plant produc-
ing at the cost of e 300/MWh. The amount of available water limits the production
of the latter plant to 40 MWh per day. Moreover, due to ecological concerns, the two
other plants are limited to produce no more than 80 MWh per day each.
1. Formulate a linear optimization problem that would optimize the costs of the
company.
2. Formulate the dual problem.
Constrained optimization 175
3. Prove, using the optimality conditions, that the optimal solution is to produce 60
MWh per day with the gas plant, 40 MWh per day with the hydroelectric plant,
and not to use the coal plant.
4. Deduce the optimal values of the dual variables.
5. Use sensitivity analysis to propose profitable investments to the company.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
∂f(x∗ ) ∂f(x∗ )
≤ ∀j. (6.188)
∂xi ∂xj
subject to
gi (x) ≤ 0, i = 1, . . . , m,
and problem (P2)
min f(x)
x∈Rn ,y∈Rm
subject to
gi (x) + yi = 0, i = 1, . . . , m,
yi ≥ 0, i = 1, . . . , m.
1. Write the necessary optimality conditions (KKT) for problem (P1), both first and
second order.
2. Write the necessary optimality conditions (KKT) for problem (P2), both first and
second order.
3. Prove that (x∗ , µ∗ ) verifies the KKT conditions of problem P1, if and only if
(x∗ , y∗ , λ∗ ) verifies the KKT conditions of problem P2, where λ∗ = µ∗ and y∗i =
−gi (x∗ ), i = 1, . . . , m (Hint: refer to Example 6.16).
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Part III
Solving equations
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Albert Einstein
We have seen that the necessary optimality conditions enable us to identify the critical
points (Definition 5.6) that are candidates for the solution to an optimization problem.
In the case of optimization without constraint, we use condition (5.1). For constrained
optimization, we use conditions (6.11), (6.23), and (6.55)–(6.57). One way to address
the problem is to solve the system of equations defined by these conditions. This
is how the problem in Example 5.8 was solved, as well as Example 6.22. In these
two cases, the system of equations is easily solved. This is not always the case, as
illustrated in Example 6.23.
We now consider numerical methods enabling us to solve such systems of non
linear equations. Even though these are not directly utilized for optimization, they
are the basis for the main algorithms.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Chapter 7
Newton’s method
Contents
Newton’s method plays a crucial role in the context of solving non linear equations
and, by extension, in that of non linear optimization. Isaac Newton was inspired
by a method from Vieta, and the method was later on improved by Raphson (and
sometimes called “Newton-Raphson.”) We refer the reader to Deuflhard (2012) for
a historical perspective of the method. We introduce it for the simple problem of
solving one equation of one unknown, i.e., by deriving numerical methods to find
x ∈ R such that F(x) = 0.
F(x) = x2 − 2
x + d) = 2 + 4d .
m(b
Defining x = b
x + d, we get
m(x) = 2 + 4(x − 2) = 4x − 6 .
The function and the model are presented in Figure 7.2(a). The zoom in Figure 7.2(b)
illustrates the good agreement between the model and the function around b x = 2.
Newton’s method 183
15 2.5
f(x) 2.4 f(x)
10 m(x) 2.3 m(x)
2.2
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
5 2.1
2
0 1.9
1.8
-5 1.7
1.6
-10 1.5
0 0.5 1 1.5 2 2.5 3 3.5 4 1.9 1.95 2 2.05 2.1
x x
(a) b
x=2 (b) Zoom
We can now provide a general definition of the linear model of a non linear func-
tion.
From a first approximation b x, the main idea of Newton’s method in order to find
the root of the function f consists in
1. calculating the linear model in b
x,
2. calculating the root x+ of this linear model,
3. if x+ is not the root of f, considering x+ as a new approximation and starting
over.
According to Definition 7.2, the root of the linear model is the solution to
F(b x)F ′ (b
x) + (x − b x) = 0 , (7.2)
i.e., if F ′ (b
x) 6= 0,
F(b x)
x+ = b
x− , (7.3)
F ′ (b
x)
which summarizes the first two steps presented above.
184 Equation with one unknown
We also need to specify the third step. How do we conclude that x+ is a root
of the function, i.e., F(x+ ) = 0, and that we can stop the iterations? Seemingly
innocuous, this question is far from simple to answer. Indeed, computers operating
in finite arithmetic are not capable of representing all real numbers (that represent an
uncountable infinity). Therefore, it is possible, and even common, that the method
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
never generates a point x+ such that F(x+ ) = 0 exactly. Then, we must often settle for
a solution x+ such that F(x+ ) is “sufficiently close” to 0. In practice, the user provides
a measure of this desired proximity, denoted by ε and the algorithm is interrupted
when
F(x+ ) ≤ ε . (7.4)
√
A typical value for ε is εM , where εM is the machine epsilon, that is, an upper
bound on the relative error due to rounding in floating point arithmetic. A simple
way to compute εM is Algorithm 7.1. The loop stops when εM is so small that, when
added to 1, the result is also 1.
We now have all the elements in order to write Newton’s algorithm to solve an
equation with one unknown (Algorithm 7.2).
Newton’s method 185
in the Baghdad area. Al Khwarizmi was an astronomer in the House of Wisdom (Dar
al-Hikma) of caliph Abd Allah al Mahmoun. He is primarily known for his treatise al
Kitab almukhtasar fi hisab al-jabr w’al muqabala (that can be translated as “The
Compendious Book on Calculation by Completion and Balancing”), which provides
the origin of the word algebra (al-Jabr, used in the sense of transposition, became
algebra). He explained in Arabic the system of Indian decimal digits applied to
arithmetic operations. The Latin translation of this work, entitled Algoritmi de
numero Indorum gave rise to the word algorithm. Al Khwarizmi died after AD 847.
Figure 7.3: Al Khwarizmi
Example 7.3 (Newton’s method: one variable – I). Take the equation
F(x) = x2 − 2 = 0 .
k xk F(xk ) F ′ (xk )
0 +2.00000000E+00 +2.00000000E+00 +4.00000000E+00
1 +1.50000000E+00 +2.50000000E-01 +3.00000000E+00
2 +1.41666667E+00 +6.94444444E-03 +2.83333333E+00
3 +1.41421569E+00 +6.00730488E-06 +2.82843137E+00
4 +1.41421356E+00 +4.51061410E-12 +2.82842712E+00
5 +1.41421356E+00 +4.44089210E-16 +2.82842712E+00
2 • 0.3
0.25 •
1.5
0.2
1 0.15
0.5 0.1
• 0.05
0
0 •
-0.5 -0.05
1.41.51.61.71.81.9 2 2.1 1.42 1.44 1.46 1.48 1.5
x x
(a) First iteration (b) Second iteration
According to Example 7.3, Newton’s method seems quite fast, as only 5 iterations
were necessary to converge. We characterize this speed below. Before that, however,
we illustrate by other examples that the method does not always work that well.
Example 7.4 (Newton’s method: one variable – II). Take the equation
F(x) = x − sin x = 0 .
We have F ′ (x) = 1 − cos x. We apply Newton’s method (Algorithm 7.2) with x0 = 1
and ε = 10−15 . The iterations are listed in Table 7.2. The number of iterations is
much larger than for the previous example. Note how the derivative F ′ (xk ) is getting
closer and closer to 0 as the iterations proceed. Actually, the root of this equation is
x∗ = 0, and the value of the derivative at the root is 0. As Newton’s method divides
by F ′ (xk ) at each iteration, the fact that F ′ (x∗ ) = 0 is the source of the slow behavior
of the method. The first two iterations are portrayed in Figure 7.5(a). The linear
model at the starting point x0 = 1 is represented by a dotted line and intersects the
x-axis at 0.65, which is the first iterate.
Newton’s method 187
0.25
0.2
0.15 •
0.1
0.05 •
0 • •
-0.05
-0.1
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
x
(a) Two iterations
0.06
•
0.04
0.02
0 • •
-0.02
-0.04
0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7
x
(b) Zoom
The linear model at that point is also represented by a dotted line, and intersects
the x-axis at 0.43, which is the second iterate. Figure 7.5(b) is a zoom on the same
figure.
188 Equation with one unknown
Even though Newton’s method has managed to provide the desired precision in
5 iterations for Example 7.3, more than 5 times as many iterations are necessary for
Example 7.4. In the following Example, we see that the method may sometimes not
work at all.
Example 7.5 (Newton’s method: one variable – III). Take equation
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
F(x) = arctan x = 0 .
Table 7.3: The ten first iterations with Newton’s method for Example 7.5
k xk F(xk ) F ′ (xk )
0 +1.50000000E+00 +9.82793723E-01 +3.07692308E-01
1 -1.69407960E+00 -1.03754636E+00 +2.58404230E-01
2 +2.32112696E+00 +1.16400204E+00 +1.56552578E-01
3 -5.11408784E+00 -1.37769453E+00 +3.68271300E-02
4 +3.22956839E+01 +1.53984233E+00 +9.57844131E-04
5 -1.57531695E+03 -1.57016153E+00 +4.02961851E-07
6 +3.89497601E+06 +1.57079607E+00 +6.59159364E-14
7 -2.38302890E+13 -1.57079633E+00 +1.76092712E-27
8 +8.92028016E+26 +1.57079633E+00 +1.25673298E-54
9 -1.24990460E+54 -1.57079633E+00 +6.40097701E-109
10 +2.45399464E+108 +1.57079633E+00 +1.66055315E-217
We now analyze in detail the aspects that influence the efficiency of the method.
The main result can be stated as follows:
• if the function is not too non linear,
• if the derivative of F at the solution is not too close to 0,
• if x0 is not too far from the root,
• then Newton’s method converges quickly toward the solution.
The central idea of the analysis is to measure the error that is committed when
the non linear function is replaced by the linear model. Intuitively, if the function
is almost linear, the error is small. While if the function is highly non linear, the
error is more significant. We use here the Lipschitz continuity of the derivative of
F to characterize the non linearity, as discussed in Section 2.4 (Definition 2.27).
Theorem 7.6 considers a linear model at b x, and provides an upper bound on the error
Newton’s method 189
1.5
0.5
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
0 x x1 x0 x2
3
-0.5
-1
-1.5
-4 -2 0 2 4
x
Figure 7.6: Newton’s method for Example 7.5
at a point x+ . This bound depends on the distance between b x and x+ , and on the
Lipschitz constant that characterizes the nonlinearity of the function.
Theorem 7.6 (Error of the linear model: one variable). Consider an open interval
X ⊆ R and a function F for which the derivative is Lipschitz continuous over X,
x, x+ ∈ X,
where M is the Lipschitz constant. So, for all b
(x+ − b
x)2
F(x+ ) − mbx (x+ ) ≤ M . (7.5)
2
Proof. We have
Z x+ Z x+ Z x+
F ′ (z) − F ′ (b
x) dz = F ′ (z) dz − F ′ (b
x) dz linearity of the integral
b
x b
x b
x
x) − F ′ (b
= F(x+ ) − F(b x)(x+ − b
x)
x + t(x+ − b
We take z = b x) and dz = (x+ − b
x) dt to obtain
Z1
F(x+ ) − mbx (x+ ) = F′ b x) − F ′ (b
x + t(x+ − b x) (x+ − b
x) dt .
0
190 Equation with one unknown
Therefore,
0
Z1
≤ F′ b x) − F ′ (b
x + t(x+ − b x) |x+ − b
x| dt from Theorem C.12
0
Z1
= |x+ − b
x| F′ b x) − F ′ (b
x + t(x+ − b x) dt
0
Z1
+
≤ |x − b
x| M t(x+ − b
x) dt from Definition 2.27
0
Z1
+ 2
= M x −b
x t dt
0
M + 2
= x −b
x .
2
We now use this bound on the error to demonstrate the convergence of Newton’s
method.
F ′ (x) ≥ ρ , ∀x ∈ X . (7.6)
Assume that there exists x∗ ∈ X such that F(x∗ ) = 0. There then exists η > 0
such that, if
|x0 − x∗ | < η (7.7)
with x0 ∈ X, the sequence xk k defined by
F(xk )
xk+1 = xk − , k = 0, 1, . . . , (7.8)
F ′ (xk )
F(x0 )
x1 − x∗ = x0 − − x∗ from (7.8)
F ′ (x0 )
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
F(x0 ) − F(x∗ )
= x0 − x∗ − because F(x∗ ) = 0
F ′ (x0 )
1
= ′ F(x∗ ) − mx0 (x∗ ) from (7.1) .
F (x0 )
Then
1
x1 − x∗ ≤ F(x∗ ) − mx0 (x∗ )
F ′ (x0 )
M 2
≤ ′
x0 − x∗ from (7.5)
2 F (x0 )
M
≤ |x0 − x∗ |2 from (7.6) ,
2ρ
2ρ
|x0 − x∗ | ≤ η ≤ τ (7.11)
M
and
M 2 M 2ρ
|x1 − x∗ | ≤ x0 − x∗ ≤ τ |x0 − x∗ | = τ|x0 − x∗ | < η ,
2ρ 2ρ M
where the last inequality is the result of the fact that τ < 1 and |x0 − x∗ | < η. Since
|x1 − x∗ | < η, we also have that |x1 − x∗ | < r (according to (7.10)) and x1 ∈ X. x1
thus satisfies the same assumptions as x0 . We can now apply the recurrence using
the same arguments for x2 , x3 , and so forth.
We now comment a summarized version of the result of Theorem 7.6:
If the function is not too non linear This assumption is related to the Lipschitz
continuity. The closer M is to 0, the less non linear is the function.
If the derivative of F is not too close to 0 This is hypothesis (7.6). If this as-
sumption is not satisfied, the method may not be well defined (division by zero),
may not converge, or may converge slowly, as illustrated by Example 7.4.
If x0 is not too far from the root This is hypothesis (7.7). If x0 is too far from
the root, the method may not converge, as shown in Example 7.5. It is interesting
192 Systems of equations with multiple unknowns
Definition 7.8 (q-quadratic convergence). Take a sequence xk k in Rn that con-
verges toward x∗ . The sequence is said to converge q-quadratically toward x∗ if there
exists c ≥ 0 and b
k ∈ N such that
2
xk+1 − x∗ ≤ c xk − x∗ , ∀k ≥ b
k. (7.12)
In Definition 7.8, the prefix q signifies quotient. In practice, the other types of
convergence are rarely used, and the prefix could be omitted. More details can be
found in Ortega and Rheinboldt (1970).
As in the case with one variable, we determine a bound for the error committed
when replacing the function F by the linear model and finding a result similar to that
of Theorem 7.6. The proof is essentially the same.
Proof. The structure of the proof is identical to that of Theorem 7.6. We have
Then,
Newton’s method for systems of equations is also essentially the same as for a
single variable. It is described by Algorithm 7.3. System (7.16) solved at step 13 of
the algorithm is often called the Newton equations. Note that we have intentionally
not written this step as
dk+1 = −J(xk )−1 F(xk ).
Indeed, from a numerical point of view, the calculation of dk+1 must be performed
by solving the system of linear equations, not by inverting the Jacobian matrix.
194 Systems of equations with multiple unknowns
F(x) = 0 . (7.15)
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
3 Input
4 The function F : Rn → Rn .
5 The Jacobian matrix of the function J : Rn → Rn×n .
6 A first approximation of the solution x0 ∈ Rn .
7 The required precision ε ∈ R, ε > 0.
8 Output
9 An approximation x∗ ∈ Rn of the solution.
10 Initialization
11 k = 0.
12 Repeat
13 Calculate dk+1 solution of
xk+1 := xk + dk+1 .
14 k := k + 1.
15 Until F(xk ) ≤ ε
16 x∗ = xk
and
4 2
J(x0 ) = .
e 3
Newton’s method 195
The iterations of Newton’s method are described in Table 7.4, with ε = 10−15 , where
the first column reports the iteration number, the second column the current iterate,
the third the value of the function at the current iterate, and the last its norm. The
quadratic convergence of the method is well illustrated in this example. Indeed, the
value of xk converges rapidly to the solution (0, 1)T , and the values of kF(xk )k decrease
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
We now analyze the impact of the starting point on the solution of Newton’s
method.
Example 7.12 (Newton fractal). Consider the system of equations
!
x31 − 3x1 x22 − 1
F(x) = = 0.
x32 − 3x21 x2
Figure 7.7: Newton’s method: relation between the starting point and the solution
The result is presented in Figure 7.7(a), where the three roots are represented
by a + sign. We see that there is no direct relationship between the position of
the starting point and the root identified by the method. For example, look at the
gray areas at the bottom right of Figure 7.7(a). Although these starting points are
closer to the roots x∗ (b) and x∗ (w), Newton’s method converges towards x∗ (g) when
started from these areas. But the most noticeable feature of this figure is the shape
of the borders between each region. This type of configuration is called a fractal
(see Mandelbrot, 1982). The zoom presented in Figure 7.7(b) shows that two points
that are very close may be colored differently. This is an illustration of a chaotic
system, which exhibits a significantly different outcome when the starting conditions
are perturbed just a little bit.
We now generalize Theorem 7.7 for the case of n equations and n variables.
x0 ∈ B(x∗ , η) , (7.19)
Newton’s method 197
M 2
xk+1 − x∗ ≤ xk − x∗ . (7.21)
ρ
Proof. In order for the sequence to be well defined, the matrix J(xk ) always has to
be invertible. By assumption, it is the case at x∗ . We choose η such that J(x) is
invertible for all x in a sphere B(x∗ , η) of radius η around x∗ . We take
ρ
η = min r, . (7.22)
2M
We first demonstrate that J(x0 ) is invertible, by using the theorem about the inverse
of a perturbed matrix (Theorem C.16), with A = J(x∗ ) and B = J(x0 ). The hypothesis
(C.28) on which the theorem is based is satisfied. Indeed,
−1 −1
J x∗ J(x0 ) − J(x∗ ) ≤ J x∗ J(x0 ) − J(x∗ )
1
≤ J(x0 ) − J(x∗ ) from (7.18)
ρ
M
≤ kx0 − x∗ k Lipschitz
ρ
M
≤ η from (7.19)
ρ
1
≤ from (7.22) .
2
Therefore, J(x0 ) is invertible, and x1 given by (7.20) is well defined. By using this
result, Theorem C.16 and by noting that if y ≤ 1/2, then 1/(1 − y) ≤ 2, we obtain
−1
−1 J x∗
J x0 ≤ −1
1 − J x∗ J(x0 ) − J(x∗ ) (7.23)
∗ −1 2
≤2 J x ≤ .
ρ
We have
Consequently,
2
2 x0 − x∗
≤ M from (7.14) ,
ρ 2
7.3 Project
The general organization of the projects is described in Appendix D.
Objective
To analyze the impact of the starting point on the convergence of Newton’s method,
with inspiration taken from Example 7.12.
Newton’s method 199
Approach
To create drawings similar to those of Figure 7.7, we use the following convention:
• Associate a specific color for each solution. For instance, when we have three
solutions, the RGB codes (255, 0, 0), (0, 255, 0) and (0, 0, 255) could be utilized.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Algorithm
Algorithm 7.3.
Problems
Exercise 7.1. The system
x2 = x21
x21 + (x2 − 2)2 = 4
T √ T √ T
has three roots: 0 0 , − 3 3 , 3 3 . Note that there are three
intersections between a circle and a parabola (draw the sketch). Note also that the
Jacobian is singular when x1 = 0.
Exercise 7.2. The system
3x21 + 2x22 = 35
4x21 − 3x22 = 24
T T T T
has four solutions: −3 −2 , −3 2 , 3 −2 , 3 2 . Note that
there are three intersections between an ellipse and a hyperbole (draw the sketch).
Exercise 7.3. The system
x21 − x1 x2 + x22 = 21
x21 + 2x1 x2 − 8x22 = 0
√ √ T √ √ T T T
has four solutions: −2 7 − 7 , 2 7 7 , −4 1 , 4 −1 .
Warning: when implementing these systems, one must not confuse the Jacobian
and the gradient matrix. Each row of the Jacobian corresponds to an equation and
each column to a variable.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
Chapter 8
Quasi-Newton methods
“You cannot have your cake and eat it too,” says a popular proverb. In this chapter,
however, we try! The method developed here has an effectiveness close to that of
Newton’s method, without requiring the calculation of derivatives.
Contents
8.1 Equation with one unknown . . . . . . . . . . . . . . . . . 201
8.2 Systems of equations with multiple unknowns . . . . . . 208
8.3 Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Sister Caasi Newton, or Quasi Newton, is the twin sister of Sir Isaac
Newton. Caasi Newton tried to follow in the footsteps of her illustri-
ous brother, but was never able to understand the complex concept of
derivatives. Her striking resemblance to her brother and the complete
absence of any writings cast doubt on her existence.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
F(x + s) − F(x)
as (x) = . (8.2)
s
Geometrically, the derivative at x is the slope of the tangent to the function at x. The
above approximation replaces the tangent by a secant intersecting the function at x
and x + s, as illustrated in Figure 8.2. The model obtained from this approximation
is therefore called the secant linear model of the function.
F(x)
mbx;s
s
b
x b
x+s
Definition 8.1 (Secant linear model of a function with one variable). Let F : R → R
be a differentiable function. The secant linear model of F in b
x is a function mbx;s :
R → R defined by
x + s) − F(b
F(b x)
mbx;s (x) = F(b
x) + (x − b
x) , (8.3)
s
where s 6= 0.
Quasi-Newton methods 203
We can now utilize the same principle as in Newton’s method, replacing the deriva-
tive in (7.3) by its secant approximation and obtain
F(bx)
x+ = b
x− . (8.4)
as (b
x)
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
To obtain an algorithm, we now need only define the value of s. As said above, a
natural idea is to choose s small so as to obtain a good approximation of the derivative.
For example, s can be defined as
τb x| ≥ 1
x if |b
s= (8.5)
τ otherwise ,
where τ is small, for instance equal to 10−7 . For a more sophisticated calculation of τ,
taking into account the epsilon machine and the precision obtained when calculating F,
we refer the reader to Dennis and Schnabel (1996, Algorithm A5.6.3). The algorithm
based on this definition of s is called the finite difference Newton’s method and is
presented as Algorithm 8.1.
F(x) = 0 .
3 Input
4 The function F : R → R.
5 A first approximation of the solution x0 ∈ R.
6 A parameter τ > 0.
7 The required precision ε ∈ R, ε > 0.
8 Output
9 An approximation of the solution x∗ ∈ R.
10 Initialization
11 k := 0.
12 Repeat
13 if |xk | ≥ 1 then
14 s := τxk
15 else
16 s := τ
sF(xk )
17 xk+1 := xk − .
F(xk + s) − F(xk )
18 k := k + 1.
19 Until F(xk ) ≤ ε
20 x∗ = xk .
204 Equation with one unknown
The iterations of this algorithm applied to Example 7.3, with τ = 10−7 , are
described in Table 8.1. The difference with the iterations of Newton’s method (Ta-
ble 7.1) are almost imperceptible. What is even more interesting is that a higher value
of τ may still enable the algorithm to converge, even if this convergence is slow. Ta-
ble 8.2 contains the iterations of the algorithm applied to Example 7.3, with τ = 0.1.
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
The first two iterations of this algorithm are illustrated in Figure 8.3. Intuitively, we
expect it to work well, with a relatively large s, when the function is not too linear.
•
2.5
2 •
1.5 x1 x1 + s x0 x0 + s
1
×
0.5
×
0
1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3
x
Figure 8.3: Finite difference Newton’s method for Example 7.3
Table 8.1: Iterations for finite difference Newton’s method (τ = 10−7 ) for Example
7.3
k xk F(xk )
0 +2.00000000E+00 +2.00000000E+00
1 +1.50000003E+00 +2.50000076E-01
2 +1.41666667E+00 +6.94446047E-03
3 +1.41421569E+00 +6.00768206E-06
4 +1.41421356E+00 +4.81081841E-12
5 +1.41421356E+00 +4.44089210E-16
In practice, there is no reason to take τ = 0.1 because, even if this choice provides
results, it slows down the convergence. The only motivation to take a larger s would
be to save on function evaluations. This is the idea of the secant method, that uses
a step based on the last two iterates, that is
s = xk−1 − xk ,
in such a way that (8.2) is written as
F(xk−1 ) − F(xk )
as (xk ) = .
xk−1 − xk
Therefore, no additional evaluation of the function is required, because F(xk−1 ) has
already been calculated during the previous iteration. The secant method is described
Quasi-Newton methods 205
Table 8.2: Iterations for finite difference Newton’s method (τ = 0.1) for Example 7.3
k xk F(xk )
0 +2.00000000E+00 +2.00000000E+00
1 +1.52380952E+00 +3.21995465E-01
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
2 +1.42318594E+00 +2.54582228E-02
3 +1.41466775E+00 +1.28485582E-03
4 +1.41423526E+00 +6.13706622E-05
5 +1.41421460E+00 +2.92283950E-06
6 +1.41421361E+00 +1.39183802E-07
7 +1.41421356E+00 +6.62780186E-09
8 +1.41421356E+00 +3.15609761E-10
9 +1.41421356E+00 +1.50284230E-11
10 +1.41421356E+00 +7.15427717E-13
11 +1.41421356E+00 +3.41948692E-14
12 +1.41421356E+00 +1.33226763E-15
13 +1.41421356E+00 +4.44089210E-16
as Algorithm 8.2. Note that this technique does not work at the first iteration (k = 0),
as xk−1 is not defined. For the first iteration, an arbitrary value for a0 is therefore
selected.
Table 8.3 shows the iterations of the secant method, with a0 = 1. Figure 8.4
illustrates the first two iterations of the method for this example. At the iterate
x0 = 2, a first arbitrary linear model, with slope 1, is first considered. It intersects
the x-axis at 0, which becomes the next iterate x1 . Then the secant method can start.
The secant intersecting the function at x0 and x1 is considered. It intersects the x-
axis at 1, which becomes iterate x2 . The next iteration is illustrated in Figure 8.5.
The secant intersecting the function at x1 and x2 , crosses the x-axis at x3 = 2.
Interestingly, by coincidence, it happens to be the same value as x0 . But it does
not mean that iteration 3 is the same as iteration 0. Indeed, between the two, the
algorithm has collected information about the function, and accumulated it into ak .
If a0 = 1 is an arbitrary value, not containing information about F, the value a3 = 3
used for the next secant model has been calculated using explicit measures of the
function F. As a consequence, the secant intersecting the function at x2 and x3
crosses the x-axis at x4 , that happens not to be too far from the zero of the function.
Therefore, the convergence of the method from iteration 4 is pretty fast, as can be
seen in Table 8.3. Indeed, during the last iterations, xk and xk−1 are closer and closer
and s = xk−1 − xk is smaller and smaller. Geometrically, it means that the secant is
closer and closer to the actual tangent, and the method becomes similar to the finite
difference Newton’s method. The rate of convergence is fast, and is characterized as
superlinear.
206 Equation with one unknown
F(x) = 0 .
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
3 Input
4 The function F : R → R.
5 A first approximation of the solution x0 ∈ R.
6 A first approximation of the derivative a0 (by default: a0 = 1).
7 The required precision ε ∈ R, ε > 0.
8 Output
9 An approximation of the solution x∗ ∈ R.
10 Initialization
11 k := 0.
12 Repeat
13 Update the current iterate
F(xk )
xk+1 := xk − .
ak
F(xk ) − F(xk+1 )
ak+1 := .
xk − xk+1
15 k := k + 1.
16 Until F(xk ) ≤ ε
17 x∗ = xk
Definition 8.2 (Superlinear convergence). Consider a sequence xk k in Rn that
converges toward x∗ . The sequence is said to converge superlinearly toward x∗ if
xk+1 − x∗
lim = 0. (8.6)
k→∞ xk − x∗
Quasi-Newton methods 207
Table 8.3: Iterations for the secant method (a0 = 1) for Example 7.3
k xk F(xk ) as (xk )
0 +2.00000000E+00 +2.00000000E+00 +1.00000000E+00
1 +0.00000000E+00 -2.00000000E+00 +2.00000000E+00
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
-1
-2
x1 x2 x0
-3
-0.5 0 0.5 1 1.5 2
x
Figure 8.4: Iterations 0 and 1 of the secant method for Example 7.3
-1
-2
x1 x2 x4 x3
-3
-0.5 0 0.5 1 1.5 2
x
Figure 8.5: Iterations 2 and 3 of the secant method for Example 7.3
208 Systems of equations with multiple unknowns
F(x) = 0 . (8.7)
3 Input
4 The function F : Rn → Rn .
5 A first approximation of the solution x0 ∈ Rn .
6 A parameter τ > 0.
7 The required precision ε ∈ R, ε > 0.
8 Output
9 An approximation of the solution x∗ ∈ Rn .
10 Initialization
11 k := 0.
12 Repeat
13 for j = 1, . . . , n do
14 if |(xk )j | ≥ 1 then
15 sj := τ(xk )j
16 else if 0 ≤ (xk )j ≤ 1
17 then
18 sj := τ
19 else
20 sj := −τ
F(xk + sj ej ) − F(xk )
Ak j
:= , j = 1, . . . , n ,
sj
where Ak j
is the jth column of Ak , and ej ∈ Rn is the jth canonical
vector, composed of 0, except at the jth place containing 1 instead.
22 Calculate dk+1 solution of Ak dk+1 = −F(xk ).
23 xk+1 := xk + dk+1 .
24 k := k + 1.
25 Until F(xk ) ≤ ε
26 x∗ = xk .
Quasi-Newton methods 209
Definition 8.3 (Linear secant model for a function with n variables). Let F : Rn →
Rm be a Lipschitz continuous function and A ∈ Rm×n a matrix. The linear secant
x is a function mbx;A : Rn → Rm defined by
model of F in b
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
x) + A(x − b
mbx;A (x) = F(b x) . (8.8)
Table 8.4: Iterations for the finite difference Newton’s method for Example 7.11
(τ = 10−7 )
k xk F(xk ) F(xk )
0 +1.00000000e+00 +3.00000000e+00 +3.45723769e+00
+1.00000000e+00 +1.71828183e+00
1 +1.52359228e-01 +7.56629845e-01 +1.15470878e+00
+1.19528158e+00 +8.72274977e-01
2 -1.08376852e-02 +5.19684806e-02 +1.14042632e-01
+1.03611119e+00 +1.01513541e-01
3 -8.89667761e-04 +1.29445824e-03 +3.94234579e-03
+1.00153532e+00 +3.72377069e-03
4 -1.37016733e-06 +3.13751967e-06 +8.08060994e-06
+1.00000294e+00 +7.44662523e-06
5 -5.68344146e-12 +1.09472431e-11 +2.98662028e-11
+1.00000000e+00 +2.77875500e-11
6 -9.93522913e-17 +0.00000000e+00 +4.44089210e-16
+1.00000000e+00 +4.44089210e-16
Table 8.5: Iterations for the finite difference Newton’s method for Example 7.11
(τ = 0.1)
k xk F(xk ) F(xk )
0 +1.00000000e+00 +3.00000000e+00 +3.45723769e+00
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
+1.00000000e+00 +1.71828183e+00
1 +1.64629659e-01 +8.02103265e-01 +1.21852778e+00
+1.20238971e+00 +9.17300554e-01
2 -1.45741083e-02 +8.85985792e-02 +1.88972898e-01
+1.05713499e+00 +1.66916290e-01
3 -5.72356301e-03 +8.21459268e-03 +2.52536228e-02
+1.00976678e+00 +2.38802414e-02
4 -4.76896360e-04 +1.48824845e-03 +3.51842725e-03
+1.00122016e+00 +3.18817297e-03
..
.
14 -2.17152295e-13 +5.45341550e-13 +1.36591792e-12
+1.00000000e+00 +1.25233157e-12
15 -2.49137961e-14 +6.26165786e-14 +1.56919411e-13
+1.00000000e+00 +1.43884904e-13
16 -2.79620466e-15 +7.10542736e-15 +1.79018084e-14
+1.00000000e+00 +1.64313008e-14
17 -2.35536342e-16 +8.88178420e-16 +1.98602732e-15
+1.00000000e+00 +1.77635684e-15
18 -5.13007076e-17 +0.00000000e+00 +0.00000000e+00
+1.00000000e+00 +0.00000000e+00
or
Ak (xk − xk−1 ) = F(xk ) − F(xk−1 ) .
Definition 8.4 (Secant equation). A linear model satisfies the secant equation in xk
and xk−1 if the matrix A defining it is such that
By taking
dk−1 = xk − xk−1
(8.11)
yk−1 = F(xk ) − F(xk−1 ) ,
it is written as
Adk−1 = yk−1 . (8.12)
Given xk , xk−1 , F(xk ) and F(xk−1 ), the linear secant model is based on a ma-
trix A satisfying the system of equations (8.10) or (8.12). This system of n linear
Quasi-Newton methods 211
equations has n2 unknowns (the elements of A). Therefore, when n > 1, it is always
underdetermined and has an infinite number of solutions. From a geometrical point
of view, there are infinitely many hyperplanes passing through the two points.
The idea proposed by Broyden (1965) is to choose, among the infinite number of
linear models verifying the secant equation, the one that is the closest to the model
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
established during the previous iteration, thereby conserving to the largest possible
extent what has already been calculated. We now calculate the difference between two
successive models, that is mxk−1 ;Ak−1 (x), the model of the function in the previous
iterate xk−1 , and mxk ;Ak (x), the model of the function in the current iterate xk .
Lemma 8.5. Let mxk ;Ak (x) and mxk−1 ;Ak−1 (x) be linear secant models of a func-
tion F : Rn → Rn in xk and xk−1 , respectively. If these models satisfy the secant
equation, we can characterize the difference between the two models by
mxk ;Ak (x) − mxk−1 ;Ak−1 (x) = (Ak − Ak−1 )(x − xk−1 ) . (8.13)
Proof. The proof exploits the definition (8.8) of the secant model, and the secant
equation (8.10).
mxk ;Ak (x) − mxk−1 ;Ak−1 (x) = F(xk ) + Ak (x − xk )
− F(xk−1 ) − Ak−1 (x − xk−1 ) from (8.8)
= F(xk ) + Ak (x − xk )
− F(xk−1 ) − Ak−1 (x − xk−1 )
+ Ak xk−1 − Ak xk−1
= F(xk ) − F(xk−1 ) − Ak (xk − xk−1 )
+ (Ak − Ak−1 )(x − xk−1 )
= (Ak − Ak−1 )(x − xk−1 ) from (8.10) .
Theorem 8.6 (Broyden update). Let mxk−1 ;Ak−1 (x) be the linear secant model of
a function F : Rn → Rn in xk−1 and let us take xk ∈ Rn , xk 6= xk−1 . The linear
secant model of F in xk that satisfies the secant equation (8.10) and is as close
as possible to mxk−1 ;Ak−1 (x) is
with
(yk−1 − Ak−1 dk−1 )dTk−1
Ak = Ak−1 + , (8.15)
dTk−1 dk−1
where dk−1 = xk − xk−1 and yk−1 = F(xk ) − F(xk−1 ).
212 Systems of equations with multiple unknowns
Proof. According to Lemma 8.5, the difference between the two linear models is
(Ak − Ak−1 )(x − xk−1 ) . (8.16)
The secant equation imposes the behavior of the linear model solely in the direction
dk−1 . Therefore, the degrees of freedom should be explored in the directions that are
Ce document est la propriété exclusive de Kavyaa Kannan ([email protected]) - jeudi 18 avril 2024 à 07h48
min A − Ak−1 F
.
A∈S
Quasi-Newton methods 213
= because A ∈ S
dTk−1 dk−1 2
(A − Ak−1 )dk−1 dTk−1
=