Full text available at: http://dx.doi.org/10.
1561/2200000050
Convex Optimization:
Algorithms and Complexity
Sébastien Bubeck
Theory Group, Microsoft Research
[email protected]
Boston — Delft
Full text available at: http://dx.doi.org/10.1561/2200000050
Foundations and Trends R in Machine Learning
Published, sold and distributed by:
now Publishers Inc.
PO Box 1024
Hanover, MA 02339
United States
Tel. +1-781-985-4510
www.nowpublishers.com
[email protected]Outside North America:
now Publishers Inc.
PO Box 179
2600 AD Delft
The Netherlands
Tel. +31-6-51115274
The preferred citation for this publication is
S. Bubeck. Convex Optimization: Algorithms and Complexity. Foundations and
Trends R in Machine Learning, vol. 8, no. 3-4, pp. 231–357, 2015.
This Foundations and Trends R issue was typeset in LATEX using a class file designed
by Neal Parikh. Printed on acid-free paper.
ISBN: 978-1-60198-861-4
c 2015 S. Bubeck
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, mechanical, photocopying, recording
or otherwise, without prior written permission of the publishers.
Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen-
ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for
internal or personal use, or the internal or personal use of specific clients, is granted by
now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The
‘services’ for users can be found on the internet at: www.copyright.com
For those organizations that have been granted a photocopy license, a separate system
of payment has been arranged. Authorization does not extend to other kinds of copy-
ing, such as that for general distribution, for advertising or promotional purposes, for
creating new collective works, or for resale. In the rest of the world: Permission to pho-
tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc.,
PO Box 1024, Hanover, MA 02339, USA; Tel. +1 781 871 0245; www.nowpublishers.com;
[email protected]
now Publishers Inc. has an exclusive license to publish this material worldwide. Permission
to use this content must be obtained from the copyright license holder. Please apply to
now Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com;
e-mail: [email protected]
Full text available at: http://dx.doi.org/10.1561/2200000050
Foundations and Trends R in Machine Learning
Volume 8, Issue 3-4, 2015
Editorial Board
Editor-in-Chief
Michael Jordan
University of California, Berkeley
United States
Editors
Peter Bartlett Geoffrey Hinton Andrew Moore
UC Berkeley University of Toronto CMU
Yoshua Bengio Aapo Hyvarinen John Platt
University of Montreal HIIT, Finland Microsoft Research
Avrim Blum Leslie Pack Kaelbling Luc de Raedt
CMU MIT University of Freiburg
Craig Boutilier Michael Kearns Christian Robert
University of Toronto UPenn U Paris-Dauphine
Stephen Boyd Daphne Koller Sunita Sarawagi
Stanford University Stanford University IIT Bombay
Carla Brodley John Lafferty Robert Schapire
Tufts University University of Chicago Princeton University
Inderjit Dhillon Michael Littman Bernhard Schoelkopf
UT Austin Brown University MPI Tübingen
Jerome Friedman Gabor Lugosi Richard Sutton
Stanford University Pompeu Fabra University University of Alberta
Kenji Fukumizu David Madigan Larry Wasserman
ISM, Japan Columbia University CMU
Zoubin Ghahramani Pascal Massart Bin Yu
University of Cambridge University of Paris-Sud UC Berkeley
David Heckerman Andrew McCallum
Microsoft Research UMass Amherst
Tom Heskes Marina Meila
Radboud University University of Washington
Full text available at: http://dx.doi.org/10.1561/2200000050
Editorial Scope
Topics
Foundations and Trends R in Machine Learning publishes survey and
tutorial articles on the theory, algorithms and applications of machine
learning, including the following topics:
• Inductive logic programming
• Adaptive control and signal
processing • Kernel methods
• Applications and case studies • Markov chain Monte Carlo
• Behavioral, cognitive, and • Model choice
neural learning
• Nonparametric methods
• Bayesian learning
• Online learning
• Classification and prediction
• Optimization
• Clustering
• Reinforcement learning
• Data mining
• Relational learning
• Dimensionality reduction
• Robustness
• Evaluation
• Spectral methods
• Game theoretic learning
• Statistical learning theory
• Graphical models
• Variational inference
• Independent component
analysis • Visualization
Information for Librarians
Foundations and Trends R in Machine Learning, 2015, Volume 8, 6 issues.
ISSN paper version 1935-8237. ISSN online version 1935-8245. Also available
as a combined paper and online subscription.
Full text available at: http://dx.doi.org/10.1561/2200000050
Foundations and Trends R in Machine Learning
Vol. 8, No. 3-4 (2015) 231–357
c 2015 S. Bubeck
DOI: 10.1561/2200000050
Convex Optimization: Algorithms and
Complexity
Sébastien Bubeck
Theory Group, Microsoft Research
[email protected]
Full text available at: http://dx.doi.org/10.1561/2200000050
Contents
1 Introduction 2
1.1 Some convex optimization problems in machine learning . 3
1.2 Basic properties of convexity . . . . . . . . . . . . . . . . 4
1.3 Why convexity? . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Black-box model . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Structured optimization . . . . . . . . . . . . . . . . . . . 10
1.6 Overview of the results and disclaimer . . . . . . . . . . . 10
2 Convex optimization in finite dimension 14
2.1 The center of gravity method . . . . . . . . . . . . . . . . 15
2.2 The ellipsoid method . . . . . . . . . . . . . . . . . . . . 17
2.3 Vaidya’s cutting plane method . . . . . . . . . . . . . . . 20
2.4 Conjugate gradient . . . . . . . . . . . . . . . . . . . . . 28
3 Dimension-free convex optimization 32
3.1 Projected subgradient descent for Lipschitz functions . . . 33
3.2 Gradient descent for smooth functions . . . . . . . . . . . 36
3.3 Conditional gradient descent, aka Frank-Wolfe . . . . . . . 41
3.4 Strong convexity . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Geometric descent . . . . . . . . . . . . . . . . . . . . . . 54
ii
Full text available at: http://dx.doi.org/10.1561/2200000050
iii
3.7 Nesterov’s accelerated gradient descent . . . . . . . . . . 59
4 Almost dimension-free convex optimization in non-Euclidean
spaces 66
4.1 Mirror maps . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Mirror descent . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Standard setups for mirror descent . . . . . . . . . . . . . 71
4.4 Lazy mirror descent, aka Nesterov’s dual averaging . . . . 73
4.5 Mirror prox . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 The vector field point of view on MD, DA, and MP . . . . 77
5 Beyond the black-box model 79
5.1 Sum of a smooth and a simple non-smooth term . . . . . 80
5.2 Smooth saddle-point representation of a non-smooth function 82
5.3 Interior point methods . . . . . . . . . . . . . . . . . . . . 88
6 Convex optimization and randomness 99
6.1 Non-smooth stochastic optimization . . . . . . . . . . . . 100
6.2 Smooth stochastic optimization and mini-batch SGD . . . 102
6.3 Sum of smooth and strongly convex functions . . . . . . . 104
6.4 Random coordinate descent . . . . . . . . . . . . . . . . . 108
6.5 Acceleration by randomization for saddle points . . . . . . 112
6.6 Convex relaxation and randomized rounding . . . . . . . . 113
6.7 Random walk based methods . . . . . . . . . . . . . . . . 117
Acknowledgements 120
References 121
Full text available at: http://dx.doi.org/10.1561/2200000050
Abstract
Ţ This monograph presents the main complexity theorems in convex
optimization and their corresponding algorithms. Starting from the
fundamental theory of black-box optimization, the material progresses
towards recent advances in structural optimization and stochastic op-
timization. Our presentation of black-box optimization, strongly in-
fluenced by Nesterov’s seminal book and Nemirovski’s lecture notes,
includes the analysis of cutting plane methods, as well as (acceler-
ated) gradient descent schemes. We also pay special attention to non-
Euclidean settings (relevant algorithms include Frank-Wolfe, mirror
descent, and dual averaging) and discuss their relevance in machine
learning. We provide a gentle introduction to structural optimization
with FISTA (to optimize a sum of a smooth and a simple non-smooth
term), saddle-point mirror prox (Nemirovski’s alternative to Nesterov’s
smoothing), and a concise description of interior point methods. In
stochastic optimization we discuss stochastic gradient descent, mini-
batches, random coordinate descent, and sublinear algorithms. We also
briefly touch upon convex relaxation of combinatorial problems and the
use of randomness to round solutions, as well as random walks based
methods.
S. Bubeck. Convex Optimization: Algorithms and Complexity. Foundations and
Trends R in Machine Learning, vol. 8, no. 3-4, pp. 231–357, 2015.
DOI: 10.1561/2200000050.
Full text available at: http://dx.doi.org/10.1561/2200000050
1
Introduction
The central objects of our study are convex functions and convex sets
in Rn .
Definition 1.1 (Convex sets and convex functions). A set X ⊂ Rn is
said to be convex if it contains all of its segments, that is
∀(x, y, γ) ∈ X × X × [0, 1], (1 − γ)x + γy ∈ X .
A function f : X → R is said to be convex if it always lies below its
chords, that is
∀(x, y, γ) ∈ X × X × [0, 1], f ((1 − γ)x + γy) ≤ (1 − γ)f (x) + γf (y).
We are interested in algorithms that take as input a convex set X
and a convex function f and output an approximate minimum of f
over X . We write compactly the problem of finding the minimum of f
over X as
min. f (x)
s.t. x ∈ X .
In the following we will make more precise how the set of constraints X
and the objective function f are specified to the algorithm. Before that
2
Full text available at: http://dx.doi.org/10.1561/2200000050
1.1. Some convex optimization problems in machine learning 3
we proceed to give a few important examples of convex optimization
problems in machine learning.
1.1 Some convex optimization problems in machine learning
Many fundamental convex optimization problems in machine learning
take the following form:
m
X
min.n fi (x) + λR(x), (1.1)
x∈R
i=1
where the functions f1 , . . . , fm , R are convex and λ ≥ 0 is a fixed
parameter. The interpretation is that fi (x) represents the cost of using
x on the ith element of some data set, and R(x) is a regularization term
which enforces some “simplicity” in x. We discuss now major instances
of (1.1). In all cases one has a data set of the form (wi , yi ) ∈ Rn ×Y, i =
1, . . . , m and the cost function fi depends only on the pair (wi , yi ). We
refer to Hastie et al. [2001], Schölkopf and Smola [2002], Shalev-Shwartz
and Ben-David [2014] for more details on the origin of these important
problems. The mere objective of this section is to expose the reader to a
few concrete convex optimization problems which are routinely solved.
In classification one has Y = {−1, 1}. Taking fi (x) = max(0, 1 −
yi x> wi ) (the so-called hinge loss) and R(x) = kxk22 one obtains the
SVM problem. On the other hand taking fi (x) = log(1+exp(−yi x> wi ))
(the logistic loss) and again R(x) = kxk22 one obtains the (regularized)
logistic regression problem.
In regression one has Y = R. Taking fi (x) = (x> wi − yi )2 and
R(x) = 0 one obtains the vanilla least-squares problem which can be
rewritten in vector notation as
min.n kW x − Y k22 ,
x∈R
where W ∈ Rm×n is the matrix with wi> on the ith row and Y =
(y1 , . . . , yn )> . With R(x) = kxk22 one obtains the ridge regression prob-
lem, while with R(x) = kxk1 this is the LASSO problem Tibshirani
[1996].
Our last two examples are of a slightly different flavor. In particular
the design variable x is now best viewed as a matrix, and thus we
Full text available at: http://dx.doi.org/10.1561/2200000050
4 Introduction
denote it by a capital letter X. The sparse inverse covariance estimation
problem can be written as follows, given some empirical covariance
matrix Y ,
min. Tr(XY ) − logdet(X) + λkXk1
s.t. X ∈ Rn×n , X > = X, X 0.
Intuitively the above problem is simply a regularized maximum likeli-
hood estimator (under a Gaussian assumption).
Finally we introduce the convex version of the matrix completion
problem. Here our data set consists of observations of some of the
entries of an unknown matrix Y , and we want to “complete" the unob-
served entries of Y in such a way that the resulting matrix is “simple"
(in the sense that it has low rank). After some massaging (see Can-
dès and Recht [2009]) the (convex) matrix completion problem can be
formulated as follows:
min. Tr(X)
s.t. X ∈ Rn×n , X > = X, X 0, Xi,j = Yi,j for (i, j) ∈ Ω,
where Ω ⊂ [n]2 and (Yi,j )(i,j)∈Ω are given.
1.2 Basic properties of convexity
A basic result about convex sets that we shall use extensively is the
Separation Theorem.
Theorem 1.1 (Separation Theorem). Let X ⊂ Rn be a closed convex
set, and x0 ∈ Rn \ X . Then, there exists w ∈ Rn and t ∈ R such that
w> x0 < t, and ∀x ∈ X , w> x ≥ t.
Note that if X is not closed then one can only guarantee that
w> x >
0 ≤ w x, ∀x ∈ X (and w 6= 0). This immediately implies the Sup-
porting Hyperplane Theorem (∂X denotes the boundary of X , that is
the closure without the interior):
Theorem 1.2 (Supporting Hyperplane Theorem). Let X ⊂ Rn be a con-
vex set, and x0 ∈ ∂X . Then, there exists w ∈ Rn , w 6= 0 such that
∀x ∈ X , w> x ≥ w> x0 .
Full text available at: http://dx.doi.org/10.1561/2200000050
1.2. Basic properties of convexity 5
We introduce now the key notion of subgradients.
Definition 1.2 (Subgradients). Let X ⊂ Rn , and f : X → R. Then
g ∈ Rn is a subgradient of f at x ∈ X if for any y ∈ X one has
f (x) − f (y) ≤ g > (x − y).
The set of subgradients of f at x is denoted ∂f (x).
To put it differently, for any x ∈ X and g ∈ ∂f (x), f is above the
linear function y 7→ f (x)+g > (y−x). The next result shows (essentially)
that a convex functions always admit subgradients.
Proposition 1.1 (Existence of subgradients). Let X ⊂ Rn be convex,
and f : X → R. If ∀x ∈ X , ∂f (x) 6= ∅ then f is convex. Conversely
if f is convex then for any x ∈ int(X ), ∂f (x) 6= ∅. Furthermore if f is
convex and differentiable at x then ∇f (x) ∈ ∂f (x).
Before going to the proof we recall the definition of the epigraph of
a function f : X → R:
epi(f ) = {(x, t) ∈ X × R : t ≥ f (x)}.
It is obvious that a function is convex if and only if its epigraph is a
convex set.
Proof. The first claim is almost trivial: let g ∈ ∂f ((1 − γ)x + γy), then
by definition one has
f ((1 − γ)x + γy) ≤ f (x) + γg > (y − x),
f ((1 − γ)x + γy) ≤ f (y) + (1 − γ)g > (x − y),
which clearly shows that f is convex by adding the two (appropriately
rescaled) inequalities.
Now let us prove that a convex function f has subgradients in the
interior of X . We build a subgradient by using a supporting hyperplane
to the epigraph of the function. Let x ∈ X . Then clearly (x, f (x)) ∈
∂epi(f ), and epi(f ) is a convex set. Thus by using the Supporting
Hyperplane Theorem, there exists (a, b) ∈ Rn × R such that
a> x + bf (x) ≥ a> y + bt, ∀(y, t) ∈ epi(f ). (1.2)
Full text available at: http://dx.doi.org/10.1561/2200000050
6 Introduction
Clearly, by letting t tend to infinity, one can see that b ≤ 0. Now let
us assume that x is in the interior of X . Then for ε > 0 small enough,
y = x+εa ∈ X , which implies that b cannot be equal to 0 (recall that if
b = 0 then necessarily a 6= 0 which allows to conclude by contradiction).
Thus rewriting (1.2) for t = f (y) one obtains
1 >
f (x) − f (y) ≤ a (x − y).
|b|
Thus a/|b| ∈ ∂f (x) which concludes the proof of the second claim.
Finally let f be a convex and differentiable function. Then by defi-
nition:
f ((1 − γ)x + γy) − (1 − γ)f (x)
f (y) ≥
γ
f (x + γ(y − x)) − f (x)
= f (x) +
γ
>
→ f (x) + ∇f (x) (y − x),
γ→0
which shows that ∇f (x) ∈ ∂f (x).
In several cases of interest the set of contraints can have an empty
interior, in which case the above proposition does not yield any informa-
tion. However it is easy to replace int(X ) by ri(X ) -the relative interior
of X - which is defined as the interior of X when we view it as subset of
the affine subspace it generates. Other notions of convex analysis will
prove to be useful in some parts of this text. In particular the notion
of closed convex functions is convenient to exclude pathological cases:
these are the convex functions with closed epigraphs. Sometimes it is
also useful to consider the extension of a convex function f : X → R to
a function from Rn to R by setting f (x) = +∞ for x 6∈ X . In convex
analysis one uses the term proper convex function to denote a convex
function with values in R ∪ {+∞} such that there exists x ∈ Rn with
f (x) < +∞. From now on all convex functions will be closed,
and if necessary we consider also their proper extension. We
refer the reader to Rockafellar [1970] for an extensive discussion of these
notions.
Full text available at: http://dx.doi.org/10.1561/2200000050
1.3. Why convexity? 7
1.3 Why convexity?
The key to the algorithmic success in minimizing convex functions is
that these functions exhibit a local to global phenomenon. We have
already seen one instance of this in Proposition 1.1, where we showed
that ∇f (x) ∈ ∂f (x): the gradient ∇f (x) contains a priori only local
information about the function f around x while the subdifferential
∂f (x) gives a global information in the form of a linear lower bound on
the entire function. Another instance of this local to global phenomenon
is that local minima of convex functions are in fact global minima:
Proposition 1.2 (Local minima are global minima). Let f be convex. If x
is a local minimum of f then x is a global minimum of f . Furthermore
this happens if and only if 0 ∈ ∂f (x).
Proof. Clearly 0 ∈ ∂f (x) if and only if x is a global minimum of f .
Now assume that x is local minimum of f . Then for γ small enough
one has for any y,
f (x) ≤ f ((1 − γ)x + γy) ≤ (1 − γ)f (x) + γf (y),
which implies f (x) ≤ f (y) and thus x is a global minimum of f .
The nice behavior of convex functions will allow for very fast algo-
rithms to optimize them. This alone would not be sufficient to justify
the importance of this class of functions (after all constant functions
are pretty easy to optimize). However it turns out that surprisingly
many optimization problems admit a convex (re)formulation. The ex-
cellent book Boyd and Vandenberghe [2004] describes in great details
the various methods that one can employ to uncover the convex aspects
of an optimization problem. We will not repeat these arguments here,
but we have already seen that many famous machine learning problems
(SVM, ridge regression, logistic regression, LASSO, sparse covariance
estimation, and matrix completion) are formulated as convex problems.
We conclude this section with a simple extension of the optimality
condition “0 ∈ ∂f (x)” to the case of constrained optimization. We state
this result in the case of a differentiable function for sake of simplicity.
Full text available at: http://dx.doi.org/10.1561/2200000050
8 Introduction
Proposition 1.3 (First order optimality condition). Let f be convex and
X a closed convex set on which f is differentiable. Then
x∗ ∈ argmin f (x),
x∈X
if and only if one has
∇f (x∗ )> (x∗ − y) ≤ 0, ∀y ∈ X .
Proof. The “if" direction is trivial by using that a gradient is also
a subgradient. For the “only if" direction it suffices to note that if
∇f (x)> (y − x) < 0, then f is locally decreasing around x on the
line to y (simply consider h(t) = f (x + t(y − x)) and note that
h0 (0) = ∇f (x)> (y − x)).
1.4 Black-box model
We now describe our first model of “input" for the objective function
and the set of constraints. In the black-box model we assume that
we have unlimited computational resources, the set of constraint X is
known, and the objective function f : X → R is unknown but can be
accessed through queries to oracles:
• A zeroth order oracle takes as input a point x ∈ X and outputs
the value of f at x.
• A first order oracle takes as input a point x ∈ X and outputs a
subgradient of f at x.
In this context we are interested in understanding the oracle complexity
of convex optimization, that is how many queries to the oracles are
necessary and sufficient to find an ε-approximate minima of a convex
function. To show an upper bound on the sample complexity we need to
propose an algorithm, while lower bounds are obtained by information
theoretic reasoning (we need to argue that if the number of queries is
“too small" then we don’t have enough information about the function
to identify an ε-approximate solution).
Full text available at: http://dx.doi.org/10.1561/2200000050
1.4. Black-box model 9
From a mathematical point of view, the strength of the black-box
model is that it will allow us to derive a complete theory of convex op-
timization, in the sense that we will obtain matching upper and lower
bounds on the oracle complexity for various subclasses of interesting
convex functions. While the model by itself does not limit our compu-
tational resources (for instance any operation on the constraint set X is
allowed) we will of course pay special attention to the algorithms’ com-
putational complexity (i.e., the number of elementary operations that
the algorithm needs to do). We will also be interested in the situation
where the set of constraint X is unknown and can only be accessed
through a separation oracle: given x ∈ Rn , it outputs either that x is
in X , or if x 6∈ X then it outputs a separating hyperplane between x
and X .
The black-box model was essentially developed in the early days
of convex optimization (in the Seventies) with Nemirovski and Yudin
[1983] being still an important reference for this theory (see also Ne-
mirovski [1995]). In the recent years this model and the corresponding
algorithms have regained a lot of popularity, essentially for two reasons:
• It is possible to develop algorithms with dimension-free oracle
complexity which is quite attractive for optimization problems in
very high dimension.
• Many algorithms developed in this model are robust to noise
in the output of the oracles. This is especially interesting for
stochastic optimization, and very relevant to machine learning
applications. We will explore this in details in Chapter 6.
Chapter 2, Chapter 3 and Chapter 4 are dedicated to the study of
the black-box model (noisy oracles are discussed in Chapter 6). We do
not cover the setting where only a zeroth order oracle is available, also
called derivative free optimization, and we refer to Conn et al. [2009],
Audibert et al. [2011] for further references on this.
Full text available at: http://dx.doi.org/10.1561/2200000050
10 Introduction
1.5 Structured optimization
The black-box model described in the previous section seems extremely
wasteful for the applications we discussed in Section 1.1. Consider for
instance the LASSO objective: x 7→ kW x − yk22 + kxk1 . We know this
function globally, and assuming that we can only make local queries
through oracles seem like an artificial constraint for the design of al-
gorithms. Structured optimization tries to address this observation.
Ultimately one would like to take into account the global structure of
both f and X in order to propose the most efficient optimization pro-
cedure. An extremely powerful hammer for this task are the Interior
Point Methods. We will describe this technique in Chapter 5 alongside
with other more recent techniques such as FISTA or Mirror Prox.
We briefly describe now two classes of optimization problems for
which we will be able to exploit the structure very efficiently, these
are the LPs (Linear Programs) and SDPs (Semi-Definite Programs).
Ben-Tal and Nemirovski [2001] describe a more general class of Conic
Programs but we will not go in that direction here.
The class LP consists of problems where f (x) = c> x for some c ∈
Rn , and X = {x ∈ Rn : Ax ≤ b} for some A ∈ Rm×n and b ∈ Rm .
The class SDP consists of problems where the optimization vari-
able is a symmetric matrix X ∈ Rn×n . Let Sn be the space of n × n
symmetric matrices (respectively Sn+ is the space of positive semi-
definite matrices), and let h·, ·i be the Frobenius inner product (re-
call that it can be written as hA, Bi = Tr(A> B)). In the class SDP
the problems are of the following form: f (x) = hX, Ci for some
C ∈ Rn×n , and X = {X ∈ Sn+ : hX, Ai i ≤ bi , i ∈ {1, . . . , m}} for
some A1 , . . . , Am ∈ Rn×n and b ∈ Rm . Note that the matrix comple-
tion problem described in Section 1.1 is an example of an SDP.
1.6 Overview of the results and disclaimer
The overarching aim of this monograph is to present the main complex-
ity theorems in convex optimization and the corresponding algorithms.
We focus on five major results in convex optimization which give the
overall structure of the text: the existence of efficient cutting-plane
Full text available at: http://dx.doi.org/10.1561/2200000050
1.6. Overview of the results and disclaimer 11
methods with optimal oracle complexity (Chapter 2), a complete char-
acterization of the relation between first order oracle complexity and
curvature in the objective function (Chapter 3), first order methods
beyond Euclidean spaces (Chapter 4), non-black box methods (such as
interior point methods) can give a quadratic improvement in the num-
ber of iterations with respect to optimal black-box methods (Chapter
5), and finally noise robustness of first order methods (Chapter 6). Ta-
ble 1.1 can be used as a quick reference to the results proved in Chapter
2 to Chapter 5, as well as some of the results of Chapter 6 (this last
chapter is the most relevant to machine learning but the results are
also slightly more specific which make them harder to summarize).
An important disclaimer is that the above selection leaves out meth-
ods derived from duality arguments, as well as the two most popular
research avenues in convex optimization: (i) using convex optimization
in non-convex settings, and (ii) practical large-scale algorithms. Entire
books have been written on these topics, and new books have yet to be
written on the impressive collection of new results obtained for both
(i) and (ii) in the past five years.
A few of the blatant omissions regarding (i) include (a) the theory
of submodular optimization (see Bach [2013]), (b) convex relaxations of
combinatorial problems (a short example is given in Section 6.6), and
(c) methods inspired from convex optimization for non-convex prob-
lems such as low-rank matrix factorization (see e.g. Jain et al. [2013]
and references therein), neural networks optimization, etc.
With respect to (ii) the most glaring omissions include (a) heuris-
tics (the only heuristic briefly discussed here is the non-linear conjugate
gradient in Section 2.4), (b) methods for distributed systems, and (c)
adaptivity to unknown parameters. Regarding (a) we refer to Nocedal
and Wright [2006] where the most practical algorithms are discussed in
great details (e.g., quasi-newton methods such as BFGS and L-BFGS,
primal-dual interior point methods, etc.). The recent survey Boyd
et al. [2011] discusses the alternating direction method of multipliers
(ADMM) which is a popular method to address (b). Finally (c) is a
subtle and important issue. In the entire monograph the emphasis
is on presenting the algorithms and proofs in the simplest way, and
Full text available at: http://dx.doi.org/10.1561/2200000050
12 Introduction
thus for sake of convenience we assume that the relevant parameters
describing the regularity and curvature of the objective function
(Lipschitz constant, smoothness constant, strong convexity parameter)
are known and can be used to tune the algorithm’s own parameters.
Line search is a powerful technique to replace the knowledge of these
parameters and it is heavily used in practice, see again Nocedal and
Wright [2006]. We observe however that from a theoretical point of
view (c) is only a matter of logarithmic factors as one can always
run in parallel several copies of the algorithm with different guesses
for the values of the parameters1 . Overall the attitude of this text
with respect to (ii) is best summarized by a quote of Thomas Cover:
“theory is the first term in the Taylor series of practice”, Cover [1992].
Notation. We always denote by x∗ a point in X such that f (x∗ ) =
minx∈X f (x) (note that the optimization problem under consideration
will always be clear from the context). In particular we always assume
that x∗ exists. For a vector x ∈ Rn we denote by x(i) its ith coordinate.
The dual of a norm k · k (defined later) will be denoted either k · k∗ or
k · k∗ (depending on whether the norm already comes with a subscript).
Other notation are standard (e.g., In for the n × n identity matrix,
for the positive semi-definite order on matrices, etc).
1
Note that this trick does not work in the context of Chapter 6.
Full text available at: http://dx.doi.org/10.1561/2200000050
1.6. Overview of the results and disclaimer 13
f Algorithm Rate # Iter Cost/iter
center of 1 ∇, R
non-smooth exp − nt n log 1
gravity ε 1 n-dim
ellipsoid 1 ∇,
non-smooth R
r
exp − nt2 n2 log R
rε
method mat-vec ×
Rn
1 ∇,
non-smooth Vaidya r
exp − nt n log Rn
rε mat-mat ×
exact n
quadratic CG 1∇
1
exp − κt κ log ε
non-smooth, √ 1 ∇,
PGD RL/ t R2 L2 /ε2
Lipschitz 1 proj.
1 ∇,
smooth PGD βR2 /t βR2 /ε
1 proj.
p
smooth AGD βR2 /t2 R β/ε 1∇
smooth 1 ∇,
FW βR2 /t βR2 /ε
(any norm) 1 LP
strong.
1∇,
conv., PGD L2 /(αt) L2 /(αε)
1 proj.
Lipschitz
strong.
1∇,
R2
conv., PGD R2 exp − κt κ log ε 1 proj.
smooth
strong. √ 2
conv., AGD R2 exp − √tκ κ log Rε 1∇
smooth
f + g, p 1 ∇ of f
f smooth, FISTA βR2 /t2 R β/ε
Prox of g
g simple
max ϕ(x, y), MD on X
y∈Y SP-MP βR2 /t βR2 /ε
ϕ smooth MD on Y
linear, √ Newton
X with F IPM ν exp − √tν ν log ν
ε step on F
ν-self-conc.
√ 1 stoch. ∇,
non-smooth SGD BL/ t B 2 L2 /ε2
1 proj.
non-smooth, 1 stoch. ∇,
SGD B 2 /(αt) B 2 /(αε)
strong. P
conv. 1 proj.
1
f= m fi
1
fi smooth SVRG – (m + κ) log ε
1 stoch. ∇
strong. conv.
Table 1.1: Summary of the results proved in Chapter 2 to Chapter 5 and some of
the results in Chapter 6.
Full text available at: http://dx.doi.org/10.1561/2200000050
References
A. Agarwal and L. Bottou. A lower bound for the optimization of finite sums.
Arxiv preprint arXiv:1410.0723, 2014.
Z. Allen-Zhu and L. Orecchia. Linear coupling: An ultimate unification of
gradient and mirror descent. Arxiv preprint arXiv:1407.1537, 2014.
K. M. Anstreicher. Towards a practical volumetric cutting plane method for
convex programming. SIAM Journal on Optimization, 9(1):190–206, 1998.
J.Y Audibert, S. Bubeck, and R. Munos. Bandit view on noisy optimization.
In S. Sra, S. Nowozin, and S. Wright, editors, Optimization for Machine
Learning. MIT press, 2011.
J.Y. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial
optimization. Mathematics of Operations Research, 39:31–45, 2014.
F. Bach. Learning with submodular functions: A convex optimization per-
spective. Foundations and Trends R in Machine Learning, 6(2-3):145–373,
2013.
F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approxi-
mation with convergence rate o(1/n). In Advances in Neural Information
Processing Systems (NIPS), 2013.
F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with
sparsity-inducing penalties. Foundations and Trends R in Machine Learn-
ing, 4(1):1–106, 2012.
B. Barak. Sum of squares upper bounds, lower bounds, and open questions.
Lecture Notes, 2014.
121
Full text available at: http://dx.doi.org/10.1561/2200000050
122 References
A. Beck and M. Teboulle. Mirror Descent and nonlinear projected subgradient
methods for convex optimization. Operations Research Letters, 31(3):167–
175, 2003.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for
linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202,
2009.
A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization:
analysis, algorithms, and engineering applications. Society for Industrial
and Applied Mathematics (SIAM), 2001.
D. Bertsimas and S. Vempala. Solving convex programs by random walks.
Journal of the ACM, 51:540–556, 2004.
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University
Press, 2004.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed opti-
mization and statistical learning via the alternating direction method of
multipliers. Foundations and Trends R in Machine Learning, 3(1):1–122,
2011.
S. Bubeck. Introduction to online optimization. Lecture Notes, 2011.
S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochas-
tic multi-armed bandit problems. Foundations and Trends R in Machine
Learning, 5(1):1–122, 2012.
S. Bubeck and R. Eldan. The entropic barrier: a simple and optimal universal
self-concordant barrier. Arxiv preprint arXiv:1412.1587, 2014.
S. Bubeck, R. Eldan, and J. Lehec. Sampling from a log-concave distribu-
tion with projected langevin monte carlo. Arxiv preprint arXiv:1507.02564,
2015a.
S. Bubeck, Y.-T. Lee, and M. Singh. A geometric alternative to nesterov’s
accelerated gradient descent. Arxiv preprint arXiv:1506.08187, 2015b.
E. Candès and B. Recht. Exact matrix completion via convex optimization.
Foundations of Computational mathematics, 9(6):717–772, 2009.
A. Cauchy. Méthode générale pour la résolution des systemes d’équations
simultanées. Comp. Rend. Sci. Paris, 25(1847):536–538, 1847.
N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge
University Press, 2006.
A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex
problems with applications to imaging. Journal of Mathematical Imaging
and Vision, 40(1):120–145, 2011.
Full text available at: http://dx.doi.org/10.1561/2200000050
References 123
K. Clarkson, E. Hazan, and D. Woodruff. Sublinear optimization for machine
learning. Journal of the ACM, 2012.
A. Conn, K. Scheinberg, and L. Vicente. Introduction to Derivative-Free Op-
timization. Society for Industrial and Applied Mathematics (SIAM), 2009.
T. M. Cover. 1990 shannon lecture. IEEE information theory society newslet-
ter, 42(4), 1992.
A. d’Aspremont. Smooth optimization with approximate gradient. SIAM
Journal on Optimization, 19(3):1171–1183, 2008.
A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient
method with support for non-strongly convex composite objectives. In
Advances in Neural Information Processing Systems (NIPS), 2014.
O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed on-
line prediction using mini-batches. Journal of Machine Learning Research,
13:165–202, 2012.
J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective
mirror descent. In Proceedings of the 23rd Annual Conference on Learning
Theory (COLT), 2010.
J. C. Dunn and S. Harshbarger. Conditional gradient algorithms with open
loop step size rules. Journal of Mathematical Analysis and Applications, 62
(2):432–444, 1978.
M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval
research logistics quarterly, 3(1-2):95–110, 1956.
M. P. Friedlander and P. Tseng. Exact regularization of convex programs.
SIAM Journal on Optimization, 18(4):1326–1350, 2007.
M. Goemans and D. Williamson. Improved approximation algorithms for
maximum cut and satisfiability problems using semidefinite programming.
Journal of the ACM, 42(6):1115–1145, 1995.
M. D. Grigoriadis and L. G. Khachiyan. A sublinear-time randomized ap-
proximation algorithm for matrix games. Operations Research Letters, 18:
53–58, 1995.
B. Grünbaum. Partitions of mass-distributions and of convex bodies by hy-
perplanes. Pacific J. Math, 10(4):1257–1261, 1960.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learn-
ing. Springer, 2001.
E. Hazan. The convex optimization approach to regret minimization. In
S. Sra, S. Nowozin, and S. Wright, editors, Optimization for Machine Learn-
ing, pages 287–303. MIT press, 2011.
Full text available at: http://dx.doi.org/10.1561/2200000050
124 References
M. Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization.
In Proceedings of the 30th International Conference on Machine Learning
(ICML), pages 427–435, 2013.
P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using
alternating minimization. In Proceedings of the Forty-fifth Annual ACM
Symposium on Theory of Computing, STOC ’13, pages 665–674, 2013.
R. Johnson and T. Zhang. Accelerating stochastic gradient descent using pre-
dictive variance reduction. In Advances in Neural Information Processing
Systems (NIPS), 2013.
L. K. Jones. A simple lemma on greedy approximation in hilbert space
and convergence rates for projection pursuit regression and neural network
training. Annals of Statistics, pages 608–613, 1992.
A. Juditsky and A. Nemirovski. First-order methods for nonsmooth convex
large-scale optimization, i: General purpose methods. In S. Sra, S. Nowozin,
and S. Wright, editors, Optimization for Machine Learning, pages 121–147.
MIT press, 2011a.
A. Juditsky and A. Nemirovski. First-order methods for nonsmooth con-
vex large-scale optimization, ii: Utilizing problem’s structure. In S. Sra,
S. Nowozin, and S. Wright, editors, Optimization for Machine Learning,
pages 149–183. MIT press, 2011b.
N. Karmarkar. A new polynomial-time algorithm for linear programming.
Combinatorica, 4:373–395, 1984.
S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining
an o (1/t) convergence rate for the projected stochastic subgradient method.
arXiv preprint arXiv:1212.2002, 2012.
N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with
an exponential convergence rate for strongly-convex optimization with fi-
nite training sets. In Advances in Neural Information Processing Systems
(NIPS), 2012.
Y.-T. Lee and A. Sidford. Path finding i :solving linear programs with
ÃŢ(sqrt(rank)) linear system solves. Arxiv preprint arXiv:1312.6677, 2013.
Y.-T. Lee, A. Sidford, and S. C.-W Wong. A faster cutting plane
method and its implications for combinatorial and convex optimization.
abs/1508.04874, 2015.
A. Levin. On an algorithm for the minimization of convex functions. In Soviet
Mathematics Doklady, volume 160, pages 1244–1247, 1965.
L. Lovász. Hit-and-run mixes fast. Math. Prog., 86:443–461, 1998.
Full text available at: http://dx.doi.org/10.1561/2200000050
References 125
G. Lugosi. Comment on: `1 -penalization for mixture regression models. Test,
19(2):259–263, 2010.
N. Maculan and G. G. de Paula. A linear-time median-finding algorithm for
projecting a vector on the simplex of rn. Operations research letters, 8(4):
219–222, 1989.
A. Nemirovski. Orth-method for smooth convex optimization. Izvestia AN
SSSR, Ser. Tekhnicheskaya Kibernetika, 2, 1982.
A. Nemirovski. Information-based complexity of convex programming. Lecture
Notes, 1995.
A. Nemirovski. Prox-method with rate of convergence o (1/t) for variational
inequalities with lipschitz continuous monotone operators and smooth
convex-concave saddle point problems. SIAM Journal on Optimization,
15(1):229–251, 2004a.
A. Nemirovski. Interior point polynomial time methods in convex program-
ming. Lecture Notes, 2004b.
A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in
Optimization. Wiley Interscience, 1983.
Y. Nesterov. A method of solving a convex programming problem with con-
vergence rate o(1/k 2 ). Soviet Mathematics Doklady, 27(2):372–376, 1983.
Y. Nesterov. Quality of semidefinite relaxation for nonconvex quadratic op-
timization. CORE Discussion Papers 1997019, Université catholique de
Louvain, Center for Operations Research and Econometrics (CORE), 1997.
Y. Nesterov. Introductory lectures on convex optimization: A basic course.
Kluwer Academic Publishers, 2004a.
Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical
programming, 103(1):127–152, 2004b.
Y. Nesterov. Gradient methods for minimizing composite objective function.
Core discussion papers, Université catholique de Louvain, Center for Op-
erations Research and Econometrics (CORE), 2007.
Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimiza-
tion problems. SIAM Journal on Optimization, 22:341–362, 2012.
Y. Nesterov and A. Nemirovski. Interior-point polynomial algorithms in con-
vex programming. Society for Industrial and Applied Mathematics (SIAM),
1994.
D. Newman. Location of the maximum on unimodal surfaces. Journal of the
ACM, 12(3):395–398, 1965.
Full text available at: http://dx.doi.org/10.1561/2200000050
126 References
J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.
N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends R in
Optimization, 1(3):123–231, 2013.
A. Rakhlin. Lecture notes on online learning. 2009.
J. Renegar. A mathematical view of interior-point methods in convex opti-
mization, volume 3. Siam, 2001.
P. Richtárik and M. Takác. Parallel coordinate descent methods for big data
optimization. Arxiv preprint arXiv:1212.0873, 2012.
H. Robbins and S. Monro. A stochastic approximation method. Annals of
Mathematical Statistics, 22:400–407, 1951.
R. Rockafellar. Convex Analysis. Princeton University Press, 1970.
M. Rudelson. Random vectors in the isotropic position. Journal of Functional
Analysis, 164:60–72, 1999.
M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximal-
gradient methods for convex optimization. In Advances in neural informa-
tion processing systems, pages 1458–1466, 2011.
B. Schölkopf and A. Smola. Learning with kernels. MIT Press, 2002.
S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From
Theory to Algorithms. Cambridge University Press, 2014.
S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods
for regularized loss minimization. Journal of Machine Learning Research,
14:567–599, 2013a.
S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual co-
ordinate ascent. In Advances in Neural Information Processing Systems
(NIPS), 2013b.
W. Su, S. Boyd, and E. Candès. A differential equation for modeling nesterov’s
accelerated gradient method: Theory and insights. In Advances in Neural
Information Processing Systems (NIPS), 2014.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society. Series B (Methodological), 58(1):pp. 267–288,
1996.
P. Tseng. On accelerated proximal gradient methods for convex-concave op-
timization. 2008.
A. Tsybakov. Optimal rates of aggregation. In Conference on Learning Theory
(COLT), pages 303–313. 2003.
Full text available at: http://dx.doi.org/10.1561/2200000050
References 127
P. M. Vaidya. A new algorithm for minimizing convex functions over convex
sets. In Foundations of Computer Science, 1989., 30th Annual Symposium
on, pages 338–343, 1989.
P. M. Vaidya. A new algorithm for minimizing convex functions over convex
sets. Mathematical programming, 73(3):291–341, 1996.
S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction
by separable approximation. IEEE Transactions on Signal Processing, 57
(7):2479–2493, 2009.
L. Xiao. Dual averaging methods for regularized stochastic learning and online
optimization. Journal of Machine Learning Research, 11:2543–2596, 2010.
Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regular-
ized empirical risk minimization. Arxiv preprint arXiv:1409.3257, 2014.