0% found this document useful (0 votes)

21 views27 pages

Convex OptimizationAlgorithms and Complexity

This document is a monograph by Sébastien Bubeck on convex optimization, detailing algorithms and complexity related to the field. It covers fundamental theories, black-box optimization, and advances in structural and stochastic optimization, providing insights into various algorithms and their applications in machine learning. The work is published in 'Foundations and Trends R in Machine Learning' and spans from basic properties of convexity to advanced optimization techniques.

Uploaded by

Miaolan Xie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views27 pages

Convex OptimizationAlgorithms and Complexity

Uploaded by

Miaolan Xie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Full text available at: http://dx.doi.org/10.

1561/2200000050

Convex Optimization:
Algorithms and Complexity

Sébastien Bubeck
Theory Group, Microsoft Research
[email protected]

Boston — Delft
Full text available at: http://dx.doi.org/10.1561/2200000050

Foundations and Trends R in Machine Learning

Published, sold and distributed by:

now Publishers Inc.
PO Box 1024
Hanover, MA 02339
United States
Tel. +1-781-985-4510
www.nowpublishers.com
[email protected]
Outside North America:
now Publishers Inc.
PO Box 179
2600 AD Delft
The Netherlands
Tel. +31-6-51115274
The preferred citation for this publication is
S. Bubeck. Convex Optimization: Algorithms and Complexity. Foundations and
Trends R in Machine Learning, vol. 8, no. 3-4, pp. 231–357, 2015.
This Foundations and Trends R issue was typeset in LATEX using a class file designed
by Neal Parikh. Printed on acid-free paper.
ISBN: 978-1-60198-861-4
c 2015 S. Bubeck

All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, mechanical, photocopying, recording
or otherwise, without prior written permission of the publishers.
Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen-
ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for
internal or personal use, or the internal or personal use of specific clients, is granted by
now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The
‘services’ for users can be found on the internet at: www.copyright.com
For those organizations that have been granted a photocopy license, a separate system
of payment has been arranged. Authorization does not extend to other kinds of copy-
ing, such as that for general distribution, for advertising or promotional purposes, for
creating new collective works, or for resale. In the rest of the world: Permission to pho-
tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc.,
PO Box 1024, Hanover, MA 02339, USA; Tel. +1 781 871 0245; www.nowpublishers.com;
[email protected]
now Publishers Inc. has an exclusive license to publish this material worldwide. Permission
to use this content must be obtained from the copyright license holder. Please apply to
now Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com;
e-mail: [email protected]
Full text available at: http://dx.doi.org/10.1561/2200000050

Foundations and Trends R in Machine Learning

Volume 8, Issue 3-4, 2015
Editorial Board

Editor-in-Chief

Michael Jordan
University of California, Berkeley
United States

Editors

Peter Bartlett Geoffrey Hinton Andrew Moore

UC Berkeley University of Toronto CMU
Yoshua Bengio Aapo Hyvarinen John Platt
University of Montreal HIIT, Finland Microsoft Research
Avrim Blum Leslie Pack Kaelbling Luc de Raedt
CMU MIT University of Freiburg
Craig Boutilier Michael Kearns Christian Robert
University of Toronto UPenn U Paris-Dauphine
Stephen Boyd Daphne Koller Sunita Sarawagi
Stanford University Stanford University IIT Bombay
Carla Brodley John Lafferty Robert Schapire
Tufts University University of Chicago Princeton University
Inderjit Dhillon Michael Littman Bernhard Schoelkopf
UT Austin Brown University MPI Tübingen
Jerome Friedman Gabor Lugosi Richard Sutton
Stanford University Pompeu Fabra University University of Alberta
Kenji Fukumizu David Madigan Larry Wasserman
ISM, Japan Columbia University CMU
Zoubin Ghahramani Pascal Massart Bin Yu
University of Cambridge University of Paris-Sud UC Berkeley
David Heckerman Andrew McCallum
Microsoft Research UMass Amherst
Tom Heskes Marina Meila
Radboud University University of Washington
Full text available at: http://dx.doi.org/10.1561/2200000050

Editorial Scope

Topics

Foundations and Trends R in Machine Learning publishes survey and

tutorial articles on the theory, algorithms and applications of machine
learning, including the following topics:

• Inductive logic programming

• Adaptive control and signal
processing • Kernel methods

• Applications and case studies • Markov chain Monte Carlo

• Behavioral, cognitive, and • Model choice

neural learning
• Nonparametric methods
• Bayesian learning
• Online learning
• Classification and prediction
• Optimization
• Clustering
• Reinforcement learning
• Data mining
• Relational learning
• Dimensionality reduction
• Robustness
• Evaluation
• Spectral methods
• Game theoretic learning
• Statistical learning theory
• Graphical models
• Variational inference
• Independent component
analysis • Visualization

Information for Librarians

Foundations and Trends R in Machine Learning, 2015, Volume 8, 6 issues.

ISSN paper version 1935-8237. ISSN online version 1935-8245. Also available
as a combined paper and online subscription.
Full text available at: http://dx.doi.org/10.1561/2200000050

Foundations and Trends R in Machine Learning

Vol. 8, No. 3-4 (2015) 231–357
c 2015 S. Bubeck
DOI: 10.1561/2200000050

Convex Optimization: Algorithms and

Complexity

Sébastien Bubeck
Theory Group, Microsoft Research
[email protected]
Full text available at: http://dx.doi.org/10.1561/2200000050

Contents

1 Introduction 2
1.1 Some convex optimization problems in machine learning . 3
1.2 Basic properties of convexity . . . . . . . . . . . . . . . . 4
1.3 Why convexity? . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Black-box model . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Structured optimization . . . . . . . . . . . . . . . . . . . 10
1.6 Overview of the results and disclaimer . . . . . . . . . . . 10

2 Convex optimization in finite dimension 14

2.1 The center of gravity method . . . . . . . . . . . . . . . . 15
2.2 The ellipsoid method . . . . . . . . . . . . . . . . . . . . 17
2.3 Vaidya’s cutting plane method . . . . . . . . . . . . . . . 20
2.4 Conjugate gradient . . . . . . . . . . . . . . . . . . . . . 28

3 Dimension-free convex optimization 32

3.1 Projected subgradient descent for Lipschitz functions . . . 33
3.2 Gradient descent for smooth functions . . . . . . . . . . . 36
3.3 Conditional gradient descent, aka Frank-Wolfe . . . . . . . 41
3.4 Strong convexity . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Geometric descent . . . . . . . . . . . . . . . . . . . . . . 54

ii
Full text available at: http://dx.doi.org/10.1561/2200000050

iii

3.7 Nesterov’s accelerated gradient descent . . . . . . . . . . 59

4 Almost dimension-free convex optimization in non-Euclidean

spaces 66
4.1 Mirror maps . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Mirror descent . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Standard setups for mirror descent . . . . . . . . . . . . . 71
4.4 Lazy mirror descent, aka Nesterov’s dual averaging . . . . 73
4.5 Mirror prox . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 The vector field point of view on MD, DA, and MP . . . . 77

5 Beyond the black-box model 79

5.1 Sum of a smooth and a simple non-smooth term . . . . . 80
5.2 Smooth saddle-point representation of a non-smooth function 82
5.3 Interior point methods . . . . . . . . . . . . . . . . . . . . 88

6 Convex optimization and randomness 99

6.1 Non-smooth stochastic optimization . . . . . . . . . . . . 100
6.2 Smooth stochastic optimization and mini-batch SGD . . . 102
6.3 Sum of smooth and strongly convex functions . . . . . . . 104
6.4 Random coordinate descent . . . . . . . . . . . . . . . . . 108
6.5 Acceleration by randomization for saddle points . . . . . . 112
6.6 Convex relaxation and randomized rounding . . . . . . . . 113
6.7 Random walk based methods . . . . . . . . . . . . . . . . 117

Acknowledgements 120

References 121
Full text available at: http://dx.doi.org/10.1561/2200000050

Abstract

Ţ This monograph presents the main complexity theorems in convex

optimization and their corresponding algorithms. Starting from the
fundamental theory of black-box optimization, the material progresses
towards recent advances in structural optimization and stochastic op-
timization. Our presentation of black-box optimization, strongly in-
fluenced by Nesterov’s seminal book and Nemirovski’s lecture notes,
includes the analysis of cutting plane methods, as well as (acceler-
ated) gradient descent schemes. We also pay special attention to non-
Euclidean settings (relevant algorithms include Frank-Wolfe, mirror
descent, and dual averaging) and discuss their relevance in machine
learning. We provide a gentle introduction to structural optimization
with FISTA (to optimize a sum of a smooth and a simple non-smooth
term), saddle-point mirror prox (Nemirovski’s alternative to Nesterov’s
smoothing), and a concise description of interior point methods. In
stochastic optimization we discuss stochastic gradient descent, mini-
batches, random coordinate descent, and sublinear algorithms. We also
briefly touch upon convex relaxation of combinatorial problems and the
use of randomness to round solutions, as well as random walks based
methods.

S. Bubeck. Convex Optimization: Algorithms and Complexity. Foundations and

Trends R in Machine Learning, vol. 8, no. 3-4, pp. 231–357, 2015.
DOI: 10.1561/2200000050.
Full text available at: http://dx.doi.org/10.1561/2200000050

1
Introduction

The central objects of our study are convex functions and convex sets
in Rn .
Definition 1.1 (Convex sets and convex functions). A set X ⊂ Rn is
said to be convex if it contains all of its segments, that is
∀(x, y, γ) ∈ X × X × [0, 1], (1 − γ)x + γy ∈ X .
A function f : X → R is said to be convex if it always lies below its
chords, that is
∀(x, y, γ) ∈ X × X × [0, 1], f ((1 − γ)x + γy) ≤ (1 − γ)f (x) + γf (y).
We are interested in algorithms that take as input a convex set X
and a convex function f and output an approximate minimum of f
over X . We write compactly the problem of finding the minimum of f
over X as
min. f (x)
s.t. x ∈ X .
In the following we will make more precise how the set of constraints X
and the objective function f are specified to the algorithm. Before that

2
Full text available at: http://dx.doi.org/10.1561/2200000050

1.1. Some convex optimization problems in machine learning 3

we proceed to give a few important examples of convex optimization

problems in machine learning.

1.1 Some convex optimization problems in machine learning

Many fundamental convex optimization problems in machine learning

take the following form:
m
X
min.n fi (x) + λR(x), (1.1)
x∈R
i=1
where the functions f1 , . . . , fm , R are convex and λ ≥ 0 is a fixed
parameter. The interpretation is that fi (x) represents the cost of using
x on the ith element of some data set, and R(x) is a regularization term
which enforces some “simplicity” in x. We discuss now major instances
of (1.1). In all cases one has a data set of the form (wi , yi ) ∈ Rn ×Y, i =
1, . . . , m and the cost function fi depends only on the pair (wi , yi ). We
refer to Hastie et al. [2001], Schölkopf and Smola [2002], Shalev-Shwartz
and Ben-David [2014] for more details on the origin of these important
problems. The mere objective of this section is to expose the reader to a
few concrete convex optimization problems which are routinely solved.
In classification one has Y = {−1, 1}. Taking fi (x) = max(0, 1 −
yi x> wi ) (the so-called hinge loss) and R(x) = kxk22 one obtains the
SVM problem. On the other hand taking fi (x) = log(1+exp(−yi x> wi ))
(the logistic loss) and again R(x) = kxk22 one obtains the (regularized)
logistic regression problem.
In regression one has Y = R. Taking fi (x) = (x> wi − yi )2 and
R(x) = 0 one obtains the vanilla least-squares problem which can be
rewritten in vector notation as
min.n kW x − Y k22 ,
x∈R

where W ∈ Rm×n is the matrix with wi> on the ith row and Y =
(y1 , . . . , yn )> . With R(x) = kxk22 one obtains the ridge regression prob-
lem, while with R(x) = kxk1 this is the LASSO problem Tibshirani
[1996].
Our last two examples are of a slightly different flavor. In particular
the design variable x is now best viewed as a matrix, and thus we
Full text available at: http://dx.doi.org/10.1561/2200000050

4 Introduction

denote it by a capital letter X. The sparse inverse covariance estimation

problem can be written as follows, given some empirical covariance
matrix Y ,
min. Tr(XY ) − logdet(X) + λkXk1
s.t. X ∈ Rn×n , X > = X, X 0.
Intuitively the above problem is simply a regularized maximum likeli-
hood estimator (under a Gaussian assumption).
Finally we introduce the convex version of the matrix completion
problem. Here our data set consists of observations of some of the
entries of an unknown matrix Y , and we want to “complete" the unob-
served entries of Y in such a way that the resulting matrix is “simple"
(in the sense that it has low rank). After some massaging (see Can-
dès and Recht [2009]) the (convex) matrix completion problem can be
formulated as follows:
min. Tr(X)
s.t. X ∈ Rn×n , X > = X, X 0, Xi,j = Yi,j for (i, j) ∈ Ω,
where Ω ⊂ [n]2 and (Yi,j )(i,j)∈Ω are given.

1.2 Basic properties of convexity

A basic result about convex sets that we shall use extensively is the
Separation Theorem.
Theorem 1.1 (Separation Theorem). Let X ⊂ Rn be a closed convex
set, and x0 ∈ Rn \ X . Then, there exists w ∈ Rn and t ∈ R such that
w> x0 < t, and ∀x ∈ X , w> x ≥ t.
Note that if X is not closed then one can only guarantee that
w> x >
0 ≤ w x, ∀x ∈ X (and w 6= 0). This immediately implies the Sup-
porting Hyperplane Theorem (∂X denotes the boundary of X , that is
the closure without the interior):
Theorem 1.2 (Supporting Hyperplane Theorem). Let X ⊂ Rn be a con-
vex set, and x0 ∈ ∂X . Then, there exists w ∈ Rn , w 6= 0 such that
∀x ∈ X , w> x ≥ w> x0 .
Full text available at: http://dx.doi.org/10.1561/2200000050

1.2. Basic properties of convexity 5

We introduce now the key notion of subgradients.

Definition 1.2 (Subgradients). Let X ⊂ Rn , and f : X → R. Then
g ∈ Rn is a subgradient of f at x ∈ X if for any y ∈ X one has
f (x) − f (y) ≤ g > (x − y).
The set of subgradients of f at x is denoted ∂f (x).
To put it differently, for any x ∈ X and g ∈ ∂f (x), f is above the
linear function y 7→ f (x)+g > (y−x). The next result shows (essentially)
that a convex functions always admit subgradients.
Proposition 1.1 (Existence of subgradients). Let X ⊂ Rn be convex,
and f : X → R. If ∀x ∈ X , ∂f (x) 6= ∅ then f is convex. Conversely
if f is convex then for any x ∈ int(X ), ∂f (x) 6= ∅. Furthermore if f is
convex and differentiable at x then ∇f (x) ∈ ∂f (x).
Before going to the proof we recall the definition of the epigraph of
a function f : X → R:
epi(f ) = {(x, t) ∈ X × R : t ≥ f (x)}.
It is obvious that a function is convex if and only if its epigraph is a
convex set.

Proof. The first claim is almost trivial: let g ∈ ∂f ((1 − γ)x + γy), then
by definition one has
f ((1 − γ)x + γy) ≤ f (x) + γg > (y − x),
f ((1 − γ)x + γy) ≤ f (y) + (1 − γ)g > (x − y),
which clearly shows that f is convex by adding the two (appropriately
rescaled) inequalities.

Now let us prove that a convex function f has subgradients in the

interior of X . We build a subgradient by using a supporting hyperplane
to the epigraph of the function. Let x ∈ X . Then clearly (x, f (x)) ∈
∂epi(f ), and epi(f ) is a convex set. Thus by using the Supporting
Hyperplane Theorem, there exists (a, b) ∈ Rn × R such that
a> x + bf (x) ≥ a> y + bt, ∀(y, t) ∈ epi(f ). (1.2)
Full text available at: http://dx.doi.org/10.1561/2200000050

6 Introduction

Clearly, by letting t tend to infinity, one can see that b ≤ 0. Now let
us assume that x is in the interior of X . Then for ε > 0 small enough,
y = x+εa ∈ X , which implies that b cannot be equal to 0 (recall that if
b = 0 then necessarily a 6= 0 which allows to conclude by contradiction).
Thus rewriting (1.2) for t = f (y) one obtains
1 >
f (x) − f (y) ≤ a (x − y).
|b|
Thus a/|b| ∈ ∂f (x) which concludes the proof of the second claim.

Finally let f be a convex and differentiable function. Then by defi-

nition:
f ((1 − γ)x + γy) − (1 − γ)f (x)
f (y) ≥
γ
f (x + γ(y − x)) − f (x)
= f (x) +
γ
>
→ f (x) + ∇f (x) (y − x),
γ→0

which shows that ∇f (x) ∈ ∂f (x).

In several cases of interest the set of contraints can have an empty

interior, in which case the above proposition does not yield any informa-
tion. However it is easy to replace int(X ) by ri(X ) -the relative interior
of X - which is defined as the interior of X when we view it as subset of
the affine subspace it generates. Other notions of convex analysis will
prove to be useful in some parts of this text. In particular the notion
of closed convex functions is convenient to exclude pathological cases:
these are the convex functions with closed epigraphs. Sometimes it is
also useful to consider the extension of a convex function f : X → R to
a function from Rn to R by setting f (x) = +∞ for x 6∈ X . In convex
analysis one uses the term proper convex function to denote a convex
function with values in R ∪ {+∞} such that there exists x ∈ Rn with
f (x) < +∞. From now on all convex functions will be closed,
and if necessary we consider also their proper extension. We
refer the reader to Rockafellar [1970] for an extensive discussion of these
notions.
Full text available at: http://dx.doi.org/10.1561/2200000050

1.3. Why convexity? 7

1.3 Why convexity?

The key to the algorithmic success in minimizing convex functions is

that these functions exhibit a local to global phenomenon. We have
already seen one instance of this in Proposition 1.1, where we showed
that ∇f (x) ∈ ∂f (x): the gradient ∇f (x) contains a priori only local
information about the function f around x while the subdifferential
∂f (x) gives a global information in the form of a linear lower bound on
the entire function. Another instance of this local to global phenomenon
is that local minima of convex functions are in fact global minima:

Proposition 1.2 (Local minima are global minima). Let f be convex. If x

is a local minimum of f then x is a global minimum of f . Furthermore
this happens if and only if 0 ∈ ∂f (x).

Proof. Clearly 0 ∈ ∂f (x) if and only if x is a global minimum of f .

Now assume that x is local minimum of f . Then for γ small enough
one has for any y,

f (x) ≤ f ((1 − γ)x + γy) ≤ (1 − γ)f (x) + γf (y),

which implies f (x) ≤ f (y) and thus x is a global minimum of f .

The nice behavior of convex functions will allow for very fast algo-
rithms to optimize them. This alone would not be sufficient to justify
the importance of this class of functions (after all constant functions
are pretty easy to optimize). However it turns out that surprisingly
many optimization problems admit a convex (re)formulation. The ex-
cellent book Boyd and Vandenberghe [2004] describes in great details
the various methods that one can employ to uncover the convex aspects
of an optimization problem. We will not repeat these arguments here,
but we have already seen that many famous machine learning problems
(SVM, ridge regression, logistic regression, LASSO, sparse covariance
estimation, and matrix completion) are formulated as convex problems.
We conclude this section with a simple extension of the optimality
condition “0 ∈ ∂f (x)” to the case of constrained optimization. We state
this result in the case of a differentiable function for sake of simplicity.
Full text available at: http://dx.doi.org/10.1561/2200000050

8 Introduction

Proposition 1.3 (First order optimality condition). Let f be convex and

X a closed convex set on which f is differentiable. Then

x∗ ∈ argmin f (x),
x∈X

if and only if one has

∇f (x∗ )> (x∗ − y) ≤ 0, ∀y ∈ X .

Proof. The “if" direction is trivial by using that a gradient is also

a subgradient. For the “only if" direction it suffices to note that if
∇f (x)> (y − x) < 0, then f is locally decreasing around x on the
line to y (simply consider h(t) = f (x + t(y − x)) and note that
h0 (0) = ∇f (x)> (y − x)).

1.4 Black-box model

We now describe our first model of “input" for the objective function
and the set of constraints. In the black-box model we assume that
we have unlimited computational resources, the set of constraint X is
known, and the objective function f : X → R is unknown but can be
accessed through queries to oracles:

• A zeroth order oracle takes as input a point x ∈ X and outputs

the value of f at x.

• A first order oracle takes as input a point x ∈ X and outputs a

subgradient of f at x.

In this context we are interested in understanding the oracle complexity

of convex optimization, that is how many queries to the oracles are
necessary and sufficient to find an ε-approximate minima of a convex
function. To show an upper bound on the sample complexity we need to
propose an algorithm, while lower bounds are obtained by information
theoretic reasoning (we need to argue that if the number of queries is
“too small" then we don’t have enough information about the function
to identify an ε-approximate solution).
Full text available at: http://dx.doi.org/10.1561/2200000050

1.4. Black-box model 9

From a mathematical point of view, the strength of the black-box

model is that it will allow us to derive a complete theory of convex op-
timization, in the sense that we will obtain matching upper and lower
bounds on the oracle complexity for various subclasses of interesting
convex functions. While the model by itself does not limit our compu-
tational resources (for instance any operation on the constraint set X is
allowed) we will of course pay special attention to the algorithms’ com-
putational complexity (i.e., the number of elementary operations that
the algorithm needs to do). We will also be interested in the situation
where the set of constraint X is unknown and can only be accessed
through a separation oracle: given x ∈ Rn , it outputs either that x is
in X , or if x 6∈ X then it outputs a separating hyperplane between x
and X .
The black-box model was essentially developed in the early days
of convex optimization (in the Seventies) with Nemirovski and Yudin
[1983] being still an important reference for this theory (see also Ne-
mirovski [1995]). In the recent years this model and the corresponding
algorithms have regained a lot of popularity, essentially for two reasons:

• It is possible to develop algorithms with dimension-free oracle

complexity which is quite attractive for optimization problems in
very high dimension.

• Many algorithms developed in this model are robust to noise

in the output of the oracles. This is especially interesting for
stochastic optimization, and very relevant to machine learning
applications. We will explore this in details in Chapter 6.

Chapter 2, Chapter 3 and Chapter 4 are dedicated to the study of

the black-box model (noisy oracles are discussed in Chapter 6). We do
not cover the setting where only a zeroth order oracle is available, also
called derivative free optimization, and we refer to Conn et al. [2009],
Audibert et al. [2011] for further references on this.
Full text available at: http://dx.doi.org/10.1561/2200000050

10 Introduction

1.5 Structured optimization

The black-box model described in the previous section seems extremely

wasteful for the applications we discussed in Section 1.1. Consider for
instance the LASSO objective: x 7→ kW x − yk22 + kxk1 . We know this
function globally, and assuming that we can only make local queries
through oracles seem like an artificial constraint for the design of al-
gorithms. Structured optimization tries to address this observation.
Ultimately one would like to take into account the global structure of
both f and X in order to propose the most efficient optimization pro-
cedure. An extremely powerful hammer for this task are the Interior
Point Methods. We will describe this technique in Chapter 5 alongside
with other more recent techniques such as FISTA or Mirror Prox.
We briefly describe now two classes of optimization problems for
which we will be able to exploit the structure very efficiently, these
are the LPs (Linear Programs) and SDPs (Semi-Definite Programs).
Ben-Tal and Nemirovski [2001] describe a more general class of Conic
Programs but we will not go in that direction here.
The class LP consists of problems where f (x) = c> x for some c ∈
Rn , and X = {x ∈ Rn : Ax ≤ b} for some A ∈ Rm×n and b ∈ Rm .
The class SDP consists of problems where the optimization vari-
able is a symmetric matrix X ∈ Rn×n . Let Sn be the space of n × n
symmetric matrices (respectively Sn+ is the space of positive semi-
definite matrices), and let h·, ·i be the Frobenius inner product (re-
call that it can be written as hA, Bi = Tr(A> B)). In the class SDP
the problems are of the following form: f (x) = hX, Ci for some
C ∈ Rn×n , and X = {X ∈ Sn+ : hX, Ai i ≤ bi , i ∈ {1, . . . , m}} for
some A1 , . . . , Am ∈ Rn×n and b ∈ Rm . Note that the matrix comple-
tion problem described in Section 1.1 is an example of an SDP.

1.6 Overview of the results and disclaimer

The overarching aim of this monograph is to present the main complex-

ity theorems in convex optimization and the corresponding algorithms.
We focus on five major results in convex optimization which give the
overall structure of the text: the existence of efficient cutting-plane
Full text available at: http://dx.doi.org/10.1561/2200000050

1.6. Overview of the results and disclaimer 11

methods with optimal oracle complexity (Chapter 2), a complete char-

acterization of the relation between first order oracle complexity and
curvature in the objective function (Chapter 3), first order methods
beyond Euclidean spaces (Chapter 4), non-black box methods (such as
interior point methods) can give a quadratic improvement in the num-
ber of iterations with respect to optimal black-box methods (Chapter
5), and finally noise robustness of first order methods (Chapter 6). Ta-
ble 1.1 can be used as a quick reference to the results proved in Chapter
2 to Chapter 5, as well as some of the results of Chapter 6 (this last
chapter is the most relevant to machine learning but the results are
also slightly more specific which make them harder to summarize).
An important disclaimer is that the above selection leaves out meth-
ods derived from duality arguments, as well as the two most popular
research avenues in convex optimization: (i) using convex optimization
in non-convex settings, and (ii) practical large-scale algorithms. Entire
books have been written on these topics, and new books have yet to be
written on the impressive collection of new results obtained for both
(i) and (ii) in the past five years.
A few of the blatant omissions regarding (i) include (a) the theory
of submodular optimization (see Bach [2013]), (b) convex relaxations of
combinatorial problems (a short example is given in Section 6.6), and
(c) methods inspired from convex optimization for non-convex prob-
lems such as low-rank matrix factorization (see e.g. Jain et al. [2013]
and references therein), neural networks optimization, etc.
With respect to (ii) the most glaring omissions include (a) heuris-
tics (the only heuristic briefly discussed here is the non-linear conjugate
gradient in Section 2.4), (b) methods for distributed systems, and (c)
adaptivity to unknown parameters. Regarding (a) we refer to Nocedal
and Wright [2006] where the most practical algorithms are discussed in
great details (e.g., quasi-newton methods such as BFGS and L-BFGS,
primal-dual interior point methods, etc.). The recent survey Boyd
et al. [2011] discusses the alternating direction method of multipliers
(ADMM) which is a popular method to address (b). Finally (c) is a
subtle and important issue. In the entire monograph the emphasis
is on presenting the algorithms and proofs in the simplest way, and
Full text available at: http://dx.doi.org/10.1561/2200000050

12 Introduction

thus for sake of convenience we assume that the relevant parameters

describing the regularity and curvature of the objective function
(Lipschitz constant, smoothness constant, strong convexity parameter)
are known and can be used to tune the algorithm’s own parameters.
Line search is a powerful technique to replace the knowledge of these
parameters and it is heavily used in practice, see again Nocedal and
Wright [2006]. We observe however that from a theoretical point of
view (c) is only a matter of logarithmic factors as one can always
run in parallel several copies of the algorithm with different guesses
for the values of the parameters1 . Overall the attitude of this text
with respect to (ii) is best summarized by a quote of Thomas Cover:
“theory is the first term in the Taylor series of practice”, Cover [1992].

Notation. We always denote by x∗ a point in X such that f (x∗ ) =

minx∈X f (x) (note that the optimization problem under consideration
will always be clear from the context). In particular we always assume
that x∗ exists. For a vector x ∈ Rn we denote by x(i) its ith coordinate.
The dual of a norm k · k (defined later) will be denoted either k · k∗ or
k · k∗ (depending on whether the norm already comes with a subscript).
Other notation are standard (e.g., In for the n × n identity matrix,
for the positive semi-definite order on matrices, etc).

1
Note that this trick does not work in the context of Chapter 6.
Full text available at: http://dx.doi.org/10.1561/2200000050

1.6. Overview of the results and disclaimer 13

f Algorithm Rate # Iter Cost/iter

center of 1 ∇, R
non-smooth exp − nt n log 1
gravity ε 1 n-dim
ellipsoid 1 ∇,
non-smooth R
r
exp − nt2 n2 log R
rε
method mat-vec ×
Rn
1 ∇,
non-smooth Vaidya r
exp − nt n log Rn
rε mat-mat ×
exact n
quadratic CG 1∇
1

exp − κt κ log ε

non-smooth, √ 1 ∇,
PGD RL/ t R2 L2 /ε2
Lipschitz 1 proj.
1 ∇,
smooth PGD βR2 /t βR2 /ε
1 proj.
p
smooth AGD βR2 /t2 R β/ε 1∇

smooth 1 ∇,
FW βR2 /t βR2 /ε
(any norm) 1 LP
strong.
1∇,
conv., PGD L2 /(αt) L2 /(αε)
1 proj.
Lipschitz
strong.
1∇,

R2

conv., PGD R2 exp − κt κ log ε 1 proj.
smooth
strong. √ 2
conv., AGD R2 exp − √tκ κ log Rε 1∇
smooth
f + g, p 1 ∇ of f
f smooth, FISTA βR2 /t2 R β/ε
Prox of g
g simple
max ϕ(x, y), MD on X
y∈Y SP-MP βR2 /t βR2 /ε
ϕ smooth MD on Y
linear, √ Newton

X with F IPM ν exp − √tν ν log ν
ε step on F
ν-self-conc.
√ 1 stoch. ∇,
non-smooth SGD BL/ t B 2 L2 /ε2
1 proj.
non-smooth, 1 stoch. ∇,
SGD B 2 /(αt) B 2 /(αε)
strong. P
conv. 1 proj.
1
f= m fi
1
fi smooth SVRG – (m + κ) log ε
1 stoch. ∇
strong. conv.

Table 1.1: Summary of the results proved in Chapter 2 to Chapter 5 and some of
the results in Chapter 6.
Full text available at: http://dx.doi.org/10.1561/2200000050

References

A. Agarwal and L. Bottou. A lower bound for the optimization of finite sums.
Arxiv preprint arXiv:1410.0723, 2014.
Z. Allen-Zhu and L. Orecchia. Linear coupling: An ultimate unification of
gradient and mirror descent. Arxiv preprint arXiv:1407.1537, 2014.
K. M. Anstreicher. Towards a practical volumetric cutting plane method for
convex programming. SIAM Journal on Optimization, 9(1):190–206, 1998.
J.Y Audibert, S. Bubeck, and R. Munos. Bandit view on noisy optimization.
In S. Sra, S. Nowozin, and S. Wright, editors, Optimization for Machine
Learning. MIT press, 2011.
J.Y. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial
optimization. Mathematics of Operations Research, 39:31–45, 2014.
F. Bach. Learning with submodular functions: A convex optimization per-
spective. Foundations and Trends R in Machine Learning, 6(2-3):145–373,
2013.
F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approxi-
mation with convergence rate o(1/n). In Advances in Neural Information
Processing Systems (NIPS), 2013.
F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with
sparsity-inducing penalties. Foundations and Trends R in Machine Learn-
ing, 4(1):1–106, 2012.
B. Barak. Sum of squares upper bounds, lower bounds, and open questions.
Lecture Notes, 2014.

121
Full text available at: http://dx.doi.org/10.1561/2200000050

122 References

A. Beck and M. Teboulle. Mirror Descent and nonlinear projected subgradient

methods for convex optimization. Operations Research Letters, 31(3):167–
175, 2003.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for
linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202,
2009.
A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization:
analysis, algorithms, and engineering applications. Society for Industrial
and Applied Mathematics (SIAM), 2001.
D. Bertsimas and S. Vempala. Solving convex programs by random walks.
Journal of the ACM, 51:540–556, 2004.
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University
Press, 2004.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed opti-
mization and statistical learning via the alternating direction method of
multipliers. Foundations and Trends R in Machine Learning, 3(1):1–122,
2011.
S. Bubeck. Introduction to online optimization. Lecture Notes, 2011.
S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochas-
tic multi-armed bandit problems. Foundations and Trends R in Machine
Learning, 5(1):1–122, 2012.
S. Bubeck and R. Eldan. The entropic barrier: a simple and optimal universal
self-concordant barrier. Arxiv preprint arXiv:1412.1587, 2014.
S. Bubeck, R. Eldan, and J. Lehec. Sampling from a log-concave distribu-
tion with projected langevin monte carlo. Arxiv preprint arXiv:1507.02564,
2015a.
S. Bubeck, Y.-T. Lee, and M. Singh. A geometric alternative to nesterov’s
accelerated gradient descent. Arxiv preprint arXiv:1506.08187, 2015b.
E. Candès and B. Recht. Exact matrix completion via convex optimization.
Foundations of Computational mathematics, 9(6):717–772, 2009.
A. Cauchy. Méthode générale pour la résolution des systemes d’équations
simultanées. Comp. Rend. Sci. Paris, 25(1847):536–538, 1847.
N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge
University Press, 2006.
A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex
problems with applications to imaging. Journal of Mathematical Imaging
and Vision, 40(1):120–145, 2011.
Full text available at: http://dx.doi.org/10.1561/2200000050

References 123

K. Clarkson, E. Hazan, and D. Woodruff. Sublinear optimization for machine

learning. Journal of the ACM, 2012.
A. Conn, K. Scheinberg, and L. Vicente. Introduction to Derivative-Free Op-
timization. Society for Industrial and Applied Mathematics (SIAM), 2009.
T. M. Cover. 1990 shannon lecture. IEEE information theory society newslet-
ter, 42(4), 1992.
A. d’Aspremont. Smooth optimization with approximate gradient. SIAM
Journal on Optimization, 19(3):1171–1183, 2008.
A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient
method with support for non-strongly convex composite objectives. In
Advances in Neural Information Processing Systems (NIPS), 2014.
O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed on-
line prediction using mini-batches. Journal of Machine Learning Research,
13:165–202, 2012.
J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective
mirror descent. In Proceedings of the 23rd Annual Conference on Learning
Theory (COLT), 2010.
J. C. Dunn and S. Harshbarger. Conditional gradient algorithms with open
loop step size rules. Journal of Mathematical Analysis and Applications, 62
(2):432–444, 1978.
M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval
research logistics quarterly, 3(1-2):95–110, 1956.
M. P. Friedlander and P. Tseng. Exact regularization of convex programs.
SIAM Journal on Optimization, 18(4):1326–1350, 2007.
M. Goemans and D. Williamson. Improved approximation algorithms for
maximum cut and satisfiability problems using semidefinite programming.
Journal of the ACM, 42(6):1115–1145, 1995.
M. D. Grigoriadis and L. G. Khachiyan. A sublinear-time randomized ap-
proximation algorithm for matrix games. Operations Research Letters, 18:
53–58, 1995.
B. Grünbaum. Partitions of mass-distributions and of convex bodies by hy-
perplanes. Pacific J. Math, 10(4):1257–1261, 1960.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learn-
ing. Springer, 2001.
E. Hazan. The convex optimization approach to regret minimization. In
S. Sra, S. Nowozin, and S. Wright, editors, Optimization for Machine Learn-
ing, pages 287–303. MIT press, 2011.
Full text available at: http://dx.doi.org/10.1561/2200000050

124 References

M. Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization.

In Proceedings of the 30th International Conference on Machine Learning
(ICML), pages 427–435, 2013.
P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using
alternating minimization. In Proceedings of the Forty-fifth Annual ACM
Symposium on Theory of Computing, STOC ’13, pages 665–674, 2013.
R. Johnson and T. Zhang. Accelerating stochastic gradient descent using pre-
dictive variance reduction. In Advances in Neural Information Processing
Systems (NIPS), 2013.
L. K. Jones. A simple lemma on greedy approximation in hilbert space
and convergence rates for projection pursuit regression and neural network
training. Annals of Statistics, pages 608–613, 1992.
A. Juditsky and A. Nemirovski. First-order methods for nonsmooth convex
large-scale optimization, i: General purpose methods. In S. Sra, S. Nowozin,
and S. Wright, editors, Optimization for Machine Learning, pages 121–147.
MIT press, 2011a.
A. Juditsky and A. Nemirovski. First-order methods for nonsmooth con-
vex large-scale optimization, ii: Utilizing problem’s structure. In S. Sra,
S. Nowozin, and S. Wright, editors, Optimization for Machine Learning,
pages 149–183. MIT press, 2011b.
N. Karmarkar. A new polynomial-time algorithm for linear programming.
Combinatorica, 4:373–395, 1984.
S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining
an o (1/t) convergence rate for the projected stochastic subgradient method.
arXiv preprint arXiv:1212.2002, 2012.
N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with
an exponential convergence rate for strongly-convex optimization with fi-
nite training sets. In Advances in Neural Information Processing Systems
(NIPS), 2012.
Y.-T. Lee and A. Sidford. Path finding i :solving linear programs with
ÃŢ(sqrt(rank)) linear system solves. Arxiv preprint arXiv:1312.6677, 2013.
Y.-T. Lee, A. Sidford, and S. C.-W Wong. A faster cutting plane
method and its implications for combinatorial and convex optimization.
abs/1508.04874, 2015.
A. Levin. On an algorithm for the minimization of convex functions. In Soviet
Mathematics Doklady, volume 160, pages 1244–1247, 1965.
L. Lovász. Hit-and-run mixes fast. Math. Prog., 86:443–461, 1998.
Full text available at: http://dx.doi.org/10.1561/2200000050

References 125

G. Lugosi. Comment on: `1 -penalization for mixture regression models. Test,

19(2):259–263, 2010.
N. Maculan and G. G. de Paula. A linear-time median-finding algorithm for
projecting a vector on the simplex of rn. Operations research letters, 8(4):
219–222, 1989.
A. Nemirovski. Orth-method for smooth convex optimization. Izvestia AN
SSSR, Ser. Tekhnicheskaya Kibernetika, 2, 1982.
A. Nemirovski. Information-based complexity of convex programming. Lecture
Notes, 1995.
A. Nemirovski. Prox-method with rate of convergence o (1/t) for variational
inequalities with lipschitz continuous monotone operators and smooth
convex-concave saddle point problems. SIAM Journal on Optimization,
15(1):229–251, 2004a.
A. Nemirovski. Interior point polynomial time methods in convex program-
ming. Lecture Notes, 2004b.
A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in
Optimization. Wiley Interscience, 1983.
Y. Nesterov. A method of solving a convex programming problem with con-
vergence rate o(1/k 2 ). Soviet Mathematics Doklady, 27(2):372–376, 1983.
Y. Nesterov. Quality of semidefinite relaxation for nonconvex quadratic op-
timization. CORE Discussion Papers 1997019, Université catholique de
Louvain, Center for Operations Research and Econometrics (CORE), 1997.
Y. Nesterov. Introductory lectures on convex optimization: A basic course.
Kluwer Academic Publishers, 2004a.
Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical
programming, 103(1):127–152, 2004b.
Y. Nesterov. Gradient methods for minimizing composite objective function.
Core discussion papers, Université catholique de Louvain, Center for Op-
erations Research and Econometrics (CORE), 2007.
Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimiza-
tion problems. SIAM Journal on Optimization, 22:341–362, 2012.
Y. Nesterov and A. Nemirovski. Interior-point polynomial algorithms in con-
vex programming. Society for Industrial and Applied Mathematics (SIAM),
1994.
D. Newman. Location of the maximum on unimodal surfaces. Journal of the
ACM, 12(3):395–398, 1965.
Full text available at: http://dx.doi.org/10.1561/2200000050

126 References

J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.

N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends R in
Optimization, 1(3):123–231, 2013.
A. Rakhlin. Lecture notes on online learning. 2009.
J. Renegar. A mathematical view of interior-point methods in convex opti-
mization, volume 3. Siam, 2001.
P. Richtárik and M. Takác. Parallel coordinate descent methods for big data
optimization. Arxiv preprint arXiv:1212.0873, 2012.
H. Robbins and S. Monro. A stochastic approximation method. Annals of
Mathematical Statistics, 22:400–407, 1951.
R. Rockafellar. Convex Analysis. Princeton University Press, 1970.
M. Rudelson. Random vectors in the isotropic position. Journal of Functional
Analysis, 164:60–72, 1999.
M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximal-
gradient methods for convex optimization. In Advances in neural informa-
tion processing systems, pages 1458–1466, 2011.
B. Schölkopf and A. Smola. Learning with kernels. MIT Press, 2002.
S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From
Theory to Algorithms. Cambridge University Press, 2014.
S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods
for regularized loss minimization. Journal of Machine Learning Research,
14:567–599, 2013a.
S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual co-
ordinate ascent. In Advances in Neural Information Processing Systems
(NIPS), 2013b.
W. Su, S. Boyd, and E. Candès. A differential equation for modeling nesterov’s
accelerated gradient method: Theory and insights. In Advances in Neural
Information Processing Systems (NIPS), 2014.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society. Series B (Methodological), 58(1):pp. 267–288,
1996.
P. Tseng. On accelerated proximal gradient methods for convex-concave op-
timization. 2008.
A. Tsybakov. Optimal rates of aggregation. In Conference on Learning Theory
(COLT), pages 303–313. 2003.
Full text available at: http://dx.doi.org/10.1561/2200000050

References 127

P. M. Vaidya. A new algorithm for minimizing convex functions over convex

sets. In Foundations of Computer Science, 1989., 30th Annual Symposium
on, pages 338–343, 1989.
P. M. Vaidya. A new algorithm for minimizing convex functions over convex
sets. Mathematical programming, 73(3):291–341, 1996.
S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction
by separable approximation. IEEE Transactions on Signal Processing, 57
(7):2479–2493, 2009.
L. Xiao. Dual averaging methods for regularized stochastic learning and online
optimization. Journal of Machine Learning Research, 11:2543–2596, 2010.
Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regular-
ized empirical risk minimization. Arxiv preprint arXiv:1409.3257, 2014.

501 Calculus Questions PDF
100% (10)
501 Calculus Questions PDF
364 pages
Introduction To Mathematical Proofs - A Transition To Advanced Mathematics (PDFDrive)
100% (12)
Introduction To Mathematical Proofs - A Transition To Advanced Mathematics (PDFDrive)
406 pages
Foundations of Geometry - Venema, G., Second Edition
94% (16)
Foundations of Geometry - Venema, G., Second Edition
407 pages
An Introduction To Linear Algebra For Science and Engineering - 3rd Ed - Norman
100% (17)
An Introduction To Linear Algebra For Science and Engineering - 3rd Ed - Norman
592 pages
Intro To Math Olympiad
100% (10)
Intro To Math Olympiad
160 pages
All The Math You Missed - But Need To Know For Graduate School
100% (36)
All The Math You Missed - But Need To Know For Graduate School
417 pages
Calculus Better Explained PDF
93% (14)
Calculus Better Explained PDF
76 pages
Ian Stewart, David Tall - Complex Analysis-Cambridge University Press (2018) PDF
100% (11)
Ian Stewart, David Tall - Complex Analysis-Cambridge University Press (2018) PDF
404 pages
Precalculus With Trigonometry
94% (33)
Precalculus With Trigonometry
791 pages
Algebra 2 PDF
100% (20)
Algebra 2 PDF
600 pages
1001 Algebra Problems
96% (73)
1001 Algebra Problems
292 pages
Geometry PDF
96% (23)
Geometry PDF
1,129 pages
Arlie O. Petters, Xiaoying Dong-An Introduction To Mathematical Finance With Applications - Understanding and Building Financial Intuition-S
100% (10)
Arlie O. Petters, Xiaoying Dong-An Introduction To Mathematical Finance With Applications - Understanding and Building Financial Intuition-S
499 pages
Geometry Problems For Math Competitions Jerry Gu Z Lib Org
100% (12)
Geometry Problems For Math Competitions Jerry Gu Z Lib Org
317 pages
Multi-Variable Calculus A First Step PDF
100% (10)
Multi-Variable Calculus A First Step PDF
337 pages
Advanced Linear Algebra PDF
100% (13)
Advanced Linear Algebra PDF
348 pages
Problems and Solutions in Mathematical Olympiad - Vol 1 (2021)
100% (13)
Problems and Solutions in Mathematical Olympiad - Vol 1 (2021)
580 pages
Advanced Functions and Introductory Calculus
100% (20)
Advanced Functions and Introductory Calculus
498 pages
Problems and Solutions in Mathematical Olympiad High School 2 Mathematical
100% (10)
Problems and Solutions in Mathematical Olympiad High School 2 Mathematical
607 pages
[Algebra Practice Workbook With Answers Improve Your Math Fluency Series] Chris McMullen - Systems of Equations Substitution Simultaneous Cramer s Rule Algebra Practice Workbook With Answers Improve Your Math Fluency Series 20 Chr
100% (13)
[Algebra Practice Workbook With Answers Improve Your Math Fluency Series] Chris McMullen - Systems of Equations Substitution Simultaneous Cramer s Rule Algebra Practice Workbook With Answers Improve Your Math Fluency Series 20 Chr
374 pages
Calculus Problems
100% (11)
Calculus Problems
578 pages
Exploring Mathematics - Problem-Solving and Proof (PDFDrive)
100% (10)
Exploring Mathematics - Problem-Solving and Proof (PDFDrive)
308 pages
0 - Introduction To Combinatorics
100% (10)
0 - Introduction To Combinatorics
444 pages
Diff Erential Geometry of Curves and Surfaces: Kristopher Tapp
100% (12)
Diff Erential Geometry of Curves and Surfaces: Kristopher Tapp
370 pages
Tensor Calculus For Physics Concise by Dwight Neuenschwander
100% (12)
Tensor Calculus For Physics Concise by Dwight Neuenschwander
239 pages
Japanese Short Stories PDF
80% (30)
Japanese Short Stories PDF
396 pages
Linear Algebra Optimization Machine Learning PDF
100% (12)
Linear Algebra Optimization Machine Learning PDF
507 pages
501 Geometry Questions Second Edition
93% (14)
501 Geometry Questions Second Edition
305 pages
Introduction To Linear Algebra For Science and Engineering 1st Ed
90% (59)
Introduction To Linear Algebra For Science and Engineering 1st Ed
550 pages
The Art of Problem Solving Intermediate Algebra
91% (35)
The Art of Problem Solving Intermediate Algebra
720 pages
Complex Engineering Problem CSE
No ratings yet
Complex Engineering Problem CSE
12 pages
Existence of Mathematics in Nature, Class 11
No ratings yet
Existence of Mathematics in Nature, Class 11
19 pages
JEE Main 2021 Physics & Chemistry Solutions
100% (1)
JEE Main 2021 Physics & Chemistry Solutions
17 pages
IELTS Reading Test 1
No ratings yet
IELTS Reading Test 1
15 pages
Deleuze 1843 - Intrução Prática em Magnetismo Animal - 000
No ratings yet
Deleuze 1843 - Intrução Prática em Magnetismo Animal - 000
422 pages
PESTEL Analysis of HPCL Operations
No ratings yet
PESTEL Analysis of HPCL Operations
3 pages
RCC Staging Design for Water Tanks
No ratings yet
RCC Staging Design for Water Tanks
44 pages
COPO Trail
No ratings yet
COPO Trail
63 pages
Day 11 - 45 Days Challenge by Padhle
No ratings yet
Day 11 - 45 Days Challenge by Padhle
4 pages
Philippine Engineering Journal
No ratings yet
Philippine Engineering Journal
32 pages
Punishment Quiz
No ratings yet
Punishment Quiz
3 pages
JURNAL EKONOMI MANAJEMEN - Pengaruh Perilaku Sumber Daya Manusia Dalam Kehidupan Organisasi Bisnis Di Era Global
No ratings yet
JURNAL EKONOMI MANAJEMEN - Pengaruh Perilaku Sumber Daya Manusia Dalam Kehidupan Organisasi Bisnis Di Era Global
13 pages
20200414014739976f2ee727 1
No ratings yet
20200414014739976f2ee727 1
12 pages
Module 2
No ratings yet
Module 2
5 pages
Aa 4000 Series (Aluminum-Silicon Wrought Alloy)
No ratings yet
Aa 4000 Series (Aluminum-Silicon Wrought Alloy)
4 pages
J Sci Food Agric - 2002 - Stapley - Modelling The Kinetics of Tea and Coffee Infusion
No ratings yet
J Sci Food Agric - 2002 - Stapley - Modelling The Kinetics of Tea and Coffee Infusion
11 pages
Thesis About The Hoover Dam
100% (2)
Thesis About The Hoover Dam
8 pages
Complete Senses Induced Lucid Dreaming (SSILD) Guide - LD
No ratings yet
Complete Senses Induced Lucid Dreaming (SSILD) Guide - LD
10 pages
RSD Alex - 28 Points
100% (3)
RSD Alex - 28 Points
7 pages
Application for Cashier Position
No ratings yet
Application for Cashier Position
7 pages
QP25DP1 - 290: Q No. CO Level
No ratings yet
QP25DP1 - 290: Q No. CO Level
2 pages
Citation Guide for Researchers
No ratings yet
Citation Guide for Researchers
22 pages
Confidence Interval Estimation Guide
No ratings yet
Confidence Interval Estimation Guide
13 pages
A Tale of Two Coaches - A01371966
100% (1)
A Tale of Two Coaches - A01371966
2 pages
MSDS-Li M-E-ST- Re 2401 MODULE 锂电池DG
No ratings yet
MSDS-Li M-E-ST- Re 2401 MODULE 锂电池DG
6 pages
Class-Vth S.St. Half Yearly
No ratings yet
Class-Vth S.St. Half Yearly
2 pages
Wa0000.
No ratings yet
Wa0000.
5 pages
Group2 Module 2 Part 1
No ratings yet
Group2 Module 2 Part 1
16 pages
Dashrath Nandan NMOUP (Unit-1) Notes
No ratings yet
Dashrath Nandan NMOUP (Unit-1) Notes
16 pages
ODD Semester Price List
No ratings yet
ODD Semester Price List
8 pages

Convex OptimizationAlgorithms and Complexity

Uploaded by

Convex OptimizationAlgorithms and Complexity

Uploaded by

Full text available at: http://dx.doi.org/10.

Foundations and Trends R in Machine Learning

Published, sold and distributed by:

Foundations and Trends R in Machine Learning

Peter Bartlett Geoffrey Hinton Andrew Moore

Foundations and Trends R in Machine Learning publishes survey and

• Inductive logic programming

• Applications and case studies • Markov chain Monte Carlo

• Behavioral, cognitive, and • Model choice

Information for Librarians

Foundations and Trends R in Machine Learning, 2015, Volume 8, 6 issues.

Foundations and Trends R in Machine Learning

Convex Optimization: Algorithms and

2 Convex optimization in finite dimension 14

3 Dimension-free convex optimization 32

3.7 Nesterov’s accelerated gradient descent . . . . . . . . . . 59

4 Almost dimension-free convex optimization in non-Euclidean

5 Beyond the black-box model 79

6 Convex optimization and randomness 99

Ţ This monograph presents the main complexity theorems in convex

S. Bubeck. Convex Optimization: Algorithms and Complexity. Foundations and

1.1. Some convex optimization problems in machine learning 3

we proceed to give a few important examples of convex optimization

1.1 Some convex optimization problems in machine learning

Many fundamental convex optimization problems in machine learning

denote it by a capital letter X. The sparse inverse covariance estimation

1.2 Basic properties of convexity

1.2. Basic properties of convexity 5

We introduce now the key notion of subgradients.

Now let us prove that a convex function f has subgradients in the

Finally let f be a convex and differentiable function. Then by defi-

which shows that ∇f (x) ∈ ∂f (x).

In several cases of interest the set of contraints can have an empty

1.3. Why convexity? 7

1.3 Why convexity?

The key to the algorithmic success in minimizing convex functions is

Proposition 1.2 (Local minima are global minima). Let f be convex. If x

Proof. Clearly 0 ∈ ∂f (x) if and only if x is a global minimum of f .

f (x) ≤ f ((1 − γ)x + γy) ≤ (1 − γ)f (x) + γf (y),

which implies f (x) ≤ f (y) and thus x is a global minimum of f .

Proposition 1.3 (First order optimality condition). Let f be convex and

if and only if one has

∇f (x∗ )> (x∗ − y) ≤ 0, ∀y ∈ X .

Proof. The “if" direction is trivial by using that a gradient is also

1.4 Black-box model

• A zeroth order oracle takes as input a point x ∈ X and outputs

• A first order oracle takes as input a point x ∈ X and outputs a

In this context we are interested in understanding the oracle complexity

1.4. Black-box model 9

From a mathematical point of view, the strength of the black-box

• It is possible to develop algorithms with dimension-free oracle

• Many algorithms developed in this model are robust to noise

Chapter 2, Chapter 3 and Chapter 4 are dedicated to the study of

1.5 Structured optimization

The black-box model described in the previous section seems extremely

1.6 Overview of the results and disclaimer

The overarching aim of this monograph is to present the main complex-

1.6. Overview of the results and disclaimer 11

methods with optimal oracle complexity (Chapter 2), a complete char-

thus for sake of convenience we assume that the relevant parameters

Notation. We always denote by x∗ a point in X such that f (x∗ ) =

1.6. Overview of the results and disclaimer 13

f Algorithm Rate # Iter Cost/iter

A. Beck and M. Teboulle. Mirror Descent and nonlinear projected subgradient

K. Clarkson, E. Hazan, and D. Woodruff. Sublinear optimization for machine

M. Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization.

G. Lugosi. Comment on: `1 -penalization for mixture regression models. Test,

J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.

P. M. Vaidya. A new algorithm for minimizing convex functions over convex

You might also like