0% found this document useful (0 votes)

15 views33 pages

Frank-Wolfe and Friends A Journey Into Projection

This document reviews the Frank-Wolfe method, a first-order optimization technique that has gained popularity due to its efficiency in handling structured constraints in data science applications. It discusses the method's historical background, its variants, and recent advancements that improve its speed and applicability across various optimization problems. The paper serves as an advanced tutorial rather than a comprehensive literature survey, highlighting the versatility and power of the Frank-Wolfe method in modern optimization contexts.

Uploaded by

techiesarpitkaurav01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views33 pages

Frank-Wolfe and Friends A Journey Into Projection

Uploaded by

techiesarpitkaurav01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Annals of Operations Research (2024) 343:607–638

https://doi.org/10.1007/s10479-024-06251-7

ORIGINAL RESEARCH

Frank–Wolfe and friends: a journey into projection-free

first-order optimization methods

Immanuel. M. Bomze1 · Francesco Rinaldi2 · Damiano Zeﬃro2

Received: 9 April 2024 / Accepted: 30 July 2024 / Published online: 16 September 2024
© The Author(s) 2024

Abstract
Invented some 65 years ago in a seminal paper by Marguerite Straus-Frank and Philip Wolfe,
the Frank–Wolfe method recently enjoys a remarkable revival, fuelled by the need of fast
and reliable first-order optimization methods in Data Science and other relevant application
areas. This review tries to explain the success of this approach by illustrating versatility and
applicability in a wide range of contexts, combined with an account on recent progress in
variants, both improving on the speed and efficiency of this surprisingly simple principle of
first-order optimization.

Keywords First-order methods · Projection-free methods · Structured optimization ·

Conditional gradient · Sparse optimization

1 Introduction

In their seminal work (Frank & Wolfe, 1956), Marguerite Straus-Frank and Philip Wolfe
introduced a first-order algorithm for the minimization of convex quadratic objectives over
polytopes, now known as Frank–Wolfe (FW) method. The main idea of the method is simple:
to generate a sequence of feasible iterates by moving at every step towards a minimizer of a
linearized objective, the so-called FW vertex. Subsequent works, partly motivated by appli-
cations in optimal control theory (see Dunn (1979) for references), generalized the method to
smooth (possibly non-convex) optimization over closed subsets of Banach spaces admitting
a linear minimization oracle (see Demyanov and Rubinov (1970), Dunn and Harshbarger
(1978)).
Furthermore, while the O(1/k) rate in the original article was proved to be optimal when
the solution lies on the boundary of the feasible set (Canon & Cullum, 1968), improved rates
This is an updated version of the paper that appeared in 4OR, Vol. 19, pp. 313–345 (2021).

B Immanuel. M. Bomze
[email protected]
Francesco Rinaldi
[email protected]
Damiano Zeffiro
[email protected]
1 Faculty of Mathematics & ds:UniVie, Universität Wien, Wien, Austria
2 Dipartimento di Matematica “Tullio Levi-Civita”, Università di Padova, Padova, Italy

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
608 Annals of Operations Research (2024) 343:607–638

were given in a variety of different settings. In Levitin and Polyak (1966) and Demyanov
and Rubinov (1970), a linear convergence rate was proved over strongly convex domains
assuming a lower bound on the gradient norm, a result then extended in Dunn (1979) under
more general gradient inequalities. In Guelat and Marcotte (1986), linear convergence of the
method was proved for strongly convex objectives with the minimum obtained in the relative
interior of the feasible set.
The slow convergence behaviour for objectives with solution on the boundary motivated
the introduction of several variants, the most popular being Wolfe’s away step Wolfe 1970.
Wolfe’s idea was to move away from bad vertices, in case a step of the FW method moving
towards good vertices did not lead to sufficient improvement on the objective. This idea was
successfully applied in several network equilibrium problems, where linear minimization
can be achieved by solving a min-cost flow problem (see Fukushima (1984) and references
therein). In Guelat and Marcotte (1986), some ideas already sketched by Wolfe were for-
malized to prove linear convergence of the Wolfe’s away step method and identification of
the face containing the solution in finite time, under some suitable strict complementarity
assumptions.
In recent years, the FW method has regained popularity thanks to its ability to handle
the structured constraints appearing in machine learning and data science applications effi-
ciently. Examples include LASSO, SVM training, matrix completion, minimum enclosing
ball, density mixture estimation, cluster detection, to name just a few (see Sect. 3 for further
details).
One of the main features of the FW algorithm is its ability to naturally identify sparse and
structured (approximate) solutions. For instance, if the optimization domain is the simplex,
then after k steps the cardinality of the support of the last iterate generated by the method is at
most k +1. Most importantly, in this setting every vertex added to the support at every iteration
must be the best possible in some sense, a property that connects the method with many greedy
optimization schemes (Clarkson, 2010). This makes the FW method pretty efficient on the
abovementioned problem class. Indeed, the combination of structured solutions with often
noisy data makes the sparse approximations found by the method possibly more desirable
than high precision solutions generated by a faster converging approach. In some cases, like
in cluster detection (see, e.g., Bomze (1997)), finding the support of the solution is actually
enough to solve the problem independently from the precision achieved.
Another important feature is that the linear minimization used in the method is often
cheaper than the projections required by projected-gradient methods. It is important to notice
that, even when these two operations have the same complexity, constants defining the related
bounds can differ significantly (see Combettes and Pokutta (2021) for some examples and
tests). When dealing with large scale problems, the FW method hence has a much smaller
per-iteration cost with respect to projected-gradient methods. For this reason, FW methods
fall into the category of projection-free methods (Lan, 2020). Furthermore, the method can
be used to approximately solve quadratic subproblems in accelerated schemes, an approach
usually referred to as conditional gradient sliding (see, e.g., Lan and Zhou (2016)).

1.1 Organisation of the paper

The present review, which extends the work in Bomze et al. (2021), is not intended to provide
an exhaustive literature survey, but rather as an advanced tutorial demonstrating versatility
and power of this approach. The article is structured as follows: in Sect. 2, we introduce the
classic FW method, together with a general scheme for all the methods we consider. In Sect. 3,

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 609

we present applications from classic optimization to more recent machine learning problems.
In Sect. 4, we review some important stepsizes for first order methods. In Sect. 5, we discuss
the main theoretical results about the FW method and the most popular variants, including the
O(1/k) convergence rate for convex objectives, affine invariance, the sparse approximation
property, and support identification. In Sect. 6 we illustrate some recent improvements on the
O(1/k) convergence rate.
In Sect. 7, we describe a generalization of the classic FW to the composite non-smooth
optimization setting, and in particular its correspondence with mirror descent via Fenchel
duality. In Sect. 8 we present recent FW variants fitting different optimization frameworks,
in particular block coordinate, distributed, accelerated, and trace norm optimization. We
highlight that all the proofs reported in the paper are either seminal, or simplified versions of
proofs reported in published papers, and we believe they might give some useful technical
insights to the interested reader.

1.2 Notation

For any integers a and b, denote by [a : b] = {x integer : a ≤ x ≤ b} the integer range

between them. For a set V , the power set V
V 2 denotes the system all subsets of V , whereas
for any positive integer s ∈ N we set s := {S ∈ 2V : |S| = s}, with |S| denoting the
number of elements in S. Matrices are denoted by capital sans-serif letters (e.g., the zero
matrix O, or the n × n identity matrix In with columns
ei the length of which should be clear
from the context). The all-ones vector is e := i ei ∈ Rn . Generally, vectors are always

√ sans-serif letters x, and their transpose by x . The Euclidean norm of
denoted by boldface

x is then x := x x whereas the general p-norm is denoted by x p for any p ≥ 1 (so
x2 = x). By contrast, the so-called zero-norm simply counts the number of nonzero
entries:
x0 := |{i ∈ [1 : n] : xi = 0}| .
For a vector d we denote as d := d
1
d its normalization, with the convention
d = o if
d = o. Here o denotes the zero vector. In context of symmetric matrices, “psd” abbreviates
“positive-semidefinite”.

2 Problem and general scheme

We consider the following problem:

min f (x) (1)
x∈C

where, unless specified otherwise, C is a convex and compact (i.e. bounded and closed)
subset of Rn and f is a differentiable function having Lipschitz continuous gradient with
constant L > 0:
∇ f (x) − ∇ f (y) ≤ Lx − y for all {x, y} ⊂ C .
This is a central property required in the analysis of first-order methods. Such a property
indeed implies (and for a convex function is equivalent to) the so-called Descent Lemma (see,
e.g., Bertsekas (2015), Proposition 6.1.2), which provides a quadratic upper approximation
to the function f . Throughout the article, we denote by x∗ a (global) solution to (1) and use
the symbol f ∗ := f (x∗ ) as a shorthand for the corresponding optimal value.

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
610 Annals of Operations Research (2024) 343:607–638

The general scheme of the first-order methods we consider for problem (1), reported in
Algorithm 1, is based upon a set F(x, g) of directions feasible at x using first-order local
information on f around x, in the smooth case g = ∇ f (x). From this set, a particular
d ∈ F(x, g) is selected, with the maximal stepsize α max possibly dependent from auxiliary
information available to the method (at iteration k, we thus write αkmax ), and not always equal
to the maximal feasible stepsize.

Algorithm 1 First-order method

1 Choose a point x0 ∈
2 For k = 0, . . .
3 If xk satisfies some specific condition, then STOP
4 Choose dk ∈ F(xk , ∇ f (xk ))
5 Set xk+1 = xk + αk dk , with αk ∈ (0, αkmax ] a suitably chosen stepsize
6 End for

2.1 The classical Frank–Wolfe method

The classical FW method for minimization of a smooth objective f generates a sequence of

feasible points {xk } following the scheme of Algorithm 2. At the iteration k it moves toward
a vertex i.e., an extreme point, of the feasible set minimizing the scalar product with the
current gradient ∇ f (xk ). It therefore makes use of a linear minimization oracle (LMO) for
the feasible set C
LMOC (g) ∈ arg min g z , (2)
z∈C

defining the descent direction as

dk = dkF W := sk − xk , sk ∈ LMOC (∇ f (xk )) . (3)
In particular, the update at step 6 can be written as
xk+1 = xk + αk (sk − xk ) = αk sk + (1 − αk )xk (4)
Since αk ∈ [0, 1], by induction xk+1 can be written as a convex combination of elements
in the set Sk+1 := {x0 } ∪ {si }0≤i≤k . When C = conv(A) for a set A of points with some
common property, usually called "elementary atoms", if x 0 ∈ A then xk can be written as a
convex combination of k + 1 elements in A. Note that due to Caratheodory’s theorem, we
can even limit the number of occurring atoms to min{k, n} + 1. In the rest of the paper the
primal gap at iteration k is defined as h k = f (xk ) − f ∗ .

3 Examples

FW methods and variants are a natural choice for constrained optimization on convex sets
admitting a linear minimization oracle significantly faster than computing a projection. We
present here in particular the traffic assignment problem, submodular optimization, LASSO
problem, matrix completion, adversarial attacks, minimum enclosing ball, SVM training,
maximal clique search in graphs, sparse optimization.

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 611

Algorithm 2 Frank–Wolfe method

1 Choose a point x0 ∈
2 For k = 0, . . .
3 If xk satisfies some specific condition, then STOP
4 Compute sk ∈ LMOC (∇ f (xk ))
5 Set dkF W = sk − xk
6 Set xk+1 = xk + αk dkF W , with αk ∈ (0, 1] a suitably chosen stepsize
7 End for

3.1 Traffic assignment

Finding a traffic pattern satisfying the equilibrium conditions in a transportation network is

a classic problem in optimization that dates back to Wardrop’s paper (Wardrop, 1952). Let
G be a network with set of nodes [1 : n]. Let {D(i, j)}i= j be demand coefficients, modeling
the amount of goods with destination j and origin i. For any i, j with i = j let furthermore
f i j : R → R be the non-linear (convex) cost functions, and xisj be the flow on link (i, j) with
destination s. The traffic assignment problem can be modeled as the following non-linear
multicommodity network problem (Fukushima, 1984):
⎧ ⎫
⎨ ⎬
min fi j xisj : s
xi − x sj = D(, s) , all = s, xisj ≥ 0 . (5)
⎩ ⎭
i, j s i j

Then the linearized optimization subproblem necessary to compute the FW vertex takes the
form
⎧ ⎫
⎨ ⎬
min ci j xisj : s
xi − x sj = D(, s), = s, xisj ≥ 0 (6)
⎩ ⎭
s i, j i j

and can be split in n shortest paths subproblems, each of the form

⎧ ⎫
⎨ ⎬
min ci j xisj : s
xi − x sj = D(, s), = s, xisj ≥ 0 (7)
⎩ ⎭
i, j i j

for a fixed s ∈ [1 : n], with ci j the first-order derivative of f i j (see Fukushima (1984) for
further details). A number of FW variants were proposed in the literature for efficiently
handling this kind of problems (see, e.g., Bertsekas (2015), Fukushima (1984), LeBlanc et
al. (1975), Mitradjieva and Lindberg (2013), Weintraub et al. (1985) and references therein
for further details). Some of those variants represent a good (if not the best) choice when low
or medium precision is required in the solution of the problem (Perederieieva et al., 2015).
In the more recent work (Joulin et al., 2014) a FW variant also solving a shortest path
subproblem at each iteration was applied to image and video co-localization.

3.2 Submodular optimization

Given a finite set V , a function r : 2V → R is said to be submodular if for every A, B ⊂ V

r (A) + r (B) ≥ r (A ∪ B) + r (A ∩ B) . (8)

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
612 Annals of Operations Research (2024) 343:607–638

As is common practice in the optimization literature (see e.g. Bach (2013), Section 2.1),
here we always assume s(∅) = 0. A number of machine learning problems, including image
segmentation and sensor placement, can be cast as minimization of a submodular function
(see, e.g., Bach (2013), Chakrabarty et al. (2014) and references therein for further details):

min r (A) . (9)

A⊆V

Submodular optimization can also be seen as a more general way to relate combinatorial
problems to convexity, for example for structured sparsity (Bach, 2013; Jaggi, 2013). By a
theorem from Fujishige (1980), problem (9) can be in turn reduced to an minimum norm
point problem over the base polytope

B(G) = {s ∈ RV : sa ≤ r (A) for all A ⊆ V , sa = r (V )} . (10)

a∈A a∈V

For this polytope, linear optimization can be achieved with a simple greedy algorithm. More
precisely, consider the LP

max w s .
s∈B(F)

Then if the objective vecor w has a negative component, the problem is clearly unbounded.
Otherwise, a solution to the LP can be obtained by ordering w in decreasing manner as
w j1 ≥ w j2 ≥ ... ≥ w jn , and setting

s jk := r ({ j1 , ..., jk }) − r ({ j1 , ..., jk−1 }) , (11)

for k ∈ [1 : n]. We thus have a LMO with a O(n log n) cost. This is the reason why FW
variants are widely used in the context of submodular optimization; further details can be
found in, e.g., (Bach 2013; Jaggi 2013).

3.3 LASSO problem

The LASSO, proposed by Tibshirani (1996), is a popular tool for sparse linear regression.
Given the training set

T = {(ri , bi ) ∈ Rn × R : i ∈ [1 : m]} ,

where ri are the rows of an m × n matrix A, the goal is finding a sparse linear model (i.e.,
a model with a small number of non-zero parameters) describing the data. This problem is
strictly connected with the Basis Pursuit Denoising (BPD) problem in signal analysis (see,
e.g., Chen et al. (2001)). In this case, given a discrete-time input signal b, and a dictionary

{a j ∈ Rm j ∈ [1 : n]}

of elementary discrete-time signals, usually called atoms (here a j are the columns of a matrix
A), the goal is finding a sparse linear combination of the atoms that approximate the real signal.
From a purely formal point of view, LASSO and BPD problems are equivalent, and both can
be formulated as follows:
min f (x) := Ax − b22
x∈Rn (12)
s.t. x1 ≤ τ ,

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 613

where the parameter τ controls the amount of shrinkage that is applied to the model (related
to sparsity, i.e., the number of nonzero components in x). The feasible set is
C = {x ∈ Rn : x1 ≤ τ } = conv{±τ ei : i ∈ [1 : n]} .
Thus we have the following LMO in this case:
LMOC (∇ f (xk )) = sign(−∇ik f (xk )) · τ eik ,
with i k ∈ arg max |∇i f (xk )|. It is easy to see that the FW per-iteration cost is then O(n). The
i
peculiar structure of the problem makes FW variants well-suited for its solution. This is the
reason why LASSO/BPD problems were considered in a number of FW-related papers (see,
e.g., Jaggi (2011), Jaggi (2013), Lacoste-Julien and Jaggi (2015), Locatello et al. (2017)).

3.4 Matrix completion

Matrix completion is a widely studied problem that comes up in many areas of science and
engineering, including collaborative filtering, machine learning, control, remote sensing,
and computer vision (just to name a few; see also Candès and Recht (2009) and references
therein). The goal is to retrieve a low rank matrix X ∈ Rn 1 ×n 2 from a sparse set of observed
matrix entries {Ui j }(i, j)∈J with J ⊂ [1 : n 1 ] × [1 : n 2 ]. Thus the problem can be formulated
as follows Freund et al. (2017):
min f (X) := (X i j − Ui j )2
X∈Rn 1 ×n 2 (13)
(i, j)∈J
s.t. rank(X) ≤ δ,
where the function f is given by the squared loss over the observed entries of the matrix
and δ > 0 is a parameter representing the assumed belief about the rank of the reconstructed
matrix we want to get in the end. In practice, the low rank constraint is relaxed with a nuclear
norm ball constraint, where we recall that the nuclear norm X∗ of a matrix X is equal the
sum of its singular values. Thus we get the following convex optimization problem:
min (X i j − Ui j )2
X∈Rn 1 ×n 2 (14)
(i, j)∈J
s.t. X∗ ≤ δ .
The feasible set is the convex hull of rank-one matrices:
C = {X ∈ Rn 1 ×n 2 : X∗ ≤ δ}
= conv{δuv : u ∈ Rn 1 , v ∈ Rn 2 , u = v = 1} .
If we indicate with A J the matrix that coincides with A on the indices J and is zero otherwise,
then we can write ∇ f (X) = 2 (X − U) J . Thus we have the following LMO in this case:
LMOC (∇ f (Xk )) ∈ arg min{tr(∇ f (Xk ) X) : X∗ ≤ δ} , (15)

which boils down to computing the gradient, and the rank-one matrix δu1 v1 , with u1 , v1 right
and left singular vectors corresponding to the top singular value of −∇ f (Xk ). Consequently,
the FW method at a given iteration approximately reconstructs the target matrix as a sparse
combination of rank-1 matrices. Furthermore, as the gradient matrix is sparse (it only has |J |
non-zero entries) storage and approximate singular vector computations can be performed
much more efficiently than for dense matrices.1 A number of FW variants has hence been
1 Details related to the LMO cost can be found in, e.g., Jaggi (2013).

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
614 Annals of Operations Research (2024) 343:607–638

proposed in the literature for solving this problem (see, e.g., Freund et al. (2017), Jaggi
(2011), Jaggi (2013)).

3.5 Adversarial attacks in machine learning

Adversarial examples are maliciously perturbed inputs designed to mislead a properly trained
learning machine at test time. An adversarial attack hence consists in taking a correctly
classified data point x0 and slightly modifying it to create a new data point that leads the
considered model to misclassification (see, e.g., Carlini and Wagner (2017), Chen et al.
(2017), Goodfellow et al. (2014) for further details). A possible formulation of the problem
(see, e.g., Chen et al. (2020), Goodfellow et al. (2014)) is given by the so called maximum
allowable p -norm attack that is,
min f (x0 + x)
x∈Rn (16)
s.t. x p ≤ ε ,
where f is a suitably chosen attack loss function, x0 is a correctly classified data point, x
represents the additive noise/perturbation, ε > 0 denotes the magnitude of the attack, and
p ≥ 1. It is easy to see that the LMO has a cost O(n). If x0 is a feature vector of a dog
image correctly classified by our learning machine, our adversarial attack hence suitably
perturbs the feature vector (using the noise vector x), thus getting a new feature vector x0 + x
classified, e.g., as a cat. In case a target adversarial class is specified by the attacker, we have
a targeted attack. In some scenarios, the goal may not be to push x0 to a specific target class,
but rather push it away from its original class. In this case we have a so called untargeted
attack. The attack function f will hence be chosen depending on the kind of attack we aim to
perform over the considered model. Due to its specific structure, problem (16) can be nicely
handled by means of tailored FW variants. Some FW frameworks for adversarial attacks
were recently described in, e.g., (Chen et al. 2020; Kazemi et al. 2021; Sahu and Kar 2020).

3.6 Minimum enclosing ball

Given a set of points P = {p1 , . . . , pn } ⊂ Rd , the minimum enclosing ball problem (MEB,
see, e.g., Clarkson 2010; Yıldırım 2008) consists in finding the smallest ball containing
P. Such a problem models numerous important applications in clustering, nearest neigh-
bor search, data classification, machine learning, facility location, collision detection, and
computer graphics, to name just a few. We refer the reader to Kumar et al. (2003) and the
√
references therein for further details. Denoting by c ∈ Rd the center and by γ (with γ ≥ 0)
the radius of the ball, a convex quadratic formulation for this problem is
min γ (17)
(c,γ )∈Rd ×R

s.t. pi − c2 ≤ γ , all i ∈ [1 : n] . (18)

This problem can be formulated via Lagrangian duality as a convex Standard Quadratic
Optimization Problem (StQP, see, e.g. Bomze and de Klerk (2002))

min x A Ax − b x : x ∈ n−1 (19)

with A = [p1 , ..., pn ] and b = [p1 p1 , . . . , pn pn ]. The feasible set is the standard simplex
n−1 := {x ∈ Rn+ : e x = 1} = conv{ei : i ∈ [1 : n]} ,

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 615

and the LMO is defined as follows:

LMO n−1 (∇ f (xk )) = eik ,

with i k ∈ arg mini ∇i f (xk ). It is easy to see that cost per iteration is O(n). When applied to
(19), the FW method can find an ε-cluster in O( 1ε ), where an ε-cluster is a subset P of P
such that the MEB of P dilated by 1 + ε contains P (Clarkson, 2010). The set P is given by
the atoms in P selected by the LMO in the first O( 1ε ) iterations. Further details related to the
connections between FW methods and MEB problems can be found in, e.g., (Ahipaşaoğlu
et al., 2008; Ahipaşaoğlu & Todd, 2013; Clarkson, 2010) and references therein.

3.7 Training linear Support Vector Machines

Support Vector Machines (SVMs) represent a very important class of machine learning tools
(see, e.g., Vapnik (2013) for further details). Given a labeled set of data points, usually called
training set:

T S = {(pi , yi ), pi ∈ Rd , yi ∈ {−1, 1}, i = 1, . . . , n},

the linear SVM training problem consists in finding a linear classifier w ∈ Rd such that
the label yi can be deduced with the "highest possible confidence" from w pi . A convex
quadratic formulation for this problem is the following (Clarkson, 2010):
w2
min ρ+ 2
w∈Rd ,ρ∈R (20)
s.t. ρ + yi w pi ≥ 0 , all i ∈ [1 : n] ,

where the slack variable ρ stands for the negative margin and we can have ρ < 0 if and only
if there exists an exact linear classifier, i.e. w such that w pi = sign(yi ). The dual of (20) is
again an StQP:

min x A Ax : x ∈ n−1 (21)

with A = [y1 p1 , ..., yn pn ]. Notice that problem (21) is equivalent to an MNP problem on
conv{yi pi : i ∈ [1 : n]}, see Sect. 8.3 below. Some FW variants (like, e.g., the Pairwise Frank–
Wolfe) are closely related to classical working set algorithms, such as the SMO algorithm
used to train SVMs (Lacoste-Julien & Jaggi, 2015). Further details on FW methods for SVM
training problems can be found in, e.g., Clarkson (2010), Jaggi (2011).

3.8 Finding maximal cliques in graphs

In the context of network analysis the clique model, dating back at least to the work of Luce
and Perry (1949) about social networks, refers to subsets with every two elements in a direct
relationship. The problem of finding maximal cliques has numerous applications in domains
including telecommunication networks, biochemistry, financial networks, and scheduling
(see, e.g., Bomze et al. (1999), Wu and Hao (2015)). Let G = (V , E) be a simple undirected
graph with V and E set of vertices and edges, respectively. A clique in G is a subset C ⊆ V
such that (i, j) ∈ E for each (i, j) ∈ C, with i = j. The goal in finding a clique C such that
|C| is maximal (i.e., it is not contained in any strictly larger clique). This corresponds to find
a local minimum for the following equivalent (this time non-convex) StQP (see, e.g., Bomze

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
616 Annals of Operations Research (2024) 343:607–638

(1997), Bomze et al. (1999), Hungerford and Rinaldi (2019) for further details):

1
max x AG x + x2 : x ∈ n−1 (22)
2
where AG is the adjacency matrix of G. Due to the peculiar structure of the problem, FW
methods can be fruitfully used to find maximal cliques (see, e.g., Hungerford and Rinaldi
(2019)).

3.9 Finding sparse points in a set

Given a non-empty polyhedron P ⊂ Rn , the goal is finding a sparse point x ∈ P (i.e., a point
with as many zero components as possible). This sparse optimization problem can be used
to model a number of real-world applications in fields like, e.g., machine learning, pattern
recognition and signal processing (see Rinaldi et al. (2010) and references therein). Ideally,
what we would like to get is an optimal solution for the following problem:

min {x0 : x ∈ P} . (23)

Since the zero norm is non-smooth, a standard procedure is to replace the original formulation
(23) with an equivalent concave optimization problem of the form:
n
min φ(yi ) : x ∈ P, −y ≤ x ≤ y , (24)
i=1

where φ : [0 , +∞ [ → R is a suitably chosen smooth concave univariate function bounded

from below, like, e.g.,

φ(t) = (1 − e−αt ) ,

with α a large enough positive parameter (see, e.g., Mangasarian (1996), Rinaldi et al. (2010)
for further details). The LMO in this case gives a vertex solution for the linear programming
problem:

min ck y : x ∈ P, −y ≤ x ≤ y ,

with (ck )i the first-order derivative of φ calculated in (yk )i . Variants of the unit-stepsize FW
method have been proposed in the literature (see, e.g., Mangasarian (1996), Rinaldi et al.
(2010)) to tackle the smooth equivalent formulation (24).

4 Stepsizes

Popular rules for determining the stepsize are:

– unit stepsize:

αk = 1,

mainly used when the problem has a concave objective function. Finite convergence can
be proved, under suitable assumptions, both for the unit-stepsize FW and some of its
variants described in the literature (see, e.g., Rinaldi et al. (2010) for further details).

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 617

– diminishing stepsize:
2
αk = , (25)
k+2
mainly used for the classic FW (see, e.g., Freund and Grigas (2016), Jaggi (2013)).
– exact line search:

αk = min arg min ϕ(α) with ϕ(α) := f (xk + α dk ) , (26)

α∈[0,αkmax ]

where we pick the smallest minimizer of the function ϕ for the sake of being well-defined
even in rare cases of ties (see, e.g., Bomze et al. (2020), Lacoste-Julien and Jaggi (2015)).
– Armijo line search: the method iteratively shrinks the step size in order to guarantee a
sufficient reduction of the objective function. It represents a good way to replace exact
line search in cases when it gets too costly. In practice, we fix parameters δ ∈ (0, 1)
and γ ∈ (0, 21 ), then try steps α = δ m αkmax with m ∈ {0, 1, 2, . . . } until the sufficient
decrease inequality

f (xk + α dk ) ≤ f (xk ) + γ α ∇ f (xk ) dk (27)

holds, and set αk = α (see, e.g., Bomze et al. (2019) and references therein).
– Lipschitz constant dependent step size:

∇ f (xk ) dk max
αk = αk (L) := min − , α , (28)
Ldk 2 k

with L the Lipschitz constant of ∇ f (see, e.g., Bomze et al. (2020), Pedregosa et al.
(2020)).

The Lipschitz constant dependent step size can be seen as the minimizer of the quadratic
model m k (·; L) overestimating f along the line xk + α dk :

Lα 2
m k (α; L) = f (xk ) + α ∇ f (xk ) dk + dk 2 ≥ f (xk + α dk ) , (29)
2
where the inequality follows by the standard Descent Lemma.
In case L is unknown, it is even possible to approximate L using a backtracking line search
(see, e.g., Kerdreux et al. (2021), Pedregosa et al. (2020)).
We now report a lower bound for the improvement on the objective obtained with the
stepsize (28), often used in the convergence analysis.

Lemma 1 If αk is given by (28) and αk < αkmax then

1
f (xk+1 ) ≤ f (xk ) − (∇ f (xk )
dk )2 . (30)
2L
Proof We have
Lαk2
f (xk + αk dk ) ≤ f (xk ) + αk ∇ f (xk ) dk + 2 dk 2
(∇ f (xk ) dk )2
(31)
= f (xk ) − 2Ldk 2
= f (xk ) − 2L (∇
1
f (xk ) dk )2 ,

where we used the standard Descent Lemma in the inequality.

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
618 Annals of Operations Research (2024) 343:607–638

5 Properties of the FW method and its variants

5.1 The FW gap

A key parameter often used as a measure of convergence is the FW gap

G(x) = max −∇ f (x) (s − x) , (32)
s∈C

which is always nonnegative and equal to 0 only in first order stationary points. This gap is,
by definition, readily available during the algorithm. If f is convex, using that ∇ f (x) is a
subgradient we obtain
G(x) ≥ −∇ f (x) (x∗ − x) ≥ f (x) − f ∗ , (33)
so that G(x) is an upper bound on the optimality gap at x. Furthermore, G(x) is a special
case of the Fenchel duality gap (Lacoste-Julien et al., 2013).
If C = n−1 is the simplex, then G is related to the Wolfe dual as defined in Clarkson
(2010). Indeed, this variant of Wolfe’s dual reads
max f (x) + λ(e x − 1) − u x
s.t. ∇i f (x) − u i + λ = 0 , i ∈ [1 : n] , (34)
(x, u, λ) ∈ Rn × Rn+ × R
and for a fixed x ∈ Rn , the optimal values of (u, λ) are
λx = − min ∇ j f (x) , u i (x) := ∇i f (x) − min ∇ j f (x) ≥ 0 .
j j

Performing maximization in problem (34) iteratively, first for (u, λ) and then for x, this
implies that (34) is equivalent to
maxx∈Rn [ f (x) + λx (e x − 1) − u(x) x]
(35)
= maxx∈Rn f (x) − max j (e j − x) ∇ f (x) = maxx∈Rn [ f (x) − G(x)] .
Furthermore, since Slater’s condition is satisfied, strong duality holds by Slater’s theorem
(Boyd & Vandenberghe, 2004), resulting in G(x∗ ) = 0 for every solution x∗ of the primal
problem.
The FW gap is related to several other measures of convergence (see e.g. Lan 2020, Section
7.5.1). First, consider the projected gradient

gk := πC (xk − ∇ f (xk )) − xk . (36)
with π B the projection on a convex and closed subset B ⊆ Rn . We have
gk = 0 if and
only if xk is stationary, with

gk 2 =
gk ≤
gk gk [(xk − ∇ f (xk )) − πC (xk − ∇ f (xk ))] +
gk
gk

gk ∇ f (xk ) = −(πC (xk − ∇ f (xk )) − xk ) ∇ f (xk )
= − (37)
≤ max −(y − xk ) ∇ f (xk ) = G(xk ) ,
y∈C

where we used [y − πC (x)] [x − πC (x)] ≤ 0 in the first inequality, with x = xk − ∇ f (xk )

and y = xk .
Let now NC (x) denote the normal cone to C at a point x ∈ C:
NC (x) := {r ∈ Rn : r (y − x) ≤ 0 for all y ∈ C} . (38)

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 619

First-order stationarity conditions are equivalent to −∇ f (x) ∈ NC (x), or

dist(NC (x), −∇ f (x)) = − ∇ f (x) − π NC (x) (−∇ f (x)) = 0 .
The FW gap provides a lower bound on the distance from the normal cone
dist(NC (x), −∇ f (x)), inflated by the diameter D > 0 of C, as follows:
G(xk ) = −(sk − xk ) ∇ f (xk )
= (sk − xk ) [π NC (xk ) (−∇ f (xk )) − (π NC (xk ) (−∇ f (xk )) + ∇ f (xk ))]
(39)
≤ sk − xk π NC (xk ) (−∇ f (xk )) + ∇ f (xk )
≤ D dist(NC (xk ), −∇ f (xk )) ,
where in the first inequality we used (sk − xk ) [π NC (xk ) (−∇ f (xk ))] ≤ 0 together with the
Cauchy-Schwarz inequality, and sk − xk ≤ D in the second.

5.2 O(1/k) rate for convex objectives

√
If f is non-convex, it is possible to prove a O(1/ k) rate for mini∈[1:k] G(xi ) (see, e.g.,
Lacoste-Julien 2016). On the other hand, if f is convex, we have an O(1/k) rate on the
optimality gap (see, e.g., Frank and Wolfe (1956), Levitin and Polyak (1966)) for all the
stepsizes discussed in Sect. 4. Here we include a proof for the Lipschitz constant dependent
stepsize αk given by (28).
Theorem 1 If f is convex and the stepsize is given by (28), then for every k ≥ 1
2L D 2
f (xk ) − f ∗ ≤ . (40)
k+2
Before proving the theorem we prove a lemma concerning the decrease of the objective in
the case of a full FW step, that is a step with dk = dkF W and with αk equal to 1, the maximal
feasible stepsize.
Lemma 2 If αk = 1 and dk = dkF W then
1
f (xk+1 ) − f ∗ ≤ min Ldk 2 , f (xk ) − f ∗ . (41)
2
Proof If αk = 1 = αkmax then by Definitions (3) and (32)

G(xk ) = −∇ f (xk ) dk ≥ Ldk 2 , (42)

the last inequality following by Definition (28) and the assumption that αk = 1. By the
standard Descent Lemma it also follows
L
f (xk+1 ) − f ∗ = f (xk + dk ) − f ∗ ≤ f (xk ) − f ∗ + ∇ f (xk ) dk + dk 2 . (43)
2
Considering the definition of dk and convexity of f , we get
f (xk ) − f ∗ + ∇ f (xk ) dk ≤ f (xk ) − f ∗ + ∇ f (xk ) (x∗ − xk ) ≤ 0 ,
so that (43) entails f (xk+1 ) − f ∗ ≤ L
2 dk 2 . To conclude, it suffices to apply to the RHS
of (43) the inequality
f (xk )− f ∗
f (xk ) − f ∗ + ∇ f (xk ) dk + L
2 dk 2 ≤ f (xk ) − f ∗ − 1
2 G(xk ) ≤ 2 (44)
where we used (42) in the first inequality and G(xk ) ≥ f (xk ) − f∗ in the second.

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
620 Annals of Operations Research (2024) 343:607–638

We can now proceed with the proof of the main result.

Proof (Theorem 1)
For k = 0 and α0 = 1 then by Lemma 2
Ld0 2 L D2
f (x1 ) − f ∗ ≤ ≤ . (45)
2 2
If α0 < 1 then
f (x0 ) − f ∗ ≤ G(x0 ) < Ld0 2 ≤ L D 2 . (46)
Therefore in both cases (30) holds for k = 0.
Reasoning by induction, if (40) holds for k with αk = 1, then the claim is clear by (41).
On the other hand, if αk < αkmax = 1 then by Lemma 1, we have

f (xk+1 ) − f ∗ ≤ f (xk ) − f ∗ − 2
2L (∇ f (x k ) dk )
1

f (xk ) − f ∗ − (∇ f 2L(xk ) dk ) 2
≤ D2
( f (xk )− f ∗ )2
(47)
≤ ∗
f (xk ) − f − 2L D 2
f∗
= ( f (xk ) − f ∗ )(1 − f (x2Lk )−
D2
) ≤ 2L D2
k+3 ,
where we used dk ≤ D in the second inequality, ∇ f (xk ) dk = G(xk ) ≥ f (xk ) − f ∗ in
the third inequality; and the last inequality follows by induction hypothesis.

As can be easily seen from above argument, the convergence rate of O(1/k) is true also in
more abstract normed spaces than Rn , e.g. when C is a convex and weakly compact subset of
a Banach space (see, e.g., Demyanov and Rubinov (1970), Dunn and Harshbarger (1978)).
A generalization for some unbounded sets is given in Ferreira and Sosa (2021). The bound
is tight due to a zigzagging behaviour of the method near solutions on the boundary, leading
to a rate of (1/k 1+δ ) for every δ > 0 (see Canon and Cullum (1968) for further details),
when the objective is a strictly convex quadratic function and the domain is a polytope.
Also the minimum FW gap mini∈[0:k] G(xi ) converges at a rate of O(1/k) (see Jaggi
(2013); Freund and Grigas (2016)). In Freund and Grigas (2016), a broad class of stepsizes
is examined, including αk = k+1 1
and αk = ᾱ constant. For these stepsizes a convergence

ln(k)
rate of O k is proved.

5.3 Variants

Active set FW variants mostly aim to improve over the O(1/k) rate and also ensure support
identification in finite time. They generate a sequence of active sets {Ak }, such that xk ∈
conv(Ak ), and define alternative directions making use of these active sets.
For the pairwise FW (PFW) and the away step FW (AFW) (see Clarkson (2010); Lacoste-
Julien and Jaggi (2015)) we have that Ak must always be a subset of Sk , with xk a convex
combination of the elements in Ak . The away vertex vk is then defined by
vk ∈ arg max ∇ f (xk ) y . (48)
y∈Ak

The AFW direction, introduced in Wolfe (1970), is hence given by

dkAS = xk − vk
(49)
dk ∈ arg max{−∇ f (xk ) d : d ∈ {dkAS , dkF W }} ,

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 621

Table 1 FW method and variants covered in this review

Variant Direction Active set

FW dk = dkF W = sk − xk , sk ∈ arg max{∇ f (xk ) x : x ∈ C} −

AFW dk ∈ arg max{−∇ f (xk ) d : d ∈ {xk − vk , dkF W }, vk ∈ Ak } Ak+1 ⊆ Ak ∪ {sk }
PFW dk = sk − vk , vk ∈ arg max{∇ f (xk ) vk : vk ∈ Ak } Ak+1 ⊆ Ak ∪ {sk }
EFW dk = yk − xk , yk ∈ arg min{ f (y) : y ∈ conv(Ak )} Ak+1 ⊆ Ak ∪ {sk }
FDFW dk ∈ arg max{−∇ f (xk ) d : d ∈ {xk − vk , dkF W }, vk ∈ Ak } Ak = F (xk )

while the PFW direction, as defined in Lacoste-Julien and Jaggi (2015) and inspired by the
early work (Mitchell et al., 1974), is

dkP F W = dkF W + dkAS = sk − vk , (50)

with sk defined in (3).

The FW method with in-face directions (FDFW) (see Freund et al. (2017), Guelat and
Marcotte (1986)), also known as Decomposition invariant Conditional Gradient (DiCG)
when applied to polytopes (Bashiri & Zhang, 2017), is defined exactly as the AFW, but
with the minimal face F (xk ) of C containing xk as the active set. The extended FW (EFW)
was introduced in Holloway (1974) and is also known as simplicial decomposition (Von
Hohenbalken, 1977). At every iteration the method minimizes the objective in the current
active set Ak+1

xk+1 ∈ arg min f (y) , (51)

y∈conv(Ak+1 )

where Ak+1 ⊆ Ak ∪ {sk } (see, e.g., Clarkson 2010, Algorithm 4.2). A more general version
of the EFW, only approximately minimizing on the current active set, was introduced in
Lacoste-Julien and Jaggi (2015) under the name of fully corrective FW. In Table 1, we report
the main features of the classic FW and of the variants under analysis.

5.4 Sparse approximation properties

As discussed in the previous section, for the classic FW method and the AFW, PFW, EFW
variants xk can always be written as a convex combination of elements in Ak ⊂ Sk , with
|Ak | ≤ k + 1. Even for the FDFW we still have the weaker property that xk must be an affine
combination of elements in Ak ⊂ A with |Ak | ≤ k + 1. It turns out that the convergence rate
of methods with this property is ( k1 ) in high dimension. More precisely, if C = conv(A)
with A compact, the O(1/k) rate of the classic FW method is worst case optimal given the
sparsity constraint

xk ∈ aff(Ak ) with Ak ⊂ A, |Ak | ≤ k + 1 . (52)

An example where the O(1/k) rate is tight was presented in Jaggi (2013). Let C = n−1
and f (x) = x − n1 e2 . Clearly, f ∗ = 0 with x∗ = n1 e. Then it is easy to see that
min{ f (x) − f ∗ : x0 ≤ k + 1} ≥ k+1 1
− n1 for every k ∈ N, so that in particular under (52)
with Ak = {ei : i ∈ [1 : n]}, the rate of any FW variant must be ( k1 ).

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
622 Annals of Operations Research (2024) 343:607–638

5.5 Affine invariance

The FW method and the AFW, PFW, EFW are affine invariant (Jaggi, 2013). More precisely,
let P be a linear transformation, fˆ be such that fˆ(Px) = f (x) and Ĉ = P(C). Then for every
sequence {xk } generated by the methods applied to ( f , C), the sequence {yk } := {Pxk } can
be generated by the FW method with the same stepsizes applied to ( fˆ, Ĉ). As a corollary,
considering the special case where P is the matrix collecting the elements of A as columns, one
can prove results on C = |A|−1 and generalize them to Ĉ := conv(A) by affine invariance.
An affine invariant convergence rate bound for convex objectives can be given using the
curvature constant

f (x) (y−x)
κ f ,C := sup 2 f (αy+(1−α)x)− fα(x)−α∇
2 : {x, y} ⊂ C, α ∈ (0, 1] . (53)

It is easy to prove that κ f ,C ≤ L D 2 if D is the diameter of C. In the special case where

C = n−1 and f (x) = x Ã Ãx + b x, then κ f ,C ≤ diam(A n−1 )2 for A = [Ã , b]; see
(Clarkson, 2010).
When the method uses the stepsize sequence (25), it is possible to give the following affine
invariant convergence rate bounds (see Freund and Grigas 2016):
2κ f ,C
f (xk ) − f ∗ ≤ ,
k+4
(54)
9κ f ,C
min G(xi ) ≤ ,
i∈]0:k] 2k
thus in particular slightly improving the rate we gave in Theorem 1 since we have that
κ f ,C ≤ L D 2 .

5.6 Support identification for the AFW

It is a classic result that the AFW under some strict complementarity conditions and for
strongly convex objectives identifies in finite time the face containing the solution (Guelat &
Marcotte, 1986). Here we report some explicit bounds for this property proved in Bomze et
al. (2020). We first assume that C = n−1 , and introduce the multiplier functions

λi (x) = ∇ f (x) (ei − x) (55)

for i ∈ [1 : n]. Let x∗ be a stationary point for f , with the objective f not necessarily convex.
It is easy to check that {λi (x∗ )}i∈[1:n] coincide with the Lagrangian multipliers. Furthermore,
by complementarity conditions we have xi∗ λi (x∗ ) = 0 for every i ∈ [1 : n]. It follows that
the set

I (x∗ ) := {i ∈ [1 : n] : λi (x∗ ) = 0}

contains the support of x∗ ,

supp(x∗ ) := {i ∈ [1 : n] : xi∗ > 0} .

The next lemma uses λi , and the Lipschitz constant L of ∇ f , to give a lower bound of the
so-called active set radius r∗ , defining a neighborhood of x∗ . Starting the algorithm in this
neighbourhood, the active set (the minimal face of C containing x∗ ) is identified in a limited
number of iterations.

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 623

Lemma 3 Let x∗ be a stationary point for f on the boundary of n−1 , δmin =

mini:λi (x∗ )>0 λi (x∗ ) and
δmin
r∗ = . (56)
δmin + 2L
Let. Assume that for every k for which dk = dA k holds, the step size αk is not smaller than
the stepsize given by (28), αk (L) ≤ αk . If xk − x∗ 1 < r∗ , then for some

j ≤ min{n − |I (x∗ )|, | supp(xk )| − 1}

we have supp(xk+ j ) ⊆ I (x∗ ) and xk+ j − x∗ 1 < r∗ .

Proof Follows from (Bomze et al. (2020), Theorem 3.3), since under the assumptions the
AFW sets one variable in supp(xk )\ I (x∗ ) to zero at every step without increasing the 1-norm
distance from x∗ .

The above lemma does not require convexity and was applied in Bomze et al. (2020) to derive
active set identification bounds in several convex and non-convex settings. Here we focus on
the case where the domain C = conv(A) with |A| < +∞ is a generic polytope, and where
f is μ-strongly convex for some μ > 0, i.e.
μ
f (y) ≥ f (x) + ∇ f (x) (y − x) + x − y2 for all {x, y} ⊂ C . (57)
2
Let E C (x∗ ) be the face of C exposed by ∇ f (x ∗ ):

E C (x∗ ) := arg min ∇ f (x∗ ) x , (58)

x∈C

Let then θ A be the Hoffman constant (see Beck and Shtern 2017) related to [Ā , In , e, −e] ,
with Ā the matrix having as columns the elements in A. Finally, consider the function f A (y) :=
f (Āy) on |A|−1 , and let L A be the Lipschitz constant of ∇ f A as well as
δmin
δmin := min ∇ f (x∗ ) (a − x∗ ) and r∗ (x∗ ) := .
a∈A\E C (x∗ ) δmin + 2L A
Using linearity of AFW convergence for strongly convex objectives (see Sect. 6.1), we have
the following result:

Theorem 2 The sequence {xk } generated by the AFW with x0 ∈ A enters E C (x∗ ) for

ln(h 0 ) − ln(μ A r∗ (x∗ )2 /2)
k ≥ max 2 ,0 , (59)
ln(1/q)
μ
where μ A = nθ A2
and q ∈ (0, 1) is the constant related to the linear convergence rate of the
AFW, i.e. h k ≤ qk h 0 for all k.

Proof (sketch) We present an argument in the case C = n−1 , A = {ei }i∈[1:n] which can be
easily extended by affine invariance to the general case (see Bomze et al. (2020) for details).
In this case θ A ≥ 1 and we can define μ̄ := μ/n ≥ μ A .
To start with, the number of steps needed to reach the condition
μ μ̄
hk ≤ r∗ (x∗ )2 = r∗ (x∗ )2 (60)
2n 2

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
624 Annals of Operations Research (2024) 343:607–638

is at most

ln(h 0 ) − ln(μ̄r∗ (x ∗ )2 /2)
k̄ = max ,0 .
ln(1/q)
Now we combine n · ≥ · 1 with strong convexity and relation (60) to obtain
xk − x∗ 1 ≤ r∗ (x∗ ), hence in particular xk − x∗ 1 ≤ r∗ (x∗ ) for every k ≥ k̄. Since
x0 is a vertex of the simplex, and at every step at most one coordinate is added to the support
of the current iterate, | supp(xk̄ )| ≤ k̄ + 1. The claim follows by applying Lemma 3.

Additional bounds under a quadratic growth condition weaker than strong convexity and
strict complementarity are reported in Garber (2020).
Convergence and finite time identification for the PFW and the AFW are proved in Bomze
et al. (2019) for a specific class of non-convex minimization problems over the standard
simplex, under the additional assumption that the sequence generated has a finite set of limit
points. In another line of work, active set identification strategies combined with FW variants
have been proposed in Cristofari et al. (2020) and Sun (2020).

5.7 Inexact linear oracle

In many real-world applications, linear subproblems can only be solved approximately. This
is the reason why the convergence of FW variants is often analyzed under some error term
for the linear minimization oracle (see, e.g., Braun et al. (2019), Braun et al. (2017), Freund
and Grigas (2016), Jaggi (2013), Konnov (2018)). A common assumption, relaxing the FW
vertex exact minimization property, is to have access to a point (usually a vertex) s̃k such
that
∇ f (xk ) (s̃k − xk ) ≤ min ∇ f (xk ) (s − xk ) + δk , (61)
s∈C

for a sequence {δk } of non negative approximation errors.

If the sequence {δk } is constant and equal to some δ > 0, then trivially the lowest possible
approximation error achieved by the FW method is δ. At the same time, (Freund and Grigas
(2016), Theorem 5.1) implies a rate of O( k1 + δ) if the stepsize αk = k+2 2
is used.
The O(1/k) rate can be instead retrieved by assuming that {δk } converges to 0 quickly
enough, and in particular if
δκ f ,C
δk = (62)
k+2
for a constant δ > 0. Under (62), in Jaggi (2013) a convergence rate of
2κ f ,C
f (xk ) − f ∗ ≤ (1 + δ) (63)
k+2
was proved for the FW method with αk given by exact line search or equal to k+2 2
, as well
as for the EFW.
A linearly convergent variant making use of an approximated linear oracle recycling
previous solutions to the linear minimization subproblem is studied in Braun et al. (2019). In
Freund and Grigas (2016), Hogan (1971), the analysis of the classic FW method is extended to
the case of inexact gradient information. In particular in Freund and Grigas (2016), assuming
the availability of the (δ, L) oracle introduced in Devolder et al. (2014), a convergence rate
of O(1/k + δk) is proved.

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 625

Table 2 Known convergence rates for the FW method and the variants covered in this review
Method Objective Domain Assumptions Rate
√
FW NC Generic − O(1/ k)
FW C Generic − O(1/k)
FW SC Generic x∗ ∈ ri() Linear
Variants SC Polytope − Linear
FW SC Strongly convex − O(1/k 2 )
FW SC Strongly convex min ∇ f (x) > 0 Linear
NC, C and SC stand for non-convex, convex and strongly convex respectively

6 Improved rates for strongly convex objectives

See Table 2

6.1 Linear convergence under an angle condition

In the rest of this section we assume that f is μ-strongly convex (57). We also assume that
the stepsize is given by exact linesearch or by (28).
Under this assumption, an asymptotic linear convergence rate for the FDFW on polytopes
was given in the early work (Guelat & Marcotte, 1986). Furthermore, in Garber and Hazan
(2016) a linearly convergent variant was proposed, making use however of an additional local
linear minimization oracle.
Recent works obtain linear convergence rates by proving the angle condition
τ
−∇ f (xk )
dk ≥ ∇ f (xk ) (xk − x∗ ) (64)
xk − x∗

for some τ > 0 and some x∗ ∈ arg minx∈C f (x). As we shall see in the next lemma, under
(64) it is not difficult to prove linear convergence rates in the number of good steps. These
are FW steps with αk = 1 and steps in any descent direction with αk < 1.

Lemma 4 If the step k is a good step and (64) holds, then

h k+1 ≤ max 21 , 1 − τ Lμ h k .
2
(65)

Proof If the step k is a full FW step then Lemma 2 entails h k+1 ≤ 1

2 h k . In the remaining
case, first observe that by strong convexity

f ∗ = f (x∗ ) ≥ f (xk ) + ∇ f (xk ) (x∗ − xk ) + μ2 xk − x∗ 2

≥ min f (xk ) + α∇ f (xk ) (x∗ − xk ) + α 2μ xk − x∗ 2
2
(66)
α∈R
2
= f (xk ) − 2μx 1−x∗ 2 ∇ f (xk ) (xk − x∗ ) ,
k

which means
1 2
hk ≤ ∗
∇ f (xk ) (xk − x∗ ) . (67)
2μxk − x 2

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
626 Annals of Operations Research (2024) 343:607–638

We can then proceed using the bound (30) from Lemma 1 in the following way:
2
h k+1 = f (xk+1 ) − f ∗ ≤ f (xk ) − f ∗ − 2L
1
∇ f (xk )
dk
2
≤ h k − 2Lxτ−x∗ 2 ∇ f (xk ) (xk − x∗ )
2
(68)
k
≤ h k 1 − τ Lμ ,
2

where we used (64) in the second inequality and (67) in the third one.

As a corollary, under (64) we have the rate

γ (k)
τ 2μ
f (xk ) − f ∗ = h k ≤ max 2,1 − L
1
h0 (69)

for any method with non increasing { f (xk )} and following Algorithm 1, with γ (k) ≤ k an
integer denoting the number of good steps until step k. It turns out that for all the variants
we introduced in this review we have γ (k) ≥ T k for some constant T > 0. When x∗ is in
the relative interior of C, the FW method satisfies (64) and we have the following result (see
Guelat and Marcotte (1986), Lacoste-Julien and Jaggi (2015)):

Theorem 3 If x∗ ∈ ri(C), then

! "2 #k
∗ μ dist(x∗ , ∂C)
f (xk ) − f ≤ 1 − ( f (x0 ) − f ∗ ) . (70)
L D

Proof We can assume for simplicity int(C) = ∅, since otherwise we can restrict ourselves to
the affine hull of C. Let δ = dist(x∗ , ∂C) and g = −∇ f (xk ). First, by assumption we have
x∗ + δg ∈ C. Therefore
g dkF W ≥ g ((x∗ + δ
g) − x) = δg
g + g (x∗ − x) ≥ δg + f (x) − f ∗ ≥ δg ,(71)
where we used x∗ + δg ∈ C in the first inequality and convexity in the second. We can
conclude
! "
dk
FW FW
dk δ δ xk − x ∗
g ≥g ≥ g ≥ g . (72)
dkF W D D D xk − x∗
dist(x∗ ,∂C)
The thesis follows by Lemma 4, noticing that for τ = D ≤ 1
2 we have 1 − τ 2 μL > 21 .

In Lacoste-Julien and Jaggi (2015), the authors proved that directions generated by the
AFW and the PFW on polytopes satisfy condition (64), with τ = PWidth(A)/D and
PWidth(A), pyramidal width of A. While PWidth(A) was originally defined with a rather
complex minmax expression, in Peña and Rodriguez (2018) it was then proved
PWidth(A) = min dist(F, conv(A \ F)) . (73)
F∈faces(C)

This quantity can√be explicitly computed in a few special cases. For A = {0, 1}n we have
PWidth(A) = 1/ n, while for A = {ei }i∈[1:n] (so that C is the n − 1 dimensional simplex)
2
√ if n is even
n
PWidth(A) = 2 (74)
√ if n is odd.
n−1/n

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 627

Angle conditions like (64) with τ dependent on the number of vertices used to represent
xk as a convex combination were given in Bashiri and Zhang (2017) and Beck and Shtern
(2017) for the FDFW and the PFW respectively. In particular, in Beck and Shtern (2017) a
geometric constant C called vertex-facet distance was defined as

C = min{dist(v, H ) : v ∈ V (C) , H ∈ H(C), v ∈

/ H} , (75)

with V (C) the set of vertices of C, and H(C) the set of supporting hyperplanes of C (contain-
ing a facet of C). Then condition (64) is satisfied for τ = C /s, with dk the PFW direction
and s the number of points used in the active set Ak .
In Bashiri and Zhang (2017), a geometric constant Hs was defined depending on the
minimum number s of vertices needed to represent the current point x k , as well as on the

proper2 inequalities qi x ≤ bi , i ∈ [1 : m], appearing in a description of C. For each of these
inequalities the second gap gi was defined as

gi = max qi v − secondmax qi v , i ∈ [1 : m] , (76)
v∈V (C) v∈V (C)

with the secondmax function giving the second largest value achieved by the argument. Then
Hs is defined as
! "
n ai j 2 [1:m]
Hs := max gi :S∈ s . (77)
j=1 i∈S

The arguments used in the paper imply that (64) holds with τ = 2D √ 1
Hs
if dk is a FDFW
direction and xk the convex combination of at most s vertices. We refer the reader to Peña
and Rodriguez (2018) and Rademacher and Shu (2022) for additional results on these and
related constants.
The linear convergence results for strongly convex objectives are extended to compositions
of strongly convex objectives with affine transformations in Beck and Shtern (2017), Lacoste-
Julien and Jaggi (2015), Peña and Rodriguez (2018). In Gutman and Pena (2020), the linear
convergence results for the AFW and the FW method with minimum in the interior are
extended with respect to a generalized condition number L f ,C,D /μ f ,C,D , with D a distance
function on C.
For the AFW, the PFW and the FDFW, linear rates with no bad steps (γ (k) = k) are given
in Rinaldi and Zeffiro (2023) for non-convex objectives satisfying a Kurdyka-Łojasiewicz
inequality. In Rinaldi and Zeffiro (2020), condition (64) was proved for the FW direction
and orthographic retractions on some convex sets with smooth boundary. Extensions to
block-coordinate variants are reported in Bomze et al. (2024), Bomze et al. (2022). The
work (Combettes & Pokutta, 2020) introduces a new FW variant using a subroutine to align
the descent direction with the projection on the tangent cone of the negative gradient, thus
implicitly maximizing τ in (64).

6.2 Strongly convex domains

When C is strongly convex we have a O(1/k 2 ) rate (see, e.g., Garber and Hazan (2015), Ker-
dreux et al. (2021)) for the classic FW method. Furthermore, when C is βC -strongly convex
and ∇ f (x) ≥ c > 0, then we have the linear convergence rate (see Demyanov and Rubinov

2 i.e., those inequalities strictly satisfied for some x ∈ C.

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
628 Annals of Operations Research (2024) 343:607–638

(1970), Dunn (1979), Kerdreux et al. (2021), Levitin and Polyak (1966))

h k+1 ≤ max 21 , 1 − 2cβ L
C
hk . (78)

Finally, it is possible to interpolate between the O(1/k 2 ) rate of the strongly convex setting
and the O(1/k) rate of the general convex one by relaxing strong convexity of the objective
with Hölderian error bounds Xu and Yang (2018) and also by relaxing strong convexity of
the domain with uniform convexity (Kerdreux et al., 2021).

7 Generalized FW for composite non-smooth optimization

Consider the following composite non-smooth optimization problem, generalizing (1):

min [ f (x) + g(x)] (79)
x∈Rn

with ∇ f being L-Lipschitz and g convex l.s.c. with compact domain C. Notice that problem
(1) corresponds to the special case where g is the indicator function of C. In the work (Bredies
et al., 2009), the following generalization of the FW direction was proposed for problem (79)
(originally referred to as "generalized conditional gradient"; the reader is referred to (Beck
(2017), Chapter 13) for a thorough treatment of the subject):

dkG F W = sk − xk with sk ∈ arg min (s − xk ) ∇ f (xk ) + g(s) . (80)
s∈Rn

Of course when g(x) is the indicator function of C we retrieve the classic FW direction. The
FW gap can be also extended as follows:
G(x) = g(x) + maxn [−∇ f (x) (s − x) − g(s)]
s∈R (81)
= g(x) + x ∇ f (x) + maxn [−∇ f (x) s − g(s)] ,
s∈R

maintaining its optimality measure properties (see Beck (2017), Theorem 13.6).
We focus here on the case where f is convex, and discuss two important properties of the
generalized conditional gradient method connected with the Fenchel dual of problem (79)
max − f ∗ (y) − g ∗ (−y) , (82)
y∈Rn

with f ∗ and g ∗ convex conjugates of f and g.

Let h(x) = f (x) + g(x) be the primal objective and h ∗ (y) = f ∗ (y) + g ∗ (−y) be the
(negative) dual objective. The first property is that the gap G turns out to be equal to the
Fenchel duality gap h(x) + h ∗ (y) when taking y = ∇ f (x) as dual variable. This is a direct
extension of the analogous property for the classic FW gap mentioned in Sect. 5. It can be
deduced directly from, e.g., (Beck (2017), Lemma 13.5). We report here a short and self-
contained argument.

Theorem 4 If f , g are convex and f is differentiable then

G(x) = h(x) + h ∗ (∇ f (x)) . (83)

Proof We have

f ∗ (∇ f (x)) = maxn z ∇ f (x) − f (z) = x ∇ f (x) − f (x) (84)
z∈R

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 629

and therefore, using (81) and (84),

h ∗ (∇ f (x)) = f ∗ (∇ f (x)) + maxn [−s ∇ f (x) − g(s)]
s∈R
= x ∇ f (x) − f (x) + G(x) − g(x) − x ∇ f (x) (85)
= G(x) − h(x) ,
which establishes the claim.

The above result also extends to non-differentiable objectives f , but in this case the gap also
depends on the particular subgradient considered in ∂ f (x).
The second property, described below in detail and proved in Bach (2015), is that the
generalized FW is equivalent to the mirror gradient descent applied to the Fenchel dual, when
considering the gradient as dual variable. Recall
that the mirror
descent method applied to
the composite optimization problem miny∈Rn h 1 (y) + q(y) has updates of the form
1
yk+1 ∈ arg min ∇h 1 (yk ) (y − yk ) + Dq (yk , y) (86)
y∈Rn αk

with Dq (x, y) the Bregman divergence associated to q:

Dq (x, y) = q(y) − q(x) − ∇q(x) (y − x) . (87)

The result proved in Bach (2015) states that if

xk+1 = xk + αk dkG F W ,

then a mirror gradient descent update as given in (86) on the Fenchel dual starting with
yk = ∇ f (xk ) gives

yk+1 = ∇ f (xk+1 ) .

As a corollary, a convergence rate of O(1/k) under the stepsize αk = 2/(k + 2) was also
derived in Bach (2015) for the generalized FW method.
We finish this section mentioning briefly some additional results. In the general nonconvex
√ with stepsize αk = 2/(k + 2), it was proved that the method converges with rate
case and
O(1/ k) (see, e.g., Beck (2017)). The analysis of the generalized FW was extended to the
case of inexact oracles in Yu et al. (2017), Yurtsever et al. (2018). We refer the reader to Yu et
al. (2017) for a more abstract analysis in Banach spaces, as well as for additional results for g
equal to a gauge function (see Yu et al. (2017), Section 2). In Nesterov (2018) parametrized
convergence rates were given under the Hölder condition for the gradient, and a trust region
version was introduced. In the special case where g is a norm, an adaption of the generalized
FW was descrbied in Harchaoui et al. (2015) (see also Pierucci et al. (2014) for a version
using smoothing techniques).

8 Extensions

8.1 Block coordinate Frank–Wolfe method

The block coordinate FW (BCFW) was introduced in Lacoste-Julien et al. (2013) for block
product domains of the form C = C (1) × ... × C (m) ⊆ Rn 1 +...+n m , and applied to structured
SVM training. The algorithm operates by selecting a random block and performing a FW

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
630 Annals of Operations Research (2024) 343:607–638

step in that block. Formally, for s ∈ Rm i let s(i) ∈ Rn be the vector with all blocks equal to
o except for the i-th block equal to s. We can write the direction of the BCFW as
(i)
dk = sk − xk
(88)
sk ∈ arg min ∇ f (xk ) s(i)
s∈C (i)

for a random index i ∈ [1 : n].

In Lacoste-Julien et al. (2013), a convergence rate of
2K m
E[ f (xk )] − f ∗ ≤ (89)
k + 2m
is proved, for K = h 0 + κ ⊗ ⊗
f , with κ f the product domain curvature constant, defined as
⊗,i ⊗,i
κ⊗f = κ f where κ f are the curvature constants of the objective fixing the blocks
outside C (i) :
(i)

f (x) d(i)
κ ⊗,i
f := sup 2 f (x+αd )− f (x)−α∇
α 2 : d ∈ C − x, x ∈ C, α ∈ (0, 1] . (90)

An asynchronous and parallel generalization for this method was given in Wang et al. (2016).
This version assumes that a cloud oracle is available, modeling a set of worker nodes each
sending information to a server at different times. This information consists of an index i and
the following LMO on C (i) :
s(i) ∈ arg min ∇ f (xk ) s(i) . (91)
s∈C (i)

The algorithm is called asynchronous because k can be smaller than k, modeling a delay in
the information sent by the node. Once the server has collected a minibatch S of τ distinct
indexes (overwriting repetitions), the descent direction is defined as
(i)
dk = s(i) , (92)
i∈S

If the indices sent by the nodes are i.i.d., then under suitable assumptions on the delay, a
convergence rate of
2m K τ
E[ f (xk )] − f ∗ ≤ (93)
τ 2 k + 2m
can be proved, where K τ = mκ ⊗ f ,τ (1 + δ) + h 0 for δ depending on the delay error, with
⊗
κ f ,τ the average curvature constant in a minibatch keeping all the components not in the
minibatch fixed.
In Osokin et al. (2016), several improvements are proposed for the BCFW, including an
adaptive criterion to prioritize blocks based on their FW gap, and block coordinate versions
of the AFW and the PFW variants.
In Shah et al. (2015), a multi plane BCFW approach is proposed in the specific case
of the structured SVM, based on caching supporting planes in the primal, corresponding
to block linear minimizers in the dual. In Berrada et al. (2019), the duality for structured
SVM between BCFW and stochastic subgradient descent is exploited to define a learning
rate schedule for neural networks based only on one hyper parameter. The block coordinate
approach is extended to the generalized FW in Beck et al. (2015), with coordinates however
picked in a cyclic order.

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 631

8.2 Conditional gradient sliding

The conditional gradient sliding (CGS) method, introduced in Lan and Zhou (2016), is based
on the application of the classic FW method as a subroutine to compute projections in the
accelerated projected gradient (APG) method. More precisely, let g be the current negative
gradient, in general different from −∇ f (xk ) for the APG method. Since for a stepsize η > 0
we have
$ %
1
πC (xk + ηg) = arg min g (xk − y) + y − xk 2 , (94)
y∈C 2η
the FW method can be applied to
1
V̄g,xk ,η (y) := g (xk − y) + y − xk 2 , (95)
2η
(k) (k)
generating an auxiliary sequence {ut } with starting point u0 = xk . The stopping criterion
is based on the FW gap for V̄ :
(k) (k)
max (β(xk − ut ) + g) (x − ut ) ≤ εk , (96)
x∈C

with εk a decreasing sequence converging to 0. Notice how both the gradient and exact
linesearch are readily available for the projection objective V̄ , so that the main costs of the
projection subroutine come from the extra linear minimization oracle calls. In Lan and Zhou
(2016), convergence rates of O(1/k 2 ) and O(1/k) in the number of gradient computations and
LMO calls respectively were proved for convex objectives, thus achieving both the optimal
O(1/k 2 ) rate for gradient descent methods (see e.g. Nesterov 1998) and the optimal O(1/k)
L D2
in FW steps. The parameters used to achieve this rate were ηk = k+1 3L
and εk = k(k+1) . In the
strongly convex case, a rate linear in the number of gradient computations can be proved.
We refer the reader to (Lan (2020), sections 7.2, 7.3) for a comprehensive study of CGS,
including also stochastic and finite sum objectives. The method was also studied in the
composite setting in Cheung and Lou (2015) and in the smooth non-convex setting in Qu
et al. (2018) and (with no acceleration) in Thekumparampil et al. (2020). A variant for
unbounded domains was given in Gonçalves et al. (2020). In a different line of work, in
Diakonikolas et al. (2020) a strategy switching between the AFW and accelerated gradient
descent (roughly speaking, after support
& identification) was proved to achieve the optimal
asymptotic convergence rate of O( L
μ ln(1/ε)) in the strongly convex case.

8.3 Variants for the min-norm point problem

Consider the min-norm point (MNP) problem

min x∗ , (97)
x∈C

with C a closed convex subset of Rn and · ∗ a norm on Rn . In Wolfe (1976), a FW variant

is introduced to solve the problem when C is a polytope and · ∗ is the standard Euclidean
norm · . Similarly to the variants introduced in Sect. 5.3, it generates a sequence of active
sets {Ak } with sk ∈ Ak+1 . At the step k the norm is minimized on the affine hull aff(Ak ) of
the current active set Ak , that is
vk = arg min y . (98)
y∈aff(Ak )

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
632 Annals of Operations Research (2024) 343:607–638

The descent direction dk is then defined as

dk = vk − xk , (99)

and the stepsize is given by a tailored linesearch that allows to remove some of the atoms in
the set Ak (see, e.g. Lacoste-Julien and Jaggi (2015), Wolfe (1976)). Whenever xk+1 is in the
relative interior of conv(Ak ), the FW vertex is added to the active set (that is, sk ∈ Ak+1 ).
Otherwise, at least one of the vertices not appearing in a convex representation of xk is
removed. This scheme converges linearly when applied to generic smooth strongly convex
objectives (see, e.g., Lacoste-Julien and Jaggi 2015).
In Harchaoui et al. (2015), a FW variant is proposed for minimum norm problems of the
form

min{x∗ : f (x) ≤ 0, x ∈ K } (100)

with K a convex cone, f convex with L-Lipschitz gradient. In particular, the optimization
domain is C = {x ∈ Rn : f (x) ≤ 0} ∩ K . The technique proposed in the article applies the
standard FW method to the problems

min{ f (x) : x∗ ≤ δk , x ∈ K } ,

with {δk } an increasing sequence convergent to the optimal value δ̄ of the problem (100). Let
C(δ) = {x ∈ Rn : x∗ ≤ δ} ∩ K for δ ≥ 0, and let

LM(r) ∈ arg min r x ,

x∈C(1)

so that by homogeneity for every k the linear minimization oracle on C(δk ) is given by

LMOC(δk ) (r) = δk LM(r) . (101)

For every k, applying the FW method with suitable stopping conditions an approximate
minimizer xk of f (x) over C(δk ) is generated, with an associated lower bound on the objective,
an affine function in y:

f k (y) := f (xk ) + ∇ f (xk ) (y − xk ) . (102)

Then the function

k (δ) := min f k (y) = f k (δLM(gk )) with gk = ∇ f (xk ) (103)

y∈C(δ)

is decreasing and affine in δ and satisfies

k (δ) = min f k (y) ≤ F(δ) := min f (y) . (104)

y∈C(δ) y∈C(δ)

Therefore, for

¯k (δ) = max i (δ) ≤ F(δ)

i∈[1:k]

the quantity δk+1 can be defined as min{δ ≥ 0 : ¯k (δ) ≤ 0}, hence F(δk+1 ) ≥ 0. A
complexity bound of O( 1ε ln( 1ε )) was given to achieve precision ε applying this method, with
O(1/ε) iterations per subproblem and length of the sequence {δk } at most O(ln(1/ε)) (see
Harchaoui et al. (2015), Theorem 2 for details).

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 633

8.4 Variants for optimization over the trace norm ball

The FW method has found many applications for optimization problems over the trace
norm ball. In this case, as explained in Example 3.4, linear optimization can be obtained
by computing the top left and right singular vectors of the matrix −∇ f (Xk ), an operation
referred to as 1-SVD (see Allen-Zhu et al. (2017)).
In the work (Freund et al., 2017), the FDFW is applied to the matrix completion problem
(13), thus generating a sequence of matrices {Xk } with Xk ∗ ≤ δ for every k. The method
can be implemented efficiently exploiting the fact that for X on the boundary of the nuclear
norm ball, there is a simple expression for the face F (X). For X ∈ Rm×n with rank(X) = k
let UDV be the thin SVD of X, so that D ∈ Rk×k is the diagonal matrix of non zero singolar
values for X, with corresponding left and right singular vectors in the columns of U ∈ Rm×k
and V ∈ Rn×k respectively. If X∗ = δ then the minimal face of the domain containing X is
the set
F (X) = {X ∈ Rm×n : X = UMV for M = M psd with M∗ = δ} . (105)
It is not difficult to see that we have rank(Xk ) ≤ k + 1 for every k ∈ N, as well. Furthermore,
the thin SVD of the current iterate Xk can be updated efficiently both after FW steps and after
in face steps. The convergence rate of the FDFW in this setting is still O(1/k).
In the recent work (Wang et al., 2022), an unbounded variant of the FW method is applied
to solve a generalized version of the trace norm ball optimization problem:
min { f (X) : PXQ∗ ≤ δ} (106)
X∈Rm×n

with P, Q singular matrices. The main idea of the method is to decompose the domain in
the sum S + T between the kernel T of the linear function ϕP,Q (X) = PXQand a bounded
set S ⊂ T ⊥ . Then gradient descent steps in the unbounded component T are alternated to
FW steps in the bounded component S. The authors apply this approach to the generalized
LASSO as well, using the AFW for the bounded component.
In Allen-Zhu et al. (2017), a variant of the classic FW using k-SVD (computing the top k
left and right singular vectors for the SVD) is introduced, and it is proved that it converges
linearly for strongly convex objectives when the solution has rank at most k. In Mu et al.
(2016), the FW step is combined with a proximal gradient step for a quadratic problem on the
product of the nuclear norm ball with the 1 ball. Approaches using an equivalent formulation
on the spectrahedron introduced in Jaggi and Sulovský (2010) are analyzed in Ding et al.
(2020), Garber (2023).

9 Conclusions

While the concept of the FW method is quite easy to understand, its advantages, witnessed
by a multitude of related work, may not be apparent to someone not closely familiar with the
subject. Therefore we considered, in Sect. 3, several motivating applications, ranging from
classic optimization to more recent machine learning problems. As in any line search-based
method, the proper choice of stepsize is an important ingredient to achieve satisfactory per-
formance. In Sect. 4, we review several options for stepsizes in first order methods, which
are closely related both to the theoretical analysis as well as to practical implementation
issues, guaranteeing fast convergence. This scope was investigated in more detail in Sect. 5
covering main results about the FW method and its most popular variants, including the

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
634 Annals of Operations Research (2024) 343:607–638

O(1/k) convergence rate for convex objectives, affine invariance, the sparse approximation
property, and support identification. The account is complemented by a report on recent
progress in improving on the O(1/k) convergence rate in Sect. 6. Versatility and efficiency
of this approach was also illustrated in the final Sect. 8 describing present recent FW vari-
ants fitting different optimization frameworks and computational environments, in particular
block coordinate, distributed, accelerated, and trace norm optimization. For sure many other
interesting and relevant aspects of FW and friends could not find their way into this review
because of space and time limitations, but the authors hope to have convinced readers that
FW merits a consideration even by non-experts in first-order optimization.
Funding Open access funding provided by University of Vienna.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included in the
article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is
not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

References
Ahipaşaoğlu, S. D., Sun, P., & Todd, M. J. (2008). Linear convergence of a modified Frank–Wolfe algorithm
for computing minimum-volume enclosing ellipsoids. Optimisation Methods and Software, 23(1), 5–19.
Ahipaşaoğlu, S. D., & Todd, M. J. (2013). A modified Frank–Wolfe algorithm for computing minimum-area
enclosing ellipsoidal cylinders: Theory and algorithms. Computational Geometry, 46(5), 494–519.
Allen-Zhu, Z., Hazan, E., Hu, W., & Li, Y. (2017). Linear convergence of a Frank–Wolfe type algorithm over
trace-norm balls. Advances in Neural Information Processing Systems, 2017, 6192–6201.
Bach, F. (2013). Learning with submodular functions: A convex optimization perspective. Foundations and
Trends in Machine Learning, 6(2–3), 145–373.
Bach, F. (2015). Duality between subgradient and conditional gradient methods. SIAM Journal on Optimiza-
tion, 25(1), 115–129.
Bashiri, M.A., & Zhang, X. (2017). Decomposition-invariant conditional gradient for general polytopes with
line search. Advances in Neural Information Processing Systems, pp. 2690–2700
Beck, A. (2017). First-order methods in optimization. Philadelphia: SIAM.
Beck, A., Pauwels, E., & Sabach, S. (2015). The cyclic block conditional gradient method for convex opti-
mization problems. SIAM Journal on Optimization, 25(4), 2024–2049.
Beck, A., & Shtern, S. (2017). Linearly convergent away-step conditional gradient for non-strongly convex
functions. Mathematical Programming, 164(1–2), 1–27.
Berrada, L., Zisserman, A., & Kumar, M.P. (2019). Deep Frank-Wolfe for neural network optimization. In:
International conference on learning representations
Bertsekas, D. P. (2015). Convex optimization algorithms. Athena Scientific
Bomze, I. M. (1997). Evolution towards the maximum clique. Journal of Global Optimization, 10(2), 143–164.
Bomze, I. M., Budinich, M., Pardalos, P. M., Pelillo, M. (1999). The maximum clique problem. In: Handbook
of combinatorial optimization, pp. 1–74. Springer
Bomze, I. M., & de Klerk, E. (2002). Solving standard quadratic optimization problems via linear, semidefinite
and copositive programming. Journal of Global Optimization, 24(2), 163–185.
Bomze, I. M., Rinaldi, F., & Rota Bulò, S. (2019). First-order methods for the impatient: Support identification
in finite time with convergent Frank–Wolfe variants. SIAM Journal on Optimization, 29(3), 2211–2226.
Bomze, I. M., Rinaldi, F., & Zeffiro, D. (2020). Active set complexity of the away-step Frank–Wolfe algorithm.
SIAM Journal on Optimization, 30(3), 2470–2500.
Bomze, I. M., Rinaldi, F., Zeffiro, D. (2021). Frank-Wolfe and friends: A journey into projection-free first-order
optimization methods. 4OR 19(3), 313–345
Bomze, I. M., Rinaldi, F., & Zeffiro, D. (2022). Fast cluster detection in networks by first order optimization.
SIAM Journal on Mathematics of Data Science, 4(1), 285–305.

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 635

Bomze, I. M., Rinaldi, F., & Zeffiro, D. (2024). Projection free methods on product domains. Computational
Optimization and Applications. https://doi.org/10.1007/s10589-024-00585-5
Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Braun, G., Pokutta, S., Tu, D., & Wright, S. (2019). Blended conditonal gradients. In: International conference
on machine learning, pp. 735–743. PMLR
Braun, G., Pokutta, S., & Zink, D. (2017). Lazifying conditional gradient algorithms. In: ICML, pp. 566–575
Bredies, K., Lorenz, D. A., & Maass, P. (2009). A generalized conditional gradient method and its connection
to an iterative shrinkage method. Computational Optimization and Applications, 42, 173–193.
Candès, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Compu-
tational Mathematics, 9(6), 717–772.
Canon, M. D., & Cullum, C. D. (1968). A tight upper bound on the rate of convergence of Frank–Wolfe
algorithm. SIAM Journal on Control, 6(4), 509–516.
Carlini, N., & Wagner, D. (2017). Towards evaluating the robustness of neural networks. In: 2017 IEEE
symposium on security and privacy (sp), pp. 39–57. IEEE
Chakrabarty, D., Jain, P., & Kothari, P. (2014). Provable submodular minimization using Wolfe’s algorithm.
Advances in Neural Information Processing Systems, 27, 802–809.
Chen, J., Zhou, D., Yi, J., & Gu, Q. (2020). A Frank–Wolfe framework for efficient and effective adversarial
attacks. In: Proceedings of the AAAI conference on artificial intelligence, Vol. 34 no. 04, pp. 3486–3494
Chen, P. Y., Zhang, H., Sharma, Y., Yi, J., & Hsieh, C. J. (2017). ZOO: Zeroth order optimization based
black-box attacks to deep neural networks without training substitute models. In: Proceedings of the
10th ACM workshop on artificial intelligence and security, pp. 15–26
Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001). Atomic decomposition by basis pursuit. SIAM Review,
43(1), 129–159.
Cheung, Y., & Lou, J. (2015). Efficient generalized conditional gradient with gradient sliding for composite
optimization. In: Twenty-fourth international joint conference on artificial intelligence
Clarkson, K. L. (2010). Coresets, sparse greedy approximation, and the Frank–Wolfe algorithm. ACM Trans-
actions on Algorithms, 6(4), 1–30.
Combettes, C., & Pokutta, S. (2020) Boosting frank-wolfe by chasing gradients. In: International conference
on machine learning, pp. 2111–2121. PMLR
Combettes, C. W., & Pokutta, S. (2021). Complexity of linear minimization and projection on some sets.
Operations Research Letters, 49(4), 565–571.
Cristofari, A., De Santis, M., Lucidi, S., & Rinaldi, F. (2020). An active-set algorithmic framework for non-
convex optimization problems over the simplex. Computational Optimization and Applications, 77,
57–89.
Demyanov, V. F., & Rubinov, A. M. (1970). Approximate methods in optimization problems. American
Elsevier
Devolder, O., Glineur, F., & Nesterov, Y. (2014). First-order methods of smooth convex optimization with
inexact oracle. Mathematical Programming, 146(1), 37–75.
Diakonikolas, J., Carderera, A., & Pokutta, S. (2020). Locally accelerated conditional gradients. In: Interna-
tional conference on artificial intelligence and statistics, pp. 1737–1747. PMLR
Ding, L., Fei, Y., Xu, Q., & Yang, C. (2020). Spectral Frank-Wolfe algorithm: Strict complementarity and
linear convergence. In: International conference on machine learning, pp. 2535–2544. PMLR
Dunn, J. C. (1979). Rates of convergence for conditional gradient algorithms near singular and nonsingular
extremals. SIAM Journal on Control and Optimization, 17(2), 187–211.
Dunn, J. C., & Harshbarger, S. (1978). Conditional gradient algorithms with open loop step size rules. Journal
of Mathematical Analysis and Applications, 62(2), 432–444.
Ferreira, O., & Sosa, W. (2021). On the Frank–Wolfe algorithm for non-compact constrained optimization
problems. Optimization pp. 1–15
Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval Research Logistics Quarterly,
3(1–2), 95–110.
Freund, R. M., & Grigas, P. (2016). New analysis and results for the Frank–Wolfe method. Mathematical
Programming, 155(1–2), 199–230.
Freund, R. M., Grigas, P., & Mazumder, R. (2017). An extended Frank–Wolfe method with in-face directions,
and its application to low-rank matrix completion. SIAM Journal on Optimization, 27(1), 319–346.
Fujishige, S. (1980). Lexicographically optimal base of a polymatroid with respect to a weight vector. Math-
ematics of Operations Research, 5(2), 186–196.
Fukushima, M. (1984). A modified Frank–Wolfe algorithm for solving the traffic assignment problem. Trans-
portation Research Part B: Methodological, 18(2), 169–177.
Garber, D. (2020). Revisiting Frank–Wolfe for polytopes: Strict complementarity and sparsity. Advances in
Neural Information Processing Systems 33

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
636 Annals of Operations Research (2024) 343:607–638

Garber, D. (2023). Linear convergence of Frank–Wolfe for rank-one matrix recovery without strong convexity.
Mathematical Programming, 199(1), 87–121.
Garber, D., & Hazan, E. (2015). Faster rates for the Frank–Wolfe method over strongly-convex sets. ICML,
15, 541–549.
Garber, D., & Hazan, E. (2016). A linearly convergent variant of the conditional gradient algorithm under
strong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization,
26(3), 1493–1528.
Gonçalves, M. L., Melo, J. G., & Monteiro, R. D. (2020) Projection-free accelerated method for convex
optimization. Optimization Methods and Software pp. 1–27
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y.
(2014). Generative adversarial nets. Advances in Neural Information Processing Systems, pp. 2672–2680
Guelat, J., & Marcotte, P. (1986). Some comments on Wolfe’s away step. Mathematical Programming, 35(1),
110–119.
Gutman, D. H., & Pena, J. F. (2020). The condition number of a function relative to a set. Mathematical
Programming pp. 1–40
Harchaoui, Z., Juditsky, A., & Nemirovski, A. (2015). Conditional gradient algorithms for norm-regularized
smooth convex optimization. Mathematical Programming, 152(1), 75–112.
Hogan, W. W. (1971). Convergence results for some extensions of the Frank-Wolfe method. Tech. rep.: Cali-
fornia Univ Los Angeles Western Management Science Inst.
Holloway, C. A. (1974). An extension of the Frank and Wolfe method of feasible directions. Mathematical
Programming, 6(1), 14–27.
Hungerford, J. T., & Rinaldi, F. (2019). A general regularized continuous formulation for the maximum clique
problem. Mathematics of Operations Research, 44(4), 1161–1173.
Jaggi, M. (2011). Sparse convex optimization methods for machine learning. Ph.D. thesis, ETH Zurich
Jaggi, M. (2013). Revisiting Frank–Wolfe: Projection-free sparse convex optimization. ICML, 1, 427–435.
Jaggi, M., Sulovský, M. (2010). A simple algorithm for nuclear norm regularized problems. In: ICML, pp.
471–478
Joulin, A., Tang, K., & Fei-Fei, L. (2014). Efficient image and video co-localization with Frank-Wolfe algo-
rithm. In: European conference on computer vision, pp. 253–268. Springer
Kazemi, E., Kerdreux, T., & Wang, L. (2021). Generating structured adversarial attacks using Frank–Wolfe
method. arXiv:2102.07360
Kerdreux, T., d’Aspremont, A., & Pokutta, S. (2021). Projection-free optimization on uniformly convex sets.
In: International conference on artificial intelligence and statistics, pp. 19–27. PMLR
Kerdreux, T., Liu, L., Lacoste-Julien, S., & Scieur, D. (2021). Affine invariant analysis of Frank–Wolfe on
strongly convex sets. International conference on machine learning, pp. 5398–5408
Konnov, I. (2018). Simplified versions of the conditional gradient method. Optimization, 67(12), 2275–2290.
Kumar, P., Mitchell, J. S., & Yıldırım, E. A. (2003). Approximate minimum enclosing balls in high dimensions
using core-sets. Journal of Experimental Algorithmics, 8, 1–1.
Lacoste-Julien, S. (2016). Convergence rate of Frank–Wolfe for non-convex objectives. arXiv:1607.00345
Lacoste-Julien, S., Jaggi, M. (2015). On the global linear convergence of Frank–Wolfe optimization variants.
In: Advances in neural information processing systems, pp. 496–504
Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P. (2013). Block-coordinate Frank-Wolfe optimization for
structural SVMs. In: S. Dasgupta, D. McAllester (eds.) Proceedings of the 30th international conference
on machine learning, Proceedings of Machine Learning Research, Vol. 28, pp. 53–61. PMLR, Atlanta,
Georgia, USA
Lan, G. (2020). First-order and stochastic optimization methods for machine learning. Springer
Lan, G., & Zhou, Y. (2016). Conditional gradient sliding for convex optimization. SIAM Journal on Optimiza-
tion, 26(2), 1379–1409.
LeBlanc, L. J., Morlok, E. K., & Pierskalla, W. P. (1975). An efficient approach to solving the road network
equilibrium traffic assignment problem. Transportation Research, 9(5), 309–318.
Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. USSR Computational Mathematics
and Mathematical Physics, 6(5), 1–50.
Locatello, F., Khanna, R., Tschannen, M., & Jaggi, M. (2017). A unified optimization view on generalized
matching pursuit and Frank–Wolfe. In: Artificial intelligence and statistics, pp. 860–868. PMLR
Luce, R. D., & Perry, A. D. (1949). A method of matrix analysis of group structure. Psychometrika, 14(2),
95–116.
Mangasarian, O.: Machine learning via polyhedral concave minimization. In: Applied mathematics and parallel
computing, pp. 175–188. Springer
Mitchell, B., Demyanov, V. F., & Malozemov, V. (1974). Finding the point of a polyhedron closest to the
origin. SIAM Journal on Control, 12(1), 19–26.

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Annals of Operations Research (2024) 343:607–638 637

Mitradjieva, M., & Lindberg, P. O. (2013). The stiff is moving—conjugate direction Frank–Wolfe methods
with applications to traffic assignment. Transportation Science, 47(2), 280–293.
Mu, C., Zhang, Y., Wright, J., & Goldfarb, D. (2016). Scalable robust matrix recovery: Frank–Wolfe meets
proximal methods. SIAM Journal on Scientific Computing, 38(5), A3291–A3317.
Nesterov, Y. (1998). Introductory lectures on convex [p]rogramming Volume I: Basic course. Lecture notes
Nesterov, Y. (2018). Complexity bounds for primal-dual methods minimizing the model of objective function.
Mathematical Programming, 171(1), 311–330.
Osokin, A., Alayrac, J. B., Lukasewitz, I., Dokania, P., & Lacoste-Julien, S. (2016). Minding the gaps for
block Frank–Wolfe optimization of structured svms. In: International conference on machine learning,
pp. 593–602. PMLR
Peña, J., & Rodriguez, D. (2018). Polytope conditioning and linear convergence of the Frank–Wolfe algorithm.
Mathematics of Operartions Research, 44(1), 1–18.
Pedregosa, F., Negiar, G., Askari, A., & Jaggi, M. (2020). Linearly convergent Frank–Wolfe with backtracking
line-search. In: International conference on artificial intelligence and statistics, pp. 1–10. PMLR
Perederieieva, O., Ehrgott, M., Raith, A., & Wang, J. Y. (2015). A framework for and empirical study of
algorithms for traffic assignment. Computers & Operations Research, 54, 90–107.
Pierucci, F., Harchaoui, Z., & Malick, J. (2014). A smoothing approach for composite conditional gradient
with nonsmooth loss. Tech. rep., RR-8662, INRIA Grenoble
Qu, C., Li, Y., & Xu, H. (2018). Non-convex conditional gradient sliding. In: International conference on
machine learning, pp. 4208–4217. PMLR
Rademacher, L., & Shu, C. (2022). The smoothed complexity of Frank–Wolfe methods via conditioning of
random matrices and polytopes. Mathematical Statistics and Learning, 5(3), 273–310.
Rinaldi, F., Schoen, F., & Sciandrone, M. (2010). Concave programming for minimizing the zero-norm over
polyhedral sets. Computational Optimization and Applications, 46(3), 467–486.
Rinaldi, F., & Zeffiro, D. (2020). A unifying framework for the analysis of projection-free first-order methods
under a sufficient slope condition. arXiv:2008.09781
Rinaldi, F., & Zeffiro, D. (2023). Avoiding bad steps in Frank–Wolfe variants. Computational Optimization
and Applications, 84(1), 225–264.
Sahu, A. K., & Kar, S. (2020). Decentralized zeroth-order constrained stochastic optimization algorithms:
Frank–Wolfe and variants with applications to black-box adversarial attacks. Proceedings of the IEEE,
108(11), 1890–1905.
Shah, N., Kolmogorov, V., & Lampert, C. H. (2015). A multi-plane block-coordinate Frank–Wolfe algorithm
for training structural svms with a costly max-oracle. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 2737–2745
Sun, Y. (2020). Safe screening for the generalized conditional gradient method. Image 1, 2
Thekumparampil, K. K., Jain, P., Netrapalli, P., Oh, S. (2020). Projection efficient subgradient
method and optimal nonsmooth Frank–Wolfe method. In: H. Larochelle, M. Ranzato, R. Had-
sell, M. F. Balcan, H. Lin (eds.) Advances in neural information processing systems, vol. 33,
pp. 12211–12224. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/
8f468c873a32bb0619eaeb2050ba45d1-Paper.pdf
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society:
Series B (Methodological), 58(1), 267–288.
Vapnik, V. (2013). The Nature of Statistical Learning Theory. Springer
Von Hohenbalken, B. (1977). Simplicial decomposition in nonlinear programming algorithms. Mathematical
Programming, 13(1), 49–68.
Wang, H., Lu, H., & Mazumder, R. (2022). Frank-Wolfe methods with an unbounded feasible region and
applications to structured learning. SIAM Journal on Optimization, 32(4), 2938–2968.
Wang, Y. X., Sadhanala, V., Dai, W., Neiswanger, W., Sra, S., Xing, E. (2016). Parallel and distributed block-
coordinate Frank–Wolfe algorithms. In: International conference on machine learning, pp. 1548–1557.
PMLR
Wardrop, J. G. (1952). Road paper some theoretical aspects of road traffic research. Proceedings of the
Institution of Civil Engineers, 1(3), 325–362.
Weintraub, A., Ortiz, C., & González, J. (1985). Accelerating convergence of the Frank-Wolfe algorithm.
Transportation Research Part B: Methodological, 19(2), 113–122.
Wolfe, P. (1970). Convergence theory in nonlinear programming. In: J. Abadie (ed.) Integer and nonlinear
programming, pp. 1–36. North Holland
Wolfe, P. (1976). Finding the nearest point in a polytope. Mathematical Programming, 11(1), 128–149.
Wu, Q., & Hao, J. K. (2015). A review on algorithms for maximum clique problems. European Journal of
Operational Research, 242(3), 693–709.

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
638 Annals of Operations Research (2024) 343:607–638

Xu, Y., Yang, T. (2018). Frank-Wolfe method is automatically adaptive to error bound condition.
arXiv:1810.04765
Yıldırım, E. A. (2008). Two algorithms for the minimum enclosing ball problem. SIAM Journal on Optimiza-
tion, 19(3), 1368–1391.
Yu, Y., Zhang, X., & Schuurmans, D. (2017). Generalized conditional gradient for sparse estimation. Journal
of Machine Learning Research, 18(144), 1–46.
Yurtsever, A., Fercoq, O., Locatello, F., Cevher, V. (2018). A conditional gradient framework for composite
convex minimization with applications to semidefinite programming. In: International conference on
machine learning, pp. 5727–5736. PMLR

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:

1. use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
2. use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at

[email protected]

Advanced Vector Optimization Techniques
No ratings yet
Advanced Vector Optimization Techniques
20 pages
Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging
No ratings yet
Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging
15 pages
Unconstrained and Constrained Optimization Techniques
No ratings yet
Unconstrained and Constrained Optimization Techniques
25 pages
Article: Projection-Based Curve Pattern Search For Black-Box Optimization Over Smooth Convex Sets
No ratings yet
Article: Projection-Based Curve Pattern Search For Black-Box Optimization Over Smooth Convex Sets
20 pages
Optimization for Convex Functions
No ratings yet
Optimization for Convex Functions
31 pages
Clnote Sept24
No ratings yet
Clnote Sept24
24 pages
ECOM 6302: Engineering Optimization: Chapter Three
100% (1)
ECOM 6302: Engineering Optimization: Chapter Three
56 pages
On The Online Frank-Wolfe Algorithms For Convex and Non-Convex Optimizations
No ratings yet
On The Online Frank-Wolfe Algorithms For Convex and Non-Convex Optimizations
28 pages
Chương 9
No ratings yet
Chương 9
12 pages
Levenberg-Marquardt Methods With Strong Local Convergence Properties For Solving Nonlinear Equations With Convex Constraints
No ratings yet
Levenberg-Marquardt Methods With Strong Local Convergence Properties For Solving Nonlinear Equations With Convex Constraints
25 pages
Solving Underdetermined Nonlinear Equations by Newton-Like Method
No ratings yet
Solving Underdetermined Nonlinear Equations by Newton-Like Method
22 pages
Process Optimization
100% (1)
Process Optimization
70 pages
Iterative Algorithms I PDF
No ratings yet
Iterative Algorithms I PDF
445 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
Stats 102B Cheat Sheet
No ratings yet
Stats 102B Cheat Sheet
4 pages
CAAM 454 554 1lvazxx
No ratings yet
CAAM 454 554 1lvazxx
422 pages
Global Optimization of Noisy Functions
No ratings yet
Global Optimization of Noisy Functions
18 pages
Solving Nonlinear Equations
No ratings yet
Solving Nonlinear Equations
18 pages
Nonlinear Optimization CO 367
No ratings yet
Nonlinear Optimization CO 367
105 pages
Optimization Lecture Notes
No ratings yet
Optimization Lecture Notes
13 pages
Optimizing With Ga
No ratings yet
Optimizing With Ga
27 pages
CHAPTER 3: Basic Methods, Basic Concepts
No ratings yet
CHAPTER 3: Basic Methods, Basic Concepts
44 pages
Construction and Analysis of Explicit Adaptive One-Step Methods For Solving Stiff Problems
No ratings yet
Construction and Analysis of Explicit Adaptive One-Step Methods For Solving Stiff Problems
14 pages
Hauser Lecture2
No ratings yet
Hauser Lecture2
26 pages
Num1 Skript
No ratings yet
Num1 Skript
104 pages
Finite Difference Methods For Ordinary and Partial 21krgbzlct
No ratings yet
Finite Difference Methods For Ordinary and Partial 21krgbzlct
7 pages
Optimization Algorithms Without Derivatives
No ratings yet
Optimization Algorithms Without Derivatives
12 pages
1 Non-Negative Matrix Factorization (NMF) : K A A A
No ratings yet
1 Non-Negative Matrix Factorization (NMF) : K A A A
7 pages
Nelder Mead 2D
No ratings yet
Nelder Mead 2D
4 pages
CS-6777 Liu Abs
100% (1)
CS-6777 Liu Abs
103 pages
2019 MultiNewton-SIOPT WHYLY
No ratings yet
2019 MultiNewton-SIOPT WHYLY
37 pages
04 Nonlinear Systems and Optimization
No ratings yet
04 Nonlinear Systems and Optimization
74 pages
Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction
No ratings yet
Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction
86 pages
NumProg2 - 2020-07-12
No ratings yet
NumProg2 - 2020-07-12
110 pages
MUR8 A Multirate Extension of The Eighth Order Do - 1997 - Applied Numerical Ma
No ratings yet
MUR8 A Multirate Extension of The Eighth Order Do - 1997 - Applied Numerical Ma
8 pages
EE364a Homework 7 Solutions
No ratings yet
EE364a Homework 7 Solutions
16 pages
Nonlinear Optimization Course Overview
No ratings yet
Nonlinear Optimization Course Overview
75 pages
Wolfe Acm
No ratings yet
Wolfe Acm
23 pages
Advanced Numerical Analysis Prof. Daniel Kressner
No ratings yet
Advanced Numerical Analysis Prof. Daniel Kressner
25 pages
5165 Test 2 Cheating
No ratings yet
5165 Test 2 Cheating
7 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Solodov - Siam Control Optim
No ratings yet
Solodov - Siam Control Optim
12 pages
NM 2 PDF
No ratings yet
NM 2 PDF
6 pages
Mathematics: Study of Local Convergence and Dynamics of A King-Like Two-Step Method With Applications
No ratings yet
Mathematics: Study of Local Convergence and Dynamics of A King-Like Two-Step Method With Applications
12 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
No ratings yet
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
9 pages
Solutions To Selected Exercises and Additional Examples For My Book Numerical Methods For Evolutionary Differential Equations
No ratings yet
Solutions To Selected Exercises and Additional Examples For My Book Numerical Methods For Evolutionary Differential Equations
19 pages
MATH 6193 - Lecture Notes
No ratings yet
MATH 6193 - Lecture Notes
143 pages
Numerical Solution of Parabolic Equation
No ratings yet
Numerical Solution of Parabolic Equation
222 pages
Jiyue Zeng Honors Thesis
No ratings yet
Jiyue Zeng Honors Thesis
59 pages
Solving Ordinary Differential Equations II Stiff A
No ratings yet
Solving Ordinary Differential Equations II Stiff A
9 pages
Introduction To Optimization - Jean-François Aujol
No ratings yet
Introduction To Optimization - Jean-François Aujol
51 pages
A Concise Introduction To Numerical Analysis
100% (1)
A Concise Introduction To Numerical Analysis
304 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
EE227C Course Notes: Convex Optimization
No ratings yet
EE227C Course Notes: Convex Optimization
122 pages
Chapter 7 - Economic Organization
No ratings yet
Chapter 7 - Economic Organization
10 pages
Psychology of Hannibal Lecter
No ratings yet
Psychology of Hannibal Lecter
13 pages
IIT Bombay Guide For Visitors: Formalities&Essentials
No ratings yet
IIT Bombay Guide For Visitors: Formalities&Essentials
4 pages
Information To Candidate & Declaration Updated As On 24th Apr'23
No ratings yet
Information To Candidate & Declaration Updated As On 24th Apr'23
2 pages
CBT - Ocd
No ratings yet
CBT - Ocd
11 pages
Grade 9 Certificate q1
No ratings yet
Grade 9 Certificate q1
14 pages
Comparative Education: India & Pakistan
No ratings yet
Comparative Education: India & Pakistan
9 pages
Advanced Math Practice Sheet SEO
No ratings yet
Advanced Math Practice Sheet SEO
26 pages
2 PT Global Business Foundation With All Majors and Elective Module Outline
No ratings yet
2 PT Global Business Foundation With All Majors and Elective Module Outline
3 pages
English as a Global Language
No ratings yet
English as a Global Language
3 pages
Secure AI Lifecycle
No ratings yet
Secure AI Lifecycle
41 pages
2023/2024 School Fees Receipt
No ratings yet
2023/2024 School Fees Receipt
2 pages
Compound Microscope and Focusing Specimen DLP
67% (3)
Compound Microscope and Focusing Specimen DLP
5 pages
E9c0 029d 4865 b904 ff8fcd4c43cc
No ratings yet
E9c0 029d 4865 b904 ff8fcd4c43cc
2 pages
TP Feedback Sheet Bita
No ratings yet
TP Feedback Sheet Bita
2 pages
Postgraduate Institute of Science
No ratings yet
Postgraduate Institute of Science
34 pages
Data Structures and Algorithms in C Goodrich Solution Manual PDF
No ratings yet
Data Structures and Algorithms in C Goodrich Solution Manual PDF
2 pages
B.Com Admissions Guide 2018-19
No ratings yet
B.Com Admissions Guide 2018-19
2 pages
LetterLand Student 39 Student Book
88% (8)
LetterLand Student 39 Student Book
84 pages
District CampBSP-GSP-PERMIT-DO-2025
No ratings yet
District CampBSP-GSP-PERMIT-DO-2025
1 page
Cambridge IGCSE ™: Mathematics 0580/12
No ratings yet
Cambridge IGCSE ™: Mathematics 0580/12
10 pages
A Cross-Cultural Comparison Between United Kingdom and Malaysian Academics' Satisfaction With Academic Research
No ratings yet
A Cross-Cultural Comparison Between United Kingdom and Malaysian Academics' Satisfaction With Academic Research
12 pages
Viii Leyte 303371 Granja-Kalinawan National High School: Ricah Mae Caaya Cabaltera
No ratings yet
Viii Leyte 303371 Granja-Kalinawan National High School: Ricah Mae Caaya Cabaltera
4 pages
Grade 12 Creative Writing Exam Guide
No ratings yet
Grade 12 Creative Writing Exam Guide
5 pages
Earthquake and Volcano Safety Lesson Plan
No ratings yet
Earthquake and Volcano Safety Lesson Plan
3 pages
Part A: Multiple Choice: Answer With The Best Choice. Make Sure That You Clearly Circle The
No ratings yet
Part A: Multiple Choice: Answer With The Best Choice. Make Sure That You Clearly Circle The
8 pages
Introducing The Art of Statistics How To Learn Fro
No ratings yet
Introducing The Art of Statistics How To Learn Fro
6 pages
AOE 4065-4066 Class Syllabus, Design (Aircraft), Fall 2009-Spring 2010
No ratings yet
AOE 4065-4066 Class Syllabus, Design (Aircraft), Fall 2009-Spring 2010
1 page
Est 230715 132158
No ratings yet
Est 230715 132158
4 pages
Terms and Conditions To Join SmartData Enterprises - 2024
No ratings yet
Terms and Conditions To Join SmartData Enterprises - 2024
2 pages

Frank-Wolfe and Friends A Journey Into Projection

Uploaded by

Frank-Wolfe and Friends A Journey Into Projection

Uploaded by

Annals of Operations Research (2024) 343:607–638

Frank–Wolfe and friends: a journey into projection-free

Immanuel. M. Bomze1 · Francesco Rinaldi2 · Damiano Zeﬃro2

Keywords First-order methods · Projection-free methods · Structured optimization ·

1.1 Organisation of the paper

For any integers a and b, denote by [a : b] = {x integer : a ≤ x ≤ b} the integer range

2 Problem and general scheme

We consider the following problem:

Algorithm 1 First-order method

2.1 The classical Frank–Wolfe method

The classical FW method for minimization of a smooth objective f generates a sequence of

defining the descent direction as

Algorithm 2 Frank–Wolfe method

3.1 Traffic assignment

Finding a traffic pattern satisfying the equilibrium conditions in a transportation network is

and can be split in n shortest paths subproblems, each of the form

3.2 Submodular optimization

Given a finite set V , a function r : 2V → R is said to be submodular if for every A, B ⊂ V

r (A) + r (B) ≥ r (A ∪ B) + r (A ∩ B) . (8)

min r (A) . (9)

B(G) = {s ∈ RV : sa ≤ r (A) for all A ⊆ V , sa = r (V )} . (10)

s jk := r ({ j1 , ..., jk }) − r ({ j1 , ..., jk−1 }) , (11)

3.3 LASSO problem

3.4 Matrix completion

3.5 Adversarial attacks in machine learning

3.6 Minimum enclosing ball

s.t. pi − c2 ≤ γ , all i ∈ [1 : n] . (18)

and the LMO is defined as follows:

LMO n−1 (∇ f (xk )) = eik ,

3.7 Training linear Support Vector Machines

T S = {(pi , yi ), pi ∈ Rd , yi ∈ {−1, 1}, i = 1, . . . , n},

3.8 Finding maximal cliques in graphs

3.9 Finding sparse points in a set

min {x0 : x ∈ P} . (23)

where φ : [0 , +∞ [ → R is a suitably chosen smooth concave univariate function bounded

Popular rules for determining the stepsize are:

αk = min arg min ϕ(α) with ϕ(α) := f (xk + α dk ) , (26)

f (xk + α dk ) ≤ f (xk ) + γ α ∇ f (xk ) dk (27)

Lemma 1 If αk is given by (28) and αk < αkmax then

where we used the standard Descent Lemma in the inequality. 

5 Properties of the FW method and its variants

5.1 The FW gap

A key parameter often used as a measure of convergence is the FW gap

where we used [y − πC (x)] [x − πC (x)] ≤ 0 in the first inequality, with x = xk − ∇ f (xk )

First-order stationarity conditions are equivalent to −∇ f (x) ∈ NC (x), or

5.2 O(1/k) rate for convex objectives

G(xk ) = −∇ f (xk ) dk ≥ Ldk 2 , (42)

We can now proceed with the proof of the main result.

The AFW direction, introduced in Wolfe (1970), is hence given by

Table 1 FW method and variants covered in this review

FW dk = dkF W = sk − xk , sk ∈ arg max{∇ f (xk ) x : x ∈ C} −

dkP F W = dkF W + dkAS = sk − vk , (50)

with sk defined in (3).

xk+1 ∈ arg min f (y) , (51)

5.4 Sparse approximation properties

xk ∈ aff(Ak ) with Ak ⊂ A, |Ak | ≤ k + 1 . (52)

5.5 Affine invariance

It is easy to prove that κ f ,C ≤ L D 2 if D is the diameter of C. In the special case where

5.6 Support identification for the AFW

λi (x) = ∇ f (x) (ei − x) (55)

contains the support of x∗ ,

supp(x∗ ) := {i ∈ [1 : n] : xi∗ > 0} .

Lemma 3 Let x∗ be a stationary point for f on the boundary of n−1 , δmin =

j ≤ min{n − |I (x∗ )|, | supp(xk )| − 1}

we have supp(xk+ j ) ⊆ I (x∗ ) and xk+ j − x∗ 1 < r∗ .

E C (x∗ ) := arg min ∇ f (x∗ ) x , (58)

5.7 Inexact linear oracle

for a sequence {δk } of non negative approximation errors.

6 Improved rates for strongly convex objectives

6.1 Linear convergence under an angle condition

Lemma 4 If the step k is a good step and (64) holds, then

Proof If the step k is a full FW step then Lemma 2 entails h k+1 ≤ 1

f ∗ = f (x∗ ) ≥ f (xk ) + ∇ f (xk ) (x∗ − xk ) + μ2 xk − x∗ 2

As a corollary, under (64) we have the rate

Theorem 3 If x∗ ∈ ri(C), then

C = min{dist(v, H ) : v ∈ V (C) , H ∈ H(C), v ∈

where we used the standard Descent Lemma in the inequality.