0% found this document useful (0 votes)
49 views236 pages

Computational Optimal Transport

Uploaded by

Jingkui Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views236 pages

Computational Optimal Transport

Uploaded by

Jingkui Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Computational Optimal Transport

Gabriel Peyré Marco Cuturi


CNRS and DMA, ENS CREST, ENSAE
Contents

1 Introduction 3
1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theoretical Foundations 7
2.1 Histograms and Measures . . . . . . . . . . . . . . . . . . 7
2.2 Assignment and Monge Problem . . . . . . . . . . . . . . 9
2.3 Kantorovich relaxation . . . . . . . . . . . . . . . . . . . 15
2.4 Metric Properties of Optimal Transport . . . . . . . . . . 22
2.5 Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . 35

3 Algorithmic Foundations 42
3.1 The Kantorovich Linear Programs . . . . . . . . . . . . . 43
3.2 C-transforms . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Complementary Slackness . . . . . . . . . . . . . . . . . . 47
3.4 Vertices of the Transportation Polytope . . . . . . . . . . 48
3.5 A Heuristic Description of the Network Simplex . . . . . . 52
3.6 Matching problems . . . . . . . . . . . . . . . . . . . . . 57

4 Entropic Regularization of Optimal Transport 63


4.1 Entropic Regularization . . . . . . . . . . . . . . . . . . . 63
4.2 Sinkhorn’s Algorithm and its Convergence . . . . . . . . . 69

ii
iii

4.3 Speeding-up Sinkhorn’s Iterations . . . . . . . . . . . . . . 81


4.4 Regularized Dual and Log-domain Computations . . . . . 84
4.5 Regularized Approximations of the Optimal Transport Cost 87
4.6 Generalized Sinkhorn . . . . . . . . . . . . . . . . . . . . 89

5 Semi-discrete Optimal Transport 92


5.1 c-transform and c̄-transform . . . . . . . . . . . . . . . . . 93
5.2 Semi-discrete Formulation . . . . . . . . . . . . . . . . . . 94
5.3 Entropic Semi-discrete Formulation . . . . . . . . . . . . . 97
5.4 Stochastic Optimization Methods . . . . . . . . . . . . . . 101

6 W1 Optimal Transport 105


6.1 W 1 on Metric Spaces . . . . . . . . . . . . . . . . . . . . 106
6.2 W 1 on Euclidean Space . . . . . . . . . . . . . . . . . . . 108
6.3 W 1 on a Graph . . . . . . . . . . . . . . . . . . . . . . . 109

7 Dynamic Formulations 111


7.1 Continuous Formulation . . . . . . . . . . . . . . . . . . . 111
7.2 Discretization on Uniform Staggered Grids . . . . . . . . . 115
7.3 Proximal Solvers . . . . . . . . . . . . . . . . . . . . . . . 116
7.4 Dynamical Unbalanced OT . . . . . . . . . . . . . . . . . 119
7.5 More General Mobility Functionals . . . . . . . . . . . . . 120
7.6 Dynamic Formulation over the Paths Space . . . . . . . . 121

8 Statistical Divergences 125


8.1 ϕ-Divergences . . . . . . . . . . . . . . . . . . . . . . . . 125
8.2 Integral Probability Metrics . . . . . . . . . . . . . . . . . 132
8.3 Wasserstein Spaces are not Hilbertian . . . . . . . . . . . 138
8.4 Empirical Estimators for OT, MMD and ϕ-divergences . . 142
8.5 Entropic Regularization: between OT and MMD . . . . . . 146

9 Variational Wasserstein Problems 148


9.1 Differentiating the Wasserstein Loss . . . . . . . . . . . . 149
9.2 Wasserstein Barycenters, Clustering and Dictionary Learning 154
9.3 Gradient Flows . . . . . . . . . . . . . . . . . . . . . . . . 166
9.4 Minimum Kantorovitch Estimators . . . . . . . . . . . . . 173
iv

10 Extensions of Optimal Transport 179


10.1 Multi-marginal Problems . . . . . . . . . . . . . . . . . . 179
10.2 Unbalanced Optimal Transport . . . . . . . . . . . . . . . 183
10.3 Problems with Extra Constraints on the Couplings . . . . . 186
10.4 Sliced Wasserstein Distance and Barycenters . . . . . . . . 187
10.5 Transporting Vectors and Matrices . . . . . . . . . . . . . 192
10.6 Gromov-Wasserstein Distances . . . . . . . . . . . . . . . 194

References 202
Abstract

Optimal Transport (OT) is a mathematical gem at the interface be-


tween probability, analysis and optimization. The goal of that theory
is to define geometric tools that are useful to compare probability dis-
tributions. Let us briefly sketch some key ideas using a vocabulary that
was first introduced by Monge two centuries ago: a probability distribu-
tion can be thought of as a pile of sand. Peaks indicate where likely ob-
servations are to appear. Given a pair of probability distributions—two
different piles of sand—there are, in general, multiple ways to morph,
transport or reshape the first pile so that it matches the second. To
every such transport we associate an a “global” cost, using the “local”
consideration of how much it costs to move a single grain of sand from
one location to another. The goal of optimal transport is to find the
least costly transport, and use it to derive an entire geometric toolbox
for probability distributions.
Despite this relatively abstract description, optimal transport the-
ory answers many basic questions related to the way our economy
works: In the “mines and factories” problem, the sand is distributed
across an entire country, each grain of sand represents a unit of a use-
ful raw resource; the target pile indicates where those resources are
needed, typically in factories, where they are meant to be processed. In
that scenario, one seeks the least costly way to move all these resources,
knowing the entire logistic cost matrix needed to ship resources from
any storage point to any factory.
Transporting optimally two abstract distributions is also extremely
relevant for mathematicians, in the sense that it defines a rich geometric
structure on the space of probability distributions. That structure is
canonical in the sense that it borrows, in arguably the most natural way,
key geometric properties of the underlying “ground” space on which
these distributions are defined. For instance, when the underlying space
is Euclidean, key concepts such as interpolation, barycenters, convexity
or gradients of functions extend very naturally to distributions when
endowed with an optimal transport geometry. OT has a rich and varied
history. Earlier contributions originated from Monge’s work in the 18th
century, to be later rediscovered under a different formalism by Tolstoi
2

in the 1920’s, Kantorovich, Hitchcock and Koopmans in the 1940’s.


The problem was solved numerically by Dantzig in 1949 and others
in the 1950’s within the framework of linear programming, paving the
way for major industrial applications in the second half of the 20th
century. OT was later rediscovered under a different light by analysts
in the 90’s, following important work by Brenier and others, as well as
in the computer vision/graphics fields under the name of earth mover’s
distances. Recent years have witnessed yet another revolution in the
spread of OT, thanks to the emergence of approximate solvers that
can scale to sizes and dimensions that are relevant to data sciences.
Thanks to this newfound scalability, OT is being increasingly used to
unlock various problems in imaging sciences (such as color or texture
processing), computer vision and graphics (for shape manipulation)
or machine learning (for regression, classification and density fitting).
This paper reviews OT with a bias toward numerical methods and
their applications in data sciences, and sheds lights on the theoretical
properties of OT that make it particularly useful for some of these
applications. Our focus is on the recent wave of efficient algorithms
that have helped translate attractive theoretical properties onto elegant
and scalable tools for a wide variety of applications. We also give a
prominent place to the many generalizations of OT that have been
proposed in but a few years, and connect them with related approaches
originating from statistical inference, kernel methods and information
theory. A companion website 1 provides bibliographical and numerical
resources, and in particular gives access to all the open source software
needed to reproduce the figures of this article.

1
[Link]
1
Introduction

Optimal Transport (OT) has a long and rich history, initiated by Monge
in the 18th century [Monge, 1781], then stated in its modern form
by Kantorovich [1942], and revitalized in the 90s by a flurry of major
mathematical results such as that of Brenier [1991]. Several reference
books have been written on this topic, including the two monographs
by Villani (2003, 2009), those by Rachev and Rüschendorf [1998a,
1998b] and more recently by Santambrogio [2015]. As exemplified by
these books, the more formal and abstract concepts in that theory
deserve in and by themselves several hundred pages. Now that opti-
mal transport has gradually established itself as an applied tool (for
instance in economics, see Galichon [2016]), we have tried to balance
that rich literature with a computational viewpoint, centered on appli-
cations to data science, notably imaging sciences and machine learning.
We follow in that sense the motivation of the recent review by Kolouri
et al. [2017] trying to cover more ground. Ultimately, our goal is to
present an overview of the main theoretical insights that support the
practical effectiveness of OT, and to explain how to turn these insights
into fast computational schemes.
The main body of Chapters 2, 3, 4, 9 and 10 is devoted solely to the

3
4 Introduction

study of the geometry defined by OT on discrete histograms. Target-


ting more advanced readers, we also give in the same chapters a deeper
and more mathematical exposition using discrete measures in light gray
boxes. This corresponds to introducing the support (positions) associ-
ated with the bins of the histogram, giving a second important degree
of freedom when comparing probability measures (not only variable
weights but also variable locations for such weights). Lastly, the third
and most technical layer of exposition is indicated in dark gray boxes,
and deals with arbitrary measures that need not be discrete (including
in particular those with a density w.r.t. a base measure). This is tra-
ditionally the default setting for most classic textbooks on OT theory.
Chapters 5 to 8 deal with arbitrary measures, and are thus targeting a
more mathematically-inclined audience.

1.1 Notations

• 1n,m : matrix of Rn×m with all entries identically set to 1. 1n :


vector of ones.

• In : identity matrix of size n × n.

• For u ∈ Rn , diag(u) is the n × n matrix with diagonal u, and zero


otherwise.

• Σn probability simplex with n bins, namely the set of probability


vectors in Rn+ .

• (a, b): histograms in the simplices Σn × Σm .

• (α, β): measures, defined on spaces (X , Y).



• ρα = dx : density with respect to the Lebesgues measure.

• (α = i ai δxi , β = j bj δyj ): discrete measures supported on


P P

x1 , . . . , xn ∈ X and y1 , . . . , ym ∈ Y.

• c(x, y): ground cost, with associated pairwise cost matrix Ci,j =
(c(xi , yj ))i,j evaluated on the support of α, β.
1.1. Notations 5

• π: coupling measure between α and β, namely such that for any


A ⊂ X , π(A × Y) = α(A), and for any subset B ⊂ Y, π(X × B) =
P
β(B). For discrete measures π = i,j Pi,j δ(xi ,yj ) .

• U(α, β): set of coupling measures, for discrete measures U(a, b).

• T : X → Y: Monge map, typically such that T] α = β.

• R(α, β): set of admissible dual potentials; for discrete measures


R(a, b).

• (αt )1t=0 : dynamic measures, with αt=0 = α0 and αt=1 = α1 .

• v: speed for Benamou-Brenier formulations, J = αv: momentum.

• (f, g): dual potentials, for discrete measures (f, g) are dual vari-
ables.

• (u, v) = (ef/ε , eg/ε ): Sinkhorn scalings.


def.

• K = e−C/ε : Gibbs kernel for Sinkhorn.


def.

• s: flow for W 1 -like problem (optimization under divergence con-


straints).

• LC (a, b) and Lc (α, β): value of the optimization problem associ-


ated to the OT with cost C (histograms) and c (arbitrary mea-
sures).

• Wp (a, b) and W p (α, β): p-Wasserstein distance associated to


ground distance matrix D (histograms) and distance d (arbitrary
measures).

• λ ∈ ΣS : weight vector used to compute the barycenters of S


measures.

• h·, ·i: for the usual Euclidean dot-product between vectors. For
two matrices of the same size A and B, hA, Bi = tr(A> B) is the
def.

Frobenius dot-product.
6 Introduction

def.
• f ⊕g(x, y) = f (x)+g(y), for two functions f : X → R, g : Y → R,
defines f ⊕ g : X × Y → R.
R
• α ⊗ β is the product measure on X × Y, i.e. X ×Y g(x, y)d(α ⊗
def. R
β)(x, y) = X ×Y g(x, y)dα(x)dβ(y).

• a ⊗ b = ab> ∈ Rn×m .
def.

• f ⊕ g = f1> >
def. n×m , for two vectors f ∈ Rn , g ∈ Rm .
m + 1n g ∈ R

• u v = (ui vi ) ∈ Rn for (u, v) ∈ (Rn )2


2
Theoretical Foundations

This chapter describes the basics of optimal transport, introducing first


the notion of optimal couplings between probability vectors (a, b), then
relating this computation to the transport between discrete measures
(α, β) defined in embedding spaces (X , Y), and lastly covering the gen-
eral setting of arbitrary measures. In a first reading, one can only fo-
cus on computations between probability vectors, namely histograms,
which is the only requisite to implement algorithms detailed in Chap-
ters 3 and 4. More experienced and math-inclined reader will be able
to grasp more intuition and more general formulation (e.g. in order to
move positions of clouds of points or handle measures with continuous
densities) using the more general measure setting.

2.1 Histograms and Measures

We will interchangeably the term histogram or probability vector for


any element a ∈ Σn that belongs to the probability simplex
n
( )
X
Rn+
def.
Σn = a∈ : ai = 1 .
i=1

7
8 Theoretical Foundations

A large part of this work focuses exclusively on the study of the geom-
etry induced by optimal transport on the simplex. For more advanced
readers, we give a deeper and more mathematical exposition using the
formalism of discrete measures, which corresponds to handling both the
weights contained in a probability vector and positions (the support of
the measure) associated with the bins of the histogram. This formal-
ism can be used implicitly to handle histograms of arbitrary size n and
with possibly varying positions. Lastly, the third and most technical
layer of exposition deals with arbitrary measures (i.e. which need not
be discrete, typically continuous with respect to a base measure).

Remark 2.1 (Discrete measures). A discrete measure with weights


a and locations x1 , . . . , xn ∈ X reads
n
X
α= ai δxi (2.1)
i=1

where δx is the Dirac at position x, intuitively a unit of mass which


is infinitely concentrated at location x. Such as measure describes
a probability measure if, additionally, a ∈ Σn , and more generally
a positive measure if each of the “weights” described in vector a is
positive itself.

Remark 2.2 (General measures). A convenient feature of OT is


that it can deal with discrete and continuous “objects” within the
same framework. Such objects only need to be modelled as mea-
sures. This corresponds to the notion of Radon measures M(X )
on the space X . The formal definition of that set requires that X is
equipped with a distance, usually denoted d, because one can only
access a measure by “testing” (integrating) it against continuous
functions, denoted f ∈ C(X ).
Integration of f ∈ C(X ) against a discrete measure α computes
a sum Z n X
f (x)dα(x) = ai f (xi ).
X i=1

More general measures, for instance on X = Rd (where d ∈ N∗


2.2. Assignment and Monge Problem 9

Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2


Pn
Figure 2.1: Schematic display of discrete distributions α = a δ (red cor-
i=1 i xi
responds to empirical uniform distribution ai = 1/n, and blue to arbitrary distri-
butions) and densities dα(x) = ρα (x)dx (in violet), in both 1-D and 2-D. Discrete
distributions in 1-D are displayed using vertical segments (with length equal to ai )
and in 2-D using point clouds (radius equal to ai ).

is the dimension), can have a density dα(x) = ρα (x)dx w.r.t. the


Lebesgue measure, often denoted ρα = dαdx , which means that
Z Z
∀ h ∈ C(Rd ), h(x)dα(x) = h(x)ρα (x)dx.
Rd Rd

An arbitrary measure α ∈ M(X ) (which needs not to have a den-


sity nor be a sum of Diracs) is defined by the fact that it can be
integrated agains any continuous function f ∈ C(X ) and obtain
R
X f (x)dα(x) ∈ R. If X is not compact, one should also impose
that f has compact support or at least as 0 limit at infinity. Mea-
sure as thus in some sense “less regular” than functions, but more
regular than distributions (which are dual to smooth functions).
For instance, the derivative of a Dirac is not a measure. We de-
note M+ (X ) the set of all positive measures on X . The set of
probability measures is denoted M1+ (X ), which means that any
α ∈ M1+ (X ) is positive, and that α(X ) = X dα = 1. Figure 2.1
R

offers a visualization of the different classes of measures, beyond


histograms, considered in this work.

2.2 Assignment and Monge Problem

Given a cost matrix (Ci,j )i∈JnK,j∈JmK , assuming n = m, the optimal


assignment problem seeks for a bijection σ in the set Perm(n) of per-
10 Theoretical Foundations

x1 x1

x2 y2

x5
y1 y2 x6 x3
x4 y3
x2 x7
y1

Figure 2.2: (left) blue dots from measure α and red dots from measure β are
pairwise equidistant. Hence, either matching σ = (1, 2) (full line) or σ = (2, 1)
(dotted line) is optimal. (right) a Monge map can associate the blue measure α to
the red measure β. The weights αi are displayed proportionally to the area of the
disk marked at each location. The mapping here is such that T (x1 ) = T (x2 ) = y2 ,
T (x3 ) = y3 , whereas for 4 ≤ i ≤ 7 we have T (xi ) = y1 .

mutations of n elements solving


n
1X
min C . (2.2)
σ∈Perm(n) n i=1 i,σ(i)

One could naively evaluate the cost function above using all permuta-
tions in the set Perm(n). However, that set has size n!, which is gigantic
even for small n. Consider for instance that such a set has more than
10100 elements [Dantzig, 1983] when n is as small as 70. That problem
can therefore only be solved if there exist efficient algorithms to opti-
mize that cost function over the set of permutations, which will be the
subject of §3.6.

Remark 2.3 (Uniqueness). Note that the optimal assignment problem


may have several optimal solutions. Suppose for instance that n = m =
2 and that the matrix C is the pairwise distance matrix between the
4 corners of a 2-dimensional square of side length 1, as represented in
the left plot in Figure 2.2. In that case only two assignments exist, and
they share the same cost.

Remark 2.4 (Monge Problem between discrete measures). For dis-


2.2. Assignment and Monge Problem 11

crete measures
n
X m
X
α= ai δxi and β= b j δ yj , (2.3)
i=1 j=1

Monge problem [Monge, 1781] seeks for a map that associates to


each point xi a single point yj , and which must push the mass
of α toward the mass of β, which is to say that such a map T :
{x1 , . . . , xn } → {y1 , . . . , ym } must verify that
X
∀ j ∈ JmK, bj = ai (2.4)
i:T (xi )=yj

which we write in compact form as T] α = β. This map should


minimize some transportation cost, which is parameterized by a
function c(x, y) defined for points (x, y) ∈ X × Y
( )
X
min c(xi , T (xi )) : T] α = β . (2.5)
T
i

Such a map between discrete points can be of course encoded,


assuming all x’s and y’s are distinct, using indices σ : JnK → JmK
so that j = σ(i), and the mass conservation is written as
X
a i = bj .
i∈σ −1 (j)

In the special case when n = m and all weights are uniform, that is
ai = bj = 1/n, then the mass conservation constraint implies that
T is a bijection, such that T (xi ) = yσ(i) , and the Monge problem is
equivalent to the optimal matching problem (2.2) where the cost
matrix is
def.
Ci,j = c(xi , yj ).
When n 6= m, note that, optimality aside, Monge maps may not
even exist between an empirical measure to another. This happens
when their weight vectors are not compatible, which is always the
case when the target measure has more points than the source
12 Theoretical Foundations

measure. For instance, the right plot in Figure 2.2 shows an (op-
timal) Monge map between α and β, but there is no Monge map
from β to α.

Remark 2.5 (Push-forward operator). For some continuous map


T : X → Y, we define the pushforward operator T] : M(X) →
M(Y ). For discrete measures (2.1), the pushforward operation con-
sists simply in moving the positions of all the points in the support
of the measure X
def.
T] α = ai δT (xi ) .
i
For more general measures, for instance for those with a density,
the notion of push-forward plays a fundamental to describe spatial
modifications of probability measures. The formal definition reads
as follow.

Definition 2.1 (Push-forward). For T : X → Y, the push forward


measure β = T] α ∈ M(Y) of some α ∈ M(X ) reads
Z Z
∀ h ∈ C(Y), h(y)dβ(y) = h(T (x))dα(x). (2.6)
Y X

Equivalently, for any measurable set B ⊂ Y, one has

β(B) = α({x ∈ X : T (x) ∈ B}). (2.7)

Note that T] preserves positivity and total mass, so that if α ∈


M1+ (X ) then T] α ∈ M1+ (Y).

Intuitively, a measurable map T : X → Y, can be interpreted


as a function “moving” a single point from a measurable space
to another. The more general extension T] can now “move” an
entire probability measure on X towards a new probability measure
on Y. The operator T] “pushes forward” each elementary mass
of a measure α on X by applying the map T to obtain then an
elementary mass in Y, to build on aggregate a new measure on
Y) written T] α. Note that such a push-forward T] : M1+ (X ) →
2.2. Assignment and Monge Problem 13

M1+ (Y) is a linear operator between measures in the sense that for
two measures α1 , α2 on X , T] (α1 + α2 ) = T] α1 + T] α2 .

Remark 2.6 (Push-forward for densities). Explicitly doing the


change of variable in formula (2.6) for measures with densities
(ρα , ρβ ) on Rd (assuming T is smooth and a bijection) shows that
a push-forward acts on densities linearly as a change of variables
in the integration formula, indeed

ρα (x) = | det(T 0 (x))|ρβ (T (x)) (2.8)

where T 0 (x) ∈ Rd×d is the Jacobian matrix of T (the matrix formed


by taking the gradient of each coordinate of T ). This implies, de-
noting y = T (x)
ρα (x)
| det(T 0 (x))| = .
ρβ (y)

Remark 2.7 (Monge problem between arbitrary measures). Monge


problem (2.5) is extended to the setting of two arbitrary prob-
ability measures (α, β) on two spaces (X , Y) as finding a map
T : X → Y that minimizes
Z 
min c(x, T (x))dα(x) : T] α = β (2.9)
T X

The constraint T] α = β means that T pushes forward the mass of


α to β, and makes use of the push-forward operator (2.6).

Remark 2.8 (Push-forward vs. pull-back). The push-forward T] of


measures should not be confounded with the pull-back of function
T ] : C(Y) → C(X ) which corresponds to the “warping” of func-
tions. It is the linear map defined, for g ∈ C(Y) by T ] g = g ◦ T .
Push-forward and pull-back are actually adjoint one from each
others, in the sense that
Z Z
∀ (α, g) ∈ M(X ) × C(Y ), gd(T] α) = (T ] g)dα.
Y X
14 Theoretical Foundations

T T

P↵
= T↵
i xi
def. P ] T ]g g
= i T (xi )
def.
= g T
Push-forward of measures Pull-back of functions

Figure 2.3: Comparison of push-forward T] and pull-back T ] .

It is important to realize that even if (α, β) have densities (ρα , ρβ ),


T] α is not equal to T ] ρβ , because of the presence of the Jacobian
in (2.8). This explains why OT should be used with caution to
perform image registration, because it does not operate as an image
warping method. Figure 2.3 illustrate the distinction between these
push-forward and pull-back operators.

Remark 2.9 (Measures and random variables). Radon measures


can also be viewed as representing the distributions of random
variables. A random variable X on X is actually a map X : Ω → X
from some abstract (often un-specified) probabilized space (Ω, P),
and its distribution α is the Radon measure α ∈ M1+ (X ) such that
R
P(X ∈ A) = α(A) = A dα(x). Equivalently, it is the push-forward
of P by X, α = X] P. Applying another push-forward β = T] α
for T : X → Y, following (2.6), is equivalent to defining another
random variable Y = T (X) : ω ∈ Ω → T (X(ω)) ∈ Y , so that β is
the distribution of Y . Drawing a random sample y from Y is thus
simply achieved by computing y = T (x) where x is drawn from X.
2.3. Kantorovich relaxation 15

2.3 Kantorovich relaxation

Limitations of the Monge Problem The assignment problem has sev-


eral limitations in practical settings, also encountered when using the
Monge problem. Indeed, because the assignment problem is formulated
as a permutation problem, it can only be used to compare two points
clouds of the same size. A direct generalization to discrete measures
with non-uniform weights can be carried out using Monge’s formalism
of pushforward maps, but that formulation may also be degenerate it
there does not exist feasible solutions satisfying the mass conservation
constraint (2.4) (see the end of Remark 2.4). Additionally, the assign-
ment Problem (2.5) is combinatorial, whereas the feasible set for the
Monge Problem (2.9), consisting in all push-forward measures that sat-
isfy the mass conservation constraint, is non-convex. Both are therefore
difficult to solve in their original formulation.

Kantorovich’s relaxation The key idea of Kantorovich [1942] is to


relax the deterministic nature of transportation, namely the fact that
a source point xi can only be assigned to another, or transported to
one and one location T (xi ) only. Kantorovich proposes instead that
the mass at any point xi be potentially dispatched across several lo-
cations. Kantorovich moves away from the idea that mass transporta-
tion should be “deterministic” to consider instead a “probabilistic” (or
“fuzzy”) transportation, which allows what is commonly known now as
“mass splitting” from a source towards several targets. This flexibility
is encoded using, in place of a permutation σ or a map T , a coupling
matrix P ∈ Rn×m+ , where Pi,j describes the amount of mass flowing
from bin i (or point xi ) towards bin j (or point xj ), xi towards yj in
the formalism of discrete measures (2.3). Admissible couplings admit a
far simpler characterization than Monge maps:
n o
P ∈ Rn×m PT 1n = b ,
def.
U(a, b) = + : P1m = a and (2.10)

where we used the following matrix-vector notation


  !
X X
P1m =  Pi,j  ∈ Rn and T
P 1n = Pi,j ∈ Rm .
j i i j
16 Theoretical Foundations

The set of matrices U(a, b) is bounded, defined by n + m equality


constraints, and therefore a convex polytope (the convex hull of a finite
set of matrices).
Additionally, whereas the Monge formulation (as illustrated in the
right plot of Figure 2.2) was intrisically asymmetric, Kantorovich’s re-
laxed formulation is always symmetric, in the sense that a coupling P
is in U(a, b) if and only if PT is in U(b, a).
Kantorovich’s optimal transport problem now reads

def. def.
X
LC (a, b) = min hC, Pi = Ci,j Pi,j . (2.11)
P∈U(a,b)
i,j

This is a linear program (see Chapter 3), and as is usually the case
with such programs, its solutions are not necessarily unique.

Remark 2.10 (Mines and Factories). The Kantorovich problem finds


a very natural illustration in the following resource allocation problem
(see also Hitchcock [1941]). Suppose that an operator runs n warehouses
and m factories. Each warehouse contains a valuable raw material that
is needed by the factories to run properly. More precisely, each ware-
house is indexed with an integer i and contains ai units of the raw
material. These raw materials must be all moved to the factories, with
a prescribed quantity bj needed at factory j to function properly. To
transfer resources from a warehouse i to a factory j, the operator can
use a transportation company that will charge Ci,j to move a single
unit of the resource from location i to location j. We assume that the
transportation company has the monopoly to transport goods, and ap-
plies the same linear pricing scheme to all actors of the economy: the
cost of shipping a units of the resource from i to j is equal to a × Ci,j .
Faced with the problem described above, the operator chooses to
solve the linear program described in Equation (2.11) to obtain a trans-
portation plan P? that quantifies for each pair i, j the amount of goods
Pi,j that must transported from warehouse i to factory j. The operator
pays on aggregate a total of hP? , Ci to the transportation company to
execute that plan.
2.3. Kantorovich relaxation 17

Permutation matrices as couplings For a permutation σ ∈ Perm(n),


we write Pσ for the corresponding permutation matrix,
(
1/n if j = σi ,
∀ (i, j) ∈ JnK2 , (Pσ )i,j = (2.12)
0 otherwise.

One can check that in that case


n
1X
hC, Pσ i = Ci,σi ,
n i=1

which shows that the assignment problem (2.2) can be recast as a


Kantorovich problem (2.11) where the couplings P are restricted to be
exactly permutation matrices:
n
1X
min C = min hC, Pσ i.
σ∈Perm(n) n i=1 i,σ(i) σ∈Perm(n)

Next, one can easily check that the set of permutation matrices is
strictly included in the so-called Birkhoff polytope U(1n /n, 1n , n). In-
deed, for any permutation σ we have Pσ 1 = 1n and Pσ T 1 = 1n ,
whereas 1n 1n T /n2 is a valid coupling but not a permutation matrix.
Therefore, one has naturally that

min hC, Pσ i ≤ LC (1n /n, 1n /n).


σ∈Perm(n)

The following proposition shows that these problems result in fact


in the same optimum, namely that one can always find a permuta-
tion matrix that minimizes Kantorovich’s problem (2.11) between two
uniform measures a = b = 1n /n, which shows that the Kantorovich
relaxation is tight when considered on assignment problems. Figure 2.4
shows on the left a 2-D example of optimal matching corresponding to
this special case.

Proposition 2.1 (Kantorovich for matching). If m = n and a = b =


1n /n, then there exists an optimal solution for Problem (2.11) Pσ? ,
which is a permutation matrix associated to an optimal permutation
σ ? ∈ Perm(n) for Problem (2.2).
18 Theoretical Foundations

↵ ↵

Figure 2.4: Comparison of optimal matching and generic couplings. A black seg-
ment between xi and yj indicates a non-zero element in the displayed optimal cou-
pling Pi,j solving (2.11). Left: optimal matching, corresponding to the setting of
Proposition (2.1) (empirical measures with the same number n = m of points).
Right: these two weighted point clouds cannot be matched; instead a Kantorovich
coupling can be used to associate two arbitrary discrete measures.

Proof. Birkhoff’s theorem [1946] states that the set of extremal points
of U(1n /n, 1n /n) is equal to the set of permutation matrices. A funda-
mental theorem of linear programming [Bertsimas and Tsitsiklis, 1997,
Theorem 2.7] states that the minimum of a linear objective in a non-
empty polyhedron, if finite, is reached at an extremal point of the
polyhedron.

Remark 2.11 (Kantorovich problem between discrete measures).


For discrete measures α, β of the form (2.3), we store in the
matrix C all pairwise costs between points in the supports of α, β,
def.
namely Ci,j = c(xi , yj ) , to define

Lc (α, β) = LC (a, b). (2.13)

Therefore, the Kantorovich formulation of optimal transport be-


tween discrete measures is the same as the problem between their
associated probability weight vectors a, b except that the cost ma-
trix C depends on the support of α and β. The notation Lc (α, β)
is however useful in some situation, because it makes explicit the
2.3. Kantorovich relaxation 19

↵ ↵

⇡ ⇡ ⇡

↵ ↵

Discrete Semi-discrete Continuous

Figure 2.5: Schematic viewed of input measures (α, β) and couplings U(α, β)
encountered in the three main scenario for Kantorovich OT. Chapter 5 is dedicated
to the semi-discrete setup.

dependency with respect to both probability weights and support-


ing points, the latter being exclusively considered through the cost
function c.

Remark 2.12 (Applications of optimal assignment and couplings).


The optimal transport itself (either as a coupling P or a Monge-
map T when it exists) has found many applications in data
sciences, and in particular image processing. It has for instance
been used for contrast equalization [Delon, 2004], texture syn-
thesis Gutierrez et al. [2017]. A significant part of applications of
OT to imaging sciences is for image matching [Zhu et al., 2007,
Wang et al., 2013, Museyko et al., 2009, Li et al., 2013], image
fusion [Courty et al., 2016], medical imaging [Wang et al., 2011]
and shape registration [Makihara and Yagi, 2010, Lai and Zhao,
2014, Su et al., 2015], image watermarking [Mathon et al., 2014].
In astrophysics, OT has been used for reconstructing the early
universe [Frisch et al., 2002]. Optimal transport has been used for
music transcription [Flamary et al., 2016]. It also finds numerous
20 Theoretical Foundations

applications in economics to interpret matching data [Galichon,


2016]. Lastly, let us note that the computation of transportation
maps computed using OT techniques (or inspired from them) is
also useful to perform sampling [Reich, 2013, Oliver, 2014] and
Bayesian inference [Kim et al., 2013, El Moselhy and Marzouk,
2012].

Remark 2.13 (Kantorovich problem between arbitrary measures).


The definition of Lc in (2.13) can be extended to arbitrary
measures by considering couplings π ∈ M1+ (X × Y) which are
joint distributions over the product space. The discrete case is a
special situation where one imposes this product measure to be
P
of the form π = i,j Pi,j δ(xi ,yj ) . In the general case, the mass
conservation constraint (2.10) should be rewritten as a marginal
constraint on joint probability distributions
n o
π ∈ M1+ (X × Y) : PX ] π = α
def.
U(α, β) = PY] π = β .
and
(2.14)
Here PX ] and PY] are the push-forward (see Definition 2.1) by
the projections PX (x, y) = x and PY (x, y) = y. Figure 2.5 shows
a schematic visualization of the coupling constraints for different
class of problem (discrete measures and densities). Using (2.7),
these marginal constraints are equivalent to imposing that π(A ×
Y) = α(A) and π(X × B) = β(B) for sets A ⊂ X and B ⊂ Y. The
Kantorovich problem (2.11) is then generalized as
Z
def.
Lc (α, β) = min c(x, y)dπ(x, y). (2.15)
π∈U (α,β) X ×Y

This is an infinite-dimensional linear program over a space of mea-


sures. If (X , Y) are compact spaces and c is continuous, then it is
easy to show that it always has solutions. Indeed U(α, β) is compact
R
for the weak topology of measures (see Remark (2.2)), π 7→ cdπ
is a continuous function for this topology and the constraint set
is non-empty (for instance α ⊗ β ∈ U(α, β)). Figure 2.6 shows ex-
amples of discrete and continuous optimal coupling solving (2.15).
2.3. Kantorovich relaxation 21

↵ ⇡ ⇡

Figure 2.6: Left: “continuous” coupling π solving (2.14) between two 1-D measure
with density. The coupling is localized along the graph of the Monge map (x, T (x))
(displayed in black). Right: “discrete” coupling T solving (2.11) between two discrete
measures of the form (2.3). The non-zero entries Ti,j are display with a black disk
at position (i, j) with radius proportional to Ti,j .

↵ ↵ ↵ ↵

⇡ ⇡ ⇡ ⇡

↵ ↵ ↵ ↵

Figure 2.7: Four simple examples of optimal couplings between 1-D distributions,
represented as maps above (arrows) and couplings below. Inspired by Levy and
Schwindt [2017].

Figure 2.7 shows other examples of optimal 1-D couplings, involv-


ing discrete and continuous marginals.

Remark 2.14 (Probabilistic interpretation). Kantorovitch’s prob-


lem can be re-interpreted through the prism of random variables,
following Remark 2.9. Indeed, problem (2.15) is equivalent to
n o
min E(X,Y ) (c(X, Y )) : X ∼ α, Y ∼ β
(X,Y )

where (X, Y ) is a couple of random variables over X ×Y and X ∼ α


(resp Y ∼ β) means that the law of X (resp. Y ), represented as
22 Theoretical Foundations

a measure, must be α (resp. β). The law of the couple (X, Y ) is


then π ∈ U(α, β) over the product space X × Y.

2.4 Metric Properties of Optimal Transport

An important feature of OT is that it defines a distance between his-


tograms and probability measures as soon as the cost matrix satisfies
certain suitable properties. Indeed, OT can be understood as a canon-
ical way to lift a ground distance between points to a distance between
histogram or measures.
We first consider the case where, using a term first introduce
by Rubner et al. [2000], the “ground metric” matrix C is fixed, rep-
resenting substitution costs between bins, and shared across several
histograms we would like to compare. The following proposition states
that OT provides a meaningful distance between histograms supported
on these bins.

Proposition 2.2. We suppose n = m, and that for some p ≥ 1, C =


Dp = (Dpi,j )i,j ∈ Rn×n where D ∈ Rn×n
+ is a distance on JnK, i.e.

(i) D ∈ Rn×n
+ is symmetric;

(ii) Di,j = 0 if and only if i = j;

(iii) ∀ (i, j, k) ∈ JnK3 , Di,k ≤ Di,j + Dj,k .

Then
Wp (a, b) = LDp (a, b)1/p
def.
(2.16)
(note that Wp depends on D) defines the p-Wasserstein distance on
Σn , i.e. Wp is symmetric, positive, Wp (a, b) = 0 if and only if a = b,
and it satisfies the triangle inequality

∀ a, b, c ∈ Σn , Wp (a, c) ≤ Wp (a, b) + Wp (b, c).

Proof. Symmetry and definiteness of the distance are easy to prove:


since C = Dp has a null diagonal, Wp (a, a) = 0, with corresponding
optimal transport matrix P? = diag(a); by the positivity of all off-
diagonal elements of Dp , Wp (a, b) > 0 whenever a 6= b (because in
2.4. Metric Properties of Optimal Transport 23

this case, an admissible coupling necessarily has a non-zero element


outside the diagonal); by symmetry of Dp , Wp (a, b) = 0 is itself a
symmetric function.
To prove the triangle inequality of Wasserstein distances for ar-
bitrary measures, Villani [2003, Theorem 7.3] uses the gluing lemma,
which stresses the existence of couplings with a prescribed structure.
In the discrete setting, the explicit constuction of this glued coupling
is simple. Let a, b, c ∈ Σn . Let P and Q be two optimal solutions of
the transport problems between a and b, and b and c respectively. We
def.
define b̄j = bj if bj > 0 and set otherwise b̄j = 1 (or actually any
other value). We then define
S = P diag(1/b̄)Q ∈ Rn×n
def.
+ .

We remark that S ∈ U (a, c) because


S1n = P diag(1/b̄)Q1n = P(b/b̄) = P1Supp(b) = a
where we denoted 1Supp(b) the indicator of the support of b, and we
use the fact that P1Supp(b) = P1 = b because necessarily Pi,j = 0 for
j∈/ Supp(b). Similarly one verifies that S> 1n = c.
The triangle inequality follows from
!1/p
p
Wp (a, c) = min hP, D i ≤ hS, Dp i1/p
P∈U (a,c)
 1/p  1/p
X p X Pij Qjk P ij Q jk 
≤  (Dij + Djk )p
X
=  Dik 
ik j b̄j ijk b̄j
 1/p  1/p
Pij Qjk Pij Qjk
Dpij Dpjk
X X
≤   +  
ijk b̄ j ijk b̄ j
 1/p  1/p
X Qjk P ij 
Dpij Pij Dpjk Qjk
X X X
=  +
ij k b̄j jk i b̄j
 1/p  1/p

Dpij Pij  Dpjk Qjk 


X X
= +
ij jk

= Wp (a, b) + Wp (b, b).


24 Theoretical Foundations

The first inequality is due to the suboptimality of S, the second is the


usual triangle inequality for elements in D, and the third comes from
Minkowski’s inequality.

Remark 2.15 (The cases 0 < p ≤ 1). Note that if 0 < p ≤ 1, then Dp is
itself distance. This implies that while for p ≥ 1, Wp (a, b) is a distance,
in the case p ≤ 1, it is actually Wp (a, b)p which defines a distance on
the simplex.

Remark 2.16 (Applications of Wasserstein distances). The fact


that the OT distance automatically “lifts” a ground metric to
a metric between histograms makes it a method of choice for
applications in computer vision and machine learning where one
needs to compare histograms. In these fields, a classical approach
is to “pool” local features (for instance image descriptors) and
compute a histogram of the empirical distribution of features
(a so-called bag of features) to perform retrieval, clustering or
classification, see for instance [Oliva and Torralba, 2001]. In a
similar line of ideas, OT distances can be used over some lifted
feature spaces to perform signal and image analysis [Thorpe et al.,
2017]. Applications to retrieval and clustering were initiated by the
landmark paper [Rubner et al., 2000], with renewed applications
following faster approximations relying on simplifications of ma-
trix C such as thresholding [Pele and Werman, 2008, 2009]. More
recent applications stress the use of the EMD for bags-of-words,
either to carry out dimensionality reduction [Rolet et al., 2016]
and classify texts Kusner et al. [2015], Huang et al. [2016], or to
define an alternative loss to train multi-class classifiers that output
bags-of-words Frogner et al. [2015] The review paper Kolouri
et al. [2017] present an overview of other applications in signal
processing and machine learning.

Remark 2.17 (Wasserstein distance between measures).


Proposition 2.2 generalizes from histogram to arbitrary mea-
sures that need not be discrete.
2.4. Metric Properties of Optimal Transport 25

Proposition 2.3. We assume X = Y, and that for some p ≥ 1,


c(x, y) = d(x, y)p where d is a distance on X , i.e.
(i) d(x, y) = d(y, x) ≥ 0;
(ii) d(x, y) = 0 if and only if x = y;
(ii) ∀ (x, y, z) ∈ X 3 , d(x, z) ≤ d(x, y) + d(y, z).
Then
W p (α, β) = Ldp (α, β)1/p
def.
(2.17)
(note that W p depends on d) defines the p-Wasserstein distance
on X , i.e. W p is symmetric, positive, W p (α, β) = 0 if and only if
α = β, and it satisfies the triangle inequality

∀ (α, β, γ) ∈ M1+ (X )3 , W p (α, γ) ≤ W p (α, β) + W p (β, γ).

Proof. The proof follows the same approach as that for Propo-
sition 2.2 and relies on the existence of a coupling between
(α, γ) obtained by “guying” optimal couplings between (α, β) and
(β, γ).

Remark 2.18 (Geometric intuition and weak convergence). The


Wasserstein distance W p has many important properties, the
most important one being that it is a weak distance, i.e. it
allows to compare singular distributions (for instance discrete
ones) and to quantify spatial shift between the supports of the
distributions. In particular, “classical” distances (or divergences)
are not even defined between discrete distributions (the L2 norm
can only be applied to continuous measures with a density with
respect to a base measure, and the discrete `2 norm requires the
positions (xi , yj ) to be fixed to work). In sharp contrast, one
has that for any p > 0, W pp (δx , δy ) = d(x, y). Indeed, it suffices
to notice that U(δx , δy ) = {δx,y } and therefore the Kantorovich
problem having only one feasible solution, W pp (δx , δy ) is necessarily
(d(x, y)p )1/p = d(x, y). This shows that W p (δx , δy ) → 0 if x → y.
This property corresponds to the fact that W p is a way to quantify
the weak convergence as we now define.
26 Theoretical Foundations

Definition 2.2 (Weak convergence). (αk )k converges weakly to α


in M1+ (X ) (denoted αk * α) if and only if for any continuous
R R
function g ∈ C(X ), X gdαk → X gdα. This notion of weak con-
vergence corresponds to the convergence in law of random vectors.

This convergence can be shown to be equivalent to


W p (αk , α) → 0 [Villani, 2009, Theorem 6.8] (together with a
convergence of the moments up to order p for unbounded metric
spaces).

Remark 2.19 (Translation invariance). A nice feature of the


Wasserstein distance over an Euclidean space X = Rd for the
ground cost c(x, y) = kx − yk2 is that one can factor out trans-
lations, indeed, denoting Tτ : x 7→ x − τ the translation, one has
2
W 2 (Tτ ] α, Tτ 0 ] β)2 = W 2 (α, β)2 − 2hτ − τ 0 , mα − mβ i + τ − τ 0 .
def.
where mα = X xdα(x) ∈ Rd is the mean of α. In particular, this
R

implies the nice decomposition of the distance as

W 2 (α, β)2 = W 2 (ᾱ, β̄)2 + kmα − mβ k2

where (ᾱ, β̄) are the “centered” zero mean measures ᾱ = Tmα ] α.

Remark 2.20 (The case p = +∞). Informally, the limit of W pp as


p → +∞ is
def.
W ∞ (α, β) = min sup d(x, y), (2.18)
π∈U (α,β) (x,y)∈Supp(π)

where the sup should be understood as the essential supremum


according to the measure π on X 2 . In contrast to the cases p < +∞,
this is a non-convex optimization problem, which is difficult to
solve numerically and to study theoretically. The W ∞ distance is
related to the Hausdorff distance between the supports of (α, β),
see Section 10.6.1. We refer to [Champion et al., 2008] for details.
2.5. Dual Problem 27

2.5 Dual Problem

The Kantorovich problem (2.11) is a constrained convex minimization


problem, and as such, it can be naturally paired with a so-called dual
problem, which is a constrained concave maximization problem. The
following fundamental proposition explains the relationship between
the primal and dual problems.

Proposition 2.4. One has

LC (a, b) = max hf, ai + hg, bi (2.19)


(f,g)∈R(a,b)

where the set of admissible potentials is

R(a, b) = {(f, g) ∈ Rn × Rm : ∀ (i, j) ∈ JnK × JmK, f ⊕ g ≤ C}


def.

(2.20)

Proof. This result is a direct consequence of the more general result on


the strong duality for linear programs [Bertsimas and Tsitsiklis, 1997,
p.148,Theo.4.4]. The easier part of that result, namely that the right-
hand side of Equation (2.19) is a lower bound on LC (a, b) is discussed
in 3.2. For the sake of completeness, let us derive this dual problem
with the use of Lagrangian duality. The Lagangian associate to (2.11)
reads

min max hC, Pi + ha − P1m , fi + hb − P> 1n , gi. (2.21)


P≥0 (f,g)∈Rn ×Rm

For linear program, one can always exchange the min and the max and
get the same value of the linear program, and one thus consider

max
n
ha, fi + hb, gi + min hC − f1> >
m − 1n g , Pi.
(f,g)∈R ×Rm P≥0

We conclude by remarking that


(
0 if Q ≥ 0
min hQ, Pi =
P≥0 −∞ otherwise

so that the constraint reads C − f1> >


m − 1n g = C − f ⊕ g ≥ 0.
28 Theoretical Foundations

The primal-dual optimality relation for the Lagrangian (2.21) allows


to locate the support of the optimal transport plan
n o
Supp(P) ⊂ (i, j) ∈ JnK × JmK : fi + gj = Ci,j . (2.22)
Remark 2.21. Following the interpretation given to the Kantorovich
problem in Remark 2.10, we follow with an intuitive presentation of the
dual. Recall that in that setup, an operator wishes to move at the least
possible cost an overall amount of resources from warehouses to facto-
ries. The operator can do so by solving (2.11) to follow the instructions
set out in P? , and pay hP? , Ci to the transportation company.

Outsourcing logistics. Suppose that the operator does not have the
computational means to solve the linear program (2.11). He decides
instead to outsource that task to a vendor. The vendor chooses the
following pricing scheme: she will ask money both when collecting a
single unit of the resource at a warehouse, and later ask a bit more
when delivering that unit of resource to a factory. More precisely, the
vendor will apply a collection price fi to collect a unit of resource at
each warehouse i (no matter where that unit is sent to), and a price
gj to deliver a unit of resource to factory j (no matter from which
warehouse that unit comes from). On aggregate, since there are exactly
ai units at warehouse i and bj needed at factory j, the vendor asks as
a consequence of that pricing scheme a price of hf, ai + hg, bi to solve
the operator’s logistic problem.

Agreement on prices. Note that the pricing system used by the ven-
dor allows quite naturally for arbitrarily negative prices. Indeed, if the
vendor applies a price vector f for warehouses and a price vector g
for factories, then the total bill will not be changed by simultaneously
decreasing all entries in f by a large number and increasing all entries
of g by that same number, since the total amount of resources in all
warehouses is equal to those that have to be delivered to the factories.
In other words, the vendor can give the illusion of giving an extremely
good deal to the operator by paying him to collect some of his goods,
but compensate that loss by simply charging him more for delivering
them.
2.5. Dual Problem 29

Of course, the vendor wishes to charge as much as they can for


that service. In the absence of another competing vendor, the operator
must therefore think of a quick way to check that the vendor’s prices
are reasonable. A possible way to do so would be for the operator
to compute the price LC (a, b) of the most efficient plan by solving
problem (2.11) and check if the vendor’s offer is at least smaller than
that amount. However, recall that the operator cannot afford such a
lengthy computation in the first place.
Luckily, there is a far more efficient way for the operator to check
whether the vendor has a competitive offer. Recall that fi is the price
charged by the vendor for picking a unit at i and gj to deliver one
at j. Therefore, the vendor’s pricing scheme implies that transferring
one unit of the resource from i to j costs exactly fi + gj . Yet, the
operator also knows that the cost of shipping one unit from i to j
by the transporting company is Ci,j . Therefore, if for any pair i, j the
aggregate price fi +gj is strictly larger that Ci,j , the vendor is charging
more than the fair price charged by the transportation company for that
transport, and the operator should refuse the vendor’s offer.

Optimal prices as a dual problem. It is therefore in the interest of


the operator to check that for all pairs i, j the prices offered by the
vendor verify fi + gj ≤ Ci,j . Suppose that the operator does check that
the vendor has provided price vectors that do comply with these n × m
inequalities. Can he conclude that the vendor’s proposal is attractive?
Doing a quick back of the hand calculation, the operator does indeed
conclude that it is in his interest to accept that offer. Indeed, since any
of his transportation plans P would have a cost hP, Ci = i,j Pi,j Ci,j ,
P

the operator can conclude applying these n × m inequalities that for


any transport plan P (including the optimal one P? ), the marginal
constraints imply
   
X X   X X X X
Pi,j Ci,j ≥ Pi,j fi + gj =  fi Pi,j  +  gj Pi,j 
i,j i,j i j j i

= hf, ai + hg, bi,


30 Theoretical Foundations

α P? 40

β f? g?
20

-20

Figure 2.8: Consider in the left plot the optimal transport problem between two
discrete measures α and β, represented respectively by blue dots and red squares.
The area of these markers is proportional to the weight at each location. That
plot also displays the optimal transport P? using a quadratic Euclidean cost. The
corresponding dual (Kantorovich) potentials f? and g? that correspond to that con-
figuration are also displayed on the right plot. Since there is a “price” f?i for each
point in α (and conversely for g and β), the color at that point represents the ob-
tained value using the color map on the right. These potentials can be interpreted
as relative prices in the sense that they indicate the individual cost, under the best
possible transport scheme, to move away a mass at each location in α, or on the
contrary to send a mass towards any point in β. The optimal transport cost is there-
fore equal to the sum of the squared lengths of all the arcs on the left weighted by
their thickness, or, alternatively, using the dual formulation, to the sum of the values
(encoded with colors) multiplied by the area of each marker on the right plot.

and therefore observe that any attempt at doing the job by himself
would necessarily be more expensive than the price proposed by the
vendor.
Knowing this, the vendor must therefore find a set of prices f, g
that maximize hf, ai + hg, bi but that must satisfy at the very least
for all i, j, the basic inequality that fi + gj ≤ Ci,j for his offer to be
accepted, which results in problem (2.19). One can show, as we do later
in §3.1, that the best price obtained by the vendor is in fact exactly
equal to the best possible cost the operator would obtain by computing
LC (a, b).
Figure 2.8 illustrates this problem. On the left, blue dots represent
warehouses and red dots stand for factories; the areas of these dots
stand for the probability weights a, b. On the right are pictured the
price values obtained by the vendor as a result of optimizing prob-
lem 2.19. Prices have been chosen so that their mean is equal to 0.
One can clearly see that highest relative prices come from collecting
2.5. Dual Problem 31

goods at an isolated warehouse on the lower left of the figure, as well


as delivering goods at the factory located in the upper right area.

Remark 2.22 (Dual problem between arbitrary measures). To ex-


tend this primal-dual construction to arbitrary measures, it
is important to realize that measures are naturally paired in
duality with continuous functions (a measure can only be accessed
through integration against continuous functions). The duality
is formalized in the following proposition, which boils down to
Proposition 2.4 when dealing with discrete measures.

Proposition 2.5. One has


Z Z
Lc (α, β) = sup f (x)dα(x) + g(y)dβ(y), (2.23)
(f,g)∈R(c) X Y

where the set of admissible dual potentials is


def.
R(c) = {(f, g) ∈ C(X ) × C(Y) : ∀(x, y), f (x) + g(y) ≤ c(x, y)} .
(2.24)
Here, (f, g) is a pair of continuous functions, and are often called
“Kantorovich potentials”.

The discrete case (2.19) corresponds to the dual vectors being


samples of the continuous potentials, i.e. (fi , gj ) = (f (xi ), g(yj )).
The primal-dual optimality conditions allow to track the support
of optimal plan, and (2.22) is generalized as

Supp(π) ⊂ {(x, y) ∈ X × Y : f (x) + g(y) = c(x, y)} . (2.25)

Note that in contrast to the primal problem (2.15), showing the


existence of solutions to (2.23) is non-trivial, because the constraint
set R(c) is not compact and the function to minimize non-coercive.
Using the machinery of c-transform detailed in Section 5.1, one can
however show that optimal (f, g) are necessarily Lipschitz regular,
which enable to replace the constraint by a compact one.
32 Theoretical Foundations

Remark 2.23 (Monge-Kantorovich Equivalence – Brenier Theorem).


The following celebrated theorem of Brenier [1991] ensures that
in Rd for p = 2, if at least one of the two inputs measures has a
density, then Kantorovitch and Monge problems are equivalent.

Theorem 2.1 (Brenier). In the case X = Y = Rd and c(x, y) =


kx − yk2 , if at least one of the two inputs measures (denoted α)
has a density ρα with respect to the Lebesgue measure, then the
optimal π in the Kantorovich formulation (2.15) is unique, and is
supported on the graph (x, T (x)) of a “Monge map” T : Rd → Rd .
This means that π = (Id, T )] α, i.e.
Z Z
∀ h ∈ C(X × Y), h(x, y)dπ(x, y) = h(x, T (x))dα(x).
X ×Y X
(2.26)
Furthermore, this map T is uniquely defined as the gradient of a
convex function ϕ, T (x) = ∇ϕ(x), where ϕ is the unique (up to
an additive constant) convex function such that (∇ϕ)] α = β. This
convex function is related to the dual potential f solving (2.23) as
kxk2
ϕ(x) = 2 − f (x).

Proof. We sketch the main ingredients of the proof, more details


can be found for instance in [Santambrogio, 2015]. We remark that
R R
cdπ = Cα,β − 2 hx, yidπ(x, y) where the constant is Cα,β =
kxk2 dα(x)+ kyk2 dβ(y). Instead of solving (2.15), one can thus
R R

consider the following problem


Z
max hx, yidπ(x, y),
π∈U (α,β) X ×Y

whose dual reads


Z Z 
min ϕdα + ψdβ : ∀(x, y), ϕ(x) + ψ(y) ≥ hx, yi .
(ϕ,ψ) X Y
(2.27)
The relation between these variables and those of (2.24) is (ϕ, ψ) =
2.5. Dual Problem 33

2 2
( k·k2 − f, k·k2 − g). One can replace the constraint by

ψ(y) ≥ ϕ∗ (y) = sup hx, yi − ϕ(x).


def.
∀ y, (2.28)
x

Here ϕ∗ is the Legendre transform of ϕ and is a convex function as


a supremum of linear forms (see also (4.50)). Since the objective
appearing in (2.29) is linear and the integrating measures positive,
one can minimize explicitly with respect to ϕ and set ψ = ϕ∗ in
order to consider the unconstraint problem
Z Z
min ϕdα + ϕ∗ dβ, (2.29)
ϕ X Y

see also Section 5.1 for a generalization of this idea to generic costs
c(x, y). By iterating this argument twice, one can replace ϕ by
ϕ∗∗ , which is a convex function, and thus impose in (2.29) that ϕ
is convex. Condition (2.25) shows that an optimal π is supported
on {(x, y) : ϕ(x) + ϕ∗ (y) = hx, yi} which shows that such an y
is optimal for the minimization (2.28) of the Legendre transform,
whose optimality condition reads y ∈ ∂ϕ(x). Since ϕ is convex,
it is differentiable almost everywhere, and since α has a density,
it is also differentiable α-almost everywhere. This shows that for
each x, the associated y is uniquely defined α-almost everywhere
as y = ∇ϕ(x), and shows that necessarily π = (Id, ∇ϕ)] α.

This results shows that in the setting of W 2 with non-singular


densities, the Monge problem (2.9) and its Kantorovich relax-
ation (2.15) are equal (the relaxation is tight). This is the con-
tinuous analog of Proposition 2.1 for the assignment case (2.1),
which states that the minimum of the optimal transport problem
is achieved, when the marginals are equal and uniform, at a per-
mutation matrix (a discrete map). Brenier’s theorem, stating that
an optimal transport map must be the gradient of a convex func-
tion, should be examined under the light that a convex function
is the natural generalization of the notion of increasing functions
in dimension more than one. Optimal transport can thus plays
34 Theoretical Foundations

an important role to define quantile functions in arbitrary dimen-


sions, which in turn is useful for applications to quantile regression
problems Carlier et al. [2016].
Note also that this theorem can be extended in many direc-
tions. The condition that α has a density can be weakened to the
condition that it does not give mass to “small sets” having Haus-
dorff dimension smaller than d − 1 (e.g. hypersurfaces). One can
also consider costs of the form c(x, y) = h(x − y) where h is a
strictly convex function.

Remark 2.24 (Monge-Ampère equation). For measures with den-


sities, using (2.8), one obtains that ϕ is the unique (up to the
addition of a constant) convex function which solves the following
Monge-Ampère-type equation

det(∂ 2 ϕ(x))ρβ (∇ϕ(x)) = ρα (x) (2.30)

where ∂ 2 ϕ(x) ∈ Rd×d is the hessian of ϕ. The Monge-Ampère


operator det(∂ 2 ϕ(x)) can be understood as a non-linear degenerate
Laplacian. In the limit of small displacements, ϕ = Id + εϕ, one
indeed recovers the Laplacian ∆ as a linearization since for smooth
maps
det(∂ 2 ϕ(x)) = 1 + ε∆ϕ(x) + o(ε).
The convexity constraint forces det(∂ 2 ϕ(x)) ≥ 0 and is necessary
for this equation to have a solution. There is a large body of lit-
erature on the theoretical analysis of the Monge-Ampère equa-
tion, and in particular the regularity of its solution, see for in-
stance [Gutiérrez, 2016], we refer to the review paper by Caffarelli
[2003]. A major difficulty is that in full generality, solutions need
not be smooth, and one has to resort to the machinery of vis-
cosity solution to capture singularity, and even Alexandrov solu-
tions when the input measures are arbitrary (e.g. Dirac masses).
Many solvers have been proposed in the simpler case of the Monge-
Ampère equation det(ϕ00 (x)) = f (x) for a fixed right-hand side f ,
see for instance Benamou et al. [2016b] and the references therein.
2.6. Special Cases 35

In particular, capturing anisotropic convex functions requires a


special care, and usual finite differences can be inaccurate. For op-
timal transport, where f actually depends on ∇ϕ, the discretiza-
tion of the equation (2.30), and the boundary condition result in
technical challenges outlined in [Benamou et al., 2014] and the
references therein. Note also that related solvers based on fixed
points iterations have been applied to image registration Haker
et al. [2004].

2.6 Special Cases

In general, computing OT distances is numerically involved. Before


detailing in Sections 3,4,7 3 different numerical solvers, we first review
special favorable cases where the resolution of the OT problem is easy.
Remark 2.25 (Binary Cost Matrix and 1-Norm). One can easily check
that when the cost matrix C is zero on the diagonal and 1 elsewhere,
namely when C = 1n×n −In , the OT distance between a and b is equal
to the 1-norm of their difference, LC (a, b) = ka − bk1 .

Remark 2.26 (Kronecker Cost Function and Total Variation). In


addition to Remark 2.25 above, one can also easily check that this
result extends to discrete and discrete measures in the case where
c(x, y) is 0 if x = y and 1 when x 6= y. The OT distance between
two discrete measures α and β is equal to their total variation
distance.

Remark 2.27 (1-D case – Empirical measures). Here X = R. As-


suming α = n1 ni=1 δxi and β = n1 nj=1 δyj , and assuming (with-
P P

out loss of generality) that the points are ordered, i.e. x1 ≤ x2 ≤


. . . ≤ xn and y1 ≤ y2 ≤ . . . ≤ yn , then one has the simple formula
p
X
p
W p (α, β) = |xi − yi |p , (2.31)
i=1

i.e. locally (if one assumes distinct points), W p (α, β) is the `p norm
between two vectors of ordered values of α and β. That statement
36 Theoretical Foundations

Figure 2.9: 1-D optimal couplings: each arrow xi → yj indicate a non-zero Pi,j in
the optimal coupling. Top: empirical measures with same number of points (optimal
matching). Bottom: generic case. This corresponds to monotone rearrangements, if
xi ≤ xi0 are such that Pi,j 6= 0, Pi0 ,j 0 6= 0, then necessarily yj ≤ yj 0 .

is only valid locally, in the sense that the order (and those vector
representations) might change whenever some of the values change.
That formula is a simple consequence of the more general remark
given below. Figure 2.9, top row, illustrates the 1-D transporta-
tion map between empirical measures with the same number of
points. The bottom row shows how this monotone map generalizes
to arbitrary discrete measures. It is possible to leverage this 1-D
computation to also compute efficiently OT on the circle, see Delon
et al. [2010]. Note that in the case of concave cost of the distance,
for instance when p < 1, the behaviour of the optimal transport
plan is very different, see Delon et al. [2012], which describes an
efficient solver in this case.

Remark 2.28 (1-D case – Generic case). For a measure α on R, we


introduce the cumulative function
Z x
def.
∀ x ∈ R, Cα (x) = dα, (2.32)
−∞

which is a function Cα : R → [0, 1], and its pseudo-inverse Cα−1 :


[0, 1] → R ∪ {−∞}

∀ r ∈ [0, 1],Cα−1 (r) = min {x ∈ R ∪ {−∞} : Cα (x) ≥ r} .


x
(2.33)
That function is also called the generalized quantile function of α.
2.6. Special Cases 37

For any p ≥ 1, one has


Z 1
p
p
W p (α, β) = Cα−1 − Cβ−1 p = |Cα−1 (r) − Cβ−1 (r)|p dr.
L ([0,1]) 0
(2.34)
This means that through the map α 7→ Cα−1 ,
the Wasserstein dis-
tance is isometric to a linear space equipped with the Lp norm,
or, equivalently, that the Wasserstein distance for measures on the
real line is a Hilbertian metric. This makes the geometry of 1-D
optimal transport very simple, but also very different from its ge-
ometry in higher dimensions, which is not Hilbertian as discussed
in Proposition 8.1 and more generally in §8.3. For p = 1, one even
has the simpler formula
Z
W 1 (α, β) = kCα − Cβ kL1 (R) = |Cα (x) − Cβ (x)|dx (2.35)
R
Z Z x
= d(α − β) dx. (2.36)
R −∞

which shows that W 1 is a norm (see §6.2 for the generalization to


arbitrary dimensions). An optimal Monge map T such that T] α =
β is then defined by
T = Cβ−1 ◦ Cα . (2.37)
Figure 2.10 illustrates the computation of 1-D OT through cu-
mulative functions. It also displays displacement interpolations,
computed as detailed in (7.7), see also Remark 9.5. For a detailed
survey of the properties of optimal transport in 1-D, we refer the
reader to [Santambrogio, 2015, Chapter 2].

Remark 2.29 (Distance between Gaussians). If α = N (mα , Σα )


and β = N (mβ , Σβ ) are two Gaussians in Rd , then one can show
that the following map

T : x 7→ mβ + A(x − mα ), (2.38)
38 Theoretical Foundations

α β (tT + (1 − t)Id)] α
1 1 1
Cµ C-1
µ
T
Cν T-1
C-1
ν

0.5 0.5 0.5 0.5

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1

(Cα , Cβ ) (Cα−1 , Cβ−1 ) (T, T −1 ) (1 − t)Cα−1 + tCβ−1

Figure 2.10: Computation of OT and displacement interpolation between two 1-D


measures, using cumulant function as detailed in (2.37).

where
1 1 1 1 1
− 2 −
A = Σα 2 Σα2 Σβ Σα2 Σα 2 = AT ,
is such that T] ρα = ρβ . Indeed, one simply has to notice that the
change of variables formula (2.8) is satisfied since
1
ρβ (T (x)) = det(2πΣβ )− 2 exp(−hT (x) − mβ , Σ−1
β (T (x) − mβ )i)
1
= det(2πΣβ )− 2 exp(−hx − mα , AT Σ−1
β A(x − mα )i)
1
= det(2πΣβ )− 2 exp(−hx − mα , Σ−1
α (x − mα )i),

and since T is a linear map we have that


1
det Σβ

0 2
| det T (x)| = det A =
det Σα
and we therefore recover ρα = | det T 0 |ρβ meaning T] α = β. No-
tice now that T is the gradient of the convex function ψ : x 7→
1
2 hx − mα , A(x − mα )i + hmβ , xi to conclude, using Brenier’s the-
orem [1991] (see Remark 2.23) that T is optimal. Both that map
2.6. Special Cases 39

T and the corresponding potential ψ are illustrated in Figures 2.11


and 2.12
With additional calculations involving first and second order
moments of ρα , we obtain that the transport cost of that map is

W 22 (α, β) = kmα − mβ k2 + B(Σα , Σβ )2 (2.39)

where B is the so-called Bures metric [1969] between positive def-


inite matrices (see also Forrester and Kieburg [2016]),
 
B(Σα , Σβ )2 = tr Σα + Σβ − 2(Σ1/2 1/2 1/2
def.
α Σβ Σα ) , (2.40)

where Σ1/2 is the matrix square root. One can show that B is a
distance on covariance matrices, and that B 2 is convex with respect
to both its arguments. In the case where Σα = diag(ri )i and
Σβ = diag(si )i are diagonals, the Bures metric is the Hellinger
distance
√ √
B(Σα , Σβ ) = r− s 2.
For 1-D Gaussians,
√ W 2 is thus the Euclidean distance on the 2-D
plane (m, Σ), as illustrated in Figure 2.13. For a detailed treat-
ment of the Wasserstein geometry of Gaussian distributions, we
refer to Takatsu [2011].

Remark 2.30 (Distance between Elliptically Contoured Distributions).


Gelbrich provides a more general result than that provided in
Remark 2.29: the Bures metric between Gaussians extends
more generally to elliptically contoured distributions [1990]. In
a nutshell, one can first show that for two measures with given
mean and covariance matrices, the distance between the two
Gaussians with these respective parameters is a lower bound of
the Wasserstein distance between the two measures (Theorem
2.1 in [1990]). Additionally, the closed form (2.39) extends to
families of elliptically contoured densities: If two densities ρα and
ρβ belong to such a family, namely when ρα and ρβ can be written
for any point x as
40 Theoretical Foundations

0
ρβ
-1
ρα
-2

-3
-4 -2 0 2 4 6

Figure 2.11: Two Gaussians ρα and ρβ , represented using the contour plots of
their densities, with respective mean and variance  matrices mα = (−2, 0), Σα =
1 1 1 1 1
2
1 − 2
; − 2
1 and m β = (3, 1), Σβ = 2, ;
2 2
, 1 . The arrows originate at random
points x taken on the plane and end at the corresponding mappings of those points
T (x) = mβ + A(x − mα ).

Figure 2.12: Same Gaussians ρα and ρβ as defined in Figure 2.11, represented


this time as surfaces. The surface above is the Brenier potential ψ defined up to an
additive constant (here +50) such that T = ∇ψ. For visual purposes, both Gaussian
densities have been multiplied by a 100 factor.
2.6. Special Cases 41

m
Figure 2.13: Computation of displacement interpolation between two 1-D Gaus-
(x−m)2
def. 1 −
sians. Denoting Gm,σ (x) = √2πs e 2s2 the Gaussian density, it thus shows the
interpolation G(1−t)m0 +tm1 ,(1−t)σ0 +tσ1 .

1
ρα (x) = p h(hx − mα , Σα (x − mα )i)
det(Σα )
1
ρβ (x) = q h(hx − mβ , Σβ (x − mβ )i),
det(Σβ )
respectively, for the same positive valued function h such that
the integral Z
h(hx, xi)dx = 1,
Rd
then their optimal transport map is also the linear map (2.38) and
their Wasserstein distance is also given by the expression (2.39).
This includes therefore as interesting special cases uniform distri-
butions on ellipses, namely ellipctic shapes.
3
Algorithmic Foundations

This chapter describes algorithmic tools from combinatorial optimiza-


tion that can be used to solve optimal transport numerically. These
tools can only be used on the discrete formulation of optimal trans-
port, as described in the primal problem (2.11) or alternatively its
dual (2.19).
The origins of these algorithms can be traced back to the period
immediately before [Tolstoı, 1930] and during world war 2, when Hitch-
cock [1941] and Kantorovich [1942] formalized the generic problem of
dispatching available resources towards consumption sites in an opti-
mal way. Both of these formulations, as well as the later contribution
by Koopmans [1949], fell short of providing a provably correct algo-
rithm to solve the problem they had helped define (although the cycle
violation method was already hinted by Tolstoı [1939]). One had to
wait until the field of linear programming fully blossomed, with the
proposal of the simplex method, to be at last able to solve rigorously
optimal transport problems.
The goal of linear programming is to solve optimization problems
whose objective is linear and whose constraints are linear (in)equalities
in the variables of interest. The optimal transport problem fits that

42
3.1. The Kantorovich Linear Programs 43

description and is therefore a particular case of that wider class of


problems. One can however argue that optimal transport is truly spe-
cial among all linear program: First, Dantzig’s early motivation to solve
linear programs was greatly related to that of solving transportation
problems [Dantzig, 1949, p.210]. Second, despite being only a particular
case, the optimal transport problem stayed in the spotlight of optimiza-
tion, because it was understood shortly after that optimal transport
problems were related, and, in fact, equivalent to, an important class
of linear programs known as minimum cost network flows [Korte and
Vygen, 2012, p.213, Lemma 9.3] thanks to a result by [Ford and Fulker-
son, 1962]. As such, the OT problem has been the subject of particular
attention, ever since the birth of mathematical programming [Dantzig,
1951], and is still widely used to introduce more generally optimization
to a new audience [Nocedal and Wright, 1999, §1,p.4] because of its
intuitive description.

3.1 The Kantorovich Linear Programs

We have already introduced in Equation (2.11) the primal optimal


transport problem:
X
LC (a, b) = min Ci,j Pi,j . (3.1)
P∈U(a,b)
i∈JnK,j∈JmK

To make the link with the linear programming literature, one can cast
the equation above as a linear program in standard form, that is a linear
program with: a linear objective; equality constraints defined with a
matrix and a constant vector; nonnegative constraints on variables.
Let In stand for the identity matrix of size n and ⊗ be Kronecker’s
product. The following (n + m) × nm matrix,
" #
1 T ⊗ Im
A= n ∈ R(n+m)×nm ,
In ⊗ 1m T
can be used to encode the row-sum and column-sum constraints that
need to be satisfied for any P to be in U(a, b). To do so, simply cast
a matrix P ∈ Rn×m as a vector p ∈ Rnm such that the i + n(j − 1)’s
element of p is equal to Pij (P is enumerated column-wise) to obtain
44 Algorithmic Foundations

the following equivalence:


a
P ∈ Rn×m ∈ U(a, b) ⇔ p ∈ Rnm
+ , Ap = b .

Therefore we can write the original optimal transport problem as:

LC (a, b) = min cT p, (3.2)


p∈Rnm
+a 
Ap= b

where the nm-dimensional vector c is equal to the stacked columns


contained in the cost matrix C.

Remark 3.1. Note that one of the n + m constraints described above


is redundant, or that, in other words, the line vectors of matrix A
are not linearly independent. Indeed, summing all n first lines and
the subsequent m lines results in the same vector (namely A 01m
 n
=
 0n  T
A 1m = 1nm ). One can show that removing a line in A and the
corresponding entry in ba yields a properly defined linear system. For
 

simplicity, and to avoid treating asymmetrically a and b we retain


in what follows our original (redundant) formulation, keeping in mind
that degeneracy will pop up in some of our computations.

The dual problem corresponding to Equation (3.2) is, following the


rules of linear programming [Bertsimas and Tsitsiklis, 1997, p.143] de-
fined as:
 T
LC (a, b) = max ba h (3.3)
h∈Rn+m
AT h≤c
Note that this program is exactly equivalent to that presented in
Equation 2.4.

Remark 3.2. We provide a simple derivation of the duality result


above, which can be seen as a rigorous formulation of the arguments
developed in Remark 2.21 to introduce duality. Strong duality, namely
the fact that the optima of both primal (3.2) and dual (3.3) problems
do indeed coincide, requires a longer proof. We refer the interested
reader to [Bertsimas and Tsitsiklis, 1997, §4.10]. To simplify notations,
let us write q = ba . Consider now a relaxed primal problem of the
 

optimal transport problem, where the constraint Ap = q is no longer


3.2. C-transforms 45

necessarily enforced, but bears a cost hT (Ap − q) parameterized by


an arbitrary cost vector h ∈ Rn+m . This relaxation, whose optimum
depends directly on the cost vector h, can be written as:
cT p − hT (Ap − q).
def.
H(h) = min
nm p∈R+

Note first that this relaxed problem has no marginal constraints on


p. Because that minimization allows for many more p solutions, we
expect H(h) to be smaller than z̄ = LC (a, b). Indeed, writing p? for
any optimal solution of the primal Problem (3.1), we obtain
min cT p − hT (Ap − q) ≤ cT p? − hT (Ap? − q) = cT p? = z̄.
p∈Rnm
+

The approach above defines therefore a problem which can be used to


compute an optimal upper bound for the original Problem (3.1), for
any cost vector h; that function is called the Lagrange dual function of
L. The goal of duality theory is now to compute the best lower bound
z by maximizing H over any cost vector h, namely
!
T T T
z = max H(h) = max h q + min
nm
(c − A h) p .
h h p∈R+

The second term involving a minimization on p can be easily shown to


be −∞ if any coordinate of cT −AT h is negative. Indeed, if for instance
for a given index i ≤ n + m we have ci − (AT h)i < 0 then it suffices
to take for p the canonical vector ei multiplied by any arbitrary large
positive value to obtain an unbounded value. When trying to maximize
the lower bound H(h) it therefore makes sense to restrict vectors h to
be such that AT h ≤ c, in which case the best possible lower bound
becomes
z = max hT q.
h∈Rn+m
AT h≤c
We have therefore proved that z ≤ z̄, a result usually known as weak
duality.

3.2 C-transforms

We present in this section an interesting property of the dual optimal


transport problem (3.3) which takes a more important meaning when
46 Algorithmic Foundations

used for the semi-discrete optimal transport problem in §5.1. This sec-
tion builds upon the original formulation (2.19) that splits row and
column sum constraints:
LC (a, b) = max hf, ai + hg, bi (3.4)
(f,g)∈R(a,b)

Consider any dual feasible pair (f, g). If we “freeze” the value of f,
we can notice that there is not better vector solution for g than the
C-transform vector of f, denoted fC ∈ Rm and defined as
(f C )j = min Cij − fi ,
i∈JnK

since it is indeed easy to prove that (f, f C ) ∈ R(a, b) and that f C is the
largest possible vector such that this constraint is insured. We therefore
have that
hf, ai + hg, bi ≤ hf, ai + hf C , bi.
This result allows first to reformulate the dual problem as a piecewise
affine concave maximization problem expressed in a single variable f as

LC (a, b) = max
n
hf, ai + hf C , bi. (3.5)
f∈R
Putting that result aside, the same reasoning applies of course if we
now “freeze” the value of g and consider instead the C̄-transform of g,
namely vector gC̄ ∈ Rm defined as
(gC̄ )i = min Cij − fj ,
j∈JmK

with a different increase in objective


hf, ai + hg, bi ≤ hgC̄ , ai + hg, bi.
Starting from a given f, it is therefore tempting to alternate C and C̄
transforms several times to improve f. Indeed, we have the sequence of
inequalities
hf, ai + hf C , bi ≤ hf CC̄ , ai + hf C , bi ≤ hf CC̄ , ai + hf CC̄C , bi ≤ . . .
One may hope for a strict increase in the objective at each of these
iterations. However, this does not work because C/C̄ quickly hit a
plateau:
3.3. Complementary Slackness 47

Proposition 3.1. The following identities hold:

(i) f ≤ f 0 ⇒ f C ≥ f 0 C .

(ii) f CC̄ ≥ f, gC̄C ≥ g.

(iii) f CC̄C = f C .

Proof. The first inequality follows from the definition of C-transforms.


Expanding the definition of f CC̄ we have:
 
f CC̄ = min Cij − fjC = min Cij − min
0
Ci0 j − fi0 .
i j∈JmK j∈JmK i ∈JnK

Now, since − mini0 ∈JnK Ci0 j − fi0 ≥ −(Cij − fi ), we recover:


 
f CC̄ ≥ min Cij − Cij + fi = fi .
i j∈JmK

The relation gC̄C ≥ g is obtained in the same way. Now, set g = f C .


Then, gC̄ = f CC̄ ≥ f. Therefore, using result (i) we have f CC̄C ≤ f C .
Result (ii) yields f CC̄C ≥ f C , proving the equality.

3.3 Complementary Slackness

Primal (3.2) and dual (3.3), (2.19) problems can be solved indepen-
dently to obtain optimal primal P? and dual (f? , g? ) solutions. The
following proposition characterizes their relationship.

Proposition 3.2. Let P? and f? , g? be optimal solutions for the pri-


mal (2.23) and dual (2.11) problems, respectively. Then, for any pair
(i, j) ∈ JnK × JmK, P?i,j (Ci,j − f?i + g?j ) = 0 holds. In other words, if
P?i,j > 0 then necessarily f?i + g?j = Ci,j ; if f?i + g?j < Ci,j then neces-
sarily P?i,j = 0.

Proof. We have by strong duality that hP? , Ci = hf? , ai + hg? , bi.


Recall that P? 1m = a and P? T 1n = b, therefore

hf? , ai + hg? , bi = hf? , P? 1m i + hg? , P? T 1n i


= hf? 1m T , P? i + h1n g? T , P? i,
48 Algorithmic Foundations

which results in
hP? , C − f? ⊕ g? i = 0.
Because (f? , g? ) belongs to the polyhedron of dual constraints (2.20),
each entry of the matrix C − f? ⊕ g? is necessarily non-negative. There-
fore, since all the entries of P are nonnegative, the constraint that the
dot-product above is equal to 0 enforces that, for any pair of indices
(i, j) such that Pi,j > 0, Ci,j − (fi + gj ) must be zero, and for any pair
of indices (i, j) such that Ci,j > fi + gj that Pi,j = 0.

3.4 Vertices of the Transportation Polytope

A linear program with a non-empty and bounded feasible set attains


its minimum at an extremal point of the feasible set [Bertsimas and
Tsitsiklis, 1997, p.65, Theo.2.7]. Since the feasible set U(a, b) of the
primal optimal transport problem (3.2) is bounded, one can restrict the
search for an optimal P to the set of extreme points of the polytope
U(a, b). Matrices P that are extremal in U(a, b) have an interesting
structure that has been the subject of extensive research [Brualdi, 2006,
§8]. That structure requires describing the transport problem using the
formalism of bipartite graphs.

3.4.1 Tree Structure of the Support of all Vertices of U(a, b)


Let V = (1, 2, . . . , n) and V 0 = (10 , 20 , . . . , m0 ) be two sets of nodes.
Consider their union V ∪ V 0 , with n + m nodes, and the set E of all
nm undirected edges {{i, j 0 }, i ∈ JnK, j ∈ JmK} between them. To each
edge {i, j 0 } we associate the corresponding cost value Cij . The complete
bipartite graph G between V and V 0 is (V ∪ V 0 , E). A transport plan is
a flow on that graph satisfying source and sink constraints, as described
informally in Figure 3.1. An extremal point in U(a, b) has the following
property [Brualdi, 2006, p.338,Theo. 8.1.2].
Proposition 3.3 (Extremal Solutions). Let P be an extremal point of
the polytope U(a, b). Let F (P) ⊂ E be the subset of undirected edges
{{i, j 0 }, i ∈ JnK, j ∈ JmK such that Pij > 0}. Then the graph G(P) =
def.

(V ∪ V 0 , F (P)) has no cycles. In particular, P cannot have more than


n + m − 1 non-zero entries.
3.4. Vertices of the Transportation Polytope 49

10 0.3
10 0.2
1 1

20 0.5
20 0.16
2 2

30 0.2
30 0.08
3 3

40 40 0.56

Figure 3.1: The optimal transport problem as a bipartite network flow problem.
Here n = 3, m = 4. All coordinates of the source histogram, a, are depicted as
source nodes on the left labeled 1, 2, 3 whereas all coordinates of the target his-
togram b are labeled as nodes 10 , 20 , 30 , 40 . The graph is bipartite in the sense that
all source nodes are connected to all target nodes, with no additional edges. Each
edge {i, j 0 } is attributed a cost corresponding to the entry Cij . A feasible flow is
represented on the right. Notice that, however, in light of the result provided in
Proposition 3.3, that flow is not extremal since it has at least one cycle given by
((1, 10 ), (10 , 2), (2, 40 ), (40 , 1)).

10 10 10
1 1 1
P220 20 P220 + " 20 P220 " 20
2 P320 2 P320 " 2 P320 + "
P230 P230 " P 230 +"
30 30 30
3 P330 3 P330 + " 3 P330 "

n n n
P m0 Q m0 R m0

Figure 3.2: A solution P with a cycle in the graph of its support can be perturbed
to obtain two feasible solutions Q and R such that P is their average, therefore
disproving that P is extremal.
50 Algorithmic Foundations

Proof. Suppose that P is an extremal point of the polytope U(a, b)


and that its corresponding set F (P) of edges, denoted F for short,
is such that the graph G = (V ∪ V 0 , F ) contains a cycle, namely
there exists k > 1 and a sequence of distinct indices i1 , . . . , ik−1 ∈
JnK and j1 , . . . , jk−1 ∈ JmK such that the set of edges H =
{{i1 , j10 } , {j10 , i2 } , {i2 , j20 } , . . . , {ik , jk0 } , {jk0 , i1 }} forms a subset of F .
We construct two feasible matrices Q and R such that P = (Q + R)/2.
To do so, we consider a directed cycle H̄ corresponding to H, namely
the sequence of pairs i1 → j10 , j10 → i2 , i2 → j20 , . . . , ik → jk0 , jk0 → i1 , as
well as the elementary amount of flow ε < min{i,j 0 }∈F Pij . We now form
a perturbation E matrix whose (i, j) entry is equal to ε if i → j 0 ∈ H̄,
−ε if j → i0 ∈ H̄ and zero otherwise. We now define two matrices
Q = P + E and R = P − E. Because ε small enough, all elements in
Q and R are nonnegative. By construction, E has either lines (resp.
columns) with all entries equal to 0 or exactly one entry equal to ε and
another equal to −ε for those indexed by i1 , . . . , ik (resp. j1 , . . . , jk ).
Therefore, E is such that E1m = 0n and E T 1n = 0m , and we have that
Q and R have the same marginals as P. Finally P = (Q + R)/2 which,
since Q, R 6= P, contradicts the fact that P is an extremal point. Since
a graph with k nodes and no cycles cannot have more than k − 1 edges,
we conclude that F (P) cannot have more than n + m − 1 edges, and
therefore P cannot have more than n + m − 1 non-zero entries.

3.4.2 The North-West Corner Rule

The North-West (NW) corner rule is a heuristic that produces a vertex


of the polytope U(a, b) in up to n + m operations. This heuristic can
play a role to initialize any algorithm, such as the network simplex
outlined in the next section.
The rule starts by giving the highest possible value to P1,1 by set-
ting it to min(a1 , b1 ). At each step, the entry Pi,j is chosen to saturate
either the row constraint at i, the row constraint at j, or both if possi-
ble. The counters i, j are then updated as follows: i is incremented in
the first case, j is in the second, both i and j are in the third case. The
rule proceeds until Pn,m has received a value.
Formally, the algorithm works as follows: i and j are initialized to
3.4. Vertices of the Transportation Polytope 51

1, r ← a1 , c ← b1 . While i ≤ n and j ≤ m, set t ← min(r, c), Pi,j ← t,


r ← r − t, c ← s − t; If r = 0 then increment i, and update r ← ai if
i ≤ n; If c = 0 then increment j, and update c ← bj if j ≤ n; repeat.
Here is an example of this sequence assuming a = [0.2, 0.5, 0.3] and
b = [0.5, 0.1, 0.4]:
     
• 0 0 0.2 0 0 0.2 0 0
0 0 0 →  • 0 0 → 0.3 • 0
     

0 0 0 0 0 0 0 0 0
     
0.2 0 0 0.2 0 0 0.2 0 0
→ 0.3 0.1 • → 0.3 0.1 0.1 → 0.3 0.1 0.1
     

0 0 0 0 0 • 0 0 0.3

We write NW(a, b) for the unique plan that can be obtained through
this heuristic.
Note that, there is, however, a much larger number of NW cor-
ner solutions that can be obtained by permuting arbitrarily the or-
der of a and b first, computing the corresponding NW corner ta-
ble, and recovering a table of U(a, b) by inverting again the or-
der of columns and rows: setting σ = (3, 1, 2), σ 0 = (3, 2, 1) gives
aσ = [0.3, 0.2, 0.5], bσ0 = [0.4, 0.1, 0.5] and σ −1 = (2, 3, 1), σ 0 = (3, 2, 1).
Observe that:
 
0.3 0 0
NW(aσ , bσ0 ) = 0.1 0.1 0  ∈ U(aσ , bσ0 ),
 

0 0 0.5
 
0 0.1 0.1
NWσ−1 σ0−1 (aσ , bσ ) = 0.5 0 0  ∈ U(a, b).
 
0

0 0 0.3

Let N (a, b) be the set of all North-West corner solutions that can
be produced this way:

N (a, b) = {NWσ−1 σ0−1 (rσ , cσ0 ), σ, σ 0 ∈ Sd }.


def.

Note that all NW corner solutions only have by construction up to


n + m − 1 nonzero elements. The NW corner rule produces a table
which is by construction unique for a and a, but there is an exponential
52 Algorithmic Foundations

number of pairs or row/column permutations (σ, σ 0 ) that may share


the same table [Stougie, 2002, p.2]. N (a, b) forms a subset of (usually
strictly included in) the set of extreme points of U(a, b) [Brualdi, 2006,
Corollary 8.1.4].

3.5 A Heuristic Description of the Network Simplex

Consider a feasible matrix P whose graph G(P) = (V ∪ V 0 , F (P)) has


no cycles. P has therefore no more than n + m − 1 non-zero entries, and
is a vertex of U(a, b) by Proposition 3.3. We start this section with a
simple way to check whether P is optimal. More precisely, we consider
the optimality of a feasible primal-dual pair:
Proposition 3.4. Let P and (f, g) be feasible solutions for the pri-
mal (2.23) and dual (2.11) problems. If for any {i, j 0 } ∈ F (P) one has
that fi + gj = Ci,j , then P and (f, g) are both primal and dual optimal.
Proof. By weak duality, we have that
LC (a, b) ≤ hP, Ci = hP, f ⊕ gi = ha, fi + hb, gi ≤ LC (a, b)
and therefore P and (f, g) are respectively primal and dual optimal.

Given a feasible matrix P, it is therefore sufficient to obtain a dual


solution (f, g) which is feasible and complementary to P, in the sense
that C − f ⊕ g has non-negative entries and pairs of indices {i, j 0 } in
F (P) are such that Ci,j = fi +gj , to prove that P is optimal. Exhibiting
whether such a pair (f, g) exists, and, if not, modify P to reach that
goal is the gist of the network simplex.

3.5.1 Obtaining a Dual Pair Complementary to P


The simplex proceeds by associating first to any extremal solution P a
pair of (f, g) complementary dual variables. This is simply carried out
by finding two vectors f and g such that for any {i, j 0 } in F (P), fi + gj
is equal to Ci,j . Note that this, in itself, does not guarantee that (f, g)
is feasible, namely that C − f ⊕ g ≥ 0.
Let s be the cardinal of F (P). Because P is extremal, s ≤ n+m−1.
Because G(P) has no cycles, G(P) is either a tree or a forest (a union
3.5. A Heuristic Description of the Network Simplex 53

10 0.1
0.16 1
0 F (P) = {1, 10 }, {1, 20 }, {2, 20 }, {2, 30 },
2 0.16
0.4 2 {3, 40 }, {4, 40 }, {4, 50 }, {5, 60 }
30 0.3
0.06 3 n
40 0.1 G(P) = {1, 2, 10 , 20 , 30 }, {1, 10 }, {1, 20 }, {2, 20 }, {2, 30 } ,
0.24 4
50
0.2 {3, 4, 40 , 50 }, {3, 40 }, {4, 40 }, {4, 50 } ,
0.14 5 o
0 0
{5, 6 }, {5, 6 }
60 0.14

Figure 3.3: A feasible transport P and its corresponding set of edges F (P)
and graph G(P). As can be seen in the picture above, the graph G(P) =
({1, . . . , 5, 10 , . . . , 60 }, F (P)) is a forest, meaning that it can be expressed as the
union of tree graphs, three in this case.

of trees), as illustrated in Figure 3.3. Aiming for a pair (f, g) that is


complementary to P, we consider the following set of s linear equality
constraints on n + m variables:
f i1 + gj1 = Ci1 ,j1
f i2 + gj1 = Ci2 ,j1
.. .. (3.6)
. = .
f is + gjs = Cis ,js ,

where the elements of F (P) are enumerated as {i1 , j10 }, . . . , {is , js0 }.
Since s ≤ n+m−1 < n+m, the linear system (3.6) above is always
undetermined. This degeneracy can be interpreted in part because the
parameterization of U(a, b) with n + m constraints results in n + m
dual variables. A more careful formulation, outlined in Remark 3.1,
would have resulted in an equivalent formulation with only n + m − 1
constraints and therefore n + m − 1 dual variables. However, s can also
be strictly smaller than n + m − 1: This happens when G(P) is the
disjoint union of two or more trees. For instance, there are 5 + 6 = 11
dual variables (one for each node) in Figure 3.3, but only 8 edges among
these 11 nodes, namely 8 linear equations to define (f, g). Therefore,
there will be as many undetermined dual variables under that setting
as there will be connected components in G(P).
Consider a tree among those listed in G(P). Suppose that tree has
k nodes i1 , . . . , ik among source nodes and l nodes j10 , . . . , jl0 among
def.
target nodes, resulting in r = k + l, and r − 1 edges, corresponding to
54 Algorithmic Foundations

1 30 g3 := C2,3 f2
f1 := 0 10
g1 := C1,1 f1 2
f2 := C2,1 g1 20 g2 := C2,2 f2

Figure 3.4: The 5 dual variables f1 , f2 , g1 , g2 , g3 corresponding to the 5 nodes


appearing in the first tree of the graph G(P) illustrated in Figure 3.3 are linked
through 4 linear equations that involve corresponding entries in the cost matrix C.
Because that system is degenerate, we choose a root in that tree (node 1 in this
example) and set its corresponding variable to 0 and proceed then by traversing
the tree (either breadth-first or depth-first) from the root to obtain iteratively the
values of the 4 remaining dual variables.

k variables in f and l variables in g, linked with r − 1 linear equations.


To lift an indetermination, we can choose arbitrarily a root node in that
tree, and assign the value 0 to its corresponding dual variable. From
then, we can traverse the tree using a breadth-first or depth-first search
to obtain a sequence of simple variable assignments that determine the
values of all other dual variables in that tree, as illustrated in Figure 3.4.
That procedure can then be repeated for all trees in the graph of P to
obtain a pair of dual variables (f, g) that is complementary to P.

3.5.2 Network Simplex Update

The dual pair (f, g) obtained previously might be feasible, in the sense
that for all i, j we have fi +gj ≤ Ci,j , in which case we have reached the
optimum by Proposition 3.4. When that is not the case, namely when
there exists i, j such that fi + gj > Ci,j , the network simplex algorithm
kicks in. We first initialize a graph G to be equal to the graph G(P)
corresponding to the feasible solution P and carry next an iteration of
the algorithm itself, which consists first in adding the violating edge
{i, j 0 } to G. Two cases can then arise:

(a) G is (still) a forest, which can happen if {i, j 0 } links two existing
subtrees. The approach outlined in §3.5.1 can be used on graph
G recover a new complementary dual vector (f, g). Note that this
addition simply removes an indetermination among the n + m
dual variables, and does not result in any change in the primal
3.5. A Heuristic Description of the Network Simplex 55

variable P. That update is usually called degenerate in the sense


that {i, j 0 } has now entered graph G although Pi,j remains 0.
G(P) is, however contained in G.

(b) G has now a cycle. In that case, we need to remove an edge


in G to ensure that G is still a forest, yet also modify P so
that P is feasible and G(P) remains included in G. These op-
erations can be all carried out by increasing the value of Pi,j
and modifying the other entries of P appearing in the de-
tected cycle, in a manner very similar to the one we used to
prove Proposition 3.3. To be more precise, let us write that cy-
cle (i1 , j10 ), (j10 , i2 ), (i2 , j20 ), . . . , (il , jl0 ), (jl0 , il+1 ) with the conven-
tion that i1 = il+1 = i to ensure that the path is a cycle that
starts and ends at i, whereas j1 = j to highlight the fact that
the cycle starts with the added edge {i, j}, going in the right
direction. Increase now the flow of all “positive” edges (ik , jk0 )
(for k ≤ l), and decrease that of “negative” edges (jk0 , ik+1 ) (for
k ≤ l), to obtain an updated primal solution Pn , equal to P for
all but the following entries:

∀k ≤ l, Pnik ,jk := Pik ,jk + θ; Pnik+1 ,jk := Pik+1 ,jk − θ.

Here, θ is the largest possible increase at index i, j using that


cycle. The value of θ is controlled by the smallest flow negatively
impacted by the cycle, namely mink Pik+1 ,jk . That update is il-
lustrated in Figure 3.5. Let k ? be an index that achieves that
minimum. We then close the update by removing {ik? +1 , jk? }
from G, to compute new dual variables (f, g) using the approach
outlined in §3.5.1

3.5.3 Improvement of the Primal Solution

Although this was not necessarily our initial motivation, one can show
that the manipulation above can only improve the cost of P. If the
added edge has not created as cycle, case (a) above, the primal solution
remains unchanged. When a cycle is created, case (b), P is updated to
56 Algorithmic Foundations

Pn , and the following equality holds:

l l
!
X X
n
hP , Ci − hP, Ci = θ Cik ,jk − Cik+1 ,jk .
k=1 k=1

We now use the dual vectors (f, g) computed at the end of the previous
iteration. They are such that fik +gik = Cik ,jk and fik+1 +gik = Cik+1 ,jk
for all edges initially in G, resulting in the identity

l
X l
X l
X l
X
Cik ,jk − Cik+1 ,jk = Ci,j + fik + gjk − fik+1 + gjk
k=1 k=1 k=2 k=1
= Ci,j − (fi + gj ).

That term is, by definition, negative, since i, j where chosen because


Ci,j < fi − gj . Therefore, if θ > 0, we have that

hPn , Ci = hP, Ci + θ (Ci,j − (fi − fg )) < hP, Ci.

If θ = 0, which can happen if G and G(P) differ, the graph G is simply


changed, but P is not.
The network simplex algorithm can therefore be summarized as
follows: initialize the algorithm with an extremal solution P, given for
instance by the north-west corner rule as covered in § 3.4.2. Initialize
the graph G with G(P). Compute a pair of dual variables (f, g) that
are complementary to P using the linear system solve using the tree
structure(s) in G as described in §3.5.1. (i) Look for a violating pair
of indices to the constraint C − f? ⊕ g ≥ 0; If none, P is optimal and
stop. If there is a violating pair (i, j 0 ), (ii) add the edge {i, j 0 } to G. If
G still has no cycles, update (f, g) accordingly; if there is a cycle, direct
it making sure (i, j 0 ) is labeled as positive, and remove a negative edge
in that cycle with the smallest flow value, updating P, G as illustrated
in Figure 3.5, build then a complementary pair f, g accordingly; return
to (i). Some of the operations above require graph operations (cycle
detection, tree traversals) which can be implemented efficiently in this
context, as described in [Bertsekas, 1998, §5].
3.6. Matching problems 57

10 0.1 0.06 10 0.1 0 10 0.1


0.06 1 0.06 1 0.04 0.06 1 0.1
0
2 0.12 0 20 0.12 20
0.12
0.46 2 0.46 2 0.46 2 0.06
30 0.3 0.3 30 0.3 0.24 30 0.3
0.06 3 0.06 3 0.06 3
0 0 0
4 0.1 4 0.1 4 0.1
0.24 4 0.24 4 0.24 4
50 0.2 50 0.2 50 0.2
0.14 5 0.14 5 0.14 5
60 0.14 60 0.14 60 0.14

(a) {3, 30 } added (b.1) {1, 30 } added (b.2) {1, 10 } removed

Figure 3.5: Adding an edge {i, j} to the graph G(P) can result in either: (a) the
graph remains a forest after this addition, in which case f, g can be recomputed
following the approach outlined in §3.5.1; (b.1) the addition of that edge creates
a cycle, from which we can define a directed path. (b.2) the path can be used to
increase the value of Pi,j , and propagate that change along the cycle to maintain
the flow feasibility constraints, until the flow of one of the edges that is negatively
impacted by the cycle is decreased to 0. This removes the cycle and updates P.

3.6 Matching problems

The network simplex is meant to be used in the general case where


marginals a, b have arbitrary values. It is, however, easy to notice that
when both a and b are equal and both uniform, namely a = 1n /n,
any extremal solution P is a permutation matrix through Birkhoff’s
theorem [1946], as pointed out in Proposition 2.1. In that case G(P) is
a disjoint set of n atoms, and many of the updates outlined above result
in degenerate moves, making the network simplex approach above too
general and inefficient to exploit this setting. We introduce two faster
alternatives in that case.

3.6.1 Hungarian Algorithm

The Hungarian algorithm precedes the network simplex by quite a few


decades, since it can be traced back to work by Jacobi Borchardt and
Jocobi [1865] and later König and Egerváry, as recounted by Kuhn
[1955]. The Hungarian algorithm is a particular case of a more general
class of network flow optimization algorithms known as dual ascent
methods, in the sense that they maintain a candidate of dual feasible
solutions that is progressively improved. We follow the presentation
of [Bertsimas and Tsitsiklis, 1997, §7.7] and start with the following
58 Algorithmic Foundations

proposition

Proposition 3.5. A dual pair f, g is either optimal for Problem (3.4) or


there exists ε > 0 and S ⊂ JnK, S 0 ⊂ JmK such that 1S T a − 1S 0 T b > 0,
where 1S is the vector in Rn of zeros except for ones at the indices
enumerated in S, and likewise for the vector 1S 0 in Rm with indices S 0 .

3.6.2 Auction Algorithm


The auction algorithm is an alternative to the Hungarian algorithm
for the assignment problem that was originally proposed by Bertsekas
[1981] and later refined in [Bertsekas and Eckstein, 1988]. Several eco-
nomic interpretations of this algorithm have been proposed (see e.g.
Bertsekas [1992]).

Complementary Slackness. Notice that in the optimal assignemnt


problem, the primal-dual conditions presented for the optimal trans-
port problem become easier to formulate, because any extremal so-
lution P is necessarily a permutation matrix Pσ for a given σ (see
Equation (3.4)). Given primal Pσ? and dual f? , g? optimal solutions we
necessarily have that
f?i + g?σ? = Ci,σi?
i

Recall also that, because of the principle of C-transforms enunciated


in §3.2, that one can choose f? to be equal to fC̄ . We therefore have
that
Ci,σi? − g?σi = min Ci,j − g?j (3.7)
j

On the contrary, it is easy to show that if there exists a vector g and


a permutation σ such that

Ci,σi − gσi = min Ci,j − gj (3.8)


j

holds, then they are both optimal, in the sense that σ is an optimal
assignment and gC̄ , g is an optimal dual pair.

Partial Assignments and ε-Complementary Slackness. The goal of


the auction algorithm is to modify iteratively a triplet S, ξ, g, where S
3.6. Matching problems 59

is a subset of JnK, ξ a partial assignment vector, namely an injective


map from S to JnK, and g a dual vector. The dual vector is meant to
converge towards a solution satisfying an approximate complementary
slackness property (3.8), whereas S grows to cover JnK as ξ describes a
permutation. The algorithm works by maintaining the three following
properties after each iteration:

(a) ∀i ∈ S, Ci,ξi − gξi ≤ ε + minj Ci,j − gj (ε-CS).

(b) the size of S can only increase at each iteration.

(c) there exists an index i such that gi decreases by at least ε.

Auction Algorithm Updates Given a point j the auction algorithm


uses not only the optimum appearing in the usual C-transform, but
also a second best:

ji1 ∈ argminj Ci,j − gj , ji2 ∈ argminj6=j 1 Ci,j − gj ,


i

to define the following updates on g for an index i ∈


/ S, as well as on
S and ξ

1. update g : Remove to the ji1 -th entry of g the sum of ε and the
difference between the second lowest and lowest adjusted cost
{Ci,j − gj }j :
 
gj 1 ← gj 1 − (Ci,j 2 − gj 2 ) − (Ci,j 1 − gj 1 ) + ε
i i i i i i
| {z }
≥ε>0 (3.9)
= Ci,j 1 − (Ci,j 2 − gj 2 ) − ε
i i i

2. update S and ξ : If there exists an index i0 ∈ S such that


ξi0 = ji1 , remove it by updating S ← S \ {i0 }. Set ξi = ji1 and add
i to S, S ← S ∪ {i}.

Algorithmic Properties The algorithm proceeds by starting from an


empty set of assigned points S = ∅ with no assignment and empty
partial assignment vector ξ, and g = 0n , terminates when S = JnK,
and loops through both steps above until it terminates. The fact that
60 Algorithmic Foundations

properties (b) and (c) are valid after each iteration is made obvious
by the nature of the updates (it suffices to look at Equation (3.9)).
ε-complementary slackness is easy to satisfy at the first iteration since
in that case S = ∅. The fact that iterations preserve that property is
shown by the following proposition:
Proposition 3.6. The auction algorithm maintains ε-complementary
slackness at each iteration.
Proof. Let g, ξ, S be the three variables at the beginning of a given
iteration. We therefore assume that for any i0 ∈ S the relationship
Ci,ξi0 − gξi0 ≤ ε + min Ci0 ,j − gj
j

holds. Consider now the particular i ∈ / S considered in an iteration.


Three updates happen: g, ξ, S are updated to gn , ξ n , S n using indices
ji1 and ji2 . More precisely, gn is equal to g except for element ji1 , whose
value is equal to
 
gnj 1 = gj 1 − (Ci,j 2 − gj 2 ) − (Ci,j 1 − gj 1 ) − ε ≤ gj 1 − ε
i i i i i i i

ξn is equal to ξ except for its i-th element equal to and S n is equal ji1 ,
to the union of {i} with S (with possibly one element removed). The
update of gn can be rewritten
gnj 1 = Ci,j 1 − (Ci,j 2 − gj 2 ) − ε,
i i i i

therefore we have
Ci,j 1 − gnj 1 = ε + (Ci,j 2 − gj 2 ) = ε + min(Ci,j − gj )
i i i i j6=ji1

Since −g ≤ −gn this implies that


Ci,j 1 − gnj 1 = ε + min(Ci,j − gj ) ≤ ε + min(Ci,j − gnj ),
i i j6=ji1 j6=ji1

and since the inequality is also obviously true for j = ji1 we there-
fore obtain the ε-complementary slackness property for index i. For
other indices i0 6= i, we have again that since gn ≤ g the sequence of
inequalities holds
Ci,ξn0 − gnξn0 = Ci,ξi0 − gξi0 ≤ ε + min Ci0 ,j − gj ≤ ε + min Ci0 ,j − gnj
i i j j
3.6. Matching problems 61

Proposition 3.7. The number of steps of the auction algorithm is at


most N = nkCk∞ /ε.

Proof. Suppose that the algorithm has not stopped after T > N steps.
Then there exists an index j which is not in the image of ξ, namely
whose price coordinate gj has never been updated and is still gj = 0.
In that case, there cannot exist an index j 0 such that gj 0 was updated
n times with n > kCk∞ /ε. Indeed, if that were the case then for any
index i
gj 0 ≤ −nε < −kCk∞ ≤ −Ci,j = gj − Ci,j ,

which would result in, for all i

Ci,j 0 − gj 0 > Ci,j + (Ci,j − gj ),

which contradicts ε-CS. Therefore, since there cannot be more than


kCk∞ /ε updates for each variable, T cannot be larger than nkCk∞ /ε =
N.

Remark 3.3. Note that this result yields a naive number of opera-
tions of N 3 kCk∞ /ε for the algorithm to terminate. That complexity
can be reduced to N 3 log kCk∞ when using a clever method known as
ε-scaling, designed to decrease the value of ε with each iteration [Bert-
sekas, 1998, p.264].

Proposition 3.8. The auction algorithm finds an assignment whose


cost is nε suboptimal.

Proof. Let σ, g? be the primal and dual optimal solutions of the assign-
ment problem of matrix C, with optimum
X X X
t? = Ci,σi = min Ci,j − g?j + g?j .
j
i j

Let ξ, g be the solutions outputted by the auction algorithm upon ter-


mination. The ε-CS conditions yield that for any i ∈ S,

min Ci,j − gj ≥ Ci,ξi − gξi − ε.


j
62 Algorithmic Foundations

Therefore by simple suboptimality of g we first have


X  X
?
t ≥ min Ci,j − gj + gj
j
i j
X   X X
≥ −ε + Ci,ξi − gξi + gj = −nε + Ci,ξj ≥ −nε + t? .
i j i

where the second inequality comes from ε-CS, the equality next by
cancellation of the sum of terms in gξi and gj , and the last inequality
by the suboptimality of ξ as a permutation.

The auction algorithm can therefore be regarded as an alternative


way to use the machinery of C-transforms. We explore next another
approach grounded on regularization, the so-called Sinkhorn algorithm,
which also bears similarities with the auction algorithm as discussed
in [Schmitzer, 2016b].
Note finally that, on low-dimensional regular grids in Euclidean
space, it is possible to couple these classical linear solvers with mul-
tiscale strategies, to obtain a significant speed-up [Schmitzer, 2016a,
Oberman and Ruan, 2015].
4
Entropic Regularization of Optimal Transport

This chapter introduces a family of numerical schemes to approxi-


mate solutions to Kantorovich formulation of optimal transport and
its many generalizations. It operates by adding an entropic regular-
ization penalty to the original problem. This regularization has several
important advantages, but a few stand out particularly: The minimiza-
tion of the regularized problem can be solved using a simple alternate
minimization scheme; that scheme translates into iterations that are
simple matrix products, making them particularly suited to execution
of GPU; the resulting approximate distance is smooth with respect to
input histogram weights and positions of the Diracs.

4.1 Entropic Regularization

The discrete entropy of a coupling matrix is defined as


def.
X
H(P) = − Pi,j (log(Pi,j ) − 1), (4.1)
i,j

with an analogous definition for vectors, with the convention that


H(a) = −∞ if one of the entries aj is 0 or negative. The function H is
1-strongly concave, because its hessian is ∂ 2 H(P ) = − diag(1/Pi,j ) and

63
64 Entropic Regularization of Optimal Transport

c
P"

"
Figure 4.1: Impact of ε on the optimization of a linear function on the simplex,
solving Pε = argminP∈Σ3 hC, Pi − εH(P) for a varying ε.

Pi,j ≤ 1. The idea of the entropic regularization of optimal transport is


to use −H as a regularizing function to obtain approximate solutions
to the original transport problem (2.11):

LεC (a, b) =
def.
min hP, Ci − εH(P). (4.2)
P∈U(a,b)

Since the objective is a ε-strongly convex function, problem 4.2 has a


unique optimal solution. The idea to regularize the optimal transport
problem by an entropic term can be traced back to modeling ideas
in transportation theory [Wilson, 1969]: Actual traffic patterns in a
network do not agree with those predicted by the solution of the optimal
transport problem. Indeed, the former are more diffuse than the latter,
which tend to rely on a few routes as a result of the sparsity of optimal
couplings to the solution of 2.11. To balance for that, researchers in
transportation proposed a model, called the “gravity” model [Erlander,
1980], that is able to form a more “blurred” traffic prediction.
Figure 4.1 illustrates the effect of the entropy to regularize a linear
program over the simples Σ3 (which can thus be visualized as a tri-
angle in 2-D). Note how the entropy pushes the original LP solution
away from the boundary of the triangle. The optimal Pε progressively
moves toward an “entropic center” of the triangle. This is further de-
tailed in the proposition below. The convergence of the solution of that
regularized problem towards an optimal solution of the original linear
program has been studied by Cominetti and San Martín [1994].

Proposition 4.1 (Convergence with ε). The unique solution Pε of (4.2)


converges to the optimal solution with maximal entropy within the set
4.1. Entropic Regularization 65

of all optimal solutions of the Kantorovich problem, namely


ε→0
Pε −→ argmin {−H(P) : P ∈ U(a, b), hP, Ci = LC (a, b)} (4.3)
P

so that in particular
ε→0
LεC (a, b) −→ LC (a, b).

One has
ε→∞
Pε −→ abT = (ai bj )i,j . (4.4)

Proof. We consider a sequence (ε` )` such that ε` → 0 and ε` > 0. We


denote P` the solution of (4.2) for ε = ε` . Since U(a, b) is bounded, we
can extract a sequence (that we do not relabel for sake of simplicity)
such that P` → P? . Since U(a, b) is closed, P? ∈ U(a, b). We consider
any P such that hC, Pi = LC (a, b). By optimality of P and P` for
their respective optimization problems (for ε = 0 and ε = ε` ), one has

0 ≤ hC, P` i − hC, Pi ≤ ε` (H(P` ) − H(P)). (4.5)

Since H is continuous, taking the limit ` → +∞ in this expression


shows that hC, P? i = hC, Pi so that P? is a feasible point of (4.3).
Furthermore, dividing by ε` in (4.5) and taking the limit shows that
H(P) ≤ H(P? ), which shows that P? is a solution of (4.3). Since the
solution P?0 to this program is unique by strict convexity of −H, one has
P? = P?0 , and the whole sequence is converging. In the limit ε → +∞,
a similar proof shows that one should rather consider the problem

min − H(P)
P∈U(a,b)

the solution of which is a ⊗ b.

Formula (4.3) states that for low regularization, the solution con-
verges to the maximum entropy optimal transport coupling. In sharp
contrast, (4.4) shows that for large regularization, the solution con-
verges to the coupling with maximal entropy between two prescribed
marginals a, b, namely the joint probability between two independent
random variables with prescribed distributions. A refined analysis of
this convergence is performed in Cominetti and San Martín [1994],
66 Entropic Regularization of Optimal Transport

ε = 10 ε=1 ε = 10−1 ε = 10−2

Figure 4.2: Impact of ε on the couplings between two 1-D densities, illustrating
Proposition 4.1. Top row: between two 1-D densities. Bottom row: between two 2-D
discrete empirical densities with same number n = m of points (only entries of the
optimal (Pi,j )i,j above a small threshold are displayed as segments between xi and
yj ).

including a first order expansion in ε (resp. 1/ε) near ε = 0 (resp


ε = +∞). Figures 4.2 and 4.3 shows visually the effect of these two
convergence. A key insight is that, as ε increases, the optimal coupling
becomes less and less sparse (in the sense of having entries larger than
a prescribed thresholds), which in turn as the effect of both accelerat-
ing computational algorithms (as we study in §4.2) but also leading to
faster statistical convergence (as exposed in §8.5).
Defining the Kullback-Leibler divergence between couplings as
!
def.
X Pi,j
KL(P|K) = Pi,j log − Pi,j + Ki,j , (4.6)
i,j
Ki,j

the unique solution Pε of (4.2) is a projection onto U(a, b) of the Gibbs


kernel associated to the cost matrix C as
Ci,j
Ki,j = e−
def.
ε
4.1. Entropic Regularization 67

"
Figure 4.3: Impact of ε on coupling between two 2-D discrete empirical densities
with same number n = m of points (only entries of the optimal (Pi,j )i,j above a
small threshold are displayed as segments between xi and yj ).

Indeed one has that using the definition above


Pε = ProjKL
def.
U(a,b) (K) = argmin KL(P|K). (4.7)
P∈U(a,b)

Remark 4.1 (Entropic regularization between discrete measures).


For discrete measures of the form (2.1), the definition of regularized
transport extends naturally to

Lεc (α, β) = LεC (a, b),


def.
(4.8)

with cost Ci,j = c(xi , yj ), to emphasize the dependency with re-


spect to the positions (xi , yj ) supporting the input measures.

Remark 4.2 (General formulation). One can consider arbitrary


measures by replacing the discrete entropy by the relative entropy
def.
with respect to the product measure dα ⊗ dβ(x, y) = dα(x)dβ(y),
and propose a regularized counterpart to (2.15) using
Z
Lεc (α, β)
def.
= min c(x, y)dπ(x, y) + ε KL(π|α ⊗ β) (4.9)
π∈U (α,β) X×Y

where the relative entropy is a generalization of the discrete


Kullback-Leibler divergence (4.6)
Z  dπ 
def.
KL(π|ξ) = log (x, y) dπ(x, y)+
X ×Y dξ
Z (4.10)
(dξ(x, y) − dπ(x, y)),
X ×Y
68 Entropic Regularization of Optimal Transport

and by convention KL(π|ξ) = +∞ if π does not have a density



dξ with respect to ξ. It is important to realize that the reference
measure α ⊗ β chosen in (4.9) to define the entropic regularizing
term KL(·|α ⊗ β) plays no specific role, only its support matters,
as noted by the following proposition.

Proposition 4.2. For any π ∈ U(α, β), and for any (α0 , β 0 ) with
the same support as (α, β) (so that they have both densities with
respect to one another) one has

KL(π|α ⊗ β) = KL(π|α0 ⊗ β 0 ) + KL(α0 ⊗ β 0 |α ⊗ β).

This proposition shows that choosing KL(·|α0 ⊗ β 0 ) in place of


KL(·|α ⊗ β) in (4.9) results in the same solution.
Formula (4.9) can be re-factored as a projection problem

min KL(π|K) (4.11)


π∈U (α,β)

c(x,y)
where K is the Gibbs distributions dK(x, y) = e− ε dα(x)dβ(y).
def.

This problem is often referred to as the “static Schrödinger


problem” [Léonard, 2014, Rüschendorf and Thomsen, 1998],
since it was initially considered by Schrödinger in statistical
physics [Schrödinger, 1931]. As ε → 0, the unique solution to (4.11)
converges to the maximum entropy solution to (2.15), see [Léonard,
2012, Carlier et al., 2017]. §7.6 details an alternate “dynamic”
formulation of the Schrödinger problem over the space of paths
connecting the points of two measures.

Remark 4.3 (Independence and couplings). A coupling


π ∈ U(α, β) describes the distribution of a couple of ran-
dom variables (X, Y ) defined on (X , Y), where X (resp. Y )
has law α (resp. β). Proposition 4.1 caries over for generic
(non-necessary discrete) measures, so that the solution πε of (4.9)
converges to the tensor product coupling α ⊗ β as ε → +∞. This
coupling α ⊗ β corresponds to the random variables (X, Y ) being
independent. In contrast, as ε → 0, πε convergence to a solution
4.2. Sinkhorn’s Algorithm and its Convergence 69

π0 of the OT problem (2.15). On X = Y = Rd , if α and β have


densities with respect to the Lebesgue measure, as detailed in
Remark 2.23, then π0 is unique and supported on the graph of
a bijective Monge map T : Rd → Rd . In this case, (X, Y ) are in
some sense fully dependent, since Y = T (X) and X = T −1 (Y ).
In the simple 1-D case d = 1, a convenient way to vizualize the
dependency structure between X and Y is to use the copula ξπ
associated to the joint distribution π. The cumulative function
defined in (2.32) is extended to couplings as
Z x Z y
∀ (x, y) ∈ R2 ,
def.
Cπ (x, y) = dπ.
−∞ −∞

The copula is then defined as

∀ (s, t) ∈ [0, 1]2 , ξπ (s, t) = Cπ (Cα−1 (s), Cβ−1 (t)).


def.

where the pseudo-inverse of a cumulative function is defined


in (2.33). For independent variables, ε = +∞, i.e. π = α ⊗ β,
one has ξπ+∞ (s, t) = st. In contrast, for fully dependent variables,
ε = +∞, one has ξπ0 (s, t) = min(s, t). Figure 4.4 shows how en-
tropic regularization generates copula ξπε interpolating between
these two extreme cases.

4.2 Sinkhorn’s Algorithm and its Convergence

The following proposition shows that the solution of (4.2) has a specific
form, which can be parameterized using n + m variables. That param-
eterization is therefore essentially dual, in the sense that a coupling P
in U(a, b) has nm variables but n + m constraints.
Proposition 4.3. The solution to (4.2) is unique and has the form
∀ (i, j) ∈ JnK × JmK, Pi,j = ui Ki,j vj (4.12)
for two (unknown) scaling variable (u, v) ∈ Rn+ × Rm
+.

Proof. Introducing two dual variables f ∈ Rn , g ∈ Rm for each marginal


constraint, the Lagrangian of (4.2) reads
E(P, f, g) = hP, Ci − εH(P) − hf, P1m − ai − hg, PT 1n − bi.
70 Entropic Regularization of Optimal Transport

α β

ε = 10 ε=1 ε = 0.5 · 10−1 ε = 10−1 ε = 10−3

Figure 4.4: Top: evolution with ε of the solution πε of (4.9). Bottom: evolution
of the copula function ξπε .

Considering first order conditions, we have


∂E(P, f, g)
= Ci,j − ε log(Pi,j ) − fi − gj .
∂Pi,j
which results, for an optimal P coupling to the regularized problem,
in the expression Pi,j = efi /ε e−Ci,j /ε egj /ε which can be rewritten in
the form provided in the proposition using non-negative vectors u and
v.

Regularized OT as Matrix Scaling The factorization of the optimal


solution exhibited in Equation (4.12) can be conveniently rewritten in
matrix form as P = diag(u)K diag(v). u, v must therefore satisfy the
following non-linear equations which correspond to the mass conserva-
tion constraints inherent to U(a, b),
diag(u)K diag(v)1m = a, and diag(v)K> diag(u)1n = b, (4.13)
These two equations can be further simplified, since diag(v)1m is sim-
ply v, and the multiplication of diag(u) times Kv is
u (Kv) = a and v (KT u) = b (4.14)
4.2. Sinkhorn’s Algorithm and its Convergence 71

where corresponds to entry-wise multiplication of vectors. That prob-


lem is known in the numerical analysis community as the matrix scaling
problem (see [Nemirovski and Rothblum, 1999] and references therein).
An intuitive way to try to solve these equations is to solve them iter-
atively, by modifying first u so that it satisfies the left-hand side of
Equation (4.14) and then v to satisfy its right-hand side. These two
updates define Sinkhorn’s algorithm:
a b
u(`+1) = v(`+1) =
def. def.
and , (4.15)
Kv(`) KT u(`+1)
initialized with an arbitrary positive vector v(0) = 1m . The division
operator used above between two vectors is to be understood entry-
wise. Note that a different initialization will likely lead to a different
solution for u, v, since u, v are only defined up to a multiplicative
constant (if u, v satisfy (4.13) then so do λu, v/λ for any λ > 0).
It turns out however that these iterations converge (see Remark 4.7
for a justification using iterative projections, and Remark 4.12 for
a strict contraction result) and all result in the same optimal cou-
pling diag(u)K diag(v). Figure 4.5, top row, shows the evolution of
the coupling diag(u(`) )K diag(v(`) ) computed by Sinkhorn iterations.
It evolves from the Gibbs kernel K towards the optimal coupling solv-
ing (4.2) by progressively shifting the mass away from the diagonal.

Remark 4.4 (Historical Perspective). This algorithm was originally in-


troduced with a proof of convergence by Sinkhorn [1964] with later
contributions [Sinkhorn and Knopp, 1967, Sinkhorn, 1967]. It was used
earlier as a heuristic to scale a matrix so that it fits desired marginals
(typically uniform) under the name of Iterative proportional fitting
(IPFP) Deming and Stephan [1940] and RAS Bacharach [1965] meth-
ods [Idel, 2016], and later extended in infinite dimensions by Ruschen-
dorf [1995]. It was adopted very early as well in the field of eco-
nomics, precisely to obtain approximate solutions to optimal transport
problems, under the name of gravity models [Wilson, 1969, Erlander,
1980, Erlander and Stewart, 1990]. It was rebranded as “softassign”
by Kosowsky and Yuille [1994] in the assignment case, namely when
a = b = 1n /n, and used to solve matching problems in economics
72 Entropic Regularization of Optimal Transport

⇡"(`)

`
0

-0.5

-1

-1.5

-2
`
1000 2000 3000 4000 5000

ε = 10 ε = 0.1 ε= 10−3
(`)
Figure 4.5: Top: evolution of the coupling πε = diag(u(`) )K diag(v(`) ) computed
at iteration ` of Sinkhorn’s iterations, for 1-D densities on X = [0, 1], c(x, y) =
|x − y|2 , and ε = 0.1. Bottom: impact of ε the convergence rate of Sinkhorn, as
(`)
measured in term of marginal constraint violation log(kπε 1m − bk1 ).

more recently by Galichon and Salanié [2009]. This regularization has


received recently renewed attention in data sciences (including ma-
chine learning, vision, graphics and imaging) following [Cuturi, 2013],
who showed that Sinkhorn’s algorithm provides a scalable way to ap-
proximate optimal transport, thanks to seamless parallelization when
solving several OT problems simultaneously (notably on GPUs, see
Remark 4.14) and that it defines, unlike the linear programming for-
mulation, a differentiable loss function (see §4.5). There exist countless
extensions and generalizations of the Sinkhorn algorithm (see for in-
stance §4.6). For instance, when a = b, one can use averaged projection
iterations to maintain symmetry Knight et al. [2014].

Remark 4.5 (Overall complexity). By doing a careful convergence


analysis (assuming n = m for the sake of simplicity), Altschuler
et al. [2017] showed that by setting ε = 4 log(n)
τ , O(kCk3∞ log(n)τ −3 )
Sinkhorn iterations (with an additional rounding step to com-
4.2. Sinkhorn’s Algorithm and its Convergence 73

pute a valid coupling P̂ ∈ U(a, b)) are enough to ensure that


hP̂, Ci ≤ LC (a, b) + τ . This implies that Sinkhorn computes
a τ -approximate solution of the unregularized OT problem in
O(n2 log(n)τ −3 ) operations. The rounding scheme consists in,
given two vectors u ∈ Rn , v ∈ Rm to carry out the following up-
dates [Altschuler et al., 2017, Alg. 2]:
!
a b
 
0
, 1n , v0 = v
def. def.
u =u min min , 1n ,
u (Kv) v (KT u0 )
def. 0 0 def. 0 T
∆a = a − u (Kv ), ∆b = b − v (K u),
P̂ = diag(u0 )K diag(v0 ) + ∆a (∆b )T / k∆a k1 .
def.

This yields a matrix P̂ ∈ U(a, b) such that the 1 norm between P̂


and diag(u)K diag(v) is controlled by the marginal violations of
diag(u)K diag(v), namely

P̂ − diag(u)K diag(v) ≤ ka − u (Kv)k1 + b − v (KT u) .


1 1

Remark 4.6 (Numerical Stability of Sinkhorn Iterations). As we discuss


in Remarks 4.12 and 4.13, the convergence of Sinkhorn’s algorithm
deteriorates as ε → 0. In numerical practice, however, that slowdown is
rarely observed in practice for a simpler reason: Sinkhorn’s algorithm
will often fail to terminate as soon as some of the elements of the
kernel K or become too negligible to be stored in memory as a positive
number, and become instead null. This can then result in a matrix
product Kv or KT u with ever smaller entries that become null and
result in a division by 0 in the Sinkhorn update of Equation (4.15).
Such issues can be partly resolved by carrying out computations on
the multipliers u and v in the log domain. That approach is carefully
presented in Remark 4.22, and related to a direct resolution of the dual
of problem (4.2).

Remark 4.7 (Relation with iterative projections). Denoting


n o
Ca1 = {P : P1m = a} Cb2 = P : PT 1m = b
def. def.
and

the rows and columns constraints, one has U(a, b) = Ca1 ∩ Cb2 . One can
74 Entropic Regularization of Optimal Transport

use Bregman iterative projections [Bregman, 1967]

P(`+1) = ProjKL (`)


P(`+2) = ProjKL (`+1)
def. def.
Ca1 (P ) and C 2 (P
b
). (4.16)

Since the sets Ca1 and Cb2 are affine, these iterations are known to con-
verge to the solution of (4.7), see [Bregman, 1967]. These iterates are
equivalent to Sinkhorn iterations (4.15) since defining

P(2`) = diag(u(`) )K diag(v(`) ),


def.

one has

P(2`+1) = diag(u(`+1) )K diag(v(`) )


def.

P(2`+2) = diag(u(`+1) )K diag(v(`+1) )


def.
and

In practice however one should prefer using (4.15) which only requires
manipulating scaling vectors and multiplication against a Gibbs kernel,
which can often be accelerated (see below Remarks 4.15 and 4.17).

Remark 4.8 (Other regularizations). It is possible to replace the entropic


term −H(P) in (4.2) by any strictly convex penalty R(P), as detailed
for instance in [Dessein et al., 2016]. A typical example is the squared
`2 norm X
R(P) = P2i,j + ιR+ (Pi,j ), (4.17)
i,j

see [Essid and Solomon, 2017]. Another example is the family of Tsal-
lis entropies [Muzellec et al., 2017]. Note however that if the penalty
function is defined even when entries of P are non-positive, which is
for instance the case for a quadratic regularization (4.17), then one
must add back a non-negativity constraint P ≥ 0, in addition to the
marginal constraints P1m = a and P> 1n = b (one can afford to ignore
the nonnegativity constraint using entropy because that penalty incor-
porates a logarithmic term which forces the entries of P to stay in the
positive orthant). This implies that the set of constraints is no longer
affine and iterative Bregman projections do not converge anymore to
the solution. A workaround is to use instead Dykstra’s algorithm [Dyk-
stra, 1983, 1985, Bauschke and Lewis, 2000], as detailed in [Benamou
et al., 2015]. This algorithm uses projections according to the Bregman
4.2. Sinkhorn’s Algorithm and its Convergence 75

ε = 10 ε=1 ε = 0.5 · 10−1 ε = 10−1 ε = 10−3

ε = 5 · 103 ε = 103 ε = 102 ε = 10 ε=1

Figure 4.6: Comparison of entropic regularization R = −H (top row) and


quadratic regularization R = k·k2 + ιR+ (bottom row). The (α, β) marginals are
the same as for Figure 4.4.

divergence associated to R. We refer to Remark 8.1 for more details


regarding Bregman divergences. An issue is that in general these pro-
jections cannot be computed explicitly. For the squared norm (4.17),
this corresponds to computing Euclidean projection on (Ca1 , Cb2 ) (with
the extra positivity constraints), which can be solved efficiently using
projection algorithms on simplices [Condat, 2015]. The main advan-
tage of the quadratic regularization over entropy is that it produces
sparse approximation of the optimal coupling, yet this comes at the
expense of a slower algorithm that cannot be parallelized as efficiently
as Sinkhorn to compute several optimal transports simultaneously (as
discussed in §4.14). Figure 4.6 contrasts the approximation achieved
by entropic and quadratic regularizers.

Remark 4.9 (Barycentric projection). The Kantorovitch formula-


tion (2.11) and its entropic regularization (4.2) both yield an opti-
mal coupling P ∈ U(a, b). In order to define a transportation map
T : X → Y, in the case where Y = Rd , one can define the so-called
76 Entropic Regularization of Optimal Transport

dH
(u
!
u0
,u
0
)
2
0} K K 2 R+
r> KR2+
{r u; R2+ K
u =
!

Figure 4.7: Left: the Hilbert metric dH is a distance over rays in cones (here
positive vectors). Right: vizualization of the contraction induced by the iteration of
a positive matrix K.

barycentric projection map


1 X
T : xi ∈ X 7−→ Pi,j yj ∈ Y (4.18)
ai j

where here the input measures are discrete of the form (2.3). Note
that this map is only defined for points (xi )i of the support of
α. In the case where T is a permutation matrix (as detailed in
Proposition 2.1), then T is equal to a Monge map, and as ε →
0, the barycentric projection progressively converges to that map
if it is unique. For arbitrary (not necessarily discrete) measures,
solving (2.15) or its regularized version (4.9) defines a coupling π ∈
U(α, β). Note that for ε > 0, because of the entropic regularization,
dπ(x,y)
this coupling π has a density dα(x)dβ(y) with respect to α ⊗ β. A
map can thus be retrieved by the formula
dπ(x, y)
Z
T : x ∈ X 7−→ y dβ(y). (4.19)
Y dα(x)dβ(y)
In the case where, for ε = 0, π is supported on the graph of
the Monge map (see Remark 2.23), then using ε > 0 produces
a smooth approximation of this map. Such a barycentric projec-
tion is useful to apply the OT Monge map to solve problems in
imaging, see Figure 9.6 for an application to color modification. It
has also been used to compute approximations of geodesic Pincipal
Geodesic Analysis (PCA) in the space of probability measures en-
dowed with the Wasserstein metric, see [Seguy and Cuturi, 2015].
4.2. Sinkhorn’s Algorithm and its Convergence 77

Remark 4.10 (Hilbert metric). As initially explained by [Franklin


and Lorenz, 1989], the global convergence analysis of Sinkhorn is
greatly simplified using Hilbert projective metric on Rn+,∗ (positive
vectors), defined as
ui u0j
∀ (u, u0 ) ∈ (Rn+,∗ )2 , dH (u, u0 ) = log max
def.
.
i,j uj u0i
It can be shown to be a distance on the projective cone Rn+,∗ / ∼,
where u ∼ u0 means that ∃r > 0, u = ru0 (the vector are equal up
to rescaling, hence the naming “projective”). This means that dH
satisfies the triangular inequality and dH (u, u0 ) = 0 if and only if
u ∼ u0 . This is a projective version of Hilbert’s original distance
on bounded open convex sets Hilbert [1895]. The projective cone
Rn+,∗ / ∼ is a complete metric space for this distance. By a loga-
rithmic change of variable, the Hilbert metric on the rays of the
positive cone is isometric to the variation norm between vectors
that are defined up to an additive constant

dH (u, u0 ) = log(u) − log(u0 ) var (4.20)


def.
where kfkvar = (max fi ) − (min fi ).
i i
The Hilbert metric was introduced independently by [Birkhoff,
1957] and [Samelson et al., 1957]. They proved the following fun-
damental theorem, which shows that a positive matrix is a strict
contraction on the cone of positive vectors.
n×m
Theorem 4.1. Let K ∈ R+,∗ , then for (v, v0 ) ∈ (Rm +,∗ )
2

 √
def. η(K)−1
 λ(K) = √
 <1
0 0 η(K)+1
dH (Kv, Kv ) ≤ λ(K)dH (v, v ) where def. K K
 η(K) = max Ki,k Kj,` .

j,k i,`
i,j,k,`

Figure 4.7 illustrates this theorem.

Remark 4.11 (Perron-Frobenius). A typical application of Theo-


rem 4.1 is to provide a quantitative proof of Perron-Frobenius
78 Entropic Regularization of Optimal Transport

p?

Σ3 KΣ3 K2 Σ3 {K` Σ3 }`

Figure 4.8: Evolution of K` Σ3 → {p? } the invariant probability distribution of


>
K ∈ R3×3
+,∗ with K 13 = 13 .

theorem, which, as explained in Remark 4.13, is linked to a lo-


cal linearization of Sinkhorn’s iterates. A matrix K ∈ Rn×n + with
>
K 1n = 1n maps Σn into Σn . If furthermore K > 0, then ac-
cording to Theorem 4.1, it is strictly contractant for the met-
ric dH , hence there exists a unique invariant probability distri-
bution p? ∈ Σn with Kp? = p? . Furthermore, for any p0 ∈ Σn ,
dH (K` p0 , p? ) ≤ λ(K)` dH (p0 , p? ), i.e. one has linear convergence of
the iterates of the matrix toward p? . This is illustrated on Fig-
ure 4.8.

Remark 4.12 (Global convergence). The following theorem,


proved by [Franklin and Lorenz, 1989], makes use of this Theo-
rem 4.1 to show the linear convergence of Sinkhorn’s iterations.

Theorem 4.2. One has (u(`) , v(`) ) → (u? , v? ) and

dH (u(`) , u? ) = O(λ(K)2` ), dH (v(`) , v? ) = O(λ(K)2` ). (4.21)

One also has


dH (P(`) 1m , a)
dH (u(`) , u? ) ≤
1 − λ(K)
(4.22)
(`) ? dH (P(`),> 1n , b)
dH (v , v ) ≤
1 − λ(K)
4.2. Sinkhorn’s Algorithm and its Convergence 79

where we denoted P(`) = diag(u(`) )K diag(v(`) ). Lastly, one has


def.

k log(P(`) ) − log(P? )k∞ ≤ dH (u(`) , u? ) + dH (v(`) , v? ) (4.23)

where P? is the unique solution of (4.2).

Proof. One notices that for any (v, v0 ) ∈ (Rm 2


+,∗ ) , one has

dH (v, v0 ) = dH (v/v0 , 1m ) = dH (1m /v, 1m /v0 ).

This shows that


a a
 
dH (u(`+1) , u? ) = dH ,
Kv(`) Kv?
= dH (Kv(`) , Kv? ) ≤ λ(K)dH (v(`) , v? ).

where we used Theorem 4.1. This shows (4.21). One also has, using
the triangular inequality

dH (u(`) , u? ) ≤ dH (u(`+1) , u(`) ) + dH (u(`+1) , u? )


a
 
≤ dH , u (`)
+ λ(K)dH (u(`) , u? )
Kv(`)
 
= dH a, u(`) (Kv(`) ) + λ(K)dH (u(`) , u? ),

which gives the first part of (4.22) since u(`) (Kv(`) ) = P(`) 1m
(the second one being similar). The proof of (4.23) follows
from [Franklin and Lorenz, 1989, Lemma 3]

The bound (4.22) shows that some error measures on the


marginal constraints violation, for instance kP(`) 1m − ak1 and
T
kP(`) 1n − bk1 , are useful stopping criteria to monitor the con-
vergence. Note that thanks to (4.20), these Hilbert metric rates on
the scaling variable (u(`) , v(`) ) gives linear rate on the dual vari-
ables (f(`) , g(`) ) = (ε log(u(`) ), ε log(v(`) )) for the variation norm
def.

k·kvar .
Figure 4.5, bottom row, highlights this linear rate on the con-
straint violation, and shows how this rate degrades as ε → 0. These
results are proved in [Franklin and Lorenz, 1989] and are tightly
80 Entropic Regularization of Optimal Transport

connected to nonlinear Perron-Frobenius Theory [Lemmens and


Nussbaum, 2012]. Perron-Frobenius theory corresponds to the lin-
earization of the iterations, see (4.24). This convergence analysis is
extended in [Linial et al., 1998], who shows that each iteration of
Sinkhorn increases the permanent of the scaled coupling matrix.

Remark 4.13 (Local convergence). The global linear rate (4.23) is


often quite pessimistic, typically in X = Y = Rd for cases where
there exists a Monge map when ε = 0 (see Remark 2.7). The global
rate is in contrast rather sharp for more difficult situations where
the cost matrix C is close to being random, and in these cases,
the rate scales exponentially bad with ε, 1 − λ(K) ∼ e−1/ε . To
obtain a finer asymptotic analysis of the convergence (i.e. if one is
interested in high precision solution and performs a large number
of iterations), one usually rather studies the local convergence rate.
One can write Sinkhorn update as iterations of a fixed-point map
f(`+1) = Φ(f(`) ) where
(
def. Φ1 (f) = ε log KT (ef/ε ) − log(b),
Φ = Φ 2 ◦ Φ1 where
Φ2 (g) = ε log K(eg/ε ) − log(a).

For optimal (f, g) solving (4.29), denoting P =


diag(ef/ε )K diag(eg/ε ) the optimal coupling solving (4.2), one has
the following Jacobian

∂Φ(f) = diag(a)−1 ◦ P ◦ diag(b)−1 ◦ PT . (4.24)

This Jacobian is a positive matrix with ∂Φ(f)1n = 1n , and thus by


Perron-Frobenius theorem, it has a single dominant eigenvector 1m
with associated eigenvalue 1. Since f is defined up to a constant, it
is actually the second eigenvalue 1 − κ < 1 which governs the local
linear rate, and this shows that for ` large enough, kf(`) − fk =
O((1 − κ)` ). Numerically, in “simple cases” (such as when there
exists a smooth Monge-map when ε = 0), this rate scales like κ ∼
ε. We refer to [Knight, 2008] for more details in the bistochastic
(assignment) case.
4.3. Speeding-up Sinkhorn’s Iterations 81

4.3 Speeding-up Sinkhorn’s Iterations

The main computational bottleneck of Sinkhorn’s iterations is the


vector-matrix multiplication against kernels K and K> , with complex-
ity O(nm) if implemented naively. We now detail several important
cases where the complexity can be improved significantly.

Remark 4.14 (Parallel and GPU Friendly Computation). The simplicity


of Sinkhorn’s algorithm yields an extremely efficient approach to com-
pute simultaneously several regularized Wasserstein distances between
pairs of histograms. Let N be an integer, a1 , · · · , aN be histograms
in Σn and b1 , · · · , bN histograms in Σm . We seek to compute all N
approximate distances LεC (a1 , b1 ), . . . , LεC (aN , bN ). In that case, writ-
ing A = [a1 , . . . , aN ] and B = [b1 , . . . , bN ] for the n × N and m × N
matrices storing all histograms, one can notice that all Sinkhorn itera-
tions for all these N pairs can be carried out in parallel, by setting for
instance
A B
U(`+1) = V(`+1) =
def. def.

(`)
and , (4.25)
KV K U(`+1)
T

initialized with V(0) = 1m×N . Here ·· corresponds to entry-wise division


of matrices. One can further check that, upon convergence of V and
U, the (row) vector of regularized distances simplifies to

(U log U)T (K C) V + (V log V)T (K C)T U ∈ RN .

Note that the basic Sinkhorn iterations described in Equation (4.15)


are intrinsically GPU friendly, since they only consist in matrix-vector
products, and this was exploited for instance to solve matching prob-
lems in Slomp et al. [2011]). However, the matrix-matrix operations
presented in Equation (4.25) present even better opportunities for par-
allelism, which explains the success of Sinkhorn’s algorithm to compute
OT distances between histograms at large scale.

Remark 4.15 (Speed up for separable kernels). if


d
Y
i = (ik )dk=1 , j = (jk )dk=1 , Ki,j = Kkik ,jk (4.26)
k=1
82 Entropic Regularization of Optimal Transport

Then one can compute Ku by applying Kk along each “slice”. If


n = m, complexity O(n1+1/d ) in place of O(n2 ). Typical example is
when c(x, y) = kx − yk2 for X = Y = Rd on a regular grid xi =
(i1 /n0 , . . . , ik /n0 )nik =1 , n = nd0 , then Kkik ,jk = exp(−|ik /n0 − jk /n0 |2 /ε)
is a 1-D Gaussian convolution.

Remark 4.16 (Convolutions approximation). The main computational


bottleneck of Sinkhorn iterations (4.15) is the matrix multiplication
Kv (and also its adjoint). Beside using separability (4.26), it is also
possible to make use of other special structure of the kernel. The sim-
plest case is for translation invariant kernels Ki,j = ki−j , which is typi-
cally the case when discretizing the measure on a fixed uniform grid in
Euclidean space X = Rd . Then Kv = k ? v is a convolution, and there
are several algorithms to approximate the convolution in nearly linear
time. The most usual one is by Fourier transform F, assuming for sim-
plicity periodic boundary conditions, because F(k ? v) = F(k) F(v).
This leads however to unstable computations, and is often unacceptable
for small ε. Another popular way to speed-up computation is by ap-
proximating the convolution using a succession of auto-regressive filter.
The most well known approach is the Deriche filtering method Deriche
[1993], and we refer to Getreuer [2013] for a comparison of various fast
filtering methods.

Remark 4.17 (Geodesic in heat approximation). For non-planar do-


mains, the kernel K is not a convolution, but in the case where the
cost is Ci,j = dM (xi , yj )p where dM is a geodesic distance on a sur-
face M (or a more general manifold), it is also possible to perform fast
dM
approximations of the application of K = e− ε to a vector. Indeed,
Varadhan formulas Varadhan [1967] assert that this kernel is close to
the Laplacian kernel (for p = 1) and the heat kernel (for p = 2). The
first formula of Varadhan states

t
log(Pt (x, y)) = dM (x, y) + o(t) where Pt = (Id − t∆M )−1
def.

2
(4.27)
where ∆M is the Laplace-Beltrami operator associated to the manifold
M (which is negative semi-definite), so that Pt is an integral kernel and
4.3. Speeding-up Sinkhorn’s Iterations 83

R
g = M Pt (x, y)f (y)dy is the solution of g − t∆M g = f . The second
formula of Varadhan states
q
−4t log(Ht (x, y)) = dM (x, y) + o(t) (4.28)
R
where Ht is the integral kernel defined so that gt = M Hε (x, y)f (y)dy
is the solution at time t of the heat equation
∂gt (x)
= (∆M gt )(x).
∂t
The convergence in these formulas (4.27) and (4.28) is uniform on com-
pact manifolds. Numerically, the domain M is discretized (for instance
using finite elements) and ∆M is approximated by a discrete Lapla-
cian matrix L. A typical example is when using piecewise linear finite
elements, so that L is the celebrated cotangent Laplacian (see [Botsch
et al., 2010] for a detailed account for this construction). These formulas
can be used to approximate efficiently the multiplication by the Gibbs
d(xi ,yj )p
kernel K√i,j = e− ε . Equation (4.27) suggests, for the case p = 1, to
use ε = 2t and to replace the multiplication by K by the multiplication
by (Id−tL)−1 , which necessitates the resolution of a positive symmetric
linear system. Equation (4.28), coupled with R steps of implicit Euler
for the stable resolution of the Heat flow, suggests for p = 2 to trade the
multiplication by K by the multiplication by (Id − Rt L)−R for 4t = ε,
which in turn necessitates R resolutions of linear systems. Fortunately,
since these linear systems are supposed to be solved many times during
Sinkhorn iterations, one can solve them efficiently by pre-computing a
sparse Cholesky factorization. By performing a re-ordering of the rows
and columns of the matrix [George and Liu, 1989], one obtains a nearly
linear sparsity for 2-D manifolds and thus each iteration of Sinkhorn
has linear complexity (the performance degrades with the dimension of
the manifold). The use of Varadhan’s formula to approximate geodesic
distances was initially proposed in [Crane et al., 2013], and its use in
conjunction with Sinkhorn iteration in [Solomon et al., 2015].

Remark 4.18 (Extrapolation acceleration). Since Sinkhorn algorithm is


a fixed-point algorithm (as shown in Remark 4.13), one can use stan-
dard linear or even non-linear extrapolation schemes to enhance the
84 Entropic Regularization of Optimal Transport

conditioning of the fixed-point mapping near the solution, and improve


the linear convergence rate. This is similar to the Successive overrelax-
ation method (see for instance [Hadjidimos, 2000]), so that the local

linear rate of convergence is improved from O((1 − κ)` ) to O((1 − κ)` )
for some κ > 0 (see Remark 4.13). We refer to [Peyré et al., 2017] for
more details.

4.4 Regularized Dual and Log-domain Computations

The following proposition details the dual problem associated to (4.2).

Proposition 4.4. One has

LεC (a, b) = max hf, ai + hg, bi − εhef/ε , Keg/ε i. (4.29)


f∈Rn ,g∈Rm

The optimal (f, g) are linked to scalings (u, v) appearing in (4.12)


through
(u, v) = (ef/ε , eg/ε ). (4.30)

Proof. We start from the end of the proof of Proposition 4.3, which
links the optimal primal solution P and dual multipliers f and g for
the marginal constraints as Pi,j = efi /ε e−Ci,j /ε egj /ε . Substituting in the
Lagrangian E(P, f, g) of Equation (4.2) the optimal P as a function of
f and g, we obtain that the Lagrange dual function equals

f, g 7→ hef/ε , (K C) eg/ε i − εH(diag(ef/ε )K diag(eg/ε )). (4.31)

The entropy of P scaled by ε, namely εhP, log P − 1n×m i can be stated


explicitly as a function of f, g, C

hdiag(ef/ε )K diag(eg/ε ), f1m T + 1n gT − C − ε1n×m i


= −hef/ε , (K C) eg/ε i + hf, ai + hg, bi − εhef/ε , Keg/ε i

therefore, the first term in (4.31) cancels out with the first term in the
entropy above. The remaining times are those displayed in (4.29).

Remark 4.19 (Dual for generic measures). For generic (non-


necessarily discrete) input measures (α, β), the dual problem (4.29)
4.4. Regularized Dual and Log-domain Computations 85

reads
Z Z
sup f (x)dα(x) + g(x)dβ(x)
f,g∈C(X )×C(Y) X Y
Z
−c(x,y)+f (x)+g(y)
−ε e ε dα(x)dβ(y)
X ×Y

This corresponds to a smoothing of the constraint R(c) appear-


ing in the original problem (2.23), which is retrieved in the limit
ε → 0. Proving existence (i.e. the sup is actually a max) of these
Kantorovich potentials (f, g) in the case of entropic transport is less
easy than for classical OT (because one cannot use c-transform and
potentials are not automatically Lipschitz). Proof of existence can
be done using the convergence of Sinkhorn iterations, see Chizat
et al. [2017] for more details.

Remark 4.20 (Sinkhorn as a Block Coordinate Ascent on the Dual


Problem). A simple approach to solve the unconstrained maximiza-
tion problem (4.29) is to use an exact block coordinate ascent strategy,
namely to update alternatively f and g to cancel their gradients with
respect to the objective of (4.29). Indeed, one can easily notice that,
writing Q(f, g) for the objective of (4.29) that
 
∇|f Q(f, g) = a − ef/ε Keg/ε , (4.32)
 
∇|g Q(f, g) = b − eg/ε KT ef/ε . (4.33)

Block coordinate ascent can therefore be implemented in a closed form


by applying successively the following updates, starting from any arbi-
trary g(0) , for l ≥ 0,
 (`) /ε

f(`+1) = ε log a − ε log Keg , (4.34)
(`+1)
 
g(`+1) = ε log b − ε log KT ef /ε
. (4.35)

Such iterations are mathematically equivalent to the Sinkhorn iter-


ations (4.15) when considering the primal-dual relations highlighted
in (4.30). Indeed, we recover that at any iteration

(f(`) , g(`) ) = ε(log(u(`) ), log(v(`) )).


86 Entropic Regularization of Optimal Transport

Remark 4.21 (Soft-min rewriting). Iterations (4.34) and (4.35) can be


given an alternative interpretation, using the following notation. Given
a vector z of real numbers we write minε z for the soft-minimum of its
coordinates, namely

e−zi /ε .
X
minε z = −ε log
i

Note that minε (z) converges to min z for any vector z as ε → 0. Indeed,
minε can be interpreted as a differentiable approximation of the min
function. Using these notations, Equations (4.34) and (4.35) can be
rewritten
(`)
(f(`+1) )i = minε (Cij − gj )j + ε log ai , (4.36)
(`)
(g(`+1) )j = minε (Cij − fi )i + ε log bj . (4.37)
(`)
Here the term minε (Cij −gj )j denotes the soft-minimum of all values
of the j-th column of matrix (C − 1n (g(`) )> ). To simplify notations, we
introduce an operator that takes a matrix as input and outputs now
a column vector of the soft-minimum values of its columns or rows.
Namely, for any matrix A ∈ Rn×m , we define
 
Minrow ∈ Rn ,
def.
ε (A) = minε (Ai,j )j
i
Mincol ∈ Rm .
def. 
ε (A) = minε (Ai,j )i j

Note that these operations are equivalent to the entropic c-transform


introduced in §5.3 (see in particular (5.11)). Using these notations,
Sinkhorn’s iterates read
T
f(`+1) = Minrow
ε (C − 1n g(`) ) + ε log a, (4.38)
g(`+1) = Mincol (`) T
ε (C − f 1m ) + ε log b. (4.39)

Note that as ε → 0, minε converges to min, but the iterations do not


converge anymore in the limit ε = 0, because alternate minimization
does not converge for constrained problems (which is the case for the
un-regularized dual (2.19)).

Remark 4.22 (Log-domain Sinkhorn). While mathematically equivalent


to the Sinkhorn updates (4.15), iterations (4.36) and (4.37) suggest to
4.5. Regularized Approximations of the Optimal Transport Cost 87

use the log-sum-exp stabilization trick to avoid underflow for small


values of ε. Writing z = min z, that trick suggests to evaluate minε z
as
e−(zi −z)/ε .
X
minε z = z − ε log (4.40)
i
Instead of substracting z to stabilize the log domain iterations as
in (4.40), one can actually substract the previously computed scalings.
This leads to the following stabilized iteration
f(`+1) = Minrow (`) (`)
ε (S(f , g )) − f
(`)
+ ε log(a) (4.41)
(`+1)
g = Mincol
ε (S(f
(`+1) (`)
, g )) −g (`)
+ ε log(b), (4.42)
where we defined
 
S(f, g) = Ci,j − fi − gj .
i,j
In contrast to the original iterations (4.15), these log-domain itera-
tions (4.41) and (4.42) are stable for arbitrary ε > 0, because the
quantity S(f, g) stays bounded during the iterations. The downside is
that it requires nm computations of exp at each step. Computing a
Minrow
ε or Mincol
ε is typically substantially slower than matrix multi-
plications, and requires computing line by line soft-minima of matrices
S. There is therefore no efficient way to parallelize the application of
Sinkhorn maps for several marginals simultaneously. In Euclidean do-
main of small dimension, it is possible to develop efficient multiscale
solvers with a decaying ε strategy to significantly speed up the compu-
tation using sparse grids [Schmitzer, 2016b].

4.5 Regularized Approximations of the Optimal Transport


Cost

The entropic dual (4.29) is a smooth unconstrained concave maximiza-


tion problem, which approximates the original Kantorovich dual (2.19),
as detailed in the following proposition.
Proposition 4.5. Any pair of optimal solutions (f? , g? ) to (4.29) are
such that (f? , g? ) ∈ R(a, b), the set of feasible Kantorovich potentials
defined in (2.20). As a consequence, we have that for any ε,
hf? , ai + hg? , bi ≤ LC (a, b).
88 Entropic Regularization of Optimal Transport

A chief advantage of the regularized transportation cost LεC defined


in (4.2) is that it is smooth and convex, which makes it a perfect fit to
integrate it as a loss function in variational problems (see Chapter 9).

Proposition 4.6. LεC (a, b) is a jointly convex function of a and b, with


gradient equal to " #
f?
∇LεC (a, b) = ? ,
g
where f? and g? are the optimal solutions of Equation (4.29) chosen so
that their coordinates sum to 0.

In [Cuturi, 2013], lower and upper bounds to approximate the


Wasserstein distance between two histograms were proposed. These
bounds consist in evaluating the primal and dual objectives at the so-
lutions provided by the Sinkhorn algorithm.

Definition 4.1 (Sinkhorn Divergences). Let f? and g? be optimal solu-


tions to (4.29). The Wasserstein distance is approximated using follow-
ing primal and dual Sinkhorn divergences:
f? g?
PεC (a, b) = hC, P? i = he ε , (K
def.
C)e ε i
DεC (a, b) ? ?
def.
= hf , ai + hg , bi,

where stands for the elementwise product of matrices, and where P?


is the solution to (4.2).

Proposition 4.7. The following relationships hold:

DεC (a, b) ≤ LεC (a, b) ≤ PεC (a, b).

Furthermore

PεC (a, b) − DεC (a, b) = ε(H(P? ) + 1). (4.43)

Proof. Equation (4.43) is obtained by writing that the primal and dual
problems have the same values at the optima (see (4.29)), and hence
? ? /ε
LεC (a, b) = PεC (a, b) − εH(P? ) = DεC (a, b) − εhef /ε
, Keg i
? ? /ε
and remarking that hef /ε , Keg i = 1.
4.6. Generalized Sinkhorn 89

The relationships given above are only valid upon convergence to


the actual solutions of Equation (4.29). We consider next the prac-
tical case where the Sinkhorn iterations are terminated after a pre-
determined number of L iterations, and used to evaluate DεC using
iterates f(L) and g(L) instead of optimal solutions f? and g? .
Using notations appearing in Equations (4.41) and (4.42), we thus
introduce the following finite step approximation of LεC ,
(L)
DC (a, b) = hf(L) , ai + hg(L) , bi.
def.
(4.44)
This “algorithmic” Sinkhorn functional lower bound the regularized
cost function.
Proposition 4.8 (Finite Sinkhorn Divergences). The following relation-
ship holds
(L)
DC (a, b) ≤ LεC (a, b).
Note however that unlike the regularized expression LεC in (4.29),
(L)
the finite Sinkhorn divergence DC (a, b) is not, in general, a convex
function of its arguments (this can be easily checked numerically).
(L)
DC (a, b) is, however, a differentiable function which can be differ-
entiated using automatic differentiation techniques (see Remark 9.1.3)
with respect to any of its arguments, notably C, a or b.

4.6 Generalized Sinkhorn

The regularized OT problem (4.2) is a special case of structured convex


optimization problem of the form
X
min Ci,j Pi,j − εH(P) + F (P1m ) + G(PT 1n ). (4.45)
P
i,j

Indeed, defining F = ι{a} and G = ι{b} , where the indicator function


of a closed convex set C is
(
0 if x ∈ C,
ιC (x) = (4.46)
+∞ otherwise,
one retrieves the hard marginal constraints defining U(a, b). The proof
of Proposition 4.3 carries to this more general problem (4.45), so that
the unique solution of (4.45) also has the form (4.12).
90 Entropic Regularization of Optimal Transport

As shown in [Peyré, 2015, Chizat et al., 2017, Karlsson and Ringh,


2016], Sinkhorn iterations (4.15) can hence be extended to this problem,
and read
ProxKL
F (Kv) ProxKL T
G (K u)
u← and v← . (4.47)
Kv KT u
where the proximal operator for the KL divergence is
0 0
∀ u ∈ RN
+, ProxKL
F (u) = argmin KL(u |u) + F (u ). (4.48)
u0 ∈RN
+

For some functions F, G it is possible to prove the linear rate of con-


vergence for iterations (4.47), and these schemes can be generalized to
arbitrary measures, see Chizat et al. [2017] for more details.
Iterations (4.47) are thus interesting in the cases where ProxKL F and
ProxKL
G can be computed in closed form or very efficiently. This is in
P
particular the case for separable functions of the form F (u) = i Fi (ui )
since in this case
 
ProxKL KL
F (u) = ProxFi (ui )
i

Computing each ProxKL Fi is usually simple since it is a scalar optimiza-


tion problem. Note that, similarly to the initial Sinkhorn algorithm, it
is also possible to stabilize the computation using log-domain compu-
tations, see Chizat et al. [2017].
This algorithm can be used to approximate the solution to various
generalizations of OT, and in particular unbalanced OT problems of the
form (10.7) (see §10.2 and in particular iterations (10.9)) and gradient
flow problems of the form (9.25) (see §9.3).

Remark 4.23 (Duality and Legendre transform). The dual problem


to (4.45) reads
fi +gj −Ci,j
max − F ∗ (f) − G∗ (g) − ε
X
e ε . (4.49)
f,g
i,j

so that (u, v) = (ef/ε , eg/ε ) are the associated scalings appearing


in (4.12). Here, F ∗ and G∗ are the Fenchel-Legendre conjugate,
4.6. Generalized Sinkhorn 91

which are convex functions defined as

∀ f ∈ Rn , F ∗ (f) = maxn hf, ai − F (a).


def.
(4.50)
a∈R

The generalized Sinkhorn iterates (4.47) are a special case of Dyk-


stra’s algorithm [Dykstra, 1983, 1985] (extended to Bregman di-
vergence [Bauschke and Lewis, 2000, Censor and Reich, 1998], see
also Remark 8.1), and is an alternate maximization scheme on the
dual problem (4.49).

The formulation (4.45) can be further generalized to more than 2


functions and more than a single coupling
X X
min Cs,i,j Ps,i,j − εH(Ps ) + Fk ((Ps 1m )s , (Ps T 1n )s )
(Ps )S
s=1 i,j,s k

where {Cs }s are given cost matrices.


The solution has the form
Y Y
∀ (s, i, j) ∈ J1, SK × JnK × JmK, Ps,i,j = Ks,i,j uk,i vk,j (4.51)
k `

where Ks = e−Cs /ε is a Gibbs kernel, and (uk , v` )k,` are scaling vectors
that need to be computed using Generalized Sinkhorn iterations similar
to (4.47), see Chizat et al. [2017]. In the case where the functions Fs
are indicator functions, one retrieves the Sinkhorn algorithm (10.2) for
the multi-marginal problem, as detailed in §10.1. It is also possible to
rewrite the regularized barycenter problem (9.14) this way, and the
iterations (9.17) are in fact a special case of this generalized Sinkhorn.
5
Semi-discrete Optimal Transport

This chapter studies methods to tackle the optimal transport prob-


lem where one of the two input measures is discrete (a sum of Dirac
masses) and the other one is arbitrary (including most importantly
the case where it has a density with respect to the Lebesgue measure).
When the ambiant space has low dimensions, this problem has a strong
geometrical flavor because when representing transport from a contin-
uous density towards a discrete one, one obtains that the support of
the density must be split into disjoint cells, which are each mapped
to one among all Dirac masses of the discrete measure. When the cost
is the squared Euclidean distance, these cells corresponds to an im-
portant concept from computational geometry, the so-called Laguerre
cells, which are weighted Voronoi cells. This connection allows to bor-
row tools from computational geometry to obtain fast computational
schemes. In high dimensions, the semi-descrite formulation can also be
studied as a stochastic programming problem, which can also benefit
from a bit of regularization, extending therefore the scope of applica-
tions of the entropic regularization scheme presented in Chapter 4. All
these constructions rely heavily on the notion of c-transform, this time
for general cost functions and not only matrices as in §3.2, which is

92
5.1. c-transform and c̄-transform 93

a generalization of the Legendre transform from convex analysis, and


plays a pivotal role in the theory and algorithms for OT.

5.1 c-transform and c̄-transform

Recall that the dual OT problem (2.23) reads


Z Z
def.
max E(f, g) = f (x)dα(x) + g(y)dβ(y) + ιR(c) (f, g)
(f,g) X Y

where we used the useful indicator function notation (4.46).


Doing an alternate minimization on either f or g leads to the im-
portant notion of c-transform:
f c (y) = inf c(x, y) − f (x),
def.
∀ y ∈ Y, (5.1)
x∈X

g c̄ (x) = inf c(x, y) − g(y),


def.
∀x ∈ X, (5.2)
y∈Y
def.
where we denoted c̄(y, x) = c(x, y). Indeed, one can check that
f c ∈ argmax E(f, g) and g c̄ ∈ argmax E(f, g). (5.3)
g f

Note that these partial minimizations define maximizers on the sup-


port of respectively α and β, while the definitions (5.1) actually define
functions on the whole spaces X and Y. This is thus a way to extend in
a canonical way solutions of (2.23) on the whole spaces. When X = Rd
and c(x, y) = kx − ykp , then the c-transform (5.1) f c is the so-called
inf-convolution between −f and k·kp . The definition of f c is also often
referred to as a “Hopf-Lax formula”.
The map (f, g) ∈ C(X ) × C(Y) 7→ (g c̄ , f c ) ∈ C(X ) × C(Y) replaces
dual potentials by “better” ones (improving the dual objective E). Func-
tions that can be written in the form f c and g c̄ are called c-concave and
c̄-concave functions. In the special case c(x, y) = hx, yi in X = Y = Rd ,
this definition coincides with the usual notion of concave functions.
Extending naturally Proposition 3.1 to a continuous case, one has the
property that
f cc̄c = f c and g c̄cc̄ = g c̄
where we denoted f cc̄ = (f c )c̄ . This invariance property shows that one
can only “improve” once the dual potential this way. Alternatively, this
94 Semi-discrete Optimal Transport

means that alternate maximization does not converge (it immediately


enters a cycle), which is classical for functionals involving a non-smooth
(a constraint) coupling of the optimized variables. This is in sharp con-
trast with entropic regularization of OT as exposed in Chapter 4. In
this case, because of the regularization, the dual objective (4.29) is
smooth, and alternate maximization corresponds to Sinkhorn itera-
tions (4.41) and (4.42). These iterates, written over the dual variables,
define entropically-smoothed versions of the c-transform, where min
operations are replaced by a “soft-min”.
Using (5.3), one can reformulate (2.23) as an unconstrained convex
program over a single potential
Z Z
Lc (α, β) = max f (x)dα(x) + f c (y)dβ(y), (5.4)
f ∈C(X ) X Y
Z Z
= max g c̄ (x)dα(x) + g(y)dβ(y). (5.5)
g∈C(Y) X Y

Since one can iterate the map (f, g) 7→ (g c̄ , f c ), it is possible to add in


these optimization problems the constraint that f is c̄-concave and g
is c-concave, which is important to ensure enough regularity on these
potentials and show for instance existence of solutions to (2.23).

5.2 Semi-discrete Formulation


P
A case of particular interest is when β = j bj δyj is discrete (of course
the same construction applies if α is discrete by exchanging the role of
(α, β)). One can adapt the definition of the c̄ transform (5.1) to this
setting by restricting the minimization to the support (yj )j of β
∀ g ∈ Rm , ∀ x ∈ X , gc̄ (x) = min c(x, yj ) − gj .
def.
(5.6)
j∈JmK

This transform maps a vector g to a continuous function gc̄ ∈ C(X ).


Note that this definition coincides with (5.1) when imposing that the
space X is equal to the support of β. Figure 5.1 shows some example
of such discrete c̄-transforms in 1-D and 2-D.
Using this discrete c̄-transform, in this semi-discrete case, (5.4) is
equivalent to the following finite dimensional optimization
Z X
gc̄ (x)dα(x) +
def.
Lc (α, β) = max
m
E(g) = gy bj (5.7)
g∈R X
5.2. Semi-discrete Formulation 95

2
0.6
0.4 1.5

0.2
1
0
-0.2 0.5
0 0.5 1

p = 1/2 p=1 p = 3/2 p=2

Figure 5.1: Top: examples of semi-discrete c̄-transforms gc̄ in 1-D, for ground
cost c(x, y) = |x − y|p for varying p (see colorbar). The red points are at locations
(yj , −gj )j . Bottom: examples of semi-discrete c̄-transforms gc̄ in 2-D, for ground
cost c(x, y) = kx − ykp for varying p. The red points are at locations yj ∈ R2 , and
their size is proportional to gj . The regions delimited by bold black curves are the
Laguerre cells (Lg (yj ))j associated to these points (yj )j .
96 Semi-discrete Optimal Transport

The Laguerre cells associated to the dual weights g


n o
x ∈ X : ∀ j 0 6= j, c(x, yj ) − gj ≤ c(x, yj 0 ) − gj 0
def.
Lg (yj ) =

induce a disjoint decomposition of X = j Lg (yj ). When g is constant,


S

the Laguerre cells decomposition corresponds to the Voronoi diagram


partition of the space. Figure 5.1, bottom row, shows examples of La-
guerre cells segmentations in 2-D.
This allows one to conveniently rewrite the minimized energy as
m Z
X  
E(g) = c(x, yj ) − gj dα(x) + hg, bi. (5.8)
j=1 Lg (yj )

The gradient of this function can be computed as follow


Z
∀ j ∈ JmK, ∇E(g)j = − dα(x) + bj .
Lg (yj )

Once the optimal g is computed, then the optimal transport map T


from α to β is mapping any x ∈ Lg (yj ) toward yj , so it is piecewise
constant.
In the special case c(x, y) = kx − yk2 , the decomposition in La-
guerre cells is also known as “power diagram”. It can be computed
efficiently using computational geometry algorithms, see [Aurenham-
mer, 1987]. The most widely used algorithm relies on the fact that
the power diagram of points in Rd is equal to the projection on Rd
of the convex hull of the set of points ((yj , kyj k2 − gj ))m
j=1 ⊂ R
d+1 .

There are numerous algorithms to compute convex hull, for instance


that of Chan [1996] in 2-D and 3-D has complexity O(P log(Q)) where
Q is the number of vertices of the convex hull.
The initial idea of a semi-discrete solver for Monge-Ampère equa-
tions was proposed by Oliker and Prussner [1989], and its relation to
the dual variational problem exposed by Aurenhammer et al. [1998].
A theoretical analysis and its application to the reflector problem in
optics is detailed in [Caffarelli et al., 1999]. The semi-discrete formu-
lation was used in [Carlier et al., 2010] in conjunction with a contin-
uation approach based on Knothe’s transport. The recent revival of
this methods in various fields is due to Mérigot [2011] who proposed a
5.3. Entropic Semi-discrete Formulation 97

`=1 `=3 ` = 50 ` = 100 Matching

Figure 5.2: Iterations of the semi-discrete OT algorithm minimizing (5.8) (here


a simple gradient descent is used). The support (yj )j of the discrete measure β is
indicated by the red points, while the continuous measure α is the uniform measure
on a square. The blue cells display the Laguerre partition (Lg(`) (yj ))j where g(`) is
the discrete dual potential computed at iteration `.

quasi-Newton solver and clarified the link with concepts from compu-
tational geometry. We refer to [Levy and Schwindt, 2017] for a recent
overview. The use of a Newton solver which is applied to sampling in
computer graphics is proposed in [De Goes et al., 2012], see also [Lévy,
2015] for applications to 3-D volume and surface processing. An impor-
tant area of application of semi-discrete method is for the resolution
of incompressible fluid dynamic (Euler’s equations) using Lagrangian
methods de Goes et al. [2015], Gallouët and Mérigot [2017]. The semi-
discrete OT solver enforces incompressibility at each iteration by im-
posing that the (possibly weighted) points cloud approximates a uni-
form inside the domain. The convergence (with linear rate) of damped
Newton iterations is proved in Mirebeau [2015] for the Monge-Ampère
equation, and is refined in Kitagawa et al. [2016] for optimal transport.
Semi-discrete OT finds important applications to illumination design,
see Mérigot et al. [2017].

5.3 Entropic Semi-discrete Formulation

The dual of the entropic regularized problem between arbitrary mea-


sures (4.9) is
Z Z
Lεc (α, β)
def.
= max f (x)dα(x) + g(y)dβ(y)
(f,g)∈C(X )×C(Y) X Y
Z
f (x)+g(y)−c(x,y)
(5.9)
−ε e ε dα(x)dβ(y).
X×Y
98 Semi-discrete Optimal Transport

This is a smooth unconstrained optimization problem.


Similarly to the unregularized problem (5.1), one can minimize ex-
plicitly with respect to f or g in (5.9), and introduce a smoothed c-
transform
Z 
−c(x,y)+f (x)
f c,ε (y) = −ε log
def.
∀ y ∈ Y, e ε dα(x)
ZX 
−c(x,y)+g(y)
g c̄,ε (x) = −ε log
def.
∀x ∈ X, e ε dβ(y) .
Y
Pm
In the case of a discrete measure β = j=1 bj δyj , one can restrict its
attention to discrete dual potential g ∈ Rm , and this corresponds to
 
m
X −c(x,yj )+gj
gc̄,ε (x) = −ε log 
def.
∀x ∈ X, e ε bj  . (5.10)
j=1

One defines similarly fc̄,ε in the case of a discrete measure α. Note that
the rewriting (4.38) and (4.39) of Sinkhorn using the soft-min operator
minε corresponds to the alternate computation of entropic smoothed
c-transforms
(`+1) (`+1)
fi = gc̄,ε (xi ) and gj = fc,ε (yj ). (5.11)

Instead of maximizing (5.9), one can thus solve the following finite
dimensional optimization problem
Z
ε
gc̄,ε (x)dα(x) − hg, bi.
def.
max E (g) = (5.12)
g∈Rn X

Note that this optimization problem is still valid even in the unregular-
ized case ε = 0 and in this case gc̄,ε=0 = gc̄ is the c̄-transform defined
in (5.6) so that (5.12) is in fact (5.8). The gradient of this functional
reads Z
∀ j ∈ JmK, ∇E ε (g)j = − χεj (x)dα(x) + bj , (5.13)
X
where χεj is a smoothed version of the indicator χ0j of the Laguerre cell
Lg (yj )
−c(x,yj )+gj
e ε
χεj (x) = P −c(x,y` )+g`
.
`e ε
5.3. Entropic Semi-discrete Formulation 99

Note once again that this formula (5.13) is still valid for ε = 0. Note also
that the family of functions (χεj )j is a partition of unity, i.e. j χεj = 1
P

and χεj ≥ 0. Figure 5.3, bottom row, illustrates this.

Remark 5.1 (Second order methods and connection with logistic regres-
sion). A crucial aspect of the smoothed semi-dual formulation (5.12) is
that it corresponds to the minimization of a smooth function. Indeed,
as shown in [Genevay et al., 2016], the Hessian of E ε is upper-bounded
by 1/ε, so that ∇E ε is 1ε -Lipschitz continuous. In fact, that problem is
very closely related to a multi-class logistic regression problem (see Fig-
ure 5.3 for a display of the resulting fuzzy classification boundary) and
enjoys the same favorable properties, see [Hosmer Jr et al., 2013], which
are generalizations of self-concordance, see [Bach, 2010]. In particular,
Newton method converges quadratically, and one can use in practice
quasi-Newton techniques, such as L-BFGS, as advocated in [Cuturi
and Peyré, 2016]. Note that [Cuturi and Peyré, 2016] studies the more
general barycenter problem detailed in §9.2, but is equivalent to this
semi-discrete setting when considering only a pair of inputs measures.
Note that the use of second order schemes (Newton or L-BFGS) is also
advocated in the unregularized case ε = 0 by [Mérigot, 2011, De Goes
et al., 2012, Lévy, 2015]. In [Kitagawa et al., 2016, Theorem 5.1], the
Hessian of E 0 (g) is shown to be uniformly bounded as long as the
volume of the Laguerre cells are bounded by below and α has a contin-
uous density. This allows these authors to show the linear convergence
of a damped Newton algorithm with a backtracking to ensure that the
Laguerre cells never vanish during the optimization. This result justi-
fies the use of second order methods even in the unregularized case.
The intuition is that, while the conditioning of the entropic regularized
problem scales like 1/ε, when ε = 0, this conditioning is rather driven
by m, the number of samples of the discrete distribution (which con-
trols the size of the Laguerre cells). One can also see [Knight and Ruiz,
2013, Sugiyama et al., 2017, Cohen et al., 2017, Allen-Zhu et al., 2017]
for alternate methods using second order schemes.

Remark 5.2 (Legendre transforms of OT cost functions). As stated in


Proposition 4.6, LεC (a, b) is a convex function of (a, b) (which is also
100 Semi-discrete Optimal Transport

0.3
0.2
0.2
0
0.1
-0.2
0
0 0.5 1

ε=0 ε = 0.01 ε = 0.1 ε = 0.3

Figure 5.3: Top: examples of entropic semi-discrete c̄-transforms gc̄,ε in 1-D, for
ground cost c(x, y) = |x − y| for varying ε (see colorbar). The red points are at
locations (yj , −gj )j . Bottom: examples of entropic semi-discrete c̄-transforms gc̄,ε
in 2-D, for ground cost c(x, y) = kx − yk for varying ε. The black curves are the level
sets of the function gc̄,ε , while the colors indicate the smoothed indicator function
of the Laguerre cells χεj . The red points are at locations yj ∈ R2 , and their size is
proportional to gj .
5.4. Stochastic Optimization Methods 101

true in the unregularized case ε = 0). It is thus possible to compute


its Legendre-Fenchel transform, which is defined in (4.50). Denoting
Fa (b) = LεC (a, b), one has, for a fixed a, following Cuturi and Peyré
[2016]
Fa∗ (g) = −εH(a) +
X
ai gc̄,ε (xi )
i
Here gc̄,ε is the entropic-smoothed c-transform introduced in (5.10). In
the unregularized case ε = 0, and for generic measures, following Carlier
def.
et al. [2015], one has denoting Fβ (α) = Lc (α, β)
Z
∀ f ∈ C(X ), Fβ∗ (f ) = f c (y)dβ(y),
Y

where the c-transform f c ∈ C(Y) of f is defined in §5.1. Note that


here, since M(X ) is in duality with C(X ), the Legendre transform is
def.
a function of continuous functions. Denoting now G(a, b) = LεC (a, b),
one has, following Cuturi and Peyré [2016]
−Ci,j +fi +gj
G∗ (f, g) = −ε log
X
∀ (f, g) ∈ Rn × Rm , e ε

i,j

which can be seen as a smoothed version of the Legendre transform of


def.
G(α, β) = Lc (α, β)

∀ (f, g) ∈ C(X ) × C(Y), G ∗ (f, g) = inf c(x, y) − f (x) − g(y).


(x,y)∈X ×Y

5.4 Stochastic Optimization Methods

The semi-discrete formulation (5.8) and its smoothed version (5.12)


are appealing because the energies to be minimized are written as an
expectation with respect to the probability distribution α
Z
E ε (g) = E ε (g, x)dα(x) = EX (E ε (g, X))
X

where E ε (g, x) = gc̄,ε (x) − hg, bi


def.

where X denotes a random vector distributed on X according to α.


Note that the gradient of each of the involved functional reads

∇g E ε (x, g) = (χεj (x) − bj )m m


j=1 ∈ R .
102 Semi-discrete Optimal Transport

One can thus use stochastic optimization methods to perform the max-
imization, as initially proposed in Genevay et al. [2016]. This allows to
obtain provably convergent algorithms without the need to resort to
some discretization of α as usually done (either approximating α us-
ing sums of Diracs or using quadrature formula for the integrals). The
measure α is used as a black box from which one can draw indepen-
dent samples, which is a natural computational setup for many high
dimensional applications in statistics and machine learning. This class
of methods has been generalized to the computation of Wasserstein
barycenters (as described in Section (9.2)) in Staib et al. [2017].

Stochastic gradient descent (SGD). Initializing g(0) = 0P , the


stochastic gradient descent (SGD – used here as a maximization
method), at step `, draws a point x` ∈ X (and all the (x` )` are in-
dependent) according to distribution α and then update

g(`+1) = g(`) + τ` ∇g E ε (g(`) , x` ).


def.
(5.14)

The step size τ` should decay fast enough to zero in order to ensure
that the “noise” created by using ∇g E ε (x` , g) as a proxy for the “true”
gradient ∇E ε (g) is cancelled in the limit. A typical choice of schedule
is
def. τ0
τ` = (5.15)
1 + `/`0
where `0 indicates roughly the number of iterations serving as a
“warmup” phase. One can prove the following convergence result
1
 
E ε (g? ) − E(E ε (g(`) )) = O √ ,
`
where g? is a solution of (5.12) and where E indicates an expectation
with respect to the i.i.d. sampling of (x` )` performed at each itera-
tion. Figure 5.4 shows the evolution of the algorithm on a simple 2-D
example, where α is the uniform distribution on [0, 1]2 .

Stochastic Gradient Descent with Averaging (SGA). Stochastic


gradient descent is slow because of the fast decay of τ` toward zero.
5.4. Stochastic Optimization Methods 103

Figure 5.4: Evolution of the energy E ε (g(`) ), for ε = 0 (no regularization) during
the SGD iterations (5.14). Each colored curve show a different randomized run. The
images display the evolution of the Laguerre cells (Lg(`) (yj ))j through the iterations.

To improve somehow the convergence speed, it is possible to average


the past iterates, i.e. run a “classical” SGD on auxiliary variables (g̃(`) )`

g̃(`+1) = g̃(`) + τ` ∇g E ε (g̃(`) , x` )


def.

where x` is drawn according to α (and all the (x` )` are independent)


and output as estimated weight vector the average
`
1X
g(`) = g̃(k) .
def.

` k=1

This defines the Stochastic Gradient Descent with Averaging (SGA)


algorithm. Note that it is possible to avoid explicitly storing all the
iterates by simply updating a running average as follows
1 `
g(`+1) = g̃(`+1) + g(`) .
`+1 `+1
In this case, a typical choice of decay is rather of the form
def. τ0
τ` = p .
1+ `/`0
104 Semi-discrete Optimal Transport

Notice that the step size now goes much slower to 0 than for (5.15), at
rate `−1/2 . Bach [2014] proves that SGA leads to a faster convergence
(the constant involved are smaller) than SGD, since on contrast to
SGD, SGA is adaptive to the local strong convexity (or concavity for
maximization problems) of the functional.
6
W 1 Optimal Transport

This chapter focuses on the setting of optimal transport in which the


ground cost is equal to a distance. Historically, this corresponds to the
original problem posed by Monge in 1781; it also appears in some of the
earliest applications of optimal transport as “earth mover’s distances”
in computer vision [Rubner et al., 2000].
Unlike the case where transportation cost equals a distance squared
(studied in particular in Chapter 7), transport with metric costs is more
difficult to analyze theoretically. Contrasting with Remark 2.23 for
transport with squared ground distances, generally there is no unique
optimal Kantorovich coupling when the cost is the ground distance
itself. Hence, in this regime it is often impossible to recover a uniquely-
defined Monge map, making this class of problems ill-suited for inter-
polation. We refer to works by Trudinger and Wang [2001], Caffarelli
et al. [2002], Sudakov [1979] for proofs of existence of optimal W 1 trans-
portation plans and detailed analyses of their geometric structure.
Nevertheless, transport with linear ground distance is useful to com-
pare histograms, since a non-squared loss is more robust to outliers
in noisy data than a quadratic cost. Furthermore, this problem ad-
mits an elegant dual reformulation involving local flow or divergence

105
106 W 1 Optimal Transport

constraints, suggesting cheaper numerical algorithms that align with


minimum-cost flow methods over networks in graph theory. This set-
ting is also popular because the associated OT distances define a norm
that can compare arbitrary distributions, even if they are not positive;
this property is shared by a larger class of so-called dual norms (see
Sections 8.2 and Remark 10.6 for more details).

6.1 W 1 on Metric Spaces

Here we assume that d is a distance on X = Y, and we solve the OT


problem with the ground cost c(x, y) = d(x, y). The following proposi-
tion highlights key properties of the c-transform (5.1) in this setup. In
the following, we denote the Lipschitz constant of a function f ∈ C(X )
as
|f (x) − f (y)|
 
: (x, y) ∈ X 2 , x 6= y
def.
Lip(f ) = sup
d(x, y)
We define Lipschitz functions to be those functions f satisfying
Lip(f ) < +∞; they form a convex subset of C(X ).
Proposition 6.1. Suppose X = Y and c(x, y) = d(x, y). Then, there
exists g such that f = g c if and only Lip(f ) ≤ 1. Furthermore, if
Lip(f ) ≤ 1, then f c = −f .
Proof. First, suppose f = g c . Then, for x, y ∈ X ,

|f (x) − f (y)| = inf [d(x, z) − g(z)] − inf [d(y, z) − g(z)]


z∈X z∈X

≤ sup |d(x, z) − d(y, z)| ≤ d(x, y).


z∈X
The first equality follows from the definition of g c , the inequality next
from the identity | inf f − inf g| ≤ sup |f − g|, the las one from the
triangle inequality. This shows that Lip(f ) ≤ 1.
def.
Now, suppose Lip(f ) ≤ 1, and define g = −f . By the Lipschitz
property, for all x, y ∈ X , f (y) − d(x, y) ≤ f (x) ≤ f (y) + d(x, y).
Applying these inequalities,
g c (y) = inf [d(x, y) + f (x)] ≥ inf [d(x, y) + f (y) − d(x, y)] = f (y)
x∈X x∈X
c
g (y) = inf [d(x, y) + f (x)] ≤ inf [d(x, y) + f (y) + d(x, y)] = f (y)
x∈X x∈X
6.1. W 1 on Metric Spaces 107

Hence, f = g c with g = −f . Using the same inequalities shows


f c (y) = inf [d(x, y) − f (x)] ≥ inf [d(x, y) − f (y) − d(x, y)] = −f (y)
x∈X x∈X
c
f (y) = inf [d(x, y) − f (x)] ≤ inf [d(x, y) − f (y) + d(x, y)] = −f (y)
x∈X x∈X

This shows fc = −f .

Starting from the single potential formulation (5.4), one can iterate
the construction and replace the couple (g, g c ) by (g c , (g c )c ). The last
proposition shows that one can thus use (g c , −g c ), which in turn is
equivalent to any pair (f, −f ) such that Lip(f ) ≤ 1. This leads to the
following alternative expression for the W 1 distance
Z 
W 1 (α, β) = max f (x)(dα(x) − dβ(x)) : Lip(f ) ≤ 1 . (6.1)
f Rd

This expression shows that W 1 is actually a norm, i.e., W 1 (α, β) =


kα − βkW 1 , and that it is still valid for any measures (not necessary
R R
positive) as long as X α = X β. This norm is often called the Kan-
torovich and Rubinshtein norm [1958]
For discrete measures of the form (2.1), writing α − β = k mk δzk
P

with zk ∈ X and k mk = 0, the optimization (6.1) can be rewritten


P

as
( )
X
W 1 (α, β) = max fk mk : ∀ (k, `), |fk − f` | ≤ d(zk , z` ) (6.2)
(fk )k
k

which is a finite-dimensional convex program with quadratic-cone con-


straints. It can be solved using interior point methods, or as we detail
next for a similar problem, using proximal methods.
When using d(x, y) = |x−y| with X = R, we can reduce the number
of constraints by ordering the zk ’s via z1 ≤ z2 ≤ . . .. In this case, we
only have to solve
( )
X
W 1 (α, β) = max fk mk : ∀ k, |fk+1 − fk | ≤ zk+1 − zk ,
(fk )k
k

which is a linear program. Note that furthermore, in this 1-D case,


a closed form expression for W 1 using cumulative functions is given
in (2.35).
108 W 1 Optimal Transport

Remark 6.1 (W p with 0 < p ≤ 1). If 0 < p ≤ 1, then d(x, ˜ y) def.


= d(x, y)p
satisfies the triangular inequality, and hence d˜ is itself a distance. One
can thus apply the results and algorithms detailed above for W 1 to
compute W p by simply using d˜ in place of d. This n is equivalent to o
stating that W p is the dual of p-Hölder functions f : Lipp (f ) ≤ 1
where
|f (x) − f (y)|
 
: (x, y) ∈ X 2 , x 6= y
def.
Lipp (f ) = sup
d(x, y)p

6.2 W 1 on Euclidean Space

In the special case of Euclidean spaces X = Y = Rd , using c(x, y) =


kx − yk, the global Lipschitz constraint appearing in (6.1) can be made
local as a uniform bound on the gradient of f ,
Z 
W 1 (α, β) = max f (x)(dα(x) − dβ(x)) : k∇f k∞ ≤ 1 (6.3)
f Rd

Here the constraint k∇f k∞ ≤ 1 signifies that the norm of the gradient
of f at any point x is upperbounded by 1, k∇f (x)k2 ≤ 1 for any x.
Considering the dual problem to (6.3), one obtains an optimization
problem under fixed divergence constraint
Z 
W 1 (α, β) = min ks(x)k2 dx : div(s) = α − β . (6.4)
s Rd

Here the vectorial function s(x) ∈ R2 can be interpreted as a flow


field, describing locally the movement of mass. Outside the support
of the two input measures, div(s) = 0, which is the conservation of
mass constraint. Once properly discretized using finite-elements, prob-
lem (6.3) and (6.4) become non-smooth convex optimization problems.
It is possible to use off-the-shelf interior points quadratic-cone optimiza-
tion solver, but as advocated in §7.3, large scale problem requires the
use of simpler but more adapted first order method. One can thus use
for instance DR iterations (7.14) or the related ADMM method. Note
that on uniform grid, projecting on the divergence constraint is conve-
niently handled using the Fast Fourier Transform. We refer to Solomon
6.3. W 1 on a Graph 109

et al. [2014a] for a detailed account for these approaches and applica-
tion to OT on triangulated meshes. See also Li et al. [2016] for similar
techniques.

6.3 W 1 on a Graph

The previous formulations (6.3) and (6.4) of W 1 can be generalized to


the setting where X is a geodesic space, i.e. c(x, y) = d(x, y), where d
is a geodesic distance. We refer to Feldman and McCann [2002] for a
theoretical analysis in the case of X being a Riemannian manifold. In
the discrete setting, this corresponds to graphs, where X = J1, nK is a
finite set of indexes, and (i, j) ∈ E ⊂ X 2 is an edge (here the graph is
assumed to be undirected) equipped with some weight (length) wi,j .
In this case, the geodesic distance is
(K−1 )
def.
X
Di,j = min wik ,ik+1 : ∀ k ∈ J1, K − 1K, (ik , ik+1 ) ∈ E .
K≥0,(ik )k :i→j
k=1

where i → j indicates that i0 = i and iK = j.


We consider for two vectors (a, b) ∈ (Rn )2 defining discrete prob-
abilities on the graph X , such that i ai = i bi (they do not even
P P

need to positive). The goal is now to compute W1 (a, b), as introduced


in (2.16) for p = 1, when the ground metric is such a geodesic distance.
The goal is to achieve this without resorting to the computation of
a “full” coupling P of size n × n, and instead rely on local operators
thanks to the underlying connectivity of the graph. These operators are
discrete equivalent of the gradient and divergence differential operators.
A discrete dual Kantorovich potential f ∈ Rn is defined on the
vertices of the graph. The gradient operator ∇ : Rn → RE is defined as
def.
∀ (i, j) ∈ E, (∇f)i,j = fi − fj .

A flow s = (si,j )i,j is defined on edges, and the divergence operator


div : RE → Rn , which is the adjoint of the gradient ∇, maps flows to
vectors defined on vertices, and is defined as
X
si,j − sj,i ∈ Rn .
def.
∀ i ∈ J1, nK, div(s)i =
j:(i,j)∈E
110 W 1 Optimal Transport

f (a, b) and s

Figure 6.1: Example of computation of W1 (a, b) on a planar graph with uniform


weights wi,j = 1. Left: potential f solution of (6.5) (increasing value from red to
blue). The green color of the edges is proportional to |(∇f)i,j |. Right: flow s solution
of (6.6), where bold black edges display non-zero si,j , which saturate to wi,j = 1.
These saturating flow edge on the right match the light green edge on the left where
|(∇f)i,j | = 1.

The analogous of formula (6.3), in the graph setting, reads


( n )
X
W1 (a, b) = max
n
fi (ai − bi ) : ∀(i, j) ∈ E, |(∇f)i,j | ≤ wi,j .
f∈R
i=1
(6.5)
The associated dual formula, which is the analogous of Formula (6.4),
now becomes in this setting,
 
 X 
W1 (a, b) = min wi,j si,j : div(s) = a − b . (6.6)
s∈RE
+
 
(i,j)∈E

This is a linear program, which is a typical instance of min-cost flow


problem. Highly efficient dedicated simplex solvers have been devised
to solve it, see for instance Ling and Okada [2007]. Figure 6.1 shows
an example of primal and dual solutions. Formulation (6.6) is the
so-called Beckman formulation Beckmann [1952], and has been used
and extended to define and study traffic congestion model, see for in-
stance Carlier et al. [2008].
7
Dynamic Formulations

This chapter presents the geodesic (also called dynamic) point of view
of optimal transport when the cost is a squared geodesic distance. This
describes the optimal transport between two measures as a curve in the
space of measures minimizing a total length. The dynamic point of view
offers an alternate and intuitive interpretation of optimal transport,
which not only allows to draw links with fluid dynamics, but also results
in an efficient numerical tool to compute OT in small dimensions when
interpolating between two densities. The drawback of that approach
is that it cannot scale to large-scale sparse measures, and only works
in low dimensions on regular domains (because one needs to grid the
space) with a squared geodesic cost.
In this chapter, we use the notation (α0 , α1 ) in place of (α, β) in
agreement with the idea that we start from one measure to reach an-
other one.

7.1 Continuous Formulation

In the case X = Y = Rd , and c(x, y) = kx − yk2 the optimal transport


distance W 22 (α, β) = Lc (α, β) as defined in (2.15) can be computed by

111
112 Dynamic Formulations

α0 α1/4 α1/2 α3/4 α1

Figure 7.1: Displacement interpolation αt satisfying (7.2). Top: for two measures
(α0 , α1 ) with densities with respect to the Lebesgue measure. Bottom: for two dis-
crete empirical measures with the same number of points (bottom).

looking for a minimal length path (αt )1t=0 between these two measures.
This path is described by advecting the measure using a vector field
vt defined at each instant. The vector field vt and the path αt must
satisfy the conservation of mass formula, resulting in
∂αt
+ div(αt vt ) = 0 and αt=0 = α0 , αt=1 = α1 (7.1)
∂t
where the equation above should be understood in the sense of distribu-
tions on Rd . The infinitesimal length of such a vector field is measured
using the L2 norm associated to the measure αt , that is defined as
Z
kvt kL2 (αt ) = ( kvt (x)k2 dαt (x))1/2 .
Rd

This definition leads to the following minimal-path reformulation of


W 2 , original introduced by Benamou and Brenier [2000]
Z 1Z
W 22 (α0 , α1 ) = min kvt (x)k2 dαt (x)dt (7.2)
(αt ,vt )t sat. (7.1) 0 Rd

where αt is a scalar-valued measure and vt a vector-valued measure.


Figure 7.1 shows two examples of such paths of measures.
The formulation (7.2) is a non-convex formulation in the variables
(αt , vt )t because of the constraint (7.1) involving the product αt vt . In-
7.1. Continuous Formulation 113

troducing a vector-valued measure (often called the “momentum”)


def.
Jt = αt vt ,
Benamou and Brenier showed in their landmark paper [2000] that it is
instead convex in the variable (αt , Jt )t when writing
Z 1Z
W 22 (α0 , α1 ) = min θ(αt (x), Jt (x))dxdt (7.3)
(αt ,Jt )t ∈C(α0 ,α1 ) 0 Rd

where we define the set of constraints as


∂αt
 
def.
C(α0 , α1 ) = (αt , Jt ) : + div(Jt ) = 0, αt=0 = α0 , αt=1 = α1 ,
∂t
(7.4)
and where θ :→ R+ ∪ {+∞} is the following lower-semi continuous
convex function
kbk2


 a if a > 0,
∀ (a, b) ∈ R+ × Rd , θ(a, b) = 0 if (a, b) = 0, (7.5)

+∞ otherwise.

This definition might seem complicated, but it is crucial to impose


that the momentum Jt (x) should vanish when αt (x) = 0. Note also
that (7.3) is written in an informal way as if the measures (αt , Jt ) were
density functions, but this is acceptable because θ is a 1-homogeneous
function, which can thus be extended in an unambiguous way from
density to functions.

Remark 7.1 (Links with McCann’s Interpolation). In the case (see


Equation (2.26)) where there exists an optimal Monge map T :
Rd → Rd with T] α0 = α1 , then αt is equal to McCann’s interpola-
tion
αt = ((1 − t)Id + tT )] α0 . (7.6)
In the 1-D case, using Remark 2.28, this interpolation can be com-
puted thanks to the relation

Cα−1
t
= (1 − t)Cα−1
0
+ tCα−1
1
, (7.7)

see Figure 2.10. We refer to Gangbo and McCann [1996] for a


detailed review on the Riemannian geometry of the Wasserstein
114 Dynamic Formulations

α0 α1/5 α2/5 α3/5 α4/5 α1

Figure 7.2: Comparison of displacement interpolation (7.8) of discrete measures.


Top: point clouds (empirical measures (α0 , α1 ) with the same number of points).
Bottom: same but with weight. For 0 < t < 1, the top example corresponds to an
empirical measure interpolation αt with N points, while the bottom one defines a
measure supported on 2N − 1 points.

space. In the case that there is only a coupling π (not necessarily


supported on a Monge map), one can compute this interpolant as

αt = Pt] γ where Pt : (x, y) ∈ Rd × Rd 7→ (1 − t)x + ty. (7.8)

For instance, in the discrete setup (2.3), denoting P a solution


to (2.11), an interpolation is defined as
X
αt = Pi,j δ(1−t)xi +tyj . (7.9)
i,j

Such an interpolation is typically supported on n + m − 1 points,


which is the maximum number of nonzero elements of P. Fig-
ure 7.2 shows two example of such displacement interpolation of
discrete measures. This construction can be generalized to geodesic
spaces X by replacing Pt by the interpolation along geodesic paths.
McCann interpolation finds many applications, for instance color,
shape and illumination interpolations in computer graphics [Bon-
neel et al., 2011].
7.2. Discretization on Uniform Staggered Grids 115

7.2 Discretization on Uniform Staggered Grids

For simplicity, we describe the numerical scheme in dimension d = 2,


the extension to higher dimensions is straightforward. We follow the
discretization method introduced by Papadakis et al. [2014], which is
inspired by staggered grid techniques which are commonly used in fluid
dynamics. We discretize time as tk = k/T ∈ [0, 1], and assume the space
is uniformly discretized at points xi = (i1 /n1 , i2 /n2 ) ∈ X = [0, 1]2 .
We use a staggered grid representation, so that αt is represented us-
ing a ∈ R(T +1)×n1 ×n2 associated to half grid points in time, whereas
J is represented using J = (J1 , J2 ) where J1 ∈ RT ×(n1 +1)×n2 and
J1 ∈ RT ×n1 ×(n2 +1) are stored at half grid point in each space direc-
tion. Using this representation, for (k, i1 , i2 ) ∈ J1, T K × J1, n1 K × J1, n2 K,
time derivative is computed as
def.
(∂t a)k,i = ak+1,i − ak,i

and spatial divergence as

div(J)k,i = J1k,i1 +1,i2 − J1k,i1 ,i2 + J2k,i1 ,i2 +1 − J2k,i1 ,i2


def.
(7.10)

which d are both defined at grid points, thus forming arrays of


RT ×n1 ×n2 .
In order to evaluate the functional to be optimized, one needs
interpolation operators from mid-grid points to grid points, for all
(k, i1 , i2 ) ∈ J1, T K × J1, n1 K × J1, n2 K,
def.
Ia (a)k,i = I(ak+1,i , ak,i )

IJ (J)k,i = (I(J1k,i1 +1,i2 , J1k,i1 ,i2 ), I(J2k,i1 ,i2 +1 , J2k,i1 ,i2 )).
def.

r+s
The simplest choice is to use a linear operator I(r, s) = 2 , which is
the one we consider next.
The discrete counterpart to (7.3) reads

min Θ(Ia (a), IJ (J)) (7.11)


(a,J)∈C(a0 ,a1 )

n1 X
T X
X n2
def.
where Θ(ã, J̃) = θ(ãk,i , J̃k,i ),
k=1 i1 =1 i2 =1
116 Dynamic Formulations

and where the constraint now reads


def.
C(a0 , a1 ) = {(a, J) : ∂t a + div(J) = 0, (a0,· , aT,· ) = (a0 , a1 )} .
where a ∈ R(T +1)×n1 ×n2 , J = (J1 , J2 ) with J1 ∈ RT ×(n1 +1)×n2 ,
J2 ∈ RT ×n1 ×(n2 +1) . Figure 7.3 shows an example of evolution (αt )t
approximated using this discretization scheme.
Remark 7.2 (Dynamic Formulation on Graphs). In the case where X is
a graph and c(x, y) = dX (x, y)2 , it is possible to derive faithful dis-
cretization methods, which uses a discrete divergence associated to the
graph structure in place of the uniform grid discretization (7.10). In
order to be ensure that the heat equation has a gradient flow structure
(see §9.3 for more details about gradient flows) for the corresponding
dynamic Wasserstein distance Maas [2011], Erbar et al. [2017] propose
to use a logarithmic mean I(r, s) (see also Solomon et al. [2016b]).

7.3 Proximal Solvers

The discretized dynamic OT problem (7.11) is challenging to solve be-


cause it requires to minimize a non-smooth optimization problem under
affine constraints. Indeed, the function θ is convex but non-smooth for
measure with vanishing mass ak,i . When interpolating between two
compactly supported input (a0 , a1 ), one typically expects the mass of
the interpolated measures (ak )Tk=1 to vanish also, and the difficult part
of the optimization process is indeed to track this evolution of the
support. In particular, it is not possible to use standard smooth opti-
mization techniques.
There are several ways to recast (7.11) into a quadratic-cone pro-
gram, either by considering the dual problem, or simply by replacing
the functional θ(ak,i , Jk,i ) by a linear function under constraints
 
X 
Θ(ã, J̃) = min z̃k,i : ∀ (k, i), (zk,i , ãk,i , J̃i,j ) ∈ L ,
z̃  
k,i
def.
which
n thus requires the introduction ofo an extra variable z̃. Here L =
(z, a, J) ∈ R × R+ × Rd : kJk2 ≤ za is a rotated Lorentz quadratic-
cone. With this extra variable, it is thus possible to solve the discretized
7.3. Proximal Solvers 117

problem using standard interior point solvers for quadratic cone pro-
grams Nesterov and Nemirovskii [1994]. These solvers have fast con-
vergence rates, and are thus capable of computing solution with very
high precision. Unfortunately, each iteration is very costly and requires
the resolution of a linear system of dimension scaling with the number
of discretization points. They are thus not applicable for large scale
multi-dimensional problems encountered in imaging applications.
An alternative to these high-precision solvers are low-precision first
order methods, which are well-suited for non-smooth but highly struc-
tured problems such as (7.11). While this class of solvers is not new,
it has recently been revitalized in the fields of imaging and machine
learning because they are the perfect fit for these application, where
numerical precision is not the driving goal. We refer for instance to the
monographs Bauschke and Combettes [2011] for detailed account on
these solvers and their use for large scale applications. We here concen-
trate on a specific solver, but of course many more can be used, and we
refer to Papadakis et al. [2014] for a study of several such approaches
for dynamical OT. Note that the idea of using first order scheme for
dynamical OT was initially proposed in Benamou and Brenier [2000].
The Douglas-Rachford (DR) algorithm Lions and Mercier [1979]
is specifically tailored to solve non-smooth structured problems of the
form
min F (x) + G(x) (7.12)
x∈H

where H is some Euclidean space, and where F, G : H → R ∪ {+∞}


are two closed convex functions, for which one can “easily ” (i.e. in
closed form or using a rapidly converging scheme) compute the so-
called proximal operator

1 2
x − x0
def.
∀ x ∈ H, Proxτ F (x) = argmin + τ F (x) (7.13)
x0 ∈H 2

for a parameter τ > 0. Note that this corresponds to the proximal map
for the Euclidean metric, and that this definition can be extended to
more general Bregman divergence in place of kx − x0 k2 , see (4.48) for an
example using the KL divergence. The iterations of the DR algorithm
118 Dynamic Formulations

define a sequence (x(`) , w(`) ) ∈ H2 using a initial (x(0) , w(0) ) ∈ H2 and


w(`+1) = w(`) + α(ProxγF (2x(`) − w(`) ) − x(`) ),
def.

(7.14)
x(`+1) = ProxγG (w(`+1) ).
def.

If 0 < α < 2 and γ > 0, one can show that x(`) → z ? a solution of (7.12),
see Combettes and Pesquet [2007] for more details. This algorithm is
closely related to another popular method, the Alternating Method of
Multiplier Gabay and Mercier [1976], Glowinski and Marroco [1975]
(see also Boyd et al. [2011] for a review), which can be retrieved by
applying DR on a dual problem, see Papadakis et al. [2014] for more
details on the equivalence between the two, initially proved by Eckstein
and Bertsekas [1992].
There are many way to re-cast problem (7.11) in the form (7.12),
and we refer to Papadakis et al. [2014] for an exploration of several
possibility. A simple way to achieve this is by setting x = (a, J, ã, J̃),
and letting
def.
F (x) = Θ(ã, J̃) + ιC(a0 ,a1 ) (a, J) and G(x) = ιD (a, J, ã, J̃)
n o
def.
where D = (a, J, ã, J̃) : ã = Ia (a), J̃ = IJ (J) .
The proximal operator of these two functions can be computed effi-
ciently. Indeed,
Proxτ F (x) = (Proxτ Θ (ã, J̃), ProjC(a0 ,a1 ) (a, J)).
The proximal operator Proxτ Θ is computed by solving a cubic polyno-
mial equation at each grid position. The orthogonal projection on the
affine constraint C(a0 , a1 ) involves the resolution of a Poisson equa-
tion, which can be achieved in O(N log(N )) operations using the Fast
Fourier Transform, where N = T n1 n2 is the number of grid points.
Lastly, the proximal operator Proxτ G is a linear projector, which re-
quires the inversion of a small linear system. We refer to Papadakis
et al. [2014] for more details on these computations. Figure (7.3) shows
an example of application of this method to compute a dynamical in-
terpolation inside a complicated planar domain. This class of proximal
method for dynamical OT has also been used to solve related problem
such as mean field games Benamou and Carlier [2015].
7.4. Dynamical Unbalanced OT 119

Figure 7.3: Solution αt of dynamic OT computed with a proximal splitting scheme.

7.4 Dynamical Unbalanced OT

In order to be able to match input measure with different mass


α0 (X ) 6= α1 (X ) (the so-called “unbalanced” settings, the terminol-
ogy introduced in Benamou [2003]), and also to cope with local mass
variation. Several normalizations or relaxations have been proposed, in
particular by relaxing the fixed marginal constraint, see §10.2.
A general methodology consists in introducing a source term st (x)
in the continuity equation (7.4). We thus consider

∂αt
 
¯ 0 , α1 ) def.
C(α = (αt , Jt , st ) : + div(Jt ) = st , αt=0 = α0 , αt=1 = α1 .
∂t

The crucial question is how to measure the cost associated to this


source term and introduce it in the original dynamic formulation (7.3).
Several different proposals have been introduced in the literature, for
instance using a L2 cost Piccoli and Rossi [2014]. In order to avoid
having “teleportation of mass” (mass which travels at infinite speed,
and suddenly grows in region where there was no mass before), the
associated cost should be infinite. It turns out that this can be achieved
in a simple convex way, by also allowing st to be an arbitrary measure
(i.e. using a 1-homogeneous cost) by penalizing st in the same way as
120 Dynamic Formulations

the momentum Jt ,

WFR2 (α0 , α1 ) = min Θ(α, J, s) (7.15)


¯ 0 ,α1 )
(αt ,Jt ,st )t ∈C(α
Z 1Z
def.
where Θ(α, J, s) = (θ(αt (x), Jt (x)) + τ θ(αt (x), st (x))) dxdt
0 Rd
where θ is the convex 1-homogeneous function introduced in (7.5),
and τ is a weight controlling the tradeoff between mass transportation
and mass creation/destruction. This formulation was proposed inde-
pendently by several authors Liero et al. [2016], Chizat et al. [2016],
Kondratyev et al. [2016]. This “dynamic” formulation has a “static’
counterpart, see Remark 10.5. The convex optimization problem (7.15)
can be solved using methods similar to those detailed in §7.3. This
dynamic formulation resembles “metamorphosis” models for shape reg-
istration Trouvé and Younes [2005], and a more precise connection is
detailed in Maas et al. [2015, 2016].
As τ → 0, and if α0 (X ) = α1 (X ), then one retrieves the classical
OT, WFR(α0 , α1 ) → W(α0 , α1 ). In contrast, as τ → +∞, this distance
approaches the Hellinger metric over densities
1
Z q q
τ →+∞
WFR(α0 , α1 )2 −→ | ρα0 (x) − ρα1 (x)|2 dx
τ X
s
dα1
Z
= |1 − (x)|2 dα0 (x).
X dα0

7.5 More General Mobility Functionals

It is possible to generalize the dynamic formulation (7.3) by considering


other “mobility functions” θ in place of the one defined in (7.5). A
possible choice for this mobility functional is proposed in Dolbeault
et al. [2009]

∀ (a, b) ∈ R+ × Rd , θ(a, b) = as−p kbkp (7.16)

where the parameter should satisfy p ≥ 1 and s ∈ [1, p] in order for θ


to be convex. Note that this definition should be handled with care in
the case 1 < s ≤ p because θ is not anymore 1-homogeneous, so that
7.6. Dynamic Formulation over the Paths Space 121

Figure 7.4: Comparison of Hellinger, Wasserstein, and Wasserstein dynamic in-


terpolations with unbalanced OT.

solutions to (7.3) must be constrained to have a density with respect


to the Lebesgue measure.
The case s = 1 corresponds to the classical OT and the optimal
value of (7.3) defines W p (α, β). In this case, θ is 1-homogeneous, so that
solution to (7.3) can be arbitrary measures. The case (s = 1, p = 2) is
the initial setup considered in (7.3) to define W 2 .
The limiting case s = p is also interesting, because it corresponds
to a dual Sobolev norm W −1,p and the value of (7.3) is then equal to
Z Z 
q
kα − βk2W −1,p (Rd ) = min f d(α − β) : k∇f (x)k dx ≤ 1
f Rd Rd

for 1/q + 1/p = 1. In the limit p = s → ∞, one recovers the W 1 norm.


The case s = p = 2 corresponds to the Sobolev H −1 (Rd ) Hilbert norm
defined in (8.16).

7.6 Dynamic Formulation over the Paths Space

There is a natural dynamical formulation of the both classical and


entropic regularization (as exposed in Chapter 4), which is based on
studying abstract optimization problems on the space X̄ of all possible
paths γ : [0, 1] → X (i.e. curves) on the space X . For simplicity, we as-
sume X = Rd , but this extends to more general spaces such as geodesic
spaces and graphs. Informally, the dynamic of “particles” between two
122 Dynamic Formulations

input measures α0 , α1 at times t = 0, 1 is described by a probability


distribution π̄ ∈ M1+ (X̄ ). Such a distribution should satisfy that the
distributions of starting and end points must match (α0 , α1 ), which is
formally written using push-forward as
n o
π̄ ∈ M1+ (X̄ ) : P̄0] π̄ = α0 , P̄1] π̄ = α1 ,
def.
Ū(α0 , α1 ) =

where, for any path γ ∈ X̄ , P0 (γ) = γ(0), P1 (γ) = γ(1).

OT over the space of paths. The dynamical version of classical


OT (2.15), formulated over the space of paths, then reads
Z
W 2 (α0 , α1 )2 = min L(γ)2 dπ̄(γ) (7.17)
π̄∈Ū (α0 ,α1 ) X̄

where L(γ) is the length of a path γ. The connexion between optimal


couplings π ? and π̄ ? solving respectively (7.17) and (2.15) is that π̄ ?
only gives mass to geodesics joining pairs of points in proportion pre-
scribed by π ? . In the particular case of discrete measures, this means
that X X
π? = Pi,j δ(xi ,yj ) and π̄ ? = Pi,j δγxi ,yj
i,j i,j
where γxi ,yj is the geodesic between xi and yj . Furthermore, the mea-
sures defined by the distribution of the curve points γ(t) at time t,
where γ is drawn following π̄ ? , i.e.
t ∈ [0, 1] 7→ αt = Pt] π̄ ?
def.
where Pt (γ) = γ(t) ∈ X , (7.18)
is a solution to the dynamical formulation (7.3), i.e. it is the displace-
ment interpolation. In the discrete case, one recovers (7.9).

Entropic OT over the space of paths. We now turn to the re-


interpretation of entropic OT, defined in Chapter 4, using the space
of paths. Similarly to (4.11), this is defined using a Kullback-Leibler
projection, but this time of a reference measure over the space of paths
K̄ which is the distribution of a reversible Brownian motion, which has
a uniform distribution at the initial and final times
min KL(π̄|K̄). (7.19)
π̄∈Ū (α0 ,α1 )
7.6. Dynamic Formulation over the Paths Space 123

ε=0 ε = .05 ε = 0.2 ε=1

Figure 7.5: Samples from Brownian bridge paths associated to the Schrodinger
entropic interpolation (7.20) over path space. Blue corresponds to t = 0 and red to
t = 1.

We refer to the review paper Léonard [2014] for an overview


of this problem and an historical account of the work of
Schrödinger Schrödinger [1931]. One can show that the (unique) so-
lution π̄ε? to (7.19) converges to a solution of (7.17) as ε → 0. Further-
more, this solution is linked to the solution of the static entropic OT
problem (4.9) using Brownian bridge γ̄x,y ε ∈ X̄ (which are similar to
fuzzy geodesic, and converge to δγx,y as ε → 0). In the discrete setting,
this means that
X X
πε? = P?ε,i,j δ(xi ,yj ) and π̄ε? = P?ε,i,j γ̄xεi ,yj (7.20)
i,j i,j

where Pε,i,j can be computed using Sinkhorn’s algorithm. Similarly


to (7.18), one then can define an entropic interpolation as

αε,t = Pt] π̄ε? .


def.

ε of the position at time t along a Brownian bridge


Since the law Pt] γ̄x,y
is a Gaussian Gt(1−t)ε2 (· − γx,y (t)) of variance t(1 − t)ε2 centered at
γx,y (t), one can deduce that αε,t is a Gaussian blurring of a set of
traveling diracs
X
αε,t = P?ε,i,j Gt(1−t)ε2 (· − γxi ,yj (t)).
i,j

Another way to describe this entropic interpolation (αt )t is using


a regularization of the Benamou-Brenier dynamic formulation (7.2),
124 Dynamic Formulations

namely
Z 1Z
ε
 
min kvt (x)k2 + k∇ log(αt )(x)k2 dαt (x)dt,
(αt ,vt )t sat. (7.1) 0 Rd 4
(7.21)
see Gentil et al. [2015], Chen et al. [2016a].
8
Statistical Divergences

§8.1 and 8.2 first review two important classes of “divergences” (not
necessarily distances) defined between probability distributions (and
sometimes more general classes of measures). A divergence D typically
satisfies D(α, β) ≥ 0 and D(α, β) = 0 if and only if α = β, but it does
not need to be symmetric or satisfy the triangular inequality. These
divergences are useful as loss functions to tackle variational problems
with applications as diverse as imaging or machine learning, see Chap-
ter 9. §8.4 details how to approximate D(α, β) from discrete samples
(xi )i and (yj )j drawn from α and β. This is a important problem for
applications in statistics and machine learning.

8.1 ϕ-Divergences

Before detailing in the following section “weak” norms, whose construc-


tion share similarities with W 1 , let us detail a generic construction of
so-called divergences between measures, which can then be called (and
used as) “loss” or “fidelity” functions. Such divergences compare two
input measures by comparing their mass pointwise, without introducing
any notion of mass transportation. Divergences are functionals which,

125
126 Statistical Divergences

by looking at the pointwise “ratio” between two measures, give a sense


of how close they are. They have nice analytical and computational
properties and are built from entropy functions.

Definition 8.1 (Entropy function). A function ϕ : R → R ∪ {∞} is an


entropy function if it is lower semicontinuous, convex, dom ϕ ⊂ [0, ∞[
and satisfies the following feasibility condition: dom ϕ ∩ ]0, ∞[ 6= ∅.
The speed of growth of ϕ at ∞ is described by

ϕ0∞ = lim ϕ(x)/x ∈ R ∪ {∞} .


x→+∞

If ϕ0∞ = ∞, then ϕ grows faster than any linear function and ϕ is


said superlinear. Any entropy function ϕ induces a ϕ-divergence (also
known as Csiszár divergence Ciszár [1967], Ali and Silvey [1966] or
f -divergence) as follows.

Definition 8.2 (ϕ-Divergences). Let ϕ be an entropy function. For


α, β ∈ M(X ), let dα ⊥ be the Lebesgue decomposition1 of α
dβ β + α
with respect to β. The divergence Dϕ is defined by

Z  
dβ + ϕ0∞ α⊥ (X )
def.
Dϕ (α|β) = ϕ (8.1)
X dβ
if α, β are nonnegative and ∞ otherwise.

In definition (8.1), the additional term ϕ0∞ α⊥ (X ) is important to


ensure that Dϕ defines a continuous functional (for the weak topology
of measures) even if ϕ has a linear growth at infinity (as this is for
instance the case for the absolute value (8.8) defining the TV norm). If
ϕ as a super-linear growth (as for instance for the usual entropy (8.4)),
then ϕ0∞ = +∞ so that Dϕ (α|β) = +∞ if α does not have a density
with respect to β.
In the discrete setting, assuming
X X
α= ai δxi and β= bi δxi (8.2)
i i
1
The Lebesgue decomposition theorem asserts that, given β, α admits a unique
decomposition as the sum of two measures αs + α⊥ such that αs is absolutely
continuous with respect to β and α⊥ and β are singular.
8.1. ϕ-Divergences 127

4
KL
TV
Hellinger
2 χ2

0
0 1 2 3
Figure 8.1: Example of entropy functionals.

are supported on the same set of n points (xi )ni=1 ⊂ X , (8.1) defines a
divergence on Σn
ai
 
bi + ϕ0∞
X X
Dϕ (a|b) = ϕ ai (8.3)
i∈Supp(b)
bi i∈Supp(b)
/
def.
where Supp(b) = {i ∈ JnK : bi 6= 0}.
The proof of the following Proposition can be found in [Liero et al.,
2015, Thm 2.7].
Proposition 8.1. If ϕ is an entropy function, then Dϕ is jointly 1-
homogeneous, convex and weakly* lower semicontinuous in (α, β).
We now review a few popular instances of this framework. Figure 8.1
displays the associated entropy functionals, while Figure 8.2 reviews the
relationship between them.
Example 8.1 (Kullback-Leibler divergence). The Kullback-Leibler diver-
def.
gence KL = DϕKL , also known as the relative entropy, was already
introduced in (4.10) and (4.6). It is the divergence associated to the
Shannon-Boltzman entropy function ϕKL , given by

s log(s) − s + 1 for s > 0,


ϕKL (s) = 1 for s = 0, (8.4)


+∞ otherwise.

def.
Remark 8.1 (Bregman divergence). The discrete KL divergence, KL =
DϕKL has the unique property of being both a ϕ-divergence and a
128 Statistical Divergences

KL 6 log(1 + 2
) 2
KL dH 6 p
KL

KL/2

dH 6
/2

p
2
p

p
TV 6
6
TV

2
W1 6 dmax TV p
dH 6 2TV
W1 TV dH
TV 6 W1 /dmin TV 6 dH

Figure 8.2: Diagram of relationship between divergences (inspired by Gibbs and


Su [2002]). For X a metric space with ground distance d, dmax = sup(x,x0 ) d(x, x0 ) is
def.
the diameter of X . When X is discrete, dmin = minx6=x0 d(x, x0 ).

Bregman divergence. For discrete vectors in Rn , a Bregman diver-


gence [Bregman, 1967] associated to a smooth strictly convex function
ψ : Rn → R is defined as
def.
Bψ (a|b) = ψ(a) − ψ(b) − h∇ψ(b), a − bi, (8.5)
where h·, ·i is the canonical inner product on Rn . Note that Bψ (a|b)
is a convex function of a and a linear function of ψ. Similarly
to ϕ-divergence, a Bregman divergences satisfies Bψ (a|b) ≥ 0 and
Bψ (a|b) = 0 if and only if a = b. The KL divergence is the Breg-
man divergence for minus the entropy ψ = −H defined in (9.32)), i.e.
KL = B−H . A Bregman divergences is locally a squared Euclidean
distance since
Bψ (a + ε|a + η) = h∂ψ(a)(ε − η), ε − ηi + o(kε − ηk2 )
and the set of separating points a : Bψ (a|b) = Bψ (a|b0 ) is an hy-


perplane between b and b0 . These properties make Bregman diver-


gence suitable to replace Euclidean distances in first order optimization
methods. The best know example is mirror gradient descent [Beck and
Teboulle, 2003], which is an explicit descent step of the form (9.31).
Bregman divergences are also important in convex optimization, and
can be used for instance to derive Sinkhorn iterations and study its
convergence, see Remark 4.7. While KL is both a ϕ and a Bregman
divergence, these are radically different concepts. In particular, to gen-
eralize formula (8.5) to non-discrete measures on some space X , one
8.1. ϕ-Divergences 129

needs to fix a reference measure ξ and consider only measures α, β with



square integrable densities dα 2
dξ , dξ ∈ L (dξ). For some strictly convex
ψ : L2 (dξ) → R, one then defines the Bregman divergence as
dα dβ dα dβ
    Z  
def.
Bψ,ξ (α|β) = ψ −ψ − η(x) (x) − (x) dξ(x)
dξ dξ X dξ dξ
 
where η = ∇ψ dβdξ ∈ L2 (dξ) is the gradient [Rao and Nayak, 1985,
Jones and Byrne, 1990].

Remark 8.2 (Hyperbolic geometry of KL). It is interesting to contrast


the geometry of the Kullback-Leibler divergence to that defined by
quadratic optimal transport when comparing Gaussians. As detailed
for instance in Costa et al. [2015], the Kullback-Leibler divergence has
a closed form for Gaussian densities. In the univariate case, d = 1, if
α = N (mα , σα2 ) and β = N (mβ , σβ2 ), one has

σβ2
! !
1 σα2 |mα − mβ |
KL(α|β) = + log + −1 . (8.6)
2 σβ2 σα2 σβ2

This expression shows that the divergence between α and β diverges


to infinity as σβ diminishes to 0 and β becomes a Dirac mass. In that
sense, one can say that singular Gaussian are infinitely far from all
other Gaussians in the KL geometry. That geometry is thus useful
when one wants to avoid dealing with singular covariances. To simplify
the analysis, one can look at the infinitesimal geometry of KL, which
is obtained by performing a Taylor expansion at order 2
1 1 2
 
KL(N (m + δm , (σ + δσ )2 )|N (m, σ 2 )) = δ + δσ2 + o(δm
2
, δσ2 ).
σ2 2 m
This local√Riemannian metric, the so-called Fisher metric, expressed
over (m/ 2, σ) ∈ R × R+,∗ , matches exactly that of the hyperbolic
Poincaré half-plane. Geodesics over this space are half circles centered
along the σ = 0 line, and have an exponential speed (i.e. they only
reach the limit σ = 0 after an infinite time. Note in particular that
if σα = σβ but mα 6= mα , then the Gaussian-constrained geodesic
between (α, β) over this hyperbolic half plane does not have a constant
standard deviation.
130 Statistical Divergences

m m

KL OT

Figure 8.3: Comparisons of interpolation between Gaussians using KL (hyperbolic)


and OT (Euclidean) geometries.

The KL hyperbolic geometry over the space of Gaussian parameters


(m, σ) should be contrasted with the Euclidean geometry associated to
OT as described in Remark 2.29, since in the univariate case

W 22 (α, β) = |mα − mβ |2 + |σα − σβ |2 . (8.7)

Figure 8.3 shows a visual comparison of these two geometries and their
respective geodesics. This interesting comparison was suggested to us
by Jean Feydy.

def.
Example 8.2 (Total variation). The total variation distance TV =
DϕTV is the divergence associated to

|s − 1| for s ≥ 0,
ϕTV (s) = (8.8)
+∞ otherwise.

It actually defines a norm on the full space of measure M(X ) where


Z
TV(α|β) = kα − βkTV where kαkTV = |α|(X ) = d|α|(x).
X
(8.9)
If α has a density ρα on X = Rd , then the TV norm is the L1 norm on
R
functions, kαkTV = X |ρα (x)|dx = kρα kL1 . If α is discrete as in (8.2),
8.1. ϕ-Divergences 131

then the TV norm is the `1 norm of vectors in Rn , kαkTV = i |ai |


P
=
kak`1 .
Remark 8.1 (Strong vs. weak topology). The total variation norm (8.9)
defines the so-called “strong” topology on the space of measure. On a
compact domain X of radius R, one has
W 1 (α, β) ≤ 2 kα − βkTV
so that this strong notion of convergence implies the weak conver-
gence metrized by Wasserstein distances. The converse is however not
true, since δx does not converge strongly to δy if x → y (note that
kδx − δy kTV = 2 if x 6= y). A chief advantage is that the weak topol-
ogy (once again on a compact ground space X ) is compact, from any
sequence of measure (αk )k , one can always extract a converging sub-
sequence, which makes it a suitable space for several optimization prob-
lems, such as those considered in Chapter 9.
def. 1/2
Example 8.3 (Hellinger). The Hellinger distance h = DϕH is the square
root of the divergence associated to
√
| s − 1|2 for s ≥ 0,
ϕH (s) =
+∞ otherwise.
As its name suggests, h is a distance on M+ (X ), which metrizes the
strong topology as k·kTV . If (α, β) have densities (ρα , ρβ ) on X = Rd ,
√ √
then h(α, β) = k ρα − ρβ kL2 . If (α, β) are discrete as in (8.2), then
√ √
h(α, β) = k a − bk. Considering ϕLp (s) = |s1/p − 1|p generalizes the
1/p
Hellinger (p = 2) and total variation (p = 1) distances and DϕLp is a
distance which metrizes the strong convergence for 0 < p < +∞.
Example 8.4 (Jensen-Shannon distance). The KL divergence is not
symmetric and, while being a Bregman divergence (which are locally
quadratic norms), it is not the square of a distance. The Jensen-
Shannon distance JS(α, β), defined as
1 α+β
JS(α, β)2 =
def.
(KL(α|ξ) + KL(β|ξ)) where ξ = ,
2 2
is indeed a distance [Endres and Schindelin, 2003, Österreicher and
Vajda, 2003]. In sharp contrast with KL, JS(α, β) is always bounded,
132 Statistical Divergences

and similarly to the TV norm and the Hellinger distance, it metrizes


the strong convergence.
def.
Example 8.5 (χ2 ). The χ2 -divergence χ2 = Dϕχ2 is the divergence
associated to 
|s − 1|2 for s ≥ 0,
ϕχ2 (s) =
+∞ otherwise.

If (α, β) are discrete as in (8.2) and have the same support, then
X (ai − bi )2
χ2 (α|β) = .
i
bi

8.2 Integral Probability Metrics

Formulation (6.3) is a special case of “dual norms”, which is a conve-


nient way to design “weak” norms that can deal with arbitrary mea-
sures. For a symmetric convex set B ⊂ C(X ) of continuous functions, ,
one defines
Z 
def.
kαkB = max f (x)dα(x) : f ∈ B . (8.10)
f X

These dual norms are often called “Integral Probability Metrics”


(IPM), see [Sriperumbudur et al., 2012].

Example 8.6 (Total Variation). The total variation norm (Example 8.2)
is a dual norm associated to the whole space of continuous functions

B = {f ∈ C(X ) : kf k∞ ≤ 1} .

The total variation distance is the only non-trivial divergence that is


also a dual norm, see [Sriperumbudur et al., 2009].

By using smaller “balls” B, one defines weaker dual norms, which


are thus able to metrize the weak convergence (Definition 2.2). Fig-
ure 8.4 displays a comparison of several such dual norms, which we
now detail.
8.2. Integral Probability Metrics 133

3 3
Energy Energy
Gauss Gauss
2 W1
2 W1
Flat Flat
1 1

0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
1
(α, β) = (δ0 , δt ) (α, β) = (δ0 , 2 (δ−t/2 + δt/2 ))

Figure 8.4: Comparison of dual norms.

8.2.1 W 1 and Flat Norm

If the set B is bounded, then k·kB is a norm on the whole space M(X )
of measures. This is not the case of W 1 , which is only defined for α
R
such that X dα = 0 (otherwise kαkB = +∞). This can be alleviated
by imposing a bound on the value of the potential f , in order to define
for instance the flat norm.

Example 8.7 (W 1 norm). W 1 as defined in (6.3), is a special case of


dual norm (8.10), using

B = {f : Lip(f ) ≤ 1}

the set of 1-Lipschitz functions.

Example 8.8 (Flat norm and Dudley metric). The flat norm is defined
using
B = {f : k∇f k∞ ≤ 1 and kf k∞ ≤ 1} . (8.11)

It metrizes the weak convergence on the whole space M(X ). For-


mula (6.2) is extended to compute the flat norm by adding the con-
straint |fk | ≤ 1. The flat norm is sometimes called “Kantorovich–
Rubinstein” norm [Hanin, 1992] and has been used as a fidelity term
for inverse problems in imaging [Lellmann et al., 2014]. The flat norm
is similar to the Dudley metric, which uses

B = {f : k∇f k∞ + kf k∞ ≤ 1} .
134 Statistical Divergences

8.2.2 Dual RKHS Norms and the Maximum Mean Discrepancy


It is also possible to define “Euclidean” norms (built using quadratic
functionals) on measures using the machinery of kernel methods
and more specifically reproducing kernel Hilbert spaces (RKHS,
see [Schölkopf and Smola, 2002] for a survey of their applications in
data sciences), of which we recall first some basic definitions.

Definition 8.3. A symmetric function k (resp. ϕ) defined on a set X ×X


is said to be positive (resp. negative) definite if for any n ≥ 0, family
x1 , . . . , xn ∈ Z and vector r ∈ Rn (resp. such that rT 1n = 0) the
following inequality holds
 
n
X n
X
ri rj k(xi , xj ) ≥ 0.  ri rj ϕ(xi , xj ) ≤ 0. (8.12)
i,j=1 i,j=1

In this case, one sets

B = {f : kf kk ≤ 1} (8.13)

where k·kk is a Hilbertian norm defined using a positive definite kernel


Z
kf k2k =
def.
k(x, y)f (x)f (y)dρ(x)dρ(y)
X ×X

where ρ is some reference measure, typically Lebesgue’s measure on


X = Rd . The resulting dual RKHS norms is often referred to as the
“Maximum Mean Discrepancy” (MMD) (see Gretton et al. [2007]),
and can be revisited through the prism of Kernel Mean Embeddings
(see Muandet et al. [2017] for a review).
The dual norm (8.10) for such a RKHS ball B can be shown to be
also defined using a dual kernel k ∗ , i.e.
Z
kαk2B = kαk2k∗ = k ∗ (x, y)dα(x)dα(y).
def.
(8.14)
X ×X

This dual kernel k ∗ is in some sense the “inverse” of the primal kernel
k. For instance, for translation-invariant kernels k on X = Rd (i.e.
convolution) k(x, y) = k0 (x − y), then k ∗ (x, y) = k0∗ (x − y) where
formally k̂0∗ (ω) = k̂0 (ω)−1 where k̂0 is the Fourier transform.
8.2. Integral Probability Metrics 135

Expression (8.14) can be re-phrased introducing two independent


random vectors (X, X 0 ) on X distributed following α

kαk2k∗ = EX,X 0 (k ∗ (X, X 0 )).

In the special case where α is a discrete measure of the form (2.3),


one thus has
n X
n
kαk2k∗ = ai ai0 k∗i,i0 = hk∗ a, ai where k∗i,i0 = k ∗ (xi , xi0 ).
X def.

i=1 i0 =1

In particular, when α = ni=1 ai δxi and β = ni=1 bi δxi are supported


P P

on the same set of points, kα − βk2k∗ = hk∗ (a − b), a − bi, so that k·kk∗
is an Euclidean norm (associated to the positive definite matrix k∗ )
on the simplex Σn . To compute the discrepancy between two discrete
measures of the form (2.3), one can use

kα − βk2k∗ = ai ai0 k ∗ (xi , xi0 )+ bj bj 0 k ∗ (yj , yj 0 )−2 ai bj k ∗ (xi , yj ).


X X X

i,i0 j,j 0 i,j


(8.15)

Example 8.9 (Gaussian RKHS). One of the most popular kernel is the
Gaussian one, in which case both primal and dual norms have closed
form expressions, provided here up to normalization constants:
σ 2 (x−y)2 (x−y)2
k(x, y) = e− 2 and k ∗ (x, y) = e− 2σ 2

An attractive feature of the Gaussian kernel is that it is separable as


a product of 1-D kernels, which facilitates computations when working
on regular grids (see also Remark 4.15). However, an important issue
that arises when using the Gaussian kernel is that one needs to select
the bandwidth parameter σ. This bandwidth should match the “typical
scale” between observations in the measures to be compared. If the
measures have multiscale features (some regions may be very dense,
others very sparsely populated), a Gaussian kernel is thus not well
adapted, and one should consider a “scale free” kernel as we detail
next. Another issue with such kernels is that they are global (have slow
polynomial decay) which makes them typically computationally more
expensive (no compact support approximation is possible).
136 Statistical Divergences

Example 8.10 (H −1 (Rd )). Another important dual norm is H −1 (Rd ),


the dual (i.e. over distributions) of the Sobolev space H 1 (Rd ) of func-
tions having derivatives in L2 (Rd ). It is defined using the primal kernel
kf k2k = k∇f k2L2 (Rd ) . It is not defined for singular measures (e.g. Diracs)
unless d = 1 because functions in the Sobolev space H 1 (Rd ) are in gen-
eral not continuous. This H −1 norm (defined on the space of zero mean
measures with densities) can also be formulated in divergence form
Z 
kα − βk2H −1 (Rd ) = min ks(x)k22 dx : div(s) = α − β , (8.16)
s Rd

which should be contrasted with (6.4) where a L1 norm of the vector


field s was used in place of the L2 norm used here. The “weighted”
version of this Sobolev dual norm
Z
kρk2H −1 (α) = min ks(x)k22 dα(x)
div(s)=ρ Rd

can be interpreted as the natural “linearization” of the Wasserstein W 2


norm, in the sense that the Benamou-Brenier dynamic formulation can
be interpreted infinitesimally as

W 2 (α, α + ερ) = ε kρkH −1 (α) + o(ε) (8.17)

The functionals W 2 (α, β) and kα − βkH −1 (α) can be shown to be equiv-


alent [Peyre, 2011]. The issue is that kα − βkH −1 (α) is not a norm (be-
cause of the weighting by α), and one cannot in general replace it by
kα − βkH −1 (Rd ) unless (α, β) have densities. In this case, if α and β
have densities on the same support bounded from below by a > 0 and
from above by b < +∞, then

b−1/2 kα − βkH −1 (Rd ) ≤ W2 (α, β) ≤ a−1/2 kα − βkH −1 (Rd ) . (8.18)

see [Santambrogio, 2015, Theorem 5.34], see also [Peyre, 2011] for sharp
constants.

Example 8.11 (Negative Sobolev Spaces). One can generalize this con-
struction by considering Sobolev space H −r (Rd ) of arbitrary negative
index, which are dual of the functional Sobolev space H r (Rd ) of func-
tion having r derivatives (in the sense of distributions) in L2 (Rd ). In
8.2. Integral Probability Metrics 137

order to metrize the weak convergence, one needs functions in H r (Rd )


to be continuous, which is the case when r > d/2. As the dimension
d increases, one thus needs to consider higher regularity. For arbitrary
α (not necessarily integers), these spaces are defined using the Fourier
transform, and for a measure α with Fourier transform α̂(ω) (written
here as a density with respect to the Lebesgue measure dω)
Z
kαk2H −r (Rd ) = kωk−2r |α̂(ω)|2 dω.
def.

Rd
This corresponds to a dual RKHS norm with a convolutive kernel
k ∗ (x, y) = k0 (x − y) with k̂0 (ω) = ± kωk−2r . Taking the inverse Fourier
transform, one sees that (up to constant), one has
1
(
if d/2 < r < d,
∀x ∈ R ,d
k0∗ (x) = kxkd−2r (8.19)
2r−d
− kxk if r > d.
Example 8.12 (Energy distance). The energy distance (or Cramer dis-
tance when d = 1) [Székely and Rizzo, 2004] associated to a distance d
is defined as
kα − βk2ED(X ,dp ) = kα − βkk∗ ∗
(x, y) = −d(x, y)p
def.
where kED
ED
(8.20)
for 0 < p < 2. It is a valid MMD norm over measures if d is negative
definite (see Definition 8.3), a typical example being the Euclidean
distance d(x, y) = kx − yk. For X = Rd , d(x, y) = k·k, using (8.19), one
sees that the energy distance is a Sobolev norm
k·kED(Rd ,k·kp ) = k·k d+p .
H− 2 (Rd )

A chief advantage of the energy distance over more usual kernels such as
the Gaussian (Example 8.9) is that it is scale-free and does not depend
on a bandwidth parameter σ. More precisely, one has the following
scaling behaviour on X = Rd , when denoting fs (x) = sx the dilation
by a factor s > 0
p
kfs] (α − β)kED(Rd ,k·kp ) = s 2 kα − βkED(Rd ,k·kp )
while the Wasserstein distance exhibit a perfect linear scaling
W p (fs] α, fs] β)) = s W p (α, β)).
138 Statistical Divergences

(α, β) ED(R2 , k·k) (G, .005) (G, .02) (G, .05)

Figure 8.5: Top row: display of ψ such that kα − βkk = kψ ? (α − β)kL2 (R2 ) ,
p
formally defined over Fourier as ψ̂(ω) = k̂0∗ (ω) where k∗ (x, x0 ) = k0∗ (x − x0 ).
Bottom row: display of ψ ? (α − β). (G,σ) stands for Gaussian kernel of variance σ 2 .
The kernel for ED(R2 , k·k) is ψ(x) = 1/ kxk.

Note however that for the energy distance, the parameter p must satisfy
0 < p < 2, and that for p = 2, it degenerates to the distance between
the means
Z
kα − βkED(Rd ,k·k2 ) = x(dα(x) − dβ(x)) ,
Rd

so it is not a norm anymore. This shows that it is not possible to get


the same linear scaling under fs] with the energy distance as for the
Wasserstein distance.

8.3 Wasserstein Spaces are not Hilbertian

Some of the special cases of the Wasserstein geometry outlined earlier


in §2.6 have highlighted the fact that the optimal transport distance
can be sometimes computed in closed form. They also illustrate that in
such cases the optimal transport distance is a Hilbertian metric between
probability measures, in the sense that there exists a map φ from the
space of input measures onto a Hilbert space, as defined below.
8.3. Wasserstein Spaces are not Hilbertian 139

Definition 8.4. A distance d defined on a set Z × Z is said to be


Hilbertian if there exists a Hilbert space H and a mapping φ : Z → H
such that for any pair z, z 0 in Z we have that d(z, z 0 ) = kφ(z)−φ(z 0 )kH .

For instance, Remark 2.28 shows that the Wasserstein metric is a


Hilbert norm between univariate distributions, simply by defining φ
to be the map that associates to a measure its generalized quantile
function. Remark 2.29 shows that for univariate Gaussians, as written
in (8.7) in this chapter, the Wasserstein distance between two univariate
Gaussians is simply the Euclidean distance between their mean and
standard deviation.
Hilbertian distances have many favorable properties when used in
a data analysis context [Dattorro, 2017]. First, they can be easily cast
as radial basis function (RBF) kernels: for any Hilbertian distance d,
p
it is indeed known that e−d /t is a positive definite kernel for any value
0 ≤ p ≤ 2 and any positive scalar t as shown in [Berg et al., 1984,
Corr. §3.3.3, Prop. §3.2.7]. The Gaussian (p = 2) and Laplace (p = 1)
kernels are simple applications of that result using the usual Euclidean
distance. The entire field of kernel methods [Hofmann et al., 2008]
builds upon the positive definiteness of a kernel function to define con-
vex learning algorithms operating on positive definite kernel matrices.
Points living in a Hilbertian space can also be efficiently embedded
in lower dimensions with low distortion factors ([Johnson and Linden-
strauss, 1984],[Barvinok, 2002, §V.6.2]) using simple methods such as
multidimensional scaling [Borg and Groenen, 2005].
Because Hilbertian distances have such properties, one might hope
that the Wasserstein distance remains Hilbertian in more general set-
tings than those outlined above, notably when the dimension of X is
2 and more. This can be disproved using the following fundamental
result:

Proposition 8.1. A distance d is Hilbertian if and only if d2 is negative


definite.

Proof. If a distance is Hilbertian, then d2 is trivially negative definite.


Indeed, given n points in Z, the sum ri rj d2 (zi , zj ) can be rewritten
P

as ri rj kφ(zi ) − φ(zj )k2H which can be expanded, taking advantage of


P
140 Statistical Divergences

the fact that ri = 0 to −2 ri rj hφ(zi ), φ(zj )iH which is negative by


P P

definition of a Hilbert dot product. If, on the contrary d2 is negative


definite then the fact that d is Hilbertian proceeds from a key result by
Schoenberg [1938] outlined in [Berg et al., 1984, p.82, Prop. 3.2].

It is therefore sufficient to show that the squared Wasserstein dis-


tance is not negative definite to show that it is not Hilbertian, as stated
in the following proposition:
Proposition 8.2. If X = Rd with d ≥ 2 and the ground cost is set to
d(x, y) = kx − yk2 , then the p-Wasserstein distance is not Hilbertian
for p = 1, 2.
Proof. It suffices to prove the result for d = 2 since any counter-example
in that dimension suffices to obtain a counter-example in any higher di-
mension. We provide a non-random counterexample which works using
measures supported on 4 vectors x1 , x2 , x3 , x4 ∈ R2 defined as follows:
x1 = [0, 0], x2 = [1, 0], x3 = [0, 1], x4 = [1, 1]. We now consider all points
on the regular grid onthe  simplex of 4 dimensions, with increments of
4 4+4−1
1/4. There are 35 = 4 = 4 such points in the simplex. Each
probability vector ai on that grid is such that for j ≤ 4, we have that
aij is in the set {0, 41 , 12 , 34 , 1} and such that 4j=1 aij = 1. For a given
P

p, the 35 × 35 pairwise Wasserstein distance matrix Dp between these


histograms can be computed. Dp is not negative definite if and only if
its elementwise square D2p is such that JD2p J has positive eigenvalues,
where J is the centering matrix J = In − n1 1n,n , which is the case as
illustrated in Figure 8.6.

8.3.1 Embeddings and Distortion


An important body of work quantifies the hardness of approximating
Wasserstein distances using Hilbertian embeddings. It has been shown
that embedding measures `2 spaces incurs necessarily an important
distortion Naor and Schechtman [2007], Andoni et al. [2017] as soon as
X = Rd with d ≥ 2.
It is possible to embed quasi-isometrically p-Wasserstein spaces for
0 < p ≤ 1 in `1 , see [Indyk and Thaper, 2003, Andoni et al., 2008, Do Ba
8.3. Wasserstein Spaces are not Hilbertian 141

1.6

Centered Distance Matrix


1.4

1.2

1
Max. Eig. of

0.8

0.6

0.4

0.2
1 1.5 2 2.5 3 3.5 4
p parameter to define p-Wasserstein

Figure 8.6: Maximal eigenvalue of JD2p J for varying values of p.

et al., 2011], but the equivalence constant between the distances growth
fast with the dimension d. Note also that also for p = 1 the embedding
is only true for discrete measures (i.e. the embedding constant depends
on the minimum distance between the spikes). A closely related em-
bedding technique consists in using the characterization of W 1 as dual
of Lipschitz functions f (see §6.2) and approximate the Lipschitz con-
straint k∇f k1 ≤ 1 by a weighted `1 ball over the Wavelets coefficients,
see [Shirdhonkar and Jacobs, 2008]. This weighted `1 ball of wavelet
coefficients defines a so-called Besov space of negative index [Leeb and
Coifman, 2016]. These embedding results are also similar to the bound
on the Wasserstein distance obtained using dyadic partitions, see [Weed
and Bach, 2017, Prop. 1] and also Fournier and Guillin [2015]. This also
provides a quasi-isometric embedding in `1 (this embedding being given
by rescaled wavelet coefficients), and comes with the advantage that
this embedding can be computed approximately in linear time when
the input measures are discretized on uniform grids. We refer to Mallat
[2008] for more details on Wavelets. Note that the idea of using multi-
scale embedding to compute Wasserstein-like distances has been used
extensively in computer vision, see for instance Ling and Okada [2006],
Grauman and Darrell [2005], Cuturi and Fukumizu [2007], Lazebnik
et al. [2006].
142 Statistical Divergences

8.3.2 Negative/Positive Definite Variants of Optimal Transport


§10.4 shows that the sliced approximation to Wasserstein distances, es-
sentially a sum of 1-D directional transportation distance computed on
random push-forwards of measures projected on lines, is negative defi-
nite as the sum of negative definite functions [Berg et al., 1984, §3.1.11].
This result can be used to define a positive definite kernel [Kolouri et al.,
2016].

8.4 Empirical Estimators for OT, MMD and ϕ-divergences

In an applied setting, given two input measures (α, β) ∈ M1+ (X )2 , an


important statistical problem is to approximate the (usually unknown)
divergence D(α, β) using only samples (xi )ni=1 from α and (yj )m
j=1 from
β. These samples are assumed to be independently identically dis-
tributed from their respective distributions.

8.4.1 Empirical Estimators for OT and MMD


For the Wasserstein distance W p (see (2.17)) and MMD norms (see 8.2),
a straightforward estimator is the distance itself between the empirical
measures
(
α̂n = n1 i δxi
def. P
D(α, β) ≈ D(α̂n , β̂m ) where def. 1 P
β̂m = m j δ yj .

Note that here both α̂ and β̂ are random measures, so D(α̂, β̂) is a
random number. For simplicity, we assume that X is compact (handling
unbounded domain requires extra constraint on the moments of the
input measures).
For such a dual distance that metrizes the weak convergence (see
Definition 2.2), since there is the weak convergence α̂n → α, one has
D(α̂n , β̂n ) → D(α, β) as n → +∞. But an important question is the
speed of convergence of D(α̂n , β̂n ) toward D(α, β), and this rate is often
called the “sample complexity” of D.
Note that for D(α, β) = k·kTV , since the TV norm does not metrize
the weak convergence, kα̂n − β̂n kTV is not a consistent estimator,
it does not converges toward kα − βkTV , indeed, with probability 1,
8.4. Empirical Estimators for OT, MMD and ϕ-divergences 143

kα̂n − β̂n kTV = 2 since the support of the discrete measures do not
overlap. Similar issues arise with other ϕ-divergences, which cannot be
estimated using divergences between empirical distributions.

Rates for OT. For X = Rd and measure supported on bounded do-


main, it is shown by [Dudley, 1969] that for d > 2, and 1 ≤ p < +∞,
1
E(| W p (α̂n , β̂n ) − W p (α, β)|) = O(n− d )

where the expectation E is taken with respect to the random samples


(xi , yi )i . This rate is tight in Rd if one of the two measure has a den-
sity with respect to the Lebesgue measure. This result was proved for
general metric spaces [Dudley, 1969] using the notion of covering num-
bers, and later refined, in particular for X = Rd in [Dereich et al., 2013,
Fournier and Guillin, 2015]
Rate for Wasserstein [Weed and Bach, 2017] with measure sup-
ported on low-dimensional sub-domains: the rate depends on the intrin-
sic dimensionality of the support. Also studies non-asymptotic behav-
ior, such as for measure which are approximated discrete (e.g. mixture
of Gaussians with small variances). It is also possible to prove con-
centration of W p (α̂n , β̂n ) around its mean W p (α, β), see [Bolley et al.,
2007, Boissard, 2011, Weed and Bach, 2017].

Rates for MMD. For weak norms k·k2B = k·k2k∗ which are dual of
RKHS norms (also called MMD), as defined in (8.13), and on contrary
to Wasserstein distances, the sample complexity does not depends on
the ambient dimension
1
E(|kα̂n − β̂n kk∗ − kα − βkk∗ |) = O(n− 2 ),

see [Sriperumbudur et al., 2012]. Note however that kα̂n − β̂n k2k∗ is a
slightly biased estimate of kα − βk2k∗ . In order to define an unbiased
estimator, and thus to be able to use for instance stochastic gradient
descent when minimizing such losses, one should rather introduce use
144 Statistical Divergences

Energy distance k·kH −1 W2


0
Figure 8.7: Decay of log10 (D(α̂n , α̂n )) as a function of log10 (n) for D being the
energy distance D = k·kH −1 (i.e. the H −1 norm) as defined in Example 8.12 (left)
0
and the Wasserstein distance D = W 2 (right). Here (α̂n , α̂n ) are two independent
empirical distributions of α, the uniform distribution on the unit cube [0, 1]d , tested
for several value of d ∈ {2, 3, 5}. The shaded bar displays the confidence interval at
± the standard deviation of log(D(α̂n , α)).

the unbiased estimator


1 1
k ∗ (xi , xi0 ) + k ∗ (yj , yj 0 )
X X
MMDk∗ (α̂n , β̂n )2 =
def.

n(n − 1) i,i0 n(n − 1) j,j 0


1 X ∗
−2 k (xi , yj ),
n2 i,j

which should be compared to (8.15). It satisfies E(MMDk∗ (α̂n , β̂n )2 ) =


kα − βk2k∗ , see [Gretton et al., 2012].

8.4.2 Empirical Estimators for ϕ-divergences


Is it not possible to approximate Dϕ (α|β) (as defined in (8.2)) from
discrete samples using Dϕ (α̂n |β̂n ). Indeed, this quantity is either +∞
(for instance for the KL divergence) or is not converging to Dϕ (α|β) as
n → +∞ (for instance for the TV norm). Instead, it is required to use a
density estimator to somehow smooth the discrete empirical measures
and replace them by densities, see [Silverman, 1986]. On an Euclidean
space X = Rd , introducing hσ = h(·/σ) with a smooth windowing
function and a bandwidth σ > 0, a density estimator for α is defined
using a convolution against this kernel,
1X
α̂n ? hσ = hσ (· − xi ). (8.21)
n i
8.4. Empirical Estimators for OT, MMD and ϕ-divergences 145

σ = 2.5 · 10−3 σ = 15 · 10−3 σ = 25 · 10−3

k=1 k = 50 k = 100

Figure 8.8: Comparison of kernel density estimation α̂n ?hσ (top, using a Gaussian
kernel h) and k-nearest neighbors estimation ρkα̂n (bottom) for n = 200 samples from
a mixture of two Gaussians.

One can then approximate the ϕ divergence using


n
!
hσ (yj − xi )
P
1X
Dϕσ (α̂n |β̂n ) ϕ Pi
def.
=
n j=1 j 0 hσ (yj − yj 0 )

where σ should be adapted to the number n of samples and to the


dimension d. It is also possible to devise non-parametric estimators
which avoids the need to choose a fixed bandwidth σ and instead selects
a number k of nearest neighbor. These methods typically make use of
the distance between nearest neighbors [Loftsgaarden and Quesenberry,
1965], which is similar to locally adapting the bandwidth σ to the local
sampling density. Denoting ∆k (x) the distance between x ∈ Rd and its
k th nearest neighbors among the (xi )ni=1 , a density estimator is defined
as
k/n
ρkα̂n (x) =
def.
(8.22)
|Bd |∆k (x)r

where |Bd | is the volume of the unit ball in Rd . Instead of somehow


“counting” the number of sample falling in an area of width σ in (8.21),
this formula (8.22) estimate the radius required to encapsulate k sam-
ples. Figure 8.8 compares the estimators (8.21) and (8.22). A typical
example of application is detailed in (9.32) for the entropy functional,
which is the KL divergence with respect to the Lebesgue measure. We
refer to [Moon and Hero, 2014] for more details.
146 Statistical Divergences

8.5 Entropic Regularization: between OT and MMD

Following Proposition 4.7, we recall that a Sinkhorn divergence is de-


fined as
f? g?
PεC (a, b) = hP? , Ci = he ε , (K C)e ε i,
def.

where P? is the solution of (4.2) while (f? , g? ) are solutions of (4.29).


Assuming Ci,j = d(xi , xj )p for some distance d on X , for two discrete
probability distribution of the form (2.3), this defines a regularized
Wasserstein cost
W p,ε (α, β)p = PεC (a, b).
def.

This definition is generalized to any input distribution (not necessarily


discrete) as Z
p def.
W p,ε (α, β) = d(x, y)p dπ ? (x, y)
X ×X
where π? is the solution of (4.9).
In order to cancel the bias introduced by the regularization (in
particular, W p,ε (α, α) 6= 0), we introduce a corrected regularized diver-
gence

W̃ p,ε (α, β)p = 2 W p,ε (α, β)p − W p,ε (α, α)p − W p,ε (β, β)p .
def.

The following proposition, whose proof can be found in Ramdas et al.


[2017], shows that this regularized divergence interpolates between the
Wasserstein distance and the energy distance defined in Example 8.12.

Proposition 8.3. One has


ε→0 ε→+∞
W̃ p,ε (α, β) −→ W p (α, β) and W̃ p,ε (α, β)p −→ kα − βk2ED(X ,d) ,

where k·kED(X ,d) is defined in (8.20).

Note that it is possible to define other families of divergence on top


of the Sinkhorn’s methods, see in particular Amari et al. [2017] for an
alternative technique.
8.5. Entropic Regularization: between OT and MMD 147

d=2 d=5
0
Figure 8.9: Decay of E(log10 (W̃ p,ε (α̂n , α̂n ))), for p = 3/2 for various ε, as a
function of log10 (n) where α is the same as in Figure 8.7.
9
Variational Wasserstein Problems

In data analysis, common divergences between probability measures


(e.g. Euclidean, total variation, Hellinger, Kullback-Leibler) are often
used to measure an error or a loss in parameter estimation problems.
Up to this chapter, we have made the case that the optimal transport
geometry has a unique ability, not shared with other information di-
vergences, to leverage physical ideas (mass displacement) and geometry
(a cost between observations or bins) to compare measures. These two
facts combined make it thus very tempting to use the Wasserstein dis-
tance as a loss function. This idea was recently explored for various ap-
plied problems; The main technical challenge in these approaches lies in
approximating and differentiating efficiently the Wasserstein distance.
In image processing, using the Wasserstein distance as a loss was
used to synthetize textures [Tartavel et al., 2016], where the Wasser-
stein loss was used to account for the discrepancy between statistics
of the synthesized and the input exemplar. It was also used for im-
age segmentation to account for statistical homogeneity of image re-
gions [Swoboda and Schnörr, 2013, Rabin and Papadakis, 2015, Peyré
et al., 2012, Ni et al., 2009, Schmitzer and Schnörr, 2013b]. The Wasser-
stein distance is also a very natural fidelity term for inverse problems

148
9.1. Differentiating the Wasserstein Loss 149

when the measurements are probability measures, for instance image


restoration [Lellmann et al., 2014], tomographic inversion [Abraham
et al., 2017], density regularization [Burger et al., 2012], particle image
velocimetry [Saumier et al., 2015], sparse recovery and compressed sens-
ing [Indyk and Price, 2011] and seismic inversion [Métivier et al., 2016].
Distances between measures (mostly kernel-based as exposed in §8.2.2)
are routinely used for shape matching (represented as measures over
a lifted space, often called currents) in computational anatomy [Vail-
lant and Glaunès, 2005], but OT distances offer an interesting alterna-
tive [Feydy et al., 2017]. To reduce the dimensionality of a dataset of
histograms, Lee and Seung have shown that the nonnegative matrix fac-
torization problem can be cast using the Kullback-Leibler divergence to
quantify a reconstruction loss [1999]. When prior information is avail-
able, the Wasserstein distance can be used instead, with markedly dif-
ferent results [Sandler and Lindenbaum, 2011, Zen et al., 2014, Rolet
et al., 2016].
Optimization problems that involve Wasserstein distances typically
require that we have access to their gradients of approximations thereof.
We start this section by presenting methods to approximate such gra-
dients, and follow with three important applications that can be cast
as variational Wasserstein problems.

9.1 Differentiating the Wasserstein Loss

In statistics, text processing or imaging, one must usually compare


a probability distribution β arising from measurements to a model,
namely a parameterized family of distributions {αθ , θ ∈ Θ} where Θ
is a subset of an Euclidean space. Such a comparison is done through
a “loss” or a “fidelity” term, which, in this section, is the Wasserstein
distance. In the simplest scenario, the computation of a suitable pa-
rameter θ is obtained by minimizing directly
def.
min E(θ) = Lc (αθ , β). (9.1)
θ∈Θ
Of course, one can consider more complicated problems: for instance,
the barycenter problem described in §9.2 consists in a sum of such
terms. However, most of these more advanced problems can be usually
150 Variational Wasserstein Problems

solved by adapting tools defined for basic case: either using the chain
rule to compute explicitly derivatives, or using automatic differentiation
as advocated in §9.1.3.

Convexity The Wasserstein distance between two histograms or two


densities is convex with respect to these inputs, as shown by (2.19)
and (2.23) respectively. Therefore, when the parameter θ is itself a
histogram, namely Θ = Σn and αθ = θ, or more generally when θ de-
scribes K weights in the simplex, Θ = ΣK , and αθ = K
P
i=1 θi αi is a
convex combination of known atoms α1 , . . . , αK in ΣN , Problem (9.1)
remains convex (the first case corresponds to the barycenter problem,
the second to one iteration of the dictionary learning problem with a
Wasserstein loss Rolet et al. [2016]). However, for more general param-
eterizations θ 7→ αθ , Problem (9.1) is in general not convex.

Simple cases For those simple cases where the Wasserstein distance
has a closed form, such as univariate (see §2.28) or elliptically con-
toured (see §2.29) distributions, simple workarounds exist. They con-
sist mostly in casting the Wasserstein distance as a simpler distance
between suitable representations of these distributions (Euclidean on
quantile functions for univariate measures, Bures metric for covariance
matrices for elliptically contoured distributions of the same family) and
solving Problem (9.1) directly on such representations.
In most cases however, one has to resort to a careful discretization
of αθ to compute a local minimizer for Problem (9.1). Two approaches
can be envisioned: Eulerian or Lagrangian. Figure 9.1 illustrates the
difference between these two fundamental discretization schemes. At
the risk of oversimplifying this argument, one may say that a Eulerian
discretization is the most suitable when measures are supported on a
low-dimensional space (as when dealing with shapes or color spaces),
or for intrinsically discrete problems (such as those arising from string
or text analysis). When applied to fitting problems in statistics and
machine learning, or when dealing with high-dimensional data, a La-
grangian perspective is usually the only tractable alternative.
9.1. Differentiating the Wasserstein Loss 151

Figure 9.1: Increasing fine discretization of a continuous P distribution having a


1
density (violet, left) using
P a Lagrangian representation n
δ (blue, top) and an
i xi
Eulerian representation i
a i δx i with xi representing cells on a grid of increasing
size (red, bottom). The Eulerian perspective starts from a pixelated image down to
one with such fine resolution that it almost matches the original density. Weights
ai are directly proportional to each pixel-cell’s intensity.

9.1.1 Eulerian Discretization


A fist way to discretize the problem is to suppose that both distribu-
tions β = m
P Pn
j=1 bj δyj and αθ = i=1 a(θ)i δxi are discrete distributions
defined on fixed locations (xi )i and (yj )j . Such locations might stand
for cells dividing the entire space of observations in a grid, or a finite
subset of points of interest in a continuous space (such as a family of
vector embeddings for all words in a given dictionary [Kusner et al.,
2015, Rolet et al., 2016]). The parameterized measure αθ is in that
case entirely represented through the weight vector a : θ 7→ a(θ) ∈ Σn ,
which, in practice, might be very sparse if the grid is large. This setting
corresponds to the so-called class of Eulerian discretization methods.
(9.1) is not differentiable In order to obtain a smooth minimization
problem, we use the entropic regularized OT and approximate (9.1)
using

min EE (θ) = LεC (a(θ), b)


def. def.
where Ci,j = c(xi , yj ).
θ∈Θ

We recall here Proposition 4.6 which shows that the entropic loss func-
tion is differentiable and convex with respect to the input histograms,
and give the expression for its gradient.
152 Variational Wasserstein Problems

Proposition 9.1 (Derivative with respect to histograms). For ε > 0,


(a, b) 7→ LεC (a, b) is convex and differentiable. Its gradient reads
∇LεC (a, b) = (f, g) (9.2)
P P
where (f, g) is the unique solution to (4.29) such that i fi = j gj = 0.
For ε = 0, this formula defines the elements of the sub-differential, and
the function is differentiable if they are unique.
Note that the zero mean condition on (f, g) is important when using
a gradient descent to guarantee conservation of mass.
Using the chain rule, one thus obtains that EE is smooth, and that
its gradient is
∇EE (θ) = [∂a(θ)]> (f ) (9.3)
where ∂a(θ) ∈ Rn×dim(Θ) is the Jacobian (differential) of the map a(θ),
and where f ∈ Rn is the dual potential vector associated to the dual
entropic OT (4.29) between a(θ) and b for the cost matrix C (which
is fixed and independent on θ). This result enables a simple gradient
descent approach to minimize locally EE .

9.1.2 Lagrangian Discretization


A different approach consists in using instead fixed (typically con-
stant) weights and discretize the measure as empirical measures αθ =
1 P
n i δx(θ)i for a point-cloud parameterization function x : θ 7→ x(θ) =
(x(θ)i )ni=1 ∈ X n , where we assume here that X is Euclidean. Prob-
lem (9.1) is thus approximated as
min EL (θ) = LεC(x(θ)) (1n /n, b)
def. def.
where C(x)i,j = c(xi , yj ). (9.4)
θ

Note that here the cost matrix C(x(θ)) now depends on θ since the
support of αθ is intended to move. The following proposition shows
that the entropic OT loss is a smooth function of the cost matrix, and
gives the expression of its gradient.
Proposition 9.2 (Derivative with respect to spikes positions). For fixed
def.
input histograms (a, b), for ε > 0, the mapping C 7→ R(C) = LεC (a, b)
is convex and smooth, and
∇R(C) = P, (9.5)
9.1. Differentiating the Wasserstein Loss 153

where P is the unique optimal solution of (4.2). For ε = 0, this formula


defines the set of sub-gradients.

Assuming (X , Y) are (possibly subsets of) Rd , for discrete measures


(α, β) of the form (2.3), one obtains using the chain rule that x =
def.
(xi )ni=1 ∈ X n 7→ F(x) = LC(x) (1n /n, b) is smooth, and that
 n
m
X
∇F(x) =  Pi,j ∇1 c(xi , yj ) ∈ Xn (9.6)
j=1 i=1

where ∇1 c is the gradient with respect to the first variable. For instance,
for X = Y = Rd , for c(s, t) = ks − tk2 on X = Y = Rd , one has
 n
m
X
∇F(x) = 2 ai xi − Pi,j yj  , (9.7)
j=1 i=1

where ai = 1/n here. Note that, up to a constant, this gradient is Id−T


where T is the barycentric projection defined in (4.18). Using the chain
rule, one thus obtains that the Lagrangian discretized problem (9.4) is
smooth and its gradient is

∇EL (θ) = [∂x(θ)]> (∇F(x(θ))) (9.8)

where ∂x(θ) ∈ Rdim(Θ)×(nd) is the Jacobian of the map x(θ), and where
∇F is implemented as in (9.6) or (9.7) using for P the optimal coupling
matrix between αθ and β. One can thus implement a gradient descent
to compute a local minimizer of EL .

9.1.3 Automatic Differentiation


The difficulty when applying formulas (9.3) and (9.8) is that one needs
to compute the optimal solutions f or P for them to be valid, which
can only be achieved with acceptable precision using a very large num-
ber of Sinkhorn iterates. In challenging situations in which the size
and the quantity of histograms to be compared is large, the compu-
tational budget to compute a single Wasserstein distance is usually
limited, therefore allowing only for a few Sinkhorn iterations. In that
case, and rather than approximating the gradient (4.29) using the value
154 Variational Wasserstein Problems

obtained at a given iterate, it is usually better to differentiate directly


the output of Sinkhorn’s algorithm, using reverse mode automatic dif-
ferentiation. This corresponds to using the “algorithmic” Sinkhorn di-
vergences as introduced in (4.44), rather than the quantity LεC in (4.2)
which incorporates the entropy of the regularized optimal transport.
The cost for computing the gradient of functionals involving Sinkhorn
divergences is the same as that of computation the functional itself,
see for instance [Bonneel et al., 2016, Genevay et al., 2017a] for some
applications of this approach. We also refer to Adams and Zemel [2011]
for any early work on differentiating Sinkhorn iterations with respect
to the cost matrix (as it is done in the Lagrangian framework), with
applications to machine learning over ranking and permutations. We
refer to [Griewank and Walther, 2008, Rall, 1981, Neidinger, 2010] for
more details on automatic differentiation, and in particular the “re-
verse mode” which is the fastest way to compute gradients, the efficient
computation of Hessian being much more involved. In terms of imple-
mentation, all recent “deep-learning” Python frameworks feature state
of the art reverse-mode differentiation and support for GPU computa-
tions [Al-Rfou et al., 2016, Abadi et al., 2016, pyt, 2017], they should be
adopted for any large scale application of Sinkhorn losses. We strongly
encourage the use of such automatic differentiation techniques, since
they are always at least as fast of the formula (9.3) and (9.8), these
formula being mostly useful to obtain a theoretical understanding of
what automatic differentation is computing. The only downside is that
reverse mode automatic differentation is memory intensive (the mem-
ory grows proportionally with the number of iteration). There exists
however subsampling strategies that mitigates this problem [Griewank,
1992].

9.2 Wasserstein Barycenters, Clustering and Dictionary


Learning

A basic problem in unsupervised learning is to compute the “mean” or


“barycenter” of several data points. A classical way to define such a
weighted mean of points (xs )Ss=1 ∈ X S living in a metric space (X , d)
(where d is a distance or more generally a divergence) is by solving a
9.2. Wasserstein Barycenters, Clustering and Dictionary Learning 155

variational problem
S
X
min λs d(x, xs )p , (9.9)
x∈X
s=1

for weights (λs )s ∈ ΣS , where p is often set to p = 2. When X = Rd


and d(x, y) = kx − yk2 , this leads to the usual definition of the linear
P
average x = s λs xs for p = 2, and the more evolved median point
when p = 1. One can retrieve various notions of means (e.g. harmonic
or geometric means over X = R+ ) using this formalism. This process is
often referred to as “Fréchet” or “Karcher” mean (see Karcher [2014]
for an historical account). For a generic distance d, problem (9.9) is
usually a difficult non-convex optimization problem. Fortunately, in
the case of optimal transport distances, the problem can be formulated
as a convex program for which existence can be proved and efficient
numerical schemes exist.

Fréchet means over the Wasserstein space. Given input histogram


{bs }Ss=1 , where bs ∈ Σns , and weights λ ∈ ΣS , a Wasserstein barycenter
is computed by minimizing
S
X
min λs LCs (a, bs ) (9.10)
a∈Σn
s=1

where the cost matrices Cs ∈ Rn×ns need to be specified. A typical


setup is “Eulerian”, so that all the barycenters are defined on the same
grid, ns = n, Cs = C = Dp is set to be a distance matrix, so that one
solves
S
X
min λs Wpp (a, bs ).
a∈Σn
s=1

This barycenter problem (9.10) was originally introduced by Agueh


and Carlier [2011] following earlier ideas of Carlier and Ekeland [2010].
They proved in particular uniqueness of the barycenter for c(x, y) =
kx − yk2 over X = Rd , if one of the input measure has a density with
respect to the Lebesgue measure (and more generally under the same
hypothesis as the one guaranteeing the existence of a Monge map, see
Remark 2.23).
156 Variational Wasserstein Problems

The barycenter problem for histograms (9.10) is in fact a linear


program, since one can look for the S couplings (Ps )s between each
input and the barycenter itself
( S )
∀ s, P> a, P>
X
min λs hPs , Cs i : s 1ns = s 1n = bs .
a∈Σn ,(Ps ∈Rn×ns )s
s=1

Although this problem is an LP, its scale forbids the use generic solvers
for medium scale problems. One can therefore resort to using first order
methods such as subgradient descent on the dual [Carlier et al., 2015].
The computation of Wasserstein barycenters has found numerous
applications in image processing [Rabin et al., 2011], computer graph-
ics [Solomon et al., 2015], statistics [Boissard et al., 2015], Bayesian
inference [Srivastava et al., 2015b,a] and machine learning [Cuturi and
Doucet, 2014]. For instance, the ability of computing barycenters is the
workhorse of clustering methods such as the K-means algorithm [del
Barrio et al., 2016, Ho et al., 2017].

Remark 9.1 (Barycenter of arbitrary measures). Given a set of in-


put measure (βs )s defined on some space X , the barycenter prob-
lem becomes
S
X
min λs Lc (α, βs ). (9.11)
α∈M1+ (X ) s=1

In the case where X = Rd and c(x, y) = kx − yk2 , Agueh and


Carlier [2011] shows that if one of the input measures has a density,
then this barycenter is unique. Problem (9.11) can be viewed as a
generalization of the problem of computing barycenters of points
(xs )Ss=1 ∈ X S to arbitrary measures. Indeed, if βs = δxs is a single
Dirac mass, then a solution to (9.11) is δx? where x? is a Fréchet
mean solving (9.9). Note that for c(x, y) = kx − yk2 , the mean of
the barycenter α? is necessarily the barycenter of the mean, i.e.
Z X Z
?
xdα (x) = λs xdαs (x),
X s X

and the support of α? is located in the convex hull of the sup-


ports of the (αs )s . The consistency of the approximation of the
9.2. Wasserstein Barycenters, Clustering and Dictionary Learning 157

infinite dimensional optimization (9.11) when approximating the


input distribution using discrete ones (and thus solving (9.10) in
place) is studied in Carlier et al. [2015]. Let us also note that it
is possible to re-cast (9.11) as a multi-marginal OT problem, see
Remark 10.2.

Remark 9.2 (Distribution of distributions and consistency). It is


possible to generalize (9.11) to a possibly infinite collection of
measures. This problem is described by considering a probability
distribution M over the space M1+ (X ) of probability distributions,
i.e. M ∈ M1+ (M1+ (X )). A barycenter is then a solution of
Z
min EM (Lc (α, β)) = Lc (α, β)dM (β), (9.12)
α∈M1+ (X ) M1+ (X )

where β is a random measure distributed according to M . Drawing


uniformly at random a finite number S of input measures (βs )Ss=1
according to M , one can then define β̂S as being a solution of (9.11)
for uniform weights λs = 1/S (note that here β̂S is itself a ran-
dom measure). Problem (9.11) corresponds to the special case of
P
a “discrete” measure M = s λs δβs . The convergence (in expecta-
tion or with high probability) of Lc (β̂S , α) to zero (where α is the
unique solution to (9.12)) corresponds to the consistency of the
barycenters, and is proved in [Bigot and Klein, 2012a, Le Gouic
and Loubes, 2016, Bigot and Klein, 2012b]. This can be inter-
preted as a law of large numbers over the Wasserstein space. The
extension of this result to a central limit theorem is an important
problem, see [Agueh and Carlier, 2017] for a formulation of this
problem and its solution in particular cases (1-D distributions and
Gaussian measures).

Remark 9.3 (Fixed point map). When dealing with the Euclidean
space X = Rd with ground cost c(x, y) = kx − yk2 , it is possible to
study the barycenter problem using transportation maps. Indeed,
if α has a density, according to Remark 2.23, one can define optimal
transportation maps Ts between α and αs , i.e. in particular such
158 Variational Wasserstein Problems

that Ts,] α = αs . The average map


S
X
T (α) =
def.
λs Ts
s=1

(the notation above makes explicit the dependence of this map


(α)
on α) is itself an optimal map between α and T] α (a positive
combination of optimal maps is equal by Brenier’s theorem (Re-
mark 2.23) to the sum of gradients of convex functions, equal to
the gradient of a sum of convex functions, therefore optimal by
Brenier’s theorem again). As shown in [Agueh and Carlier, 2011],
first order optimality conditions of the barycenter problem (9.12)
?
actually reads T (α ) = IdRd (the identity map) at the optimal
measure α? (the barycenter), and it is shown in [Álvarez-Esteban
et al., 2016] that the barycenter α? is the unique solution to the
fixed-point equation
def. (α)
G(α) = α where G(α) = T] α. (9.13)

Under mild conditions on the input measures, it is shown


in Álvarez-Esteban et al. [2016] that α 7→ G(α) strictly decreases
the objective function of (9.12) if α is not the barycenter, and
def.
that the fixed point iterations α(`+1) = G(α(`) ) converge to the
barycenter α? . This fixed point algorithm can be used in cases
where the optimal transportation maps are known in closed form
(e.g. for Gaussians). Adapting this algorithm for empirical mea-
sures of the same size results in computing optimal assignments
in place of Monge maps. For more general discrete measures of
arbitrary size the scheme can also be adapted Cuturi and Doucet
[2014] using barycentric projections (4.18).

Special cases. In general, solving (9.10) or (9.11) is not straightfor-


ward, but there exist some special cases for which solutions are explicit
or simple.
9.2. Wasserstein Barycenters, Clustering and Dictionary Learning 159

Remark 9.4 (Barycenter of Gaussians). It is shown in [Agueh and


Carlier, 2011] that the barycenter of Gaussians distributions αs =
N (ms , Σs ), for the squared Euclidean cost c(x, y) = kx − yk2 , is
itself a Gaussian N (m? , Σ? ). Making use of (2.39), one sees that
the barycenter mean is the mean of the inputs
X
m? = λs ms
s

while the covariance minimizes


X
min λs B(Σ, Σs )2
Σ
s

where B is the Bure metric (2.40). As studied in [Agueh and Car-


lier, 2011], the first order optimality condition of this convex prob-
lem shows that Σ? is the unique positive definite fixed point of the
following map
X 1 1 1
Σ? = Ψ(Σ? ) where Ψ(Σ) =
def.
λs (Σ 2 Σs Σ 2 ) 2
s
1
where Σ 2 is the square-root of PSD matrices. This result was
known from [Knott and Smith, 1994, Rüschendorf and Uckelmann,
2002] and is proved in [Agueh and Carlier, 2011]. While Ψ is not
strictly contracting, iterating this fixed point map, i.e. defining
def.
Σ(`+1) = Ψ(Σ(`) ) can be shown to converge to the solution Σ?
thanks to the more general result provided in [Álvarez-Esteban
et al., 2016]. This is because the fixed point map G defined in (9.13)
preserves Gaussian distributions, and in fact,

G(N (m, Σ)) = N (m? , Ψ(Σ)).

This method has been used for application to texture synthesis


in [Xia et al., 2014]. Figure 9.2 shows two examples of computations
of barycenters between four 2-D Gaussians.
160 Variational Wasserstein Problems

Figure 9.2: Barycenters between 4 Gaussian distributions in 2-D. Each Gaussian


is displayed using an ellipse aligned with the principal axes of the covariance, and
with elongations proportional to the corresponding eigenvalues.

Remark 9.5 (1-D cases). For 1-D distributions, the W p barycen-


ter can be computed almost in closed form using the fact that
the transport is the monotone re-arrangement, as detailed in Re-
mark 2.28. The simplest case is for empirical measures with n
points, i.e. βs = n1 ni=1 δys,i , where the points are assumed to
P

be sorted ys,1 ≤ ys,2 ≤ . . . Using (2.31) the barycenters αλ is also


an empirical measure on n points
n
1X
αλ = δx where xλ,i = Aλ (xs,i )s ,
n i=1 λ,i

where Aλ is the barycentric map


S
X
λs |x − xs |p .
def.
Aλ (xs )s = argmin
x∈R s=1

For instance, for p = 2, one has xλ,i = Ss=1 λs xs,i . In the general
P

case, one needs to use the cumulative functions as defined in (2.32),


and using (2.34), one has

∀ r ∈ [0, 1], Cα−1


λ
(r) = Aλ (Cα−1
s
(r))Ss=1 ,

which can be used for instance to compute barycenters between


discrete measures supported on less than n points in O(n log(n))
operations, using a simple sorting procedure.
9.2. Wasserstein Barycenters, Clustering and Dictionary Learning 161

Remark 9.6 (Simple cases). Denoting by Tr,u : x 7→ rx + u a scal-


ing and translation, and assuming that αs = Trs ,us ,] α0 is obtained
by scaling and translating an initial template measure, then a
barycenter αλ is also obtained using scaling and translation
(
r? = ( s λs /rs )−1 ,
P
αλ = Tr? ,u? ,] α0 where
u? = s λs us .
P

Remark 9.7 (Case S = 2). In the case where X = Rd and c(x, y) =


kx − yk2 (this can be extended to geodesic spaces), the barycenter
between S = 2 measures (α0 , α1 ) is the McCann interpolant as
already introduced in (7.6). Denoting T] α0 = α1 the Monge map,
one has that the barycenter αλ reads αλ = (λ1 Id + λ2 T )] α0 . For-
mula (7.9) explains how to perform the computation in the discrete
case.

Entropic approximation of barycenters. One can use entropic


smoothing and approximate the solution of (9.10) using
S
X
min λs LεCs (a, bs ) (9.14)
a∈Σn
s=1

for some ε > 0. This is a smooth convex minimization problem, which


can be tackled using gradient descent [Cuturi and Doucet, 2014]. An
alternative is to use descent method (typically quasi-Newton) on the
semi-dual Cuturi and Peyré [2016], which is useful to integrate addi-
tional regularizations on the barycenter (e.g. to impose some smooth-
ness). A simple but effective approach, as remarked in Benamou et al.
[2015] is to rewrite (9.14) as a (weighted) KL projection problem
( )
X
T
min λs εKL(Ps |Ks ) : ∀ s, Ps 1m = bs , P1 11 = . . . = PS 1S
(Ps )s s
(9.15)
where we denoted Ks = e−Cs /ε . Here, the barycenter a is implicitly
def.

encoded in the row marginals of all the couplings Ps ∈ Rn×ns as a =


P1 11 = . . . = PS 1S . As detailed in Benamou et al. [2015], one can
generalize Sinkhorn to this problem, which also corresponds to iterative
162 Variational Wasserstein Problems

projection. This can also be seen as a special case of the generalized


Sinkhorn detailed in §4.6. The optimal couplings (Ps )s solving (9.15)
are computed in scaling form as

Ps = diag(us )K diag(vs ), (9.16)

and the scalings are sequentially updated as


bs
v(`+1)
def.
∀ s ∈ J1, SK, s = (`)
, (9.17)
KT
s us
a(`+1)
u(`+1)
def.
∀ s ∈ J1, SK, s = (`+1)
, (9.18)
Ks vs
Y
where a(`+1) (Ks v(`+1) )λs .
def.
= s (9.19)
s

An alternative way to derive these iterations is to perform alternate


minimization on the variables of a dual problem, which is detailed in
the following proposition.

Proposition 9.1. The optimal (us , vs ) appearing in (9.16) can be writ-


ten as (us , vs ) = (efs /ε , egs /ε ) where (fs , gs )s are the solutions of the
following program (whose value matches the one of (9.14))
( )
 
gs /ε fs /ε
X X
max λs hgs , bs i − εhKs e ,e i : λs fs = 0 . (9.20)
(fs ,gs )s s s

Proof. Introducing Lagrange multipliers in (9.15) leads to


X 
min max λs εKL(Ps |Ks ) + ha − Ps 1m , fs i
(Ps )s ,a (fs ,gs )s s

+hbs − Ps T 1m , gs i .

Strong duality holds, so that one can exchange the min and the max,
to obtain
X  
max λs hgs , bs i + min εKL(Ps |Ks ) − hPs , fs ⊕ gs i
(fs ,gs )s s Ps
X
+ min h λs fs , ai.
a
s
9.2. Wasserstein Barycenters, Clustering and Dictionary Learning 163

P
The explicit minimization on a gives the constraint s λs fs = 0 to-
gether with
fs ⊕ gs
 

X
max λs hgs , bs i − εKL |Ks
(fs ,gs )s s ε

where KL∗ (·|Ks ) is the Legendre transform (4.50) of the function


KL∗ (·|Ks ). This Legendre transform reads

KL∗ (U|K) = Ki,j (eUi,j − 1),


X
(9.21)
i,j

which shows the desired formula. To show (9.21), since this function is
separable, one needs to compute

∀ (u, k) ∈ R2+ , KL∗ (u|k) = max ur − (r log(r/k) − r + k)


def.

whose optimality condition reads u = log(r/k), i.e. r = keu , hence the


result.

Minimizing (9.20) with respect to each gs , while keeping all the


other variable fixed, is obtained in closed form by (9.17). Minimiz-
ing (9.20) with respect to all the (fs )s requires to solve for a using (9.19)
and leads to the expression (9.18).
Figures 9.3 and 9.4 show applications to 2-D and 3-D shapes inter-
polation. Figure 9.5 shows a computation of barycenters on a surface,
where the ground cost is the square of the geodesic distance. For this
figure, the computations are performed using the geodesic in heat ap-
proximation detailed in Remark 4.17. We refer to [Solomon et al., 2015]
for more details and other applications to computer graphics and imag-
ing sciences.
Barycenters have found many applications outside the field of shape
analysis. They have been be used for image processing, in particular
color modification [Solomon et al., 2015] (see Figure 9.6); Bayesian com-
putations [Srivastava et al., 2015b] to summarize measures; non-linear
dimensionality reduction, to express an input measure as a Wasser-
stein barycenter of other known measures [Bonneel et al., 2016]. All of
these problems result in involved non-convex objective functions which
164 Variational Wasserstein Problems

Figure 9.3: Barycenters between 4 input 2-D shapes using entropic regulariza-
tion (9.14). To display a binary shape, the displayed images shows a thresholded
density. The weights (λs )s are bilinear with respect to the four corners of the square.

Figure 9.4: Barycenters between 4 input 3-D shapes using entropic regulariza-
tion (9.14). The weights (λs )s are bilinear with respect to the four corners of the
square. Shapes are represented as measures that are uniform within the boundaries
of the shape and null outside.
9.2. Wasserstein Barycenters, Clustering and Dictionary Learning 165

Figure 9.5: Barycenters interpolation between two input measures on surfaces,


computed using the geodesic in heat fast kernel approximation (see Remark 4.17).
Extracted from [Solomon et al., 2015].

Figure 9.6: Interpolation between the two 3-D color empirical histograms of two
input images (here only the 2-D chromatic projection is visualized for simplicity. The
modified histogram is then applied to the input images using barycentric projection
as detailed in Remark 4.9. Extracted from [Solomon et al., 2015].

can be accurately optimized using automatic differentiation (see Re-


mark 9.1.3). Problems closely related to the computation of barycen-
ters include the computation of principal components analyses over
the Wasserstein space, see for instance [Seguy and Cuturi, 2015, Bigot
et al., 2017], and the statistical estimation of template models [Boissard
et al., 2015].

Remark 9.8 (Wasserstein Propagation). As studied in Solomon


et al. [2014b], it is possible to generalize the barycenter prob-
lem (9.10), where one looks for distributions (bu )u∈U at some given
166 Variational Wasserstein Problems

set U of nodes in a graph G given a set of fixed input distributions


(bv )v∈V on the complementary set V of the nodes. The unknown
are determined by minimizing the overall transportation distance
between all pairs of nodes (r, s) ∈ G forming edges in the graph
X
min LCr,s (br , bs ) (9.22)
(bu ∈Σnu )u∈U
(r,s)∈G

where the cost matrices Cr,s ∈ Rnr ×ns needs to be specified by


the user. The barycenter problem (9.10) is a special case of this
problem where the considered graph G is “star shaped” where U
is a single vertex connected to all the other vertices V (the weight
λs associated to bs can be absorbed in the cost matrix). Introduc-
ing explicitly a coupling Pr,s ∈ U(br , bs ) for each edge (r, s) ∈ G,
and using entropy regularization one can rewrite this problem sim-
ilarly as in (9.15), and one extends Sinkhorn iterations (9.17) to
this problem (this can also be derived by re-casting this problem in
the form of the generalized Sinkhorn algorithm detailed in §4.6).
This discrete variational problem (9.22) on a graph can be gener-
alized to define a Dirichlet energy when replacing the graph by a
continuous domain [Solomon et al., 2013]. This in turn leads to the
definition of measure-valued harmonic functions which finds appli-
cation in image and surface processing. We refer also to Lavenant
[2017] for a theoretical analysis, and to Vogt and Lellmann [2017]
for extensions to non-quadratic (total-variation) functionals and
applications to imaging.

9.3 Gradient Flows

Given a smooth function a 7→ F (a), one can use the standard gradient
descent
a(`+1) = a(`) − τ ∇F (a(`) )
def.
(9.23)
where τ is a small enough step size. This corresponds to a so-called
“explicit” minimization scheme, and only applies for smooth func-
tions F . For non-smooth functions, one can use instead an “implicit”
scheme, which is also called the proximal-point algorithm (see for in-
9.3. Gradient Flows 167

stance Bauschke and Combettes [2011])


k·k 1 2
a(`+1) = Proxτ F (a(`) ) = argmin a − a(`)
def. def.
+ τ F (a). (9.24)
a 2
Note that this corresponds to the Euclidean proximal operator, already
encountered in (7.13). The update (9.23) can be understood as iterating
the explicit operator Id − τ ∇F , while (9.24) makes use of the implicit
operator (Id+τ ∇F )−1 . For convex F , iterations (9.24) always converge,
for any value of τ > 0.
Instead of using the `2 norm k·k in (9.24), if the function F is defined
on the simplex of histograms Σn , then it makes sense to rather use an
optimal transport metric, in order to solve

a(`+1) = argmin Wp (a, a(`) )p + τ F (a).


def.
(9.25)
a

Remark 9.9 (Wasserstein gradient flows). Equation (9.25) can be


generalized to arbitrary measures by defining the iteration

α(`+1) = argmin W p (α, α(`) )p + τ F (α)


def.
(9.26)
α

for some function F defined on M1+ (X ). This implicit time step-


ping is a useful tool to construct continuous flows, by formally
taking the limit τ → 0 and introducing the time t = τ `, so that
α(`) is intended to approximate a continuous flow t ∈ R+ 7→ αt .
For the special case p = 2 and X = Rd , a formal calculus shows
that αt is expected to solve a PDE of the form
∂αt
= div(αt ∇(F 0 (αt ))) (9.27)
∂t
where F 0 (α) denotes the derivative of the function F in the sense
that it is a continuous function F 0 (α) ∈ C(X ) such that
Z
F (α + εξ) = F (α) + ε F 0 (α)dξ(x) + o(ε).
X

A typical example is when using F = −H, where H(α) =


KL(α|LRd ) is the the relative entropy with respect to the Lebesgue
168 Variational Wasserstein Problems

measure LRd on X = Rd
Z
H(α) = − ρα (x)(log(ρα (x)) − 1)dx (9.28)
Rd

(setting H(α) = −∞ when α does not have a density), then (9.27)


shows that the gradient flow of this neg-entropy is the linear heat
diffusion
∂αt
= ∆αt (9.29)
∂t
where ∆ is the spatial Laplacian. The heat diffusion can therefore
be interpreted either as the “classical” Euclidian flow (somehow
performing “vertical” movements with respect to mass amplitudes)
R 2
of the Dirichlet energy Rd ∇ρα(x) dx, or, alternatively, as the
entropy for the optimal transport flow (somehow an “horizontal”
movement with respect to mass positions). Interest in Wasserstein
gradient flows was sparked by the seminal paper of Jordan, Kinder-
lehrer and Otto [Jordan et al., 1998], and these evolutions are often
called “JKO flows” following their work. As shown in details in the
monograph by Ambrosio et al. [2008], JKO flows are a special case
of gradient flows in metric spaces. We also refer to the recent sur-
vey paper [Santambrogio, 2017]. JKO flows can be used to study
in particular non-linear evolution equations such as the porous
medium equation [Otto, 2001], total variation flows [Carlier and
Poon, 2017], quantum drifts [Gianazza et al., 2009], or heat evo-
lutions on manifolds [Erbar, 2010]. Their flexible formalism allows
for constraints on the solution, such as the congestion constraint
(an upper bound on the density at any point) that Maury et al.
used to model crowd motion [2010] (see also the review paper [San-
tambrogio, 2018]).

Remark 9.10 (Gradient flows in metric spaces). The implicit step-


ping (9.26) is a special case of a general formalism to define gradi-
ent flows over general metric spaces (X , d), where d is a distance, as
detailed in Ambrosio et al. [2008]. For some function F (x) defined
for x ∈ X , the implicit discrete minmization step is then defined
9.3. Gradient Flows 169

p=1 p=1

1
2
p= p=

+
1

=
+

p
=
p

Explicit Implicit

Figure 9.7: Comparison of explicit and implicit gradient flow to minimize the
function f (x) = kxk2 on X = R2 for the distances d(x, y) = kx − ykp for several
values of p.

as
x(`+1) ∈ argmin d(x(`) , x)2 + τ F (x). (9.30)
x∈X
The JKO step (9.26) corresponds to the use of the Wasserstein
distance on the space of probability distributions. In some cases,
one can show that (9.30) admits a continuous flow limit xt as τ → 0
and kτ = t. In case that X also has a Euclidean structure, an
explicit stepping is defined by linearizing F

x(`+1) = argmin d(x(`) , x)2 + τ h∇F (x(`) ), xi. (9.31)


x∈X

In sharp contrast to the implicit formula (9.30) it is usually


straightforward to compute but can be unstable. The implicit step
is always stable, is also defined for non-smooth F , but is usu-
ally not accessible in closed form. Figure 9.7 illustrates this con-
cept on the function F (x) = kxk2 on X = R2 for the distances
1
d(x, y) = kx − ykp = (|x1 − y1 |p + |x2 − y2 |p ) p for several values of
p. The explicit scheme (9.31) is unstable for p = 1 and p = +∞,
and for p = 1 it gives axis-aligned steps (coordinate wise descent).
In contrast, the implicit scheme (9.30) is stable. Note in particular
how, for p = 1, when the two coordinates are equal, the following
step operates in the diagonal direction.
170 Variational Wasserstein Problems

Remark 9.11 (Lagrangian discretization). The finite-dimensional


problem in (9.25) can be interpreted as the Eulerian discretization
of a flow over the space of measures (9.26). An alternative way
to discretize the problem, using the so-called Lagrangian method,
would be to parameterize instead the solution as a (discrete)
empirical measure moving with time, where the locations of that
measure (and not its weights) become the variables of interest.
In practice, one can consider a dynamic point cloud of particles
αt = n1 ni=1 δxi (t) indexed with time. The initial problem (9.25) is
P

then replaced by a set of n coupled ODE prescribing the dynamic


of the points X(t) = (xi (t))i ∈ X n . If the energy F is finite for dis-
crete measures, then one can simply define F(X) = F ( n1 ni=1 δxi ).
P
R
Typical examples are linear functions F (α) = X V (x)dα(x) and
R
quadratic interactions F (α) = X 2 W (x, y)dα(x)dα(y), in which
case one can use respectively
1X 1 X
F(X) = V (xi ) and F(X) = W (xi , xj ).
n i n2 i,j

For functions such as generalized entropy, which are only finite for
measures having densities, one should apply a density estimator to
convert the point cloud into a density, which allows to also define
function F(x) consistent with F as n → +∞. A typical example
is for the entropy F (α) = H(α) defined in (9.28), for which a
consistent estimator is (up to a constant term) can be obtained by
summing the logarithms of the distances to nearest neighbors
1X
F(X) = log(dX (xi )) where dX (x) = 0 min0 x − x0 ,
n i x ∈X,x 6=x
(9.32)
see Beirlant et al. [1997] for a review of non-parametric entropy
estimators. For small enough step sizes τ , assuming X = Rd , the
Wasserstein distance W 2 matches the Euclidean distance on the
points, i.e. if |t − t0 | is small enough, W2 (αt , αt0 ) = kX(t) − X(t0 )k.
The gradient flow is thus equivalent to the Euclidean flow on po-
sitions X 0 (t) = −∇F(X(t)), which is discretized for times tk = τ k
9.3. Gradient Flows 171

t=0 t = 0.2 t = 0.4 t = 0.6 t = 0.8

Figure 9.8: Example ofRgradient flow evolutions using a Lagrangian discretization,


for the function F (α) = V dα − H(α), for V (x) = kxk2 . The entropy is discretized
using (9.32). The limiting stationary distribution is a Gaussian.

similarly to (9.23) using explicit Euler steps

X (`+1) = X (`) − τ ∇F(X (`) ).


def.

Figure 9.8 shows an example of such a discretized explicit evolu-


tion for a linear plus entropy functional, so resulting in a discretized
version of a Fokker-Planck equation. Note that for this particular
case of linear Fokker-Planck equation, it is possible also to resort to
stochastic PDEs methods, and can be approximated numerically
by evolving a single random particle with a Gaussian drift. The
convergence of these schemes (so-called Langevin Monte-Carlo) to
the stationary distribution can in turn be quantified in term of
Wasserstein distance, see for instance [Dalalyan and Karagulyan,
2017]. If the function F is not smooth, one should discretize sim-
ilarly to (9.24) using implicit Euler steps, i.e. consider
k·k 1 2
X (`+1) = Proxτ F (X (`) ) = argmin Z − X (`)
def. def.
+ τ F(Z).
Z∈X n 2
R
In the simplest case of a linear function F (α) = X V (x)dα(x),
then the flow operates independently over each particule xi (t) and
corresponds to a usual Euclidean flow for the function V , x0i (t) =
−∇V (xi (t)) (and is an advection PDEs of the density along the
integral curves of the flow).
172 Variational Wasserstein Problems

Remark 9.12 (Geodesic convexity). An important concept related


to gradient flows is the convexity of the functional F with re-
spect to the Wasserstein-2 geometry, i.e. the convexity of F along
Wasserstein geodesics (i.e. displacement interpolations as exposed
in Remark 7.1). The Wasserstein gradient flow (with a continuous
time) for such a function exists, is unique and is the limit of the
discrete stepping (9.26) as τ → 0. It converges to a fixed station-
ary distribution as t → +∞. The entropy is a typical example
of geodesically convex function, and so are linear functions of the
R
form F (α) = X V (x)dα(x) and quadratic interaction functions
R
F (α) = X ×X W (x, y)dα(x)dα(y) for convex functions V : X → R,
W : X × X → R. Note that while linear functions are convex in the
classical sense, quadratic interaction functions might fail to be. A
typical example is W (x, y) = kx − yk2 which is a negative semi-
definite kernel (see Definition 8.3) and thus corresponds to F (α)
being a concave function in the usual sense (while it is geodesically
convex). An important result of McCann [1997] is that generalized
“entropy” functions of the form F (α) = Rd ϕ(ρα (x))dx on X = Rd
R

are geodesically convex if ϕ is convex, with ϕ(0) = 0, ϕ(t)/t → +∞


as t → +∞ and such that s 7→ sd ϕ(s−d ) is convex decaying.

There is an important literature on the numerical resolution of the


resulting discretized flow, and we only give a few representative pub-
lications. For 1-D problems, very precise solvers have been developed
because OT is a quadratic functional in the inverse cumulative func-
tion (see Remark 2.28) Kinderlehrer and Walkington [1999], Blanchet
et al. [2008], Agueh and Bowles [2013], Matthes and Osberger [2014],
Blanchet and Carlier [2015]. In higher dimension, it can be tackled using
finite elements and finite volume schemes Carrillo et al. [2015], Burger
et al. [2010]. Alternative solvers are obtained using Lagrangian schemes
(i.e. particles systems) Carrillo and Moll [2009], Benamou et al. [2016a],
Westdickenberg and Wilkening [2010]. Another direction is to look for
discrete flows (typically on discrete grids or graphs) which maintains
the properties of their continuous counterparts, and in particular such
that the gradient flow of the entropy is the heat equation, see Mielke
[2013], Erbar and Maas [2014], Chow et al. [2012], Maas [2011].
9.4. Minimum Kantorovitch Estimators 173

An approximate approach to solve the Eulerian discretized prob-


lem (9.23) makes use of the entropic regularization. This was initially
proposed in Peyré [2015] and refined in Chizat et al. [2017], and theo-
retically analyzed in Carlier et al. [2017]. Adding an entropic regular-
ization penalty, the problem (9.25) has the form (4.45) when using the
identification G = ιa(`) and F should be replace by τ F . One can thus
use the iterations (4.47) to approximate a(`+1) as proposed initially
in Peyré [2015]. The convergence of this scheme as ε → 0 is proved
in Carlier et al. [2017]. Figure 9.9 shows example of evolution com-
puter with this method. An interesting application of gradient flows
is in machine learning, in order to learning the underlying function F
that best model some dynamical model of density. This learning can be
achieved by solving a smooth non-convex optimization using entropic
regularized transport, and automatic differentiation (as advocated in
Remark 9.1.3), see Hashimoto et al. [2016].
It should be noted that analyzing the convergence of gradient flows
discretized in both time and space is difficult. Due to the polyhedral
nature of the linear program defining the distance, using too small step
sizes leads to a “locking” phenomena (the distribution is stuck and does
no evolve, so that the step size should be not too small, as discussed
in [Maury and Preux, 2017].
It is also possible to compute gradient flows for unbalanced opti-
mal transport distances as detailed in §10.2. This results in evolutions
allowing mass creation or destruction, which is crucial to model many
physical, biological or chemical phenomena. Figure 9.10 shows an exam-
ple of gradient flow corresponding to the celebrated Hele-Shaw model
for cells growth [Perthame et al., 2014], which is studied theoretically
in [Gallouët and Monsaingeon, 2017, Di Marino and Chizat, 2017].
Such an unbalanced gradient flow can be also approximated using the
generalized Sinkhorn algorithm [Chizat et al., 2017].

9.4 Minimum Kantorovitch Estimators

Given some discrete samples (xi )ni=1 ⊂ X from some unknown distri-
bution, the goal is to fit a parametric model θ 7→ αθ ∈ M(X ) to the
174 Variational Wasserstein Problems

cos(w) t=0 t=5 t = 10 t = 20

Figure 9.9: Examples of gradient flows evolutions,


R with drift V and congestion
terms (from Peyré [2015]), so that F (α) = X V (x)dα(x) + ι≤κ (ρα ).

1.2

1.0

0.8

0.6
ρkτ

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
X

Figure 9.10: Unbalanced OT gradient flow to solve the Hele-Shaw equation


(from [Chizat et al., 2017]).
9.4. Minimum Kantorovitch Estimators 175

g✓ ↵✓
⇣ x
z
Z X

Figure 9.11: Schematic display of the density fitting problem 9.33.

observed empirical input measure β


1X
min L(αθ , β) where β = δxi , (9.33)
θ∈Θ n i

where L is some “loss” function between a discrete and a “continuous”


(arbitrary) distribution (see Figure 9.11).
def.
In the case where αθ as a densify ρθ = ραθ with respect to the
Lebesgue measure (or any other fixed reference measure), the maximum
likelihood estimator (MLE) is obtained by solving
def.
X
min LMLE (αθ , β) = − log(ρθ (xi )).
θ
i

This corresponds to using an empirical counterpart of a Kullback-


Leibler loss since, assuming the xi are i.i.d. samples of some β̄, then
n→+∞
LMLE (α, β) −→ KL(α|β̄)

This MLE approach is known to lead to optimal estimation proce-


dures in many cases (see for instance Owen [2001]). However, it fails
to work when estimating singular distributions, typically when the αθ
does not has a density (so that LMLE (αθ , β) = +∞) or when (xi )i are
samples from some singular β̄ (so that the αθ should share the same
support as β for KL(αθ |β̄) to be finite, but this support is usually un-
known). Another issue is that in several cases of practical interest, the
density ρθ is inaccessible (or too hard to compute).
A typical setup where both problems (singular and unknown den-
sities) occur is for so-called generative models, where the paramet-
ric measure is written as a push-forward of a fixed reference measure
ζ ∈ M(Z)
αθ = hθ,] ζ where hθ : Z → X
176 Variational Wasserstein Problems

where the push-forward operator is introduced in Definition 2.1. The


space Z is usually low-dimensional, so that the support of αθ is localized
along a low-dimensional “manifold” and the resulting density is highly
singular (it does not have a density with respect to Lebesgue measure).
Furthermore, computing this density is usually intractable, while gen-
erating i.i.d. samples from αθ is achieved by computing xi = hθ (zi )
where (zi )i are i.i.d. samples from ζ.
In order to cope with such difficult scenario, one has to use weak
metrics in place of the MLE functional LMLE , which needs to be written
in dual form as
Z Z 
def.
L(α, β) = max f (x)dα(x) + g(x)dβ(x) : (f, g) ∈ R .
(f,g)∈C(X )2 X X
(9.34)
Dual norms exposed in §8.2 correspond to imposing R =
{(f, −f ) : f ∈ B}, while optimal transport (2.23) sets R = R(c) as
defined in (2.24).
For a fixed θ, evaluating the energy to be minimized in (9.33) us-
ing such a loss function corresponds to solving a semi-discrete optimal
transport, which is the focus of Chapter 5. Minimizing the energy with
respect to θ is much more involved, and is typically highly non-convex.
Denoting fθ a solution to (9.34) when evaluating E(θ) = L(αθ , β),
a sub-gradient is obtained using the formula
Z
∇E(θ) = [∂hθ (x)]> ∇fθ (x)dαθ (x), (9.35)
X

where ∂hθ (x) ∈ Rdim(Θ)×d is the differential (with respect to θ) of


θ ∈ Rdim(Θ) 7→ hθ (x), while ∇fθ (x) is the gradient (with respect to
x) of fθ . This formula is hard to use numerically, first because it re-
quires first computing a continuous function fθ , which is a solution to
a semi-discrete problem. As exposed in §8.5, for OT loss, this can be
achieved using stochastic optimization, but this is hardly applicable in
high dimension. Another option is to impose a parametric form for this
potential, for instance expansion in a RKHS Genevay et al. [2016] or a
deep-network approximation [Arjovsky et al., 2017]. This however leads
to important approximation error that are not yet analyzed theoreti-
cally. A last issue is that it is unstable numerically because it requires
9.4. Minimum Kantorovitch Estimators 177

the computation of the gradient ∇fθ of the dual potential fθ .


For the OT loss, an alternative gradient formula is obtained when
one rather computes a primal optimal coupling for the following equiv-
alent problem
Z 
Lc (αθ , β) = min c(hθ (z), x)dγ(z, x) : γ ∈ U(ζ, β) .
γ∈M(Z×X ) Z×X
(9.36)
Note that in the semi-discrete case considered here, the objective to be
minimized can be actually decomposed as
n Z n
1
X X Z
min c(hθ (z), xi )dγi (z) where γi = ζ, , dγi (z) =
(γi )n
i=1 Z
i=1 i=1 Z n
(9.37)
1
where each γi ∈ M+ (Z). Once an optimal (γθ,i )i solving (9.37) is
obtained, the gradient of E(θ) is computed as
n Z
[∂hθ (z)]> ∇1 c(hθ (z), xi )dγi (z),
X
∇E(θ) =
i=1 Z

where ∇1 c(x, y) ∈ Rd is the gradient of x 7→ c(x, y). Note that as


opposed to (9.35), this formula does not involve computing the gradient
of the potentials being solutions of the dual OT problem.
The class of estimators obtained using L = Lc , often called
“Minimum Kantorovitch Estimators” (MKE), was initially introduced
in [Bassetti et al., 2006], see also [Canas and Rosasco, 2012]. It has
been used in the context of generative models by [Montavon et al.,
2016] to train Restricted Boltzmann Machines, and in [Bernton et al.,
2017] in conjunction with Approximate Bayesian Computations. Ap-
proximation of these computations using Deep Network are used to
train deep generative models for both GAN [Arjovsky et al., 2017] and
VAE [Bousquet et al., 2017], see also [Genevay et al., 2017a,b].

Remark 9.13 (Metric learning and transfer learning). Let us insist


on the fact that, for applications in machine learning, the suc-
cess of OT-related methods very much depends on the choice of
an adapted cost c(x, y) which captures the geometry of the data.
While it is possible to embed many kinds of data in Euclidean
178 Variational Wasserstein Problems

spaces (see for instance [Mikolov et al., 2013] for words embed-
ding), in many cases, some sort of adaptation or optimization of the
metric is needed. Metric learning for supervised tasks is a classical
problem (see for instance [Kulis, 2012, Weinberger and Saul, 2009])
and it has been extended to the learning of the ground metric
c(x, y) when some OT distance is used in a learning pipeline [Cu-
turi and Avis, 2014] (see also [Zen et al., 2014, Wang and Guibas,
2012]). Let us also mention the related inverse problem of learn-
ing the cost matrix from the observations of an optimal coupling
P, which can be regularized using a low-rank prior [Dupuy et al.,
2016]. Another related problem is transfer learning [Pan and Yang,
2010] and domain adaptation [Glorot et al., 2011], where one wants
to transfer some trained machine learning pipeline to adapt it to
some new dataset. This problem can be modelled and solved using
OT techniques, see [Courty et al., 2017b,a].
10
Extensions of Optimal Transport

This chapter details several variational problems that are related to


(and share the same structure of) the Kantorovitch formulation of op-
timal transport. The goal is to extend optimal transport to more gen-
eral settings: several input histograms, un-normalized histograms, more
general classes of measures, and transport between measures living in
different metric spaces and for which no suitable cost function between
the spaces can be defined.

10.1 Multi-marginal Problems

Instead of coupling two input histograms using the Kantorovitch for-


mulation (2.11), one can couple S histograms (as )Ss=1 , where as ∈ Σns ,
by solving the following multi-marginal problem
ns
XX
def.
min hC, Pi = Ci1 ,...,iS Pi1 ,...,iS . (10.1)
P∈U(as )s s is =1

where the set of valid couplings is


 
 n 
P ∈ Rn1 ×...×nS : ∀ s, ∀ is ,
X X̀
U(as )s = Pi1 ,...,iS = as,is .
 
`6=s i` =1

179
180 Extensions of Optimal Transport

The entropic regularization scheme (4.2) naturally extends to this set-


ting
min hP, Ci − εH(P).
P∈U(as )s

and one can then apply Sinkhorn’s algorithm to compute the optimal
P in scaling form
S
C
where K = e− ε ,
Y def.
∀ i = (i1 , . . . , iS ), Pi = K i us,is
s=1

where us ∈ Rn+s
are (unknown) scaling vectors, which are iteratively
updated, by cycling repeatedly through s = 1, . . . , S,
as,is
us,is ← P Pn` Q (10.2)
`6=s i` =1 Ki r6=s u`,ir

Remark 10.1 (General measures). The discrete multimarginal


problem (10.1) is generalized to measures (αs )s on spaces
(X1 , . . . , XS ) by computing a coupling measure
Z
min c(x1 , . . . , xS )dπ(x1 , . . . , xS ) (10.3)
π∈U (αs )s X1 ×...×XS

where the set of couplings is


n o
π ∈ M1+ (X1 × . . . × XS ) : ∀ s = 1, . . . , S, Ps,] π = αs
def.
U(αs )s =

where Ps : X1 × . . . × XS → Xs is the projection on the sth com-


ponent, Ps (x1 , . . . , xS ) = xs , see for instance Gangbo and Swiech
[1998a]. We refer to [Pass, 2015, 2012] for a review of the main prop-
erties of the multi-marginal OT problem. A typical application of
multi-marginal OT is to compute approximation of solutions to
quantum chemistry problems, and in particular, for the so-called
Density Functional Theory [Cotar et al., 2013]. This problem is
obtained when considering the singular Coulomb interaction cost
X 1
c(x1 , . . . , xS ) = .
i6=j
kxi − xj k
10.1. Multi-marginal Problems 181

Remark 10.2 (Multi-marginal formulation of the barycenter). It is


possible to re-cast the linear program optimization (9.11) as an
optimization over a single coupling over X S+1 where the first
marginal is the barycenter and the other ones are the input mea-
sure (αs )Ss=1
Z S
X
min λs c(x, xs )dπ̄(x1 , . . . , xs , x) (10.4)
π̄∈M1+ (X S+1 ) X S+1 s=1

subj. to ∀ s = 1, . . . , S, Ps,] π̄ = αs .
This stems from the “gluing lemma”, which states that given cou-
plings (πs )Ss=1 where πs ∈ U(αs , α), one can construct a higher-
dimensional coupling π̄ ∈ M1+ (X K+1 ) with marginals πs , i.e. such
def.
that Qs] π̄ = πs where Qs (x1 , . . . , xS , x) = (xs , x) ∈ X 2 . By ex-
plicitly minimizing in (10.4) with respect to the first marginal (as-
sociated to x ∈ X ), one obtains that solutions α of the barycenter
problem (9.11) can be computed as α = Aλ,] π where Aλ is the
“barycentric map” defined as
X
Aλ : (x1 , . . . , xS ) ∈ X S 7→ argmin λs c(x, xs )
x∈X s

(assuming this map is single-valued), where π is any solution of


the multi-marginal problem (10.3) with cost
X
c(x1 , . . . , xS ) = λ` c(x` , Aλ (x1 , . . . , xS )). (10.5)
`

For instance, for c(x, y) = kx − yk2 , one has, removing the constant
squared terms,
X
c(x1 , . . . , xS ) = − λr λs hxr , xs i,
r≤s

which is a problem studied in Gangbo and Swiech [1998b]. We refer


to Agueh and Carlier [2011] for more details. This formula shows
that if all the input measures are discrete βs = niss=1 as,is δxs,is ,
P

then the barycenter α is also discrete, and is obtained using the


182 Extensions of Optimal Transport

formula X
α= π(i1 ,...,is ) δAλ (xi1 ,...,xis ) .
(i1 ,...,is )

where π is an optimal solution of (10.1) with cost matrix Ci1 ,...,iS =


c(xi1 , . . . , xis ) as defined in (10.5). Since π is a nonnegative tensor
Q
of s ns dimensions obtained as the solution of a linear program
with s ns − S + 1 equality constraints, an optimal solution π with
P

up to s ns − S + 1 non-zero values can be obtained. A barycenter


P

α with a support of up to s ns − S + 1 points can therefore be


P

obtained. This direct fact and other considerations in the discrete


case can be found in Anderes et al. [2016].

Remark 10.3 (Relaxation of Euler equations). A convex relaxation


of Euler equations of incompressible fluid dynamics has been
proposed by Brenier [Brenier, 2008, 1990, 1993, 1999, Ambrosio
and Figalli, 2009]. Similarly to the setting exposed in §7.6, it
corresponds to the problem of finding a probability distribution
π̄ ∈ M1+ (X̄ ) over the set X̄ of all paths γ : [0, 1] → X , which
describes the movement of particules in the fluid. This is a relaxed
version of the initial partial differential equation model because,
as in the Kantorovitch formulation of OT, mass can be split. The
evolution with time does not necessarily define anymore a diffemor-
phism of the underlying space X . The dynamic of the fluid is ob-
tained by minimizing as in (7.17) the energy 01 kγ 0 (t)k2 dt of each
R

path. The difference with OT over the space of paths is the addi-
tional incompressibilty of the fluid. This incompressibilty is taken
care of by imposing that the density of particules should be uni-
form at any time t ∈ [0, 1] (and not just imposed at initial and final
times t ∈ {0, 1} as in classical OT). Assuming X is compact and
denoting ρX the uniform distribution on X , this reads P̄t,] π̄ = ρX
where P̄t : γ ∈ X̄ → γ(t) ∈ X . One can discretize this problem by
replacing a continuous path (γ(t))t∈[0,1] by a sequence of S points
(xi1 , xi2 , . . . , xiS ) on a grid (xk )nk=1 ⊂ X , and Π̄ is represented by
S
a S-way coupling P ∈ Rn ∈ U(as )s where the marginals are uni-
form as = n−1 1n . The cost of the corresponding multi-marginal
10.2. Unbalanced Optimal Transport 183

problem is then
S−1
X 2 2
Ci1 ,...,iS = xis − xis+1 + R xσ(i1 ) − xiS . (10.6)
s=1

Here R is a large enough penalization constant, which is here to


enforce the movement of particules between initial and final times,
which is prescribed by a permutation σ : JnK → JnK. This result-
ing multi-marginal problem is implemented efficiently in conjunc-
tion with Sinkhorn iterations (10.2) using the special structure of
the cost, as detailed in Benamou et al. [2015]. Indeed, in place of
the O(nS ) cost required to compute the denominator appearing
in (10.2), one can decompose it as a succession of S matrix-vector
multiplications, hence with a low cost Sn2 . Note that other solvers
have been proposed, for instance using the semi-discrete framework
exposed in §5.2, see [de Goes et al., 2015, Gallouët and Mérigot,
2017].

10.2 Unbalanced Optimal Transport

A major bottleneck of “classical” optimal transport is that it requires


the two input measures (α, β) to have the same total mass. While
many workaround have been proposed (including re-normalizing the
input measure, or using dual norms such as detailed in 8.2), it is only
recently that a satisfying unifying theory have been developed. We only
sketch here a simple but important particular case.
Following Liero et al. [2015], to account for arbitrary positive his-
tograms (a, b) ∈ Rn+ ×Rm + , the initial Kantorovitch formulation (2.11) is
“relaxed” by only penalizing marginal deviation using some divergence
Dϕ (defined in (8.3)). This equivalently corresponds to minimizing an
OT distance between approximate measures.

LτC (a, b) = min LC (a, b) + τ Dϕ (a, ã) + τ Dϕ (b, b̃) (10.7)


ã,b̃

= min hC, Pi + τ1 Dϕ (P1m |a) + τ2 Dϕ (P> 1m |a). (10.8)


P∈Rn×m
+

where (τ1 , τ2 ) controls how much mass variations are penalized as op-
184 Extensions of Optimal Transport

posed to transportation of the mass. In the limit τ1 = τ2 → +∞, as-


P P
suming i ai = j bj (the “balanced” case), one recovers the original
optimal transport formulation with hard marginal constraint (2.11).
This formalism recovers many different previous works, for instance
introducing for Dϕ an `2 norm [Benamou, 2003] or a `1 norm as in
partial transport [Figalli, 2010, Caffarelli and McCann, 2010]. A case
of particular importance is when using Dϕ = KL the Kulback-Leibler
divergence, as detailed in Remark 10.5. For this cost, in the limit τ1 =
τ2 → 0, one obtains the so-called squared Hellinger distance (see also
Example 8.3)
τ →0 X √
LτC (a, b) −→ h2 (a, b) = bi ) 2 .
p
( ai −
i

Sinkhorn’s iterations (4.15) can be adapted to this problem by mak-


ing use of the generalized algorithm detailed in §4.6. This means that
the solution has the form (4.12) and that the scalings are updated as
τ1 τ2
a b
   
τ1 +ε τ2 +ε
u← and v← . (10.9)
Kv KT u

Remark 10.4 (Generic measure). For (α, β) two arbitrary measure,


the unbalanced version (also called “log-entropic”) of (2.15) read
Z
Lτc (α, β)
def.
= min c(x, y)dπ(x, y)
π∈M+ (X ×Y) X ×Y
+ τ Dϕ (P1,] π|α) + τ Dϕ (P2,] π|β),

where divergences Dϕ between measures are defined in (8.1). In


the special case c(x, y) = kx − yk2 , Dϕ = KL, Lτc (α, β)1/2 is the
Gaussian-Hellinger distance [Liero et al., 2015], and it is shown to
be a distance on M1+ (Rd ).

Remark 10.5 (Wasserstein-Fisher-Rao). For the particular choice


of cost
c(x, y) = − log cos(min(d(x, y)/κ, π/2))
10.2. Unbalanced Optimal Transport 185

where κ is some cutoff distance, and using Dϕ = KL, then


1
WFR(α, β) = Lτc (α, β) 2
def.

is the so-called Wasserstein-Fisher-Rao or Hellinger-Kantorovitch


distance. In the special case X = Rd , this static (Kantorovitch-like)
formulation matches its dynamical counterparts (7.15), as proved
independently by Liero et al. [2015], Chizat et al. [2015]. This dy-
namical formulation is detailed in §7.4.

The barycenter problem (9.11) can be generalized to handle an


unbalanced setting by replacing Lc with Lτc . Figure (10.1) shows the
resulting interpolation, providing a good illustration of the usefulness
of the relaxation parameter τ . The input measures are mixtures of
two Gaussians with unequal mass. Classical OT requires the leftmost
bump to be split in two and gives an non-regular interpolation. In sharp
contrast, unbalanced OT allows the mass to vary during interpolation,
so that the bumps are not split and local modes of the distributions are
smoothly matched. Using finite values for τ (recall that OT is equivalent
to τ = ∞) is thus important to prevent irregular interpolations that
arise because of mass splitting, which happen because of a “hard” mass
conservation constraint. The resulting optimization problem can be
tackled numerically using entropic regularization and the generalized
Sinkhorn algorithm detailed in §4.6.
In practice, unbalanced OT techniques seem to outperform classical
OT for applications (such as in imaging or machine learning) where the
input data is noisy or not perfectly known. They are also crucial when
the signal strength of a measure, as measured by its total mass, must be
accounted for, or when normalization is not meaningful. This was the
original motivation of Frogner et al. [2015], whose goal was to compare
sets of word labels used to describe images. Unbalanced OT and the
corresponding Sinkhorn iterations have also been used for application
to the dynamics of cells in Schiebinger et al. [2017].

Remark 10.6 (Connexion with Dual norms). A particularly simple setup


to account for mass variation is to use dual norms, as detailed in §8.2.
By choosing a compact set B ⊂ C(X ) one obtains a norm defined on the
186 Extensions of Optimal Transport

Classical OT (τ = +∞) Ubalanced OT (0 < τ < +∞)

Figure 10.1: Influence of relaxation parameter τ on unbalanced barycenters. Top


to bottom displays the evolution of the barycenter between two input measures.

whole space M(X ) (in particular, the measures do not need to be posi-
tive). A particular instance of this setting is the flat norm (8.11), which
is recovered as a special instance of unbalanced transport, when using
Dϕ (α|α0 ) = kα − α0 kTV to be the total variation norm (8.9), see for in-
stance Hanin [1992], Lellmann et al. [2014]. We also refer to Schmitzer
and Wirth [2017] for a general framework to define Wasserstein-1 un-
balanced transport.

10.3 Problems with Extra Constraints on the Couplings

Many other OT-like problems have been proposed in the literature.


They typically correspond to adding extra constraints C on the initial
OT problem (2.15)
Z 
min c(x, y)dπ(x, y) : π ∈ C . (10.10)
π∈U (α,β) X ×Y

Let us give two representative examples. The optimal transport with


capacity constraint [Korman and McCann, 2015] corresponds to im-
10.4. Sliced Wasserstein Distance and Barycenters 187

posing that the density ρπ (for instance with respect to the Lebesgue
measure) is upper bounded
C = {π : ρπ ≤ κ} (10.11)
for some κ > 0. This constraints rules out any singular and thin cou-
pling that would be localized on a Monge map. The martingale trans-
port problem (see for instance Galichon et al. [2014], Dolinsky and
Soner [2014], Tan and Touzi [2013], Beiglböck et al. [2013]), which
finds many applications in finance, imposes the so-called martingale
constraint on the conditional mean of the coupling, when X = Y = Rd :

dπ(x, y)
 Z 
C = π : ∀ x ∈ Rd , y dβ(y) = x . (10.12)
Rd dα(x)dβ(y)
Note that this constraints any admissible coupling to be such that its
barycentric projection map (4.19) be equal to the identity. For arbitrary
(α, β), this set C is typically empty, but necessary and sufficient condi-
tions exists (α and β should be in “convex order”) to ensure C 6= ∅ so
that (α, β) satisfy a martingale constraint. This constraint can be dif-
ficult to enforce numerically when discretizing an existing problem. It
also forbids the solution to concentrate on a single Monge-map, and can
lead to couplings concentrated on the union of several graphs (a “multi-
valued” Monge-map), or even more complicated support sets. Using an
entropic penalization as in (4.9), one can solve approximately (10.10)
using the Dykstra algorithm as explained in Benamou et al. [2015],
which is a generalization of Sinkhorn’s algorithm exposed in §4.2. This
requires to compute the projection onto C for the KL divergence, which
is straightforward for (10.11), but cannot be done in closed form (10.12)
and thus necessitates sub-iterations, see Guo and Obloj [2017] for more
details.

10.4 Sliced Wasserstein Distance and Barycenters

One can define a distance between two measures (α, β) defined on Rd


by aggregating 1-D Wasserstein distances between their projections on
1-D lines. This defines
Z
SW(α, β)2 = W 2 (Pθ,] α, Pθ,] β)2 dθ
def.
(10.13)
Sd
188 Extensions of Optimal Transport

n o
where Sd = θ ∈ Rd : kθk = 1 is the d-dimensional sphere, and Pθ :
x ∈ Rd → R is the projection. This approach is detailed in Bonneel et al.
[2015], following ideas from Marc Bernot. It is related to the problem
of Radon inversion over measure spaces Abraham et al. [2017].

Lagrangian discretization and stochastic gradient descent. The ad-


vantage of this functional is that 1-D Wasserstein distances are simple
to compute, as detailed in §2.6. In the specific case where m = n and
N
X N
X
α= δxi and β= δyi , (10.14)
i=1 i=1

this is achieved by simply sorting points


n
Z !
X
2 2
SW(α, β) = |hxσθ (i) − yκθ (i) , θi| dθ
Sd i=1

where σθ , κθ ∈ Perm(n) are the permutation ordering in increasing


order respectively (hxi , θi)i and (hyi , θi)i .
def.
Fixing the vector y, the function Eβ (x) = SW(α, β)2 is smooth,
and one can use this function to define a mapping by gradient descent

x ← x − τ ∇Eβ (x) where (10.15)


Z  
∇Eβ (x)i = 2 hxi − yκθ ◦σ−1 (i) , θiθ dθ.
Sd θ

To make the method tractable, one can use a stochastic gradient de-
scent (SGD), replacing this integral with a discrete sum against ran-
domly drawn directions θ ∈ Sd (see §5.4 for more details on SGD).
The flow (10.15) can be understood as (Langrangian implementation
of) a Wasserstein gradient flow (in the sense of §9.3) of the function
α 7→ SW(α, β)2 . Numerically, one finds that this flow has no local
minimizer and that it thus converges to α = β. The usefulness of the
Lagrangian solver is that, at convergence, it defines a matching (simi-
lar to a Monge map) between the two distributions. This method has
been used successfully for color transfer and texture synthesis in [Ra-
bin et al., 2011] and is related to the alternate minimization approach
detailed in Pitié et al. [2007].
10.4. Sliced Wasserstein Distance and Barycenters 189

t=0 t = 1/4 t = 1/2 t = 3/4 t=1

Figure 10.2: Example of sliced barycenters computation using the Radon trans-
form (as defined in (10.20)). Top: barycenters αt for S = 2 two input and weights
(λ1 , λ2 ) = (1−t, t). Bottom: their Radon transform R(αt ) (the horizontal axis being
the orientation angle θ).

It is now simple to extend this Lagrangian scheme to compute ap-


proximate “sliced” barycenters of measures, by mimicking the Frechet
definition of Wasserstein barycenters (9.11) and minimizing
S
X
min λs SW(α, βs )2 . (10.16)
α∈M1+ (X ) s=1

given a set (βs )Ss=1 of fixed input measure. Using a Lagrangian dis-
cretization of the form (10.14) for both α and the (βs )s , one can perform
the non-convex minimization over the position x = (xi )i
def.
X X
min E(x) = λs Eβs (x), and ∇E(x) = λs ∇Eβs (x), (10.17)
x
s s

by gradient descent using formula (10.15) to compute ∇Eβs (x) (coupled


with a random sampling of the direction θ).

Eulerian discretization and Radon transform. A related way to com-


pute approximated sliced barycenter, without resorting the an iterative
minimization scheme, is to use the fact that (10.13) computes a dis-
tance between the Radon transforms R(α) and R(β) where
def.
R(α) = (Pθ,] α)θ∈Sd .
190 Extensions of Optimal Transport

A crucial point is that the Radon transform is invertible, and that


its inverse can be computed using a filtered backprojection formula.
Given a collection of measures ρ = (ρθ )θ∈Sd , one define the filtered
backprojection operator as
d−1
R+ (ρ) = Cd ∆ 2 B(ρ) (10.18)

where ξ = B(ρ) ∈ M(Rd ) is the measure defined through the relation


Z Z Z Z
∀ g ∈ C(Rd ), g(x)dξ(x) = g(rθ + Uθ z)dρθ (r)dzdθ
Rd Sd Rd−1 R
(10.19)
where Uθ is any orthogonal basis of θ⊥ , and where Cd ∈ R is a normaliz-
d−1
ing constant which depends on the dimension. Here ∆ 2 is a fractional
Laplacian, which is the high-pass filter defined over the Fourier domain
ˆ d−1
as ∆ 2 (ω) = kωk
d−1
. The definition of the backprojection (10.19) adds
up the contribution of all the measures (ρθ )θ by extending each one as
being constant in the directions orthogonal to θ. One then has the left-
inverse relation R+ ◦ R = IdM(Rd ) , so that R+ is a valid reconstruction
formula.
In order to compute barycenters of input densities, it makes sense
to replace formula (9.11) by its equivalent using Radon transform, and
thus consider independently for each θ the 1D barycenter problem
S
X
ρ?θ ∈ argmin λs W 2 (ρθ , Pθ,] βs )2 . (10.20)
(ρθ ∈M1+ (R)) s=1

Each 1-D barycenter problem is easily computed using the monotone


rearrangement as detailed in Remark (9.5). The Radon approximation
def.
αR = R+ (ρ? ) of a sliced barycenter solving (9.11) is then obtained
by the inverse Radon transform R+ . Note that in general, αR is not a
solution to (9.11) because the Radon transform is not surjective, so that
ρ? , which is obtained as a barycenter of the Radon transforms R(βs )
does not necessarily belong to the range of R. But numerically it seems
in practice to be almost the case [Bonneel et al., 2015]. Numerically,
this Radon transform formulation is very effective for input measures
and barycenters discretized on a fixed grid (e.g. a uniform grid for
images), and R and well as R+ are computed approximately on this
141414 Nicolas Bonneel
Nicolas
Nicolas etetal.
Bonneel
Bonneel [Link].

10.4. Sliced Wasserstein Distance and Barycenters 191

Radon
Radonbarycenter
Radon barycenter
barycenter Sliced barycenter
Sliced
Sliced barycenter
barycenter Wasserstein barycenter
Wasserstein
Wasserstein barycenter
barycenter
Radon
RRR S S S WWW
Lagrangian Wasserstein
Fig.
Fig.6 6Comparison
Fig. ofofBar
6Comparison
Comparison ofBar
Bar, Bar, Bar
RdRdR,dBar
and Bar
d and
RdRdRand Bar
Bar (computed using
d (computed
RdRdR(computed
the
using
using method
the
the detailed
method
method inin[25]).
detailed
detailed in[25]).
[25]).

Figure 10.3: Comparison of barycenters computed using Radon transform (10.20)


(Eulerian discretization), Lagrangian discretization (10.17), and Wasserstein OT
(computed using Sinkhorn iterations 9.17).

grid using fast algorithms (see for instance Averbuch et al. [2001]).
Figure (10.2) illustrates this computation of barycenters (and highlights RRR
Fig. 8 8Top:
Fig.
Fig. 8Top: Radon
Top: Radon
Radon barycenter
barycenter
barycenter BarBarofoffour
Bar offour2-D
four 2-D
2-D distributions with
distributions
distributions
RdRdRd with
with
the way the Radon transforms are equal interpolated),
equalweights.
equal Bottom
weights.
weights. : Same
Bottom
Bottom : Same
: Same while
experiment
experiment
experiment Figure
with SW2,
with
with SW2,
SW2, using (10.3)
N N=N=4=
using
using 10 4 10
4 10
4 4 4

points samples,
points
points samples,
samples, |Q | =
|Q|Q
|=100
|=100directions
100 directionsand
directions a
and gaussian
and
a gaussian kernel
a gaussian kernelwith
kernel standard
with
with standard
standard
shows a comparison of the Radon deviation barycenters
deviations s=s=20/512
deviation 20/512
=20/512 (10.20)
totoestimate the
toestimate
estimate the
the and densities.
corresponding
corresponding
corresponding the .ones
densities.
densities. . .
Fig.
Fig.7 7Image
Fig. warping
7Image
Image using
warping
warping the
using
using Radon
the
the barycenter
Radon
Radon exhibits
barycenter
barycenter artifacts.
exhibits
exhibits artifacts.
artifacts.
obtained by Lagrangian discretization second
second(10.17).
application
second application
application allows
allows
allows for the
for
for editing
the
the ofofaofasingle
editing
editing asingleimage
single image
image bybyby
5.5
5.5Application
5.5 totoColor
Application
Application Palette
toColor
Color Manipulation
Palette
Palette Manipulation
Manipulation bringing
bringing
bringing itsitscolors
itscolors closer
colors closer
closer totoatoaset
asetofofphotographs
set ofphotographs
photographs exhibiting
exhibiting
exhibiting
particular
particular
particular color
color palettes.
color palettes.
palettes. This
This process
This process
process isiscalled
iscalled
calledcolor grading,
color
color grading,
grading,
InInthis
Inthis section
this we
section
section we investigate
we investigate
investigate thethe benefit
the ofofour
benefit
benefit ofourSliced
our Sliced and
Sliced and finds
and finds applications
finds applications
applications ininphotograph
inphotograph
photograph enhancement.
enhancement.
enhancement.
Sliced Wasserstein Kernels Beside its computational simplicity, an-
Wasserstein
Wasserstein
Wasserstein barycenter
barycenter
barycenter for two
for
for two applications:
two applications:
applications: harmonizing
harmonizing
harmonizing
colors
colors
colors ininan image
inanan image
image sequence,
sequence,
sequence, and and
and grading
grading
grading colors
colors
colors ofofaofasin-
other advantage of the sliced Wasserstein distance is that it is isometric sin- Lagrangian
asin- Lagrangian
Lagrangian color
color
color palette.
palette.
palette. We We consider
We consider
consider a acolor
acolor image
color image
image rep- rep-
rep-
gle image. Color harmonization isisthe process ofofbringing N⇥3 N⇥3
N⇥3
gle
gle image.
image. Color
Color harmonization
harmonization isthe
the process
process bringing resented
ofbringing resented
resented asasasa avector
avector
vector XX2X2R2R R ofofof
NNpixels,
Npixels,
pixels, sososo
that XX=
that
that X==
theto a Euclidean distance (it is thus an “Hilbertian” metric), as detailed in
thecolors
the colors
colors ofofinput
ofinputimages
input images
images totoan average
toanan average
average color
color
color distribution
distribution (X(X
distribution )kk=1,...,N
k(X )k=1,...,N
)kk=1,...,N wherewhere
where each
each pixel
each pixel
pixel XkXk2 3 3stores
Xk2R2R R3stores
storesthe value
the
the ofofaofa a
value
value
such
such that
such that the
that theimages
the images
images end
end upupup
end looking
looking
looking more
more similar.
more This
similar.
similar. This
This has has pixel
has indexed
pixel
pixel indexed
indexed bybyk.
[Link]
[Link]
Inthefollowing,
the following,
following, weweuse
we the
use
use YCbCr
the
the YCbCr
YCbCr color
color
color
Remark 2.28 (see in particular formula (2.34)). As highlighted in §8.3,
several
several
several applications
applications
applications such
such as,as,as,
such for forinstance,
for instance, image
instance, image
image stitching
stitching space
stitching space
space because
because
because ofofits ability
ofitsits ability
ability totodecorrelate
todecorrelate
decorrelate color channels,
color
color channels,
channels,
ororenforcing
orenforcing
enforcing temporal
temporal
temporal coherence
coherence
coherence ofofcolors
ofcolorsininmovies.
colors inmovies.
movies. The The although
The although
although other other
other color
color
color spacesspaces
spaces may may
may bebeused
beused
used(e.g., the
(e.g.,
(e.g., CIE-Lab
the
the CIE-Lab
CIE-Labd
this should be contrasted with the Wasserstein distance W 2 on R ,
which is not Hilbertian in dimension d ≥ 2. It is thus possible to use
this sliced distance to equip the space of distributions M+ d
1 (R ) with a
reproducing kernel Hilbert space structure (as detailed in §8.3). One
can for instance use the exponential and energy distance kernels
SW(α,β)p
k(α, β) = e− 2σ p and k(α, β) = − SW(α, β)p

for 1 ≤ p ≤ 2 for the exponential kernels, and 0 < p < 2 for the energy
distance kernels. This means that for any collection (αi )i of input mea-
sures, the matrix (k(αi , αj ))i,j is symmetric positive semi-definite. It is
possible to use these kernels to perform a variety of machine learning
tasks using the “kernel trick”, for instance in regression, classification
192 Extensions of Optimal Transport

(SVM and logistic), clustering (K-means) and dimensionality reduc-


tion (PCA) [Hofmann et al., 2008]. We refer to Kolouri et al. [2016] for
details and applications.

10.5 Transporting Vectors and Matrices

Real valued measures α ∈ M(X ) are easily generalized to vector valued


measures α ∈ M(X ; V) where V is some vector space. For notational
simplicity, we assume V is Euclidean and equipped with some inner
product h·, ·i (typically V = Rd and the inner product is the canonical
one). Thanks to this inner product, vector valued measures are identi-
fied with the dual of continuous functions g : X → V, i.e. for any such
g, one defines its integration against the measure as
Z
g(x)dα(x) ∈ R (10.21)
X

which is a linear operation on g and α. A discrete measure has the form


α = i ai δxi where (xi , ai ) ∈ X ×V and the integration formula (10.21)
P

simply reads Z X
g(x)dα(x) = hai , g(xi )i ∈ R.
X i

Equivalently, if V = Rd , then such a α can be viewed as a collection


(αs )ds=1 of d “classical” real valued measures (its coordinates), writing
Z d Z
X
g(x)dα(x) = gs (x)dαs (x),
X s=1 X

where g(x) = (gs (x))ds=1 are the coordinates of g in the canonical basis.

Dual norms. It is non-trivial, and in fact in general impossible, to


extend OT distances to such a general setting. Even coping with real
valued measures taking both positive and negative values is difficult.
The only simple option is to consider dual norms, as defined in §8.2.
Indeed, formula (6.3) readily extends to M(X ; V) by considering B to
be a subset of C(X ; V). So in particular, W1 , the flat norm and MMD
norms can be computed for vector valued measures.
10.5. Transporting Vectors and Matrices 193

OT over cone-valued measures. It is possible to define more ad-


vanced OT distances when α is restricted to be in a subset M(X ; V) ⊂
M(X ; V). The set V should be a positively 1-homogeneous convex cone
of V
n o
λu : λ ∈ R+ , u ∈ V0
def.
V =

where V0 is a compact convex set. A typical example is the set of pos-


itive measures where V = Rd+ . Dynamical convex formulations of OT
over such a cone have been proposed, see Zinsl and Matthes [2015].
This has been applied to model the distribution of chemical compo-
nents. Another important example is the set of positive symmetric ma-
trices V = S+ d ⊂ Rd×d . It is of course possible to use dual norms over

this space, by treating matrices as vectors, see for instance Ning and
Georgiou [2014]. Dynamical convex formulations for OT over such a
cone have been provided Chen et al. [2016b], Jiang et al. [2012]. Some
static (Kantorovitch-like) formulations have been also proposed Ning
et al. [2015], Peyré et al. [2017], but a mathematically sound theoreti-
cal framework is still missing. In particular, it is unclear if these static
approaches define distances for vector valued measures, and if they
relate to some dynamical formulation. Figure 10.4 example of tensor
interpolation obtained using the method detailed in Peyré et al. [2017],
which proposes a generalization of Sinkhorn algorithms using Quantum
relative entropy (10.22) to deal with tensor fields.

OT over positive matrices. A related but quite different setting is to


replace discrete measures, i.e. histograms a ∈ Σn by positive matrices
with unit trace A ∈ Sn+ such that tr(A) = 1. The rationale is that the
eigenvalues λ(A) ∈ Σn of A plays the role of an histogram, but one
has also to take care of the rotations of the eigenvector, so that this
problem is more complicated.
One can extend several divergences introduced in §8.1 to this set-
ting. For instance, the Bures metric (2.40) is a generalization of the
Hellinger distance (defined in Remark 8.3), since they are equal on
positive diagonal matrices. One can also extend the Kulback-Leibler
divergence (4.6) (see also Remark 8.1) is generalized to positive matri-
194 Extensions of Optimal Transport

t = 0 t = 1/8 t = 1/4 t = 3/8 t = 1/2 t = 5/8 t = 3/4 t = 7/8 t = 1

Figure 10.4: Interpolations between two input fields of positive semidefinite ma-
trices (displayed at times t ∈ {0, 1} using ellipses) on some domain (here, a 2-D
planar square and a surface mesh), using the method detailed in Peyré et al. [2017].
Unlike linear interpolation schemes, this OT-like method transports the “mass” of
the tensors (size of the ellipses) as well as their anisotropy and orientation.

ces as

def.
KL(A|B) = tr (P log(P ) − P log(Q) − P + Q) (10.22)

where log(·) is the matrix logarithm. This matrix KL is convex with


both of its arguments.
It is possible to solve convex dynamic formulations to define OT
distances between such matrices Carlen and Maas [2014], Chen et al.
[2016b, 2017]. There also exists an equivalent of Sinkhorn’s algorithm,
which is due to Gurvits Gurvits [2004] and has been extensively studied
in Georgiou and Pavon [2015], see also the review paper Idel [2016]. It is
known to converge only in some cases, but seems empirically to always
work.

10.6 Gromov-Wasserstein Distances

Optimal transport relies on some ground cost in order to compare his-


tograms or measures. A typical setup corresponds to measures defined
on the same metric space. We describe here an approach which deals
with measures defined on different spaces, but which requires solving a
more difficult non-convex optimization problem.
10.6. Gromov-Wasserstein Distances 195

X
Y
Y

Figure 10.5: Computation of the Hausdorff distance in R2 .

10.6.1 Hausdorff distance

The Hausdorff distance between two sets A, B ⊂ Z for some metric dZ


is
 
def.
HZ (A, B) = max sup inf dZ (a, b), sup inf dZ (a, b) .
a∈A b∈B b∈B a∈A

This defines a distance between compact sets K(Z) of Z, and if Z is


compact, then (K(Z), HZ ) is itself compact, see [Burago et al., 2001].
Following [Mémoli, 2011], one remarks that this distance between
sets (A, B) can be defined similarly to the Wasserstein distance between
measures (which should be somehow understood as “weighted” sets).
One replace the measures couplings (2.14) by sets couplings
( )
def. ∀ a ∈ A, ∃b ∈ B, (a, b) ∈ R
R(A, B) = R∈X ×Y : .
∀ b ∈ B, ∃a ∈ A, (a, b) ∈ R

With respect to Kantorovitch problem (2.15), one should replace inte-


gration (since one does not have access to measures) by maximization,
and one has

HZ (A, B) = inf sup d(a, b). (10.23)


R∈R(A,B) (a,b)∈R

Note that the support of a measure coupling π ∈ U(α, β) is a set cou-


pling between the supports, i.e. Supp(π) ∈ R(Supp(α), Supp(β)). The
Hausdorff distance is thus connected to the ∞-Wasserstein distance and
one has H(A, B) ≤ W ∞ (α, β) for any measure (α, β) whose supports
are (A, B).
196 Extensions of Optimal Transport

a Z
X f
H b
GH
g
Y

Figure 10.6: The GH approach to compare two metric spaces.

10.6.2 Gromov-Hausdorff distance


The Gromov-Hausdorff distance [Gromov, 2007] (see also [Edwards,
1975]) is a way to measure how two metric spaces (X , dX ), (Y, dY ) by
quantifying how much they are isometric one from each other. It is
defined as a the minimum Hausdorff distance between every possible
isometric embedding of the two spaces in a third one
isom
( )
def. f : X −→ Z
GH(dX , dY ) = inf HZ (f (X ), g(Y)) : isom .
Z,f,g g : Y −→ Z

Here, the constraint is that f must be an isometric embedding, meaning


that dZ (f (x), f (x0 )) = dX (x, x0 ) for any (x, x0 ) ∈ X 2 (similarly for g).
One can show that GH defines a distance between compact metric
space up to isometry, so that in particular GH(dX , dY ) = 0 if and
only if there exists an isometry h : X → Y, i.e. f is bijective and
dY (h(x), h(x0 )) = dX (x, x0 ) for any (x, x0 ) ∈ X 2 .
Similarly to (10.23) and as explained in [Mémoli, 2011], it is possible
to re-write equivalently the Gromov-Hausdorff distance using couplings
as follow
1
GH(dX , dY ) = inf sup |dX (x, x0 ) − dX (y, y 0 )|.
2 R∈R(X ,Y) ((x,y),(x0 ,y 0 ))∈R2

For discrete spaces X = (xi )ni=1 , Y = (yj )m


j=1 represented using a dis-
tance matrix D = (dX (xi , xi0 ))i,i0 ∈ Rn×n , D0 = (dY (yj , yj 0 ))j,j 0 ∈
Rm×m , one can re-write this optimization using binary matrices R ∈
{0, 1}n×m indicating the support of the set couplings R as follow
1
GH(D, D0 ) = inf max Ri,j Rj,j 0 |Di,i0 − D0j,j 0 |. (10.24)
2 R1>0,R> 1>0 (i,i0 ,j,j 0 )
10.6. Gromov-Wasserstein Distances 197

GH

GH

Figure 10.7: GH limit of sequences of metric spaces.

The initial motivation of the GH distance is to define and study limits of


metic spaces, as illustrated in Figure 10.7, and we refer to [Burago et al.,
2001] for details. There is an explicit description of the geodesics for
the GW distance [Chowdhury and Mémoli, 2016], which is very similar
to the one of Gromov-Wasserstein space detailed in Remark 10.8.
The underlying optimization problem (10.24) is highly non-convex,
and computing the global minimum is untractable. It has been ap-
proached numerically using approximation schemes, and has found
applications in vision and graphics for shape matching [Mémoli and
Sapiro, 2005, Bronstein et al., 2006].
It often desirable to “smooth” the definition of the Hausdorff dis-
tance by replacing the maximization by an integration. This in turn
necessitates the introduction of measures, and it is one of the motiva-
tions for the definition of the Gromov-Wasserstein distance in the next
section.

10.6.3 Gromov-Wasserstein Distance

Optimal transport needs a ground cost C to compare histograms (a, b),


it can thus not be used if the histograms are not defined on the same
underlying space, or if one cannot pre-register these spaces to define
a ground cost. To address this issue, one can instead only assume a
weaker assumption, namely that one has at its disposal two matrices
D ∈ Rn×n and D0 ∈ Rm×m that represent some relationship between
the points on which the histograms are defined. A typical scenario is
when these matrices are (power of) distance matrices. The Gromov-
198 Extensions of Optimal Transport

X
Di,i0 D0j,j 0 Y
|Di,i0 D0j,j 0 |

Figure 10.8: The GW approach to compare two metric measure spaces.

Wasserstein problem reads


GW((a, D), (b, D0 ))2 =
def.
(10.25)
|Di,i0 − D0j,j 0 |2 Pi,j Pi0 ,j 0 .
def.
X
min ED,D0 (P) =
P∈U(a,b)
i,j,i0 ,j 0
This problem is similar to the GH problem (10.24) when replacing
maximization by a sum, and set couplings by measure couplings. This is
a non-convex problem, which can be recast as a Quadratic Assignment
Problem (QAP) [Loiola et al., 2007] and is in full generality NP-hard to
solve for arbitrary inputs. It is in fact equivalent to a graph matching
problem [Lyzinski et al., 2016] for a particular cost.
One can show that GW satisfies the triangular inequality, and in
fact it defines a distance between metric spaces equipped with a prob-
ability distribution (here assumed to be discrete in definition (10.25))
up to isometries preserving the measures. This distance was introduced
and studied in details by Memoli in [Mémoli, 2011]. An in-depth math-
ematical exposition (in particular, its geodesic structure and gradient
flows) is given in Sturm [2012]. See also Schmitzer and Schnörr [2013a]
for applications in computer vision. This distance is also tightly con-
nected with the Gromov-Hausdorff distance [Gromov, 2001] between
metric spaces, which have been used for shape matching Mémoli [2007],
Bronstein et al. [2010].

Remark 10.7 (Gromov-Wasserstein distance). The general setting


corresponds to computing couplings between metric measure
spaces (X , dX , αX ) and (Y, dY , αY ) where (dX , dY ) are distances
10.6. Gromov-Wasserstein Distances 199

and (αX , αY ) are measures on their respective spaces. One defines

GW((αX , dX ), (αY , dY ))2 =


def.

Z
min |dX (x, x0 ) − dY (y, y 0 )|2 dπ(x, y)dπ(x0 , y 0 ).
π∈U(αX ,αY ) X 2 ×Y 2
(10.26)
GW defines a distance between metric measure spaces up to isome-
tries, where one says that (X , αX , dX ) and (Y, αY , dY ) are isometric
if there exists a bijection ϕ : X → Y such that ϕ] αX = αY and
dY (ϕ(x), ϕ(x0 )) = dX (x, x0 ).

Remark 10.8 (Gromov-Wasserstein geodesics). The space of met-


ric spaces (up to isometries) endowed with this GW dis-
tance (10.26) has a geodesic structure. Sturm [2012] shows that
the geodesic between (X0 , dX0 , α0 ) and (X1 , dX1 , α1 ) can be chosen
to be t ∈ [0, 1] 7→ (X0 × X1 , dt , π ? ) where π ? is a solution of (10.26)
and for all ((x0 , x1 ), (x00 , x01 )) ∈ (X0 × X1 )2 ,

dt ((x0 , x1 ), (x00 , x01 )) = (1 − t)dX0 (x0 , x00 ) + tdX1 (x1 , x01 ).


def.

This formula allows one to define and analyze gradient flows which
minimize functionals involving metric spaces, see Sturm [2012].
It is however difficult to handle numerically, because it involves
computations over the product space X0 ×X1 . A heuristic approach
is used in Peyré et al. [2016] to define geodesics and barycenters
of metric measure spaces while imposing the cardinality of the
involved spaces and making use of the entropic smoothing (10.27)
detailed below.

10.6.4 Entropic Regularization


To approximate the computation of GW, and to help convergence of
minimization schemes to better minima, one can consider the entropic
regularized variant

min ED,D0 (P) − εH(P). (10.27)


P∈U(a,b)
200 References

`=1 `=2 `=3 `=4

Figure 10.9: Iterations of the entropic GW algorithm (10.28) between two shapes
(xi )i and (yj )j in R2 , initialized with P(0) = a ⊗ b. The distance matrices are
Di,i0 = kxi − xi0 k and D0j,j 0 = kyj − yj 0 k. Top row: coupling P (`) displayed as a 2-D
image. Bottom row: matching induced by P (`) (each point xi is connected to the 3
(`)
yj with the 3 largest values among {Pi,j }j ). The shapes have the same size, but for
displaying purpose, the inner shape (xi )i has been reduced in size.

As proposed initially in [Gold and Rangarajan, 1996, Rangarajan et al.,


1999], and later revisited in [Solomon et al., 2016a] for applications in
graphics, one can use iteratively Sinkhorn’s algorithm to progressively
compute a stationary point of (10.27). Indeed, successive linearizations
of the objective function lead to consider the succession of updates

P(`+1) = hP, C(`) i − εH(P) where


def.
min (10.28)
P∈U(a,b)

C(`) = ∇ED,D0 (P(`) ) = −DP(`) D0 ,


def.

which can be interpreted as a mirror-descent scheme [Solomon et al.,


2016a]. Each update can thus be solved using Sinkhorn iterations (4.15)
with cost C(`) . Figure 10.9 displays the evolution of the algorithm.
Figure (10.10) illustrates the use of this entropic Gromov-Wasserstein
to compute soft maps between domains.
References 201

Figure 10.10: Example of fuzzy correspondences computed by solving GW prob-


lem (10.27) with Sinkhorn iterations (10.28). Extracted from [Solomon et al., 2016a].
References

Pytorch library. [Link] 2017.


Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin,
et al. Tensorflow: Large-scale machine learning on heterogeneous dis-
tributed systems. arXiv preprint arXiv:1603.04467, 2016.
Isabelle Abraham, Romain Abraham, Maıtine Bergounioux, and Guillaume
Carlier. Tomographic reconstruction from a few views: a multi-marginal
optimal transport approach. Applied Mathematics & Optimization, 75(1):
55–73, 2017.
Ryan Prescott Adams and Richard S Zemel. Ranking via sinkhorn propaga-
tion. arXiv preprint arXiv:1106.1925, 2011.
Martial Agueh and Malcolm Bowles. One-dimensional numerical algorithms
for gradient flows in the p-Wasserstein spaces. Acta Applicandae Mathe-
maticae, 125(1):121–134, 2013.
Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space.
SIAM J. on Mathematical Analysis, 43(2):904–924, 2011.
Martial Agueh and Guillaume Carlier. Vers un théorème de la limite centrale
dans l’espace de Wasserstein? Comptes Rendus Mathematique, 2017.
Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermüller,
Dzmitry Bahdanau, Nicolas Ballas, ..., and Ying Zhang. Theano: A python
framework for fast computation of mathematical expressions. CoRR,
abs/1605.02688, 2016.

202
References 203

Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of


divergence of one distribution from another. Journal of the Royal Statistical
Society. Series B (Methodological), pages 131–142, 1966.
Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much
faster algorithms for matrix scaling. arXiv preprint arXiv:1704.02315, 2017.
Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time ap-
proximation algorithms for optimal transport via Sinkhorn iteration. arXiv
preprint arXiv:1705.09634, 2017.
Pedro C Álvarez-Esteban, E del Barrio, JA Cuesta-Albertos, and C Matrán.
A fixed-point approach to barycenters in Wasserstein space. Journal of
Mathematical Analysis and Applications, 441(2):744–762, 2016.
Shun-ichi Amari, Ryo Karakida, and Masafumi Oizumi. Information geometry
connecting Wasserstein distance and Kullback-Leibler divergence via the
entropy-relaxed transportation problem. arXiv preprint arXiv:1709.10219,
2017.
Luigi Ambrosio and Alessio Figalli. Geodesics in the space of measure-
preserving maps and plans. Arch. Ration. Mech. Anal., 194(2):421–462,
2009. ISSN 0003-9527.
Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric
spaces and in the space of probability measures. Springer Science & Business
Media, 2008.
Ethan Anderes, Steffen Borgwardt, and Jacob Miller. Discrete Wasserstein
barycenters: optimal transport for discrete data. Mathematical Methods of
Operations Research, 84(2):389–409, 2016.
Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer. Earth mover dis-
tance over high-dimensional spaces. In Proceedings of the nineteenth annual
ACM-SIAM symposium on Discrete algorithms, pages 343–352. Society for
Industrial and Applied Mathematics, 2008.
Alexandr Andoni, Assaf Naor, and Ofer Neiman. Snowflake universality of
Wasserstein spaces. Annales scientifiques de l’École normale supérieure,
2017.
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv
preprint arXiv:1701.07875, 2017.
Franz Aurenhammer. Power diagrams: properties, algorithms and applica-
tions. SIAM Journal on Computing, 16(1):78–96, 1987.
Franz Aurenhammer, Friedrich Hoffmann, and Boris Aronov. Minkowski-type
theorems and least-squares clustering. Algorithmica, 20(1):61–76, 1998.
204 References

Amir Averbuch, RR Coifman, DL Donoho, Moshe Israeli, and Johan Walden.


Fast slant stack: A notion of radon transform for data in a cartesian grid
which is rapidly computible, algebraically exact, geometrically faithful and
invertible. SIAM Scientific Computing, 2001.
Francis Bach. Self-concordant analysis for logistic regression. Electronic Jour-
nal of Statistics, 4:384–414, 2010.
Francis R Bach. Adaptivity of averaged stochastic gradient descent to lo-
cal strong convexity for logistic regression. Journal of Machine Learning
Research, 15(1):595–627, 2014.
Michael Bacharach. Estimating nonnegative matrices from marginal data.
International Economic Review, 6(3):294–310, 1965.
Alexander Barvinok. A Course in Convexity. Graduate studies in mathemat-
ics. American Mathematical Society, 2002. ISBN 9780821829684.
Federico Bassetti, Antonella Bodini, and Eugenio Regazzini. On minimum
kantorovich distance estimators. Statistics & probability letters, 76(12):
1298–1302, 2006.
Heinz H Bauschke and Patrick L Combettes. Convex Analysis and Monotone
Operator Theory in Hilbert Spaces. Springer-Verlag, New York, 2011.
Heinz H Bauschke and Adrian S Lewis. Dykstra’s algorithm with Bregman
projections: a convergence proof. Optimization, 48(4):409–427, 2000.
Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected sub-
gradient methods for convex optimization. Operations Research Letters, 31
(3):167–175, 2003.
Martin Beckmann. A continuous model of transportation. Econometrica, 20:
643–660, 1952.
Mathias Beiglböck, Pierre Henry-Labordère, and Friedrich Penkner. Model-
independent bounds for option prices?a mass transport approach. Finance
and Stochastics, 17(3):477–501, 2013.
Jan Beirlant, Edward J Dudewicz, László Györfi, and Edward C Van der
Meulen. Nonparametric entropy estimation: An overview. International
Journal of Mathematical and Statistical Sciences, 6(1):17–39, 1997.
Jean-David Benamou. Numerical resolution of an “unbalanced” mass trans-
port problem. ESAIM: Mathematical Modelling and Numerical Analysis,
37(05):851–868, 2003.
Jean-David Benamou and Yann Brenier. A computational fluid mechanics
solution to the Monge-Kantorovich mass transfer problem. Numerische
Mathematik, 84(3):375–393, 2000.
References 205

Jean-David Benamou and Guillaume Carlier. Augmented lagrangian meth-


ods for transport optimization, mean field games and degenerate elliptic
equations. Journal of Optimization Theory and Applications, 167(1):1–26,
2015.
Jean-David Benamou, Brittany D Froese, and Adam M Oberman. Numerical
solution of the optimal transportation problem using the Monge–Ampere
equation. Journal of Computational Physics, 260:107–126, 2014.
Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and
Gabriel Peyré. Iterative Bregman projections for regularized transportation
problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138,
2015.
Jean-David Benamou, Guillaume Carlier, Quentin Mérigot, and Edouard
Oudet. Discretization of functionals involving the Monge–Ampère oper-
ator. Numerische Mathematik, 134(3):611–636, 2016a.
Jean-David Benamou, Francis Collino, and Jean-Marie Mirebeau. Monotone
and consistent discretization of the Monge-Ampere operator. Mathematics
of computation, 85(302):2743–2775, 2016b.
Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic
Analysis on Semigroups. Number 100 in Graduate Texts in Mathematics.
Springer Verlag, 1984.
Espen Bernton, Pierre E Jacob, Mathieu Gerber, and Christian P Robert. In-
ference in generative models using the Wasserstein distance. arXiv preprint
arXiv:1701.05146, 2017.
Dimitri P Bertsekas. A new algorithm for the assignment problem. Mathe-
matical Programming, 21(1):152–171, 1981.
Dimitri P Bertsekas. Auction algorithms for network flow problems: A tuto-
rial introduction. Computational optimization and applications, 1(1):7–66,
1992.
Dimitri P Bertsekas. Network optimization: continuous and discrete models.
Athena Scientific Belmont, 1998.
Dimitri P Bertsekas and Jonathan Eckstein. Dual coordinate step methods for
linear network flow problems. Mathematical Programming, 42(1):203–243,
1988.
Dimitris Bertsimas and John N Tsitsiklis. Introduction to linear optimization.
Athena Scientific, 1997.
Jérémie Bigot and Thierry Klein. Consistent estimation of a population
barycenter in the Wasserstein space. Preprint arXiv:1212.2562, 2012a.
206 References

Jérémie Bigot and Thierry Klein. Characterization of barycenters in the


Wasserstein space by averaging optimal transport maps. arXiv preprint
arXiv:1212.2562, 2012b.
Jérémie Bigot, Raúl Gouet, Thierry Klein, and Alfredo López. Geodesic PCA
in the Wasserstein space by convex pca. In Annales de l’Institut Henri
Poincaré, Probabilités et Statistiques, volume 53, pages 1–26. Institut Henri
Poincaré, 2017.
Garrett Birkhoff. Tres observaciones sobre el algebra lineal. Univ. Nac. Tu-
cumán Rev. Ser. A, 5:147–151, 1946.
Garrett Birkhoff. Extensions of jentzsch’s theorem. Transactions of the Amer-
ican Mathematical Society, 85(1):219–227, 1957.
Adrien Blanchet and Guillaume Carlier. Optimal transport and Cournot-Nash
equilibria. Mathematics of Operations Research, 41(1):125–145, 2015.
Adrien Blanchet, Vincent Calvez, and José A Carrillo. Convergence of the
mass-transport steepest descent scheme for the subcritical Patlak-Keller-
Segel model. SIAM Journal on Numerical Analysis, 46(2):691–721, 2008.
Emmanuel Boissard. Simple bounds for the convergence of empirical and oc-
cupation measures in 1-Wasserstein distance. Electronic Journal of Proba-
bility, 16:2296–2333, 2011.
Emmanuel Boissard, Thibaut Le Gouic, and Jean-Michel Loubes. Distribu-
tion’s template estimate with Wasserstein metrics. Bernoulli, 21(2):740–
759, 2015.
Franccois Bolley, Arnaud Guillin, and Cédric Villani. Quantitative concentra-
tion inequalities for empirical measures on non-compact spaces. Probability
Theory and Related Fields, 137(3):541–593, 2007.
Nicolas Bonneel, Michiel Van De Panne, Sylvain Paris, and Wolfgang Hei-
drich. Displacement interpolation using lagrangian mass transport. ACM
Transactions on Graphics (TOG), 30(6):158, 2011.
Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced
and Radon Wasserstein barycenters of measures. Journal of Mathematical
Imaging and Vision, 51(1):22–45, 2015.
Nicolas Bonneel, Gabriel Peyré, and Marco Cuturi. Wasserstein barycentric
coordinates: Histogram regression using optimal transport. ACM Transac-
tions on Graphics (Proc. SIGGRAPH 2016), 35(4):71:1–71:10, 2016. .
CW Borchardt and CGJ Jocobi. De investigando ordine systematis aequa-
tionum differentialium vulgarium cujuscunque. Journal für die reine und
angewandte Mathematik, 64:297–320, 1865.
References 207

Ingwer Borg and Patrick JF Groenen. Modern multidimensional scaling: The-


ory and applications. Springer Science & Business Media, 2005.
Mario Botsch, Leif Kobbelt, Mark Pauly, Pierre Alliez, and Bruno Lévy. Poly-
gon Mesh Processing. Taylor & Francis, 2010.
Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel,
and Bernhard Schoelkopf. From optimal transport to generative modeling:
the VEGAN cookbook. Preprint 1705.07642, Arxiv, 2017.
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eck-
stein. Distributed optimization and statistical learning via the alternating
direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122,
January 2011.
Lev M Bregman. The relaxation method of finding the common point of
convex sets and its application to the solution of problems in convex pro-
gramming. USSR computational mathematics and mathematical physics, 7
(3):200–217, 1967.
Yann Brenier. The least action principle and the related concept of generalized
flows for incompressible perfect fluids. J. of the AMS, 2:225–255, 1990.
Yann Brenier. Polar factorization and monotone rearrangement of vector-
valued functions. Comm. Pure Appl. Math., 44(4):375–417, 1991.
Yann Brenier. The dual least action problem for an ideal, incompressible fluid.
Archive for Rational Mechanics and Analysis, 122(4):323–351, 1993. ISSN
0003-9527.
Yann Brenier. Minimal geodesics on groups of volume-preserving maps and
generalized solutions of the Euler equations. Communications on Pure and
Applied Mathematics, 52(4):411–452, 1999. ISSN 1097-0312.
Yann Brenier. Generalized solutions and hydrostatic approximation of the
Euler equations. Phys. D, 237(14-17):1982–1988, 2008. ISSN 0167-2789.
Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Generalized
multidimensional scaling: a framework for isometry-invariant partial surface
matching. Proceedings of the National Academy of Sciences, 103(5):1168–
1172, 2006.
Alexander M Bronstein, Michael M Bronstein, Ron Kimmel, Mona Mah-
moudi, and Guillermo Sapiro. A Gromov-Hausdorff framework with dif-
fusion geometry for topologically-robust non-rigid shape matching. Inter.
Journal on Computer Vision, 89(2-3):266–286, 2010.
Richard A Brualdi. Combinatorial matrix classes, volume 108. Cambridge
University Press, 2006.
208 References

Dmitri Burago, Yuri Burago, and Sergei Ivanov. A course in metric geometry,
volume 33. American Mathematical Society Providence, RI, 2001.
Donald Bures. An extension of Kakutani’s theorem on infinite product mea-
sures to the tensor product of semifinite w∗ -algebras. Transactions of the
American Mathematical Society, 135:199–212, 1969.
Martin Burger, José Antonio Carrillo de la Plata, and Marie-Therese Wolfram.
A mixed finite element method for nonlinear diffusion equations. Kinetic
and Related Models, 3(1):59–83, 2010.
Martin Burger, Marzena Franek, and Carola-Bibiane Schönlieb. Regularised
regression and density estimation based on optimal transport. Appl. Math.
Res. Express, 2:209–253, 2012.
Luis Caffarelli. The Monge-Ampere equation and optimal transportation, an
elementary review. LECTURE NOTES IN MATHEMATICS-SPRINGER
VERLAG, pages 1–10, 2003.
Luis Caffarelli, Mikhail Feldman, and Robert McCann. Constructing optimal
maps for Monge’s transport problem as a limit of strictly convex costs.
Journal of the American Mathematical Society, 15(1):1–26, 2002.
Luis A Caffarelli and Robert J McCann. Free boundaries in optimal transport
and Monge-Ampère obstacle problems. Ann. of Math., 171(2):673–730,
2010. ISSN 0003-486X.
Luis A Caffarelli, Sergey A Kochengin, and Vladimir I Oliker. Problem of re-
flector design with given far-field scattering data. In Monge Ampère Equa-
tion: Applications to Geometry and Optimization: NSF-CBMS Conference
on the Monge Ampère Equation, Applications to Geometry and Optimiza-
tion, July 9-13, 1997, Florida Atlantic University, volume 226, page 13.
American Mathematical Soc., 1999.
Guillermo Canas and Lorenzo Rosasco. Learning probability measures with
respect to optimal transport metrics. In F. Pereira, C. J. C. Burges, L. Bot-
tou, and K. Q. Weinberger, editors, Advances in Neural Information Pro-
cessing Systems 25, pages 2492–2500. Curran Associates, Inc., 2012.
Eric A Carlen and Jan Maas. An analog of the 2-Wasserstein metric in non-
commutative probability under which the fermionic Fokker–Planck equa-
tion is gradient flow for the entropy. Communications in Mathematical
Physics, 331(3):887–926, 2014.
Guillaume Carlier and Ivar Ekeland. Matching for teams. Econom. Theory,
42(2):397–418, 2010. ISSN 0938-2259. .
Guillaume Carlier and Clarice Poon. On the total variation Wasserstein gradi-
ent flow and the TV-JKO scheme. arXiv preprint arXiv:1703.00243, 2017.
References 209

Guillaume Carlier, Chloé Jimenez, and Filippo Santambrogio. Optimal trans-


portation with traffic congestion and Wardrop equilibria. SIAM J. Control
Optim., 47(3):1330–1350, 2008.
Guillaume Carlier, Alfred Galichon, and Filippo Santambrogio. From knothe’s
transport to Brenier’s map and a continuation method for optimal trans-
port. SIAM Journal on Mathematical Analysis, 41(6):2554–2576, 2010.
Guillaume Carlier, Adam Oberman, and Edouard Oudet. Numerical methods
for matching for teams and Wasserstein barycenters. ESAIM: Mathematical
Modelling and Numerical Analysis, 49(6):1621–1642, 2015.
Guillaume Carlier, Victor Chernozhukov, and Alfred Galichon. Vec-
tor quantile regression beyond correct specification. arXiv preprint
arXiv:1610.06833, 2016.
Guillaume Carlier, Vincent Duval, Gabriel Peyré, and Bernhard Schmitzer.
Convergence of entropic schemes for optimal transport and gradient flows.
SIAM Journal on Mathematical Analysis, 49(2):1385–1418, 2017.
José A Carrillo and J Salvador Moll. Numerical simulation of diffusive and
aggregation phenomena in nonlinear continuity equations by evolving dif-
feomorphisms. SIAM Journal on Scientific Computing, 31(6):4305–4329,
2009.
José A Carrillo, Alina Chertock, and Yanghong Huang. A finite-volume
method for nonlinear nonlocal equations with a gradient flow structure.
Communications in Computational Physics, 17:233–258, 1 2015. ISSN 1991-
7120.
Yair Censor and Simeon Reich. The Dykstra algorithm with Bregman pro-
jections. Communications in Applied Analysis, 2:407–419, 1998.
Thierry Champion, Luigi De Pascale, and Petri Juutinen. The ∞-wasserstein
distance: Local solutions and existence of optimal transport maps. SIAM
Journal on Mathematical Analysis, 40(1):1–20, 2008.
Timothy M Chan. Optimal output-sensitive convex hull algorithms in two
and three dimensions. Discrete & Computational Geometry, 16(4):361–368,
1996.
Yongxin Chen, Tryphon T Georgiou, and Michele Pavon. On the relation
between optimal transport and Schrödinger bridges: A stochastic control
viewpoint. Journal of Optimization Theory and Applications, 169(2):671–
691, 2016a.
Yongxin Chen, Tryphon T Georgiou, and Allen Tannenbaum. Matrix
optimal mass transport: A quantum mechanical approach. Preprint
arXiv:1610.03041, 2016b.
210 References

Yongxin Chen, Wilfrid Gangbo, Tryphon T Georgiou, and Allen Tannenbaum.


On the matrix Monge-Kantorovich problem. Preprint arXiv:1701.02826,
2017.
Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and Franccois-Xavier
Vialard. Unbalanced optimal transport: Geometry and Kantorovich for-
mulation. Preprint 1508.05216, Arxiv, 2015.
Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and Franccois-Xavier
Vialard. An interpolating distance between optimal transport and Fisher–
Rao metrics. Foundations of Computational Mathematics, pages 1–44, 2016.
Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and Franccois-Xavier
Vialard. Scaling algorithms for unbalanced transport problems. to appear
in Mathematics of Computation, 2017.
Shui-Nee Chow, Wen Huang, Yao Li, and Haomin Zhou. Fokker-Planck equa-
tions for a free energy functional or Markov process on a graph. Archive for
Rational Mechanics and Analysis, 203(3):969–1008, 2012. ISSN 0003-9527.
.
Samir Chowdhury and Facundo Mémoli. Constructing geodesics on the space
of compact metric spaces. arXiv preprint arXiv:1603.02385, 2016.
Imre Ciszár. Information-type measures of difference of probability distri-
butions and indirect observations. Studia Sci. Math. Hungar., 2:299–318,
1967.
Michael B Cohen, Aleksander Madry, Dimitris Tsipras, and Adrian Vladu.
Matrix scaling and balancing via box constrained Newton’s method and
interior point methods. arXiv preprint arXiv:1704.02310, 2017.
Patrick L Combettes and Jean-Christophe Pesquet. A Douglas-Rachford split-
ting approach to nonsmooth convex variational signal recovery. IEEE Jour-
nal of Selected Topics in Signal Processing, 1(4):564 –574, 2007. ISSN 1932-
4553. .
Roberto Cominetti and Jaime San Martín. Asymptotic analysis of the expo-
nential penalty trajectory in linear programming. Mathematical Program-
ming, 67(1-3):169–187, 1994.
Laurent Condat. Fast projection onto the simplex and the `1 ball. Proc. Math.
Program., Ser. A, pages 1–11, 2015.
Sueli IR Costa, Sandra A Santos, and João E Strapasson. Fisher information
distance: a geometrical reading. Discrete Applied Mathematics, 197:59–69,
2015.
References 211

Codina Cotar, Gero Friesecke, and Claudia Klüppelberg. Density functional


theory and optimal transportation with Coulomb cost. Communications
on Pure and Applied Mathematics, 66(4):548–599, 2013. ISSN 1097-0312.
Nicolas Courty, Rémi Flamary, Devis Tuia, and Thomas Corpetti. Optimal
transport for data fusion in remote sensing. In Geoscience and Remote
Sensing Symposium (IGARSS), 2016 IEEE International, pages 3571–3574.
IEEE, 2016.
Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy.
Joint distribution optimal transportation for domain adaptation. arXiv
preprint arXiv:1705.08848, 2017a.
Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Opti-
mal transport for domain adaptation. IEEE transactions on pattern anal-
ysis and machine intelligence, 39(9):1853–1865, 2017b.
Keenan Crane, Clarisse Weischedel, and Max Wardetzky. Geodesics in heat:
A new approach to computing distance based on heat flow. ACM Trans.
Graph., 32(5):152:1–152:11, October 2013.
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal trans-
port. In Advances in Neural Information Processing Systems (NIPS) 26,
pages 2292–2300, 2013.
Marco Cuturi and David Avis. Ground metric learning. The Journal of
Machine Learning Research, 15:533–564, 2014.
Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycen-
ters. In Proceedings of ICML, volume 32, pages 685–693, 2014.
Marco Cuturi and Kenji Fukumizu. Kernels on structured objects through
nested histograms. In P. B. Schölkopf, J. C. Platt, and T. Hoffman, editors,
Advances in Neural Information Processing Systems 19, pages 329–336.
MIT Press, 2007.
Marco Cuturi and Gabriel Peyré. A smoothed dual approach for variational
Wasserstein problems. SIAM Journal on Imaging Sciences, 9(1):320–343,
2016. .
Arnak S Dalalyan and Avetik G Karagulyan. User-friendly guarantees
for the langevin monte carlo with inaccurate gradient. arXiv preprint
arXiv:1710.00095, 2017.
George B. Dantzig. Programming of interdependent activities: II mathemat-
ical model. Econometrica, 17(3/4):200–211, 1949.
George B Dantzig. Application of the simplex method to a transportation
problem. Activity analysis of production and allocation, 13:359–373, 1951.
212 References

George B. Dantzig. Reminiscences About the Origins of Linear Programming,


pages 78–86. Springer Berlin Heidelberg, 1983.
George B. Dantzig. Linear programming. In J. K. Lenstra, A. H. G. Rinnooy
Kan, and A. Schrijver, editors, History of Mathematical Programming: A
Collection of Personal Reminiscences, pages 257–282. Elsevier Science Pub-
lishers, 1991.
Jon Dattorro. Convex optimization & Euclidean distance geometry. Meboo
publishing, 2017.
Fernando De Goes, Katherine Breeden, Victor Ostromoukhov, and Mathieu
Desbrun. Blue noise through optimal transport. ACM Transactions on
Graphics (TOG), 31(6):171, 2012.
Fernando de Goes, Corentin Wallez, Jin Huang, Dmitry Pavlov, and Mathieu
Desbrun. Power particles: An incompressible fluid solver based on power
diagrams. ACM Trans. Graph., 34(4):50:1–50:11, July 2015.
Eustasio del Barrio, JA Cuesta-Albertos, C Matrán, and A Mayo-Íscar. Ro-
bust clustering tools based on optimal transportation. arXiv preprint
arXiv:1607.01179, 2016.
Julie Delon. Midway image equalization. Journal of Mathematical Imaging
and Vision, 21(2):119–134, 2004.
Julie Delon, Julien Salomon, and Andrei Sobolevski. Fast transport optimiza-
tion for Monge costs on the circle. SIAM Journal on Applied Mathematics,
70(7):2239–2258, 2010.
Julie Delon, Julien Salomon, and Andrei Sobolevski. Local matching indica-
tors for transport problems with concave costs. SIAM Journal on Discrete
Mathematics, 26(2):801–827, 2012.
Edwards Deming and Frederick F Stephan. On a least squares adjustment of
a sampled frequency table when the expected marginal totals are known.
Annals Mathematical Statistics, 11(4):427–444, 1940.
Steffen Dereich, Michael Scheutzow, and Reik Schottstedt. Constructive quan-
tization: Approximation by empirical measures. In Annales de l’Institut
Henri Poincaré, Probabilités et Statistiques, volume 49, pages 1183–1203.
Institut Henri Poincaré, 2013.
Rachid Deriche. Recursively implementating the Gaussian and its derivatives.
PhD thesis, INRIA, 1993.
Arnaud Dessein, Nicolas Papadakis, and Jean-Luc Rouas. Regularized optimal
transport and the Rot mover’s distance. arXiv preprint arXiv:1610.06447,
2016.
References 213

Simone Di Marino and Lenaic Chizat. A tumor growth model of Hele-Shaw


type as a gradient flow. Arxiv, 2017.
Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sub-
linear time algorithms for earth mover’s distance. Theory of Computing
Systems, 48(2):428–442, 2011.
Jean Dolbeault, Bruno Nazaret, and Giuseppe Savaré. A new class of trans-
port distances between measures. Calculus of Variations and Partial Dif-
ferential Equations, 34(2):193–231, 2009.
Yan Dolinsky and H Mete Soner. Martingale optimal transport and robust
hedging in continuous time. Probability Theory and Related Fields, 160
(1-2):391–427, 2014.
Richard M. Dudley. The speed of mean Glivenko-Cantelli convergence. The
Annals of Mathematical Statistics, 40(1):40–50, 1969.
Arnaud Dupuy, Alfred Galichon, and Yifei Sun. Estimating matching affinity
matrix under low-rank constraints. Arxiv:1612.09585, 2016.
Richard L Dykstra. An algorithm for restricted least squares regression. J.
Amer. Stat., 78(384):839–842, 1983.
Richard L Dykstra. An iterative procedure for obtaining I-projections onto
the intersection of convex sets. Ann. Probab., 13(3):975–984, 1985. ISSN
0091-1798.
Jonathan Eckstein and Dimitri P Bertsekas. On the Douglas-Rachford split-
ting method and the proximal point algorithm for maximal monotone op-
erators. Mathematical Programming, 55:293–318, 1992.
David A Edwards. The structure of superspace. In Studies in topology, pages
121–133. Elsevier, 1975.
Tarek A El Moselhy and Youssef M Marzouk. Bayesian inference with optimal
maps. Journal of Computational Physics, 231(23):7815–7850, 2012.
Dominik Maria Endres and Johannes E Schindelin. A new metric for proba-
bility distributions. IEEE Transactions on Information theory, 49(7):1858–
1860, 2003.
Matthias Erbar. The heat equation on manifolds as a gradient flow in the
Wasserstein space. Annales de l’Institut Henri Poincaré, Probabilités et
Statistiques, 46(1):1–23, 2010.
Matthias Erbar and Jan Maas. Gradient flow structures for discrete porous
medium equations. Discrete Contin. Dyn. Syst., 34(4):1355–1374, 2014.
214 References

Matthias Erbar, Martin Rumpf, Bernhard Schmitzer, and Stefan Simon. Com-
putation of optimal transport on discrete metric measure spaces. arXiv
preprint arXiv:1707.06859, 2017.
Sven Erlander. Optimal Spatial Interaction and the Gravity Model, volume
173. Springer-Verlag, 1980.
Sven Erlander and Neil F Stewart. The gravity model in transportation anal-
ysis: theory and extensions. Vsp, 1990.
Montacer Essid and Justin Solomon. Quadratically-regularized optimal trans-
port on graphs. arXiv preprint arXiv:1704.08200, 2017.
Mikhail Feldman and Robert McCann. Monge’s transport problem on a Rie-
mannian manifold. Trans. AMS, 354(4):1667–1697, 2002.
Jean Feydy, Benjamin Charlier, Francois-Xavier Vialard, and Gabriel Peyré.
Optimal transport for diffeomorphic registration. In Proc. MICCAI’17,
2017.
Alessio Figalli. The optimal partial transport problem. Archive for rational
mechanics and analysis, 195(2):533–560, 2010.
Rémi Flamary, Cédric Févotte, Nicolas Courty, and Valentin Emiya. Op-
timal spectral transportation with application to music transcription. In
Advances in Neural Information Processing Systems, pages 703–711, 2016.
Lester Randolph Ford and Delbert Ray Fulkerson. Flows in Networks. Prince-
ton University Press, 1962.
Peter J Forrester and Mario Kieburg. Relating the Bures measure to the
Cauchy two-matrix model. Communications in Mathematical Physics, 342
(1):151–187, 2016.
Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasser-
stein distance of the empirical measure. Probability Theory and Related
Fields, 162(3-4):707–738, 2015.
Joel Franklin and Jens Lorenz. On the scaling of multidimensional matrices.
Linear Algebra and its applications, 114:717–735, 1989.
Uriel Frisch, Sabino Matarrese, Roya Mohayaee, and Andrei Sobolevski. A
reconstruction of the initial conditions of the universe by optimal mass
transportation. Nature, 417(6886):260–262, 2002.
Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and
Tomaso A Poggio. Learning with a Wasserstein loss. In Advances in Neural
Information Processing Systems, pages 2053–2061, 2015.
References 215

Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of
nonlinear variational problems via finite element approximation. Computers
& Mathematics with Applications, 2(1):17–40, 1976.
Alfred Galichon. Optimal Transport Methods in Economics. Princeton Uni-
versity Press, 2016.
Alfred Galichon and Bernard Salanié. Matching with trade-offs: Revealed pref-
erences over competing characteristics. Technical report, Preprint SSRN-
1487307, 2009.
Alfred Galichon, Pierre Henry-Labordere, and Nizar Touzi. A stochastic con-
trol approach to no-arbitrage bounds given marginals, with an application
to lookback options. The Annals of Applied Probability, 24(1):312–336,
2014.
Thomas O Gallouët and Quentin Mérigot. A lagrangian scheme à la bre-
nier for the incompressible euler equations. Foundations of Computational
Mathematics, pages 1–31, 2017.
Thomas O Gallouët and Leonard Monsaingeon. A JKO splitting scheme for
Kantorovich–Fisher–Rao gradient flows. SIAM Journal on Mathematical
Analysis, 49(2):1100–1130, 2017.
Wilfrid Gangbo and Robert J McCann. The geometry of optimal transporta-
tion. Acta Mathematica, 177(2):113–161, 1996.
Wilfrid Gangbo and Andrzej Swiech. Optimal maps for the multidimensional
Monge-Kantorovich problem. Communications on pure and applied math-
ematics, 51(1):23–45, 1998a.
Wilfrid Gangbo and Andrzej Swiech. Optimal maps for the multidimensional
Monge-Kantorovich problem. Communications on pure and applied math-
ematics, 51(1):23–45, 1998b.
Matthias Gelbrich. On a formula for the l2 wasserstein metric between mea-
sures on euclidean and hilbert spaces. Mathematische Nachrichten, 147(1):
185–203, 1990.
Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochas-
tic optimization for large-scale optimal transport. In Advances in Neural
Information Processing Systems, pages 3440–3448, 2016.
Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models
with Sinkhorn divergences. Preprint 1706.00292, Arxiv, 2017a.
Aude Genevay, Gabriel Peyré, and Marco Cuturi. GAN and VAE from an
optimal transport point of view. Preprint 1706.01807, Arxiv, 2017b.
216 References

Ivan Gentil, Christian Léonard, and Luigia Ripani. About the anal-
ogy between optimal transport and minimal entropy. arXiv preprint
arXiv:1510.08230, 2015.
Alan George and Joseph WH Liu. The evolution of the minimum degree
ordering algorithm. Siam review, 31(1):1–19, 1989.
Tryphon T Georgiou and Michele Pavon. Positive contraction mappings
for classical and quantum Schrödinger systems. Journal of Mathematical
Physics, 56(3):033301, 2015.
Pascal Getreuer. A survey of Gaussian convolution algorithms. Image Pro-
cessing On Line, 2013:286–310, 2013.
Ugo Gianazza, Giuseppe Savaré, and Giuseppe Toscani. The Wasserstein
gradient flow of the Fisher information and the quantum drift-diffusion
equation. Archive for Rational Mechanics and Analysis, 194(1):133–220,
2009. ISSN 0003-9527. .
Alison L Gibbs and Francis Edward Su. On choosing and bounding probability
metrics. International statistical review, 70(3):419–435, 2002.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for
large-scale sentiment classification: A deep learning approach. In Proceed-
ings of the 28th international conference on machine learning (ICML-11),
pages 513–520, 2011.
Roland Glowinski and A. Marroco. Sur l’approximation, par éléments fi-
nis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de
problèmes de Dirichlet non linéaires. ESAIM: Mathematical Modelling and
Numerical Analysis - Modélisation Mathématique et Analyse Numérique, 9
(R2):41–76, 1975.
Steven Gold and Anand Rangarajan. A graduated assignment algorithm for
graph matching. IEEE Trans. on PAMI, 18(4):377–388, April 1996.
Kristen Grauman and Trevor Darrell. The pyramid match kernel: Discrimi-
native classification with sets of image features. In Computer Vision, 2005.
ICCV 2005. Tenth IEEE International Conference on, volume 2, pages
1458–1465. IEEE, 2005.
Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schölkopf,
and Alex J Smola. A kernel method for the two-sample-problem. In Ad-
vances in neural information processing systems, pages 513–520, 2007.
Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf,
and Alexander Smola. A kernel two-sample test. Journal of Machine Learn-
ing Research, 13(Mar):723–773, 2012.
References 217

Andreas Griewank. Achieving logarithmic growth of temporal and spatial


complexity in reverse automatic differentiation. Optimization Methods and
software, 1(1):35–54, 1992.
Andreas Griewank and Andrea Walther. Evaluating derivatives: principles
and techniques of algorithmic differentiation. SIAM, 2008.
Mikhail Gromov. Metric Structures for Riemannian and Non-Riemannian
Spaces. Progress in Mathematics. Birkhäuser, 2001.
Mikhail Gromov. Metric structures for Riemannian and non-Riemannian
spaces. Springer Science & Business Media, 2007.
Gaoyue Guo and Jan Obloj. Computational methods for martingale optimal
transport problems. arXiv preprint arXiv:1710.07911, 2017.
Leonid Gurvits. Classical complexity and quantum entanglement. Journal of
Computer and System Sciences, 69(3):448–484, 2004.
Cristian E Gutiérrez. The Monge-Ampere equation. Springer, 2016.
Jorge Gutierrez, Julien Rabin, Bruno Galerne, and Thomas Hurtut. Optimal
patch assignment for statistically constrained texture synthesis. In Inter-
national Conference on Scale Space and Variational Methods in Computer
Vision, pages 172–183. Springer, 2017.
A Hadjidimos. Successive overrelaxation (SOR) and related methods. Journal
of Computational and Applied Mathematics, 123(1):177–199, 2000.
Steven Haker, Lei Zhu, Allen Tannenbaum, and Sigurd Angenent. Optimal
mass transport for registration and warping. International Journal of com-
puter vision, 60(3):225–240, 2004.
Leonid G Hanin. Kantorovich-Rubinstein norm and its application in the the-
ory of Lipschitz spaces. Proceedings of the American Mathematical Society,
115(2):345–352, 1992.
Tatsunori Hashimoto, David Gifford, and Tommi Jaakkola. Learning
population-level diffusions with generative rnns. In International Confer-
ence on Machine Learning, pages 2417–2426, 2016.
David Hilbert. Über die gerade linie als kürzeste verbindung zweier punkte.
Mathematische Annalen, 46(1):91–96, 1895.
Frank L Hitchcock. The distribution of a product from several sources to
numerous localities. Studies in Applied Mathematics, 20(1-4):224–230, 1941.
Nhat Ho, XuanLong Nguyen, Mikhail Yurochkin, Hung Hai Bui, Viet Huynh,
and Dinh Phung. Multilevel clustering via Wasserstein means. arXiv
preprint arXiv:1706.03883, 2017.
218 References

Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel meth-


ods in machine learning. Ann. Statist., 36(3):1171–1220, 2008.
David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied
logistic regression, volume 398. John Wiley & Sons, 2013.
Gao Huang, Chuan Guo, Matt J Kusner, Yu Sun, Fei Sha, and Kilian Q
Weinberger. Supervised word mover’s distance. In Advances in Neural
Information Processing Systems, pages 4862–4870, 2016.
Martin Idel. A review of matrix scaling and Sinkhorn’s normal form for
matrices and positive maps. arXiv preprint arXiv:1609.06349, 2016.
Piotr Indyk and Eric Price. K-median clustering, model-based compressive
sensing, and sparse recovery for earth mover distance. In Proceedings of the
forty-third annual ACM symposium on Theory of computing, pages 627–
636. ACM, 2011.
Piotr Indyk and Nitin Thaper. Fast image retrieval via embeddings. In
3rd International Workshop on Statistical and Computational Theories of
Vision (at ICCV), 2003.
Xianhua Jiang, Lipeng Ning, and Tryphon T Georgiou. Distances and Rie-
mannian metrics for multivariate spectral densities. IEEE Transactions on
Automatic Control, 57(7):1723–1735, July 2012. ISSN 0018-9286. .
William B Johnson and Joram Lindenstrauss. Extensions of Lipschitz map-
pings into a Hilbert space. In Conference in modern analysis and probability
(New Haven, Conn., 1982), volume 26 of Contemp. Math., pages 189–206.
Amer. Math. Soc., Providence, RI, 1984.
Lee K Jones and Charles L Byrne. General entropy criteria for inverse prob-
lems, with applications to data compression, pattern classification, and clus-
ter analysis. IEEE transactions on Information Theory, 36(1):23–30, 1990.
Richard Jordan, David Kinderlehrer, and Felix Otto. The variational for-
mulation of the Fokker-Planck equation. SIAM journal on mathematical
analysis, 29(1):1–17, 1998.
Leonid Kantorovich. On the transfer of masses (in russian). Doklady Akademii
Nauk, 37(2):227–229, 1942.
LV Kantorovich and G.S. Rubinshtein. On a space of totally additive func-
tions. Vestn Lening. Univ., 13:52–59, 1958.
Hermann Karcher. Riemannian center of mass and so called Karcher mean.
arXiv preprint arXiv:1407.2087, 2014.
References 219

Johan Karlsson and Axel Ringh. Generalized Sinkhorn iterations for reg-
ularizing inverse problems using optimal mass transport. arXiv preprint
arXiv:1612.02273, 2016.
Sanggyun Kim, Rui Ma, Diego Mesa, and Todd P Coleman. Efficient bayesian
inference methods via convex optimization and optimal transport. In In-
formation Theory Proceedings (ISIT), 2013 IEEE International Symposium
on, pages 2259–2263. IEEE, 2013.
David Kinderlehrer and Noel J Walkington. Approximation of parabolic equa-
tions using the Wasserstein metric. ESAIM: Mathematical Modelling and
Numerical Analysis, 33(04):837–852, 1999.
Jun Kitagawa, Quentin Mérigot, and Boris Thibert. A Newton algorithm for
semi-discrete optimal transport. arXiv preprint arXiv:1603.05579, 2016.
Philip A Knight. The Sinkhorn–Knopp algorithm: convergence and applica-
tions. SIAM Journal on Matrix Analysis and Applications, 30(1):261–275,
2008.
Philip A Knight and Daniel Ruiz. A fast algorithm for matrix balancing. IMA
Journal of Numerical Analysis, 33(3):1029–1047, 2013.
Philip A Knight, Daniel Ruiz, and Bora Uccar. A symmetry preserving algo-
rithm for matrix scaling. SIAM journal on Matrix Analysis and Applica-
tions, 35(3):931–955, 2014.
Martin Knott and Cyril S Smith. On a generalization of cyclic monotonicity
and distances among random vectors. Linear algebra and its applications,
199:363–371, 1994.
Soheil Kolouri, Yang Zou, and Gustavo K Rohde. Sliced Wasserstein kernels
for probability distributions. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 5258–5267, 2016.
Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcev, and Gustavo K
Rohde. Optimal mass transport: Signal processing and machine-learning
applications. IEEE Signal Processing Magazine, 34(4):43–59, 2017.
Stanislav Kondratyev, Léonard Monsaingeon, and Dmitry Vorotnikov. A new
optimal transport distance on the space of finite Radon measures. Advances
in Differential Equations, 21(11/12):1117–1164, 2016.
Tjalling C Koopmans. Optimum utilization of the transportation system.
Econometrica: Journal of the Econometric Society, pages 136–146, 1949.
Jonathan Korman and Robert McCann. Optimal transportation with capacity
constraints. Transactions of the American Mathematical Society, 367(3):
1501–1521, 2015.
220 References

Bernhard Korte and Jens Vygen. Combinatorial optimization. Springer, 2012.


JJ Kosowsky and Alan L Yuille. The invisible hand algorithm: Solving the
assignment problem with statistical physics. Neural networks, 7(3):477–490,
1994.
Harold W. Kuhn. The hungarian method for the assignment problem. Naval
Research Logistics Quarterly, 2:83–97, 1955.
Brian Kulis. Metric learning: A survey. Foundations and Trends in Machine
Learning, 5(4):287–364, 2012.
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word
embeddings to document distances. In International Conference on Ma-
chine Learning, pages 957–966, 2015.
Rongjie Lai and Hongkai Zhao. Multi-scale non-rigid point cloud registration
using robust sliced-Wasserstein distance via Laplace-Beltrami eigenmap.
arXiv preprint arXiv:1406.3758, 2014.
Hugo Lavenant. Harmonic mappings valued in the Wasserstein space. Preprint
cvgmt 3649, 2017.
Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of fea-
tures: Spatial pyramid matching for recognizing natural scene categories.
In Computer vision and pattern recognition, 2006 IEEE computer society
conference on, volume 2, pages 2169–2178. IEEE, 2006.
Thibaut Le Gouic and Jean-Michel Loubes. Existence and consistency of
Wasserstein barycenters. Probability Theory and Related Fields, pages 1–
17, 2016.
Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-
negative matrix factorization. Nature, 401(6755):788–791, 1999.
William Leeb and Ronald Coifman. Hölder–Lipschitz norms and their duals
on spaces with semigroups, with applications to earth mover’s distance.
Journal of Fourier Analysis and Applications, 22(4):910–953, 2016.
Jan Lellmann, Dirk A Lorenz, Carola Schönlieb, and Tuomo Valkonen. Imag-
ing with Kantorovich–Rubinstein discrepancy. SIAM Journal on Imaging
Sciences, 7(4):2833–2859, 2014.
Bas Lemmens and Roger Nussbaum. Nonlinear Perron-Frobenius Theory,
volume 189. Cambridge University Press, 2012.
Christian Léonard. From the Schrödinger problem to the Monge–Kantorovich
problem. Journal of Functional Analysis, 262(4):1879–1920, 2012.
References 221

Christian Léonard. A survey of the Schrödinger problem and some of its


connections with optimal transport. Discrete Contin. Dyn. Syst. A, 34(4):
1533–1574, 2014.
Bruno Lévy. A numerical algorithm for l2 semi-discrete optimal transport
in 3d. ESAIM: Mathematical Modelling and Numerical Analysis, 49(6):
1693–1715, 2015.
Bruno Levy and Erica Schwindt. Notions of optimal transport theory and
how to implement them on a computer. arXiv:1710.02634, 2017.
Peihua Li, Qilong Wang, and Lei Zhang. A novel earth mover’s distance
methodology for image matching with Gaussian mixture models. In Pro-
ceedings of the IEEE International Conference on Computer Vision, pages
1689–1696, 2013.
Wuchen Li, Penghang Yin, and Stanley Osher. A fast algorithm for unbal-
anced l1 Monge–Kantorovich problem. CAM report, 2016.
Matthias Liero, Alexander Mielke, and Giuseppe Savaré. Optimal entropy-
transport problems and a new Hellinger-Kantorovich distance between pos-
itive measures. ArXiv e-prints, 2015.
Matthias Liero, Alexander Mielke, and Giuseppe Savaré. Optimal trans-
port in competition with reaction: The Hellinger–Kantorovich distance and
geodesic curves. SIAM Journal on Mathematical Analysis, 48(4):2869–2911,
2016.
Haibin Ling and Kazunori Okada. Diffusion distance for histogram compar-
ison. In Computer Vision and Pattern Recognition, 2006 IEEE Computer
Society Conference on, volume 1, pages 246–253. IEEE, 2006.
Haibin Ling and Kazunori Okada. An efficient earth mover’s distance algo-
rithm for robust histogram comparison. IEEE Tr. on PAMI, 29(5):840–853,
2007.
Nathan Linial, Alex Samorodnitsky, and Avi Wigderson. A deterministic
strongly polynomial algorithm for matrix scaling and approximate perma-
nents. In Proceedings of the thirtieth annual ACM symposium on Theory
of computing, pages 644–652. ACM, 1998.
Pierre-Louis Lions and Bertrand Mercier. Splitting algorithms for the sum of
two nonlinear operators. Siam J. Numer. Anal., 16:964–979, 1979.
Don O Loftsgaarden and Charles P Quesenberry. A nonparametric estimate
of a multivariate density function. The Annals of Mathematical Statistics,
36(3):1049–1051, 1965.
222 References

Eliane Maria Loiola, Nair Maria Maia de Abreu, Paulo Oswaldo Boaventura-
Netto, Peter Hahn, and Tania Querido. A survey for the quadratic assign-
ment problem. European J. Operational Research, 176(2):657–690, 2007.
Vince Lyzinski, Donniell E Fishkind, Marcelo Fiori, Joshua T Vogelstein,
Carey E Priebe, and Guillermo Sapiro. Graph matching: Relax at your
own risk. IEEE transactions on pattern analysis and machine intelligence,
38(1):60–73, 2016.
Jan Maas. Gradient flows of the entropy for finite Markov chains. Journal of
Functional Analysis, 261(8):2250–2292, 2011. .
Jan Maas, Martin Rumpf, Carola Schönlieb, and Stefan Simon. A generalized
model for optimal transport of images including dissipation and density
modulation. ESAIM: Mathematical Modelling and Numerical Analysis, 49
(6):1745–1769, 2015.
Jan Maas, Martin Rumpf, and Stefan Simon. Generalized optimal transport
with singular sources. arXiv preprint arXiv:1607.01186, 2016.
Yasushi Makihara and Yasushi Yagi. Earth mover’s morphing: Topology-free
shape morphing using cluster-based EMD flows. In Asian Conference on
Computer Vision, pages 202–215. Springer, 2010.
Stephane Mallat. A wavelet tour of signal processing: the sparse way. Aca-
demic press, 2008.
Benjamin Mathon, Francois Cayre, Patrick Bas, and Benoit Macq. Optimal
transport for secure spread-spectrum watermarking of still images. IEEE
Transactions on Image Processing, 23(4):1694–1705, 2014.
Daniel Matthes and Horst Osberger. Convergence of a variational Lagrangian
scheme for a nonlinear drift diffusion equation? ESAIM: Mathematical
Modelling and Numerical Analysis, 48(3):697–726, 2014.
Bertrand Maury and Anthony Preux. Pressureless Euler equations with max-
imal density constraint: a time-splitting scheme. Topological Optimization
and Optimal Transport: In the Applied Sciences, 17:333, 2017.
Bertrand Maury, Aude Roudneff-Chupin, and Filippo Santambrogio. A
macroscopic crowd motion model of gradient flow type. Mathematical Mod-
els and Methods in Applied Sciences, 20(10):1787–1821, 2010.
Robert J McCann. A convexity principle for interacting gases. Advances in
mathematics, 128(1):153–179, 1997.
Facundo Mémoli. On the use of Gromov–Hausdorff distances for shape com-
parison. In Symposium on Point Based Graphics, pages 81–90. 2007.
References 223

Facundo Mémoli. Gromov–Wasserstein distances and the metric approach


to object matching. Foundations of Computational Mathematics, 11(4):
417–487, 2011.
Facundo Mémoli and Guillermo Sapiro. A theoretical and computational
framework for isometry invariant recognition of point cloud data. Founda-
tions of Computational Mathematics, 5(3):313–347, 2005.
Quentin Mérigot. A multiscale approach to optimal transport. Comput.
Graph. Forum, 30(5):1583–1592, 2011.
Quentin Mérigot, Jocelyn Meyron, and Boris Thibert. Light in power: A
general and parameter-free algorithm for caustic design. arXiv preprint
arXiv:1708.04820, 2017.
Ludovic Métivier, Romain Brossier, Quentin Merigot, Edouard Oudet, and
Jean Virieux. An optimal transport approach for seismic tomography: Ap-
plication to 3D full waveform inversion. Inverse Problems, 32(11):115008,
2016.
Alexander Mielke. Geodesic convexity of the relative entropy in reversible
Markov chains. Calculus of Variations and Partial Differential Equations,
48(1-2):1–31, 2013.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient
estimation of word representations in vector space. arXiv preprint
arXiv:1301.3781, 2013.
Jean-Marie Mirebeau. Discretization of the 3D Monge-Ampere operator, be-
tween wide stencils and power diagrams. ESAIM: Mathematical Modelling
and Numerical Analysis, 49(5):1511–1523, 2015.
Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Histoire
de l’Académie Royale des Sciences, pages 666–704, 1781.
Grégoire Montavon, Klaus-Robert Müller, and Marco Cuturi. Wasserstein
training of restricted Boltzmann machines. In D. D. Lee, M. Sugiyama,
U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural
Information Processing Systems 29, pages 3718–3726. 2016.
Kevin Moon and Alfred Hero. Multivariate f -divergence estimation with
confidence. In Advances in Neural Information Processing Systems, pages
2420–2428, 2014.
Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard
Schölkopf. Kernel mean embedding of distributions: A review and beyond.
Foundations and Trends R in Machine Learning, 10(1-2):1–141, 2017.
224 References

Oleg Museyko, Michael Stiglmayr, Kathrin Klamroth, and Günter Leugering.


On the application of the Monge–Kantorovich problem to image registra-
tion. SIAM Journal on Imaging Sciences, 2(4):1068–1097, 2009.
Boris Muzellec, Richard Nock, Giorgio Patrini, and Frank Nielsen. Tsallis
regularized optimal transport and ecological inference. In AAAI, pages
2387–2393, 2017.
Assaf Naor and Gideon Schechtman. Planar earthmover is not in l1 . SIAM
J. Comput., 37(3):804–826, 2007.
Richard D Neidinger. Introduction to automatic differentiation and MATLAB
object-oriented programming. SIAM Review, 52(3):545–563, 2010.
Arkadi Nemirovski and Uriel Rothblum. On complexity of matrix scaling.
Linear Algebra and its Applications, 302:435–460, 1999.
Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms
in convex programming, volume 13. SIAM, 1994.
Kangyu Ni, Xavier Bresson, Tony Chan, and Selim Esedoglu. Local histogram
based segmentation using the Wasserstein distance. International journal
of computer vision, 84(1):97–111, 2009.
Lipeng Ning and Tryphon T Georgiou. Metrics for matrix-valued measures
via test functions. In 53rd IEEE Conference on Decision and Control, pages
2642–2647. IEEE, 2014.
Lipeng Ning, Tryphon T Georgiou, and Allen Tannenbaum. On matrix-valued
Monge–Kantorovich optimal mass transport. IEEE transactions on auto-
matic control, 60(2):373–382, 2015.
Jorge Nocedal and Stephen J Wright. Numerical Optimization. Springer-
Verlag, 1999.
Adam M Oberman and Yuanlong Ruan. An efficient linear programming
method for optimal transportation. arXiv preprint arXiv:1509.03668, 2015.
Vladimir Oliker and Laird
 2 D 2 Prussner. On the numerical solution of the
∂2z ∂2z ∂ z
equation ∂x 2 ∂y 2 − ∂x∂y = f and its discretizations, I. Numerische
Mathematik, 54(3):271–293, 1989.
Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic
representation of the spatial envelope. International Journal of Computer
Vision, 42(3):145–175, 2001.
Dean S Oliver. Minimization for conditional simulation: Relationship to op-
timal transport. Journal of Computational Physics, 265:1–15, 2014.
References 225

Ferdinand Österreicher and Igor Vajda. A new class of metric divergences on


probability spaces and its applicability in statistics. Annals of the Institute
of Statistical Mathematics, 55(3):639–653, 2003.
Felix Otto. The geometry of dissipative evolution equations: the porous
medium equation. Communications in partial differential equations, 26
(1-2):101–174, 2001.
Art B Owen. Empirical likelihood. Wiley Online Library, 2001.
Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE
Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
Nicolas Papadakis, Gabriel Peyré, and Edouard Oudet. Optimal transport
with proximal splitting. SIAM Journal on Imaging Sciences, 7(1):212–238,
2014.
Brendan Pass. On the local structure of optimal measures in the multi-
marginal optimal transportation problem. Calc. Var. Partial Differential
Equations, 43(3-4):529–536, 2012. ISSN 0944-2669.
Brendan Pass. Multi-marginal optimal transport: theory and applications.
ESAIM: Mathematical Modelling and Numerical Analysis, 49(6):1771–1790,
2015.
Ofir Pele and Michael Werman. A linear time histogram metric for improved
sift matching. Computer Vision–ECCV 2008, pages 495–508, 2008.
Ofir Pele and Michael Werman. Fast and robust earth mover’s distances.
In Computer vision, 2009 IEEE 12th international conference on, pages
460–467. IEEE, 2009.
Benoît Perthame, Fernando Quirós, and Juan Luis Vázquez. The Hele-Shaw
asymptotics for mechanical models of tumor growth. Archive for Rational
Mechanics and Analysis, 212(1):93–127, 2014. ISSN 0003-9527.
Gabriel Peyré. Entropic approximation of Wasserstein gradient flows. SIAM
Journal on Imaging Sciences, 8(4):2323–2351, 2015.
Gabriel Peyré, Jalal Fadili, and Julien Rabin. Wasserstein active contours.
In Image Processing (ICIP), 2012 19th IEEE International Conference on,
pages 2541–2544. IEEE, 2012.
Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-wasserstein av-
eraging of kernel and distance matrices. In International Conference on
Machine Learning, pages 2664–2672, 2016.
Gabriel Peyré, Lenaic Chizat, Francois-Xavier Vialard, and Justin Solomon.
Quantum entropic regularization of matrix-valued optimal transport. to
appear in European Journal of Applied Mathematics, 2017.
226 References

Rémi Peyre. Comparison between w2 distance and h−1 norm, and localisation
of Wasserstein distance. arXiv preprint arXiv:1104.4631, 2011.
Benedetto Piccoli and Francesco Rossi. Generalized Wasserstein distance and
its application to transport equations with source. Archive for Rational
Mechanics and Analysis, 211(1):335–358, 2014.
Franccois Pitié, Anil C Kokaram, and Rozenn Dahyot. Automated colour
grading using colour distribution transfer. Computer Vision and Image
Understanding, 107(1):123–137, 2007.
Julien Rabin and Nicolas Papadakis. Convex color image segmentation with
optimal transport distances. In Proc. SSVM’15, pages 256–269, 2015.
Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein
barycenter and its application to texture mixing. In International Confer-
ence on Scale Space and Variational Methods in Computer Vision, pages
435–446. Springer, 2011.
Svetlozar T Rachev and Ludger Rüschendorf. Mass Transportation Problems:
Volume I: Theory, volume 1. Springer Science & Business Media, 1998a.
Svetlozar T Rachev and Ludger Rüschendorf. Mass Transportation Problems:
Volume II: Applications, volume 2. Springer Science & Business Media,
1998b.
Louis B Rall. Automatic differentiation: Techniques and applications.
Springer, 1981.
Aaditya Ramdas, Nicolás García Trillos, and Marco Cuturi. On Wasserstein
two-sample testing and related families of nonparametric tests. Entropy,
19(2):47, 2017.
Anand Rangarajan, Alan L Yuille, Steven Gold, and Eric Mjolsness. Conver-
gence properties of the softassign quadratic assignment algorithm. Neural
Comput., 11(6):1455–1474, August 1999.
C Rao and T Nayak. Cross entropy, dissimilarity measures, and characteriza-
tions of quadratic entropy. IEEE Transactions on Information Theory, 31
(5):589–593, 1985.
Sebastian Reich. A nonparametric ensemble transform method for bayesian
inference. SIAM Journal on Scientific Computing, 35(4):A2013–A2024,
2013.
Antoine Rolet, Marco Cuturi, and Gabriel Peyré. Fast dictionary learning
with a smoothed Wasserstein loss. In Proceedings of the 19th International
Conference on Artificial Intelligence and Statistics, volume 51 of Proceed-
ings of Machine Learning Research, pages 630–638, 2016.
References 227

Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s
distance as a metric for image retrieval. International journal of computer
vision, 40(2):99–121, 2000.
Ludger Ruschendorf. Convergence of the iterative proportional fitting proce-
dure. The Annals of Statistics, 23(4):1160–1174, 1995.
Ludger Rüschendorf and Wolfgang Thomsen. Closedness of sum spaces and
the generalized Schrodinger problem. Theory of Probability and its Appli-
cations, 42(3):483–494, 1998.
Ludger Rüschendorf and Ludger Uckelmann. On the n-coupling problem.
Journal of multivariate analysis, 81(2):242–258, 2002.
Hans Samelson et al. On the perron-frobenius theorem. The Michigan Math-
ematical Journal, 4(1):57–59, 1957.
Roman Sandler and Michael Lindenbaum. Nonnegative matrix factorization
with earth mover’s distance metric for image analysis. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 33(8):1590–1602, 2011.
Filippo Santambrogio. Optimal Transport for Applied Mathematicians.
Birkhauser, 2015.
Filippo Santambrogio. {Euclidean, metric, and Wasserstein} gradient flows:
an overview. Bulletin of Mathematical Sciences, 7(1):87–154, 2017.
Filippo Santambrogio. Crowd motion and population dynamics under density
constraints. GMT preprint 3728, 2018.
Louis-Philippe Saumier, Boualem Khouider, and Martial Agueh. Optimal
transport for particle image velocimetry. Communications in Mathematical
Sciences, 13(1):269–296, 2015.
Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Sub-
ramanian, Aryeh Solomon, Siyan Liu, Stacie Lin, Peter Berube, Lia Lee,
et al. Reconstruction of developmental landscapes by optimal-transport
analysis of single-cell gene expression sheds light on cellular reprogram-
ming. bioRxiv, page 191056, 2017.
Bernhard Schmitzer. A sparse multiscale algorithm for dense optimal trans-
port. Journal of Mathematical Imaging and Vision, 56(2):238–259, 2016a.
Bernhard Schmitzer. Stabilized sparse scaling algorithms for entropy regular-
ized transport problems. arXiv preprint arXiv:1610.06519, 2016b.
Bernhard Schmitzer and Christoph Schnörr. Modelling convex shape pri-
ors and matching based on the Gromov-Wasserstein distance. Journal of
mathematical imaging and vision, 46(1):143–159, 2013a.
228 References

Bernhard Schmitzer and Christoph Schnörr. Object segmentation by shape


matching with Wasserstein modes. In International Workshop on Energy
Minimization Methods in Computer Vision and Pattern Recognition, pages
123–136. Springer, 2013b.
Bernhard Schmitzer and Benedikt Wirth. A framework for Wasserstein-1-type
metrics. arXiv preprint arXiv:1701.01945, 2017.
Isaac J Schoenberg. Metric spaces and positive definite functions. Transac-
tions of the American Mathematical Society, 38:522–356, 1938.
Bernhard Schölkopf and Alexander J Smola. Learning with kernels: Support
vector machines, regularization, optimization, and beyond. the MIT Press,
2002.
Erwin Schrödinger. Über die Umkehrung der Naturgesetze. Sitzungsberichte
Preuss. Akad. Wiss. Berlin. Phys. Math., 144:144–153, 1931.
Vivien Seguy and Marco Cuturi. Principal geodesic analysis for probability
measures under the optimal transport metric. In Advances in Neural Infor-
mation Processing Systems 28, pages 3294–3302. Curran Associates, Inc.,
2015.
Sameer Shirdhonkar and David W Jacobs. Approximate earth mover’s dis-
tance in linear time. In Computer Vision and Pattern Recognition, 2008.
CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
Bernard W Silverman. Density estimation for statistics and data analysis,
volume 26. CRC press, 1986.
Richard Sinkhorn. A relationship between arbitrary positive matrices and
doubly stochastic matrices. Ann. Math. Statist., 35:876–879, 1964.
Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and
column sums. Amer. Math. Monthly, 74:402–405, 1967.
Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and
doubly stochastic matrices. Pacific J. Math., 21:343–348, 1967.
Marcos Slomp, Michihiro Mikamo, Bisser Raytchev, Toru Tamaki, and Kazu-
fumi Kaneda. Gpu-based softassign for maximizing image utilization in
photomosaics. International Journal of Networking and Computing, 1(2):
211–229, 2011.
Justin Solomon, Leonidas Guibas, and Adrian Butscher. Dirichlet energy
for analysis and synthesis of soft maps. In Computer Graphics Forum,
volume 32, pages 197–206. Wiley Online Library, 2013.
References 229

Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher.


Earth mover’s distances on discrete surfaces. Transaction on Graphics
(proc. SIGGRAPH), 33(4), 2014a.
Justin Solomon, Raif Rustamov, Guibas Leonidas, and Adrian Butscher.
Wasserstein propagation for semi-supervised learning. In Proceedings of
the 31st International Conference on Machine Learning (ICML-14), pages
306–314, 2014b.
Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian
Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional
Wasserstein distances: Efficient optimal transportation on geometric do-
mains. ACM Transactions on Graphics (Proc. SIGGRAPH 2015), 34(4):
66:1–66:11, 2015. .
Justin Solomon, Gabriel Peyré, Vladimir G Kim, and Suvrit Sra. Entropic
metric alignment for correspondence problems. ACM Transactions on
Graphics (Proc. SIGGRAPH 2016), 35(4):72:1–72:13, 2016a.
Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher.
Continuous-flow graph transportation distances. arXiv preprint
arXiv:1603.06927, 2016b.
Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard
Schölkopf, and Gert RG Lanckriet. On integral probability metrics,\phi-
divergences and binary classification. arXiv preprint arXiv:0901.2698, 2009.
Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard
Schölkopf, and Gert RG Lanckriet. On the empirical estimation of integral
probability metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.
Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. WASP:
Scalable bayes via barycenters of subset posteriors. In Artificial Intelligence
and Statistics, pages 912–920, 2015a.
Sanvesh Srivastava, Cheng Li, and David B Dunson. Scalable bayes via
barycenter in Wasserstein space. arXiv preprint arXiv:1508.05880, 2015b.
Matthew Staib, Sebastian Claici, Justin Solomon, and Stefanie Jegelka. Par-
allel streaming Wasserstein barycenters. arXiv preprint arXiv:1705.07443,
2017.
Leen Stougie. A polynomial bound on the diameter of the transportation
polytope. Technical report, TU/e, Technische Universiteit Eindhoven, De-
partment of Mathematics and Computing Science, 2002.
Karl-Theodor Sturm. The space of spaces: curvature bounds and gradient
flows on the space of metric measure spaces. Preprint 1208.0434, arXiv,
2012.
230 References

Zhengyu Su, Yalin Wang, Rui Shi, Wei Zeng, Jian Sun, Feng Luo, and Xi-
anfeng Gu. Optimal mass transport for shape matching and comparison.
IEEE transactions on pattern analysis and machine intelligence, 37(11):
2246–2259, 2015.
Vladimir N Sudakov. Geometric problems in the theory of infinite-dimensional
probability distributions. Number 141. American Mathematical Soc., 1979.
Mahito Sugiyama, Hiroyuki Nakahara, and Koji Tsuda. Tensor balancing on
statistical manifold. arXiv preprint arXiv:1702.08142, 2017.
Paul Swoboda and Christoph Schnörr. Convex variational image restoration
with histogram priors. SIAM Journal on Imaging Sciences, 6(3):1719–1735,
2013.
Gábor J Székely and Maria L Rizzo. Testing for equal distributions in high
dimension. InterStat, 5(16.10), 2004.
Asuka Takatsu. Wasserstein geometry of Gaussian measures. Osaka Journal
of Mathematics, 48(4):1005–1026, 2011.
Xiaolu Tan and Nizar Touzi. Optimal transportation under controlled stochas-
tic dynamics. Ann. Probab., 41(5):3201–3240, 2013.
Guillaume Tartavel, Gabriel Peyré, and Yann Gousseau. Wasserstein loss for
image synthesis and restoration. SIAM Journal on Imaging Sciences, 9(4):
1726–1755, 2016.
Matthew Thorpe, Serim Park, Soheil Kolouri, Gustavo K Rohde, and De-
jan Slepčev. A transportation lp distance for signal analysis. Journal of
Mathematical Imaging and Vision, 59(2):187–210, 2017.
AN Tolstoı. Methods of finding the minimal total kilometrage in cargo trans-
portation planning in space. TransPress of the National Commissariat of
Transportation, pages 23–55, 1930.
AN Tolstoı. Metody ustraneniya neratsionalâĂŹnykh perevozok pri
planirovanii [russian; methods of removing irrational transportation in plan-
ning]. Sotsialisticheskiı Transport, 9:28–51, 1939.
Alain Trouvé and Laurent Younes. Metamorphoses through Lie group ac-
tion. Foundations of Computational Mathematics, 5(2):173–198, 2005. ISSN
1615-3375.
Neil S Trudinger and Xu-Jia Wang. On the monge mass transfer problem. Cal-
culus of Variations and Partial Differential Equations, 13(1):19–31, 2001.
Marc Vaillant and Joan Glaunès. Surface matching via currents. In Informa-
tion processing in medical imaging, pages 1–5. Springer, 2005.
References 231

Sathamangalam R Srinivasa Varadhan. On the behavior of the fundamental


solution of the heat equation with variable coefficients. Comm. on Pure
and Applied Math., 20(2):431–455, 1967.
Cedric Villani. Topics in Optimal Transportation. Graduate Studies in Mathe-
matics Series. American Mathematical Society, 2003. ISBN 9780821833124.
Cedric Villani. Optimal transport: old and new, volume 338. Springer Verlag,
2009.
Thomas Vogt and Jan Lellmann. Measure-valued variational models with ap-
plications to diffusion-weighted imaging. arXiv preprint arXiv:1710.00798,
2017.
Fan Wang and Leonidas J Guibas. Supervised earth mover’s distance learning
and its computer vision applications. ECCV2012, pages 442–455, 2012.
Wei Wang, John A Ozolek, Dejan Slepcev, Ann B Lee, Cheng Chen, and Gus-
tavo K Rohde. An optimal transportation approach for nuclear structure-
based pathology. IEEE transactions on medical imaging, 30(3):621–631,
2011.
Wei Wang, Dejan Slepčev, Saurav Basu, John A Ozolek, and Gustavo K
Rohde. A linear optimal transportation framework for quantifying and
visualizing variations in sets of images. International journal of computer
vision, 101(2):254–269, 2013.
Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of
convergence of empirical measures in Wasserstein distance. arXiv preprint
arXiv:1707.00087, 2017.
Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large
margin nearest neighbor classification. The Journal of Machine Learning
Research, 10:207–244, 2009.
Michael Westdickenberg and Jon Wilkening. Variational particle schemes
for the porous medium equation and for the system of isentropic Euler
equations. ESAIM: Mathematical Modelling and Numerical Analysis, 44
(1):133–166, 2010.
Alan Geoffrey Wilson. The use of entropy maximizing models, in the the-
ory of trip distribution, mode split and route split. Journal of Transport
Economics and Policy, pages 108–126, 1969.
Gui-Song Xia, Sira Ferradans, Gabriel Peyré, and Jean-Franccois Aujol. Syn-
thesizing and mixing stationary Gaussian texture models. SIAM Journal
on Imaging Sciences, 7(1):476–508, 2014.
232 References

Gloria Zen, Elisa Ricci, and Nicu Sebe. Simultaneous ground metric learning
and matrix factorization with earth mover’s distance. Proc. ICPR’14, pages
3690–3695, 2014.
Lei Zhu, Yan Yang, Steven Haker, and Allen Tannenbaum. An image morph-
ing technique based on optimal mass preserving mapping. IEEE Transac-
tions on Image Processing, 16(6):1481–1495, 2007.
Jonathan Zinsl and Daniel Matthes. Transport distances and geodesic con-
vexity for systems of degenerate diffusion equations. Calculus of Variations
and Partial Differential Equations, 54(4):3397–3438, 2015.

You might also like