0% found this document useful (0 votes)

408 views472 pages

MRF For Vision and Image Processing

Markov Random Fields for Vision and Image Processing / edited by Andrew Blake, Pushmeet Kohli, and Carsten Rother. No part of this book may be reproduced without permission in writing from the publisher.

Uploaded by

meomuop2508

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

408 views472 pages

MRF For Vision and Image Processing

Uploaded by

meomuop2508

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Markov Random Fields for Vision and Image Processing

edited by Andrew Blake, Pushmeet Kohli, and Carsten Rother
The MIT Press
Cambridge, Massachusetts
London, England
2011 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means
(including photocopying, recording, or information storage and retrieval) without permission in writing from the
publisher.
For information about special quantity discounts, please email special_sales@[Link]
This book was set in Syntax and Times New Roman by Westchester Books Composition. Printed and bound in
the United States of America.
Library of Congress Cataloging-in-Publication Data
Markov random elds for vision and image processing / edited by Andrew Blake, Pushmeet Kohli, and Carsten
Rother.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-262-01577-6 (hardcover : alk. paper)
1. Image processingMathematics. 2. Computer graphicsMathematics. 3. Computer visionMathematics.
4. Markov random elds. I. Blake, Andrew, 1956 II. Kohli, Pushmeet. III. Rother, Carsten.
TAI637.M337 2011
006.3

70151dc22
2010046702
10 9 8 7 6 5 4 3 2 1
Contents
1 Introduction to Markov Random Fields 1
Andrew Blake and Pushmeet Kohli
I Algorithms for Inference of MAP Estimates for MRFs 29
2 Basic Graph Cut Algorithms 31
Yuri Boykov and Vladimir Kolmogorov
3 Optimizing Multilabel MRFs Using Move-Making Algorithms 51
Yuri Boykov, Olga Veksler, and Ramin Zabih
4 Optimizing Multilabel MRFs with Convex and Truncated Convex Priors 65
Hiroshi Ishikawa and Olga Veksler
5 Loopy Belief Propagation, Mean Field Theory, and Bethe Approximations 77
Alan Yuille
6 Linear Programming and Variants of Belief Propagation 95
Yair Weiss, Chen Yanover, and Talya Meltzer
II Applications of MRFs, Including Segmentation 109
7 Interactive Foreground Extraction: Using Graph Cut 111
Carsten Rother, Vladimir Kolmogorov, Yuri Boykov, and Andrew Blake
8 Continuous-Valued MRF for Image Segmentation 127
Dheeraj Singaraju, Leo Grady, Ali Kemal Sinop, and Ren Vidal
9 Bilayer Segmentation of Video 143
Antonio Criminisi, Geoffrey Cross, Andrew Blake, and Vladimir Kolmogorov
10 MRFs for Superresolution and Texture Synthesis 155
William T. Freeman and Ce Liu
vi Contents
11 A Comparative Study of Energy Minimization Methods for MRFs 167
Richard Szeliski, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir
Kolmogorov, Aseem Agarwala, Marshall F. Tappen, and Carsten Rother
III Further Topics: Inference, Parameter Learning, and Continuous Models 183
12 Convex Relaxation Techniques for Segmentation, Stereo, and
Multiview Reconstruction 185
Daniel Cremers, Thomas Pock, Kalin Kolev, and Antonin Chambolle
13 Learning Parameters in Continuous-Valued Markov Random Fields 201
Marshall F. Tappen
14 Message Passing with Continuous Latent Variables 215
Michael Isard
15 Learning Large-Margin Random Fields Using Graph Cuts 233
Martin Szummer, Pushmeet Kohli, and Derek Hoiem
16 Analyzing Convex Relaxations for MAP Estimation 249
M. Pawan Kumar, Vladimir Kolmogorov, and Philip H. S. Torr
17 MAP Inference by Fast Primal-Dual Linear Programming 263
Nikos Komodakis
18 Fusion-Move Optimization for MRFs with an Extensive Label Space 281
Victor Lempitsky, Carsten Rother, Stefan Roth, and Andrew Blake
IV Higher-Order MRFs and Global Constraints 295
19 Field of Experts 297
Stefan Roth and Michael J. Black
20 Enforcing Label Consistency Using Higher-Order Potentials 311
Pushmeet Kohli, Lubor Ladicky, and Philip H. S. Torr
21 Exact Optimization for Markov Random Fields with Nonlocal Parameters 329
Victor Lempitsky, Andrew Blake, and Carsten Rother
22 Graph Cut-Based Image Segmentation with Connectivity Priors 347
Sara Vicente, Vladimir Kolmogorov, and Carsten Rother
V Advanced Applications of MRFs 363
23 Symmetric Stereo Matching for Occlusion Handling 365
Jian Sun, Yin Li, Sing Bing Kang, and Heung-Yeung Shum
Contents vii
24 Steerable Random Fields for Image Restoration 377
Stefan Roth and Michael J. Black
25 Markov Random Fields for Object Detection 389
John Winn and Jamie Shotton
26 SIFT Flow: Dense Correspondence across Scenes and Its Applications 405
Ce Liu, Jenny Yuen, Antonio Torralba, and William T. Freeman
27 Unwrap Mosaics: A Model for Deformable Surfaces in Video 419
Alex Rav-Acha, Pushmeet Kohli, Carsten Rother, and Andrew Fitzgibbon
Bibliography 433
Contributors 457
Index 459
1
Introduction to Markov Random Fields
Andrew Blake and Pushmeet Kohli
This book sets out to demonstrate the power of the Markov random eld (MRF) in vision.
It treats the MRF both as a tool for modeling image data and, coupled with a set of recently
developed algorithms, as a means of making inferences about images. The inferences con-
cern underlying image and scene structure to solve problems such as image reconstruction,
image segmentation, 3D vision, and object labeling. This chapter is designed to present
some of the main concepts used in MRFs, both as a taster and as a gateway to the more
detailed chapters that follow, as well as a stand-alone introduction to MRFs.
The unifying ideas in using MRFs for vision are the following:

Images are dissected into an assembly of nodes that may correspond to pixels or
agglomerations of pixels.

Hidden variables associated with the nodes are introduced into a model designed to
explain the values (colors) of all the pixels.

Ajoint probabilistic model is built over the pixel values and the hidden variables.

The direct statistical dependencies between hidden variables are expressed by explicitly
grouping hidden variables; these groups are often pairs depicted as edges in a graph.
These properties of MRFs are illustrated in gure 1.1. The graphs corresponding to such
MRF problems are predominantly gridlike, but may also be irregular, as in gure 1.1(c).
Exactly how graph connectivity is interpreted in terms of probabilistic conditional depen-
dency is discussed a little later.
The notation for image graphs is that the graph G = (V, E) consists of vertices V =
(1, 2, . . . , i, . . . , N) corresponding, for example, to the pixels of the image, and a set of
edges E where a typical edge is (i, j), i, j V, and edges are considered to be undirected,
so that (i, j) and (j, i) refer to the same edge. In the superpixel graph of gure 1.1), the
nodes are superpixels, and a pair of superpixels forms an edge in E if the two superpixels
share a common boundary.
The motivation for constructing such a graph is to connect the hidden variables associated
with the nodes. For example, for the task of segmenting an image into foreground and
background, each node i (pixel or superpixel) has an associated random variable X
i
that
2 1 Introduction to Markov Random Fields
(a) (b) (c)
Figure 1.1
Graphs for Markov models in vision. (a) Simple 4-connected grid of image pixels. (b) Grids with greater con-
nectivity can be usefulfor example, to achieve better geometrical detail (see discussion later)as here with the
8-connected pixel grid. (c) Irregular grids are also useful. Here a more compact graph is constructed in which
the nodes are superpixelsclusters of adjacent pixels with similar colors.
may take the value 0 or 1, corresponding to foreground or background, respectively. In order
to represent the tendency of matter to be coherent, neighboring sites are likely to have the
same label. So where (i, j) E, some kind of probabilistic bias needs to be associated with
the edge (i, j) such that X
i
and X
j
tend to have the same labelboth 0 or both 1. In fact, any
pixels that are nearby, not merely adjacent, are likely to have the same label. On the other
hand, explicitly linking all the pixels in a typical image, whose foreground/background
labels have correlations, would lead to a densely connected graph. That in turn would
result in computationally expensive algorithms. Markov models explicitly represent only
the associations between relatively few pairs of pixelsthose pixels that are dened as
neighbors because of sharing an edge in E. The great attraction of Markov Models is that
they leverage a knock-on effectthat explicit short-range linkages give rise to implied
long-range correlations. Thus correlations over long ranges, on the order of the diameters
of typical objects, can be obtained without undue computational cost. The goal of this
chapter is to investigate probabilistic models that exploit this powerful Markov property.
1.1 Markov Chains: The Simplest Markov Models
In a Markov chain a sequence of randomvariables X = (X
1
, X
2
, . . .) has a joint distribution
specied by the conditionals P(X
i
| X
i1
, X
i2
, . . . , X
1
). The classic tutorial example [381,
sec. 6.2] is the weather, so that X
i
L = {sunny, rainy}. The weather on day i can be
inuenced by the weather many days previous, but in the simplest form of Markov chain,
the dependence of todays weather is linked explicitly only to yesterdays weather. It is also
linked implicitly, as a knock-on effect, to all previous days. This is a rst-order Markov
assumption, that
P(X
i
| X
i1
, X
i2
, . . . , X
1
) = P(X
i
| X
i1
). (1.1)
This is illustrated in gure 1.2. The set of conditional probabilities P(X
i
| X
i1
) is in fact
a 2 2 matrix. For example:
1.1 Markov Chains 3
(a)
(b)
(c)
x
i+1
x
i
x
1
x
i1
x
N
x
i+1
x
i
x
1
x
i1
x
N
Mon Tues Wed Thur
Figure 1.2
Asimple rst-order Markov chain for weather forecasting. (a) Adirected graph is used to represent the conditional
dependencies of a Markov chain. (b) In more detail, the state transition diagram completely species the proba-
bilistic process of the evolving weather states. (c) AMarkov chain can alternatively be expressed as an undirected
graphical model; see text for details.
Yesterday (X
i1
)
Rain Sun
Today (X
i
) Rain 0.4 0.8
Sun 0.6 0.2
An interesting and commonly used special case is the stationary Markov chain, in which
the matrix
M
i
(x, x

) = P(X
i
= x | X
i1
= x

) (1.2)
is independent of time i, so that M
i
(., .) = M
i1
(., .). In the weather example this corre-
sponds to the assumption that the statistical dependency of weather is a xed relationship,
the same on any day.
We will not dwell on the simple example of the Markov chain, but a fewcomments may be
useful. First, the rst-order explicit structure implicitly carries longer-range dependencies,
too. For instance, the conditional dependency across three successive days is obtained by
multiplying together the matrices for two successive pairs of days:
P(X
i
= x | X
i2
= x

) =

L
M
i
(x, x

)M
i1
(x

, x

). (1.3)
4 1 Introduction to Markov Random Fields
Thus the Markov chain shares the elegance of Markov models generally, which will recur
later with models for images, that long-range dependencies can be captured for the price of
explicitly representing just the immediate dependencies betweenneighbors. Second, higher-
order Markov chains, where the explicit dependencies go back farther than immediate
neighbors, can also be useful. A famous example is predictive text, in which probable
letters in a typical word are characterized in terms of the two preceding letterstaking just
the one preceding letter does not give enough practical predictive power. Predictive text,
then, is a second-order Markov chain.
The directed graph in gure 1.2a) is a graphical representation of the fact that, for a
Markov chain, the joint density can be decomposed as a product of conditional densities:
P(x) = P(x
N
| x
N1
) . . . P(x
i
| x
i1
) . . . P(x
2
| x
1
)P(x
1
), (1.4)
where for simplicity, in a popular abuse of notation, P(x) denotes P(X = x) and, similarly,
P(x
i
| x
i1
) denotes P(X
i
= x
i
| X
i1
= x
i1
). This convention is used frequently through-
out the book. An alternative formalism that is commonly used is the undirected graphical
model. Markov chains can also be represented in this way (gure 1.2c), corresponding to a
factorized decomposition:
P(x) =
N,N1
(x
N
, x
N1
) . . .
i,i1
(x
i
, x
i1
) . . .
2,1
(x
2
, x
1
), (1.5)
where
i,i1
is a factor of the joint density. It is easy to see, in this simple case of the
Markov chain, how the directed form (1.4) can be reexpressed in the undirected form (1.5).
However, it is not the case in general, and in particular in 2Dimages, that models expressed
in one form can easily be expressed in the other. Many of the probabilistic models used in
computer vision are most naturally expressed using the undirected formalism, so it is the
undirected graphical models that dominate in this book. For details on directed graphical
models see [216, 46].
1.2 The Hidden Markov Model (HMM)
Markov models are particularly useful as prior models for state variables X
i
that are to
be inferred from a corresponding set of measurements or observations z = (z
1
, z
2
, . . . ,
z
i
, . . . , z
N
). The observations z are themselves considered to be instantiations of a random
variable Z representing the full space of observations that can arise. This is the classical
situation in speech analysis [381, sec. 6.2], where z
i
represents the spectral content of a
fragment of an audio signal, and X
i
represents a state in the time course of a particular word
or phoneme. It leads naturally to an inference problem in which the posterior distribution
for the possible states X, given the observations z, is computed via Bayess formula as
P(X = x | Z = z) P(Z = z | X = x)P(X = x). (1.6)
Here P(X = x) is the prior distribution over statesthat is, what is known about states X
in the absence of any observations. As before, (1.6) is abbreviated, for convenience, to
1.2 The Hidden Markov Model (HMM) 5
P(x | z) P(z | x)P(x). (1.7)
The omitted constant of proportionality would be xed to ensure that

x
P(x | z) = 1.
Often multiple models are considered simultaneously, and in that case this is denoted
P(x | z, ) P(z | x, )P(x | ), (1.8)
where the model parameters may determine the prior model or the observation
model or both. The constant of proportionality in this relation would of course depend on z
and on .
The prior of an HMM is itself represented as a Markov chain, which in the rst-order
case was decomposed as a product of conditional distributions (1.4). The term P(z | x)
is the likelihood of the observations, which is essentially a measure of the quality of the
measurements. The more precise and unambiguous the measuring instrument, the more
the likelihood will be compressed into a single, narrow peak. This captures the fact that
a more precise instrument produces more consistent responses z, under a given condition
represented by the state X = x. It is often assumedand this is true of the models used
in many of the chapters of this bookthat observations are independent across sites. The
observation at site i depends only on the corresponding state. In other words:
P(z | x) = P(z
N
| x
N
)P(z
N1
| x
N1
) . . . P(z
1
| x
1
). (1.9)
The directed graphical model for the conditional dependencies of such a rst-order HMMis
giveningure 1.3a). The gure captures the conditional dependencies bothof the underlying
(a)
(b)
z
i+1
z
i
z
1
z
i1
z
N
x
i+1
x
i
x
1
x
i1
x
N
z
i+1
z
i
z
1
z
i1
z
N
x
i+1
x
i
x
1
x
i1
x
N
Figure 1.3
A rst-order hidden Markov model (HMM). (a) A directed graph is used to represent the dependencies of a rst-
order HMM, with its Markov chain prior, and a set of independently uncertain observations. (b) Alternatively the
HMM can be represented as an undirected graphical model (see text).
6 1 Introduction to Markov Random Fields
Markov chain and of the independence of the observations. Alternatively, an HMM can be
expressed as an undirected graphical model, as depicted in gure 1.3(b), in which the prior
is decomposed as in (1.5), and the likelihood is
P(z | x) =
N
(x
N
)
N1
(x
N1
) . . .
1
(x
1
), (1.10)
where trivially
i
(x
i
) = P(z
i
| x
i
).
Discrete HMMs, with a nite label set L, are largely tractable. Rabiner and Juang [382]
set out three canonical problems for HMMs, and algorithms to solve them. The problems
are the following:
1. Evaluating the observation probability P(z | ) In this problem there is no explicit
state dependence, because it has been marginalized out by summation over states:
P(z | ) =

xL
N
P(z | x, )P(x | ). (1.11)
The main application of this evaluation is to determine which of a set of known models ts
the data best:
max

P(z | ). (1.12)
The quantity P(z | ) is also known as the evidence [328] for the model from the data z.
2. MAP estimation Given a model and a set of data z, estimate the most probable
(maximum a posteriori) sequence x of states as the mode of the posterior distribution (1.8).
3. Parameter estimation Given a set of data z, estimate the parameters , a contin-
uous parameter space that best ts the data. This is the problem that must be solved to build
a particular model from training data. It is closely related to the model selection problem
above, in that both maximize P(z | ), the difference being that the model space is,
respectively, discrete or continuous.
These three problems are essentially solved by using two algorithms and variants of them.
The rst problem requires the forward algorithm that computes a marginal distribution
node i from the distribution at the previous node i 1:
P(x
i
, z
1
, . . . , z
i
| ) = P(z
i
| x
i
, )

x
i1
P(x
i
| x
i1
, )P(x
i1
, z
1
, . . . , z
i1
| ). (1.13)
This is a special case of Belief Propagation (BP) that will be discussed later in this chapter
and in various subsequent chapters in the book. In fact there are two forms of BP [367, 46],
and this one is an example of sum-product BP. (The name derives from the summation and
product steps in (1.13).) The other formis described shortly. In the case of the HMM, where
the underlying prior model is simply a Markov chain, sum-product belief propagation is
1.2 The Hidden Markov Model (HMM) 7
quite straightforward and is an exact algorithmfor computing the marginal posteriors. After
one complete forward pass, the nal marginal distribution is P(x
N
, z | ), and so nally
P(z | ) =

x
N
P(x
N
, z | ) (1.14)
can be computed as the evidence for a known model that solves problem 1 above. The
forward pass (1.13) constitutes half of BP, the remaining part being a backward pass that
recurs from node N back to node 1 (details omitted here, but see [382]). Using the forward
and backward passes together, the full set of marginal posterior distributions
P(x
i
| z, ), i = 1, . . . , N (1.15)
can be computed. This is required for problem 3 above, in order to compute the expected
values of the sufcient statistics that are needed to estimate the parameters by expectation
maximization [121]also known in the speech analysis literature as the BaumWelch
method [381].
The second algorithm is the Viterbi algorithm, a dynamic programming optimization
algorithm applied to the state sequence x. It is also equivalent to a special case of max-
product belief propagation, which also is mentioned quite frequently in the book. The aim
is to solve the second problem above, computing the MAP estimate of the state vector as
x = arg max
x
P(x | z, ) (1.16)
via a forward recursion:
P
i
(x
i
) = P(z
i
| x
i
, ) max
x
i1
P(x
i
| x
i1
, )P
i1
(x
i1
) (1.17)
where P
i
is dened by
P
i
(x
i
) max
x
1
,...,x
i1
P(x
1
, . . . , x
i
, z
1
, . . . , z
i
| ). (1.18)
Each forward step of the recursion can be viewed as a message-passing operation. In step
i 1 of the computation, node i 1 sends the message P
i
(x
i
) to node i.
After the forward steps are complete, the nal component x
N
of the MAP solution x can
be computed as
x
N
= arg max
x
N
P
n
(x
N
). (1.19)
This is followed by a backward recursion
x
i1
= arg max
x
i1
P( x
i
| x
i1
, ) P
i1
(x
i1
), (1.20)
after which all components of x are determined.
8 1 Introduction to Markov Random Fields
The purpose of the discussion in this section has been largely to explain the nature of
hidden variables in simple Markov models, as a precursor to later discussion of hidden
variables in the more complex, two-dimensional kinds of models that are used in vision.
However, even in vision the discrete HMM structure has some direct applications. It has
proved useful for representing temporal problems that are somewhat analogous to speech
analysis, but in which the audio input is replaced by a time sequence of visual features.
Well-known examples include the recognition of American Sign Language [449] and the
recognition of hand gestures for command and control [522]. This book deals mainly with
discrete Markov modelsthat is, ones in which the states of each X
i
belong to a nite set L.
However, in vision by far the greater application of timelike HMMs employs continuous
state-space to represent position, attitude, and shape in visual tracking [333, 477, 50, 253,
209, 124]. In such continuous settings the HMM becomes a classical or nonclassical form
of Kalman lter. Both exact solutions to the estimation problems that arise, and efcient
approximate solutions, are much studied, but are outside the scope of this book.
1.3 Markov Models on Trees
In the following section 1.4, Markov Random Fields (MRFs) are dened as probabilistic
models over undirected graphs. On the way there, we now consider undirected models on
trees as intermediate in complexity between the linear graphschains and HMMsof sec-
tion 1.2, and graphs of unrestricted connectivity. Clearly the HMM graph (gure 1.3b) is a
special case of an undirected model on a tree. Trees appear to be of intermediate complexity
but, perhaps surprisingly, turn out to be closer to HMMs, in that inference can be performed
exactly. The Viterbi and forward-backward algorithms for HMMs generalize to two differ-
ent kinds of message passing on trees. However, once the nodes on two leaves of a tree are
coalesced into a single leaffor example, leaves b and d in gure 1.4a circuit may be
formed in the resulting graph, and message-passing algorithms are no longer an exact solu-
tion to the problems of inference.
As with Markov chains and HMMs, in undirected trees the topological structure conveys
two aspects of the underlying probabilistic model. First are the conditional independence
properties, that:
P(x
i
| {x
j
, j = i}) = P(x
i
| {x
j
, (i, j) E}). (1.21)
The set B
i
= {j : (i, j) E} is known as the Markov blanket of i, its neighbors in the tree
(or generally graph) G. The second aspect is the decomposition of the joint distribution, the
generalization of (1.5) and (1.10). How can a distribution with this independence property
be constructed? The answer is, as a distribution that is factorized over the edges of the tree:
P(x) =

(i,j)E
F
i,j
(x
i
, x
j
). (1.22)
1.3 Markov Models on Trees 9
a (root)
b c
d e
m
b

a
(
x
a
)
m
d

c
(
x
c
)
Figure 1.4
Message passing for MAP inference in tree structured graphical models. Agraphical model containing ve nodes
(a, b, c, d, and e) connected in a tree structure.
1.3.1 Inference on Trees: Belief Propagation (BP)
The message-passingformulationof theViterbi algorithmcanbe generalizedtondmarginal
distributions over individual variables and the MAP estimate in a tree structured graphical
model. The resulting algorithm is known as belief propagation [367] and has two variants:
max-product, for computing the MAPsolution, and sum-product (mentioned earlier), which
allows computation of marginals of individual random variables.
Max-product message passing is similar in spirit to dynamic programming algorithms.
Like the Viterbi algorithm, it works by passing messages between tree nodes in two stages.
In the rst stage, messages are passed fromthe leaf nodes to their parents, which in turn pass
messages to their parents, and so on until the messages reach the root node. The message
m
ij
from a node i to its parent j is computed as
m
ij
(x
j
) = max
x
i
P(x
j
, x
i
)

kN
c
(i)
m
ki
(x
i
) (1.23)
where N
c
(i) is the set of all children of node i. The MAP label of the variable at the root r
of the tree can be computed as
x
r
= arg max
x
r

kN
c
(r)
m
kr
(x
r
). (1.24)
Given the MAP label x
p
of a variable X
p
, the label of any of its children i can be found as
x
i
= max
x
i
P( x
p
, x
i
)

kN
c
(i)
m
ki
(x
i
). (1.25)
10 1 Introduction to Markov Random Fields
1.3.2 Example: Max-Product BP on a Five-Node Model
Consider the undirected, tree-structured graphical model shown in gure 1.4. The joint
distribution factorizes as
P(x) = P(x
a
, x
b
)P(x
a
, x
c
)P(x
c
, x
e
)P(x
c
, x
d
). (1.26)
The messages computed by max-product are
m
dc
(x
c
) = max
x
d
P(x
c
, x
d
) (1.27)
m
ec
(x
c
) = max
x
e
P(x
c
, x
e
) (1.28)
m
ca
(x
a
) = max
x
c
P(x
a
, x
c
)m
ec
(x
c
)m
dc
(x
c
) (1.29)
m
ba
(x
a
) = max
x
b
P(x
a
, x
b
). (1.30)
The MAP labels can be found as
x
a
= max
x
a
m
ba
(x
a
)m
ca
(x
a
) (1.31)
x
b
= max
x
b
P( x
a
, x
b
) (1.32)
x
c
= max
x
c
P( x
a
, x
c
)m
ec
(x
c
)m
dc
(x
c
) (1.33)
x
d
= max
x
d
P( x
c
, x
d
) (1.34)
x
e
= max
x
d
P( x
c
, x
e
). (1.35)
The sumproduct BPalgorithmcomputes the marginal distributions P(x
i
) for all variables
X
i
. It essentially works in a way similar to max-product BP (1.23), except that rather than
taking the max, a sum is performed over the different labels:
m
ij
(x
j
) =

x
i
P(x
j
, x
i
)

kN
c
(i)
m
ki
(x
i
) (1.36)
where N
c
(i) is the set of all children of node i. The marginal P(x
i
) can be computed by
taking the product of the messages sent to the root node i:
P(x
i
) =

kN
c
(i)
m
ki
(x
i
). (1.37)
Now, by successively rearranging the tree so that each node i in turn becomes the root node,
P(x
i
) can be computed for all nodes i.
1.4 MRFs 11
1.4 MRFs: Markov Models on Graphs
At the start of the chapter, the choice of graphs for image processing was motivated by the
need to establish dependencies between pixels that are nearby in an image. The graphs that
were proposed for that purpose are not trees, but contain cycles of edges, typically many
cycles, as in gure 1.1. The representation of independence in undirected graphs follows
the methodology given above for trees (1.21). The random variable at a node is dependent
directly only on random variables in its Markov blanket B
i
. The HammersleyClifford
theorem [98] gives the general form for a distribution with these Markov properties:
P(x) =

cC
F
c
(x) (1.38)
where C is the set of maximal cliques of Gdened to be the set of maximal subgraphs
of G that are fully connected in E, and maximal in the sense that no further node can be
added to a clique without removing the full connectedness property. Note that for a tree the
cliques are simply the edges of the graph, and so the decomposition (1.38) simplies to the
decomposition (1.22) for trees, above. Usually the decomposition is expressed in terms of
an energy or cost function E(x):
P(x) exp (E(x)) where E(x) =

c
(x). (1.39)
More generally, there may be a dependence on parameters, so that
P(x) =
1
Z()
exp (E(x, )) where E(x, ) =

c
(x
c
, ), (1.40)
and now the partition function
Z() =

x
exp (E(x, )) (1.41)
is included to maintain the normalization condition for the distribution,

x
P(x) = 1.
An alternative representation of the undirected graph in gure 1.1 is the factor graph
[276]. The undirected graph (or Markov network) makes conditional dependencies explicit,
but factorization properties are somewhat implicitthey need to be computed in terms
of maximal cliques. Factor graphs are a little more complex in that they introduce a sec-
ond type of node, the function node, in addition to the nodes for variables but have the
advantage of making factorization explicit. In many of the cases used in this bookfor
example, gure 1.1a, with 4-way connectivity between pixelsthe factorization structure
is straightforward. Each edge in that example is a maximal clique, and therefore the factors
are functions of the two variables on the nodes belonging to each edge. On the other hand,
12 1 Introduction to Markov Random Fields
gure 1.1b, with its 8-way connectivity, has a more complex set of statistical dependencies,
and the factors may not simply correspond to edges. The most general factor structure for
the Markov model with the statistical dependencies denoted by that graph is a function of the
four variables in each of the 2 2 squares of variables, which are the maximal cliques of
the graph. In computer vision, it is usual to dene models directly in terms of their factors,
in contrast to normal practice in statistical modeling, where models are dened rst in terms
of their Markov properties, with factors specied subsequently over maximal cliques, as in
(1.38). In the more complex cases of factors involving more than two variables at a time,
factor graphs are useful, and are mentioned in chapters 21 and 24. For the most part, though,
pairwise factors and simple undirected graphs sufce.
1.4.1 Example: Pseudo-Boolean Energy on a 4-Connected Graph of Pixels
A standard example of an MRF, just about the simplest one that is interesting, is the Ising
model with the single parameter = { }, whose origins are in statistical physics [523]. The
state-space consists of Boolean variables x
i
L = {0, 1}, and the energy function E(x) is
termed a Pseudo-Boolean Function (PBF)because its input is Boolean but its output is
not (the energy is real valued). As for the graph, the maximal cliques are the horizontal and
vertical edges of the rectangular graph of pixels shown in gure 1.5a. All cliques in this
(a)
x
i
(b)
(c)
Figure 1.5
Simulating a simple model: The Ising model. (a) The horizontal and vertical edges of a rectangular grid form the
cliques of the Ising model. (b) Simulations of typical probable states of the Ising model for various values of the
coherence parameter .
1.5 Hidden MRF Models 13
case are of size 2, containing two nodes (pixels). The clique potentials, referred to in this
kind of model as pairwise potentials because the cliques are pairs of pixels, are

ij
(x
i
, x
j
) = |x
i
x
j
|. (1.42)
This represents a penalty that increases the energy E wherever adjacent x
i
and x
j
have
different values, and so reduces the joint probability P(x) by a factor e

. This enhances
the probability of congurations x in which there is large-scale agreement between the
values of adjacent pixels. In fact, a moments thought will make it clear that the total
penalty

(i,j)C

ij
is simply times the total perimeter (in the Manhattan metric) of
boundaries separating regions of value 1 fromregions of value 0. Thus the distribution P(x)
favors congurations x in which that total perimeter is relatively small. The simulations in
gure 1.5b show how higher values of indeed tend to favor larger regions of 1s and 0s.
It is worth saying at this stage, and this will come out repeatedly later, that algorithms for
inference with general graphs that contain loopsmany loops in the case of the example
aboveare hard. Even simulation from the Ising model is hard, compared with the equiv-
alent simulation for a chain or tree, which is straighforward and can be done exactly. The
simulations of the Ising model above were done using a formof Markov chain Monte Carlo,
adapted specically for the Ising model [165], the SwendsenWang iterative sampler [25].
1.5 Hidden MRF Models
A Markov random eld P(X = x), as in the previous section, can act as a prior model for
a set of hidden random variables X, under a set of observations z, in direct analogy to the
HMM model of section 1.2. As with HMMs, the observations are most simply modeled as
random variables that are independent when conditioned on the hidden variables X. This
is illustrated in gure 1.6, in which the simple 4-connected graph of gure 1.5a appears as
a layer of hidden variables x with observations z distributed across sites. Each individual
observation z
i
depends statistically just on the state x
i
of the corresponding pixel. Now the
posterior for the state x of the pixels is obtained from Bayess formula, just as it was for
HMMs (1.6), as
P(x | z, ) P(z | x, )P(x | ), (1.43)
with the observation likelihood P(z | x) factorized across sites/pixels as it was earlier (1.9),
and including the possibility of multiple models as before. It is common also to express this
posterior MRF in terms of a sum of energies, generalizing the prior MRF (1.40) to include
terms for the observation likelihood:
P(x | z, ) =
1
Z(z, )
exp E(x, z, ), (1.44)
where E(x, z, ) =

c
(x, ) +

i
(x
i
, z
i
). (1.45)
14 1 Introduction to Markov Random Fields
z
i
x
i
Figure 1.6
Two-dimensional hidden Markov model. An MRF on a regular grid, as in gure 1.5, serves here as the prior over
hidden variables in a model that is coupled to an array z of observations.
Here Z(z, ) is the partition function for the posterior MRF. Unlike the HMM (1.8), for
which Z can in fact be computed quite easily, computing the partition function Z(z, ) for
the posterior MRF is intractable.
The most common form of inference over the posterior MRF in vision and image-
processing problems, is Maximum A Posteriori (MAP) estimation. MAP inference of x is
done in principle by computing x = arg max P(x | z), or equivalently by minimizing energy:
x = arg min E(x, z, ). (1.46)
Note that this does not require knowledge of the partition function Z(z, ), which is just
as well, given its intractability. The energy functions for many commonly used Markov
models (see examples below) can be written as a sum of unary and pairwise terms:
E(x, z, ) =

i
(x
i
, z
i
, ) +

(i,j)E

ij
(x
i
, x
j
, ). (1.47)
Algorithms for computing the MAP are discussed in the following sections. Sufce it to
say that the MAP can in fact be computed exactly in a time that is polynomial with respect
to the size N of the image array, using graph cut, at least when the prior P(x) is an Ising
distribution.
1.5.1 Example: Segmentation on a 4-Connected Graph of Pixels
Here we give the simplest useful example of a hidden MRF model, for segmenting an image
into foreground and background components. The state-space is Boolean, so x
i
{0, 1}
denotes background/foreground labels, respectively. The model, originated by Boykov and
1.5 Hidden MRF Models 15
Jolly [66], has a number of variants. For tutorial purposes the simplest of them is illustrated
here. It uses the Ising model as a prior to encourage the foreground and background compo-
nents to be as coherent as possible. Thus the terms in the hidden MRF model (1.45) are
the ones fromthe Ising prior (1.42). The likelihood terms (for details, see chapter 7 on MRF
models for segmentation) can be specied by constructing histograms h
F
(z) and h
B
(z) in
color space for foreground and background, respectively (taking care to avoid zeros), and
setting

i
(z
i
) = log h
F
(z
i
) log h
B
(z
i
). (1.48)
The resulting model species a posterior, which is maximized to obtain the estimated
segmentation x, and the resulting method is demonstrably effective, as gure 1.7 shows.
a) b)
c) d)
Figure 1.7
MRF model for bilevel segmentation. (a) An image to be segmented. (b) Foreground and background regions
of the image are marked so x
i
in those regions is no longer hidden but observed. The problem is to infer fore-
ground/background labels in the remaining unlabeled region of the trimap. (c) Using simply a color likelihood
model learned from the labeled regions, without the Ising prior, the inferred labeling is noisy. (d) Also introducing
a pairwise Ising term, and calculating the MAP estimate for the inferred labels, deals substantially with the noise
and missing data. (Results of the CRF variant of the Ising term, described below, are illustrated here.)
16 1 Introduction to Markov Random Fields
(a) (b)
Figure 1.8
MRF model for image reconstruction. (a) An image with added noise, and a portion masked out. (b) Introducing
a truncated quadratic prior and calculating (approximately) the MAP estimate of the hidden variables deals much
more effectively with the noise.
1.5.2 Example: Image Reconstruction
A classic example of a multilabel MRF problem is image reconstruction, in which a noisy
and/or damaged image requires repair. An example is shown in gure 1.8. In this case the
state-space mayhave x
i
{0, 255}, correspondingtothe possible grayvalues of pixels inthe
reconstructed image. Asuitable model for reconstruction is to choose the unary as follows:

i
(x
i
) =
_
(x
i
z
i
)
2
if z
i
observed
0 otherwise.
(1.49)
The pairwise term is chosen to encourage smoothness in the reconstructed image, but not
so strongly as to blur across genuine edges. Asuitable choice is a truncated quadratic prior
of the form
(x
i
, x
j
) = min((x
i
x
j
)
2
,
max
). (1.50)
(More detail is given in later chapters, particularly chapter 11.) The special case that

max
= 1 gives the classic Potts model, which penalizes any nonzero difference between
x
i
and x
j
equally, regardless of magnitude, with .
1.5.3 Continuous Valued MRFs
So far we have seen examples of MRFs with discrete label valueseither Boolean or multi-
valued: for example, integer. Of course it is natural in many cases to regard hidden variables
as continuous. For example, the underlying image in an image reconstruction problem is
a physical property of the world, and its values are most naturally regarded as continuous.
Visual reconstruction is therefore often cast in terms of continuous variables [172, 54],
1.6 Inference 17
and in this book, chapters 8, 12, and 13 deal with MRFs over continuous variables. One
direct approach is to dene Gaussian distributions over the variables, so-called GMRFs,
as in chapter 13. Nonlinear variations on that basic theme also have interesting and useful
properties.
1.5.4 Conditional Random Field
AConditional RandomField (CRF) is a formof MRF that denes a posterior for variables x
given data z, as with the hidden MRF above. Unlike the hidden MRF, however, the factor-
ization into the data distribution P(x|z) and the prior P(x) is not made explicit [288]. This
allows complex dependencies of x on z to be written directly in the posterior distribution,
without the factorization being made explicit. (Given P(x|z), such factorizations always
exist, howeverinnitely many of them, in factso there is no suggestion that the CRF is
more general than the hidden MRF, only that it may be more convenient to deal with.) One
common application of the CRF formalismin vision is in Boolean (foreground/background)
segmentation, where it is natural to think of a modication of the Ising prior (1.42) to
incorporate some data dependency [65]:

ij
(x
i
, x
j
, z) = f (z) |x
i
x
j
|, (1.51)
in which the additional term f (z) 1 weakens the penalty wherever the image data
suggest the presence of a segmentation boundaryfor instance, where image contrast is
high. This is described in detail in chapter 7. CRFs are also used in scene labeling [284], for
example, to label areas of a scene as natural or as man-made. Other applications of CRFs
appear in chapters 9, 11, 24, 25, and 27.
1.6 Inference: MAP/Marginals
Given a hidden MRF for some posterior distribution P(x | z), as in (1.44) but omitting the
parameters , the common inference problems are to estimate either the most probable state
arg max
x
P(x | z) (1.52)
or the marginal distributions at each pixel, P(x
i
| z), i = 1, . . . , N. This is in direct analogy
to MAP inference and inference of marginals for HMMs, as discussed in section 1.2, but
closed form algorithms are no longer available for two-dimensional problems. Dropping
the explicit mention of the data z, the inference problems are equivalent to estimating the
mode and marginals of an MRF P(x), as in (1.39). The remainder of this section outlines
various approaches to estimation using this posterior distribution.
1.6.1 Gibbs Sampling
Gibbs sampling is a procedure introduced by Geman and Geman [161] for sampling fairly
from an MRF. At successive visits to each site i, the variable x
i
is sampled from the local,
18 1 Introduction to Markov Random Fields
conditional distribution P(x
i
| {x
j
, (i, j) E, j = i}), and all sites are visited, arbitrarily
often, insome randomorder. Asymptotically, after manyvisits toeachsite, the set of samples
(x
1
, . . . , x
N
) settles down to be a fair sample from the MRF. However, this burn-in process
may happen very slowly, and this is a problem for practical application. Further details of
Gibbs sampling are given in chapter 5.
1.6.2 Mean Field Approximation
Mean eld approximation is a form of variational approximation in which the distribution
P(x) is approximated as a product of factors,
P(x)

i
b
i
(x
i
), (1.53)
and the factors become the approximations to the marginals P(x
i
) of the distribution. Of
course this factorized formcannot represent the posterior exactly, but mean eld algorithms
choose the b
i
in such a way as to approximate the posterior as closely as possible. This
is done by minimizing KL divergence, a single numerical measure of difference between
the true and approximate distributions. Full details of this important approach are given in
chapter 5.
Classical Algorithms for MAP Inference The MAP solution of a hidden Markov model can
be found by minimizing an energy function (1.45). Anumber of algorithms exist for solving
these minimization problems dened over discrete or continuous random variables. Some
of the best-known of these algorithms will be reviewed in this section. Comparisons of their
performance on various test problems will be given in chapter 11.
1.6.3 Iterated Conditional Modes (ICM)
Iterated Conditional Modes (ICM) are one of the oldest and simplest MAP inference algo-
rithms [38, 235]. They belong to the family of local (also called move-making) algorithms
that start with an initial solution. At each step these algorithms explore a space of possible
changes (also called a move space) that can be made to the current solution x
c
, and choose
a change (move) that leads to a new solution x
n
having the lowest energy. This move is
referred to as the optimal move. The algorithm is said to converge when no solution with a
lower energy can be found.
ICM works in a coordinate descent fashion. At each step it chooses a variable X
i
, i V.
Keeping the values of all other variables xed, it nds the label assignment for X
i
that leads
to the lowest energy:
x
n
i
= arg min
x
i
L
E({x
c
j
: j = i}, x
i
). (1.54)
This process is repeated for other variables until the energy cannot be decreased further
(i.e., all variables are assigned labels that are locally optimal.) Note that the descent step is
efcient in that it need involve only variables in the Markov blanket of i.
1.6 Inference 19
1.6.4 Simulated Annealing
The ICM algorithm makes changes to the solution if they lead to solutions having lower
energy. This greedy characteristic makes it prone to getting stuck in local minima. Intu-
itively, an energy minimization algorithm that can jump out of local minima will do well in
problems with a large number of local optima. Simulated annealing (SA) is one such class
of algorithms. Developed as a general optimization strategy by Kirkpatrick et al. [234], sim-
ulated annealing allows changes to the solution that lead to a higher energy with a certain
probability. This probability is controlled by a parameter T that is called the temperature.
A higher value of T implies a high probability of accepting changes to the solution that
lead to a higher energy. The algorithm starts with a high value of T and reduces it to 0 as it
proceeds. When T = 0 the algorithmaccepts only changes that lead to a decrease in energy,
as, for example, ICM does. Details are given in chapter 5.
1.6.5 Loopy Belief Propagation
We have seen how the sum-product and max-product message-passing algorithms can be
used to perform inference in tree structured graphs. Although these algorithms are guaran-
teed only to nd the optimal solutions in tree structured graphs, they have been shown to
be effective heuristics for MAP/marginal estimation even in graphs with loops, such as the
grid graphs found in computer vision.
The max-product algorithm can be used to minimize general energy functions approxi-
mately. Since an energy function is the negative log of posterior probability, it is necessary
only to replace the product operation with a sum operation, and the max operation with a
min. For the pairwise energy function (4.1) the message m
ji
from a variable X
j
to any
neighboring variable X
i
is computed as
m
ji
(x
i
) = min
x
j
_
_

j
(x
j
) +
ij
(x
i
, x
j
) +

kN(j){i}
m
kj
(x
j
)
_
_
. (1.55)
These messages can also be used to compute a (min-sum) analogue of the belief, called the
min-marginal:
M(x
i
) = min
x{x
i
}
E(x). (1.56)
The min-marginal can be computed from the messages as
M

(x
i
) =
i
(x
i
) +

kN(j){i}
m
kj
(x
j
). (1.57)
The min-marginal can be used to nd an alternative estimate to the MAP assignment of a
variable, as follows:
x

i
= arg min
x
i
M

(x
i
). (1.58)
20 1 Introduction to Markov Random Fields
Chapter 10 will describe an application of Bayesian belief propagation to quickly nd
approximate solutions to two-dimensional vision problems, such as image superresolu-
tion using patch priors, and for the problem of transferring the style of a picture (e.g., a
painting) to another picture (e.g., a photograph). Those results have inspired a lot of work
on the theoretical properties of the way in which BP operates in a general probabilistic
model [517]. For instance, it was shown that when sum-product BP converges, the solu-
tion is a local minimum of the Bethe approximation [539] of the free energy dened by the
Markov model. More recently, a number of researchers have shown the relationship between
message-passing algorithms and linear programming relaxation-based methods for MAP
inference [504, 536]. This relationship is the subject of chapter 6.
1.7 MAP Inference in Discrete Models
Many problems in vision and machine learning give rise to energies dened over discrete
random variables. The problem of minimizing a function of discrete variables is in general
NP-hard, and has been well studied in the discrete optimization and operations research
communities. Anumber of greedy and approximate techniques have been proposed for
solving these discrete optimization problems. An introduction to these methods is given in
this section.
Although minimizing a function of discrete variables is NP-hard in general, there are
families of energy functions for which it can be done in polynomial time. Submodular func-
tions are one such well-known family [155, 324]. In some respects they are analogous to
the convex functions encountered in continuous optimization, and have played a big part
in the development of efcient algorithms for estimating the MAP solution of many image
labeling problems. For instance, the Ising energy described earlier (1.42) is an important
example in vision and image processing of a submodular function.
It is worth reecting for a moment that this approach gives an exact solution to inference
problems at image scale. This is quite remarkable, given how rare it is that realistic scale
information problems in the general area of machine intelligence admit exact solutions.
Graph cut was rst used to give exact solutions for Boolean MRF problems over images
by Greig et al. [176], at a time when the method was so slow that it could be used only to
benchmark faster, approximate algorithms. Since then, the progress in understanding and
developing graph cut algorithms in general, and for images specically, allows these algo-
rithms to be regarded as highly practical. Even real-time operation is possible, in which
several million pixels in an image or video are processed each second.
1.7.1 Submodular Pseudo-Boolean Functions
The minimization of a Pseudo-Boolean Function (PBF) E: {0, 1}
n
R is a well-
studied problem in combinatorial optimization [215] and operations research [61]. A PBF
f : {0, 1}
n
Ris submodular if and only if, for all label assignments x
A
, x
B
{0, 1}
n
, the
function satises the condition
1.7 MAP Inference in Discrete Models 21
f (x
A
) +f (x
B
) f (x
A
x
B
) +f (x
A
x
B
), (1.59)
where and are componentwise OR and AND, respectively. From the above denition
it can easily be seen that all PBFs of arity 1 are submodular. Similarly, any PBF of arity 2
is submodular if and only if
f (1, 0) +f (0, 1) f (1, 1) +f (0, 0). (1.60)
Another interesting property of submodular functions is that the set of submodular functions
is closed under addition (i.e., the sum of two or more submodular functions is another
submodular function). This condition implies that the energy of a pairwise MRF (1.47) is
submodular if all the pairwise potentials
ij
are submodular. For example, the Ising model
pairwise potential consists of terms
ij
(x
i
, x
j
) = |x
i
x
j
| (1.42), which are submodular,
as is apparent from substituting f (x, x

) = |x x

| into (1.60). Hence the entire Ising

potential function is submodular.
1.7.2 Minimizing Submodular Pseudo-Boolean Functions Using Graph Cut
The rst polynomial time algorithm for minimizing submodular functions [215, 427] had
highpractical runtime complexity. Althoughrecent workhas beenpartlysuccessful inreduc-
ing the complexity of algorithms for general submodular function minimization, they are
still quite expensive computationally and cannot practically be used for large problems. For
instance, one of the best algorithms for general submodular function minimization has a
worst case complexity O(n
5
Q+n
6
), where Q is the time taken to evaluate the function
[358]. This is certainly too expensive for use in computer vision, where n is frequently of
the order of millions (of pixels).
However, there is one important subclass of submodular functions that can be optimized
much more efciently: the submodular functions of arity at most 2, corresponding to MRFs
with factors that are functions of at most two variables. This is the common case that was
illustrated in the graphs in gure 1.1. Optimization over this class is known as Quadratic
Pseudo-Boolean Optimization or (QPBO). It turns out to be equivalent to nding the min-
imum cost cut in a certain graph [61, 68, 211, 421] the so-called s-t min-cut problem.
This is described in detail in chapter 2. It leads to efcient algorithms for nding the
MAP solution of many important pairwise MRF models for vision problems, for example,
foreground/background image segmentation problems like the one illustrated in gure 1.7.
Ageneral QPB function can be expressed as:
E(x) =
const
+

iV
(
i;1
x
i
+
i;0
x
i
)
+

(i,j)E
(
ij;11
x
i
x
j
+
ij;01
x
i
x
j
+
ij;10
x
i
x
j
+
ij;00
x
i
x
j
),
(1.61)
where x
i
denes the complementary variable x
i
= 1 x
i
. Parameter
i;a
is the penalty for
assigning label a to latent variable x
v
, and
ij;ab
is the penalty for assigning labels a and b
22 1 Introduction to Markov Random Fields
+ +
s
t
s
t
s
t
s
t
w
2
w
4
w
1
w
3
w
5
w
6
x
a
x
b
x
a
x
b
x
a
x
b
ab;10
ab;01

ab;00
ab;00
b;0

a;0

ab;11
ab;11
b;1 a;1
Figure 1.9
Energy minimization using graph cut. The gure shows how individual unary and pairwise terms of an energy
function taking two binary variables are represented and combined in the graph. Multiple edges between the same
nodes are merged into a single edge by adding their weights. For instance, the cost w
1
of the edge (s, x
a
) in the
nal graph is equal to w
1
=
a;0
+
ab;00
. The cost of an s-t cut in the nal graph is equal to the energy E(x) of
the conguration x the cut induces. The minimum cost s-t cut induces the least-energy conguration x for the
energy function.
to the latent variables x
i
and x
j
, respectively. To minimize this energy with s-t min-cut,
the individual unary and pairwise terms of the energy function are represented by weighted
edges in the graph. Multiple edges between the same nodes are merged into a single edge
by adding their weights. The graph construction in the simple case of a two-variable energy
function is shown in gure 1.9. In this graph an s-t cut is dened to be a curve that separates
the source node s fromthe terminal node t . The cost of such a cut is dened to be the sumof
the weights of edges traversed by the cut. The s-t cut with the minimum cost provides the
minimum solution x

by disconnecting a node x
i
either froms, representing the assignment
x
i
= 0, or from t , representing x
i
= 1. The cost of that cut corresponds to the energy of
the solution E(x

) modulo the constant term

const
. Algorithms for nding the s-t min-
cut require that all edges in the graph have nonnegative weights. This condition results in
a restriction that the energy function E be submodular. This is all explained in detail in
chapter 2.
1.8 Solving Multilabel Problems Using Graph Cut
Many computer vision problems involve latent variables that can take values in integer
spaces. The energy functions corresponding to these problems are dened over multistate
variables. Graphcut-basedalgorithms for minimizingsuchfunctions canbe dividedintotwo
broad categories: transformation methods and move-making algorithms. Transformations
methods transformthe energy minimization problemdened over multistate variables X
i

L = {l
1
, ..., l
k
}, k 2 to one that is dened over k binary variables per multistate variable.
This is done by encoding different states of the multistate variables by the assignments of a
1.8 Solving Multilabel Problems Using Graph Cut 23
set of Boolean variablesa full discussion is given in chapter 4. Move-making algorithms,
on the other hand, work by decomposing the problem into a set of problems dened over
Boolean random variables, which, as we have seen, can be solved efciently using s-t min-
cut algorithms. In practice, move making has been the dominant approach for multilabel
problems.
1.8.1 Graph Cut-Based Move-Making Algorithms
The key characteristic of any move-making algorithm is the number of possible changes it
can make in any step (also called the size of the move space). Alarge move space means that
extensive changes can be made to the current solution. This makes the algorithm less prone
to getting stuck in local minima and also results in a faster rate of convergence. Boykov
et al. [72] proposed two move-making algorithms: -expansion and -swap, whose move,
space size increases exponentially with the number of variables involved in the energy
function. The moves of the expansion and swap algorithms can be encoded as a vector of
binary variables t ={t
i
, i V}. The transformation function T (x
c
, t) of a move algorithm
takes the current labeling x
c
and a move t, and returns the new labeling x
n
that has been
induced by the move.
The expansion algorithm has one possible move per label L. An -expansion move
allows any random variable either to retain its current label or to take the label . The
transformation function T

() for an -expansion move transforms the current label x

c
i
of
any random variable X
i
as
x
n
i
= T

(x
c
i
, t
i
) =
_
if t
i
= 0
x
c
i
if t
i
= 1.
(1.62)
One iteration of the algorithm involves making moves for all in L successively in
some order.
The swap algorithm has one possible move for every pair of labels , L. An -
swap move allows a random variable whose current label is or to take either label
or label . The transformation function T

() for an -swap transforms the current label

x
c
i
of a random variable x
i
as
x
n
i
= T

(x
c
i
, t
i
) =
_
if x
c
i
= or and t
i
= 0
if x
c
i
= or and t
i
= 1.
(1.63)
One iteration of the algorithm involves performing swap moves for all and in L
successively in some order.
The energy of a move t is the energy of the labeling x
n
that the move t induces, that is,
E
m
(t) = E(T (x
c
, t)). The move energy is a pseudo-Boolean function (E
m
: {0, 1}
n
R)
and will be denoted by E
m
(t). At each step of the expansion and swap-move algorithms, the
optimal move t

(i.e., the move decreasing the energy of the labeling by the greatest amount)
is computed. This is done byminimizingthe move energy, that is, t

= arg min
t
E(T (x
c
, t)).
Boykov et al. [72] showed that for certain families of pairwise energy functions, the move
24 1 Introduction to Markov Random Fields
energy is submodular, and hence the optimal move t

can be computed in polynomial time

using s-t min-cut algorithms. More details on the expansion and swap algorithms can be
found in chapter 3.
1.9 MAP Inference Using Linear Programming
Linear programming (LP) is a popular method for solving discrete optimization problems.
It has also been extensively used for MAP inference in discrete Markov models. To convert
the energy minimization problem into an LP, we will rst need to formulate it as an integer
program (IP). This is illustrated for the pairwise energy function (4.1). The energy function
is rst linearized using binary indicator variables y
i;a
and y
ij;ab
for all i, j V and l
a
,
l
b
L. The indicator variable y
i;a
= 1 if x
i
= l
a
, and y
i;a
= 0 otherwise. Similarly, the
variables y
ij;ab
indicate the label assignment x
i
= l
a
, x
j
= l
b
. The resulting IP can be
written as
Minimize

iV,l
a
L

i
(l
a
)y
i;a
+

(i,j)E,
l
a
,l
b
L

ij
(l
a
, l
b
)y
ij;ab
, (1.64)
subject to

l
a
L
y
ij;ab
= y
j;b
, (i, j) E, l
b
L, (1.65)

l
b
L
y
ij;ab
= y
i;a
, (i, j) E, l
a
L, (1.66)

l
a
L
y
i;a
= 1, i V, (1.67)
y
i;a
, y
ij;ab
{0, 1} i V, (i, j) E, l
a
, l
b
L. (1.68)
The constraint (1.65) enforces consistency over marginalization, and constraint (1.67)
makes sure that each variable is assigned only one label.
Relaxing the integrality constraints (1.68) of the IP leads to an LP problem that can be
solved in polynomial time using general-purpose LP solvers. These solvers are relatively
computationally expensive, and make this approach inappropriate for vision problems that
involve a large number of variables. A number of message-passing and maximum ow-
based methods have recently been developed to efciently solve the LPdened above [270,
277, 504, 520]. See chapters 6 and 17 for more details, including discussion of when the
solution to the LP is also a solution to the IP.
1.9.1 Nonsubmodular Problems in Vision
We have seen how submodular pseudo-Boolean functions can be minimized using algo-
rithms for computing the s-t min-cut in weighted graphs. Although this is an extremely
1.10 Parameter Learning for MRFs 25
useful method, energy functions corresponding to many computer vision problems are not
submodular. We give two examples here by way of motivation. The rst is the fusion move
described in chapter 18, in which two trial solutions to a problem such as optimal image
denoising are available, and a nal solution is to be constructed by combining the two,
pixel by pixel. A binary array is constructed that switches each pixel between its values in
the rst and second solutions. The original denoising functional now induces a functional
over the Boolean array that will not, in general, be submodular (see chapter 18 for details).
A second example of an important nonsubmodular problem arises in the general case of
multilabel optimization, as discussed in the previous section. (Details are given in chapter 3.)
Minimizing a general nonsubmodular energy function is an NP-hard problem. We also
saw earlier how general energy minimization problems can be formulated in terms of an IP.
Relaxing the integrality constraints of the IP leads to an LP problem, and this is attractive
because there is no requirement for the IP energy to be submodular. Chapter 2 will explain
a particular relaxation of QPBO called the roof dual [61, 261], which leads to an LP that
can be solved efciently using s-t min-cut algorithms.
1.10 Parameter Learning for MRFs
Learning the parameters of an MRF (or CRF) P(x|z, ) from labeled training data is an
important problem. The alternative is to attempt to select the parameters by hand or by
experiment; this is known to be difcult, and quite impractical if the dimensionality of is
at all large. It also proves to be a challenging problem. Note that here the learning problemis
rather different fromthe parameter learning problemfor HMMs discussed earlier. It is a little
easier in that the data are labeledthat is, the x-values are known for each z in the training
setthose previously hidden variables x in the model may become observed variables in
the context of learning. In contrast, in HMM parameter learning, the state values x were
unknown. But it is a harder problem in that the estimation is done over a graph that is not
merely a tree, so the underlying inference problem is not exactly tractable.
1.10.1 Maximum Likelihood
Astandard approach to parameter learning is via Maximum Likelihood Estimation (MLE),
that is, by solving
max

L() where L() = P(x|z, ). (1.69)

Now, from (1.40),
log L() = log Z()

c
(x
c
, ), (1.70)
and differentiating this to maximize the likelihood L w.r.t. is entirely feasible for the
second term, as it is decomposed into a sum of local terms, but generally intractable for the
26 1 Introduction to Markov Random Fields
rst term, the log partition function that was dened earlier (1.41). This is well known, and
is fully described in standard texts on Markov random elds [523, 309].
As a result, it is necessary to approximate the likelihood function in order to maximize
it. Probably the best-known classical approximation is pseudo likelihood in which L is
replaced by
L

i
P(x
i
| {x
j
, (i, j) E}), (1.71)
a product of local, conditional densities. Its log is a sum of local functions, which can
tractably be differentiated in order to maximize with respect to parameters. This pseudo
likelihood function does not itself approximate the likelihood L, but its maximum is known
to approximate the maximum of L under certain conditions [523].
Alternative approximate learning schemes have been proposed by others. For example,
Kumar et al. [283] propose three different schemes to approximate the gradients of the log
likelihood function in the case of pairwise energy functions, such as the one used for image
segmentation (1.47). The derivatives of the log likelihood are easily shown to be a function
of the expectations x
i

;z
. If we were given the true marginal distributions P(x
i
|z, ), we
could compute the exact expectations:
x
i

;z
=

x
i
x
i
P
i
(x
i
|z, ). (1.72)
This is generally infeasible, so the three proposed approximations are as follows.
1. Pseudo Marginal Approximation (PMA) Pseudo marginals obtained from loopy belief
propagation are used instead of the true marginals for computing the expectations.
2. Saddle Point Approximation (SPA) The label of the random variable in the MAP solu-
tion is taken as the expected value. This is equivalent to assuming that all the mass of the
distribution P
i
(x
i
|z, ) is on the MAP label.
3. Maximum Marginal Approximation (MMA) In this approximation the label having the
maximum value under the pseudo marginal distribution obtained from BP is assumed to be
the expected value.
1.10.2 Max-Margin Learning
An alternative approach to parameter learning that has become very popular in recent years is
max-margin learning. Similar to MLestimation, max-margin learning also uses inference in
order to compute a gradient with respect to the parameters; however, only the MAPlabeling
need be inferred, rather than the full marginals.
Most margin-basedlearningalgorithms proposedandusedincomputer visionare inspired
from the structured support vector machine (SVMstruct) framework of Tsochantaridis
et al. [484] and the maximum margin network learning of Taskar et al. [14, 473]. The goal
1.11 Glossary of Notation 27
of these methods is to learn the parameters so that the ground truth has the lowest energy
by the largest possible margin or, if that is not possible, that the energy of the ground
truth is as close as possible to that of the minimum energy solution. More formally,
max
:|=1|
such that (1.73)
E(x, z; ) E( x, z; ) x = x, (1.74)
where x is the MAP estimate for x. Adetailed discussion of max-margin learning for MRFs
is given in chapter 15.
This concludes the introduction to Markov random elds. We have tried to set out the
main ideas as a preparation for the more detailed treatment that follows. In part I of the book,
some of the main algorithms for performing inference with MRFs are reviewed. Then in part
II, to reward the reader for hard work on the algorithms, there is a collection of some of the
most successful applications of MRFs, including segmentation, superresolution, and image
restoration, together with an experimental comparison of various optimization methods on
several test problems in vision. Part III discusses some more advanced algorithmic topics,
including the learning of parameter values to tune MRF models, and some approaches to
learning and inference with continuous-valued MRFs. Part IVaddresses some of the limita-
tions of the strong locality assumptions in the small-neighborhood MRFs discussed in this
introduction and in many of the earlier chapters of the book. This includes going beyond
pairwise functions in the MRF factorization, to ternary functions and higher, and to models
that, though sparse, do not restrict neighborhoods to contain pixels that are nearby in the
image array. Finally, part V is a showcase of some applications that use MRFs in more
complex ways, as components in bigger systems or with multiterm energy functions, each
term of which enforces a different regularity of the problemfor instance, spatial and
temporal terms in multiview video, or acting over multiple layers of hidden variables for
simultaneous recognition and segmentation of objects.
1.11 Glossary of Notation
Some of the basic notation for Markov random elds is listed below. It is the notation used
in this introductory chapter, and also is used frequently in later chapters of the book.
Symbol Meaning
G = V, E graph of MRF nodes (sites) V and edges E
i V index for nodes/sites
(i, i

) E or (i, j) E edges linking nodes of G

c C cliques of G, so each c V
z = (z
1
, . . . , z
i
, . . . , z
N
) image data, pixel value (monochrome or color) at site i
X = (X
1
, . . . , X
i
, . . . , x
N
) random state variables at site (pixel) i
28 1 Introduction to Markov Random Fields
x = (x
1
, . . . , x
i
, . . . x
N
), values taken by state variables, that is, X
i
= x
i
L = {l
1
, . . . , l
k
, . . . , l
K
} label values for discrete MRF, so x
i
L
P(X | z) posterior probability distribution for state given data
P(x | z) shorthand for P(X = x | z)
P(x | z) =
1
Z(z)
exp E(x, z) Gibbs energy form of the MRF
Z() or Z(, z) partition function in Gibbs form of MRF
E(x) = U(x) +V(x) Gibbs energy: unary and neighbor terms
E(x, z) = U(x, z) +V(x, z) Gibbs energy where dependence on data z is explicit
, P(x | z, ), E(x, z, ) Parameters for MRF model

i
(x
i
) unary potential for MRF, so U(x) =

i

i
(x
i
)

ij
(x
i
, x
j
) pairwise potentials, so V(x) =

ij

ij
(x
ij
)

c
(x
c
) higher-order potentials,
V(x) =

cC

c
(x
c
) general form of Gibbs term as a sum over cliques
y
i
auxiliary Boolean variables, for example, for label expan-
sion schemes
I
Algorithms for Inference of MAP Estimates for MRFs
Part I of this book deals with some of the fundamental issues concerning inference in the
kinds of MRFs that are used in vision. In machine learning generally, inference is often
taken to refer to inference of an entire probability distribution P(x | z) or projections of that
distribution as marginal distributions [46] over subsets of variables x
i
. This is a general view
of inference that allows the output of the inference process to be treated as an intermediate
result capable of renement by fusion with further information as it subsequently may
become available. In principle this approach can be taken with images where x is an array of
pixels or some other high-dimensional array of image-related variables such as superpixels.
In practice, for images the dimension of x is so high that it is quite infeasible to represent
the full posterior distribution. One approach is to use Monte Carlo samplers, as described
in chapter 1, but it is unlikely that burn-in can be achieved on any practical timescale. For
low-dimensional information extracted from images, such as curves, it may be practical
to represent the full posterior, either exactly or via samplers [51]. This book, however, is
mostly concerned with inferences over the whole image, and the most that can practically
be done is to infer (approximate) marginals over individual pixels or, conceivably, small
sets of pixels. This idea is mentioned in chapter 1 and, in more detail, in chapter 5. There
are some possible applications for pixelwise marginals in parameter learning (discussed in
part III), but on the whole they have not been greatly used in vision. This discussion is all
by way of explaining why part I, and indeed much of the book, restricts inference to MAP
estimation.
As pointed out in chapter 1, maximizing the posterior P(x | z) is equivalent to minimizing
energy E(x, z), so part I begins with chapter 2 explaining in detail the idea, introduced in
chapter 1, of solving pseudo-Boolean optimization exactly, using a graph cut algorithm. The
basic graph cut mechanism that gives exact solutions for the minimization of submodular,
pseudo-Boolean functions can also give exact solutions when even the Boolean domain
is replaced by a multilabel domain such as the integers. The conditions under which this
can be done place tight restrictions on the objective function E(x), and are detailed in
chapter 4. Under less strict conditions on E(x), pseudo-Boolean graph cut can be used as an
exact partial optimization stepa move (see chapter 1)to solve a multilabel optimization
30 I Algorithms for Inference of MAP Estimates for MRFs
problem. However, the overall solution will be only approximately optimal, as chapter 3
explains.
Graph cut is one of the dominant approaches for inference in MRFs for vision; two others
are Loopy Belief Propagation (LBP) and mean eld approximation, which were introduced
briey in chapter 1. Chapter 5 describes both methods and explains that they are related to
energy minimization, but with a different kind of objective functions known as a free energy.
As an alternative to these deterministic methods, stochastic estimation using Markov Chain
Monte Carlo (MCMC) is also explained. The nal chapter in part I, chapter 6, expands on
the idea of Linear Programming (LP) relaxation as a means of embedding an integer-valued
optimization problemon a graph in a more tractable, continuous-valued optimization. Inter-
estingly, it turns out that this approach is closely related to BP and that variants of BP can
be used as efcient means of solving the LP.
2
Basic Graph Cut Algorithms
Yuri Boykov and Vladimir Kolmogorov
This chapter describes efcient methods for solving an inference problem for pairwise
MRF models (equation 1.55) in a simple special case when state variables x
i
are binary.
Many problems in computer vision can be represented by such binary models (see part III).
One basic example is segmentation of an image into object and background pixels, as
in chapter 7. The inference problem can be formulated as nding a binary labeling x in
which the energy E dened in chapter 1 (equation 1.47) achieves its minima for some given
observations z
i
and for some xed parameter . This section describes efcient opti-
mization techniques for binary pairwise models using standard graph cut algorithms from
combinatorial optimization.
This chapter uses a different representation of energy E that is more convenient for
binary variables x. This representation is common in combinatorial optimization literature
in which algorithms for binary pairwise energies have been actively studied for forty years;
for example, see the survey in [61]. Without loss of generality, assume that binary state
variables x
i
take two values, 0 and 1. Then, energy E can be written as a quadratic pseudo-
Boolean function, as in (2.1):
E(x) =
const
+

iV
(
i;1
x
i
+
i;0
x
i
)
+

(i,j)E
(
ij;11
x
i
x
j
+
ij;01
x
i
x
j
+
ij;10
x
i
x
j
+
ij;00
x
i
x
j
),
(2.1)
where x
i
= 1 x
i
denotes the negation of variable x
i
{0, 1}. Boolean refers to the
fact that variables x
i
can take only two values, and quadratic means that there are
only unary (e.g.,
i;1
x
i
) and quadratic (e.g.,
ij;11
x
i
x
j
) terms. For binary-valued labelings
the x-function (equation 1.47) is equivalent to (2.1) with constants
i;1
=
i
(1, z
i
, ),

i;0
=
i
(0, z
i
, ),
ij;11
=
ij
(1, 1, z
i
, z
j
, ), and
ij;01
=
ij
(0, 1, z
i
, z
j
, ),
ij;10
=

ij
(1, 0, z
i
, z
j
, ),
ij;00
=
ij
(0, 0, z
i
, z
j
, ), assuming that observations z and parameters
are known. Note that the energy (2.1) can also be written as
E(x) =
const
+

i;x
i
+

(i,j)E

ij;x
i
x
j
. (2.1

)
32 2 Basic Graph Cut Algorithms
In this chapter we will see that for an important class of pseudo-Boolean functions the
minimization problem (equation 1.1) can be reduced to a min-cut/max-ow problem on
graphs. This is a classical combinatorial optimization problemwith many applications even
outside of computer vision. It can be solved in polynomial time, and many efcient algo-
rithms have been developed since 1956 [146]. Newfast algorithms for the min-cut/max-ow
problem are actively being researched in the combinatorial optimization community [11].
Existing algorithms vary in their theoretical complexity and empirical efciency on differ-
ent types of graphs. Some efcient algorithms were developed specically for sparse grids
common in computer vision [68, 120].
This chapter describes the general s-t min-cut problem and standard algorithms for solv-
ing it, and shows how they can be used to minimize pseudo-Boolean functions like (2.1).
The structure is as follows:

Section 2.1 covers terminology and other basic background material on the general s-t
min-cut problem required to understand this and several later chapters. This section also
gives an overview of standard combinatorial algorithms for the s-t min-cut problem,
focusing on methods known to be practical for regular grid graphs common in vision.

Section 2.2 describes how graph cut algorithms can be used to globally minimize a
certain class of pseudo-Boolean functions (2.1). This class is characterized by a certain
submodularity condition for parameters of energy (2.1). Many interesting problems in
vision, such as object/background segmentation (as in chapter 7), are based on submodular
binary models.

Section 2.3 concludes this chapter by describing more general graph, cut techniques that
can be applied to nonsubmodular binary models. Such methods are not guaranteed to nd the
solution for all variables, but those variables x
i
whose values are determined are guaranteed
to be a part of some globally optimal vector x. In many practical problems, such partial
solutions cover most of the state variables.
2.1 Algorithms for Min-Cut/Max-Flow Problems
2.1.1 Background on Directed Graphs
Chapter 1 described undirected graphs consisting of a set of vertices V, typically corre-
sponding to image pixels, and a set of undirected arcs or edges E, corresponding to a 4, 8,
or any other neighborhood system (see gure 1.1). In this chapter we will use the corre-
sponding directed weighted graphs (

V,

E, w), including two additional terminal vertices,
source s and sink t ,

V := V {s, t }, (2.2)
2.1 Algorithms for Min-Cut/Max-Flow Problems 33
source
sink
s s
t t
sink
source
cut
p q p q
(a) A graph (b) A cut on
Figure 2.1
Example of a directed, capacitated graph. Edge costs are represented by their thickness. Such grid graphs are
common in computer vision.
and a larger set of directed edges,

E := {(s i), (i t ) | i V} {(i j), (j i) | (i, j) E} (2.3)

whose weights (capacities) are nonnegative: w
ij
0 for (i j)

E.
1
In the context of vision, terminals often correspond to labels that can be assigned to
pixels. In gure 2.1a we show a simple example of a two-terminal directed graph that can
be used for optimizing the posterior distribution (equation 1.6) on a 3 3 image in the case
of two labels. The structure of graphs representing different problems in computer vision
may vary. However, many of themuse 2Dor 3Dgrid graphs like the one in gure 2.1a. This
is a simple consequence of the fact that graph nodes often represent regular image pixels
or voxels.
All directed edges in the graph are assigned some weight or capacity. The cost w
ij
of
directed edge (i j) may differ from the cost w
ji
of the reverse edge (j i). In fact,
the ability to assign different edge weights for (i j) and (j i) is important for many
applications in vision.
It is common to distinguish two types of edges: n-links and t-links. The n-links connect
pairs of neighboring pixels or voxels:
{(i j), (j i) | (i, j) E} (n-links).
1. We assume throughout the chapter that the set E does not have parallel edges, and that (i, j) E implies
(j, i) / E.
34 2 Basic Graph Cut Algorithms
Thus, they represent a neighborhood systemin the image. In the context of computer vision,
the cost of n-links may correspond to penalties for discontinuities between pixels. t-links
connect pixels with terminals (labels):
{(s i), (i t ) | i V} (t-links).
The cost of a t-link connecting a pixel and a terminal may correspond to a penalty for
assigning the corresponding label to the pixel.
2.1.2 Min-Cut and Max-Flow Problems
An s-t cut C on a graph with two terminals s, t is a partitioning of the nodes in the graph
into two disjoint subsets S and T such that the source s is in S and the sink t is in T . For
simplicity, throughout this chapter we refer to s-t cuts as just cuts. Figure 2.1b shows an
example of a cut. In combinatorial optimization the cost of a cut C = (S, T ) is dened
as the sum of the costs of boundary edges (i j) where i S and j T . Note that
the cut cost is directed as it sums weights of directed edges specically from S to T .
The minimum cut problem on a graph is to nd a cut that has the minimum cost among
all cuts.
One of the fundamental results in combinatorial optimization is that the minimums-t cut
problem can be solved by nding a maximum ow from the source s to the sink t . Loosely
speaking, maximum ow is the maximum amount of water that can be sent from the
source to the sink by interpreting graph edges as directed pipes with capacities equal to
edge weights. The theorem of Ford and Fulkerson [146] states that a maximum ow froms
to t saturates a set of edges in the graph, dividing the nodes into two disjoint parts (S, T )
corresponding to a minimum cut. Thus, min-cut and max-ow problems are equivalent. In
fact, the maximumowvalue is equal to the cost of the minimumcut. The close relationship
between maximumowand minimumcut problems is illustrated in gure 2.2 in the context
of image segmentation. The max-ow displayed in gure 2.2a saturates the edges in the
min-cut boundary in gure 2.2b. It turns out that max-ow and min-cut are dual problems,
as explained in the appendix (section 2.5).
We can intuitively show how min-cut (or max-ow) on a graph may help with energy
minimization over image labelings. Consider an example in gure 2.1. The graph corre-
sponds to a 3 3 image. Any s-t cut partitions the nodes into disjoint groups each containing
exactly one terminal. Therefore, any cut corresponds to some assignment of pixels (nodes)
to labels (terminals). If edge weights are appropriately set based on parameters of an energy,
a minimum cost cut will correspond to a labeling with the minimum value of this energy.
2.1.3 Standard Algorithms in Combinatorial Optimization
An important fact in combinatorial optimization is that there are polynomial algorithms
for min-cut/max-ow problems on directed weighted graphs with two terminals. Most
well-known algorithms belong to one of the following groups: FordFulkerson-style
2.1 Algorithms for Min-Cut/Max-Flow Problems 35
Original image (b) Minimum cut (a) Maximum flow
Figure 2.2
Graph cut/ow example in the context of interactive image segmentation (see chapter 7). Object and background
seeds (white and black circles in a and b) are hardwired to the source s and the sink t , correspondingly, by
cost t-links. The cost of n-links between the pixels (graph nodes) is low in places with high intensity contrast.
Thus, cuts along object boundaries in the image are cheaper. Weak edges also work as bottlenecks for a ow.
In (a) we show a maximum ow from s to t . This ow saturates graph edges corresponding to a minimum cut
(black/white contour) shown in (b).
augmenting paths [146, 126], the network simplex approach [170], and GoldbergTarjan-
style push-relabel methods [171].
Standard augmenting path-based algorithms [146, 126] work by pushing ow along
nonsaturated paths from the source to the sink until the maximum ow in the graph
G = (

V,

E, w) is reached. Atypical augmenting path algorithmstores information about the
distribution of the current s t ow f among the edges of G, using a residual graph G
f
.
The topology of G
f
is identical to G, but the capacity of an edge in G
f
reects the residual
capacity of the same edge in G, given the amount of ow already in the edge. At the ini-
tialization there is no ow from the source to the sink (f = 0) and edge capacities in the
residual graph G
0
are equal to the original capacities in G. At each new iteration the algo-
rithm nds the shortest s t path along nonsaturated edges of the residual graph. If a path
is found, then the algorithm augments it by pushing the maximum possible ow f that
saturates at least one of the edges in the path. The residual capacities of edges in the path
are reduced by f while the residual capacities of the reverse edges are increased by f .
Each augmentation increases the total ow from the source to the sink f = f +f . The
maximum ow is reached when any s t path crosses at least one saturated edge in the
residual graph G
f
.
The Dinic algorithm [126] uses a breadth-rst search to nd the shortest paths from
s to t on the residual graph G
f
. After all shortest paths of a xed length k are saturated,
the algorithm starts the breadth-rst search for s t paths of length k +1 from scratch.
Note that the use of shortest paths is an important factor that improves theoretical running
time complexities for algorithms based on augmenting paths. The worst case running time
complexity for the Dinic algorithm is O(mn
2
), where n is the number of nodes and mis the
number of edges in the graph. In practice, the blocking ow approach of Dinic is known to
outperform max-ow algorithms based on network simplex [170].
36 2 Basic Graph Cut Algorithms
Push-relabel algorithms [171] use quite a different approach to the max-ow/min-cut
problem. They do not maintain a valid ow during the operation; there are active nodes
that have a positive ow excess. Instead, the algorithms maintain a labeling of nodes
giving a lower bound estimate on the distance to the sink along nonsaturated edges. The algo-
rithms attempt to push excess ows toward nodes with a smaller estimated distance to
the sink. Typically, the push operation is applied to active nodes with the largest distance
(label) or is based on a FIFOselection strategy. The distances (labels) progressively increase
as edges are saturatedbypushoperations. Undeliverable ows are eventuallydrainedbackto
the source. We recommend our favorite textbook on basic graph theory and algorithms [101]
for more details on push-relabel and augmenting path methods.
Note that the most interesting applications of graph cut to vision use directed N-D grids
withlocallyconnectednodes. It is alsotypical that a large portionof the nodes is connectedto
the terminals. Unfortunately, these conditions rule out many specialized min-cut/max-ow
algorithms that are designed for some restricted classes of graphs. Examples of interesting
but inapplicable methods include randomized techniques for dense undirected graphs [226]
and methods for planar graphs assuming a small number of terminal connections [340, 188],
among others.
2.1.4 The BK Algorithm
This section describes an algorithm developed as an attempt to improve empirical perfor-
mance of standard augmenting path techniques on graphs in vision [68]. Normally (see
section 2.1.3) augmenting path-based methods start a new breadth-rst search for s t
paths as soon as all paths of a given length are exhausted. In the context of graphs in com-
puter vision, building a breadth-rst search tree typically involves scanning the majority
of image pixels. Practically speaking, it could be a very expensive operation if it has to
be performed too often. Indeed, real-data experiments in vision conrm that rebuilding a
search tree on graphs makes standard augmenting path techniques perform poorly in prac-
tice [68]. Several ideas were developed in [68] that improve the empirical performance of
augmenting path techniques on sparse grid graphs in computer vision.
The BK (BoykovKolmogorov) algorithm [68] belongs to the group of algorithms based
on augmenting paths. Similarly to Dinic [126], it builds search trees for detecting aug-
menting paths. In fact, BK builds two search trees, one from the source and the other from
the sink. The other difference is that BK reuses these trees and never starts building them
from scratch. The drawback of BK is that the augmenting paths found are not necessarily
the shortest augmenting paths; thus the time complexity of the shortest augmenting path
is no longer valid. The trivial upper bound on the number of augmentations for the BK
algorithm is the cost of the minimum cut |C|, which results in the worst case complexity
O(mn
2
|C|). Theoretically speaking, this is worse than the complexities of the standard
algorithms discussed in section 2.1.3. However, experimental comparison conducted in
2.1 Algorithms for Min-Cut/Max-Flow Problems 37
P P P P P A A
P P A
P P P
P P P A
P
s
t
P
P A
A A A A
P P P A
Figure 2.3
Example of search trees S (nodes with dense dots) and T (nodes with sparse dots) at the end of the growth stage
when a path (gray(yellow) line) from the source s to the sink t is found. Active and passive nodes are labeled A
and P, correspondingly. Free nodes are white.
[68, 11] shows that on many typical problem instances in vision (particularly for 2D cases)
BK can signicantly outperform standard max-ow algorithms.
Figure 2.3 illustrates BKs basic ideas. The algorithm maintains two non-overlapping
search trees S and T with roots at the source s and the sink t , correspondingly. In tree S
all edges from each parent node to its children are non-saturated, while in tree T edges
from children to their parents are nonsaturated. The nodes that are not in S or T are free.
We have
S V, s S, T V, t T, S T = .
The nodes in the search trees S and T can be either active or passive. The active nodes
represent the outer border in each tree and the passive nodes are internal. The point is that
active nodes allow trees to grow by acquiring new children (along nonsaturated edges)
from a set of free nodes. The passive nodes cannot grow because they are completely
blocked by other nodes from the same tree. It is also important that active nodes may come
in contact with the nodes from the other tree. An augmenting path is found as soon as an
active node in one of the trees detects a neighboring node that belongs to the other tree.
The algorithm iteratively repeats the following three stages:

Growth stage: search trees S and T grow until they touch, giving an s t path.

Augmentation stage: the found path is augmented and search tree(s) break into forest(s).

adoption stage: trees S and T are restored.

At the growth stage the search trees expand. The active nodes explore adjacent nonsaturated
edges and acquire new children from a set of free nodes. The newly acquired nodes become
active members of the corresponding search trees. As soon as all neighbors of a given
active node are explored, the active node becomes passive. The growth stage terminates if
an active node encounters a neighboring node that belongs to the opposite tree. In this case
BK detects a path from the source to the sink, as shown in gure 2.3.
38 2 Basic Graph Cut Algorithms
The augmentation stage augments the path found at the growth stage. Since we push
through the largest owpossible, some edge(s) in the path become(s) saturated. Thus, some
of the nodes in the trees S and T may become orphans, that is, the edges linking them to
their parents are no longer valid (they are saturated). In fact, the augmentation phase may
split the search trees S and T into forests. The source s and the sink t are still roots of two
of the trees and orphans form roots of all other trees.
The goal of the adoption stage is to restore the single-tree structure of sets S and T with
roots in the source and the sink. At this stage we try to nd a new valid parent for each
orphan. Anew parent should belong to the same set, S or T , as the orphan. Aparent should
also be connected through a nonsaturated edge. If there is no qualifying parent, we remove
the orphan from S or T and make it a free node. The algorithm also declares all its former
children as orphans. The stage terminates when no orphans are left and, thus, the search
tree structures of S and T are restored. Since some orphan nodes may become free, the
adoption stage results in contraction of sets S and T .
After the adoption stage is completed, the algorithm returns to the growth stage. The
algorithm terminates when the search trees S and T cannot grow (no active nodes) and
the trees are separated by saturated edges. This implies that a maximum ow is achieved.
The corresponding minimum cut C = {S, T } is dened as follows: nodes in the source
search trees S form subset S and nodes in the sink tree T form subset T . More details on
implementing BK can be found in [68]. The code for the BK algorithm can be downloaded
for research purposes from the authors Web pages.
2.1.5 Further Comments on Time and Memory Efciency
The ability to compute globally optimal solutions for many large problems in computer
vision is one of the main advantages of graph cut methods. Max-ow/min-cut algorithms
outlined in the previous sections normally take only a few seconds to solve problems on
graphs corresponding to typical 2D images [68, 11]. However, these algorithms may take
several minutes to solve bigger 3D problems such as segmentation of medical volumetric
data or multiview volumetric reconstruction.
Efciency of max-ow/min-cut algorithms for graphs in computer vision is actively stud-
ied, and one should be aware that signicant improvements are possible in specic situa-
tions. For example, one can use ow recycling [244] or cut recycling [221] techniques in
dynamic applications (e.g., in video) where max-owhas to be computed for similar graphs
corresponding to different time frames. Such methods are shown to work well when graphs
and solutions change very little from one instance to the next. Methods for GPU-based
acceleration of the push-relabel approach were also proposed [69]. Given fast advance-
ments of the GPU technology, such methods may become very effective for grid graphs
corresponding to 2D images.
Limited memory resources could be a serious problem for large N-D problems, as stan-
dard max-ow/min-cut algorithms practically do not work as soon as the whole graph does
2.2 Max-ow Algorithm as an Energy Minimization tool 39
not t into available RAM. This issue was addressed in [120], where a scalable version of
the push-relabel method was developed specically for large regular N-D graphs (grids or
complexes).
Note that the speed and memory efciency of different max-ow/min-cut algorithms
can be tested on problems in computer vision using a database of graphs (in the standard
DIMACS format) that recently become available.
2
The posted graphs are regular grids or
complexes corresponding to specic examples in segmentation, stereo, multiview recon-
struction, and other applications.
2.2 Max-ow Algorithm as an Energy Minimization tool
Section 2.1 described the min-cut/max-ow problem and several max-ow algorithms. As
already mentioned, they have a close relationship with the problem of minimizing pseudo-
Boolean functions of the form (2.1). To see this, consider an undirected graph (V, E) and
the corresponding directed weighted graph (

V,

E, w) dened in section 2.1.1. Any s-t cut
(S, T ) can be viewed as a binary labeling x of the set of nodes V dened as follows: x
i
= 0
if i S, and x
i
= 1 if i T for node i V. The energy of labeling x (i.e., the cost of the
corresponding cut (S, T )) equals
E(x) =

iV
(w
si
x
i
+w
it
x
i
) +

(i,j)E
(w
ij
x
i
x
j
+w
ji
x
i
x
j
). (2.4)
There is a 1:1 correspondence between binary labelings and cuts; therefore, computing
a minimum cut in (

V,

E, w) will yield a global minimum of function (2.4). Note that
standard max-ow/min-cut algorithms reviewed in sections 2.1.3 and 2.1.4 work only for
graphs with nonnegative weights w 0. Thus, equation (2.4) implies that such algorithms
can minimize quadratic pseudo-Boolean functions (2.1) with coefcients ({
i;a
}, {
ij;ab
})
satisfying the following conditions:

i;0
0
i;1
0 (2.5a)

ij;00
= 0
ij;01
0
ij;10
0
ij;11
= 0 (2.5b)
for all nodes i V and edges (i, j) E.
Can we use a max-ow algorithm for a more general class of quadratic pseudo-Boolean
functions? In the next section we dene operations on coefcients that do not change
the energy E(x); such operations are called reparameterizations. We will then explore
which functions can be reparameterized to satisfy (2.5), arriving at the class of submodular
2. [Link] Special thanks to Andrew Delong for
preparing most of the data sets and for setting up the site. The idea of such a database was proposed by Andrew
Goldberg.
40 2 Basic Graph Cut Algorithms
functions. In section 2.2.2 we will see that any submodular QPB(quadratic pseudo-Boolean)
function can be transformed to (2.5), and thus can be minimized in polynomial time via a
max-ow algorithm.
Unfortunately, this transformation does not work for nonsubmodular functions, which
should not be surprising, given that minimizing such functions is an NP-hard problem (it
includes, e.g., the maxcut problem, which is NP-hard [156]). Still, the max-owalgorithm
can be quite useful for certain nonsubmodular functions, (e.g., it can identify a part of an
optimal labeling). We will review some known results in section 2.3.
2.2.1 Reparameterization and Normal Form
Recall that energy (2.1) depends on vector , which is a concatenation of all coefcients

const
,
i;a
, and
ij;ab
. To emphasize this dependence, we will often write the energy dened
by as E(x | ) instead of E(x).
Given some node i V and some real number R, consider an operation transforming
vector as follows:

i;0
:=
i;0

i;1
:=
i;1

const
:=
const
+ (2.6)
where := denotes the assignment operator as used in programming languages. It is easy
to see that this transformation does not change the function E(x | ) and the cost of any
labeling x stays the same. This follows from the fact that x
i
+x
i
= 1. This motivates the
following denition:
Denition 2.1 Vector

is called a reparameterization of vector if they dene the same

energy function, that is, E(x |

) = E(x | ) for any labeling x. In this case we may also

write

.
Consider an edge (i, j) E. Identities x
j
= x
j
(x
i
+x
i
) and x
j
= x
j
(x
i
+x
i
) yield two
more reparameterization operations:

ij;00
:=
ij;00

ij;10
:=
ij;10

j;0
:=
j;0
+ (2.7a)

ij;01
:=
ij;01

ij;11
:=
ij;11

j;1
:=
j;1
+. (2.7b)
Similarly, identities x
i
= x
i
(x
j
+x
j
) and x
i
= x
i
(x
j
+x
j
) give

ij;00
:=
ij;00

ij;01
:=
ij;01

i;0
:=
i;0
+ (2.7c)

ij;10
:=
ij;10

ij;11
:=
ij;11

i;1
:=
i;1
+. (2.7d)
Operation 2.7a is illustrated in gure 2.4b. It can be shown that any possible reparameter-
ization can be obtained as a combination of operations (2.6 and 2.7), assuming that graph
(V, E) is connected (e.g., [520]).
2.2 Max-ow Algorithm as an Energy Minimization tool 41
(a) (b) (c)
Submodular
term
Supermodular
term
Figure 2.4
(a) Convention for displaying parameters
i;a
,
ij;ab
,
j;b
. (b) Example of a reparameterization operation (equa-
tion 2.7a). (c) Normal form. Dotted lines denote links with zero cost. The rst term is submodular, the second is
supermodular. Unary parameters must satisfy min{
i;0
,
i;1
} = min{
j;0
,
j;1
} = min{
k;0
,
k;1
} = 0.
Denition 2.2 Vector is in a normal form if each node i V satises
min{
i;0
,
i;1
} = 0 (2.8)
and each edge (i, j) E satises either (2.9a) or (2.9b) below:

ij;00
= 0,
ij;01
0,
ij;10
0,
ij;11
= 0 (2.9a)

ij;00
0,
ij;01
= 0,
ij;10
= 0,
ij;11
0. (2.9b)
Figure 2.4c illustrates conditions (2.9a) and (2.9b) on edges (i, j) and (j, k), correspond-
ingly. Note that (2.9a) agrees with (2.5b).
It is not difcult to verify that any quadratic pseudo-Boolean function E(x | ) can be
reparameterized to a normal form in linear time. For example, this can be done by the
following simple algorithm:
1. For each edge (i, j) E do the following:

Make
ij;ab
nonnegative: compute = min
a,b{0,1}

ij;ab
and set

ij;ab
:=
ij;ab
a, b {0, 1},
const
:=
const
+.

For each label b {0, 1} compute = min{

ij;0b
,
ij;1b
} and set

ij;0b
:=
ij;0b
,
ij;1b
:=
ij;1b
,
j;b
:=
j;b
+.

For each label a {0, 1} compute = min{

ij;a0
,
ij;a1
} and set

ij;a0
:=
ij;a0
,
ij;a1
:=
ij;a1
,
i;a
:=
i;a
+.
2. For each node i compute = min{
i;0
,
i;1
} and set

i;0
:=
i;0
,
i;1
:=
i;1
,
const
:=
const
+.
42 2 Basic Graph Cut Algorithms
The rst step of this algorithm performs a xed number of operations for each edge, and
the second step performs a xed number of operations for each node, giving linear overall
complexity O(|V| +|E|).
In general, the normal form is not unique for vector . For example, in the appendix it
is shown that each augmentation of a standard max-ow algorithm can be interpreted as
a reparameterization of energy E(x | ) from one normal form to another. The set of all
reparameterizations of in a normal form will be denoted by ():
() = {

is in normal form}.
2.2.2 Submodularity
It is easy to check that reparameterization operations (2.7) for edge (i, j) preserve the
quantity

ij
= (
ij;00
+
ij;11
) (
ij;01
+
ij;10
). (2.10)
Thus,
ij
is an invariant that can be used to classify each edge (i, j).
Denition 2.3
(a) Edge (i, j) is called submodular with respect to given energy function E(x | ) if

ij
0. It is called supermodular if
ij
0.
(b) A quadratic pseudo-Boolean function E(x | ) is called submodular if all its edges
(i, j) are submodular, that is, they satisfy

ij;00
+
ij;11

ij;01
+
ij;10
. (2.11)
We have already seen a denition of submodularity for general pseudo-Boolean functions
in chapter 1 (1.59). It is not difcult to show that for quadratic pseudo-Boolean functions
the two denitions are equivalent. We leave this proof as an exercise.
Equation 2.11 encodes some notion of smoothness. It says that the combined cost of
homogeneous labelings (0, 0) and (1, 1) is smaller than the combined cost of discontinuous
labelings (0, 1) and (1, 0). The smoothness assumption is natural in vision problems where
nearby points are likely to have similar labels (object category, disparity, etc.).
Note that for energy E(x | ) in a normal form, condition (2.9a) describes submodular
edges and condition (2.9b) describes supermodular edges (see also g.2.4(c)).
Now we can return to the question of what energy functions E(x | ) can be reparam-
eterized to the form of equations (2.5). Any energy can be transformed to a normal form
by a simple reparameterization algorithm described in section 2.2.1. Condition (2.8) will
satisfy (2.5a). Since submodularity is preserved under reparameterization, all submodular
edges will satisfy (2.9a) and, thus, (2.5b). This proves the following.
2.3 Minimizing Nonsubmodular Functions 43
Theorem 2.1 Global minima x = arg min
x
E(x | ) of any submodular quadratic pseudo-
Boolean function can be obtained in polynomial time, using the following steps:
1. Reparameterize to a normal form as described in section 2.2.1.
2. Construct the directed weighted graph

G = (

V,

E, w) as described in section 2.1.1 with
the following nonnegative arc capacities:
w
si
=
i;1
, w
it
=
i;0
, w
ij
=
ij;01
, w
ji
=
ji;10
.
3. Compute a minimum s-t cut in

G and the corresponding labeling x.
Note that supermodularity is also preserved under reparameterization. After conversion
to a normal form any (strictly) supermodular edge will satisfy (2.9b), which is inconsistent
with (2.5b). Therefore, nonsubmodular functions with one or more supermodular edges
cannot be converted to (2.5) and, in general, they cannot be minimized by standard max-
owalgorithms. Optimization of nonsubmodular functions is known to be an NP-hard prob-
lem. However, section 2.3 describes one approach that may work for some cases of non-
submodular functions.
2.3 Minimizing Nonsubmodular Functions
Let us now consider the case when not all edges for function E(x) satisfy the submodular-
ity condition (2.11). Minimizing such a function is an NP-hard problem, so there is little
hope that there exists a polynomial time algorithm for solving arbitrary instances. It does
not mean, however, that all instances that occur in practice are equally hard. This section
describes a linear programming relaxation approach that can solve many of the instances
that occur in computer vision applications, as can be seen in several other chapters of
the book.
Relaxation is a general technique applicable to many optimization problems. It can be
described as follows. Suppose that we want to minimize function E(x) over a set X Z
n
containing a nite number of integer-valued points. First, we extend function E to a larger
domain

X, X

X so that

X is a convex subset of R
n
. Assuming that E is a convex function
over

X, one can usually compute a global minimizer x of function E over

X, using one of
many efcient methods for convex optimization problems. (More information on relaxation
techniques can be found in chapter 16.)
For some easy instances it may happen that a minimizer x is integer-valued and lies in
the original domain X; then we know that we have solved the original problem. In general,
however, the vector x may have fractional components. In that case there are several possi-
bilities. We may try to round x to an integer-valued solution so that the objective function
E does not increase too much. (This scheme is used for a large number of approximation
algorithms in combinatorial optimization.) Another option is to use the minimum value
44 2 Basic Graph Cut Algorithms
min
x

X
E(x) as a lower bound on the original optimization problem in a branch-and-bound
framework.
This section reviews the LP relaxation of energy (2.1) introduced earlier, in section 1.9
of chapter 1:
Minimize
const
+

iV,
a{0,1}

i;a
y
i;a
+

(i,j)E,
a,b{0,1}

ij;ab
y
ij;ab
(2.12a)
s.t. y
ij;0b
+y
ij;1b
= y
j;b
(i, j) E, b {0, 1}, (2.12b)
y
ij;a0
+y
ij;a1
= y
i;a
(i, j) E, a {0, 1}, (2.12c)
y
i;0
+y
i;1
= 1 i V, (2.12d)
y
i;a
, y
ij;ab
[0, 1] i V, (i, j) E, a, b {0, 1}. (2.12e)
Note that forcing variables y
i;a
, y
ij;ab
to be integral would make (2.12) equivalent to the
minimization problem(2.1); y
i;a
and y
ij;ab
would be the indicator variables of events x
i
= a
and (x
i
, x
j
) = (a, b), respectively. Thus, (2.12) is indeed a relaxation of (2.1). It is known
as the roof duality relaxation [179, 61].
Equation (2.12) is a linear program. (Interestingly, it can be shown to be the LP dual of
problem (DUAL) formulated in the appendix to this chapter. Thus, there is a strong duality
between (2.17) and LPrelaxation (2.12), but there may be a duality gap between (2.17) and
the original integer problem (2.1) in a general nonsubmodular case.) It can be shown [179]
that the extreme points of this LP are half-integral. That is, an optimal solution y may have
components y
i;a
, y
ij;ab
belonging to
_
0, 1,
1
2
_
. Two important questions arise:

Does y give any information about minimizers of function (2.1)? This is discussed in
section 2.3.1.

How can y be computed? Since (2.12) is a linear program, one could use a number
of generic LP solvers. However, there exist specialized combinatorial algorithms for solv-
ing (2.12). The method in [62] based on the max-owalgorithmis perhaps the most efcient;
it is reviewed in section 2.3.2. We will refer to it as the BHS algorithm.
2.3.1 Properties of Roof Duality Relaxation
Let y be a half-integral optimal solution of (2.12). It is convenient to dene a partial labeling
x as follows: x
i
= y
i;1
if y
i;1
{0, 1} and x
i
= if y
i;1
=
1
2
. In the former case we say
node i is labeled; otherwise it is unlabeled.
An important property of roof duality relaxation is that x gives a part of an optimal
solution. In other words, function E has a global minimum x

such that x

i
= x
i
for all
labeled nodes i. This property is known as persistency or partial optimality [179, 61, 261].
2.3 Minimizing Nonsubmodular Functions 45
Clearly, the usefulness of roof duality relaxation depends on howmany nodes are labeled
and this depends heavily on the application. If the number of nonsubmodular terms is small
and they are relatively weak compared with unary terms, then we can expect most nodes to
be labeled. In other situations all nodes can remain unlabeled. Refer to [403] and chapter 18
for some computational experiments in computer vision.
An important question is what to do with remaining unlabeled nodes. If the number
of such nodes is small or the remaining graph has low tree width, then one could use a
junction tree algorithm to compute a global minimum. Another option is to use the PROBE
technique [63]. The idea is to x unlabeled nodes to a particular value (0 or 1) and apply roof
duality relaxation to the modied problem. After this operation more nodes may become
labeled, giving further information about minimizers of function E. This information is
used to simplify function E, and the process is repeated until we cannot infer any new
constraints on minimizers of E. For certain functions the PROBE procedure labels many
more nodes compared with the basic roof duality approach [63, 403].
3
In computer vision applications QPB functions are often used inside iterative move-
making algorithms, such as expansion moves (chapter 3) or fusion moves (chapter 18). In
that case the roof duality approach can be used as follows. Suppose that the current cong-
uration is represented by a binary labeling z

, and solving the LP relaxation gives partial

labeling x. Let us replace z

with the following labeling z: if node i is labeled, then z

i
= x
i
,
otherwise z
i
= z

i
. In such an iterative optimization the energy can never increase. This
follows from the autarky property that says E(z) E(z

) [179, 61].
2.3.2 Solving Roof Duality Relaxation via Max-ow: The BHS Algorithm
The LPrelaxation (2.12) can be solved in many different ways, including generic LPsolvers.
We review the method in [62], which is perhaps the most efcient, and we call this the BHS
algroithm.
4
On the high level, it can be described as follows. The original energy (2.1) is
relaxed to a symmetric submodular function using binary indicator variables x
i;1
and x
i;0
,
introduced for each variable x
i
and its negation x
i
= 1 x
i
, respectively. This submodular
formulation can be solved by max-ow algorithms. The corresponding result can also be
shown to solve LP (2.12).
Below we provide more details. The rst step is to reparameterize the energy E(x | )
into a normal form, as described in section 2.2.1. Then, submodular edges will have the
form(2.9a) andsupermodular edges will have the form(2.9b). The secondstepis toconstruct
3. The code can be found online on V. Kolmogorovs Web page. This includes the BHS algorithm, the PROBE
procedure, and the heuristic IMPROVE technique (see [403]).
4. BHS is an abbreviation of the inventors names. Sometimes it has also been called the QPBO algorithm in the
computer vision literature, as in [403]. This is, however, misleading; the optimization problem is already called
QPBO (chapter 1).
46 2 Basic Graph Cut Algorithms
a new energy function

E({x
i;1
}, {x
i;0
}) by transforming terms of the original function (2.1)
as follows:

i;0
x
i

i;0
2
[x
i;1
+x
i;0
] (2.13a)

i;1
x
i

i;1
2
[x
i;1
+x
i;0
] (2.13b)

ij;01
x
i
x
j

ij;01
2
[x
i;1
x
j;1
+x
i;0
x
j;0
] (2.13c)

ij;10
x
i
x
j

ij;10
2
[x
i;1
x
j;1
+x
i;0
x
j;0
] (2.13d)

ij;00
x
i
x
j

ij;00
2
[x
i;0
x
j;1
+x
i;1
x
j;0
] (2.13e)

ij;11
x
i
x
j

ij;11
2
[x
i;1
x
j;0
+x
i;0
x
j;1
]. (2.13f )
The constant term
const
remains unmodied.
The constructed function

E has several important properties. First, it is equivalent to the
original function if we impose the constraint x
i;0
= 1 x
i;1
for all nodes i:

E(x, x) = E(x) x. (2.14)

Second, function

E is submodular. (Note that all pairwise terms are of the form c x
i;a
x
j;b
where c 0 and x
i;a
, x
j;b
are binary variables.) This means that we can minimize this func-
tion by computing a maximum ow in an appropriately constructed graph (see section 2.2).
This graph will have twice as many nodes and edges compared with the graph needed for
minimizing submodular functions E(x).
After computing a global minimum ({x
i;1
}, {x
i;0
}) of function

E we can easily obtain a
solution y of the relaxation (2.12). The unary components for node i V are given by
y
i;1
=
1
2
[x
i;1
+x
i;0
] (2.15a)
y
i;0
=
1
2
[x
i;0
+x
i;1
]. (2.15b)
Clearly, we have y
i;1
+ y
i;0
= 1 and y
i;1
, y
i;0

_
0, 1,
1
2
_
. It also is not difcult to derive
pairwise components y
ij;ab
: we just need to minimize (2.12) over { y
ij;ab
} with xed unary
components { y
i;a
}. We omit the formulas because they are rarely used in practice.
Acomplete proof that the algorithmabove indeed solves problem(2.12) is a bit involved,
andwe donot give it here. The original proof canbe foundin[61]. Belowwe provide a sketch
of an alternative proof that follows the general approach in [195]. The idea is to establish
2.5 Appendix 47
some correspondence between LP relaxation (2.12) of the original nonsubmodular energy
E and the same type of relaxation for submodular energy

E. To be concise, we will refer
to these two relaxation problems as LP and

LP. Assuming that F(LP) and F(

LP) are
the sets of feasible solutions of the two problems, it is possible to dene linear mappings
: F(LP) F(

LP) and : F(

LP) F(LP), which preserve the cost of the solutions.

This implies that LP and

LP have the same minimumvalue, and any solution of one problem

yields an optimal solution for the other one. Relaxation

LP corresponds to a submodular
energy, and therefore it can be solved via a max-ow algorithm. Applying mapping to
the optimal solution of

LP yields formulas (2.15).
2.4 Conclusions
This chapter described standard optimization techniques for binary pairwise models. Alarge
number of problems in the following chapters are represented using such quadratic pseudo-
Boolean functions. Equation (2.1) is a different representation of energy convenient for
binary variables x. This representation is common in combinatorial optimization literature,
where optimization of pseudo-Boolean functions has been actively studied for more than
forty years.
This chapter studied an important class of pseudo-Boolean functions whose minimization
can be reduced to min-cut/max-ow problems on graphs. Such functions are characterized
by a submodularity condition. We also reviewed several classical polynomial complex-
ity algorithms (augmenting paths [146, 126] and push-relabel [171]) for solving min-cut/
max-ow problems. Efcient versions of such algorithms were also developed speci-
cally for sparse grids common in computer vision [68, 120]. This chapter reviewed the BK
algorithm [68] that is widely used in imaging.
Section 2.3 concluded this chapter by describing more general graph cut techniques that
can be applied to nonsubmodular binary models. Such general quadratic pseudo-Boolean
optimization (QPBO) methods are based on roof duality relaxation of the integer program-
ming problem. They are not guaranteed to nd the global minima solution, but they can
nd partial solutions.
2.5 Appendix
2.5.1 Max-ow as a Reparameterization
We showed how to minimize any submodular QPB function using a max-ow algorithm.
At this point the reader may ask why we solve a maximization problem(computing a owof
maximumvalue) for minimizing a function. As was mentioned in section 2.2, the maximiza-
tion and minimization problems are dual to one another. In this section we will illustrate
this duality using the notion of reparameterization.
48 2 Basic Graph Cut Algorithms
First, let us formulate the dual problem in the context of energy minimization. Without
loss of generality, we will assume that energy E(x | ) was already converted to a normal
form (2.8 and 2.9), as described in section 2.2.1. It is easy to see that for any vector with
nonnegative components ({
i;a
}, {
ij;ab
}) the constant term
const
is a lower bound on the
function E(x | ) dened by (2.1), that is,

const
min
x
E(x | ). (2.16)
In order to obtain the tightest possible bound, one can solve the following maximization
problem:
(DUAL) Given , nd its reparameterization

with nonnegative components ({

i;a
},
{

ij;ab
}) such that the lower bound

const
is maximized.
It is easy to check that the solution to (DUAL) can be found among normal form reparam-
eterizations of , that is, (DUAL) is equivalent to
max

()

const
. (2.17)
It is not difcult to show that problems (DUAL) and (2.17) correspond to some linear
programs.
5
Also note that inequality (2.16) and problems (DUAL) and (2.17) make sense
for both submodular and nonsubmodular functions. In any case,
const
is a lower bound
on min
x
E(x | ) as long as vector is nonnegative (e.g., in normal form).
When the energy function E(x | ) is submodular, (2.17) is the exact maximization
problemsolved by max-owalgorithms on graph

G described in theorem2.1. For example,
consider an augmenting path-style algorithm from section 2.1.3. Pushing ow through an
arc changes residual capacities of this arc and its reverse arc, thus changing vector .
6
If we send units of ow from the source to the sink via arcs (s i
1
), (i
1
i
2
), . . .,
(i
k1
i
k
), (i
k
t ), then the corresponding transformation of vector is

i
1
;1
:=
i
1
;1

i
1
i
2
;01
:=
i
1
i
2
;01

i
1
i
2
;10
:=
i
1
i
2
;10
+
.
.
.
.
.
.

i
k1
i
k
;01
:=
i
k1
i
k
;01

i
k1
i
k
;10
:=
i
k1
i
k
;10
+

i
k
;0
:=
i
k
;0

const
:=
const
+.
(2.18)
5. The denition of reparameterization involves an exponential number of linear constraints. But the fact that any
reparameterization

can be expressed via operations (2.6 and 2.7) allows polynomial size LP formulations
for (DUAL) and (2.17).
6. Recall that the reduction given in theorem 2.1 associates each arc of graph

G with one component of vector .
Thus, we can dene a 1:1 mapping between residual capacities of graph

G and components of vector .
2.5 Appendix 49
It can be checked that this transformation of is indeed a reparameterization obtained by
combining operations (2.6 and 2.7) for the sequence of edges above. Furthermore, it keeps
vector in a normal form and increases the lower bound by . Thus, each augmentation
greedily improves the objective function of the maximization problem (2.17). This rela-
tionship between reparameterization and augmenting path ows has been used to design
dynamic algorithms for energy minimization [66, 221, 244].
Upon termination the max-owalgorithmyields a minimums-t cut (S, T ) corresponding
to some labeling x and a maximumowcorresponding to some reparameterization

. Using
the property that all arcs fromS to T have zero residual capacity, one can check that

i; x
i
= 0
for all nodes i and

ij; x
i
x
j
= 0 for all edges (i, j). Therefore, from equation (2.1

) we get
E( x |

) =

const
+

i; x
i
+

(i,j)E

ij; x
i
x
j
=

const
, (2.19)
that is, lower bound

const
in (2.16) equals the cost of labeling x. This conrms the opti-
mality of x and

. Moreover, (2.19) implies that for submodular energies E(x | ) there is
no duality gap, that is,
max

()

const
= min
x
E(x | ), (2.20)
which is also known as a strong duality relationship.
Note that inequality (2.16) holds for arbitrary quadratic pseudo-Boolean functions
E(x | ) in a normal form including any combination of submodular and supermodular
terms. Presence of supermodular terms, however, may result in the optimal lower bound in
the problem (DUAL) being smaller than the minimum of E(x | ). In other words, we may
have a weak duality relationship max

()

const
< min
x
E(x | ), instead of the strong
duality (2.20). In the next chapter we will study some optimization algorithms for general
non-submodular quadratic pseudo-Boolean functions that may include supermodular terms.
3
Optimizing Multilabel MRFs Using Move-Making Algorithms
Yuri Boykov, Olga Veksler, and Ramin Zabih
Chapter 2 addresses minimization of energies with binary variables, that is, pseudo-Boolean
optimization. Many problems encountered in vision, however, require multilabel vari-
ables. Unfortunately, most of these are NP-hard to minimize, making it necessary to resort
to approximations. One approach to approximate optimization is the move-making algo-
rithms. Interestingly, this approach reduces to solving a certain sequence of pseudo-Boolean
problems, that is, the problems discussed in chapter 2). At each iteration a move-making
algorithm makes a proposal (move) for a site i either to keep its old label or to switch to a
new label. This process can be encoded as binary optimization. For example, x
i
= 0 means
that a site i keeps its old label, and x
i
= 1 means that site i switches to the proposed label.
The goal is then to nd the optimal (i.e., giving the largest energy decrease) subset of sites
to switch to the proposed labels. Depending on the nature of the new proposal (move),
such binary optimization can be submodular, and therefore solvable exactly with graph
cuts (see chapter 2). The algorithm considers a new move at each iteration until there is
no available move that decreases the energy. This chapter explains in detail two widely
used move-making algorithms: expansion and swap. The classes of energy functions under
which expansion and swap moves are submodular are characterized. The effectiveness of
the approach is tested for the applications of image restoration and stereo correspondence.
In particular for stereo, the expansion algorithmhas advanced the state of the art by bringing
to the forefront the advantages of the global optimization methods versus local methods,
which were commonly used at the time. (See a detailed comparison of optimization meth-
ods in chapter 11). In the last decade, the swap and expansion algorithms have been widely
used in numerous practical applications in vision.
3.1 Introduction
The energy function for many commonly used Markov random elds can be written as
E(x) =

i
(x
i
) +

(i,j)E

ij
(x
i
, x
j
), (3.1)
52 3 Optimizing Multilabel MRFs Using Move-Making Algorithms
where, as in chapter 1, V and E are the set of vertices and edges corresponding to the MRF,
x
i
is the label of vertex (or site) i, and
i
(x
i
) are the unary terms and
ij
(x
i
, x
j
) are the
binary terms. Chapter 2 addresses optimization of the energy in (3.1) when the label set L
has size 2 (i.e., pseudo-Boolean optimization). This chapter addresses the general multilabel
case.
The choice of
i
does not affect optimization of the energy in (3.1). The choice
of
ij
, however, decides the difculty of optimizing this energy. Some choices lead
to tractable polynomially solvable problems [421, 116] (see also chapter 4), and oth-
ers to NP-hard problems [72]. Unfortunately, many useful
ij
, and in particular many
discontinuity-preserving [54]
ij
, lead to NP-hard energies, and therefore approximation
techniques are needed. This chapter describes two popular optimization algorithms, the
expansion and the swap, which are based on the common idea of move making. These
algorithms were originally developed in [70, 71, 72].
The idea of move-making algorithms is to permit, for a given labeling x
c
, switching to
another labeling x
n
only if this switch belongs to a certain set of allowed moves. Ideally,
the set of allowed moves is designed in such a way that an optimal move, that is, the move
resulting in the largest energy decrease, can be found efciently. Amove-making algorithm
starts froman initial labeling and proceeds by making a series of allowed moves that lead to
solutions having lower (or equal) energy. The algorithm converges when no lower energy
solution can be found.
There are two algorithms that approximately minimize the energy E(x) under two fairly
general classes of binary terms
ij
. In particular, the swap algorithm can be used whenever
the following condition is satised:

ij
(, ) +
ij
(, )
ij
(, ) +
ij
(, ) , L. (3.2)
The condition for the expansion algorithm is more restrictive:

ij
(, ) +
ij
(, )
ij
(, ) +
ij
(, ) , , L. (3.3)
It is important to note that move-making algorithms, in particular expansion and swap
algorithms, can still be used even if the move is not the optimal one and the conditions
above do not hold. Chapter 18 discusses such move-making methods in detail, and shows
that many applications do successfully use nonoptimal moves (e.g., [7, 404]).
In order to connect (3.3) and (3.2) to familiar concepts, recall the denitions from metric
spaces.
ij
is a metric if , , L:

ij
(, ) 0, (3.4)

ij
(, ) = 0 = , (3.5)

ij
(, ) =
ij
(, ), (3.6)

ij
(, )
ij
(, ) +
ij
(, ). (3.7)
3.2 Overview of the Swap and Expansion Algorithms 53
There are useful generalizations of a metric, such as a quasi-metric, for which the symme-
try (i.e., (3.6)) is dropped, and a semi-metric, for which the triangular inequality (i.e., (3.7))
is dropped. Clearly, (3.2) is satised when
ij
is a metric, quasi-metric, or semi-metric;
(3.3) is satised when
ij
is a metric or a quasi-metric.
Note that both semi-metrics and metrics include important cases of discontinuity-
preserving
ij
. Informally, a discontinuity-preserving binary term
ij
should be bounded
from above. This avoids overpenalizing sharp jumps between the labels of neighboring
sites; see [523, 308] and the experimental results in section 3.5. Examples of discontinuity-
preserving binary terms for a one-dimensional label set L include the truncated quadratic

ij
(, ) = min(K, | |
2
) (a semi-metric) and the truncated absolute distance

ij
(, ) = min(K, | |) (a metric), where K is some constant. If L is multidimen-
sional, one can replace | | by any norm (e.g., || ||
L
2
). These models encourage labelings
consisting of several regions where sites in the same region have similar labels, and therefore
one informally calls them piecewise smooth models.
Another important discontinuity-preserving function is given by the Potts model

ij
(, ) = K
ij
T ( = ) (a metric), where T () is 1 if its argument is true, and other-
wise 0. This model encourages labelings consisting of several regions where sites in the
same region have equal labels, and therefore one informally calls it a piecewise constant
model.
Both expansion and swap algorithms nd a local minimum of the energy in (3.1) with
respect to an exponentially large set of moves (see section 3.2.1). Unlike the ordinary
local minima, there are optimality guarantees. In particular, the local minimum found by
the expansion algorithm is guaranteed to be within a multiplicative factor from the global
minimum. This factor is as low as 2 for the Potts model.
In this chapter the effectiveness of expansion and swap algorithms is demonstrated on
the energies arising for the problems of image restoration and stereo correspondence. In
particular for stereo, the expansion algorithm advanced the state of the art. It signicantly
surpassed the accuracy of the local correlation-based methods that were common at the
time [419]. Almost all of the top-performing stereo correspondence methods nowadays are
based on energy minimization.
1
3.2 Overview of the Swap and Expansion Algorithms
This section is an overview of the swap and expansion algorithms. Section 3.2.1 discusses
local optimization in a discrete setting. Section 3.2.2 denes expansion and swap moves.
Section 3.2.3 describes the algorithms and lists their basic properties.
1. See [Link]
54 3 Optimizing Multilabel MRFs Using Move-Making Algorithms
3.2.1 Local Minima in Discrete Optimization
Due to the inefciency of computing the global minimum, many authors have opted for a
local minimum. In general, a labeling x is a local minimum of the energy E if
E( x) E(x

) for any x

near to x. (3.8)
In discrete labeling, the labelings near x are those that lie within a single move of x. Many
local optimization techniques use the so-called standard moves, where only one site can
change its label at a time (see gure 3.2(b)). For standard moves, (3.8) can be read as follows:
if you are at a local minimum with respect to standard moves, then you cannot decrease the
energy by changing a single sites label. In fact, this is a very weak condition. As a result,
optimization schemes using standard moves frequently generate low-quality solutions. For
instance, consider the local minimum with respect to standard moves in gure 3.1(c). Some
example methods using standard moves are Iterated Conditional Modes (ICM) [38] and
simulated annealing [161] (see chapter 1).
Not surprisingly, in general a local minimum can be arbitrarily far from the optimum. It
thus may not convey any of the global image properties that were encoded in the energy
function. In such cases it is difcult to determine the cause of an algorithms failures. When
an algorithm gives unsatisfactory results, it may be due either to a poor choice of the energy
function or to the fact that the answer is far from the global minimum. There is no obvious
way to tell which of these is the problem.
2
Another common issue is that local minimization
techniques are naturally sensitive to the initial estimate.
The NP-hardness result [72] of optimizing energy in (3.1) effectively forces computing a
local minimum. However, the methods described in this chapter generate a local minimum
with respect to very large moves. This approach overcomes many of the problems associated
with local minima. The algorithms introduced in this section generate a labeling that is a
local minimum of the energy in (3.1) for two types of large moves: expansion and swap. In
contrast to the standard moves described above, the swap and expansion moves allowa large
number of sites to change their labels simultaneously. This makes the set of labelings within
a single move of a locally optimal x exponentially large, and the condition in (3.8) very
demanding. For example, expansion moves are so strong that one is able to prove that any
labeling locally optimal with respect to these moves is within a known factor of the global
minimum.
3
Figure 3.1 compares local minima for standard moves (c) and for expansion
moves (d) obtained from the same initial solution (b). The experiments (see section 3.5)
2. In special cases where the global minimum can be rapidly computed, it is possible to separate these issues. For
example, [176] points out that the global minimum of a special case of the Ising energy function is not necessarily
the desired solution for image restoration; [49, 176] analyze the performance of simulated annealing in cases with
a known global minimum.
3. In practice most often the actual solution is much closer to the global minimum than the theoretical bound (see
chapter 11).
3.2 Overview of the Swap and Expansion Algorithms 55
(a) Original image (b) Observed image
(c) Local min w.r.t.
standard moves
(d) Local min w.r.t.
expansion moves
Figure 3.1
Comparison of local minima with respect to standard and large moves in image restoration. The energy (4.1) is
used with the quadratic
i
penalizing deviations from the observed intensities. (b)
ij
is a truncated L
2
metric.
Both local minima (c) and (d) were obtained using labeling (b) as an initial solution.
showthat for many problems the solutions do not change signicantly by varying the initial
labelings. In most cases, starting from a constant labeling gives good results.
3.2.2 Denitions of Swap and Expansion Moves
For every pair of labels , L, the swap algorithm has one possible move type, an -
swap. Given , L, a move from a labeling x to another labeling x

is called an -
swap if x
i
= x

i
implies that x
i
, x

i
{, }. This means that the only difference between x
and x

is that some sites that were labeled in x are now labeled in x

, and some sites

that were labeled in x are now labeled in x

.
For each L, the expansion algorithm has one possible move type, an -expansion.
Given a label , a move from a labeling x to another labeling x

is called an -expansion
if x
i
= x

i
implies x

i
= . In other words, an -expansion move allows any set of sites to
change their labels to . An example of an -expansion is shown in gure 3.2d.
Recall that ICM and annealing use standard moves allowing only one site to change its
intensity. An example of a standard move is given in gure 3.2b. Note that a move that
assigns a given label to a single site is both an - swap (for equal to the sites old
label) and an -expansion. As a consequence, a standard move is a special case of both an
- swap and an -expansion.
56 3 Optimizing Multilabel MRFs Using Move-Making Algorithms
(a) Initial labeling (b) Standard move
(d) (c)
Figure 3.2
Examples of standard and large moves from a given labeling (a). The number of labels is |L| = 3. A standard
move (b) changes a label of a single site (in the circled area). Strong moves (c and d) allow a large number of sites
to change their labels simultaneously.
3.2.3 Algorithms and Properties
The swap and expansion algorithms nd a local minimum when swap or expansion moves
are allowed, respectively. Finding such a local minimumis not a trivial task. Given a labeling
x, there is an exponential number of swap and expansion moves. Therefore, even checking
for a local minimumrequires exponential time if performednavely. Incontrast, checkingfor
a local minimum when only the standard moves are allowed is linear. A large move space
allows extensive changes to the current solution. This makes the algorithm less prone to
getting stuck in local minima and also results in a faster rate of convergence.
Thus the key step is a method to nd the optimal - swap or -expansion from a given
labeling x. This step is developed in section 3.3. Once these are available, it is easy to design
variants of the fastest descent technique that can efciently nd the corresponding local
minima. The algorithms are summarized in gure 3.3.
The two algorithms are quite similar in their structure. A single execution of steps 3.1
and 3.2 is called an iteration, and an execution of steps 24 is a cycle. In each cycle the
algorithm performs an iteration for every label (expansion algorithm) or for every pair of
labels (swap algorithm), in a certain order that can be xed or random. Acycle is successful
if a strictly better labeling is found at any iteration. The algorithms stop after the rst
unsuccessful cycle since no further improvement is possible. Acycle in the swap algorithm
takes |L|
2
iterations, and a cycle in the expansion algorithm takes |L| iterations.
3.3 Finding the Optimal Swap and Expansion Moves 57
Figure 3.3
The swap algorithm (top) and expansion algorithm (bottom).
These algorithms are guaranteed to terminate in a nite number of cycles. Under a very
reasonable assumption that
i
and
ij
in (3.1) are constants independent of the size of the
vertex set V, one can easily prove termination in O(|V|) cycles [494]. However, in the
experiments reported in section 3.5, the algorithms stop after a few cycles, and most of
the improvements occur during the rst cycle.
Pseudo-Boolean optimization (see chapter 2) based on a graph cut is used to efciently
nd x for the key part of each algorithm in step 3.1. At each iteration the corresponding
graph has O(|V|) vertices. The exact number of vertices, the topology of the graph, and its
edge weights vary from iteration to iteration. The optimization details of step 3.1 for the
expansion and swap algorithms are in section 3.3.
3.3 Finding the Optimal Swap and Expansion Moves
For a given labeling, the number of expansion and swap moves is exponential. The key
idea of nding the optimal swap and expansion moves efciently is to convert the problem
of nding an optimal move to binary energy optimization. In [72] the authors formulate the
problem as that of nding a minimum cut on a certain graph, describe the corresponding
graph construction, and prove that it is correct. Because of the work on pseudo-Boolean
optimization [61, 266] (also see chapter 2) here, one just needs to formulate the pseudo-
Boolean energy function corresponding to the swap and expansion moves and prove that it is
submodular. Once this is shown, the submodular pseudo-Boolean energies can be efciently
minimized, for example, by nding a minimum cut on a certain graph (see chapter 2).
The moves of the expansion and swap algorithms can be encoded as a vector of binary
variables t ={t
i
, i V}. The transformation function T (x
c
, t) of a move algorithm takes
58 3 Optimizing Multilabel MRFs Using Move-Making Algorithms
the current labeling x
c
and a move t and returns the new labeling x
n
that has been induced
by the move.
Recall that an -swap allows a random variable whose current label is or to switch
it to or . The transformation function T

() for an -swap transforms the current label

x
c
i
of a random variable X
i
as
x
n
i
= T

(x
c
i
, t
i
) =
_

_
x
c
i
if x
c
i
= and x
c
i
= ,
if x
c
i
= or and t
i
= 0,
if x
c
i
= or and t
i
= 1.
(3.9)
Recall that an -expansion allows a random variable either to retain its current label or
switch to . The transformation function T

() for an -expansion transforms the current

label x
c
i
of a random variable X
i
as
x
n
i
= T

(x
c
i
, t
i
) =
_
if t
i
= 0
x
c
i
if t
i
= 1.
(3.10)
The energy of a move t is the energy of the labeling x
n
that the move t induces, that is,
E
m
(t) = E(T (x
c
, t)). The move energy is a pseudo-Boolean function (E
m
: {0, 1}
n
R)
and will be denoted by E
m
(t). At step 3.1 (see gure 3.3) of the expansion and swap
move algorithms, the optimal move t

(the move decreasing the energy of the labeling by

the greatest amount) is computed. This is done by minimizing the move energy, that is,
t

= arg min
t
E(T (x
c
, t)).
The pseudo-Boolean energy corresponding to a swap move is
E(T

(x
c
, t)) =

iV

i
_
T

(x
c
i
, t
i
)
_
+

(i,j)E

ij
_
T

(x
c
i
, t
i
), T

(x
c
j
, t
j
)
_
,
(3.11)
where x
c
is xed and t is the unknown variable. The submodularity condition fromchapter 2
and the denition in (3.9) imply that the energy function in (3.11) is submodular, and
therefore can be optimized with a graph cut if

ij
(, ) +
ij
(, )
ij
(, ) +
ij
(, ), (3.12)
as in (3.2).
The pseudo-Boolean energy corresponding to an expansion move is
E(T

(x
c
, t)) =

iV

i
_
T

(x
c
i
, t
i
)
_
+

(i,j)E

ij
_
T

(x
c
i
, t
i
), T

(x
c
j
, t
j
)
_
.
(3.13)
3.4 Optimality Properties 59
The submodularity condition from chapter 2 and the denition in (3.10) imply that the
energy function in (3.13) is submodular, and therefore can be optimized with a graph cut if

ij
(, ) +
ij
(x
c
i
, x
c
j
)
ij
(x
c
i
, ) +
ij
(, x
c
j
), (3.14)
as in (3.3), assuming that x
c
i
can take any value in L.
3.4 Optimality Properties
This section addresses the optimality properties of the expansion algorithm. The bounds
obtained in this section are usually much larger than those observed in practice (see chap-
ter 11). Nevertheless, they are of theoretical interest.
3.4.1 The Expansion Algorithm
Assuming a metric
ij
, a local minimum when expansion moves are allowed is within a
known factor of the global minimum. This factor, which can be as small as 2, will depend
on
ij
. Specically, let
c = max
i,jE
_
max
=L

ij
(, )
min
=L

ij
(, )
_
be the ratio of the largest nonzero value of
ij
to the smallest nonzero value of
ij
. Note that
c is well dened since
ij
(, ) = 0 for = because it is assumed that
ij
is a metric.
Theorem 3.1 Let x be a local minimum when expansion moves are allowed and x

be the
globally optimal solution. Then E( x) 2cE(x

).
For the proof of the theorem, see [72]. Even though in practice the results are very close to
the global minima, articial examples can be constructed that come as close as one wishes
to the bound in the theorem above (see gure 3.4). Note that [236] develops an algorithm
for minimizing E that also has the bound of 2 for the Potts
ij
. For a general metric
ij
, it
has a bound of O(log k log log k) where k is the number of labels. However, the algorithm
uses linear programming, which is impractical for a large number of variables.
3.4.2 Approximating a Semi-metric
A local minimum when swap moves are allowed can be arbitrarily far from the global
minimum. This is illustrated in gure 3.5. In fact, one can use the expansion algorithm to
get an answer within a factor of 2c fromthe optimumof energy (3.1) even when
ij
is a semi-
metric. Here c is the same as in theorem 3.1. This c is still well dened for a semi-metric.
Suppose that
ij
is a semi-metric. Let r be any real number in the interval [m, M] where
m = min
=L

ij
(i, j) and M = max
=L

ij
(i, j).
60 3 Optimizing Multilabel MRFs Using Move-Making Algorithms
a
b
b
b b
c
c
c c
Figure 3.4
Example of an expansion algorithm getting stuck in a local minimum far from the global optimum. Here L =
{0, 1, 2, 3}. The top row shows the data terms for these labels in consecutive order. White means zero cost and
black means an innitely high data cost. The Potts model is used, with K
ij
illustrated in the left image of the
bottom row. All the costs are innite except those along the arcs and segments outlined in black and white. The
accumulated cost of cutting along these arcs and segments is shown; for example, cutting along the top part of
the rectangle costs c in total, and cutting along the diagonal of the rectangle costs a in total. Here b = and
2c = a . The expansion algorithm is initialized with all pixels labeled 0. Expansion proceeds on labels 1 and 2,
the results of which are shown in the second picture, bottom row. Expansion on label 3 results in the solution
shown in the third picture, bottom row, at which point the algorithm converges to a local minimum with cost
C
sub
= 4b +4c. The optimum is shown in the last picture, bottom row, and its cost is C
opt
= 4b +a. The ratio
C
sub
C
opt
2 as 0.
(a) Local min (b) Global min (c) Values of
i
c
1 2 3 1 2 3
b a c c c
K b
c
a
2 3
K
0 2
0
2
0
K
K
1
Figure 3.5
The image consists of three sites V = {1, 2, 3}. There are two pairs of neighbors E = {{1, 2}, {2, 3}}. The set of
labels is L = {a, b, c}.
i
is shown in (c).
ij
(a, b) =
ij
(b, c) =
K
2
and
ij
(a, c) = K, where K is a suitably
large value. It is easy to see that the conguration in (a) is a local minimumwith the energy of K, while the optimal
conguration (b) has energy 4.
3.5 Experimental Results 61
Dene a new energy based on the Potts model:
E
P
(x) =

i
(x
i
) +

(i,j)E
r T (x
i
= x
j
).
Theorem 3.2 If x is a local minimum of E
P
under the expansion moves and x

is the
global minimum of E(x), then E( x) 2cE(x

).
For the proof of the theorem, see [72]. Thus, to nd a solution within a xed factor
of the global minimum for a semi-metric
ij
, one can take a local minimum x under the
expansion moves for E
P
as dened above. Note that such an x is not a local minimum of
E(x) under expansion moves. In practice, however, the local minimum under swap moves
gives empirically better results than using x. In fact, the estimate x can be used as a good
starting point for the swap algorithm. In this case the swap algorithm will also generate a
local minimum whose energy is within a known factor of the global minimum.
3.5 Experimental Results
This section presents experimental results on image restoration and stereo correspondence.
For extensive experimental evaluation of the swap and expansion algorithms as energy
optimization techniques, see chapter 11. The minimum cut is computed using the max-ow
algorithm in [68].
3.5.1 Image Restoration
Image restoration is a classical problem for evaluation of an energy optimization method.
The task is to estimate the original image from a noisy one. The labels are all possible
intensities. The restored intensity should be close to the observed one, and intensities are
expected to vary smoothly everywhere except at object boundaries. Here the unary term
is
i
(x
i
) = min(|x
i
I
i
|
2
, const ), where I
i
[0, 255] is the intensity observed at pixel i.
The parameter const is set to 400, and it is used to make the data penalty more robust
against outliers.
Figure 3.6 shows an image consisting of several regions with constant intensities after it
was corrupted by N(0, 100) noise. Figure 3.6b shows image restoration results for the trun-
cated absolute difference model
ij
(x
i
, x
j
) = 80 min(3, |x
i
x
j
|), which is discontinuity-
preserving. Since it is a metric, the expansion algorithmis used. For comparison, gure 3.6c
shows the result for the absolute difference model
ij
(x
i
, x
j
) = 15 |x
i
x
j
|, which is not
discontinuity-preserving. For the absolute difference model one can nd the exact solution
using the method in chapter 4. For both models the parameters that minimize the average
absolute error from the original image are chosen. The average errors were 0.34 for the
truncated model and 1.8 for the absolute difference model, and the running times were 2.4
62 3 Optimizing Multilabel MRFs Using Move-Making Algorithms
(a) Noisy image (b) Truncated abs. diff. (c) Abs. diff.
Figure 3.6
Image restoration. The results in (b and c) are histograms equalized to reveal oversmoothing in (c), which does
not happen in (b).
and 5.6 seconds, respectively. The image size is 100 100. This example illustrates a well-
known fact [54] about the importance of using discontinuity-preserving
ij
. The results in
gure 3.6b and c were histogram equalized to reveal oversmoothing in (c), which does not
happen in (b). Similar oversmoothing occurs for stereo correspondence (see [494, 42]).
3.5.2 Stereo Correspondence
In stereo correspondence two images are taken at the same time from different viewpoints.
For most pixels in the rst image there is a corresponding pixel in the second image that is a
projection along the line of sight of the same real-world scene element. The difference in the
coordinates of the corresponding points is called the disparity, which one can assume to be
one-dimensional because of image rectication [182]. The labels are the set of discretized
disparities, and the task is to estimate the disparity label for each pixel in the rst image.
4
Note that here V contains the pixels of the rst image.
The disparity varies smoothly everywhere except at object boundaries, and corresponding
points are expected to have similar intensities. Therefore one could set
i
(l) = (I
i
I

i+l
)
2
,
where I and I

are the rst and second images, respectively, and i +l is pixel i shifted by
disparity l in I

. Instead, the sampling-insensitive

i
(l) from [44] is used, which is a better
model and is just slightly more complex.
Figure 3.7 shows results from a real stereo pair with known ground truth, provided by
Dr. Y. Ohta and Dr. Y. Nakamura from the University of Tsukuba (see also [419]). The
left image is in gure 3.7a, and the ground truth is in 3.7b. The label set is {0, 1, . . . , 14},
and
ij
(x
i
, x
j
) = K
ij
T (x
i
= x
j
). K
ij
is inversely proportional to the gradient strength
between pixels i and j. Specically, K
ij
= 40 if |I
p
I
q
| 5 and K
i,j
= 20 otherwise.
The results in gure 3.7c and d are clearly superior compared with annealing and nor-
malized correlation in gure 3.7e and f. Note that results such as in gure 3.7e and f used
to be close to the state of the art before the expansion and swap algorithms were developed.
4. For a symmetric approach to stereo based on expansion moves, see [264].
3.5 Experimental Results 63
(a) Left image: 384x288, 15 labels (b) Ground truth
(c) Swap algorithm
(e) Normalized correlation
(d) Expansion algorithm
(f) Simulated annealing
Figure 3.7
Real imagery with ground truth.
For normalized correlation, parameters that give the best statistics were chosen. To give
it a good starting point, simulated annealing was initialized with the correlation results. In
contrast, swap and expansion algorithms are not sensitive to the initialization. Averaging
over 100 random initializations, the results differed by less than 1 percent of pixels.
Figure 3.8 summarizes the errors, running times, and energies. Expansion and swap
algorithms are clearly superior to the other two methods. Compared with one another, they
perform similarly. The observed difference in errors is less than 1%. At each cycle the
label order is chosen randomly. Another run of the algorithm might give slightly different
results, and on average about 1% of pixels change their labels between runs. The expansion
algorithm converges 1.4 times faster than the swap algorithm, on average.
64 3 Optimizing Multilabel MRFs Using Move-Making Algorithms
algorithm all err. time energy
exp. (1iter.)
exp. (conv.)
swap (conv.)
sim. anneal.
norm. corr.

7.9
7.2
7.0
20.3
24.7
2.7
2.1
2.0
5.0
10.0
0.98
6.9
11.1
600
0.89
254700
253700
251990
442000
N/A
err. > 1
Figure 3.8
Comparison of Accuracy (in %), Running Time (in sec), and Energy
3.6 Conclusion
Since their invention, swap and and expansion algorithms have proved useful in a wide
variety of applications. Compared with the expansion algorithm, the swap algorithm, can
be applied to a wider set of energy functions, since (3.2) is less restrictive. However, the
expansion algorithm usually performs better in practice [464] (see also chapter 11), in
terms of the running time and of accuracy, and therefore it is more popular. Both algorithms
have been used for such diverse applications as image restoration [70], stereo correspon-
dence [70, 45], motion segmentation [45, 521, 532, 366], texture synthesis [287], motion
magnication [317], object recognition and segmentation [525, 198] (see also chapter 25),
digital photomontage [7], digital tapestry [404], image generation [349], computational
photography [304], image completion [183], and digital panoramas [6].
4
Optimizing Multilabel MRFs with Convex and Truncated
Convex Priors
Hiroshi Ishikawa and Olga Veksler
Usually, submodular pseudo-Boolean energies as discussed in chapters 1 and 2 arise from
optimizing binary (i.e., two-label) MRFs. Interestingly, it is possible to encode a multilabel
MRF with a pseudo-Boolean energy. The resulting energy is submodular and, therefore,
can be optimized exactly when
ij
(x
i
, x
j
) terms are convex. This chapter addresses that
case. Though convex
ij
(x
i
, x
j
) are useful for problems with a small label set L, they are
less suitable as the label set grows larger, due to the nondiscontinuity-preserving nature
of convex
ij
(x
i
, x
j
). To ensure discontinuity preservation, MRFs with truncated convex

ij
(x
i
, x
j
) are commonly used in computer vision. Interestingly, the construction pre-
sented in this chapter for MRFs with convex
ij
(x
i
, x
j
) is also useful for approximately
optimizing MRFs with truncated convex
ij
(x
i
, x
j
). To demonstrate the effectiveness of
the proposed methods, experimental evaluations of image restoration, image inpainting,
and stereo correspondence are presented.
4.1 Introduction
It is rarely possible to solve exactly a large combinatorial optimization problem. Yet, there
are exceptional circumstances where one can use a known method to nd a global optimum.
For instance, when the state of the problem is described as a string of linearly ordered local
states, as in HMM, dynamic programming can be used (see chapter 1). Another case that
can be solved exactly is a certain binary MRF (see chapter 2). The present chapter points
out yet another such instance and describes a method that can be used: a method to solve a
rst-order MRF with
ij
(x
i
, x
j
) that is convex in terms of a linearly ordered label set. This
method was originally presented in [211].
As in chapter 1, the following energy function is minimized:
E(x) =

i
(x
i
) +

(i,j)E

ij
(x
i
, x
j
), (4.1)
where V and E are the set of sites and edges for the MRF, x
i
is the label of site i, and
i
(x
i
)
are the unary terms and
ij
(x
i
, x
j
) are the binary terms.
66 4 Optimizing Multilabel MRFs with Convex and Truncated Convex Priors
Chapter 3 addresses approximation algorithms for a relatively general multilabel case.
The present chapter rst addresses the case of the convex term
ij
, which can be solved
exactly. Two common examples are the absolute linear
ij
(x
i
, x
j
) = w
ij
|x
i
x
j
| and
the quadratic
ij
(x
i
, x
j
) = w
ij
(x
i
x
j
)
2
. These are convex functions of the absolute
difference |x
i
x
j
| of the labels. As in chapter 3, there are no restrictions on the unary
terms
i
(x
i
).
Afunction f (x) is said to be convex if
f (t x +(1 t )y) tf (x) +(1 t )f (y) (0 t 1). (4.2)
Consider gure 4.1a. It is easy to see that for singly connected graphs,
ij
, which is a strictly
convex function of |x
i
x
j
|, encourages smaller changes of labels at more edges. To encour-
age fewer but larger changes, a truncatedconcave
ij
(see chapter 3) canbe used. Inbetween,
the absolute linear potential is neutral with respect to this choice.
For general graphs, the biases observed in a singly connected case are still present, though
they are less straightforward due to the constraints from more general graph topology. For
example, consider a 2D labeling in gure 4.1b. Assume that the absolute linear
ij
is used.
A site that is outlined with a thick line strongly prefers the label printed inside it (notice
that these labels are the same for all three labelings). Asite that is outlined with a thin line
considers all labels equally likely. The three labelings in gure 4.1b are ordered by their
total cost (from smaller to larger) under the absolute linear
ij
. Notice that the absolute
linear
ij
is not neutral now with respect to whether it prefers a few large changes or many
small ones. The left (most preferred) labeling has fewer but larger jumps compared with
the middle labeling, which has more but smaller jumps. The rightmost labeling has fewer
jumps than the middle but a higher cost.
Assume now the unary terms
i
are such that the labeling on the right is preferred.
Depending on the relative weight of the unary and binary terms, the middle labeling may be
3 0 0
3 0 0
3 0 0
9
(a) (b)
3 3 3
3 3 3
3 3 0
6
3 2 2
3 2 2
3 2 0
4 3
2
3
9
3
3
3
3 3 0 0 3 2 1 0
Figure 4.1
(a) At the top are two congurations, each with four nodes. Below is the sum of truncated convex, linear, and
convex potentials. The truncated convex potential favors the left conguration; the convex potential, the right
one. The linear potential gives the same energy for both. (b) 2D example with nine nodes; the cut is shown as a
dashed line. Assume that the linear
ij
is used and the data term xes the labels at thick squares. This
ij
favors
the labeling on the left the most, then the middle one, and disfavors the right one. When the data term indicates the
labeling on the right but the result is the middle, it is oversmoothing. When the middle labeling is indicated by
the data term but the left one is the result, it is undersmoothing.
4.2 Optimizing for Convex
ij
(i, j) 67
the optimum one, since it has a lower cost under binary terms. In this case oversmoothing
will occur. Similarly, if the unary terms indicate the labeling in the middle but the binary
terms
ij
have more weight, the labeling on the left may be the optimum one, resulting in
the undersmoothed" result.
Recall that labels correspond to some property to be estimated at each site. In vision, while
undersmoothing can be an issue, for most problems oversmoothing is more likely to occur,
since the property to be estimated changes sharply at the boundaries of objects. Thus, to
counter this bias, a potential that activelyencourages concentrationof changes is usuallypre-
ferred. Sucha
ij
limits the growthof the value as a functionof label difference. Inparticular,
convex or linear functions can be truncated to give
ij
(x
i
, x
j
) = w
ij
min {T, |x
i
x
j
|}
and
ij
(x
i
, x
j
) = w
ij
min {T, (x
i
x
j
)
2
}. These energies are commonly used in vision;
for example, the energy evaluation study in chapter 11 includes them. In general, any
convex binary potential can be truncated for the sake of robustness in the presence of
discontinuities.
Unfortunately, the energy in (4.1) is NP-hard to optimize with truncated convex
ij
[72].
The swap and expansion algorithms from chapter 3 can be used for optimization, but
according to the energy evaluation study in chapter 11, they do not perform very well for
the truncated quadratic.
Interestingly, the exact method for convex
ij
that is developed here can also be used as a
base for a move-making algorithm(see chapter 3) for approximate optimization of truncated
convex
ij
. These moves were originally developed in [495], and they are called range
moves. Range moves are able to give a pixel a choice of several labels, and signicantly
outperform the swap and expansion algorithms for the truncated convex
ij
.
4.2 Optimizing for Convex
ij
(i, j)
The rest of the chapter assumes that the labels can be represented as consecutive integers
in the range {0, 1, . . . k}. Integer labels rule out direct use of the methods in this chapter
for problems such as motion estimation, since in motion, labels are 2D. However, there
are componentwise optimization approaches to motion, that is, xing one component of a
motion vector and letting the other one vary [406].
This chapter also assumes that
ij
(l
1
, l
2
) is a function of the label difference |l
1
l
2
|. First
a denition of a convex function in a discrete setting is given. A binary term
ij
(l
1
, l
2
) =
w
ij
g(l
1
l
2
) is said to be convex if and only if for any integer y, g(y +1) 2g(y) +
g(y 1) 0. It is assumed that g(y) is symmetric,
1
otherwise it can be replaced with
g(y) +g(y)/2 without changing the optimal labeling. Convex
ij
include the absolute
and squared difference functions as a special case. Note that [421] extends the notion of
convexity to a more general one.
1. Afunction is symmetric if g(y) = g(y).
68 4 Optimizing Multilabel MRFs with Convex and Truncated Convex Priors
p
2
i
p
m
i
t s
e
ml
ij
e
1
i
p
k
i
p
1
i
p
2
j
p
l
j
p
k
j
p
1
j
Figure 4.2
Part of the graph construction for sites i, j E.
This chapter follows a simplied version of the original approach described in [211],
which is based on computing a minimum cut in a certain graph.
We are nowreadytodescribe the graphconstruction. Before proceedingfurther, the reader
who is unfamiliar with graph cut is encouraged to read chapter 2. Part of the construction
is illustrated in gure 4.2. There are two special nodes in the graph, the source s and the
sink t . For each x
i
V, a set of nodes p
i
0
, p
i
1
, . . . , p
i
k+1
is created. Node p
i
0
is identied
with the source s, and p
i
k+1
is identied with the sink t . Node p
i
m
is connected to node p
i
m+1
with a directed edge e
i
m
for m = 0, 1, . . . , k. In addition, for m = 0, 1, . . . , k, node p
i
m+1
is connected to p
i
m
with a directed edge of innite weight. This ensures that for each i,
only one of the edges e
i
m
, m = 0, 1, . . . will be in the minimum cut. If an edge e
i
m
is cut,
then variable x
i
is assigned label m. Thus a cut C of nite cost corresponds to a unique
labeling x
C
.
Furthermore, for any (i, j) E, an edge e
ij
ml
that connects p
i
m
to p
j
l
is created for m =
0, . . . , k +1 and l = 0, . . . , k +1. For reasons explained later, edge weights are set to
w(e
ij
ml
) =
w
ij
2
[g(ml +1) 2g(ml) +g(ml 1)]. (4.3)
The weight in (4.3) is nonnegative because g(y) is convex. This is crucial, since min-cut
algorithms require nonnegative edge weights.
Let C be a nite cost cut. Let (i, j) E. If edges e
i
m
and e
j
l
are in C, then all the edges
in the set S
ml
ij
, dened below, also have to be in C.
S
ml
ij
=
_
e
ij
qr
|0 q m, l +1 r k +1
_

_
e
ij
qr
|m+1 q k +1, 0 r l
_
.
When summing over S
ml
ij
, most weights cancel out, and one gets

eS
ml
ij
w(e) = w
ij
[g(ml) +g(k +2) +h(m) +h(l)],
4.3 Optimizing for Truncated Convex
ij
(x
i
, x
j
) 69
where h(m) =
1
2
[g(k +1 m) +g(m+1)]. Recall that the cut C corresponds to a
labeling x
C
. Except for some extra terms, the sum above is almost exactly
ij
(m, l) =

ij
(x
C
i
, x
C
j
) = w
ij
g(ml). The term g(k +2) can be ignored since it is a constant and
does not change the minimumcut, just its cost. Terms h(m) and h(l) can be subtracted from
the costs of edges e
i
m
and e
j
l
. Therefore one sets
w(e
i
m
) =
i
(m)

jE
i
w
ij
h(m),
where E
i
is the set of neighbors of site i. Under this assignment of edge weights, the cost
of any nite cut C is exactly E(x
C
) plus a constant. Therefore the minimum cut gives the
optimal labeling.
For the absolute linear function
ij
this construction leads to a graph with O(|V| |L|).
(|S| is the size of set S) vertices and edges, assuming a 4-connected grid. This is because
edges e
ij
ml
have zero weight unless m = l. For more general
ij
, for example, the squared dif-
ference, the number of vertices is still O(|V| |L|), but the number of edges is O(|V| |L|
2
).
Note that [262] develops an algorithm for minimizing energy with convex
ij
that is
more memory- and time-efcient. However, it can be used only when the
i
s are convex.
4.3 Optimizing for Truncated Convex
ij
(x
i
, x
j
)
This section describes range moves, an approximation approach for the case when
ij
is
a truncated convex function. It is assumed, again, that
ij
(x
i
, x
j
) depends only on the
label difference x
i
x
j
. A binary term
ij
is dened to be truncated convex if there exists
a symmetric function g(y) such that g(y +1) 2g(y) +g(y 1) 0 and

ij
(l
1
, l
2
) = w
ij
min{g(l
1
l
2
), T }. (4.4)
In this section it is assumed that
ij
in (4.1) is dened by (4.4).
As mentioned in section 4.1, swap and expansion moves do not perform as well with
truncated convex
ij
as with the Potts model. The reason is that these moves are essentially
binary, giving each site a choice of two labels to switch to. For the Potts model the optimal
labeling is expected to be piecewise constant. Thus there are large groups of sites that prefer
the same label, and binary moves perform well. For the truncated convex model, there may
be no large group of sites that prefer the same label, since the optimal labeling is expected
to be piecewise smooth. For example, in gure 4.3 there is no large group of sites that prefer
exactly the same label. There are large groups of sites that prefer similar labels, though;
one of them is outlined in dark gray (red) in gure 4.3. Thus, to get a better approximation,
moves that allow each site a choice of several labels are needed. There is a theorem in [495]
that supports this intuition.
70 4 Optimizing Multilabel MRFs with Convex and Truncated Convex Priors
Label range:
. . .
Figure 4.3
Illustrating range moves. Left: initial labeling. Right: range move from labeling on the left. Allowed range of
labels {, +1, . . . } is on top. Sites participating in the move are outlined in dark gray (red).
4.3.1 - Range Moves
Recall that the label set is L = {0, 1, . . . , k}. Let L

= {, +1, . . . , }, where
< L. Given a labeling x, x

is an - range move fromx if x

i
= x

i
{x
i
, x

i
} L

.
The set of all - range moves fromx is denoted as M

(x) = {x

i
= x
i
x

i
, x
i
L

}.
One can view - range moves as a generalization of - swap moves. An - swap
move reassigns labels , among sites that are currently labeled and . An - range
move reassigns the labels in the range {, +1, .., } among the sites that currently have
labels in the range {, +1, .., }. An illustration of an - range move is in gure 4.3.
Of course, if one knew how to nd the best - range move for = 0 and = k,
one would nd a global optimum, which is impossible because the problem is, in general,
NP-hard. However, it is possible to nd the best - range move if | | T .
4.3.2 - Range Moves for | | T
For simplicity, it is assumed here that | | = T . Extending to the case | | < T is
trivial. Suppose that a labeling x is given, and one wishes to nd the optimal - range
move, where | | = T . The graph construction is similar to that in section 4.2. Let
T = {i| x
i
}. Notice that the truncated convex terms
ij
become convex when
i, j T , since for any i, j T ,
ij
(x
i
, x
j
) = w
ij
g(x
i
x
j
) w
ij
T .
Label set L

is identied with label set {0, 1, . . . , T } and the construction in sec-

tion 4.2 is employed, but only on the sites in the subset T and with a correction for the
boundary of T .
Essentially, the problem is that the construction in section 4.2 does not consider the
effect of terms
ij
on the boundary of T , that is, those
ij
for which |{p, q} T | = 1. This
boundary problem is easy to x. For each site i T , if there is a neighboring site j T , an
additional cost
ij
(m, x
j
) is added to the weight of edge e
i
m
, for all m = 0, 1, . . . , k. Recall
that the label set {, +1, . . . } is identied with the label set {0, 1, . . . , T }. Therefore

ij
(m, x
j
) =
ij
(m+, x
j
). This additional weight to edges e
i
m
makes sure that the terms
4.4 Experimental Results 71

ij
on the boundary of T are accounted for. This corrected construction will nd an optimal
- range move (see [495] for a precise proof).
Just as with an - swap, the algorithm starts at some labeling x. Then it iterates over
a set of label ranges {, .., } with | | = T , nding the best - range move x

and
switching the current labeling to x

.
4.3.3 - Generalized Range Moves
It is possible to slightly generalize the construction in the previous section as follows. As
previously, let | | = T (the case of | | < T is basically identical) and, as before,
let T = {i| x
i
}. Let
L
t
= { t, t +1, . . . , +t 1, +t } L,
that is, L
t
extends the range of L

by t in each direction, making sure that the resulting

range is still a valid range of labels in L. Let
M
t
(x) = {x

i
= x
i
x
i
L

, x

i
L
t
}.
That is, M
t
(x) is a set of moves that change pixel labels fromL

to labels in L
t
. Notice
that M

(x) M
t
(x). It is actually not possible to nd the optimal move in M
t
(x),
but one can nd x M
t
(x) s.t. E( x) E(x

), where x

is the optimal move in M

(x).
Thus labeling x is not worse, and can be signicantly better, than the optimal move x

in
M

(x).
Almost the same construction as in section 4.3.2 is used. Agraph for pixels in T = {i|
x
i
} is constructed. However, the label range is now L
t
and, as before, it is identied
with label set {0, 1, . . . , |L
t
| 1}. The rest is identical to that in section 4.3.2.
Suppose x is the optimum labeling found by the construction above. This construction
may overestimate, but not underestimate, the actual energy of x. Also, its (possibly over-
estimated) energy is not higher than the energy of x

, the optimal range move in M

(x),
since M

(x) M
t
(x). Thus E( x) E(x

). See [495] for a precise proof.

In practice it is enough to set t to a small constant, because the larger the value of t , the
more the graph construction overestimates the energy, and it is less likely that the labels at
the ends of the range will be assigned. Using small t saves computational time, especially
for the truncated quadratic model, since the size of the graph is quadratic in the number of
labels.
4.4 Experimental Results
The generalized range moves (section 4.3.3) were used for all the experiments presented in
this section, with t = 3. For max-owcomputation, the algorithmin [68] was used. The per-
formance is evaluated on image restoration, image inpainting, and stereo correspondence.
72 4 Optimizing Multilabel MRFs with Convex and Truncated Convex Priors
(a) original image (b) added noise N(0,16)
(c) expansion (d) range moves
Figure 4.4
Comparison of optimization algorithms applied to image restoration. Note the stripey artifacts in (c) that are absent
in (d).
4.4.1 Image Restoration
In image restoration the task is to reconstruct the original image froma noisy one. V is the set
of all pixels and L = {0, 1, . . . , 255}. The unary and binary terms were
i
(x
i
) = (I
i
x
i
)
2
,
where I
i
is the intensity of pixel i in the observed image, and
ij
(x
i
, x
j
) = 8 min{(x
i

x
j
)
2
, 50}, respectively.
Figure 4.4a shows an articial image with smoothly varying intensities inside the circle,
square, and background. Figure 4.4b shows the image in (a) corrupted by N(0, 16) noise.
Figures 4.4c and Figures 4.4d showthe result of the expansion and range moves algorithms,
respectively. The results of the swap algorithmare visually similar to those of the expansion
algorithm and are omitted here. The energies of the ground truth, expansion algorithm, and
range moves are, respectively: 4.1 10
5
, 4.5 10
5
, and 3.8 10
5
. Notice that the range
moves algorithm not only produces an answer with a signicantly lower energy, but also
gives the answer that looks smoother. The expansion algorithm tends to assign the same
label to a subset of pixels that is too large, and the resulting answer looks piecewise constant
as opposed to piecewise smooth. This is because the expansion moves algorithm seek to
change a large subset of pixels to the same label, as opposed to the range moves algorithm,
which can change a subset of pixels to a smooth range of labels. Range moves results
are also much closer to the ground truth. The absolute average error (compared with the
ground truth in gure 4.4a) for range moves is 0.82, for the swap algorithm it is 1.35,
4.4 Experimental Results 73
Range moves
Swap
Expansion
Tsukuba
1758
1805
1765
Venus
2671
2702
2690
Teddy
6058
6099
6124
Cones
7648
7706
7742
Figure 4.5
Energies on the Middlebury database. All numbers were divided by 10
3
and for the expansion algorithm it is 1.38. Range moves take about twice as long to run
as the expansion algorithm on this example. Running times were approximately 80 and
40 seconds, respectively.
4.4.2 Image Inpainting
Image inpainting is similar to image restoration, except that some pixels have been occluded
and therefore have no preference for any label; that is, for an occluded pixel i,
i
(l) = 0
for all l L. Here, an example fromchapter 11 was used. The energy terms are
ij
(x
i
, x
j
) =
25 min{(x
i
x
j
)
2
, 200}. The expansion algorithmgives a labeling with energy 1.61 10
7
,
the labeling fromthe swap algorithmhas energy 1.64 10
7
, and for range moves, the energy
is 1.53 10
7
, which is signicantly lower. The energies that swap and expansion algorithms
have in this implementation are slightly different fromthose in chapter 11, probably because
the iteration over labels is performed in random order, and different runs of the swap and
expansion algorithms will give slightly different results. The TRW-S algorithm gives better
results than range moves (see chapter 11). The optimal energy with TRW-S is 1.5100 10
7
.
It is worth noting that for stereo, graph cut performs better than TRW-S when longer-range
interactions are present see [260].
4.4.3 Stereo Correspondence
This section presents results on stereo correspondence for the Middlebury database
images [419].
2
Here V is the set of all pixels in the left image and L is the set of all possible
disparities. The disparities are discretized at subpixel precision, in quarter of a pixel steps.
That is, if |x
i
x
j
| = 1, then the disparities of pixels i and j differ by 0.25 pixel. Let d
l
stand for the actual disparity corresponding to the integer label lfor example, label 2
corresponds to disparity 0.5. The data costs are
i
(l) = |I
L
(i) [I
R
(i d
l
) (d
l
d
l
) +
I
R
(p d
l
)(d
l
d
l
)]|, where x stands for rounding down, x stands for rounding up, and
i d stands for the pixel that has the coordinates of pixel i shifted to the left by d. The
truncated quadratic
ij
(x
i
, x
j
) = 10 min{(x
i
x
j
)
2
, 16} is used.
Figure 4.5 compares the energies obtained with range moves to those obtained with the
swap and expansion algorithms. The accuracy of the labelings is summarized in gure 4.6.
2. The images were obtained from [Link]/stereo.
74 4 Optimizing Multilabel MRFs with Convex and Truncated Convex Priors
Range moves
Swap
Expansion
Tsukuba
6.7
7.47
7.14
Venus
3.25
4.04
4.19
Teddy
15.1
15.8
16.0
Cones
6.79
7.64
7.81
Figure 4.6
Accuracy on the Middlebury database
(a) expansion algorithm (b) range moves
Figure 4.7
Zooming in on the details of expansion and large movessee text.
Each number in gure 4.6 gives the percentage of pixels away from ground truth by more
than 0.5 pixel. Tsukuba, Venus, Teddy, and Cones are four different scenes in the Middlebury
database. The range moves algorithmperforms better not only in terms of energy but also in
terms of ground truth. The accuracy improvement is slight but consistent across all scenes.
Figure 4.7 shows zoomed detail for the Cones sequence. Range moves produce smoother
results compared with the expansion algorithm.
4.5 Further Developments
Since the original work [211] on optimization with convex
ij
, many extensions have been
developed. In [421, 116] the notion of convexity is extended to a more general one of
submodularity with respect to the order of labels. In particular,
ij
does not have to be sym-
metric, which is useful for some applications [322]. In [420] it is shown that if there exists
an ordering for a set of labels that makes a given energy submodular, it can be found in a
polynomial time. The idea of encoding multiple labels by binary ones also has been exam-
ined further. In [384] the space of possible encodings of multilabel variables by two or more
4.5 Further Developments 75
binary variables is examined in general, including the higher-order cases (i.e., when terms
that depend on more than two pixels are allowed in the energy).
Other extensions improve the time and memory requirements. In [262] it is shown how
to reduce memory, which in practice also leads to a reduction in running time. The approach
is based on computing several minimum cuts on smaller graphs. The limitation is that the

i
terms are also required to being convex. In [85, 117] similar ideas on improving com-
putational time are presented. Again,
i
is restricted to being convex. The approach in [82]
improves the time and memory requirement without restriction to convex
i
. However,
there are no optimality guarantees. The approach is similar in spirit to the range moves,
without requiring the label range to be consecutive. In [373, 547] a spatially continuous
formulation of the energy function in section 4.2 is proposed. Its benets are reduction of
metrication errors and improved time and memory efciency.
The underlying idea of the range move algorithm is to design multilabel moves (as
opposed to binary moves) in order to improve optimization. There are several related works
that explore this idea. In [277] it is shown how to modify the range moves algorithm to
achieve a multiplicative bound for approximation. The algorithm can be considered a gen-
eralization of the expansion algorithm, that is, expansion on a range of labels as opposed
to expansion on a single label (as in the original expansion algorithm). In [496] various
generalizations of the range move algorithm are considered, including moves that can be
considered as generalizations of the expansion algorithm. The authors of [322] develop
multilabel moves for a very specic energy function with strong ordering constraints on
labels. In [173] approximation algorithms are developed that are based on multilabel moves
for general binary terms
ij
, not just truncated convex ones.
5
Loopy Belief Propagation, Mean Field Theory, and Bethe
Approximations
Alan Yuille
This chapter describes methods for estimating the marginals and maximum a posteriori
(MAP) estimates of probability distributions dened over graphs by approximate methods
including mean eld theory (MFT), variational methods, and belief propagation. These
methods typically formulate this problem in terms of minimizing a free energy function of
pseudo marginals. They differ by the design of the free energy and the choice of algorithm
to minimize it. These algorithms can often be interpreted in terms of message passing. In
many cases the free energy has a dual formulation and the algorithms are dened over
the dual variables (e.g., the messages in belief propagation). The quality of performance
depends on the types of free energies usedspecically, how well they approximate the
log partition function of the probability distributionand whether there are suitable algo-
rithms for nding their minima. Section 5.1 introduces two types of Markov eld models
that are often used in computer vision. I proceed to dene MFT/variational methods in sec-
tion 5.2; their free energies are lower bounds of the log partition function, and describe how
inference can be done by expectation maximization, steepest descent, or discrete iterative
algorithms. Section 5.3 describes message-passing algorithms, such as belief propagation
and its generalizations, which can be related to free energy functions (and dual variables).
Finally, in section 5.4 I describe how these methods relate to Markov Chain Monte Carlo
(MCMC) approaches, which gives a different way to think of these methods and can lead
to novel algorithms.
5.1 Two Models
Two important probabilistic vision models are presented that will be used to motivate the
algorithms described in the rest of the section.
The rst type of model is formulated as a standard Markov Random Field (MRF) with
input z and output x. We will describe two vision applications for this model. The rst
application is image labeling where z = {z
i
: i D} species the intensity values z
i

{0, 255} on the image lattice D, and x = {x
i
: i D} is a set of image labels x
i
L; see
gure 5.1. The nature of the labels will depend on the problem. For edge detection, |L| = 2
78 5 Loopy Belief Propagation, Mean Field Theory, and Bethe Approximations
Observed node
Hidden node
Edge process
Dependency
x
i
x
i
y
ij
x
i
z
i
x
j
x
j
Figure 5.1
Graphs for different MRFs. Conventions (far left), basic MRF graph (middle left), MRF graph with inputs z
i
(middle right), and graph with line processes y
ij
(far right).
and the labels l
1
, l
2
will correspond to edge and non-edge. For labeling the MSRC data
set [437], |L| = 23 and the labels l
1
, . . . , l
23
include sky, grass, and so on. A second
application is binocular stereo (see gure 5.2, where the input is the input images to the
left and right cameras, z = (z
L
, z
R
), and the output is a set of disparities x that specify the
relative displacements between corresponding pixels in the two images and hence determine
the depth (see gure 5.2 and also chapter 23).
We can model these two applications by a posterior probability distribution P(x | z), a
conditional random eld [288]. This distribution is dened on a graph G = (V, E) where
the set of nodes V is the set of image pixels D and the edges E are between neighboring
pixels (see gure 5.1). The x = {x
i
: i V} are random variables specied at each node
of the graph. P(x | z) is a Gibbs distribution specied by an energy function E(x, z)
that contains unary potentials U(x, z) =

iV
(x
i
, z) and pairwise potentials V(x, x) =

ijE

ij
(x
i
, x
j
). The unary potentials (x
i
, z) depend only on the label/disparity at
node/pixel i, and the dependence on the input z will depend on the application: (1) for the
labeling application (x
i
, z) = g(z)
i
, where g(.) is a nonlinear lter that can be obtained
by an algorithm like AdaBoost [500] and evaluated in a local image window surrounding
pixel i; (2) for binocular stereo we can set (x
i
, z
L
, z
R
) = |f (z
L
)
i
f (z
R
)
i+x
i
|, where f (.)
is a vector-value lter and | | is the L1 norm, so that (.) takes small values at the dispar-
ities x
i
for which the lter responses are similar on the two images. The pairwise potentials
impose prior assumptions about the local context of the labels and disparities. These mod-
els typically assume that neighboring pixels will tend to have similar labels/disparities (see
gure 5.2).
In summary, the rst type of model is specied by a distribution P(x | z) dened over
discrete-valued random variables x = {x
i
: i V} dened on a graph G = (V, E):
P(x | z) =
1
Z(z)
exp
_
_
_

i
(x
i
, z)

ijE

ij
(x
i
, x
j
)
_
_
_
. (5.1)
The goal will be to estimate properties of the distribution such as the MAP estimator and
the marginals (which relate to each other, as discussed in subsection 5.2.4):
5.1 Two Models 79
P
R
E
R
O
R
P
L
O
L E
L
P
P
2
P
1
Figure 5.2
Stereo. The geometry of stereo (left). A point P in 3D space is projected onto points P
L
, P
R
in the left and
right images. The projection is specied by the focal points O
L
, O
R
and the directions of gaze of the cameras
(the camera geometry). The geometry of stereo enforces that points in the plane specied by P, O
L
, O
R
must
be projected onto corresponding lines E
L
, E
R
in the two images (the epipolar line constraint). If we can nd
the correspondence between the points on epipolar lines, then we can use trigonometry to estimate their depth,
which is (roughly) inversely proportional to the disparity, which is the relative displacement of the two images.
Finding the correspondence is usually ill-posed and requires making assumptions about the spatial smoothness of
the disparity (and hence of the depth). Current models impose weak smoothness priors on the disparity (center).
Earlier models assumed that the disparity was independent across epipolar lines that lead to similar graphic models
(right) where inference could be done by dynamic programming.
x

= arg max
x
P(x | z), the MAP estimate,
p
i
(x
i
) =

x/i
P(x | z), i V the marginals.
(5.2)
The second type of model has applications to image segmentation, image denoising,
and depth smoothing. Called the weak membrane model, it was proposed independently
by Geman and Geman [161] and Blake and Zisserman [54]. This model has additional
hidden variables y, which are used to explicitly label discontinuities. It is also a generative
model that species a likelihood function and a prior probability (in contrast to conditional
random elds, which specify only the posterior distribution). This type of model can be
extended by using more sophisticated hidden variables to perform tasks such as long-range
motion correspondence [544], object alignment [96], and the detection of particle tracks in
high-energy physics experiments [355].
The input to the weak membrane model is the set of intensity (or depth) values z =
{z
i
: i D}, and the output is x = {x
i
: i D} dened on a corresponding output lattice.
(Formally we should specify two different lattices, say D
1
and D
2
, but this makes the
notation too cumbersome.) We dene a set of edges E that connect neighboring pixels on
the output lattice and dene the set of line processes y = {y
j
: j D
e
} with y
ij
{0, 1} over
these edges (see gure 5.1. The weak membrane is a generative model, so it is specied
by two probability distributions: (1) the likelihood function P(z | x), which species how
the observed image z is a corrupted version of the image x, and (2) the prior distribution
P(x, y), which imposes a weak membrane by requiring that neighboring pixels take similar
values except at places where the line process is activated.
80 5 Loopy Belief Propagation, Mean Field Theory, and Bethe Approximations
The simplest version of the weak membrane model is specied by the distributions
P(z | x) =

iD
_

exp{(z
i
x
i
)
2
}, P(x, y) exp{E(x, y)},
with E(x, y) = A

(i,j)E
(x
i
x
j
)
2
(1 y
ij
) +B

(i,j)E
y
ij
.
(5.3)
In this model the intensity variables x
i
, z
i
are continuous-valued and the line processor
variables y
ij
{0, 1}, where y
ij
= 1 means that there is an (image) edge at ij E
x
. The
likelihood function P(z | x) assumes independent zero-mean Gaussian noise (for other
noise models, such as shot noise, see Geiger and Yuille [159] and Black and Rangara-
jan [47]). The prior P(x, y) encourages neighboring pixels i, j to have similar intensity
values x
i
x
j
except if there is an edge y
ij
= 1. This prior imposes piecewise smooth-
ness, or weak smoothness, which is justied by statistical studies of intensities and depth
measurements (see Zhu and Mumford [550], Roth and Black [397]). More advanced vari-
ants of this model will introduce higher-order coupling terms of the form y
ij
y
kl
into the
energy E(x, y) to encourage edges to group into longer segments that may form closed
boundaries. The weak membrane model leads to a particularly hard inference problem
since it requires estimating continuous and discrete variables, x and y , from P(x, y | z)
P(z | x)P(x, y).
5.2 Mean Field Theory and Variational Methods
Mean Field Theory (MFT), also known as a variational method, offers a strategy to design
inference algorithms for MRF models. The approach has several advantages: (1) It takes
optimization problems dened over discrete variables and converts them into problems
dened in terms of continuous variables. This enables us to compute gradients of the energy
and to use optimization techniques that depend on them, such as steepest descent. In particu-
lar, we can take hybrid problems dened in terms of both discrete and continuous variables,
such the weak membrane, and convert them into continuous optimization problems. (2) We
can use deterministic annealing methods to develop continuation methods in which we
dene a one-parameter family of optimization problems indexed by a temperature param-
eter T . We can solve the problems for large values of T (for which the optimization is
simple) and track the solutions to low values of T (where the optimization is hard; see sec-
tion 5.2.4). (3) We can show that MFT gives a fast deterministic approximation to Markov
Chain Monte Carlo (MCMC) stochastic sampling methods, as described in section 5.4, and
hence can be more efcient than stochastic sampling. (4) MFTmethods can give bounds for
quantities, such as the partition function log Z, that are useful for model selection problems,
as described in [46].
5.2 Mean Field Theory and Variational Methods 81
5.2.1 Mean Field Free Energies
The basic idea of MFT is to approximate a distribution P(x | z) by a simpler distribu-
tion B

(x | z) that is chosen so that it is easy to estimate the MAP estimate of P(.),

and any other estimator, from the approximate distribution B

(.). This requires specifying

a class of approximating distributions {B(.)}, a measure of similarity between distribu-
tions B(.) and P(.), and an algorithm for nding the B

(.) that minimizes the similarity

measure.
In MFT the class of approximating distributions is chosen to be factorizable so that
B(x) =

iV
b
i
(x
i
), where b = {b
i
(x
i
)} are pseudo marginals that obey b
i
(x
i
) 0, i, x
i
and

x
i
b
i
(x
i
) = 1, i. This means that the MAP estimate of x = (x
1
, . . . , x
N
) can be
approximated by x
i
= arg max
x
i
b

(x
i
) once we have determined B

(x). But note that

MFT can be extended to structured mean eld theory, which allows more structure to the
{B(.)} (see [46]). The similarity measure is specied by the Kullback-Leibler divergence
KL(B, P) =

x
B(x) log
B(x)
P(x)
, which has the properties that KL(B, P) 0 with equality
only if B(.) = P(.). It can be shown (see section 5.2.2), that this is equivalent to a mean
eld free energy F =

x
B(x)E(x)

x
B(x) log B(x).
For the rst type of model we dene the mean eld free energy F
MFT
(b) by
F
MFT
(b) =

ijE

x
i
,x
j
b
i
(x
i
)b
j
(x
j
)
ij
(x
i
, x
j
)
+

x
i
b
i
(x
i
)
i
(x
i
, z) +

x
i
b
i
(x
i
) log b
i
(x
i
).
(5.4)
The rst two terms are the expectation of the energy E(x, z) with respect to the distribution
b(x), and the third termis the negative entropy of b(x). If the labels can take only two values
(i.e., x
i
{0, 1}, then the entropy can be written as

iV
{b
i
log b
i
+(1 b
i
) log(1 b
i
)}
where b
i
= b
i
(x
i
= 1). If the labels take a set of values l = 1, . . . , N, then we can express
the entropy as

M
l=1
b
il
log b
il
where b
il
= b
i
(x
i
= l), and hence the {b
il
} satisfy the
constraint

M
l=1
b
il
= 1, i.
For the second (weak membrane) model we use pseudo marginals b(y) only for the line
processes y. This leads to a free energy F
MFT
(b, x) specied by
F
MFT
(b, x) =

iV
(x
i
z
i
)
2
+A

ijE
(1 b
ij
)(x
i
x
j
)
2
+B

ijE
b
ij
+

ijE
{b
ij
log b
ij
+(1 b
ij
) log(1 b
ij
)},
(5.5)
where b
ij
= b
ij
(y
ij
= 1) (the derivation uses the fact that

1
y
ij
=0
b
ij
(y
ij
)y
ij
= b
ij
). As
described below, this free energy is exact and involves no approximations.
82 5 Loopy Belief Propagation, Mean Field Theory, and Bethe Approximations
5.2.2 Mean Field Free Energy and Variational Bounds
I now describe in more detail the justications for the mean eld free energies. For the
rst type of models the simplest derivations are based on the KullbackLeibler divergence,
which was introduced into the machine learning literature by Saul and Jordan [414]. (Note
that the mean eld free energies can also be derived by related statistical physics techniques
[365], and there were early applications to neural networks [201], vision [239], and machine
learning [369].)
Substituting P(x) =
1
Z
exp{E(x)} and B(x) =

iV
b
i
(x
i
) into the KullbackLeibler
divergence KL(B, P) gives
KL(B, P) =

x
B(x)E(x) +

x
B(x) log B(x) +log Z = F
MFT
(B) +log Z. (5.6)
Hence minimizing F
MFT
(B) with respect to B gives (1) the best factorized approximation
to P(x) and (2) a lower bound to the partition function log Z min
B
F
MFT
(B), which can
be useful to assess model evidence [46].
For the weak membrane model the free energy follows fromNeal and Hintons variational
formulation of the Expectation Maximization (EM) algorithm [348]. The goal of EM is to
estimate x from P(x | z) =

y
P(x, y | z) after treating the y as nuisance variables that
should be summed out [46]. This can be expressed [348] in terms of minimizing the free
energy function:
F
EM
(B, x) =

y
B(y) log P(x, y | z) +

y
B(y) log B(y). (5.7)
The equivalence of minimizing F
EM
[B, x] and estimating x

= arg max
x
P(x | z) can be
veried by reexpressing
F
EM
[B, x] = log P(x | z) +

y
B(y) log
B(y)
P(y | x, z)
, (5.8)
from which it follows that the global minimum occurs at x

= arg min
x
{log P(x | z)} and
B(y) = P(y | x

, z) (because the second term in (5.8) is the KullbackLeibler divergence,

which is minimized by setting B(y) = P(y | x, z)).
The EM algorithm minimizes F
EM
[B, x] with respect to B and x alternatively, which
gives the E-step and the M-step, respectively. For the basic weak membrane model, both
steps of the algorithmcan be performed simply. The E-step requires minimizing a quadratic
function, which can be performed by linear algebra, and the M-step can be computed
analytically:
Minimize wrt x
_
_
_

i
(x
i
z
i
)
2
+A

(i,j)E
b
ij
(x
i
x
j
)
2
, (5.9)
5.2 Mean Field Theory and Variational Methods 83
B(y) =

(i,j)E
b
ij
(y
ij
) b
ij
=
1
1 +exp{A(x
i
x
j
)
2
+B}
. (5.10)
The EM algorithm is only guaranteed to converge to a local minimum of the free energy,
so good choices of initial conditions are needed. Anatural initialization for the weak mem-
brane model is to set x = z, performthe E-step, then the M-step, and so on. Observe that the
M-step corresponds to performing a weighted smoothing of the data z where the smoothing
weights are determined by the current probabilities B(y) for the edges. The E-step estimates
the probabilities B(y) for the edges, given the current estimates for the x.
Notice that the EMfree energydoes not enforce anyconstraints onthe formof the distribu-
tion B, yet the algorithmresults in a factorized distribution (see 5.10). This results naturally
because the variables that are being summed outthe y variablesare conditionally inde-
pendent (i.e., there are no terms in the energy E(x, z) that couple y
ij
with its neighbors). In
addition we can compute
P(x | z) =

y
P(x, y | z) (5.11)
analytically to obtain
P(x | z) =
1
Z
exp
_
_
_

imD
(x
i
z
i
)
2

ijmE
g(x
i
x
j
)
_
_
_
, (5.12)
where
g(x
i
x
j
) = log{exp{A(x
i
x
j
)
2
} +exp{B}}. (5.13)
The function g(x
i
x
j
) penalizes x
i
x
j
quadratically for small x
i
x
j
but tends to a nite
value asymptotically for large |x
i
x
j
|.
Suppose, however, that we consider a modied weak membrane model that includes
interactions between the line processesterms in the energy such as C

(ij)(kl)E
y
y
ij
y
kl
that encourage lines to be continuous. It is now impossible either to (a) solve for B(y) in
closed form for the E-step of EM or (b) to compute P(x | y) analytically. Instead, we use
the mean eld approximation by requiring that B is factorizableB(y) =

ijE
b
ij
(y
ij
).
This gives a free energy:
F
MFT
(b, x) =

iV
(x
i
z
i
)
2
+A

ijE
(1 b
ij
)(x
i
x
j
)
2
+B

ijE
b
ij
+C

(ij)(kl)E
y
b
ij
b
kl
+

ijE
{b
ij
log b
ij
+(1 b
ij
) log(1 b
ij
).
(5.14)
84 5 Loopy Belief Propagation, Mean Field Theory, and Bethe Approximations
5.2.3 Minimizing the Free Energy by Steepest Descent and Its Variants
The mean eld free energies are functions of continuous variables (since discrete variables
have been replaced by continuous probability distributions) that enable us to compute
gradients of the free energy. This allows us to use steepest descent algorithms and their
many variants. Suppose we take the MFT free energy from (5.4), restrict x
i
{0, 1}, set
b
i
= b
i
(x
i
= 1); then the basic steepest descent can be written as
db
i
dt
=
F
MFT
b
i
,
= 2

x
j

ij
(x
i
, x
j
)b
j
+
i
(x
i
) {b
i
log b
i
+(1 b
i
) log(1 b
i
)}.
(5.15)
The MFT free energy decreases monotonically because
dF
MFT
dt
=

i
F
MFT
b
i
db
i
dt
=

i
_
F
MFT
b
i
_
2
.
(Note that the energy decreases very slowly for small gradients because the square of a
small number is very small.) The negative entropy term{b
i
log b
i
+(1 b
i
) log(1 b
i
)} is
guaranteed to keep the values of b
i
within the range [0, 1] (since the gradient of the negative
entropy equals log b
1
/(1 b
i
), which becomes innitely large as b
i
0 and b
i
1).
In practice, we must replace (5.15) with a discrete approximation of the form
b
t +1
i
= b
t
i

F
MFT
b
i
,
where b
t
i
is the state at time t . But the choice of the step size is critical. If it is too large,
the algorithm will fail to converge, and if it is too small, the algorithm will converge very
slowly. (Refer to Press et al. [378] for a detailed discussion of variants of steepest descent
and their numerical stability and convergence properties.) A simple variant that has often
been used for mean eld theory applications to vision [239, 542] is to multiply the free
energy gradient F
MFT
/b
i
in (5.15) by a positive function (ensuring that the free energy
decreases monotonically). A typical choice of function is b
i
(1 b
i
), which, interestingly,
gives dynamics that are identical to models of articial neural networks [201].
There is a related class of discrete iterative algorithms that can be expressed in the form
b
t +1
= f (b
t
) for some function f (.). They have two advantages over steepest descent
algorithms: (1) they are guaranteed to decrease the free energy monotonically (i.e.,
F
MFT
(b
t +1
) F
MFT
(b
t
)), and (2) they are nonlocal, so that b
t +1
may be distant from b
t
,
which can enable them to escape some of the local minima that can trap steepest descent.
Algorithms of this type can be derived by closely using principles such as variational bound-
ing [408, 220], majorization [119], and CCCP [546]. It can be shown that many existing
5.2 Mean Field Theory and Variational Methods 85
discrete iterative algorithms (e.g., EM, generalized iterative scaling, Sinhkorns algorithm)
can be derived using the CCCP principle [546]. For a recent discussion and entry point into
this literature, see [447].
5.2.4 Temperature and Deterministic Annealing
So far we have concentrated on using MFT to estimate the marginal distributions. I now
describe how MFT can attempt to estimate the most probable states of the probability
distribution x

= arg max
x
P(x). The strategy is to introduce a temperature parameter T
and a family of probability distributions related to P(x). (This strategy is also used in
chapter 6.)
More precisely, we dene a one-parameter family of distributions {P(x)}
1/T
where T
is a temperature parameter (the constant of proportionality is the normalization constant;
see gure 5.3). This is equivalent to specifying Gibbs distributions P(x; T ) = 1/Z(T )
exp{E(x)/T }, where the default distribution P(x) occurs at T = 1. The key observation
is that as T 0, the distribution strongly peaks about the state x

= arg min
x
E(x) with
lowest energy (or states if there are two or more global minima). Conversely, at T
all states will become equally likely and P(x; T ) will tend to the uniform distribution.
Introducing this temperature parameter modies the free energies by multiplying the
entropy term by T . For example, we modify (5.4) to be
F
MFT
()

ijE

x
i
,x
j
b
i
(x
i
)b
j
(x
j
)
ij
(x
i
, x
j
)
+

x
i
b
i
(x
i
)
i
(x
i
, z) +T

x
i
b
i
(x
i
) log b
i
(x
i
).
(5.16)
T = 0.5
T = 2.0
T = 10.0
x x
P
(
x
;
T
)
T
2
T
3
T
1
Figure 5.3
The probability distribution P(x; T ) peak sharply as T 0 and tends to a uniform distribution for large T (left).
The mean eld free energy F is convex for large T and becomes less smooth as T decreases (right). This motivates
simulated annealing and deterministic annealing, which are related to graduated nonconvexity. For some models
there are phase transitions where the minima of the free energy change drastically at a critical temperature T
c
.
86 5 Loopy Belief Propagation, Mean Field Theory, and Bethe Approximations
Observe that for large T , the convex entropy term will dominate the free energy, causing
it to become convex. But for small T the remaining terms dominate. In general, we expect
that the landscape of the free energy will become smoothed as T increases, and in some
cases it is possible to compute a temperature T
c
above which the free energy has an obvious
solution [131]. This motivates a continuation approach known as deterministic annealing,
which involves minimizing the free energy at high temperatures, and using this to provide
initial conditions for minimizing the free energies at lower temperatures. In practice, the
best results often require introducing temperature dependence into the parameters [131]. At
sufciently low temperatures the global minima of the free energy can approach the MAP
estimates, but technical conditions need to be enforced (see [545]).
Deterministic annealing was motivated by simulated annealing[234] and performs
stochastic sampling (see section 5.4) from the distribution P(x; T ), gradually reducing
T so that eventually the samples come from P(x : T = 0) and, hence, correspond to the
global minimumx = arg min
x
E(x). This approach is guaranteed to converge [161], but the
theoretically guaranteed rate of convergence is impractically slow and so, in practice, rates
are chosen heuristically. Deterministic annealing is also related to the continuation tech-
niques described by Blake and Zisserman [54] to obtain solutions to the weak membrane
model.
5.3 Belief Propagation and Bethe Free Energy
I now present a different approach to estimating (approximate) marginals and MAPs of an
MRF. This is called belief propagation (BP). It was originally proposed as a method for
doing inference on trees (e.g., graphs without closed loops) [367], for which it is guaranteed
to converge to the correct solution (and is related to dynamic programming). But empirical
studies showed that belief propagation will often yield good approximate results on graphs
that do have closed loops [337].
To illustrate the advantages of belief propagation, consider the binocular stereo prob-
lem that can be addressed by using the rst type of model. For binocular stereo there is
the epipolar line constraint, which means that, provided we know the camera geome-
try, we can reduce the problem to one-dimensional matching (see gure 5.2). We impose
weak smoothness in this dimension only, and then use dynamic programming to solve the
problem [158]. But a better approach is to impose weak smoothness in both directions,
which can be solved (approximately) using belief propagation [457] (see gure 5.2).
Surprisingly, the xed points of belief propagation algorithms correspond to the extrema
of the Bethe free energy [540]. This free energy (see (5.22)), appears to be better than the
mean eld theory free energy because it includes pairwise pseudo marginal distributions and
reduces to the MFT free energy if those distributions are replaced by the product of unary
marginals. But, except for graphs without closed loops (or with a single closed loop), there
are no theoretical results showing that the Bethe free energy yields a better approximation
5.3 Belief Propagation and Bethe Free Energy 87
x
3
x
2
x
1
x
4
x
5
x
6
x
k
x
j
x
i
m
ki
(x
i
)
m
ji
(x
i
)
x
3
x
2
x
7
x
1
x
4
x
5
x
6
Figure 5.4
Message passing (left) is guaranteed to converge to the correct solution on graphs without closed loops (center)
but gives only approximations on graphs with a limited number of closed loops (right).
than mean eld theory (see gure 5.4). There is also no guarantee that BP will converge for
general graphs, and it can oscillate widely.
5.3.1 Message Passing
BP is dened in terms of messages m
ij
(x
j
) from i to j, and is specied by the sum product
update rule:
m
t +1
ij
(x
j
) =

x
i
exp{
ij
(x
i
, x
j
)
i
(x
i
)}

k=j
m
t
ki
(x
i
). (5.17)
The unary and binary pseudo marginals are related to the messages by
b
t
i
(x
i
) exp{
i
(x
i
)}

k
m
t
kj
(x
j
), (5.18)
b
t
kj
(x
k
, x
j
) exp{
kj
(x
k
, x
j
)
k
(x
k
)
j
(x
j
)}

=j
m
t
k
(x
k
)

l=k
m
t
lj
(x
j
). (5.19)
The update rule for BP is not guaranteed to converge to a xed point for general graphs
and can sometimes oscillate wildly. It can be partially stabilized by adding a damping term
to (5.17), for example, by multiplying the right-hand side by (1 ) and adding a term
m
t
ij
(x
j
).
To understand the convergence of BP, observe that the pseudo marginals b satisfy the
admissibility constraint:

ij
b
ij
(x
i
, x
j
)

i
b
i
(x
i
)
n
i
1
exp
_
_
_

ij
(x
i
, x
j
)

i
(x
i
)
_
_
_
P(x), (5.20)
where n
i
is the number of edges that connect to node i. This means that the algorithm
reparameterizes the distribution from an initial specication in terms of the , to one
in terms of the pseudo marginals b. For a tree this reparameterization is exact (i.e., the
pseudo marginals become the true marginals of the distributionwe can, e.g., represent
88 5 Loopy Belief Propagation, Mean Field Theory, and Bethe Approximations
a one-dimensional distribution by P(x) =
1
Z
_

N1
i=1
(x
i
, x
i+1
)

N
i=1

i
(x
i
)
_
or by

N1
i=1
p(x
i
, x
i+1
)/

N1
i=2
p(x
i
).
It follows from the message updating equations (5.17, 5.19) that at convergence, the bs
satisfy the consistency constraints:

x
j
b
ij
(x
i
, x
j
) = b
i
(x
i
),

x
i
b
ij
(x
i
, x
j
) = b
j
(x
j
). (5.21)
This follows from the xed point conditions on the messages, m
kj
(x
j
) =

x
k
exp
{
k
(x
k
)} exp{
jk
(x
j
, x
k
)}

l=j
m
lk
(x
k
) k, j, x
j
.
In general, the admissibility and consistency constraints characterize the xed points of
belief propagation. This has an elegant interpretation within the framework of information
geometry [206].
5.3.2 The Bethe Free Energy
The Bethe free energy [128] differs fromthe MFTfree energy by including pairwise pseudo
marginals b
ij
(x
i
, x
j
):
F[b; ] =

x
i
,x
j
b
ij
(x
i
, x
j
)
ij
(x
i
, x
j
) +

x
i
b
i
(x
i
)
i
(x
i
)
+

x
i
,x
j
b
ij
(x
i
, x
j
) log b
ij
(x
i
, x
j
)

i
(n
i
1)

x
i
b
i
(x
i
) log b
i
(x
i
).
(5.22)
But we must also impose consistency and normalization constraints by Lagrange
multipliers {
ij
(x
j
)} and {
i
}, giving the additional terms

i,j

x
j

ij
(x
j
)
_

x
i
b
ij
(x
i
, x
j
) b
j
(x
j
)
_
+

i,j

x
i

ji
(x
i
)
_
_
_

x
j
b
ij
(x
i
, x
j
) b
i
(x
i
)
_
_
_
+

i
_

x
i
b
i
(x
i
) 1
_
.
(5.23)
It is straightforward to verify that the extrema of the Bethe free energy also obey the
admissibility and consistency constraints. Hence the xed points of belief propagation
correspond to extrema of the Bethe free energy.
If the goal of belief propagation is to minimize the Bethe free energy, then why not
use direct methods such as steepest descent or discrete iterative algorithms instead? One
disadvantage is that these methods require working with pseudo marginals that have higher
dimensions than the messages (contrast b
ij
(x
i
, x
j
) with m
ij
(x
j
)). Discrete iterative algo-
rithms (DIA) have been proposed [543, 191] that are more stable than belief propagation
5.4 Stochastic Inference 89
and can reach lower values of the Bethe free energy. But these DIA must have an inner
loop to deal with the consistency constraints, and hence take longer to converge than belief
propagation. The difference between these direct algorithms and belief propagation can
also be given an elegant geometric interpretation in terms of information geometry [206].
5.3.3 Where Do the Messages Come From? The Dual Formulation
Where do the messages in belief propagation come from? At rst glance, they do not appear
directly in the Bethe free energy. But observe that the consistency constraints are imposed
by Lagrange multipliers
ij
(x
j
) that have the same dimensions as the messages.
We can think of the Bethe free energy as specifying a primal problem dened over pri-
mal variables b and dual variables . The goal is to minimize F[b; ] with respect to
the primal variables and maximize it with respect to the dual variables. There is a corre-
sponding dual problem that can be obtained by minimizing F[b; ] with respect to b to get
solutions b() and substituting them back to obtain

F
d
[] = F[b(); ]. Extrema of the
dual problem correspond to extrema of the primal problem (and vice versa).
It is straightforward to show that minimizing F with respect to the bs gives the equations
b
t
i
(x
i
) exp
_
_
_
1/(n
i
1)
_
_
_

ji
(x
i
)
i
(x
i
)
_
_
_
_
_
_
, (5.24)
b
t
ij
(x
i
, x
j
) exp
_

ij
(x
i
, x
j
)
t
ij
(x
j
)
t
ji
(x
i
)
_
. (5.25)
Observe the similarity between these equations and those specied by belief propagation
(see (5.17)). They become identical if we identify the messages with a function of the s:

ji
(x
i
) =

kN(i)/j
log m
ki
(x
i
). (5.26)
There are, however, two limitations of the Bethe free energy. First, it does not provide
a bound of the partition function (unlike MFT), and so it is not possible to use bounding
arguments to claimthat Bethe is better than MFT(i.e., it is not guaranteed to give a tighter
bound). Second, Bethe is nonconvex (except on trees), which has unfortunate consequences
for the dual problemthe maximum of the dual is not guaranteed to correspond to the
minimumof the primal. Both problems can be avoided by an alternative approach, described
in chapter 6, which gives convex upper bounds on the partition function and species
convergent (single-loop) algorithms.
5.4 Stochastic Inference
Stochastic sampling methodsMarkov Chain Monte Marlo (MCMC)also can be applied
to obtain samples froman MRF that can be used to estimate states. For example, Geman and
90 5 Loopy Belief Propagation, Mean Field Theory, and Bethe Approximations
Geman [161] used simulated annealingMCMC with changing temperatureto perform
inference on the weak membrane model. As I describe, stochastic sampling is closely
related to MFT and BP. Indeed, both can be derived as deterministic approximations to
MCMC.
5.4.1 MCMC
MCMC is a stochastic method for obtaining samples from a probability distribution P(x).
It requires choosing a transition kernel K(x | x

) that obeys the xed point condition P(x) =

x
K(x | x

)P(x

). In practice the kernel is usually chosen to satisfy the stronger detailed

balance condition P(x)K(x

| x) = K(x | x

)P(x

) (the xed point condition is recovered by

taking

x
). In addition the kernel must satisfy conditions K(x | x

) 0 and

x
K(x | x

) =
1 x

, and for any pair of states x, x

it must be possible to nd a trajectory {x

i
: i =
0, . . . , N} such that x = x
0
, x

= x
N
, and K(x
i+1
| x
i
) > 0 (i.e., there is a nonzero
probability of moving between any two states by a nite number of transitions).
This denes a random sequence x
0
, x
1
, . . . , x
n
where x
0
is specied and x
i+1
is sampled
from K(x
i+1
| x
i
). It can be shown that x
n
will tend to a sample from P(x) as n ,
independent of the initial state x
0
, and the convergence is exponential, at a rate depending
on the magnitude of the second largest eigenvalue of K(. | .). Unfortunately, this eigenvalue
can almost never be calculated, and in practice, tests must be used to determine if the MCMC
has converged to a sample from P(x) (see [315]).
I now introduce the two most popular types of transition kernels K(x | x

), the Gibbs
sampler and the Metropolis-Hastings sampler. Both satisfy the detailed balance condition
and are straightforward to sample from (i.e., they do not depend on quantities that are
hard to compute, such as the normalization constant Z of P(x)). To specify these kernels
compactly, I use the notation that r denotes a set of graph nodes with state x
r
, and /r denotes
the remaining graph nodes with state x
/r
. For example, for the image labeling problemwith
MRF given by (5.1), r can label a point i on the image lattice, x
r
would be the label x
i
of that
lattice point, and x
/r
would be the labels of all the other pixels x
/r
= {x
j
: j V j = i}.
But it is important to realize that these kernels can be extended to cases where r represents
a set of points on the image lattice, for example, two neighboring points i, j where ij E
and x
r
is x
i
, x
j
.
The Gibbs sampler is one of the most popular MCMCs, partly because it is so simple.
It has transition kernel K(x | x

) =

r
(r)K
r
(x | x

), where (r) is a distribution on the

lattice sites r V. The default choice for (.) is the uniform distribution, but other choices
may be better, depending on the specic application. The K
r
(x | x

) are specied by
K
r
(x | x

) = P(x
r
| x

N(r)
)
x
/r
,x

/r
, (5.27)
where
a,b
is the delta function (i.e.,
a,b
= 1 for a = b and = 0 otherwise). P(x
r
| x

N(r)
)
is the conditional distribution that, as illustrated below, takes a simple form for MRFs that
makes it easy to sample from. Each K
r
(. | .) satises the detailed balance condition, and
5.4 Stochastic Inference 91
hence so does K(. | .) by linearity. Note that we require (r) > 0 for all r; otherwise we
will not be able to move between any pair of states x, x

in a nite number of moves.

The Gibbs sampler proceeds by rst picking (a) lattice site(s) at random from (.) and
then sampling the state x
r
of the site from the conditional distribution P(x
r
| x

N(r)
). The
conditional distribution will take a simple form for MRFs, and so sampling from it is
usually straightforward. For example, consider the binary-valued case with x
i
{0, 1} and
with potentials
ij
(x
i
, x
j
) =
ij
x
i
x
j
and
i
(x
i
) =
i
x
i
. The Gibbs sampler samples x
t +1
i
from the distribution
P(x
i
| x
/i
) =
1
1 +exp
_
x
i
_

j

ij
x
j
+
i
__. (5.28)
In fact, updates for Gibbs sampling are similar to the updates for MFT. A classic result,
described in [13], shows that MFT can be obtained by taking the expectation of the update
for the Gibbs sampler. Surprisingly, belief propagation can also be derived as the expec-
tation of a more sophisticated variant of the Gibbs sampler that updates pairs of states
simultaneouslywhere r denotes neighboring lattice sites i, j (for details, see [392]).
The MetropolisHastings sampler is the most general transition kernel that satises the
detailed balance conditions. It has the form
K(x | x

) = q(x | x

) min
_
1,
p(x)q(x

| x)
p(x

)q(x | x

)
_
, for x = x

. (5.29)
Here q(x | x

) is a proposal probability (that depends on the application and usually takes a

simple form). The sampler proceeds by selecting a possible transition x

x from the pro-

posal probability q(x | x

) and accepting this transition with probability min

_
1,
p(x)q(x

| x)
p(x

)q(x | x

)
_
.
Akey advantage of this approach is that it only involves evaluating the ratios of the proba-
bilities P(x) and P(x

), which are typically simple quantities to compute (see the examples

below).
In many cases the proposal probability q(. | .) is selected to be a uniform distribution
over a set of possible states. For example, for the rst type of model we let the proposal
probability choose at randoma site i at a newstate value x

i
(fromuniformdistributions) that
proposes a new state x

. We always accept this proposal if E(x

) E(x), and we accept it

with probability exp{E(x) E(x

)} if E(x

) > E(x). Hence each iteration of the algorithm

usually decreases the energy, but there is also the possibility of going uphill in energy space,
which means it can escape the local minima that can trap steepest descent methods. But it
must be realized that an MCMC algorithm converges to samples from the distribution P(x)
and not to a xed state, unless we perform annealing by sampling from the distribution
1
Z[T ]
P(x)
1/T
and letting T tend to zero. As discussed in section 5.2.4, annealing rates must
be determined by trial and error because the theoretical bounds are too slow.
92 5 Loopy Belief Propagation, Mean Field Theory, and Bethe Approximations
In general, MCMC can be slow unless problem-specic knowledge is used. Gibbs sam-
pling is popular because it is very simple and easy to program but can exploit only a limited
amount of knowledge about the application being addressed. Most practical applications
use MetropolisHastings with proposal probabilities that exploit knowledge of the prob-
lem. In computer vision, data-driven Markov chain Monte Carlo (DDMCMC) [486, 485]
shows how effective proposal probabilities can be, but this requires sophisticated proposal
probabilities and is beyond the scope of this chapter. For a detailed introduction to MCMC
methods, see [315].
5.5 Discussion
This chapter described mean eld theory and belief propagation techniques for performing
inference of marginals on MRFmodels. It discussed howthese methods could be formulated
in terms of minimizing free energies, such as mean eld free energies and the Bethe free
energies. (See [540] for extensions to the Kikuchi free energy, and chapter 6 for discussion
of convex free energies). It described a range of algorithms that can be used to perform
minimization. This includes steepest descent, discrete iterative algorithms, and message
passing. It showed howbelief propagation can be described as dynamics in the dual space of
the primal problemspecied by the Bethe free energy. It introduced a temperature parameter
that enables inference methods to obtain MAP estimates and also motivates continuation
methods, such as deterministic annealing. It briey described stochastic MCMC methods,
such as Gibbs and MetropolisHastings sampling, and showed that mean eld algorithms
and belief propagation can both be thought of as deterministic approximations to Gibbs
sampling.
There have been many extensions to the basic methods described in this chapter. (Refer
to [46] for an entry into the literature on structured mean eld methods, expectation maxi-
mization, and the trade-offs between these approaches). Other recent variants of mean eld
theory methods are described in [392]. Recently CCCP algorithms have been shown to be
useful for learning latent structural SVMs with latent variables [541]. Work by Felzenszwalb
and Huttenlocher [141] shows how belief propagation methods can be made extremely fast
by taking advantage of properties of the potentials and the multiscale properties of many
vision problems. Researchers in the UAI community have discovered ways to derive gener-
alizations of BP starting from the perspective of efcient exact inference [91]. Convex free
energies introduced by Wainwright et al. [506] have nicer theoretical properties than the
Bethe free energy and have led to alternatives to BP, such as TRWand provably convergent
algorithms (see chapter 6). Stochastic sampling techniques such as MCMC remain a very
active area of research. (See [315] for anadvancedintroductiontotechniques suchas particle
ltering that have had important applications to tracking [209]). The relationship between
sampling techniques and deterministic methods is an interesting area of research, and there
Acknowledgments 93
are successful algorithms that combine both aspects. For example, there are recent non-
parametric approaches that combine particle lters with belief propagation to do inference
on graphical models where the variables are continuous-valued [453, 208]. It is unclear,
however, whether the deterministic methods described in this chapter can be extended
to perform the types of inference that advanced techniques like data-driven MCMC can
perform [486, 485].
Acknowledgments
I would like to thank George Papandreou , Xingyao Ye, and Xuming He for giving detailed
feedback on this chapter. George also kindly drew the gures. This work was partially sup-
ported by the NSF with grant number 0613563.
6
Linear Programming and Variants of Belief Propagation
Yair Weiss, Chen Yanover, and Talya Meltzer
6.1 Introduction
The basic problem of energy minimization in an MRF comes up in many application
domains ranging from statistical physics [22] to error-correcting codes [137] and protein
folding [537]. Linear programming (LP) relaxations are a standard method for approxi-
mating combinatorial optimization problems in computer science [34] and have been used
for energy minimization problems for some time [23, 137, 233]. They have an advantage
over other energy minimization schemes in that they come with an optimality guarantee:
if the LP relaxation is tight (i.e., the solution to the linear program is integer), then it is
guaranteed to give the global optimum of the energy.
Despite this advantage there have been very fewapplications of LPrelaxations for solving
MRFproblems in vision. This can be traced to the computational complexityof LPsolvers
the number of constraints and equations in LP relaxation of vision problems is simply
too large. Instead, the typical algorithms used in MRF minimization for vision problems
are based on either message passing (in particular belief propagation (BP) and the tree
reweighted version of belief propagation (TRW)) or graph cut (or variants of it).
Since 2005, however, an intriguing connection has emerged among message passing
algorithms, graph cut algorithms, and LP relaxation. In this chapter we give a short, intro-
ductory treatment of this intriguing connection (focusing on message passing algorithms).
Specically, we show that BP and its variants can be used to solve LP relaxations that arise
from vision problems, sometimes far more efciently than using off-the-shelf LP software
packages. Furthermore, we show that BP and its variants can give additional information
that allows one to provably nd the global minimum even when the LP relaxation is not
tight.
6.1.1 Energy Minimization and Its Linear Programming Relaxation
The energy minimization problemand its LPrelaxation were described in chapter 1, and we
briey dene them again here in a slightly different notation (that will make the connection
to BP more transparent).
96 6 Linear Programming and Variants of Belief Propagation
(b) (a) (c)
Figure 6.1
(a) Asingle frame froma stereo pair. Finding the disparity is often done by minimizing an energy function. (b) The
results of graph cut using a Potts model energy function. (c) The results of ordinary BP, using the same energy
function.
For simplicity, we discuss only MRFs with pairwise cliques in this chapter, but all
statements can be generalized to higher-order cliques [519].
We work with MRFs of the form
P(x) =
1
Z

i
exp(
i
(x
i
))

<i,j>
exp(
ij
(x
i
, x
j
), (6.1)
where < i, j > denotes the set of pairwise cliques that was denoted in earlier chapters,
(i, j) E.
We wish to nd the most probable conguration x

that maximizes P(x) or, equivalently,

the one that minimizes the energy:
x

= arg min

i
(x
i
) +

ij
(x
i
, x
j
). (6.2)
For concreteness, lets focus on the stereo vision problem (gure 6.1). Here x
i
will denote
the disparity at a pixel i and
i
(x
i
) will be the local data term in the energy function and

ij
(x
i
, x
j
) is the pairwise smoothness term in the energy function. As shown in [71], for
many widely used smoothness terms (e.g., the Potts model) exact minimization is NP-hard.
Figure 6.1b,c show the results of graph cut and ordinary belief propgation on the Potts
model energy function. In this display the lighter a pixel is, the farther it is calculated to be
from the camera. Note that both graph cut and BP calculate the depth of the hat and shirt to
have holesthere are pixels inside the hat and the shirt whose disparities are calculated
to be larger than the rest of the hat. Are these mistakes due to the energy function or the
approximate minimization?
6.1 Introduction 97
We can convert the minimization into an integer programby introducing binary indicator
variables q
i
(x
i
) for each pixel and q
ij
(x
i
, x
j
) for any pair of connected pixels. We can then
rewrite the minimization problem as
{q

ij
, q

i
} = arg min

x
i
q
i
(x
i
)
i
(x
i
) +

<i,j>

x
i
,x
j

ij
(x
i
, x
j
)q
ij
(x
i
, x
j
). (6.3)
The minimization is done subject to the following constraints:
q
ij
(x
i
, x
j
) {0, 1}

x
i
,x
j
q
ij
(x
i
, x
j
) = 1

x
i
q
ij
(x
i
, x
j
) = q
j
(x
j
),
where the last equation enforces the consistency of the pairwise indicator variables with
the singleton indicator variable.
This integer programis completely equivalent to the original MAPproblem, and hence is
computationally intractable. We can obtain the linear programming relaxation by allowing
the indicator variables to take on noninteger values. This leads to the following problem:
The LP Relaxation of Pairwise Energy Minimization Minimize:
J({q}) =

<i,j>

x
i
,x
j
q
ij
(x
i
, x
j
)
ij
(x
i
, x
j
) +

x
i
q
i
(x
i
)
i
(x
i
) (6.4)
subject to
q
ij
(x
i
, x
j
) [0, 1] (6.5)

x
i
,x
j
q
ij
(x
i
, x
j
) = 1 (6.6)

x
i
q
ij
(x
i
, x
j
) = q
j
(x
j
). (6.7)
This is now a linear program (the cost and the constraints are linear). It can therefore be
solved in polynomial time and we have the following guarantee:
Observation If the solutions {q
ij
(x
i
, x
j
), q
i
(x
i
)} to the MAP LP relaxation are all integer,
that is q
ij
(x
i
, x
j
), q
i
(x
i
) {0, 1}, then x

i
= arg max
x
i
q
i
(x
i
) is the MAP solution.
98 6 Linear Programming and Variants of Belief Propagation
6.1.2 The Need for Special-Purpose LP Solvers
Having converted the energy minimization problem to a linear program (LP), it may seem
that all we need to do is use off-the-shelf LP solvers and apply them to computer vision
problems. However, by relaxing the problem we have increased its size tremendously
there are many more variables in the LP than there are nodes in the original graph.
Formally, denote by k
i
the number of possible states of node i. The number of variables
and constraints in the LP relaxation is given by
N
variables
=

i
k
i
+

<i,j>
k
i
k
j
N
constraints
=

<i,j>
(k
i
+k
j
+1).
The additional

<i,j>
2k
i
k
j
bound constraints, derived from (6.5), are usually not consid-
ered part of the constraint matrix.
Figure 6.2 shows the number of variables and constraints as a function of image size for
a stereo problem with thirty disparities. If the image is a modest 200 200 pixels and each
disparity can take on thirty discrete values, then the LPrelaxation will have over seventy-two
million variables and four million constraints. The vertical line shows the largest size image
that could be solved using a commercial powerful LP solver (CPLEX 9.0) on a desktop
machine with 4GB of memory in [535]. Obviously, we need a solver that can somehow
take advantage of the problem structure in order to deal with such a large-scale problem.
50 100
10
0
10
5
150 200
Grid size
#Constraints
#Variables
Tomlab
Figure 6.2
The number of variables and constraints in a stereo problem with thirty disparities as a function of image size.
Even modestly sized images have millions of variables and constraints. The largest image that could be solved
with commercial LP software on a machine with 4GB of memory in [535] is approximately 50 50.
6.2 Ordinary Sum-Product Belief Propagation and Linear Programming 99
6.2 Ordinary Sum-Product Belief Propagation and Linear Programming
The sum-product belief propagation (BP) algorithm was introduced by Pearl [367] as a
methodfor performingexact probabilistic calculations onsinglyconnectedMRFs. The algo-
rithmreceives as input a graph Gand the functions F
ij
(x
i
, x
j
) = exp((x
i
, x
j
)), F
i
(x
i
) =
exp(
i
(x
i
)). At each iteration a node x
i
sends a message m
ij
(x
j
) to x
j
, its neighbor in the
graph. The messages are updated as follows:
m
ij
(x
j
)
ij

x
i
F
ij
(x
i
, x
j
)F
i
(x
i
)

kN
i
\j
m
ki
(x
i
) (6.8)
where N
i
\j refers to all neighbors of node x
i
except x
j
. The constant
ij
is a normalization
constant typically chosen so that the messages sum to 1 (the normalization has no inuence
on the nal beliefs). Given the messages, each node can forman estimate of its local belief
dened as
b
i
(x
i
) F
i
(x
i
)

jN
i
m
ji
(x
i
), (6.9)
and every pair of nodes can calculate its pairwise belief:
b
ij
(x
i
, x
j
) F
i
(x
i
)F
j
(x
j
)F
ij
(x
i
, x
j
)

kN
i
\j
m
ki
(x
i
)

kN
j
\i
m
kj
(x
j
). (6.10)
Pearl [367] showed that when the MRF graph is singly connected, the algorithm will
converge and these pairwise beliefs and singleton beliefs will exactly equal the correct
marginals of the MRF (i.e., b
i
(x
i
) = P(x
i
), b
ij
(x
i
, x
j
) = P(x
i
, x
j
)). But when there are
cycles in the graph, neither convergence nor correctness of the beliefs is guaranteed. Some-
what surprisingly, however, for any graph (with or without cycles) there is a simple
relationship between the BP beliefs and the LP relaxation.
In order to show this relationship, we need to dene the BP algorithm at temperature T .
This is exactly the same algorithm dened in (6.8), and the only difference is the denition
of the local functions F
ij
(x
i
, x
j
), F
i
(x
i
). The new denition depends both on the energy
function parameters
ij
(x
i
, x
j
),
i
(x
i
) and on a new parameter T that we call temperature.
F
ij
(x
i
, x
j
) = exp
1
T

ij
(x
i
, x
j
) (6.11)
F
i
(x
i
) = exp
1
T

i
(x
i
). (6.12)
Observation For any MRF as T 0 there exists a xed point of ordinary BP at temper-
ature T whose beliefs approach the LP solution.
100 6 Linear Programming and Variants of Belief Propagation
This observation follows directly from the connection between BP and the Bethe free
energy [539]. As explained in chapter 5, there is a 1:1 correspondence between the xed
points of BP at temperature T and stationary points of the following problem.
The Bethe Free Energy Minimization Problem Minimize Minimize
G({b}; T ) =

<i,j>

x
i
,x
j
b
ij
(x
i
, x
j
)
ij
(x
i
, x
j
) +

x
i
b
i
(x
i
)
i
(x
i
)
T
_
_

ij
H(b
ij
) +

i
(1 deg
i
)H(b
i
)
_
_
,
(6.13)
subject to
b
ij
(x
i
, x
j
) [0, 1] (6.14)

x
i
,x
j
b
ij
(x
i
, x
j
) = 1 (6.15)

x
i
b
ij
(x
i
, x
j
) = b
j
(x
j
) (6.16)
where H(b
i
) is the Shannon entropy of the belief H(b
i
) =

x
i
b
i
(x
i
) ln b
i
(x
i
) and deg
i
is the degree of node i in the graph.
Comparing the Bethe free energy minimization problem and the LP relaxation problem,
we see that the constraints are the same, and the rst term in the objective is also the same.
The only difference is the existence of additional entropy terms in the Bethe free energy.
But these terms are multiplied by T , so that as T 0 the two problems coincide (recall
that the Shannon entropy is bounded).
Figure 6.3 illustrates the convergence of the Bethe free energy to the LP relaxation. We
consider a graphical model corresponding to a toroidal grid. The nodes are binary, and all
the pairwise potentials are of the form
F =
_
3 1
1 2
_
.
These potentials correspond to an Ising model with a uniform external eldnodes prefer
to be similar to their neighbors and there is a preference for one state over the other. In
order to visualize the approximate free energies, we consider beliefs that are symmetric and
identical for all pairs of nodes:
b
ij
=
_
x y
y 1 (x +2y)
_
.
6.2 Ordinary Sum-Product Belief Propagation and Linear Programming 101
T = 5 T = 2 T = 0.1
Figure 6.3
Contour plots of the Bethe free energy (top) and a convex free energy (bottom) for a 2D Ising model with uniform
external eld at different temperatures. The stars indicate local stationary points. Both free energies approach the
LP as temperature is decreased, but for the Bethe free energy a local minimum is present even for arbitrarily low
temperatures.
Note that the MAP (and the optimum of the LP) occur at x = 1, y = 0, in which case
all nodes are in their preferred state. Figure 6.3 shows the Bethe free energy (top) for this
problemat different temperatures. At high temperature the minimization problems are quite
different, but as temperature is decreased, the Bethe free energy is dominated by the linear
term and becomes equivalent to the LP relaxation.
Note, however, that the convergence of the Bethe free energy problemto the LPrelaxation
does not guarantee that any BP xed point will solve the LP relaxation as the temperature
approaches zero. It only guarantees that there exists a good xed point; there may be
other xed points as well. The stars in gure 6.3 indicate the local stationary points of the
Bethe free energy. Abad local minimum exists for low temperatures at x = 0, y = 0. This
corresponds to a solution where all nodes have the same state but it is not the preferred
state.
6.2.1 Convex BP and the LP Relaxation
In order to avoid local minima, we need a version of the Bethe free energy that has a unique
stationary point. The question of when the Bethe free energy has a unique stationary point
is surprisingly delicate (see, e.g., [190, 391, 412, 475]) and can depend nontrivially on the
graph and the energy function. Perhaps the simplest condition that guarantees a unique
stationary point is convexity. As illustrated in gure 6.4a, a 1D function is convex if its
second derivative is always positive, and this guarantees that it has a unique stationary
point. Convexity is a sufcient but not necessary condition for uniqueness of stationary
points. Figure 6.4b shows a 1D function that has a unique stationary point but is not
102 6 Linear Programming and Variants of Belief Propagation
0 1
0 0
0.2 0.4 0.6 0.8 0 1 0.2 0.4 0.6 0.8
(b) (a)
Figure 6.4
Two 1D functions dened on the range [0, 1]. The function in (a) is the negative Shannon entropy and is convex.
The function in (b) (an inverted Gaussian) has a unique stationary point but is not convex. In order to guarantee
uniqueness of BP xed points, we seek free energies that are convex.
convex. Nevertheless, the easiest way to guarantee uniqueness of BP xed points is to
require convexity of the free energy.
In dimensions larger than 1, the denition of convexity simply requires positivity of the
Hessian. This means that the convexity of the free energy does not depend on the terms in
the energy function
ij
(x
i
, x
j
),
i
(x
i
). These terms only change the linear term in the free
energy and do not inuence the Hessian. Thus the free energy will be convex if the sum
of entropy terms is convex. This sum of entropies H

=

ij
H(b
ij
) +

i
(1 deg
i
)H(b
i
)
is called the Bethe entropy approximation. The negative Bethe entropy can be shown to be
convex when the graph is a tree or has a single cycle. However, when the graph has multiple
cycles, as in the toroidal grid discussed earlier, the Bethe negative entropy is not convex
and hence BP can have many xed points.
We can avoid this problem by convexifying the Bethe entropy. We consider a family
of entropy approximations of the form

H =

ij
c
ij
H(b
ij
) +

i
c
i
H(b
i
). (6.17)
Heskes [190] has shown that a sufcient condition for such an approximate entropy to be
convex is that it can be rewritten as a positive combination of three types of terms: (1) pair-
wise entropies (e.g., H(b
ij
)), (2) singleton entropies (e.g., H(b
i
)), and (3) conditional
entropies (e.g., H(b
ij
) H(b
i
)). Thus the Bethe entropy for a chain of three nodes will be
convex because it can be written H

= H
12
+H
23
H
2
, which is a positive combination of
a pairwise entropy H
12
and a conditional entropy H
23
H
2
. However, for the toroidal grid
discussed above, the Bethe entropy cannot be written in such a fashion. To see this, note that
a 3 3 toroidal grid has nine nodes with degree 4 and eighteen edges. This means that the
Bethe entropy will have twenty-seven negative entropy terms (i.e., nine times we will sub-
tract 3H
i
). However, the maximumnumber of negative terms we can create with conditional
6.2 Ordinary Sum-Product Belief Propagation and Linear Programming 103
entropies is eighteen (the number of edges), so we have more negative terms than we can
create with conditional entropies. In contrast, the entropy approximation

<i,j>
(H
ij
H
i
)
is convex, since it is the sum of eighteen conditional entropies.
Given an approximate entropy that satises the convexity conditions, we can replace the
Bethe entropy with this new convex entropy and obtain a convex free energy. But how can
we minimize it? It turns out that a slight modication of the BP update rules gives a new
algorithm whose xed points are stationary points of any approximate free energy. The
algorithm denes an extra scalar variable for each node i:
i
=
1
c
i
+

jNi
c
ij
, and for each
edge: ij
ij
=
j
c
ij
(note that
ij
may differ from
ji
). Using these extra scalars, the update
equations are
m
ij
(x
j
)

x
i
F
1
c
ij
ij
(x
i
, x
j
)F

i
i
(x
i
)

kN
i
\j
m

ki
ki
(x
i
)m
ji
(x
i
)

ji
1
(6.18)
b
i
(x
i
) = F

i
i
(x
i
)

jN(i)
m

ji
ji
(x
i
) (6.19)
b
ij
(x
i
, x
j
) = F
ij
(x
i
, x
j
)
1
c
ij
F

i
i
(x
i
)F

j
j
(x
j
)

kN
i
\j
m

ki
ki
(x
i
)m
ji
(x
i
)

ji
1

kN
j
\i
m

kj
kj
(x
j
)m
ij
(x
j
)

ij
1
.
(6.20)
Note that this algorithmis verysimilar toordinaryBP, soit requires verylittle modication
of an existing implementation of BP. In particular, we can use algorithms for efcient
calculations of BPmessages for certain energy functions (e.g., [141]). Note that for the Bethe
free energy, c
ij
= 1 and c
i
= 1 deg
i
(and thus
i
,
ij
= 1), so the above update equation
reduces to ordinary BP. However, by choosing c
ij
, c
i
so that the approximate free energy
is convex, we can guarantee that this modied algorithm will have a single, unique xed
point at any temperature.
Returning to the toroidal grid we discussed earlier, gure 6.3 (bottom row) shows the
convexied free energy (with an entropy approximation of the form 18H
12
18H
1
) for
this problem for different temperatures. As was the case for the Bethe free energy, at high
temperature the minimization problems are quite different, but as temperature is decreased,
the free energy is dominated by the linear termand becomes equivalent to the LPrelaxation.
However, unlike the Bethe free energy, the convexfree energyalways has a unique minimum
(indicated by the star), so the xed point of the generalized BP algorithm is guaranteed to
give the LP solution.
An important special case of a convex free energy is the class of tree reweighted (TRW)
free energies. In these free energies the entropy is approximated as a linear combination
of entropies over trees

H =

. For this free energy the generalized BP algorithm

reduces to the TRW algorithm (since
i
= 1 and
ij
=
ji
in this case).
104 6 Linear Programming and Variants of Belief Propagation
a) b)
Figure 6.5
(a) The solution to the LP obtained by running convex BP. Using convex BP, we can solve the LP relaxation for
full-sized images. Pixels for which the LP solution is fractional are shown in black. (b) Abinary image indicating
in white the pixels for which the LP solution is fractional.
To summarize, by choosing a convex free energy and running the generalized BP algo-
rithm at low temperature, we can approximate the LP solution. Returning to the stereo
problem depicted in gure 6.1, even though standard LP solvers fail on such a large prob-
lem, convex BP solved it in less than two hours. The results are shown in gure 6.5a,b.
Figure 6.5a displays the disparity encoded by the LP solution. If the LP solution was indeed
integer, we display the disparity for which the LP solution was nonzero. If the LP solution
was fractional, that pixel is shown in black. In gure 6.5b we showa binary mask indicating
which pixels had noninteger values in the LP solution. The existence of such pixels means
that the LP relaxation is not tight.
6.3 Convex Max-Product BP
Although we have shown that one can use sum-product convex BP to solve the linear
program, one needs to be able to run the sum-product algorithm at sufciently low tem-
peratures. There are two problems with this approach. First, running the algorithm at low
temperatures requires dening F
ij
= exp(
ij
(x
i
, x
j
)/T ), F
i
(x
i
) = exp(
i
(x
i
)/T ) for
low T . Note that as T 0 we are dividing by a number that approaches zero, and this can
cause numerical problems. In the appendix to this chapter, we discuss howto implement the
algorithmin log space (i.e., by working with the logarithms of the potentials and messages).
This can greatly increase the numerical precision at low temperatures.
A second problem, however, is that it is not obvious how to choose the temperature so
that it is sufciently low. As evident in the discussion in the previous section, we need
the temperature to be low enough so that the entropy contribution is negligible relative
to the the average energy. Thus the requirement of the temperature being sufciently low
6.3 Convex Max-Product BP 105
is problem dependentas we change terms in the energy function, the scale of the average
energy may change as well, requiring a different temperature.
In order to avoid choosing a sufciently low temperature, we can work with the zero
temperature limit of the convex BPalgorithm. This algorithm, called the max-product con-
vex BP algorithm, is exactly the same as (6.18) but with the sum operator replaced with a
max. It is easy to showthat as the temperature T approaches zero, the update equations of
the sum-product algorithmat temperature T approach those of the max-product algorithmat
T = 1formally, if a set of messages forms a xed point of sum-product at temperature T .
As T 0, these same messages, raised to the 1/T power, will form a xed point of the
max-product algorithm. This proof follows from the fact that the
p
norm approaches the
max norm as p [519, 184].
Despite this direct connection to the sum-product algorithm, the max-product algorithm
is more difcult to analyze. In particular, even for a convex free energy approximation,
the max-product algorithm may have multiple xed points even though the sum-product
algorithmhas a unique xed point. Thus one cannot guarantee that any xed point of convex
max-product BP will solve the linear programming relaxation.
An important distinction in analyzing the xed points of max-product BP is the notion
of ties. We say that a belief at a node has a tie if it does not have a unique maximizing
value. Thus a belief of the form (0.7, 0.2, 0.1) has no ties and the belief (0.4, 0.4, 0.2) has
a tie. Max-product xed points without ties can easily be shown to correspond to a limit of
the sum-product algorithm at zero temperature. This leads to the following result.
Claim If max-product convex BP converges to a xed point without ties, then the assign-
ment x

i
= arg max
x
i
b
i
(x
i
) is the global minimum of the energy function.
This result is analogous to the claim on the LP relaxation. Only if the LP relaxation
ends up being integral can we say that it corresponds to the global minimum of the energy.
Unfortunately, in many vision problems neither is the LPall integer nor are the max-product
beliefs all tied. Atypical example is shown in gure 6.6a, where we have indicated in black
the pixels for which ties exist (these same pixels are, not coincidentally, the pixels where
the LP solution is nonintegral). In all the nontied pixels we have shown the disparity that
maximizes the beliefs at the xed point. It can be seen that a small number of pixels are black
(see also the mask of black pixels in gure 6.6b) so that we cannot guarantee optimality of
the solution. Yet the disparities at the nontied pixels seemreasonable. Under what conditions
can we trust the values in the nontied pixels?
In recent years a number of results have been obtained that allow us still to prove partial
optimality of an assignment obtained by maximizing the belief at a nontied node after
running max-product convex BP. Partial optimality means that we can x the values at the
nontied nodes and optimize only over the remaining, tied, nodes. Under certain conditions
this procedure can still be guaranteed to nd the global optimum.
106 6 Linear Programming and Variants of Belief Propagation
(a)
(b) (c)
Figure 6.6
(a) Results of using convex max-product BP on the stereo problem shown in gure 6.1a. Pixels for which there
are ties in the belief are shown in black. (b) A binary image indicating which pixels had ties. Note that these are
exactly the same pixels for which the LP solution had noninteger values (see gure 6.5). (c) The global optimum
found by resolving the tied pixels and verifying that the conditions for optimality hold. Note that this solution
is not much better than the local optima found before (gure 6.1). Both the hat and the shirt of the foreground
person are calculated to have holes.
We list here some results on partial optimality and refer the reader to [519, 535] for more
exact denitions and proofs.

When each node has only two possible states, partial optimality holds.

Consider the subgraph formed by looking only at the tied nodes. If this graph is a tree,
then partial optimality holds.

Consider the subgraph formed by looking only at the tied nodes. Dene its boundary as
those nodes in the subgraph that are connected to other nodes. If the beliefs at the boundary
nodes are uniform, then partial optimality holds.

Consider the subgraph formed by looking only at the tied nodes. Dene a new energy
function on this subgraph and nd the assignment in the tied nodes that minimizes this
energy function. If that assignment does not contradict the beliefs at the boundary of the
subgraph, then partial optimality holds.
Note that verifying that partial optimality holds may require additional computation
after running max-product convex BP. Yet in many vision problems we have found that this
verication can be done efciently, and this allows us to provably nd the global optimumof
the energyfunction. Code implementingthese vericationsteps is available at [Link]
.[Link]/talyam/[Link]. Figure 6.6c shows the global optimumof the energy function
for the image shown in gure 6.1. Although this is the global optimum for the energy
function, it still suffers from mistakes. In particular, the calculated depth for the hat and the
shirt still has holes. This indicates that a crucial part of stereo research is choosing a good
energy function to minimize.
6.5 Appendix 107
6.4 Discussion
Despite the NP-hardness of energy minimization in many computer vision problems, it is
actually possible to nd the global optimum of the energy in many instances. Theoretically,
this could be done by relaxing the probleminto a linear program. However, the large number
of variables and constraints makes this linear program unsuitable for standard LP solvers.
In this chapter we have reviewed how variants of belief propagation can be used to solve
the LPrelaxation. Furthermore, we have shown how the max-product convex BPalgorithm
can be used to nd the global optimum even if the LP relaxation is not tight.
Though we have focused on the connection between BP and the LP relaxation, it can
also be shown that the alpha expansion graph cut algorithm is also intimately connected
to the same LP relaxation. In particular, Komodakis and Tzritas [269] have shown that the
alpha expansion algorithm can be seen as an iterative primal integer-dual algorithm for
solving the LP relaxation (see chapter 17). Thus the graph cut algorithm and BP, which are
often seen as competing algorithms, are actually closely related. One important conclusion
from this relationship is that both algorithms are not expected to work well when the LP
relaxation is loose. Indeed, despite the success recounted here in nding global optima for
some energy functions in stereo vision, for other energy functions the number of tied
pixels is far too large for the methods described here to be successful. Understanding the
conditions under which energy minimization problems in computer vision have a tight LP
relaxation is a promising direction for future research.
6.5 Appendix: Implementation Details
6.5.1 Implementation in Log Space
To be able to run the algorithm with a low temperature T , we use the log-space. That is,
we work directly with the costs/energies
i
,
ij
instead of the potentials F
i
, F
ij
, and a set
of messages n
ji
(x
i
) = T log m
ji
(x
i
).
Yet when running sum-product, we need to sum over terms that are the exponents of
the log-terms calculated, and then take the log again. Thus rewriting the generalized BP
updates in log-space gives
n
ji
(x
i
) = T log

x
j
exp
_

j
(x
j
) +

ij
(x
i
, x
j
)

ij
+

kNj\i

jk
n
kj
(x
j
) (1
ij
)n
ij
(x
j
)
_
/T
_
.
(6.21)
For efciency and numerical stability we use the equality
log(exp(x) +exp(y)) = x +log(1 +exp(y x)) (6.22)
108 6 Linear Programming and Variants of Belief Propagation
for x y. In particular, when x is much greater than y, we can ignore the second term and
avoid exponentiating or taking the logarithm during the message update.
6.5.2 Efcient Message Computation for Potts Model
Calculating the vector m
ji
(x
i
) is actually performing a matrix-vector multiplication:
m
ji
= A
ij
y
j
(6.23)
where A
ij
is a matrix and y
j
is a vector. In the case of the generalized BP update, the matrix
and vector are given by A
ij
(x
i
, x
j
) = F
1/
ij
ij
(x
i
, x
j
) and y
j
(x
j
) = F
j
(x
j
)

kNj\i
m

jk
kj
(x
j
)
m

ij
1
ij
(x
j
), respectively.
We consider the case where the pairwise potentials are of the Potts model, and thus
A
ij
= (a
ij
b
ij
) I +b
ij
. We then obtain
A
ij
y
j
= (a
ij
b
ij
)y
j
+b
ij

x
j
y
j
(x
j
). (6.24)
Note that this way, we could compute the outgoing messages vector m
ji
in O(|X
j
| +|X
i
|)
complexity: one loop of O(|X
j
|) for computing the sum S
j
=

x
j
y
j
(x
j
), and another
loop of O(|X
i
|) for computing the value
m
ji
(x
i
) = (a
ij
b
ij
)y
j
(x
i
) +b
ij
S
j
(6.25)
for each assignment x
i
.
Acknowledgments
This work was supported by the Israeli Science Foundation. We thank Danny Rosenberg
for his help in generating the gures.
II
Applications of MRFs, including Segmentation
Part II of the book shows that relatively simple, pairwise MRF models, as introduced in
chapter 1, paired with powerful inference engines (as introduced in part I of the book), are
capable of producing impressive results for a large number of applications. These solutions
sometimes even represent the state of the art in the eld and are part of commercial products.
The rst two chapters utilize a binary (two-label) MRF where each pixel is probabilisti-
cally dependent only on a few, here four or eight, neighboring pixels. The application is the
separation of foreground and background in still images, given additional hints from the
user. Chapter 7 introduces in detail the most simple realization of such an MRF model, and
then builds a more complex, higher-order MRFmodel. Chapter 8 revisits the low-connected,
pairwise MRF model of chapter 7 and casts it as a continuous-valued MRF. In this way a
continuum of models, including well-known models such as random walker and shortest
path, can be expressed in a unied framework. This makes it possible to study different
trade-offs of these models, such as sensitivity to metrication artifacts. Chapter 9 increases
the level of difculty by performing foreground and background separation in a monocular
video stream in real time. The main visual cue is that the object and the background move
differently. This is achieved by introducing for each pixel a second-order Hidden Markov
model with respect to time (as explained in chapter 1), which is then paired with standard 2D
MRFs for each frame. Chapter 10 considers the important applications of superresolution
(up-scaling an image) and texture synthesis (creating a large image from a small exemplar.
The basic idea is to assemble a new image, from a large collection of exemplar patches,
in such a way that the synthesized image is spatially coherent. This problem is expressed
in the form of a 4-connected, pairwise MRF and is used as the basic building block in
several state-of-the-art approaches in this domain. Chapter 11 considers pairwise, multil-
abeled MRF models for stereo vision, panoramic stitching, image segmentation, inpainting,
and denoising. Its focus is on the comparison of existing inference techniques, introduced
in part I of the book, with respect to these applications.
There are many other application scenarios that are not covered in the book but have
been realized with low-connected, pairwise MRF or CRF models. Examples in low-level
110 II Applications of MRFs, including Segmentation
vision are registration, matching and optical ow (examples in [482] and chapter 18),
3D reconstruction (examples in chapter 12), photo and video summarization or collages
(e.g, [399]), image manipulation [368], and image and video retargeting [17]). MRFs also
play an essential role in large vision-based systems and for high-level vision tasks, such as
object recognition, which is the subject of part V.
7
Interactive Foreground Extraction: Using Graph Cut
Carsten Rother, Vladimir Kolmogorov, Yuri Boykov, and Andrew Blake
Interactive image segmentation has received considerable attention in the computer vision
community in the last decade. Today this topic is mature, and commercial products exist
that feature advanced research solutions. This means that today interactive image segmenta-
tion is probably one of the most widespread computer vision technologies. In this chapter we
reviewone class of interactive segmentation techniques that use discrete optimization and a
regional selection interface. We begin the chapter by explaining the seminal work of Boykov
and Jolly [66]. After that, the GrabCut technique [401] is introduced, which extends [66]
by additionaly estimating the appearance model. The joint estimation of segmentation and
appearance model parameters can signicantly simplify accurate object extraction. GrabCut
is the underlying algorithmfor the Background Removal tool in Microsoft Ofce 2010 prod-
ucts. In the third section of the chapter many interesting features and details are explained
that are part of the product. In this process several recent research articles are reviewed.
Finally, the Background Removal tool, as well as [66, 401], are evaluated in different ways
on publicly available databases. This includes static and dynamic user inputs. (An extended
version of this chapter is available at [402].
1
7.1 Introduction
This chapter addresses the problem of extracting an object in an image, given additional
hints from the user. This is different from the long-standing research topic of automatically
partitioning an image into the objects present in the scene (e.g., [434]). First, user interaction
1. A historical note. The Background Removal tool was fully developed at the end of 2004 by Carsten Rother
and Vladimir Kolmogorov. It was part of an external release of Microsoft Expression Acrylic Graphics Designer
(technology preview) in June 2005 (called smart select). This included engineering solutions to many practi-
cally interesting problems (see section 7.4) that were not addressed in [401]. For some problems our solutions
are in fact very similar to recent work [323, 321]. Some of these practical problems motivated our recent arti-
cles on the following topics: initialization and optimality [499], connectivity [497], bounding-box prior [298],
and segmentation-based matting [390]. To ne-tune the Background Removal tool we employed the robot user
(see section 7.5.2), which was also used in [178] and motivated our very recent work on learning interactive
segmentation systems [350].
112 7 Interactive Foreground Extraction
is needed to specify the object of interest. Second, quite often the user wants to select only
part of an object (e.g., head of a person, or an arbitrary region of interest). The intrinsically
interactive nature of this problem makes it both very attractive and challenging. Hence, it
has been a fruitful research topic for more than two decades, with some work concentrating
more on theoretical aspects, such as model and optimization, and other work more on user
aspects.
The question of what is the best interactive segmentation system today is hard to answer.
Many factors have to be considered: (1) What is the user group (e.g., novice or advanced
users)? (2) What is the user interface? (3) How should the user involvement (such as total
interaction time or number of user hints) be measured?
Approaches for interactive image segmentation have inuenced many related tasks. One
example is the problem of joint object recognition and segmentation, as in the TextonBoost
framework [437] or the OBJCut system [281]. Another example is the Web-based retrieval
system for classical vases [286], which automatically runs segmentation technology in the
background.
Note that the focus of this chapter is on the binary segmentation problem. Each pixel
belongs to either foreground or background. This is a simplied view of the problem, since
some pixels, especially those close to the object boundary, are semitransparenta mix of
foreground and background colors. Abrief discussion of this issue in the context of a prac-
tical segmentation system is given in [402].
The chapter is organized as follows. After a brief literature review in section 7.1.1,
three systems are presented, in order of increased model complexity: the Boykov and Jolly
approach (section 7.2), the GrabCut system (section 7.3), and the commercial Background
Removal tool (section 7.4). In section 7.5 the methods are compared with other state-of-
the-art techniques in two experiments.
7.1.1 Interactive Image Segmentation: A Brief Review
In the following we categorize different approaches to interactive image segmentation by
their methodology and user interfaces. Our brief review is not meant to be comprehensive.
Magic Wand This is probably the simplest technique. Given a user-specied seed point
(or region), a set of pixels is computed that is connected to the seed point, where all pixels
in the set deviate from the color of the seed point only within a given tolerance. Figure 7.1a
shows the result using Magic Wand in Adobe Photoshop 7 [3]. Because the distributions in
color space of foreground and background pixels have a considerable overlap, a satisfactory
segmentation cannot be achieved.
Intelligent Scissors This approach (a.k.a. Live Wire or Magnetic Lasso) [343] allows a
user to choose a minimum cost contour by roughly tracing the objects boundary with the
mouse. As the mouse moves, the minimum cost path from the cursor position back to the
7.1 Introduction 113
(a) (c) (d) (b)
Figure 7.1
Foreground extraction with four different systems: (a) Magic Wand, (b) Intelligent Scissors, (c) graph cut [66],
and (d) GrabCut [401], with segmentation result in the bottom row and user interaction in the top row (image
colors were changed for better visualization; see original color image in gure 1.7(a) in chapter 1). While the
results bd are all visually acceptable, GrabCut needs fewest user interactions (two clicks). Note that result (d) is
the nal result of GrabCut, including semitransparency using border matting.
last seed point is shown. If the computed path deviates from the desired one, additional
user-specied seed points are necessary. In gure 7.1b the Magnetic Lasso of Photoshop 7
was used in which a large number of seed points (here nineteen) were needed, since both
foreground and background are highly textured. One problem is that this technique is not
effective for objects with a long boundary (e.g., a tree with many branches).
Segmentation in the Discrete Domain Boykov and Jolly were the rst to formulate a simple
generative MRF model in discrete domain for the task of binary image segmentation [66].
This basic model can be used for interactive segmentation. Given some user constraints
in the form of foreground and background brushes (i.e., regional constraints), the optimal
solution is computed very efciently with graph cut (see an example in gure 7.1c. The main
benets of this approach are global optimality, practical efciency, numerical robustness,
ability to fuse a wide range of visual cues and constraints, unrestricted topological properties
of segments, and applicability to N-D problems. For these reasons this approach inspired
many other methods for various applications in computer vision (see, e.g., chapter 9, on
bilayer segmentation in video). It also inspired the GrabCut system [401, 52], which is
the main focus of this chapter. GrabCut solves a more challenging problem, namely, the
joint optimization of segmentation and estimation of global properties of the segments.
The benet is a simpler user interface in the form of a bounding box (see the example in
gure 1.7d). Note that such joint optimization has been done in other contexts. An example
114 7 Interactive Foreground Extraction
is depth estimation in stereo images [45] where the optimal partitioning of the stereo images
and the global properties (afne warping) of each segment are optimized jointly.
Since the work of Boykov and Jolly, many articles on interactive segmentation using
graph cut and a brush interface have been published; a few are [312, 175, 129, 499, 19,
440, 321, 350, 178], which we will discuss in more detail later. We would like to refer to
chapter 8, where the discrete labeling problem is relaxed to a continuous one, which gives
a common framework for explaining and comparing three popular approaches: random
walker [175], graph cut [66], and geodesic distance [19]. Another interesting set of discrete
functionals is based on ratio (e.g., area over boundary length; see [105, 217, 257]).
Segmentation in the Continuous Domain There are very close connections between the
spatially discrete MRFs, as mentioned above, and variational formulations in the continu-
ous domain. The rst continuous formulations were expressed in terms of snakes [227] and
geodesic active contours [84], related to the well-known Mumford-Shah functional [344].
The goal is to nd a segmentation that minimizes a boundary (surface) under some metric,
typically an image-based Riemannian metric. Traditionally, techniques such as level sets
were used; however, they are only guaranteed to nd a local optimum. Recently many
of these functionals were reformulated using convex relaxation (i.e., the solution lives in
the [0, 1] domain), which allows achievement of global optimality and bounds in some prac-
tical cases (see chapter 12). An example of interactive segmentation with a brush interface
is [489], where the optimal solution of a weighted TVnorm is computed efciently. Instead
of using convex relaxation techniques, the continuous problem can be approximated on a
discrete grid and solved globally optimally, using graph cut. This can be done for a large set
of useful metrics (see [67]). Theoretically, the discrete approach is inferior because the
connectivity of the graph has to be large in order to avoid metrication artifacts. In practice,
however, artifacts are rarely visible when using a geodesic distance (see, e.g., gure 7.1d)
with an underlying 8-connected graph. In section 7.3.1 we will show another relationship
between the continuous ChanVese functional [88] and the discrete GrabCut functional.
Paint Selection Conceptually the brush interface and the so-called paint selection inter-
face [321] are very similar. The key difference is that a newsegmentation is visualized after
each mouse movement (i.e., instant feedback while drawing a stroke). Section 7.4 gives a
more detailed comparison with the Background Removal tool.
7.2 Basic Graph Cut Model for Image Segmentation
Boykov and Jolly [66] addressed the problem of interactive image segmentation based on
initial trimap T = {T
F
, T
B
, T
U
}. The trimap partitions the image into three sets: T
F
and T
B
comprise pixels selected by the user as either foreground or background, respectively, and
T
U
is the remaining set of unknown pixels. The image is an array z = (z
1
, . . . , z
n
, . . . , z
N
)
7.2 Basic Graph Cut Model for Image Segmentation 115
of intensities (gray, color, or any other n-dimensional values), indexed by integer n.
The unknown segmentation of the image is expressed as an array of opacity variables
x = (x
1
, . . . , x
N
) at each pixel. In general, 0 x
n
1 (e.g., in -matting), but [66] uses
discrete-valued (hard) segmentation variables x
n
{0, 1}, with 0 for background and 1
for foreground. The parameter describes the distributions for foreground and back-
ground intensities. The basic approach in [66] assumes that such distributions (intensity
models or histograms = {h
B
(z
i
), h
F
(z
i
)} for foreground and background) are either
known a priori or are assembled directly from labeled pixels in the respective trimap
regions T
B
, T
F
. Histograms are normalized to sum to 1 over the range of intensities, (i.e.
_
z
h
F
(z) = 1. This means that the histograms represent the observation likelihood, that is,
P(z
i
|x
i
= 0) = h
B
(z
i
) and P(z
i
|x
i
= 1) = h
F
(z
i
).
The segmentation task addressed in [66] is to infer the unknown opacity variables x
from the given model and image data z. For this, an energy function E is dened so
that its minimum should correspond to a good segmentation, in the sense that it is guided
by both the given foreground and background intensity histograms and that the opacity is
coherent, reecting a tendency to solidity of objects. This is captured by a Gibbs energy
of the form
E(x, , z) = U(x, , z) +V(x, z). (7.1)
The data term U evaluates the t of the segmentation x to the data z, given the model ,
and is dened for all pixels in T
U
as
U(x, , z) =

nT
U
log h
B
(z
i
)[x
n
= 0] log h
F
(z
i
)[x
n
= 1] +

nT
F
T
B
H(x
n
, n) (7.2)
where [] denotes the indicator function taking values 0, 1 for a predicate , and the term
H(x
n
, n) constrains certain variables to belong to foreground or background, respectively,
that is, H(x
n
, n) = ([x
n
= 0][n T
F
] +[x
n
= 1][n T
B
]), where is a large enough
constant. The smoothness term can be written as
V(x, z) =

(m,n)N
dis(m, n)
1
(
1
+
2
exp{||z
m
z
n
||
2
}) [x
n
= x
m
], (7.3)
where N is the set of pairs of neighboring pixels and dis() is the Euclidean distance of
neighboring pixels. This energy encourages coherence in regions of similar intensity level.
In practice, good results are obtained by dening pixels to be neighbors if they are adjacent
either horizontally/vertically or diagonally (8-way connectivity). Factor dis() and larger
neighborhoods help the smoothness term V(x, z) to better approximate a geometric length
of the segmentation boundary according to some continuous metric (see [67]). This reduces
geometric artifacts. When the constant
2
is set to 0, the smoothness termis simply the well-
known Ising prior, encouraging smoothness everywhere. Practically, however, as shown
116 7 Interactive Foreground Extraction
in [66], it is far more effective to set
2
> 0, as this relaxes the tendency to smoothness in
regions of high contrast. The constant is chosen to be = (2(z
m
z
n
)
2
)
1
, where
denotes expectation over an image sample. This choice of ensures that the exponential
termin (7.3) switches appropriately between high and lowcontrast (see [53]). The constants

1
and
2
should be learned froma large corpus of training data. Various learning approaches
have been suggested in the past, ranging fromsimple cross validation [53] over max-margin
learning [466] and, very recently, parameter estimation in an interactive setting [350]. In
most of our experiments the values were xed to the reasonable choice of
1
= 5 and

2
= 50.
Now that energy (7.1) is fully dened, the Boykov-Jolly [66] model for binary segmen-
tation can be formulated as an estimation of a global minimum
x = arg min
x
E(x, , z). (7.4)
Exact global minima can be found using a standard minimum-cut/maximum-owalgorithm
[68] (see chapter 2). Since the desired results are often not achieved with the initial trimap,
additional user interactions are necessary. The maximum-ow computation for these addi-
tional interactions can be made very efcient by reusing owfromthe previous computation
(see details in [66]).
7.3 GrabCut+: Image Segmentation Using Iterative Graph Cut
The following description of GrabCut contains additional details, and a few minor modi-
cations, compared with the original version [401, 52]; hence the name GrabCut+ is used.
The algorithm described in section 7.2 often gives good results in practice, as shown in
gure 7.1c; however, it relies on the user to dene the color distributions for foreground and
background. One problem is that it does not exploit the information given by the unlabeled
data to help learn the unknown parameter . In the following we describe one approach that
makes use of the unlabeled data. The simple idea is to nd the optimal settings for the model
parameter and segmentation x jointly. This is done as before, by minimizing the func-
tional in (7.1) subject to the given user constraints. Note that by optimizing we implicitly
assume that both foreground and background are represented well by compact distribu-
tions. The implications of this assumption are discussed later in detail. By exploiting the
unlabeled data we are able to achieve good results with fewer user inputs compared with
the approach in section 7.2. In particular we show that it is sometimes sufcient simply
to specify the object with a bounding box, so that the set T
F
is empty (see the example in
gure 7.1d). Unfortunately, optimizing this energy with respect to both unknowns, and
x, is a very challenging problem. In fact it is NP-hard [499], as discussed later. Hence, in
this section and the following one, questions concerning different optimization procedures,
optimality, and initialization are addressed.
7.3 GrabCut+ 117
7.3.1 The GrabCut Model
The rst modication of the basic segmentation model, as described in section 7.2, is done
by switching from an explicit representation of intensity distributions via histograms to
a parametric representation via Gaussian Mixture Models (GMMs) [401, 53, 409]. This
more compact form of representation is particularly helpful in the case of RGB colors
(n-dimensional intensities).
Foreground and background are modeled separately with each K full-covariance Gaus-
sian (here K = 7). In order to deal with the GMM tractably, in the optimization framework
an additional vector k = {k
1
, . . . , k
n
, . . . , k
N
} is introduced with k
n
{1, . . . K}, assigning
each pixel a unique GMM component, one component from either the foreground or the
background model, according to x
n
= 1 or 0.
2
This means that the unknown parameter
comprises the variables
= {k,
F
(k),
F
(k),
F
(k),
B
(k),
B
(k),
B
(k), k = 1 . . . K},
with as mixture-weighting coefcients that sumto 1, and (k), (k) as mean and covari-
ance matrix for each Gaussian k.
3
It is important to note that tting a GMMmodel is, strictly
speaking, an ill-posed problem since an innite likelihood (energy of minus innity) can
be obtained when the variance is 0 (see [46], section 9.2.1). Hence, the covariance matrix
is restricted to have a minimum variance of 1/255
2
.
Using the same energy (7.1), the GrabCut model is dened as the joint optimization
(estimation) for segmentation x and parameters :
x = arg min
x
min

E(x, , z). (7.5)

The key difference between the BoykovJolly model (7.4) and the GrabCut model (7.5)
is that in (7.5) the minimization is also done with respect to . It is worth noting that
the GrabCut model and the functional of Chan-Vese [88, 299] in the continuous domain,
related to the Mumford-Shah functional [344], share some properties. In both models the
key problemis the joint optimization of segmentation and global properties of the segmented
regions.
7.3.2 The Optimization Procedure
The pseudo code for GrabCut+ is given in gure 7.2. The user starts by dening the initial
trimap, using either a bounding box or a lasso interface. This means that T
B
is outside
the marked region, T
U
is inside the marked region, and T
F
is an empty set. As suggested
2. Using soft assignments of probabilities for each component to a given pixel would give a signicant additional
computational expense for a negligible practical benet.
3. An efcient variant for using GMMs in a segmentation framework has been suggested in [437]. A different
GMM model with 2K Gaussian is tted rst to the whole image. This gives a xed assignment vector k, which
is not updated during the optimization of and x.
118 7 Interactive Foreground Extraction
Figure 7.2
The pseudo code for GrabCut+ with bounding box or lasso input.
in [401], results improve if T
B
comprises only pixels that are inside a strip around the
outside of the marked region.
4
The intuition is that the relevant background training data
are often close to the object. In fact, pixels outside this strip are ignored throughout the
whole optimization procedure. The trimap T uniquely denes the segmentation x, which
is used to initialize the unknown color model . Though this initialization step was not
discussed in [401], it is quite important. One choice is a random initialization for k, but
the following smart initialization works better. Consider the set of background pixels,
x
n
= 0. The rst principal axis is computed from image data z
n
of this set. Then the data
are projected onto this axis and sorted accordingly. Finally, the sorted set is divided into K
groups that for each pixel n dene the assignment variable k
n
. Given k, a standard EM-style
procedure for GMM tting can be invoked. This means that in the M step the Gaussians
(i.e., , , ), are tted in a standard way; see [401] for details. In the E step k is optimized
by enumerating all components k K and choosing the one with lowest energy. In practice
these two steps are executed four times. The foreground pixels are processed similarly.
The main procedure alternates the following two steps: (1) Given , the segmentation
x is inferred with graph cut as in section 7.2, (2) Given the segmentation x, the unknown
model is inferred using the above EM-style procedure for GMM tting. In each step the
total energy E is guaranteed not to increase. The method can be run until a local minimum
is found, but in practice we simply stop it after ve iterations. Figure 7.3 shows the power
of running the iterative GrabCut+ procedure. Finally, the user can update the trimap and the
main procedure is run again. As in the previous section, reusing of ow gives a speedup.
Note that a small modication of this procedure lets us apply GrabCut+ for a standard
brush interface [66], as done in the experiments. For that, step 1 in gure 7.2 is initialized
as x
n
= 0 for n T
B
, as x
n
= 1 for n T
F
, and as x
n
=? for n T
U
, where the label ?
4. The width of the strip is chosen as a small fraction of the bounding box dimensions. In the experiments, section
7.5.1, the width is set to ten pixels, as in [298].
7.3 GrabCut+ 119
(a)
8 12 1 4
(b) (c) (d) (e)
Energy E
Figure 7.3
(a) The energy of the GrabCut+ model decreases over twelve iterations. (b) Initial result after rst iteration of
the GrabCut+ algorithm, where initialization is as in gure 7.1d. (c) The GMMs, here K = 5, for foreground
(bright/blue) and background (dark/red) overlap considerably (visualized RG-slice). (d) and (e) The nal result
for segmentation and GMMs.
means unlabeled and those pixels are ignored in step 2 of the algorithm. Other interesting
user inputs, such as no user intervention (as in [286]), are discussed in [402].
7.3.3 Properties of the GrabCut Model
In the GrabCut model there is the freedomto choose appropriate foreground and background
color distributions . It is straightforward to see that distributions that are more compact
and model the data well give a lower energy.
5
This means that it is implicitly assumed that
both foreground and background segments are more likely to be represented by a compact
distribution in color space.
One important question is whether the GrabCut model has an implicit bias toward certain
segmentations. This was analyzed inVicente et al. [499] by disregarding the smoothing term
V in (7.3). They rst showed that by using a histogram representation in RGB space, it is
possible to write the term min

E(x, , z) in (7.5) explicitly in the form of a new energy

(x, z) with higher-order cliques on x. This higher-order energy E

has two types of terms:

(1) a convex function over

n
x
n
, and (2) a concave function, for each histogram bin k,
over

n
x
n
[n Bin(k)], that is, all pixels that are assigned to bin k. The convex function
(rst type) has lowest energy if exactly half of the pixels in the image are assigned to
foreground and half to background, respectively. Hence, this is a bias toward balanced
segmentations. The concave function (second type) has lowest energy if all pixels that are
assigned to the same bin have the same label, either 0 or 1. Note that these two types of
terms often counterbalance each other in practice, so that the optimal segmentation is not a
degenerate solutionnot all undened pixels are either all foreground or all background.
In an extreme case the bias toward balanced segmentation is more prominent. This is when
all pixels are assigned to unique histogram bins. Then all concave terms are constants, so
that the energy consists of the convex part only. At the other extreme, however, when all
5. For example, assume all foreground pixels have the same color; then the lowest unary term is achieved by
modeling the foreground distribution with one (or many identical) Gaussian, with minimum variance.
120 7 Interactive Foreground Extraction
pixels are assigned to the same histogram bin, the bias disappears, since then concave and
convex terms cancel each other out. An interesting observation was made in [499] that
results improve considerably when choosing the weight of this bias individually for each
image.
Unfortunately, optimizing the higher-order energy E

with respect to x is an NP-hard

problem [499]. An optimization procedure for E

was suggested in [499], based on dual

decomposition ([35] and in chapter 22, which also provides a lower bound for the energy.
This procedure achieved global optimality in 61% of test cases for the GrabCut database of
forty-nine images [203] and a bounding-box input. However, for the remaining 39% of test
cases, the dual decomposition approach performed rather poorly. Many of those test cases
were camouage images, where foreground and background colors overlap considerably,
(example in [499], gure 2). In comparison, the iterative GrabCut+ procedure (gure 7.2)
performs quite well for all types of images and, hence, achieves a lower total error rate
(8.1%) than the dual decomposition approach (10.5%); more details are in [402, 499].
7.4 Background Removal: Image Segmentation in MS Ofce 2010
Designing a product based on GrabCut+ (section 7.3) meant that many interesting practical
and theoretical issues had to be addressed. This section discusses some of the topics on
which the Background Removal tool and the GrabCut+ tool differ. In [402] further details
are given, including a discussion on how to deal with large images and semitransparency.
We begin by revisiting the optimization procedure and then examine additional model
constraints.
7.4.1 Initialization and the Bounding-Box Prior
It turns out that choosing the initial segmentation x, based on the user trimap T (i.e., step 1
in gure 7.2), is crucial for performance. In the following, three initialization schemes are
compared for the bounding-box input. All approaches have in common that x
n
= 0 for
n T
B
, though, they differ in the treatment of pixels in T
U
. The rst approach, called Init-
FullBox, is described in section 7.3.2, and sets x
n
= 1 for all n T
U
. The second method,
called InitThirds, was suggested in [298]. First, a background color model is trained from
pixels in T
B
. Then the probabilities of all pixels in T
U
are evaluated under the background
GMM. One third of the pixels with lowest probability are set to foreground, and one third
with highest probability are set to background.
6
The remaining pixels are set to ?, and are
ignored in step 2 of gure 7.2. The last approach, called InitParametric, is implemented in
the Background Removal tool and is similar to the InitThirds procedure but in addition con-
siders the smoothing term V of the energy. For this a new parametric energy is introduced,
E

(x, , z) = E(x, , z) +

n
x
n
, with E as dened in section 7.3.1. Here is chosen
6. The choice of using a third as the threshold is arbitrary and should be learned from data.
7.4 Background Removal 121
such that the background GMM is trained from pixels in T
B
, and the distribution for the
foreground is constant and uniform. The global optimum of E

for all continuous values

of can be computed efciently using parametric max-ow [257].
7
From the set of all
solutions for E

, one solution x is selected using the following heuristic. The segmentation

with smallest foreground area is selected that also meets the criterion that the maximum
distance of the largest connected component to any four sides of the box is smaller than
a xed threshold (e.g., 25% of a side of the bounding box).
8
The inspiration for the Init-
Parametric initialization procedure is the following. First, the user is more likely to select
a bounding box that is tight around the object. We refer to this idea as the bounding box
prior; it motivated the work in [298]. Second, as in InitThirds, the foreground segmentation
is often far away in feature space from the given background distribution. However, in
contrast to InitThirds the foreground segmentation is spatially coherent with this procedure.
An experiment gives an indication of the quality of results that can be achieved with
these three methods. For this test the GrabCut data set (ty images) is used together with
the bounding boxes from [298] (see online [203]).
9
It has to be stressed that the following
error rates must be taken with care, since the data set is rather small,
10
and parameters
were not trained.
11
The exact settings for GrabCut+ are as dened in section 7.3:
1
= 5,

2
= 50, K = 7. Results are as follows. Initializing GrabCut+ with InitFullBox gave an
error rate of 9.0%.
12
It seems that the initialization of InitThirds is clearly a better choice,
since the error rate dropped from 9.0% to 5.0%, which is the same conclusion as in [298].
Running the Background Removal tool, which uses InitParametric, gave a slightly higher
error rate of 5.95%.
Note that this experiment did not enforce the very sensible bounding box prior, which
ensures that the nal segmentation is close to the user-selected bounding box. Indeed, by
enforcing this prior the error rate can be reduced to 3.7%. The algorithm to achieve this is
described in [298] and runs graph cut iteratively while forcing certain pixels to belong to
the foreground. We refer the interested reader to [298] for several alternative procedures for
7. Since we were not aware of parametric maxowin 2004, a simple iterative procedure is utilized that reuses ow.
8. This is the same criterion as the weak tightness condition dened in [298].
9. Note, the bounding boxes from [298] deviate slightly from the original set of bounding boxes. The bounding
boxes that are different touch the image boundary, while the object does not. This modication simplies the
segmentation task and also removes the problem that in one image the original bounding box was of the same size
as the image itself.
10. A larger, (e.g., 1000+), data set with high-quality ground truth is needed in the future. Note that for product
testing, a medium-sized data set was created that includes images that are not photographs (e.g., graphics and
hand-drawn sketches).
11. As discussed in detail later, image segmentation is an interactive process; hence parameters have to be trained
anyway with the user in the loop (as in section 7.5.2).
12. One small modication did reduce the error rate to 7.1 percent. This was done by choosing a random initial-
ization for k in step 2 in gure 7.2. It shows that initialization is very important and can affect the nal result
considerably, and that the data set may be too small.
122 7 Interactive Foreground Extraction
enforcing the bounding box prior. An interesting direction of future work could be to use
InitThirds as the initialization procedure and to exploit the parametric max-ow approach
of InitParametric to enforce the bounding box prior.
7.4.2 Modeling User Intention
The ultimate goal of a segmentation system is that a user achieves the desired result in as
short a time as possible. For this goal the energy dened in section 7.3.1 might not be the
optimal model. One problem with the above model is that it is agnostic to the sequence of
user interactions. By exploiting this sequence the intention of the user can be modeled in
a better way. Two simple ideas, which are realized in the Background Removal tool, will
be discussed.
13
Other ideas for modeling the users intention are presented in recent works
[509, 321] that investigate alternative models and different user interfaces.
The rst idea is toavoidthe so-calleductuationeffect [321]. Consider a current imperfect
segmentation where the user has placed the (latest) foreground brushstroke. Two effects
are undesirable: (1) pixels change label from foreground to background and (2) pixels that
are spatially far away change label from background to foreground, since the user may not
notice it. We enforce the sensible constraint that the change in the segmentation must be,
in this case, from background to foreground and also 4-connected to the latest foreground
brushstroke. Achieving connectivity is in general an NP-hard problem (see chapter 22
and [497]); hence we solve it with a simple postprocessing step.
14
The same procedure
is applied for a background brushstroke. With this connectivity prior, parameters in the
GrabCut+ model may be chosen quite differently (see [350]).
15
Also, many systems that
do not use an explicit unary term, such as [19] and random walker [175], are guaranteed to
satisfy this connectivity property.
16
The second idea achieves the desired property that the latest brushstroke always has a
noticeable effect on the current segmentation, which is related to the progressive labeling
concept in [321]. Consider the case where the current segmentation has a dominant color
of red in the background and a small region of the true foreground that is also red but
is currently incorrectly labeled. A foreground brushstroke in this small region may fail to
13. These ideas were realized in Microsoft Expression Acrylic Graphics Designer (technology preview). Unfor-
tunately, due to constraints in the Microsoft Ofce product, it was not possible to realize them exactly as described
below. The difference is that in the MS Ofce version all user foreground and background brushstrokes are treated
as one (latest) foreground or background brushstroke, respectively.
14. Note that other automatic techniques can be used in the future, such as [390], based on [497], or the the
geodesic star convexity prior of [178].
15. The reason is that one of the main effects of the Ising prior in an MRF (weight
1
in (7.3)) is to smooth out
wrongly labeled isolated regions. By enforcing connectivity these isolated regions are not permitted in the solution
space; hence a different weight for the Ising prior might perform better.
16. In this context we also postprocess the initial segmentation, which is the result of a bounding box (or lasso)
input, such that only one 4-connected foreground component is present. No constraint on the background label is
enforced, since many objects do have holes.
7.5 Evaluation and Future Work 123
select the whole region, since the unary terms in the region strongly favor the background
label, that is, red being background. The underlying problemis that the general global color
model as dened in section 7.3.1 is not always appropriate for modeling objects. Apractical
solution to overcome this problem is simply to give pixels that are in the latest foreground
brushstroke a much higher weight (e.g., 80%), and all other foreground pixels a lower
weight (e.g., 20%). The same procedure is applied for a (latest) background brushstroke.
17
This aggressive color modeling procedure works only in conjunction with the above idea
of connectivity with the latest brushstroke.
7.5 Evaluation and Future Work
As mentioned in section 7.1, the question of what is the best interactive segmentation sys-
tem today is hard to answer. To cast some light on this question, two experiments were
performed: rst, with a static user input and second, with a so-called robot user that sim-
ulates a simple novice user. The robot user is used to train and compare different systems
in a truly interactive setting.
7.5.1 Static User Input
A plausible type of user interaction, for objects with boundaries that are not excessively
long, is that the user draws with a fat pen around the boundary of the object, which
produces a relatively tight trimap. Obviously a method should exploit the fact that the true
boundary is more likely to be in the middle of the user-drawn band. As above, the GrabCut
database (fty images) is used; it provides such a trimap, derived by simply eroding the
ground truth segmentation (see the example in gure 1.7(b) in chapter 1, and online in
[203]). Several articles have reported error statistics for this data set, using the percentage
of misclassied pixels within the unknown trimap region T
U
.
18
As above, the following
error rates have to be taken with care, since the database is small and the methods were
not properly trained. Also, all parameter settings for GrabCut+ (and variants of it) are as
described above.
Applying simple graph cut without any global color modeling (i.e., energy in (7.1) without
unary term U) gives an error rate of 9.0%. As discussed in detail in chapter 8, one bias
of graph cut is the so-called shrinking bias: segmentations with a short boundary are
preferred. In contrast, random walker [175] has less of a shrinking bias and instead has a
proximity bias toward segmentations that are equally far from the given foreground and
background trimap, respectively. Obviously, this is a better bias for this data set; hence
17. An improvement could be to model, in the case of a foreground brushstroke, the background color model in
a different way, such that it is more representative for this segmentation task.
18. Asmall fraction of pixels in the ground truth are unlabeled, due to transparency effects. These pixels are not
counted when computing the error rate.
124 7 Interactive Foreground Extraction
the error rate for random walker is just 5.4% (see [129]).
19
In [440] (see also chapter 8)
a continuum of solutions is presented that vary with respect to the proximity bias. As to
be expected, the setting p = in [440], which exploits the proximity bias most, is the
best. On this note, a simple baseline method that ignores the image data achieves quite a
low error rate of 4.5%. The baseline method simply classies each pixel according to the
Euclidean distance to the foreground and background regions, respectively.
20
It is worth
noting that there is a method that beats the baseline: that is, randomwalker with an adaptive
thresholding, which better exploits the proximity bias (see [129]).
Finally, the effect of adding global color models is investigated. The error rate of graph
cut decreases from9.0%to 6.6%when using the GrabCut+ algorithmin gure 7.2 with one
sweep (step 3). Multiple iterations of GrabCut+ reduce the error rate further to 5.6%, and
the Background Removal tool achieves basically the same error rate, 5.8%.
21
To conclude,
global color models do considerably help graph cut-based techniques, and they may also
help the best performing methods for this data set (see [129]), as also conjectured in [129].
7.5.2 Dynamic User Input
The setup is described in detail in Gulshan et. al. [178], so only some aspects are mentioned
here. To measure the amount of user interaction in an interactive system, we invented the
so-called robot user [350]. It can be used for both learning and evaluating segmenta-
tion systems (see [350]). Given a new image, it starts with an initial set of brushstrokes
22
and computes a segmentation. It then places a circular brushstroke in the largest con-
nected component of the segmentation error area, at a point farthest from the boundary
of the component. The process is repeated up to twenty times, generating a sequence of
twenty simulated user strokes that is different for each algorithm. Figure 7.4b shows an
example (see also video in [203]). From the sequence of interactions, one number for the
amount of user interaction is derived that measures, in rough words, the average number
of brushstrokes necessary to achieve a good-quality result (details in [178]). A small user
study conrmed that the interaction effort of the robot user indeed correlates reasonably
well with the true effort of a novice user (see [350]).
The data set for this experiment consists of 151 images with ground truth segmentations,
and is a mix of existing data sets including GrabCut and VOC09 (see [203]). The free
19. In [129] some variations of the random walker formulation are given, but however, produce quantitatively
the same results.
20. The baseline method does not treat thin structures well; hence a skeletonization approach might perform even
better.
21. Note that by additionally exploiting the idea that the relevant training data for the background GMM are more
likely in a strip around the unknown region (as provided with the data set), the error rate of GrabCut+ drops from
5.6% to 5.3%.
22. They were chosen manually with one stroke for foreground and three strokes for background.
7.5 Evaluation and Future Work 125
(a) (b)
(c) (d)
0 2 4 6 8 10 12 14 16 18 20
1
3
6
9
No. of strokes
E
r
r
o
r
Geo
GC
RW
BR
GSC
Figure 7.4
(a) Input image (original in color). (b) Result of the robot user employing the BR system, with 0.85% of mis-
classied pixels (image colors adapted for better visualization). The segmentation is outlined with a black-white
line, and the robot user inputs are white and black circles for the foreground and background, respectively (long
strokes are initial manual user strokes). (c) Result from Geo system, which is considerably worse (error 1.61%).
(d) Performance of ve systems utilizing the robot user (error in log-scale).
parameters for all systems were trained using cross validation, 7576 split repeated ten
times.
Table 7.1 presents the results for ve different methods, and gure 7.4d depicts the aver-
age error rate for the sequence of robot user interactions. The Geo method of [19], based on
geodesic distance, performed worst. This is not surprising, since the method does not reg-
ularize the boundary and is sensitive to the exact locations of brushstrokes (see chapter 8).
Figure 7.4c gives an example. Graph cut GC (section 7.2) performed considerably bet-
ter.
23
The main difference in graph cut compared with the following three systems is that it
does not impose any shape prior. Random walker (RW) [175], for instance, guarantees
a connectivity of the segmentation with respect to brushstrokes (e.g., a pixel with label 0
is 4/8-connected to a background brushstrokes). Hence, even without global color models,
23. Here the implementation of [203] was used. It includes an important additional trick that mixes the GMMs
with a uniform distribution.
126 7 Interactive Foreground Extraction
Table 7.1
Comparison of ve different systems in terms of the average number of brush strokes needed by the robot user to
achieve good results
Method Geo [20] GC (sec. 7.2) RW [176] BR (sec. 7.3) GSC [179]
Effort 15.14 12.35 12.31 10.82 9.63
random walker performs as well as graph cut. The Background Removal tool (BR) is the
second-best-performing system (example in gure 7.4b). It exploits the same connectiv-
ity property as random walker, but additionally utilizes global color models. Finally, the
best-performing method, GSC [178], imposes a strong geodesic star convexity prior on top
of the simple graph cut system (GC). This convexity prior is even more restrictive than a
connectivity prior (see [178], gure 14), which seems to be an advantage in practice.
24
7.5.3 Future Work
Many ideas for future work were mentioned above. Certainly, the model for Background
Removal can be improved further by using stronger shape priors, (e.g., [178]), improved
local MRFmodeling (e.g., ux [390]), or better global color models (e.g., [178]). Apart from
improving the model, we believe that further improvements may be achieved by focusing
more on user aspects.
Acknowledgments
Many people helped with discussions, experiments and great new ideas on the topic of
interactive image segmentation. We would like gratefully thankToby Sharp, Varun Gulshan,
Victor Lempitsky, Pushmeet Kohli, Sara Vicente, Antonio Criminisi, Christoph Rhemann,
Dheeraj Singaraju, Hannes Nickisch, and Patrick Perez.
24. This is supported by testing against another variant of GC, which performed slightly worse (effort 10.66). It
removes all foreground islands that are not connected to a foreground brushstroke in a postprocessing step.
8
Continuous-Valued MRF for Image Segmentation
Dheeraj Singaraju, Leo Grady, Ali Kemal Sinop, and Ren Vidal
Research on image segmentation has focused on algorithms that automatically determine
how to group pixels into different regions on the basis of homogeneity of intensity, color,
texture, or other features [434, 100, 138, 533]. However, since images generally contain
several objects that are surrounded by clutter, it is often not possible to dene a unique seg-
mentation. In many such cases, different users may be interested in obtaining different
segmentations of an image. Hence, recent research in segmentation has focused on interac-
tive methods that allow different users to interact with the system and to segment different
objects of interest from the same image.
One genre of interactive segmentation algorithms offers the user a scribble interface to
label two disjoint sets of pixels as belonging to the object of interest and to the background.
The algorithms goal is then to output a label for each unmarked pixel into one of these two
categories. The labeling is typically obtained by minimizing an energy function dened on
a weighted combinatorial graph. In general, this can be done using several methods, such as
graph cut [66, 72], random walker [175], shortest path [18, 113], region growing [2], fuzzy
connectivity [487], seeded watershed [24], and many more examples given in chapter 7.
This genre has become very popular, notably due to the availability of numerical solvers
that efciently produce the global optimizer of the dened energy function.
This chapter discusses a generalized graph-theoretic algorithmthat estimates the segmen-
tation via a continuous-valued optimization as opposed to the traditional view of segmenta-
tion as a discrete-valued optimization, as in chapter 7. The algorithmproceeds by associating
a continuous-valued variable with each node in the graph. An energy function is then dened
by considering the p-norm of the difference between these variables at neighboring nodes.
The minimizer of this energy function is thresholded to produce a binary segmentation.
This formulation includes algorithms such as graph cut [66], random walker [175], and
shortest path [18] as special cases for specic values of the p-norm (i.e., p = 1, 2, and ,
respectively). Due to the choices of the p-norm, these algorithms have their characteristic
disadvantages. Three such concerns that will be discussed in detail later are metrication
artifacts (blockiness of the segmentation due to the underlying grid structure), proximity
bias (bleeding of the segmentation due to sensitivity to the location of user interaction),
128 8 Continuous-Valued MRF for Image Segmentation
and shrinking bias (shortcutting of the segmentation boundary due to bias toward shorter
boundaries).
The use of an intermediate p-norm for segmentation might compensate for these draw-
backs. However, the optimization of intermediate p-norms has been somewhat neglected,
due to the focus on fast dedicated solvers for the cases of p = 1, p = 2, and p = (e.g.,
max-ow for p = 1, linear system solver for p = 2, and Dijkstras shortest path algorithm
for p = ). The lack of a general solver precludes the ability to merge these algorithms
or employ the generalized algorithm with an intermediate p-value. For this purpose, the
present chapter discusses the use of iterative reweighted least squares (IRLS) techniques
to nd the segmentation for any arbitrary p-norm (1 < p < 3) by solving a series of
2
optimizations. The use of IRLS hence allows one to nd segmentation algorithms that are
proper hybrids of existing segmentation algorithms such as graph cut, random walker, and
shortest path.
8.1 A Generalized Image Segmentation Algorithm
A given image is represented by a weighted graph G = (V, E). The nodes V represent the
pixels in the image and the edges E represent the choice of neighborhood structure. The
weight of an edge, e
ij
E, is denoted by w
ij
, and the weights are assumed here to be
symmetric and nonnegative (i.e., w
ij
= w
ji
0).
Since this chapter assumes a scribble interface, it is assumed that some pixels in the
image have been labeled as foreground and some others have been labeled as background.
Let M V contain the locations of the nodes marked by the user and let U V contain
the locations of the unmarked nodes. The set M is further divided into the sets F M
and B M that contain the locations of the nodes labeled as the foreground object and
the background, respectively. By construction, MU = , MU = V, F B = , and
F B = M.
A Bernoulli random variable y
i
{0, 1} is dened for each node i V, to indicate
its binary segmentation as object (y
i
= 1) or background (y
i
= 0). A continuous-valued
random variable x
i
is introduced at each node to dene the success probability for
the distribution of the random variable y
i
, that is, p(y
i
= 1 | x
i
). For example, [143]
uses logistic regression on a real-valued variable x
i
to dene the success probability as
p(y
i
= 1 | x
i
) =
e
x
i
1+e
x
i
. In this chapter the success probability of y
i
at node i is dened as
p(y
i
= 1 | x
i
) = max{min{x
i
, 1}, 0} =
_

_
1 if x
i
> 1,
x
i
if 0 x
i
1 and
0 if x
i
< 0.
(8.1)
For notational convenience, dene vectors x R
|V|
and y R
|V|
, whose ith entries are
given by x
i
and y
i
, respectively. Now the goal is to infer the hidden variables, x and y, from
8.1 A Generalized Image Segmentation Algorithm 129
the observed quantity (i.e., the image I). These hidden parameters can be estimated in a
Bayesian framework by considering the following posterior probability model:
p(x, y | I) p(x)p(y | x)p(I | y) = p(x)

iV
p(y
i
| x
i
)

iV
p(I
i
| y
i
). (8.2)
The term p(x) is the prior term that encodes constraints on how the parameters of the
Bernoulli variables vary spatially. Unlike most of the discussions in this book, in this chap-
ter the smoothness constraints are enforced on the hidden variables x rather than on the
segmentation y itself. The spatial smoothness prior is explicitly parameterized as
p(x) exp
_
_

e
ij
E
(w
ij
|x
i
x
j
|)
p
_
_
, (8.3)
where > 0 and the weights w
ij
are positive (i.e., e
ij
E, w
ij
> 0). Different choices for
the p-norms result in different priors on x. For example, p = 1 gives a Laplacian prior
and p = 2 gives a Gaussian prior. The term p(y
i
| x
i
) at each node is given completely
by (8.1), where p(y
i
= 0 | x
i
) is dened as 1 p(y
i
= 1 | x
i
). The term p(I
i
| y
i
) is the
standard likelihood term as used in the rest of this book.
One of the drawbacks of the model discussed so far is that the edge weights for the
pairwise terms, {w
ij
}, do not depend on the image. Specically, the edge weights are used
as the parameters of the spatial prior model p(x) and hence cannot depend on the image.
However, as discussed in other chapters, in practice it is preferable to use contrast-sensitive
edge weights, such as w
ij
= e
(I
i
I
j
)
2
, to align the segmentation boundary with the edges in
the image. However, modifying the probabilistic model to accommodate contrast-sensitive
weights is not straightforward. In the case of discrete MRFs, a modication of the likelihood
termwas proposed by Blake et al. [53], that better accommodates contrast-sensitive weights.
However, it is unclear how such results would be applicable to the formulation of this
chapter, which considers both continuous and discrete variables.
To this effect, this chapter follows an alternative formulation, which directly models
the posterior, rather than attempting to decompose it in a likelihood and a prior term.
Specically, the posterior distribution of the hidden variables x and y is modeled as
p(x, y|I) p(y|x, I)p(x|I) = p(y|x)p(x|I)
=

iV
(x
y
i
i
(1 x
i
)
1y
i
)

iV
_
x
H(x
i
0.5)
i
(1 x
i
)
(1H(x
i
0.5))
_
exp
_
_

e
ij
E
(w
ij
|x
i
x
j
|)
p
_
_
exp
_

iV
w
p
i0
|x
i
0|
p

iV
w
p
i1
|x
i
1|
p
_
,
(8.4)
130 8 Continuous-Valued MRF for Image Segmentation
where H() is the Heaviside function and i V, w
i0
0 and w
i1
0. The reduction of the
term p(y|x, I) to p(y|x) in the rst line comes from the assumption that y is conditionally
independent of I, given x. The terms introduced in the bottom row of (8.4) act as the unary
terms and the weights w
i0
and w
i1
bias the parameters x
i
towards 0 and 1. Firstly, these
unary terms serve the purpose of encoding the users interaction. If a node i Mis labelled
as object or backgraound, the algorithm sets the corresponding unary terms as (w
i0
, w
i1
) =
(0, ) or (w
i0
, w
i1
) = (, 0), respectively. It can be veried that this is equivalent to
hardcoding the value of x
i
at the marked nodes i M as i F, x
i
= 1 and i B,
x
i
= 0. The unary terms may also be used to encode the extent to which the appearance
(color, texture, etc.) of a node i obeys anappearance model for the object or the backgraound.
Given the expression (8.4), the goal is now to estimate the hidden variables x and y as
argmax
x,y
p(x, y|I). It can be veried that estimating y = argmax
y
p(x, y|I) gives for each
node i V, y
i
= 1 if x
i
0.5 and y = 0 otherwise. It can also be veried that estimating
the optimal value of x as argmax
x
p(x, y|I), is equivalent to estimating x as the minimizer
of the energy function E(x), where E(x) is dened as
E(x) =

iV
w
p
i0
|x
i
0|
p
+

iV
w
p
i1
|x
i
1|
p
+

e
ij
E
(w
ij
|x
i
x
j
|)
p
, (8.5)
where > 0 is the same parameter as in (8.4), that accounts for the trade-off between the
unary terms and the pairwise terms.
For notational convenience, one can introduce two auxiliary nodes for the foreground and
the background: f and b, respectively. The parameters x
i
at these nodes are hardcoded as
x
f
= 1 and x
b
= 0. The unary terms can then be rewritten as w
p
i0
|x
i
0|
p
= w
p
i0
|x
i
x
b
|
p
and w
p
i1
|x
i
1|
p
= w
p
i1
|x
i
x
f
|
p
. Hence, without loss of generality, E(x) can be rewritten
in terms of pairwise interactions only, as
E(x) =

e
ij
E
(w
ij
|x
i
x
j
|)
p
, (8.6)
where, with abuse of notation, the set E is modied to include the original set of edges E
dened in (8.5), as well as the additional edges introduced by representing the unary terms
as pairwise interactions.
Now, note that E(x) is parameterized by a nite-valued p-norm, p < . The limiting
case p = is admitted by generalizing E(x) as
E
p
(x) =
_
_

e
ij
E
(w
ij
|x
i
x
j
|)
p
_
_
1
p
. (8.7)
8.2 Solutions to the p-Brush Problem 131
Due to the monotonicity of the ()
1
p
operator, (8.6) and (8.7) have the same optimum
for a nite 0 < p < . As shown later, the generalization to p = allows the shortest
path segmentation algorithm to be admitted as a special case of the generalized algorithm
for p = .
Therefore, the problem of computing the segmentation is recast as the problem of com-
puting the optimal x that minimizes E
p
(x), subject to the constraints enforced by the users
interaction, as
x = arg min
x
E
p
(x) s.t. x
i
= 1, if i F and x
i
= 0, if i B. (8.8)
It is shown later that the solution to (8.8) naturally satises the constraint that i U, 0
x
i
1. We can redene y to be the hard segmentation produced from the real-valued x
by thresholding at x = 0.5. This is equivalent to obtaining the segmentation of node i as
y
i
= arg max
y
i
p(y
i
| x
i
). This segmentation procedure is summarized in algorithm 8.1. This
generalized segmentation algorithm is referred to as the p-brush algorithm, due to the
dependence of the solution on the p-norm. It will be shown later that the graph cut, random
walker, and shortest path segmentation algorithms can be viewed as special instances of
this p-brush algorithm when p = 1, 2, and , respectively.
Algorithm 8.1 (p-Brush: A Generalized Image Segmentation Algorithm)
Given:

Two sets of pixels marked for the foreground object (F V) and the background (B V).

Anorm p1 for the energy function E

p
=
_

e
ij
E
(w
ij
|x
i
x
j
|)
p
_ 1
p
that includes unary
terms as well as the spatial prior.
Compute: x=arg min
x
E
p
(x), s.t. x
i
=1 if i F and x
i
=0 if i B.
Output: Segmentation y dened as y
i
=1 if x
i

1
2
and y
i
=0 if x
i
<
1
2
.
8.2 Solutions to the p-Brush Problem
This section initially discusses some interesting properties of the solutions of the p-brush
algorithm. The remaining discussion focuses on efcient solvers for particular values of the
p-norm.
An important property of E
p
(x) is its convexity for all values of p 1. Therefore, any
solution of (8.8) must be a global minimizer of E
p
(x). In what follows, a global minimizer
of E
p
(x) is denoted as x
p
and its ith entry is denoted as x
p,i
. It was shown in [440] that
these minimizers have interesting properties.
132 8 Continuous-Valued MRF for Image Segmentation
Extremum Value Property The value of x
p,i
at every node i V is bounded by the values
of x
p,j
at the marked nodes j M. Formally, this can be written as i V, min
jM
x
p,j

x
p,i
max
jM
x
p,j
.
Nowrecall that by construction, the entries of x
p
are xed at the marked nodes as i F,
x
p,i
= 1, and i B, x
p,i
= 0. Hence, the extremumvalue property can be used to conclude
that the value of x
p,i
at each unmarked node i U lies in [0, 1]. As a result, the set of
solutions for the entries of x
p
at the unmarked nodes U is [0, 1]
|U|
, which is a compact and
convex set. This result, coupled with the fact that the energy function E
p
(x) is convex in x,
implies that any descent algorithm can be used to calculate the global minimizer of E
p
(x).
Right Continuity in the p-Norm This property characterizes the continuity of the solutions
of the p-brush algorithm as a function of the p-norm. In particular it can be shown that if
x
p+
is a minimizer of E
p+
(x), where 0, then x
p+
is right continuous in p, that is,
lim
0
+ x
p+
= x
p
. The signicance of this property will be illustrated later while dis-
cussing the IRLS algorithm for estimating the solutions of the p-brush algorithm for the
range 1 < p < 3.
8.2.1 Special Cases of the p-Brush Algorithm
Before studying the case for a general p, it is of interest to study the instances of the p-brush
algorithm resulting from the choices of p = 1, 2, and since they correspond to existing
segmentation algorithms.
The p = 1 Case: Graph Cut After substituting p = 1 in the p-brush algorithm, the second
step of the algorithm requires the solution to the problem
min
x

e
ij
E
w
ij
|x
i
x
j
|, s.t. x
i
= 1 if i F, and x
i
= 0 if i B. (8.9)
It is known that the problem in (8.9) admits a purely binary solution, x
i
{0, 1}, due to the
totally unimodular property of the min-cut problem [363]. This is precisely the solution
that is produced by the graph cut algorithm using the min-cut/max-ow solver [66]. Notice
that (8.9) provides a continuous-valued interpretation of the graph cut algorithmas opposed
to the traditional discrete-valued interpretation.
Although (8.9) admits a purely binary solution, the solution may not be unique and
there may be continuous-valued nonbinary solutions to (8.9). A result in [85] can be
used to obtain a purely binary-valued minimizer from any continuous-valued minimizer.
Specically, a binary-valued minimizer ( x
B
1
{0, 1}
|V|
) can be produced froma continuous-
valued minimizer ( x
C
1
[0, 1]
|V|
) of (8.9) by thresholding its entries at any value (0, 1),
that is, i V: x
B
1,i
= 1 if x
C
1,i
, and x
B
1,i
= 0 otherwise. It was shown in [85] that both
solutions, x
B
1
and x
C
1
, are minimizers of (8.9). Hence, thresholding any solution to (8.9)
8.2 Solutions to the p-Brush Problem 133
at = 0.5 produces a valid minimum cut. It is interesting that although this model deals
with continuous-valued solutions as opposed to discrete MRF models (e.g., chapter 7), the
thresholded solution is indeed equivalent to the solution of the discrete model.
The p = 2 Case: Random Walker For the case p = 2, the second step of the p-brush
algorithm requires the solution to the problem
min
x

e
ij
E
w
2
ij
_
x
i
x
j
_
2
, s.t. x
i
= 1 if i F, and x
i
= 0 if i B. (8.10)
This is exactly the optimization problem solved by the random walker algorithm in [175]
(for the case of two labels). A random walk is dened on the graph such that the proba-
bility that a random walker at node i V moves to a neighboring node j N
i
is given as
w
ij
/

kN
i
w
ik
. The random walk is terminated when the random walker reaches any of
the marked nodes. [175] showed that the solution of (8.10), x
2
, satises the property that
x
2,i
corresponds to the probability that a random walker starting from node i V will reach
a node labeled as foreground before a node labeled as background. These probabilities are
thresholded at x = 0.5 to obtain the segmentation. Hence, the random walker algorithm
with two labels gives the same solution as the p-brush algorithm when p = 2.
The p = Case: Shortest Path When p , the limit of E
p
(x) is given as
lim
p
E
p
(x) = max
e
ij
E
w
ij
|x
i
x
j
|
. ,, .
(x)
lim
p
p

e
ij
E
_
w
ij
|x
i
x
j
|
(x)
_
p
. ,, .
1
= (x), (8.11)
where (x) is dened as (x) = max
e
ij
E
w
ij
|x
i
x
j
|. Now the optimization problem
in (8.11) can be rewritten as
min
x
_
max
e
ij
E
(w
ij
|x
i
x
j
|)
_
, s.t. x
i
= 1 if i F, and x
i
= 0 if i B. (8.12)
This optimization problem may be viewed as a combinatorial formulation of the min-
imal Lipschitz extension problem [15]. It has been shown that the solution to (8.12) is
not unique in general [15]. Theorem 8.1 provides one possible interesting construction to
minimize E

(x).
Theorem 8.1 Innity-Norm Optimization Dene the distance between neighboring
nodes i and j as d
ij
=
1
w
ij
. Denote the shortest path lengths from node i V to a node
marked foreground and background as d
F
i
and d
B
i
, respectively. The vector, x

, dened as
i U : x
,i
= d
B
i
/d
B
i
+d
F
i
, is a solution to (8.12).
134 8 Continuous-Valued MRF for Image Segmentation
Proof Given in the appendix to this chapter.
Note that a node i V is assigned to the foreground if x
,i
> 0.5 (i.e., d
F
i
< d
B
i
). This
implies that the segmentation given by the shortest path algorithm [18] is a valid solution
to the p-brush algorithm for the case p = . Hence x

may be computed efciently

using Dijkstras shortest path algorithm. However, as mentioned earlier, this is not the only
solution to (8.12). One could introduce other constructions and additional regularizers, as
in [441], to obtain a unique solution.
8.2.2 Segmentation with an Arbitrary p-Norm
In the special cases discussed so far, a specic solver is used for each case due to the
properties of the employed p-norm. For any arbitrary nite p (1, 3), algorithm 8.2 can
be used to estimate x
p
, employing iterative reweighted least squares (IRLS). Each iteration
of the algorithmhas two steps. The rst step, reweighting, involves the update of the weights
based on the current estimate of x (see (8.13)). The second step, least squares estimation,
involves updating the value of x by solving a least squares problemwith the updated weights
(see (8.14)). IRLS can also be thought of as an iterative random walker algorithm with the
weights being updated at each iteration. The rationale behind the algorithm is as follows.
For p > 1, the function (E
p
(x))
p
is differentiable. In this case, algorithm 8.2 is equivalent
to performing a Newton descent of (E
p
(x))
p
with step size (p 1) [440]. This is because,
when e
ij
E, x
(n)
i
= x
(n)
j
, the matrix W
(n)
, whose (i, j) entry is given by w
(n)
ij
, is exactly the
Hessian (say H
p
(x)) of (E
p
(x))
p
evaluated at x = x
(n)
. If x
i
= x
j
for some e
ij
E, then
Algorithm 8.2 (Estimation of x
p
for Any 1 < p < 3, Using IRLS)
1. Set n = 0 and choose a value > 0 and a stopping criterion > 0.
2. Initialize the membership vector x
(0)
as i F : x
(0)
i
= 1, i B : x
(0)
i
= 0, and i
U : x
(0)
i
= 0.5.
3. For each edge e
ij
E, dene the edge weight as w
(n)
ij
:
w
(n)
ij
=
_
w
p
ij
|x
(n)
i
x
(n)
j
|
p2
if x
i
= x
j

p2
if x
i
= x
j
.
(8.13)
4. Calculate x
(n+1)
as the solution of
arg min
x

e
ij
E
w
(n)
ij
_
x
i
x
j
_
2
, s.t. x
i
= 0/1 if i F/B. (8.14)
5. If |x
(n+1)
U
x
(n)
U
| > , update n = n +1 and go to step 3.
8.3 Experiments 135
H
p
(x) does not exist for 1 < p < 2. This is resolved by approximating H
p
(x), using the
weights dened in (8.13). It can be veried that IRLS still produces a descent direction for
updating x
(n)
at each step.
For p = 1, E
1
(x) is not differentiable. However, recall fromthe properties of x
p
discussed
earlier in this section that the minimizers of E
p
(x) are right continuous with respect to the
p-norm. Therefore, the minimizer of (E
1+
(x))
1+
can be calculated using IRLS and used
as an approximation of x
1
with a desired accuracy by choosing to be sufciently small.
In general, IRLS is provably convergent only for 1 < p < 3 [359]. However, solutions for
p 3 can be obtained by using Newton descent with an adaptive step size rather than p 1.
8.3 Experiments
This sectionevaluates the performance of the p-brushalgorithmonsynthetic medical as well
as natural images. The experiments aim to highlight common problems in the segmentation
results and to analyze how they might depend on the choice of the p-norm. The results
rst present the image to be segmented and then show the segmentation results obtained
for various values of p. The segmentation boundary is superimposed on the image either
as a dashed line or as a bold line to ensure visibility. The user interaction for the object and
background is superimposed in different shades of gray to distinguish them.
A practical segmentation system has many components that can signicantly affect its
performance, such as unary terms and neighborhood. Most existing algorithms rely on good
unary terms that, on their own, produce near-perfect segmentations. When the unary terms
are uninformative, the segmentation relies primarily on the spatial prior. Therefore, unary
terms are ignored in the following evaluation to isolate and analyze the effect of the p-norm
on the spatial regularization of the segmentation boundary. It should, however, be noted
that unary terms have been employed in existing literature for p = 1 [67], p = 2 [174],
and p = [18]. Though it is not within the goals of this chapter, it is of future interest to
analyze how the unary terms behave for general values of p. Along the same lines, a simple
4-connectedlattice is usedtodene the neighborhoodinorder toinspect metricationartifacts
that might otherwise be eliminated by considering higher connected neighborhoods. The
contrast-sensitive weight for an edge e
ij
E is dened as w
ij
= e
I
i
I
j

, where > 0
and I
i
is the gray scale intensity or RGB color of pixel i.
8.3.1 Metrication Artifacts
The artifacts correspond to blocky segmentations that follow the topology of the neigh-
borhood structure. They occur in the case of p = 1 and p = , due to the neighborhood
structure [67]. In contrast, the case p = 2 corresponds to a nite differences discretization
of a continuous (inhomogeneous) Laplace equation. Chapter 12 shows that discretization of
continuous formulations can reduce metrication errors. Hence, it may be conjectured that
metrication artifacts are reduced in the case p = 2.
136 8 Continuous-Valued MRF for Image Segmentation
(a) image
(e) p = 2 (f) p = 2.33 (g) p = 2.67
(b) p = 1 (c) p = 1.33 (d) p = 1.67
(h) p =
Figure 8.1
Analysis of metrication artifacts in the segmentation of the synthetic image shown in (a). Light and dark marks
(squares in this example) indicate user labelings of foreground and background, respectively. The segmentations
obtained for various p-norms are shown in (b)(h). Blocky artifacts are present for p = 1 but not for higher
values of p.
To understand this better, consider the experiments in gure 8.1, where the goal is to
segment the octagon in gure 8.1a. Graph cut (p = 1) produces a squared-off effect at
the portion where the octagons boundary has been erased. This is the metrication artifact.
The artifacts are absent as p increases from 1 to 2.67. Although they are avoided for
p = in this synthetic example, they are exhibited for p = in the examples on real
images. Figure 8.2 shows metrication artifacts in the segmentation of an aneurysm in an
MRimage. The metrication artifacts are clearly visible for p = 1. They seemto reduce as p
increases, but not as drastically as in the previous example. Specically, for p = 1.25, the
segmentation boundary to the left of the aneurysm is less blocky compared with the result
for p = 1. However, the boundary is still blocky at the top and bottom of the aneurysm.
These artifacts reduce as p increases from 1 to 2.75, but they reappear for p = . The
same trend can be seen in the results in gure 8.5.
In general, the metrication artifacts may be reduced by choosing a higher connected
neighborhood [67]. For example, a 8-neighborhood instead of a 4-neighborhood would
have given the desired result for p = 1 in the example in gure 8.1. However, increasing
the connectivity is equivalent to increasing the number of edges in the graph, and hence
comes at the expense of higher memory usage and higher computation.
8.3.2 Proximity Bias
This bias corresponds to the sensitivity of the segmentation to the location of the users
interaction. The proximity bias is best understood in p = , since the segmentation of
an unmarked pixel depends on its distance from the marked pixels. It was shown in [175]
8.3 Experiments 137
(a) image
(e) p = 2 (f) p = 2.25 (g) p = 2.75
(b) p = 1 (c) p = 1.25 (d) p = 1.75
(h) p =
Figure 8.2
Analysis of metrication artifacts in the segmentation of the medical image shown in (a). The segmentations obtained
for various p-norms are shown in (b)(h). The artifacts reduce as p increases from 1 to 2.75, but are present for
p = .
that for p = 2, the segmentation of an unmarked pixel depends on the distances of all the
parallel paths from that pixel to the marked pixels, thus reducing dependence on a single
path. No such interpretation is known for p = 1.
To understand this, consider the experiments in gure 8.3, where the goal is to segment the
image into equal halves along the vertical dark line. The users scribbles are not symmetric
with respect to the line. If the scribbles were symmetric, the segmentation computed for the
various p-norms would be as desired. In this case the desired segmentation is obtained for
p = 1 and p = 1.33. As p increases, the segmentation begins to leak through the portions
where the dark line has been erased. This is the proximity bias.
This bias is further explored in the results in gure 8.4. In contrast to the user interaction
in gure 8.2, the one for the aneurysmhas been erased toward the bottomright. This is done
to analyze the effect on the segmentation boundary near the weak edge at the bottom right
of the aneurysm. It can be seen that for p = 1 and p = 1.1, the segmentation boundary is
correctly aligned with the aneurysm. However, as p increases, the segmentation begins to
leak and the boundary is incorrectly aligned with the aneurysms interior.
In general, the proximity bias can be reduced by further user interaction to correct any
errors. However, this might require greater levels of interaction from the user, which can
prove to be a burden for real-time applications. Moreover, additional user interaction might
not be possible in unsupervised applications, where the user interaction is automatically
generated by the algorithm, based on appearance models learned a priori for the object and
background.
8.3.3 Shrinking Bias
The shrinking bias corresponds to the bias toward segmentation boundaries with shorter
length. This can be understood better for p = 1 (i.e., graph cut), because it has been shown
138 8 Continuous-Valued MRF for Image Segmentation
(a) image
(e) p = 2 (f) p = 2.33 (g) p = 2.67
(b) p = 1 (c) p = 1.33 (d) p = 1.67
(h) p =
Figure 8.3
Analysis of proximity bias in the segmentation of the synthetic image shown in (a). The segmentations obtained
for various p-norms are shown in (b)(h). The desired segmentation is produced for p = 1 and p = 1.33. As p
increases, the segmentation boundary leaks through the erased portions of the vertical line and gradually moves
toward the distance based segmentation produced when p = .
(a) image
(e) p = 1.75 (f) p = 2 (g) p = 2.75
(b) p = 1 (c) p = 1.1 (d) p = 1.25
(h) p =
Figure 8.4
Analysis of proximity bias in the segmentation of the image shown in (a). The segmentation boundary does not
leak for p = 1 and p = 1.1. However, as p increases, the proximity bias increases and the segmentation boundary
leaks.
8.3 Experiments 139
(a) image
(e) p = 1.75 (f) p = 2 (g) p = 2.75
(b) p = 1 (c) p = 1.1 (d) p = 1.25
(h) p =
Figure 8.5
Analysis of shrinking bias in the segmentation of a banana in the natural image shown in (a). The segmentations
obtained for various p-norms are shown in (b)(h). The tip of the banana is cut off for p = 1, 1.1, 1.25, and 1.75,
with the portion being cut off gradually reducing. As p increases, the shrinking bias seems to reduce and the
segmentation boundary aligns with the banana. Also, the metrication artifacts along the right boundary decrease
as p increases.
that the smoothness prior, that is, p(x) in (8.3), can be viewed as a penalty on the length of
the segmentation boundary [67]. For higher values of p, there is no such known relationship
between the boundary length and the prior. Due to this bias, the segmentation boundary
might collapse, thereby resulting in a shortcutting of the segmentation boundary.
In order to appreciate this, consider the results in gure 8.5, where the goal is to segment
the top portion of a banana. The top of the banana is cut off for p = 1, 1.1, 1.25, and 1.75,
with the portion being cut off gradually reducing. Also, the segmentation boundary toward
the central left portion of the banana incorrectly aligns with the edges inside the banana.
As p increases, the segmentation begins to align correctly with the boundary of the banana.
The shrinking bias reduces as p increases.
This trend can also be seen in gure 8.6, where the goal is to segment the ultrasound
image shown in gure 8.6a along the interface of the dark and bright portions of the image.
The shortcutting effect (i.e., the shrinking bias) reduces as p increases.
In general, such errors caused by shortcutting of the segmentation boundary can be
resolved by further user interaction. However, as mentioned earlier, higher levels of user
interaction are not preferable for real-time segmentation or unsupervised segmentation.
140 8 Continuous-Valued MRF for Image Segmentation
(a) image
e) p = 1.75 f) p = 2 g) p = 2.75
b) p = 1 c) p = 1.1 d) p = 1.25
h) p =
Figure 8.6
Analysis of shrinking bias in the segmentation of the medical image shown in (a). The goal is to segment the
interface between the bright and the dark regions. The segmentations obtained for various p-norms are shown in
(b)(h). As p increases, the shrinking bias seems to reduce.
8.4 Conclusion
The p-brush segmentation algorithm provides a generalized framework that includes exist-
ing algorithms such as graph cut, random walker, and shortest path as special cases for
particular values of p = 1, 2, and , respectively. Due to the nature of their cost functions,
these algorithms have specic efcient solvers and characteristic drawbacks.
The experiments suggest that there is a correlation between the discussed biases and
the p-norm. Specically, the proximity bias increases and the shrinking bias decreases as
p increases. Metrication artifacts are observed for p = 1 and p = , but not for p = 2.
Since x
p
is continuous in p, it is conjectured that these artifacts reduce as p increases from1
to 2, but reappear for p beyond some p > 2. Due to the interplay of these issues, it cannot
be determined beforehand which p-norm would be the best for a general segmentation
system. However, if the system is to be used for a specic segmentation goal, an optimal
value for p may be learned through a user study or by training on representative exemplar
images. IRLS may be used to obtain the segmentation for several values of p with the aim
of obtaining a trade-off among the properties of these three special cases. However, IRLS
may be potentially slower than the efcient solvers for the cases of p = 1, 2, and .
8.5 Appendix: Proof of Innity Norm Optimization Theorem
In order to verify that x

dened in the hypothesis is indeed a solution of (8.12), it must

satisfy the following two conditions.
8.5 Appendix 141
1. x
,i
= 1 or x
,i
= 0 if the pixel i is labeled as foreground (i F) or as background
(i B), respectively.
This condition can be veried very easily. Note that d
F
i
= 0 and d
B
i
> 0 for all the pixels
i labeled as belonging to the foreground. This implies that i F : x
,i
=
d
B
i
d
B
i
+0
= 1. Since
d
B
i
= 0 and d
F
i
> 0 for all the pixels i labeled as belonging to the background, this implies
that i B : x
,i
=
0
0+d
F
i
= 0.
2. If x

is a solution to (8.12), then ( x

) = (x

).
In order to verify this, it shall be proved that ( x

) (x

). The denition of x

being
a solution to (8.12) would imply that ( x

) = (x

), and the proof would be complete.

Since ( x

) is dened as max
e
ij
E
(w
ij
| x
,i
x
,j
|), it is sufcient to show that e
ij
E :
w
ij
| x
,i
x
,j
| (x

) to prove that ( x

) (x

d
B
i
+d
F
i
(d
B
j
+d
F
j
)(d
F
i
+d
B
i
)
_
since |d
B
i
d
B
j
| w
1
ij
and |d
F
i
d
F
j
| w
1
ij
_
=
1
d
B
j
+d
F
j
.
(8.15)
Inorder tocomplete the proof, a result from[441] will be usedtoshowthat k V,
1
d
B
k
+d
F
k

). Specically, let : u ;v denote a path in G from node u V to v V. It was

shown in [441] that
(x

)
_
_

e
ij

w
1
ij
_
_
1
, : f ;b, where f F and b B. (8.16)
Now, consider any node k V and denote the marked nodes labeled as foreground and
background that are closest to this node k as f
k
F and b
k
B, respectively. Denote
the shortest path from f
k
to k as
f
k
,k
: f
k
;k and the shortest path from k to b
k
as

k,b
k
: k ;b
k
. Now, consider the path
f
k
,b
k
: f
k
;b
k
from f
k
to b
k
that is obtained by
traversing fromf
k
to k along
f
k
,k
and then fromk to b
k
along
k,b
k
. By using (8.16), it can
142 8 Continuous-Valued MRF for Image Segmentation
be seen that (x

)
_

e
ij

f
k
,b
k
w
1
ij
_
1
= (d
F
k
+d
B
k
)
1
. Since this holds true for every
node k V, it can be seen that
k V,
1
d
B
k
+d
F
k
(x

). (8.17)
Hence it can be concluded from (8.15) and (8.17) that e
ij
E : w
ij
| x
,i
x
,j
| (x

).
The proof is complete with this result.
Acknowledgments
The authors Dheeraj Singaraju and Rene Vidal would like to thank the Ofce of Naval
Research, USA for supporting this work through the grant ONR YIP N00014-09-1-0839.
The authors would also like to thank Donald Geman for his helpful discussions about the
probability model for the p-brush algorithm.
9
Bilayer Segmentation of Video
Antonio Criminisi, Geoffrey Cross, Andrew Blake, and Vladimir Kolmogorov
This chapter presents another application of Markov random elds: accurately extracting a
foreground layer from video in real time. A prime application is live background substitu-
tion in teleconferencing. This demands layer separation to near computer graphics quality,
including transparency determination as in video-matting [95, 94], but with computational
efciency sufcient to attain live streaming speed.
1
9.1 Introduction
Layer extraction fromimages or sequences has long been an active area of research [31, 219,
278, 480, 510, 521, 531]. The challenge addressed here is to segment the foreground layer
efciently, without restrictions on appearance, motion, camera viewpoint, or shape, yet with
sufcient accuracy for use in background substitution and other synthesis applications. Fre-
quently, motion-based segmentation has been achieved by estimating optical ow(i.e., pixel
velocities) and then grouping pixels into regions according to predened motion models
[510]. Spatial priors can also be imposed by means of graph cut [66, 266, 278, 521, 531].
However, the grouping principle generally requires some assumption about the nature of the
underlying motion (translational, afne, etc.), which is restrictive. Furthermore, regulariza-
tion to constrain ill-posed optical owsolutions tends to introduce undesirable inaccuracies
along layer boundaries. Last, accurate estimation of optical ow is computationally expen-
sive, requiring an extensive search in the neighborhood of each point. In our approach,
explicit estimation of pixel velocities is altogether avoided. Instead, an efcient discrimi-
native model, to separate motion from stasis by using spatiotemporal derivatives, is learned
from labeled data.
Recently, interactive segmentation techniques exploiting color/contrast cues have been
demonstrated to be very effective for static images [66, 401] (see also chapter 7). Segmen-
tation based on color/contrast alone is nonetheless beyond the capability of fully automatic
methods. This suggests a robust approach that fuses a variety of cuesfor example, stereo,
1. This chapter is based on [112].
144 9 Bilayer Segmentation of Video
color, contrast, and spatial priors [258] are known to be effective and comfortably com-
putable in real time. This chapter shows that comparable segmentation accuracy can be
achieved monocularly, avoiding the need for stereo cameras with their inconvenient neces-
sity for calibration. Efciency with motion in place of stereo is actually enhanced, in that
stereo match likelihoods no longer need to be evaluated, and the other signicant computa-
tional costs remain approximately the same. Additionally, temporal consistency is imposed
for increased segmentation accuracy, and temporal transition probabilities are modeled with
reduction of icker artifacts and explicit detection of temporal occlusions.
Notation and Image Observables Given an input sequence of images, a frame is represented
as an array z = (z
1
, z
2
, . . . , z
n
, . . . , z
N
) of pixels inYUVcolor space, indexed by the single
index n. The frame at time t is denoted z
t
. Temporal derivatives are denoted
z = ( z
1
, z
2
, . . . , z
n
, . . . , z
N
), (9.1)
and, at each time t , are computed as z
t
n
= |G(z
t
n
) G(z
t 1
n
)| with G(.) a Gaussian kernel
at the scale of
t
pixels. Then, spatial gradients
g = (g
1
, g
2
, . . . , g
n
, . . . , g
N
) where g
n
= |z
n
| (9.2)
also are computed by convolving the images with rst-order derivatives of Gaussian kernels
with standard deviation
s
. Here we use
s
=
t
= 0.8, approximating a Nyquist sampling
lter. Spatiotemporal derivatives are computed on the Y color space channel only. Motion
observables are denotedm = (g, z) andare usedas the rawimage features for discrimination
between motion and stasis.
Segmentation is expressed as an array of opacity values = (
1
,
2
, . . . ,
n
, . . . ,
N
).
We focus on binary segmentation (i.e., {F, B}), with F and B denoting foreground and
background, respectively. Fractional opacities are discussed briey in section 9.3.
9.2 Probabilistic Segmentation Model
This section describes the probabilistic model for foreground/background segmentation in
an energy minimization framework. This extends previous energy models for segmenta-
tion [66, 258, 401] by the addition of a second-order, temporal Markov chain prior and an
observation likelihood for image motion. The posterior model is a conditional random eld
(CRF) [288] with a factorization that contains some recognizably generative structure, and
this is used to determine the precise algebraic forms of the factors. Various parameters are
then set discriminatively [284]. The CRF is denoted as
p(
1
, . . . ,
t
| z
1
, . . . , z
t
, m
1
, . . . , m
t
) exp
_
t

=1
E
t

_
(9.3)
where E
t
= E(
t
,
t 1
,
t 2
, z
t
, m
t
). (9.4)
9.2 Probabilistic Segmentation Model 145
Note the second-order temporal dependence inthe Markovmodel (tobe discussedmore fully
later). The aim is to estimate
1
, . . . ,
t
, given the image and motion data, and in principle
this would be done by joint maximization of the posterior or, equivalently, minimization of
energy:
(
1
, . . . ,
t
) = arg min
t

=1
E
t

. (9.5)
However, such batch computation is of no interest for real-time applications because of the
causality constrainteach
t

must be delivered on the evidence fromits past, without using

any evidence from the future. Therefore estimation will be done by separate minimization
of each term E
t
(details are given later).
9.2.1 Conditional Random Field Energy Terms
The energy E
t
associated with time t is a sum of terms in which likelihood and prior are
not entirely separated, and so does not represent a pure generative model, although some
of the terms have clearly generative interpretations. The energy decomposes as a sum of
four terms:
E(
t
,
t 1
,
t 2
, z
t
, m
t
) = V
T
(
t
,
t 1
,
t 2
) +V
S
(
t
, z
t
)
+U
C
(
t
, z) +U
M
(
t
,
t 1
, m
t
),
(9.6)
in which the rst two terms are priorlike and the second two are observation likelihoods.
Briey, the roles of the four terms are the following:

Temporal prior term V

T
(. . .) is a second-order Markov chain that imposes a tendency
to temporal continuity of segmentation labels.

Spatial prior term V

S
(. . .) is an Ising term that imposes a tendency to spatial continuity
of labels, and is inhibited by high contrast.

Color likelihood term U

C
(. . .) evaluates the evidence for pixel labels based on color
distributions in foreground and background.

Motion likelihood term U

M
(. . .) evaluates the evidence for pixel labels based on the
expectation of stasis in the background and frequently occurring motion in the foreground.
Motion m
t
is explained in terms of the labeling at both the current frame
t
and the previous
one
t 1
.
This energyresembles a spatiotemporal hiddenMarkovmodel (HMM), andthis is illustrated
graphically in gure 9.1. Details of the prior and likelihood factors are given in the remainder
of this section.
146 9 Bilayer Segmentation of Video
Figure 9.1
Spatiotemporal hidden Markov model. This graphical model illustrates the color and motion likelihoods together
with the spatial and temporal priors. The same temporal chain is repeated at each pixel position. Spatial
dependencies are illustrated for a 4-neighborhood system.
9.2.2 Temporal Prior Term
Figure 9.2 illustrates the four kinds of temporal transitions a pixel can undergo in a bilayer
scene, based on a two-frame analysis. For instance, a foreground pixel may remain in the
foreground (pixels labeled FF in gure 9.2c) or move to the background (pixels labeled
FB), and so on. The critical point here is that a rst-order Markov chain is inadequate to
convey the nature of temporal coherence in this problem; a second-order Markov chain
is required. For example, a pixel that was in the background at time t 2 and is in the
foreground at time t 1 is far more likely to remain in the foreground at time t than to go
back to the background. Note that BF and FB transitions correspond to temporal occlusion
and disocclusion events, and that a pixel cannot change layer without going through an
occlusion event.
These intuitions are capturedprobabilisticallyandincorporatedintoour energyminimiza-
tion framework by means of a second-order Markov chain, as illustrated in the graphical
model of gure 9.1. The temporal transition priors are learned from labeled data and then
stored in a table, as in gure 9.3. Despite there being eight (2
3
) possible transitions, due to
probabilistic normalization (p(
t
= B|
t 1
,
t 2
) = 1 p(
t
= F|
t 1
,
t 2
)) the tem-
poral prior table has only four degrees of freedom, represented by the four parameters

FF
,
FB
,
BF
,
BB
. This leads to the following joint temporal prior term:
V
T
(
t
,
t 1
,
t 2
) =
N

n
_
log p
_

t
n
|
t 1
n
,
t 2
n
__
(9.7)
9.2 Probabilistic Segmentation Model 147
(a) (b) (c)
Figure 9.2
Temporal transitions at a pixel. (a,b) An object moves toward the right fromframe t 2 to frame t 1. (c) Between
the two frames pixels may remain in their own foreground or background layer (denoted F and B, respectively) or
change layer, thus dening four kinds of temporal transitions: B B, F B, F F, B F. Those transitions
inuence the label that a pixel is going to assume at frame t .
Figure 9.3
Learned priors for temporal transitions. The background probabilities are the complement of the foreground ones.
See text for details.
in which < 1 is a discount factor to allow for multiple counting across nonindependent
pixels. The optimal value of (as well as of the other CRF parameters) is trained discrimi-
natively from ground truth.
9.2.3 Ising Spatial Energy
There is a natural tendency for segmentation boundaries to align with contours of high
image contrast. Similarly to [66, 401] and chapter 7, this is represented by an energy term
of the form
V
S
(, z) =

(m,n)C
[
m
=
n
]
_
+e
||z
m
z
n
||
2
1 +
_
(9.8)
where (m, n) index neighboring pixel pairs. C is the set of pairs of neighboring pixels. The
contrast parameter is chosen to be =
_
2
_
z
m
z
n

2
__
1
, where < > denotes expec-
tation over all pairs of neighbors in an image sample. The energy term V(, z) represents
a combination of an Ising prior for labeling coherence with a contrast likelihood that acts
to partially discount the coherence terms. The constant is a strength parameter for the
coherence prior and also for the contrast likelihood. The constant is a dilution constant
148 9 Bilayer Segmentation of Video
for contrast; previously [66] set to = 0 for pure color segmentation. However, multiple
cue experiments with color and stereo [258] have suggested = 1 as a more appropriate
value.
9.2.4 Likelihood for Color
The term U
C
(.) in (21.5) is the log of the color likelihood. In [258, 401] color likelihoods
were modeled in terms of Gaussian mixture models (GMM) in RGB, where foreground
and background mixtures were learned via expectation maximization (EM). However, we
have found that issues with the initialization of EM and with local minima affect the dis-
crimination power of the nal likelihood ratio. Instead, here we model the foreground and
background color likelihoods nonparametrically, as histograms in the YUV color space.
The color term U
C
(.) is dened as
U
C
(, z) =
N

n
log p(z
n
|
n
). (9.9)
Probabilistic normalization requires that

z
p(z | = F), and similarly for the back-
ground likelihood. This nonparametric representation negates the need for having to set
the number of GMM components, as well as having to wait for EM convergence.
The foreground color likelihood model is learned adaptively over successive frames
similarly to [258], based on data from the segmented foreground in the previous frame. The
likelihoods are then stored in 3D lookup tables constructed from the raw color histograms
with a modest degree of smoothing to avoid overtting. The background color distribution is
constructed from an initial extended observation of the background, rather as in [405, 450],
to build in variability of appearance. The distribution is then static over time. It is also shared
by the entire background, to give additional robustness against camera shake, and studies
suggest that the loss of precision in segmentation, compared with pixelwise color models
(such as those used in [458]), should not be very great [258]. The distribution is represented
as a smoothed histogram, rather than as a GMM, to avoid initialization problems.
9.2.5 Likelihood for Motion
The treatment of motion could have been addressed via an intermediate computation of
optical ow. However, reliable computation of ow is expensive and beset with difculties
concerning the aperture problem and regularization. Those difculties can be nessed in
the segmentation application by bypassing ow and directly modeling the characteristics
of the feature normally used to compute ow: the spatial and temporal derivatives m =
(g, z). The motion likelihood therefore captures the characteristics of those features under
foreground and background conditions, respectively.
However, the nature of our generative model suggests an approach to motion likelihood
modeling that should capture even richer information about segmentation. Referring back
to gure 9.2, the immediate history of the segmentation of a pixel falls into one of four
9.2 Probabilistic Segmentation Model 149
|grad(l)|
|
d
l
/
d
t
|
180
160
140
120
100
80
60
40
20
10 20 30 40 50 60 70
|grad(l)|
|
d
l
/
d
t
|
180
160
140
120
100
80
60
40
20
10 20 30 40 50 60 70
Figure 9.4
Learned motion likelihoods. Two of the four learned likelihoods of motion data, conditioned on the segmentation
in the two previous frames. Bright indicates high density, and dark, low density. The distributions are modeled as
normalized histograms.
classes: FF, BB, FB, BF. We model the observed image motion features m
t
n
= (g
t
n
, z
t
n
), at
time t and for pixel n, as conditioned on those combinations of the segmentation labels

t 1
n
and
t
n
. This is a natural model because the temporal derivative z
t
n
is computed from
frames t 1 and t , so clearly it should depend on segmentations of those frames. The
joint distributions learned for each of the four label combinations are shown in gure 9.4.
Empirically, the BB distribution reects the relative constancy of the background state
temporal derivatives are small in magnitude. The FF distribution reects larger temporal
change and, as expected, that is somewhat correlated with spatial gradient magnitude. Tran-
sitional FB and BF distributions show the largest temporal changes because the temporal
samples at time t 1 and t straddle an object boundary. The distributions for BF and FB
are distinct in shape from those for BB and FF [112], and this is one indication that the
second-order model does indeed capture additional motion information, compared with a
rst-order model. (The rst-order model would be conditioned on just F and B, for which
the likelihoods are essentially identical to those for FF and BB.)
The four motion likelihoods are learned from some labeled ground truth data and then
stored (smoothed) as 2D histograms to use in likelihood evaluation. The likelihoods are
evaluated as part of the total energy, in the term
U
M
(
t
,
t 1
, m
t
) =

n
log p(m
t
n
|
t
n
,
t 1
n
). (9.10)
Illustrating the Motion Likelihoods Figure 9.5 shows the results of a likelihood ratio test
using the likelihood ratio R of the FF model versus the BB model, applied to two frames
of the VK test sequence. Motion and nonmotion events are accurately separated in textured
150 9 Bilayer Segmentation of Video
(b) (a)
Figure 9.5
Testing the motion classier. (a) A frame from the VK test sequence. (b) The corresponding motion likelihood
map as output of a likelihood ratio test. Bright pixels indicate motion (likelihood ratio R > 1) and dark ones,
stasis (R < 1). Thanks to our joint motion likelihood, strong stationary edges are assigned a lower (more negative,
darker) value of R than stationary textureless areas.
areas. In fact, moving edges are clearly marked with bright pixels (R > 1) and stationary
edges are marked with dark pixels (R < 1). However, textureless regions remain ambiguous
and are automatically assigned a likelihood ratio close to unity (mid-gray in the gure). This
suggests that motion alone is not sufcient for an accurate segmentation. Fusing motion and
color with CRF spatial and Markov chain temporal priors as in (21.5) is expected to help
ll the remaining gaps. In stereo, as opposed to motion segmentation [258], it is known that
good segmentation can be achieved even without the temporal model. However, as we show
later, the gaps in the motion likelihood also demand the temporal model for satisfactory
lling in.
9.2.6 Inference by Energy Minimization
At the beginningof this sectionthe principal aimof estimationwas statedtobe the maximiza-
tion of the joint posterior (9.5). However, it was also plain that the constraints of causality
in real-time systems do not allow that. Under causality, having estimated
1
, . . . ,
t 1
, one
way to estimate
t
would be

t
= arg min E(
t
,
t 1
,
t 2
, z
t
, m
t
). (9.11)
Freezing all estimators before generating t is an extreme approach, and better results are
obtained by acknowledging the variability in at least the immediately previous time step.
Therefore, the energy in (9.11) is replaced by the expected energy:
E

t 1
|

t 1 E(
t
,
t 1
,
t 2
, z
t
, m
t
), (9.12)
where E indicates expectation and the conditional density for time t 1 is modeled as
p(
t 1
|
t 1
) =

n
p(
t 1
n
|
t 1
n
), and (9.13)
p(
t 1
|
t 1
) = +(1 )(
t 1
,
t 1
), (9.14)
9.3 Experimental Results 151
(a) input sequence (b) automatic background substitution in three frames

Figure 9.6
An example of automatic bilayer segmentation in monocular image sequences. The foreground person is accurately
extracted from the sequence and then composited free of aliasing upon a different backgrounda useful tool in
videoconferencing applications. The sequences and ground truth data used throughout this chapter are available
from [Link]
and (with [0, 1]) is the degree to which the binary segmentation at time t 1 is soft-
ened to give a segmentation distribution. In practice, allowing > 0 (typically = 0.1)
prevents the segmentation from becoming erroneously stuck in either foreground or
background states.
This factorization of the segmentation distribution across pixels makes the expectation
computation (9.12) entirely tractable. The alternative of fully representing uncertainty in
segmentation is computationally too costly. Finally, the segmentation
t
is computed by
binary graph cut [66].
9.3 Experimental Results
This section validates the proposed segmentation algorithm through comparison both with
stereo-based segmentation and with hand-labeled ground truth.
2
Bilayer Segmentation and Background Substitution
In gure 9.6 foreground and background of a video sequence have been separated auto-
matically. After an initial period when the subject is almost stationary, the segmentation
converges to a good solution. Real-time border matting [401] has been used to compute
fractional opacities along the boundary, and this is used for anti-aliased compositing onto
a new background. Segmentation and background substitution for another test sequence
is demonstrated in gure 9.7. Notice that good segmentation is achieved even in frames
containing rapid motion, as in gures 9.1b and 9.7d. In gure 9.7 fusion of color with
motion enables correct lling of large, textureless foreground (the black jumper).
Detecting Temporal Occlusions
Figure 9.8 shows examples of temporal occlusion detection for the JM sequence, made
possible by the spatiotemporal priors. Pixels transitioning from foreground to background
are marked with a checkerboard pattern.
2. Ground truth sequences available at [Link]
152 9 Bilayer Segmentation of Video
(a) (b) (c) (d)
Figure 9.7
Bilayer segmentation and background substitution. (a) A frame from the MS test sequence; (bd) foreground
extraction and anti-aliased background substitution, over several frames. (d) The algorithm is capable of handling
complex motions.
Figure 9.8
Segmentation and occlusion detection. Two frames from the JM test sequence. Pixels undergoing an F B
transition are marked with a checkerboard pattern.
Quantitative Evaluation and Comparisons
Following [258], error rates are measured as a percentage of misclassied pixels, with
respect to ground truth segmentation.
3
Figure 9.9 presents the results for four of the six
Microsoft test sequences. The error rates obtained monocularly (blue) are compared with
those obtained by layered graph cut (LGC) stereo segmentation [258]. While monocular
segmentation cannot be expected to perform better than stereo, its accuracy is compara-
ble with that of LGC segmentation. Figure 9.9 provides an objective measure of visual
accuracy, while videos (on our Web site) offer a subjective impression that is hard to cap-
ture numerically. Despite some icker artifacts, the quality of monocular segmentation is
generally convincing.
Finally, the contribution of the temporal model is evaluated. Figure 9.10 compares
error rates for the following three cases: (1) no temporal modeling, (2) rst-order HMM,
(3) second-order HMM (including both the second-order temporal prior and the two-frame
motion likelihood). Error is computed for the ACtest sequence with model parameters fully
optimized for best performance. Color information is omitted to avoid confounding factors
in the comparison. From gure 9.10 it is clear that the second-order HMM model achieves
3. The published ground truth based on motion rather than that based on depth.
9.3 Experimental Results 153
Stereo + Col./Contrast + Priors [11]
Motion + Priors (no color)
Motion + Col./Contrast + Priors
0 50 100 150 200
0
2
4
6
8
10
0
2
4
6
8
10
Frame no
M
e
a
n

e
r
r
o
r

r
a
t
e

(
%
)
Subject...AC
0 50 100 150 200 250
Frame no
Subject...VK
Figure 9.9
Accuracy of segmentation. Error rates for the AC and VK sequences, respectively. The thick dashed curve indicates
the error rates obtainedbyLGCstereosegmentation[258]. The solidcurve indicates errors obtainedbythe proposed
monocular algorithm. In both sequences an initial period of stasis prevents correct segmentation, but after a few
frames the errors drop to a value close to that of LGC stereo. After the model is burned in, it can tolerate periods
of stasis. Omitting the color component of the model increases the error (thin dotted line).
1st order HMM no HMM 2nd order HMM
0 50 100 150 200
Frame no
M
e
a
n

e
r
r
o
r

r
a
t
e

(
%
)
0
2
4
6
8
10
12
14
16
18
Figure 9.10
The advantage of the second-order temporal model. Error plots for different orders of temporal HMM, for the AC
test sequence. Crosses indicate error averaged over all frames. Averages were computed from frame 10 onward
to exclude the burn-in period. The second-order model achieves the lowest error.
154 9 Bilayer Segmentation of Video
the lowest error, followed by the rst-order model, with highest error occurring when the
temporal model is entirely absent.
Robustness to Photometric Variations
Accurate segmentation of all six test sequences in the Microsoft repository has proved dif-
cult in view of particularly large photometric variability in some sequences. The variations
have been found to be due mostly to camera AGC (automatic gain control). We found
that the IJ and IU sequences exhibit illumination variation about an order of magnitude
higher than in the remaining four sequences [112]. While stereo-based segmentation is
relatively immune to such problems [258], monocular algorithms are more prone to be dis-
turbed. However, such large levels of photometric variation are easily avoided in practice
by switching off the AGC facility. More examples, results, and comparisons are reported
in [112].
9.4 Conclusions
This chapter has presented an algorithm for the accurate segmentation of videos by proba-
bilistic fusion of motion, color, and contrast cues together with spatial and temporal priors.
The model forms a conditional randomeld, and its parameters are trained discriminatively.
The motion component of the model avoids the computation of optical ow, and instead
uses a novel and effective likelihood model based on spatiotemporal derivatives and condi-
tioned on frame pairs. Spatiotemporal coherence is exploited via a contrast-sensitive Ising
energy combined with a second-order temporal Markov chain.
In terms of efciency our algorithm compares favorably with respect to existing real-
time stereo techniques [258] and achieves comparable levels of accuracy. Computationally
intensive evaluation of stereo match scores is replaced by efcient motion likelihood and
color model evaluation, using efcient lookup table.
Quantitative evaluation has conrmed the validity of the proposed approach and has
highlighted advantages and limitations with respect to stereo-based segmentation. Finally,
combining the proposed motion likelihoods and second-order temporal model with stereo
matching information may well, in the future, lead to greater levels of accuracy and robust-
ness than either motion or stereo alone.
Acknowledgments
The authors acknowledge helpful discussions with C. Rother, M. Cohen, C. Zhang and
R. Zabih.
10
MRFs for Superresolution and Texture Synthesis
William T. Freeman and Ce Liu
Suppose we want to digitally enlarge a photograph. The input is a single low-resolution
image, and the desired output is an estimate of the high-resolution version of that image.
This problem can be phrased as one of image interpolation: interpolating the pixel values
between observed samples. Image interpolation is sometimes called superresolution, since it
estimates data at a resolution beyond that of the image samples. In contrast with multi-image
superresolution methods, where a high-resolution image is inferred from a video sequence,
here high-resolution images are estimated from a single low-resolution example [150].
There are many analytic methods for image interpolation, including pixel replication,
linear and cubic spline interpolation [378], and sharpened Gaussian interpolation [426].
When interpolating in resolution by a large factor, such as four or more in each dimension,
these analytic methods typically suffer from a blurred appearance. Following a simple rule,
they tend to make conservative, smooth guesses for image appearance.
This problem can be addressed with two techniques. The rst is to use an example-
based representation to handle the many expected special cases. Second, a graphical model
framework can be used to reason about global structure. The superresolution problem
has a structure similar to other low-level vision tasks: accumulate local evidence (which
may be ambiguous) and propagate it across space. An MRF is an appropriate structure
for this: local evidence terms can be modeled by unary potentials
i
(x
i
) at a node i
with states x
i
. Spatial propagation occurs through pairwise potentials
ij
(x
i
, x
j
) between
nodes i and j, or through higher-order potentials. The joint probability then has the facto-
rized form
P(x) =
1
Z

i
(x
i
)

(i,j)E

ij
(x
i
, x
j
), (10.1)
where E is the set of edges in the MRF denoted by the neighboring nodes i and j, and Z
is a normalization constant such that the probabilities sum to 1 [309]. The local statistical
relationships allow information to propagate long distances over an image.
156 10 MRFs for Superresolution and Texture Synthesis
10.1 Image Preltering
The superresolution algorithm rst species the desired model of subsampling and image
degradation that needs to be undone. For the examples in this chapter the degradation
is assumed to be low-pass ltering, followed with subsampling by a factor of 4 in each
dimension, to obtain the observed low-resolution image. The low-pass lter is a 7 7 pixel
Gaussian lter, normalized to have unit sum, of standard deviation 1 pixel. Starting from
a high-resolution image, it is blurred and subsampled to generate the corresponding low-
resolution image. This model is applied to a set of training images, in order to generate
some number of paired examples of high-resolution and low-resolution image patch pairs.
It is convenient to handle the high- and low-resolution images at the same sampling
ratethe same number of pixels. After creating the low-resolution image, we perform an
initial interpolation up to the sampling rate of the full-resolution image. Usually this is done
with cubic spline interpolation, to create the up-sampled low-resolution image.
We want to exploit whatever invariances we can, to let the training data generalize beyond
the training examples. Two heuristics are used to try to extend the reach of the examples.
First, we do not believe that all spatial frequencies of the low resolution image are needed
to predict the missing high-frequency image components, and storing a different example
patch for each possible value of the low-frequency components of the low-resolution patch
is undesirable. Thus a low-pass lter is applied to the up-sampled low-resolution image in
order to divide it into two spatial frequency bands. We call the output of the low-pass lter
the low-band, L; the up-sampled low-resolution image minus the low-band image gives
the mid-band, M. The difference between the up-sampled low-resolution image and the
original image is the high-band, H.
Asecond operation to increase the scope of the examples is contrast normalization. It is
assumed that the relationship of the mid-band data to the high-band data is independent of
the local contrast level, so the contrast of the mid- and high-band images is normalized in
the following way:
[

M,

H] =
[M, H]
std(M) +
(10.2)
where std() is the standard deviation operator, and is a small value that sets the local
contrast level below which we do not adjust the contrast. Typically, = 0.0001 for images
that range over 0 to 1.
10.2 Representation of the Unknown State
There is a choice about what is estimated at the nodes of the MRF. If the variable to be
estimated at each node is a single pixel, then the dimensionality of the unknown state at a
node is low, which is good. However, it may not be feasible to drawvalid conclusions about
10.2 Representation of the Unknown State 157
Input patch
Closest image
patches from database
Corresponding
high-resolution
patches from database
Figure 10.1
Top: input patch (mid-band band-pass ltered, contrast-normalized). We seek the high-resolution patch associated
with this. Middle: Nearest neighbors from database to the input patch. These patches match this input patch
reasonably well. Bottom: The corresponding high-resolution patches associated with each of the retrieved mid-
band band-pass patches. These show more variability than the mid-band patches, indicating that more information
than simply the local image matches is needed to select the proper high-resolution image estimate. Since the
resolution requirements for the color components are lower than for luminance, we use an example-based approach
for the luminance, and interpolate the color information by a conventional cubic spline interpolation.
single pixel states simply by performing computations between pairs of pixels. That may
place an undue burden on the MRF inference. That burden can be removed if a large patch
of estimated pixels is assigned to one node, but then the state dimensionality at a node may
be unmanageably high.
This is addressed by working with entire image patches at each node, to provide sufcient
local evidence, but using other means to constrain the state dimensionality at a node. First,
restrict the solution patch to be one of some number of exemplars, typically image examples
from some training set. In addition, take advantage of local image evidence to further
constrain the choice of exemplars to be fromsome smaller set of candidates fromthe training
set. The result is an unknown state dimension of twenty to forty states per node.
Figure 10.1 illustrates this representation. The top row shows an input patch from the
(band-passed, contrast-normalized) low-resolution input image. The next two rows show
the thirty nearest-neighbor examples from a database of 658,788 image patches extracted
from forty-one images. The low-res patches are of dimension 25 25, and the high-res
patches are of dimension 9 9. The bottomtwo rows of gure 10.2 showthe corresponding
high-resolution image patches for each of those thirty nearest neighbors. The mid-band
images look approximately the same as each other and as the input patch, while the high-
resolution patches look considerably different from each other. This tells us that the local
information fromthe patch by itself is not sufcient to determine the missing high-resolution
information, and some other source of information must be used to resolve the ambiguity.
The state representation is then an index into a collection of exemplars, indicating which of
the unknown high-resolution image patches is the correct one, as illustrated in gure 10.2.
The resulting MRF is shown in gure 10.3.
158 10 MRFs for Superresolution and Texture Synthesis
Image patch
Underlying candidate
scene patches. Each
renders to the image
patch.
Figure 10.2
The state to be estimated at each node. Using the local evidence, at each node we have a small collection of image
candidates, selected from our database. We use the belief propagation to select between the candidates, based on
compatibility information.
x
3
y
3
x
1
y
1
x
4
y
4
x
2
y
2
high-band patches
(x
i
, y
i
)
(x
i
, x
j
)
mid-band patches
Figure 10.3
Patch-based MRF for low-level vision. The observations y
i
are patches from the mid-band image data. The states
to be estimated are indices into a data set of high-band patches.
10.3 MRF Parameterization
We can dene a local evidence term and pairwise potentials of the Markov random eld if
we make assumptions about the probability of encountering a training set exemplar in the
test image. We assume that any of our image exemplars can appear in the input image with
equal probability. We account for differences between the input and training set patches
as independent, identically distributed Gaussian noise added to every pixel. Then the local
evidence for a node being in sample state x
i
depends on the amount of noise needed to
translate from the low-resolution patch corresponding to state x
i
to the observed mid-band
image patch, p. If we denote the band-passed, contrast-normalized mid-band training patch
associated with state x
i
as M(x
i
), then

i
(x
i
) = exp |p M(x
i
)|
2
/(2
2
) (10.3)
where we write 2D image patches as rasterized vectors.
10.4 Loopy Belief Propagation 159
Toconstruct the compatibilityterm
ij
(x
i
, x
j
), we assume we have overlappinghigh-band
patches that should agree with their neighbors in their regions of overlap. Any disagreements
are again attributed to a Gaussian noise process. If we denote the band-passed, contrast-
normalized high-band training patch associated with state x
i
as H(x
i
), and introduce an
operator O
ij
that extracts as a rasterized vector the pixels of the overlap region between
patches i and j (with the ordering compatible for neighboring patches), then we have

ij
(x
i
, x
j
) = exp |O
ij
(H(x
i
)) O
ji
(H(x
j
))|
2
/(2
2
). (10.4)
In the examples below, we used a mid-band and high-band patch size of 9 9 pixels, and
a patch overlap region of size 3 pixels.
10.4 Loopy Belief Propagation
We have set up the Markov random eld such that each possible selection of states at
each node corresponds to a high-resolution image interpretation of the input low-resolution
image. The MRF probability, dervied from all the local evidence and pairwise potentials in
the MRF, assigns a probability to each possible selection of states according to (10.1). Each
conguration of states species an estimated high-band image, and we seek the high-band
image that is most favored by the MRF we have specied. This is the task of nding a point
estimate from a posterior probability distribution.
In Bayesian decision theory [32] the optimal point estimate depends on the loss function
usedthe penalty for guessing wrong. With a penalty proportional to the square of the error,
the best estimate (MMSE) is the mean of the posterior. However, if all deviations from the
true value are equally penalized, then the best estimate is the maximum of the posterior.
Using belief propagation [367], both estimates can be calculated exactly for an MRF that
is a tree. The use of BP on a graph with loops or cycles in this way has been described fully
in chapters 1 and 5. As described earlier, the BP update and marginal probability equations
are applied, and give an approximation to the marginals. The message updates are run until
convergence or for a xed number of iterations (here, we used thirty iterations). Fixed points
of these iterative update rules correspond to stationary points of a well-known approxima-
tion used in statistical physics, the Bethe approximation [538]. Good empirical results have
been obtained with that approximation [153, 150], and we use it here.
To approximate the MMSE estimate, take the mean (weighted by the marginals from
(5.18)) of the candidate patches at a node. It is also possible to approximate the MAP
estimate by replacing the summation operator of (5.17) with max, then selecting the patch
maximizing the resulting max-marginal given in (5.18). These solutions are often sharper,
but with more artifacts, than the MMSE estimate.
To piece together the nal image, undo the contrast normalization of each patch, average
neighboring patches in regions where they overlap, add in the low- and mid-band images and
the analytically interpolated chrominance information. Figure 10.4 summarizes the steps in
160 10 MRFs for Superresolution and Texture Synthesis
(a) Input (b) Bicubic x 4
(k) Ground truth high-res (l) Ground truth high-band
(h) Inferred high-band
(BP # iterations = 30)
(g) Inferred high-band
(BP # iterations = 1)
(f) Inferred high-band
(nearest neighbor)
(j) Superresolution results
(add back color)
(d) Band-pass (c) Desaturated
(e) Contrast normalized
band-pass
(i) Add back low-frequency
to (h)
Figure 10.4
Images showing example-based superresolution processing. (a) input image of resolution 120 80. (b) Cubic
spline interpolation up to a factor of 4 higher resolution in each dimension. (c) Extract the luminance component
for example-based processing (and use cubic spline interpolation for the chrominance components). (d) High-pass
ltering of this image gives the mid-band output, shown here. (e) Display of the contrast-normalized mid-band.
The contrast normalization extends the utility of the training database samples beyond the contrast value of each
particular training example. (f) The high frequencies corresponding to the nearest neighbor of each local low-
frequency patch. (g) After one iteration of belief propagation, much of the choppy, high-frequency details of
(f) are removed. (h) Converged high-resolution estimates. (i) Image (c) added to image (h)the estimated high
frequencies added back to the mid- and low-frequencies. (j) Color components added back in. (k) Comparison
with ground truth. (l) True high-frequency components.
the algorithm, and gure 10.5 shows other results. The perceived sharpness is signicantly
improved, and the belief propagation iterations signicantly reduce the artifacts that would
result from estimating the high-resolution image based on local image information alone.
(Figure 10.6 provides enlargements of cropped regions from gures 10.4 and 10.5) The
code used to generate the images in this section is available online.
1
10.5 Texture Synthesis
This same example-based MRF machinery can be applied to other low-level vision tasks
[150]. Another application involving image patches in Markov random elds is texture
synthesis. Here, the input is a small sample of a texture to be synthesized. The output is a
1. Download at [Link]
10.5 Texture Synthesis 161
(a) Low-res input (b) Bicubic (c) Belief propagation (d) Original high-res
Figure 10.5
Other example-based superresolution outputs. (a) Input low-resolution images. (b) Bicubic interpolation
(4 resolution increase). (c) Belief propagation output. (d) The true high-resolution images.
162 10 MRFs for Superresolution and Texture Synthesis
(a) Low-res input (b) Bicubic (d) Belief
propagation
(e) Original
high-res
(c) Nearest
neighbor
Figure 10.6
The close-ups of gures 10.4 and 10.5. (a) Input low-res images. (b) Bicubic interpolation (4 resolution increase).
(c) Nearest neighbor output. (d) Belief propagation output. (c) The true high-resolution images.
larger portion of that texture, having the same appearance but not made by simply repeating
the input texture.
Nonparametric texture methods have revolutionized texture synthesis. Notable examples
include Heeger and Bergen [187], De Bonet [59], and Efros and Leung [133]. However,
these methods can be slow. To speed them up and address some image quality issues, Efros
and Freeman [132] developed a nonparametric patch-based method; a related method was
developed independently by Liang et al. [313]. This is another example of the patch-based,
nonparametric Markov random eld machinery described above for the superresolution
problem.
10.5 Texture Synthesis 163
B1 B2
Random placement
of blocks
Neighboring blocks
constrained by overlap
Minimum error
boundary cut
(a) (b) (c)
Input
texture
block
B1 B2 B1 B2
Figure 10.7
Patch samples of an input texture can be composited to form a larger texture in a number of different ways. (a) A
random placement of texture samples gives strong patch boundary artifacts. (b) Selecting only patches that match
well with neighbors in an overlap region leaves some boundary artifacts in the composite image. (c) Selecting the
best seamthrough the boundary region of neighboring patches removes most artifacts. Figure reprinted from[132].
For texture synthesis the idea is to draw patch samples from random positions within the
source texture, then piece them together seamlessly. Figure 10.7, from [132], tells the story.
In (a), random samples are drawn from the source texture and placed in the synthesized
texture. With random selection the boundaries between adjacent texture blocks are quite
visible. (b) shows texture synthesis with overlapping patches selected fromthe input texture
to match the left and top borders of the texture region that has been synthesized so far. The
border artifacts are greatly suppressed, yet some are still visible. (c) shows the result of
adding an additional step to the processing of (b): select an optimal ragged boundary using
image quilting (described below).
There is an MRF implied by the model above, with the same
ij
compatibility term
between neighboring patches as for the superresolution problem. For this texture synthesis
problem there is no local evidence term.
2
This makes it nearly impossible to solve the
problem using belief propagation, since there is no small list of candidate patches available
at each node. The state dimension cannot be reduced to a manageable level.
As an alternative, a greedy algorithm, described in detail in [132], approximates the
optimal assignment of training patch to MRF node. The image is processed in a raster
scan fashion, top-to-bottom in rows, left-to-right within each row. Except at the image
boundaries, there are two borders with patches lled in for any patch we seek to select. To
add a patch, randomly select a patch from the source texture from the top ve matches to
the top and left boundary values. This algorithm can be thought of as a particularly simple,
approximate method to nd the patch assignments that maximize the MRF of (10.1), where
2. Arelated problem, texture transfer [132], includes local evidence constraints.
164 10 MRFs for Superresolution and Texture Synthesis
Figure 10.8
A collection of source (small image) and corresponding synthesized textures made using the patch-based image-
quilting method. Figure reprinted from [132].
the pairwise compatibilities
ij
(x
i
, x
j
) are as for superresolution, but there are no local
evidence terms
i
(x
i
). Figure 10.8 shows nine examples of textures synthesized from input
examples (shown in the smaller images to the left of each synthesis example). The examples
exhibit the perceptual appearance of the smaller patches but are synthesized in a realistic
nonrepeating pattern.
10.5.1 Image Quilting by Dynamic Programming
To return to the goal of nding the optimal ragged boundary between two patches, we seek
the optimal tear to minimize the visibility of artifacts caused by differences between the
neighboring patches. The algorithm for nding the optimal tear in a vertical region has an
obvious extension to a horizontal tear. Let the difference between two adjacent patches in
the region of overlap be d(i, j), where i and j are horizontal and vertical pixel coordinates,
respectively. For each row, seek the column q(j) of an optimal path of tearing between the
two patches. This optimal path should follow a contour of small difference values between
the two patches. Now minimize
q = arg min
q(j)
K

j
d(q(j), j)
2
(10.5)
under the constraint that the tear forms a continuous line, |q(j) q(j 1)| 1.
This optimal path problem has a well-known solution through dynamic programming
[36], which has been exploited in various vision and graphics applications [430, 132]. It
10.6 Some Related Applications 165
is equivalent to nding the maximum posterior probability through max-product belief
propagation. Here is a summary of the algorithm:
Initialization:
p(i,1)) = d(i, 1)
for j = 2:N
p(i,j) = p(i, j-1) + min_k d(k,j)
end
where the values considered for the minimization over k are i, and i 1. Using an auxiliary
set of pointers indicating the optimal value of the min
k
operation at each iteration, the path
q(i) can be found from the values of p(i, j). This method has also been used to hide image
seams in seam carving [17].
10.6 Some Related Applications
Markov randomelds have been used extensively in image processing and computer vision.
Geman and Geman brought Markov random elds to the attention of the vision community
and showed how to use MRFs as image priors in restoration applications [161]. Poggio
et al. used MRFs in a framework unifying different computer vision modules [374].
The example-based approach has been built on by others. This method has been used in
combination with a resolution enhancement model specic to faces [20] to achieve excellent
results in hallucinating details of faces [316]. Huang and Ma have proposed nding a linear
combination of the candidate patches to t the input data, then applying the same regression
to the output patches, simulating a better t to the input [533]. (Arelated approach was used
in [151].)
Optimal seams for image transitions were found in a 2D framework, using graph cut, by
Kwatra et al. [287]. Example-based image priors were used for image-based rendering in
the work of Fitzgibbon et al. [145]. Fattal used edge models for image up-sampling [135].
Glasner et al. also used an example-based approach for superresolution, relying on self-
similarity within a single image [167].
Acknowledgments
WilliamT. Freemanthanks the co-authors of the original researchpapers whichare described
in this chapter: Egon Pasztor, Alexi Efros, and Owen Carmichael. He acknowledges support
from Royal Dutch/Shell Group, NGA NEGI-1582-04-0004, MURI Grant N00014-06-1-
0734 and gifts from Microsoft, Adobe, and Texas Instruments.
11
A Comparative Study of Energy Minimization Methods for MRFs
Richard Szeliski, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir
Kolmogorov, Aseem Agarwala, Marshall F. Tappen, and Carsten Rother
Many of the previous chapters have addressed the task of expressing computer vision prob-
lems, such as depth or texture computation, as discrete labeling problems in the form of a
Markov Random Field. The goal of nding the best solution to the problem at hand then
refers tothe taskof optimizinganenergyfunctionover a discrete domainwithdiscrete binary
or multilabel variables. Many different optimization, or inference, techniques have been
discussed in previous chapters (see part I of the book). The focus of this chapter is to analyze
the trade-offs among different energy minimization algorithms for different application sce-
narios. Three promising recent optimization techniques are investigatedgraph cut, LBP,
and tree-reweighted message passing (TRW)in addition to the well-known older iterated
conditional modes (ICM) algorithm. The main part of this chapter investigates applica-
tions where the MRF has a 4-connected grid graph structure and multilabeled variables.
The applications are stereo without occlusion, image stitching, interactive segmentation,
and denoising. The study is based on [464], for which benchmarks, code, images, and results
are available online at [Link] The main conclusion of the study
is that some existing techniques, in particular graph cut and TRW, work very well for many
applications. At the end of this chapter a different study is briey discussed that considerers
the application scenario of stereo with occlusions. Here, the conclusion is different, since
the resulting highly connected MRF with multilabeled variables is hard to optimize, and
only graph cut-based techniques work well.
11.1 Introduction
Over the last few years energy minimization approaches have had a renaissance, primarily
due to powerful new optimization algorithms such as graph cut [72, 266] and loopy belief
propagation (LBP) [540]. The results, especially in stereo, have been dramatic; according
to the widely used Middlebury stereo benchmarks [419], almost all the top-performing
stereo methods rely on graph cut or LBP. Moreover, these methods give substantially more
accurate results than were previously possible. Simultaneously, the range of applications
of pixel-labeling problems has also expanded dramatically (see this part II and part V);
168 11 A Comparative Study of Energy Minimization Methods for MRFs
examples are image restoration [38], texture modeling [162], image labeling [93], stereo
matching [26, 72], interactive photo segmentation [66, 401], and the automatic placement
of seams in digital photomontages [7].
Relatively little attention has been paid, however, to the relative performances of various
optimization algorithms. Among the few exceptions are [68, 468, 260, 403]. In [68] the
efciency of several different max-ow algorithms for graph cut is compared, and [468]
compares graph cut with LBP for stereo on a 4-connected grid graph. The main part of this
chapter also considers problems on 2D grid graphs, since they occur in many real-world
applications. In [260] a model with a more complex graph topology is considered: the
problem of stereo with occlusion, which can be expressed as a highly connected graph. The
main conclusions of this study will be summarized here. The study in [403] considers appli-
cations with a rather different type of Markov random eld. In contrast to stereo, where
pairwise terms of neighboring nodes encode a smoothness prior, that is, prefer to have the
same labels, several problems in computer vision are of a different nature.
In domains such as new view synthesis, superresolution, or image deconvolution, pair-
wise terms are often of a repulsive nature, that is, neighboring nodes prefer to have different
labels. The study in [403] considered a range of such problems, with the limitation of allow-
ing only binary labels. These problems are in general NP-hard, which means that graph
cut-based techniques are not directly applicable and more recent methods, such as the graph
cut-based BHS algorithm(referred to as QPBOin [403]), have to be used instead (see details
in chapter 2). Probably the most interesting insight of the study in [403] is that the relative
performance of different techniques depends heavily on the connectivity of the graph. For
low-connectivity graphs, advanced methods such as BP and the BHS algorithm perform
very well. However, for highly connected graphs, where each node is connected to fteen
(up to eighty) other nodes, older techniques such as simulated annealing perform extremely
well, even outperforming all other techniques for one example. This is in sharp contrast to
the study presented in this chapter and also motivates a future large-scale comparison of
arbitrary, multilabeled MRFs.
The chapter is organized as follows. Section 11.2 denes the energy function and evalu-
ation methodology. In section 11.3 different optimization techniques are discussed. Section
11.4 presents the benchmark problems. Section 11.5 provides the experimental comparisons
of different energyminimizationmethods. Section11.6introduces the more advancedmodel
of stereo with occlusion and an experimental comparison between two models, stereo with
and without occlusion, is given.
11.2 Problem Formulation and Experimental Infrastructure
11.2.1 Energy Model
We dene a pixel-labeling problem as assigning to every pixel i a label that we write as l
i
.
The collection of all pixel label assignments is denoted by l, the number of pixels is N,
11.2 Problem Formulation and Experimental Infrastructure 169
and the number of labels is K. Using the same notation as in earlier chapters, the energy
function E, which can also be viewed as the log-likelihood of the posterior distribution
of a Markov random eld [161, 308], is composed of a data energy U and a smoothness
energy V (i.e., E = U +V). The data energy is the sum of a set of per-pixel data costs

i
(l
i
), that is, U =

i

i
(l
i
), which typically comes from the (negative) log-likelihood
of the measurement noise.
We assume that pixels form a 2D grid, so that each i can also be written in terms
of its coordinates i = (p, q). We use the standard 4-connected neighborhood system, so
that the smoothness energy is the sum of spatially varying horizontal and vertical nearest-
neighbor smoothness costs. If we let N denote the set of all such neighboring pixel pairs,
the smoothness energy is V =

{i,j}N

ij
(l
i
, l
j
). Here {i, j} stands for an unordered set.
In the MRF framework the smoothness energy typically comes from the negative log
of the prior. In this chapter we consider a general form of the smoothness costs, where
different pairings of adjacent labels can lead to different costs. This is important in a number
of applications, for example, image stitching and texture quilting [7, 132, 287].
Amore restricted form of the smoothness energy is
V =

{i,j}N
w
ij
(|l
i
l
j
|), (11.1)
where the smoothness terms are the product of spatially varying, per-pairing weights w
ij
and
a nondecreasing function of the label difference (l) = (|l
i
l
j
|). Such energy func-
tions typicallyarise instereomatching[72] andimage denoising. Thoughwe couldrepresent
using a K-valued lookup table, for simplicity we instead parameterize using a simple
clipped monomial form (l) = min(|l|
r
,
max
), with r {1, 2}. If we set
max
= 1.0,
we get the Potts model, (l) = 1 (l), which penalizes any pair of different labels
uniformly ( is the unit impulse function).
Depending on the choice of and the number of labels, a number of important special
cases exist with fast and exact algorithms (see chapters 1, 2, 4). In this chapter, however,
the class of energy functions is quite broad and, in general, NP-hard to optimize. Also,
not all energy minimization methods can handle the entire class. For example, acceleration
techniques based on distance transforms [141] can signicantly speed up message-passing
algorithms such as LBPor TRW, yet these methods are applicable only to certain smoothness
costs . Other algorithms, such as graph cut, have good theoretical guarantees only for
certain choices of (see chapter 3). In this chapter we assume that any algorithmcan run on
any benchmark problem; this can generally be ensured by reverting to a weaker or slower
version of the algorithm, if necessary, for a particular benchmark.
11.2.2 Evaluation Methodology and Software Interface
The algorithms were implemented in C or C++ and all experiments were run on the
same machine (Pentium 4; 3.4 GHz, 2GB RAM). Astandard software interface (API) was
170 11 A Comparative Study of Energy Minimization Methods for MRFs
designed that allows a user to specify an energy function E and to easily call a variety of
energy minimization methods to minimize E. (For details see [464].)
11.3 Energy Minimization Algorithms
Most optimization methods used in this study are discussed in other chapters. In particular,
Iterated Conditional Modes (ICM) [38] is explained in chapter 1. The graph cut-based swap-
move and expansion-move algorithms [72] are described in chapter 3. Max-Product Loopy
belief propagation (LBP) is explained in chapters 1 and 5, based on [367, 154, 150]. For a
detailed explantation of Tree-Reweighted Message Passing (TRW) the reader is referred to
[504, 255] and chapters 5 and 6. In the following, some aspects of the exact implementation
of each method are outlined (see [464] for a more general description).
11.3.1 Iterated Conditional Mode (ICM)
It is well known that ICM, that is, coordinate descent, is extremely sensitive to the initial
estimate, especially in high-dimensional spaces with nonconvex energies (such as arise in
vision), due to the huge number of local minima. In this study ICM was initialized in a
winner-take-all manner by assigning each pixel the label with the lowest data cost.
11.3.2 Graph Cut-Based Move-Making Methods
For graph-cut-based techniques, it can occur that the energy does not obey the submodu-
larity constraint (see details in chapter 3). In short, for the expansion-move algorithm, the
submodularity constraint holds if for all labels , , and it is

ij
(, ) +
ij
(, )
ij
(, ) +
ij
(, ). (11.2)
If the constraint is violated, the truncation procedure of [404] is performed; however,
it is no longer guaranteed that the optimal labeling is found. Note that for the energy
functions used in this study, only the expansion-move, and not the swap-move, algorithm
sometimes requires truncation. In practice, this technique seems to work well, probably
because relatively few terms have to be truncated (see details in section 11.5).
The main computational cost of graph cut lies in computing the minimum cut, which is
done via max-ow. The implementation used in this work is described in detail in chapter 2,
based on [68]. This algorithm is designed specically for the graphs that arise in vision
applications, and in [68] it is shown to perform particularly well for such graphs.
11.3.3 Max-Product Loopy Belief Propagation (LBP)
Two different variants of LBP were implemented: BP-M, an updated version of the max-
product LBP implementation of [468], and BP-S, an LBP implementation derived from
the TRW-S implementation described below. The most signicant difference between
the two implementations is in the schedules for passing messages on grids. In the BP-M
11.3 Energy Minimization Algorithms 171
implementation, messages are passed along rows, then along columns. When a row or
column is processed, the algorithm starts at the rst node and passes messages in one
directionsimilar to the forward-backward algorithm for Hidden Markov Models. Once
the algorithm reaches the end of a row or column, messages are passed backward along
the same row or column. In the BP-S implementation, the nodes are processed in scan-line
order, with a forward and a backward pass. In the forward pass each node sends messages
to its right and bottom neighbors. In the backward pass messages are sent to the left and
upper neighbors. Another difference between our LBP implementations is how the label-
ing is computed. In BP-M each pixel independently chooses the label with the highest
belief, while in BP-S the labeling is computed from messages (as described in the next sec-
tion). Based on some experiments, we do not believe that the performance of BP-S would be
improved by adopting the label computing technique of BP-M. Note that BP-S uses integers
for messages, to provide additional efciency. The performance of the two versions differs
by a surprisingly large margin (see section 11.5). For both methods the distance transform
method described in [141] is used when applicable, that is, when the label set is ordered.
This signicantly reduces the running time of the algorithm. In the latest benchmark the BP
implementation of [141] was included for comparison.
11.3.4 Tree-Reweighted Message Passing (TRW)
Tree-reweighted message passing [504] is a message-passing algorithm similar, on the sur-
face, to LBP. Let M
t
ij
be the message that pixel i sends to its neighbor j at iteration t ; this
is a vector of size K (the number of labels). The message update rule is
M
t
ij
(l
j
) = min
l
i
_
c
ij
_

i
(l
i
) +

sN(i)
M
t 1
si
(l
i
)
_
M
t 1
ji
(l
i
) +
ij
(l
i
, l
j
)
_
. (11.3)
The coefcients c
ij
are determined in the following way. First, a set of trees from the
neighborhood graph (a 2Dgrid in our case) is chosen so that each edge is in at least one tree.
A probability distribution over the set of trees is then chosen. Finally, c
ij
is set to
ij
/
i
,
that is, the probability that a tree chosen randomly under contains edge (i, j), given that
it contains i. Note that if c
ij
were set to 1, the update rule would be identical to that of
standard LBP.
An interesting feature of the TRW algorithm is that it computes a lower bound on the
energy. We use this lower bound in our experimental results (section 11.5 below) to assess
the quality of the solutions. The best solutions are typically within 1%of the maximumlower
bound.
The original TRW algorithm does not necessarily converge and does not, in fact, guar-
antee that the lower bound always increases with time. In this chapter we use an improved
version of TRW due to [255], which is called sequential TRW (TRW-S). In this version the
lower bound estimate is guaranteed not to decrease, which results in certain convergence
172 11 A Comparative Study of Energy Minimization Methods for MRFs
properties. In TRW-S we rst select an arbitrary pixel-ordering function S(i). The messages
are updated in order of increasing S(i) and at the next iteration in the reverse order. Trees
are constrained to be chains that are monotonic with respect to S(i). The algorithm can be
implemented using half as much memory as some versions of BP since it needs to store one
message per edge.
Given messages M, we compute labeling l as described in [255]: we go through pixels
in the order S(i) and choose the label l
i
that minimizes

i
(l
i
) +

S(j)<S(i)

ij
(l
i
, l
j
) +

S(j)>S(i)
M
ji
(l
i
).
Note that this rule is heuristic, and there is no guarantee that the energy might not actually
increase with timeit is guaranteed only that the lower bound does not decrease. In practice
the energy sometimes starts to oscillate. To deal with this issue, one could keep track of the
lowest energy to date and return that state when the algorithm is terminated.
11.4 Benchmark Problems
For the benchmark a representative set of low-level energy minimization problems was
created, drawn froma range of different applications. The input images for each benchmark
are shown in gure 11.1.
11.4.1 Stereo Matching
For stereo matching, a simple energy function, as in [70, 468], is applied to images fromthe
widely used Middlebury stereo data set [419]. The labels are the disparities, and the data
costs are the absolute color differences between corresponding pixels for each disparity.
Here, the cost variant by Bircheld and Tomasi [44] is used, for increased robustness to
image sampling.
To make the optimization problems more varied, different smoothness costs for the
different image pairs were used (see introduction of energy model in 11.2.1). For Tsukuba
with K = 16 labels, a truncated linear cost is used (r = 1,
max
= 2) with = 20. For
Venus with K = 20 labels, a truncated quadratic cost is used (r = 2,
max
= 7) with = 50.
Since this smoothness termis not a metric, applying the expansion-move algorithmrequires
truncation. For Teddy with K = 60 labels, the Potts model (r = 1,
max
= 1) with = 10
is used. The default local smoothness weight is w
ij
= 1 at all pixels. For Tsukuba and Teddy,
the weights are increased at locations where the intensity gradient
ij
in the left image is
small: w
ij
= 2 if |
ij
| 8 for Tsukuba, and w
ij
= 3 if |
ij
| 10 for Teddy.
11.4.2 Photomontage
The Photomontage system [7] seamlessly stitches together multiple photographs for dif-
ferent photo merging applications. Here, two such applications are considered: panoramic
11.4 Benchmark Problems 173
(a)
(b)
(c)
(d) (e)
Figure 11.1
Some images used for the benchmark (see [464] for the full set). (a) Stereo matching: Tsukuba, and Teddy, left
images and true disparity. (b) Photomontage Panorama. (c) Photomontage Family. (d) Binary image segmentation:
Sponge and Person. (e) Denoising and inpainting: Penguin and House.
174 11 A Comparative Study of Energy Minimization Methods for MRFs
stitching and group photo merging. The input is a set of aligned images S
1
, S
2
, . . . , S
K
of
equal dimension; the labels are the image indices, and the nal output image is formed by
copying colors from the input images according to the computed labeling. If two neigh-
bors i and j are assigned to the same input image, that is, l
i
= l
j
, they should appear natural
in the composite, and thus
ij
(l
i
, l
j
) = 0. If l
i
= l
j
, that is, a seam exists between i and j,
and
ij
measures howvisually noticeable the seamis in the composite. The data term
i
(l
i
)
is 0 if pixel i is in the eld of view of image S
l
i
, and otherwise.
The rst benchmark, Panorama, automatically stitches together the panorama in
gure 11.1(b) ([7], gure 8). The smoothness energy, derived from [287], is

ij
(l
i
, l
j
) = |S
l
i
(i) S
l
j
(i)| +|S
l
i
(j) S
l
j
(j)|. (11.4)
This energy function is suitable for the expansion-move algorithm without truncation.
The second benchmark, Family, stitches together ve group photographs and is shown
in gure 11.1(c) ([7], gure 1). The best depiction of each person is to be included in a
composite. Photomontage itself is interactive, but to make the benchmark repeatable, the
user strokes are saved into a data le. For any pixel i underneath a drawn stroke,
i
(l
i
) = 0
if l
i
equals the user-indicated source image, and otherwise. For pixels i not under-
neath any strokes,
i
(l
i
) = 0 for all labels. The smoothness terms are modied from the
rst benchmark to encourage seams along strong edges. More precisely, the right-hand side
of (11.4) is divided by |
ij
S
l
i
| +|
ij
S
l
j
|, where
ij
I is the gradient between pixels i and j
in image I. The expansion-move algorithm is applicable to this energy only after truncating
certain terms.
11.4.3 Binary Image Segmentation
Binary MRFs are widely used in medical image segmentation [66], stereo matching using
minimal surfaces [80, 443], and video segmentation using stereo disparity cues [259]. For
the natural Ising model smoothness cost, the global minimum can be computed rapidly
via graph cut [176]; this result has been generalized to other smoothness costs by [266].
Nevertheless, such energy functions still form an interesting benchmark, since there may
well be other heuristic algorithms that perform faster while achieving nearly the same level
of performance.
Our benchmark consists of three segmentation problems inspired by the interactive seg-
mentation algorithm of [66, 401] (see chapter 7). As above, this application requires user
interaction, which is handled by saving the user interactions to a le and using them to
derive the data costs.
The data cost is the negative log-likelihood of a pixel belonging to either the foreground
or the background, and is modeled as two separate Gaussian mixture models, as in [401].
The smoothness term is a standard Potts model that is contrast-sensitive:
w
ij
= exp(S(i) S(j)
2
) +
2
, (11.5)
11.5 Experimental Results 175
where = 50,
2
= 1/5, and S(i), S(j) are the RGBcolors of two neighboring pixels i, j.
1
The quantity is set to (2S(i) S(j)
2
)
1
where the expectation denotes an average
over the image, as motivated in [401]. The purpose of
2
is to remove small and isolated
areas that have high contrast.
11.4.4 Image Denoising and Inpainting
For the denoising and inpainting benchmark, the Penguin image is used ([141], gure 8),
along with the House image, a standard test image in the denoising literature. Random
noise is added to each pixel, and we also obscure a portion of each image (gure 11.1e).
The labels are intensities (K = 256), and the data cost
i
for each pixel is the squared dif-
ference between the label and the observed intensity except in the obscured portions, where

i
(l
i
) = 0 for all intensities. For the Penguin image a truncated quadratic smoothness cost
is used (r = 2,
max
= 200) with = 25. For the House image a nontruncated quadratic
cost is used (r = 2,
max
= ) with = 5. In both cases the expansion-move algorithm
requires truncation. Unlike all of the other benchmarks, the House example is a convex
minimization problem amenable to quadratic solvers, since both data and smoothness costs
are quadratics. The implications of this are discussed in the next section.
11.5 Experimental Results
Experimental results from running the different optimization algorithms on these bench-
marks are given in gure 11.2 (stereo), gure 11.3 (Photomontage), gure 11.4 (segmen-
tation), and gure 11.5 (denoising and inpainting). (Note that this is a selection and that all
plots can be found online and in [464].) In each plot the x-axis shows the running times in
seconds on a log-scale, and the y-axis shows the energy. In some of these gures the right
gure is a zoomed-in version of the left gure. As mentioned before, instead of showing
absolute energy values, the energy is divided by the best lower bound computed byTRW-S.
2
Note that the lower bound increases monotonically [255] and is included in the plots.
Let us rst summarize the main insights from these experiments. On all benchmarks the
best methods achieve results that are extremely close to the global minimum, with less than
1% error in all cases, and often less than 0.1%. For example, on Tsukuba, TRW-S gets to
within 0.02% of the of the optimum, and on Panorama, expansion-move is within 0.9%.
These statistics may actually slightly understate the performance of the methods since they
are based on the TRW-Slower bound rather than on the global minimum, which is unknown.
This means that most theoretical guarantees on bounds of certain methods are in practice
irrelevant, at least for these examples. For instance, expansion-move is in the best cases
within a factor of 2 of the global minimum (note that a 1% error corresponds to a factor of
1.01; see chapter 3).
1. Note that in [464] this setting was incorrect, with
2
= 10.
2. It was checked that for all applications the lower bound is positive.
176 11 A Comparative Study of Energy Minimization Methods for MRFs
(a)
(b)
(c)
125%
120%
115%
110%
105%
100%
95%
103%
102%
101%
100%
99%
125%
120%
115%
110%
105%
100%
95%
105%
104%
103%
102%
101%
100%
99%
0.1s 1s 10s 100s 1s 10s 100s 1000s
500%
400%
300%
200%
100%
0%
140%
135%
130%
125%
120%
115%
110%
105%
100%
95
0.1s 1s
ICM
BP-S
BP-M
Swap
Expansion
TRW-S
lowerBound
10s 100s 0.1s 1s 10s 100s 1000s 1000s
1s 10s 100s 1s 10s 100s 1000s 1000s
Figure 11.2
Results on the stereo matching benchmarks. (a) Tsukuba energy, with truncated linear smoothness cost . (b) Venus
energy, with truncated quadratic smoothness cost . (c) Teddy energy, with the Potts model for . In each row
the right gure is a zoomed-in version of the left gure. The legend for all plots is in (b). Some of the plots may
not contain the poorer-performing algorithms (e.g., ICM). Further plots are available online. The plots in other
gures are generated in the same manner.
11.5 Experimental Results 177
1s 10s 100s 10000s 1000s
1s 10s 100s 1000s
1s 10s
(a)
(b)
100s 10000s 1000s
1s 10s 100s 1000s
800%
700%
600%
500%
400%
300%
200%
100%
0%
8000%
7000%
6000%
5000%
4000%
3000%
2000%
1000%
0%
180%
160%
140%
120%
100%
80%
400%
350%
300%
250%
200%
150%
100%
50%
ICM
BP-S
BP-M
Swap
Expansion
TRW-S
lowerBound
Figure 11.3
Results on the Photomontage benchmark. (a) Panorama. (b) Family.
0.01s 0.1s 1s 10s 100s 1000s 0.01s 0.1s 1s 10s 1000s 100s
100.020%
100.018%
100.016%
100.014%
100.012%
100.010%
100.008%
100.006%
100.004%
100.002%
100.000%
100.20%
100.15%
100.10%
100.05%
100.00%
Figure 11.4
Results of binary segmentation benchmarks Sponge (left) and Person (right). Here the global minimum can be
computed rapidly by graph cut or TRW-S. See the legend in gure 11.2b.
178 11 A Comparative Study of Energy Minimization Methods for MRFs
1s 10s 100s 1000s
0.1s 1s 10s 100s 1000s
1s 10s
(a)
(b)
100s 1000s
0.1s 1s 10s 100s 1000s
140%
130%
120%
110%
100%
90%
110%
105%
100%
95%
105%
104%
103%
102%
101%
100%
99%
101.0%
100.5%
100.0%
99.5%
Figure 11.5
Results on the denoising and inpainting benchmarks. (a) Penguin energy, with truncated quadratic smoothness
cost . (b) House energy, with nontruncated quadratic smoothness cost . On these benchmarks the LBP imple-
mentation of [141] is included, labeled as BP-P (triangle (blue)); otherwise see legend in gure 11.2b. The House
example is a convex energy minimization problem amenable to quadratic solvers; we include the results of the
fast algorithm of [462], labeled as HBF (diamond on the left-hand side).
As one can see, there is a dramatic difference in performance among the various energy
minimization methods. On the Photomontage benchmark, expansion-move performs best,
which provides some justication for the fact that this algorithm is used by various image-
stitching applications [7, 8]. On the stereo benchmark the two best methods seem to be
TRW-Sand expansion-move. There are also some obvious paired comparisons; for instance,
there never seems to be any reason to use swap-move instead of expansion-move. In terms
of runtime, expansion-move is clearly the winner among the competitive methods (i.e., all
except ICM), but it should be noted that not all methods have been equally optimized
for speed. Concerning the two different implementations of BP, BP-M and BP-S, for most
examples there is a large gap between their relative performances. BP-S is on average faster
but typically converges to a higher energy.
11.5 Experimental Results 179
Figure 11.6
Results on Panorama benchmark, with swap-move at left and expansion-move at right. The solution of swap-
move has higher energy and is also visually inferior (e.g., person in the center is sliced). Larger versions of
these images are online.
In terms of visual quality of the resulting labelings (note that all results are available
online), the ICM results look noticeably worse, but the others are difcult to distinguish on
most of the benchmarks. The major exception is the Photomontage benchmarks. Here ICM,
BP-S, and swap-move all make some major errors, leaving slices of people oating in the
air, while the other methods produce the fewest noticeable seams, as shown in gure 11.6.
The individual plots show some further interesting features.
On the stereo benchmarks (gure 11.2) expansion-move nds near-optimal solutions
quickly and is the overall winner for Teddy. Though slower, TRW-S does extremely well,
eventually beating all other methods on Tsukuba and Venus. The Venus results are particu-
larly interesting: expansion-move does much worse here than on the other stereo images,
presumably due to the quadratic smoothness term and the slanted surfaces present in this
data set.
The Photomontage benchmarks (gure 11.3), with their label-dependent smoothness
costs, seem to present the largest challenge for the energy minimization methods, many of
which come nowhere near the best solution. The exception is expansion-move, which nds
solutions with less than 1% error on both benchmarks in a few iterations. TRW-S oscillates
wildly but eventually beats expansion-move on Family, though not on Panorama.
On the binary image segmentation benchmarks (gure 11.4), graph cut-based methods are
guaranteed to compute the global minimum, as is TRW-S (but not the original TRW[504]).
Both LBP implementations come extremely close (under 0.1% error in all cases), but never
actually attain the global minimum.
In the nal benchmark, denoising and inpainting (gure 11.5), two different cost func-
tions were used: (1) a nonconvex (truncated quadratic) for the Penguin data set, and (2) a
convex (quadratic) for the House data set. Since in the latter case both the data and the
smoothness energies are quadratic, this is a Gaussian MRF for which a real-valued solution
can be found in closed form. On this problem, hierarchically preconditioned conjugate gra-
dient descent [462], labeled HBF in gure 11.5b, outperforms all other algorithms by a large
margin, requiring only ve iterations and a fraction of a second. However, since the result-
ing oating-point solution is rounded to integers, the resulting energy is slightly higher than
180 11 A Comparative Study of Energy Minimization Methods for MRFs
the integer solution found by TRW-S. The graph cut variants are not performing well, per-
haps due to the nonmetric smoothness terms, and also because there are no constant-valued
regions to propagate over large distances. On these benchmarks, results for the popular
LBP (BP-P) implementation of [141] were also included, which performs comparably to
the other two BP implementations.
Anal issue deserving investigation is the impact of truncation when using the expansion-
move algorithm. It turns out that the total number of pairwise terms that require truncation is
very lowtypically a fraction of 1%. Thus, it is unlikely that truncation strongly affects the
performance of the expansion-move algorithm. Furthermore, the BHS algorithm described
in chapter 2 could resolve this problem (see an example in [259]).
11.6 Experimental Comparison: Stereo with and Without Occlusion
Let us rst briey explain the model of stereo with occlusion. (A detailed explanation and
the corresponding energy can be found in [265].) The model can be seen as an extension of
the simple 4-connected stereo model introduced in section 11.4, since the unary matching
costs and pairwise smoothing costs are of the same form. The two main differences of
the stereo with occlusion model are that (1) the left and right stereo images are labeled
simultaneously, and (2) a large number of pairwise terms, are added across the two images.
These pairwise terms encode the occlusion property and have costs 0 or . This means that
the cost makes sure that two pixels which are matched are not occluded by any other
pixel matches (i.e., 3D voxels). While stereo without occlusion is a simple 4-connected
MRF, this model corresponds to a more complex highly connected MRF. In particular, each
node has K +4 pairwise terms, where K is the number of disparities.
The study in [260] on stereo with occlusion compared three optimization techniques
TRW-S, BP-S, and alpha-expansion, for six different benchmarks (Tsukuba, Venus, Teddy,
Sawtooth, Map, Cones). Figure 11.7 shows plots for Tsukuba and Teddy, with K = 16 and
K = 54 levels of disparities, respectively (see more plots in [260]). In contrast to the 4-
connected stereo model, alpha-expansion is in this case the clear winner. On all examples it
reaches a slightly lower energy than TRW-S. More important, alpha-expansion converges
in about 10100 sec (depending on the number of disparities), and TRW-S typically needs
several hours to reach a solution that is close to the graph cut one. The relative performance
between BP-S and TRW-S depends on the number of disparities, with BP-S performing
better when the number of disparities is large. The potential reason for graph cut outper-
forming message-passing techniques is that in each expansion-move, the corresponding
binary energy has considerably fewer pairwise terms for each node than the full number of
pairwise terms (i.e., K +4).
For comparing stereo with and without occlusion, we used the stereo without occlusion
model of [254]. It corresponds to a 4-connected MRF, as in section 11.4, but uses exactly
the same matching costs as the above stereo with occlusion model [265]. As a check, the
11.6 Experimental Comparison 181
GC
TRW
BP
Lower Bound
0 100 200 300 400
Time in seconds
E
n
e
r
g
y
7750 0 400 800 1200 1600 2000
Time in seconds
E
n
e
r
g
y
32650
x 10
6
x 10
6
Figure 11.7
Comparison of optimization methods for stereo with occlusion [260], with Tsukuba (K=16 disparities) benchmark
(left) and Teddy (K=54 disparities) benchmark (right). The optimization methods are expansion-move (labeled
GC), TRW-S (labeled TRW), and BP-S (labeled BP).
energies of stereo with and without occlusion differ only by a factor of 3 for the Tsukuba
data set, that is, they are of the same order of magnitude. As in the previous section, we
normalize the energies with respect to the optimal bound of the respective TRW-S result (see
details in [260]). For the four data sets the normalized energies for stereo without occlusion
are Tsukuba (100.0037%), Map (100.055%), Sawtooth (100.096%), and Venus (100.014%).
The respective normalized energies for stereo with occlusion are Tsukuba (103.09%), Map
(103.28%), Sawtooth (101.27%), and Venus (102.26%). This means that the difference from
100% is on average two to three orders of magnitude larger for the more complex model.
Consequently, we may conclude that stereo with occlusion is much harder to optimize
than stereo without occlusion. This is also supported by the fact that Meltzer et al. [338]
have shown that for a 4-connected stereo model, it is possible to achieve global optimality
for many examples, including Tsukuba and Venus, using a combination of TRW and the
junction tree algorithm. (The global optimum for Tsukuba
3
is shown in gure 4 of [338]).
The global optimum for a stereo example with a highly connected MRF is so far unknown.
It is quite obvious that properly modeling occlusion leads to a better stereo model, as can
be seen from the ranking on the Middlebury stereo benchmark. Figure 11.8 shows the best
results of the two models, where a clear difference is visible. Another way of evaluating
the quality of a model is to compute the energy of the ground truth labeling. For the stereo
without occlusion model (section 11.4), the energy of the ground truth is very high (e.g.,
144% for Venus), whereas most optimization methods reach an energy close to 100%, (e.g.,
TRW-S achieves 100.02%. In contrast, for stereo with occlusion [265] the ground truth
labeling for Venus has a lower energy of 115%, where the best method (expansion-move)
3. They may use a slightly different energy function (e.g., matching cost) than in this study.
182 11 A Comparative Study of Energy Minimization Methods for MRFs
Figure 11.8
The results of the best optimization methods for stereo without occlusion (left and middle) and with occlusion
(right). The result on the right is visually superior because it is smoother and more accurate (see ground truth in
gure 11.1a). The result on the left uses the model described in section 11.4; the result in the middle is based on
the model in [254]; and the result on the right uses the model of [265]. The results in the middle and right use the
identical matching cost function.
reaches an energy of 102.26%. These numbers have to be taken with care, since (1) the
models are quite different (e.g., different matching costs), and (2) the lower bound of TRW-
S for stereo with occlusion might not be very tight. As expected, if only the labelings of the
nonoccluded ground truth pixels are considered, the ground truth energy drops to 113% for
the stereo without occlusion model. Further examples can be found in table 1 of [464].
11.7 Conclusions
The strongest impression one gets from the study on 4-connected grid graphs is how much
better modern energy minimization methods are than ICM, and how close they come to
computing the global minimum. We do not believe that this is purely due to aws in ICM, but
simply reects the fact that the algorithms used until the late 1990s performed poorly.
(As additional evidence, [72] compared the energy produced by graph cut with simulated
annealing, and obtained a similarly large improvement.)
Our study has also shown that there is a demand for better MRF models, such as stereo
with occlusion handling, as discussed in section 11.6. However, this highly connected MRF
model clearly poses a challenge for existing optimization t