0% found this document useful (0 votes)
16 views173 pages

Principles of Distributed Computing

Uploaded by

rohitk15102004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views173 pages

Principles of Distributed Computing

Uploaded by

rohitk15102004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 173

ii

Principles of
Distributed Computing

Roger Wattenhofer
[email protected]

Spring 2015
iv CONTENTS

7.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

8 Locality Lower Bounds 85


8.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.3 The Neighborhood Graph . . . . . . . . . . . . . . . . . . . . . . 88
Contents 9 Social Networks 93
9.1 Small World Networks . . . . . . . . . . . . . . . . . . . . . . . . 94
9.2 Propagation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 100

10 Synchronization 105
1 Vertex Coloring 5
10.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
1.1 Problem & Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
10.2 Synchronizer α . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
1.2 Coloring Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
10.3 Synchronizer β . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
2 Tree Algorithms 15 10.4 Synchronizer γ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2.1 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 10.5 Network Partition . . . . . . . . . . . . . . . . . . . . . . . . . . 110
2.2 Convergecast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 10.6 Clock Synchronization . . . . . . . . . . . . . . . . . . . . . . . . 112
2.3 BFS Tree Construction . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 MST Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 19 11 Hard Problems 119
11.1 Diameter & APSP . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3 Leader Election 23 11.2 Lower Bound Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.1 Anonymous Leader Election . . . . . . . . . . . . . . . . . . . . . 23 11.3 Communication Complexity . . . . . . . . . . . . . . . . . . . . . 124
3.2 Asynchronous Ring . . . . . . . . . . . . . . . . . . . . . . . . . . 24 11.4 Distributed Complexity Theory . . . . . . . . . . . . . . . . . . . 129
3.3 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Synchronous Ring . . . . . . . . . . . . . . . . . . . . . . . . . . 29 12 Wireless Protocols 133
12.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4 Distributed Sorting 33
12.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.1 Array & Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Sorting Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 12.2.1 Non-Uniform Initialization . . . . . . . . . . . . . . . . . 135
4.3 Counting Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 39 12.2.2 Uniform Initialization with CD . . . . . . . . . . . . . . . 135
12.2.3 Uniform Initialization without CD . . . . . . . . . . . . . 137
5 Shared Memory 47 12.3 Leader Election . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 12.3.1 With High Probability . . . . . . . . . . . . . . . . . . . . 137
5.2 Mutual Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 12.3.2 Uniform Leader Election . . . . . . . . . . . . . . . . . . . 138
5.3 Store & Collect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 12.3.3 Fast Leader Election with CD . . . . . . . . . . . . . . . . 139
5.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 51 12.3.4 Even Faster Leader Election with CD . . . . . . . . . . . 139
5.3.2 Splitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 12.3.5 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.3.3 Binary Splitter Tree . . . . . . . . . . . . . . . . . . . . . 53 12.3.6 Uniform Asynchronous Wakeup without CD . . . . . . . . 142
5.3.4 Splitter Matrix . . . . . . . . . . . . . . . . . . . . . . . . 55
12.4 Useful Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6 Shared Objects 59
13 Stabilization 147
6.1 Centralized Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Arrow and Friends . . . . . . . . . . . . . . . . . . . . . . . . . . 60 13.1 Self-Stabilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.3 Ivy and Friends . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 13.2 Advanced Stabilization . . . . . . . . . . . . . . . . . . . . . . . . 152

7 Maximal Independent Set 71 14 Labeling Schemes 157


7.1 MIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 14.1 Adjacency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.2 Original Fast MIS . . . . . . . . . . . . . . . . . . . . . . . . . . 73 14.2 Rooted Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.3 Fast MIS v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 14.3 Road Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

iii
CONTENTS v vi CONTENTS

15 Fault-Tolerance & Paxos 165 24 All-to-All Communication 275


15.1 Client/Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
15.2 Paxos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 25 Dynamic Networks 283
25.1 Synchronous Edge-Dynamic Networks . . . . . . . . . . . . . . . 283
26 Consensus 295 25.2 Problem Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 284
26.1 Impossibility of Consensus . . . . . . . . . . . . . . . . . . . . . . 295 25.3 Basic Information Dissemination . . . . . . . . . . . . . . . . . . 285
26.2 Randomized Consensus . . . . . . . . . . . . . . . . . . . . . . . 300 25.4 Small Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
25.4.1 k-Verification . . . . . . . . . . . . . . . . . . . . . . . . . 288
17 Byzantine Agreement 185 25.4.2 k-Committee Election . . . . . . . . . . . . . . . . . . . . 289
17.1 Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 25.5 More Stable Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 291
17.2 How Many Byzantine Nodes? . . . . . . . . . . . . . . . . . . . . 187
17.3 The King Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 189 26 Consensus 295
17.4 Lower Bound on Number of Rounds . . . . . . . . . . . . . . . . 190 26.1 Impossibility of Consensus . . . . . . . . . . . . . . . . . . . . . . 295
17.5 Asynchronous Byzantine Agreement . . . . . . . . . . . . . . . . 191 26.2 Randomized Consensus . . . . . . . . . . . . . . . . . . . . . . . 300

18 Authenticated Agreement 195 27 Multi-Core Computing 305


27.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
18.1 Agreement with Authentication . . . . . . . . . . . . . . . . . . . 195
27.1.1 The Current State of Concurrent Programming . . . . . . 305
18.2 Zyzzyva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
27.2 Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . 307
19 Quorum Systems 207 27.3 Contention Management . . . . . . . . . . . . . . . . . . . . . . . 308
19.1 Load and Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
28 Dominating Set 317
19.2 Grid Quorum Systems . . . . . . . . . . . . . . . . . . . . . . . . 209
28.1 Sequential Greedy Algorithm . . . . . . . . . . . . . . . . . . . . 318
19.3 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 28.2 Distributed Greedy Algorithm . . . . . . . . . . . . . . . . . . . . 319
19.4 Byzantine Quorum Systems . . . . . . . . . . . . . . . . . . . . . 214
29 Routing 327
20 Eventual Consistency & Bitcoin 219 29.1 Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
20.1 Consistency, Availability and Partitions . . . . . . . . . . . . . . 219 29.2 Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
20.2 Bitcoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 29.3 Routing in the Mesh with Small Queues . . . . . . . . . . . . . . 329
20.3 Smart Contracts . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 29.4 Hot-Potato Routing . . . . . . . . . . . . . . . . . . . . . . . . . 330
20.4 Weak Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 229 29.5 More Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

21 Distributed Storage 233 30 Routing Strikes Back 335


21.1 Consistent Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . 233 30.1 Butterfly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
21.2 Hypercubic Networks . . . . . . . . . . . . . . . . . . . . . . . . . 234 30.2 Oblivious Routing . . . . . . . . . . . . . . . . . . . . . . . . . . 336
21.3 DHT & Churn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 30.3 Offline Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

22 Game Theory 247


22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
22.2 Prisoner’s Dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . 247
22.3 Selfish Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
22.4 Braess’ Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
22.5 Rock-Paper-Scissors . . . . . . . . . . . . . . . . . . . . . . . . . 252
22.6 Mechanism Design . . . . . . . . . . . . . . . . . . . . . . . . . . 253

23 Peer-to-Peer Computing 257


23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
23.2 Architecture Variants . . . . . . . . . . . . . . . . . . . . . . . . 258
23.3 Hypercubic Networks . . . . . . . . . . . . . . . . . . . . . . . . . 259
23.4 DHT & Churn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
23.5 Storage and Multicast . . . . . . . . . . . . . . . . . . . . . . . . 268
2 CONTENTS

them when we use them. Towards the end of the course a general picture should
emerge, hopefully!

Course Overview

Introduction This course introduces the basic principles of distributed computing, highlight-
ing common themes and techniques. In particular, we study some of the funda-
mental issues underlying the design of distributed systems:

• Communication: Communication does not come for free; often communi-


cation cost dominates the cost of local processing or storage. Sometimes
What is Distributed Computing? we even assume that everything but communication is free.

In the last few decades, we have experienced an unprecedented growth in the • Coordination: How can you coordinate a distributed system so that it
area of distributed systems and networks. Distributed computing now encom- performs some task efficiently? How much overhead is inevitable?
passes many of the activities occurring in today’s computer and communications
world. Indeed, distributed computing appears in quite diverse application areas: • Fault-tolerance: A major advantage of a distributed system is that even
The Internet, wireless communication, cloud or parallel computing, multi-core in the presence of failures the system as a whole may survive.
systems, mobile networks, but also an ant colony, a brain, or even the human
society can be modeled as distributed systems. • Locality: Networks keep growing. Luckily, global information is not always
These applications have in common that many processors or entities (often needed to solve a task, often it is sufficient if nodes talk to their neighbors.
called nodes) are active in the system at any moment. The nodes have certain In this course, we will address whether a local solution is possible.
degrees of freedom: they have their own hard- and software. Nevertheless, the
nodes may share common resources and information, and, in order to solve • Parallelism: How fast can you solve a task if you increase your computa-
a problem that concerns several—or maybe even all—nodes, coordination is tional power, e.g., by increasing the number of nodes that can share the
necessary. workload? How much parallelism is possible for a given problem?
Despite these commonalities, a human brain is of course very different from a
quadcore processor. Due to such differences, many different models and parame- • Symmetry breaking: Sometimes some nodes need to be selected to or-
ters are studied in the area of distributed computing. In some systems the nodes chestrate computation or communication. This is achieved by a technique
operate synchronously, in other systems they operate asynchronously. There are called symmetry breaking.
simple homogeneous systems, and heterogeneous systems where different types
of nodes, potentially with different capabilities, objectives etc., need to inter- • Synchronization: How can you implement a synchronous algorithm in an
act. There are different communication techniques: nodes may communicate by asynchronous environment?
exchanging messages, or by means of shared memory. Occasionally the commu-
nication infrastructure is tailor-made for an application, sometimes one has to • Uncertainty: If we need to agree on a single term that fittingly describes
work with any given infrastructure. The nodes in a system often work together this course, it is probably “uncertainty”. As the whole system is distrib-
to solve a global task, occasionally the nodes are autonomous agents that have uted, the nodes cannot know what other nodes are doing at this exact
their own agenda and compete for common resources. Sometimes the nodes can moment, and the nodes are required to solve the tasks at hand despite the
be assumed to work correctly, at times they may exhibit failures. In contrast lack of global knowledge.
to a single-node system, distributed systems may still function correctly despite
failures as other nodes can take over the work of the failed nodes. There are Finally, there are also a few areas that we will not cover in this course,
different kinds of failures that can be considered: nodes may just crash, or they mostly because these topics have become so important that they deserve their
might exhibit an arbitrary, erroneous behavior, maybe even to a degree where own courses. Examples for such topics are distributed programming or secu-
it cannot be distinguished from malicious (also known as Byzantine) behavior. rity/cryptography.
It is also possible that the nodes follow the rules indeed, however they tweak In summary, in this class we explore essential algorithmic ideas and lower
the parameters to get the most out of the system; in other words, the nodes act bound techniques, basically the “pearls” of distributed computing and network
selfishly. algorithms. We will cover a fresh topic every week.
Apparently, there are many models (and even more combinations of models)
that can be studied. We will not discuss them in detail now, but simply define Have fun!

1
BIBLIOGRAPHY 3 4 CONTENTS

Chapter Notes [Tel01] Gerard Tel. Introduction to Distributed Algorithms. Cambridge Uni-
versity Press, New York, NY, USA, 2nd edition, 2001.
Many excellent text books have been written on the subject. The book closest
to this course is by David Peleg [Pel00], as it shares about half of the material. A
main focus of Peleg’s book are network partitions, covers, decompositions, and
spanners – an interesting area that we will only touch in this course. There exist
a multitude of other text books that overlap with one or two chapters of this
course, e.g., [Lei92, Bar96, Lyn96, Tel01, AW04, HKP+ 05, CLRS09, Suo12].
Another related course is by James Aspnes [Asp] and one by Jukka Suomela
[Suo14].
Some chapters of this course have been developed in collaboration with (for-
mer) Ph.D. students, see chapter notes for details. Many students have helped
to improve exercises and script. Thanks go to Philipp Brandes, Raphael Ei-
denbenz, Roland Flury, Klaus-Tycho Förster, Stephan Holzer, Barbara Keller,
Fabian Kuhn, Christoph Lenzen, Thomas Locher, Remo Meier, Thomas Mosci-
broda, Regina O’Dell, Yvonne-Anne Pignolet, Jochen Seidel, Stefan Schmid,
Johannes Schneider, Jara Uitto, Pascal von Rickenbach (in alphabetical order).

Bibliography
[Asp] James Aspnes. Notes on Theory of Distributed Systems.

[AW04] Hagit Attiya and Jennifer Welch. Distributed Computing: Funda-


mentals, Simulations and Advanced Topics (2nd edition). John Wi-
ley Interscience, March 2004.

[Bar96] Valmir C. Barbosa. An introduction to distributed algorithms. MIT


Press, Cambridge, MA, USA, 1996.

[CLRS09] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and


Clifford Stein. Introduction to Algorithms (3. ed.). MIT Press, 2009.

[HKP+ 05] Juraj Hromkovic, Ralf Klasing, Andrzej Pelc, Peter Ruzicka, and
Walter Unger. Dissemination of Information in Communication
Networks - Broadcasting, Gossiping, Leader Election, and Fault-
Tolerance. Texts in Theoretical Computer Science. An EATCS Se-
ries. Springer, 2005.

[Lei92] F. Thomson Leighton. Introduction to parallel algorithms and ar-


chitectures: array, trees, hypercubes. Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA, 1992.

[Lyn96] Nancy A. Lynch. Distributed Algorithms. Morgan Kaufmann Pub-


lishers Inc., San Francisco, CA, USA, 1996.

[Pel00] David Peleg. Distributed Computing: a Locality-Sensitive Approach.


Society for Industrial and Applied Mathematics, Philadelphia, PA,
USA, 2000.

[Suo12] Jukka Suomela. Deterministic Distributed Algorithms, 2012.

[Suo14] Jukka Suomela. Distributed algorithms. Online textbook, 2014.


6 CHAPTER 1. VERTEX COLORING

1 2

3 3
Chapter 1 Figure 1.1: 3-colorable graph with a valid coloring.

Vertex Coloring Remarks:


• Sometimes we might even assume that the nodes exactly have identi-
fiers 1, . . . , n.

Vertex coloring is an infamous graph theory problem. It is also a useful toy • It is easy to see that node identifiers (as defined in Assumption 1.2)
example to see the style of this course already in the first lecture. Vertex coloring solve the coloring problem 1.1, but using n colors is not exciting. How
does have quite a few practical applications, for example in the area of wireless many colors are needed is a well-studied problem:
networks where coloring is the foundation of so-called TDMA MAC protocols.
Definition 1.3 (Chromatic Number). Given an undirected Graph G = (V, E),
Generally speaking, vertex coloring is used as a means to break symmetries,
the chromatic number χ(G) is the minimum number of colors to solve Problem
one of the main themes in distributed computing. In this chapter we will not
1.1.
really talk about vertex coloring applications, but treat the problem abstractly.
At the end of the class you probably learned the fastest algorithm ever! Let us To get a better understanding of the vertex coloring problem, let us first look
start with some simple definitions and observations. at a simple non-distributed (“centralized”) vertex coloring algorithm:

Algorithm 1 Greedy Sequential


1.1 Problem & Model 1: while there is an uncolored vertex v do
2: color v with the minimal color (number) that does not conflict with the
Problem 1.1 (Vertex Coloring). Given an undirected graph G = (V, E), assign already colored neighbors
a color cv to each vertex v ∈ V such that the following holds: e = (v, w) ∈ 3: end while
E ⇒ cv 6= cw .

Remarks: Definition 1.4 (Degree). The number of neighbors of a vertex v, denoted by


δ(v), is called the degree of v. The maximum degree vertex in a graph G defines
• Throughout this course, we use the terms vertex and node interchange- the graph degree ∆(G) = ∆.
ably.
Theorem 1.5. Algorithm 1 is correct and terminates in n “steps”. The algo-
• The application often asks us to use few colors! In a TDMA MAC pro- rithm uses at most ∆ + 1 colors.
tocol, for example, less colors immediately imply higher throughput.
Proof: Since each node has at most ∆ neighbors, there is always at least one
However, in distributed computing we are often happy with a solution
color free in the range {1, . . . , ∆ + 1}.
which is suboptimal. There is a tradeoff between the optimality of a
solution (efficacy), and the work/time needed to compute the solution Remarks:
(efficiency).
• In Definition 1.7 we will see what is meant by “step”.

Assumption 1.2 (Node Identifiers). Each node has a unique identifier, e.g., • Sometimes χ(G)  ∆ + 1.
its IP address. We usually assume that each identifier consists of only log n bits Definition 1.6 (Synchronous Distributed Algorithm). In a synchronous dis-
if the system has n nodes. tributed algorithm, nodes operate in synchronous rounds. In each round, each
node executes the following steps:
1. Send messages to neighbors in graph (of reasonable size).

5
1.1. PROBLEM & MODEL 7 8 CHAPTER 1. VERTEX COLORING

2. Receive messages (that were sent by neighbors in step 1 of the same round). Remarks:
3. Do some local computation (of reasonable complexity). • In the worst case, this algorithm is still not better than sequential.
Remarks: • Moreover, it seems difficult to come up with a fast algorithm.
• Any other step ordering is fine.
• Maybe it’s better to first study a simple special case, a tree, and then
• What does “reasonable” mean in this context? We are somewhat go from there.
flexible here, and different model variants exist. Generally, we will
deal with algorithms that only do very simple computations (a com-
parison, an addition, etc.). Exponential-time computation is usually 1.2 Coloring Trees
considered cheating in this context. Similarly, sending a message with
a node ID, or a value is considered okay, whereas sending really long Lemma 1.9. χ(T ree) ≤ 2
messages is fishy. We will have more exact definitions later, when we
Proof. Call some node the root of the tree. If the distance of a node to the root
need them.
is odd (even), color it 1 (0). An odd node has only even neighbors and vice
• We can build a distributed version of Algorithm 1: versa.

Algorithm 2 Reduce Remarks:


1: Assume that initially all nodes have IDs
• If we assume that each node knows its parent (root has no parent)
2: Each node v executes the following code:
and children in a tree, this constructive proof gives a very simple
3: node v sends its ID to all neighbors
algorithm:
4: node v receives IDs of neighbors
5: while node v has an uncolored neighbor with higher ID do
6: node v sends “undecided” to all neighbors Algorithm 3 Slow Tree Coloring
7: node v receives new decisions from neighbors 1: Color the root 0, root sends 0 to its children
8: end while
2: Each node v concurrently executes the following code:
9: node v chooses the smallest admissible free color
3: if node v receives a message cp (from parent) then
10: node v informs all its neighbors about its choice
4: node v chooses color cv = 1 − cp
5: node v sends cv to its children (all neighbors except parent)
6: end if
1 3 1 3

Theorem 1.10. Algorithm 3 is correct. If each node knows its parent and its
100 2
children, the time complexity is the tree height which is bounded by the diameter
of the tree.
5 5
Remarks:
Figure 1.2: Vertex 100 receives the lowest possible color.
• How can we determine a root in a tree if it is not already given? We
will figure that out later.
Definition 1.7 (Time Complexity). For synchronous algorithms (as defined in
• The time complexity of the algorithm is the height of the tree.
1.6) the time complexity is the number of rounds until the algorithm terminates.
The algorithm terminates when the last node terminates. • Nice trees, e.g., balanced binary trees, have logarithmic height, that
Theorem 1.8. Algorithm 2 is correct and has time complexity n. The algorithm is we have a logarithmic time complexity.
uses at most ∆ + 1 colors.
• However, if the tree has a degenerated topology, the time complexity
Proof. Nodes choose colors that are different from their neighbors, and no two may again be up to n, the number of nodes.
neighbors choose concurrently. In each round at least one node chooses a color,
so we are done after at most n rounds. • This algorithm is not very exciting. Can we do better than logarith-
mic?
1.2. COLORING TREES 9 10 CHAPTER 1. VERTEX COLORING

Here is the idea of the algorithm: We start with color labels that have log n bits. that indeed every node will have a color in the range {0, . . . , 5} in log∗ n + k
In each round we compute a new label with exponentially smaller size than the rounds.
previous label, still guaranteeing to have a valid vertex coloring! The algorithm
terminates in log∗ n time. Log-Star?! That’s the number of logarithms (to the
Remarks:
base 2) you need to take to get down to 2. Formally:
• Let us have a closer look at the end game of the algorithm. Colors
Definition 1.11 (Log-Star).
11∗ (in binary notation, i.e., 6 or 7 in decimal notation) will not be
∀x ≤ 2 : log∗ x := 1 ∀x > 2 : log∗ x := 1 + log∗ (log x)
chosen, because the node will then do another round. This gives a
total of 6 colors (i.e., colors 0,. . . , 5).
Remarks:

• Log-star is an amazingly slowly growing function. Log-star of all the • What about that last line of the loop? How do the nodes know that
atoms in the observable universe (estimated to be 1080 ) is 5. So log- all nodes now have a color in the range {0, . . . , 5}? The answer to this
star increases indeed very slowly! There are functions which grow question is surprisingly complex. One may hardwire the number of
even more slowly, such as the inverse Ackermann function, however, rounds into the until statement, such that all nodes execute the loop
the inverse Ackermann function of all the atoms is already 4. for exactly the same number of rounds. However, in order to do so,
all nodes need to know n, the number of nodes, which is ugly. There
are (non-trivial) solutions where nodes do not need to know n, see
Algorithm 4 “6-Color” exercises.
1: Assume that initially the nodes have IDs of size log n bits
2: The root assigns itself the label 0 • Can one reduce the number of colors? Note that Algorithm 2 does
3: Each other node v executes the following code not work (since the degree of a node can be much higher than 6)! For
4: send own color cv to all children fewer colors we need to have siblings monochromatic!
5: repeat
6: receive color cp from parent
7: interpret cv and cp as bit-strings Algorithm 5 Shift Down
8: let i be the index of the smallest bit where cv and cp differ 1: Each other node v concurrently executes the following code:
2: Recolor v with the color of parent
9: the new label is i (as bitstring) followed by the ith bit of cv
3: Root chooses a new (different) color from {0, 1, 2}
10: send cv to all children
11: until cw ∈ {0, . . . , 5} for all nodes w

Lemma 1.13. Algorithm 5 preserves coloring legality; also siblings are monochro-
Example: matic.
Algorithm 4 executed on the following part of a tree: Now Algorithm 2 can be used to reduce the number of used colors from 6 to 3.

Grand-parent 0010110000 → 10010 → ... Algorithm 6 Six-2-Three


Parent 1010010000 → 01010 → 111 1: Each node v concurrently executes the following code:
Child 0110010000 → 10001 → 001 2: for x = 5, 4, 3 do
Theorem 1.12. Algorithm 4 terminates in log∗ n+k time, where k is a constant 3: Perform subroutine Shift down (Algorithm 5)
independent of n. 4: if cv = x then
5: choose the smallest admissible new color cv ∈ {0, 1, 2}
Proof. We need to show that parent p and child c always have different colors. 6: end if
Initially, this is true, since all nodes start out with their unique ID. In a round, 7: end for
let i be the smallest index where child c has a different bit from parent p. If
parent p differs in a different index bit j 6= i from its own parent, parent and
child will compute different colors in that round. On the other hand, if j = i, Theorem 1.14. Algorithms 4 and 6 color a tree with three colors in time
the symmetry is broken by p having a different bit at index i. O(log∗ n).
Regarding runtime, note that the size of the largest color shrinks dramat-
ically in each round, apart from the symmetry-breaking bit, exactly as a log-
arithmic function. With some (tedious and boring) machinery, one can show
1.2. COLORING TREES 11 12 CHAPTER 1. VERTEX COLORING

logarithmic difference-tag as in Algorithm 4. Then the new label is


the concatenation of all the difference-tags. For constant degree ∆,
this gives a 3∆-label in O(log∗ n) steps. Algorithm 2 then reduces the
number of colors to ∆ + 1 in 23∆ (this is still a constant for constant
∆!) steps.

• Unfortunately, coloring a general graph is not yet possible with this


technique. We will see another technique for that in Chapter 7. With
this technique it is possible to color a general graph with ∆ + 1 colors
in O(log n) time.

• A lower bound shows that many of these log-star algorithms are


asymptotically (up to constant factors) optimal. We will see that
later.

Chapter Notes
The basic technique of the log-star algorithm is by Cole and Vishkin [CV86].
The technique can be generalized and extended, e.g., to a ring topology or to
graphs with constant degree [GP87, GPS88, KMW05]. Using it as a subroutine,
one can solve many problems in log-star time. For instance, one can color so-
called growth bounded graphs (a model which includes many natural graph
classes, for instance unit disk graphs) asymptotically optimally in O(log∗ n)
time [SW08]. Actually, Schneider et al. show that many classic combinatorial
problems beyond coloring can be solved in log-star time in growth bounded and
other restricted graphs.
In a later chapter we learn a Ω(log∗ n) lower bound for coloring and related
Figure 1.3: Possible execution of Algorithm 6.
problems [Lin92]. Linial’s paper also contains a number of other results on
coloring, e.g., that any algorithm for coloring √ d-regular trees of radius r that
run in time at most 2r/3 require at least Ω( d) colors.
Remarks: For general graphs, later we will learn fast coloring algorithms that use a
maximal independent sets as a base. Since coloring exhibits a trade-off between
• The term O() used in Theorem 1.12 is called “big O” and is often efficacy and efficiency, many different results for general graphs exist, e.g., [PS96,
used in distributed computing. Roughly speaking, O(f ) means “in KSOS06, BE09, Kuh09, SW10, BE11b, KP11, BE11a, BEPS12, PS13, CPS14,
the order of f , ignoring constant factors and smaller additive terms.” BEK14].
More formally, for two functions f and g, it holds that f ∈ O(g) if Some parts of this chapter are also discussed in Chapter 7 of [Pel00], e.g.,
there are constants x0 and c so that |f (x)| ≤ c|g(x)| for all x ≥ x0 . the proof of Theorem 1.12.
For an elaborate discussion on the big O notation we refer to other
introductory math or computer science classes, or Wikipedia.
• A fast tree-coloring with only 2 colors is more than exponentially more
Bibliography
expensive than coloring with 3 colors. In a tree degenerated to a list, [BE09] Leonid Barenboim and Michael Elkin. Distributed (delta+1)-coloring
nodes far away need to figure out whether they are an even or odd in linear (in delta) time. In 41st ACM Symposium On Theory of
number of hops away from each other in order to get a 2-coloring. To Computing (STOC), 2009.
do that one has to send a message to these nodes. This costs time
linear in the number of nodes. [BE11a] Leonid Barenboim and Michael Elkin. Combinatorial Algorithms for
Distributed Graph Coloring. In 25th International Symposium on
• The idea of this algorithm can be generalized, e.g., to a ring topology. DIStributed Computing, 2011.
Also a general graph with constant degree ∆ can be colored with
∆ + 1 colors in O(log∗ n) time. The idea is as follows: In each step, [BE11b] Leonid Barenboim and Michael Elkin. Deterministic Distributed Ver-
a node compares its label to each of its neighbors, constructing a tex Coloring in Polylogarithmic Time. J. ACM, 58(5):23, 2011.
BIBLIOGRAPHY 13 14 CHAPTER 1. VERTEX COLORING

[BEK14] Leonid Barenboim, Michael Elkin, and Fabian Kuhn. Distributed [PS13] Seth Pettie and Hsin-Hao Su. Fast distributed coloring algorithms
(delta+1)-coloring in linear (in delta) time. SIAM J. Comput., for triangle-free graphs. In Automata, Languages, and Programming
43(1):72–95, 2014. - 40th International Colloquium, ICALP, pages 681–693, 2013.

[BEPS12] Leonid Barenboim, Michael Elkin, Seth Pettie, and Johannes Schnei- [SW08] Johannes Schneider and Roger Wattenhofer. A Log-Star Distributed
der. The locality of distributed symmetry breaking. In Foundations Maximal Independent Set Algorithm for Growth-Bounded Graphs.
of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium In 27th ACM Symposium on Principles of Distributed Computing
on, pages 321–330, 2012. (PODC), Toronto, Canada, August 2008.

[CPS14] Kai-Min Chung, Seth Pettie, and Hsin-Hao Su. Distributed algo- [SW10] Johannes Schneider and Roger Wattenhofer. A New Technique For
rithms for the lovász local lemma and graph coloring. In ACM Sym- Distributed Symmetry Breaking. In 29th Symposium on Principles
posium on Principles of Distributed Computing, pages 134–143, 2014. of Distributed Computing (PODC), Zurich, Switzerland, July 2010.
[CV86] R. Cole and U. Vishkin. Deterministic coin tossing and accelerating
cascades: micro and macro techniques for designing parallel algo-
rithms. In 18th annual ACM Symposium on Theory of Computing
(STOC), 1986.
[GP87] Andrew V. Goldberg and Serge A. Plotkin. Parallel (∆+1)-coloring
of constant-degree graphs. Inf. Process. Lett., 25(4):241–245, June
1987.

[GPS88] Andrew V. Goldberg, Serge A. Plotkin, and Gregory E. Shannon.


Parallel Symmetry-Breaking in Sparse Graphs. SIAM J. Discrete
Math., 1(4):434–446, 1988.
[KMW05] Fabian Kuhn, Thomas Moscibroda, and Roger Wattenhofer. On
the Locality of Bounded Growth. In 24th ACM Symposium on the
Principles of Distributed Computing (PODC), Las Vegas, Nevada,
USA, July 2005.
[KP11] Kishore Kothapalli and Sriram V. Pemmaraju. Distributed graph
coloring in a few rounds. In 30th ACM SIGACT-SIGOPS Symposium
on Principles of Distributed Computing (PODC), 2011.

[KSOS06] Kishore Kothapalli, Christian Scheideler, Melih


√ Onus, and Christian
Schindelhauer. Distributed coloring in O( log n) Bit Rounds. In
20th international conference on Parallel and Distributed Processing
(IPDPS), 2006.
[Kuh09] Fabian Kuhn. Weak graph colorings: distributed algorithms and
applications. In 21st ACM Symposium on Parallelism in Algorithms
and Architectures (SPAA), 2009.
[Lin92] N. Linial. Locality in Distributed Graph Algorithms. SIAM Journal
on Computing, 21(1)(1):193–201, February 1992.
[Pel00] David Peleg. Distributed Computing: a Locality-Sensitive Approach.
Society for Industrial and Applied Mathematics, Philadelphia, PA,
USA, 2000.
[PS96] Alessandro Panconesi and Aravind Srinivasan. On the Complexity of
Distributed Network Decomposition. J. Algorithms, 20(2):356–374,
1996.
16 CHAPTER 2. TREE ALGORITHMS

Definition 2.5 (Clean). A graph (network) is clean if the nodes do not know
the topology of the graph.

Theorem 2.6 (Clean Broadcast Lower Bound). For a clean network, the num-
ber of edges m is a lower bound for the broadcast message complexity.

Proof: If you do not try every edge, you might miss a whole part of the graph
Chapter 2 behind it.

Definition 2.7 (Asynchronous Distributed Algorithm). In the asynchronous


model, algorithms are event driven (“upon receiving message . . . , do . . . ”).
Tree Algorithms Nodes cannot access a global clock. A message sent from one node to another
will arrive in finite but unbounded time.

Remarks:
In this chapter we learn a few basic algorithms on trees, and how to construct
• The asynchronous model and the synchronous model (Definition 1.6)
trees in the first place so that we can run these (and other) algorithms. The
are the cornerstone models in distributed computing. As they do
good news is that these algorithms have many applications, the bad news is
not necessarily reflect reality there are several models in between syn-
that this chapter is a bit on the simple side. But maybe that’s not really bad
chronous and asynchronous. However, from a theoretical point of view
news?!
the synchronous and the asynchronous model are the most interesting
ones (because every other model is in between these extremes).
2.1 Broadcast • Note that in the asynchronous model, messages that take a longer
Definition 2.1 (Broadcast). A broadcast operation is initiated by a single node, path may arrive earlier.
the source. The source wants to send a message to all other nodes in the system.
Definition 2.8 (Asynchronous Time Complexity). For asynchronous algorithms
Definition 2.2 (Distance, Radius, Diameter). The distance between two nodes (as defined in 2.7) the time complexity is the number of time units from the
u and v in an undirected graph G is the number of hops of a minimum path start of the execution to its completion in the worst case (every legal input, ev-
between u and v. The radius of a node u is the maximum distance between u ery execution scenario), assuming that each message has a delay of at most one
and any other node in the graph. The radius of a graph is the minimum radius time unit.
of any node in the graph. The diameter of a graph is the maximum distance
between two arbitrary nodes. Remarks:

Remarks: • You cannot use the maximum delay in the algorithm design. In other
words, the algorithm has to be correct even if there is no such delay
• Clearly there is a close relation between the radius R and the diameter upper bound.
D of a graph, such as R ≤ D ≤ 2R.
• The clean broadcast lower bound (Theorem 2.6) directly brings us to
Definition 2.3 (Message Complexity). The message complexity of an algo- the well known flooding algorithm.
rithm is determined by the total number of messages exchanged.

Theorem 2.4 (Broadcast Lower Bound). The message complexity of broadcast Algorithm 7 Flooding
is at least n − 1. The source’s radius is a lower bound for the time complexity. 1: The source (root) sends the message to all neighbors.
Proof: Every node must receive the message. 2: Each other node v upon receiving the message the first time forwards the
message to all (other) neighbors.
Remarks: 3: Upon later receiving the message again (over other edges), a node can dis-

• You can use a pre-computed spanning tree to do broadcast with tight card the message.
message complexity. If the spanning tree is a breadth-first search
spanning tree (for a given source), then the time complexity is tight
as well.

15
2.2. CONVERGECAST 17 18 CHAPTER 2. TREE ALGORITHMS

Remarks: 2.3 BFS Tree Construction


• If node v receives the message first from node u, then node v calls In synchronous systems the flooding algorithm is a simple yet efficient method to
node u parent. This parent relation defines a spanning tree T . If the construct a breadth-first search (BFS) spanning tree. However, in asynchronous
flooding algorithm is executed in a synchronous system, then T is a systems the spanning tree constructed by the flooding algorithm may be far from
breadth-first search spanning tree (with respect to the root). BFS. In this section, we implement two classic BFS constructions—Dijkstra and
• More interestingly, also in asynchronous systems the flooding algo- Bellman-Ford—as asynchronous algorithms.
rithm terminates after R time units, R being the radius of the source. We start with the Dijkstra algorithm. The basic idea is to always add the
However, the constructed spanning tree may not be a breadth-first “closest” node to the existing part of the BFS tree. We need to parallelize
search spanning tree. this idea by developing the BFS tree layer by layer. The algorithm proceeds in
phases. In phase p the nodes with distance p to the root are detected. Let Tp
be the tree in phase p.
2.2 Convergecast
Algorithm 9 Dijkstra BFS
Convergecast is the same as broadcast, just reversed: Instead of a root sending 1: We start with T1 which is the root plus all direct neighbors of the root. We
a message to all other nodes, all other nodes send information to a root (starting start with phase p = 1:
from the leaves, i.e., the tree T is known). The simplest convergecast algorithm 2: repeat
is the echo algorithm: 3: The root starts phase p by broadcasting “start p” within Tp .
4: When receiving “start p” a leaf node u of Tp (that is, a node that was
Algorithm 8 Echo newly discovered in the last phase) sends a “join p + 1” message to all
1: A leave sends a message to its parent. quiet neighbors. (A neighbor v is quiet if u has not yet “talked” to v.)
2: If an inner node has received a message from each child, it sends a message 5: A node v receiving the first “join p+1” message replies with “ACK” and
to the parent. becomes a leaf of the tree Tp+1 .
6: A node v receiving any further “join” message replies with “NACK”.
7: The leaves of Tp collect all the answers of their neighbors; then the leaves
Remarks:
start an echo algorithm back to the root.
• Usually the echo algorithm is paired with the flooding algorithm, 8: When the echo process terminates at the root, the root increments the
which is used to let the leaves know that they should start the echo phase
process; this is known as flooding/echo. 9: until there was no new node detected

• One can use convergecast for termination detection, for example. If a


root wants to know whether all nodes in the system have finished some Theorem 2.9. The time complexity of Algorithm 9 is O(D2 ), the message
task, it initiates a flooding/echo; the message in the echo algorithm complexity is O(m + nD), where D is the diameter of the graph, n the number
then means “This subtree has finished the task.” of nodes, and m the number of edges.
• Message complexity of the echo algorithm is n − 1, but together with Proof: A broadcast/echo algorithm in Tp needs at most time 2D. Finding new
flooding it is O(m), where m = |E| is the number of edges in the neighbors at the leaves costs 2 time units. Since the BFS tree height is bounded
graph. by the diameter, we have D phases, giving a total time complexity of O(D2 ).
• The time complexity of the echo algorithm is determined by the depth Each node participating in broadcast/echo only receives (broadcasts) at most 1
of the spanning tree (i.e., the radius of the root within the tree) gen- message and sends (echoes) at most once. Since there are D phases, the cost is
erated by the flooding algorithm. bounded by O(nD). On each edge there are at most 2 “join” messages. Replies
to a “join” request are answered by 1 “ACK” or “NACK” , which means that we
• The flooding/echo algorithm can do much more than collecting ac- have at most 4 additional messages per edge. Therefore the message complexity
knowledgements from subtrees. One can for instance use it to com- is O(m + nD).
pute the number of nodes in the system, or the maximum ID, or the Remarks:
sum of all values stored in the system, or a route-disjoint matching.
• The time complexity is not very exciting, so let’s try Bellman-Ford!
• Moreover, by combining results one can compute even fancier aggrega-
tions, e.g., with the number of nodes and the sum one can compute the The basic idea of Bellman-Ford is even simpler, and heavily used in the
average. With the average one can compute the standard deviation. Internet, as it is a basic version of the omnipresent border gateway protocol
And so on . . . (BGP). The idea is to simply keep the distance to the root accurate. If a
2.4. MST CONSTRUCTION 19 20 CHAPTER 2. TREE ALGORITHMS

neighbor has found a better route to the root, a node might also need to update Definition 2.12 (Blue Edges). Let T be a spanning tree of the weighted graph
its distance. G and T 0 ⊆ T a subgraph of T (also called a fragment). Edge e = (u, v) is an
outgoing edge of T 0 if u ∈ T 0 and v ∈/ T 0 (or vice versa). The minimum weight
Algorithm 10 Bellman-Ford BFS outgoing edge b(T 0 ) is the so-called blue edge of T 0 .
1: Each node u stores an integer du which corresponds to the distance from u
to the root. Initially droot = 0, and du = ∞ for every other node u. Lemma 2.13. For a given weighted graph G (such that no two weights are the
2: The root starts the algorithm by sending “1” to all neighbors. same), let T denote the MST, and T 0 be a fragment of T . Then the blue edge
3: if a node u receives a message “y” with y < du from a neighbor v then of T 0 is also part of T , i.e., T 0 ∪ b(T 0 ) ⊆ T .
4: node u sets du := y
Proof: For the sake of contradiction, suppose that in the MST T there is edge
5: node u sends “y + 1” to all neighbors (except v)
e 6= b(T 0 ) connecting T 0 with the remainder of T . Adding the blue edge b(T 0 )
6: end if
to the MST T we get a cycle including both e and b(T 0 ). If we remove e from
this cycle, then we still have a spanning tree, and since by the definition of the
Theorem 2.10. The time complexity of Algorithm 10 is O(D), the message blue edge ωe > ωb(T 0 ) , the weight of that new spanning tree is less than than
complexity is O(nm), where D, n, m are defined as in Theorem 2.9. the weight of T . We have a contradiction.
Remarks:
Proof: We can prove the time complexity by induction. We claim that a node
at distance d from the root has received a message “d” by time d. The root • In other words, the blue edges seem to be the key to a distributed
knows by time 0 that it is the root. A node v at distance d has a neighbor u at algorithm for the MST problem. Since every node itself is a fragment
distance d − 1. Node u by induction sends a message “d” to v at time d − 1 or of the MST, every node directly has a blue edge! All we need to do
before, which is then received by v at time d or before. Message complexity is is to grow these fragments! Essentially this is a distributed version of
easier: A node can reduce its distance at most n − 1 times; each of these times Kruskal’s sequential algorithm.
it sends a message to all its neighbors. If all nodes do this, then we have O(nm)
messages. • At any given time the nodes of the graph are partitioned into frag-
ments (rooted subtrees of the MST). Each fragment has a root, the
Remarks:
ID of the fragment is the ID of its root. Each node knows its parent
• Algorithm 9 has the better message complexity and Algorithm 10 has and its children in the fragment. The algorithm operates in phases.
the better time complexity. The currently best algorithm (optimizing At the beginning of a phase, nodes know the IDs of the fragments of
both) needs O(m + n log3 n) messages and O(D log3 n) time. This their neighbor nodes.
“trade-off” algorithm is beyond the scope of this chapter, but we will
later learn the general technique. Remarks:
• Algorithm 11 was stated in pseudo-code, with a few details not really
2.4 MST Construction explained. For instance, it may be that some fragments are much
larger than others, and because of that some nodes may need to wait
There are several types of spanning trees, each serving a different purpose. A for others, e.g., if node u needs to find out whether neighbor v also
particularly interesting spanning tree is the minimum spanning tree (MST). The wants to merge over the blue edge b = (u, v). The good news is that
MST only makes sense on weighted graphs, hence in this section we assume that all these details can be solved. We can for instance bound the asyn-
each edge e is assigned a weight ωe . chronicity by guaranteeing that nodes only start the new phase after
Definition 2.11 (MST). Given a weighted graph G = (V, E, the last phase is done, similarly to the phase-technique of Algorithm
Pω), the MST of 9.
G is a spanning tree T minimizing ω(T ), where ω(G0 ) = e∈G0 ωe for any
0
subgraph G ⊆ G. Theorem 2.14. The time complexity of Algorithm 11 is O(n log n), the message
complexity is O(m log n).
Remarks:
• In the following we assume that no two edges of the graph have the Proof: Each phase mainly consists of two flooding/echo processes. In general,
same weight. This simplifies the problem as it makes the MST unique; the cost of flooding/echo on a tree is O(D) time and O(n) messages. However,
however, this simplification is not essential as one can always break the diameter D of the fragments may turn out to be not related to the diameter
ties by adding the IDs of adjacent vertices to the weight. of the graph because the MST may meander, hence it really is O(n) time. In
addition, in the first step of each phase, nodes need to learn the fragment ID of
• Obviously we are interested in computing the MST in a distributed their neighbors; this can be done in 2 steps but costs O(m) messages. There are
way. For this we use a well-known lemma: a few more steps, but they are cheap. Altogether a phase costs O(n) time and
2.4. MST CONSTRUCTION 21 22 CHAPTER 2. TREE ALGORITHMS

Algorithm 11 GHS (Gallager–Humblet–Spira) Bibliography


1: Initially each node is the root of its own fragment. We proceed in phases:
2: repeat
[Awe87] B. Awerbuch. Optimal distributed algorithms for minimum weight
3: All nodes learn the fragment IDs of their neighbors. spanning tree, counting, leader election, and related problems. In
4: The root of each fragment uses flooding/echo in its fragment to determine Proceedings of the nineteenth annual ACM symposium on Theory of
the blue edge b = (u, v) of the fragment. computing, STOC ’87, pages 230–240, New York, NY, USA, 1987.
5: The root sends a message to node u; while forwarding the message on the ACM.
path from the root to node u all parent-child relations are inverted {such [Bel58] Richard Bellman. On a Routing Problem. Quarterly of Applied
that u is the new temporary root of the fragment} Mathematics, 16:87–90, 1958.
6: node u sends a merge request over the blue edge b = (u, v).
7: if node v also sent a merge request over the same blue edge b = (v, u) [CLRS09] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and
then Clifford Stein. Introduction to Algorithms (3. ed.). MIT Press, 2009.
8: either u or v (whichever has the smaller ID) is the new fragment root
9: the blue edge b is directed accordingly [Dij59] E. W. Dijkstra. A Note on Two Problems in Connexion with Graphs.
10: else Numerische Mathematik, 1(1):269–271, 1959.
11: node v is the new parent of node u [DM78] Y.K. Dalal and R.M. Metcalfe. Reverse path forwarding of broadcast
12: end if packets. Communications of the ACM, 12:1040–148, 1978.
13: the newly elected root node informs all nodes in its fragment (again using
flooding/echo) about its identity [Eve79] S. Even. Graph Algorithms. Computer Science Press, Rockville, MD,
14: until all nodes are in the same fragment (i.e., there is no outgoing edge) 1979.
[For56] Lester R. Ford. Network Flow Theory. The RAND Corporation
Paper P-923, 1956.
O(m) messages. So we only have to figure out the number of phases: Initially all
fragments are single nodes and hence have size 1. In a later phase, each fragment [GHS83] R. G. Gallager, P. A. Humblet, and P. M. Spira. Distributed Algo-
merges with at least one other fragment, that is, the size of the smallest fragment rithm for Minimum-Weight Spanning Trees. ACM Transactions on
at least doubles. In other words, we have at most log n phases. The theorem Programming Languages and Systems, 5(1):66–77, January 1983.
follows directly.
[HKP+ 05] Juraj Hromkovic, Ralf Klasing, Andrzej Pelc, Peter Ruzicka, and
Walter Unger. Dissemination of Information in Communication
Networks - Broadcasting, Gossiping, Leader Election, and Fault-
Chapter Notes Tolerance. Texts in Theoretical Computer Science. An EATCS Se-
ries. Springer, 2005.
Trees are one of the oldest graph structures, already appearing in the first book
about graph theory [Koe36]. Broadcasting in distributed computing is younger, [Koe36] Denes Koenig. Theorie der endlichen und unendlichen Graphen.
but not that much [DM78]. Overviews about broadcasting can be found for Teubner, Leipzig, 1936.
example in Chapter 3 of [Pel00] and Chapter 7 of [HKP+ 05]. For a introduction
to centralized tree-construction, see e.g. [Eve79] or [CLRS09]. Overviews for the [Lyn96] Nancy A. Lynch. Distributed Algorithms. Morgan Kaufmann Pub-
distributed case can be found in Chapter 5 of [Pel00] or Chapter 4 of [Lyn96]. lishers Inc., San Francisco, CA, USA, 1996.
The classic papers on routing are [For56, Bel58, Dij59]. In a later chapter, we [Pel00] David Peleg. Distributed Computing: a Locality-Sensitive Approach.
will later learn a general technique to derive algorithms with an almost optimal Society for Industrial and Applied Mathematics, Philadelphia, PA,
time and message complexity. USA, 2000.
Algorithm 11 is called “GHS” after Gallager, Humblet, and Spira, three
pioneers in distributed computing [GHS83]. Their algorithm won the presti-
gious Edsger W. Dijkstra Prize in Distributed Computing in 2004, among other
reasons because it was one of the first non-trivial asynchronous distributed al-
gorithms. As such it can be seen as one of the seeds of this research area. We
presented a simplified version of GHS. The original paper featured an improved
message complexity of O(m + n log n). Later, Awerbuch managed to further
improve the GHS algorithm to get O(n) time and O(m + n log n) message com-
plexity, both asymptotically optimal [Awe87].
24 CHAPTER 3. LEADER ELECTION

or asymmetric (star, single node with highest degree, etc.). We will now show
that non-uniform anonymous leader election for synchronous rings is impossible.
The idea is that in a ring, symmetry can always be maintained.

Lemma 3.4. After round k of any deterministic algorithm on an anonymous


ring, each node is in the same state sk .
Chapter 3 Proof by induction: All nodes start in the same state. A round in a synchronous
algorithm consists of the three steps sending, receiving, local computation (see
Definition 1.6). All nodes send the same message(s), receive the same mes-

Leader Election sage(s), do the same local computation, and therefore end up in the same state.

Theorem 3.5 (Anonymous Leader Election). Deterministic leader election in


an anonymous ring is impossible.

Proof (with Lemma 3.4): If one node ever decides to become a leader (or a
Some algorithms (e.g. the slow tree coloring Algorithm 3) ask for a special node, non-leader), then every other node does so as well, contradicting the problem
a so-called “leader”. Computing a leader is a very simple form of symmetry specification 3.1 for n > 1. This holds for non-uniform algorithms, and therefore
breaking. Algorithms based on leaders do generally not exhibit a high degree also for uniform algorithms. Furthermore, it holds for synchronous algorithms,
of parallelism, and therefore often suffer from poor time complexity. However, and therefore also for asynchronous algorithms.
sometimes it is still useful to have a leader to make critical decisions in an easy
(though non-distributed!) way. Remarks:

• Sense of direction is the ability of nodes to distinguish neighbor nodes


3.1 Anonymous Leader Election in an anonymous setting. In a ring, for example, a node can distinguish
the clockwise and the counterclockwise neighbor. Sense of direction
The process of choosing a leader is known as leader election. Although leader does not help in anonymous leader election.
election is a simple form of symmetry breaking, there are some remarkable issues
that allow us to introduce notable computational models. • Theorem 3.5 also holds for other symmetric network topologies (e.g.,
In this chapter we concentrate on the ring topology. Many interesting chal- complete graphs, complete bipartite graphs, . . . ).
lenges in distributed computing already reveal the root of the problem in the • Note that Theorem 3.5 does generally not hold for randomized algo-
special case of the ring. Paying attention to the ring also makes sense from a rithms; if nodes are allowed to toss a coin, some symmetries can be
practical point of view as some real world systems are based on a ring topology, broken.
e.g., the antiquated token ring standard.
Problem 3.1 (Leader Election). Each node eventually decides whether it is a • However, more surprisingly, randomization does not always help. A
leader or not, subject to the constraint that there is exactly one leader. randomized uniform anonymous algorithm can for instance not elect
a leader in a ring. Randomization does not help to decide whether the
Remarks: ring has n = 3 or n = 6 nodes: Every third node may generate the
same random bits, and as a result the nodes cannot distinguish the
• More formally, nodes are in one of three states: undecided, leader, two cases. However, an approximation of n which is strictly better
not leader. Initially every node is in the undecided state. When than a factor 2 will help.
leaving the undecided state, a node goes into a final state (leader or
not leader).
Definition 3.2 (Anonymous). A system is anonymous if nodes do not have
3.2 Asynchronous Ring
unique identifiers. We first concentrate on the asynchronous model from Definition 2.7. Through-
Definition 3.3 (Uniform). An algorithm is called uniform if the number of out this section we assume non-anonymity; each node has a unique identifier.
nodes n is not known to the algorithm (to the nodes, if you wish). If n is Having ID’s seems to lead to a trivial leader election algorithm, as we can simply
known, the algorithm is called non-uniform. elect the node with, e.g., the highest ID.

Whether a leader can be elected in an anonymous system depends on whether Theorem 3.6. Algorithm 12 is correct. The time complexity is O(n). The
the network is symmetric (ring, complete graph, complete bipartite graph, etc.) message complexity is O(n2 ).

23
3.2. ASYNCHRONOUS RING 25 26 CHAPTER 3. LEADER ELECTION

Algorithm 12 Clockwise Leader Election Algorithm 13 Radius Growth


1: Each node v executes the following code: 1: Each node v does the following:
2: v sends a message with its identifier (for simplicity also v) to its clockwise 2: Initially all nodes are active. {all nodes may still become leaders}
neighbor. 3: Whenever a node v sees a message w with w > v, then v decides to not be
3: v sets m := v the largest identifier seen so far a leader and becomes passive.
4: if v receives a message w with w > m then 4: Active nodes search in an exponentially growing neighborhood (clockwise
5: v forwards message w to its clockwise neighbor and sets m := w and counterclockwise) for nodes with higher identifiers, by sending out probe
6: v decides not to be the leader, if it has not done so already. messages. A probe message includes the ID of the original sender, a bit
7: else if v receives its own identifier v then whether the sender can still become a leader, and a time-to-live number
8: v decides to be the leader (TTL). The first probe message sent by node v includes a TTL of 1.
9: end if 5: Nodes (active or passive) receiving a probe message decrement the TTL and
forward the message to the next neighbor; if their ID is larger than the one
in the message, they set the leader bit to zero, as the probing node does
Proof: Let node z be the node with the maximum identifier. Node z sends not have the maximum ID. If the TTL is zero, probe messages are returned
its identifier in clockwise direction, and since no other node can swallow it, to the sender using a reply message. The reply message contains the ID of
eventually a message will arrive at z containing it. Then z declares itself to the receiver (the original sender of the probe message) and the leader-bit.
be the leader. Every other node will declare non-leader at the latest when Reply messages are forwarded by all nodes until they reach the receiver.
forwarding message z. Since there are n identifiers in the system, each node 6: Upon receiving the reply message: If there was no node with higher ID
will at most forward n messages, giving a message complexity of at most n2 . in the search area (indicated by the bit in the reply message), the TTL is
We start measuring the time when the first node that “wakes up” sends its doubled and two new probe messages are sent (again to the two neighbors).
identifier. For asynchronous time complexity (Definition 2.8) we assume that If there was a better candidate in the search area, then the node becomes
each message takes at most one time unit to arrive at its destination. After at passive.
most n − 1 time units the message therefore arrives at node z, waking z up. 7: If a node v receives its own probe message (not a reply) v decides to be the
Routing the message z around the ring takes at most n time units. Therefore leader.
node z decides no later than at time 2n − 1. Every other node decides before
node z.
Remarks: messages. Since there are only logarithmic many possible rounds, the message
complexity follows immediately.
• Note that in Algorithm 12 nodes distinguish between clockwise and Remarks:
counterclockwise neighbors. This is not necessary: It is okay to simply
send your own identifier to any neighbor, and forward a message to • This algorithm is asynchronous and uniform as well.
the neighbor you did not receive the message from. So nodes only
need to be able to distinguish their two neighbors. • The question may arise whether one can design an algorithm with an
even lower message complexity. We answer this question in the next
• Careful analysis shows, that while having worst-case message com- section.
plexity of O(n2 ), Algorithm 12 has an average message complexity of
O(n log n). Can we improve this algorithm?
3.3 Lower Bounds
Theorem 3.7. Algorithm 13 is correct. The time complexity is O(n). The
message complexity is O(n log n). Lower bounds in distributed computing are often easier than in the standard
centralized (random access machine, RAM) model because one can argue about
Proof: Correctness is as in Theorem 3.6. The time complexity is O(n) since messages that need to be exchanged. In this section we present a first difficult
the node with maximum identifier z sends messages with round-trip times lower bound. We show that Algorithm 13 is asymptotically optimal.
2, 4, 8, 16, . . . , 2 · 2k with k ≤ log(n + 1). (Even if we include the additional
wake-up overhead, the time complexity stays linear.) Proving the message com- Definition 3.8 (Execution). An execution of a distributed algorithm is a list of
plexity is slightly harder: if a node v manages to survive round r, no other node events, sorted by time. An event is a record (time, node, type, message), where
in distance 2r (or less) survives round r. That is, node v is the only node in its type is “send” or “receive”.
2r -neighborhood that remains active in round r + 1. Since this is the same for
every node, less than n/2r nodes are active in round r +1. Being active in round
n
r costs 2 · 2 · 2r messages. Therefore, round r costs at most 2 · 2 · 2r · 2r−1 = 8n
3.3. LOWER BOUNDS 27 28 CHAPTER 3. LEADER ELECTION

Remarks:
• We assume throughout this course that no two events happen at ex-
actly the same time (or one can break ties arbitrarily).
• An execution of an asynchronous algorithm is generally not only de-
termined by the algorithm but also by a “god-like” scheduler. If more
than one message is in transit, the scheduler can choose which one
arrives first.
Figure 3.1: The rings R1 , R2 are glued together at their open edge.
• If two messages are transmitted over the same directed edge, then it
is sometimes required that the message first transmitted will also be
received first (“FIFO”). Proof by induction: We divide the ring into two sub-rings R1 and R2 of size
n/2. These subrings cannot be distinguished from rings with n/2 nodes if no
For our lower bound, we assume the following model:
messages are received from “outsiders”. We can ensure this by not scheduling
• We are given an asynchronous ring, where nodes may wake up at arbitrary such messages until we want to. Note that executing both given open schedules
times (but at the latest when receiving the first message). on R1 and R2 “in parallel” is possible because we control not only the scheduling
of the messages, but also when nodes wake up. By doing so, we make sure that
• We only accept uniform algorithms where the node with the maximum
2M (n/2) messages are sent before the nodes in R1 and R2 learn anything of
identifier can be the leader. Additionally, every node that is not the
each other!
leader must know the identity of the leader. These two requirements can
Without loss of generality, R1 contains the maximum identifier. Hence, each
be dropped when using a more complicated proof; however, this is beyond
node in R2 must learn the identity of the maximum identifier, thus at least
the scope of this course.
n/2 additional messages must be received. The only problem is that we cannot
• During the proof we will “play god” and specify which message in trans- connect the two sub-rings with both edges since the new ring needs to remain
mission arrives next in the execution. We respect the FIFO conditions for open. Thus, only messages over one of the edges can be received. We look into
links. the future: we check what happens when we close only one of these connecting
edges.
Definition 3.9 (Open Schedule). A schedule is an execution chosen by the
Since we know that n/2 nodes have to be informed in R2 , there must be
scheduler. An open (undirected) edge is an edge where no message traversing
at least n/2 messages that must be received. Closing both edges must inform
the edge has been received so far. A schedule for a ring is open if there is an
n/2 nodes, thus for one of the two edges there must be a node in distance n/4
open edge in the ring.
which will be informed upon creating that edge. This results in n/4 additional
The proof of the lower bound is by induction. First we show the base case: messages. Thus, we pick this edge and leave the other one open which yields
Lemma 3.10. Given a ring R with two nodes, we can construct an open sched- the claim.
ule in which at least one message is received. The nodes cannot distinguish this
Lemma 3.12. Any uniform leader election algorithm for asynchronous rings
schedule from one on a larger ring with all other nodes being where the open
has at least message complexity M (n) ≥ n4 (log n + 1).
edge is.
Proof: Let the two nodes be u and v with u < v. Node u must learn the Proof by induction: For the sake of simplicity we assume n being a power of
identity of node v, thus receive at least one message. We stop the execution of 2. The base case n = 2 works because of Lemma 3.10 which implies that
the algorithm as soon as the first message is received. (If the first message is M (2) ≥ 1 = 24 (log 2 + 1). For the induction step, using Lemma 3.11 and the
received by v, bad luck for the algorithm!) Then the other edge in the ring (on induction hypothesis we have
which the received message was not transmitted) is open. Since the algorithm n n
needs to be uniform, maybe the open edge is not really an edge at all, nobody M (n) = 2 · M +
can tell. We could use this to glue two rings together, by breaking up this  n 2 n 4  n
imaginary open edge and connect two rings by two edges. An example can be ≥ 2· log + 1 +
8 2 4
seen in Figure 3.1. n n n
= log n + = (log n + 1) .
Lemma 3.11. By gluing together two rings of size n/2 for which we have open 4 4 4
schedules, we can construct an open schedule on a ring of size n. If M (n/2) 2
denotes the number of messages already received in each of these schedules, at
least 2M (n/2) + n/4 messages have to be exchanged in order to solve leader
election.
3.4. SYNCHRONOUS RING 29 30 CHAPTER 3. LEADER ELECTION

Remarks: Remarks:

• To hide the ugly constants we use the “big Omega” notation, the lower • Message complexity is indeed n.
bound equivalent of O(). A function f is in Ω(g) if there are constants
• But the time complexity is huge! If m is the minimum identifier it is
x0 and c > 0 such that |f (x)| ≥ c|g(x)| for all x ≥ x0 .
m · n.
• In addition to the already presented parts of the “big O” notation, • The synchronous start and the non-uniformity assumptions can be
there are 3 additional ones. Remember that a function f is in O(g) if dropped by using a wake-up technique (upon receiving a wake-up mes-
f grows at most as fast as g. A function f is in o(g) if f grows slower sage, wake up your clockwise neighbors) and by letting messages travel
than g. slowly.
• An analogous small letter notation exists for Ω. A function f is in • There are several lower bounds for the synchronous model: comparison-
ω(g) if f grows faster than g. based algorithms or algorithms where the time complexity cannot be a
function of the identifiers have message complexity Ω(n log n) as well.
• Last but not least, we say that a function f is in Θ(g) if f grows as
fast as g, i.e., f ∈ O(g) and f ∈ Ω(g). • In general graphs, efficient leader election may be tricky. While time-
optimal leader election can be done by parallel flooding-echo (see
• Again, we refer to standard text books for formal definitions. Chapter 2), bounding the message complexity is more difficult.
Theorem 3.13 (Asynchronous Leader Election Lower Bound). Any uniform
leader election algorithm for asynchronous rings has Ω(n log n) message com- Chapter Notes
plexity.
[Ang80] was the first to mention the now well-known impossibility result for
anonymous rings and other networks, even when using randomization. The
3.4 Synchronous Ring first algorithm for asynchronous rings was presented in [Lan77], which was im-
proved to the presented clockwise algorithm in [CR79]. Later, [HS80] found the
The lower bound relied on delaying messages for a very long time. Since this is radius growth algorithm, which decreased the worst case message complexity.
impossible in the synchronous model, we might get a better message complexity Algorithms for the unidirectional case with runtime O(n log n) can be found in
in this case. The basic idea is very simple: In the synchronous model, not [DKR82, Pet82]. The Ω(n log n) message complexity lower bound for compari-
receiving a message is information as well! First we make some additional son based algorithms was first published in [FL87]. In [Sch89] an algorithm with
assumptions: constant error probability for anonymous networks is presented. General results
about limitations of computer power in synchronous rings are in [ASW88, AS88].
• We assume that the algorithm is non-uniform (i.e., the ring size n is
known).
Bibliography
• We assume that every node starts at the same time.
[Ang80] Dana Angluin. Local and global properties in networks of proces-
• The node with the minimum identifier becomes the leader; identifiers are sors (Extended Abstract). In 12th ACM Symposium on Theory of
integers. Computing (STOC), 1980.
[AS88] Hagit Attiya and Marc Snir. Better Computing on the Anonymous
Algorithm 14 Synchronous Leader Election Ring. In Aegean Workshop on Computing (AWOC), 1988.
1: Each node v concurrently executes the following code:
2: The algorithm operates in synchronous phases. Each phase consists of n [ASW88] Hagit Attiya, Marc Snir, and Manfred K. Warmuth. Computing on
time steps. Node v counts phases, starting with 0. an anonymous ring. volume 35, pages 845–875, 1988.
3: if phase = v and v did not yet receive a message then
[CR79] Ernest Chang and Rosemary Roberts. An improved algorithm for
4: v decides to be the leader
decentralized extrema-finding in circular configurations of processes.
5: v sends the message “v is leader” around the ring
Commun. ACM, 22(5):281–283, May 1979.
6: end if
[DKR82] Danny Dolev, Maria M. Klawe, and Michael Rodeh. An O(n log n)
Unidirectional Distributed Algorithm for Extrema Finding in a Circle.
J. Algorithms, 3(3):245–260, 1982.
BIBLIOGRAPHY 31 32 CHAPTER 3. LEADER ELECTION

[FL87] Greg N. Frederickson and Nancy A. Lynch. Electing a leader in a


synchronous ring. J. ACM, 34(1):98–115, 1987.

[HS80] D. S. Hirschberg and J. B. Sinclair. Decentralized extrema-finding in


circular configurations of processors. Commun. ACM, 23(11):627–628,
November 1980.
[Lan77] Gérard Le Lann. Distributed Systems - Towards a Formal Ap-
proach. In International Federation for Information Processing (IFIP)
Congress, 1977.
[Pet82] Gary L. Peterson. An O(n log n) Unidirectional Algorithm for the
Circular Extrema Problem. 4(4):758–762, 1982.
[Sch89] B. Schieber. Calling names on nameless networks. In Proceedings
of the eighth annual ACM Symposium on Principles of distributed
computing, PODC ’89, pages 319–328, New York, NY, USA, 1989.
ACM.
34 CHAPTER 4. DISTRIBUTED SORTING

Algorithm 15 Odd/Even Sort


1: Given an array of n nodes (v1 , . . . , vn ), each storing a value (not sorted).
2: repeat
3: Compare and exchange the values at nodes i and i + 1, i odd
4: Compare and exchange the values at nodes i and i + 1, i even
5: until done
Chapter 4
Remarks:

Distributed Sorting • The compare and exchange primitive in Algorithm 15 is defined as


follows: Let the value stored at node i be vi . After the compare and
exchange node i stores value min(vi , vi+1 ) and node i + 1 stores value
max(vi , vi+1 ).

“Indeed, I believe that virtually every important aspect of • How fast is the algorithm, and how can we prove correctness/efficiency?
programming arises somewhere in the context of sorting [and searching]!”
• The most interesting proof uses the so-called 0-1 Sorting Lemma. It
– Donald E. Knuth, The Art of Computer Programming allows us to restrict our attention to an input of 0’s and 1’s only, and
works for any “oblivious comparison-exchange” algorithm. (Oblivious
means: Whether you exchange two values must only depend on the
In this chapter we study a classic problem in computer science—sorting— relative order of the two values, and not on anything else.)
from a distributed computing perspective. In contrast to an orthodox single- Lemma 4.3 (0-1 Sorting Lemma). If an oblivious comparison-exchange algo-
processor sorting algorithm, no node has access to all data, instead the to-be- rithm sorts all inputs of 0’s and 1’s, then it sorts arbitrary inputs.
sorted values are distributed. Distributed sorting then boils down to:
Proof. We prove the opposite direction (does not sort arbitrary inputs ⇒ does
Definition 4.1 (Sorting). We choose a graph with n nodes v1 , . . . , vn . Initially not sort 0’s and 1’s). Assume that there is an input x = x1 , . . . , xn that is not
each node stores a value. After applying a sorting algorithm, node vk stores the sorted correctly. Then there is a smallest value k such that the value at node
k th smallest value. vk after running the sorting algorithm is strictly larger than the k th smallest
value x(k). Define an input x∗i = 0 ⇔ xi ≤ x(k), x∗i = 1 else. Whenever the
Remarks: algorithm compares a pair of 1’s or 0’s, it is not important whether it exchanges
• What if we route all values to the same central node v, let v sort the the values or not, so we may simply assume that it does the same as on the
values locally, and then route them to the correct destinations?! Ac- input x. On the other hand, whenever the algorithm exchanges some values
cording to the message passing model studied in the first few chapters x∗i = 0 and x∗j = 1, this means that xi ≤ x(k) < xj . Therefore, in this case the
this is perfectly legal. With a star topology sorting finishes in O(1) respective compare-exchange operation will do the same on both inputs. We
time! conclude that the algorithm will order x∗ the same way as x, i.e., the output
with only 0’s and 1’s will also not be correct.
Definition 4.2 (Node Contention). In each step of a synchronous algorithm,
each node can only send and receive O(1) messages containing O(1) values, no Theorem 4.4. Algorithm 15 sorts correctly in n steps.
matter how many neighbors the node has. Proof. Thanks to Lemma 4.3 we only need to consider an array with 0’s and
1’s. Let j1 be the node with the rightmost (highest index) 1. If j1 is odd (even)
Remarks: it will move in the first (second) step. In any case it will move right in every
• Using Definition 4.2 sorting on a star graph takes linear time. following step until it reaches the rightmost node vn . Let jk be the node with
the k th rightmost 1. We show by induction that jk is not “blocked” anymore
(constantly moves until it reaches destination!) after step k. We have already
4.1 Array & Mesh anchored the induction at k = 1. Since jk−1 moves after step k − 1, jk gets
a right 0-neighbor for each step after step k. (For matters of presentation we
To get a better intuitive understanding of distributed sorting, we start with two omitted a couple of simple details.)
simple topologies, the array and the mesh. Let us begin with the array:

33
4.1. ARRAY & MESH 35 36 CHAPTER 4. DISTRIBUTED SORTING

Remarks: Remarks:

• Linear time is not very exciting, maybe we can do better by using a • There are algorithms that sort in 3m + o(m) time on an m by m
different topology? Let’s try a mesh (a.k.a. grid) topology first. mesh (by diving the mesh into smaller blocks). This is asymptotically
optimal, since a value might need to move 2m times.

• Such a n-sorter is cute, but we are more ambitious. There are non-
Algorithm 16 Shearsort distributed sorting algorithms such as quicksort, heapsort, or merge-
1: We are given a mesh with m rows and m columns, m even, n = m2 . sort that sort n values in (expected) O(n log n) time. Using our n-fold
2: The sorting algorithm operates in phases, and uses the odd/even sort algo- parallelism effectively we might therefore hope for a distributed sort-
rithm on rows or columns. ing algorithm that sorts in time O(log n)!
3: repeat
4: In the odd phases 1, 3, . . . we sort all the rows, in the even phases 2, 4, . . .
we sort all the columns, such that: 4.2 Sorting Networks
5: Columns are sorted such that the small values move up.
6: Odd rows (1, 3, . . . , m − 1) are sorted such that small values move left. In this section we construct a graph topology which is carefully manufactured
7: Even rows (2, 4, . . . , m) are sorted such that small values move right. for sorting. This is a deviation from previous chapters where we always had to
8: until done work with the topology that was given to us. In many application areas (e.g.
peer-to-peer networks, communication switches, systolic hardware) it is indeed
possible (in fact, crucial!) that an engineer can build the topology best suited
√ for her application.
Theorem 4.5. Algorithm 16 sorts n values in n(log n + 1) time in snake-like
order. Definition 4.6 (Sorting Networks). A comparator is a device with two inputs
x, y and two outputs x0 , y 0 such that x0 = min(x, y) and y 0 = max(x, y). We
Proof. Since the algorithm is oblivious, we can use Lemma 4.3. We show that construct so-called comparison networks that consist of wires that connect com-
after a row and a column phase, half of the previously unsorted rows will be parators (the output port of a comparator is sent to an input port of another
sorted. More formally, let us call a row with only 0’s (or only 1’s) clean, a row comparator). Some wires are not connected to comparator outputs, and some
with 0’s and 1’s is dirty. At any stage, the rows of the mesh can be divided are not connected to comparator inputs. The first are called input wires of the
into three regions. In the north we have a region of all-0 rows, in the south all-1 comparison network, the second output wires. Given n values on the input wires,
rows, in the middle a region of dirty rows. Initially all rows can be dirty. Since a sorting network ensures that the values are sorted on the output wires. We will
neither row nor column sort will touch already clean rows, we can concentrate also use the term width to indicate the number of wires in the sorting network.
on the dirty rows.
First we run an odd phase. Then, in the even phase, we run a peculiar Remarks:
column sorter: We group two consecutive dirty rows into pairs. Since odd and
even rows are sorted in opposite directions, two consecutive dirty rows look as • The odd/even sorter explained in Algorithm 15 can be described as a
follows: sorting network.
00000 . . . 11111 • Often we will draw all the wires on n horizontal lines (n being the
“width” of the network). Comparators are then vertically connecting
11111 . . . 00000 two of these lines.
Such a pair can be in one of three states. Either we have more 0’s than 1’s, or • Note that a sorting network is an oblivious comparison-exchange net-
more 1’s than 0’s, or an equal number of 0’s and 1’s. Column-sorting each pair work. Consequently we can apply Lemma 4.3 throughout this section.
will give us at least one clean row (and two clean rows if “|0| = |1|”). Then An example sorting network is depicted in Figure 4.1.
move the cleaned rows north/south and we will be left with half the dirty rows.
At first glance it appears that we need such a peculiar column sorter. How- Definition 4.7 (Depth). The depth of an input wire is 0. The depth of a
ever, any column sorter sorts the columns in exactly the same way (we are very comparator is the maximum depth of its input wires plus one. The depth of
grateful to have Lemma 4.3!). an output wire of a comparator is the depth of the comparator. The depth of a
All in all we need 2 log m = log n phases to remain only with 1 dirty row in comparison network is the maximum depth (of an output wire).
the middle which will be sorted (not cleaned) with the last row-sort.
Definition 4.8 (Bitonic Sequence). A bitonic sequence is a sequence of numbers
that first monotonically increases, and then monotonically decreases, or vice
versa.
4.2. SORTING NETWORKS 37 38 CHAPTER 4. DISTRIBUTED SORTING

Proof. The proof follows directly from the Algorithm 18 and Lemma 4.9.

Remarks:
• Clearly we want to sort arbitrary and not only bitonic sequences! To
do this we need one more concept, merging networks.

Algorithm 19 Merging Network


1: A merging network of width n is a merger of width n followed by two bitonic
sequence sorters of width n/2. A merger is a depth-one network where we
compare wire i with wire n − i + 1, for i = 1, . . . , n/2.

Remarks:
• Note that a merging network is a bitonic sequence sorter where we
Figure 4.1: A sorting network.
replace the (first) half-cleaner by a merger.
Lemma 4.11. A merging network of width n (Algorithm 19) merges two sorted
input sequences of length n/2 each into one sorted sequence of length n.
Remarks:
Proof. We have two sorted input sequences. Essentially, a merger does to two
• < 1, 4, 6, 8, 3, 2 > or < 5, 3, 2, 1, 4, 8 > are bitonic sequences.
sorted sequences what a half cleaner does to a bitonic sequence, since the lower
• < 9, 6, 2, 3, 5, 4 > or < 7, 4, 2, 5, 9, 8 > are not bitonic. part of the input is reversed. In other words, we can use the same argument as
in Theorem 4.5 and Lemma 4.9: Again, after the merger step either the upper
• Since we restrict ourselves to 0’s and 1’s (Lemma 4.3), bitonic se- or the lower half is clean, the other is bitonic. The bitonic sequence sorters
quences have the form 0i 1j 0k or 1i 0j 1k for i, j, k ≥ 0. complete sorting.

Algorithm 17 Half Cleaner Remarks:


1: A half cleaner is a comparison network of depth 1, where we compare wire • How do you sort n values when you are able to merge two sorted
i with wire i + n/2 for i = 1, . . . , n/2 (we assume n to be even). sequences of size n/2? Piece of cake, just apply the merger recursively.

Lemma 4.9. Feeding a bitonic sequence into a half cleaner (Algorithm 17), the Algorithm 20 Batcher’s “Bitonic” Sorting Network
half cleaner cleans (makes all-0 or all-1) either the upper or the lower half of 1: A batcher sorting network of width n consists of two batcher sorting net-
the n wires. The other half is bitonic. works of width n/2 followed by a merging network of width n. (See Figure
4.2.)
Proof. Assume that the input is of the form 0i 1j 0k for i, j, k ≥ 0. If the midpoint 2: A batcher sorting network of width 1 is empty.
falls into the 0’s, the input is already clean/bitonic and will stay so. If the
midpoint falls into the 1’s the half cleaner acts as Shearsort with two adjacent
rows, exactly as in the proof of Theorem 4.5. The case 1i 0j 1k is symmetric. Theorem 4.12. A sorting network (Algorithm 20) sorts an arbitrary sequence
of n values. It has depth O(log2 n).

Algorithm 18 Bitonic Sequence Sorter Proof. Correctness is immediate: at recursive stage k (k = 1, 2, 3, . . . , log n) we
1: A bitonic sequence sorter of width n (n being a power of 2) consists of a merge 2k ) sorted sequences into 2k−1 sorted sequences. The depth d(n) of the
half cleaner of width n, and then two bitonic sequence sorters of width n/2 sorting network of level n is the depth of a sorting network of level n/2 plus
each. the depth m(n) of a merging network with width n. The depth of a sorter of
2: A bitonic sequence sorter of width 1 is empty. level 1 is 0 since the sorter is empty. Since a merging network of width n has
the same depth as a bitonic sequence sorter of width n, we know by Lemma
4.10 that m(n) = log n. This gives a recursive formula for d(n) which solves to
Lemma 4.10. A bitonic sequence sorter (Algorithm 18) of width n sorts bitonic d(n) = 12 log2 n + 12 log n.
sequences. It has depth log n.
4.3. COUNTING NETWORKS 39 40 Counting Networks

B[w] Remarks:
• A naive distributed counter stores the system’s counter value with a
distinguished central node. When other nodes initiate the test-and-
increment operation, they send a request message to the central node

B[w/2]
and in turn receive a reply message with the current counter value.
However, with a large number of nodes operating on the distributed
counter, the central processor will become a bottleneck. There will
...

...

...
be a congestion of request messages at the central processor, in other

M[w]
words, the system will not scale.
• Is a scalable implementation (without any kind of bottleneck) of such
B[w/2]

a distributed counter possible, or is distributed counting a problem


which is inherently centralized?!?
...

...

...
• Distributed counting could for instance be used to implement a load
balancing infrastructure, i.e. by sending the job with counter value i
(modulo n) to server i (out of n possible servers).
Figure 4.2: A batcher sorting network
Definition 4.14 (Balancer). A balancer is an asynchronous flip-flop which
forwards messages that arrive on the left side to the wires on the right, the first
Remarks: to the upper, the second to the lower, the third to the upper, and so on.
• Simulating Batcher’s sorting network on an ordinary sequential com-
puter takes time O(n log2 n). As said, there are sequential sorting Algorithm 21 Bitonic Counting Network.
algorithms that sort in asymptotically optimal time O(n log n). So 1: Take Batcher’s bitonic sorting network of width w and replace all the com-
a natural question is whether there is a sorting network with depth parators with balancers.
O(log n). Such a network would have some remarkable advantages 2: When a node wants to count, it sends a message to an arbitrary input wire.
over sequential asymptotically optimal sorting algorithms such as heap- 3: The message is then routed through the network, following the rules of the
sort. Apart from being highly parallel, it would be completely obliv- asynchronous balancers.
ious, and as such perfectly suited for a fast hardware solution. In 4: Each output wire is completed with a “mini-counter.”
1983, Ajtai, Komlos, and Szemeredi presented a celebrated O(log n) 5: The mini-counter of wire k replies the value “k + i · w” to the initiator of
depth sorting network. (Unlike Batcher’s sorting network the constant the ith message it receives.
hidden in the big-O of the “AKS” sorting network is too large to be
practical, however.)
Definition 4.15 (Step Property). A sequence y0 , y1 , . . . , yw−1 is said to have
• It can be shown that Batcher’s sorting network and similarly others the step property, if 0 ≤ yi − yj ≤ 1, for any i < j.
can be simulated by a Butterfly network and other hypercubic net-
works, see next chapter. Remarks:
• What if a sorting network is asynchronous?!? Clearly, using a synchro- • If the output wires have the step property, then with r requests, ex-
nizer we can still sort, but it is also possible to use it for something actly the values 1, . . . , r will be assigned by the mini-counters. All we
else. Check out the next section! need to show is that the counting network has the step property. For
that we need some additional facts...
4.3 Counting Networks Facts 4.16. For a balancer, we denote the number of consumed messages on the
ith input wire with xi , i = 0, 1. Similarly, we denote the number of sent messages
In this section we address distributed counting, a distributed service which can on the ith output wire with yi , i = 0, 1. A balancer has these properties:
for instance be used for load balancing.
(1) A balancer does not generate output-messages; that is, x0 + x1 ≥ y0 + y1
Definition 4.13 (Distributed Counting). A distributed counter is a variable in any state.
that is common to all processors in a system and that supports an atomic test-
and-increment operation. The operation delivers the system’s counter value to (2) Every incoming message is eventually forwarded. In other words, if we
the requesting processor and increments it. are in a quiescent state (no message in transit), then x0 + x1 = y0 + y1 .
41 42 Counting Networks

(3) The number of messages sent to the upper output wire is at most one If Z = Z 0 , Fact 4.18.1 implies that zi = zi0 for i = 0, . . . , w/2 − 1. Therefore,
higher than the number of messages sent to the lower output wire: in any the output of M [w] is yi = zbi/2c for i = 0, . . . , w − 1. Since z0 , . . . , zw/2−1 has
state y0 = d(y0 + y1 )/2e (thus y1 = b(y0 + y1 )/2c). the step property, so does the output of M [w] and the lemma follows.
If Z and Z 0 differ by 1, Fact 4.18.2 implies that zi = zi0 for i = 0, . . . , w/2−1,
Facts 4.17. If a sequence y0 , y1 , . . . , yw−1 has the step property, except a unique j such that zj and zj0 differ by only 1, for j = 0, . . . , w/2 − 1.
(1) then all its subsequences have the step property. Let l := min(zj , zj0 ). Then, the output yi (with i < 2j) is l + 1. The output
yi (with i > 2j + 1) is l. The output y2j and y2j+1 are balanced by the final
(2) then its even and odd subsequences satisfy balancer resulting in y2j = l + 1 and y2j+1 = l. Therefore M [w] preserves the
& ' $ % step property.
w/2−1 w−1 w/2−1 w−1
X 1X X 1X
y2i = yi and y2i+1 = yi . A bitonic counting network is constructed to fulfill Lemma 4.19, i.e., the
i=0
2 i=0 i=0
2 i=0
final output comes from a Merger whose upper and lower inputs are recursively
merged. Therefore, the following theorem follows immediately.
Facts 4.18. If two sequences x0 , x1 , . . . , xw−1 and y0 , y1 , . . . , yw−1 have the step
property, Theorem 4.20 (Correctness). In a quiescent state, the w output wires of a
Pw−1 Pw−1 bitonic counting network of width w have the step property.
(1) and i=0 xi = i=0 yi , then xi = yi for i = 0, . . . , w − 1.
Pw−1 Pw−1 Remarks:
(2) and i=0 xi = i=0 yi +1, then there exists a unique j (j = 0, 1, . . . , w−
1) such that xj = yj + 1, and xi = yi for i = 0, . . . , w − 1, i 6= j.
• Is every sorting network also a counting network? No. But surpris-
ingly, the other direction is true!
Remarks:
Theorem 4.21 (Counting vs. Sorting). If a network is a counting network
• An alternative representation of Batcher’s network has been intro-
then it is also a sorting network, but not vice versa.
duced in [AHS94]. It is isomorphic to Batcher’s network, and relies on
a Merger Network M [w] which is defined inductively: M [w] consists Proof. There are sorting networks that are not counting networks (e.g. odd/even
of two M [w/2] networks (an upper and a lower one) whose output sort, or insertion sort). For the other direction, let C be a counting network
is fed to w/2 balancers. The upper balancer merges the even sub- and I(C) be the isomorphic network, where every balancer is replaced by a
sequence x0 , x2 , . . . , xw−2 , while the lower balancer merges the odd comparator. Let I(C) have an arbitrary input of 0’s and 1’s; that is, some of
subsequence x1 , x3 , . . . , xw−1 . Call the outputs of these two M [w/2], the input wires have a 0, all others have a 1. There is a message at C’s ith
z and z 0 respectively. The final stage of the network combines z and z 0 input wire if and only if I(C)’s i input wire is 0. Since C is a counting network,
by sending each pair of wires zi and zi0 into a balancer whose outputs all messages are routed to the upper output wires. I(C) is isomorphic to C,
yield y2i and y2i+1 . therefore a comparator in I(C) will receive a 0 on its upper (lower) wire if
• It is enough to prove that a merger network M [w] preserves the step and only if the corresponding balancer receives a message on its upper (lower)
property. wire. Using an inductive argument, the 0’s and 1’s will be routed through I(C)
such that all 0’s exit the network on the upper wires whereas all 1’s exit the
Lemma 4.19. Let M [w] be a merger network of width w. In a quiescent state network on the lower wires. Applying Lemma 4.3 shows that I(C) is a sorting
(no message in transit), if the inputs x0 , x1 , . . . , xw/2−1 resp. xw/2 , xw/2+1 , . . . , xw−1 network.
have the step property, then the output y0 , y1 , . . . , yw−1 has the step property.
Remarks:
Proof. By induction on the width w.
For w = 2: M [2] is a balancer and a balancer’s output has the step property • We claimed that the counting network is correct. However, it is only
(Fact 4.16.3). correct in a quiescent state.
For w > 2: Let z resp. z 0 be the output of the upper respectively lower
M [w/2] subnetwork. Since x0 , x1 , . . . , xw/2−1 and xw/2 , xw/2+1 , . . . , xw−1 both Definition 4.22 (Linearizable). A system is linearizable if the order of the
have the step property by assumption, their even and odd subsequences also values assigned reflects the real-time order in which they were requested. More
have the step property (Fact 4.17.1). By induction hypothesis, the output of formally, if there is a pair of operations o1 , o2 , where operation o1 terminates be-
Pw/2−1
both M [w/2] subnetworks have the step property. Let Z := zi and fore operation o2 starts, and the logical order is “o2 before o1 ”, then a distributed
0
Pw/2−1 0 1
Pi=0 w/2−1 system is not linearizable.
Z := i=0 zi . From Fact 4.17.2 we conclude that Z = d 2 i=0 xi e +
Pw−1 Pw/2−1 Pw−1
b 21 i=w/2 xi c and Z 0 = b 12 i=0 xi c + d 12 i=w/2 xi e. Since dae + bbc and Lemma 4.23 (Linearizability). The bitonic counting network is not lineariz-
bac + dbe differ by at most 1 we know that Z and Z 0 differ by at most 1. able.
43 44 Counting Networks


Proof. Consider the bitonic counting network with width 4 in Figure 4.3: As- bubble sort based algorithm is presented in [SI86]; it takes time O( n log n),
sume that two inc operations were initiated and the corresponding messages but is fast in practice. Nevertheless, already [TK77] presented an asymptotically
entered the network on wire 0 and 2 (both in light gray color). After hav- optimal algorithms for grid network which runs in 3n + O(n2/3 log n) rounds for
ing passed the second resp. the first balancer, these traversing messages “fall an n × n grid. A simpler algorithm was later found by [SS86] using 3n + O(n3/4 )
asleep”; In other words, both messages take unusually long time before they are rounds.
received by the next balancer. Since we are in an asynchronous setting, this Batcher presents his famous O(log2 n) depth sorting network in [Bat68]. It
may be the case. took until [AKS83] to find a sorting network with asymptotically optimal depth
O(log n). Unfortunately, the constants hidden in the big-O-notation render it
zzz 0
rather impractical.
The notion of counting networks was introduced in [AHS91], and shortly
afterward the notion of linearizability was studied by [HSW91]. Follow-up work
in [AHS94] presents bitonic counting networks and studies contention in the
counting network. An overview of research on counting networks can be found
zzz 2
in [BH98].

Bibliography
Figure 4.3: Linearizability Counter Example. [AHS91] James Aspnes, Maurice Herlihy, and Nir Shavit. Counting networks
and multi-processor coordination. In Proceedings of the twenty-third
annual ACM symposium on Theory of computing, STOC ’91, pages
In the meantime, another inc operation (medium gray) is initiated and enters
348–358, New York, NY, USA, 1991. ACM.
the network on the bottom wire. The message leaves the network on wire 2,
and the inc operation is completed. [AHS94] James Aspnes, Maurice Herlihy, and Nir Shavit. Counting networks.
Strictly afterwards, another inc operation (dark gray) is initiated and enters J. ACM, 41(5):1020–1048, September 1994.
the network on wire 1. After having passed all balancers, the message will leave
the network wire 0. Finally (and not depicted in Figure 4.3), the two light gray [AKS83] Miklos Ajtai, Janos Komlós, and Endre Szemerédi. An 0(n log n)
messages reach the next balancer and will eventually leave the network on wires sorting network. In Proceedings of the fifteenth annual ACM sympo-
1 resp. 3. Because the dark gray and the medium gray operation do conflict sium on Theory of computing, STOC ’83, pages 1–9, New York, NY,
with Definition 4.22, the bitonic counting network is not linearizable. USA, 1983. ACM.

Remarks: [Bat68] Kenneth E. Batcher. Sorting networks and their applications. In


Proceedings of the April 30–May 2, 1968, spring joint computer con-
• Note that the example in Figure 4.3 behaves correctly in the quiescent ference, AFIPS ’68 (Spring), pages 307–314, New York, NY, USA,
state: Finally, exactly the values 0, 1, 2, 3 are allotted. 1968. ACM.
• It was shown that linearizability comes at a high price (the depth
[BH98] Costas Busch and Maurice Herlihy. A Survey on Counting Networks.
grows linearly with the width).
In WDAS, pages 13–20, 1998.

[FJ59] Lester R. Ford and Selmer M. Johnson. A Tournament Problem. The


Chapter Notes American Mathematical Monthly, 66(5):pp. 387–389, 1959.
The technique used for the famous lower bound of comparison-based sequential
[Hab72] Nico Habermann. Parallel neighbor-sort (or the glory of the induc-
sorting first appeared in [FJ59]. Comprehensive introductions to the vast field of
tion principle). Paper 2087, Carnegie Mellon University - Computer
sorting can certainly be found in [Knu73]. Knuth also presents the 0/1 principle
Science Departement, 1972.
in the context of sorting networks, supposedly as a special case of a theorem
for decision trees of W. G. Bouricius, and includes a historic overview of sorting [HSW91] M. Herlihy, N. Shavit, and O. Waarts. Low contention linearizable
network research. counting. In Foundations of Computer Science, 1991. Proceedings.,
Using a rather complicated proof not based on the 0/1 principle, [Hab72] 32nd Annual Symposium on, pages 526–535, oct 1991.
first presented and analyzed Odd/Even sort on arrays. Shearsort for grids first
appeared in [SSS86] as a sorting algorithm both easy to implement and to prove [Knu73] Donald E. Knuth. The Art of Computer Programming, Volume III:
correct. Later it was generalized to meshes with higher dimension in [SS89]. A Sorting and Searching. Addison-Wesley, 1973.
BIBLIOGRAPHY 45 46 Counting Networks

[SI86] Kazuhiro Sado and Yoshihide Igarashi. Some parallel sorts on a mesh-
connected processor array and their time efficiency. Journal of Parallel
and Distributed Computing, 3(3):398–410, 1986.

[SS86] Claus Peter Schnorr and Adi Shamir. An optimal sorting algorithm
for mesh connected computers. In Proceedings of the eighteenth annual
ACM symposium on Theory of computing, STOC ’86, pages 255–263,
New York, NY, USA, 1986. ACM.

[SS89] Isaac D. Scherson and Sandeep Sen. Parallel sorting in two-


dimensional VLSI models of computation. Computers, IEEE Trans-
actions on, 38(2):238–249, feb 1989.
[SSS86] Isaac Scherson, Sandeep Sen, and Adi Shamir. Shear sort – A true
two-dimensional sorting technique for VLSI networks. 1986 Interna-
tional Conference on Parallel Processing, 1986.

[TK77] Clark David Thompson and Hsiang Tsung Kung. Sorting on a mesh-
connected parallel computer. Commun. ACM, 20(4):263–271, April
1977.
48 CHAPTER 5. SHARED MEMORY

only if no updates have occurred to that register since the load-


link. If any updates have occurred, the store-conditional is guar-
anteed to fail (and return false), even if the value read by the
load-link has since been restored.

• The power of RMW operations can be measured with the so-called


Chapter 5 consensus-number : The consensus-number k of a RMW operation de-
fines whether one can solve consensus for k processes. Test-and-set
for instance has consensus-number 2 (one can solve consensus with 2
processes, but not 3), whereas the consensus-number of compare-and-

Shared Memory swap is infinite. It can be shown that the power of a shared mem-
ory system is determined by the consensus-number (“universality of
consensus”.) This insight has a remarkable theoretical and practical
impact. In practice for instance, after this was known, hardware de-
signers stopped developing shared memory systems supporting weak
In distributed computing, various different models exist. So far, the focus of the RMW operations.
course was on loosely-coupled distributed systems such as the Internet, where
nodes asynchronously communicate by exchanging messages. The “opposite” • Many of the results derived in the message passing model have an
model is a tightly-coupled parallel computer where nodes access a common equivalent in the shared memory model. Consensus for instance is
memory totally synchronously—in distributed computing such a system is called traditionally studied in the shared memory model.
a Parallel Random Access Machine (PRAM). • Whereas programming a message passing system is rather tricky (in
particular if fault-tolerance has to be integrated), programming a
shared memory system is generally considered easier, as programmers
5.1 Model are given access to global variables directly and do not need to worry
A third major model is somehow between these two extremes, the shared mem- about exchanging messages correctly. Because of this, even distrib-
ory model. In a shared memory system, asynchronous processes (or processors) uted systems which physically communicate by exchanging messages
communicate via a common memory area of shared variables or registers: can often be programmed through a shared memory middleware, mak-
ing the programmer’s life easier.
Definition 5.1 (Shared Memory). A shared memory system is a system that
consists of asynchronous processes that access a common (shared) memory. A • We will most likely find the general spirit of shared memory systems
process can atomically access a register in the shared memory through a set of in upcoming multi-core architectures. As for programming style, the
predefined operations. An atomic modification appears to the rest of the system multi-core community seems to favor an accelerated version of shared
instantaneously. Apart from this shared memory, processes can also have some memory, transactional memory.
local (private) memory. • From a message passing perspective, the shared memory model is like
a bipartite graph: On one side you have the processes (the nodes)
Remarks: which pretty much behave like nodes in the message passing model
(asynchronous, maybe failures). On the other side you have the shared
• Various shared memory systems exist. A main difference is how they
registers, which just work perfectly (no failures, no delay).
allow processes to access the shared memory. All systems can atom-
ically read or write a shared register R. Most systems do allow for
advanced atomic read-modify-write (RMW) operations, for example: 5.2 Mutual Exclusion
– test-and-set(R): t := R; R := 1; return t
A classic problem in shared memory systems is mutual exclusion. We are given
– fetch-and-add(R, x): t := R; R := R + x; return t a number of processes which occasionally need to access the same resource. The
– compare-and-swap(R, x, y): if R = x then R := y; return true; resource may be a shared variable, or a more general object such as a data
else return false; endif; structure or a shared printer. The catch is that only one process at the time is
allowed to access the resource. More formally:
– load-link(R)/store-conditional(R, x): Load-link returns the cur-
rent value of the specified register R. A subsequent store-conditional Definition 5.2 (Mutual Exclusion). We are given a number of processes, each
to the same register will store a new value x (and return true) executing the following code sections:

47
5.2. MUTUAL EXCLUSION 49 50 CHAPTER 5. SHARED MEMORY

<Entry> → <Critical Section> → <Exit> → <Remaining Code> Remarks:


A mutual exclusion algorithm consists of code for entry and exit sections, such
that the following holds • No lockout, on the other hand, is not given by this algorithm. Even
with only two processes there are asynchronous executions where al-
• Mutual Exclusion: At all times at most one process is in the critical sec- ways the same process wins the test-and-set.
tion.
• Algorithm 22 can be adapted to guarantee fairness (no lockout), es-
• No deadlock: If some process manages to get to the entry section, later sentially by ordering the processes in the entry section in a queue.
some (possibly different) process will get to the critical section.
• A natural question is whether one can achieve mutual exclusion with
Sometimes we in addition ask for only reads and writes, that is without advanced RMW operations.
The answer is yes!
• No lockout: If some process manages to get to the entry section, later the
same process will get to the critical section. Our read/write mutual exclusion algorithm is for two processes p0 and p1 only.
In the remarks we discuss how it can be extended. The general idea is that
• Unobstructed exit: No process can get stuck in the exit section. process pi has to mark its desire to enter the critical section in a “want” register
Wi by setting Wi := 1. Only if the other process is not interested (W1−i = 0)
Using RMW primitives one can build mutual exclusion algorithms quite easily. access is granted. This however is too simple since we may run into a deadlock.
Algorithm 22 shows an example with the test-and-set primitive. This deadlock (and at the same time also lockout) is resolved by adding a priority
variable Π. See Algorithm 23.
Algorithm 22 Mutual Exclusion: Test-and-Set
Input: Shared register R := 0 Algorithm 23 Mutual Exclusion: Peterson’s Algorithm
<Entry> Initialization: Shared registers W0 , W1 , Π, all initially 0.
1: repeat Code for process pi , i = {0, 1}
2: r := test-and-set(R) <Entry>
3: until r = 0 1: Wi := 1
<Critical Section> 2: Π := 1 − i
4: . . . 3: repeat until Π = i or W1−i = 0
<Exit> <Critical Section>
5: R := 0 4: . . .
<Remainder Code> <Exit>
6: . . . 5: Wi := 0
<Remainder Code>
6: . . .
Theorem 5.3. Algorithm 22 solves the mutual exclusion problem as in Defini-
tion 5.2.
Remarks:
Proof. Mutual exclusion follows directly from the test-and-set definition: Ini-
tially R is 0. Let pi be the ith process to successfully execute the test-and-set, • Note that line 3 in Algorithm 23 represents a “spinlock” or “busy-
where successfully means that the result of the test-and-set is 0. This happens wait”, similarly to the lines 1-3 in Algorithm 22.
at time ti . At time t0i process pi resets the shared register R to 0. Between ti Theorem 5.4. Algorithm 23 solves the mutual exclusion problem as in Defini-
and t0i no other process can successfully test-and-set, hence no other process can tion 5.2.
enter the critical section concurrently.
Proving no deadlock works similar: One of the processes loitering in the Proof. The shared variable Π elegantly grants priority to the process that passes
entry section will successfully test-and-set as soon as the process in the critical line 2 first. If both processes are competing, only process pΠ can access the
section exited. critical section because of Π. The other process p1−Π cannot access the critical
Since the exit section only consists of a single instruction (no potential infi- section because WΠ = 1 (and Π 6= 1 − Π). The only other reason to access the
nite loops) we have unobstructed exit. critical section is because the other process is in the remainder code (that is,
not interested). This proves mutual exclusion!
No deadlock comes directly with Π: Process pΠ gets direct access to the
critical section, no matter what the other process does.
5.3. STORE & COLLECT 51 52 CHAPTER 5. SHARED MEMORY

Since the exit section only consists of a single instruction (no potential infi- Algorithm 24 Collect: Simple (Non-Adaptive) Solution
nite loops) we have unobstructed exit. Operation store(val ) (by process pi ) :
Thanks to the shared variable Π also no lockout (fairness) is achieved: If a 1: Ri := val
process pi loses against its competitor p1−i in line 2, it will have to wait until Operation collect:
the competitor resets W1−i := 0 in the exit section. If process pi is unlucky it 2: for i := 1 to n do
will not check W1−i = 0 early enough before process p1−i sets W1−i := 1 again 3: V (pi ) := Ri // read register Ri
in line 1. However, as soon as p1−i hits line 2, process pi gets the priority due 4: end for
to Π, and can enter the critical section.

Remarks: Remarks:
• Extending Peterson’s Algorithm to more than 2 processes can be done
by a tournament tree, like in tennis. With n processes every process • Algorithm 24 clearly works. The step complexity of every store
needs to win log n matches before it can enter the critical section. operation is 1, the step complexity of a collect operation is n.
More precisely, each process starts at the bottom level of a binary • At first sight, the step complexities of Algorithm 24 seem optimal. Be-
tree, and proceeds to the parent level if winning. Once winning the cause there are n processes, there clearly are cases in which a collect
root of the tree it can enter the critical section. Thanks to the priority operation needs to read all n registers. However, there are also scenar-
variables Π at each node of the binary tree, we inherit all the properties ios in which the step complexity of the collect operation seems very
of Definition 5.2. costly. Assume that there are only two processes pi and pj that have
stored a value in their registers Ri and Rj . In this case, a collect
5.3 Store & Collect in principle only needs to read the registers Ri and Rj and can ignore
all the other registers.
5.3.1 Problem Definition • Assume that up to a certain time t, k ≤ n processes have finished
In this section, we will look at a second shared memory problem that has an or started at least one operation. We call an operation op at time t
elegant solution. Informally, the problem can be stated as follows. There are adaptive to contention if the step complexity of op only depends on k
n processes p1 , . . . , pn . Every process pi has a read/write register Ri in the and is independent of n.
shared memory where it can store some information that is destined for the
other processes. Further, there is an operation by which a process can collect • In the following, we will see how to implement adaptive versions of
(i.e., read) the values of all the processes that stored some value in their register. store and collect.
We say that an operation op1 precedes an operation op2 iff op1 terminates
before op2 starts. An operation op2 follows an operation op1 iff op1 precedes 5.3.2 Splitters
op2.
Definition 5.5 (Collect). There are two operations: A store(val ) by process Algorithm 25 Splitter Code
pi sets val to be the latest value of its register Ri . A collect operation returns
Shared Registers: X : {⊥} ∪ {1, . . . , n}; Y : boolean
a view, a partial function V from the set of processes to a set of values, where
Initialization: X := ⊥; Y := false
V (pi ) is the latest value stored by pi , for each process pi . For a collect
operation cop, the following validity properties must hold for every process pi :
Splitter access by process pi :
• If V (pi ) = ⊥, then no store operation by pi precedes cop. 1: X := i;
2: if Y then
• If V (pi ) = v 6= ⊥, then v is the value of a store operation sop of pi that
3: return right
does not follow cop, and there is no store operation by pi that follows
4: else
sop and precedes cop.
5: Y := true
Hence, a collect operation cop should not read from the future or miss a 6: if X = i then
preceding store operation sop. 7: return stop
We assume that the read/write register Ri of every process pi is initialized 8: else
to ⊥. We define the step complexity of an operation op to be the number of 9: return left
accesses to registers in the shared memory. There is a trivial solution to the 10: end if
collect problem as shown by Algorithm 24. 11: end if
5.3. STORE & COLLECT 53 54 CHAPTER 5. SHARED MEMORY

Algorithm 26 Adaptive Collect: Binary Tree Algorithm


k processors
Operation store(val ) (by process pi ) :
1: Ri := val
at most 1 2: if first store operation by pi then
stop 3: v := root node of binary tree
at most k−1 at most k−1 4: α := result of entering splitter S(v);
5: MS(v) := true
left right
6: while α 6= stop do
7: if α = left then
Figure 5.1: A Splitter
8: v := left child of v
9: else
To obtain adaptive collect algorithms, we need a synchronization primitive, 10: v := right child of v
called a splitter. 11: end if
12: α := result of entering splitter S(v);
Definition 5.6 (Splitter). A splitter is a synchronization primitive with the 13: MS(v) := true
following characteristic. A process entering a splitter exits with either stop, 14: end while
left, or right. If k processes enter a splitter, at most one process exits with 15: ZS(v) := i
stop and at most k − 1 processes exit with left and right, respectively. 16: end if

Hence, it is guaranteed that if a single process enters the splitter, then it


Operation collect:
obtains stop, and if two or more processes enter the splitter, then there is
Traverse marked part of binary tree:
at most one process obtaining stop and there are two processes that obtain
17: for all marked splitters S do
different values (i.e., either there is exactly one stop or there is at least one
18: if ZS 6= ⊥ then
left and at least one right). For an illustration, see Figure 5.1. The code
19: i := ZS ; V (pi ) := Ri // read value of process pi
implementing a splitter is given by Algorithm 25.
20: end if
Lemma 5.7. Algorithm 25 correctly implements a splitter. 21: end for // V (pi ) = ⊥ for all other processes

Proof. Assume that k processes enter the splitter. Because the first process that
checks whether Y = true in line 2 will find that Y = false, not all processes Theorem 5.8. Algorithm 26 correctly implements store and collect. Let k
return right. Next, assume that i is the last process that sets X := i. If i does be the number of participating processes. The step complexity of the first store
not return right, it will find X = i in line 6 and therefore return stop. Hence, of a process pi is O(k), the step complexity of every additional store of pi is
there is always a process that does not return left. It remains to show that at O(1), and the step complexity of collect is O(k).
most 1 process returns stop. For the sake of contradiction, assume pi and pj
are two processes that return stop and assume that pi sets X := i before pj sets Proof. Because at most one process can stop at a splitter, it is sufficient to show
X := j. Both processes need to check whether Y is true before one of them that every process stops at some splitter at depth at most k − 1 ≤ n − 1 when
sets Y := true. Hence, they both complete the assignment in line 1 before the invoking the first store operation to prove correctness. We prove that at most
first one of them checks the value of X in line 6. Hence, by the time pi arrives k − i processes enter a subtree at depth i (i.e., a subtree where the root has
at line 6, X 6= i (pj and maybe some other processes have overwritten X by distance i to the root of the whole tree). By definition of k, the number of
then). Therefore, pi does not return stop and we get a contradiction to the processes entering the splitter at depth 0 (i.e., at the root of the binary tree)
assumption that both pi and pj return stop. is k. For i > 1, the claim follows by induction because of the at most k − i
processes entering the splitter at the root of a depth i subtree, at most k − i − 1
obtain left and right, respectively. Hence, at the latest when reaching depth
5.3.3 Binary Splitter Tree
k − 1, a process is the only process entering a splitter and thus obtains stop.
Assume that we are given 2n − 1 splitters and that for every splitter S, there It thus also follows that the step complexity of the first invocation of store is
is an additional shared variable ZS : {⊥} ∪ {1, . . . , n} that is initialized to ⊥ O(k).
and an additional shared variable MS : boolean that is initialized to false. We To show that the step complexity of collect is O(k), we first observe
call a splitter S marked if MS = true. The 2n − 1 splitters are arranged in a that the marked nodes of the binary tree are connected, and therefore can
complete binary tree of height n − 1. Let S(v) be the splitter associated with be traversed by only reading the variables MS associated to them and their
a node v of the binary tree. The store and collect operations are given by neighbors. Hence, showing that at most 2k − 1 nodes of the binary tree are
Algorithm 26. marked is sufficient. Let xk be the maximum number of marked nodes in a tree,
5.3. STORE & COLLECT 55 56 CHAPTER 5. SHARED MEMORY

right Theorem 5.9. Let k be the number of participating processes. The step com-
plexity of the first store of a process pi is O(k), the step complexity of every
additional store of pi is O(1), and the step complexity of collect is O(k 2 ).
left
Proof. Let the top row be row 0 and the left-most column be column 0. Let xi
be the number of processes entering a splitter in row i. By induction on i, we
show that xi ≤ k − i. Clearly, x0 ≤ k. Let us therefore consider the case i > 0.
Let j be the largest column such that at least one process visits the splitter in
row i − 1 and column j. By the properties of splitters, not all processes entering
the splitter in row i − 1 and column j obtain left. Therefore, not all processes
entering a splitter in row i − 1 move on to row i. Because at least one process
stays in every row, we get that xi ≤ k − i. Similarly, the number of processes
entering column j is at most k − j. Hence, every process stops at the latest in
row k − 1 and column k − 1 and the number of marked splitters is at most k 2 .
Thus, the step complexity of collect is at most O(k 2 ). Because the longest
path in the splitter matrix is 2k, the step complexity of store is O(k).

Remarks:
Figure 5.2: 5 × 5 Splitter Matrix
• With a slightly more complicated argument, it is possible to show that
the number of processes entering the splitter in row i and column j
where k processes access the root. We claim that xk ≤ 2k − 1, which is true
is at most k − i − j. Hence, it suffices to only allocate the upper left
for k = 1 because a single process entering a splitter will always compute stop.
half (including the diagonal) of the n × n matrix of splitters.
Now assume the inequality holds for 1, . . . , k − 1. Not all k processes may exit
the splitter with left (or right), i.e., kl ≤ k − 1 processes will turn left and • The binary tree algorithm can be made space efficient by using a
kr ≤ min{k − kl , k − 1} turn right. The left and right children of the root are randomized version of a splitter. Whenever returning left or right, a
the roots of their subtrees, hence the induction hypothesis yields randomized splitter returns left or right with probability 1/2. With
high probability, it then suffices to allocate a binary tree of depth
xk = xkl + xkr + 1 ≤ (2kl − 1) + (2kr − 1) + 1 ≤ 2k − 1,
O(log n).
concluding induction and proof.
• Recently, it has been shown that with a considerably more complicated
Remarks: deterministic algorithm, it is possible to achieve O(k) step complexity
and O(n2 ) space complexity.
• The step complexities of Algorithm 26 are very good. Clearly, the
step complexity of the collect operation is asymptotically optimal.
In order for the algorithm to work, we however need to allocate the Chapter Notes
memory for the complete binary tree of depth n − 1. The space com-
plexity of Algorithm 26 therefore is exponential in n. We will next see Already in 1965 Edsger Dijkstra gave a deadlock-free solution for mutual ex-
how to obtain a polynomial space complexity at the cost of a worse clusion [Dij65]. Later, Maurice Herlihy suggested consensus-numbers [Her91],
collect step complexity. where he proved the “universality of consensus”, i.e., the power of a shared
memory system is determined by the consensus-number. For this work, Mau-
5.3.4 Splitter Matrix rice Herlihy was awarded the Dijkstra Prize in Distributed Computing in 2003.
Petersons Algorithm is due to [PF77, Pet81], and adaptive collect was studied
Instead of arranging splitters in a binary tree, we arrange n2 splitters in an n×n in the sequence of papers [MA95, AFG02, AL05, AKP+ 06].
matrix as shown in Figure 5.2. The algorithm is analogous to Algorithm 26.
The matrix is entered at the top left. If a process receives left, it next visits
the splitter in the next row of the same column. If a process receives right, it Bibliography
next visits the splitter in the next column of the same row. Clearly, the space
complexity of this algorithm is O(n2 ). The following theorem gives bounds on [AFG02] Hagit Attiya, Arie Fouren, and Eli Gafni. An adaptive collect algo-
the step complexities of store and collect. rithm with applications. Distributed Computing, 15(2):87–96, 2002.
BIBLIOGRAPHY 57 58 CHAPTER 5. SHARED MEMORY

[AKP+ 06] Hagit Attiya, Fabian Kuhn, C. Greg Plaxton, Mirjam Wattenhofer,
and Roger Wattenhofer. Efficient adaptive collect using randomiza-
tion. Distributed Computing, 18(3):179–188, 2006.

[AL05] Yehuda Afek and Yaron De Levie. Space and Step Complexity Effi-
cient Adaptive Collect. In DISC, pages 384–398, 2005.
[Dij65] Edsger W. Dijkstra. Solution of a problem in concurrent program-
ming control. Commun. ACM, 8(9):569, 1965.

[Her91] Maurice Herlihy. Wait-Free Synchronization. ACM Trans. Program.


Lang. Syst., 13(1):124–149, 1991.
[MA95] Mark Moir and James H. Anderson. Wait-Free Algorithms for Fast,
Long-Lived Renaming. Sci. Comput. Program., 25(1):1–39, 1995.

[Pet81] J.L. Peterson. Myths About the Mutual Exclusion Problem. Infor-
mation Processing Letters, 12(3):115–116, 1981.
[PF77] G.L. Peterson and M.J. Fischer. Economical solutions for the crit-
ical section problem in a distributed system. In Proceedings of the
ninth annual ACM symposium on Theory of computing, pages 91–97.
ACM, 1977.
60 CHAPTER 6. SHARED OBJECTS

Algorithm 28 Shared Object: Home-Based Solution


Initialization: An object has a home base (a node) that is known to every
node. All requests (accesses to the shared object) are routed through the
home base.
Accessing Object: (by node v)
1: v acquires a lock at the home base, receives object.
Chapter 6
Remarks:

Shared Objects • Home-based solutions suffer from the triangular routing problem. If
two close-by nodes take turns to access the object, all the traffic is
routed through the potentially far away home-base.

Assume that there is a common resource (e.g. a common variable or data struc- 6.2 Arrow and Friends
ture), which different nodes in a network need to access from time to time. If
the nodes are allowed to change the common object when accessing it, we need We will now look at a protocol (called the Arrow algorithm) that always
to guarantee that no two nodes have access to the object at the same time. In moves the shared object to the node currently accessing it without creating
order to achieve this mutual exclusion, we need protocols that allow the nodes the triangular routing problem of home-based solutions. The protocol runs on
of a network to store and manage access to such a shared object. a precomputed spanning tree. Assume that the spanning tree is rooted at the
current position of the shared object. When a node u wants to access the shared
object, it sends out a find request towards the current position of the object.
6.1 Centralized Solutions While searching for the object, the edges of the spanning tree are redirected
such that in the end, the spanning tree is rooted at u (i.e., the new holder of the
A simple and obvious solution is to store the shared object at a central location object). The details of the algorithm are given by Algorithm 29. For simplicity,
(see Algorithm 27). we assume that a node u only starts a find request if u is not currently the
holder of the shared object and if u has finished all previous find requests (i.e.,
Algorithm 27 Shared Object: Centralized Solution it is not currently waiting to receive the object).
Initialization: Shared object stored at root node r of a spanning tree of the Remarks:
network graph (i.e., each node knows its parent in the spanning tree).
Accessing Object: (by node v) • The parent pointers in Algorithm 29 are only needed for the find
1: v sends request up the tree operation. Sending the variable to u in line 13 or to w.successor in
2: request processed by root r (atomically) line 23 is done using routing (on the spanning tree or on the underlying
3: result sent down the tree to node v network).
• When we draw the parent pointers as arrows, in a quiescent moment
Remarks: (where no “find” is in motion), the arrows all point towards the node
• Instead of a spanning tree, one can use routing. currently holding the variable (i.e., the tree is rooted at the node
holding the variable)
• Algorithm 27 works, but it is not very efficient. Assume that the
object is accessed by a single node v repeatedly. Then we get a high • What is really great about the Arrow algorithm is that it works in
message/time complexity. Instead v could store the object, or at least a completely asynchronous and concurrent setting (i.e., there can be
cache it. But then, in case another node w accesses the object, we many find requests at the same time).
might run into consistency problems. Theorem 6.1. (Arrow, Analysis) In an asynchronous and concurrent setting,
• Alternative idea: The accessing node should become the new master a “find” operation terminates with message and time complexity D, where D is
of the object. The shared object then becomes mobile. There exist the diameter of the spanning tree.
several variants of this idea. The simplest version is a home-based
solution like in Mobile IP (see Algorithm 28).

59
6.2. ARROW AND FRIENDS 61 62 CHAPTER 6. SHARED OBJECTS

Before proving Theorem 6.1, we prove the following lemma.


Lemma 6.2. An edge {u, v} of the spanning tree is in one of four states:
1.) Pointer from u to v (no message on the edge, no pointer from v to u)
2.) Message on the move from u to v (no pointer along the edge)
3.) Pointer from v to u (no message on the edge, no pointer from u to v)
Algorithm 29 Shared Object: Arrow Algorithm 4.) Message on the move from v to u (no pointer along the edge)
Initialization: As for Algorithm 27, we are given a rooted spanning tree. Each Proof. W.l.o.g., assume that initially the edge {u, v} is in state 1. With a
node has a pointer to its parent, the root r is its own parent. The variable message arrival at u (or if u starts a “find by u” request, the edge goes to state
is initially stored at r. For all nodes v, v.successor := null, v.wait := false. 2. When the message is received at v, v directs its pointer to u and we are
therefore in state 3. A new message at v (or a new request initiated by v) then
Start Find Request at Node u: brings the edge back to state 1.
1: do atomically
2: u sends “find by u” message to parent node Proof of Theorem 6.1. Since the “find” message will only travel on a static tree,
3: u.parent := u it suffices to show that it will not traverse an edge twice. Suppose for the sake
4: u.wait := true of contradiction that there is a first “find” message f that traverses an edge
5: end do e = {u, v} for the second time and assume that e is the first edge that is
traversed twice by f . The first time, f traverses e. Assume that e is first
Upon w Receiving “Find by u” Message from Node v: traversed from u to v. Since we are on a tree, the second time, e must be
6: do atomically traversed from v to u. Because e is the first edge to be traversed twice, f must
7: if w.parent 6= w then re-visit e before visiting any other edges. Right before f reaches v, the edge e
8: w sends “find by u” message to parent is in state 2 (f is on the move) and in state 3 (it will immediately return with
9: w.parent := v the pointer from v to u). This is a contradiction to Lemma 6.2.
10: else
11: w.parent := v Remarks:
12: if not w.wait then • Finding a good tree is an interesting problem. We would like to have
13: send variable to u // w holds var. but does not need it any more a tree with low stretch, low diameter, low degree, etc.
14: else
15: w.successor := u // w will send variable to u a.s.a.p. • It seems that the Arrow algorithm works especially well when lots of
16: end if “find” operations are initiated concurrently. Most of them will find a
17: end if “close-by” node, thus having low message/time complexity. For the
18: end do sake of simplicity we analyze a synchronous system.

Upon w Receiving Shared Object: Theorem 6.3. (Arrow, Concurrent Analysis) Let the system be synchronous.
19: perform operation on shared object Initially, the system is in a quiescent state. At time 0, a set S of nodes initiates
20: do atomically a “find” operation. The message complexity of all “find” operations is O(log |S|·
21: w.wait := false m∗ ) where m∗ is the message complexity of an optimal (with global knowledge)
22: if w.successor 6= null then algorithm on the tree.
23: send variable to w.successor Proof Sketch. Let d be the minimum distance of any node in S to the root.
24: w.successor := null There will be a node u1 at distance d from the root that reaches the root in
25: end if d time steps, turning all the arrows on the path to the root towards u1 . A
26: end do
node u2 that finds (is queued behind) u1 cannot distinguish the system from
a system where there was no request u1 , and instead the root was initially
located at u1 . The message cost of u2 is consequentially the distance between
u1 and u2 on the spanning tree. By induction the total message complexity is
exactly as if a collector starts at the root and then “greedily” collects tokens
located at the nodes in S (greedily in the sense that the collector always goes
towards the closest token). Greedy collecting the tokens is not a good strategy
in general because it will traverse the same edge more than twice in the worst
6.2. ARROW AND FRIENDS 63 64 CHAPTER 6. SHARED OBJECTS

case. An asymptotically optimal algorithm can also be translated into a depth- Algorithm 30 Shared Object: Read/Write Caching
first-search collecting paradigm, traversing each edge at most twice. In another • Nodes can either read or write the shared object. For simplicity we first
area of computer science, we would call the Arrow algorithm a nearest-neighbor assume that reads or writes do not overlap in time (access to the object is
TSP heuristic (without returning to the start/root though), and the optimal sequential).
algorithm TSP-optimal. It was shown that nearest-neighbor has a logarithmic • Nodes store three items: a parent pointer pointing to one of the neighbors
overhead, which concludes the proof. (as with Arrow), and a cache bit for each edge, plus (potentially) a copy of
the object.
Remarks: • Initially the object is stored at a single node u; all the parent pointers point
• An average request set S on a not-too-bad tree gives usually a much towards u, all the cache bits are false.
better bound. However, there is an almost tight log |S|/ log log |S| • When initiating a read, a message follows the arrows (this time: without
worst-case example. inverting them!) until it reaches a cached version of the object. Then a copy
of the object is cached along the path back to the initiating node, and the
• It was recently shown that Arrow can do as good in a dynamic setting cache bits on the visited edges are set to true.
(where nodes are allowed to initiate requests at any time). In partic- • A write at u writes the new value locally (at node u), then searches (follow the
ular the message complexity of the dynamic analysis can be shown to parent pointers and reverse them towards u) a first node with a copy. Delete
have a log D overhead only, where D is the diameter of the spanning the copy and follow (in parallel, by flooding) all edge that have the cache flag
tree (note that for logarithmic trees, the overhead becomes log log n). set. Point the parent pointer towards u, and remove the cache flags.
• What if the spanning tree is a star? Then with Theorem 6.1, each find
will terminate in 2 steps! Since also an optimal algorithm has message
Remarks:
cost 1, the algorithm is 2-competitive. . . ? Yes, but because of its high
degree the star center experiences contention. . . It can be shown that • Concurrent reads are not a problem, also multiple concurrent reads
the contention overhead is at most proportional to the largest degree and one write work just fine.
∆ of the spanning tree.
• What about concurrent writes? To achieve consistency writes need to
• Thought experiment: Assume a balanced binary spanning tree—by invalidate the caches before writing their value. It is claimed that the
Theorem 6.1, the message complexity per operation is log n. Because strategy then becomes 4-competitive.
a binary tree has maximum degree 3, the time per operation therefore
is at most 3 log n. • Is the algorithm also time competitive? Well, not really: The optimal
algorithm that we compare to is usually offline. This means it knows
• There are better and worse choices for the spanning tree. The stretch the whole access sequence in advance. It can then cache the object
of an edge {u, v} is defined as distance between u and v in a span- before the request even appears!
ning tree. The maximum stretch of a spanning tree is the maximum
stretch over all edges. A few years ago, it was shown how to construct • Algorithms on trees are often simpler, but have the disadvantage that
spanning trees that are O(log n)-stretch-competitive. they introduce the extra stretch factor. In a ring, for example, any
tree has stretch n − 1; so there is always a bad request pattern.
What if most nodes just want to read the shared object? Then it does
not make sense to acquire a lock every time. Instead we can use caching (see
Algorithm 30).
Theorem 6.4. Algorithm 30 is correct. More surprisingly, the message com-
plexity is 3-competitive (at most a factor 3 worse than the optimum).
Proof. Since the accesses do not overlap by definition, it suffices to show that
between two writes, we are 3-competitive. The sequence of accessing nodes is
w0 , r1 , r2 , . . . , rk , w1 . After w0 , the object is stored at w0 and not cached
anywhere else. All reads cost twice the smallest subtree T spanning the write
w0 and all the reads since each read only goes to the first copy. The write w1
costs T plus the path P from w1 to T . Since any data management scheme
must use an edge in T and P at least once, and our algorithm uses edges in T
at most 3 times (and in P at most once), the theorem follows.
6.3. IVY AND FRIENDS 65 66 CHAPTER 6. SHARED OBJECTS

Algorithm 31 Shared Object: Pointer Forwarding


Initialization: Object is stored at root r of a precomputed spanning tree T (as
in the Arrow algorithm, each node has a parent pointer pointing towards
the object).
Accessing Object: (by node u)
1: follow parent pointers to current root r of T
2: send object from r to u
3: r.parent := u; u.parent := u; // u is the new root

Algorithm 32 Shared Object: Ivy Figure 6.1: Reversal of the path x0 , x1 , x2 , x3 , x4 , x5 .


Initialization: Object is stored at root r of a precomputed spanning tree T
(as before, each node has a parent pointer pointing towards the object). For
simplicity, we assume that accesses to the object are sequential. during a find message to the new root. The details are given by Algorithm 32.
Start Find Request at Node u: Figure 6.1 shows how the pointer redirecting affects a given tree (the right tree
1: u sends “find by u” message to parent node
results from a find request started at node x0 on the left tree).
2: u.parent := u Remarks:
Upon v receiving “Find by u” Message:
• Also with Algorithm 32, we might have a bad linked list situation.
3: if v.parent = v then
However, if the start of the list acquires the object, the linked list
4: send object to u
turns into a star. As the following theorem shows, the search paths
5: else
are not long on average. Since paths sometimes can be bad, we will
6: send “find by u” message to v.parent
need amortized analysis.
7: end if
8: v.parent := u // u will become the new root
Theorem 6.5. If the initial tree is a star, a find request of Algorithm 32 needs
at most log n steps on average, where n is the number of processors.
6.3 Ivy and Friends Proof. All logarithms in the following proof are to base 2. We assume that
accesses to the shared object are sequential. We use a potential function argu-
In the following we study algorithms that do not restrict communication to a ment. Let s(u) be the size of the subtree rooted at node u (the number of nodes
tree. Of particular interest is the special case of a complete graph (clique). A in the subtree including u itself). We define the potential Φ of the whole tree
simple solution for this case is given by Algorithm 31. T as (V is the set of all nodes)
Remarks: X log s(u)
Φ(T ) = .
• If the graph is not complete, routing can be used to find the root. 2
u∈V

• Assume that the nodes line up in a linked list. If we always choose the Assume that the path traversed by the ith operation has length ki , i.e., the ith
first node of the linked list to acquire the object, we have message/time operation redirects ki pointers to the new root. Clearly, the number of steps
complexity n. The new topology is again a linear linked list. Pointer of the ith operation is proportional
Pm to ki . We are interested in the cost of m
forwarding is therefore bad in a worst-case. consecutive operations, i=1 ki .
Let T0 be the initial tree and let Ti be the tree after the ith operation.
• If edges are not FIFO, it can even happen that the number of steps Further, let ai = ki − Φ(Ti−1 ) + Φ(Ti ) be the amortized cost of the ith operation.
is unbounded for a node having bad luck. An algorithm with such a We have
property is named “not fair,” or “not wait-free.” (Example: Initially m m m
X X  X
we have the list 4 → 3 → 2 → 1; 4 starts a find; when the message ai = ki − Φ(Ti−1 ) + Φ(Ti ) = ki − Φ(T0 ) + Φ(Tm ).
of 4 passes 3, 3 itself starts a find. The message of 3 may arrive at 2 i=1 i=1 i=1
and then 1 earlier, thus the new end of the list is 2 → 1 → 3; once the
message of 4 passes 2, the game re-starts.) For any tree T , we have Φ(T ) ≥ log(n)/2. Because we assume that T0 is a star,
we also have Φ(T0 ) = log(n)/2. We therefore get that
There seems to be a natural improvement of the pointer forwarding idea. m m
X X
Instead of simply redirecting the parent pointer from the old root to the new ai ≥ ki .
root, we can redirect all the parent pointers of the nodes on the path visited i=1 i=1
6.3. IVY AND FRIENDS 67 68 CHAPTER 6. SHARED OBJECTS

Hence, it suffices to upper bound the amortized cost of every operation. We Remarks:
thus analyze the amortized cost ai of the ith operation. Let x0 , x1 , x2 , . . . , xki
be the path that is reversed by the operation. Further for 0 ≤ j ≤ ki , let sj be • Systems guys (the algorithm is called Ivy because it was used in a
the size of the subtree rooted at xj before the reversal. The size of the subtree system with the same name) have some fancy heuristics to improve
rooted at x0 after the reversal is ski and the size of the one rooted at xj after performance even more: For example, the root every now and then
the reversal, for 1 ≤ j ≤ ki , is sj −sj−1 (see Figure 6.1). For all other nodes, the broadcasts its name such that paths will be shortened.
sizes of their subtrees are the same, therefore the corresponding terms cancel
out in the ammortized cost ai . We can thus write ai as
    • What about concurrent requests? It works with the same argument
Xki Xki as in Arrow. Also for Ivy an argument including congestion is missing
 1   1 1
ai = ki − log sj + log ski + log(sj − sj−1 ) (and more pressing, since the dynamic topology of a tree cannot be
j=0
2 2 j=1
2
chosen to have low degree and thus low congestion as in Arrow).
ki −1
1 X 
= ki + · log(sj+1 − sj ) − log sj
2 j=0 • Sometimes the type of accesses allows that several accesses can be
  combined into one to reduce congestion higher up the tree. Let the
ki −1
1 X sj+1 − sj tree in Algorithm 27 be a balanced binary tree. If the access to a
= ki + · log .
2 j=0 sj shared variable for example is “add value x to the shared variable”,
two or more accesses that accidentally meet at a node can be combined
For 0 ≤ j ≤ ki − 1, let αj = sj+1 /sj . Note that sj+1 > sj and thus that αj > 1. into one. Clearly accidental meeting is rare in an asynchronous model.
Further note, that (sj+1 − sj )/sj = αj − 1. We therefore have that We might be able to use synchronizers (or maybe some other timing
tricks) to help meeting a little bit.
ki −1
1 X
ai = ki + · log(αj − 1)
2 j=0
i −1 
kX  Chapter Notes
1
= 1+ log(αj − 1) .
j=0
2
The Arrow protocol was designed by Raymond [Ray89]. There are real life im-
For α > 1, it can be shown that 1 + log(α − 1)/2 ≤ log α (see Lemma 6.6). From plementations of the Arrow protocol, such as the Aleph Toolkit [Her99]. The
this inequality, we obtain performance of the protocol under high loads was tested in [HW99] and other im-
plementations and variations of the protocol were given in, e.g., [PR99, HTW00].
i −1
kX i −1
kX kXi −1
sj+1 It has been shown that the find operations of the protocol do not backtrack,
ai ≤ log αj = log = (log sj+1 − log sj )
j=0 j=0
sj j=0
i.e., the time and message complexities are O(D) [DH98], and that the Arrow
protocol is fault tolerant [HT01]. Given a set of concurrent request, Herlihy et
= log ski − log s0 ≤ log n,
al. [HTW01] showed that the time and message complexities are within factor
because ski = n and s0 ≥ 1. This concludes the proof. log R from the optimal, where R is the number of requests. Later, this analysis
was extended to long-lived and asynchronous systems. In particular, Herlihy et
Lemma 6.6. For α > 1, 1 + log(α − 1)/2 ≤ log α. al. [HKTW06] showed that the competitive ratio in this asynchronous concur-
Proof. The claim can be verified by the following chain of reasoning: rent setting is O(log D). Thanks to the lower bound of the greedy TSP heuristic,
this is almost tight.
0 ≤ (α − 2)2 The Ivy system was introduced in [Li88, LH89]. On the theory side, it was
0 ≤ α2 − 4α + 4 shown by Ginat et al. [GST89] that the amortized cost of a single request of
4(α − 1) ≤ α2 the Ivy protocol is Θ(log n). Closely related work to the Ivy protocol on the
  practical side is research on virtual memory and parallel computing on loosely
log2 4(α − 1) ≤ log2 α2
coupled multiprocessors. For example [BB81, LSHL82, FR86] contain studies on
2 + log2 (α − 1) ≤ 2 log2 α variations of the network models, limitations on data sharing between processes
1 and different approaches.
1 + log2 (α − 1) ≤ log2 α.
2 Later, the research focus shifted towards systems where most data operations
were read operations, i.e., efficient caching became one of the main objects of
study, e.g., [MMVW97].
BIBLIOGRAPHY 69 70 CHAPTER 6. SHARED OBJECTS

Bibliography [LSHL82] Paul J. Leach, Bernard L. Stumpf, James A. Hamilton, and Paul H.
Levine. UIDs as Internal Names in a Distributed File System. In
[BB81] Thomas J. Buckholtz and Helen T. Buckholtz. Apollo Domain Proceedings of the First ACM SIGACT-SIGOPS Symposium on
Architecture. Technical report, Apollo Computer, Inc., 1981. Principles of Distributed Computing (PODC), pages 34–41, 1982.
[DH98] Michael J. Demmer and Maurice Herlihy. The Arrow Distributed [MMVW97] B. Maggs, F. Meyer auf der Heide, B. Voecking, and M. Wester-
Directory Protocol. In Proceedings of the 12th International Sym- mann. Exploiting Locality for Data Management in Systems of
posium on Distributed Computing (DISC), 1998. Limited Bandwidth. In IEEE Symposium on Foundations of Com-
puter Science (FOCS), 1997.
[FR86] Robert Fitzgerald and Richard F. Rashid. The Integration of
Virtual Memory Management and Interprocess Communication in [PR99] David Peleg and Eilon Reshef. A Variant of the Arrow Distributed
Accent. ACM Transactions on Computer Systems, 4(2):147–177, Directory Protocol with Low Average Complexity. In Proceedings
1986. of the 26th International Colloquium on Automata, Languages and
Programming (ICALP), pages 615–624, 1999.
[GST89] David Ginat, Daniel Sleator, and Robert Tarjan. A Tight Amor-
tized Bound for Path Reversal. Information Processing Letters, [Ray89] Kerry Raymond. A Tree-based Algorithm for Distributed Mu-
31(1):3–5, 1989. tual Exclusion. ACM Transactions on Computer Systems, 7:61–77,
1989.
[Her99] Maurice Herlihy. The Aleph Toolkit: Support for Scalable Dis-
tributed Shared Objects. In Proceedings of the Third Interna-
tional Workshop on Network-Based Parallel Computing: Commu-
nication, Architecture, and Applications (CANPC), pages 137–149,
1999.
[HKTW06] Maurice Herlihy, Fabian Kuhn, Srikanta Tirthapura, and Roger
Wattenhofer. Dynamic Analysis of the Arrow Distributed Protocol.
In Theory of Computing Systems, Volume 39, Number 6, November
2006.
[HT01] Maurice Herlihy and Srikanta Tirthapura. Self Stabilizing Distrib-
uted Queuing. In Proceedings of the 15th International Conference
on Distributed Computing (DISC), pages 209–223, 2001.
[HTW00] Maurice Herlihy, Srikanta Tirthapura, and Roger Wattenhofer. Or-
dered Multicast and Distributed Swap. In Operating Systems Re-
view, Volume 35/1, 2001. Also in PODC Middleware Symposium,
Portland, Oregon, July 2000.
[HTW01] Maurice Herlihy, Srikanta Tirthapura, and Roger Wattenhofer.
Competitive Concurrent Distributed Queuing. In Twentieth ACM
Symposium on Principles of Distributed Computing (PODC), Au-
gust 2001.
[HW99] Maurice Herlihy and Michael Warres. A Tale of Two Directories:
Implementing Distributed Shared Objects in Java. In Proceedings
of the ACM 1999 conference on Java Grande (JAVA), pages 99–
108, 1999.
[LH89] Kai Li and Paul Hudak. Memory Coherence in Shared Vir-
tual Memory Systems. ACM Transactions on Computer Systems,
7(4):312–359, November 1989.
[Li88] Kai Li. IVY: Shared Virtual Memory System for Parallel Comput-
ing. In International Conference on Parallel Processing, 1988.
72 CHAPTER 7. MAXIMAL INDEPENDENT SET

Remarks:

• Computing a maximum independent set (MaxIS) is a notoriously diffi-


cult problem. It is equivalent to maximum clique on the complemen-
tary graph. Both problems are NP-hard, in fact not approximable
1
within n 2 − within polynomial time.
Chapter 7 • In this course we concentrate on the maximal independent set (MIS)
problem. Please note that MIS and MaxIS can be quite different,
indeed e.g. on a star graph there exists an MIS that is Θ(n) smaller
than the MaxIS (cf. Figure 7.1).
Maximal Independent Set • Computing a MIS sequentially is trivial: Scan the nodes in arbitrary
order. If a node u does not violate independence, add u to the MIS.
If u violates independence, discard u. So the only question is how to
compute a MIS in a distributed way.
In this chapter we present a highlight of this course, a fast maximal independent
set (MIS) algorithm. The algorithm is the first randomized algorithm that we
study in this class. In distributed computing, randomization is a powerful and Algorithm 33 Slow MIS
therefore omnipresent concept, as it allows for relatively simple yet efficient Require: Node IDs
algorithms. As such the studied algorithm is archetypal. Every node v executes the following code:
A MIS is a basic building block in distributed computing, some other prob- 1: if all neighbors of v with larger identifiers have decided not to join the MIS
lems pretty much follow directly from the MIS problem. At the end of this then
chapter, we will give two examples: matching and vertex coloring (see Chapter 2: v decides to join the MIS
1). 3: end if

7.1 MIS Remarks:

Definition 7.1 (Independent Set). Given an undirected Graph G = (V, E) an • Not surprisingly the slow algorithm is not better than the sequential
independent set is a subset of nodes U ⊆ V , such that no two nodes in U algorithm in the worst case, because there might be one single point
are adjacent. An independent set is maximal if no node can be added without of activity at any time. Formally:
violating independence. An independent set of maximum cardinality is called
Theorem 7.2 (Analysis of Algorithm 33). Algorithm 33 features a time com-
maximum.
plexity of O(n) and a message complexity of O(m).

Remarks:

• This is not very exciting.


1
• There is a relation between independent sets and node coloring (Chap-
ter 1), since each color class is an independent set, however, not nec-
essarily a MIS. Still, starting with a coloring, one can easily derive a
2 2 MIS algorithm: In the first round all nodes of the first color join the
MIS and notify their neighbors. Then, all nodes of the second color
which do not have a neighbor that is already in the MIS join the MIS
and inform their neighbors. This process is repeated for all colors.
Figure 7.1: Example graph with 1) a maximal independent set (MIS) and 2) a Thus the following corollary holds:
maximum independent set (MaxIS). Corollary 7.3. Given a coloring algorithm that runs in time T and needs C
colors, we can construct a MIS in time T + C.

71
7.2. ORIGINAL FAST MIS 73 74 CHAPTER 7. MAXIMAL INDEPENDENT SET

Remarks: choices of v and nodes in H(v) in Step 1 we get

• Using Theorem 1.14 and Corollary 7.3 we get a distributed determin- P [v ∈


/ MIS|v ∈ M ] = P [there is a node w ∈ H(v), w ∈ M |v ∈ M ]
istic MIS algorithm for trees (and for bounded degree graphs) with = P [there is a node w ∈ H(v), w ∈ M ]
time complexity O(log∗ n). X X 1
≤ P [w ∈ M ] =
2d(w)
• With a lower bound argument one can show that this deterministic w∈H(v) w∈H(v)
MIS algorithm is asymptotically optimal for rings. X 1 d(v) 1
≤ ≤ = .
2d(v) 2d(v) 2
w∈H(v)
• There have been attempts to extend Algorithm 4 to more general
graphs, however, so far without much success. Below we present a
radically different approach that uses randomization. Then

1 1
P [v ∈ MIS] = P [v ∈ MIS|v ∈ M ] · P [v ∈ M ] ≥ · .
2 2d(v)
7.2 Original Fast MIS
2

Algorithm 34 Fast MIS Lemma 7.5 (Good Nodes). A node v is called good if
The algorithm operates in synchronous rounds, grouped into phases.
A single phase is as follows: X 1 1
1 ≥ ,
1) Each node v marks itself with probability 2d(v) , where d(v) is the current 2d(w) 6
w∈N (v)
degree of v.
2) If no higher degree neighbor of v is also marked, node v joins the MIS. If
where N (v) is the set of neighbors of v. Otherwise we call v a bad node. A
a higher degree neighbor of v is marked, node v unmarks itself again. (If the 1
good node will be removed in Step 3 with probability p ≥ 36 .
neighbors have the same degree, ties are broken arbitrarily, e.g., by identifier).
3) Delete all nodes that joined the MIS and their neighbors, as they cannot
join the MIS anymore. Proof: Let node v be good. Intuitively, good nodes have lots of low-degree
neighbors, thus chances are high that one of them goes into the independent
set, in which case v will be removed in Step 3 of the algorithm.
Remarks: If there is a neighbor w ∈ N (v) with degree at most 2 we are done: With
Lemma 7.4 the probability that node w joins the MIS is at least 18 , and our
• Correctness in the sense that the algorithm produces an independent good node will be removed in Step 3.
set is relatively simple: Steps 1 and 2 make sure that if a node v joins So all we need to worry about is that all neighbors have at least degree 3:
X 1 1
the MIS, then v’s neighbors do not join the MIS at the same time. 1
For any neighbor w of v we have 2d(w) ≤ 16 . Since ≥ there is a
Step 3 makes sure that v’s neighbors will never join the MIS. 2d(w) 6
w∈N (v)
1 X 1 1
• Likewise the algorithm eventually produces a MIS, because the node subset of neighbors S ⊆ N (v) such that ≤ ≤
6 2d(w) 3
w∈S
with the highest degree will mark itself at some point in Step 1.
We can now bound the probability that node v will be removed. Let therefore
R be the event of v being removed. Again, if a neighbor of v joins the MIS in
• So the only remaining question is how fast the algorithm terminates. Step 2, node v will be removed in Step 3. We have
To understand this, we need to dig a bit deeper.
P [R] ≥ P [there is a node u ∈ S, u ∈ MIS]
Lemma 7.4 (Joining MIS). A node v joins the MIS in Step 2 with probability X X
1 ≥ P [u ∈ MIS] − P [u ∈ MIS and w ∈ MIS] .
p ≥ 4d(v) .
u∈S u,w∈S;u6=w

Proof: Let M be the set of marked nodes in Step 1 and MIS be the set of nodes
that join the MIS in Step 2. Let H(v) be the set of neighbors of v with higher For the last inequality we used the inclusion-exclusion principle truncated
degree, or same degree and higher identifier. Using independence of the random after the second order terms. Let M again be the set of marked nodes after
7.2. ORIGINAL FAST MIS 75 76 CHAPTER 7. MAXIMAL INDEPENDENT SET

Step 1. Using P [u ∈ M ] ≥ P [u ∈ MIS] we get More formally: With Lemmas 7.5 and 7.6 we know that at least half of the
X X edges will be removed with probability at least 1/36. Let R be the number
P [R] ≥ P [u ∈ MIS] − P [u ∈ M and w ∈ M ] of edges to be removed in a certain phase. Using linearity of expectation (cf.
u∈S u,w∈S;u6=w Theorem 7.9) we know that E [R] ≥ m/72, m being the total number of edges at
X XX
≥ P [u ∈ MIS] − P [u ∈ M ] · P [w ∈ M ] the start of the phase. Now let p := P [R ≤ E [R] /2]. Bounding the expectation
u∈S u∈S w∈S yields
X 1 XX 1 1 X
≥ − E [R] = P [R = r] · r ≤ P [R ≤ E[R]/2] · E[R]/2 + P [R > E[R]/2] · m
4d(u) 2d(u) 2d(w)
u∈S u∈S w∈S r
!  
X 1 1 X 1 1 1 1 1 = p · E [R] /2 + (1 − p) · m.
≥ − ≥ − = .
2d(u) 2 2d(w) 6 2 3 36
u∈S w∈S Solving for p we get
2
m − E [R] m − E [R] /2
Remarks: p≤ < ≤ 1 − 1/144.
m − E [R] /2 m
• We would be almost finished if we could prove that many nodes are
good in each phase. Unfortunately this is not the case: In a star- In other words, with probability at least 1/144 at least m/144 edges are removed
graph, for instance, only a single node is good! We need to find a in a phase. After expected O(log m) phases all edges are deleted. Since m ≤ n2
work-around. and thus O(log m) = O(log n) the Theorem follows. 2
Remarks:
Lemma 7.6 (Good Edges). An edge e = (u, v) is called bad if both u and v
are bad; else the edge is called good. The following holds: At any time at least • With a bit of more math one can even show that Algorithm 34 termi-
half of the edges are good. nates in time O(log n) “with high probability”.
Proof: For the proof we construct a directed auxiliary graph: Direct each edge
towards the higher degree node (if both nodes have the same degree direct it 7.3 Fast MIS v2
towards the higher identifier). Now we need a little helper lemma before we can
continue with the proof.
Algorithm 35 Fast MIS 2
Lemma 7.7. A bad node has outdegree (number of edges pointing away from
The algorithm operates in synchronous rounds, grouped into phases.
bad node) at least twice its indegree (number of edges pointing towards bad node).
A single phase is as follows:
Proof: For the sake of contradiction, assume that a bad node v does not have 1) Each node v chooses a random value r(v) ∈ [0, 1] and sends it to its
outdegree at least twice its indegree. In other words, at least one third of the neighbors.
neighbor nodes (let’s call them S) have degree at most d(v). But then 2) If r(v) < r(w) for all neighbors w ∈ N (v), node v enters the MIS and
X X 1 X 1 informs its neighbors.
1 d(v) 1 1
≥ ≥ ≥ = 3) If v or a neighbor of v entered the MIS, v terminates (v and all edges
2d(w) 2d(w) 2d(v) 3 2d(v) 6 adjacent to v are removed from the graph), otherwise v enters the next phase.
w∈N (v) w∈S w∈S

which means v is good, a contradiction. 2


Remarks:
Continuing the proof of Lemma 7.6: According to Lemma 7.7 the number of
edges directed into bad nodes is at most half the number of edges directed out • Correctness in the sense that the algorithm produces an independent
of bad nodes. Thus, the number of edges directed into bad nodes is at most set is simple: Steps 1 and 2 make sure that if a node v joins the MIS,
half the number of edges. Thus, at least half of the edges are directed into good then v’s neighbors do not join the MIS at the same time. Step 3 makes
nodes. Since these edges are not bad, they must be good. sure that v’s neighbors will never join the MIS.

Theorem 7.8 (Analysis of Algorithm 34). Algorithm 34 terminates in expected • Likewise the algorithm eventually produces a MIS, because the node
time O(log n). with the globally smallest value will always join the MIS, hence there
is progress.
Proof: With Lemma 7.5 a good node (and therefore a good edge!) will be
deleted with constant probability. Since at least half of the edges are good • So the only remaining question is how fast the algorithm terminates.
(Lemma 7.6) a constant fraction of edges will be deleted in each phase. To understand this, we need to dig a bit deeper.
7.3. FAST MIS V2 77 78 CHAPTER 7. MAXIMAL INDEPENDENT SET

• Our proof will rest on a simple, yet powerful observation about ex- Lemma 7.10 (Edge Removal). In a single phase, we remove at least half of
pected values of random variables that may not be independent: the edges in expectation.
Theorem 7.9 (Linearity of Expectation). Let Xi , i = 1, . . . , k denote random Proof. To simplify the notation, at the start of our phase, the graph is simply
variables, then " # G = (V, E). In addition, to ease presentation, we replace each undirected edge
X X {v, w} by the two directed edges (v, w) and (w, v).
E Xi = E [Xi ] .
i i
Suppose that a node v joins the MIS in this phase, i.e., r(v) < r(w) for all
neighbors w ∈ N (v). If in addition we have r(v) < r(x) for all neighbors x of a
Proof. It is sufficient to prove E [X + Y ] = E [X]+E [Y ] for two random variables neighbor w of v, we call this event (v → w). The probability of event (v → w)
X and Y , because then the statement follows by induction. Since is at least 1/(d(v) + d(w)), since d(v) + d(w) is the maximum number of nodes
adjacent to v or w (or both). As v joins the MIS, all (directed) edges (w, x)
P [(X, Y ) = (x, y)] = P [X = x] · P [Y = y|X = x]
with x ∈ N (w) will be removed; there are d(w) of these edges.
= P [Y = y] · P [X = x|Y = y] We now count the removed edges. Whether we remove the edges adjacent
we get that to w because of event (v → w) is a random variable X(v→w) . If event (v → w)
occurs, X(v→w) has the value d(w), if not it has the value 0. For each undirected
X
E [X + Y ] = P [(X, Y ) = (x, y)] · (x + y) edge {v, w} we have two such variables, X(v→w) and X(w→v) . Due to Theorem
(X,Y )=(x,y) 7.9, the expected value of the sum X of all these random variables is at least
X X X
= P [X = x] · P [Y = y|X = x] · x E [X] = E[X(v→w) ] + E[X(w→v) ]
X=x Y =y
X X {v,w}∈E
X
+ P [Y = y] · P [X = x|Y = y] · y = P [Event (v → w)] · d(w) + P [Event (w → v)] · d(v)
Y =y X=x
X X {v,w}∈E
= P [X = x] · x + P [Y = y] · y X d(w) d(v)
X=x Y =y ≥ +
d(v) + d(w) d(w) + d(v)
{v,w}∈E
= E [X] + E [Y ] . X
= 1 = |E|.
2 {v,w}∈E
Remarks:
In other words, in expectation |E| directed edges are removed in a single
• How can we prove that the algorithm only needs O(log n) phases in phase! Note that we did not double count any edge removals, as a directed edge
expectation? It would be great if this algorithm managed to remove a (w, x) can only be removed by an event (v → w). The event (v → w) inhibits
constant fraction of nodes in each phase. Unfortunately, it does not. a concurrent event (v 0 → w) since r(v) < r(v 0 ) for all v 0 ∈ N (w). We may
• Instead we will prove that the number of edges decreases quickly. have counted an undirected edge at most twice (once in each direction). So, in
Again, it would be great if any single edge was removed with constant expectation at least half of the undirected edges are removed. 2
probability in Step 3. But again, unfortunately, this is not the case. Remarks:
• Maybe we can argue about the expected number of edges to be re- • This enables us to follow a bound on the expected running time of
moved in one single phase? Let’s see: A node v enters the MIS with Algorithm 35 quite easily.
probability 1/(d(v) + 1), where d(v) is the degree of node v. By doing
so, not only are v’s edges removed, but indeed all the edges of v’s Theorem 7.11 (Expected running time of Algorithm 35). Algorithm 35 ter-
neighbors as well – generally these are much more than d(v) edges. So minates after at most 3 log4/3 m + 1 ∈ O(log n) phases in expectation.
there is hope, but we need to be careful: If we do this the most naive
way, we will count the same edge many times. Proof: The probability that in a single phase at least a quarter of all edges
are removed is at least 1/3. For the sake of contradiction, assume not. Then
• How can we fix this? The nice observation is that it is enough to with probability less than 1/3 we may be lucky and many (potentially all) edges
count just some of the removed edges. Given a new MIS node v and are removed. With probability more than 2/3 less than 1/4 of the edges are
a neighbor w ∈ N (v), we count the edges only if r(v) < r(x) for all removed. Hence the expected fraction of removed edges is strictly less than
x ∈ N (w). This looks promising. In a star graph, for instance, only 1/3 · 1 + 2/3 · 1/4 = 1/2. This contradicts Lemma 7.10.
the smallest random value can be accounted for removing all the edges Hence, in expectation at least every third phase is “good” and removes at
of the star. least a quarter of the edges. To get rid of all but two edges we need log4/3 m
7.3. FAST MIS V2 79 80 CHAPTER 7. MAXIMAL INDEPENDENT SET

good phases in expectation. The last two edges will certainly be removed in the 7.4 Applications
next phase. Hence a total of 3 log4/3 m + 1 phases are enough in expectation.
Definition 7.15 (Matching). Given a graph G = (V, E) a matching is a subset
Remarks:
of edges M ⊆ E, such that no two edges in M are adjacent (i.e., where no node
• Sometimes one expects a bit more of an algorithm: Not only should is adjacent to two edges in the matching). A matching is maximal if no edge
the expected time to terminate be good, but the algorithm should can be added without violating the above constraint. A matching of maximum
always terminate quickly. As this is impossible in randomized algo- cardinality is called maximum. A matching is called perfect if each node is
rithms (after all, the random choices may be “unlucky” all the time!), adjacent to an edge in the matching.
researchers often settle for a compromise, and just demand that the
probability that the algorithm does not terminate in the specified Remarks:
time can be made absurdly small. For our algorithm, this can be de-
duced from Lemma 7.10 and another standard tool, namely Chernoff’s • In contrast to MaxIS, a maximum matching can be found in polyno-
Bound. mial time, and is also easy to approximate, since any maximal match-
ing is a 2-approximation.
Definition 7.12 (W.h.p.). We say that an algorithm terminates w.h.p. (with
high probability) within O(t) time if it does so with probability at least 1 − 1/nc • An independent set algorithm is also a matching algorithm: Let G =
for any choice of c ≥ 1. Here c may affect the constants in the Big-O notation (V, E) be the graph for which we want to construct the matching.
because it is considered a “tunable constant” and usually kept small. The so-called line graph G0 is defined as follows: for every edge in G
Pk there is a node in G0 ; two nodes in G0 are connected by an edge if
Definition 7.13 (Chernoff’s Bound). Let X = i=1 Xi be the sum of k inde-
pendent 0 − 1 random variables. Then Chernoff’s bound states that w.h.p. their respective edges in G are adjacent. A (maximal) independent
 p  set in the line graph G0 is a (maximal) matching in the original graph
|X − E[X]| ∈ O log n + E[X] log n . G, and vice versa. Using Algorithm 35 directly produces a O(log n)
bound for maximal matching.
Corollary 7.14 (Running Time of Algorithm 35). Algorithm 35 terminates
w.h.p. in O(log n) time. • More importantly, our MIS algorithm can also be used for vertex
coloring (Problem 1.1):
Proof: In Theorem 7.11 we used that independently of everything that happened
before, in each phase we have a constant probability p that a quarter of the edges
are removed. Call such a phase good. For some constants C1 and C2 , let us check Algorithm 36 General Graph Coloring
after C1 log n + C2 ∈ O(log n) phases, in how many phases at least a quarter of 1: Given a graph G = (V, E) we virtually build a graph G0 = (V 0 , E 0 ) as
the edges have been removed. In expectation, these are at least p(C1 log n + C2 ) follows:
PC1 log n+C2
many. Now we look at the random variable X = i=1 Xi , where the Xi 2: Every node v ∈ V clones itself d(v) + 1 times (v0 , . . . , vd(v) ∈ V 0 ), d(v) being
are independent 0 − 1 variables being one with exactly probability p. Certainly, the degree of v in G.
if X is at least x with some probability, then the probability that we have 3: The edge set E 0 of G0 is as follows:
x good phases can only be larger (if no edges are left, certainly “all” of the 4: First all clones are in a clique: (vi , vj ) ∈ E 0 , for all v ∈ V and all 0 ≤ i <
remaining edges are removed). To X we can apply Chernoff’s bound. If C1 j ≤ d(v)
and C2 are chosen large enough, they will overcome the constants in the Big-O 5: Second all ith clones of neighbors in the original graph G are connected:
from Chernoff’s bound, i.e., w.h.p. it holds that |X − E[X]| ≤ E[X]/2, implying (ui , vi ) ∈ E 0 , for all (u, v) ∈ E and all 0 ≤ i ≤ min(d(u), d(v)).
X ≥ E[X]/2. Choosing C1 large enough, we will have w.h.p. sufficiently many 6: Now we simply run (simulate) the fast MIS Algorithm 35 on G0 .
good phases, i.e., the algorithm terminates w.h.p. in O(log n) phases. 7: If node vi is in the MIS in G0 , then node v gets color i.
Remarks:
• The algorithm can be improved. Drawing random real numbers in Theorem 7.16 (Analysis of Algorithm 36). Algorithm 36 (∆ + 1)-colors an
each phase for instance is not necessary. One can achieve the same by arbitrary graph in O(log n) time, with high probability, ∆ being the largest degree
sending only a total of O(log n) random (and as many non-random) in the graph.
bits over each edge.
Proof: Thanks to the clique among the clones at most one clone is in the MIS.
• One of the main open problems in distributed computing is whether
And because of the d(v)+1 clones of node v every node will get a free color! The
one can beat this logarithmic time, or at least achieve it with a deter-
running time remains logarithmic since G0 has O n2 nodes and the exponent
ministic algorithm.
becomes a constant factor when applying the logarithm.
• Let’s turn our attention to applications of MIS next.
7.4. APPLICATIONS 81 82 CHAPTER 7. MAXIMAL INDEPENDENT SET

Remarks: unoriented trees [KSOS06]. However, the distributed time √ complexity for MIS
is still somewhat open, as the strongest lower bounds are Ω( log n) or Ω(log ∆)
• This solves our open problem from Chapter 1.1! [KMW04]. Recent research regarding the MIS problem focused on improving
• Together with Corollary 7.3 we get quite close ties between (∆ + 1)- the O(log n) time complexity for special graph classes, for instances growth-
coloring and the MIS problem. bounded graphs [SW08] or trees [LW11]. There are also results that depend
on the degree of the graph [BE09, Kuh09]. Deterministic MIS algorithms are
• Computing a MIS also solves another graph problem on graphs of still√far from the lower bounds, as the best deterministic MIS algorithm takes
bounded independence. 2O( log n) time [PS96]. The maximum matching algorithm mentioned in the
remarks is the blossom algorithm by Jack Edmonds.
Definition 7.17 (Bounded Independence). G = (V, E) is of bounded indepen-
dence, if for every node v ∈ V the largest independent set in the neighborhood
N (v) is bounded by a constant. Bibliography
Definition 7.18 ((Minimum) Dominating Sets). A dominating set is a subset
of the nodes such that each node is in the set or adjacent to a node in the set. [ABI86] Noga Alon, László Babai, and Alon Itai. A Fast and Simple
A minimum dominating set is a dominating set containing the least possible Randomized Parallel Algorithm for the Maximal Independent Set
number of nodes. Problem. J. Algorithms, 7(4):567–583, 1986.

Remarks: [BE09] Leonid Barenboim and Michael Elkin. Distributed (delta+1)-


coloring in linear (in delta) time. In 41st ACM Symposium On
• In general, finding a dominating set less than factor log n larger than Theory of Computing (STOC), 2009.
an minimum dominating set is NP-hard.
[II86] Amos Israeli and Alon Itai. A Fast and Simple Randomized Parallel
• Any MIS is a dominating set: if a node was not covered, it could join Algorithm for Maximal Matching. Inf. Process. Lett., 22(2):77–80,
the independent set. 1986.
• In general a MIS and a minimum dominating sets have not much in
[KMW04] F. Kuhn, T. Moscibroda, and R. Wattenhofer. What Cannot Be
common (think of a star). For graphs of bounded independence, this
Computed Locally! In Proceedings of the 23rd ACM Symposium
is different.
on Principles of Distributed Computing (PODC), July 2004.
Corollary 7.19. On graphs of bounded independence, a constant-factor approx-
imation to a minimum dominating set can be found in time O(log n) w.h.p. [KSOS06] Kishore Kothapalli, Christian Scheideler, Melih
√ Onus, and Chris-
tian Schindelhauer. Distributed coloring in O( log n) Bit Rounds.
Proof: Denote by M a minimum dominating set and by I a MIS. Since M is a In 20th international conference on Parallel and Distributed Pro-
dominating set, each node from I is in M or adjacent to a node in M . Since cessing (IPDPS), 2006.
the graph is of bounded independence, no node in M is adjacent to more than
constantly many nodes from I. Thus, |I| ∈ O(|M |). Therefore, we can compute [Kuh09] Fabian Kuhn. Weak graph colorings: distributed algorithms and
a MIS with Algorithm 35 and output it as the dominating set, which takes applications. In 21st ACM Symposium on Parallelism in Algo-
O(log n) rounds w.h.p. rithms and Architectures (SPAA), 2009.

[Lub86] Michael Luby. A Simple Parallel Algorithm for the Maximal Inde-
Chapter Notes pendent Set Problem. SIAM J. Comput., 15(4):1036–1053, 1986.

The fast MIS algorithm is a simplified version of an algorithm by Luby [Lub86]. [LW11] Christoph Lenzen and Roger Wattenhofer. MIS on trees. In PODC,
Around the same time there have been a number of other papers dealing with the pages 41–48, 2011.
same or related problems, for instance by Alon, Babai, and Itai [ABI86], or by
Israeli and Itai [II86]. The analysis presented in Section 7.2 takes elements of all [MRSDZ11] Yves Métivier, John Michael Robson, Nasser Saheb-Djahromi, and
these papers, and from other papers on distributed weighted matching [WW04]. Akka Zemmari. An optimal bit complexity randomized distributed
The analysis in the book [Pel00] by David Peleg is different, and only achieves MIS algorithm. Distributed Computing, 23(5-6):331–340, 2011.
O(log2 n) time. The new MIS variant (with the simpler analysis) of Section
7.3 is by Métivier, Robson, Saheb-Djahromi and Zemmari [MRSDZ11]. With [Pel00] David Peleg. Distributed Computing: a Locality-Sensitive Ap-
some adaptations, the algorithms [Lub86, MRSDZ11] only need to exchange proach. Society for Industrial and Applied Mathematics, Philadel-
a total of O(log n) bits per node, which is asymptotically optimum, even on phia, PA, USA, 2000.
BIBLIOGRAPHY 83 84 CHAPTER 7. MAXIMAL INDEPENDENT SET

[PS96] Alessandro Panconesi and Aravind Srinivasan. On the Complexity


of Distributed Network Decomposition. J. Algorithms, 20(2):356–
374, 1996.

[SW08] Johannes Schneider and Roger Wattenhofer. A Log-Star Distrib-


uted Maximal Independent Set Algorithm for Growth-Bounded
Graphs. In 27th ACM Symposium on Principles of Distributed
Computing (PODC), Toronto, Canada, August 2008.

[WW04] Mirjam Wattenhofer and Roger Wattenhofer. Distributed


Weighted Matching. In 18th Annual Conference on Distributed
Computing (DISC), Amsterdam, Netherlands, October 2004.
86 CHAPTER 8. LOCALITY LOWER BOUNDS

Algorithm 37 Synchronous Algorithm: Canonical Form


1: In r rounds: send complete initial state to nodes at distance at most r
2: // do all the communication first
3: Compute output based on complete information about r-neighborhood
4: // do all the computation in the end

Chapter 8
r) and derive a lower bound on the number of colors that are needed if
we want to properly color an n-node ring with an r-round algorithm.
A 3-coloring lower bound can then be derived by taking the smallest
Locality Lower Bounds r for which an r-round algorithm needs 3 or fewer colors.

8.2 Locality
In Chapter 1, we looked at distributed algorithms for coloring. In particular,
Let us for a moment look at distributed algorithms more generally (i.e., not
we saw that rings and rooted trees can be colored with 3 colors in log∗ n + O(1)
only at coloring and not only at rings). Assume that initially, all nodes only
rounds.
know their own label (identifier) and potentially some additional input. As
information needs at least r rounds to travel r hops, after r rounds, a node v
8.1 Model can only learn about other nodes at distance at most r. If message size and local
computations are not restricted, it is in fact not hard to see, that in r rounds,
In this chapter, we will reconsider the distributed coloring problem. We will look a node v can exactly learn all the node labels and inputs up to distance r.
at a classic lower bound that shows that the result of Chapter 1 is tight: Coloring As shown by the following lemma, this allows to transform every deterministic
rings (and rooted trees) indeed requires Ω(log∗ n) rounds. In particular, we will r-round synchronous algorithm into a simple canonical form.
prove a lower bound for coloring in the following setting:
Lemma 8.1. If message size and local computations are not bounded, every
• We consider deterministic, synchronous algorithms. deterministic, synchronous r-round algorithm can be transformed into an algo-
rithm of the form given by Algorithm 37 (i.e., it is possible to first communicate
• Message size and local computations are unbounded. for r rounds and then do all the computations in the end).
• We assume that the network is a directed ring with n nodes.
Proof. Consider some r-round algorithm A. We want to show that A can be
• Nodes have unique labels (identifiers) from 1 to n. brought to the canonical form given by Algorithm 37. First, we let the nodes
communicate for r rounds. Assume that in every round, every node sends its
Remarks: complete state to all of its neighbors (remember that there is no restriction on
the maximal message size). By induction, after i rounds, every node knows the
• A generalization of the lower bound to randomized algorithms is pos- initial state of all other nodes at distance at most i. Hence, after r rounds, a
sible. node v has the combined initial knowledge of all the nodes in its r-neighborhood.
• Except for restricting to deterministic algorithms, all the conditions We want to show that this suffices to locally (at node v) simulate enough of
above make a lower bound stronger: Any lower bound for synchronous Algorithm A to compute all the messages that v receives in the r communication
algorithms certainly also holds for asynchronous ones. A lower bound rounds of a regular execution of Algorithm A.
that is true if message size and local computations are not restricted Concretely, we prove the following statement by induction on i. For all
is clearly also valid if we require a bound on the maximal message nodes at distance at most r − i + 1 from v, node v can compute all messages
size or the amount of local computations. Similarly also assuming of the first i rounds of a regular execution of A. Note that this implies that v
that the ring is directed and that node labels are from 1 to n (instead can compute all the messages it receives from its neighbors during all r rounds.
of choosing IDs from a more general domain) strengthen the lower Because v knows the initial state of all nodes in the r-neighborhood, v can
bound. clearly compute all messages of the first round (i.e., the statement is true for
i = 1). Let us now consider the induction step from i to i + 1. By the induction
• Instead of directly proving that 3-coloring a ring needs Ω(log∗ n) hypothesis, v can compute the messages of the first i rounds of all nodes in
rounds, we will prove a slightly more general statement. We will con- its (r − i + 1)-neighborhood. It can therefore compute all messages that are
sider deterministic algorithms with time complexity r (for arbitrary received by nodes in the (r − i)-neighborhood in the first i rounds. This is of

85
8.2. LOCALITY 87 88 CHAPTER 8. LOCALITY LOWER BOUNDS

course exactly what is needed to compute the messages of round i + 1 of nodes 8.3 The Neighborhood Graph
in the (r − i)-neighborhood.
We will now make the above observations concerning colorings of rings a bit
Remarks: more formal. Instead of thinking of an r-round coloring algorithm as a function
• It is straightforward to generalize the canonical form to randomized from all possible r-hop views to colors, we will use a slightly different perspective.
algorithms: Every node first computes all the random bits it needs Interestingly, the problem of understanding distributed coloring algorithms can
throughout the algorithm. The random bits are then part of the initial itself be seen as a classical graph coloring problem.
state of a node.
Definition 8.4 (Neighborhood Graph). For a given family of network graphs
Definition 8.2 (r-hop view). We call the collection of the initial states of all G, the r-neighborhood graph Nr (G) is defined as follows. The node set of Nr (G)
nodes in the r-neighborhood of a node v, the r-hop view of v. is the set of all possible labeled r-neighborhoods (i.e., all possible r-hop views).
There is an edge between two labeled r-neighborhoods Vr and Vr0 if Vr and Vr0
Remarks: can be the r-hop views of two adjacent nodes.
• Assume that initially, every node knows its degree, its label (identi-
fier) and potentially some additional input. The r-hop view of a node Lemma 8.5. For a given family of network graphs G, there is an r-round al-
v then includes the complete topology of the r-neighborhood (exclud- gorithm that colors graphs of G with c colors iff the chromatic number of the
ing edges between nodes at distance r) and the labels and additional neighborhood graph is χ(Nr (G)) ≤ c.
inputs of all nodes in the r-neighborhood.
Proof. We have seen that a coloring algorithm is a function that maps every
Based on the definition of an r-hop view, we can state the following corollary possible r-hop view to a color. Hence, a coloring algorithm assigns a color to
of Lemma 8.1. every node of the neighborhood graph Nr (G). If two r-hop views Vr and Vr0 can
Corollary 8.3. A deterministic r-round algorithm A is a function that maps be the r-hop views of two adjacent nodes u and v (for some labeled graph in
every possible r-hop view to the set of possible outputs. G), every correct coloring algorithm must assign different colors to Vr and Vr0 .
Thus, specifying an r-round coloring algorithm for a family of network graphs
Proof. By Lemma 8.1, we know that we can transform Algorithm A to the G is equivalent to coloring the respective neighborhood graph Nr (G).
canonical form given by Algorithm 37. After r communication rounds, every
node v knows exactly its r-hop view. This information suffices to compute the Instead of directly defining the neighborhood graph for directed rings, we de-
output of node v. fine directed graphs Bk that are closely related to the neighborhood graph. The
node set of Bk contains all k-tuples of increasing node labels ([n] = {1, . . . , n}):
Remarks:

• Note that the above corollary implies that two nodes with equal r-hop V [Bk ] = (α1 , . . . , αk ) : αi ∈ [n], i < j → αi < αj (8.1)
views have to compute the same output in every r-round algorithm.
For α = (α1 , . . . , αk ) and β = (β1 , . . . , βk ) there is a directed edge from α to β
• For coloring algorithms, the only input of a node v is its label. The iff
r-hop view of a node therefore is its labeled r-neighborhood. ∀i ∈ {1, . . . , k − 1} : βi = αi+1 . (8.2)
• If we only consider rings, r-hop neighborhoods are particularly simple.
Lemma 8.6. Viewed as an undirected graph, the graph B2r+1 is a subgraph of
The labeled r-neighborhood of a node v (and hence its r-hop view) in
the r-neighborhood graph of directed n-node rings with node labels from [n].
an oriented ring is simply a (2r + 1)-tuple (`−r , `−r+1 , . . . , `0 , . . . , `r )
of distinct node labels where `0 is the label of v. Assume that for
Proof. The claim follows directly from the observations regarding r-hop views
i > 0, `i is the label of the ith clockwise neighbor of v and `−i is
of nodes in a directed ring from Section 8.2. The set of k-tuples of increasing
the label of the ith counterclockwise neighbor of v. A deterministic
node labels is a subset of the set of k-tuples of distinct node labels. Two nodes
coloring algorithm for oriented rings therefore is a function that maps
of B2r+1 are connected by a directed edge iff the two corresponding r-hop views
(2r + 1)-tuples of node labels to colors.
are connected by a directed edge in the neighborhood graph. Note that if there
• Consider two r-hop views Vr = (`−r , . . . , `r ) and Vr0 = (`0−r , . . . , `0r ). is an edge between α and β in Bk , α1 6= βk because the node labels in α and β
If `0i = `i+1 for −r ≤ i ≤ r − 1 and if `0r 6= `i for −r ≤ i ≤ r, the r-hop are increasing.
view Vr0 can be the r-hop view of a clockwise neighbor of a node with
r-hop view Vr . Therefore, every algorithm A that computes a valid To determine a lower bound on the number of colors an r-round algorithm
coloring needs to assign different colors to Vr and Vr0 . Otherwise, there needs for directed n-node rings, it therefore suffices to determine a lower bound
is a ring labeling for which A assigns the same color to two adjacent on the chromatic number of B2r+1 . To obtain such a lower bound, we need the
nodes. following definition.
8.3. THE NEIGHBORHOOD GRAPH 89 90 CHAPTER 8. LOCALITY LOWER BOUNDS

Definition 8.7 (Diline Graph). The directed line graph (diline graph) DL(G) Proof. For k = 1, Bk is the complete graph on n nodes with a directed edge
of a directed graph G = (V, E) is defined as follows. The  node set of DL(G) is from node i to node j iff i < j. Therefore, χ(B1 ) = n. For k > 2, the claim
V [DL(G)] = E. There is a directed edge (w, x), (y, z) between (w, x) ∈ E and follows by induction and Lemmas 8.8 and 8.9.
(y, z) ∈ E iff x = y, i.e., if the first edge ends where the second one starts.
This finally allows us to state a lower bound on the number of rounds needed
Lemma 8.8. If n > k, the graph Bk+1 can be defined recursively as follows: to color a directed ring with 3 colors.
Bk+1 = DL(Bk ). Theorem 8.11. Every deterministic, distributed algorithm to color a directed
ring with 3 or less colors needs at least (log∗ n)/2 − 1 rounds.
Proof. The edges of Bk are pairs of k-tuples α = (α1 , . . . , αk ) and β = (β1 , . . . , βk )
that satisfy Conditions (8.1) and (8.2). Because the last k − 1 labels in α are Proof. Using the connection between Bk and the neighborhood graph for di-
equal to the first k − 1 labels in β, the pair (α, β) can be represented by a rected rings, it suffices to show that χ(B2r+1 ) > 3 for all r < (log∗ n)/2 − 1.
(k + 1)-tuple γ = (γ1 , . . . , γk+1 ) with γ1 = α1 , γi = βi−1 = αi for 2 ≤ i ≤ k, and From Lemma 8.10, we know that χ(B2r+1 ) ≥ log(2r) n. To obtain log(2r) n ≤ 2,
γk+1 = βk . Because the labels in α and the labels in β are increasing, the labels we need r ≥ (log∗ n)/2 − 1. Because log2 3 < 2, we therefore have log(2r) n > 3
in γ are increasing as well. The two graphs Bk+1 and DL(Bk ) therefore have if r < log∗ n/2 − 1.
the same node sets. There is an edge between two nodes (α1 , β 1 ) and (α2 , β 2 ) of
DL(Bk ) if β 1 = α2 . This is equivalent to requiring that the two corresponding Corollary 8.12. Every deterministic, distributed algorithm to compute an MIS
(k + 1)-tuples γ 1 and γ 2 are neighbors in Bk+1 , i.e., that the last k labels of γ 1 of a directed ring needs at least log∗ n/2 − O(1) rounds.
are equal to the first k labels of γ 2 .
Remarks:
The following lemma establishes a useful connection between the chromatic
numbers of a directed graph G and its diline graph DL(G). • It is straightforward to see that also for a constant c > 3, the number
of rounds needed to color a ring with c or less colors is log∗ n/2−O(1).
Lemma 8.9. For the chromatic numbers χ(G) and χ(DL(G)) of a directed
graph G and its diline graph, it holds that • There basically (up to additive constants) is a gap of a factor of 2
  between the log∗ n+O(1) upper bound of Chapter 1 and the log∗ n/2−
χ DL(G) ≥ log2 χ(G) . O(1) lower bound of this chapter. It is possible to show that the lower
bound is tight, even for undirected rings (for directed rings, this will
Proof. Given a c-coloring of DL(G), we show how to construct a 2c coloring of G.
be part of the exercises).
The claim of the lemma then follows because this implies that χ(G) ≤ 2χ(DL(G)) .
Assume that we are given a c-coloring of DL(G). A c-coloring of the diline • Alternatively, the lower bound can also be presented as an application
graph DL(G) can be seen as a coloring of the edges of G such that no two of Ramsey’s theory. Ramsey’s theory is best introduced with an ex-
adjacent edges have the same color. For a node v of G, let Sv be the set of ample: Assume you host a party, and you want to invite people such
colors of its outgoing edges. Let u and v be two nodes such that G contains a that there are no three people who mutually know each other, and no
directed edge (u, v) from u to v and let x be the color of (u, v). Clearly, x ∈ Su three people which are mutual strangers. How many people can you
because (u, v) is an outgoing edge of u. Because adjacent edges have different invite? This is an example of Ramsey’s theorem, which says that for
colors, no outgoing edge (v, w) of v can have color x. Therefore x 6∈ Sv . This any given integer c, and any given integers n1 , . . . , nc , there is a Ram-
implies that Su 6= Sv . We can therefore use these color sets to obtain a vertex sey number R(n1 , . . . , nc ), such that if the edges of a complete graph
coloring of G, i.e., the color of u is Su and the color of v is Sv . Because the with R(n1 , . . . , nc ) nodes are colored with c different colors, then for
number of possible subsets of [c] is 2c , this yields a 2c -coloring of G. some color i the graph contains some complete subgraph of color i of
size ni . The special case in the party example is looking for R(3, 3).
Let log(i) x be the i-fold application of the base-2 logarithm to x:
• Ramsey theory is more general, as it deals with hyperedges. A normal
log(1) x = log2 x, log(i+1) x = log2 (log(i) x). edge is essentially a subset of two nodes; a hyperedge is a subset of
k nodes. The party example can be explained in this context: We
Remember from Chapter 1 that
have (hyper)edges of the form {i, j}, with 1 ≤ i, j ≤ n. Choosing n
log∗ x = 1 if x ≤ 2, log∗ x = 1 + min{i : log(i) x ≤ 2}. sufficiently large, coloring the edges with two colors must exhibit a
set S of 3 edges {i, j} ⊂ {v1 , v2 , v3 }, such that all edges in S have the
For the chromatic number of Bk , we obtain same color. To prove our coloring lower bound using Ramsey theory,
we form all hyperedges of size k = 2r+1, and color them with 3 colors.
Lemma 8.10. For all n ≥ 1, χ(B1 ) = n. Further, for n ≥ k ≥ 2, χ(Bk ) ≥ Choosing n sufficiently large, there must be a set S = {v1 , . . . , vk+1 }
log(k−1) n. of k + 1 identifiers, such that all k + 1 hyperedges consisting of k
8.3. THE NEIGHBORHOOD GRAPH 91 92 CHAPTER 8. LOCALITY LOWER BOUNDS

nodes from S have the same color. Note that both {v1 , . . . , vk } and Bibliography
{v2 , . . . , vk+1 } are in the set S, hence there will be two neighboring
views with the same color. Ramsey theory shows that in this case [CHW08] A. Czygrinow, M. Hańćkowiak, and W. Wawrzyniak. Fast Distrib-
n will grow as a power tower (tetration) in k. Thus, if n is so large uted Approximations in Planar Graphs. In Proceedings of the 22nd
that k is smaller than some function growing like log∗ n, the coloring International Symposium on Distributed Computing (DISC), 2008.
algorithm cannot be correct.
[KMW04] F. Kuhn, T. Moscibroda, and R. Wattenhofer. What Cannot Be
Computed Locally! In Proceedings of the 23rd ACM Symposium on
• The neighborhood graph concept can be used more generally to study Principles of Distributed Computing (PODC), July 2004.
distributed graph coloring. It can for instance be used to show that
[KMW10] Fabian Kuhn, Thomas Moscibroda, and Roger Wattenhofer. Local
with a single round (every node sends its identifier to all neighbors) it
Computation: Lower and Upper Bounds. CoRR, abs/1011.5470,
is possible to color a graph with (1+o(1))∆2 ln n colors, and that every
2010.
one-round algorithm needs at least Ω(∆2 / log2 ∆ + log log n) colors.
[Lin92] N. Linial. Locality in Distributed Graph Algorithms. SIAM Journal
on Computing, 21(1)(1):193–201, February 1992.
• One may also extend the proof to other problems, for instance one
may show that a constant approximation of the minimum dominating [LR03] Bruce M. Landman and Aaron Robertson. Ramsey Theory on the
set problem on unit disk graphs costs at least log-star time. Integers. American Mathematical Society, 2003.
[LS14] Juhana Laurinharju and Jukka Suomela. Brief Announcement:
• Using r-hop views and the fact that nodes with equal r-hop views Linial’s Lower Bound Made Easy. In Proceedings of the 2014 ACM
have to make the same decisions is the basic principle behind almost Symposium on Principles of Distributed Computing, PODC ’14,
all locality lower bounds (in fact, we are not aware of a locality lower pages 377–378, New York, NY, USA, 2014. ACM.
bound that does not use this principle). Using this basic technique
(but a completely different proof otherwise), it is for instance possi- [LW08] Christoph Lenzen and Roger Wattenhofer. Leveraging Linial’s Lo-
ble to show that computing an MIS cality Limit. In 22nd International Symposium on Distributed Com-
√ (and many other problems) in a puting (DISC), Arcachon, France, September 2008.
general graph requires at least Ω( log n) and Ω(log ∆) rounds.
[Nao91] Moni Naor. A Lower Bound on Probabilistic Algorithms for Distribu-
tive Ring Coloring. SIAM J. Discrete Math., 4(3):409–412, 1991.

Chapter Notes [NR90] Jaroslav Nesetril and Vojtech Rodl, editors. Mathematics of Ramsey
Theory. Springer Berlin Heidelberg, 1990.
The lower bound proof in this chapter is by Linial [Lin92], proving asymptotic [NS93] Moni Naor and Larry Stockmeyer. What can be Computed Locally?
optimality of the technique of Chapter 1. This proof can also be found in In Proceedings of the twenty-fifth annual ACM symposium on Theory
Chapter 7.5 of [Pel00]. An alternative proof that omits the neighborhood graph of computing, STOC ’93, pages 184–193, New York, NY, USA, 1993.
construction is presented in [LS14]. The lower bound is also true for randomized ACM.
algorithms [Nao91]. Recently, this lower bound technique was adapted to other
problems [CHW08, LW08]. In some sense, Linial’s seminal work raised the [Pel00] David Peleg. Distributed Computing: a Locality-Sensitive Approach.
question of what can be computed in O(1) time [NS93], essentially starting Society for Industrial and Applied Mathematics, Philadelphia, PA,
distributed complexity theory. USA, 2000.
More recently, using a different argument, Kuhn et al. [KMW04] managed [Ram30] F. P. Ramsey. On a Problem of Formal Logic. Proc. London Math.
to show more substantial lower bounds for a number of combinatorial problems Soc. (3), 30:264–286, 1930.
including minimum vertex cover (MVC), minimum dominating set (MDS), max-
imal matching, or maximal independent set (MIS). More concretely, Kuhn et al. [Suo12] Jukka Suomela. Survey of Local Algorithms.
showed that all these problems need polylogarithmic time (for a polylogarithmic http://www.cs.helsinki.fi/local-survey/, 2012.
approximation, in case of approximation problems such as MVC and MDS). For
recent surveys regarding locality lower bounds we refer to e.g. [KMW10, Suo12].
Ramsey theory was started by Frank P. Ramsey with his 1930 article called
“On a problem of formal logic” [Ram30]. For an introduction to Ramsey theory
we refer to e.g. [NR90, LR03].
94 CHAPTER 9. SOCIAL NETWORKS

9.1 Small World Networks


Back in 1929, Frigyes Karinthy published a volume of short stories that pos-
tulated that the world was “shrinking” because human beings were connected
more and more. Some claim that he was inspired by radio network pioneer
Guglielmo Marconi’s 1909 Nobel Prize speech. Despite physical distance, the
Chapter 9 growing density of human “networks” renders the actual social distance smaller
and smaller. As a result, it is believed that any two individuals can be connected
through at most five (or so) acquaintances, i.e., within six hops.
The topic was hot in the 1960s. For instance, in 1964, Marshall McLuhan
Social Networks coined the metaphor “Global Village”. He wrote: “As electrically contracted,
the globe is no more than a village”. He argues that due to the almost instanta-
neous reaction times of new (“electric”) technologies, each individual inevitably
feels the consequences of his actions and thus automatically deeply participates
in the global society. McLuhan understood what we now can directly observe –
Zachary’s Karate Club
Distributed computing is applicable in various contexts. This lecture exemplar- real and virtual world are moving together. He realized that the transmission
ily studies one of these contexts, social networks, an area of study whose origins medium, rather than the transmitted information is at the core of change, as
date back a century. To give you a first impression, consider Figure 9.1. expressed by his famous phrase “the medium is the message”.
This idea has been followed ardently in the 1960s by several sociologists,
first by Michael Gurevich, later by Stanley Milgram. Milgram wanted to know
[Zachary 1977]
12

8 13 11
the average path length between two “random” humans, by using various ex-
22 Recorded interactions in a periments, generally using randomly chosen individuals from the US Midwest
18 17
as starting points, and a stockbroker living in a suburb of Boston as target.
5 karate club for 2 years. The starting points were given name, address, occupation, plus some personal
inst
information about the target. They were asked to send a letter to the target.
During observation,
2 6
However, they were not allowed to directly send the letter, rather, they had to
pass it to somebody they knew on first-name basis and that they thought to
25 adminstrator/instructor have a higher probability to know the target person. This process was repeated,
4
32
3
7
conflict developed until somebody knew the target person, and could deliver the letter. Shortly
after starting the experiment, letters have been received. Most letters were lost
28 ⇒ broke into two clubs. during the process, but if they arrived, the average path length was about 5.5.
26
29
14 The observation that the entire population is connected by short acquaintance
chains got later popularized by the terms “six degrees of separation” and “small
Who joins which club?
10
9 20
world”.
24
admin Statisticians tried to explain Milgram’s experiments, by essentially giving
33 Split along network models that allowed for short diameters, i.e., each node is connected
31
to each other node by only a few hops. Until today there is a thriving research
21 administrator/instructor community in statistical physics that tries to understand network properties
30

19
minimum cut (!) that allow for “small world” effects.
The world is often fascinated by graphs with a small radius. For example,
15
27
16
23 movie fanatics study the who-acted-with-whom-in-the-same-movie graph. For
this graph it has long been believed that the actor Kevin Bacon has a partic-
Figure 9.1: This graph shows the social relations between the members of a 4 ularly small radius. The number of hops from Bacon even got a name, the
karate club, studied by anthropologist Wayne Zachary in the 1970s. Two people Bacon Number. In the meantime, however, it has been shown that there are
(nodes) stand out, the instructor and the administrator of the club, both happen “better” centers in the Hollywood universe, such as Sean Connery, Christopher
to have many friends among club members. At some point, a dispute caused Lee, Rod Steiger, Gene Hackman, or Michael Caine. The center of other social
the club to split into two. Can you predict how the club partitioned? (If not, networks has also been explored, Paul Erdös for instance is well known in the
just search the Internet for Zachary and Karate.) math community.
One of the keywords in this area are power-law graphs, networks where node
degrees are distributed according to a power-law distribution, i.e., the number

93
9.1. SMALL WORLD NETWORKS 95 96 CHAPTER 9. SOCIAL NETWORKS

of nodes with degree δ is proportional to δ −α , for some α > 1. Such power-


law graphs have been witnessed in many application areas, apart from social
networks also in the web, or in biology or physics.
Obviously, two power-law graphs might look and behave completely differ-
ently, even if α and the number of edges is exactly the same.
One well-known model towards this end is the Watts-Strogatz model. Watts
and Strogatz argued that social networks should be modeled by a combination of
two networks: As the basis we take a network that has a large cluster coefficient
...

Definition 9.1. The cluster coefficient of a network is defined by the probability


that two friends of a node are likely to be friends as well, averaged over all the
nodes.

. . . , then we augment such a graph with random links, every node for in-
stance points to a constant number of other nodes, chosen uniformly at random.
This augmentation represents acquaintances that connect nodes to parts of the
network that would otherwise be far away.
Remarks:

• Without further information, knowing the cluster coefficient is of ques-


tionable value: Assume we arrange the nodes in a grid. Technically,
if we connect each node to its four closest neighbors, the graph has
cluster coefficient 0, since there are no triangles; if we instead connect Figure 9.2: Augmented grid with m = 6
each node with its eight closest neighbors, the cluster coefficient is 3/7.
The cluster coefficient is quite different, even though both networks
have similar characteristics.
Remarks:
This is interesting, but not enough to really understand what is going on. For
• The network model has the following geographic interpretation: nodes
Milgram’s experiments to work, it is not sufficient to connect the nodes in a
(individuals) live on a grid and know their neighbors on the grid.
certain way. In addition, the nodes themselves need to know how to forward
Further, each node has some additional acquaintances throughout the
a message to one of their neighbors, even though they cannot know whether
network.
that neighbor is really closer to the target. In other words, nodes are not just
following physical laws, but they make decisions themselves. • The parameter α controls how the additional neighbors are distributed
Let us consider an artificial network with nodes on a grid topology, plus some across the grid. If α = 0, long-range contacts are chosen uniformly at
additional random links per node. In a quantitative study it was shown that the random (as in the Watts-Strogatz model). As α increases, long-range
random links need a specific distance distribution to allow for efficient greedy contacts become shorter on average. In the extreme case, if α → ∞,
routing. This distribution marks the sweet spot for any navigable network. all long-range contacts are to immediate neighbors on the grid.
Definition 9.2 (Augmented Grid). We take n = m2 nodes (i, j) ∈ V =
2 • It can be shown that as long as α ≤ 2, the diameter of the resulting
{1, . . . , m} that are identified with the lattice points on an m × m grid.
 We graph is polylogarithmic in n (polynomial in log n) with high proba-
define the distance between two nodes (i, j) and (k, `) as d (i, j), (k, `) = |k − bility. In particular, if the long-range contacts are chosen uniformly
i| + |` − j| as the distance between them on the m × m lattice. The network at random (α = 0), the diameter is O(log n).
is modeled using a parameter α ≥ 0. Each node u has a directed edge to ev-
ery lattice neighbor. These are the local contacts of a node. In addition, each Since the augmented grid contains random links, we do not know anything
node also has an additional random link (the long-range contact). For all u for sure about how the random links are distributed. In theory, all links could
and v, the long-range contact of u points to node P v with probability proportional point to the same node! However, this is almost certainly not the case. Formally
to d(u, v)−α , i.e., with probability d(u, v)−α / w∈V \{u} d(u, w)−α . Figure 9.2 this is captured by the term with high probability.
illustrates the model.
Definition 9.3 (With High Probability). Some probabilistic event is said to
occur with high probability (w.h.p.), if it happens with a probability p ≥ 1 −
9.1. SMALL WORLD NETWORKS 97 98 CHAPTER 9. SOCIAL NETWORKS

1/nc , where c is a constant. The constant c may be chosen arbitrarily, but it is Algorithm 38 Greedy Routing
considered constant with respect to Big-O notation. 1: while not at destination do
2: go to a neighbor which is closest to destination (considering grid distance
Remarks: only)
3: end while
• For instance, a running time bound of c log n or ec! log n + 5000c with
probability at least 1 − 1/nc would be O(log n) w.h.p., but a running
time of nc would not be O(n) w.h.p. since c might also be 50. Proof. Because of the grid, there is always a neighbor which is closer to the
destination. Since with each hop we reduce the distance to the target at least
• This definition is very powerful, as any polynomial (in n) number
by one in one of the two grid dimensions, we will reach the destination within
of statements that hold w.h.p. also holds w.h.p. at the same time,
2(m − 1) steps.
regardless of any dependencies between random variables!

Theorem 9.4. The diameter of the augmented grid with α = 0 is O(log n) with This is not really what Milgram’s experiment promises. We want to know
high probability. how much the additional random links speed up the process. To this end, we
first need to understand how likely it is that the random link of node u points
Proof Sketch. For simplicity, we will only show that we can reach a target node to node v, in terms of their grid distance d(u, v), the number of nodes n, and
t starting from some source node s. However, it can be shown that (essentially) the constant parameter α.
each of the intermediate claims holds with high probability, which then by means
of the union bound yields that all of the claims hold simultaneously with high Lemma 9.6. Node u’s random link points to a node v with probability
probability for all pairs of nodes (see exercises).
• Θ(1/(d(u, v)α m2−α )) if α < 2.
Let Ns be the dlog ne-hop neighborhood of source s on the grid, containing
Ω(log2 n) nodes. Each of the nodes in Ns has a random link, probably leading • Θ(1/(d(u, v)2 log n)) if α = 2,
to distant parts of the graph. As long as we have reached only o(n) nodes, any
new random link will with probability 1 − o(1) lead to a node for which none of • Θ(1/d(u, v)α ) if α > 2.
its grid neighbors has been visited yet. Thus, in expectation we find almost |Ns |
new nodes whose neighbors are “fresh”. Using their grid links, we will reach Moreover, if α > 2, the probability to see a link of length at least d is in
(4 − o(1))|Ns | more nodes within one more hop. If bad luck strikes, it could still Θ(1/dα−2 ).
happen that many of these links lead to a few nodes, already visited nodes, or
nodes that are very close to each other. But that is very unlikely, as we have Proof. For a constant α 6= 2, we have that
lots of random choices! Indeed, it can be shown that not only in expectation, Z   m 
X m
X Θ(r) m
but with high probability (5 − o(1))|Ns | many nodes are reached this way (see 1 1 r2−α
∈ =Θ dr =Θ .
exercises). d(u, w)α r=1 rα r=1 rα−1 2−α 1
w∈V \{u}
Because all the new nodes have (so far unused) random links, we can repeat
this reasoning inductively, implying that the number of nodes grows by (at least)
If α < 2, this gives Θ(m2−α ), if α > 2, it is in Θ(1). If α = 2, we get
a constant factor for every two hops. Thus, after O(log n) hops, we will have
reached n/ log n nodes (which is still small compared to n). Finally, consider the X m
X Xm
1 Θ(r) 1
expected number of links from these nodes that enter the (log n)-neighborhood ∈ = Θ(1) · = Θ(log m) = Θ(log n).
of some target node t with respect to the grid. Since this neighborhood consists d(u, w)α r=1 r2 r=1
r
w∈V \{u}
of Ω(log2 n) nodes, in expectation Ω(log n) links come close enough to target
t. This is large enough to almost guarantee that this happens (see exercises). Multiplying with d(u, v)α yields the first three bounds. For the last statement,
Summing everything up, we still used merely O(log n) hops in total to get from compute
s to t. Z   m 
X m
r r2−α
Θ(1/d(u, v)α ) = Θ dr =Θ = Θ(1/dα−2 ).
r=d rα 2−α d
v∈V
This shows that for α = 0 (and in fact for all α ≤ 2), the resulting network d(u,v)≥d
has a small diameter. Recall however that we also wanted the network to be
navigable. For this, we consider a simple greedy routing strategy (Algorithm 38).

Lemma 9.5. In the augmented


√ grid, Algorithm 38 finds a routing path of length
at most 2(m − 1) ∈ O( n).
9.1. SMALL WORLD NETWORKS 99 100 CHAPTER 9. SOCIAL NETWORKS

Remarks: w points to a random node that is independent of anything seen on the path
from s to w.
• If α > 2, according to the lemma, the probability to see a random link For all nodes x ∈ Bj , we have d(w, x) ≤ d(w, t) + d(x, t) ≤ 2j+1 + 2j < 2j+2 .
of length at least d = m1/(α−1) is Θ(1/dα−2 ) = Θ(1/m(α−2)/(α−1) ). Hence, for each node x ∈ Bj , the probability that the long-range contact of w
In expectation we have to take Θ(m(α−2)/(α−1) ) hops until we see a points to x is Ω(1/22j+4 log n). Further, the number of nodes in Bj is at least
random link of length at least d. When just following links of length (2j )2 /2 = 22j−1 . Hence, the probability that some node in Bj is the long range
less than d, it takes more than m/d = m/m1/(α−1) = m(α−2)/(α−1) contact of w is at least
hops. In other words, in expectation, either way we need at least      
m(α−2)/(α−1) = mΩ(1) hops to the destination. 1 22j−1 1
Ω |Bj | · 2j+4 =Ω 2j+4
=Ω .
2 log n 2 log n log n
• If α < 2, there is a (slightly more complicated) argument. First we
draw a border around the nodes in distance m(2−α)/3 to the target. Theorem 9.9. Consider the greedy routing path from a node s to a node t on
Within this border there are about m2(2−α)/3 many nodes in the tar- an augmented grid with parameter α = 2. The expected length of the path is
get area. Assume that the source is outside the target area. Start- O(log2 n).
ing at the source, the probability to find a random link that leads
directly inside the target area is according to the lemma at most Proof. We already observed that the total number of phases is O(log n) (the
m2(2−α)/3 · Θ(1/m2−α )) = Θ(1/m(2−α)/3 ). In other words, until we distance to the target is halved when we go from phase j to phase j − 1). At
find a random link that leads into the target area, in expectation, each point during the routing process, the probability of proceeding to the next
we have to do Θ(m(2−α)/3 ) hops. This is too slow, and our greedy phase is at least Ω(1/ log n). Let Xj be the number of steps in phase j. Because
strategy is probably faster, as thanks to having α < 2 there are many the probability for ending the phase is Ω(1/ log n) in each step, in expectation
long-range links. However, it means that we will probably enter the P O(log n) steps to proceed to the next phase, i.e., E[Xj ] ∈ O(log n). Let
we need
border of the target area on a regular grid link. Once inside the tar- X = j Xj be the total number of steps of the routing process. By linearity of
get area, again the probability of short-cutting our trip by a random expectation, we have
long-range link is Θ(1/m(2−α)/3 ), so we probably just follow grid links, X
m(2−α)/3 = mΩ(1) many of them. E[X] = E[Xj ] ∈ O(log2 n).
j
• In summary, if α 6= 2, our greedy routing algorithm takes mΩ(1) =
nΩ(1) expected hops to reach the destination. This is polynomial in Remarks:
the number of nodes n, and the social network can hardly be called a
“small world”. • One can show that the O(log2 n) result also holds w.h.p.

• Maybe we can get a polylogarithmic bound on n if we set α = 2? • In real world social networks, the parameter α was evaluated experi-
mentally. The assumption is that you are connected to the geograph-
Definition 9.7 (Phase). Consider routing from source s to target t and assume ically closest nodes, and then have some random long-range contacts.
that we are at some intermediate node w. We say that we are in phase j at node For Facebook grandpa LiveJournal it was shown that α is not really
w if the lattice distance d(w, t) to the target node t is between 2j < d(w, t) ≤ 2, but rather around 1.25.
2j+1 .

Remarks: 9.2 Propagation Studies


• Enumerating the phases in decreasing order is useful, as notation be- In networks, nodes may influence each other’s behavior and decisions. There are
comes less cumbersome. many applications where nodes influence their neighbors, e.g., they may impact
their opinions, or they may bias what products they buy, or they may pass on
• There are dlog me ∈ O(log n) phases. a disease.
Lemma 9.8. Assume that we are in phase j at node w when routing from s On a beach (modeled as a line segment), it is best to place an ice cream
to t. The probability for getting (at least) to phase j − 1 in one step is at least stand right in the middle of the segment, because you will be able to “control”
Ω(1/ log n). the beach most easily. What about the second stand, where should it settle?
The answer generally depends on the model, but assuming that people will buy
Proof. Let Bj be the set of nodes x with d(x, t) ≤ 2j . We get from phase j to ice cream from the stand that is closer, it should go right next to the first stand.
(at least) phase j − 1 if the long-range contact of node w points to some node Rumors can spread surprisingly fast through social networks. Tradition-
in Bj . Note that we always make progress while following the greedy routing ally this happens by word of mouth, but with the emergence of the Internet
path. Therefore, we have not seen node w before and the long-range contact of and its possibilities new ways of rumor propagation are available. People write
9.2. PROPAGATION STUDIES 101 102 CHAPTER 9. SOCIAL NETWORKS

email, use instant messengers or publish their thoughts in a blog. Many factors Chapter Notes
influence the dissemination of rumors. It is especially important where in a net-
work a rumor is initiated and how convincing it is. Furthermore the underlying A simple form of a social network is the famous stable marriage problem [DS62]
network structure decides how fast the information can spread and how many in which a stable matching bipartite graph has to be found. There exists a great
people are reached. More generally, we can speak of diffusion of information in many of variations which are based on this initial problem, e.g., [KC82, KMV94,
networks. The analysis of these diffusion processes can be useful for viral mar- EO06, FKPS10, Hoe11]. Social networks like Facebook, Twitter and others have
keting, e.g., to target a few influential people to initiate marketing campaigns. grown very fast in the last years and hence spurred interest to research them.
A company may wish to distribute the rumor of a new product via the most How users influence other users has been studied both from a theoretical point
influential individuals in popular social networks such as Facebook. A second of view [KKT03] and in practice [CHBG10]. The structure of these networks
company might want to introduce a competing product and has hence to select can be measured and studied [MMG+ 07]. More than half of the users in social
where to seed the information to be disseminated. Rumor spreading is quite networks share more information than they expect to [LGKM11].
similar to our ice cream stand problem. The small world phenomenon that we presented in this chapter is analyzed
More formally, we may study propagation problems in graphs. Given a by Kleinberg [Kle00]. A general overview is in [DJ10].
graph, and two players. Let the first player choose a seed node u1 ; afterwards This chapter has been written in collaboration with Michael Kuhn.
let the second player choose a seed node u2 , with u2 6= u1 . The goal of the game
is to maximize the number of nodes that are closer to one’s own seed node. Bibliography
In many graphs it is an advantage to choose first. In a star graph for instance
the first player can choose the center node of the star, controlling all but one [CHBG10] Meeyoung Cha, Hamed Haddadi, Fabrı́cio Benevenuto, and P. Kr-
node. In some other graphs, the second player can at least score even. But is ishna Gummadi. Measuring User Influence in Twitter: The Million
there a graph where the second player has an advantage? Follower Fallacy. In ICWSM, 2010.
[DJ10] Easley David and Kleinberg Jon. Networks, Crowds, and Markets:
Theorem 9.10. In a two player rumor game where both players select one node Reasoning About a Highly Connected World. Cambridge University
to initiate their rumor in the graph, the first player does not always win. Press, New York, NY, USA, 2010.
[DS62] D. Gale and L.S. Shapley. College Admission and the Stability of
Marriage. American Mathematical Monthly, 69(1):9–15, 1962.
Proof. See Figure 9.3 for an example where the second player will always win,
regardless of the decision the first player. If the first player chooses the node x0 [EO06] Federico Echenique and Jorge Oviedo. A theory of stability in many-
in the center, the second player can select x1 . Choice x1 will be outwitted by x2 , to-many matching markets. Theoretical Economics, 1(2):233–273,
and x2 itself can be answered by z1 . All other strategies are either symmetric, 2006.
or even less promising for the first player.
[FKPS10] Patrik Floréen, Petteri Kaski, Valentin Polishchuk, and Jukka
Suomela. Almost Stable Matchings by Truncating the Gale-Shapley
Algorithm. Algorithmica, 58(1):102–118, 2010.
[Hoe11] Martin Hoefer. Local Matching Dynamics in Social Networks. Au-
tomata Languages and Programming, pages 113–124, 2011.
z2
z1 [Kar29] Frigyes Karinthy. Chain-Links, 1929.

x2 x1 x0 [KC82] Alexander S. Kelso and Vincent P. Crawford. Job Matching, Coali-


tion Formation, and Gross Substitutes. Econometrica, 50(6):1483–
1504, 1982.
y1 [KKT03] David Kempe, Jon M. Kleinberg, and Éva Tardos. Maximizing the
y2
y2 spread of influence through a social network. In KDD, 2003.
[Kle00] Jon M. Kleinberg. The small-world phenomenon: an algorithm
perspective. In STOC, 2000.
Figure 9.3: Counter example. [KMV94] Samir Khuller, Stephen G. Mitchell, and Vijay V. Vazirani. On-line
algorithms for weighted bipartite matching and stable marriages.
Theoretical Computer Science, 127:255–267, May 1994.
BIBLIOGRAPHY 103 104 CHAPTER 9. SOCIAL NETWORKS

[LGKM11] Yabing Liu, Krishna P. Gummadi, Balanchander Krishnamurthy,


and Alan Mislove. Analyzing Facebook privacy settings: User ex-
pectations vs. reality. In Proceedings of the 11th ACM/USENIX
Internet Measurement Conference (IMC’11), Berlin, Germany,
November 2011.
[McL64] Marshall McLuhan. Understanding media: The extensions of man.
McGraw-Hill, New York, 1964.

[Mil67] Stanley Milgram. The Small World Problem. Psychology Today,


2:60–67, 1967.
[MMG+ 07] Alan Mislove, Massimiliano Marcon, P. Krishna Gummadi, Peter
Druschel, and Bobby Bhattacharjee. Measurement and analysis of
online social networks. In Internet Measurement Comference, 2007.

[WS98] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of


“small-world” networks. Nature, 393(6684):440–442, Jun 1998.
[Zac77] W W Zachary. An information flow model for conflict and fission in
small groups. Journal of Anthropological Research, 33(4):452–473,
1977.
106 CHAPTER 10. SYNCHRONIZATION

complexity of the resulting asynchronous algorithm depends on the overhead


introduced by the synchronizer. For a synchronizer S, let T (S) and M (S) be
the time and message complexities of S for each generated clock pulse. As we
will see, some of the synchronizers need an initialization phase. We denote the
time and message complexities of the initialization by Tinit (S) and Minit (S),
respectively. If T (A) and M (A) are the time and message complexities of the
Chapter 10 given synchronous algorithm A, the total time and message complexities Ttot
and Mtot of the resulting asynchronous algorithm then become
Ttot = Tinit (S)+T (A)·(1+T (S)) and Mtot = Minit (S)+M (A)+T (A)·M (S),

Synchronization respectively.
Remarks:
• Because the initialization only needs to be done once for each network,
we will mostly be interested in the overheads T (S) and M (S) per
So far, we have mainly studied synchronous algorithms. Generally, asynchro- round of the synchronous algorithm.
nous algorithms are more difficult to obtain. Also it is substantially harder
to reason about asynchronous algorithms than about synchronous ones. For in- Definition 10.3 (Safe Node). A node v is safe with respect to a certain clock
stance, computing a BFS tree (Chapter 2) efficiently requires much more work in pulse if all messages of the synchronous algorithm sent by v in that pulse have
an asynchronous system. However, many real systems are not synchronous, and already arrived at their destinations.
we therefore have to design asynchronous algorithms. In this chapter, we will Lemma 10.4. If all neighbors of a node v are safe with respect to the current
look at general simulation techniques, called synchronizers, that allow running clock pulse of v, the next pulse can be generated for v.
synchronous algorithms in asynchronous environments.
Proof. If all neighbors of v are safe with respect to a certain pulse, v has received
all messages of the given pulse. Node v therefore satisfies the condition of
10.1 Basics Definition 10.1 for generating a valid next pulse.

A synchronizer generates sequences of clock pulses at each node of the network Remarks:
satisfying the condition given by the following definition.
• In order to detect safety, we require that all algorithms send acknowl-
Definition 10.1 (valid clock pulse). We call a clock pulse generated at a node edgements for all received messages. As soon as a node v has received
v valid if it is generated after v received all the messages of the synchronous an acknowledgement for each message that it has sent in a certain
algorithm sent to v by its neighbors in the previous pulses. pulse, it knows that it is safe with respect to that pulse. Note that
sending acknowledgements does not increase the asymptotic time and
Given a mechanism that generates the clock pulses, a synchronous algorithm message complexities.
is turned into an asynchronous algorithm in an obvious way: As soon as the ith
clock pulse is generated at node v, v performs all the actions (local computations
and sending of messages) of round i of the synchronous algorithm. 10.2 The Local Synchronizer α
Theorem 10.2. If all generated clock pulses are valid according to Definition
10.1, the above method provides an asynchronous algorithm that behaves exactly Algorithm 39 Synchronizer α (at node v)
the same way as the given synchronous algorithm.
1: wait until v is safe
2: send SAFE to all neighbors
Proof. When the ith pulse is generated at a node v, v has sent and received
3: wait until v receives SAFE messages from all neighbors
exactly the same messages and performed the same local computations as in
4: start new pulse
the first i − 1 rounds of the synchronous algorithm.

The main problem when generating the clock pulses at a node v is that v can- Synchronizer α is very simple. It does not need an initialization. Using
not know what messages its neighbors are sending to it in a given synchronous acknowledgements, each node eventually detects that it is safe. It then reports
round. Because there are no bounds on link delays, v cannot simply wait “long this fact directly to all its neighbors. Whenever a node learns that all its neigh-
enough” before generating the next pulse. In order satisfy Definition 10.1, nodes bors are safe, a new pulse is generated. Algorithm 39 formally describes the
have to send additional messages for the purpose of synchronization. The total synchronizer α.

105
10.3. SYNCHRONIZER β 107 108 CHAPTER 10. SYNCHRONIZATION

Theorem 10.5. The time and message complexities of synchronizer α per syn- Proof. Because the diameter of T is at most n − 1, the convergecast and the
chronous round are broadcast together take at most 2n − 2 time units. Per clock pulse, the syn-
chronizer sends at most 2n − 2 synchronization messages (one in each direction
T (α) = O(1) and M (α) = O(m). over each edge of T ).
With the improved variant of the GHS algorithm (Algorithm 11) mentioned
Proof. Communication is only between neighbors. As soon as all neighbors of
in Chapter 2, it is possible to construct an MST in time O(n) with O(m+n log n)
a node v become safe, v knows of this fact after one additional time unit. For
messages in an asynchronous environment. Once the tree is computed, the tree
every clock pulse, synchronizer α sends at most four additional messages over
can be made rooted in time O(n) with O(n) messages.
every edge: Each of the nodes may have to acknowledge a message and reports
safety.
Remarks:
Remarks: • We now got a time-efficient synchronizer (α) and a message-efficient
synchronizer (β), it is only natural to ask whether we can have the
• Synchronizer α was presented in a framework, mostly set up to have
best of both worlds. And, indeed, we can. How is that synchronizer
a common standard to discuss different synchronizers. Without the
called? Quite obviously: γ.
framework, synchronizer α can be explained more easily:
1. Send message to all neighbors, include round information i and
actual data of round i (if any). 10.4 The Hybrid Synchronizer γ
2. Wait for message of round i from all neighbors, and go to next
round.
• Although synchronizer α allows for simple and fast synchronization,
it produces awfully many messages. Can we do better? Yes.

10.3 The Global Synchronizer β

Algorithm 40 Synchronizer β (at node v)


1: wait until v is safe
2: wait until v receives SAFE messages from all its children in T
3: if v 6= ` then
4: send SAFE message to parent in T
5: wait until PULSE message received from parent in T
6: end if
7: send PULSE message to children in T
8: start new pulse Figure 10.1: A cluster partition of a network: The dashed cycles specify the
clusters, cluster leaders are black, the solid edges are the edges of the intracluster
Synchronizer β needs an initialization that computes a leader node ` and a trees, and the bold solid edges are the intercluster edges
spanning tree T rooted at `. As soon as all nodes are safe, this information is
propagated to ` by a convergecast. The leader then broadcasts this information Synchronizer γ can be seen as a combination of synchronizers α and β. In the
to all nodes. The details of synchronizer β are given in Algorithm 40. initialization phase, the network is partitioned into clusters of small diameter.
In each cluster, a leader node is chosen and a BFS tree rooted at this leader
Theorem 10.6. The time and message complexities of synchronizer β per syn-
node is computed. These trees are called the intracluster trees. Two clusters
chronous round are
C1 and C2 are called neighboring if there are nodes u ∈ C1 and v ∈ C2 for
T (β) = O(diameter(T )) ≤ O(n) and M (β) = O(n). which (u, v) ∈ E. For every two neighboring clusters, an intercluster edge is
chosen, which will serve for communication between these clusters. Figure 10.1
The time and message complexities for the initialization are illustrates this partitioning into clusters. We will discuss the details of how to
construct such a partition in the next section. We say that a cluster is safe if
Tinit (β) = O(n) and Minit (β) = O(m + n log n). all its nodes are safe.
10.4. SYNCHRONIZER γ 109 110 CHAPTER 10. SYNCHRONIZATION

Synchronizer γ works in two phases. In a first phase, synchronizer β is 10.5 Network Partition
applied separately in each cluster by using the intracluster trees. Whenever
the leader of a cluster learns that its cluster is safe, it reports this fact to all We will now look at the initialization phase of synchronizer γ. Algorithm 42
the nodes in the clusters as well as to the leaders of the neighboring clusters. describes how to construct a partition into clusters that can be used for syn-
Now, the nodes of the cluster enter the second phase where they wait until chronizer γ. In Algorithm 42, B(v, r) denotes the ball of radius r around v,
all the neighboring clusters are known to be safe and then generate the next i.e., B(v, r) = {u ∈ V : d(u, v) ≤ r} where d(u, v) is the hop distance between
pulse. Hence, we essentially apply synchronizer α between clusters. A detailed u and v. The algorithm has a parameter ρ > 1. The clusters are constructed
description is given by Algorithm 41. sequentially. Each cluster is started at an arbitrary node that has not been
included in a cluster. Then the cluster radius is grown as long as the cluster
grows by a factor more than ρ.
Algorithm 41 Synchronizer γ (at node v)
1: wait until v is safe
Algorithm 42 Cluster construction
2: wait until v receives SAFE messages from all children in intracluster tree
1: while unprocessed nodes do
3: if v is not cluster leader then
2: select an arbitrary unprocessed node v;
4: send SAFE message to parent in intracluster tree
3: r := 0;
5: wait until CLUSTERSAFE message received from parent
4: while |B(v, r + 1)| > ρ|B(v, r)| do
6: end if
5: r := r + 1
7: send CLUSTERSAFE message to all children in intracluster tree
6: end while
8: send NEIGHBORSAFE message over all intercluster edges of v
7: makeCluster(B(v, r)) // all nodes in B(v, r) are now processed
9: wait until v receives NEIGHBORSAFE messages from all adjacent inter-
8: end while
cluster edges and all children in intracluster tree
10: if v is not cluster leader then
11: send NEIGHBORSAFE message to parent in intracluster tree Remarks:
12: wait until PULSE message received from parent
13: end if • The algorithm allows a trade-off between the cluster diameter k (and
14: send PULSE message to children in intracluster tree thus the time complexity) and the number of intercluster edges mC
15: start new pulse (and thus the message complexity). We will quantify the possibilities
in the next section.
• Two very simple partitions would be to make a cluster out of every
single node or to make one big cluster that contains the whole graph.
Theorem 10.7. Let mC be the number of intercluster edges and let k be the
We then get synchronizers α and β as special cases of synchronizer γ.
maximum cluster radius (i.e., the maximum distance of a leaf to its cluster
leader). The time and message complexities of synchronizer γ are Theorem 10.8. Algorithm 42 computes a partition of the network graph into
clusters of radius at most logρ n. The number of intercluster edges is at most
T (γ) = O(k) and M (γ) = O(n + mC ). (ρ − 1) · n.
Proof. The radius of a cluster is initially 0 and does only grow as long as it
Proof. We ignore acknowledgements, as they do not affect the asymptotic com-
grows by a factor larger than ρ. Since there are only n nodes in the graph, this
plexities. Let us first look at the number of messages. Over every intraclus-
can happen at most logρ n times.
ter tree edge, exactly one SAFE message, one CLUSTERSAFE message, one
To count the number of intercluster edges, observe that an edge can only
NEIGHBORSAFE message, and one PULSE message is sent. Further, one
become an intercluster edge if it connects a node at the boundary of a cluster
NEIGHBORSAFE message is sent over every intercluster edge. Because there
with a node outside a cluster. Consider a cluster C of size |C|. We know that
are less than n intracluster tree edges, the total message complexity therefore
C = B(v, r) for some v ∈ V and r ≥ 0. Further, we know that |B(v, r + 1)| ≤
is at most 4n + 2mC = O(n + mC ).
ρ · |B(v, r)|. The number of nodes adjacent to cluster C is therefore at most
For the time complexity, note that the depth of each intracluster tree is at |B(v, r + 1) \ B(v, r)| ≤ ρ · |C| − |C|. Because there is only one intercluster edge
most k. On each intracluster tree, two convergecasts (the SAFE and NEIGH- connecting two clusters by definition, the number of intercluster edges adjacent
BORSAFE messages) and two broadcasts (the CLUSTERSAFE and PULSE to C is at most (ρ − 1) · |C|. Summing over all clusters, we get that the total
messages) are performed. The time complexity for this is at most 4k. There number of intercluster edges is at most (ρ − 1) · n.
is one more time unit needed to send the NEIGHBORSAFE messages over the
intercluster edges. The total time complexity therefore is at most 4k + 1 = Corollary 10.9. Using ρ = 2, Algorithm 42 computes a clustering with cluster
O(k). radius at most log2 n and with at most n intercluster edges.
10.5. NETWORK PARTITION 111 112 CHAPTER 10. SYNCHRONIZATION

Corollary 10.10. Using ρ = n1/k , Algorithm 42 computes a clustering with 10.6 Clock Synchronization
cluster radius at most k and at most O(n1+1/k ) intercluster edges.
“A man with one clock knows what time it is – a man with two is never sure.”
Remarks:
Synchronizers can directly be used to give nodes in an asynchronous network a
• Algorithm 42 describes a centralized construction of the partition- common notion of time. In wireless networks, for instance, many basic protocols
ing of the graph. For ρ ≥ 2, the clustering can be computed by an need an accurate time. Sometimes a common time in the whole network is
asynchronous distributed algorithm in time O(n) with O(m + n log n) needed, often it is enough to synchronize neighbors. The purpose of the time
(reasonably sized) messages (showing this will be part of the exer- division multiple access (TDMA) protocol is to use the common wireless channel
cises). as efficiently as possible, i.e., interfering nodes should never transmit at the
same time (on the same frequency). If we use synchronizer β to give the nodes
• It can be shown that the trade-off between cluster radius and number
a common notion of time, every single clock cycle costs D time units!
of intercluster edges of Algorithm 42 is asymptotically optimal. There
are graphs for which every clustering into clusters of radius at most k Often, each (wireless) node is equipped with an internal clock. Using this
requires n1+c/k intercluster edges for some constant c. clock, it should be possible to divide time into slots, and make each node send
(or listen, or sleep, respectively) in the appropriate slots according to the media
The above remarks lead to a complete characterization of the complexity of access control (MAC) layer protocol used.
synchronizer γ. However, as it turns out, synchronizing clocks in a network is not trivial.
Corollary 10.11. The time and message complexities of synchronizer γ per As nodes’ internal clocks are not perfect, they will run at speeds that are time-
synchronous round are dependent. For instance, variations in temperature or supply voltage will affect
this clock drift. For standard clocks, the drift is in the order of parts per million,
T (γ) = O(k) and M (γ) = O(n1+1/k ). i.e., within a second, it will accumulate to a couple of microseconds. Wireless
TDMA protocols account for this by introducing guard times. Whenever a node
The time and message complexities for the initialization are knows that it is about to receive a message from a neighbor, it powers up its
Tinit (γ) = O(n) and Minit (γ) = O(m + n log n). radio a little bit earlier to make sure that it does not miss the message even
when clocks are not perfectly synchronized. If nodes are badly synchronized,
Remarks: messages of different slots might collide.
In the clock synchronization problem, we are given a network (graph) with
• In Chapter 2, you have seen that by using flooding, there is a very
n nodes. The goal for each node is to have a logical clock such that the logical
simple synchronous algorithm to compute a BFS tree in time O(D)
clock values are well synchronized, and close to real time. Each node is equipped
with message complexity O(m). If we use synchronizer γ to make this
with a hardware clock, that ticks more or less in real time, i.e., the time between
algorithm asynchronous, we get an algorithm with time complexity
two pulses is arbitrary between [1 − , 1 + ], for a constant   1. Similarly as
O(n+D log n) and message complexity O(m+n log n+D·n) (including
in our asynchronous model, we assume that messages sent over the edges of the
initialization).
graph have a delivery time between [0, 1]. In other words, we have a bounded
• The synchronizers α, β, and γ achieve global synchronization, i.e. but variable drift on the hardware clocks and an arbitrary jitter in the delivery
every node generates every clock pulse. The disadvantage of this is times. The goal is to design a message-passing algorithm that ensures that the
that nodes that do not participate in a computation also have to logical clock skew of adjacent nodes is as small as possible at all times.
participate in the synchronization. In many computations (e.g. in a
BFS construction), many nodes only participate for a few synchronous Theorem 10.12. The global clock skew (the logical clock difference between any
rounds. In such scenarios, it is possible to achieve time and message two nodes in the graph) is Ω(D), where D is the diameter of the graph.
complexity O(log3 n) per synchronous round (without initialization).
Proof. For a node u, let tu be the logical time of u and let (u → v) denote a
• It can be shown that if all nodes in the network need to generate all message sent from u to a node v. Let t(m) be the time delay of a message m
pulses, the trade-off of synchronizer γ is asymptotically optimal. and let u and v be neighboring nodes. First consider a case where the message
delays between u and v are 1/2. Then all the messages sent by u and v at time
• Partitions of networks into clusters of small diameter and coverings
i according to the clock of the sender arrive at time i + 1/2 according to the
of networks with clusters of small diameters come in many variations
clock of the receiver.
and have various applications in distributed computations. In particu-
Then consider the following cases
lar, apart from synchronizers, algorithms for routing, the construction
of sparse spanning subgraphs, distributed data structures, and even • tu = tv + 1/2, t(u → v) = 1, t(v → u) = 0
computations of local structures such as a MIS or a dominating set
are based on some kind of network partitions or covers. • tu = tv − 1/2, t(u → v) = 0, t(v → u) = 1,
10.6. CLOCK SYNCHRONIZATION 113 114 CHAPTER 10. SYNCHRONIZATION

where the message delivery time is always fast for one node and slow for the Remarks:
other and the logical clocks are off by 1/2. In both scenarios, the messages sent
at time i according to the clock of the sender arrive at time i + 1/2 according • The introduced examples may seem cooked-up, but examples like this
to the logical clock of the receiver. Therefore, for nodes u and v, both cases exist in all networks, and for all algorithms. Indeed, it was shown that
with clock drift seem the same as the case with perfectly synchronized clocks. any natural clock synchronization algorithm must have a bad local
Furthermore, in a linked list of D nodes, the left- and rightmost nodes l, r cannot skew. In particular, a protocol that averages between all neighbors
distinguish tl = tr + D/2 from tl = tr − D/2. is even worse than the introduced α algorithm. This algorithm has a
clock skew of Ω(D2 ) in the linked list, at all times.

Remarks: • It was shown that the local clock skew is Θ(log D), i.e., there is a pro-
tocol that achieves this bound, and there is a proof that no algorithm
• From Theorem 10.12, it directly follows that all the clock synchro- can be better than this bound!
nization algorithms we studied have a global skew of Ω(D).
• Note that these are worst-case bounds. In practice, clock drift and
• Many natural algorithms manage to achieve a global clock skew of message delays may not be the worst possible, typically the speed of
O(D). hardware clocks changes at a comparatively slow pace and the mes-
sage transmission times follow a benign probability distribution. If we
As both the message jitter and hardware clock drift are bounded by con- assume this, better protocols do exist.
stants, it feels like we should be able to get a constant drift between neighboring
nodes. As synchronizer α pays most attention to the local synchronization, we
take a look at a protocol inspired by the synchronizer α. A pseudo-code repre- Chapter Notes
sentation for the clock synchronization protocol α is given in Algorithm 43.
The idea behind synchronizers is quite intuitive and as such, synchronizers α and
Algorithm 43 Clock synchronization α (at node v) β were implicitly used in various asynchronous algorithms [Gal76, Cha79, CL85]
1: repeat before being proposed as separate entities. The general idea of applying syn-
2: send logical time tv to all neighbors chronizers to run synchronous algorithms in asynchronous networks was first
3: if Receive logical time tu , where tu > tv , from any neighbor u then introduced by Awerbuch [Awe85a]. His work also formally introduced the syn-
4: tv := tu chronizers α and β. Improved synchronizers that exploit inactive nodes or hy-
5: end if percube networks were presented in [AP90, PU87].
6: until done Naturally, as synchronizers are motivated by practical difficulties with local
clocks, there are plenty of real life applications. Studies regarding applications
can be found in, e.g., [SM86, Awe85b, LTC89, AP90, PU87]. Synchronizers in
Lemma 10.13. The clock synchronization protocol α has a local skew of Ω(n). the presence of network failures have been discussed in [AP88, HS94].
It has been known for a long time that the global clock skew is Θ(D) [LL84,
Proof. Let the graph be a linked list of D nodes. We denote the nodes by ST87]. The problem of synchronizing the clocks of nearby nodes was intro-
v1 , v2 , . . . , vD from left to right and the logical clock of node vi by ti . Apart duced by Fan and Lynch in [LF04]; they proved a surprising lower bound of
from the left-most node v1 all hardware clocks run with speed 1 (real time). Ω(log D/ log log D) for √the local skew. The first algorithm providing a non-
Node v1 runs at maximum speed, i.e. the time between two pulses is not 1 but trivial local skew of O( D) was given in [LW06]. Later, matching upper and
1 − . Assume that initially all message delays are 1. After some time, node v1 lower bounds of Θ(log D) were given in [LLW10]. The problem has also been
will start to speed up v2 , and after some more time v2 will speed up v3 , and studied in a dynamic setting [KLO09, KLLO10].
so on. At some point of time, we will have a clock skew of 1 between any two Clock synchronization is a well-studied problem in practice, for instance
neighbors. In particular t1 = tD + D − 1. regarding the global clock skew in sensor networks, e.g. [EGE02, GKS03,
Now we start playing around with the message delays. Let t1 = T . First we MKSL04, PSJ04]. One more recent line of work is focussing on the problem
set the delay between the v1 and v2 to 0. Now node v2 immediately adjusts its of minimizing the local clock skew [BvRW07, SW09, LSW09, FW10, FZTS11].
logical clock to T . After this event (which is instantaneous in our model) we set
the delay between v2 and v3 to 0, which results in v3 setting its logical clock to T
as well. We perform this successively to all pairs of nodes until vD−2 and vD−1 . Bibliography
Now node vD−1 sets its logical clock to T , which indicates that the difference
[AP88] Baruch Awerbuch and David Peleg. Adapting to Asynchronous Dy-
between the logical clocks of vD−1 and vD is T − (T − (D − 1)) = D − 1.
namic Networks with Polylogarithmic Overhead. In 24th ACM Sym-
posium on Foundations of Computer Science (FOCS), pages 206–
220, 1988.
BIBLIOGRAPHY 115 116 CHAPTER 10. SYNCHRONIZATION

[AP90] Baruch Awerbuch and David Peleg. Network Synchronization with [KLO09] Fabian Kuhn, Thomas Locher, and Rotem Oshman. Gradient Clock
Polylogarithmic Overhead. In Proceedings of the 31st IEEE Sympo- Synchronization in Dynamic Networks. In 21st ACM Symposium
sium on Foundations of Computer Science (FOCS), 1990. on Parallelism in Algorithms and Architectures (SPAA), Calgary,
Canada, August 2009.
[Awe85a] Baruch Awerbuch. Complexity of Network Synchronization. Journal
of the ACM (JACM), 32(4):804–823, October 1985. [LF04] Nancy Lynch and Rui Fan. Gradient Clock Synchronization. In
Proceedings of the 23rd Annual ACM Symposium on Principles of
[Awe85b] Baruch Awerbuch. Reducing Complexities of the Distributed Max- Distributed Computing (PODC), 2004.
flow and Breadth-first-search Algorithms by Means of Network Syn-
chronization. Networks, 15:425–437, 1985. [LL84] Jennifer Lundelius and Nancy Lynch. An Upper and Lower Bound
for Clock Synchronization. Information and Control, 62:190–204,
[BvRW07] Nicolas Burri, Pascal von Rickenbach, and Roger Wattenhofer. 1984.
Dozer: Ultra-Low Power Data Gathering in Sensor Networks. In
International Conference on Information Processing in Sensor Net- [LLW10] Christoph Lenzen, Thomas Locher, and Roger Wattenhofer. Tight
works (IPSN), Cambridge, Massachusetts, USA, April 2007. Bounds for Clock Synchronization. In Journal of the ACM, Volume
57, Number 2, January 2010.
[Cha79] E.J.H. Chang. Decentralized Algorithms in Distributed Systems. PhD
[LSW09] Christoph Lenzen, Philipp Sommer, and Roger Wattenhofer. Op-
thesis, University of Toronto, 1979.
timal Clock Synchronization in Networks. In 7th ACM Conference
[CL85] K. Mani Chandy and Leslie Lamport. Distributed Snapshots: De- on Embedded Networked Sensor Systems (SenSys), Berkeley, Cali-
termining Global States of Distributed Systems. ACM Transactions fornia, USA, November 2009.
on Computer Systems, 1:63–75, 1985.
[LTC89] K. B. Lakshmanan, K. Thulasiraman, and M. A. Comeau. An Ef-
[EGE02] Jeremy Elson, Lewis Girod, and Deborah Estrin. Fine-grained ficient Distributed Protocol for Finding Shortest Paths in Networks
Network Time Synchronization Using Reference Broadcasts. ACM with Negative Weights. IEEE Trans. Softw. Eng., 15:639–644, 1989.
SIGOPS Operating Systems Review, 36:147–163, 2002. [LW06] Thomas Locher and Roger Wattenhofer. Oblivious Gradient Clock
Synchronization. In 20th International Symposium on Distributed
[FW10] Roland Flury and Roger Wattenhofer. Slotted Programming for
Computing (DISC), Stockholm, Sweden, September 2006.
Sensor Networks. In International Conference on Information Pro-
cessing in Sensor Networks (IPSN), Stockholm, Sweden, April 2010. [MKSL04] Miklós Maróti, Branislav Kusy, Gyula Simon, and Ákos Lédeczi. The
Flooding Time Synchronization Protocol. In Proceedings of the 2nd
[FZTS11] Federico Ferrari, Marco Zimmerling, Lothar Thiele, and Olga Saukh.
international Conference on Embedded Networked Sensor Systems,
Efficient Network Flooding and Time Synchronization with Glossy.
SenSys ’04, 2004.
In Proceedings of the 10th International Conference on Information
Processing in Sensor Networks (IPSN), pages 73–84, 2011. [PSJ04] Santashil PalChaudhuri, Amit Kumar Saha, and David B. Johnson.
Adaptive Clock Synchronization in Sensor Networks. In Proceedings
[Gal76] Robert Gallager. Distributed Minimum Hop Algorithms. Technical of the 3rd International Symposium on Information Processing in
report, Lab. for Information and Decision Systems, 1976. Sensor Networks, IPSN ’04, 2004.
[GKS03] Saurabh Ganeriwal, Ram Kumar, and Mani B. Srivastava. Timing- [PU87] David Peleg and Jeffrey D. Ullman. An Optimal Synchronizer for
sync Protocol for Sensor Networks. In Proceedings of the 1st interna- the Hypercube. In Proceedings of the sixth annual ACM Symposium
tional conference on Embedded Networked Sensor Systems (SenSys), on Principles of Distributed Computing, PODC ’87, pages 77–85,
2003. 1987.
[HS94] M. Harrington and A. K. Somani. Synchronizing Hypercube Net- [SM86] Baruch Shieber and Shlomo Moran. Slowing Sequential Algorithms
works in the Presence of Faults. IEEE Transactions on Computers, for Obtaining Fast Distributed and Parallel Algorithms: Maximum
43(10):1175–1183, 1994. Matchings. In Proceedings of the fifth annual ACM Symposium on
Principles of Distributed Computing, PODC ’86, pages 282–292,
[KLLO10] Fabian Kuhn, Christoph Lenzen, Thomas Locher, and Rotem Osh- 1986.
man. Optimal Gradient Clock Synchronization in Dynamic Net-
works. In 29th Symposium on Principles of Distributed Computing [ST87] T. K. Srikanth and S. Toueg. Optimal Clock Synchronization. Jour-
(PODC), Zurich, Switzerland, July 2010. nal of the ACM, 34:626–645, 1987.
BIBLIOGRAPHY 117 118 CHAPTER 10. SYNCHRONIZATION

[SW09] Philipp Sommer and Roger Wattenhofer. Gradient Clock Synchro-


nization in Wireless Sensor Networks. In 8th ACM/IEEE Inter-
national Conference on Information Processing in Sensor Networks
(IPSN), San Francisco, USA, April 2009.
120 CHAPTER 11. HARD PROBLEMS

are getting delayed in some nodes but not in others. The flooding
might not use edges of a BFS tree anymore! These floodings might not
compute correct distances anymore! On the other hand we know that
the maximal message size in Algorithm 44 is O(n log n). So we could
just simulate each of these “big message” rounds by n “small mes-
sage” rounds using small messages. This yields a runtime of O(nD)
Chapter 11 which is not desirable. A third possible approach is “starting each
flooding/echo one after each other” and results in O(nD) in the worst
case as well.

Hard Problems • So let us fix the above algorithm! The key idea is to arrange the
flooding-echo processes in a more organized way: Start the flooding
processes in a certain order and prove that at any time, each node is
only involved in one flooding. This is realized in Algorithm 45.

This chapter is on “hard” problems in distributed computing. In sequential com- Definition 11.1. (BFSv ) Performing a breadth first search at node v produces
puting, there are NP-hard problems which are conjectured to take exponential spanning tree BFSv (see Chapter 2). This takes time O(D) using small mes-
time. Is there something similar in distributed computing? Using flooding/echo sages.
(Algorithms 7,8) from Chapter 2, everything so far was solvable basically in
O(D) time, where D is the diameter of the network. Remarks:

• A spanning tree of a graph G can be traversed in time O(n) by sending


11.1 Diameter & APSP a pebble over an edge in each time slot.

But how do we compute the diameter itself!?! With flooding/echo, of course! • This can be done using, e.g., a depth first search (DFS): Start at the
root of a tree, recursively visit all nodes in the following way. If the
Algorithm 44 Naive Diameter Construction current node still has an unvisited child, then the pebble always visits
1: all nodes compute their radius by synchronous flooding/echo that child first. Return to the parent only when all children have been
2: all nodes flood their radius on the constructed BFS tree visited.
3: the maximum radius a node sees is the diameter
• Algorithm 45 works as follows: Given a graph G, first a leader l com-
putes its BFS tree BFSl . Then we send a pebble P to traverse tree
Remarks: BFSl . Each time pebble P enters a node v for the first time, P waits
• Since all these phases only take O(D) time, nodes know the diameter one time slot, and then starts a breadth first search (BFS) – using
in O(D) time, which is asymptotically optimal. edges in G – from v with the aim of computing the distances from v
to all other nodes. Since we start a BFSv from every node v, each
• However, there is a problem! Nodes are now involved in n parallel node u learns its distance to all these nodes v during the according
flooding/echo operations, thus a node may have to handle many and execution of BFSv . There is no need for an echo-process at the end of
big messages in one single time step. Although this is not strictly BFSu .
illegal in the message passing model, it still feels like cheating! A
natural question is whether we can do the same by just sending short
Remarks:
messages in each round.
• In Definition 1.6 of Chapter 1 we postulated that nodes should send • Having all distances is nice, but how do we get the diameter? Well, as
only messages of “reasonable” size. In this chapter we strengthen the before, each node could just flood its radius (its maximum distance)
definition a bit, and require that each message should have at most into the network. However, messages are small now and we need to
O(log n) bits. This is generally enough to communicate a constant modify this slightly. In each round a node only sends the maximal
number of ID’s or values to neighbors, but not enough to communicate distance that it is aware of to its neighbors. After D rounds each
everything a node knows! node will know the maximum distance among all nodes.

• A simple way to avoid large messages is to split them into small mes- Lemma 11.2. In Algorithm 45, at no time a node w is simultaneously active
sages that are sent using several rounds. This can cause that messages for both BFSu and BFSv .

119
11.2. LOWER BOUND GRAPHS 121 122 CHAPTER 11. HARD PROBLEMS

Algorithm 45 Computes APSP on G. consisting of q = q(n) := (n − 2)/4 nodes. Throughout this chapter we write [q]
1: Assume we have a leader node l (if not, compute one first) as a short version of {1, . . . , q} and define:
2: compute BFSl of leader l
3: send a pebble P to traverse BFSl in a DFS way;
L0 := {li | i ∈ [q] } // upper left in Figure 11.1
4: while P traverses BFSl do
5: if P visits a new node v then L1 := {li0 | i ∈ [q] } // lower left
6: wait one time slot; // avoid congestion R0 := {ri | i ∈ [q] } // upper right
7: start BFSv from node v; // compute all distances to v R1 := {ri0 | i ∈ [q] } // lower right
8: // the depth of node u in BFSv is d(u, v)
9: end if
10: end while cL cR

Proof. Assume a BFSu is started at time tu at node u. Then node w will be l1 r1


involved in BFSu at time tu + d(u, w). Now, consider a node v whose BFSv L0 R0
is started at time tv > tu . According to the algorithm this implies that the l2 r2
pebble visits v after u and took some time to travel from u to v. In particular,
the time to get from u to v is at least d(u, v), in addition at least node v is l10 r10
visited for the first time (which involves waiting at least one time slot), and L1 R1
we have tv ≥ tu + d(u, v) + 1. Using this and the triangle inequality, we get l20 r20
that node w is involved in BFSv strictly after being involved in BFSu since
tv + d(v, w) ≥ (tu + d(u, v) + 1) + d(v, w) ≥ tu + d(u, w) + 1 > tu + d(u, w).

Theorem 11.3. Algorithm 45 computes APSP (all pairs shortest path) in time Figure 11.1: The above skeleton G0 contains n = 10 nodes, such that q = 2.
O(n).
We add node cL and connect it to all nodes in L0 and L1 . Then we add
Proof. Since the previous lemma holds for any pair of vertices, no two BFS node cR , connected to all nodes in R0 and R1 . Furthermore, nodes cL and cR
“interfere” with each other, i.e. all messages can be sent on time without con- are connected by an edge. For i ∈ [q] we connect li to ri and li0 to ri0 . Also we
gestion. Hence, all BFS stop at most D time slots after they were started. We add edges such that nodes in L0 are a clique, nodes in L1 are a clique, nodes
conclude that the runtime of the algorithm is determined by the time O(D) we in R0 are a clique, and nodes in R1 are a clique. The resulting graph is called
need to build tree BFSl , plus the time O(n) that P needs to traverse BFSl , plus G0 . Graph G0 is the skeleton of any graph in family G.
the time O(D) needed by the last BFS that P initiated. Since D ≤ n, this is More formally skeleton G0 = (V 0 , E 0 ) is:
all in O(n).
V0 := L0 ∪ L1 ∪ R0 ∪ R1 ∪ {cL , cR }
Remarks: [
E0 := {(v, cL )} // connections to cL
• All of a sudden our algorithm needs O(n) time, and possibly n  D. v ∈ L0 ∪ L1
We should be able to do better, right?! [
∪ {(v, cR )} // connections to cR
• Unfortunately not! One can show that computing the diameter of a v ∈ R0 ∪ R1
network needs Ω(n/ log n) time. [
∪ {(li , ri ), (li0 , ri0 )} ∪ {(cL , cR )} // connects left to right
• Note that one can check whether a graph has diameter 1 by exchanging i∈[q]
[ [
some specific information such as degree with the neighbors. However, ∪ {(u, v)} // clique edges
already checking diameter 2 is difficult. u6=v∈S
S ∈ {L0 ,
L1 , R0 , R1 }
11.2 Lower Bound Graphs
We define a family G of graphs that we use to prove a lower bound on the To simplify our arguments, we partition G0 into two parts: Part L is the
rounds needed to compute the diameter. To simplify our analysis, we assume subgraph induced by nodes L0 ∪ L1 ∪ {cL }. Part R is the subgraph induced
that (n − 2) can be divided by 8. We start by defining four sets of nodes, each by nodes R0 ∪ R1 ∪ {cR }.
11.2. LOWER BOUND GRAPHS 123 124 CHAPTER 11. HARD PROBLEMS

Family G contains any graph G that is derived from G0 by adding any com- cL cR
bination of edges of the form (li , lj0 ) resp. (ri , rj0 ) with li ∈ L0 , lj0 ∈ L1 , ri ∈ R0 ,
and rj0 ∈ R1 .
l1 r1

Part L Part R l2 r2

l10 r10
cL cR
l20 r20
l1 r1
Figure 11.3: Nodes in the neighborhood of l2 are cyan, the neighborhood of r20
l2 r2
is white. Since these neighborhoods do not intersect, the distance of these two
nodes is d(l2 , r20 ) > 2. If edge (l2 , l20 ) was included, their distance would be 2.
l10 r10

l20 r20 (in each direction), the bandwidth between Part L and Part R is
O(n log n).

• If we transmit the information of the Θ(n2 ) edges in a naive way with


Figure 11.2: The above graph G has n = 10 and is a member of family G. What a bandwidth of O(n log n), we need Ω(n/ log n) time. But maybe we
is the diameter of G? can do better?!? Can an algorithm be smarter and only send the
information that is really necessary to tell whether the diameter is 2?

Lemma 11.4. The diameter of a graph G = (V, E) ∈ G is 2 if and only if: For • It turns out that any algorithm needs Ω(n/ log n) rounds, since the
each tuple (i, j) with i, j ∈ [q], there is either edge (li , lj0 ) or edge (ri , rj0 ) (or both information that is really necessary to tell that the diameter is larger
edges) in E. than 2 contains basically Θ(n2 ) bits.

Proof. Note that the distance between most pairs of nodes is at most 2. In
particular, the radius of cL resp. cR is 2. Thanks to cL resp. cR the distance 11.3 Communication Complexity
between, any two nodes within Part L resp. within Part R is at most 2.
Because of the cliques L0 , L1 , R0 , R1 , distances between li and rj resp. li0 and To prove the last remark formally, we can use arguments from two-party com-
rj0 is at most 2. munication complexity. This area essentially deals with a basic version of dis-
The only interesting case is between a node li ∈ L0 and node rj0 ∈ R1 (or, tributed computation: two parties are given some input each and want to solve
symmetrically, between lj0 ∈ L1 and node ri ∈ R0 ). If either edge (li , lj0 ) or a task on this input.
edge (ri , rj0 ) is present, then this distance is 2, since the path (li , lj0 , rj0 ) or the We consider two students (Alice and Bob) at two different universities con-
path (li , ri , rj0 ) exists. If neither of the two edges exist, then the neighborhood nected by a communication channel (e.g., via email) and we assume this channel
of li consists of {cL , ri }, all nodes in L0 , and some nodes in L1 \ {lj0 }, and the to be reliable. Now Alice and Bob want to check whether they received the same
neighborhood of rj0 consists of {cR , lj0 } , all nodes in R1 , and some nodes in problem set for homework (we assume their professors are lazy and wrote it on
R0 \ {ri } (see for example Figure 11.3 with i = 2 and j = 2.) Since the two the black board instead of putting a nicely prepared document online.) Do Alice
neighborhoods do not share a common node, the distance between li and rj0 is and Bob really need to type the whole problem set into their emails? In a more
(at least) 3. formal way: Alice receives an k-bit string x and Bob another k-bit string y, and
the goal is for both of them to compute the equality function.
Remarks: Definition 11.5. (Equality.) We define the equality function EQ to be:
• Each part contains up to q 2 ∈ Θ(n2 ) edges not belonging to the skele- 
1 :x=y
ton. EQ(x, y) :=
0 : x 6= y .
• There are 2q + 1 ∈ Θ(n) edges connecting the left and the right part.
Since in each round we can transmit O(log n) bits over each edge
11.3. COMMUNICATION COMPLEXITY 125 126 CHAPTER 11. HARD PROBLEMS

Remarks: just naively transmitting contained bits? In order to cover all possible ways of
communication, we need the following definition:
• In a more general setting, Alice and Bob are interested in computing a
certain function f : {0, 1}k × {0, 1}k → {0, 1} with the least amount of Definition 11.11. (fooling set.) A set S ⊂ {0, 1}k × {0, 1}k fools f if there is
communication between them. Of course they can always succeed by a fixed z such that
having Alice send her whole k-bit string to Bob, who then computes • f (x, y) = z for each (x, y) ∈ S
the function, but the idea here is to find clever ways of calculating f
with less than k bits of communication. We measure how clever they • For any (x1 , y1 ) 6= (x2 , y2 ) ∈ S, the rectangle {x1 , x2 } × {y1 , y2 } is not
can be as follows: monochromatic: Either f (x1 , y2 ) 6= z, f (x2 , y1 ) 6= z or both 6= z.
Definition 11.6. (Communication complexity CC.) The communication com- Example 11.12. Consider S = {(000, 000), (001, 001)}. Take a look at the
plexity of protocol A for function f is CC(A, f ) := minimum number of bits non-monochromatic rectangle R4 in Example 11.10. Verify that S is indeed a
exchanged between Alice and Bob in the worst case when using A. The commu- fooling set for EQ!
nication complexity of f is CC(f ) := min{CC(A, f ) | A solves f }. That is the
minimal number of bits that the best protocol needs to send in the worst case. Remarks:

Definition 11.7. For a given function f , we define a 2 × 2 k k


matrix M f • Can you find a larger fooling set for EQ?
f
representing f . That is Mx,y := f (x, y).
• We assume that Alice and Bob take turns in sending a bit. This results
Example 11.8. For EQ, in case k = 3, matrix M EQ looks like this: in 2 possible actions (send 0/1) per round and in 2t action patterns
  during a sequence of t rounds.
EQ 000 001 010 011 100 101 110 111 ←x
 000 1 0 0 0 0 0 0 0  Lemma 11.13. If S is a fooling set for f , then CC(f ) = Ω(log |S|).
 
 001 0 1 0 0 0 0 0 0 
  Proof. We prove the statement via contradiction: fix a protocol A and assume
 010 0 0 1 0 0 0 0 0 
  that it needs t < log(|S|) rounds in the worst case. Then there are 2t possible
 011 0 0 0 1 0 0 0 0 
  action patterns, with 2t < |S|. Hence for at least two elements of S, let us
 100 0 0 0 0 1 0 0 0 
  call them (x1 , y1 ), (x2 , y2 ), protocol A produces the same action pattern P .
 101 0 0 0 0 0 1 0 0 
  Naturally, the action pattern on the alternative inputs (x1 , y2 ), (x2 , y1 ) will be
 110 0 0 0 0 0 0 1 0 
  P as well: in the first round Alice and Bob have no information on the other
 111 0 0 0 0 0 0 0 1 
party’s string and send the same bit that was sent in P . Based on this, they
↑y determine the second bit to be exchanged, which will be the same as the second
one in P since they cannot distinguish the cases. This continues for all t rounds.
As a next step we define a (combinatorial) monochromatic rectangle. These We conclude that after t rounds, Alice does not know whether Bob’s input is y1
are “submatrices” of M f which contain the same entry. or y2 and Bob does not know whether Alice’s input is x1 or x2 . By the definition
Definition 11.9. (monochromatic rectangle.) A set R ⊆ {0, 1}k × {0, 1}k is of fooling sets, either
called a monochromatic rectangle, if
• f (x1 , y2 ) 6= f (x1 , y1 ) in which case Alice (with input x1 ) does not know
• whenever (x1 , y1 ) ∈ R and (x2 , y2 ) ∈ R then (x1 , y2 ) ∈ R. the solution yet,

• there is a fixed z such that f (x, y) = z for all (x, y) ∈ R. or

Example 11.10. The first three of the following rectangles are monochromatic, • f (x2 , y1 ) 6= f (x1 , y1 ) in which case Bob (with input y1 ) does not know the
the last one is not: solution yet.
This contradicts the assumption that A leads to a correct decision for all inputs
R1 = {011} × {011} Example 11.8: light gray
after t rounds. Therefore at least log(|S|) rounds are necessary.
R2 = {011, 100, 101, 110} × {000, 001} Example 11.8: gray
R3 = {000, 001, 101} × {011, 100, 110, 111} Example 11.8: dark gray Theorem 11.14. CC(EQ) = Ω(k).
R4 = {000, 001} × {000, 001} Example 11.8: boxed
Proof. The set S := {(x, x) | x ∈ {0, 1}k } fools EQ and has size 2k . Now apply
Each time Alice and Bob exchange a bit, they can eliminate columns/rows of Lemma 11.13.
the matrix M f and a combinatorial rectangle is left. They can stop communi-
cating when this remaining rectangle is monochromatic. However, maybe there Definition 11.15. Denote the negation of a string z by z and by x ◦ y the
is a more efficient way to exchange information about a given bit string than concatenation of strings x and y.
11.3. COMMUNICATION COMPLEXITY 127 128 CHAPTER 11. HARD PROBLEMS

Lemma 11.16. Let x, y be k-bit strings. Then x 6= y if and only if there is an G would send. Then they use their communication channel to exchange all
index i ∈ [2k] such that the ith bit of x ◦ x and the ith bit of y ◦ y are both 0. 2(2q + 1) ∈ Θ(n) messages that would be sent over edges between Part L and
Part R in this round while executing A on G. Based on this Alice and Bob
Proof. If x 6= y, there is an j ∈ [k] such that x and y differ in the j th bit. determine which messages would be sent in round two and so on. For each
Therefore either the j th bit of both x and y is 0, or the j th bit of x and y is round simulated by Alice and Bob, they only need to communicate O(n log n)
0. For this reason, there is an i ∈ [2k] such that x ◦ x and y ◦ y are both 0 at bits: O(log n) bits for each of O(n) messages. Since A makes a decision after
position i. o(n/ log n) rounds, this yields a total communication of o(n2 ) bits. On the other
If x = y, then for any i ∈ [2k] it is always the case that either the ith bit of hand, Lemma 11.14 states that to decide whether x equals y, Alice and Bob

x ◦ x is 1 or the ith bit of y ◦ y (which is the negation of x ◦ x in this case) is 2
need to communicate at least Ω q2 = Ω(n2 ) bits. A contradiction.
1.

Remarks: Remarks:

• With these insights we get back to the problem of computing the • Until now we only considered deterministic algorithms. Can one do
diameter of a graph and relate this problem to EQ. better using randomness?
Definition 11.17. Using the parameter q defined before, we define a bijective
map between all pairs x, y of q 2 -bit strings and the graphs in G: each pair of Algorithm 46 Randomized evaluation of EQ.
strings x, y is mapped to graph Gx,y ∈ G that is derived from skeleton G0 by 1: Alice and Bob use public randomness. That is they both have access to the
adding k
P z ∈ {0, 1}
same random bit string
2: Alice sends bit a := x · z mod 2 to Bob
• edge (li , lj0 ) to Part L if and only if the (j + q · (i − 1))th bit of x is 1. P i∈[k] i i
3: Bob sends bit b := i∈[k] yi · zi mod 2 to Alice
• edge (ri , rj0 ) to Part R if and only if the (j + q · (i − 1))th bit of y is 1. 4: if a 6= b then
5: we know x 6= y
Remarks: 6: end if

• Clearly, Part L of Gx,y depends on x only and Part R depends on


y only. Lemma 11.20. If x 6= y, Algorithm 46 discovers x 6= y with probability at least
q2 1
1/2.
Lemma 11.18. Let x and y be 2 -bit
strings given to Alice and Bob. Then
graph G := Gx◦x,y◦y ∈ G has diameter 2 if and only if x = y. Proof. Note that if x = y we have a = b for sure.
Proof. By Lemma 11.16 and the construction of G, there is neither edge (li , lj0 ) If x 6= y, Algorithm 46 may not reveal inequality. For instance, for k = 2,
nor edge (ri , rj0 ) in E(G) for some (i, j) if and only if x 6= y. Applying Lemma if x = 01, y = 10 and z = 11 we get a = b = 1. In general, let I be the set of
11.4 yields: G has diameter 2 if and only if x = y. indices where xi 6= yi , i.e. I := {i ∈ [k] | xi 6= yi }. Since x 6= y, we know that
|I| > 0. We have
X
Theorem 11.19. Any  distributedalgorithm A that decides whether a graph G |a − b| ≡ zi ( mod 2),
has diameter 2 needs Ω logn n + D time. i∈I

  and since all zi with i ∈ I are random, we get that a 6= b with probability at
Proof. Computing D for sure needs time Ω(D). It remains to prove Ω logn n . least 1/2.
Assume there is a distributed algorithm A that decides whether the diameter of
2
a graph is 2 in time o(n/ log n). When Alice and Bob are given q2 -bit inputs x Remarks:
and y, they can simulate A to decide whether x = y as follows: Alice constructs
Part L of Gx◦x,y◦y and Bob constructs Part R. As we remarked, both parts • By excluding the vector z = 0k we can even get a discovery probability
are independent of each other such that Part L can be constructed by Alice strictly larger than 1/2.
without knowing y and Part R can be constructed by Bob without knowing x.
Furthermore, Gx◦x,y◦y has diameter 2 if and only if x = y (Lemma 11.18.) • Repeating the Algorithm 46 with different random strings z, the error
Now Alice and Bob simulate the distributed algorithm A round by round: probability can be reduced arbitrarily.
In the first round, they determine which messages the nodes in their part of
• Does this imply that there is a fast randomized algorithm to determine
1 Thats why we need that n − 2 can be divided by 8. the diameter? Unfortunately not!
11.4. DISTRIBUTED COMPLEXITY THEORY 129 130 CHAPTER 11. HARD PROBLEMS

• Sometimes public randomness is not available, but private randomness • Just a little bit slower are problems that can be solved in log-star O(log∗ n)
is. Here Alice has her own random string and Bob has his own random time, e.g., many combinatorial optimization problems in special graph
string. A modified version of Algorithm 46 also works with private classes such as growth bounded graphs. 3-coloring a ring takes O(log∗ n).
randomness at the cost of the runtime.
• A large body of problems is polylogarithmic (or pseudo-local ), in the sense
• One can prove an Ω(n/ log n) lower bound for any randomized distrib- that they seem to be strictly local but are not, as they need O(polylog n)
uted algorithm that computes the diameter. To do so one considers time, e.g., the maximal independent set problem.
the disjointness function DISJ instead of equality. Here, Alice is given
a subset X ⊆ [k] and and Bob is given a subset Y ⊆ [k] and they need • There are problems which are global and need O(D) time, e.g., to count
to determine whether Y ∩ X = ∅. (X and Y can be represented by the number of nodes in the network.
k-bit strings x, y.) The reduction is similar as the one presented above
but uses graph Gx,y instead of Gx◦x,y◦y . However, the lower bound for • Finally there are problems which need polynomial O(poly n) time, even if
the randomized communication complexity of DISJ is more involved the diameter D is a constant, e.g., computing the diameter of the network.
than the lower bound for CC(EQ).

• Since one can compute the diameter given a solution for APSP, an Chapter Notes
Ω(n/ log n) lower bound for APSP is implied. As such, our simple
Algorithm 45 is almost optimal! The linear time algorithm for computing the diameter was discovered inde-
pendently by [HW12, PRT12]. The presented matching lower bound is by
• Many prominent functions allow for a low communication complex- Frischknecht et al. [FHW12], extending techniques by [DHK+ 11].
ity. For instance, CC(P ARIT Y ) = 2. What is the Hamming dis- Due to its importance in network design, shortest path-problems in general
tance (number of different entries) of two strings? It is known that and the APSP problem in particular were among the earliest studied problems
CC(HAM
√ ≥ d) = Ω(d). √Also, CC(decide whether “HAM ≥ k/2 + in distributed computing. Developed algorithms were immediately used, e.g.,
k” or “HAM ≤ k/2 − k”) = Ω(k), even when using randomness. as early as in 1969 in the ARPANET (see [Lyn96], p.506). Routing messages
This problem is known as the Gap-Hamming-Distance. via shortest paths were extensively discussed to be beneficial in [Taj77, MS79,
MRR80, SS80, CM82] and in many other papers. It is not surprising that there
• Lower bounds in communication complexity have many applications. is plenty of literature dealing with algorithms for distributed APSP, but most
Apart from getting lower bounds in distributed computing, one can of them focused on secondary targets such as trading time for message com-
also get lower bounds regarding circuit depth or query times for static plexity. E.g., papers [AR78, Tou80, Che82] obtain a communication complexity
data structures. of roughly O(n · m) bits/messages and still require superlinear runtime. Also a
• In the distributed setting with limited bandwidth we showed that lot of effort was spent to obtain fast sequential algorithms for various versions
computing the diameter has about the same complexity as computing of computing APSP or related problems such as the diameter problem, e.g.,
all pairs shortest paths. In contrast, in sequential computing, it is [CW90, AGM91, AMGN92, Sei95, SZ99, BVW08]. These algorithms are based
a major open problem whether the diameter can be computed faster on fast matrix multiplication such that currently the best runtime is O(n2.3727 )
than all pairs shortest paths. No nontrivial lower bounds are known, due to [Wil12].
only that Ω(n2 ) steps are needed – partly due to the fact that there can The problem sets in which one needs to distinguish diameter 2 from 4 are
be n2 edges/distances in a graph. On the other hand the currently inspired by a combinatorial (×, 3/2)-approximation in a sequential setting by
best algorithm uses fast matrix multiplication and terminates after Aingworth et. al. [ACIM99]. The main idea behind this approximation is to
O(n2.3727 ) steps. distinguish diameter 2 from 4. This part was transferred to the distributed
setting in [HW12].
Two-party communication complexity was introduced by Andy Yao in [Yao79].
11.4 Distributed Complexity Theory Later, Yao received the Turing Award. A nice introduction to communication
complexity covering techniques such as fooling-sets is the book by Nisan and
We conclude this chapter with a short overview on the main complexity classes Kushilevitz [KN97].
of distributed message passing algorithms. Given a network with n nodes and This chapter was written in collaboration with Stephan Holzer.
diameter D, we managed to establish a rich selection of upper and lower bounds
regarding how much time it takes to solve or approximate a problem. Currently
we know five main distributed complexity classes: Bibliography
• Strictly local problems can be solved in constant O(1) time, e.g., a constant [ACIM99] D. Aingworth, C. Chekuri, P. Indyk, and R. Motwani. Fast Estima-
approximation of a dominating set in a planar graph. tion of Diameter and Shortest Paths (Without Matrix Multiplica-
BIBLIOGRAPHY 131 132 CHAPTER 11. HARD PROBLEMS

tion). SIAM Journal on Computing (SICOMP), 28(4):1167–1181, [MRR80] J. McQuillan, I. Richer, and E. Rosen. The new routing algorithm
1999. for the ARPANET. IEEE Transactions on Communications (TC),
28(5):711–719, 1980.
[AGM91] N. Alon, Z. Galil, and O. Margalit. On the exponent of the all pairs
shortest path problem. In Proceedings of the 32nd Annual IEEE [MS79] P. Merlin and A. Segall. A failsafe distributed routing proto-
Symposium on Foundations of Computer Science (FOCS), pages col. IEEE Transactions on Communications (TC), 27(9):1280–
569–575, 1991. 1287, 1979.

[AMGN92] N. Alon, O. Margalit, Z. Galilt, and M. Naor. Witnesses for Boolean [PRT12] David Peleg, Liam Roditty, and Elad Tal. Distributed Algorithms
Matrix Multiplication and for Shortest Paths. In Proceedings of for Network Diameter and Girth. In ICALP, page to appear, 2012.
the 33rd Annual Symposium on Foundations of Computer Science [Sei95] R. Seidel. On the all-pairs-shortest-path problem in unweighted
(FOCS), pages 417–426. IEEE Computer Society, 1992. undirected graphs. Journal of Computer and System Sciences
(JCSS), 51(3):400–403, 1995.
[AR78] J.M. Abram and IB Rhodes. A decentralized shortest path algo-
rithm. In Proceedings of the 16th Allerton Conference on Commu- [SS80] M. Schwartz and T. Stern. Routing techniques used in computer
nication, Control and Computing (Allerton), pages 271–277, 1978. communication networks. IEEE Transactions on Communications
(TC), 28(4):539–552, 1980.
[BVW08] G.E. Blelloch, V. Vassilevska, and R. Williams. A New Combina-
torial Approach for Sparse Graph Problems. In Proceedings of the [SZ99] A. Shoshan and U. Zwick. All pairs shortest paths in undirected
35th international colloquium on Automata, Languages and Pro- graphs with integer weights. In Proceedings of the 40th Annual
gramming, Part I (ICALP), pages 108–120. Springer-Verlag, 2008. IEEE Symposium on Foundations of Computer Science (FOCS),
pages 605–614. IEEE, 1999.
[Che82] C.C. Chen. A distributed algorithm for shortest paths. IEEE Trans-
actions on Computers (TC), 100(9):898–899, 1982. [Taj77] W.D. Tajibnapis. A correctness proof of a topology information
maintenance protocol for a distributed computer network. Commu-
[CM82] K.M. Chandy and J. Misra. Distributed computation on graphs: nications of the ACM (CACM), 20(7):477–485, 1977.
Shortest path algorithms. Communications of the ACM (CACM),
25(11):833–837, 1982. [Tou80] S. Toueg. An all-pairs shortest-paths distributed algorithm. Tech.
Rep. RC 8327, IBM TJ Watson Research Center, Yorktown
[CW90] D. Coppersmith and S. Winograd. Matrix multiplication via Heights, NY 10598, USA, 1980.
arithmetic progressions. Journal of symbolic computation (JSC),
[Wil12] V.V. Williams. Multiplying Matrices Faster Than Coppersmith-
9(3):251–280, 1990.
Winograd. Proceedings of the 44th annual ACM Symposium on
[DHK+ 11] A. Das Sarma, S. Holzer, L. Kor, A. Korman, D. Nanongkai, G. Pan- Theory of Computing (STOC), 2012.
durangan, D. Peleg, and R. Wattenhofer. Distributed Verification [Yao79] A.C.C. Yao. Some complexity questions related to distributive com-
and Hardness of Distributed Approximation. Proceedings of the 43rd puting. In Proceedings of the 11th annual ACM symposium on The-
annual ACM Symposium on Theory of Computing (STOC), 2011. ory of computing (STOC), pages 209–213. ACM, 1979.
[FHW12] S. Frischknecht, S. Holzer, and R. Wattenhofer. Networks Can-
not Compute Their Diameter in Sublinear Time. In Proceedings
of the 23rd annual ACM-SIAM Symposium on Discrete Algorithms
(SODA), pages 1150–1162, January 2012.

[HW12] Stephan Holzer and Roger Wattenhofer. Optimal Distributed All


Pairs Shortest Paths and Applications. In PODC, page to appear,
2012.

[KN97] E. Kushilevitz and N. Nisan. Communication complexity. Cam-


bridge University Press, 1997.

[Lyn96] Nancy A. Lynch. Distributed Algorithms. Morgan Kaufmann Pub-


lishers Inc., San Francisco, CA, USA, 1996.
134 CHAPTER 12. WIRELESS PROTOCOLS

Clearly, we can use node IDs to solve leader election, e.g., a node with ID i
transmits in time slot i. However, this may be incredibly slow. There are better
deterministic solutions, but by and large the best and simplest algorithms are
randomized.
Throughout this chapter, we use a random variable X to denote the number
of nodes transmitting in a given slot.
Chapter 12
Algorithm 47 Slotted Aloha
1: Every node v executes the following code:
2: repeat
Wireless Protocols 3: transmit with probability 1/n
4: until one node has transmitted alone

Wireless communication was one of the major success stories of the last decades. Theorem 12.1. Using Algorithm 47 allows one node to transmit alone (become
Today, different wireless standards such as wireless local area networks (WLAN) a leader) after expected time e.
are omnipresent. In some sense, from a distributed computing viewpoint wireless
networks are quite simple, as they cannot form arbitrary network topologies. Proof. The probability for success, i.e., only one node transmitting is
Simplistic models of wireless networks include geometric graph models such as
 n−1
the so-called unit disk graph. Modern models are more robust: The network 1 1 1
graph is restricted, e.g., the total number of neighbors of a node which are not P r[X = 1] = n · · 1− ≈ ,
n n e
adjacent is likely to be small. This observation is hard to capture with purely
geometric models, and motivates more advanced network connectivity models where the last approximation is a result from Theorem 12.23 for sufficiently
such as bounded growth or bounded independence. large n. Hence, if we repeat this process e times, we can expect one success.
However, on the other hand, wireless communication is also more difficult
than standard message passing, as for instance nodes are not able to transmit a
different message to each neighbor at the same time. And if two neighbors are
transmitting at the same time, they interfere, and a node may not be able to Remarks:
decipher anything.
In this chapter we deal with the distributed computing principles of wireless • The origin of the name is the ALOHAnet which was developed at the
communication: We make the simplifying assumption that all n nodes are in the University of Hawaii.
communication range of each other, i.e., the network graph is a clique. Nodes
share a synchronous time, in each time slot a node can decide to either transmit • How does the leader know that it is the leader? One simple solution is
or receive (or sleep). However, two or more nodes transmitting in the same a “distributed acknowledgment”. The nodes just continue Algorithm
time slot will cause interference. Transmitting nodes are never aware if there is 47, including the ID of the the leader in their transmission. So the
interference because they cannot simultaneously transmit and receive. leader learns that it is the leader.

• One more problem?! Indeed, node v which managed to transmit the


12.1 Basics acknowledgment (alone) is the only remaining node which does not
know that the leader knows that it is the leader. We can fix this by
The basic communication protocol in wireless networks is the medium access having the leader acknowledge v’s successful acknowledgment.
control (MAC) protocol. Unfortunately it is difficult to claim that one MAC
protocol is better than another, because it all depends on the parameters, such as • One can also imagine an unslotted time model. In this model two
the network topology, the channel characteristics, or the traffic pattern. When messages which overlap partially will interfere and no message is re-
it comes to the principles of wireless protocols, we usually want to achieve ceived. As everything in this chapter, Algorithm 47 also works in an
much simpler goals. One basic and important question is the following: How unslotted time model, with a factor 2 penalty, i.e., the probability for
long does it take until one node can transmit successfully, without interference? a successful transmission will drop from 1e to 2e
1
. Essentially, each slot
This question is often called the wireless leader election problem (Chapter 2), is divided into t small time slots with t → ∞ and the nodes start a
1
with the node transmitting alone being the leader. new t-slot long transmission with probability 2nt .

133
12.2. INITIALIZATION 135 136 CHAPTER 12. WIRELESS PROTOCOLS

12.2 Initialization Algorithm 48 Initialization with Collision Detection


1: Every node v executes the following code:
Sometimes we want the n nodes to have the IDs {1, 2, . . . , n}. This process is 2: nextId := 0
called initialization. Initialization can for instance be used to allow the nodes 3: myBitstring := ‘’ / initialize to empty string
to transmit one by one without any interference. 4: bitstringsT oSplit := [‘’] / a queue with sets to split

5: while bitstringsT oSplit is not empty do


6: b := bitstringsT oSplit.pop()
12.2.1 Non-Uniform Initialization
7: repeat
Theorem 12.2. If the nodes know n, we can initialize them in O(n) time slots. 8: if b = myBitstring then
9: choose r uniformly at random from {0, 1}
10: in the next two time slots:
11: transmit in slot r, and listen in other slot
12: else
Proof. We repeatedly elect a leader using e.g., Algorithm 47. The leader gets 13: it is not my bitstring, just listen in both slots
the next free number and afterwards leaves the process. We know that this 14: end if
works with probability 1/e. The expected time to finish is hence e · n. 15: until there was at least 1 transmission in both slots
16: if b = myBitstring then
17: myBitstring := myBitstring + r / append bit r
18: end if

19: for r ∈ {0, 1} do


Remarks: 20: if some node u transmitted alone in slot r then
21: node u becomes ID nextId and becomes passive
• But this algorithm requires that the nodes know n in order to give 22: nextId := nextId + 1
them IDs from 1, . . . , n! For a more realistic scenario we need a uni- 23: else
form algorithm, i.e, the nodes do not know n. 24: bitstringsT oSplit.push(b + r)
25: end if
26: end for
27: end while
12.2.2 Uniform Initialization with CD
Remarks:
Definition 12.3 (Collision Detection, CD). Two or more nodes transmitting
concurrently is called interference. In a system with collision detection, a re- • In line 20 a transmitting node needs to know whether it was the only
ceiver can distinguish interference from nobody transmitting. In a system with- one transmitting. This is achievable in several ways, for instance by
out collision detection, a receiver cannot distinguish the two cases. adding an acknowledgement round. To notify a node v that it has
transmitted alone in round r, every node that was silent in round r
sends an acknowledgement in round r + 1, while v is silent. If v hears
a message or interference in r + 1, it knows that it transmitted alone
The main idea of the algorithm is to partition nodes iteratively into sets. in round r.
Each set is identified by a label (a bitstring), and by storing one such bitstring, Theorem 12.4. Algorithm 48 correctly initializes n nodes in expected time
each node knows in which set it currently is. Initially, all nodes are in a single O(n).
set, identified by the empty bitstring. This set is then partitioned into two non-
empty sets, identified by ’0’ and ’1’. In the same way, all sets are iteratively Proof. A successful split is defined as a split in which both subsets are non-
partitioned into two non-empty sets, as long as a set contains more than one empty. We know that there are exactly n − 1 successful splits because we have
node. If a set contains only a single node, this node receives the next free ID. a binary tree with n leaves and n − 1 inner nodes. Let us now calculate the
The algorithm terminates once every node is alone in its set. Note that this probability for creating two non-empty sets from a set of size k ≥ 2 as
partitioning process iteratively creates a binary tree which has exactly one node 1 1 1
in the set at each leaf, and thus has n leaves. P r[1 ≤ X ≤ k − 1] = 1 − P r[X = 0] − P r[X = k] = 1 − − k ≥ .
2k 2 2
12.3. LEADER ELECTION 137 138 CHAPTER 12. WIRELESS PROTOCOLS

Thus, in expectation we need O(n) splits. Proof. The probability for not electing a leader after c · log n time slots, i.e.,
c log n slots without a successful transmission is
Remarks:  c ln n  e·c0 ln n
1 1 1 1
1− = 1− ≤ = c0 .
• What if we do not have collision detection? e e eln n·c0 n

12.2.3 Uniform Initialization without CD


Let us assume that we have a special node ` (leader) and let S denote the set of
Remarks:
nodes which want to transmit. We now split every time slot from Algorithm 48
into two time slots and use the leader to help us distinguish between silence and
• What about uniform algorithms, i.e. the number of nodes n is not
noise. In the first slot every node from the set S transmits, in the second slot
known?
the nodes in S ∪ {`} transmit. This gives the nodes sufficient information to
distinguish the different cases (see Table 12.1).
12.3.2 Uniform Leader Election
nodes in S transmit nodes in S ∪ {`} transmit
|S| = 0 7 4
|S| = 1, S = {`} 4 4 Algorithm 49 Uniform leader election
|S| = 1, S 6= {`} 4 7 1: Every node v executes the following code:
|S| ≥ 2 7 7 2: for k = 1, 2, 3, . . . do
3: for i = 1 to ck do
Table 12.1: Using a leader to distinguish between noise and silence: 7 represents 4: transmit with probability p := 1/2k
noise/silence, 4 represents a successful transmission. 5: if node v was the only node which transmitted then
6: v becomes the leader
7: break
Remarks: 8: end if
9: end for
• As such, Algorithm 48 works also without CD, with only a factor 2
10: end for
overhead.

• More generally, a leader immediately brings CD to any protocol.


Theorem 12.7. By using Algorithm 49 it is possible to elect a leader w.h.p. in
• This protocol has an important real life application, for instance when O(log2 n) time slots if n is not known.
checking out a shopping cart with items which have RFID tags.

• But how do we determine such a leader? And how long does it take Proof. Let us briefly describe the algorithm. The nodes transmit with prob-
until we are “sure” that we have one? Let us repeat the notion of with ability p = 2−k for ck time slots for k = 1, 2, . . .. At first p will be too high
high probability. and hence there will be a lot of interference. But after log n phases, we have
k ≈ log n and thus the nodes transmit with probability ≈ n1 . For simplicity’s
sake, let us assume that n is a power of 2. Using the approach outlined above,
12.3 Leader Election we know that after log n iterations, we have p = n1 . Theorem 12.6 yields that we
can elect a leader w.h.p. in O(log n) slots. Since we have to try log n estimates
12.3.1 With High Probability until k ≈ n, the total runtime is O(log2 n).

Definition 12.5 (With High Probability). Some probabilistic event is said to


occur with high probability (w.h.p.), if it happens with a probability p ≥ 1 − Remarks:
1/nc , where c is a constant. The constant c may be chosen arbitrarily, but it is
considered constant with respect to Big-O notation. • Note that our proposed algorithm has not used collision detection.
Can we solve leader election faster in a uniform setting with collision
Theorem 12.6. Algorithm 47 elects a leader w.h.p. in O(log n) time slots. detection?
12.3. LEADER ELECTION 139 140 CHAPTER 12. WIRELESS PROTOCOLS

Algorithm 50 Uniform leader election with CD Algorithm 51 Fast uniform leader election
1: Every node v executes the following code: 1: i := 1
2: repeat 2: repeat
3: transmit with probability 12 3: i := 2 · i
4: if at least one node transmitted then 4: transmit with probability 1/2i
5: all nodes that did not transmit quit the protocol 5: until no node transmitted
6: end if {End of Phase 1}
7: until one node transmits alone 6: l := 2i−2
7: u := 2i
8: while l + 1 < u do
12.3.3 Fast Leader Election with CD 9: j := d l+u
2 e
10: transmit with probability 1/2j
Theorem 12.8. With collision detection we can elect a leader using Algorithm 11: if no node transmitted then
50 w.h.p. in O(log n) time slots. 12: u := j
13: else
Proof. The number of active nodes k is monotonically decreasing and always 14: l := j
greater than 1 which yields the correctness. A slot is called successful if at most 15: end if
half the active nodes transmit. We can assume that k ≥ 2 since otherwise we 16: end while
would have already elected a leader. We can calculate the probability that a {End of Phase 2}
time slot is successful as 17: k := u
      18: repeat
k k 1 1 1
Pr 1 ≤ X ≤ =P X≤ − P r[X = 0] ≥ − k ≥ . 19: transmit with probability 1/2k
2 2 2 2 4
20: if no node transmitted then
Since the number of active nodes at least halves in every successful time slot, 21: k := k − 1
log n successful time slots are sufficient to elect a leader. Now let Y be a random 22: else
variable which counts the number of successful time slots after 8 · c · log n time 23: k := k + 1
slots. The expected value is E[Y ] ≥ 8 · c · log n · 14 ≥ 2 · c · log n. Since all those 24: end if
time slots are independent from each other, we can apply a Chernoff bound (see 25: until exactly one node transmitted
Theorem 12.22) with δ = 21 which states
1
P r[Y < (1 − δ)E[Y ]] ≤ e−
δ2
2 E[Y ] 1
≤ e− 8 ·2c log n ≤ n−α Proof. The nodes transmit with probability 1/2j < 1/2log n+log log n = n log n.
n
The expected number of nodes transmitting is E[X] = n log n . Using Markov’s
for any constant α. inequality (see Theorem 12.21) yields P r[X > 1] ≤ P r[X > E[X] · log n] ≤
1
log n .

1
Remarks: Lemma 12.10. If j < log n − log log n, then P [X = 0] ≤ n.

• Can we be even faster? Proof. The nodes transmit with probability 1/2j > 1/2log n−log log n = logn n .
Thus, the probability that a node is silent is at most 1 − logn n . Hence, the
12.3.4 Even Faster Leader Election with CD probability for a silent time slot, i.e., P r[X = 0], is at most (1 − logn n )n =
e− log n = n1 .
Let us first briefly describe an algorithm for this. In the first phase the nodes
0 1 2 1
transmit with probability 1/22 , 1/22 , 1/22 , . . . until no node transmits. This Corollary 12.11. If i > 2 log n, then P r[X > 1] ≤ log n .
yields a first approximation on the number of nodes. Afterwards, a binary search
is performed to determine an even better approximation of n. Finally, the third Proof. This follows from Lemma 12.9 since the deviation in this corollary is
phase finds a constant approximation of n using a biased random walk. The even larger.
algorithm stops in any case as soon as only one node is transmitting, which will Corollary 12.12. If i < 1
log n, then P [X = 0] ≤ 1
2 n.
become the leader.
1 Proof. This follows from Lemma 12.10 since the deviation in this corollary is
Lemma 12.9. If j > log n + log log n, then P r[X > 1] ≤ log n . even larger.
12.3. LEADER ELECTION 141 142 CHAPTER 12. WIRELESS PROTOCOLS

Lemma 12.13. Let v be such that 2v−1 < n ≤ 2v , i.e., v ≈ log n. If k > v + 2, Remarks:
then P r[X > 1] ≤ 14 .
• Tightening this analysis a bit more, one can elect a leader with prob-
Proof. Markov’s inequality yields ability 1 − log1 n in time log log n + o(log log n).
 
2k 2k 1
P r[X > 1] = P r X > E[X] < P r[X > v E[X]] < P r[X > 4E[X]] < . • Can we be even faster?
n 2 4

Lemma 12.14. If k < v − 2, then P [X = 0] ≤ 41 . 12.3.5 Lower Bound


Proof. A similar analysis is possible to upper bound the probability that a Theorem 12.18. Any uniform protocol that elects a leader with probability of
t
transmission fails if our estimate is too small. We know that k ≤ v − 2 and thus at least 1 − 12 must run for at least t time slots.
 n
1 n 2v−1 1
P r[X = 0] = 1 − k < e− 2k < e− 2k < e−2 < .
2 4 Proof. Consider a system with only 2 nodes. The probability that exactly one
transmits is at most

Lemma 12.15. If v − 2 ≤ k ≤ v + 2, then the probability that exactly one node 1


P r[X = 1] = 2p · (1 − p) ≤ .
transmits is constant. 2
1
Proof. The transmission probability is p = 2v±Θ(1) = Θ(1/n), and the lemma Thus, after t time slots the probability that a leader was elected is at most
t
follows with a slightly adapted version of Theorem 12.1. 1 − 12 .

Lemma 12.16. With probability 1− log1 n we find a leader in phase 3 in O(log log n)
time. Remarks:

Proof. For any k, because of Lemmas 12.13 and 12.14, the random walk of the • Setting t = log log n shows that Algorithm 51 is almost tight.
third phase is biased towards the good area. One can show that in O(log log n)
steps one gets Ω(log log n) good transmissions. Let Y denote the number of
times exactly one node transmitted. With Lemma 12.15 we obtain E[Y ] = 12.3.6 Uniform Asynchronous Wakeup without CD
Ω(log log n). Now a direct application of a Chernoff bound (see Theorem 12.22)
yields that these transmissions elect a leader with probability 1 − log1 n . Until now we have assumed that all nodes start the algorithm in the same time
slot. But what happens if this is not the case? How long does it take to elect
Theorem 12.17. The Algorithm 51 elects a leader with probability of at least a leader if we want a uniform and anonymous (nodes do not have an identifier
1 − logloglogn n in time O(log log n). and thus cannot base their decision on it) algorithm?
Proof. From Corollary 12.11 we know that after O(log log n) time slots, the
first phase terminates. Since we perform a binary search on an interval of size Theorem 12.19. If nodes wake up in an arbitrary (worst-case) way, any al-
O(log n), the second phase also takes at most O(log log n) time slots. For the gorithm may take Ω(n/ log n) time slots until a single node can successfully
third phase we know that O(log log n) slots are sufficient to elect a leader with transmit.
probability 1 − log1 n by Lemma 12.16. Thus, the total runtime is O(log log n).
Now we can combine the results. We know that the error probability for
Proof. Nodes must transmit at some point, or they will surely never successfully
every time slot in the first two phases is at most log1 n . Using a union bound (see
transmit. With a uniform protocol, every node executes the same code. We
Theorem 12.20), we can upper bound the probability that no error occurred by
log log n focus on the first slot where nodes may transmit. No matter what the protocol
log n . Thus, we know that after phase 2 our estimate is at most log log n away is, this happens with probability p. Since the protocol is uniform, p must be a
from log n with probability of at least 1 − logloglogn n . Hence, we can apply Lemma constant, independent of n.
12.16 and thus successfully elect a leader with probability of at least 1 − logloglogn n The adversary wakes up w = pc ln n nodes in each time slot with some con-
(again using a union bound) in time O(log log n). stant c. All nodes woken up in the first time slot will transmit with probability
p. We study the event E1 that exactly one of them transmits in that first time
slot. Using the inequality (1 + t/n)n ≤ et from Lemma 12.23 we get
12.4. USEFUL FORMULAS 143 144 CHAPTER 12. WIRELESS PROTOCOLS

for all n ∈ N, |t| ≤ n. Note that


w−1  n
P r[E1 ] = w · p · (1 − p) t
lim 1 + = et .
1
(c ln n−p) n→∞ n
= c ln n (1 − p) p
≤ c ln n · e−c ln +p Theorem 12.24. For all p, k such that 0 < p < 1 and k ≥ 1 we have
= c ln n · n−c ep
1 − p ≤ (1 − p/k)k .
= n−c · O (log n)
1 1
< c−1 = c0 . Chapter Notes
n n
In other words, w.h.p. that time slot will not be successful. Since the nodes The Aloha protocol is presented and analyzed in [Abr70, BAK+ 75, Abr85]; the
cannot distinguish noise from silence, the same argument applies to every set of basic technique that unslotted protocols are twice as bad a slotted protocols is
nodes which wakes up. Let Eα be the event that all n/w time slots will not be from [Rob75]. The idea to broadcast in a packet radio network by building a
successful. Using the inequality 1 − p ≤ (1 − p/k)k from Lemma 12.24 we get tree was first presented in [TM78, Cap79]. This idea is also used in [HNO99]
 Θ(n/ log n) to initialize the nodes. Willard [Wil86] was the first that managed to elect
1 1 a leader in O(log log n) time in expectation. Looking more carefully at the
P r[Eα ] = (1 − P r(E1 ))n/w > 1 − c0 > 1 − c00 .
n n success rate, it was shown that one can elect a leader with probability 1 − log1 n
in time log log n + o(log log n) [NO98]. Finally, approximating the number of
In other words, w.h.p. it takes more than n/w time slots until some node can nodes in the network is analyzed in [JKZ02, CGK05]. The lower bound for
transmit alone. probabilistic wake-up is published in [JS02]. In addition to single-hop networks,
multi-hop networks have been analyzed, e.g. broadcast [BYGI92, KM98, CR06],
or deployment [MvRW06].
This chapter was written in collaboration with Philipp Brandes.
12.4 Useful Formulas
In this chapter we have used several inequalities in our proofs. For simplicity’s Bibliography
sake we list all of them in this section.
Theorem 12.20. Boole’s inequality or union bound: For a countable set of [Abr70] Norman Abramson. THE ALOHA SYSTEM: another alternative
events E1 , E2 , E3 , . . ., we have for computer communications. In Proceedings of the November 17-
19, 1970, fall joint computer conference, pages 281–285, 1970.
[ X
P r[ Ei ] ≤ P r[Ei ]. [Abr85] Norman M. Abramson. Development of the ALOHANET. IEEE
i i
Transactions on Information Theory, 31(2):119–123, 1985.
Theorem 12.21. Markov’s inequality: If X is any random variable and a > 0,
[BAK+ 75] R. Binder, Norman M. Abramson, Franklin Kuo, A. Okinaka, and
then
E[X] D. Wax. ALOHA packet broadcasting: a retrospect. In American
P r[|X| ≥ a] ≤ . Federation of Information Processing Societies National Computer
a
Conference (AFIPS NCC), 1975.
Theorem 12.22. Chernoff Pbound: Let Y1 , . . . , Y n be a independent Bernoulli
random variables let Y := i Yi . For any 0 ≤ δ ≤ 1 it holds [BYGI92] Reuven Bar-Yehuda, Oded Goldreich, and Alon Itai. On the Time-
2
Complexity of Broadcast in Multi-hop Radio Networks: An Expo-
− δ2 E[Y ] nential Gap Between Determinism and Randomization. J. Comput.
P r[Y < (1 − δ)E[Y ]] ≤ e
Syst. Sci., 45(1):104–126, 1992.
and for δ > 0
min{δ,δ 2 } [Cap79] J. Capetanakis. Tree algorithms for packet broadcast channels.
P r[Y ≥ (1 + δ) · E[Y ]] ≤ e− 3 ·E[Y ]
IEEE Trans. Inform. Theory, 25(5):505–515, 1979.
Theorem 12.23. We have
[CGK05] Ioannis Caragiannis, Clemente Galdi, and Christos Kaklamanis. Ba-
   n
t2 t sic Computations in Wireless Networks. In International Symposium
et 1 − ≤ 1+ ≤ et on Algorithms and Computation (ISAAC), 2005.
n n
BIBLIOGRAPHY 145 146 CHAPTER 12. WIRELESS PROTOCOLS

[CR06] Artur Czumaj and Wojciech Rytter. Broadcasting algorithms in


radio networks with unknown topology. J. Algorithms, 60(2):115–
143, 2006.

[HNO99] Tatsuya Hayashi, Koji Nakano, and Stephan Olariu. Randomized


Initialization Protocols for Packet Radio Networks. In 13th Interna-
tional Parallel Processing Symposium & 10th Symposium on Parallel
and Distributed Processing (IPPS/SPDP), 1999.

[JKZ02] Tomasz Jurdzinski, Miroslaw Kutylowski, and Jan Zatopianski.


Energy-Efficient Size Approximation of Radio Networks with No
Collision Detection. In Computing and Combinatorics (COCOON),
2002.
[JS02] Tomasz Jurdzinski and Grzegorz Stachowiak. Probabilistic Al-
gorithms for the Wakeup Problem in Single-Hop Radio Net-
works. In International Symposium on Algorithms and Computation
(ISAAC), 2002.
[KM98] Eyal Kushilevitz and Yishay Mansour. An Omega(D log (N/D))
Lower Bound for Broadcast in Radio Networks. SIAM J. Comput.,
27(3):702–712, 1998.

[MvRW06] Thomas Moscibroda, Pascal von Rickenbach, and Roger Watten-


hofer. Analyzing the Energy-Latency Trade-off during the Deploy-
ment of Sensor Networks. In 25th Annual Joint Conference of
the IEEE Computer and Communications Societies (INFOCOM),
Barcelona, Spain, April 2006.

[NO98] Koji Nakano and Stephan Olariu. Randomized O (log log n)-Round
Leader Election Protocols in Packet Radio Networks. In Interna-
tional Symposium on Algorithms and Computation (ISAAC), 1998.
[Rob75] Lawrence G. Roberts. ALOHA packet system with and without
slots and capture. SIGCOMM Comput. Commun. Rev., 5(2):28–42,
April 1975.
[TM78] B. S. Tsybakov and V. A. Mikhailov. Slotted multiaccess packet
broadcasting feedback channel. Problemy Peredachi Informatsii,
14:32–59, October - December 1978.

[Wil86] Dan E. Willard. Log-Logarithmic Selection Resolution Protocols in


a Multiple Access Channel. SIAM J. Comput., 15(2):468–477, 1986.
148 CHAPTER 13. STABILIZATION

Definition 13.2 (Time Complexity). The time complexity of a self-stabilizing


system is the time that passed after the last (transient) failure until the system
has converged to a legitimate state again, staying legitimate.

Remarks:
• Self-stabilization enables a distributed system to recover from a tran-
Chapter 13 sient fault regardless of its nature. A self-stabilizing system does not
have to be initialized as it eventually (after convergence) will behave
correctly.

Stabilization • One of the first self-stabilizing algorithms was Dijkstra’s token ring
network. A token ring is an early form of a local area network where
nodes are arranged in a ring, communicating by a token. The sys-
tem is correct if there is exactly one token in the ring. Let’s have
a look at a simple solution. Given an oriented ring, we simply call
A large branch of research in distributed computing deals with fault-tolerance. the clockwise neighbor parent (p), and the counterclockwise neigh-
Being able to tolerate a considerable fraction of failing or even maliciously be- bor child (c). Also, there is a leader node v0 . Every node v is in a
having (“Byzantine”) nodes while trying to reach consensus (on e.g. the output state S(v) ∈ {0, 1, . . . , n}, perpetually informing its child about its
of a function) among the nodes that work properly is crucial for building reli- state. The token is implicitly passed on by nodes switching state.
able systems. However, consensus protocols require that a majority of the nodes Upon noticing a change of the parent state S(p), node v executes the
remains non-faulty all the time. following code:
Can we design a distributed system that survives transient (short-lived)
failures, even if all nodes are temporarily failing? In other words, can we build
a distributed system that repairs itself ? Algorithm 52 Self-stabilizing Token Ring
1: if v = v0 then
2: if S(v) = S(p) then
13.1 Self-Stabilization 3: S(v) := S(v) + 1 (mod n)
4: end if
Definition 13.1 (Self-Stabilization). A distributed system is self-stabilizing if, 5: else
starting from an arbitrary state, it is guaranteed to converge to a legitimate 6: S(v) := S(p)
state. If the system is in a legitimate state, it is guaranteed to remain there, 7: end if
provided that no further faults happen. A state is legitimate if the state satisfies
the specifications of the distributed system.
Theorem 13.3. Algorithm 52 stabilizes correctly.
Remarks: Proof: As long as some nodes or edges are faulty, anything can happen. In self-
stabilization, we only consider the system after all faults already have happened
• What kind of transient failures can we tolerate? An adversary can (at time t0 , however starting in an arbitrary state).
crash nodes, or make nodes behave Byzantine. Indeed, temporarily Every node apart from leader v0 will always attain the state of its parent.
an adversary can do harm in even worse ways, e.g. by corrupting the It may happen that one node after the other will learn the current state of the
volatile memory of a node (without the node noticing – not unlike leader. In this case the system stabilizes after the leader increases its state at
the movie Memento), or by corrupting messages on the fly (without most n time units after time t0 . It may however be that the leader increases its
anybody noticing). However, as all failures are transient, eventually state even if the system is not stable, e.g. because its parent or parent’s parent
all nodes must work correctly again, that is, crashed nodes get res- accidentally had the same state at time t0 .
urrected, Byzantine nodes stop being malicious, messages are being The leader will increase its state possibly multiple times without reaching
delivered reliably, and the memory of the nodes is secure. stability, however, at some point the leader will reach state s, a state that no
other node had at time t0 . (Since there are n nodes and n states, this will
• Clearly, the read only memory (ROM) must be taboo at all times for eventually happen.) At this point the system must stabilize because the leader
the adversary. No system can repair itself if the program code itself or cannot push for s + 1 (mod n) until every node (including its parent) has s.
constants are corrupted. The adversary can only corrupt the variables After stabilization, there will always be only one node changing its state,
in the volatile random access memory (RAM). i.e., the system remains in a legitimate state.

147
13.1. SELF-STABILIZATION 149 150 CHAPTER 13. STABILIZATION

Proof: In the proof, we present the transformation. First, however, we need to


Remarks: be more formal about the deterministic local algorithm A. In A, each node of
the network computes its decision in k phases. In phase i, node u computes
• Although one might think the time complexity of the algorithm is its local variables according to its local variables and received messages of the
quite bad, it is asymptotically optimal. earlier phases. Then node u sends its messages of phase i to its neighbors.
• It can be a lot of fun designing self-stabilizing algorithms. Let us try Finally node u receives the messages of phase i from its neighbors. The set of
to build a system, where the nodes organize themselves as a maximal local variables of node u in phase i is given by Liu . (In the very first phase, node
independent set (MIS, Chapter 7): u initializes its local variables with L1u .) The message sent from node u to node
v in phase i is denoted by miu,v . Since the algorithm A is deterministic, node u
can compute its local variables Liu and messages miu,∗ of phase i from its state
Algorithm 53 Self-stabilizing MIS
of earlier phases, by simply applying functions fL and fm . In particular,
Require: Node IDs
Every node v executes the following code:
1: do atomically Liu = fL (u, Li−1 i−1
u , m∗,u ), for i > 1, and (13.1)
2: Leave MIS if a neighbor with a larger ID is in the MIS
3: Join MIS if no neighbor with larger ID joins MIS miu,v = fm (u, v, Liu ), for i ≥ 1. (13.2)
4: Send (node ID, MIS or not MIS) to all neighbors
5: end do
The self-stabilizing algorithm needs to simulate all the k phases of the local
algorithm A in parallel. Each node u stores its local variables L1u , . . . , Lku as well
as all messages received m1∗,u , . . . , mk∗,u in two tables in RAM. For simplicity,
Remarks:
each node u also stores all the sent messages m1u,∗ , . . . , mku,∗ in a third table. If
• Note that the main idea of Algorithm 53 is from Algorithm 33, Chap- a message or a local variable for a particular phase is unknown, the entry in the
ter 7. table will be marked with a special value ⊥ (“unknown”). Initially, all entries
in the table are ⊥.
• As long as some nodes are faulty, anything can happen: Faulty nodes
Clearly, in the self-stabilizing model, an adversary can choose to change
may for instance decide to join the MIS, but report to their neighbors
table values at all times, and even reset these values to ⊥. Our self-stabilizing
that they did not join the MIS. Similarly messages may be corrupted
algorithm needs to constantly work against this adversary. In particular, each
during transport. As soon as the system (nodes, messages) is correct,
node u runs these two procedures constantly:
however, the system will converge to a MIS. (The arguments are the
same as in Chapter 7). • For all neighbors: Send each neighbor v a message containing the complete
• Self-stabilizing algorithms always run in an infinite loop, because tran- row of messages of algorithm A, that is, send the vector (m1u,v , . . . , mku,v ) to
sient failures can hit the system at any time. Without the infinite loop, neighbor v. Similarly, if neighbor u receives such a vector from neighbor
an adversary can always corrupt the solution “after” the algorithm v, then neighbor u replaces neighbor v’s row in the table of incoming
terminated. messages by the received vector (m1v,u , . . . , mkv,u ).

• The problem of Algorithm 53 is its time complexity, which may be • Because of the adversary, node u must constantly recompute its local
linear in the number of nodes. This is not very exciting. We need variables (including the initialization) and outgoing message vectors using
something better! Since Algorithm 53 was just the self-stabilizing Functions (13.1) and (13.2) respectively.
variant of the slow MIS Algorithm 33, maybe we can hope to “self-
stabilize” some of our fast algorithms from Chapter 7? The proof is by induction. Let N i (u) be the i-neighborhood of node u (that
is, all nodes within distance i of node u). We assume that the adversary has not
• Yes, we can! Indeed there is a general transformation that takes any corrupted any node in N k (u) since time t0 . At time t0 all nodes in N k (u) will
local algorithm (efficient but not fault-tolerant) and turns it into a self- check and correct their initialization. Following Equation (13.2), at time t0 all
stabilizing algorithm, keeping the same level of efficiency and efficacy. nodes in N k (u) will send the correct message entry for the first round (m1∗,∗ ) to
We present the general transformation below. all neighbors. Asynchronous messages take at most 1 time unit to be received
Theorem 13.4 (Transformation). We are given a deterministic local algorithm at a destination. Hence, using the induction with Equations (13.1) and (13.2)
A that computes a solution of a given problem in k synchronous communication it follows that at time t0 + i, all nodes in N k−i (u) have received the correct
rounds. Using our transformation, we get a self-stabilizing system with time messages m1∗,∗ , . . . , mi∗,∗ . Consequently, at time t0 + k node u has received all
complexity k. In other words, if the adversary does not corrupt the system for k messages of local algorithm A correctly, and will compute the same result value
time units, the solution is stable. In addition, if the adversary does not corrupt as in A. 2
any node or message closer than distance k from a node u, node u will be stable.
13.1. SELF-STABILIZATION 151 152 CHAPTER 13. STABILIZATION

Remarks: • Self-stabilization is the original approach, and self-organization may


be the general theme, but new buzzwords pop up every now and
• Using our transformation (also known as “local checking”), designing
then, e.g. self-configuration, self-management, self-regulation, self-
self-stabilizing algorithms just turned from art to craft.
repairing, self-healing, self-optimization, self-adaptivity, or self-protection.
• As we have seen, many local algorithms are randomized. This brings Generally all these are summarized as “self-*”. One computing giant
two additional problems. Firstly, one may not exactly know how long coined the term “autonomic computing” to reflect the trend of self-
the algorithm will take. This is not really a problem since we can managing distributed systems.
simply send around all the messages needed, until the algorithm is
finished. The transformation of Theorem 13.4 works also if nodes
just send all messages that are not ⊥. Secondly, we must be careful 13.2 Advanced Stabilization
about the adversary. In particular we need to restrict the adversary We finish the chapter with a non-trivial example beyond self-stabilization, show-
such that a node can produce a reproducible sufficiently long string ing the beauty and potential of the area: In a small town, every evening each
of random bits. This can be achieved by storing the sufficiently long citizen calls all his (or her) friends, asking them whether they will vote for the
string along with the program code in the read only memory (ROM). Democratic or the Republican party at the next election.1 In our town citizens
Alternatively, the algorithm might not store the random bit string in listen to their friends, and everybody re-chooses his or her affiliation according
its ROM, but only the seed for a random bit generator. We need this in to the majority of friends.2 Is this process going to “stabilize” (in one way or
order to keep the adversary from reshuffling random bits until the bits another)?
become “bad”, and the expected (or with high probability) efficacy
or efficiency guarantees of the original local algorithm A cannot be Remarks:
guaranteed anymore.
• Is eventually everybody voting for the same party? No.
• Since most local algorithms have only a few communication rounds,
• Will each citizen eventually stay with the same party? No.
and only exchange small messages, the memory overhead of the trans-
formation is usually bearable. In addition, information can often be • Will citizens that stayed with the same party for some time, stay with
compressed in a suitable way so that for many algorithms message that party forever? No.
size will remain polylogarithmic. For example, the information of the
fast MIS algorithm (Algorithm 35) consists of a series of random val- • And if their friends also constantly root for the same party? No.
ues (one for each round), plus two boolean values per round. These
• Will this beast stabilize at all?!? Yes!
boolean values represent whether the node joins the MIS, or whether
a neighbor of the node joins the MIS. The order of the values tells in Eventually every citizen will either stay with the same party for the rest of her
which round a decision is made. Indeed, the series of random bits can life, or switch her opinion every day.
even be compressed just into the random seed value, and the neighbors
can compute the random values of each round themselves. Theorem 13.5 (Dems & Reps). Eventually every citizen is rooting for the
same party every other day.
• There is hope that our transformation as well gives good algorithms
Proof: To prove that the opinions eventually become fixed or cycle every other
for mobile networks, that is for networks where the topology of the
day, think of each friendship as a pair of (directed) edges, one in each direction.
network may change. Indeed, for deterministic local approximation
Let us say an edge is currently bad if the party of the advising friend differs
algorithms, this is true: If the adversary does not change the topology
from the next-day’s party of the advised friend. In other words, the edge is bad
of a node’s k-neighborhood in time k, the solution will locally be stable
if the advised friend did not follow the advisor’s opinion (which means that the
again.
advisor was in the minority). An edge that is not bad, is good.
• For randomized local approximation algorithms however, this is not Consider the out-edges of citizen u on day t, during which (say) u roots for
that simple. Assume for example, that we have a randomized local al- the Democrats. Assume that on day t, g out-edges of u are good, and b out-
gorithm for the dominating set problem. An adversary can constantly edges are bad. Note that g + b is the degree of u. Since g out-edges are good, g
switch the topology of the network, until it finds a topology for which friends of u root for the Democrats on day t + 1. Likewise, b friends of u root
the random bits (which are not really random because these random for the Republicans on day t + 1. In other words, on the evening of day t + 1
bits are in ROM) give a solution with a bad approximation ratio. By citizen u will receive g recommendations for Democrats, and b for Republicans.
defining a weaker adversarial model, we can fix this problem. Essen- We distinguish two cases:
tially, the adversary needs to be oblivious, in the sense that it cannot 1 We are in the US, and as we know from The Simpsons, you “throw your vote away” if
see the solution. Then it will not be possible for the adversary to you vote for somebody else. As a consequence our example has two parties only.
restart the random computation if the solution is “too good”. 2 Assume for the sake of simplicity that everybody has an odd number of friends.
13.2. ADVANCED STABILIZATION 153 154 CHAPTER 13. STABILIZATION

• g > b: In this case, citizen u will again root for the Democrats on day happen simultaneously, and the discrete moment at which this hap-
t + 2. Note that this means, on day t + 1, exactly g in-edges of u are pens is sometimes called a tick. (In other words, each generation is
good, and exactly b in-edges are bad. In other words, the number of bad a pure function of the one before.) The rules continue to be applied
out-edges on day t is exactly the number of bad in-edges on day t + 1. repeatedly to create further generations. John Conway figured that
these rules were enough to generate interesting situations, including
• g < b: In this case, citizen u will root for the Republicans on day t + 2. “breeders” with create “guns” which in turn create “gliders”. As such
Please note that on day t + 1, exactly b in-edges of u are good, and exactly Life in some sense answers an old question by John von Neumann,
g in-edges are bad. In other words, the number of bad out-edges on day whether there can be a simple machine that can build copies of itself.
t was exactly the number of good in-edges on day t + 1 (and vice versa). In fact Life is Turing complete, that is, as powerful as any computer.
This means that the number of bad out-edges on day t is strictly larger
than the number of bad in-edges on day t + 1.

We can summarize these two cases by the following observation. If a citizen


u votes for the same party on day t as on day t + 2, the number of her bad
out-edges on day t is the same as the number of her bad in-edges on day t + 1.
If a citizen u votes for different parties on the days t and t + 2, the number of
Figure 13.1: A “glider gun”. . .
her bad out-edges on day t is strictly larger than the number of her bad in-edges
on day t + 1.
We now account for the total number of bad edges. We denote the total
number of bad out-edges on day t with BOt and by the total number of bad
in-edges on day t with BIt . Using the analysis of the two cases, and summing
up for all citizens, we know that BOt ≥ BIt+1 . Moreover, each out-edge of a
citizen is an in-edge for another citizen, hence BOt = BIt . In fact, if any citizen
switches its party from day t to t + 2, we know that the total number of bad
edges strictly decreases, i.e., BOt+1 = BIt+1 < BOt . But BO cannot decrease
forever. Once BOt+1 = BOt , every citizen u votes for the same party on day
t+2 as u voted on day t, and the system stabilizes in the sense that every citizen
will either stick with his or her party forever or flip-flop every day. 2
Remarks:

• The model can be generalized considerably by, for example, adding Figure 13.2: . . . in action.
weights to vertices (meaning some citizens’ opinions are more impor-
tant than others), adding weights to edges (meaning the influence be-
tween some citizens is stronger than between others), allowing loops
(citizens who consider their own current opinions as well), allowing
tie-breaking mechanisms, and even allowing different thresholds for Chapter Notes
party changes.
Self-stabilization was first introduced in a paper by Edsger W. Dijkstra in 1974
• How long does it take until the system stabilizes? [Dij74], in the context of a token ring network. It was shown that the ring
stabilizes in time Θ(n). For his work Dijkstra received the 2002 ACM PODC
• Some may be reminded of Conway’s Game of Life: We are given an Influential Paper Award. Shortly after receiving the award he passed away.
infinite two-dimensional grid of cells, each of which is in one of two With Dijkstra being such an eminent person in distributed computing (e.g.
possible states, dead or alive. Every cell interacts with its eight neigh- concurrency, semaphores,mutual exclusion, deadlock, finding shortest paths in
bors. In each round, the following transitions occur: Any live cell graphs, fault-tolerance, self-stabilization), the award was renamed Edsger W.
with fewer than two live neighbors dies, as if caused by loneliness. Dijkstra Prize in Distributed Computing. In 1991 Awerbuch et al. showed that
Any live cell with more than three live neighbors dies, as if by over- any algorithm can be modified into a self-stabilizing algorithm that stabilizes in
crowding. Any live cell with two or three live neighbors lives on to the same time that is needed to compute the solution from scratch [APSV91].
the next generation. Any dead cell with exactly three live neighbors The Republicans vs. Democrats problem was popularized by Peter Winkler,
is “born” and becomes a live cell. The initial pattern constitutes the in his column “Puzzled” [Win08]. Goles et al. already proved in [GO80] that
“seed” of the system. The first generation is created by applying the any configuration of any such system with symmetric edge weights will end up
above rules simultaneously to every cell in the seed, births and deaths in a situation where each citizen votes for the same party every second day.
BIBLIOGRAPHY 155 156 CHAPTER 13. STABILIZATION

Winkler additionally proved that the time such a system takes to stabilize is
bounded by O(n2 ). Frischknecht et al. constructed a worst case graph which
takes Ω(n2 / log2 n) rounds to stabilize [FKW13]. Keller et al. generalized this
results in [KPW14], showing that a graph with symmetric edge weights stabi-
lizes in O(W (G)), where W (G) is the sum of edge weights in graph G. They
also constructed a weighted graph with exponential stabilization time. Closely
related to this puzzle is the well known Game of Life which was described by
the mathematician John Conway and made popular by Martin Gardner [Gar70].
In the Game of Life cells can be either dead or alive and change their states
according to the number of alive neighbors.

Bibliography
[APSV91] Baruch Awerbuch, Boaz Patt-Shamir, and George Varghese. Self-
Stabilization By Local Checking and Correction. In In Proceedings
of IEEE Symposium on Foundations of Computer Science (FOCS),
1991.
[Dij74] Edsger W. Dijkstra. Self-stabilizing systems in spite of distributed
control. Communications of the ACM, 17(11):943–644, November
1974.
[FKW13] Silvio Frischknecht, Barbara Keller, and Roger Wattenhofer. Conver-
gence in (Social) Influence Networks. In 27th International Sympo-
sium on Distributed Computing (DISC), Jerusalem, Israel, October
2013.

[Gar70] M. Gardner. Mathematical Games: The fantastic combinations


of John Conway’s new solitaire game Life. Scientific American,
223:120–123, October 1970.
[GO80] E. Goles and J. Olivos. Periodic behavior of generalized threshold
functions. Discrete Mathematics, 30:187–189, 1980.

[KPW14] Barbara Keller, David Peleg, and Roger Wattenhofer. How even Tiny
Influence can have a Big Impact! In 7th International Conference
on Fun with Algorithms (FUN), Lipari Island, Italy, July 2014.
[Win08] P. Winkler. Puzzled. Communications of the ACM, 51(9):103–103,
August 2008.
158 CHAPTER 14. LABELING SCHEMES

• There is an interesting connection between labeling schemes for ad-


jacency and so-called induced-universal graphs: Let F be a family
of graphs. The graph U (n) is called n-induced-universal for F if all
G ∈ F with at most n nodes appear as a node-induced subgraph in
U (n). (A node-induced subgraph of U (n) = (V, E) is any graph that
can be obtained by taking a subset V 0 of V and all edges from E which
Chapter 14 have both endpoints in V 0 .)

• In the movie Good Will Hunting, the big open question was to find all
graphs of the family of homeomorphically irreducible (non-isomorphic,
Labeling Schemes no node with degree 2) trees with 10 nodes, T10 . What is the smallest
induced-universal graph for T10 ?

• If a graph family F allows a labeling scheme for adjacency with label


size f (n), then there are n-induced-universal graphs for F so that the
Imagine you want to repeatedly query a huge graph, e.g., a social or a road size of U (n) is at most 2f (n) . Since the size of U (n) is exponential in
network. For example, you might need to find out whether two nodes are f it is interesting to study the label size carefully: If f is log n, the
connected, or what the distance between two nodes is. Since the graph is so size of U (n) is n, whereas if f is 2 log n the size of U (n) becomes n2 !
large, you distribute it among multiple servers in your data center.
• What about adjacency in general graphs?

14.1 Adjacency Theorem 14.3. Any labeling scheme for adjacency in general graphs has a label
size of at least Ω(n) bits.
Theorem 14.1. It is possible to assign labels of size 2 log n bits to nodes in a
tree so that for every pair u, v of nodes, it is easy to tell whether u is adjacent Proof. Let Gn denote the family of graphs with n nodes, and assume there is
to v by just looking at u and v’s labels. a labeling scheme for adjacency in graphs from Gn with label size s. First, we
argue that the encoder e must be injective on Gn : Since the labeling scheme is
Proof. Choose a root in the tree arbitrarily so that every non-root node has a for adjacency, e cannot assign the same labels to two different graphs.
parent. The label of each node u consists of two parts: The ID of u (from 1 to There are 2s possible labels for any node, and for every G ∈ Gn we can
n), and the ID of u’s parent (or nothing if u is the root). choose n of them. Thus, we obtain that
 s   s 
Remarks: 2 2 +n−1
|Gn | ≤ =
n n
• What we have constructed above is called a labeling scheme, more
precisely a labeling scheme for adjacency in trees. Formally, a labeling  n
Moreover, a graph in Gn can have at most n2 edges, and thus |Gn | ≥ 2( 2 ) /n!
scheme is defined as follows. when taking into account that the order of the nodes is irrelevant. Canceling
Definition 14.2. A labeling scheme consists of an encoder e and a decoder out the n! term and taking the logarithm on both sides of the inequality we
d. The encoder e assigns to each node v a label e(v). The decoder d receives conclude that s ∈ Ω(n).
the labels of the nodes in question and returns an answer to some query. The
largest size (in bits) of a label assigned to a node is called the label size of the Remarks:
labeling scheme.
• The lower bound for general graphs is a bit discouraging; we wanted
Remarks: to use labeling schemes for queries on large graphs!

• In Theorem 14.1, the decoder receives two node labels e(u) and e(v), • The situation is less dire if the graph is not arbitrary. For instance,
and its answer is Yes or No, depending on whether u and v are in degree-bounded graphs, in planar graphs, and in trees, the bounds
adjacent or not. The label size is 2 log n. change to Θ(log n) bits.

• The label size is the complexity measure we are going to focus on in • What about other queries, e.g., distance?
this chapter. The run-time of the encoder and the decoder are two
other complexity measures that are studied in the literature. • Next, we will focus on rooted trees.

157
14.2. ROOTED TREES 159 160 CHAPTER 14. LABELING SCHEMES

14.2 Rooted Trees Proof. For our proof, use Heavy-Light-Decomposition(T ) to partition T ’s edges
into heavy and light edges. All heavy edges form a collection of paths, called the
Theorem 14.4. There is a 2 log n labeling scheme for ancestry, i.e., for two heavy paths. Moreover, every node is reachable from the root through a sequence
nodes u and v, find out if u is an ancestor of v in the rooted tree T . of heavy paths connected with light edges. Instead of storing the whole path to
reach a node, we only store the information about heavy paths and light edges
Proof. Traverse the tree with a depth first search, and consider the obtained
that were taken to reach a node from the root.
pre-ordering of the nodes, i.e., enumerate the nodes in the order in which they
For instance, if node u can be reached by first using 2 heavy edges, then the
are first visited. For a node u denote by l(u) the index in the pre-order. Our
7th light edge, then 3 heavy edges, and then the light edges 1 and 4, then we
encoder assigns labels e(u) = (l(u), r(u)) to each node u, where r(u) is the
assign to v the label (2, 7, 3, 1, 4). For any node u, the path p(u) from the root
largest value l(v) that appears at any node v in the sub-tree rooted at u. With
to u is now specified by the label. The distance between any two nodes can be
the labels assigned in this manner, we can find out whether u is an ancestor of
computed using the paths.
v by checking if l(v) is contained in the interval (l(u), r(u)].
Since every parent has at most ∆ < n children, the name of a light edge has
at most log n bits. The size (number of nodes in the sub-tree) of a light child is
Algorithm 54 Naı̈ve-Distance-Labeling(T ) at most half the size of its parent, so a path can have at most log n light edges.
1: Let l be the label of the root r of T
Between any two light edges, there could be a heavy path, so we can have up to
2: Let T1 , . . . , Tδ be the sub-trees rooted at each of the δ children of r
log n heavy paths in a label. The length of such a heavy path can be described
3: for i = 1, . . . , δ do
with log n bits as well, since no heavy path has more than n nodes. Altogether
4: The root of Ti gets the label obtained by appending i to l we therefore need at most O(log2 n) bits.
5: Naı̈ve-Distance-Labeling(Ti )
6: end for Remarks:

• One can show that any labeling scheme for distance in trees needs to
Theorem 14.5. There is an O(n log n) labeling scheme for distance in trees. use labels of size at least Ω(log2 n).

Proof. Apply the encoder algorithm Naı̈ve-Distance-Labeling(T ) to label the • The distance encoder from Theorem 14.6 also supports decoders for
tree T . The encoder assigns to every node v a sequence (l1 , l2 . . . ). The length other queries. To check for ancestry, it therefore suffices to check if
of a sequence e(v) is at most n, and each entry in the sequence requires at most p(u) is a prefix of p(v) or vice versa.
log n bits. A label (l1 , . . . , lk ) of a node v corresponds to a path from r to v in
• The nearest common ancestor is the last node that is on both p(u)
T , and the nodes on the path are labeled (l1 ), (l1 , l2 ), (l1 , l2 , l3 ) and so on. The
and p(v), and the separation level is the length of the path to that
distance between u and v in T is obtained by reconstructing the paths from e(u)
node.
and e(v).
• Two nodes are siblings if their distance is 2 but they are not ancestors.
Remarks:
• The heavy-light decomposition can be used to shave off a few bits in
• We can assign the labels more carefully to obtain a smaller label size. other labeling schemes, e.g., ancestry or adjacency.
For that, we use the following heavy-light decomposition.

Algorithm 55 Heavy-Light-Decomposition(T )
14.3 Road Networks
1: Node r is the root of T Labeling schemes are used to quickly find shortest paths in road networks.
2: Let T1 , . . . , Tδ be the sub-trees rooted at each of the δ children of r
Remarks:
3: Let Tmax be a largest tree in {T1 , . . . , Tδ } in terms of number of nodes
4: Mark the edge (r, Tmax ) as heavy • A naı̈ve approach is to store at every node u the shortest paths to
5: Mark all edges to other children of r as light all other nodes v. This requires an impractical amount of memory.
6: Assign the names 1, . . . , δ − 1 to the light edges of r For example, the road network for Western Europe has 18 million
7: for i = 1, . . . , δ do nodes and 44 million directed edges, and the USA road network has
8: Heavy-Light-Decomposition(Ti ) 24 million nodes and 58 million directed edges.
9: end for
• What if we only store the next node on the shortest path to all targets?
In a worst case this stills requires Ω(n) bits per node. Moreover,
Theorem 14.6. There is an O(log2 n) labeling scheme for distance in trees. answering a single query takes many invocations of the decoder.
14.3. ROAD NETWORKS 161 162 CHAPTER 14. LABELING SCHEMES

• For simplicity, let us focus on answering distance queries only. Even Remarks:
if we only want to know the distance, storing the full table of n2
distances costs more than 1000TB, too much for storing it in RAM. • The size of the shortest path covers will determine how space efficient
the solution will be. It turns out that real-world networks allow for
• The idea for the encoder is to compute a set S of hub nodes that lie on small shortest path covers: The parameter h is the so-called highway
many shortest paths. We then store at each node u only the distance dimension of G, is defined as h = maxi,v Fi (v), and h is conjectured
to the hub nodes that appear on shortest paths originating or ending to be small for road networks.
in u.
• Computing Si with a minimal number of hubs is NP-hard, but one
• Given two labels e(u) and e(v), let H(u, v) denote the set of hub
can compute a O(log n) approximation of Si in polynomial time. Con-
nodes that appear in both labels. The decoder now simply returns
sequently, the label size is at most O(h log n log D). By ordering the
d(u, v) = min{dist(u, h) + dist(h, v) : h ∈ H(u, v)}, all of which can
nodes in each label by their ID, the decoder can scan through both
be computed from the two labels.
node lists in parallel in time O(h log n log D).
• The key in finding a good labeling scheme now lies in finding good
hub nodes. • While this approach yields good theoretical bounds, the encoder is
still too slow in practice. Therefore, before computing the shortest
Algorithm 56 Naı̈ve-Hub-Labeling(G) path covers, the graph is contracted by introducing shortcuts first.
1: Let P be the set of all n2 shortest paths • Based on this approach a distance query on a continent-sized road
2: while P 6= ∅ do network can be answered in less that 1µs on current hardware, orders
3: Let h be a node which is on a maximum number of paths in P of magnitude faster than a single random disk access. Storing all the
4: for all paths p = (u, . . . , v) ∈ P do labels requires roughly 20 GB of RAM.
5: if h is on p then
6: Add h with the distance dist(u, h) to the label of u • The method can be extended to support shortest path queries, e.g.,
7: Add h with the distance dist(h, v) to the label of v by storing the path to/from the hub nodes, or by recursively querying
8: Remove p from P for nodes that lie on the shortest path to the hub.
9: end if
10: end for
11: end while Chapter Notes
Adjacency labelings were first studied by Breuer and Folkman [BF67]. The
Remarks:
log n + O(log∗ n) upper bound for trees is due to [AR02] using a clustering
• Unfortunately, algorithm 56 takes a prohibitively long time to com- technique. In contrast, it was shown that for general graphs the size of universal
pute. graphs is at least 2(n−1)/2 ! Since graphs of arboricity d can be decomposed into
d forests [NW61], the labeling scheme from [AR02] can be used to label graphs
• Another approach computes the set S as follows. The encoder (Al-
of arboricity d with d log n + O(log n) bit labels. For a thorough survey on
gorithm 57) first constructs so-called shortest path covers. The node
labeling schemes for rooted trees please check [AHR].
set Si is a shortest path cover if Si contains a node on every shortest
path of length between 2i−1 and 2i . At node v only the hub nodes in Universal graphs were studied already by Ackermann [Ack37], and later
Si that are within the ball of radius 2i around v (denoted by B(v, 2i )) by Erdős, Rényi, and Rado [ER63, Rad64]. The connection between labeling
are stored. schemes and universal graphs [KNR88] was investigated thoroughly. Our adja-
cency lower bound follows the presentation in [AKTZ14], which also summarizes
recent results in this field of research.
Algorithm 57 Hub-Labeling(G)
Distance labeling schemes were first studied by Peleg [Pel00]. The notion of
1: for i = 1, . . . , log D do highway dimension was introduced by [AFGW10] in an attempt to explain the
2: Compute the shortest path cover Si good performance of many heuristics to speed up shortest path computations,
3: end for e.g., Transit Node Routing [BFSS07]. Their suggestions to modify the SHARC
4: for all v ∈ V do heuristic [BD08] lead to the hub labeling scheme and were implemented and
5: Let Fi (v) be the set Si ∩ B(v, 2i ) evaluated [ADGW11], and later refined [DGSW14]. The Ω(n) label size lower
6: Let F (v) be the set F1 (v) ∪ F2 (v) ∪ . . . bound for routing (shortest paths) with stretch smaller than 3 is due to [GG01].
7: The label of v consists of the nodes in F (v), with their distance to v This chapter was written in collaboration with Jochen Seidel. Thanks to
8: end for
Noy Rotbart for suggesting the topic.
BIBLIOGRAPHY 163 164 CHAPTER 14. LABELING SCHEMES

Bibliography
[Ack37] Wilhelm Ackermann. Die Widerspruchsfreiheit der allgemeinen
Mengenlehre. Mathematische Annalen, 114(1):305–315, 1937.
[ADGW11] Ittai Abraham, Daniel Delling, Andrew V. Goldberg, and Renato
Fonseca F. Werneck. A hub-based labeling algorithm for shortest
paths in road networks. In SEA, 2011.
[AFGW10] Ittai Abraham, Amos Fiat, Andrew V. Goldberg, and Renato Fon-
seca F. Werneck. Highway dimension, shortest paths, and provably
efficient algorithms. In SODA, 2010.

[AHR] Stephen Alstrup, Esben Bistrup Halvorsen, and Noy Rotbart. A


survey on labeling schemes for trees. To appear.
[AKTZ14] Stephen Alstrup, Haim Kaplan, Mikkel Thorup, and Uri Zwick.
Adjacency labeling schemes and induced-universal graphs. CoRR,
abs/1404.3391, 2014.
[AR02] Stephen Alstrup and Theis Rauhe. Small induced-universal graphs
and compact implicit graph representations. In FOCS, 2002.
[BD08] Reinhard Bauer and Daniel Delling. SHARC: fast and robust uni-
directional routing. In ALENEX, 2008.

[BF67] Melvin A Breuer and Jon Folkman. An unexpected result in cod-


ing the vertices of a graph. Journal of Mathematical Analysis and
Applications, 20(3):583 – 600, 1967.
[BFSS07] Holger Bast, Stefan Funke, Peter Sanders, and Dominik Schultes.
Fast routing in road networks with transit nodes. Science,
316(5824):566, 2007.
[DGSW14] Daniel Delling, Andrew V. Goldberg, Ruslan Savchenko, and Re-
nato F. Werneck. Hub labels: Theory and practice. In SEA, 2014.
[ER63] P. Erdős and A. Rényi. Asymmetric graphs. Acta Mathematica
Academiae Scientiarum Hungarica, 14(3-4):295–315, 1963.
[GG01] Cyril Gavoille and Marc Gengler. Space-efficiency for routing
schemes of stretch factor three. J. Parallel Distrib. Comput.,
61(5):679–687, 2001.

[KNR88] Sampath Kannan, Moni Naor, and Steven Rudich. Implicit repre-
sentation of graphs. In STOC, 1988.
[NW61] C. St. J. A. Nash-Williams. Edge-disjoint spanning trees of finite
graphs. J. London Math. Soc., 36:445–450, 1961.
[Pel00] David Peleg. Proximity-preserving labeling schemes. Journal of
Graph Theory, 33(3):167–176, 2000.
[Rad64] Richard Rado. Universal graphs and universal functions. Acta
Arith., 9:331–340, 1964.
166 CHAPTER 15. FAULT-TOLERANCE & PAXOS

• Algorithm 58 does not work correctly if there is message loss, so we


need a little improvement.

Algorithm 59 Client-Server Algorithm with Acknowledgments


1: Client sends commands one at a time to server
2: Server acknowledges every command
Chapter 15 3: If the client does not receive an acknowledgment within a reasonable time,
the client resends the command

Fault-Tolerance & Paxos Remarks:


• Sending commands “one at a time” means that when the client sent
command c, the client does not send any new command c0 until it
received an acknowledgment for c.
How do you create a fault-tolerant distributed system? In this chapter we start • Since not only messages sent by the client can be lost, but also ac-
out with simple questions, and, step by step, improve our solutions until we knowledgments, the client might resend a message that was already
arrive at a system that works even under adverse circumstances, Paxos. received and executed on the server. To prevent multiple executions of
the same command, one can add a sequence number to each message,
allowing the receiver to identify duplicates.
15.1 Client/Server
• This simple algorithm is the basis of many reliable protocols, e.g.
Definition 15.1 (node). We call a single actor in the system node. In a TCP.
computer network the computers are the nodes, in the classical client-server
model both the server and the client are nodes, and so on. If not stated otherwise, • The algorithm can easily be extended to work with multiple servers:
the total number of nodes in the system is n. The client sends each command to every server, and once the client
received an acknowledgment from each server, the command is con-
Model 15.2 (message passing). In the message passing model we study dis- sidered to be executed successfully.
tributed systems that consist of a set of nodes. Each node can perform local
computations, and can send messages to every other node. • What about multiple clients?
Model 15.4 (variable message delay). In practice, messages might experience
Remarks:
different transmission times, even if they are being sent between the same two
• We start with two nodes, the smallest number of nodes in a distributed nodes.
system. We have a client node that wants to “manipulate” data (e.g.,
Remarks:
store, update, . . . ) on a remote server node.
• Throughout this chapter, we assume the variable message delay model.
Algorithm 58 Naı̈ve Client-Server Algorithm Theorem 15.5. If Algorithm 59 is used with multiple clients and multiple
1: Client sends commands one at a time to server servers, the servers might see the commands in different order, leading to an
inconsistent state.

Model 15.3 (message loss). In the message passing model with message loss, Proof. Assume we have two clients u1 and u2 , and two servers s1 and s2 . Both
for any specific message, it is not guaranteed that it will arrive safely at the clients issue a command to update a variable x on the servers, initially x = 0.
receiver. Client u1 sends command x = x + 1 and client u2 sends x = 2 · x.
Let both clients send their message at the same time. With variable message
Remarks: delay, it can happen that s1 receives the message from u1 first, and s2 receives
the message from u2 first.1 Hence, s1 computes x = (0 + 1) · 2 = 2 and s2
• A related problem is message corruption, i.e., a message is received computes x = (0 · 2) + 1 = 1.
but the content of the message is corrupted. In practice, in contrast
to message loss, message corruption can be handled quite well, e.g. by 1 For example, u and s are (geographically) located close to each other, and so are u
1 1 2
including additional information in the message, such as a checksum. and s2 .

165
15.1. CLIENT/SERVER 167 168 CHAPTER 15. FAULT-TOLERANCE & PAXOS

Definition 15.6 (state replication). A set of nodes achieves state replication, Remarks:
if all nodes execute a (potentially infinite) sequence of commands c1 , c2 , c3 , . . . ,
in the same order. • This idea appears in many contexts and with different names, usually
with slight variations, e.g. two-phase locking (2PL).
Remarks:
• Another example is the two-phase commit (2PC) protocol, typically
• State replication is a fundamental property for distributed systems.
presented in a database environment. The first phase is called the
• For people working in the financial tech industry, state replication is preparation of a transaction, and in the second phase the transaction
often synonymous with the term blockchain. The Bitcoin blockchain is either committed or aborted. The 2PC process is not started at the
we will discuss in Chapter 20 is indeed one way to implement state client but at a designated server node that is called the coordinator.
replication. However, as we will see in all the other chapters, there
are many alternative concepts that are worth knowing, with different • It is often claimed that 2PL and 2PC provide better consistency guar-
properties. antees than a simple serializer if nodes can recover after crashing. In
particular, alive nodes might be kept consistent with crashed nodes,
• Since state replication is trivial with a single server, we can desig- for transactions that started while the crashed node was still running.
nate a single server as a serializer. By letting the serializer distribute This benefit was even improved in a protocol that uses an additional
the commands, we automatically order the requests and achieve state phase (3PC).
replication!
• The problem with 2PC or 3PC is that they are not well-defined if
Algorithm 60 State Replication with a Serializer exceptions happen.
1: Clients send commands one at a time to the serializer
2: Serializer forwards commands one at a time to all other servers • Does Algorithm 61 really handle node crashes well? No! In fact,
3: Once the serializer received all acknowledgments, it notifies the client about it is even worse than the simple serializer approach (Algorithm 60):
the success Instead of having a only one node which must be available, Algorithm
61 requires all servers to be responsive!

Remarks: • Does Algorithm 61 also work if we only get the lock from a subset of
• This idea is sometimes also referred to as master-slave replication. servers? Is a majority of servers enough?

• What about node failures? Our serializer is a single point of failure! • What if two or more clients concurrently try to acquire a majority
of locks? Do clients have to abandon their already acquired locks, in
• Can we have a more distributed approach of solving state replication? order not to run into a deadlock? How? And what if they crash before
Instead of directly establishing a consistent order of commands, we they can release the locks? Do we need a slightly different concept?
can use a different approach: We make sure that there is always at
most one client sending a command; i.e., we use mutual exclusion,
respectively locking.
15.2 Paxos
Algorithm 61 Two-Phase Protocol Definition 15.7 (ticket). A ticket is a weaker form of a lock, with the following
properties:
Phase 1
1: Client asks all servers for the lock • Reissuable: A server can issue a ticket, even if previously issued tickets
have not yet been returned.
Phase 2
2: if client receives lock from every server then • Ticket expiration: If a client sends a message to a server using a previ-
3: Client sends command reliably to each server, and gives the lock back ously acquired ticket t, the server will only accept t, if t is the most recently
4: else issued ticket.
5: Clients gives the received locks back
6: Client waits, and then starts with Phase 1 again
7: end if
15.2. PAXOS 169 170 CHAPTER 15. FAULT-TOLERANCE & PAXOS

Remarks: • Idea: What if a server, instead of only handing out tickets in Phase
1, also notifies clients about its currently stored command? Then, u2
• There is no problem with crashes: If a client crashes while holding learns that u1 already stored c1 and instead of trying to store c2 , u2
a ticket, the remaining clients are not affected, as servers can simply could support u1 by also storing c1 . As both clients try to store and
issue new tickets. execute the same command, the order in which they proceed is no
longer a problem.
• Tickets can be implemented with a counter: Each time a ticket is
requested, the counter is increased. When a client tries to use a ticket, • But what if not all servers have the same command stored, and u2
the server can determine if the ticket is expired. learns multiple stored commands in Phase 1. Which command should
u2 support?
• What can we do with tickets? Can we simply replace the locks in
Algorithm 61 with tickets? We need to add at least one additional • Observe that it is always safe to support the most recently stored
phase, as only the client knows if a majority of the tickets have been command. As long as there is no majority, clients can support any
valid in Phase 2. command. However, once there is a majority, clients need to support
this value.

Algorithm 62 Naı̈ve Ticket Protocol • So, in order to determine which command was stored most recently,
servers can remember the ticket number that was used to store the
Phase 1 command, and afterwards tell this number to clients in Phase 1.
1: Client asks all servers for a ticket • If every server uses its own ticket numbers, the newest ticket does not
Phase 2 necessarily have the largest number. This problem can be solved if
clients suggest the ticket numbers themselves!
2: if a majority of the servers replied then
3: Client sends command together with ticket to each server
4: Server stores command only if ticket is still valid, and replies to client
5: else
6: Client waits, and then starts with Phase 1 again
7: end if

Phase 3
8: if client hears a positive answer from a majority of the servers then
9: Client tells servers to execute the stored command
10: else
11: Client waits, and then starts with Phase 1 again
12: end if

Remarks:

• There are problems with this algorithm: Let u1 be the first client
that successfully stores its command c1 on a majority of the servers.
Assume that u1 becomes very slow just before it can notify the servers
(Line 7), and a client u2 updates the stored command in some servers
to c2 . Afterwards, u1 tells the servers to execute the command. Now
some servers will execute c1 and others c2 !

• How can this problem be fixed? We know that every client u2 that
updates the stored command after u1 must have used a newer ticket
than u1 . As u1 ’s ticket was accepted in Phase 2, it follows that u2
must have acquired its ticket after u1 already stored its value in the
respective server.
15.2. PAXOS 171 172 CHAPTER 15. FAULT-TOLERANCE & PAXOS

Algorithm 63 Paxos replies in phases 1 and 2 if the ticket expired.


Client (Proposer) Server (Acceptor) • The contention between different clients can be alleviated by random-
Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . izing the waiting times between consecutive attempts.

c / command to execute Tmax = 0 / largest issued ticket Lemma 15.8. We call a message propose(t,c) sent by clients on Line 12 a
t = 0 / ticket number to try proposal for (t,c). A proposal for (t,c) is chosen, if it is stored by a majority of
C=⊥ / stored command servers (Line 15). For every issued propose(t0 ,c0 ) with t0 > t holds that c0 = c,
Tstore = 0 / ticket used to store C if there was a chosen propose(t,c).
Phase 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof. Observe that there can be at most one proposal for every ticket number
1: t=t+1 τ since clients only send a proposal if they received a majority of the tickets for
2: Ask all servers for ticket t τ (Line 7). Hence, every proposal is uniquely identified by its ticket number τ .
3: if t > Tmax then Assume that there is at least one propose(t0 ,c0 ) with t0 > t and c0 6= c; of
4: Tmax = t such proposals, consider the proposal with the smallest ticket number t0 . Since
5: Answer with ok(Tstore , C) both this proposal and also the propose(t,c) have been sent to a majority of the
6: end if servers, we can denote by S the non-empty intersection of servers that have been
involved in both proposals. Recall that since propose(t,c) has been chosen, this
Phase 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . means that that at least one server s ∈ S must have stored command c; thus,
7: if a majority answers ok then when the command was stored, the ticket number t was still valid. Hence, s
8: Pick (Tstore , C) with largest Tstore must have received the request for ticket t0 after it already stored propose(t,c),
9: if Tstore > 0 then as the request for ticket t0 invalidates ticket t.
10: c=C Therefore, the client that sent propose(t0 ,c0 ) must have learned from s that
11: end if a client already stored propose(t,c). Since a client adapts its proposal to the
12: Send propose(t, c) to same command that is stored with the highest ticket number so far (Line 8), the client
majority must have proposed c as well. There is only one possibility that would lead to
13: end if the client not adapting c: If the client received the information from a server
that some client stored propose(t∗ ,c∗ ), with c∗ 6= c and t∗ > t. But in that case,
14: if t = Tmax then a client must have sent propose(t∗ ,c∗ ) with t < t∗ < t0 , but this contradicts the
15: C=c assumption that t0 is the smallest ticket number of a proposal issued after t.
16: Tstore = t
17: Answer success Theorem 15.9. If a command c is executed by some servers, all servers (even-
18: end if tually) execute c.

Phase 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof. From Lemma 15.8 we know that once a proposal for c is chosen, every
19: if a majority answers success subsequent proposal is for c. As there is exactly one first propose(t,c) that is
then chosen, it follows that all successful proposals will be for the command c. Thus,
20: Send execute(c) to every server only proposals for a single command c can be chosen, and since clients only
21: end if
tell servers to execute a command, when it is chosen (Line 20), each client will
eventually tell every server to execute c.

Remarks:
Remarks:
• Unlike previously mentioned algorithms, there is no step where a client
explicitly decides to start a new attempt and jumps back to Phase 1. • If the client with the first successful proposal does not crash, it will
Note that this is not necessary, as a client can decide to abort the directly tell every server to execute c.
current attempt and start a new one at any point in the algorithm.
• However, if the client crashes before notifying any of the servers, the
This has the advantage that we do not need to be careful about se-
servers will execute the command only once the next client is success-
lecting “good” values for timeouts, as correctness is independent of
ful. Once a server received a request to execute c, it can inform every
the decisions when to start new attempts.
client that arrives later that there is already a chosen command, so
• The performance can be improved by letting the servers send negative that the client does not waste time with the proposal process.
15.2. PAXOS 173 174 CHAPTER 15. FAULT-TOLERANCE & PAXOS

• Note that Paxos cannot make progress if half (or more) of the servers Bibliography
crash, as clients cannot achieve a majority anymore.
[Gra78] James N Gray. Notes on data base operating systems. Springer, 1978.
• The original description of Paxos uses three roles: Proposers, accep- [Lam98] Leslie Lamport. The part-time parliament. ACM Transactions on
tors and learners. Learners have a trivial role: They do nothing, they Computer Systems (TOCS), 16(2):133–169, 1998.
just learn from other nodes which command was chosen.
[Lam01] Leslie Lamport. Paxos made simple. ACM Sigact News, 32(4):18–25,
• We assigned every node only one role. In some scenarios, it might 2001.
be useful to allow a node to have multiple roles. For example in a
peer-to-peer scenario nodes need to act as both client and server.

• Clients (Proposers) must be trusted to follow the protocol strictly.


However, this is in many scenarios not a reasonable assumption. In
such scenarios, the role of the proposer can be executed by a set of
servers, and clients need to contact proposers, to propose values in
their name.

• So far, we only discussed how a set of nodes can reach decision for a
single command with the help of Paxos. We call such a single decision
an instance of Paxos.

• If we want to execute multiple commands, we can extend each in-


stance with an instance number, that is sent around with every mes-
sage. Once a command is chosen, any client can decide to start a new
instance with the next number. If a server did not realize that the
previous instance came to a decision, the server can ask other servers
about the decisions to catch up.

Chapter Notes
Two-phase protocols have been around for a long time, and it is unclear if there
is a single source of this idea. One of the earlier descriptions of this concept can
found in the book of Gray [Gra78].
Leslie Lamport introduced Paxos in 1989. But why is it called Paxos? Lam-
port described the algorithm as the solution to a problem of the parliament
of a fictitious Greek society on the island Paxos. He even liked this idea so
much, that he gave some lectures in the persona of an Indiana-Jones-style ar-
chaeologist! When the paper was submitted, many readers were so distracted by
the descriptions of the activities of the legislators, they did not understand the
meaning and purpose of the algorithm. The paper was rejected. But Lamport
refused to rewrite the paper, and he later wrote that he “was quite annoyed at
how humorless everyone working in the field seemed to be”. A few years later,
when the need for a protocol like Paxos arose again, Lamport simply took the
paper out of the drawer and gave it to his colleagues. They liked it. So Lamport
decided to submit the paper (in basically unaltered form!) again, 8 years after
he wrote it – and it got accepted! But as this paper [Lam98] is admittedly hard
to read, he had mercy, and later wrote a simpler description of Paxos [Lam01].
This chapter was written in collaboration with David Stolz.
176 CHAPTER 16. CONSENSUS

Definition 16.1 (Consensus). Consider a distributed system with n nodes.


Each node i has an input xi . A solution of the consensus problem must guar-
antee the following:

• Termination: Every non-faulty node eventually decides.

• Agreement: All non-faulty nodes decide on the same value.


Chapter 16
• Validity: The decided value must be the input of at least one node.

Remarks:
Consensus • The validity condition infers that if all nodes have the same input x,
then the nodes need to decide on x. Please note that consensus is not
democratic, it may well be that the nodes decide on an input value
promoted by a small minority.
This chapter is the first to deal with fault tolerance, one of the most fundamental
aspects of distributed computing. Indeed, in contrast to a system with a single • Whether consensus is possible depends on many parameters of the
processor, having a distributed system may permit getting away with failures distributed system, in particular whether the system is synchronous
and malfunctions of parts of the system. This line of research was motivated or asynchronous, or what “faulty” means. In the following we study
by the basic question whether, e.g., putting two (or three?) computers into some simple variants to get a feeling for the problem.
the cockpit of a plane will make the plane more reliable. Clearly fault-tolerance
often comes at a price, as having more than one decision-maker often complicates • Consensus is a powerful primitive. With established consensus almost
decision-making. everything can be computed in a distributed system, e.g. a leader.

Given a distributed asynchronous message passing system with n ≥ 2 nodes.


All nodes can communicate directly with all other nodes, simply by sending a
16.1 Impossibility of Consensus message. In other words, the communication graph is the complete graph. Can
Imagine two cautious generals who want to attack a common enemy.1 Their the consensus problem be solved? Yes!
only means of communication are messengers. Unfortunately, the route of these
messengers leads through hostile enemy territory, so there is a chance that a Algorithm 64 Trivial Consensus
messenger does not make it. Only if both generals attack at the very same time 1: Each node has an input
the enemy can be defeated. Can we devise a protocol such that the two generals 2: We have a leader, e.g. the node with the highest ID
can agree on an attack time? Clearly general A can send a message to general 3: if node v is the leader then
B asking to e.g. “attack at 6am”. However, general A cannot be sure that 4: the leader shall simply decide on its own input
this message will make it, so she asks for a confirmation. The problem is that 5: else
general B getting the message cannot be sure that her confirmation will reach 6: send message to the leader asking for its input
general A. If the confirmation message indeed is destroyed, general A cannot 7: wait for answer message by leader, and decide on that
distinguish this case from the case where general B did not even get the attack 8: end if
information. So, to be safe, general B herself will ask for a confirmation of her
confirmation. Taking again the position of general A we can similarly derive
Remarks:
that she cannot be sure unless she also gets a confirmation of the confirmation
of the confirmation. . . • This algorithm is quite simple, and at first sight seems to work per-
To make things worse, also different approaches do not seem to work. In fectly, as all three consensus conditions of Definition 26.1 are fulfilled.
fact it can be shown that this two generals problem cannot be solved, in other
words, there is no finite protocol which lets the two generals find consensus! To • However, the algorithm is not fault-tolerant at all. If the leader crashes
show this, we need to be a bit more formal: before being able to answer all requests, there are nodes which will
never terminate, and hence violate the termination condition. Is there
1 If you don’t fancy the martial tone of this classic example, feel free to think about some-
a deterministic protocol that can achieve consensus in an asynchronous
thing else, for instance two friends trying to make plans for dinner over instant messaging
software, or two lecturers sharing the teaching load of a course trying to figure out who is in
system, even in the presence of failures? Let’s first try something
charge of the next lecture. slightly different.

175
16.1. IMPOSSIBILITY OF CONSENSUS 177 178 CHAPTER 16. CONSENSUS

Definition 16.2 (Reliable Broadcast). Consider an asynchronous distributed Theorem 16.5. In an asynchronous shared memory system with n > 1 nodes,
system with n nodes that may crash. Any two nodes can exchange messages, and node crash failures (but no memory failures!) consensus as in Definition
i.e., the communication graph is complete. We want node v to send a reliable 26.1 cannot be achieved by a deterministic algorithm.
broadcast to the n − 1 other nodes. Reliable means that either nobody receives
the message, or everybody receives the message. Proof. Let us simplify the proof by setting n = 2. We have processes u and v,
with input values xu and xv . Further let the input values be binary, either 0 or
Remarks: 1.
First we have to make sure that there are input values such that initially the
• This seems to be quite similar to consensus, right? system is bivalent. If xu = 0 and xv = 0 the system is 0-valent, because
of the validity condition (Definition 26.1). Even in the case where process
• The main problem is that the sender may crash while sending the v immediately crashes the system remains 0-valent. Similarly if both input
message to the n − 1 other nodes such that some of them get the values are 1 and process u immediately crashes the system is 1-valent. If xu =
message, and the others not. We need a technique that deals with 0 and xv = 1 and v immediately crashes, process u cannot distinguish from
this case: both having input 0, equivalently if u immediately crashes, process v cannot
distinguish from both having 1, hence the system is bivalent!
Algorithm 65 Reliable Broadcast In order to solve consensus an algorithm needs to terminate. All non-faulty
processes need to decide on the same value x (agreement condition of Definition
1: if node v is the source of message m then
26.1), in other words, at some instant this value x must be known to the system
2: send message m to each of the n − 1 other nodes
as a whole, meaning that no matter what the execution is, the system will be
3: upon receiving m from any other node: broadcast succeeded!
x-valent. In other words, the system needs to change from bivalent to univalent.
4: else
We may ask ourselves what can cause this change in a deterministic asynchro-
5: upon receiving message m for the first time:
nous shared memory algorithm? We need an element of non-determinism; if
6: send message m to each of the n − 1 other nodes
everything happens deterministically the system would have been x-valent even
7: end if
after initialization which we proved to be impossible already.
The only nondeterministic elements in our model are the asynchrony of ac-
cessing the memory and crashing processes. Initially and after every memory
Theorem 16.3. Algorithm 98 solves reliable broadcast as in Definition 26.2.
access, each process decides what to do next: Read or write a memory cell or
terminate with a decision. We take control of the scheduling, either choosing
Proof. First we should note that we do not care about nodes that crash during
which request is served next or making a process crash. Now we hope for a crit-
the execution: whether or not they receive the message is irrelevant since they
ical bivalent state with more than one memory request, and depending which
crashed anyway. If a single non-faulty node u received the message (no matter
memory request is served next the system is going to switch from bivalent to
how, it may be that it received it through a path of crashed nodes) all non-
univalent. More concretely, if process u is being served next the system is going
faulty nodes will receive the message through u. If no non-faulty node receives
x-valent, if process v (with v 6= u) is served next the system is going y-valent
the message, we are fine as well!
(with y 6= x). We have several cases:

Remarks: • If the operations of processes u and v target different memory cells, pro-
cesses cannot distinguish which memory request was executed first. Hence
• While it is clear that we could also solve reliable broadcast by means of the local states of the processes are identical after serving both operations
a consensus protocol (first send message, then agree on having received and the state cannot be critical.
it), the opposite seems more tricky!
• The same argument holds if both processes want to read the same register.
• No wonder, it cannot be done!! For the presentation of this impossibil- Nobody can distinguish which read was first, and the state cannot be
ity result we use the read/write shared memory model introduced in critical.
Chapter 5. Not only was the proof originally conceived in the shared
memory model, it is also cleaner. • If process u reads memory cell c, and process v writes memory cell c,
the scheduler first executes u’s read. Now process v cannot distinguish
Definition 16.4 (Univalent, Bivalent). A distributed system is called x-valent whether that read of u did or did not happen before its write. If it did
if the outcome of a computation will be x. An x-valent system is also called happen, v should decide on x, if it did not happen, v should decide y. But
univalent. If, depending on the execution, still more than one possible outcome since v does not know which one is true, it needs to be informed about
is feasible, the system is called multivalent. If exactly two outcomes are still it by u. We prevent this by making u crash. Thus the state can only be
possible, the system is called bivalent. univalent if v never decides, violating the termination condition!
16.1. IMPOSSIBILITY OF CONSENSUS 179 180 CHAPTER 16. CONSENSUS

• Also if both processes write the same memory cell we have the same issue, • Finally, FLP only prohibits deterministic algorithms! So can we solve
since the second writer will immediately overwrite the first writer, and consensus if we use randomization? The answer again is yes! We will
hence the second writer cannot know whether the first write happened at study this in the remainder of this chapter.
all. Again, the state cannot be critical.
Hence, if we are unlucky (and in a worst case, we are!) there is no critical 16.2 Randomized Consensus
state. In other words, the system will remain bivalent forever, and consensus is
impossible. Can we solve consensus if we allow randomization? Yes. The following algorithm
solves Consensus even in face of Byzantine errors, i.e., malicious behavior of
Remarks: some of the nodes. To simplify arguments we assume that at most f nodes will
• The proof presented is a variant of a proof by Michael Fischer, Nancy fail (crash) with n > 9f , and that we only solve binary consensus, that is, the
Lynch and Michael Paterson, a classic result in distributed computing. input values are 0 and 1. The general idea is that nodes try to push their input
The proof was motivated by the problem of committing transactions in value; if other nodes do not follow they will try to push one of the suggested
distributed database systems, but is sufficiently general that it directly values randomly. The full algorithm is in Algorithm 99.
implies the impossibility of a number of related problems, including
consensus. The proof also is pretty robust with regard to different Algorithm 66 Randomized Consensus
communication models. 1: node u starts with input bit xu ∈ {0, 1}, round:=1.
2: broadcast BID(xu , round)
• The FLP (Fischer, Lynch, Paterson) paper won the 2001 PODC In- 3: repeat
fluential Paper Award, which later was renamed Dijkstra Prize. 4: wait for n − f BID messages of current round
• One might argue that FLP destroys all the fun in distributed com- 5: if at least n − f messages have value x then
puting, as it makes so many things impossible! For instance, it seems 6: xu := x; decide on x
impossible to have a distributed database where the nodes can reach 7: else if at least n − 2f messages have value x then
consensus whether to commit a transaction or not. 8: xu := x
9: else
• So are two-phase-commit (2PC), three-phase-commit (3PC) et al. 10: choose xu randomly, with P r[xu = 0] = P r[xu = 1] = 1/2
wrong?! No, not really, but sometimes they just do not commit! 11: end if
• What about turning some other knobs of the model? Can we have 12: round := round + 1
consensus in a message passing system? No. Can we have consensus 13: broadcast BID(xu , round)
14: until decided
in synchronous systems? Yes, even if all but one node fails!
• Can we have consensus in synchronous systems even if some nodes
are mischievous, and behave much worse than simply crashing, and Theorem 16.6. Algorithm 99 solves consensus as in Definition 26.1 even if up
send for example contradicting information to different nodes? This is to f < n/9 nodes exhibit Byzantine failures.
known as Byzantine behavior. Yes, this is also possible, as long as the
Byzantine nodes are strictly less than a third of all the nodes. This Proof. First note that it is not a problem to wait for n − f BID messages in
was shown by Marshall Pease, Robert Shostak, and Leslie Lamport line 4 since at most f nodes are corrupt. If all nodes have the same input value
in 1980. Their work won the 2005 Dijkstra Prize, and is one of the x, then all (except the f Byzantine nodes) will bid for the same value x. Thus,
cornerstones not only in distributed computing but also information every node receives at least n − 2f BID messages containing x, deciding on x
security. Indeed this work was motivated by the “fault-tolerance in in the first round already. We have consensus!
planes” example. Pease, Shostak, and Lamport noticed that the com- If the nodes have different (binary) input values the validity condition be-
puters they were given to implement a fault-tolerant fighter plane at comes trivial as any result is fine. What about agreement? Let u be one of
times behaved strangely. Before crashing, these computers would start the first nodes to decide on value x (in line 6). It may happen that due to
behaving quite randomly, sending out weird messages. At some point asynchronicity another node v received messages from a different subset of the
Pease et al. decided that a malicious behavior model would be the nodes, however, at most f senders may be different. Taking into account that
most appropriate to be on the safe side. Being able to allow strictly Byzantine nodes may lie, i.e., send different BIDs to different nodes, f addi-
less than a third Byzantine nodes is quite counterintuitive; even to- tional BID messages received by v may differ from those received by u. Since
day many systems are built with three copies. In light of the result node u had at least n − 2f BID messages with value x, node v has at least
of Pease et al. this is a serious mistake! If you want to be tolerant n − 4f BID messages with x. Hence every correct node will bid for x in the
against a single Byzantine machine, you need four copies, not three! next round, and then decide on x.
16.2. RANDOMIZED CONSENSUS 181 182 CHAPTER 16. CONSENSUS

So we only need to worry about termination! We already have seen that Algorithm 67 Shared Coin (code for node u)
as soon as one correct node terminates (in line 6) everybody terminates in the 1: set local coin xu := 0 with probability 1/n, else xu := 1
next round. So what are the chances that some node u terminates in line 6? 2: use reliable broadcast to tell everybody about your local coin xu
Well, if push comes to shove we can still hope that all correct nodes randomly 3: memorize all coins you get from others in the set cu
propose the same value (in line 10). Maybe there are some nodes not choosing 4: wait for exactly n − f coins
at random (i.e., entering line 8), but they unanimously propose either 0 or 1: 5: copy these coins into your local set su (but keep learning coins)
For the sake of contradiction, assume that both 0 and 1 are proposed in line 6: use reliable broadcast to tell everybody about your set su
8. This means that both 0 and 1 had been proposed by at least n − 5f correct 7: wait for exactly n − f sets sv (which satisfy sv ⊆ cu )
nodes. In other words, we have a total of 2(n − 5f ) + f = n + (n − 9f ) > n 8: if seen at least a single coin 0 then
nodes. Contradiction! 9: return 0
Thus, at worst all n − f correct nodes need to randomly choose the same bit, 10: else
which happens with probability 2−(n−f ) . If so, all will send the same BID, and 11: return 1
the algorithm terminates. So the expected running time is smaller than 2n . 12: end if

Remarks:
Let u be the first node to terminate (satisfy line 7). For u we draw a matrix
• The presentation of Algorithm 99 is a simplification of the typical of all the seen sets sv (columns) and all coins cu seen by node u (rows). Here is
presentation in text books. an example with n = 7, f = 2, n − f = 5:

• What about an algorithm that allows for crashes only, but can manage s1 s3 s5 s6 s7
more failures? Good news! Slightly changing the presented algorithm c1 X X X X X
will do that for f < n/4! See exercises. c2 X X X
c3 X X X X X
• Unfortunately Algorithm 99 is still impractical as termination is aw-
c5 X X X X
fully slow. In expectation about the same number of nodes choose 1
c6 X X X X
or 0 in line 10. Termination would be much more efficient if all nodes
c7 X X X X
chose the same random value in line 10! So why not simply replacing
line 10 with “choose xu := 1”?!? The problem is that a major-
ity of nodes may see a majority of 0 bids, hence proposing 0 in the Note that there are exactly (n − f )2 X’s in this matrix as node u has seen
next round. Without randomization it is impossible to get out of this exactly n − f sets (line 7) each having exactly n − f coins (lines 4 to 6). We
equilibrium. (Moreover, this approach is deterministic, contradicting need two little helper lemmas:
Theorem 26.5.)
Lemma 16.8. There are at least f + 1 rows that have at least f + 1 X’s
• The idea is to replace line 10 with a subroutine where all nodes com-
Proof. Assume (for the sake of contradiction) that this is not the case. Then
pute a so-called shared (or common, or global) coin. A shared coin
at most f rows have all n − f X’s, and all other rows (at most n − f ) have at
is a random variable that is 0 with constant probability and 1 with
most f X’s. In other words, the number of total X’s is bounded by
constant probability. Sounds like magic, but it isn’t! We assume at
most f < n/3 nodes may crash: |X| ≤ f · (n − f ) + (n − f ) · f = 2f (n − f ).

Using n > 3f we get n − f > 2f , and hence |X| ≤ 2f (n − f ) < (n − f )2 . This


Theorem 16.7. If f < n/3 nodes crash, Algorithm 100 implements a shared
is a contradiction to having exactly (n − f )2 X’s!
coin.
Lemma 16.9. Let W be the set of local coins for which the corresponding matrix
Proof. Since only f nodes may crash, each node sees at least n − f coins and row has more than f X’s. All local coins in the set W are seen by all nodes that
sets in lines 4 and 7, respectively. Thanks to the reliable broadcast protocol terminate.
each node eventually sees all the coins in the other sets. In other words, the
algorithm terminates in O(1) time. Proof. Let w ∈ W be such a local coin. By definition of W we know that w is
The general idea is that a third of the coins are being seen by everybody. If in at least f + 1 seen sets. Since each node must see at least n − f seen sets
there is a 0 among these coins, everybody will see that 0. If not, chances are before terminating, each node has seen at least one of these sets, and hence w
high that there is no 0 at all! Here are the details: is seen by everybody terminating.
BIBLIOGRAPHY 183 184 CHAPTER 16. CONSENSUS

Continuing the proof of Theorem 26.7: With probability (1 − 1/n)n ≈ 1/e ≈ .37
all nodes chose their local coin equal to 1, and 1 is decided. With probability
1 − (1 − 1/n)|W | there is at least one 0 in W . With Lemma 26.8 we know that
|W | ≈ n/3, hence the probability is about 1 − (1 − 1/n)n/3 ≈ 1 − (1/e)1/3 ≈ .28.
With Lemma 26.9 this 0 is seen by all, and hence everybody will decide 0. So
indeed we have a shared coin.
Theorem 16.10. Plugging Algorithm 100 into Algorithm 99 we get a random-
ized consensus algorithm which finishes in a constant expected number of rounds.

Remarks:
• If some nodes go into line 8 of Algorithm 99 the others still have a
constant probability to guess the same shared coin.
• For crash failures there exists an improved constant expected time
algorithm which tolerates f failures with 2f < n.

• For Byzantine failures there exists a constant expected time algorithm


which tolerates f failures with 3f < n.
• Similar algorithms have been proposed for the shared memory model.

Chapter Notes
See [Lam82, FLP85, PLS83, Sim88].

Bibliography
[FLP85] Michael J. Fischer, Nancy A. Lynch, and Mike Paterson. Impossibility
of Distributed Consensus with One Faulty Process. J. ACM, 32(2):374–
382, 1985.
[Lam82] L. Lamport. The Byzantine generals problem. ACM Transactions on
Programming Languages and Systems, 4:382–401, 1982.
[PLS83] Robert L. Probert, Nancy A. Lynch, and Nicola Santoro, editors. Pro-
ceedings of the Second Annual ACM SIGACT-SIGOPS Symposium on
Principles of Distributed Computing, Montreal, Quebec, Canada, Au-
gust 17-19, 1983. ACM, 1983.

[Sim88] Janos Simon, editor. Proceedings of the 20th Annual ACM Sympo-
sium on Theory of Computing, May 2-4, 1988, Chicago, Illinois, USA.
ACM, 1988.
186 CHAPTER 17. BYZANTINE AGREEMENT

17.1 Validity
Definition 17.3 (Any-Input Validity). The decision value must be the input
value of any node.

Remarks:
Chapter 17 • This is the validity definition we implicitly used for consensus, in Def-
inition 26.1.
• Does this definition still make sense in the presence of byzantine
nodes? What if byzantine nodes lie about their inputs?
Byzantine Agreement • We would wish for a validity definition which differentiates between
byzantine and correct inputs.
Definition 17.4 (Correct-Input Validity). The decision value must be the input
value of a correct node.
In order to make flying safer, researchers studied possible failures of various
sensors and machines used in airplanes. While trying to model the failures, Remarks:
they were confronted with the following problem: Failing machines did not just
crash, instead they sometimes showed arbitrary behavior before stopping com- • Unfortunately, implementing correct-input validity does not seem to
pletely. With these insights researchers modeled failures as arbitrary failures, be easy, as a byzantine node following the protocol but lying about
not restricted to any patterns. its input value is indistinguishable from a correct node. Here is an
alternative.
Definition 17.1 (Byzantine). A node which can have arbitrary behavior is
Definition 17.5 (All-Same Validity). If all correct nodes start with the same
called byzantine. This includes “anything imaginable”, e.g., not sending any
input v, the decision value must be v.
messages at all, or sending different and wrong messages to different neighbors,
or lying about the input value. Remarks:
Remarks: • If the decision values are binary, then correct-input validity is induced
by all-same validity.
• Byzantine behavior also includes collusion, i.e., all byzantine nodes
are being controlled by the same adversary. • If the input values are not binary, but for example from sensors that
deliever values in R, all-same validity is in most scenarios not really
• We assume that any two nodes communicate directly, and that no useful.
node can forge an incorrect sender address. This is a requirement, such Definition 17.6 (Median Validity). If the input values are orderable, e.g. v ∈
that a single byzantine node cannot simply impersonate all nodes! R, byzantine outliers can be prevented by agreeing on a value close to the median
• We call non-byzantine nodes correct nodes. of the correct input values, where close is a function of the number of byzantine
nodes f .
Definition 17.2 (Byzantine Agreement). Finding consensus as in Definition
26.1 in a system with byzantine nodes is called byzantine agreement. An algo- Remarks:
rithm is f -resilient if it still works correctly with f byzantine nodes. • Is byzantine agreement possible? If yes, with what validity condition?

Remarks: • Let us try to find an algorithm which tolerates 1 single byzantine node,
first restricting to the so-called synchronous model.
• As for consensus (Definition 26.1) we also need agreement, termination
Model 17.7 (synchronous). In the synchronous model, nodes operate in syn-
and validity. Agreement and termination are straight-forward, but
chronous rounds. In each round, each node may send a message to the other
what about validity?
nodes, receive the messages sent by the other nodes, and do some local compu-
tation.
Definition 17.8 (synchronous runtime). For algorithms in the synchronous
model, the runtime is simply the number of rounds from the start of the execution
to its completion in the worst case (every legal input, every execution scenario).

185
17.2. HOW MANY BYZANTINE NODES? 187 188 CHAPTER 17. BYZANTINE AGREEMENT

17.2 How Many Byzantine Nodes? Proof. We need to show agreement, any-input validity and termination. With
Lemma 17.9 we know that all correct nodes have the same set T , and therefore
agree on the same minimum value. The nodes agree on a value proposed by any
Algorithm 68 Byzantine Agreement with f = 1. node, so any-input validity holds. Moreover, the algorithm terminates after two
1: Code for node u, with input value x: rounds.
Round 1 Remarks:
2: Send tuple(u, x) to all other nodes
• If n > 4 the byzantine node can put multiple values into T .
3: Receive tuple(v, y) from all other nodes v
4: Store all received tuple(v, y) in a set Su • The idea of this algorithm can be generalized for any f and n >
Round 2 3f . In the generalization, every node sends in every of f + 1 rounds
all information it learned so far to all other nodes. In other words,
5: Send set Su to all other nodes message size increases exponentially with f .
6: Receive sets Sv from all nodes v
7: T = set of tuple(v, y) seen in at least two sets Sv , including own Su • Does Algorithm 68 also work with n = 3?
8: Let tuple(v, y) ∈ T be the tuple with the smallest value y
Theorem 17.11. Three nodes cannot reach byzantine agreement with all-same
9: Decide on value y
validity if one node among them is byzantine.

Proof. We have three nodes u, v, w. In order to achieve all-same validity, a


Remarks: correct node must decide on its own value if another node supports that value.
The third node might disagree, but that node could be byzantine. If correct
• Byzantine nodes may not follow the protocol and send syntactically in-
node u has input 0 and correct node v has input 1, the byzantine node w can
correct messages. Such messages can easily be deteced and discarded.
fool them by telling u that its value is 0 and simultaneously telling v that its
It is worse if byzantine nodes send syntactically correct messages, but
value is 1. This leads to u and v deciding on their own values, which results in
with a bogus content, e.g., they send different messages to different
violating the agreement condition. Even if u talks to v, and they figure out that
nodes.
they have different assumptions about w’s value, u cannot distinguish whether
• Some of these mistakes cannot easily be detected: For example, if a w or v is byzantine.
byzantine node sends different values to different nodes in the first Theorem 17.12. A network with n nodes cannot reach byzantine agreement
round; such values will be put into Su . However, some mistakes can with f ≥ n/3 byzantine nodes.
and must be detected: Observe that all nodes only relay information
in Round 2, and do not say anything about their own value. So, if a Proof. Let us assume (for the sake of contradiction) that there exists an algo-
byzantine node sends a set Sv which contains a tuple(v, y), this tuple rithm A that reaches byzantine agreement for n nodes with f ≥ n/3 byzantine
must be removed by u from Sv upon receiving it (Line 6). nodes. With A, we can solve byzantine agreement with 3 nodes. For simplicity,
we call the 3 nodes u, v, w supernodes.
• Recall that we assumed that nodes cannot forge their source address; Each supernode simulates n/3 nodes, either bn/3c or dn/3e, if n is not di-
thus, if a node receives tuple(v, y) in Round 1, it is guaranteed that visible by 3. Each simulated node starts with the input of its supernode. Now
this message was sent by v. the three supernodes simulate algorithm A. The single byzantine supernode
simulates dn/3e byzantine nodes. As algorithm A promises to solve byzantine
Lemma 17.9. If n ≥ 4, all correct nodes have the same set T . agreement for f ≥ n/3, A has to be able to handle dn/3e byzantine nodes.
Algorithm A guarantees that the correct nodes simulated by the correct two su-
Proof. With f = 1 and n ≥ 4 we have at least 3 correct nodes. A correct node pernodes will achieve byzantine agreement. So the two correct supernodes can
will see every correct value at least twice, once directly from another correct just take the value of their simulated nodes (these values have to be the same
node, and once through the third correct node. So all correct values are in T . by the agreement property), and we have achieved agreement for three supern-
If the byzantine node sends the same value to at least 2 other (correct) nodes, odes, one of them byzantine. This contradicts Lemma 17.11, hence algorithm
all correct nodes will see the value twice, so all add it to set T . If the byzantine A cannot exist.
node sends all different values to the correct nodes, none of these values will
end up in any set T .

Theorem 17.10. Algorithm 68 reaches byzantine agreement if n ≥ 4.


17.3. THE KING ALGORITHM 189 190 CHAPTER 17. BYZANTINE AGREEMENT

17.3 The King Algorithm Proof. If all correct nodes change their values to the king’s value, all correct
nodes have the same value. If some correct node does not change its value to
the king’s value, it received a proposal at least n − f times, therefore at least
Algorithm 69 King Algorithm (for f < n/3) n − 2f correct nodes broadcasted this proposal. Thus, all correct nodes received
1: x = my input value it at least n − 2f > f times (using n > 3f ), therefore all correct nodes set
2: for phase = 1 to f + 1 do their value to the proposed value, including the correct king. Note that only
one value can be proposed more than f times, which follows from Lemma 17.14.
Round 1
With Lemma 17.13, no node will change its value after this round.
3: Broadcast value(x)
Theorem 17.17. Algorithm 69 solves byzantine agreement.
Round 2
4: if some value(y) at least n − f times then Proof. The king algorithm reaches agreement as either all correct nodes start
5: Broadcast propose(y) with the same value, or they agree on the same value latest after the phase
6: end if where a correct node was king according to Lemmas 17.15 and 17.16. Because
7: if some propose(z) received more than f times then of Lemma 17.13 we know that they will stick with this value. Termination is
8: x=z guaranteed after 3(f + 1) rounds, and all-same validity is proved in Lemma
9: end if 17.13.
Round 3
Remarks:
10: Let node vi be the predefined king of this phase i
11: The king vi broadcasts its current value w • Algorithm 69 requires f + 1 predefined kings. We assume that the
12: if received strictly less than n − f propose(x) then kings (and their order) are given. Finding the kings indeed would be
13: x=w a byzantine agreement task by itself, so this must be done before the
14: end if execution of the King algorithm.
15: end for
• Do algorithms exist which do not need predefined kings? Yes, see
Section 17.5.
Lemma 17.13. Algorithm 69 fulfills the all-same validity.
• Can we solve byzantine agreement (or at least consensus) in less than
Proof. If all correct nodes start with the same value, all correct nodes propose
f + 1 rounds?
it in Round 2. All correct nodes will receive at least n − f proposals, i.e., all
correct nodes will stick with this value, and never change it to the king’s value.
This holds for all phases.
17.4 Lower Bound on Number of Rounds
Lemma 17.14. If a correct node proposes x, no other correct node proposes y,
with y 6= x, if n > 3f . Theorem 17.18. A synchronous algorithm solving consensus in the presence
of f crashing nodes needs at least f + 1 rounds, if nodes decide for the minimum
Proof. Assume (for the sake of contradiction) that a correct node proposes value seen value.
x and another correct node proposes value y. Since a good node only proposes
a value if it heard at least n − f value messages, we know that both nodes must Proof. Let us assume (for the sake of contradiction) that some algorithm A
have received their value from at least n − 2f distinct correct nodes (as at most solves consensus in f rounds. Some node u1 has the smallest input value x, but
f nodes can behave byzantine and send x to one node and y to the other one). in the first round u1 can send its information (including information about its
Hence, there must be a total of at least 2(n − 2f ) + f = 2n − 3f nodes in the value x) to only some other node u2 before u1 crashes. Unfortunately, in the
system. Using 3f < n, we have 2n − 3f > n nodes, a contradiction. second round, the only witness u2 of x also sends x to exactly one other node u3
before u2 crashes. This will be repeated, so in round f only node uf +1 knows
Lemma 17.15. There is at least one phase with a correct king.
about the smallest value x. As the algorithm terminates in round f , node uf +1
Proof. There are f + 1 phases, each with a different king. As there are only f will decide on value x, all other surviving (correct) nodes will decide on values
byzantine nodes, one king must be correct. larger than x.

Lemma 17.16. After a round with a correct king, the correct nodes will not
change their values v anymore, if n > 3f .
17.5. ASYNCHRONOUS BYZANTINE AGREEMENT 191 192 CHAPTER 17. BYZANTINE AGREEMENT

Remarks: If the correct nodes have different (binary) input values, the validity condition
becomes trivial as any result is fine.
• A general proof without the restriction to decide for the minimum What about agreement? Let u be the first node to decide on value x (in
value exists as well. Line 8). Due to asynchrony another node v received messages from a different
• Since byzantine nodes can also just crash, this lower bound also holds subset of the nodes, however, at most f senders may be different. Taking
for byzantine agreement, so Algorithm 69 has an asymptotically opti- into account that byzantine nodes may lie (send different propose messages to
mal runtime. different nodes), f additional propose messages received by v may differ from
those received by u. Since node u had at least n − 2f propose messages with
• So far all our byzantine agreement algorithms assume the synchronous value x, node v has at least n − 4f propose messages with value x. Hence every
model. Can byzantine agreement be solved in the asynchronous model? correct node will propose x in the next round, and then decide on x.
So we only need to worry about termination: We have already seen that as
soon as one correct node terminates (Line 8) everybody terminates in the next
17.5 Asynchronous Byzantine Agreement round. So what are the chances that some node u terminates in Line 8? Well,
we can hope that all correct nodes randomly propose the same value (in Line
12). Maybe there are some nodes not choosing at random (entering Line 10
Algorithm 70 Asynchronous Byzantine Agreement (Ben-Or, for f < n/9) instead of 12), but according to Lemma 17.19 they will all propose the same.
1: xi ∈ {0, 1} / input bit Thus, at worst all n − f correct nodes need to randomly choose the same bit,
2: r = 1 / round which happens with probability 2−(n−f )+1 . If so, all correct nodes will send the
3: decided = false same propose message, and the algorithm terminates. So the expected running
4: Broadcast propose(xi ,r) time is exponential in the number of nodes n.
5: repeat
6: Wait until n − f propose messages of current round r arrived Remarks:
7: if at least n − 2f propose messages contain the same value x then
8: xi = x, decided = true • This Algorithm is a proof of concept that asynchronous byzantine
9: else if at least n − 4f propose messages contain the same value x then agreement can be achieved. Unfortunately this algorithm is not useful
10: xi = x in practice, because of its runtime.
11: else • For a long time, there was no algorithm with subexponential runtime.
12: choose xi randomly, with P r[xi = 0] = P r[xi = 1] = 1/2 The currently fastest algorithm has an expected runtime of O(n2.5 )
13: end if but only tolerates f ≤ 1/500n many byzantine nodes. This algorithm
14: r=r+1 works along the lines of the shared coin algorithm; additionally nodes
15: Broadcast propose(xi ,r) try to detect which nodes are byzantine.
16: until decided (see Line 8)
17: decision = xi
Chapter Notes
Lemma 17.19. Assume n > 9f . If a correct node chooses value x in Line 10, The project which started the study of byzantine failures was called SIFT and
then no other correct node chooses value y 6= x in Line 10. was founded by NASA [WLG+ 78], and the research regarding byzantine agree-
ment started to get significant attention with the results by Pease, Shostak, and
Proof. For the sake of contradiction, assume that both 0 and 1 are chosen in Line
Lamport [PSL80, LSP82]. In [PSL80] they presented the generalized version
10. This means that both 0 and 1 had been proposed by at least n − 5f correct
of Algorithm 68 and also showed that byzantine agreement is unsolvable for
nodes. In other words, we have a total of at least 2(n−5f )+f = n+(n−9f ) > n
n ≤ 3f . The algorithm presented in that paper is nowadays called Exponential
nodes. Contradiction!
Information Gathering (EIG), due to the exponential size of the messages.
Theorem 17.20. Algorithm 70 solves binary byzantine agreement as in Defi- There are many algorithms for the byzantine agreement problem. For ex-
nition 17.2 for up to f < n/9 byzantine nodes. ample the Queen Algorithm [BG89] which has a better runtime than the King
algorithm [BGP89], but tolerates less failures. That byzantine agreement re-
Proof. First note that it is not a problem to wait for n − f propose messages in quires at least f + 1 many rounds was shown by Dolev and Strong [DS83],
Line 6, since at most f nodes are byzantine. If all correct nodes have the same based on a more complicated proof from Fischer and Lynch [FL82].
input value x, then all (except the f byzantine nodes) will propose the same While many algorithms for the synchronous model have been around for a
value x. Thus, every node receives at least n−2f propose messages containing x, long time, the asynchronous model is a lot harder. The only results were by Ben-
deciding on x in the first round already. We have established all-same validity! Or and Bracha. Ben-Or [Ben83] was able to tolerate f < n/5. Bracha [BT85]
BIBLIOGRAPHY 193 194 CHAPTER 17. BYZANTINE AGREEMENT

improved this tolerance to f < n/3. The first algorithm with a polynomial [PSL80] Marshall C. Pease, Robert E. Shostak, and Leslie Lamport. Reach-
expected runtime was found by King and Saia [KS13] just recently. ing agreement in the presence of faults. J. ACM, 27(2):228–234,
Nearly all developed algorithms only satisfy all-same validity. There are a 1980.
few exceptions, e.g., correct-input validity [FG03], available if the initial values
are from a finite domain, or median validity [SW15] if the input values are [SW15] David Stolz and Roger Wattenhofer. Byzantine Agreement with
orderable. Median Validity. In 19th International Conference on Priniciples of
Before the term byzantine was coined, the terms Albanian Generals or Chi- Distributed Systems (OPODIS), Rennes, France, 2015.
nese Generals were used in order to describe malicious behavior. When the [WLG+ 78] John H. Wensley, Leslie Lamport, Jack Goldberg, Milton W. Green,
involved researchers met people from these countries they moved – for obvious Karl N. Levitt, P. M. Melliar-Smith, Robert E. Shostak, and
reasons – to the historic term byzantine [LSP82]. Charles B. Weinstock. Sift: Design and analysis of a fault-tolerant
This chapter was written in collaboration with Barbara Keller. computer for aircraft control. In Proceedings of the IEEE, pages
1240–1255, 1978.
Bibliography
[Ben83] Michael Ben-Or. Another advantage of free choice (extended ab-
stract): Completely asynchronous agreement protocols. In Proceed-
ings of the second annual ACM symposium on Principles of distrib-
uted computing, pages 27–30. ACM, 1983.

[BG89] Piotr Berman and Juan A Garay. Asymptotically optimal distributed


consensus. Springer, 1989.

[BGP89] Piotr Berman, Juan A. Garay, and Kenneth J. Perry. Towards


optimal distributed consensus (extended abstract). In 30th Annual
Symposium on Foundations of Computer Science, Research Triangle
Park, North Carolina, USA, 30 October - 1 November 1989, pages
410–415, 1989.

[BT85] Gabriel Bracha and Sam Toueg. Asynchronous consensus and broad-
cast protocols. Journal of the ACM (JACM), 32(4):824–840, 1985.

[DS83] Danny Dolev and H. Raymond Strong. Authenticated algorithms for


byzantine agreement. SIAM Journal on Computing, 12(4):656–666,
1983.

[FG03] Matthias Fitzi and Juan A Garay. Efficient player-optimal protocols


for strong and differential consensus. In Proceedings of the twenty-
second annual symposium on Principles of distributed computing,
pages 211–220. ACM, 2003.

[FL82] Michael J. Fischer and Nancy A. Lynch. A lower bound for the time
to assure interactive consistency. 14(4):183–186, June 1982.

[KS13] Valerie King and Jared Saia. Byzantine agreement in polynomial


expected time:[extended abstract]. In Proceedings of the forty-fifth
annual ACM symposium on Theory of computing, pages 401–410.
ACM, 2013.

[LSP82] Leslie Lamport, Robert E. Shostak, and Marshall C. Pease. The


byzantine generals problem. ACM Trans. Program. Lang. Syst.,
4(3):382–401, 1982.
196 CHAPTER 18. AUTHENTICATED AGREEMENT

Algorithm 71 Byzantine Agreement using Authentication


Code for primary p:
1: if input is 1 then
2: broadcast value(1)p
3: decide 1 and terminate
4: else
Chapter 18 5: decide 0 and terminate
6: end if
Code for all other nodes v:
7: for all rounds i ∈ 1, . . . , f + 1 do
Authenticated Agreement 8:
9:
S is the set of accepted messages value(1)u .
if |S| ≥ i and value(1)p ∈ S then
10: broadcast S ∪ {value(1)v }
11: decide 1 and terminate
12: end if
13: end for
Byzantine nodes are able to lie about their inputs as well as received messages.
14: decide 0 and terminate
Can we detect certain lies and limit the power of byzantine nodes? Possibly,
the authenticity of messages may be validated using signatures?
Theorem 18.2. Algorithm 71 can tolerate f < n byzantine failures while ter-
minating in f + 1 rounds.
Proof. Assuming that the primary p is not byzantine and its input is 1, then
p broadcasts value(1)p in the first round, which will trigger all correct nodes
to decide for 1. If p’s input is 0, there is no signed message value(1)p , and no
node can decide for 1.
18.1 Agreement with Authentication If primary p is byzantine, we need all correct nodes to decide for the same
value for the algorithm to be correct. Let us assume that p convinces a correct
node v that its value is 1 in round i with i < f + 1. We know that v received
i signed messages for value 1. Then, v will broadcast i + 1 signed messages for
Definition 18.1 (Signature). If a node never signs a message, then no correct value 1, which will trigger all correct nodes to also decide for 1. If p tries to
node ever accepts that message. We denote a message msg(x) signed by node u convince some node v late (in round i = f + 1), v must receive f + 1 signed
with msg(x)u . messages. Since at most f nodes are byzantine, at least one correct node u
signed a message value(1)u in some round i < f + 1, which puts us back to the
previous case.

Remarks:
• The algorithm only takes f + 1 rounds, which is optimal as described
in Theorem 17.18.
• Using signatures, Algorithm 71 solves consensus for any number of
Remarks: failures! Does this contradict Theorem 17.11? Recall that in the
proof of Theorem 17.11 we assumed that a byzantine node can dis-
tribute contradictory information about its own input. If messages are
signed, correct nodes can detect such behavior – a node u signing two
contradicting messages proves to all nodes that node u is byzantine.
• Algorithm 71 shows an agreement protocol for binary inputs relying
on signatures. We assume there is a designated “primary” node p. • Does Algorithm 71 satisfy any of the validity conditions introduced
The goal is to decide on p’s value. in Section 17.1? No! A byzantine primary can dictate the decision

195
18.2. ZYZZYVA 197 198 CHAPTER 18. AUTHENTICATED AGREEMENT

value. Can we modify the algorithm such that the correct-input va- In the Absence of Failures
lidity condition is satisfied? Yes! We can run the algorithm in parallel
for 2f + 1 primary nodes. Either 0 or 1 will occur at least f + 1 times,
which means that one correct process had to have this value in the Algorithm 72 Zyzzyva: No failures
first place. In this case, we can only handle f < n2 byzantine nodes. 1: At time t client u wants to execute command c
2: Client u sends request R = Request(c,t)u to primary p
• In reality, a primary will usually be correct. If so, Algorithm 71 only 3: Primary p appends c to its local history, i.e., hp = (hp , c)
needs two rounds! Can we make it work with arbitrary inputs? Also, 4: Primary p sends OR = OrderedRequest(hp , c, R)p to all replicas
relying on synchrony limits the practicality of the protocol. What if 5: Each replica r appends command c to local history hr = (hr , c) and checks
messages can be lost or the system is asynchronous? whether hr = hp
6: Each replica r runs command ck and obtains result a
• Zyzzyva uses authenticated messages to achieve state replication, as 7: Each replica r sends Response(a,OR)r to client u
in Definition 15.6. It is designed to run fast when nodes run correctly, 8: Client u collects the set S of received Response(a,OR)r messages
and it will slow down to fix failures! 9: Client u checks if all histories hr are consistent
10: if |S| = 3f + 1 then
11: Client u considers command c to be complete
12: end if
18.2 Zyzzyva
Definition 18.3 (View). A view V describes the current state of a replicated
system, enumerating the 3f + 1 replicas. The view V also marks one of the Remarks:
replicas as the primary p.
• Since the client receives 3f +1 consistent responses, all correct replicas
Definition 18.4 (Command). If a client wants to update (or read) data, it have to be in the same state.
sends a suitable command c in a Request message to the primary p. Apart
from the command c itself, the Request message also includes a timestamp t.
The client signs the message to guarantee authenticity. • Only three communication rounds are required for the command c to
complete.
Definition 18.5 (History). The history h is a sequence of commands c1 , c2 , . . .
in the order they are executed by Zyzzyva. We denote the history up to ck with • Note that replicas have no idea which commands are considered com-
hk . plete by clients! How can we make sure that commands that are
considered complete by a client are actually executed? We will see in
Theorem 18.15.
Remarks:

• In Zyzzyva, the primary p is used to order commands submitted by • Commands received from clients should be ordered according to time-
clients to create a history h. stamps to preserve the causal order of commands.

• Apart from the globally accepted history, node u may also have a local • There is a lot of optimization potential. For example, including the en-
history, which we denote as hu or huk . tire command history in most messages introduces prohibitively large
overhead. Rather, old parts of the history that are agreed upon can be
Definition 18.6 (Complete command). If a command completes, it will remain truncated. Also, sending a hash value of the remainder of the history
in its place in the history h even in the presence of failures. is enough to check its consistency across replicas.

Remarks: • What if a client does not receive 3f + 1 Response(a,OR)r messages?


A byzantine replica may omit sending anything at all! In practice,
• As long as clients wait for the completion of their commands, clients clients set a timeout for the collection of Response messages. Does
can treat Zyzzyva like one single computer even if there are up to f this mean that Zyzzyva only works in the synchronous model? Yes
failures. and no. We will discuss this in Lemma 18.18 and Lemma 18.19.
18.2. ZYZZYVA 199 200 CHAPTER 18. AUTHENTICATED AGREEMENT

Byzantine Replicas Algorithm 74 Zyzzyva: Byzantine Primary (append to Algorithm 73)


1: if |S| < 2f + 1 then
2: Client u sends the original R = Request(c,t)u to all replicas
Algorithm 73 Zyzzyva: Byzantine Replicas (append to Algorithm 72) 3: Each replica r sends a ConfirmRequest(R)r message to p
1: if 2f + 1 ≤ |S| < 3f + 1 then 4: if primary p replies with OR then
2: Client u sends Commit(S)u to all replicas 5: Replica r forwards OR to all replicas
3: Each replica r replies with a LocalCommit(S)r message to u 6: Continue as in Algorithm 72, Line 5
4: Client u collects at least 2f + 1 LocalCommit(S)r messages and considers 7: else
c to be complete 8: Replica r initiates view change by broadcasting IHatePrimaryr to all
5: end if replicas
9: end if
10: end if
Remarks:

• If replicas fail, a client u may receive less than 3f + 1 consistent re- Remarks:
sponses from the replicas. Client u can only assume command c to
• A faulty primary can slow down Zyzzyva by not sending out the
be complete if all correct replicas r eventually append command c to
OrderedRequest messages in Algorithm 72, repeatedly escalating to
their local history hr .
Algorithm 74.
Definition 18.7 (Commit Certificate). A commit certificate S contains 2f + 1 • Line 5 in the Algorithm is necessary to ensure liveness. We will discuss
consistent and signed Response(a,OR)r messages from 2f + 1 different replicas this in Theorem 18.19.
r.
• Again, there is potential for optimization. For example, a replica
might already know about a command that is requested by a client. In
Remarks: that case, it can answer without asking the primary. Furthermore, the
primary might already know the message R requested by the replicas.
• The set S is a commit certificate which proves the execution of the
In that case, it sends the old OR message to the requesting replica.
command on 2f + 1 replicas, of which at least f + 1 are correct. This
commit certificate S must be acknowledged by 2f + 1 replicas before
the client considers the command to be complete. Safety
Definition 18.9 (Safety). We call a system safe if the following condition holds:
• Why do clients have to distribute this commit certificate to 2f + 1 If a command with sequence number j and a history hj completes, then for any
replicas? We will discuss this in Theorem 18.13. command that completed earlier (with a smaller sequence number i < j), the
history hi is a prefix of history hj .
• What if |S| < 2f + 1, or what if the client receives 2f + 1 messages
but some have inconsistent histories? Since at most f replicas are Remarks:
byzantine, the primary itself must be byzantine! Can we resolve this?
• In Zyzzyva a command can only complete in two ways, either in Al-
gorithm 72 or in Algorithm 73.
Byzantine Primary
• If a system is safe, complete commands cannot be reordered or drop-
Definition 18.8 (Proof of Misbehavior). Proof of misbehavior of some node ped. So is Zyzzyva so far safe?
can be established by a set of contradicting signed messages.
Lemma 18.10. Let ci and cj be two different complete commands. Then ci
and cj must have different sequence numbers.
Remarks:
Proof. If a command c completes in Algorithm 72, 3f + 1 replicas sent a
• For example, if a client u receives two Response(a,OR)r messages that Response(a,OR)r to the client. If the command c completed in Algorithm 73,
contain inconsistent OR messages signed by the primary, client u can at least 2f + 1 replicas sent a Response(a,OR)r message to the client. Hence, a
proof that the primary misbehaved. Client u broadcasts this proof of client has to receive at least 2f + 1 Response(a,OR)r messages.
misbehavior to all replicas r which initiate a view change by broad- Both ci and cj are complete. Therefore there must be at least 2f + 1 replicas
casting a IHatePrimaryr message to all replicas. that responded to ci with a Response(a,OR)r message. But there are also at least
18.2. ZYZZYVA 201 202 CHAPTER 18. AUTHENTICATED AGREEMENT

2f + 1 replicas that responded to cj with a Response(a,OR)r message. Because Remarks:


there are only 3f + 1 replicas, there is at least one correct replica that sent a
Response(a,OR)r message for both ci and cj . A correct replica only sends one • The f + 1 IHatePrimaryr messages in set H prove that at least one
Response(a,OR)r message for each sequence number, hence the two commands correct replica initiated a view change. This proof is broadcast to all
must have different sequence numbers. replicas to make sure that once the first correct replica stopped acting
in the current view, all other replicas will do so as well.
Lemma 18.11. Let ci and cj be two complete commands with sequence numbers
i < j. The history hi is a prefix of hj . • Slr is the most recent commit certificate that the replica obtained
in the ending view as described in Algorithm 73. Slr will be used
to recover the correct history before the new view starts. The local
Proof. As in the proof of Lemma 18.10, there has to be at least one correct
histories hr are included in the ViewChange(H r ,hr ,Slr )r message such
replica that sent a Response(a,OR)r message for both ci and cj .
that commands that completed after a correct client received 3f + 1
A correct replica r that sent a Response(a,OR)r message for ci will only responses from replicas can be recovered as well.
accept cj if the history for cj provided by the primary is consistent with the
local history of replica r, including ci . • In Zyzzyva, a byzantine primary starts acting as a normal replica after
a view change. In practice, all machines eventually break and rarely
Remarks: fix themselves after that. Instead, one could consider to replace a
byzantine primary with a fresh replica that was not in the previous
• A byzantine primary can cause the system to never complete any view.
command. Either by never sending any messages or by inconsistently
ordering client requests. In this case, replicas have to replace the
primary. Algorithm 76 Zyzzyva: View Change Execution
1: The new primary p collects the set C of ViewChange(H r ,hr ,Slr )r messages
2: if new primary p collected |C| ≥ 2f + 1 messages then
View Changes 3: New primary p sends NewView(C)p to all replicas
4: end if
Definition 18.12 (View Change). In Zyzzyva, a view change is used to replace
a byzantine primary with another (hopefully correct) replica. View changes are
5: if a replica r received a NewView(C)p message then
initiated by replicas sending IHatePrimaryr to all other replicas. This only
6: Replica r recovers new history hnew as shown in Algorithm 77
happens if a replica obtains a valid proof of misbehavior from a client or after a
7: Replica r broadcasts ViewConfirm(hnew )r message to all replicas
replica fails to obtain an OR message from the primary in Algorithm 74.
8: end if

Remarks: 9: if a replica r received 2f + 1 ViewConfirm(hnew )r messages then


10: Replica r accepts hr = hnew as the history of the new view
• How can we safely decide to initiate a view change, i.e. demote a 11: Replica r starts participating in the new view
byzantine primary? Note that byzantine nodes should not be able to 12: end if
trigger a view change!

Remarks:
Algorithm 75 Zyzzyva: View Change Agreement
1: All replicas continuously collect the set H of IHatePrimaryr messages • Analogously to Lemma 18.11, commit certificates are ordered. For
2: if a replica r received |H| > f messages or a valid ViewChange message two commit certificates Si and Sj with sequence numbers i < j, the
then history hi certified by Si is a prefix of the history hj certified by Sj .
3: Replica r broadcasts ViewChange(H r ,hr ,Slr )r
4: Replica r stops participating in the current view • Zyzzyva collects the most recent commit certificate and the local his-
5: Replica r switches to the next primary “p = p + 1” tory of 2f + 1 replicas. This information is distributed to all replicas,
6: end if and used to recover the history for the new view hnew .

• If a replica does not receive the NewView(C)p or the ViewConfirm(hnew )r


message in time, it triggers another view change by broadcasting
IHatePrimaryr to all other replicas.
18.2. ZYZZYVA 203 204 CHAPTER 18. AUTHENTICATED AGREEMENT

• How is the history recovered exactly? It seems that the set of histo- client retries executing c, the replicas will be able to identify the same
ries included in C can be messy. How can we be sure that complete command c using the timestamp included in the client’s request, and
commands are not reordered or dropped? avoid duplicate execution of the command.

• Can we be sure that all commands that completed at a correct client


are carried over into the new view?
commands up to Sl ≥ f + 1 consistent histories < f + 1 consistent histories
z }| { z }| { z }| {
Lemma 18.13. The globally most recent commit certificate Sl is included in C.
replicas replicas
other correct

z }| {
f +1

Proof. Any two sets of 2f +1 replicas share at least one correct replica. Hence, at
least one correct replica which acknowledged the most recent commit certificate
z}| {

Sl also sent a LocalCommit(Sl )r message that is in C.


f

| {z } | {z }
hnew discarded commands Lemma 18.14. Any command and its history that completes after Sl has to be
reported in C at least f + 1 times.
Inconsistent or missing commands
Consistent commands Proof. A command c can only complete in Algorithm 72 after Sl . Hence, 3f + 1
Consistent commands with commit certificate replicas sent a Response(a,OR)r message for c. C includes the local histories of
2f + 1 replicas of which at most f are byzantine. As a result, c and its history
Figure 18.1: The structure of the data reported by different replicas in C. Com- is consistently found in at least f + 1 local histories in C.
mands up to the last commit certificate Sl were completed in either Algorithm 72
or Algorithm 73. After the last commit certificate Sl there may be commands Lemma 18.15. If a command c is considered complete by a client, command
that completed at a correct client in Algorithm 72. Algorithm 77 shows how c remains in its place in the history during view changes.
the new history hnew is recovered such that no complete commands are lost.
Proof. We have shown in Lemma 18.13 that the most recent commit certificate
is contained in C, and hence any command that terminated in Algorithm 73
is included in the new history after a view change. Every command that com-
Algorithm 77 Zyzzyva: History Recovery
pleted before the last commit certificate Sl is included in the history as a result.
1: C = set of 2f + 1 ViewChange(H r ,hr ,S r )r messages in NewView(C)p
Commands that completed in Algorithm 72 after the last commit certificate are
2: R = set of replicas included in C
supported by at least f + 1 correct replicas as shown in Lemma 18.14. Such
3: Sl = most recent commit certificate Slr reported in C
commands are added to the new history as described in Algorithm 77. Algo-
4: hnew = history hl contained in Sl
rithm 77 adds commands sequentially until the histories become inconsistent.
5: k = l + 1, next sequence number
Hence, complete commands are not lost or reordered during a view change.
6: while command ck exists in C do
7: if ck is reported by at least f + 1 replicas in R then Theorem 18.16. Zyzzyva is safe even during view changes.
8: Remove replicas from R that do not support ck
9: hnew = (hnew , ck ) Proof. Complete commands are not reordered within a view as described in
10: end if Lemma 18.11. Also, no complete command is lost or reordered during a view
11: k =k+1 change as shown in Lemma 18.15. Hence, Zyzzyva is safe.
12: end while
13: return hnew Remarks:

• So Zyzzyva correctly handles complete commands even in the presence


Remarks: of failures. We also want Zyzzyva to make progress, i.e., commands
issued by correct clients should complete eventually.
• Commands up to Sl are included into the new history hnew .
• If the network is broken or introduces arbitrarily large delays, com-
• If at least f +1 replicas share a consistent history after the last commit mands may never complete.
certificate Sl , also the commands after that are included.
• Can we be sure commands complete in periods in which delays are
• Even if f + 1 correct replicas consistently report a command c after bounded?
the last commit certificate Sl , c may not be considered complete by
a client, e.g., because one of the responses to the client was lost. Definition 18.17 (Liveness). We call a system live if every command eventu-
Such a command is included in the new history hnew . When the ally completes.
BIBLIOGRAPHY 205 206 CHAPTER 18. AUTHENTICATED AGREEMENT

Lemma 18.18. Zyzzyva is live during periods of synchrony if the primary is Marvin Theimer, and Roger P. Wattenhofer. Farsite: Federated,
correct and a command is requested by a correct client. available, and reliable storage for an incompletely trusted en-
vironment. SIGOPS Oper. Syst. Rev., 36(SI):1–14, December
Proof. The client receives a Response(a,OR)r message from all correct replicas. 2002.
If it receives 3f + 1 messages, the command completes immediately in Algo-
rithm 72. If the client receives fewer than 3f + 1 messages, it will at least [AEMGG+ 05] Michael Abd-El-Malek, Gregory R Ganger, Garth R Goodson,
receive 2f + 1, since there are at most f byzantine replicas. All correct replicas Michael K Reiter, and Jay J Wylie. Fault-scalable byzantine
will answer the client’s Commit(S)u message with a correct LocalCommit(S)r fault-tolerant services. ACM SIGOPS Operating Systems Re-
message after which the command completes in Algorithm 73. view, 39(5):59–74, 2005.

Lemma 18.19. If, during a period of synchrony, a request does not complete [CML+ 06] James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Ro-
in Algorithm 72 or Algorithm 73, a view change occurs. drigues, and Liuba Shrira. Hq replication: A hybrid quorum
protocol for byzantine fault tolerance. In Proceedings of the 7th
Proof. If a command does not complete for a sufficiently long time, the client Symposium on Operating Systems Design and Implementation,
will resend the R = Request(c,t)u message to all replicas. After that, if a OSDI ’06, pages 177–190, Berkeley, CA, USA, 2006. USENIX
replica’s ConfirmRequest(R)r message is not answered in time by the primary, Association.
it broadcasts an IHatePrimaryr message. If a correct replica gathers f + 1
IHatePrimaryr messages, the view change is initiated. If no correct replica col- [CWA+ 09] Allen Clement, Edmund L Wong, Lorenzo Alvisi, Michael
lects more than f IHatePrimaryr messages, at least one correct replica received Dahlin, and Mirco Marchetti. Making byzantine fault tolerant
a valid OrderedRequest(hp , c, R)p message from the primary which it forwards systems tolerate byzantine faults. In NSDI, volume 9, pages
to all other replicas. In that case, the client is guaranteed to receive at least 153–168, 2009.
2f + 1 Response(a,OR)r messages from the correct replicas and can complete
the command by assembling a commit certificate. [DFF+ 82] Danny Dolev, Michael J Fischer, Rob Fowler, Nancy A Lynch,
and H Raymond Strong. An efficient algorithm for byzantine
agreement without authentication. Information and Control,
Remarks:
52(3):257–274, 1982.
• If the newly elected primary is byzantine, the view change may never
[GKQV10] Rachid Guerraoui, Nikola Knežević, Vivien Quéma, and Marko
terminate. However, we can detect if the new primary does not assem-
Vukolić. The next 700 bft protocols. In Proceedings of the 5th Eu-
ble C correctly as all contained messages are signed. If the primary
ropean conference on Computer systems, pages 363–376. ACM,
refuses to assemble C, replicas initiate another view change after a
2010.
timeout.
[KAD+ 07] Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen
Clement, and Edmund Wong. Zyzzyva: speculative byzantine
Chapter Notes fault tolerance. In ACM SIGOPS Operating Systems Review,
Algorithm 71 was introduced by Dolev et al. [DFF+ 82] in 1982. Byzantine volume 41, pages 45–58. ACM, 2007.
fault tolerant state machine replication (BFT) is a problem that gave rise [MC99] Barbara Liskov Miguel Castro. Practical byzantine fault toler-
to many different protocols. Castro and Liskov [MC99] introduced Practi- ance. In OSDI, volume 99, pages 173–186, 1999.
cal Byzantine Fault Tolerance (PBFT) protocol in 1999 and applications like
Farsite [ABC+ 02] followed. This triggered the development of systems like
Q/U [AEMGG+ 05] and HQ [CML+ 06], which are quorum-based protocols.
Zyzzyva [KAD+ 07] improved on performance especially in the case of no fail-
ures while Aardvark [CWA+ 09] improves performance in the presence of fail-
ures. Guerraoui at al. [GKQV10] developed a modular system which allows to
more easily develop BFT protocols that match specific applications in terms of
robustness or best case performance.

Bibliography
[ABC+ 02] Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak,
Ronnie Chaiken, John R. Douceur, Jon Howell, Jacob R. Lorch,
208 CHAPTER 19. QUORUM SYSTEMS

19.1 Load and Work


Definition 19.2 (access strategy). An accessPstrategy Z defines the proba-
bility PZ (Q) of accessing a quorum Q ∈ S s.t. Q∈S PZ (Q) = 1.
Definition 19.3 (load).
P
Chapter 19 • The load of access strategy Z on a node vi is LZ (vi ) = Q∈S;vi ∈Q PZ (Q).
• The load induced by access strategy Z on a quorum system S is the max-
imal load induced by Z on any node in S, i.e., LZ (S) = maxvi ∈S LZ (vi ).

Quorum Systems • The load of a quorum system S is L(S) = minZ LZ (S).


Definition 19.4 (work).
• The work of a quorum Q ∈ S is the number of nodes in Q, W (Q) = |Q|.
What happens if a single server is no longer powerful enough to service all your
• The work induced by access strategy Z on a quorumP system S is the
customers? The obvious choice is to add more servers and to use the majority
expected number of nodes accessed, i.e., WZ (S) = Q∈S PZ (Q) · W (Q).
approach (e.g. Paxos, Chapter 15) to guarantee consistency. However, even
if you buy one million servers, a client still has to access more than half of • The work of a quorum system S is W (S) = minZ WZ (S).
them per request! While you gain fault-tolerance, your efficiency can at most
be doubled. Do we have to give up on consistency? Remarks:
Let us take a step back: We used majorities because majority sets always
overlap. But are majority sets the only sets that guarantee overlap? In this • Note that you cannot choose different access strategies Z for work and
chapter we study the theory behind overlapping sets, known as quorum systems. load, you have to pick a single Z for both.
Definition 19.1 (quorum, quorum system). Let V = {v1 , . . . , vn } be a set of • We illustrate the above concepts with a small example. Let V =
nodes. A quorum Q ⊆ V is a subset of these nodes. A quorum system {v1 , v2 , v3 , v4 , v5 } and S = {Q1 , Q2 , Q3 , Q4 }, with Q1 = {v1 , v2 },
S ⊂ 2V is a set of quorums s.t. every two quorums intersect, i.e., Q1 ∩ Q2 6= ∅ Q2 = {v1 , v3 , v4 }, Q3 = {v2 , v3 , v5 }, Q4 = {v2 , v4 , v5 }. If we choose
for all Q1 , Q2 ∈ S. the access strategy Z s.t. PZ (Q1 ) = 1/2 and PZ (Q2 ) = PZ (Q3 ) =
PZ (Q4 ) = 1/6, then the node with the highest load is v2 with LZ (v2 ) =
Remarks: 1/2 + 1/6 + 1/6 = 5/6, i.e., LZ (S) = 5/6. Regarding work, we have
WZ (S) = 1/2 · 2 + 1/6 · 3 + 1/6 · 3 + 1/6 · 3 = 15/6.
• When a quorum system is being used, a client selects a quorum, ac-
quires a lock (or ticket) on all nodes of the quorum, and when done • Can you come up with a better access strategy for S?
releases all locks again. The idea is that no matter which quorum is
chosen, its nodes will intersect with the nodes of every other quorum. • If every quorum Q in a quorum system S has the same number of
elements, S is called uniform.
• What can happen if two quorums try to lock their nodes at the same
time? • What is the minimum load a quorum system can have?

• A quorum system S is called minimal if ∀Q1 , Q2 ∈ S : Q1 6⊂ Q2 .

• The simplest quorum system imaginable consists of just one quorum, Primary Copy vs. Majority Singleton Majority
which in turn just consists of one server. It is known as Singleton. How many nodes need to be accessed? (Work) 1 > n/2
What is the load of the busiest node? (Load) 1 > 1/2
• In the Majority quorum system, every quorum has b n2 c + 1 nodes.
Table 19.1: First comparison of the Singleton and Majority quorum systems.
• Can you think of other simple quorum systems? Note that the Singleton quorum system can be a good choice when the failure
probability of every single node is > 1/2.

207
19.2. GRID QUORUM SYSTEMS 209 210 CHAPTER 19. QUORUM SYSTEMS


Theorem 19.5. Let S be a quorum system. Then L(S) ≥ 1/ n holds.

Proof. Let Q = {v1 , . . . , vq } be a quorum of minimal size in S, with sizes |Q| = q


and |S| = s. Let Z be an access strategy for S. Every other quorum in S
intersects in at least one element with this quorum Q. Each time a quorum is
accessed, at least one node in Q is accessed as well, yielding a lower bound of
LZ (vi ) ≥ 1/q for some vi ∈ Q.
Furthermore, as Q is minimal, at least q nodes need to be accessed, yielding
W (S) ≥ q. Thus, LZ (vi ) ≥ q/n for some vi ∈ Q, as each time q nodes are
accessed, the load of the most accessed node is at least q/n.

Combining both√ ideas leads to LZ (S) ≥ max (1/q, q/n) ⇒ LZ (S) ≥ 1/ n.
Thus, L(S) ≥ 1/ n, as Z can be any access strategy.

Remarks:

• Can we achieve this load?


Figure 19.2: There are other ways to choose quorums in the grid s.t. pairwise
different quorums only intersect in one node.
√ The size of each quorum is between
19.2 Grid Quorum Systems √ √
n and 2 n − 1, i.e., the work is in Θ( √n). When the access strategy Z is

Definition 19.6 (Basic Grid quorum system). Assume√ n ∈ N, and arrange uniform, the load of every node is in Θ(1/ n).
the n nodes in a square matrix with
√ side length of n, i.e., in a grid. The basic
Grid quorum system consists of n√quorums, with each containing the full row
Algorithm 78 Sequential Locking Strategy for a Quorum Q
i and the full column i, for 1 ≤ i ≤ n.
1: Attempt to lock the nodes one by one, ordered by their identifiers
2: Should a node be already locked, release all locks and start over

• However, by deviating from the “access all at once” strategy, we can


guarantee progress if the nodes are totally ordered!

Theorem 19.7. If each quorum is accessed by Algorithm 78, at least one quo-
rum will obtain a lock for all of its nodes.
Proof. We prove the theorem by contradiction. Assume no quorum can make
Figure 19.1: The basic
√ version of the Grid quorum system, where each quorum progress, i.e., for every quorum we have: At least one of its nodes is locked by
Q√i with 1 ≤ i ≤ n uses row i and column i. The size of each quorum is another quorum. Let v be the node with the highest identifier that is locked by
2 n − 1 and two quorums overlap in exactly two nodes. Thus, when √ the access some quorum Q. Observe that Q already locked all of its nodes with a smaller
strategy
√ Z is uniform (i.e., the probability of each√quorum is 1/ n), the work identifier than v, otherwise Q would have restarted. As all nodes with a higher
is 2 n − 1, and the load of every node is in Θ(1/ n). identifier than v are not locked, Q either has locked all of its nodes or can
make progress – a contradiction. As the set of nodes is finite, one quorum will
Remarks: eventually be able to lock all of its nodes.

• Consider the right picture in Figure 19.1: The two quorums intersect Remarks:
in two nodes. If both quorums were to be accessed at the same time,
it is not guaranteed that at least one quorum will lock all of its nodes, • But now we are back to sequential accesses in a distributed system?
as they could enter a deadlock! Let’s do it concurrently with the same idea, i.e., resolving conflicts by
the ordering of the nodes. Then, a quorum that locked the highest
• In the case of just two quorums, one could solve this by letting the identifier so far can always make progress!
quorums just intersect in one node, see Figure 19.2. However, already Theorem 19.8. If the nodes and quorums use Algorithm 79, at least one quo-
with three quorums the same situation could occur again, progress is rum will obtain a lock for all of its nodes.
not guaranteed!
19.3. FAULT TOLERANCE 211 212 CHAPTER 19. QUORUM SYSTEMS

Algorithm 79 Concurrent Locking Strategy for a Quorum Q Facts 19.12. One version of a Chernoff bound states the following:
Invariant: Let vQ ∈ Q be the highest identifier of a node locked by Q s.t. all Let x1 , . . . , xn be independent Bernoulli-distributed random variables
Pn with
nodes vi ∈ Q with vi < vQ are locked by Q as well. Should Q not have any P r[xi = 1] =Ppi and P r[xi = 0] = 1 − pi = qi , then for X := i=1 xi and
n
lock, then vQ is set to 0. µ := E[X] = i=1 pi the following holds:
1: repeat 2
2: Attempt to lock all nodes of the quorum Q for all 0 < δ < 1: P r[X ≤ (1 − δ)µ] ≤ e−µδ /2
.
3: for each node v ∈ Q that was not able to be locked by Q do
4: exchange vQ and vQ0 with the quorum Q0 that locked v Theorem 19.13. The asymptotic failure probability of the Majority quorum
5: if vQ > vQ0 then system is 0.
6: Q0 releases lock on v and Q acquires lock on v
Proof. In a Majority quorum system each quorum contains exactly b n2 c + 1
7: end if
nodes and each subset of nodes with cardinality b n2 c + 1 forms a quorum. The
8: end for
Majority quorum system fails, if only b n2 c nodes work. Otherwise there is at
9: until all nodes of the quorum Q are locked
least one quorum available. In order to calculate the failure probability we
define the
( following random variables:
1, if node i works, happens with probability p
Proof. The proof is analogous to the proof of Theorem 19.7: Assume for con- xi =
tradiction that no quorum can make progress. However, at least the quorum 0, if node i fails, happens with probability q = 1 − p
Pn
with the highest vQ can always make progress – a contradiction! As the set of and X := i=1 xi , with µ = np,
nodes is finite, at least one quorum will eventually be able to acquire a lock on whereas X corresponds to the number of working nodes. To estimate the
all of its nodes. probability that the number of working nodes is less than b n2 c + 1 we will make
1
use of the Chernoff inequality from above. By setting δ = 1 − 2p we obtain
Remarks: FP (S) = P r[X ≤ b n2 c] ≤ P r[X ≤ n2 ] = P r[X ≤ (1 − δ)µ].
1
• What if a quorum locks all of its nodes and then crashes? Is the With δ = 1 − 2p we have 0 < δleq1/2 due to 1/2 < p ≤ 1. Thus, we can use
2
quorum system dead now? This issue can be prevented by, e.g., using the Chernoff bound and get FP (S) ≤ e−µδ /2
∈ e−Ω(n) .
leases instead of locks: leases have a timeout, i.e., a lock is released
eventually. Theorem 19.14. The asymptotic failure probability of the Grid quorum system
is 1.

19.3 Fault Tolerance Proof. Consider the n = d · d nodes to be arranged in a d × d grid. A quorum
always contains one full row. In this estimation we will make use of the Bernoulli
Definition 19.9 (resilience). If any f nodes from a quorum system S can fail inequality which states that for all n ∈ N, x ≥ −1 : (1 + x)n ≥ 1 + nx.
s.t. there is still a quorum Q ∈ S without failed nodes, then S is f -resilient. The system fails, if in each row at least one node fails (which happens with
The largest such f is the resilience R(S). probability 1 − pd for a particular row, as all nodes work with probability pd ).
Therefore we can bound the failure probability from below with:
Theorem 19.10. Let S be a Grid quorum system where each √ of the n quorums
Fp (S) ≥ P r[at least one failure per row] = (1 − pd )d ≥ 1 − dpd −→ 1.
consists of a full row and a full column. S has a resilience of n − 1. n→∞

Proof. If all n nodes on the diagonal of the grid√ fail, then every quorum will Remarks:
have at least one failed node. Should less than n nodes fail, then there is a
row and a column without failed nodes. • Now we have a quorum system with optimal load (the Grid) and one
with fault-tolerance (Majority), but what if we want both?
Definition 19.11 (failure probability). Assume that every node works with a
fixed probability p (in the following we assume concrete values, e.g. p > 1/2). Definition 19.15 (B-Grid quorum system). Consider n = dhr nodes, arranged
The failure probability Fp (S) of a quorum system S is the probability that at in a rectangular grid with h · r rows and d columns. Each group of r rows is a
least one node of every quorum fails. band, and r elements in a column restricted to a band are called a mini-column.
A quorum consists of one mini-column in every band and one element from
Remarks: each mini-column of one band; thus every quorum has d + hr − 1 elements. The
B-Grid quorum system consists of all such quorums.
• The asymptotic failure probability is Fp (S) for n → ∞.
Theorem 19.16. The asymptotic failure probability of the B-Grid quorum sys-
tem is 0.
19.3. FAULT TOLERANCE 213 214 CHAPTER 19. QUORUM SYSTEMS

19.4 Byzantine Quorum Systems


While failed nodes are bad, they are still easy to deal with: just access another
quorum where all nodes can respond! Byzantine nodes make life more difficult
however, as they can pretend to be a regular node, i.e., one needs more sophis-
ticated methods to deal with them. We need to ensure that the intersection
of two quorums always contains a non-byzantine (correct) node and further-
more, the byzantine nodes should not be allowed to infiltrate every quorum. In
this section we study three counter-measures of increasing strength, and their
implications on the load of quorum systems.

Figure 19.3: A B-Grid quorum system with n = 100 nodes, d = 10 columns, Definition 19.17 (f -disseminating). A quorum system S is f -disseminating
h·r = 10 rows, h = 5 bands, and r = 2. The depicted quorum has a d+hr −1 = if (1) the intersection of two different quorums always contains f + 1 nodes,
10 + 5 · 2 − 1 = 19 nodes. If the access strategy Z is chosen uniformly, and (2) for any set of f byzantine nodes, there is at least one quorum without
√ then byzantine nodes.
we have a work of d + hr − 1 and√a load of d+hr−1
n setting d = n and
. By √
r = log n, we obtain a work of Θ ( n) and a load of Θ (1/ n).
Remarks:
Proof. Suppose n = dhr and the elements are arranged in a grid with d columns
• Thanks to (2), even with f byzantine nodes, the byzantine nodes
and h · r rows. The B-Grid quorum system does fail if in each band a complete
cannot stop all quorums by just pretending to have crashed. At least
mini-column fails, because then it is not possible to choose a band where in each
one quorum will survive. We will also keep this assumption for the
mini-column an element is still working. It also fails if in a band an element in
upcoming more advanced byzantine quorum systems.
each mini-column fails. Those events may not be independent of each other, but
with the help of the union bound, we can upper bound the failure probability • Byzantine nodes can also do something worse than crashing - they
with the following equation: could falsify data! Nonetheless, due to (1), there is at least one
Fp (S) ≤ P r[in every band a complete mini-column fails] non-byzantine node in every quorum intersection. If the data is self-
+ P r[in a band at least one element of every mini-column fails] verifying by, e.g., authentication, then this one node is enough.

≤ (d(1 − p)r )h + h(1 − pr )d • If the data is not self-verifying, then we need another mechanism.

√ Definition 19.18 (f -masking). A quorum system S is f -masking if (1) the


We use d = n, r = ln d, and 0 ≤ (1 − p) ≤ 1/3. Using nln x = xln n , we have intersection of two different quorums always contains 2f + 1 nodes, and (2) for
d(1 − p)r ≤ d · dln 1/3 ≈ d−0.1 , and hence for large enough d the whole first term any set of f byzantine nodes, there is at least one quorum without byzantine
is bounded from above by d−0.1h  1/d2 = 1/n. nodes.
Regarding the second term, we have p ≥ 2/3, and h = d/ ln d < d. Hence
we can bound the term from above by d(1 − dln 2/3 )d ≈ d(1 − d−0.4 )d . Using Remarks:
(1 + t/n)n ≤ et , we get (again, for large enough d) an upper bound of d(1 −
0.6 0.6
d−0.4 )d = d(1 − d0.6 /d)d ≤ d · e−d = d(−d / ln d)+1  d−2 = 1/n. In total, we • Note that except for the second condition, an f -masking quorum sys-
have Fp (S) ∈ O(1/n). tem is the same as a 2f -disseminating system. The idea is that the
non-byzantine nodes (at least f + 1 can outvote the byzantine ones
(at most f ), but only if all non-byzantine nodes are up-to-date!
Singleton Majority Grid B-Grid∗
√ √ • This raises an issue not covered yet in this chapter. If we access some
Work 1 > n/2 Θ ( n)  Θ ( n) 
√ √ quorum and update its values, this change still has to be disseminated
Load 1 > 1/2 Θ 1/ n Θ 1/ n
√ √ to the other nodes in the byzantine quorum system. Opaque quorum
Resilience 0 < n/2 Θ ( n) Θ ( n)
systems deal with this issue, which are discussed at the end of this
Failure Prob.∗∗ 1−p →0 →1 →0 section.
Table 19.2: Overview of the different quorum systems regarding re-
silience, work, load, and their asymptotic failure probability. The best • f -disseminating quorum systems need more than 3f nodes and f -
entries in each row are set in bold. masking quorum systems need more than 4f nodes. Essentially, the
∗ √ quorums may not contain too many nodes, and the different intersec-
Setting d = n and r = log n
∗∗
Assuming probability q = (1 − p) is constant but significantly less than 1/2 tion properties lead to the different bounds.
19.4. BYZANTINE QUORUM SYSTEMS 215 216 CHAPTER 19. QUORUM SYSTEMS

Theorem
p 19.19. Let S be a f -disseminating quorum system. Then L(S) ≥ Corollary 19.23. The f -masking Grid quorum system and the M -Grid quorum
(f + 1)/n holds. system are f -masking quorum systems.
p
Theorem 19.20. Let S be a f -masking quorum system. Then L(S) ≥ (2f + 1)/n Remarks:
holds.
• We achieved nearly the same load as without byzantine nodes! How-
Proofs of Theorems 19.19 and 19.20. The proofs follow the proof of Theorem ever, as mentioned earlier, what happens if we access a quorum that is
19.5, by observing that now not just one element is accessed from a minimal not up-to-date, except for the intersection with an up-to-date quorum?
quorum, but f + 1 or 2f + 1, respectively. Surely we can fix that as well without too much loss?

Definition 19.21 (f -masking Grid quorum system). The f -masking Grid quo- • This property will be handled in the last part of this chapter by opaque
rum system is constructed as the grid quorum system, but√each quorum contains quorum systems. It will ensure that the number of correct up-to-date
one full column and f + 1 rows of nodes, with 2f + 1 ≤ n. nodes accessed will be larger than the number of out-of-date nodes
combined with the byzantine nodes in the quorum (cf. (19.24.1)).
Definition 19.24 (f -opaque quorum system). A quorum system S is f -opaque
if the following three properties hold for any set of f byzantine nodes F and any
two different quorums Q1 , Q2 :
|(Q1 ∩ Q2 ) \ F | > |(Q2 ∩ F ) ∪ (Q2 \ Q1 )| (19.24.1)
(F ∩ Q) = ∅ for some Q ∈ S (19.24.2)

Figure 19.4: An example how to choose a quorum √ in the f -masking Grid with
f = 2, i.e., 2 + 1 = 3 rows. The load is in Θ(f / n) when the access strategy is
chosen to be uniform. Two quorums overlap by their columns intersecting each
other’s rows, i.e., they overlap in at least 2f + 2 nodes.

Remarks:

• The f -masking Grid nearly hits the lower bound for the load of f -
masking quorum systems, but not quite. A small change and we will
be optimal asymptotically.

Definition 19.22 (M -Grid quorum system). The M -Grid quorum √ system is


constructed as the grid quorum as well, but each quorum contains f + 1 rows
√ √
n−1
and f + 1 columns of nodes, with f ≤ 2 . Figure 19.6: Intersection properties of an opaque quorum system. Equation
(19.24.1) ensures that the set of non-byzantine nodes in the intersection of
Q1 , Q2 is larger than the set of out of date nodes, even if the byzantine nodes
“team up” with those nodes. Thus, the correct up to date value can always be
recognized by a majority voting.

Theorem 19.25. Let S be a f -opaque quorum system. Then, n > 5f .


Proof. Due to (19.24.2), there exists a quorum Q1 with size at most n−f . With
(19.24.1), |Q1 | > f holds. Let F1 be a set of f (byzantine) nodes F1 ⊂ Q1 , and
with (19.24.2), there exists a Q2 ⊂ V \ F1 . Thus, |Q1 ∩ Q2 | ≤ n − 2f . With
Figure 19.5: An example how to choose a quorum
p in the M -Grid with f = 3, (19.24.1), |Q1 ∩ Q2 | > f holds. Thus, one could choose f (byzantine) nodes
i.e., 2 rows and 2 columns. The load is in Θ( f /n) when the access strategy F2 with F2 ⊂ (Q1 ∩ Q2 ). Using (19.24.1) one can bound n − 3f from below:
is chosen to be uniform. Two quorums overlap with each row intersecting each n − 3f > |(Q2 ∩ Q1 )| − |F2 | ≥ |(Q2 ∩ Q1 ) ∪ (Q1 ∩ F2 )| ≥ |F1 | + |F2 | = 2f.
√ 2
other’s column, i.e., 2 f + 1 = 2f + 2 nodes.
BIBLIOGRAPHY 217 218 CHAPTER 19. QUORUM SYSTEMS

Remarks: [Lam78] Leslie Lamport. The implementation of reliable distributed multipro-


cess systems. Computer Networks, 2:95–114, 1978.
• One can extend the Majority quorum system to be f -opaque by set-
ting the size of each quorum to be d(2n + 2f )/3e. Then its load is [Mae85] Mamoru Maekawa. A square root N algorithm for mutual exclusion
1/n d(2n + 2f )/3e ≈ 2/3 + 2f /3n ≥ 2/3. in decentralized systems. ACM Trans. Comput. Syst., 3(2):145–159,
1985.
• Can we do much better? Sadly, no...
[MR98] Dahlia Malkhi and Michael K. Reiter. Byzantine quorum systems.
Theorem 19.26. Let S be a f -opaque quorum system. Then L(S) ≥ 1/2 holds.
Distributed Computing, 11(4):203–213, 1998.
Proof. Equation (19.24.1) implies that for Q1 , Q2 ∈ S, the intersection of both
[MR10] Michael G. Merideth and Michael K. Reiter. Selected results from the
Q1 , Q2 is at least half their size, i.e., |(Q1 ∩ Q2 )| ≥ |Q1 |/2. Let S consist of
latest decade of quorum systems research. In Bernadette Charron-
quorums Q1 , Q2 , . . . . The load induced by an access strategy Z on Q1 is:
Bost, Fernando Pedone, and André Schiper, editors, Replication:
X X X X X Theory and Practice, volume 5959 of Lecture Notes in Computer Sci-
LZ (Qi ) = LZ (Qi ) ≥ (|Q1 |/2) LZ (Qi ) = |Q1 |/2 .
ence, pages 185–206. Springer, 2010.
v∈Q1 v∈Qi Qi v∈(Q1 ∩Qi ) Qi

[MRW97] Dahlia Malkhi, Michael K. Reiter, and Avishai Wool. The load and
Using the pigeonhole principle, there must be at least one node in Q1 with load
availability of byzantine quorum systems. In James E. Burns and
of at least 1/2.
Hagit Attiya, editors, Proceedings of the Sixteenth Annual ACM Sym-
posium on Principles of Distributed Computing, Santa Barbara, Cal-
Chapter Notes ifornia, USA, August 21-24, 1997, pages 249–257. ACM, 1997.

[NW94] Moni Naor and Avishai Wool. The load, capacity and availability
Historically, a quorum is the minimum number of members of a deliberative
of quorum systems. In 35th Annual Symposium on Foundations of
body necessary to conduct the business of that group. Their use has inspired the
Computer Science, Santa Fe, New Mexico, USA, 20-22 November
introduction of quorum systems in computer science since the late 1970s/early
1994, pages 214–225. IEEE Computer Society, 1994.
1980s. Early work focused on Majority quorum systems [Lam78, Gif79, Tho79],
with the notion of minimality introduced shortly after [GB85]. The Grid quo- [NW05] Moni Naor and Udi Wieder. Scalable and dynamic quorum systems.
rum system was first considered in [Mae85], with the B-Grid being introduced Distributed Computing, 17(4):311–322, 2005.
in [NW94]. The latter article and [PW95] also initiated the study of load and
resilience. [PW95] David Peleg and Avishai Wool. The availability of quorum systems.
The f -masking Grid quorum system and opaque quorum systems are from Inf. Comput., 123(2):210–223, 1995.
[MR98], and the M -Grid quorum system was introduced in [MRW97]. Both
[Tho79] Robert H. Thomas. A majority consensus approach to concurrency
papers also mark the start of the formal study of Byzantine quorum systems.
control for multiple copy databases. ACM Trans. Database Syst.,
The f -masking and the M -Grid have asymptotic failure probabilities of 1, more
4(2):180–209, 1979.
complex systems with better values can be found in these papers as well.
Quorum systems have also been extended to cope with nodes dynamically [Vuk12] Marko Vukolic. Quorum Systems: With Applications to Storage and
leaving and joining, see, e.g., the dynamic paths quorum system in [NW05]. Consensus. Synthesis Lectures on Distributed Computing Theory.
For a further overview on quorum systems, we refer to the book by Vukolić Morgan & Claypool Publishers, 2012.
[Vuk12] and the article by Merideth and Reiter [MR10].
This chapter was written in collaboration with Klaus-Tycho Förster.

Bibliography
[GB85] Hector Garcia-Molina and Daniel Barbará. How to assign votes in a
distributed system. J. ACM, 32(4):841–860, 1985.
[Gif79] David K. Gifford. Weighted voting for replicated data. In Michael D.
Schroeder and Anita K. Jones, editors, Proceedings of the Seventh
Symposium on Operating System Principles, SOSP 1979, Asilomar
Conference Grounds, Pacific Grove, California, USA, 10-12, Decem-
ber 1979, pages 150–162. ACM, 1979.
220 CHAPTER 20. EVENTUAL CONSISTENCY & BITCOIN

Definition 20.2 (Availability). The system is operational and instantly pro-


cessing incoming requests.
Definition 20.3 (Partition Tolerance). Partition tolerance is the ability of a
distributed system to continue operating correctly even in the presence of a net-
work partition.

Chapter 20 Theorem 20.4 (CAP Theorem). It is impossible for a distributed system to


simultaneously provide Consistency, Availability and Partition Tolerance. A
distributed system can satisfy any two of these but not all three.

Eventual Consistency & Proof. Assume two nodes, sharing some state. The nodes are in different par-
titions, i.e., they cannot communicate. Assume a request wants to update the
state and contacts a node. The node may either: 1) update its local state,
Bitcoin resulting in inconsistent states, or 2) not update its local state, i.e., the system
is no longer available for updates.

Algorithm 81 Partition tolerant and available ATM


How would you implement an ATM? Does the following implementation work 1: if bank reachable then
satisfactorily? 2: Synchronize local view of balances between ATM and bank
3: if balance of customer insufficient then
Algorithm 80 Naı̈ve ATM 4: ATM displays error and aborts user interaction
1: ATM makes withdrawal request to bank 5: end if
2: ATM waits for response from bank 6: end if
3: if balance of customer sufficient then 7: ATM dispenses cash
4: ATM dispenses cash 8: ATM logs withdrawal for synchronization
5: else
6: ATM displays error
Remarks:
7: end if
• Algorithm 81 is partition tolerant and available since it continues to
process requests even when the bank is not reachable.
Remarks:
• The ATM’s local view of the balances may diverge from the balances
• A connection problem between the bank and the ATM may block
as seen by the bank, therefore consistency is no longer guaranteed.
Algorithm 80 in Line 2.
• The algorithm will synchronize any changes it made to the local bal-
• A network partition is a failure where a network splits into at least ances back to the bank once connectivity is re-established. This is
two parts that cannot communicate with each other. Intuitively any known as eventual consistency.
non-trivial distributed system cannot proceed during a partition and
maintain consistency. In the following we introduce the tradeoff be- Definition 20.5 (Eventual Consistency). If no new updates to the shared state
tween consistency, availability and partition tolerance. are issued, then eventually the system is in a quiescent state, i.e., no more
messages need to be exchanged between nodes, and the shared state is consistent.
• There are numerous causes for partitions to occur, e.g., physical dis-
connections, software errors, or incompatible protocol versions. From Remarks:
the point of view of a node in the system, a partition is similar to a
period of sustained message loss. • Eventual consistency is a form of weak consistency.
• Eventual consistency guarantees that the state is eventually agreed
upon, but the nodes may disagree temporarily.
20.1 Consistency, Availability and Partitions
• During a partition, different updates may semantically conflict with
Definition 20.1 (Consistency). All nodes in the system agree on the current each other. A conflict resolution mechanism is required to resolve the
state of the system. conflicts and allow the nodes to eventually agree on a common state.

219
20.2. BITCOIN 221 222 CHAPTER 20. EVENTUAL CONSISTENCY & BITCOIN

• One example of eventual consistency is the Bitcoin cryptocurrency Definition 20.8 (Output). An output is a tuple consisting of an amount of bit-
system. coins and a spending condition. Most commonly the spending condition requires
a valid signature associated with the private key of an address.

20.2 Bitcoin Remarks:


Definition 20.6 (Bitcoin Network). The Bitcoin network is a randomly con- • Spending conditions are scripts that offer a variety of options. Apart
nected overlay network of a few thousand nodes, controlled by a variety of own- from a single signature, they may include conditions that require the
ers. All nodes perform the same operations, i.e., it is a homogenous network result of a simple computation, or the solution to a cryptographic
and without central control. puzzle.
Remarks: • Outputs exist in two states: unspent and spent. Any output can be
spent at most once. The address balance is the sum of bitcoin amounts
• The lack of structure is intentional: it ensures that an attacker cannot
in unspent outputs that are associated with the address.
strategically position itself in the network and manipulate the infor-
mation exchange. Information is exchanged via a simple broadcasting • The set of unspent transaction outputs (UTXO) and some additional
protocol. global parameters is the shared state of Bitcoin. Every node in the
Bitcoin network holds a complete replica of that state. Local replicas
Definition 20.7 (Address). Users may generate any number of private keys,
may temporarily diverge, but consistency is eventually re-established.
from which a public key is then derived. An address is derived from a public key
and may be used to identify the recipient of funds in Bitcoin. The private/public Definition 20.9 (Input). An input is a tuple consisting of a reference to a
key pair is used to uniquely identify the owner of funds of an address. previously created output and arguments (signature) to the spending condition,
proving that the transaction creator has the permission to spend the referenced
Remarks:
output.
• The terms public key and address are often used interchangeably, since
Definition 20.10 (Transaction). A transaction is a datastructure that describes
both are public information. The advantage of using an address is that
the transfer of bitcoins from spenders to recipients. The transaction consists of
its representation is shorter than the public key.
a number of inputs and new outputs. The inputs result in the referenced outputs
• It is hard to link addresses to the user that controls them, hence spent (removed from the UTXO), and the new outputs being added to the UTXO.
Bitcoin is often referred to as being pseudonymous.
Remarks:
• Not every user needs to run a fully validating node, and end-users will
likely use a lightweight client that only temporarily connects to the • Inputs reference the output that is being spent by a (h, i)-tuple, where
network. h is the hash of the transaction that created the output, and i specifies
the index of the output in that transaction.
• The Bitcoin network collaboratively tracks the balance in bitcoins of
each address. • Transactions are broadcast in the Bitcoin network and processed by
every node that receives them.
• The address is composed of a network identifier byte, the hash of the
public key and a checksum. It is commonly stored in base 58 encoding,
Remarks:
a custom encoding similar to base 64 with some ambiguous symbols
removed, e.g., lowercase letter “l” since it is similar to the number • Note that the effect of a transaction on the state is deterministic. In
“1”. other words if all nodes receive the same set of transactions in the
• The hashing algorithm produces addresses of size 20 bytes. This same order (Definition 15.6), then the state across nodes is consistent.
means that there are 2160 distinct addresses. It might be tempting • The outputs of a transaction may assign less than the sum of inputs, in
to brute force a target address, however at one billion trials per sec- which case the difference is called the transaction’s fee. The fee is used
ond one still requires approximately 245 years in expectation to find to incentivize other participants in the system (see Definition 20.15)
a matching private/public key pair. Due to the birthday paradox the
odds improve if instead of brute forcing a single address we attempt to • Notice that so far we only described a local acceptance policy. Nothing
brute force any address. While the odds of a successful trial increase prevents nodes to locally accept different transactions that spend the
with the number of addresses, lookups become more costly. same output.
20.2. BITCOIN 223 224 CHAPTER 20. EVENTUAL CONSISTENCY & BITCOIN

Algorithm 82 Node Receives Transaction nonce x are usually bit-strings, is called a Proof-of-Work function if it has fol-
1: Receive transaction t lowing properties:
2: for each input (h, i) in t do
1. Fd (c, x) is fast to compute if d, c, and x are given.
3: if output (h, i) is not in local UTXO or signature invalid then
4: Drop t and stop 2. For fixed parameters d and c, finding x such that Fd (c, x) = true is com-
5: end if putationally difficult but feasible. The difficulty d is used to adjust the time
6: end for to find such an x.
7: if sum of values of inputs < sum of values of new outputs then
Definition 20.13 (Bitcoin PoW function). The Bitcoin PoW function is given
8: Drop t and stop by
9: end if 2224
10: for each input (h, i) in t do Fd (c, x) → SHA256(SHA256(c|x)) < .
d
11: Remove (h, i) from local UTXO
12: end for Remarks:
13: Append t to local history • This function concatenates the challenge c and nonce x, and hashes
14: Forward t to neighbors in the Bitcoin network them twice using SHA256. The output of SHA256 is a cryptographic
hash with a numeric value in {0, . . . , 2256 − 1} which is compared to
224
a target value 2 d , which gets smaller with increasing difficulty.
• Transactions are in one of two states: unconfirmed or confirmed. In-
coming transactions from the broadcast are unconfirmed and added • SHA256 is a cryptographic hash function with pseudorandom output.
to a pool of transactions called the memory pool. No better algorithm is known to find a nonce x such that the function
Fd (c, x) returns true than simply iterating over possible inputs. This
Definition 20.11 (Doublespend). A doublespend is a situation in which multi- is by design to make it difficult to find such an input, but simple to
ple transactions attempt to spend the same output. Only one transaction can be verify the validity once it has been found.
valid since outputs can only be spent once. When nodes accept different trans-
actions in a doublespend, the shared state becomes inconsistent. • If the PoW functions of all nodes had the same challenge, the fastest
node would always win. However, as we will see in Definition 20.15,
each node attempts to find a valid nonce for a node-specific challenge.
Remarks:
Definition 20.14 (Block). A block is a datastructure used to communicate
• Doublespends may occur naturally, e.g., if outputs are co-owned by incremental changes to the local state of a node. A block consists of a list of
multiple users. However, often doublespends are intentional – we call transactions, a reference to a previous block and a nonce. A block lists some
these doublespend-attacks: In a transaction, an attacker pretends to transactions the block creator (“miner”) has accepted to its memory-pool since
transfer an output to a victim, only to doublespend the same output the previous block. A node finds and broadcasts a block when it finds a valid
in another transaction back to itself. nonce for its PoW function.
• Doulespends can result in an inconsistent state since the validity of
transactions depends on the order in which they arrive. If two con- Algorithm 83 Node Finds Block
flicting transactions are seen by a node, the node considers the first 1: Nonce x = 0, challenge c, difficulty d, previous block bt−1
to be valid, see Algorithm 82. The second transaction is invalid since 2: repeat
it tries to spend an output that is already spent. The order in which 3: x=x+1
transactions are seen, may not be the same for all nodes, hence the 4: until Fd (c, x) = true
inconsistent state. 5: Broadcast block bt = (memory-pool, bt−1 , x)

• If doublespends are not resolved, the shared state diverges. Therefore


a conflict resolution mechanism is needed to decide which of the con- Remarks:
flicting transactions is to be confirmed (accepted by everybody), to • With their reference to a previous block, the blocks build a tree, rooted
achieve eventual consistency. in the so called genesis block.
Definition 20.12 (Proof-of-Work). Proof-of-Work (PoW) is a mechanism that • The primary goal for using the PoW mechanism is to adjust the rate
allows a party to prove to another party that a certain amount of computa- at which blocks are found in the network, giving the network time
tional resources has been utilized for a period of time. A function Fd (c, x) → to synchronize on the latest block. Bitcoin sets the difficulty so that
{true, f alse}, where difficulty d is a positive number, while challenge c and globally a block is created about every 10 minutes in expectation.
20.2. BITCOIN 225 226 CHAPTER 20. EVENTUAL CONSISTENCY & BITCOIN

• Finding a block allows the finder to impose the transactions in its local • If multiple blocks are mined more or less concurrently, the system is
memory pool to all other nodes. Upon receiving a block, all nodes roll said to have forked. Forks happen naturally because mining is a dis-
back any local changes since the previous block and apply the new tributed random process and two new blocks may be found at roughly
block’s transactions. the same time.

• Transactions contained in a block are said to be confirmed by that


block. Algorithm 84 Node Receives Block
1: Receive block b
Definition 20.15 (Reward Transaction). The first transaction in a block is 2: For this node the current head is block bmax at height hmax
called the reward transaction. The block’s miner is rewarded for confirming 3: Connect block b in the tree as child of its parent p at height hb = hp + 1
transactions by allowing it to mint new coins. The reward transaction has a 4: if hb > hmax then
dummy input, and the sum of outputs is determined by a fixed subsidy plus the 5: hmax = hb
sum of the fees of transactions confirmed in the block. 6: bmax = b
7: Compute UTXO for the path leading to bmax
Remarks: 8: Cleanup memory pool
9: end if
• A reward transaction is the sole exception to the rule that the sum of
inputs must be at least the sum of outputs.
Remarks:
• The number of bitcoins that are minted by the reward transaction and
assigned to the miner is determined by a subsidy schedule that is part • Algorithm 84 describes how a node updates its local state upon receiv-
of the protocol. Initially the subsidy was 50 bitcoins for every block, ing a block. Notice that, like Algorithm 82, this describes the local
and it is being halved every 210,000 blocks, or 4 years in expectation. policy and may also result in node states diverging, i.e., by accepting
Due to the halving of the block reward, the total amount of bitcoins different blocks at the same height as current head.
in circulation never exceeds 21 million bitcoins.
• Unlike extending the current path, switching paths may result in con-
• It is expected that the cost of performing the PoW to find a block, in firmed transactions no longer being confirmed, because the blocks in
terms of energy and infrastructure, is close to the value of the reward the new path do not include them. Switching paths is referred to as
the miner receives from the reward transaction in the block. a reorg.

Definition 20.16 (Blockchain). The longest path from the genesis block, i.e., • Cleaning up the memory pool involves 1) removing transactions that
root of the tree, to a leaf is called the blockchain. The blockchain acts as a were confirmed in a block in the current path, 2) removing transactions
consistent transaction history on which all nodes eventually agree. that conflict with confirmed transactions, and 3) adding transactions
that were confirmed in the previous path, but are no longer confirmed
Remarks: in the current path.

• The path length from the genesis block to block b is the height hb . • In order to avoid having to recompute the entire UTXO at every
new block being added to the blockchain, all current implementations
• Only the longest path from the genesis block to a leaf is a valid trans- use datastructures that store undo information about the operations
action history, since branches may contradict each other because of applied by a block. This allows efficient switching of paths and updates
doublespends. of the head by moving along the path.

• Since only transactions in the longest path are agreed upon, miners Theorem 20.17. Forks are eventually resolved and all nodes eventually agree
have an incentive to append their blocks to the longest chain, thus on which is the longest blockchain. The system therefore guarantees eventual
agreeing on the current state. consistency.

• The mining incentives quickly increased the difficulty of the PoW Proof. In order for the fork to continue to exist, pairs of blocks need to be
mechanism: initially miners used CPUs to mine blocks, but CPUs found in close succession, extending distinct branches, otherwise the nodes on
were quickly replaced by GPUs, FPGAs and even application specific the shorter branch would switch to the longer one. The probability of branches
integrated circuits (ASICs) as bitcoins appreciated. This results in being extended almost simultaneously decreases exponentially with the length
an equilibrium today in which only the most cost efficient miners, in of the fork, hence there will eventually be a time when only one branch is being
terms of hardware supply and electricity, make a profit in expectation. extended, becoming the longest branch.
20.3. SMART CONTRACTS 227 228 CHAPTER 20. EVENTUAL CONSISTENCY & BITCOIN

20.3 Smart Contracts Remarks:

Definition 20.18 (Smart Contract). A smart contract is an agreement between • Most smart contracts begin with the creation of a 2-of-2 multisig out-
two or more parties, encoded in such a way that the correct execution is guar- put, requiring a signature from both parties. Once the transaction
anteed by the blockchain. creating the multisig output is confirmed in the blockchain, both par-
ties are guaranteed that the funds of that output cannot be spent
unilaterally.
Remarks:

• Contracts allow business logic to be encoded in Bitcoin transactions Algorithm 85 Parties A and B create a 2-of-2 multisig output o
which mutually guarantee that an agreed upon action is performed. 1: B sends a list IB of inputs with cB coins to A
The blockchain acts as conflict mediator, should a party fail to honor 2: A selects its own inputs IA with cA coins
an agreement. 3: A creates transaction ts {[IA , IB ], [o = cA + cB → (A, B)]}
4: A creates timelocked transaction tr {[o], [cA → A, cB → B]} and signs it
• The use of scripts as spending conditions for outputs enables smart 5: A sends ts and tr to B
contracts. Scripts, together with some additional features such as 6: B signs both ts and tr and sends them to A
timelocks, allow encoding complex conditions, specifying who may 7: A signs ts and broadcasts it to the Bitcoin network
spend the funds associated with an output and when.

Definition 20.19 (Timelock). Bitcoin provides a mechanism to make trans- Remarks:


actions invalid until some time in the future: timelocks. A transaction may
specify a locktime: the earliest time, expressed in either a Unix timestamp or • ts is called a setup transaction and is used to lock in funds into a shared
a blockchain height, at which it may be included in a block and therefore be account. If ts is signed and broadcast immediately, one of the parties
confirmed. could not collaborate to spend the multisig output, and the funds
become unspendable. To avoid a situation where the funds cannot
Remarks: be spent, the protocol also creates a timelocked refund transaction
tr which guarantees that, should the funds not be spent before the
• Transactions with a timelock are not released into the network until timelock expires, the funds are returned to the respective party. At no
the timelock expires. It is the responsibility of the node receiving point in time one of the parties holds a fully signed setup transaction
the transaction to store it locally until the timelock expires and then without the other party holding a fully signed refund transaction,
release it into the network. guaranteeing that funds are eventually returned.

• Transactions with future timelocks are invalid. Blocks may not in- • Both transactions require the signature of both parties. In the case of
clude transactions with timelocks that have not yet expired, i.e., they the setup transaction because it has two inputs from A and B respec-
are mined before their expiry timestamp or in a lower block than spec- tively which require individual signatures. In the case of the refund
ified. If a block includes an unexpired transaction it is invalid. Upon transaction the single input spending the multisig output requires both
receiving invalid transactions or blocks, nodes discard them immedi- signatures being a 2-of-2 multisig output.
ately and do not forward them to their peers.
Algorithm 86 Simple Micropayment Channel from S to R with capacity c
• Timelocks can be used to replace or supersede transactions: a time- 1: cS = c, cR = 0
locked transaction t1 can be replaced by another transaction t0 , spend- 2: S and R use Algorithm 85 to set up output o with value c from S
ing some of the same outputs, if the replacing transaction t0 has an 3: Create settlement transaction tf {[o], [cS → S, cR → R]}
earlier timelock and can be broadcast in the network before the re- 4: while channel open and cR < c do
placed transaction t1 becomes valid. 5: In exchange for good with value δ
6: cR = cR + δ
Definition 20.20 (Singlesig and Multisig Outputs). When an output can be
7: cS = cS − δ
claimed by providing a single signature it is called a singlesig output. In contrast
8: Update tf with outputs [cR → R, cS → S]
the script of multisig outputs specifies a set of m public keys and requires k-of-
9: S signs and sends tf to R
m (with k ≤ m) valid signatures from distinct matching public keys from that
10: end while
set in order to be valid.
11: R signs last tf and broadcasts it
20.4. WEAK CONSISTENCY 229 230 CHAPTER 20. EVENTUAL CONSISTENCY & BITCOIN

Remarks: Definition 20.24 (Causal Relation). The following pairs of operations are said
to be causally related:
• Algorithm 86 implements a Simple Micropayment Channel, a smart
contract that is used for rapidly adjusting micropayments from a • Two writes by the same node to different variables.
spender to a recipient. Only two transactions are ever broadcast and
inserted into the blockchain: the setup transaction ts and the last set- • A read followed by a write of the same node.
tlement transaction tf . There may have been any number of updates
• A read that returns the value of a write from any node.
to the settlement transaction, transferring ever more of the shared
output to the recipient. • Two operations that are transitively related according to the above condi-
tions.
• The number of bitcoins c used to fund the channel is also the maximum
total that may be transferred over the simple micropayment channel.
Remarks:
• At any time the recipient R is guaranteed to eventually receive the
bitcoins, since she holds a fully signed settlement transaction, while • The first rule ensures that writes by a single node are seen in the same
the spender only has partially signed ones. order. For example if a node writes a value in one variable and then
signals that it has written the value by writing in another variable.
• The simple micropayment channel is intrinsically unidirectional. Since Another node could then read the signalling variable but still read the
the recipient may choose any of the settlement transactions in the old value from the first variable, if the two writes were not causally
protocol, she will use the one with maximum payout for her. If we related.
were to transfer bitcoins back, we would be reducing the amount paid
out to the recipient, hence she would choose not to broadcast that Definition 20.25 (Causal Consistency). A system provides causal consistency
transaction. if operations that potentially are causally related are seen by every node of the
system in the same order. Concurrent writes are not causally related, and may
be seen in different orders by different nodes.
20.4 Weak Consistency
Eventual consistency is only one form of weak consistency. A number of different Chapter Notes
tradeoffs between partition tolerance and consistency exist in literature.
The CAP theorem was first introduced by Fox and Brewer [FB99], although it
Definition 20.21 (Monotonic Read Consistency). If a node u has seen a par- is commonly attributed to a talk by Eric Brewer [Bre00]. It was later proven
ticular value of an object, any subsequent accesses of u will never return any by Gilbert and Lynch [GL02] for the asynchronous model. Gilbert and Lynch
older values. also showed how to relax the consistency requirement in a partially synchronous
system to achieve availability and partition tolerance.
Remarks:
Bitcoin was introduced in late 2008 by Satoshi Nakamoto [Nak08]. Nakamoto
• Users are annoyed if they receive a notification about a comment on is thought to be a pseudonym used by either a single person or a group of peo-
an online social network, but are unable to reply because the web ple; it is still unknown who invented Bitcoin, giving rise to speculation and
interface does not show the same notification yet. In this case the conspiracy theories. Among the plausible theories are noted cryptographers
notification acts as the first read operation, while looking up the com- Nick Szabo [Big13] and Hal Finney [Gre14]. The first Bitcoin client was pub-
ment on the web interface is the second read operation. lished shortly after the paper and the first block was mined on January 3, 2009.
The genesis block contained the headline of the release date’s The Times is-
Definition 20.22 (Monotonic Write Consistency). A write operation by a node sue “The Times 03/Jan/2009 Chancellor on brink of second bailout for banks”,
on a data item is completed before any successive write operation by the same which serves as proof that the genesis block has been indeed mined on that
node (i.e. system guarantees to serialize writes by the same node). date, and that no one had mined before that date. The quote in the genesis
block is also thought to be an ideological hint: Bitcoin was created in a climate
Remarks: of financial crisis, induced by rampant manipulation by the banking sector, and
• The ATM must replay all operations in order, otherwise it might hap- Bitcoin quickly grew in popularity in anarchic and libertarian circles. The orig-
pen that an earlier operation overwrites the result of a later operation, inal client is nowadays maintained by a group of independent core developers
resulting in an inconsistent final state. and remains the most used client in the Bitcoin network.
Central to Bitcoin is the resolution of conflicts due to doublespends, which
Definition 20.23 (Read-Your-Write Consistency). After a node u has updated is solved by waiting for transactions to be included in the blockchain. This
a data item, any later reads from node u will never see an older value. however introduces large delays for the confirmation of payments which are
BIBLIOGRAPHY 231 232 CHAPTER 20. EVENTUAL CONSISTENCY & BITCOIN

undesirable in some scenarios in which an immediate confirmation is required. [Gre14] Andy Greenberg. Nakamoto’s neighbor: My hunt for bitcoin’s cre-
Karame et al. [KAC12] show that accepting unconfirmed transactions leads to ator led to a paralyzed crypto genius. http://onforb.es/1rvyecq,
a non-negligible probability of being defrauded as a result of a doublespending 2014.
attack. This is facilitated by information eclipsing [DW13], i.e., that nodes
do not forward conflicting transactions, hence the victim does not see both [HS12] Mike Hearn and Jeremy Spilman. Contract: Rapidly adjusting
transactions of the doublespend. Bamert et al. [BDE+ 13] showed that the odds micro-payments. https://en.bitcoin.it/wiki/Contract, 2012. Last ac-
of detecting a doublespending attack in real-time can be improved by connecting cessed on November 11, 2015.
to a large sample of nodes and tracing the propagation of transactions in the [KAC12] G.O. Karame, E. Androulaki, and S. Capkun. Two Bitcoins at
network. the Price of One? Double-Spending Attacks on Fast Payments in
Bitcoin does not scale very well due to its reliance on confirmations in the Bitcoin. In Conference on Computer and Communication Security,
blockchain. A copy of the entire transaction history is stored on every node 2012.
in order to bootstrap joining nodes, which have to reconstruct the transaction
history from the genesis block. Simple micropayment channels were introduced [Nak08] Satoshi Nakamoto. Bitcoin: A peer-to-peer electronic cash system.
by Hearn and Spilman [HS12] and may be used to bundle multiple transfers be- https://bitcoin.org/bitcoin.pdf, 2008.
tween two parties but they are limited to transferring the funds locked into the
channel once. Recently Duplex Micropayment Channels [DW15] and the Light- [PD15] Joseph Poon and Thaddeus Dryja. The bitcoin lightning network.
ning Network [PD15] have been proposed to build bidirectional micropayment 2015.
channels in which the funds can be transferred back and forth an arbitrary num-
ber of times, greatly increasing the flexibility of Bitcoin transfers and enabling a
number of features, such as micropayments and routing payments between any
two endpoints.
This chapter was written in collaboration with Christian Decker.

Bibliography
[BDE+ 13] Tobias Bamert, Christian Decker, Lennart Elsen, Samuel Welten,
and Roger Wattenhofer. Have a snack, pay with bitcoin. In IEEE
Internation Conference on Peer-to-Peer Computing (P2P), Trento,
Italy, 2013.

[Big13] John Biggs. Who is the real satoshi nakamoto? one researcher may
have found the answer. http://on.tcrn.ch/l/R0vA, 2013.

[Bre00] Eric A. Brewer. Towards robust distributed systems. In Symposium


on Principles of Distributed Computing (PODC). ACM, 2000.

[DW13] Christian Decker and Roger Wattenhofer. Information propagation


in the bitcoin network. In IEEE International Conference on Peer-
to-Peer Computing (P2P), Trento, Italy, September 2013.

[DW15] Christian Decker and Roger Wattenhofer. A Fast and Scalable Pay-
ment Network with Bitcoin Duplex Micropayment Channels. In Sym-
posium on Stabilization, Safety, and Security of Distributed Systems
(SSS), 2015.

[FB99] Armando Fox and Eric Brewer. Harvest, yield, and scalable tolerant
systems. In Hot Topics in Operating Systems. IEEE, 1999.

[GL02] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibil-
ity of consistent, available, partition-tolerant web services. SIGACT
News, 2002.
234 CHAPTER 21. DISTRIBUTED STORAGE

Remarks:
• Let us do a back-of-the-envelope calculation. We have m = 1M
movies, n = 1M nodes, each node has storage for 1TB/1GB = 1K
movies, i.e., we use k = 1K hash functions. Theorem 21.1 shows that
each movie is stored about 1K times. With a bit more math one can
Chapter 21 show that it is highly unlikely that a movie is stored much less often
than its expected value.
• Instead of storing movies directly on nodes as in Algorithm 87, we can
also store the movies on any nodes we like. The nodes of Algorithm
Distributed Storage 87 then simply store forward pointers to the actual movie locations.
• In this chapter we want to push unreliability to the extreme. What if
the nodes are so unreliable that on average a node is only available for
1 hour? In other words, nodes exhibit a high churn, they constantly
“Indeed, I believe that virtually every important aspect of join and leave the distributed system.
programming arises somewhere in the context of [sorting and] searching!”
• With such a high churn, hundreds or thousands of nodes will change
– Donald E. Knuth, The Art of Computer Programming every second. No single node can have an accurate picture of what
other nodes are currently in the system. This is remarkably different
to classic distributed systems, where a single unavailable node may
How do you store 1M movies, each with a size of about 1GB, on 1M nodes, already be a minor disaster: all the other nodes have to get a consistent
each equipped with a 1TB disk? Simply store the movies on the nodes, arbitrar- view (Definition 18.3) of the system again. In high churn systems it
ily, and memorize (with a global index) which movie is stored on which node. is impossible to have a consistent view at any time.
What if the set of movies or nodes changes over time, and you do not want to
change your global index too often? • Instead, each node will just know about a small subset of 100 or less
other nodes (“neighbors”). This way, nodes can withstand high churn
situations.
21.1 Consistent Hashing • On the downside, nodes will not directly know which node is responsi-
Several variants of hashing will do the job, e.g. consistent hashing: ble for what movie. Instead, a node searching for a movie might have
to ask a neighbor node, which in turn will recursively ask another
Algorithm 87 Consistent Hashing neighbor node, until the correct node storing the movie (or a forward
pointer to the movie) is found. The nodes of our distributed storage
1: Hash the unique file name of each movie m with a known set of hash func-
system form a virtual network, also called an overlay network.
tions hi (m) → [0, 1), for i = 1, . . . , k
2: Hash the unique name (e.g., IP address and port number) of each node with
the same set of hash functions hi , for i = 1, . . . , k 21.2 Hypercubic Networks
3: Store a copy of a movie x on node u if hi (x) ≈ hi (u), for any i. More
formally, store movie x on node u if In this section we present a few overlay topologies of general interest.

|hi (x) − hi (u)| = min{|hi (m) − hi (u)|}, for any i Definition 21.2 (Topology Properties). Our virtual network should have the
m
following properties:
• The network should be (somewhat) homogeneous: no node should play a
dominant role, no node should be a single point of failure.
Theorem 21.1 (Consistent Hashing). In expectation, Algorithm 87 stores each
movie kn/m times. • The nodes should have IDs, and the IDs should span the universe [0, 1),
such that we can store data with hashing, as in Algorithm 87.
Proof. While it is possible that some movie does not hash closest to a node for
any of its hash functions, this is highly unlikely: For each node (n) and each hash • Every node should have a small degree, if possible polylogarithmic in n,
function (k), each movie has about the same probability (1/m) to be stored. the number of nodes. This will allow every node to maintain a persistent
By linearity of expectation, a movie is stored kn/m times, in expectation. connection with each neighbor, which will help us to deal with churn.

233
21.2. HYPERCUBIC NETWORKS 235 236 CHAPTER 21. DISTRIBUTED STORAGE

• The network should have a small diameter, and routing should be easy. i ∈ {1, . . . , d} and all aj ∈ [m] with j 6= i. In other words, we take the expression
If a node does not have the information about a data item, then it should ai − bi in the sum modulo m prior to computing the absolute value. M (m, 1)
know which neighbor to ask. Within a few (polylogarithmic in n) hops, is also called a path, T (m, 1) a cycle, and M (2, d) = T (2, d) a d-dimensional
one should find the node that has the correct information. hypercube. Figure 23.2 presents a linear array, a torus, and a hypercube.

00 10 20 30 110 111
4
01 11 21 31 100 101

02 12 22 32 010 011
0 1 2 m−1
2 03 13 23 33 000 001

M(m,1) T (4,2) M(2,3)


1 Figure 21.2: The structure of M (m, 1), T (4, 2), and M (2, 3).

Figure 21.1: The structure of a fat tree. Remarks:


• Routing on a mesh, torus, or hypercube is trivial. On a d-dimensional
hypercube, to get from a source bitstring s to a target bitstring t one
Remarks: only needs to fix each “wrong” bit, one at a time; in other words, if
• Some basic network topologies used in practice are trees, rings, grids the source and the target differ by k bits, there are k! routes with k
or tori. Many other suggested networks are simply combinations or hops.
derivatives of these. • If you put a dot in front of the d-bit ID of each node, the nodes exactly
span the d-bit IDs [0, 1).
• The advantage of trees is that the routing is very easy: for every
source-destination pair there is only one path. However, since the • The Chord architecture is a close relative of the hypercube, basically
root of a tree is a bottleneck, trees are not homogeneous. Instead, a less rigid hypercube. The hypercube connects every node with an
so-called fat trees should be used. Fat trees have the property that ID in [0, 1) with every node in exactly distance 2−i , i = 1, 2, . . . , d in
every edge connecting a node v to its parent u has a capacity that is [0, 1). Chord instead connect nodes with approximately distance 2−i .
proportional to the number of leaves of the subtree rooted at v. See
Figure 23.1 for a picture. • The hypercube has many derivatives, the so-called hypercubic net-
works. Among these are the butterfly, cube-connected-cycles, shuffle-
• Fat trees belong to a family of networks that require edges of non- exchange, and de Bruijn graph. We start with the butterfly, which is
uniform capacity to be efficient. Networks with edges of uniform ca- basically a “rolled out” hypercube.
pacity are easier to build. This is usually the case for grids and tori.
Unless explicitly mentioned, we will treat all edges in the following to Definition 21.4 (Butterfly). Let d ∈ N. The d-dimensional butterfly BF (d)
be of capacity 1. is a graph with node set V = [d + 1] × [2]d and an edge set E = E1 ∪ E2 with

Definition 21.3 (Torus, Mesh). Let m, d ∈ N. The (m, d)-mesh M (m, d) is a E1 = {{(i, α), (i + 1, α)} | i ∈ [d], α ∈ [2d ]}
graph with node set V = [m]d and edge set
and
( d
)
X
E = {(a1 , . . . , ad ), (b1 , . . . , bd )} | ai , bi ∈ [m], |ai − bi | = 1 , E2 = {{(i, α), (i + 1, β)} | i ∈ [d], α, β ∈ [2d ], |α − β| = 2i }.
i=1
A node set {(i, α) | α ∈ [2]d } is said to form level i of the butterfly. The
where [m] means the set {0, . . . , m − 1}. The (m, d)-torus T (m, d) is a graph d-dimensional wrap-around butterfly W-BF(d) is defined by taking the BF (d)
that consists of an (m, d)-mesh and additionally wrap-around edges from nodes and having (d, α) = (0, α) for all α ∈ [2]d .
(a1 , . . . , ai−1 , m − 1, ai+1 , . . . , ad ) to nodes (a1 , . . . , ai−1 , 0, ai+1 , . . . , ad ) for all
21.2. HYPERCUBIC NETWORKS 237 238 CHAPTER 21. DISTRIBUTED STORAGE

(110,0) (111,0)
Remarks:
(110,1) (111,1)

• Figure 23.3 shows the 3-dimensional butterfly BF (3). The BF (d) has (100,1) (110,2)
(101,1)
(111,2)
(d + 1)2d nodes, 2d · 2d edges and degree 4. It is not difficult to check
2
that combining the node sets {(i, α) | i ∈ [d]} for all α ∈ [2]d into a
(100,0) (101,0)

single node results in the hypercube. (010,2) (011,2)


1
(101,2)
(100,2)

• Butterflies have the advantage of a constant node degree over hyper- (010,0)
(011,0)
0

cubes, whereas hypercubes feature more fault-tolerant routing. (000,2)


(010,1)
(001,2) (011,1)
000 001 010 011 100 101 110 111

(000,1)
(001,1)
• You may have seen butterfly-like structures before, e.g. sorting net-
works, communication switches, data center networks, fast fourier (000,0) (001,0)

transform (FFT). The Benes network (telecommunication) is noth-


ing but two back-to-back butterflies. The Clos network (data centers) Figure 21.4: The structure of CCC(3).
is a close relative to Butterflies too. Actually, merging the 2i nodes on
level i that share the first d − i bits into a single node, the Butterfly
becomes a fat tree. Every year there are new applications for which Remarks:
hypercubic networks are the perfect solution!
• Two possible representations of a CCC can be found in Figure 23.4.
• Next we define the cube-connected-cycles network. It only has a de-
• The shuffle-exchange is yet another way of transforming the hypercu-
gree of 3 and it results from the hypercube by replacing the corners
bic interconnection structure into a constant degree network.
by cycles.
Definition 21.6 (Shuffle-Exchange). Let d ∈ N. The d-dimensional shuffle-
exchange SE(d) is defined as an undirected graph with node set V = [2]d and
000 001 010 011 100 101 110 111 an edge set E = E1 ∪ E2 with

E1 = {{(a1 , . . . , ad ), (a1 , . . . , ād )} | (a1 , . . . , ad ) ∈ [2]d , ād = 1 − ad }


0
and
E2 = {{(a1 , . . . , ad ), (ad , a1 , . . . , ad−1 )} | (a1 , . . . , ad ) ∈ [2]d } .
1
Figure 23.5 shows the 3- and 4-dimensional shuffle-exchange graph.

2
SE(3) SE(4)
100 101 1000 1001 1100 1101

3 000 001 110 111 0000 0001 0100 0101 1010 1011 1110 1111

Figure 21.3: The structure of BF(3).


010 011 0010 0011 0110 0111

E1
E2
Definition 21.5 (Cube-Connected-Cycles). Let d ∈ N. The cube-connected-
cycles network CCC(d) is a graph with node set V = {(a, p) | a ∈ [2]d , p ∈ [d]}
and edge set Figure 21.5: The structure of SE(3) and SE(4).

E = {(a, p), (a, (p + 1) mod d)} | a ∈ [2]d , p ∈ [d]
 Definition 21.7 (DeBruijn). The b-ary DeBruijn graph of dimension d
∪ {(a, p), (b, p)} | a, b ∈ [2]d , p ∈ [d], a = b except for ap .
DB(b, d) is an undirected graph G = (V, E) with node set V = {v ∈ [b]d }
and edge set E that contains all edges {v, w} with the property that w ∈
{(x, v1 , . . . , vd−1 ) : x ∈ [b]}, where v = (v1 , . . . , vd ).
21.2. HYPERCUBIC NETWORKS 239 240 CHAPTER 21. DISTRIBUTED STORAGE

01 001 011 Theorem 21.9. Every graph of maximum degree d > 2 and size n must have
a diameter of at least d(log n)/(log(d − 1))e − 2.
010 101
00 11 000 111
Proof. Suppose we have a graph G = (V, E) of maximum degree d and size
n. Start from any node v ∈ V . In a first step at most d other nodes can be
10 100 110 reached. In two steps at most d · (d − 1) additional nodes can be reached. Thus,
in general, in at most k steps at most
k−1
X
Figure 21.6: The structure of DB(2, 2) and DB(2, 3). (d − 1)k − 1 d · (d − 1)k
1+ d · (d − 1)i = 1 + d · ≤
i=0
(d − 1) − 1 d−2
Remarks: nodes (including v) can be reached. This has to be at least n to ensure that v
• Two examples of a DeBruijn graph can be found in Figure 23.6. can reach all other nodes in V within k steps. Hence,
(d − 2) · n
• There are some data structures which also qualify as hypercubic net- (d − 1)k ≥ ⇔ k ≥ logd−1 ((d − 2) · n/d) .
d
works. An example of a hypercubic network is the skip list, the bal-
anced binary search tree for the lazy programmer: Since logd−1 ((d − 2)/d) > −2 for all d > 2, this is true only if k ≥
d(log n)/(log(d − 1))e − 2.
Definition 21.8 (Skip List). The skip list is an ordinary ordered linked list
of objects, augmented with additional forward links. The ordinary linked list is Remarks:
the level 0 of the skip list. In addition, every object is promoted to level 1 with
• In other words, constant-degree hypercubic networks feature an
probability 1/2. As for level 0, all level 1 objects are connected by a linked list.
asymptotically optimal diameter.
In general, every object on level i is promoted to the next level with probability
1/2. A special start-object points to the smallest/first object on each level. • Other hypercubic graphs manage to have a different tradeoff between
node degree and diameter. The pancake graph, for instance, mini-
Remarks: mizes the maximum of these with d = k = Θ(log n/ log log n). The
ID of a node u in the pancake graph of dimension d is an arbitrary
• Search, insert, and delete can be implemented in O(log n) expected
permutation of the numbers 1, 2, . . . , d. Two nodes u, v are connected
time in a skip list, simply by jumping from higher levels to lower ones
by an edge if one can get the ID of node v by taking the ID of node
when overshooting the searched position. Also, the amortized memory
u, and reversing (flipping) the first i numbers of u’s ID. For example,
cost of each object is constant, as on average an object only has two
in dimension d = 4, nodes u = 2314 and v = 1324 are neighbors.
forward links.
• There are a few other interesting graph classes which are not hyper-
• The randomization can easily be discarded, by deterministically pro-
cubic networks, but nevertheless seem to relate to the properties of
moting a constant fraction of objects of level i to level i + 1, for all
Definition 21.2. Small-world graphs (a popular representations for
i. When inserting or deleting, object o simply checks whether its left
social networks) also have small diameter, however, in contrast to hy-
and right level i neighbors are being promoted to level i + 1. If none
percubic networks, they are not homogeneous and feature nodes with
of them is, promote object o itself. Essentially we establish a maximal
large degrees.
independent set (MIS) on each level, hence at least every third and at
most every second object is promoted. • Expander graphs (an expander graph is a sparse graph which has
good connectivity properties, that is, from every not too large subset
• There are obvious variants of the skip list, e.g., the skip graph. Instead of nodes you are connected to an even larger set of nodes) are homo-
of promoting only half of the nodes to the next level, we always pro- geneous, have a low degree and small diameter. However, expanders
mote all the nodes, similarly to a balanced binary tree: All nodes are are often not routable.
part of the root level of the binary tree. Half the nodes are promoted
left, and half the nodes are promoted right, on each level. Hence on
level i we have have 2i lists (or, if we connect the last element again 21.3 DHT & Churn
with the first: rings) of about n/2i objects. The skip graph features
all the properties of Definition 21.2. Definition 21.10 (Distributed Hash Table (DHT)). A distributed hash table
(DHT) is a distributed data structure that implements a distributed storage. A
• More generally, how are degree and diameter of Definition 21.2 re- DHT should support at least (i) a search (for a key) and (ii) an insert (key,
lated? The following theorem gives a general lower bound. object) operation, possibly also (iii) a delete (key) operation.
21.3. DHT & CHURN 241 242 CHAPTER 21. DISTRIBUTED STORAGE

Remarks: which repeatedly takes down nodes by a distributed denial of service


attack, however only a logarithmic number of nodes at each point in
• A DHT has many applications beyond storing movies, e.g., the Inter-
time. The algorithm relies on messages being delivered timely, in at
net domain name system (DNS) is essentially a DHT.
most constant time between any pair of operational nodes, i.e., the
• A DHT can be implemented as a hypercubic overlay network with synchronous model. Using the trivial synchronizer this is not a prob-
nodes having identifiers such that they span the ID space [0, 1). lem. We only need bounded message delays in order to have a notion
of time which is needed for the adversarial model. The duration of
• A hypercube can directly be used for a DHT. Just use a globally a round is then proportional to the propagation delay of the slowest
known set of hash functions hi , mapping movies to bit strings with d message.
bits.
• Other hypercubic structures may be a bit more intricate when using Algorithm 88 DHT
it as a DHT: The butterfly network, for instance, may directly use the
1: Given: a globally known set of hash functions hi , and a hypercube (or any
d + 1 layers for replication, i.e., all the d + 1 nodes are responsible for
other hypercubic network)
the same ID.
2: Each hypercube virtual node (“hypernode”) consists of Θ(log n) nodes.
• Other hypercubic networks, e.g. the pancake graph, might need a bit 3: Nodes have connections to all other nodes of their hypernode and to nodes
of twisting to find appropriate IDs. of their neighboring hypernodes.
4: Because of churn, some of the nodes have to change to another hypernode
• We assume that a joining node knows a node which already belongs to such that up to constant factors, all hypernodes own the same number of
the system. This is known as the bootstrap problem. Typical solutions nodes at all times.
are: If a node has been connected with the DHT previously, just try 5: If the total number of nodes n grows or shrinks above or below a certain
some of these previous nodes. Or the node may ask some authority threshold, the dimension of the hypercube is increased or decreased by one,
for a list of IP addresses (and ports) of nodes that are regularly part respectively.
of the DHT.
• Many DHTs in the literature are analyzed against an adversary that
Remarks:
can crash a fraction of random nodes. After crashing a few nodes the
system is given sufficient time to recover again. However, this seems • Having a logarithmic number of hypercube neighbors, each with a
unrealistic. The scheme sketched in this section significantly differs logarithmic number of nodes, means that each node has Θ(log2 n)
from this in two major aspects. neighbors. However, with some additional bells and whistles one can
achieve Θ(log n) neighbor nodes.
• First, we assume that joins and leaves occur in a worst-case manner.
We think of an adversary that can remove and add a bounded number • The balancing of nodes among the hypernodes can be seen as a dy-
of nodes; the adversary can choose which nodes to crash and how nodes namic token distribution problem on the hypercube. Each hypernode
join. has a certain number of tokens, the goal is to distribute the tokens
• Second, the adversary does not have to wait until the system is recov- along the edges of the graph such that all hypernodes end up with the
ered before it crashes the next batch of nodes. Instead, the adversary same or almost the same number of tokens. While tokens are moved
can constantly crash nodes, while the system is trying to stay alive. around, an adversary constantly inserts and deletes tokens. See also
Indeed, the system is never fully repaired but always fully functional. Figure 23.7.
In particular, the system is resilient against an adversary that contin- • In summary, the storage system builds on two basic components: (i)
uously attacks the “weakest part” of the system. The adversary could an algorithm which performs the described dynamic token distribution
for example insert a crawler into the DHT, learn the topology of the and (ii) an information aggregation algorithm which is used to esti-
system, and then repeatedly crash selected nodes, in an attempt to mate the number of nodes in the system and to adapt the dimension
partition the DHT. The system counters such an adversary by con- of the hypercube accordingly:
tinuously moving the remaining or newly joining nodes towards the
areas under attack. Theorem 21.11 (DHT with Churn). We have a fully scalable, efficient distrib-
uted storage system which tolerates O(log n) worst-case joins and/or crashes per
• Clearly, we cannot allow the adversary to have unbounded capabili-
constant time interval. As in other storage systems, nodes have O(log n) overlay
ties. In particular, in any constant time interval, the adversary can
neighbors, and the usual operations (e.g., search, insert) take time O(log n).
at most add and/or remove O(log n) nodes, n being the total num-
ber of nodes currently in the system. This model covers an adversary
21.3. DHT & CHURN 243 244 CHAPTER 21. DISTRIBUTED STORAGE

[SW05, SG05, MS07, KW08, BYL08].


Some of the figures in this chapter have been provided by Christian Schei-
deler.

Bibliography
[AJ75] George A. Anderson and E. Douglas Jensen. Computer Interconnec-
tion Structures: Taxonomy, Characteristics, and Examples. ACM
Comput. Surv., 7(4):197–213, December 1975.
[AMD04] Ittai Abraham, Dahlia Malkhi, and Oren Dobzinski. LAND: stretch
(1 + epsilon) locality-aware networks for DHTs. In Proceedings of
Figure 21.7: A simulated 2-dimensional hypercube with four hypernodes, each
the fifteenth annual ACM-SIAM symposium on Discrete algorithms,
consisting of several nodes. Also, all the nodes are either in the core or in
SODA ’04, pages 550–559, Philadelphia, PA, USA, 2004. Society for
the periphery of a node. All nodes within the same hypernode are completely
Industrial and Applied Mathematics.
connected to each other, and additionally, all nodes of a hypernode are connected
to the core nodes of the neighboring nodes. Only the core nodes store data items, [AP90] Baruch Awerbuch and David Peleg. Efficient Distributed Construc-
while the peripheral nodes move between the nodes to balance biased adversarial tion of Sparse Covers. Technical report, The Weizmann Institute of
churn. Science, 1990.
[AP91] Baruch Awerbuch and David Peleg. Concurrent Online Tracking of
Mobile Users. In SIGCOMM, pages 221–233, 1991.
Remarks:
[AS03] James Aspnes and Gauri Shah. Skip Graphs. In SODA, pages 384–
• Indeed, handling churn is only a minimal requirement to make a dis- 393. ACM/SIAM, 2003.
tributed storage system work. Advanced studies proposed more elab-
orate architectures which can also handle other security issues, e.g., [AS09] Baruch Awerbuch and Christian Scheideler. Towards a Scalable and
privacy or Byzantine attacks. Robust DHT. Theory Comput. Syst., 45(2):234–260, 2009.
[BA84] L. N. Bhuyan and D. P. Agrawal. Generalized Hypercube and Hy-
perbus Structures for a Computer Network. IEEE Trans. Comput.,
Chapter Notes 33(4):323–333, April 1984.
The ideas behind distributed storage were laid during the peer-to-peer (P2P) [BSS09] Matthias Baumgart, Christian Scheideler, and Stefan Schmid. A
file sharing hype around the year 2000, so a lot of the seminal research DoS-resilient information system for dynamic data management. In
in this area is labeled P2P. The paper of Plaxton, Rajaraman, and Richa Proceedings of the twenty-first annual symposium on Parallelism in
[PRR97] laid out a blueprint for many so-called structured P2P architec- algorithms and architectures, SPAA ’09, pages 300–309, New York,
ture proposals, such as Chord [SMK+ 01], CAN [RFH+ 01], Pastry [RD01], NY, USA, 2009. ACM.
Viceroy [MNR02], Kademlia [MM02], Koorde [KK03], SkipGraph [AS03], Skip-
Net [HJS+ 03], or Tapestry [ZHS+ 04]. Also the paper of Plaxton et. al. was [BYL08] John Buford, Heather Yu, and Eng Keong Lua. P2P Networking
standing on the shoulders of giants. Some of its eminent precursors are: lin- and Applications. Morgan Kaufmann Publishers Inc., San Francisco,
ear and consistent hashing [KLL+ 97], locating shared objects [AP90, AP91], CA, USA, 2008.
compact routing [SK85, PU88], and even earlier: hypercubic networks, e.g. [GS81] J.R. Goodman and C.H. Sequin. Hypertree: A Multiprocessor
[AJ75, Wit81, GS81, BA84]. Interconnection Topology. Computers, IEEE Transactions on, C-
Furthermore, the techniques in use for prefix-based overlay structures are 30(12):923–933, dec. 1981.
related to a proposal called LAND, a locality-aware distributed hash table pro-
posed by Abraham et al. [AMD04]. [HJS+ 03] Nicholas J. A. Harvey, Michael B. Jones, Stefan Saroiu, Marvin
More recently, a lot of P2P research focussed on security aspects, describing Theimer, and Alec Wolman. SkipNet: a scalable overlay network
for instance attacks [LMSW06, SENB07, Lar07], and provable countermeasures with practical locality properties. In Proceedings of the 4th con-
[KSW05, AS09, BSS09]. Another topic currently garnering interest is using ference on USENIX Symposium on Internet Technologies and Sys-
P2P to help distribute live streams of video content on a large scale [LMSW07]. tems - Volume 4, USITS’03, pages 9–9, Berkeley, CA, USA, 2003.
There are several recommendable introductory books on P2P computing, e.g. USENIX Association.
BIBLIOGRAPHY 245 246 CHAPTER 21. DISTRIBUTED STORAGE

[KK03] M. Frans Kaashoek and David R. Karger. Koorde: A Simple Degree- [PU88] David Peleg and Eli Upfal. A tradeoff between space and efficiency
Optimal Distributed Hash Table. In M. Frans Kaashoek and Ion for routing tables. In Proceedings of the twentieth annual ACM
Stoica, editors, IPTPS, volume 2735 of Lecture Notes in Computer symposium on Theory of computing, STOC ’88, pages 43–52, New
Science, pages 98–107. Springer, 2003. York, NY, USA, 1988. ACM.

[KLL+ 97] David R. Karger, Eric Lehman, Frank Thomson Leighton, Rina [RD01] Antony Rowstron and Peter Druschel. Pastry: Scalable, decen-
Panigrahy, Matthew S. Levine, and Daniel Lewin. Consistent Hash- tralized object location and routing for large-scale peer-to-peer sys-
ing and Random Trees: Distributed Caching Protocols for Relieving tems. In IFIP/ACM International Conference on Distributed Sys-
Hot Spots on the World Wide Web. In Frank Thomson Leighton tems Platforms (Middleware), pages 329–350, November 2001.
and Peter W. Shor, editors, STOC, pages 654–663. ACM, 1997.
[RFH+ 01] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and
[KSW05] Fabian Kuhn, Stefan Schmid, and Roger Wattenhofer. A Self- Scott Shenker. A scalable content-addressable network. SIGCOMM
Repairing Peer-to-Peer System Resilient to Dynamic Adversarial Comput. Commun. Rev., 31(4):161–172, August 2001.
Churn. In 4th International Workshop on Peer-To-Peer Systems [SENB07] Moritz Steiner, Taoufik En-Najjary, and Ernst W. Biersack. Exploit-
(IPTPS), Cornell University, Ithaca, New York, USA, Springer ing KAD: possible uses and misuses. SIGCOMM Comput. Commun.
LNCS 3640, February 2005. Rev., 37(5):65–70, October 2007.
[KW08] Javed I. Khan and Adam Wierzbicki. Introduction: Guest edi- [SG05] Ramesh Subramanian and Brian D. Goodman. Peer to Peer Com-
tors’ introduction: Foundation of peer-to-peer computing. Comput. puting: The Evolution of a Disruptive Technology. IGI Publishing,
Commun., 31(2):187–189, February 2008. Hershey, PA, USA, 2005.
[Lar07] Erik Larkin. Storm Worm’s virulence may change tac- [SK85] Nicola Santoro and Ramez Khatib. Labelling and Implicit Routing
tics. http://www.networkworld.com/news/2007/080207-black-hat- in Networks. Comput. J., 28(1):5–8, 1985.
storm-worms-virulence.html, Agust 2007. Last accessed on June 11,
2012. [SMK+ 01] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and
Hari Balakrishnan. Chord: A scalable peer-to-peer lookup ser-
[LMSW06] Thomas Locher, Patrick Moor, Stefan Schmid, and Roger Watten- vice for internet applications. SIGCOMM Comput. Commun. Rev.,
hofer. Free Riding in BitTorrent is Cheap. In 5th Workshop on Hot 31(4):149–160, August 2001.
Topics in Networks (HotNets), Irvine, California, USA, November
2006. [SW05] Ralf Steinmetz and Klaus Wehrle, editors. Peer-to-Peer Systems and
Applications, volume 3485 of Lecture Notes in Computer Science.
[LMSW07] Thomas Locher, Remo Meier, Stefan Schmid, and Roger Watten- Springer, 2005.
hofer. Push-to-Pull Peer-to-Peer Live Streaming. In 21st Inter-
[Wit81] L. D. Wittie. Communication Structures for Large Networks of
national Symposium on Distributed Computing (DISC), Lemesos,
Microcomputers. IEEE Trans. Comput., 30(4):264–273, April 1981.
Cyprus, September 2007.
[ZHS+ 04] Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, An-
[MM02] Petar Maymounkov and David Mazières. Kademlia: A Peer-to-Peer thony D. Joseph, and John Kubiatowicz. Tapestry: a resilient
Information System Based on the XOR Metric. In Revised Papers global-scale overlay for service deployment. IEEE Journal on Se-
from the First International Workshop on Peer-to-Peer Systems, lected Areas in Communications, 22(1):41–53, 2004.
IPTPS ’01, pages 53–65, London, UK, UK, 2002. Springer-Verlag.

[MNR02] Dahlia Malkhi, Moni Naor, and David Ratajczak. Viceroy: a scal-
able and dynamic emulation of the butterfly. In Proceedings of the
twenty-first annual symposium on Principles of distributed comput-
ing, PODC ’02, pages 183–192, New York, NY, USA, 2002. ACM.

[MS07] Peter Mahlmann and Christian Schindelhauer. Peer-to-Peer Net-


works. Springer, 2007.

[PRR97] C. Greg Plaxton, Rajmohan Rajaraman, and Andréa W. Richa.


Accessing Nearby Copies of Replicated Objects in a Distributed
Environment. In SPAA, pages 311–320, 1997.
248 CHAPTER 22. GAME THEORY

u Player u
v Cooperate Defect
1 0
Cooperate
Player v 1 3
3 2
Defect
0 2
Chapter 22 Table 22.1: The prisoner’s dilemma game as a matrix.

Game Theory • If both of them stay silent (cooperate), both will be sentenced to one year
of prison on a lesser charge.

• If both of them testify against their fellow prisoner (defect), the police has
a stronger case and they will be sentenced to two years each.
“Game theory is a sort of umbrella or ‘unified field’ theory for the rational side
of social science, where ‘social’ is interpreted broadly, to include human as well • If player u defects and the player v cooperates, then player u will go free
as non-human players (computers, animals, plants).” (snitching pays off) and player v will have to go to jail for three years; and
vice versa.
– Robert Aumann, 1987
• This two player game can be represented as a matrix, see Table 22.1.

Definition 22.1 (game). A game requires at least two rational players, and
each player can choose from at least two options ( strategies). In every possible
22.1 Introduction outcome ( strategy profile) each player gets a certain payoff (or cost). The payoff
of a player depends on the strategies of the other players.
In this chapter we look at a distributed system from a different perspective.
Nodes no longer have a common goal, but are selfish. The nodes are not byzan- Definition 22.2 (social optimum). A strategy profile is called social optimum
tine (actively malicious), instead they try to benefit from a distributed system (SO) if and only if it minimizes the sum of all costs (or maximizes payoff ).
– possibly without contributing.
Game theory attempts to mathematically capture behavior in strategic sit-
uations, in which an individual’s success depends on the choices of others. Remarks:

Remarks: • The social optimum for the prisoner’s dilemma is when both players
cooperate – the corresponding cost sum is 2.
• Examples of potentially selfish behavior are file sharing or TCP. If a
packet is dropped, then most TCP implementations interpret this as Definition 22.3 (dominant). A strategy is dominant if a player is never worse
a congested network and alleviate the problem by reducing the speed off by playing this strategy. A dominant strategy profile is a strategy profile in
at which packets are sent. What if a selfish TCP implementation will which each player plays a dominant strategy.
not reduce its speed, but instead transmit each packet twice?
Remarks:
• We start with one of the most famous games to introduce some defi-
nitions and concepts of game theory. • The dominant strategy profile in the prisoner’s dilemma is when both
players defect – the corresponding cost sum is 4.

22.2 Prisoner’s Dilemma Definition 22.4 (Nash Equilibrium). A Nash Equilibrium (NE) is a strategy
profile in which no player can improve by unilaterally (the strategies of the other
A team of two prisoners (players u and v) are being questioned by the police. players do not change) changing its strategy.
They are both held in solitary confinement and cannot talk to each other. The
prosecutors offer a bargain to each prisoner: snitch on the other prisoner to
reduce your prison sentence.

247
22.3. SELFISH CACHING 249 250 CHAPTER 22. GAME THEORY

Remarks: • Note that in undirected graphs cu←v > cv←u if and only if du > dv .
We assume that the graphs are undirected for the rest of the chapter.
• A game can have multiple Nash Equilibria.

• In the prisoner’s dilemma both players defecting is the only Nash


Equilibrium.

• If every player plays a dominant strategy, then this is by definition a


Nash Equilibrium. Figure 22.1: In this example we assume du = dv = dw = 1. Either the nodes u
and w cache the file. Then neither of the three nodes has an incentive to change
• Nash Equilibria and dominant strategy profiles are so called solution its behavior. The costs are 1, 1/2, and 1 for the nodes u, v, w, respectively.
concepts. They are used to analyze a game. There are more solution Alternatively, only node v caches the file. Again, neither of the three nodes has
concepts, e.g. correlated equilibria or best response. an incentive to change its behavior. The costs are 1/2, 1, and 3/4 for the nodes
u, v, w, respectively.
• The best response is the best strategy given a belief about the strategy
of the other players. In this game the best response to both strategies
of the other player is to defect. If one strategy is the best response to
any strategy of the other players, it is a dominant strategy. Algorithm 89 Nash Equilibrium for Selfish Caching
1: S = {} //set of nodes that cache the file
• If two players play the prisoner’s dilemma repeatedly, it is called iter- 2: repeat
ated prisoner’s dilemma. It is a dominant strategy to always defect. 3: Let v be a node with maximum demand dv in set V
To see this, consider the final game. Defecting is a dominant strat- 4: S = S ∪ {v}, V = V \ {v}
egy. Thus, it is fixed what both players do in the last game. Now the 5: Remove every node u from V with cu←v < 1
penultimate game is the last game and by induction always defecting 6: until V = {}
is a dominant strategy.

• Game theorists were invited to come up with a strategy for 200 iter-
Theorem 22.5. Algorithm 89 computes a Nash Equilibrium for Selfish
ations of the prisoner’s dilemma to compete in a tournament. Each
Caching.
strategy had to play against every other strategy and accumulated
points throughout the tournament. The simple Tit4Tat strategy (co- Proof. Let u be a node that is not caching the file. Then there exists a node v
operate in the first game, then copy whatever the other player did in for which cu←v ≤ 1. Hence, node u has no incentive to cache.
the previous game) won. One year later, after analyzing each strat- Let u be a node that is caching the file. We now consider any other node v
egy, another tournament (with new strategies) was held. Tit4Tat won that is also caching the file. First, we consider the case where v cached the file
again. before u did. Then it holds that cu←v > 1 by construction.
It could also be that v started caching the file after u did. Then it holds
• We now look at a distributed system game. that du ≥ dv and therefore cu←v ≥ cv←u . Furthermore, we have cv←u > 1 by
construction. Combining these implies that cu←v ≥ cv←u > 1.
In either case, node u has no incentive to stop caching.
22.3 Selfish Caching
Definition 22.6 (Price of Anarchy). Let N E − denote the Nash Equilibrium
Computers in a network want to access a file regularly. Each node v ∈ V , with with the highest cost (smallest payoff ). The Price of Anarchy (PoA) is defined
V being the set of nodes and n = |V |, has a demand dv for the file and wants to as
minimize the cost for accessing it. In order to access the file, node v can either cost(N E − )
cache the file locally which costs 1 or request the file from another node u which P oA = .
cost(SO)
costs cv←u . If a node does not cache the file, the cost it incurs is the minimal
cost to access the file remotely. Note that if no node caches the file, then every Definition 22.7 (Optimistic Price of Anarchy). Let N E + denote the Nash
node incurs cost ∞. There is an example in Figure 22.1. Equilibrium with the smallest cost (highest payoff ). The Optimistic Price of
Anarchy (OPoA) is defined as
Remarks:
cost(N E + )
• We will sometimes depict this game as a graph. The cost cv←u for OP oA = .
cost(SO)
node v to access the file from node u is equivalent to the length of the
shortest path times the demand du .
22.4. BRAESS’ PARADOX 251 252 CHAPTER 22. GAME THEORY

0 0

0
0

0 0

0
0

(a) The road network without the shortcut (b) The road network with the shortcut

Figure 22.2: A network with a Price of Anarchy of Θ(n). Figure 22.3: Braess’ Paradox, where d denotes the number of drivers using an
edge.

Remarks:
To reduce congestion, a super fast road (delay is 0) is built between nodes u
• The Price of Anarchy measures how much a distributed system de- and v. This results in the following Nash Equilibrium: every driver now drives
grades because of selfish nodes. from s to v to u to t. The total cost is now 2 > 1.5.
• We have P oA ≥ OP oA ≥ 1.
Remarks:
Theorem 22.8. The (Optimistic) Price of Anarchy of Selfish Caching can be
Θ(n). • There are physical systems which exhibit similar properties. Some
famous ones employ a spring. YouTube has some fascinating videos
Proof. Consider a network as depicted in Figure 22.2. Every node v has demand about this. Simply search for “Braess Paradox Spring”.
dv = 1. Note that if any node caches the file, no other node has an incentive
to cache the file as well since the cost to access the file is at most 1 − ε. Wlog • We will now look at another famous game that will allow us to deepen
let us assume that a node v on the left caches the file, then it is cheaper for our understanding of game theory.
every node on the right to access the file remotely. Hence, the total cost of this
solution is 1 + n2 · (1 − ε). In the social optimum one node from the left and one
node from the right cache the file. This reduces the cost to 2. Hence, the Price 22.5 Rock-Paper-Scissors
1+ n 2 ·(1−ε)
of Anarchy is = 12 + n4 = Θ(n).
2 ε→0 There are two players, u and v. Each player simultaneously chooses one of three
options: rock, paper, or scissors. The rules are simple: paper beats rock, rock
22.4 Braess’ Paradox beats scissors, and scissors beat paper. A matrix representation of this game is
in Table 22.2.
Consider the graph in Figure 22.3, it models a road network. Let us assume
that there are 1000 drivers (each in their own car) that want to travel from node u Player u
s to node t. Traveling along the road from s to u (or v to t) always takes 1 v Rock Paper Scissors
hour. The travel time from s to v (or u to t) depends on the traffic and increases 0 1 -1
Rock
by 1/1000 of an hour per car, i.e., when there are 500 cars driving, it takes 30 0 -1 1
minutes to use this road. -1 0 1
Player v Paper
1 0 -1
Lemma 22.9. Adding a super fast road (delay is 0) between u and v can in- 1 -1 0
crease the travel time from s to t. Scissors
-1 1 0
Proof. Since the drivers act rationally, they want to minimize the travel time.
Table 22.2: Rock-Paper-Scissors as a matrix.
In the Nash Equilibrium, 500 drivers first drive to node u and then to t and 500
drivers first to node v and then to t. The travel time for each driver is 1 + 500
/ 1000 = 1.5.
22.6. MECHANISM DESIGN 253 254 CHAPTER 22. GAME THEORY

Remarks: Algorithm 90 First Price Auction


1: every bidder vi submits his bid bi
• None of the three strategies is a Nash Equilibrium. Whatever player
2: the good is allocated to the highest bidder v1 for the price p = b1
u chooses, player v can always switch her strategy such that she wins.
• This is highlighted in the best response concept. The best response
to e.g. scissors is to play rock. The other player switches to paper. Theorem 22.13. A First Price Auction (Algorithm 90) is not truthful.
And so on. Proof. Consider an auction with two bidders, with bids b1 and b2 . By not stating
• Is this a game without a Nash Equilibrium? John Nash answered this the truth and decreasing his bid to b1 − ε > b2 , player one could pay less and
question in 1950. By choosing each strategy with a certain probability, thus gain more. Thus, the first price auction is not truthful.
we can obtain a so called mixed Nash Equilibrium. Indeed:
Theorem 22.10. Every game has a mixed Nash Equilibrium. Algorithm 91 Second Price Auction
1: every bidder vi submits his bid bi
Remarks: 2: the good is allocated to the highest bidder v1 for p = b2

• The Nash Equilibrium of this game is if both players choose each


strategy with probability 1/3. The expected payoff is 0. Theorem 22.14. Truthful bidding is a dominant strategy in a Second Price
Auction.
• Any strategy (or mix of them) is a best response to a player choosing
each strategy with probability 1/3. Proof. Let zi be the truthful value of node vi and bi his bid. Let bmax =
maxj6=i bj is the largest bid from other nodes but vi . The payoff for node vi is
• In a pure Nash Equilibrium, the strategies are chosen deterministi- zi − bmax if bi > bmax and 0 else. Let us consider overbidding first, i.e., bi > zi :
cally. Rock-Paper-Scissors does not have a pure Nash Equilibrium.
• If bmax < zi < bi , then both strategies win and yield the same payoff
• Unfortunately, game theory does not always model problems accu- (zi − bmax ).
rately. Many real world problems are too complex to be captured by
• If zi < bi < bmax , then both strategies lose and yield a payoff of 0.
a game. And as you may know, humans (not only politicians) are
often not rational. • If zi < bmax < bi , then overbidding wins the auction, but the payoff
(zi − bmax ) is negative. Truthful bidding loses and yields a payoff of 0.
• In distributed systems, players can be servers, routers, etc. Game
theory can tell us whether systems and protocols are prone to selfish Likewise underbidding, i.e. bi < zi :
behavior.
• If bmax < bi < zi , then both strategies win and yield the same payoff
(zi − bmax ).
22.6 Mechanism Design • If bi < zi < bmax , then both strategies lose and yield a payoff of 0.
Whereas game theory analyzes existing systems, there is a related area that • If bi < bmax < zi , then truthful bidding wins and yields a positive payoff
focuses on designing games – mechanism design. The task is to create a game (zi − bmax ). Underbidding loses and yields a payoff of 0.
where nodes have an incentive to behave “nicely”.
Hence, truthful bidding is a dominant strategy for each node vi .
Definition 22.11 (auction). One good is sold to a group of bidders in an auc-
tion. Each bidder vi has a secret value zi for the good and tells his bid bi to the Remarks:
auctioneer. The auctioneer sells the good to one bidder for a price p.
• Let us use this for Selfish Caching. We need to choose a node that is
Remarks: the first to cache the file. But how? By holding an auction. Every
node says for which price it is willing to cache the file. We pay the
• For simplicity, we assume that no two bids are the same, and that node with the lowest offer and pay it the second lowest offer to ensure
b1 > b2 > b3 > . . . truthful offers.

Definition 22.12 (truthful). An auction is truthful if no player vi can gain • Since a mechanism designer can manipulate incentives, she can im-
anything by not stating the truth, i.e., bi = zi . plement a strategy profile by making all the strategies in this profile
dominant.
22.6. MECHANISM DESIGN 255 256 CHAPTER 22. GAME THEORY

Theorem 22.15. Any Nash Equilibrium of Selfish Caching can be implemented then Clarke, and finally Groves [Vic61, Cla71, Gro73]. One popular exam-
for free. ple of selfishness in practice is BitThief – a BitTorrent client that successfully
downloads without uploading [LMSW06]. Using game theory economists try to
Proof. If the mechanism designer wants the nodes from the caching set S of understand markets and predict crashes. Apart from John Nash, the Sveriges
the Nash Equilibrium to cache, then she can offer the following deal to every Riksbank Prize (Nobel Prize) in Economics has been awarded many times to
node not in S: If not every node from set S caches the file, then I will ensure game theorists. For example in 2007 Hurwicz, Maskin, and Myerson received
a positive payoff for you. Thus, all nodes not in S prefer not to cache since the prize for “for having laid the foundations of mechanism design theory”.
this is a dominant strategy for them. Consider now a node v ∈ S. Since S is a This chapter was written in collaboration with Philipp Brandes.
Nash Equilibrium, node v incurs cost of at least 1 if it does not cache the file.
For nodes that incur cost of exactly 1, the mechanism designer can even issue a
penatly if the node does not cache the file. Thus, every node v ∈ S caches the Bibliography
file.
[AH81] Robert Axelrod and William Donald Hamilton. The evolution of
cooperation. Science, 211(4489):1390–1396, 1981.
Remarks:
[Bra68] Dietrich Braess. Über ein paradoxon aus der verkehrsplanung. Un-
• Mechanism design assumes that the players act rationally and want to
ternehmensforschung, 12(1):258–268, 1968.
maximize their payoff. In real-world distributed systems some players
may be not selfish, but actively malicious (byzantine). [CCW+ 04] Byung-Gon Chun, Kamalika Chaudhuri, Hoeteck Wee, Marco Bar-
reno, Christos H Papadimitriou, and John Kubiatowicz. Selfish
• What about P2P file sharing? To increase the overall experience, caching in distributed systems: a game-theoretic analysis. In Pro-
BitTorrent suggests that peers offer better upload speed to peers who ceedings of the twenty-third annual ACM symposium on Principles
upload more. This idea can be exploited. By always claiming to have of distributed computing, pages 21–30. ACM, 2004.
nothing to trade yet, the BitThief client downloads without uploading.
In addition to that, it connects to more peers than the standard client [Cla71] Edward H Clarke. Multipart pricing of public goods. Public choice,
to increase its download speed. 11(1):17–33, 1971.

• Many techniques have been proposed to limit such free riding behavior, [Flo52] Merrill M Flood. Some experimental games. Management Science,
e.g., tit-for-tat trading: I will only share something with you if you 5(1):5–26, 1952.
share something with me. To solve the bootstrap problem (“I don’t
have anything yet”), nodes receive files or pieces of files whose hash [Gro73] Theodore Groves. Incentives in teams. Econometrica: Journal of
match their own hash for free. One can also imagine indirect trading. the Econometric Society, pages 617–631, 1973.
Peer u uploads to peer v, who uploads to peer w, who uploads to peer [KP99] Elias Koutsoupias and Christos Papadimitriou. Worst-case equilib-
u. Finally, one could imagine using virtual currencies or a reputation ria. In STACS 99, pages 404–413. Springer, 1999.
system (a history of who uploaded what). Reputation systems suffer
from collusion and Sybil attacks. If one node pretends to be many [LMSW06] Thomas Locher, Patrick Moor, Stefan Schmid, and Roger Watten-
nodes who rate each other well, it will have a good reputation. hofer. Free Riding in BitTorrent is Cheap. In 5th Workshop on Hot
Topics in Networks (HotNets), Irvine, California, USA, November
2006.
Chapter Notes
[Nas50] John F. Nash. Equilibrium points in n-person games. Proc. Nat.
Game theory was started by a proof for mixed-strategy equilibria in two-person Acad. Sci. USA, 36(1):48–49, 1950.
zero-sum games by John von Neumann [Neu28]. Later, von Neumann and Mor-
[Neu28] John von Neumann. Zur Theorie der Gesellschaftsspiele. Mathema-
genstern introduced game theory to a wider audience [NM44]. In 1950 John
tische Annalen, 100(1):295–320, 1928.
Nash proved that every game has a mixed Nash Equilibrium [Nas50]. The Pris-
oner’s Dilemma was first formalized by Flood and Dresher [Flo52]. The iterated [NM44] John von Neumann and Oskar Morgenstern. Theory of games and
prisoner’s dilemma tournament was organized by Robert Axelrod [AH81]. The economic behavior. Princeton university press, 1944.
Price of Anarchy definition is from Koutsoupias and Papadimitriou [KP99].
This allowed the creation of the Selfish Caching Game [CCW+ 04], which we [Vic61] William Vickrey. Counterspeculation, auctions, and competitive
used as a running example in this chapter. Braess’ paradox was discovered by sealed tenders. The Journal of finance, 16(1):8–37, 1961.
Dietrich Braess in 1968 [Bra68]. A generalized version of the second-price auc-
tion is the VCG auction, named after three successive papers from first Vickrey,
258 CHAPTER 23. PEER-TO-PEER COMPUTING

this definition almost everything we learn in this course is P2P! More-


over, according to this definition early-day file sharing applications such
as Napster (1999) that essentially made the term P2P popular would not
be P2P! On the other hand, the plain old telephone system or the world
wide web do fit the P2P definition...

Chapter 23 • From a different viewpoint, the term P2P may also be synonymous for
privacy protection, as various P2P systems such as Freenet allow publish-
ers of information to remain anonymous and uncensored. (Studies show
that these freedom-of-speech P2P networks do not feature a lot of content

Peer-to-Peer Computing against oppressive governments; indeed the majority of text documents
seem to be about illicit drugs, not to speak about the type of content in
audio or video files.)
In other words, we cannot hope for a single well-fitting definition of P2P, as
some of them even contradict. In the following we mostly employ the academic
“Indeed, I believe that virtually every important aspect of
viewpoints (second and third definition above). In this context, it is generally
programming arises somewhere in the context of [sorting and] searching!”
believed that P2P will have an influence on the future of the Internet. The P2P
– Donald E. Knuth, The Art of Computer Programming paradigm promises to give better scalability, availability, reliability, fairness,
incentives, privacy, and security, just about everything researchers expect from
a future Internet architecture. As such it is not surprising that new “clean slate”
Internet architecture proposals often revolve around P2P concepts.
One might naively assume that for instance scalability is not an issue in
23.1 Introduction today’s Internet, as even most popular web pages are generally highly available.
However, this is not really because of our well-designed Internet architecture,
Unfortunately, the term peer-to-peer (P2P) is ambiguous, used in a variety of but rather due to the help of so-called overlay networks: The Google website for
different contexts, such as: instance manages to respond so reliably and quickly because Google maintains a
large distributed infrastructure, essentially a P2P system. Similarly companies
• In popular media coverage, P2P is often synonymous to software or proto-
like Akamai sell “P2P functionality” to their customers to make today’s user
cols that allow users to “share” files, often of dubious origin. In the early
experience possible in the first place. Quite possibly today’s P2P applications
days, P2P users mostly shared music, pictures, and software; nowadays
are just testbeds for tomorrow’s Internet architecture.
books, movies or tv shows have caught on. P2P file sharing is immensely
popular, currently at least half of the total Internet traffic is due to P2P!
• In academia, the term P2P is used mostly in two ways. A narrow view 23.2 Architecture Variants
essentially defines P2P as the “theory behind file sharing protocols”. In Several P2P architectures are known:
other words, how do Internet hosts need to be organized in order to deliver
a search engine to find (file sharing) content efficiently? A popular term • Client/Server goes P2P: Even though Napster is known to the be first P2P
is “distributed hash table” (DHT), a distributed data structure that im- system (1999), by today’s standards its architecture would not deserve the
plements such a content search engine. A DHT should support at least a label P2P anymore. Napster clients accessed a central server that managed
search (for a key) and an insert (key, object) operation. A DHT has many all the information of the shared files, i.e., which file was to be found on
applications beyond file sharing, e.g., the Internet domain name system which client. Only the downloading process itself was between clients
(DNS). (“peers”) directly, hence peer-to-peer. In the early days of Napster the
load of the server was relatively small, so the simple Napster architecture
• A broader view generalizes P2P beyond file sharing: Indeed, there is a made a lot of sense. Later on, it became clear that the server would
growing number of applications operating outside the juridical gray area, eventually be a bottleneck, and more so an attractive target for an attack.
e.g., P2P Internet telephony à la Skype, P2P mass player games on video Indeed, eventually a judge ruled the server to be shut down, in other
consoles connected to the Internet, P2P live video streaming as in Zattoo words, he conducted a juridical denial of service attack.
or StreamForge, or P2P social storage such as Wuala. So, again, what is
P2P?! Still not an easy question... Trying to account for the new applica- • Unstructured P2P: The Gnutella protocol is the anti-thesis of Napster,
tions beyond file sharing, one might define P2P as a large-scale distributed as it is a fully decentralized system, with no single entity having a global
system that operates without a central server bottleneck. However, with picture. Instead each peer would connect to a random sample of other

257
23.3. HYPERCUBIC NETWORKS 259 260 CHAPTER 23. PEER-TO-PEER COMPUTING

peers, constantly changing the neighbors of this virtual overlay network of a tree is usually a severe bottleneck, so-called fat trees have been used. These
by exchanging neighbors with neighbors of neighbors. (In such a system trees have the property that every edge connecting a node v to its parent u has
it is part of the challenge to find a decentralized way to even discover a capacity that is equal to all leaves of the subtree routed at v. See Figure 23.1
a first neighbor; this is known as the bootstrap problem. To solve it, for an example.
usually some random peers of a list of well-known peers are contacted
first.) When searching for a file, the request was being flooded in the
network (Algorithm 7 in Chapter 2). Indeed, since users often turn off their 4
client once they downloaded their content there usually is a lot of churn
(peers joining and leaving at high rates) in a P2P system, so selecting
the right “random” neighbors is an interesting research problem by itself.
However, unstructured P2P architectures such as Gnutella have a major
disadvantage, namely that each search will cost m messages, m being 2
the number of virtual edges in the architecture. In other words, such an
unstructured P2P architecture will not scale.
1
• Hybrid P2P: The synthesis of client/server architectures such as Napster
and unstructured architectures such as Gnutella are hybrid architectures.
Some powerful peers are promoted to so-called superpeers (or, similarly,
trackers). The set of superpeers may change over time, and taking down Figure 23.1: The structure of a fat tree.
a fraction of superpeers will not harm the system. Search requests are
handled on the superpeer level, resulting in much less messages than in
flat/homogeneous unstructured systems. Essentially the superpeers to-
Remarks:
gether provide a more fault-tolerant version of the Napster server, all
regular peers connect to a superpeer. As of today, almost all popular • Fat trees belong to a family of networks that require edges of non-
P2P systems have such a hybrid architecture, carefully trading off relia- uniform capacity to be efficient. Easier to build are networks with
bility and efficiency, but essentially not using any fancy algorithms and edges of uniform capacity. This is usually the case for grids and tori.
techniques. Unless explicitly mentioned, we will treat all edges in the following to
• Structured P2P: Inspired by the early success of Napster, the academic be of capacity 1. In the following, [x] means the set {0, . . . , x − 1}.
world started to look into the question of efficient file sharing. The pro- Definition 23.1 (Torus, Mesh). Let m, d ∈ N. The (m, d)-mesh M (m, d) is a
posal of hypercubic architectures lead to many so-called structured P2P graph with node set V = [m]d and edge set
architecture proposals, such as Chord, CAN, Pastry, Tapestry, Viceroy,
Kademlia, Koorde, SkipGraph, SkipNet, etc. In practice structured P2P ( d
)
X
architectures are not yet popular, apart from the Kad (from Kademlia) E= {(a1 , . . . , ad ), (b1 , . . . , bd )} | ai , bi ∈ [m], |ai − bi | = 1 .
architecture which comes for free with the eMule client. i=1

The (m, d)-torus T (m, d) is a graph that consists of an (m, d)-mesh and addi-
23.3 Hypercubic Networks tionally wrap-around edges from nodes (a1 , . . . , ai−1 , m, ai+1 , . . . , ad ) to nodes
(a1 , . . . , ai−1 , 1, ai+1 , . . . , ad ) for all i ∈ {1, . . . , d} and all aj ∈ [m] with j 6= i.
In this section we will introduce some popular families of network topologies. In other words, we take the expression ai −bi in the sum modulo m prior to com-
These topologies are used in countless application domains, e.g., in classic paral- puting the absolute value. M (m, 1) is also called a line, T (m, 1) a cycle, and
lel computers or telecommunication networks, or more recently (as said above) M (2, d) = T (2, d) a d-dimensional hypercube. Figure 23.2 presents a linear
in P2P computing. Similarly to Chapter 4 we employ an All-to-All communi- array, a torus, and a hypercube.
cation model, i.e., each node can set up direct communication links to arbitrary
other nodes. Such a virtual network is called an overlay network, or in this Remarks:
context, P2P architecture. In this section we present a few overlay topologies
of general interest. • Routing on mesh, torus, and hypercube is trivial. On a d-dimensional
The most basic network topologies used in practice are trees, rings, grids or hypercube, to get from a source bitstring s to a target bitstring d one
tori. Many other suggested networks are simply combinations or derivatives of only needs to fix each “wrong” bit, one at a time; in other words, if
these. The advantage of trees is that the routing is very easy: for every source- the source and the target differ by k bits, there are k! routes with k
destination pair there is only one possible simple path. However, since the root hops.
23.3. HYPERCUBIC NETWORKS 261 262 CHAPTER 23. PEER-TO-PEER COMPUTING

A node set {(i, α) | α ∈ [2]d } is said to form level i of the butterfly. The
00 10 20 30 110 111 d-dimensional wrap-around butterfly W-BF(d) is defined by taking the BF (d)
and identifying level d with level 0.
01 11 21 31 100 101

02 12 22 32 010 011
Remarks:
0 1 2 m−1
• Figure 23.3 shows the 3-dimensional butterfly BF (3). The BF (d) has
03 13 23 33 000 001
(d + 1)2d nodes, 2d · 2d edges and degree 4. It is not difficult to check
that combining the node sets {(i, α) | i ∈ [d]} into a single node results
M(m,1) T (4,2) M(2,3) in the hypercube.
Figure 23.2: The structure of M (m, 1), T (4, 2), and M (2, 3). • Butterflies have the advantage of a constant node degree over hyper-
cubes, whereas hypercubes feature more fault-tolerant routing.

• The structure of a butterfly might remind you of sorting networks


• The hypercube can directly be used for a structured P2P architecture.
from Chapter 4. Although butterflies are used in the P2P context
It is trivial to construct a distributed hash table (DHT): We have n
(e.g. Viceroy), they have been used decades earlier for communication
nodes, n for simplicity being a power of 2, i.e., n = 2d . As in the
switches. The well-known Benes network is nothing but two back-
hypercube, each node gets a unique d-bit ID, and each node connects
to-back butterflies. And indeed, butterflies (and other hypercubic
to d other nodes, i.e., the nodes that have IDs differing in exactly one
networks) are even older than that; students familiar with fast fourier
bit. Now we use a globally known hash function f , mapping file names
transform (FFT) will recognize the structure without doubt. Every
to long bit strings; SHA-1 is popular in practice, providing 160 bits.
year there is a new application for which a hypercubic network is the
Let fd denote the first d bits (prefix) of the bitstring produced by f . If
perfect solution!
a node is searching for file name X, it routes a request message f (X)
to node fd (X). Clearly, node fd (X) can only answer this request if all • Indeed, hypercubic networks are related. Since all structured P2P
files with hash prefix fd (X) have been previously registered at node architectures are based on hypercubic networks, they in turn are all
fd (X). related.
• There are a few issues which need to be addressed before our DHT • Next we define the cube-connected-cycles network. It only has a de-
works, in particular churn (nodes joining and leaving without notice). gree of 3 and it results from the hypercube by replacing the corners
To deal with churn the system needs some level of replication, i.e., by cycles.
a number of nodes which are responsible for each prefix such that
failure of some nodes will not compromise the system. We give some
more details in Section 23.4. In addition there are other issues (e.g., 000 001 010 011 100 101 110 111
security, efficiency) which can be addressed to improve the system.
These issues are beyond the scope of this lecture.
0
• The hypercube has many derivatives, the so-called hypercubic net-
works. Among these are the butterfly, cube-connected-cycles, shuffle-
exchange, and de Bruijn graph. We start with the butterfly, which is 1
basically a “rolled out” hypercube (hence directly providing replica-
tion!).
2
Definition 23.2 (Butterfly). Let d ∈ N. The d-dimensional butterfly BF (d)
is a graph with node set V = [d + 1] × [2]d and an edge set E = E1 ∪ E2 with
3
E1 = {{(i, α), (i + 1, α)} | i ∈ [d], α ∈ [2]d }
Figure 23.3: The structure of BF(3).
and

E2 = {{(i, α), (i + 1, β)} | i ∈ [d], α, β ∈ [2]d , α and β differ Definition 23.3 (Cube-Connected-Cycles). Let d ∈ N. The cube-connected-
only at the ith position} . cycles network CCC(d) is a graph with node set V = {(a, p) | a ∈ [2]d , p ∈ [d]}
23.3. HYPERCUBIC NETWORKS 263 264 CHAPTER 23. PEER-TO-PEER COMPUTING

and edge set


 SE(3) SE(4)
100 101 1000 1001 1100 1101
E = {(a, p), (a, (p + 1) mod d)} | a ∈ [2]d , p ∈ [d]

∪ {(a, p), (b, p)} | a, b ∈ [2]d , p ∈ [d], a = b except for ap . 000 001 110 111 0000 0001 0100 0101 1010 1011 1110 1111

(110,0) (111,0)
010 011 0010 0011 0110 0111
(110,1) (111,1)
E1
(100,1) (110,2)
(101,1)
(111,2) E2

(100,0) 2
(101,0)

(010,2) (011,2)
1
Figure 23.5: The structure of SE(3) and SE(4).
(101,2)
(100,2)

(011,0)
(010,0) 0 01 001 011
000 001 010 011 100 101 110 111
(010,1)
(000,2) (001,2) (011,1)

(000,1) 010 101


(001,1) 00 11 000 111

(000,0) (001,0)

10 100 110
Figure 23.4: The structure of CCC(3).

Figure 23.6: The structure of DB(2, 2) and DB(2, 3).


Remarks:
• Two possible representations of a CCC can be found in Figure 23.4. Definition 23.6 (Skip List). The skip list is an ordinary ordered linked list
• The shuffle-exchange is yet another way of transforming the hypercu- of objects, augmented with additional forward links. The ordinary linked list is
bic interconnection structure into a constant degree network. the level 0 of the skip list. In addition, every object is promoted to level 1 with
probability 1/2. As for level 0, all level 1 objects are connected by a linked list.
Definition 23.4 (Shuffle-Exchange). Let d ∈ N. The d-dimensional shuffle- In general, every object on level i is promoted to the next level with probability
exchange SE(d) is defined as an undirected graph with node set V = [2]d and 1/2. A special start-object points to the smallest/first object on each level.
an edge set E = E1 ∪ E2 with
E1 = {{(a1 , . . . , ad ), (a1 , . . . , ād )} | (a1 , . . . , ad ) ∈ [2]d , ād = 1 − ad } Remarks:

and • Search, insert, and delete can be implemented in O(log n) expected


E2 = {{(a1 , . . . , ad ), (ad , a1 , . . . , ad−1 )} | (a1 , . . . , ad ) ∈ [2]d } . time in a skip list, simply by jumping from higher levels to lower ones
when overshooting the searched position. Also, the amortized memory
Figure 23.5 shows the 3- and 4-dimensional shuffle-exchange graph. cost of each object is constant, as on average an object only has two
forward pointers.
Definition 23.5 (DeBruijn). The b-ary DeBruijn graph of dimension d
DB(b, d) is an undirected graph G = (V, E) with node set V = {v ∈ [b]d } • The randomization can easily be discarded, by deterministically pro-
and edge set E that contains all edges {v, w} with the property that w ∈ moting a constant fraction of objects of level i to level i + 1, for all
{(x, v1 , . . . , vd−1 ) : x ∈ [b]}, where v = (v1 , . . . , vd ). i. When inserting or deleting, object o simply checks whether its left
and right level i neighbors are being promoted to level i + 1. If none
Remarks: of them is, promote object o itself. Essentially we establish a MIS on
• Two examples of a DeBruijn graph can be found in Figure 23.6. The each level, hence at least every third and at most every second object
DeBruijn graph is the basis of the Koorde P2P architecture. is promoted.

• There are some data structures which also qualify as hypercubic net- • There are obvious variants of the skip list, e.g., the skip graph. Instead
works. An obvious example is the Chord P2P architecture, which uses of promoting only half of the nodes to the next level, we always pro-
a slightly different hypercubic topology. A less obvious (and therefore mote all the nodes, similarly to a balanced binary tree: All nodes are
good) example is the skip list, the balanced binary search tree for the part of the root level of the binary tree. Half the nodes are promoted
lazy programmer: left, and half the nodes are promoted right, on each level. Hence on
23.4. DHT & CHURN 265 266 CHAPTER 23. PEER-TO-PEER COMPUTING

level i we have have 2i lists (or, more symmetrically: rings) of about In general a DHT has to withstand churn. Usually, peers are under control of
n/2i objects. This is pretty much what we need for a nice hypercubic individual users who turn their machines on or off at any time. Such peers join
P2P architecture. and leave the P2P system at high rates (“churn”), a problem that is not existent
in orthodox distributed systems, hence P2P systems fundamentally differ from
• One important goal in choosing a topology for a network is that it has old-school distributed systems where it is assumed that the nodes in the system
a small diameter. The following theorem presents a lower bound for are relatively stable. In traditional distributed systems a single unavailable
this. node is a minor disaster: all the other nodes have to get a consistent view of the
Theorem 23.7. Every graph of maximum degree d > 2 and size n must have system again, essentially they have to reach consensus which nodes are available.
a diameter of at least d(log n)/(log(d − 1))e − 2. In a P2P system there is usually so much churn that it is impossible to have a
consistent view at any time.
Proof. Suppose we have a graph G = (V, E) of maximum degree d and size Most P2P systems in the literature are analyzed against an adversary that
n. Start from any node v ∈ V . In a first step at most d other nodes can be can crash a fraction of random peers. After crashing a few peers the system
reached. In two steps at most d · (d − 1) additional nodes can be reached. Thus, is given sufficient time to recover again. However, this seems unrealistic. The
in general, in at most k steps at most scheme sketched in this section significantly differs from this in two major as-
pects. First, we assume that joins and leaves occur in a worst-case manner. We
k−1
X think of an adversary that can remove and add a bounded number of peers; it
(d − 1)k − 1 d · (d − 1)k
1+ d · (d − 1)i = 1 + d · ≤ can choose which peers to crash and how peers join. We assume that a joining
i=0
(d − 1) − 1 d−2
peer knows a peer which already belongs to the system. Second, the adversary
nodes (including v) can be reached. This has to be at least n to ensure that v does not have to wait until the system is recovered before it crashes the next
can reach all other nodes in V within k steps. Hence, batch of peers. Instead, the adversary can constantly crash peers, while the sys-
tem is trying to stay alive. Indeed, the system is never fully repaired but always
(d − 2) · n fully functional. In particular, the system is resilient against an adversary that
(d − 1)k ≥ ⇔ k ≥ logd−1 ((d − 2) · n/d) . continuously attacks the “weakest part” of the system. The adversary could for
d
example insert a crawler into the P2P system, learn the topology of the system,
Since logd−1 ((d − 2)/d) > −2 for all d > 2, this is true only if k ≥ and then repeatedly crash selected peers, in an attempt to partition the P2P
d(log n)/(log(d − 1))e − 2. network. The system counters such an adversary by continuously moving the
remaining or newly joining peers towards the sparse areas.
Remarks: Clearly, we cannot allow the adversary to have unbounded capabilities. In
particular, in any constant time interval, the adversary can at most add and/or
• In other words, constant-degree hypercubic networks feature an remove O(log n) peers, n being the total number of peers currently in the sys-
asymptotically optimal diameter. tem. This model covers an adversary which repeatedly takes down machines by
• There are a few other interesting graph classes, e.g., expander graphs a distributed denial of service attack, however only a logarithmic number of ma-
(an expander graph is a sparse graph which has high connectivity chines at each point in time. The algorithm relies on messages being delivered
properties, that is, from every not too large subset of nodes you are timely, in at most constant time between any pair of operational peers, i.e., the
connected to a larger set of nodes), or small-world graphs (popular synchronous model. Using the trivial synchronizer this is not a problem. We
representations of social networks). At first sight hypercubic networks only need bounded message delays in order to have a notion of time which is
seem to be related to expanders and small-world graphs, but they are needed for the adversarial model. The duration of a round is then proportional
not. to the propagation delay of the slowest message.
In the remainder of this section, we give a sketch of the system: For sim-
plicity, the basic structure of the P2P system is a hypercube. Each peer is part
23.4 DHT & Churn of a distinct hypercube node; each hypercube node consists of Θ(log n) peers.
Peers have connections to other peers of their hypercube node and to peers of
As written earlier, a DHT essentially is a hypercubic structure with nodes having the neighboring hypercube nodes.1 Because of churn, some of the peers have to
identifiers such that they span the ID space of the objects to be stored. We change to another hypercube node such that up to constant factors, all hyper-
described the straightforward way how the ID space is mapped onto the peers cube nodes own the same number of peers at all times. If the total number of
for the hypercube. Other hypercubic structures may be more complicated: The peers grows or shrinks above or below a certain threshold, the dimension of the
butterfly network, for instance, may directly use the d + 1 layers for replication, hypercube is increased or decreased by one, respectively.
i.e., all the d + 1 nodes with the same ID are responsible for the same hash 1 Having a logarithmic number of hypercube neighbor nodes, each with a logarithmic num-
prefix. For other hypercubic networks, e.g., the pancake graph (see exercises), ber of peers, means that each peers has Θ(log2 n) neighbor peers. However, with some addi-
assigning the object space to peer nodes may be more difficult. tional bells and whistles one can achieve Θ(log n) neighbor peers.
23.4. DHT & CHURN 267 268 CHAPTER 23. PEER-TO-PEER COMPUTING

The balancing of peers among the hypercube nodes can be seen as a dynamic 23.5 Storage and Multicast
token distribution problem on the hypercube. Each node of the hypercube has a
certain number of tokens, the goal is to distribute the tokens along the edges of As seen in the previous section, practical implementations often incorporate
the graph such that all nodes end up with the same or almost the same number some non-rigid (flexible) part. In a system called Pastry, prefix-based overlay
of tokens. While tokens are moved around, an adversary constantly inserts and structures similar to hypercubes are used to implement a DHT. Peers main-
deletes tokens. See also Figure 23.7. tain connections to other peers in the overlay according to the lengths of the
shared prefixes of their respective identifiers, where each peer carries a d-bit
peer identifier. Let β denote the number of bits that can be fixed at a peer
to route any message to an arbitrary destination. For i = {0, β, 2β, 3β, . . .}, a
peer chooses, if possible, 2β − 1 neighbors whose identifiers are equal in the i
most significant bits and differ in the subsequent β bits by one of 2β − 1 pos-
sibilities. If peer identifiers are chosen uniformly at random, the length of the
longest shared prefix is bounded by O(log n) in an overlay containing n peers;
thus, only O(log n(2β − 1)/β) connections need to be maintained. Moreover,
every peer reaches every other peer in O( logβ n ) hops by repetitively selecting
the next hop to fix β more bits toward the destination peer identifier, yielding
a logarithmic overlay diameter.
The advantage of prefix-based over more rigid DHT structures is that there
Figure 23.7: A simulated 2-dimensional hypercube with four nodes, each con- is a large choice of neighbors for most prefixes. Peers are no longer bound to
sisting of several peers. Also, all the peers are either in the core or in the connect to peers exactly matching a given identifier. Instead peers are enabled to
periphery of a node. All peers within the same node are completely connected connect to any peer matching a desired prefix, regardless of subsequent identifier
to each other, and additionally, all peers of a node are connected to the core bits. In particular, among half of all peers can be chosen for a shared prefix of
peers of the neighboring nodes. Only the core peers store data items, while the length 0. The flexibility of such a neighbor policy allows the optimization of
peripheral peers move between the nodes to balance biased adversarial changes. secondary criteria. Peers may favor peers with a low-latency and select multiple
neighbors for the same prefix to gain resilience against churn. Regardless of
the choice of neighbors, the overlay always remains connected with a bounded
degree and diameter.
In summary, the P2P system builds on two basic components: i) an algo-
Such overlay structures are not limited to distributed storage. Instead, they
rithm which performs the described dynamic token distribution and ii) an in-
are equally well suited for the distribution of content, such as multicasting of
formation aggregation algorithm which is used to estimate the number of peers
radio stations or television channels. In a basic multicasting scheme, a source
in the system and to adapt the dimension of the hypercube accordingly:
with identifier 00...0 may forward new data blocks to two peers having identi-
fiers starting with 0 and 1. They in turn forward the content to peers having
Theorem 23.8 (DHT with Churn). We have a fully scalable, efficient P2P identifiers starting with 00, 01, 10, and 11. The recursion finishes once all peers
system which tolerates O(log n) worst-case joins and/or crashes per constant are reached. This basic scheme has the subtle shortcoming that data blocks
time interval. As in other P2P systems, peers have O(log n) neighbors, and the may pass by multiple times at a single peer because a predecessor can match a
usual operations (e.g., search, insert) take time O(log n). prefix further down in its distribution branch.
The subsequent multicasting scheme M avoids this problem by modifying
the topology and using a different routing scheme. For simplicity, the neighbor
Remarks: selection policy is presented for the case β = 1. In order to use M, the peers
must store links to a different set of neighbors. A peer v with the identifier
• Indeed, handling churn is only a minimal requirement to make a P2P bv0 . . . bvd−1 stores links to peers whose identifiers start with bv0 bv1 . . . bvi−1 bvi bvi+1
system work. Later studies proposed more elaborate architectures and bv0 bv1 . . . bvi−1 bvi bvi+1 for all i ∈ {0, . . . , d − 2}. For example, the peer with the
which can also handle other security issues, e.g., privacy or Byzantine identifier 0000 has to maintain connections to peers whose identifiers start with
attacks. the prefixes 10, 11, 010, 011, 0010, and 0011. Pseudo-code for the algorithm is
given in Algorithm 92.
• It is surprising that unstructured (in fact, hybrid) P2P systems dom- The parameters are the length π of the prefix that is not to be modified and
inate structured P2P systems in the real world. One would think at most one critical predecessor vc . If β = 1, any node v tries to forward the
that structured P2P systems have advantages, in particular their effi- data block to two peers v1 and v2 . The procedure is called at the source v0 with
cient logarithmic data lookup. On the other hand, unstructured P2P arguments π := 0 and vc := ∅, resulting in the two messages forward(1, v0 ) to
networks are simpler, in particular in light of non-exact queries. v1 and forward(1, ∅) to v2 . The peer v1 is chosen locally such that the prefix its
23.5. STORAGE AND MULTICAST 269 270 CHAPTER 23. PEER-TO-PEER COMPUTING

Algorithm 92 M: forward(π, vc ) at peer v. v0 0000


(1,v0 ) (1,0 )
1: S := {v 0 ∈ Nv | `(v 0 , v) ≥ π + 1}
2: choose v1 ∈ S: `(v1 , v) ≤ `(ṽ, v) ∀ṽ ∈ S
v1 0101 v2 1010
3: if v1 6= ∅ then
(2,v0 ) (2,v1 ) (2,v2 ) (2,0 )
4: forward(`(v1 , v), v) to v1
5: end if
6: if vc 6= ∅ then v3 0010 v4 0110 v6 100 1 v7 1101
7: choose v2 ∈ Nv : `(v2 , vc ) = π + 1
8: if v2 = ∅ then Figure 23.8: The spanning tree induced by a forward message initiated at peer
9: v2 := getNext(v) from vc v0 is shown. The fixed prefix is underlined at each peer, whereas prefixes in
10: end if bold print indicate that the parent peer has been constrained to forward the
11: if v2 6= ∅ then packet to peers with these prefixes.
12: forward(`(v2 , vc ), vc ) to v2
13: end if
14: else Remarks:
15: choose v2 ∈ Nv : `(v2 , v) = π
16: if v2 6= ∅ then • The multicast scheme M benefits from the same overlay properties as
17: forward(π + 1, vc ) to v2 DHTs; there is a bounded diameter and peer degree. Peers can main-
18: end if tain backup neighbors and favor low-latency, high-bandwidth peers
19: end if as neighbors. Most importantly, intermediate peers have the possibil-
ity to choose among multiple (backup) neighbors to forward incoming
data blocks. This, in turn, allows peers to quickly adapt to changing
identifier shares with the identifier of v is the shortest among all those whose network conditions such as churn and congestion. It is not necessary
shared prefix length is at least π + 1. This value `(v1 , v) and v itself are the to rebuild the overlay structure after failures. In doing so, a system
parameters included in the forward message to peer v1 , if such a peer exists. can gain both robustness and effiency.
The second peer is chosen similarly, but with respect to vc and not v itself. If no
• In contrast, for more rigid data structures, such as trees, data blocks
suitable peer is found in the routing table, the peer vc is queried for a candidate
are forced to travel along fixed data paths, rendering them susceptible
using the subroutine getNext which is described in Algorithm 93. This step is
to any kind of failure.
required because node v cannot deduce from its routing table whether a peer
v2 with the property `(v2 , vc ) ≥ π + 1 exists. In the special case when vc = ∅, • Conversely, unstructured and more random overlay networks lack the
v2 is chosen locally, if possible, such that `(v2 , v) = π. In Figure 23.8, a sample structure to immediately forward incoming data blocks. Instead, such
spanning tree resulting from the execution of M is depicted. systems have to rely on the exchange of periodic notifications about
available data blocks and requests and responses for the download of
Algorithm 93 getNext(vs ) at peer v missing blocks, significantly increasing distribution delays. Further-
1: S := {v 0 ∈ Nv | `(v 0 , v) > `(vs , v)} more, the lack of structure makes it hard to maintain connectivity
2: choose vr ∈ S: `(vr , v) ≤ `(ṽ, v) ∀ṽ ∈ S among all peers. If the neighbor selection is not truly random, but
3: send vr to vs based on other criertia such as latency and bandwidth, clusters may
form that disconnect themselves from the remaining overlay.
The presented multicasting scheme M has the property that, at least in a There is a varierty of further flavors and optimizations for prefix-based overlay
static setting, wherein peers neither join nor leave the overlay, all peers can be structures. For example, peers have a logarithmic number of neighbors in the
reached and each peer receives a data block exactly once as summarized by the presented structure. For 100, 000 and more peers, peers have at least 20 neigh-
following theorem: bors. Selecting a backup neighbor doubles the number of neighbors to 40. Using
M further doubles their number to 80. A large number of neighbors accrues
Theorem 23.9. In a static overlay, algorithm M has the following properties:
substantial maintenance costs. The subsequent variation limits the number of
(a) It does not induce any duplicate messages (loop-free), and neighbors with a slight adjustment of the overlay structure. It organizes peers
into disjoint groups G0 , G1 , . . . , Gm of about equal size. The introduction of
(b) all peers are reached (complete). groups is motivated by the fact that they will enable peers to have neighboring
connections for a subset of all shared prefixes while maintaining the favorable
overlay properties. The source, feeding blocks into the overlay, joins group G0 .
23.5. STORAGE AND MULTICAST 271 272 CHAPTER 23. PEER-TO-PEER COMPUTING

The other peers randomly join groups. Let g(v) denote the function that assigns Bibliography
each peer v to a group, i.e., v ∈ Gg(v) .
Peers select neighboring peers based not solely on shared prefixes but also on [AJ75] George A. Anderson and E. Douglas Jensen. Computer Interconnec-
group membership. A peer v with the identifier bv0 . . . bvd−1 stores links to neigh- tion Structures: Taxonomy, Characteristics, and Examples. ACM
boring peers whose identifiers start with bv0 bv1 . . . bvi−1 bvi and belong to group Comput. Surv., 7(4):197–213, December 1975.
g(v) + 1 mod m for all i ∈ {g(v), g(v) + m, g(v) + 2m, g(v) + 3m, . . .}. Further-
[AMD04] Ittai Abraham, Dahlia Malkhi, and Oren Dobzinski. LAND: stretch
more, let f denote the first index i where no such peer exists. As fallback, peer
(1 + epsilon) locality-aware networks for DHTs. In Proceedings of
v stores further links to peers from arbitrary groups whose identifiers start with
the fifteenth annual ACM-SIAM symposium on Discrete algorithms,
bv0 bv1 . . . bvk−1 bvk for all k ≥ f − m + 1. The fallback connections allow a peer to
SODA ’04, pages 550–559, Philadelphia, PA, USA, 2004. Society for
revert to the regular overlay structure for the longest shared prefixes where only
Industrial and Applied Mathematics.
few peers exist.
As an example, a scenario with m = 4 groups is considered. A peer with [AP90] Baruch Awerbuch and David Peleg. Efficient Distributed Construc-
identifier 00 . . . 0 belonging to group G2 has to maintain connections to peers tion of Sparse Covers. Technical report, The Weizmann Institute of
from group G3 that share the prefixes 001, 0000001, 00000000001, etc. In an Science, 1990.
overlay with 100 peers, the peer is unlikely to find a neighbor for a prefix length
larger than log(100), such as prefix 00000000001. Instead, he further maintains [AP91] Baruch Awerbuch and David Peleg. Concurrent Online Tracking of
fallback connections to peers from arbitrary groups having identifiers starting Mobile Users. In SIGCOMM, pages 221–233, 1991.
with the prefixes 00000001, 000000001, 000000001, etc. (if such peers exist).
Remarks: [AS03] James Aspnes and Gauri Shah. Skip graphs. In SODA, pages 384–
393, 2003.
• By applying the presented grouping mechanism, the total number of
neighbors is reduced to 2 log
m
n
+ c with constant c for fallback con- [AS09] Baruch Awerbuch and Christian Scheideler. Towards a Scalable and
nections. (Note that peers have both outgoing neighbors to the next Robust DHT. Theory Comput. Syst., 45(2):234–260, 2009.
group and incoming neighbors from the previous group, doubling the
number of neighbors.) [BA84] L. N. Bhuyan and D. P. Agrawal. Generalized Hypercube and Hy-
perbus Structures for a Computer Network. IEEE Trans. Comput.,
• Setting the number of groups m = log n gives a constant number of 33(4):323–333, April 1984.
neighbors regardless of the overlay size.
[BSS09] Matthias Baumgart, Christian Scheideler, and Stefan Schmid. A
DoS-resilient information system for dynamic data management. In
Chapter Notes Proceedings of the twenty-first annual symposium on Parallelism in
The paper of Plaxton, Rajaraman, and Richa [PRR97] laid out a blueprint for algorithms and architectures, SPAA ’09, pages 300–309, New York,
many so-called structured P2P architecture proposals, such as Chord [SMK+ 01], NY, USA, 2009. ACM.
CAN [RFH+ 01], Pastry [RD01], Viceroy [MNR02], Kademlia [MM02], Koorde
[BYL08] John Buford, Heather Yu, and Eng Keong Lua. P2P Networking
[KK03], SkipGraph [AS03], SkipNet [HJS+ 03], or Tapestry [ZHS+ 04]. Also the
and Applications. Morgan Kaufmann Publishers Inc., San Francisco,
paper of Plaxton et. al. was standing on the shoulders of giants. Some of
CA, USA, 2008.
its eminent precursors are: linear and consistent hashing [KLL+ 97], locating
shared objects [AP90, AP91], compact routing [SK85, PU88], and even earlier: [GS81] J.R. Goodman and C.H. Sequin. Hypertree: A Multiprocessor
hypercubic networks, e.g. [AJ75, Wit81, GS81, BA84]. Interconnection Topology. Computers, IEEE Transactions on, C-
Furthermore, the techniques in use for prefix-based overlay structures are 30(12):923–933, dec. 1981.
related to a proposal called LAND, a locality-aware distributed hash table pro-
posed by Abraham et al. [AMD04]. [HJS+ 03] Nicholas J. A. Harvey, Michael B. Jones, Stefan Saroiu, Marvin
More recently, a lot of P2P research focussed on security aspects, describing Theimer, and Alec Wolman. SkipNet: a scalable overlay network
for instance attacks [LMSW06, SENB07, Lar07], and provable countermeasures with practical locality properties. In Proceedings of the 4th con-
[KSW05, AS09, BSS09]. Another topic currently garnering interest is using ference on USENIX Symposium on Internet Technologies and Sys-
P2P to help distribute live streams of video content on a large scale [LMSW07]. tems - Volume 4, USITS’03, pages 9–9, Berkeley, CA, USA, 2003.
There are several recommendable introductory books on P2P computing, e.g. USENIX Association.
[SW05, SG05, MS07, KW08, BYL08].
Some of the figures in this chapter have been provided by Christian Schei- [KK03] M. Frans Kaashoek and David R. Karger. Koorde: A Simple Degree-
deler. Optimal Distributed Hash Table. In IPTPS, pages 98–107, 2003.
BIBLIOGRAPHY 273 274 CHAPTER 23. PEER-TO-PEER COMPUTING

[KLL+ 97] David R. Karger, Eric Lehman, Frank Thomson Leighton, Rina [RD01] Antony Rowstron and Peter Druschel. Pastry: Scalable, decen-
Panigrahy, Matthew S. Levine, and Daniel Lewin. Consistent Hash- tralized object location and routing for large-scale peer-to-peer sys-
ing and Random Trees: Distributed Caching Protocols for Relieving tems. In IFIP/ACM International Conference on Distributed Sys-
Hot Spots on the World Wide Web. In STOC, pages 654–663, 1997. tems Platforms (Middleware), pages 329–350, November 2001.

[KSW05] Fabian Kuhn, Stefan Schmid, and Roger Wattenhofer. A Self- [RFH+ 01] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and
Repairing Peer-to-Peer System Resilient to Dynamic Adversarial Scott Shenker. A scalable content-addressable network. SIGCOMM
Churn. In 4th International Workshop on Peer-To-Peer Systems Comput. Commun. Rev., 31(4):161–172, August 2001.
(IPTPS), Cornell University, Ithaca, New York, USA, Springer
[SENB07] Moritz Steiner, Taoufik En-Najjary, and Ernst W. Biersack. Exploit-
LNCS 3640, February 2005.
ing KAD: possible uses and misuses. SIGCOMM Comput. Commun.
Rev., 37(5):65–70, October 2007.
[KW08] Javed I. Khan and Adam Wierzbicki. Introduction: Guest edi-
tors’ introduction: Foundation of peer-to-peer computing. Comput. [SG05] Ramesh Subramanian and Brian D. Goodman. Peer to Peer Com-
Commun., 31(2):187–189, February 2008. puting: The Evolution of a Disruptive Technology. IGI Publishing,
Hershey, PA, USA, 2005.
[Lar07] Erik Larkin. Storm Worm’s virulence may change tac-
tics. http://www.networkworld.com/news/2007/080207-black-hat- [SK85] Nicola Santoro and Ramez Khatib. Labelling and Implicit Routing
storm-worms-virulence.html, Agust 2007. Last accessed on June 11, in Networks. Comput. J., 28(1):5–8, 1985.
2012.
[SMK+ 01] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and
[LMSW06] Thomas Locher, Patrick Moor, Stefan Schmid, and Roger Watten- Hari Balakrishnan. Chord: A scalable peer-to-peer lookup ser-
hofer. Free Riding in BitTorrent is Cheap. In 5th Workshop on Hot vice for internet applications. SIGCOMM Comput. Commun. Rev.,
Topics in Networks (HotNets), Irvine, California, USA, November 31(4):149–160, August 2001.
2006. [SW05] Ralf Steinmetz and Klaus Wehrle, editors. Peer-to-Peer Systems and
Applications, volume 3485 of Lecture Notes in Computer Science.
[LMSW07] Thomas Locher, Remo Meier, Stefan Schmid, and Roger Watten-
Springer, 2005.
hofer. Push-to-Pull Peer-to-Peer Live Streaming. In 21st Inter-
national Symposium on Distributed Computing (DISC), Lemesos, [Wit81] L. D. Wittie. Communication Structures for Large Networks of
Cyprus, September 2007. Microcomputers. IEEE Trans. Comput., 30(4):264–273, April 1981.

[MM02] Petar Maymounkov and David Mazières. Kademlia: A Peer-to-Peer [ZHS+ 04] Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, An-
Information System Based on the XOR Metric. In Revised Papers thony D. Joseph, and John Kubiatowicz. Tapestry: a resilient
from the First International Workshop on Peer-to-Peer Systems, global-scale overlay for service deployment. IEEE Journal on Se-
IPTPS ’01, pages 53–65, London, UK, UK, 2002. Springer-Verlag. lected Areas in Communications, 22(1):41–53, 2004.

[MNR02] Dahlia Malkhi, Moni Naor, and David Ratajczak. Viceroy: a scal-
able and dynamic emulation of the butterfly. In Proceedings of the
twenty-first annual symposium on Principles of distributed comput-
ing, PODC ’02, pages 183–192, New York, NY, USA, 2002. ACM.

[MS07] Peter Mahlmann and Christian Schindelhauer. Peer-to-Peer Net-


works. Springer, 2007.

[PRR97] C. Greg Plaxton, Rajmohan Rajaraman, and Andréa W. Richa.


Accessing Nearby Copies of Replicated Objects in a Distributed
Environment. In SPAA, pages 311–320, 1997.

[PU88] David Peleg and Eli Upfal. A tradeoff between space and efficiency
for routing tables. In Proceedings of the twentieth annual ACM
symposium on Theory of computing, STOC ’88, pages 43–52, New
York, NY, USA, 1988. ACM.
276 CHAPTER 24. ALL-TO-ALL COMMUNICATION

Remarks:
n

• Since we have a complete communication graph, the graph has 2
edges in the beginning.

• As in Chapter 2, we assume that no two edges of the graph have the


same weight. Recall that this assumption ensures that the MST is
Chapter 24 unique. Recall also that this simplification is not essential as one can
always break ties by using the IDs of adjacent vertices.

For simplicity, we assume that we have a synchronous model (as we are


All-to-All Communication only interested in the time complexity, our algorithm can be made asynchro-
nous using synchronizer α at no additional cost (cf. Chapter 10). As usual, in
every round, every node can send a (potentially different) message to each of
its neighbors. In particular, note that the message delay is 1 for every edge e
independent of the weight ωe . As mentioned before, every message can contain
In the previous chapters, we have mostly considered communication on a par- a constant number of node IDs and edge weights (and O(log n) additional bits).
ticular graph G = (V, E), where any two nodes u and v can only communicate
directly if {u, v} ∈ E. This is however not always the best way to model a net- Remarks:
work. In the Internet, for example, every machine (node) is able to “directly” • Note that for graphs of arbitrary diameter D, if there are no bounds on
communicate with every other machine via a series of routers. If every node in the number of messages sent, on the message size, and on the amount
a network can communicate directly with all other nodes, many problems can of local computations, there is a straightforward generic algorithm to
be solved easily. For example, assume we have n servers, each hosting an ar- compute an MST in time D: In every round, every node sends its
bitrary number of (numeric) elements. If all servers are interested in obtaining complete state to all its neighbors. After D rounds, every node knows
the maximum of all elements, all servers can simultaneously, i.e., in one com- the whole graph and can compute any graph structure locally without
munication round, send their local maximum element to all other servers. Once any further communication.
these maxima are received, each server knows the global maximum.
Note that we can again use graph theory to model this all-to-all commu- • In general, the diameter D is also an obvious lower bound for the time
nication scenario:
 The communication graph is simply the complete graph needed to compute an MST. In a weighted ring, e.g., it takes time D
Kn := (V, V2 ). If each node can send its entire local state in a single message, to find the heaviest edge. In fact, on the ring, time D is required to
then all problems could be solved in 1 communication round in this model! compute any spanning tree.
Since allowing unbounded messages is not realistic in most practical scenarios,
we restrict the message size: Assuming that all node identifiers and all other In this chapter, we are not concerned with lower bounds, we want to give an
variables in the system (such as the numeric elements in the example above) algorithm that computes the MST as quickly as possible instead! We again use
can be described using O(log n) bits, each node can only send a message of size the following lemma that is proven in Chapter 2.
O(log n) bits to all other nodes (messages to different neighbors can be differ- Lemma 24.2. For a given graph G let T be an MST, and let T 0 ⊆ T be a
ent). In other words, only a constant number of identifiers (and elements) can subgraph (also known as a fragment) of the MST. Edge e = (u, v) is an outgoing
be packed into a single message. Thus, in this model, the limiting factor is the edge of T 0 if u ∈ T 0 and v 6∈ T 0 (or vice versa). Let the minimum weight outgoing
amount of information that can be transmitted in a fixed amount of time. This edge of the fragment T 0 be the so-called blue edge b(T 0 ). Then T 0 ∪ b(T 0 ) ⊆ T .
is fundamentally different from the model we studied before where nodes are
restricted to local information about the network graph. Lemma 24.2 leads to a straightforward distributed MST algorithm. We start
In this chapter, we study one particular problem in this model, the com- with an empty graph, i.e., every node is a fragment of the MST. The algorithm
putation of a minimum spanning tree (MST), i.e., we will again look at the consists of phases. In every phase, we add the blue edge b(T 0 ) of every existing
construction of a basic network structure. Let us first review the definition of a fragment T 0 to the MST. Algorithm 94 shows how the described simple MST
minimum spanning tree from Chapter 2. We assume that each edge e is assigned construction can be carried out in a network of diameter 1.
a weight ωe .
Theorem 24.3. On a complete graph, Algorithm 94 computes an MST in time
Definition 24.1 (MST). Given a weighted graph G = (V, E, O(log n).
P ω). The MST
of G is a spanning tree T minimizing ω(T ), where ω(H) = e∈H ωe for any
subgraph H ⊆ G. Proof. The algorithm is correct because of Lemma 24.2. Every node only needs
to send a single message to all its neighbors in every phase (line 4). All other
computations can be done locally without sending other messages. In particular,

275
277 278 CHAPTER 24. ALL-TO-ALL COMMUNICATION

Algorithm 94 Simple MST Construction (at node v) Algorithm 95 Fast MST construction (at node v)
1: // all nodes always know all current MST edges and thus all MST fragments 1: // all nodes always know all current MST edges and thus all MST fragments
2: while v has neighbor u in different fragment do 2: repeat
3: find lowest-weight edge e between v and a node u in a different fragment 3: F := fragment of v;
4: send e to all nodes 4: ∀F 0 6= F , compute min-weight edge eF 0 connecting v to F 0
5: determine blue edges of all fragments 5: ∀F 0 6= F , send eF 0 to `(F 0 )
6: add blue edges of all fragments to MST, update fragments 6: if v = `(F ) then
7: end while 7: ∀F 0 6= F , determine min-weight edge eF,F 0 between F and F 0
8: k := |F |
9: E(F ) := k lightest edges among eF,F 0 for F 0 6= F
the blue edge of a given fragment is the lightest edge sent by any node of that 10: send send each edge in E(F ) to a different node in F
fragment. Because every node always knows the current MST (and all current // for simplicity assume that v also sends an edge to itself
fragments), lines 5 and 6 can be performed locally. 11: end if
In every phase, every fragment connects to at least one other fragment. The 12: send edge received from `(F ) to all nodes
minimum fragment size therefore at least doubles in every phase. Thus, the 13: // the following operations are performed locally by each node
number of phases is at most log2 n. 14: E 0 := edges received by other nodes
15: AddEdges(E 0 )
Remarks: 16: until all nodes are in the same fragment

• Algorithm 94 does essentially the same thing as the GHS algorithm


(Algorithm 11) discussed in Chapter 2. Because we now have a com-
An edge connecting two distinct super-fragments F 0 and F 00 is added if at
plete graph and thus every node can communicate with every other
least one of the two super-fragments is safe. In this case, the two super-fragments
node, things get simpler (and also much faster).
are merged into one (new) super-fragment. The new super-fragment is safe if
• Algorithm 94 does not make use of the fact that a node can send and only if both original super-fragements are safe and the processed edge e is
different messages to different nodes. Making use of this possibility not the last edge in E 0 incident to any of the two fragments Fi and Fj that are
will allow us to significantly reduce the running time of the algorithm. incident to e, i.e., both counters c(Fi ) and c(Fj ) are still positive (see line 12).
The considered edge e may not be added for one of two reasons. It is possible
Our goal is now to improve Algorithm 94. We assume that every node has that both F 0 and F 00 are not safe. Since a super-fragment cannot become safe
a unique identifier. By sending its own identifier to all other nodes, every node again, nothing has to be done in this case. The second reason is that F 0 = F 00 .
knows the identifiers of all other nodes after one round. Let `(F ) be the node In this case, this single fragment may become unsafe if e reduced either c(Fi )
with the smallest identifier in fragment F . We call `(F ) the leader of fragment or c(Fj ) to zero (see line 18).
F . In order to improve the running time of Algorithm 94, we need to be able
to connect every fragment to more than one other fragment in a single phase. Lemma 24.4. The algorithm only adds MST edges.
Algorithm 95 shows how the nodes can learn about the k = |F | lightest outgoing
Proof. We have to prove that at the time we add an edge e in line 9 of Al-
edges of each fragment F (in constant time!).
gorithm 96, e is the blue edge of some (super-)fragment. By definition, e is
Given this set E 0 of edges, each node can locally decide which edges can
the lightest edge that has not been considered and that connects two distinct
safely be added to the constructed tree by calling the subroutine AddEdges
super-fragments F 0 and F 00 . Since e is added, we know that either saf e(F 0 )
(Algorithm 96). Note that the set of received edges E 0 in line 14 is the same for
or saf e(F 00 ) is true. Without loss of generality, assume that F 0 is safe. Ac-
all nodes. Since all nodes know all current fragments, all nodes add the same
cording to the definition of safe, this means that from each fragment F in the
set of edges!
super-fragment F 0 we know at least the lightest outgoing edge, which implies
Algorithm 96 uses the lightest outgoing edge that connects two fragments (to
that we also know the lightest outgoing edge, i.e., the blue edge, of F 0 . Since e
a larger super-fragment) as long as it is safe to add this edge, i.e., as long as it is
is the lightest edge that connects any two super-fragments, it must hold that e
clear that this edge is a blue edge. A (super-)fragment that has outgoing edges
is exactly the blue edge of F 0 . Thus, whenever an edge is added, it is an MST
in E 0 that are surely blue edges is called safe. As we will see, a super-fragment
edge.
F is safe if all the original fragments that make up F are still incident to at least
one edge in E 0 that has not yet been considered. In order to determine whether Theorem 24.5. Algorithm 95 computes an MST in time O(log log n).
all lightest outgoing edges in E 0 that are incident to a certain fragment F have
been processed, a counter c(F ) is maintained (see line 2). If an edge incident Proof. Let βk denote the size of the smallest fragment after phase k of Algo-
to two (distinct) fragments Fi and Fj is processed, both c(Fi ) and c(Fj ) are rithm 95. We first show that every fragment merges with at least βk other
decremented by 1 (see Line 8). fragments in each phase. Since the size of each fragment after phase k is at
279 280 CHAPTER 24. ALL-TO-ALL COMMUNICATION

Algorithm 96 AddEdges(E 0 ): Given the set of edges E 0 , determine which Upper Bounds
edges are added to the MST
1: Let F1 , . . . , Fr be the initial fragments
Graph Class Time Complexity Authors
2: ∀Fi ∈ {F1 , . . . , Fr }, c(Fi ) := # incident edges in E 0 √
General Graphs O(D + n · log∗ n) Kutten, Peleg [KP95]
3: Let F1 := F1 , . . . , Fr := Fr be the initial super-fragments
Diameter 2 O(log n) Lotker, Patt-Shamir,
4: ∀Fi ∈ {F1 , . . . , Fr }, saf e(Fi ) := true
Peleg [LPSP06]
5: while E 0 6= ∅ do
Diameter 1 O(log log n) Lotker, Patt-Shamir,
6: e := lightest edge in E 0 between the original fragments Fi and Fj
Pavlov, Peleg [LPPSP03]
7: E 0 := E 0 \ {e}
8: c(Fi ) := c(Fi ) − 1, c(Fj ) := c(Fj ) − 1
9: if e connects super-fragments F 0 6= F 00 and (saf e(F 0 ) or saf e(F 00 )) then Lower Bounds
10: add e to MST
11: merge F 0 and F 00 into one super-fragment Fnew
Graph Class Time Complexity Authors
12: if saf e(F 0 ) and saf e(F 00 ) and c(Fi ) > 0 and c(Fj ) > 0 then √
Diameter Ω(log n) Ω(D + n/ log n) Das Sarma, Holzer, Kor,
13: saf e(Fnew ) := true
Korman, Nanongkai,
14: else
Pandurangan, Peleg,
15: saf e(Fnew ) := f alse
  Wattenhofer [SHK+ 12]
16: end if 1/3
17: else if F 0 = F 00 and (c(Fi ) = 0 or c(Fj ) = 0) then Diameter 4 Ω (n/ log n) Das Sarma, Holzer, Kor,
18: saf e(F 0 ) := f alse Korman, Nanongkai,
19: end if Pandurangan, Peleg,
20: end while
  Wattenhofer [SHK+ 12]
1/4
Diameter 3 Ω (n/ log n) Das Sarma, Holzer, Kor,
least βk by definition, we get that the size of each fragment after phase k + 1 is Korman, Nanongkai,
at least βk (βk + 1). Assume that a fragment F , consisting of at least βk nodes, Pandurangan, Peleg,
does not merge with βk other fragments in phase k + 1 for any k ≥ 0. Note Wattenhofer [SHK+ 12]
that F cannot be safe because being safe implies that there is at least one edge
in E 0 that has not been considered yet and that is the blue edge of F . Hence, Table 24.1: Time complexity of distributed MST construction
the phase cannot be completed in this case. On the other hand, if F is not
safe, then at least one of its sub-fragments has used up all its βk edges to other
fragments. However, such an edge is either used to merge two fragments or it We want to remark that the above lower bounds remain true for random-
must have been dropped because the two fragments already belong to the same ized algorithms. We can even not hope for a better randomized approximation
fragment because another edge connected them (in the same phase). In either algorithm for the MST as long as the approximation factor is bounded polyno-
case, we get that any fragment, and in particular F , must merge with at least mially in n. On the other hand it is not known whether the O(log log n) time
βk other fragments. complexity of Algorithm 95 is optimal. In fact, no lower bounds are known for
Given that the minimum fragment size grows (quickly) in each phase and the MST construction on graphs of diameter 1 and 2. Algorithm 95 makes use
that only edges belonging to the MST are added according to Lemma 24.4, we of the fact that it is possible to send different messages to different nodes. If
conclude that the algorithm correctly computes the MST. The fact that we assume that every node always has to send the same message to all other
nodes, Algorithm 94 is the best that is known. Also for this simpler case, no
βk+1 ≥ βk (βk + 1) lower bound is known.
2k−1
implies that βk ≥ 2 for any k ≥ 1. Therefore after 1 + log2 log2 n phases, the
minimum fragment size is n and thus all nodes are in the same fragment. Bibliography
[KP95] Shay Kutten and David Peleg. Fast distributed construction of k-
Chapter Notes dominating sets and applications. In Proceedings of the fourteenth
annual ACM symposium on Principles of distributed computing,
There is a considerable amount of work on distributed MST construction. Table pages 238–251. ACM, 1995.
24.1 lists the most important results for various network diameters D. In the
above text we focus only on D = 1. [LPPSP03] Zvi Lotker, Elan Pavlov, Boaz Patt-Shamir, and David Peleg. Mst
BIBLIOGRAPHY 281 282 CHAPTER 24. ALL-TO-ALL COMMUNICATION

construction in o (log log n) communication rounds. In Proceedings


of the fifteenth annual ACM symposium on Parallel algorithms and
architectures, pages 94–100. ACM, 2003.

[LPSP06] Zvi Lotker, Boaz Patt-Shamir, and David Peleg. Distributed mst for
constant diameter graphs. Distributed Computing, 18(6):453–460,
2006.
[SHK+ 12] Atish Das Sarma, Stephan Holzer, Liah Kor, Amos Korman,
Danupon Nanongkai, Gopal Pandurangan, David Peleg, and Roger
Wattenhofer. Distributed verification and hardness of distributed
approximation. SIAM Journal on Computing, 41(5):1235–1265,
2012.
284 CHAPTER 25. DYNAMIC NETWORKS

For simplicity, we restrict to deterministic algorithms. Nodes communicate


with each other using anonymous broadcast. At the beginning of round r, each
node u decides what message to broadcast based on its internal state; at the
same time (and independently), the adversary chooses a set E(r) of edges for
the round. As in standard synchronous message passing, all nodes v for which
{u, v} ∈ E(r) receive the message broadcast by node u in round r and each node
Chapter 25 can perform arbitrary local computations upon receiving the messages from its
neighbors. We assume that all nodes in the network have a unique identifier
(ID). In most cases, we will assume that messages are restricted to O(log n) bits.
In these cases, we assume that node IDs can be represented using O(log n) bits,
Dynamic Networks so that a constant number of node IDs and some additional information can be
transmitted in a single message. We refer to the special case where all nodes are
woken up at once as synchronous start and to the general case as asynchronous
start.
We assume that each node in the network starts an execution of the protocol
Many large-scale distributed systems and networks are dynamic. In some net- in an initial state which contains its own ID and its input. Additionally, nodes
works, e.g., peer-to-peer, nodes participate only for a short period of time, and know nothing about the network, and initially cannot distinguish it from any
the topology can change at a high rate. In wireless ad-hoc networks, nodes are other network.
mobile and move around. In this chapter, we will study how to solve some basic
tasks if the network is dynamic. Under what conditions is it possible to compute
an accurate estimate of the size or some other property of the system? How 25.2 Problem Definitions
efficiently can information be disseminated reliably in the network? To what
extent does stability in the communication graph help solve these problems? In the context of this chapter, we study the following problems.
There are various reasons why networks can change over time and as a con-
sequence, there also is a wide range of possible models for dynamic networks. Counting. An algorithm is said to solve the counting problem if whenever it is
Nodes might join or leave a distributed system. Some components or commu- executed in a dynamic graph comprising n nodes, all nodes eventually terminate
nication links may fail in different ways. Especially if the network devices are and output n.
mobile, the connectivity between them can change. Dynamic changes can occur
constantly or they might be infrequent enough so that the system can adapt to k-verification. Closely related to counting, the k-verification problem re-
each change individually. quires nodes to determine whether or not n ≤ k. All nodes begin with k as
We will look at a synchronous dynamic network model in which the graph their input, and must eventually terminate and output “yes” or “no”. Nodes
can change from round to round in a worst-case manner. To simplify things must output “yes” if and only if there are at most k nodes in the network.
(and to make the problems we study well-defined), we assume that the set of
nodes in the network is fixed and does not change. However, we will make
k-token dissemination. An instance of k-token dissemination is a pair (V, I),
almost no assumptions how the set of edges changes over time. We require
whereS I : V → P (T ) assigns a set of tokens from some domain T to each node,
some guarantees about the connectivity, apart from this, in each round, the
and | u∈V I(v)| = k. An algorithm solves k-token dissemination if for all
communication graph is chosen in a worst-case manner by an adversary.
instances (V, I), when the algorithm is executed in S any dynamic graph G =
(V, E), all nodes eventually terminate and output u∈V I(u). We assume that
each token in the nodes’ input is represented using O(log n) bits. Nodes may or
25.1 Synchronous Edge-Dynamic Networks may not know k, depending on the context. Of particular interest is all-to-all
We model a synchronous dynamic network by a dynamic graph G = (V, E), token dissemination, a special case where k = n and each node initially knows
where V is a static set of nodes, and E : N0 → V2 is a function mapping a round exactly one token, i.e., |I(u)| = 1 for all nodes u.

number r ∈ N0 to a set of undirected edges E(r). Here V2 := {{u, v} | u, v ∈ V }
is the set of all possible undirected edges over V . k-committee election. As an useful step towards solving counting and to-
ken dissemination, we consider a problem called k-committee election. In this
Definition 25.1 (T -Interval Connectivity). A dynamic graph G = (V, E) is problem, nodes must partition themselves into sets, called committees, such that
said to be T -interval connected
 for T ∈ N if for all r ∈ N, the static graph
Tr+T −1 a) the size of each committee is at most k and
Gr,T := V, i=r E(i) is connected. If G is 1-interval connected we say
that G is always connected. b) if k ≥ n, then there is just one committee containing all nodes.

283
25.3. BASIC INFORMATION DISSEMINATION 285 286 CHAPTER 25. DYNAMIC NETWORKS

Each committee has a unique committee ID, and the goal is for all nodes to Proof. Suppose by way of contradiction that A is a protocol for counting which
eventually terminate and output a committee ID such that the two conditions requires at most t(n) rounds in 1-interval connected graphs of size n. Let n0 =
are satisfied. max {t(n) + 1, n + 1}. We will show that the protocol cannot distinguish a static
line of length n from a dynamically changing line of length n0 .
Given a sequence A = a1 ◦ . . . ◦ am , let shift(A, r) denote the cyclic left-shift
25.3 Basic Information Dissemination of A in which the first r symbols (r ≥ 0) are removed from the beginning of
the sequence and appended to the end. Consider an execution in a dynamic
To start, let us study how a single piece of information is propagated through line of length n0 , where the line in round r is composed of two adjacent sections
a dynamic network. We assume that we have a dynamic network graph G with A ◦ Br , where A = 0 ◦ . . . ◦ (n − 1) remains static throughout the execution,
n nodes such that G is always connected (G is 1-interval connected as defined and B(r) = shift(n ◦ . . . ◦ (n0 − 1), r) is left-shifted by one in every round. The
in Definition 25.1). Further assume that there is a single piece of information computation is initiated by node 0 and all other nodes are initially asleep. We
(token), which is initially known by a single node. claim that the execution of the protocol in the dynamic graph G = A ◦ B(r)
is indistinguishable in the eyes of nodes 0, . . . , n − 1 from an execution of the
Theorem 25.2. Assume that there is a single token in the network. Further protocol in the static line of length n (that is, the network comprising section
assume that at time 0 at least one node knows the token and that once they know A alone). This is proven by induction on the round number, using the fact that
the token, all nodes broadcast it in every round. In a 1-interval connected graph throughout rounds 0, . . . , t(n) − 1 none of the nodes in section A ever receives a
G = (V, E) with n nodes, after r ≤ n − 1 rounds, at least r + 1 nodes know the message from a node in section B: although one node in section B is awakened
token. Hence, in particular after n − 1 rounds, all nodes know the token. in every round, this node is immediately removed and attached at the end of
section B, where it cannot communicate with the nodes in section A. Thus,
Proof. We can proof the theorem by induction on r. Let T (r) be the set of the protocol cannot distinguish the dynamic graph A from the dynamic graph
nodes that know the token after r rounds. We need to show that for all r ≥ 0, A ◦ B(r), and it produces the wrong output in one of the two graphs.
|T (r)| ≥ min {r + 1, n}. Because we assume that at time 0 at least one node
knows the token, clearly, |T (0)| ≥ 1. For the induction step, assume that after Remark:
r rounds, |T (r)| ≥ min {r + 1, n}. If T (r) = V , we have |T (r + 1)| ≥ |T (r)| = n
• The above impossibility result extends to all problems introduced in
and we are done. Otherwise, we have V \ T (r) 6= ∅. Therefore, by the 1-interval
Section 25.2 as long as we do not assume that the nodes know n or
connectivity assumption, there must be two nodes u ∈ T (r) and v ∈ V \ T (r)
an upper bound on n.
such that {u, v} ∈ E(r + 1). Hence, in round r + 1, node v gets the token an
therefore |T (r + 1)| ≥ |T (r)| + 1 ≥ min {r + 2, n}. In light of the impossibility result of Theorem 25.3, let us now first consider
the synchronous start case where all nodes start the protocol at time 0 (with
Remarks: round 1). We first look at the case where there is no bound on the message
size and describe a simple linear-time protocol for counting (and token dissem-
• Note that Theorem 25.2 only shows that after n − 1 rounds all nodes ination). The protocol is extremely simple, but it demonstrates some of the
know the token. If the nodes do not know n or an upper bound on n, ideas used in some of the later algorithms, where we eliminate the large mes-
they do not know if all nodes know the token. sages using a stability assumption (T -interval connectivity) which allows nodes
to communicate with at least one of their neighbors for at least T rounds.
• We can apply the above techniques also if there is more than one token In the simple protocol, all nodes maintain a set A containing all the IDs they
in the network, provided that tokens form a totally-ordered set and have collected so far. In every round, each node broadcasts A and adds any IDs
nodes forward the smallest (or biggest) token they know. It is then it receives. Nodes terminate when they first reach a round r in which |A| ≤ r.
guaranteed that the smallest (resp. biggest) token in the network will
A ← {self };
be known by all nodes after at most n − 1 rounds. Note, however,
for r = 1, 2, . . . do
that in this case nodes do not know when they know the smallest or
broadcast A;
biggest token.
receive B1 , . . . , Bs from neighbors;
The next theorem shows that essentially, for the general asynchronous start A ← A ∪ B1 ∪ . . . ∪ Bs ;
case, 1-interval connectivity does not suffice to obtain anything better than what if |A| ≤ r then terminate and output |A|;
is stated by the above theorem. If nodes do not know n or an upper bound on ;
n initially, they cannot find n. end
Algorithm 1: Counting in linear time using large messages
Theorem 25.3. Counting is impossible in 1-interval connected graphs with Before analyzing Algorithm 1, let us fix some notation that will help to argue
asynchronous start. about the algorithms we will study. If x is a variable of an algorithm, let xu (r)
25.3. BASIC INFORMATION DISSEMINATION 287 288 CHAPTER 25. DYNAMIC NETWORKS

be the value of the variable x at node u after r rounds (immediately before the • For the remainder of the chapter, we will only consider the simpler
broadcast operation of round r + 1). For instance in Algorithm 1, Au (r) denotes synchronous start case. For T ≥ 2, all discussed results that hold for
the set of IDs of node u at the end of the rth iteration of the for-loop. T -interval connected networks with synchronous start also hold for
asynchronous start with the same asymptotic bounds.
Lemma 25.4. Assume that we are given an 1-interval connected graph G =
(V, E) and that all nodes in V execute Algorithm 1. If all nodes together start
at time 0, we have |Au (r)| ≥ r + 1 for all u ∈ V and r < n. 25.4 Small Messages
Proof. We prove the lemma by induction on r. We clearly have |Au (0)| = 1 for
We now switch to the more interesting (and more realistic) case where in each
all u because initially each node includes its own ID in A. Hence, the lemma is
round, each node can only broadcast a message of O(log n) bits. We will first
true for r = 0.
show how to use k-committee election to solve counting. We first describe how
For the induction step, assume that the claim of the lemma is true for some
to obtain a good upper bound on n. We will then see that the same algorithm
given r < n − 1 for all dynamic graphs G. Let A0u (r + 1) be the set of identifiers
can also be used to find n exactly and to solve token dissemination.
known by node u if all nodes start the protocol at time 1 (instead of 0) and run
it for r rounds. By the induction hypothesis, we have |A0u (r + 1)| ≥ r + 1. If the
algorithm is started at time 0 instead of time 1, the set of identifiers in Au (r +1) 25.4.1 k-Verification
is exactly the union of all the identifiers known by the nodes in A0u (r+1) after the
The counting algorithm works by successive doubling: at each point the nodes
first round (at time 1). This includes all the nodes in A0u (r + 1) as well as their
have a guess k for the size of the network, and attempt to verify whether or not
neighbors in the first round. If |A0u (r+1)| ≥ r+2, we also have |Au (r+1)| ≥ r+2
k ≥ n. If it is discovered that k < n, the nodes double k and repeat; if k ≥ n,
and we are done. Otherwise, by 1-interval connectivity, there must at least be
the nodes halt and output the count.
one node v ∈ V \ A0u (r + 1) for which there is an edge to a node in A0u (r + 1) in
Suppose that nodes start out in a state that represents a solution to k-
round 1. We therefore have |Au (r + 1)| ≥ |A0u (r + 1)| + 1 ≥ r + 2.
committee election: each node has a committee ID, such that no more than k
Theorem 25.5. In an 1-interval connected graph G, Algorithm 1 terminates nodes have the same ID, and if k ≥ n then all nodes have the same committee ID.
at all nodes after n rounds and output n. The problem of checking whether k ≥ n is then equivalent to checking whether
there is more than one committee: if k ≥ n there must be one committee only,
Proof. Follows directly from Lemma 25.4. For all nodes u, |Au (r)| ≥ r + 1 > r and if k < n there must be more than one. Nodes can therefore check if k ≥ n
for all r < n and |Au (n)| = |Au (n − 1)| = n. by executing a simple k-round protocol that checks if there is more than one
committee in the graph.
Lemma 25.6. Assume that we are given a 2-interval connected graph G =
(V, E) and that all nodes in V execute Algorithm 1. If node u is waken up and
starts the algorithm at time t, it holds that have |Au (t + 2r)| ≥ r + 1 for all The k-verification protocol Each node has a local variable x , which is
0 ≤ r < n. initially set to 1. While xu = 1, node u broadcasts its committee ID. If it hears
from some neighbor a different committee ID from its own, or the special value
Proof. The proof follows along the same lines as the proof of Lemma 25.4 (see ⊥, it sets xu ← 0 and broadcasts ⊥ in all subsequent rounds. After k rounds,
exercises). all nodes output the value of their x variable.

Lemma 25.7. If the initial state of the execution represents a solution to k-


Remarks:
committee election, at the end of the k-verification protocol each node outputs 1
• Because we did not bound the maximal message size and because iff k ≥ n.
every node receives information (an identifier) from each other node,
Algorithm 1 can be used to solve all the problems defined in Section Proof. First suppose that k ≥ n. In this case there is only one committee in
25.2. For the token dissemination problem, the nodes also need to the graph; no node ever hears a committee ID different from its own. After k
attach a list of all known tokens to all messages rounds all nodes still have x = 1, and all output 1.
Next, suppose k < n. We can show that after the ith round of the protocol,
• As a consequence of Theorem 25.3, 1-interval connectivity does not at least i nodes in each committee have x = 0. In any round of the protocol,
suffice to compute the number of nodes n in a dynamic network if consider a cut between the nodes that belong to a particular committee and
nodes start asynchronously. It turns out that in this case, we need a still have x = 1, and the rest of the nodes, which either belong to a different
slightly stronger connectivity assumption. If the network is 2-interval committee or have x = 0. From 1-interval connectivity, there is an edge in
connected instead of 1-interval connected, up to a constant factor in the cut, and some node u in the committee that still has xu = 1 hears either
the time complexity, the above results can also be obtained in the a different committee ID or ⊥. Node u then sets xu ← 0, and the number of
asynchronous start case (see exercises). nodes in the committee that still have x = 1 decreases by at least one. Since
25.4. SMALL MESSAGES 289 290 CHAPTER 25. DYNAMIC NETWORKS

each committee initially contains at most k nodes, after k rounds all nodes in
all committees have x = 0, and all output 0.

25.4.2 k-Committee Election


We can solve k-committee in O(k 2 ) rounds as follows. Each node u stores leader ← self ;
two local variables, committee u and leader u . A node that has not yet joined a committee ← ⊥;
committee is called active, and a node that has joined a committee is inactive. for i = 0, . . . , k do
Once nodes have joined a committee they do not change their choice. // Polling phase
Initially all nodes consider themselves leaders, but throughout the protocol, if committee = ⊥ then
any node that hears an ID smaller than its own adopts that ID as its leader. min active ← self ; // The node nominates itself for selection
The protocol proceeds in k cycles, each consisting of two phases, polling and else
selection. min active ← ⊥;
1. Polling phase: for k − 1 rounds, all nodes propagate the ID of the smallest end
active node of which they are aware. for j = 0, . . . , k − 1 do
broadcast min active;
2. Selection phase: in this phase, each node that considers itself a leader receive x1 , . . . , xs from neighbors;
selects the smallest ID it heard in the previous phase and invites that min active ← min {min active, x1 , . . . , xs };
node to join its committee. An invitation is represented as a pair (x, y), end
where x is the ID of the leader that issued the invitation, and y is the ID // Update leader
of the invited node. All nodes propagate the smallest invitation of which leader ← min {leader , min active};
they are aware for k − 1 (invitations are sorted in lexicographic order, so // Selection phase
the invitations issued by the smallest node in the network will win out if leader = self then
over other invitations. It turns out, though, that this is not necessary for // Leaders invite the smallest ID they heard
correctness; it is sufficient for each node to forward an arbitrary invitation invitation ← (self , min active);
from among those it received). else
At the end of the selection phase, a node that receives an invitation to join // Non-leaders do not invite anybody
its leader’s committee does so and becomes inactive. (Invitations issued invitation ← ⊥
by nodes that are not the current leader can be accepted or ignored; this, end
again, does not affect correctness.) for j = 0, . . . , k − 1 do
broadcast invitation;
At the end of the k cycles, any node u that has not been invited to join a receive y1 , . . . , ys from neighbors;
committee outputs committee u = u. The details are given in Algorithm 2. invitation ← min {invitation, y1 , . . . , ys } ; // (in lexicographic
order)
Lemma 25.8. Algorithm 2 solves the k-committee problem in O(k 2 ) rounds in
end
1-interval connected networks.
// Join the leader’s committee, if invited
Proof. The time complexity is immediate. To prove correctness, we show that if invitation = (leader , self ) then
after the protocol ends, the values of the local committee u variables constitute committee = leader ;
a valid solution to k-committee. end
end
1. In each cycle, each node invites at most one node to join its committee. if committee = ⊥ then
After k cycles at most k nodes have joined any committee. Note that the committee ← self ;
first node invited by a leader u to join u’s committee is always u itself. end
Thus, if after k cycles node u has not been invited to join a committee, it Algorithm 2: k-committee in always-connected graphs
follows that u did not invite any other node to join its committee; when it
forms its own committee in the last line of the algorithm, the committee’s
size is 1.

2. Suppose that k ≥ n, and let u be the node with the smallest ID in the
network. Following the polling phase of the first cycle, all nodes v have
25.5. MORE STABLE GRAPHS 291 292 CHAPTER 25. DYNAMIC NETWORKS

leader v = u for the remainder of the protocol. Thus, throughout the Procedure disseminate gives an algorithm for exchanging at least T pieces
execution, only node u issues invitations, and all nodes propagate u’s of information in n rounds when the dynamic graph is 2T -interval connected.
invitations. Since k ≥ n rounds are sufficient for u to hear the ID of the The procedure takes three arguments: a set of tokens A, the parameter T , and
minimal active node in the network, in every cycle node u successfully a guess k for the size of the graph. If k ≥ n, each node is guaranteed to learn
identifies this node and invites it to join u’s committee. After k cycles, all the T smallest tokens that appeared in the input to all the nodes.
nodes will have joined. The execution of procedure disseminate is divided into dk/T e phases, each
consisting of 2T rounds. During each phase, each node maintains the set A of
tokens it has already learned and a set S of tokens it has already broadcast
in the current phase (initially empty). In each round of the phase, the node
Remark: broadcasts the smallest token it has not yet broadcast in the current phase,
then adds that token to S.
• The protocol can be modified easily to solve all-to-all token dissemi-
We refer to each iteration of the inner loop as a phase. Since a phase lasts
nation if k ≥ n. Let tu be the token node u received in its input (or
2T rounds and the graph is 2T -interval connected, there is some connected
⊥ if node u did not receive a token). Nodes attach their tokens to
subgraph that exists throughout the phase. Let G0i be a connected subgraph
their IDs, and send pairs of the form (u, tu ) instead of just u. Like-
that exists throughout phase i, for i = 0, . . . , dk/T e − 1. We use disti (u, v) to
wise, invitations now contain the token of the invited node, and have
denote the distance between nodes u, v ∈ V in G0i .
the structure (leader , (u, tu )). The min operation disregards the token
Let Kt (r) denote the set of nodes that know token t by the beginning of
and applies only to the ID. At the end of each selection phase, nodes
extract the token of the invited node, and add it to their collection. S = {u ∈ V | t ∈ Au (r)}. In addition, let I be the set of
round r, that is, Kt (r)
T smallest tokens in u∈V Au (0). Our goal is to show that when the protocol
By the end of the protocol every node has been invited to join the
terminates we have Kt (r) = V for all t ∈ I.
committee, and thus all nodes have seen all tokens.
For a node u ∈ V , a token t ∈ P , and a phase i, we define tdisti (u, t) to be
the distance of u from the nearest node in G0i that knows t at the beginning of
25.5 More Stable Graphs phase i:
tdist(u, t) := min {disti (u, v) | v ∈ Kt (2T · i)} .

Here and in the sequel, we use the convention that min ∅ := ∞. For convenience,
S ← ∅; we use Sui (r) := Su (2T · i + r) to denote the value of Su in round r of phase
for i = 0, . . . , dk/T e − 1 do i. Similarly we denote Aiu (r) := Au (2T · i + r) and Kti (r) := Kt (2T · i + r).
for r = 0, . . . , 2T − 1 do Correctness hinges on the following property.
if S 6= A then
t ← min (A \ S); S
Lemma 25.9. For any node u ∈ V , token t ∈ v∈V Av (0), and round r such
broadcast t; i
that tdisti (u, t) ≤ r ≤ 2T , either t ∈ Su (r + 1) or Su (r + 1) includes at least
S ← S ∪ {t}
(r − tdisti (u, t)) tokens that are smaller than t.
end
receive t1 , . . . , ts from neighbors;
A ← A ∪ {t1 , . . . , ts } Proof. By induction on r. For r = 0 the claim is immediate.
end Suppose the claim holds for round r − 1 of phase i, and consider round
S←∅ r ≥ tdisti (u, t). If r = tdisti (u, t), then r − tdisti (u, t) = 0 and the claim
end holds trivially. Thus, suppose that r > tdisti (u, t). Hence, r − 1 ≥ tdisti (u, t),
return A and the induction hypothesis applies: either t ∈ Sui (r) or Sui (r) includes at least
Procedure disseminate(A, T, k) (r − 1 − tdisti (u, t)) tokens that are smaller than t. In the first case we are done,
since Sui (r) ⊆ Sui (r +1); thus, assume that t 6∈ Sui (r), and Sui (r) includes at least
In this section we show that in T -interval connected graphs the computation (r − 1 − tdisti (u, t)) tokens smaller than t. However, if Sui (r) includes at least
can be sped up by a factor of T . To do this we employ a neat pipelining effect, (r − tdisti (u, t)) tokens smaller than t, then so does Sui (r + 1), and the claim is
using the temporarily stable subgraphs that T -interval connectivity guarantees; again satisfied; thus we assume that Sui (r) includes exactly (r − 1 − tdisti (u, t))
this allows us to disseminate information more quickly. Basically, because we tokens smaller than t. 
are guaranteed that some edges and paths persist for T rounds, it suffices to It is sufficient to prove that min Aiu (r) \ Sui (r)  ≤ t: if this holds, then
i i
send a particular ID or token only once in T rounds to guarantee progress. in round r node u broadcasts min Au (r) \ Su (r) , which is either t or a to-
Other rounds can then be used for different tokens. For convenience we assume ken smaller than t; thus, either t ∈ Sui (r + 1) or Sui (r + 1) includes at least
that the graph is 2T -interval connected for some T ≥ 1. (r − tdisti (u, t)) tokens smaller than t, and the claim holds.
25.5. MORE STABLE GRAPHS 293 294 CHAPTER 25. DYNAMIC NETWORKS

First we handle the case where tdisti (u, t) = 0. In this case, t ∈ Aiu (0) ⊆ • It is not known whether the bound of Theorem 25.11 is tight. It can be
Aiu (r). Since we assumed that t 6∈ Sui (r) we have t ∈ Aiu (r) \ Sui (r), which shown that it is tight for a restricted class of protocols (see exercises).
implies that min Aiu (r) \ Sui (r) ≤ t.
Next suppose that tdisti (u, t) > 0. Let x ∈ Kti (0) be a node such • If we make additional assumptions about the stable subgraphs that
that disti (u, x) = tdist(u, t) (such a node must exist from the definition of are guaranteed for intervals of length T , the bound in Theorem 25.11
tdisti (u, t)), and let v be a neighbor of u along the path from u to x in Gi , such can be improved. E.g., if intervals of length T induce a stable k-
that disti (v, x) = disti (u, x) − 1 < r. From the induction hypothesis, either vertex connected subgraph, the complexity can be improved to O(n +
t ∈ Svi (r) or Svi (r) includes at least (r − 1 − tdisti (v, t)) = (r − tdisti (u, t)) to- n2 /(kT )).
kens that are smaller than t. Since the edge between u and v exists throughout
phase i, node u receives everything v sends in phase i, and hence Svi (r) ⊆ Aiu (r). Chapter Notes
Finally, because we assumed that Sui (r) contains exactly (r − 1 − tdisti (u, t))to-
kens smaller than t, and does not include t itself, we have min Aiu (r) \ Sui (r) ≤ See [Sch10, BW05].
t, as desired.

Using Lemma 25.9 we can show: correct. Bibliography


Lemma 25.10. If k ≥ n, at the end of procedure disseminate the set Au of
[BW05] Regina ODell Bischoff and Roger Wattenhofer. Information Dissemi-
each node u contains the T smallest tokens.
nation in Highly Dynamic Graphs. In 3rd ACM Joint Workshop on
Proof. Let Nid (t) := {u ∈ V | tdisti (u, t) ≤ d} denote the set of nodes at dis- Foundations of Mobile Computing (DIALM-POMC), Cologne, Ger-
tance at most d from some node that knows t at the beginning of phase i, and many, September 2005.
let t be one of the T smallest tokens. [Sch10] Leonard J. Schulman, editor. Proceedings of the 42nd ACM Symposium
From Lemma 25.9, for each node u ∈ NiT (t), either t ∈ Sui (2T + 1) or on Theory of Computing, STOC 2010, Cambridge, Massachusetts,
Sui (2T + 1) contains at least 2T − T = T tokens that are smaller than t. But t USA, 5-8 June 2010. ACM, 2010.
is one of the T smallest tokens, so the second case is impossible. Therefore all
nodes in NTi (t) know token t at the end of phase i. Because Gi is connected we
have |NiT (t)| ≥ min {n − |Ki (t)|, T }; that is, in each phase T new nodes learn t,
until all the nodes know t. Since there are no more than k nodes and we have
dk/T e phases, at the end of the last phase all nodes know t.

To solve counting and token dissemination with up to n tokens, we use


Procedure disseminate to speed up the k-committee election protocol from
Algorithm 2. Instead of inviting one node in each cycle, we can use disseminate
to have the leader learn the IDs of the T smallest nodes in the polling phase,
and use procedure disseminate again to extend invitations to all T smallest
nodes in the selection phase. Thus, in O(k + T ) rounds we can increase the size
of the committee by T .
Theorem 25.11. It is possible to solve k-committee election in O(k + k 2 /T )
rounds in T -interval connected graphs. When used in conjunction with the k-
verification protocol, this approach yields O(n+n2 /T )-round protocols for count-
ing all-to-all token dissemination.

Remarks:

• The same result can also be achieved for the asynchronous start case,
as long as T ≥ 2.

• The described algorithm is based on the assumptions that all nodes


know T (or that they have a common lower bound on T ). At the cost
of a log-factor, it is possible to drop this assumption and adapt to the
actual interval-connectivity T .
296 CHAPTER 26. CONSENSUS

Definition 26.1 (Consensus). Consider a distributed system with n nodes.


Each node i has an input xi . A solution of the consensus problem must guar-
antee the following:

• Termination: Every non-faulty node eventually decides.

• Agreement: All non-faulty nodes decide on the same value.


Chapter 26
• Validity: The decided value must be the input of at least one node.

Remarks:
Consensus • The validity condition infers that if all nodes have the same input x,
then the nodes need to decide on x. Please note that consensus is not
democratic, it may well be that the nodes decide on an input value
promoted by a small minority.
This chapter is the first to deal with fault tolerance, one of the most fundamental
aspects of distributed computing. Indeed, in contrast to a system with a single • Whether consensus is possible depends on many parameters of the
processor, having a distributed system may permit getting away with failures distributed system, in particular whether the system is synchronous
and malfunctions of parts of the system. This line of research was motivated or asynchronous, or what “faulty” means. In the following we study
by the basic question whether, e.g., putting two (or three?) computers into some simple variants to get a feeling for the problem.
the cockpit of a plane will make the plane more reliable. Clearly fault-tolerance
often comes at a price, as having more than one decision-maker often complicates • Consensus is a powerful primitive. With established consensus almost
decision-making. everything can be computed in a distributed system, e.g. a leader.

Given a distributed asynchronous message passing system with n ≥ 2 nodes.


All nodes can communicate directly with all other nodes, simply by sending a
26.1 Impossibility of Consensus message. In other words, the communication graph is the complete graph. Can
Imagine two cautious generals who want to attack a common enemy.1 Their the consensus problem be solved? Yes!
only means of communication are messengers. Unfortunately, the route of these
messengers leads through hostile enemy territory, so there is a chance that a Algorithm 97 Trivial Consensus
messenger does not make it. Only if both generals attack at the very same time 1: Each node has an input
the enemy can be defeated. Can we devise a protocol such that the two generals 2: We have a leader, e.g. the node with the highest ID
can agree on an attack time? Clearly general A can send a message to general 3: if node v is the leader then
B asking to e.g. “attack at 6am”. However, general A cannot be sure that 4: the leader shall simply decide on its own input
this message will make it, so she asks for a confirmation. The problem is that 5: else
general B getting the message cannot be sure that her confirmation will reach 6: send message to the leader asking for its input
general A. If the confirmation message indeed is destroyed, general A cannot 7: wait for answer message by leader, and decide on that
distinguish this case from the case where general B did not even get the attack 8: end if
information. So, to be safe, general B herself will ask for a confirmation of her
confirmation. Taking again the position of general A we can similarly derive
Remarks:
that she cannot be sure unless she also gets a confirmation of the confirmation
of the confirmation. . . • This algorithm is quite simple, and at first sight seems to work per-
To make things worse, also different approaches do not seem to work. In fectly, as all three consensus conditions of Definition 26.1 are fulfilled.
fact it can be shown that this two generals problem cannot be solved, in other
words, there is no finite protocol which lets the two generals find consensus! To • However, the algorithm is not fault-tolerant at all. If the leader crashes
show this, we need to be a bit more formal: before being able to answer all requests, there are nodes which will
never terminate, and hence violate the termination condition. Is there
1 If you don’t fancy the martial tone of this classic example, feel free to think about some-
a deterministic protocol that can achieve consensus in an asynchronous
thing else, for instance two friends trying to make plans for dinner over instant messaging
software, or two lecturers sharing the teaching load of a course trying to figure out who is in
system, even in the presence of failures? Let’s first try something
charge of the next lecture. slightly different.

295
26.1. IMPOSSIBILITY OF CONSENSUS 297 298 CHAPTER 26. CONSENSUS

Definition 26.2 (Reliable Broadcast). Consider an asynchronous distributed Theorem 26.5. In an asynchronous shared memory system with n > 1 nodes,
system with n nodes that may crash. Any two nodes can exchange messages, and node crash failures (but no memory failures!) consensus as in Definition
i.e., the communication graph is complete. We want node v to send a reliable 26.1 cannot be achieved by a deterministic algorithm.
broadcast to the n − 1 other nodes. Reliable means that either nobody receives
the message, or everybody receives the message. Proof. Let us simplify the proof by setting n = 2. We have processes u and v,
with input values xu and xv . Further let the input values be binary, either 0 or
Remarks: 1.
First we have to make sure that there are input values such that initially the
• This seems to be quite similar to consensus, right? system is bivalent. If xu = 0 and xv = 0 the system is 0-valent, because
of the validity condition (Definition 26.1). Even in the case where process
• The main problem is that the sender may crash while sending the v immediately crashes the system remains 0-valent. Similarly if both input
message to the n − 1 other nodes such that some of them get the values are 1 and process u immediately crashes the system is 1-valent. If xu =
message, and the others not. We need a technique that deals with 0 and xv = 1 and v immediately crashes, process u cannot distinguish from
this case: both having input 0, equivalently if u immediately crashes, process v cannot
distinguish from both having 1, hence the system is bivalent!
Algorithm 98 Reliable Broadcast In order to solve consensus an algorithm needs to terminate. All non-faulty
processes need to decide on the same value x (agreement condition of Definition
1: if node v is the source of message m then
26.1), in other words, at some instant this value x must be known to the system
2: send message m to each of the n − 1 other nodes
as a whole, meaning that no matter what the execution is, the system will be
3: upon receiving m from any other node: broadcast succeeded!
x-valent. In other words, the system needs to change from bivalent to univalent.
4: else
We may ask ourselves what can cause this change in a deterministic asynchro-
5: upon receiving message m for the first time:
nous shared memory algorithm? We need an element of non-determinism; if
6: send message m to each of the n − 1 other nodes
everything happens deterministically the system would have been x-valent even
7: end if
after initialization which we proved to be impossible already.
The only nondeterministic elements in our model are the asynchrony of ac-
cessing the memory and crashing processes. Initially and after every memory
Theorem 26.3. Algorithm 98 solves reliable broadcast as in Definition 26.2.
access, each process decides what to do next: Read or write a memory cell or
terminate with a decision. We take control of the scheduling, either choosing
Proof. First we should note that we do not care about nodes that crash during
which request is served next or making a process crash. Now we hope for a crit-
the execution: whether or not they receive the message is irrelevant since they
ical bivalent state with more than one memory request, and depending which
crashed anyway. If a single non-faulty node u received the message (no matter
memory request is served next the system is going to switch from bivalent to
how, it may be that it received it through a path of crashed nodes) all non-
univalent. More concretely, if process u is being served next the system is going
faulty nodes will receive the message through u. If no non-faulty node receives
x-valent, if process v (with v 6= u) is served next the system is going y-valent
the message, we are fine as well!
(with y 6= x). We have several cases:

Remarks: • If the operations of processes u and v target different memory cells, pro-
cesses cannot distinguish which memory request was executed first. Hence
• While it is clear that we could also solve reliable broadcast by means of the local states of the processes are identical after serving both operations
a consensus protocol (first send message, then agree on having received and the state cannot be critical.
it), the opposite seems more tricky!
• The same argument holds if both processes want to read the same register.
• No wonder, it cannot be done!! For the presentation of this impossibil- Nobody can distinguish which read was first, and the state cannot be
ity result we use the read/write shared memory model introduced in critical.
Chapter 5. Not only was the proof originally conceived in the shared
memory model, it is also cleaner. • If process u reads memory cell c, and process v writes memory cell c,
the scheduler first executes u’s read. Now process v cannot distinguish
Definition 26.4 (Univalent, Bivalent). A distributed system is called x-valent whether that read of u did or did not happen before its write. If it did
if the outcome of a computation will be x. An x-valent system is also called happen, v should decide on x, if it did not happen, v should decide y. But
univalent. If, depending on the execution, still more than one possible outcome since v does not know which one is true, it needs to be informed about
is feasible, the system is called multivalent. If exactly two outcomes are still it by u. We prevent this by making u crash. Thus the state can only be
possible, the system is called bivalent. univalent if v never decides, violating the termination condition!
26.1. IMPOSSIBILITY OF CONSENSUS 299 300 CHAPTER 26. CONSENSUS

• Also if both processes write the same memory cell we have the same issue, • Finally, FLP only prohibits deterministic algorithms! So can we solve
since the second writer will immediately overwrite the first writer, and consensus if we use randomization? The answer again is yes! We will
hence the second writer cannot know whether the first write happened at study this in the remainder of this chapter.
all. Again, the state cannot be critical.
Hence, if we are unlucky (and in a worst case, we are!) there is no critical 26.2 Randomized Consensus
state. In other words, the system will remain bivalent forever, and consensus is
impossible. Can we solve consensus if we allow randomization? Yes. The following algorithm
solves Consensus even in face of Byzantine errors, i.e., malicious behavior of
Remarks: some of the nodes. To simplify arguments we assume that at most f nodes will
• The proof presented is a variant of a proof by Michael Fischer, Nancy fail (crash) with n > 9f , and that we only solve binary consensus, that is, the
Lynch and Michael Paterson, a classic result in distributed computing. input values are 0 and 1. The general idea is that nodes try to push their input
The proof was motivated by the problem of committing transactions in value; if other nodes do not follow they will try to push one of the suggested
distributed database systems, but is sufficiently general that it directly values randomly. The full algorithm is in Algorithm 99.
implies the impossibility of a number of related problems, including
consensus. The proof also is pretty robust with regard to different Algorithm 99 Randomized Consensus
communication models. 1: node u starts with input bit xu ∈ {0, 1}, round:=1.
2: broadcast BID(xu , round)
• The FLP (Fischer, Lynch, Paterson) paper won the 2001 PODC In- 3: repeat
fluential Paper Award, which later was renamed Dijkstra Prize. 4: wait for n − f BID messages of current round
• One might argue that FLP destroys all the fun in distributed com- 5: if at least n − f messages have value x then
puting, as it makes so many things impossible! For instance, it seems 6: xu := x; decide on x
impossible to have a distributed database where the nodes can reach 7: else if at least n − 2f messages have value x then
consensus whether to commit a transaction or not. 8: xu := x
9: else
• So are two-phase-commit (2PC), three-phase-commit (3PC) et al. 10: choose xu randomly, with P r[xu = 0] = P r[xu = 1] = 1/2
wrong?! No, not really, but sometimes they just do not commit! 11: end if
• What about turning some other knobs of the model? Can we have 12: round := round + 1
consensus in a message passing system? No. Can we have consensus 13: broadcast BID(xu , round)
14: until decided
in synchronous systems? Yes, even if all but one node fails!
• Can we have consensus in synchronous systems even if some nodes
are mischievous, and behave much worse than simply crashing, and Theorem 26.6. Algorithm 99 solves consensus as in Definition 26.1 even if up
send for example contradicting information to different nodes? This is to f < n/9 nodes exhibit Byzantine failures.
known as Byzantine behavior. Yes, this is also possible, as long as the
Byzantine nodes are strictly less than a third of all the nodes. This Proof. First note that it is not a problem to wait for n − f BID messages in
was shown by Marshall Pease, Robert Shostak, and Leslie Lamport line 4 since at most f nodes are corrupt. If all nodes have the same input value
in 1980. Their work won the 2005 Dijkstra Prize, and is one of the x, then all (except the f Byzantine nodes) will bid for the same value x. Thus,
cornerstones not only in distributed computing but also information every node receives at least n − 2f BID messages containing x, deciding on x
security. Indeed this work was motivated by the “fault-tolerance in in the first round already. We have consensus!
planes” example. Pease, Shostak, and Lamport noticed that the com- If the nodes have different (binary) input values the validity condition be-
puters they were given to implement a fault-tolerant fighter plane at comes trivial as any result is fine. What about agreement? Let u be one of
times behaved strangely. Before crashing, these computers would start the first nodes to decide on value x (in line 6). It may happen that due to
behaving quite randomly, sending out weird messages. At some point asynchronicity another node v received messages from a different subset of the
Pease et al. decided that a malicious behavior model would be the nodes, however, at most f senders may be different. Taking into account that
most appropriate to be on the safe side. Being able to allow strictly Byzantine nodes may lie, i.e., send different BIDs to different nodes, f addi-
less than a third Byzantine nodes is quite counterintuitive; even to- tional BID messages received by v may differ from those received by u. Since
day many systems are built with three copies. In light of the result node u had at least n − 2f BID messages with value x, node v has at least
of Pease et al. this is a serious mistake! If you want to be tolerant n − 4f BID messages with x. Hence every correct node will bid for x in the
against a single Byzantine machine, you need four copies, not three! next round, and then decide on x.
26.2. RANDOMIZED CONSENSUS 301 302 CHAPTER 26. CONSENSUS

So we only need to worry about termination! We already have seen that Algorithm 100 Shared Coin (code for node u)
as soon as one correct node terminates (in line 6) everybody terminates in the 1: set local coin xu := 0 with probability 1/n, else xu := 1
next round. So what are the chances that some node u terminates in line 6? 2: use reliable broadcast to tell everybody about your local coin xu
Well, if push comes to shove we can still hope that all correct nodes randomly 3: memorize all coins you get from others in the set cu
propose the same value (in line 10). Maybe there are some nodes not choosing 4: wait for exactly n − f coins
at random (i.e., entering line 8), but they unanimously propose either 0 or 1: 5: copy these coins into your local set su (but keep learning coins)
For the sake of contradiction, assume that both 0 and 1 are proposed in line 6: use reliable broadcast to tell everybody about your set su
8. This means that both 0 and 1 had been proposed by at least n − 5f correct 7: wait for exactly n − f sets sv (which satisfy sv ⊆ cu )
nodes. In other words, we have a total of 2(n − 5f ) + f = n + (n − 9f ) > n 8: if seen at least a single coin 0 then
nodes. Contradiction! 9: return 0
Thus, at worst all n − f correct nodes need to randomly choose the same bit, 10: else
which happens with probability 2−(n−f ) . If so, all will send the same BID, and 11: return 1
the algorithm terminates. So the expected running time is smaller than 2n . 12: end if

Remarks:
Let u be the first node to terminate (satisfy line 7). For u we draw a matrix
• The presentation of Algorithm 99 is a simplification of the typical of all the seen sets sv (columns) and all coins cu seen by node u (rows). Here is
presentation in text books. an example with n = 7, f = 2, n − f = 5:

• What about an algorithm that allows for crashes only, but can manage s1 s3 s5 s6 s7
more failures? Good news! Slightly changing the presented algorithm c1 X X X X X
will do that for f < n/4! See exercises. c2 X X X
c3 X X X X X
• Unfortunately Algorithm 99 is still impractical as termination is aw-
c5 X X X X
fully slow. In expectation about the same number of nodes choose 1
c6 X X X X
or 0 in line 10. Termination would be much more efficient if all nodes
c7 X X X X
chose the same random value in line 10! So why not simply replacing
line 10 with “choose xu := 1”?!? The problem is that a major-
ity of nodes may see a majority of 0 bids, hence proposing 0 in the Note that there are exactly (n − f )2 X’s in this matrix as node u has seen
next round. Without randomization it is impossible to get out of this exactly n − f sets (line 7) each having exactly n − f coins (lines 4 to 6). We
equilibrium. (Moreover, this approach is deterministic, contradicting need two little helper lemmas:
Theorem 26.5.)
Lemma 26.8. There are at least f + 1 rows that have at least f + 1 X’s
• The idea is to replace line 10 with a subroutine where all nodes com-
Proof. Assume (for the sake of contradiction) that this is not the case. Then
pute a so-called shared (or common, or global) coin. A shared coin
at most f rows have all n − f X’s, and all other rows (at most n − f ) have at
is a random variable that is 0 with constant probability and 1 with
most f X’s. In other words, the number of total X’s is bounded by
constant probability. Sounds like magic, but it isn’t! We assume at
most f < n/3 nodes may crash: |X| ≤ f · (n − f ) + (n − f ) · f = 2f (n − f ).

Using n > 3f we get n − f > 2f , and hence |X| ≤ 2f (n − f ) < (n − f )2 . This


Theorem 26.7. If f < n/3 nodes crash, Algorithm 100 implements a shared
is a contradiction to having exactly (n − f )2 X’s!
coin.
Lemma 26.9. Let W be the set of local coins for which the corresponding matrix
Proof. Since only f nodes may crash, each node sees at least n − f coins and row has more than f X’s. All local coins in the set W are seen by all nodes that
sets in lines 4 and 7, respectively. Thanks to the reliable broadcast protocol terminate.
each node eventually sees all the coins in the other sets. In other words, the
algorithm terminates in O(1) time. Proof. Let w ∈ W be such a local coin. By definition of W we know that w is
The general idea is that a third of the coins are being seen by everybody. If in at least f + 1 seen sets. Since each node must see at least n − f seen sets
there is a 0 among these coins, everybody will see that 0. If not, chances are before terminating, each node has seen at least one of these sets, and hence w
high that there is no 0 at all! Here are the details: is seen by everybody terminating.
BIBLIOGRAPHY 303 304 CHAPTER 26. CONSENSUS

Continuing the proof of Theorem 26.7: With probability (1 − 1/n)n ≈ 1/e ≈ .37
all nodes chose their local coin equal to 1, and 1 is decided. With probability
1 − (1 − 1/n)|W | there is at least one 0 in W . With Lemma 26.8 we know that
|W | ≈ n/3, hence the probability is about 1 − (1 − 1/n)n/3 ≈ 1 − (1/e)1/3 ≈ .28.
With Lemma 26.9 this 0 is seen by all, and hence everybody will decide 0. So
indeed we have a shared coin.
Theorem 26.10. Plugging Algorithm 100 into Algorithm 99 we get a random-
ized consensus algorithm which finishes in a constant expected number of rounds.

Remarks:
• If some nodes go into line 8 of Algorithm 99 the others still have a
constant probability to guess the same shared coin.
• For crash failures there exists an improved constant expected time
algorithm which tolerates f failures with 2f < n.

• For Byzantine failures there exists a constant expected time algorithm


which tolerates f failures with 3f < n.
• Similar algorithms have been proposed for the shared memory model.

Chapter Notes
See [Lam82, FLP85, PLS83, Sim88].

Bibliography
[FLP85] Michael J. Fischer, Nancy A. Lynch, and Mike Paterson. Impossibility
of Distributed Consensus with One Faulty Process. J. ACM, 32(2):374–
382, 1985.
[Lam82] L. Lamport. The Byzantine generals problem. ACM Transactions on
Programming Languages and Systems, 4:382–401, 1982.
[PLS83] Robert L. Probert, Nancy A. Lynch, and Nicola Santoro, editors. Pro-
ceedings of the Second Annual ACM SIGACT-SIGOPS Symposium on
Principles of Distributed Computing, Montreal, Quebec, Canada, Au-
gust 17-19, 1983. ACM, 1983.

[Sim88] Janos Simon, editor. Proceedings of the 20th Annual ACM Sympo-
sium on Theory of Computing, May 2-4, 1988, Chicago, Illinois, USA.
ACM, 1988.
306 CHAPTER 27. MULTI-CORE COMPUTING

Algorithm Move(Element e, Table from, Table to)

1: if from.find(e) then
2: to.insert(e)
3: from.delete(e)
Chapter 27 4: end if

platform-dependent, varying with different machine sizes, workloads, and


Multi-Core Computing so on, making it difficult to write code that is both scalable and portable.

• Conventional locking provides poor support for code composition and


reuse. For example, consider a lock-based hash table that provides atomic
insert and delete methods. Ideally, it should be easy to move an ele-
This chapter is based on the article“Distributed Computing and the Multicore ment atomically from one table to another, but this kind of composition
Revolution” by Maurice Herlihy and Victor Luchangco. Thanks! simply does not work. If the table methods synchronize internally, then
there is no way to acquire and hold both locks simultaneously. If the ta-
bles export their locks, then modularity and safety are compromised. For
a concrete example, assume we have two hash tables T1 and T2 storing
integers and using internal locks only. Every number is only inserted into
27.1 Introduction a table, if it is not already present, i.e., multiple occurrences are not per-
mitted. We want to atomically move elements using two threads between
In the near future, nearly all computers, ranging from supercomputers to cell the tables using Algorithm Move. If we have external locks, we must pay
phones, will be multiprocessors. It is harder and harder to increase processor attention to avoid deadlocks etc.
clock speed (the chips overheat), but easier and easier to cram more processor
cores onto a chip (thanks to Moore’s Law). As a result, uniprocessors are giving Table T1 is contains 1 and T2 is empty
way to dual-cores, dual-cores to quad-cores, and so on. Time Thread 1 Thread 2
However, there is a problem: Except for “embarrassingly parallel” applica- Move(1,T1,T2) Move(1,T2,T1)
tions, no one really knows how to exploit lots of cores. 1 T1.find(1) delayed
2 T2.insert(1)
3 delayed T2.find(1)
27.1.1 The Current State of Concurrent Programming 4 T1.insert(1)
In today’s programming practice, programmers typically rely on combinations 5 T1.delete(1) T2.delete(1)
of locks and conditions, such as monitors, to prevent concurrent access by differ- both T1 and T2 are empty
ent threads to the same shared data. While this approach allows programmers
to treat sections of code as “atomic”, and thus simplifies reasoning about inter- • Such basic issues as the mapping from locks to data, that is, which locks
actions, it suffers from a number of severe shortcomings. protect which data, and the order in which locks must be acquired and
released, are all based on convention, and violations are notoriously diffi-
• Programmers must decide between coarse-grained locking, in which a large cult to detect and debug. For these and other reasons, today’s software
data structure is protected by a single lock (usually implemented using practices make lock-based concurrent programs (too) difficult to develop,
operations such as test-and-set or compare and swap(CAS)), and fine- debug, understand, and maintain.
grained locking, in which a lock is associated with each component of
the data structure. Coarse-grained locking is simple, but permits little or The research community has addressed this issue for more than fifteen
no concurrency, thereby preventing the program from exploiting multiple years by developing nonblocking algorithms for stacks, queues and other
processing cores. By contrast, fine-grained locking is substantially more data structures. These algorithms are subtle and difficult. For example,
complicated because of the need to ensure that threads acquire all nec- the pseudo code of a delete operation for a (non-blocking) linked list,
essary locks (and only those, for good performance), and because of the recently presented at a conference, contains more than 30 lines of code,
need to avoid deadlocks, when acquiring multiple locks. The decision is whereas a delete procedure for a (non-concurrent, used only by one thread)
further complicated by the fact that the best engineering solution may be linked list can be written with 5 lines of code.

305
27.2. TRANSACTIONAL MEMORY 307 308 CHAPTER 27. MULTI-CORE COMPUTING

27.2 Transactional Memory – System model: An abstract model for a (shared-memory) multi-
processor is needed that properly accounts for performance. In
Recently the transactional memory programming paradigm has gained mo- the 80s, the PRAM model became a standard model for parallel
mentum as an alternative to locks in concurrent programming. Rather than computation, and the research community developed many ele-
using locks to give the illusion of atomicity by preventing concurrent access gant parallel algorithms for this model. Unfortunately, PRAM
to shared data with transactional memory, programmers designate regions of assume that processors are synchronous, and that memory can
code as transactions, and the system guarantees that such code appears to exe- be accessed only by read and write operations. Modern computer
cute atomically. A transaction that cannot complete is aborted—its effects are architectures are asynchronous and they provide additional op-
discarded—and may be retried. Transactions have been used to build large, erations such as test-and-set. Also, PRAM did not model the
complex and reliable database systems for over thirty years; with transactional effects of contention nor the performance implications of multi-
memory, researchers hope to translate that success to multiprocessor systems. level caching, assuming instead a flat memory with uniform-cost
The underlying system may use locks or nonblocking algorithms to implement access. More realistic models have been proposed to account for
transactions, but the complexity is hidden from the application programmer. the costs of interprocess communication, but these models still
Proposals exist for implementing transactional memory in hardware, in soft- assume synchronous processors with only read and write access
ware, and in schemes that mix hardware and software. This area is growing at to memory.
a fast pace.
– How to resolve conflicts? Many transactional memory implemen-
More formally, a transaction is defined as follows:
tations “optimistically” execute transactions in parallel. Con-
Definition 27.1. A transaction in transactional memory is characterized by flicts between two transactions intending to modify the same
three properties (ACI): memory at the same time are resolved by a contention man-
ager. A contention manager decides whether a transaction con-
• Atomicity: Either a transaction finishes all its operations or no operation tinues, waits or is aborted. The contention management policy
has an effect on the system. of a transactional memory implementation can have a profound
• Consistency: All objects are in a valid state before and after the transac- effect on its performance, and even its progress guarantees.
tion.
• Isolation: Other transactions cannot access or see data in an intermediate 27.3 Contention Management
(possibly invalid) state of any parallel running transaction.
After the previous introduction of transactional memory, we look at different
aspects of contention management from a theoretical perspective. We start with
Remarks:
a description of the model.
• For database transactions there exists a fourth property called dura- We are given a set of transactions S := {T1 , ..., Tn } sharing up to s resources
bility: If a transaction has completed, its changes are permanent, i.e., (such as memory cells) that are executed on n threads. Each thread runs on a
even if the system crashes, the changes can be recovered. In princi- separate processor/core P1 , ..., Pn . For simplicity, each transaction T consists
ple, it would be feasible to demand the same thing for transactional of a sequence of tT operations. An operation requires one time unit and can
memory, however this would mean that we had to use slow hard discs be a write access of a resource R or some arbitrary computation.1 To perform
instead of fast DRAM chips... a write, the written resource must be acquired exclusively (i.e., locked) before
the access. Additionally, a transaction must store the original value of a written
• Although transactional memory is a promising approach for concur- resource. Only one transaction can lock a resource at a time. If a transaction
rent programming, it is not a panacea, and in any case, transactional A attempts to acquire a resource, locked by B, then A and B face a conflict.
programs will need to interact with other (legacy) code, which may If multiple transactions concurrently attempt to acquire an unlocked resource,
use locks or other means to control concurrency. an arbitrary transaction A will get the resource and the others face a conflict
with A. A contention manager decides how to resolve a conflict. Contention
• One major challenge for the adoption of transactional memory is that
managers operate in a distributed fashion, that is to say, a separate instance of a
it has no universally accepted specification. It is not clear yet how to
contention manager is available for every thread and they operate independently.
interact with I/O and system calls should be dealt with. For instance,
Contention managers can make a transaction wait (arbitrarily long) or abort.
imagine you print a news article. The printer job is part of a transac-
An aborted transaction undoes all its changes to resources and frees all locks
tion. After printing half the page, the transaction gets aborted. Thus
before restarting. Freeing locks and undoing the changes can be done with one
the work (printing) is lost. Clearly, this behavior is not acceptable.
operation. A successful transaction finishes with a commit and simply frees
• From a theory perspective we also face a number of open problems. 1 Reads are of course also possible, but are not critical because they do not attempt to

For example: modify data.


27.3. CONTENTION MANAGEMENT 309 310 CHAPTER 27. MULTI-CORE COMPUTING

all locks. A contention manager is unaware of (potential) future conflicts of a • There are also systems, where resources are not locked exclusively.
transaction. The required resources might also change at any time. All we need is a correct serialization (analogous to transactions in
The quality of a contention manager is characterized by different properties: database systems). Thus a transaction might speculatively use the
current value of a resource, modified by an uncommitted transaction.
• Throughput: How long does it take until all transactions have committed? However, these systems must track dependencies to ensure the ACI
How good is our algorithm compared to an optimal? properties of a transaction (see Definition 27.1). For instance, assume
a transaction T1 increments variable x from 1 to 2. Then transaction
Definition 27.2. The makespan of the set S of transactions is the time
T2 might access x and assume its correct value is 2. If T1 commits ev-
interval from the start of the first transaction until all transactions have
erything is fine and the ACI properties are ensured, but if T1 aborts, T2
committed.
must abort too, since otherwise the atomicity property was violated.
Definition 27.3. The competitive ratio is the ratio of the makespans of
the algorithm to analyze and an optimal algorithm. • In practice, the number of concurrent transactions might be much
larger than the number of processors. However, performance may de-
• Progress guarantees: Is the system deadlock-free? Does every transaction crease with an increasing number of threads since there is time wasted
commit in finite time? to switch between threads. Thus, in practice, load adaption schemes
have been suggested to limit the number of concurrent transactions
Definition 27.4. We look at three levels of progress guarantees: close to (or even below) the number of cores.

– wait freedom (strongest guarantee): all threads make progress in a • In the analysis, we will assume that the number of operations is fixed
finite number of steps for each transaction. However, the execution time of a transaction (in
– lock freedom: one thread makes progress in a finite number of steps the absence of contention) might also change, e.g., if data structures
shrink, less elements have to be considered. Nevertheless, often the
– obstruction freedom (weakest): one thread makes progress in a finite changes are not substantial, i.e., only involve a constant factor. Fur-
number of steps in absence of contention (no other threads compete thermore, if an adversary can modify the duration of a transaction
for the same resources) arbitrarily during the execution of a transaction, then any algorithm
must make the exact same choices as an optimal algorithm: Assume
Remarks: two transactions T0 and T1 face a conflict and an algorithm Alg de-
cides to let T0 wait (or abort). The adversary could make the opposite
• For the analysis we assume an oblivious adversary. It knows the algo- decision and let T0 proceed such that it commits at time t0 . Then it
rithm to analyze and chooses/modifies the operations of transactions sets the execution time T0 to infinity, i.e., tT0 = ∞ after t0 . Thus,
arbitrarily. However, the adversary does not know the random choices the makespan of the schedule for algorithm Alg is unbounded though
(of a randomized algorithm). The optimal algorithm knows all deci- there exists a schedule with bounded makespan. Thus the competitive
sions of the adversary, i.e. first the adversary must say how transac- ratio is unbounded.
tions look like and then the optimal algorithm, having full knowledge
of all transaction, computes an (optimal) schedule.
Problem complexity
• Wait freedom implies lock freedom. Lock freedom implies obstruction
freedom. In graph theory, coloring a graph with as few colors as possible is known to be
hard problem. A (vertex) coloring assigns a color to each vertex of a graph such
• Here is an example to illustrate how needed resources change over that no two adjacent vertices share the same color. It was shown that computing
time: Consider a dynamic data structure such as a balanced tree. If a an optimal coloring given complete knowledge of the graph is NP-hard. Even
transaction attempts to insert an element, it must modify a (parent) worse, computing an approximation within a factor of χ(G)log χ(G)/25 , where
node and maybe it also has to do some rotations to rebalance the χ(G) is the minimal number of colors needed to color the graph, is NP-hard as
tree. Depending on the elements of the tree, which change over time, well.
it might modify different objects. For a concrete example, assume that To keep things simple, we assume for the following theorem that resource
the root node of a binary tree has value 4 and the root has a (left) acquisition takes no time, i.e., as long as there are no conflicts, transactions get
child of value 2. If a transaction A inserts value 5, it must modify the all locks they wish for at once. In this case, there is an immediate connection to
pointer to the right child of the root node with value 4. Thus it locks graph coloring, showing that even an offline version of contention management,
the root node. If A gets aborted by a transaction B, which deletes where all potential conflicts are known and do not change over time, is extremely
the node with value 4 and commits, it will attempt to lock the new hard to solve.
root node with value 2 after its restart.
27.3. CONTENTION MANAGEMENT 311 312 CHAPTER 27. MULTI-CORE COMPUTING

Theorem 27.5. If the optimal schedule has makespan k and resource acquisi- • A first naive contention manger: Be aggressive! Always abort the trans-
tion takes zero time, it is NP-hard to compute a schedule of makespan less than action having locked the resource. Analysis: The throughput might be
k log k/25 , even if all conflicts are known and transactions do not change their zero, since a livelock is possible. But the system is still obstruction free.
resource requirements. Consider two transactions consisting of three operations. The first opera-
tion of both is a write to the same resource R. If they start concurrently,
Proof. We will prove the claim by showing that any algorithm finding a schedule they will abort each other infinitely often.
taking k 0 < k (log k)/25 can be utilized to approximate the chromatic number of
log χ(G)
any graph better than χ(G) 25 . • A smarter contention manager: Approximate the work done. Assume
Given the graph G = (V, E), define that V is the set of transactions and before a start (also before a restart after an abort) a transaction gets
E is the set of resources. Each transaction (node) v ∈ V needs to acquire a a unique timestamp. The older transaction, which is believed to have
lock on all its resources (edges) {v, w} ∈ E, and then computes something for already performed more work, should win the conflict.
exactly one round. Obviously, this “translation” of a graph into our scheduling Analysis: Clearly, the oldest transaction will always run until commit
problem does not require any computation at all. without interruption. Thus we have lock-freedom, since at least one trans-
Now, if we knew a χ(G)-coloring of G, we could simply use the fact that the action makes progress at any time. In other words, at least one processor
nodes sharing one color form an independent set and execute all transactions is always busy executing a transaction until its commit. Thus, the bound
of a single color in parallel and the colors sequentially. Since no two neighbors says that all transactions are executed sequentially. How about the com-
are in an independent set and resources are edges, all conflicts are resolved. petitive ratio? We have s resources and n transactions starting at the
Consequently, the makespan k is at most χ(G). same time. For simplicity, assume every transaction Ti needs to lock at
On the other hand, the makespan k must be at least χ(G): Since each trans- least one resource for a constant fraction c of its execution time tTi . Thus,
action (i.e., node) locks all required resources (i.e., adjacent edges) for at least at most s transactions can run concurrently from start until commit with-
one round, no schedule could do better than serve a (maximum) independent out (possibly) facing a conflict (if s + 1 transactions run at the same time,
set in parallel while all other transactions wait. However, by definition of the at least two of them lock the same resource). Thus, the makespan of an
chromatic number χ(G), V cannot be split into less than χ(G) independent sets, Pn c·t
optimal contention manager is at least: i=0 sTi . The makespan of our
meaning that k ≥ χ(G). Therefore k = χ(G). timestamping algorithm is at most the durationPof a sequential execution,
In other words, if we could compute a schedule using k 0 < k (log k)/25 rounds n
i.e. the sum of the lengths of all transactions: i=0 tTi . The competitive
in polynomial time, we knew that ratio is:
χ(G) = k ≤ k 0 < k (log k)/25 = χ(G)(log χ(G))/25 . Pn
i=0 tTi s
Pn c·tTi = = O(s).
c
i=0 s

Remarks: Remarks:
• The theorem holds for a central contention manager, knowing all – Unfortunately, in most relevant cases the number of resources
transactions and all potential conflicts. Clearly, the online problem, is larger than the number of cores, i.e., s > n. Thus, our
where conflicts remain unknown until they occur, is even harder. Fur- timestamping algorithm only guarantees sequential execution,
thermore, the distributed nature of contention managers also makes whereas the optimal might execute all transactions in parallel.
the problem even more difficult.
Are there contention managers that guarantee more than sequential execu-
• If resource acquisition does not take zero time, the connection be-
tion, if a lot of parallelism is possible? If we have a powerful adversary, that can
tween the problems is not a direct equivalence. However, the same
change the required resources after an abort, the analysis is tight. Though we
proof technique shows that it is NP-hard to compute a polynomial
restrict to deterministic algorithms here, the theorem also holds for randomized
approximation, i.e., k 0 ≤ k c for some constant c ≥ 1.
contention managers.

Deterministic contention managers Theorem 27.6. Suppose n transactions start at the same time and the adver-
sary is allowed to alter the resource requirement of any transaction (only) after
Theorem 27.5 showed that even if all conflicts are known, one cannot produce an abort, then the competitive ratio of any deterministic contention manager is
schedules which makespan get close to the optimal without a lot of computation. Ω(n).
However, we target to construct contention managers that make their decisions
quickly without knowing conflicts in advance. Let us look at a couple of con- Proof. Assume we have n resources. Suppose all transactions consist of two
tention managers and investigate their throughput and progress guarantees. operations, such that conflicts arise, which force the contention manager to
27.3. CONTENTION MANAGEMENT 313 314 CHAPTER 27. MULTI-CORE COMPUTING

abort one of the two transactions T2i−1 , T2i for every i < n/2. More precisely, Randomized contention managers
transaction T2i−1 writes to resource R2i−1 and to R2i afterwards. Transaction
T2i writes to resource R2i and to R2i−1 afterwards. Clearly, any contention Though the lower bound of the previous section (Theorem 27.6) is valid for both
manager has to abort n/2 transactions. Now the adversary tells each transaction deterministic and randomized schemes, let us look at a randomized approach:
which did not finish to adjust its resource requirements and write to resource Each transaction chooses a random priority in [1, n]. In case of a conflict, the
R0 as their first operation. Thus, for any deterministic contention manager the transaction with lower priority gets aborted. (If both conflicting transactions
n/2 aborted transactions must execute sequentially and the makespan of the have the same priority, both abort.)
algorithm becomes Ω(n). Additionally, if a transaction A was aborted by transaction B, it waits until
The optimal strategy first schedules all transactions that were aborted and in transaction B committed or aborted, then transaction A restarts and draws a
turn aborts the others. Since the now aborted transactions do not change their new priority.
resource requirements, they can be scheduled in parallel. Hence the optimal Analysis: Assume the adversary cannot change the resource requirements,
makespan is 4, yielding a competitive ratio of Ω(n). otherwise we cannot show more than a competitive ratio of n, see Theorem
27.6. Assume if two transactions A and B (potentially) conflict (i.e., write to
the same resource), then they require the resource for at least a fraction c of their
Remarks: running time. We assume a transaction T potentially conflicts with dT other
transactions. Therefore, if a transaction has highest priority among these dT
• The prove can be generalized to show that the ratio is Ω(s) if s re- transactions, it will abort all others and commit successfully. The chance that
sources are present, matching the previous upper bound. for a transaction T a conflicting transaction chooses the same random number
is (1 − 1/n)dT > (1 − 1/n)n ≈ 1/e. The chance that a transaction chooses the
• But what if the adversary is not so powerful, i.e., a transaction has a largest random number and no other transaction chose this number is thus at
fixed set of needed resources? least 1/dT · 1/e. Thus, for any constant c ≥ 1, after choosing e · dT · c · ln n
random numbers the chance that transaction T has commited successfully is
The analysis of algorithm timestamp is still tight. Consider the din-
ing philosophers problem: Suppose all transactions have length n and  e·dT ·c·ln n
1 1
transaction i requires its first resource Ri at time 1 and its second 1− 1− ≈ 1 − e−c ln n = 1 − .
e · dT nc
Ri+1 (except Tn , which only needs Rn ) at time n − i. Thus, each
transaction Ti potentially conflicts with transaction Ti−1 and trans- Assuming that the longest transaction takes time tmax , within that time a
action Ti+1 . Let transaction i have the ith oldest timestamp. At time transaction either commits or aborts and chooses a new random number. The
n − i transaction i + 1 with i ≥ 1 will get aborted by transaction i and time to choose e · tmax · c · ln n numbers is thus at most e · tmax · dT · c · ln n =
only transaction 1 will commit at time n. After every abort transac- O(tmax · dT · ln n). Therefore, with high probability each transaction makes
tion i restarts 1 time unit before transaction i − 1. Since transaction progress within a finite amount of time, i.e., our algorithm ensures wait freedom.
i−1 acquires its second resource i−1 time units before its termination, Furthermore, the competitive ratio of our randomized contention manger for the
transaction i−1 will abort transaction i at least i−1 times. After i−1 previously considered dining philosophers problem is w.h.p. only O(ln n), since
aborts transaction i may commit. The total time until the algorithm any transaction only conflicts with two other transactions.
is done is
Pbounded by the time transaction n stays in the system, i.e.,
n
at least i=1 (n−i) = Ω(n2 ). An optimal schedule requires only O(n)
time: First schedule all transactions with even indices, then the ones Chapter Notes
with odd indices.
See [GWK09, Att08, SW09, HSW10].
• Let us try to approximate the work done differently. The transaction,
which has performed more work should win the conflict. A transaction
counts the number of accessed resources, starting from 0 after every Bibliography
restart. The transaction which has acquired more resources, wins the
conflict. In case both have accessed the same number of resources, [Att08] Hagit Attiya. Needed: foundations for transactional memory.
the transaction having locked the resource may proceed and the other SIGACT News, 39(1):59–61, 2008.
has to wait.
[GWK09] Vijay K. Garg, Roger Wattenhofer, and Kishore Kothapalli, editors.
Analysis: Deadlock possible: Transaction A and B start concurrently. Distributed Computing and Networking, 10th International Confer-
Transaction A writes to R1 as its first operation and to R2 as its ence, ICDCN 2009, Hyderabad, India, January 3-6, 2009. Proceed-
second operation. Transaction B writes to the resources in opposite ings, volume 5408 of Lecture Notes in Computer Science. Springer,
order. 2009.
BIBLIOGRAPHY 315 316 CHAPTER 27. MULTI-CORE COMPUTING

[HSW10] David Hasenfratz, Johannes Schneider, and Roger Wattenhofer.


Transactional Memory: How to Perform Load Adaption in a Simple
And Distributed Manner. In The 2010 International Conference on
High Performance Computing & Simulation (HPCS), Caen, France,
June 2010.
[SW09] Johannes Schneider and Roger Wattenhofer. Bounds On Contention
Management Algorithms. In 20th International Symposium on Algo-
rithms and Computation (ISAAC), Honolulu, USA, 2009.
318 CHAPTER 28. DOMINATING SET

28.1 Sequential Greedy Algorithm


Intuitively, to end up with a small dominating set S, nodes in S need to cover
as many neighbors as possible. It is therefore natural to add nodes v with a
large span w(v) to S. This idea leads to a simple greedy algorithm:

Chapter 28 Algorithm 101 Greedy Algorithm


1: S := ∅;
2: while there
 are white nodes do
3: v := v w(v) = maxu {w(u)} ;
Dominating Set 4: S := S ∪ v;
5: end while

Theorem 28.2. The Greedy Algorithm computes a ln ∆-approximation, that


is, for the computed dominating set S and an optimal dominating set S ∗ , we
In this chapter we present another randomized algorithm that demonstrates the
have
power of randomization to break symmetries. We study the problem of finding |S|
a small dominating set of the network graph. As it is the case for the MIS, an ≤ ln ∆.
|S ∗ |
efficient dominating set algorithm can be used as a basic building block to solve
a number of problems in distributed computing. For example, whenever we need Proof. Each time, we choose a new node of the dominating set (each greedy
to partition the network into a small number of local clusters, the computation step), we have cost 1. Instead of letting this node pay the whole cost, we
of a small dominating set usually occurs in some way. A particularly important distribute the cost equally among all newly covered nodes. Assume that node
application of dominating sets is the construction of an efficient backbone for v, chosen in line 3 of the algorithm, is white itself and that its white neighbors
routing. are v1 , v2 , v3 , and v4 . In this case each of the 5 nodes v and v1 , . . . , v4 get
charged 1/5. If v is chosen as a gray node, only the nodes v1 , . . . , v4 get charged
Definition 28.1 (Dominating Set). Given an undirected graph G = (V, E), a (they all get 1/4).
dominating set is a subset S ⊆ V of its nodes such that for all nodes v ∈ V , Now, assume that we know an optimal dominating set S ∗ . By the definition
either v ∈ S or a neighbor u of v is in S. of dominating sets, to each node which is not in S ∗ , we can assign a neighbor
from S ∗ . By assigning each node to exactly one neighboring node of S ∗ , the
Remarks: graph is decomposed into stars, each having a dominator (node in S ∗ ) as center
and non-dominators as leaves. Clearly, the cost of an optimal dominating set
• It is well-known that computing a dominating set of minimal size is is 1 for each such star. In the following, we show that the amortized cost
NP-hard. We therefore look for approximation algorithms, that is, (distributed costs) of the greedy algorithm is at most ln ∆ + 2 for each star.
algorithms which produce solutions which are optimal up to a certain This suffices to prove the theorem.
factor. Consider a single star with center v ∗ ∈ S ∗ before choosing a new node u
in the greedy algorithm. The number of nodes that become dominated when
• Note that every MIS (cf. Chapter 7) is a dominating set. In general, adding u to the dominating set is w(u). Thus, if some white node v in the star of
the size of every MIS can however be larger than the size of an optimal v ∗ becomes gray or black, it gets charged 1/w(u). By the greedy condition, u is a
minimum dominating set by a factor of Ω(n). As an example, connect node with maximal span and therefore w(u) ≥ w(v ∗ ). Thus, v is charged at most
the centers of two stars by an edge. Every MIS contains all the leaves 1/w(v ∗ ). After becoming gray, nodes do not get charged any more. Therefore
of at least one of the two stars whereas there is a dominating set of first node that is covered in the star of v ∗ gets charged at most 1/(d(v ∗ ) + 1).
size 2. Because w(v ∗ ) ≥ d(v ∗ ) when the second node is covered, the second node gets
charged at most 1/d(v ∗ ). In general, the ith node that is covered in the star of
All the dominating set algorithms that we study throughout this chapter v ∗ gets charged at most 1/(d(v ∗ ) + i − 2). Thus, the total amortized cost in the
operate in the following way. We start with S = ∅ and add nodes to S until star of v ∗ is at most
S is a dominating set. To simplify presentation, we color nodes according to 1 1 1 1
their state during the execution of an algorithm. We call nodes in S black, nodes + + · · · + + = H(d(v ∗ ) + 1) ≤ H(∆ + 1) < ln(∆) + 2
d(v ∗ ) + 1 d(v ∗ ) 2 1
which are covered (neighbors of nodes in S) gray, and all uncovered nodes white. Pn
By W (v), we denote the set of white nodes among the direct neighbors of v, where ∆ is the maximal degree of G and where H(n) = i−1 1/i is the nth
including v itself. We call w(v) = |W (v)| the span of v. number.

317
28.2. DISTRIBUTED GREEDY ALGORITHM 319 320 CHAPTER 28. DOMINATING SET

Remarks:

• One can show that unless NP ⊆ DTIME(nO(log log n) ), no polynomial-


time algorithm can approximate the minimum dominating set problem
better than (1 − o(1)) · ln ∆. Thus, unless P ≈ NP, the approximation
ratio of the simple greedy algorithm is optimal (up to lower order
terms). Figure 28.1: Distributed greedy algorithm: Bad example

28.2 Distributed Greedy Algorithm


For a distributed algorithm, we use the following observation. The span of a
node can only be reduced if any of the nodes at distance at most 2 is included
in the dominating set. Therefore, if the span of node v is greater than the span
of any other node at distance at most 2 from v, the greedy algorithm chooses
v before any of the nodes at distance at most 2. This leads to a very simple
distributed version of the greedy algorithm. Every node v executes the following
algorithm.

Algorithm 102 Distributed Greedy Algorithm (at node v):


1: while v has white neighbors do
Figure 28.2: Distributed greedy algorithm with rounded spans: Bad example
2: compute span w(v);
3: send w(v) to nodes at distance at most 2;
4: if w(v) largest within distance 2 (ties are broken by IDs) then are no long paths of descending spans. Allowing for an additional factor 2 in
5: join dominating set the approximation ratio, we can round all spans to the next power of 2 and
6: end if let the greedy algorithm take a node with a maximal rounded span. In this
7: end while case, a path of strictly descending rounded spans has at most length log n. For
the distributed version, this means that nodes whose rounded span is maximal
within distance 2 are added to the dominating set. Ties are again broken by
Theorem 28.3. Algorithm 102 computes a dominating set of size at most ln ∆+
unique node IDs. If node IDs are chosen at√random, the time complexity for
2 times the size of an optimal dominating set in O(n) rounds.
the graph of Figure 28.1 is reduced from Ω( n) to O(log n).
Proof. The approximation quality follows directly from the above observation Unfortunately, there still is a problem remaining. To see this, we consider
and the analysis of the greedy algorithm. The time complexity is at most linear Figure 28.2. The graph of Figure 28.2 consists of a clique with n/3 nodes
because in every iteration of the while loop, at least one node is added to the and two leaves per node of the clique. An optimal dominating set consists
dominating set and because one iteration of the while loop can be implemented of all the n/3 nodes of the clique. Because they all have distance 1 from each
in a constant number of rounds. other, the described distributed algorithm only selects one in each while iteration
(the one with the largest ID). Note that as soon as one of the nodes is in the
The approximation ratio of the above distributed algorithm is best possi- dominating set, the span of all remaining nodes of the clique is 2. They do not
ble (unless P ≈ NP or unless we allow local computations to be exponential). have common neighbors and therefore there is no reason not to choose all of
However, the time complexity is very bad. In fact, there really are graphs on them in parallel. However, the time complexity of the simple algorithm is Ω(n).
which in each iteration of the while loop, only one node is added to the dom- In order to improve this example, we need an algorithm that can choose many
inating set. As an example, consider a graph as in Figure 28.1. An optimal nodes simultaneously as long as these nodes do not interfere too much, even
dominating set consists of all nodes on the center axis. The distributed greedy if they are neighbors. In AlgorithmS103, N (v) denotes the set of neighbors of
algorithm computes an optimal dominating set, however, the nodes are chosen v (including v itself) and N2 (v) = u∈N (V ) N (u) are the nodes at distance at

sequentially from left to right.
√ Hence, the running time of the algorithm on most 2 of v. As before, W (v) = u ∈ N (v) : u is white and w(v) = |W (v)|.
the graph of Figure 28.1 is Ω( n). Below, we will see that there are graphs on It is clear that if Algorithm 103 terminates, it computes a valid dominating
which Algorithm 102 even needs Ω(n) rounds. set. We will now show that the computed dominating set is small and that the
The problem of the graph of Figure 28.1 is that there is a long path of algorithm terminates quickly.
descending degrees (spans). Every node has to wait for the neighbor to the
left. Therefore, we want to change the algorithm in such a way that there Theorem 28.4. Algorithm 103 computes a dominating set of size at most (6 ·
28.2. DISTRIBUTED GREEDY ALGORITHM 321 322 CHAPTER 28. DOMINATING SET

Algorithm 103 Fast Distributed Dominating Set Algorithm (at node v): algorithm, we now get that the total cost in the star of a node v ∗ ∈ S ∗ is at
1: W (v) := N (v); w(v) := |W (v)|; most
2: while W (v) 6= ∅ do |N (v ∗ )|
X 6
3: w̃(v) := 2blog2 w(v)c ; // round down to next power of 2 ≤ 6 · H(|N (v ∗ )|) ≤ 6 · H(∆ + 1) = 6 · ln ∆ + 12.
4: ŵ(v) := maxu∈N2 (v) w̃(u); i=1
i
5: if w̃(v) = ŵ(v) then v.active := true else v.active := false end if ;
6: compute support s(v) := |{u ∈ N (v) : u.active = true}|;
7: ŝ(v) := maxu∈W (v) s(u); To bound the time complexity of the algorithm, we first need to prove the
8: v.candidate := false; following lemma.
9: if v.active then
Lemma 28.5. Consider an iteration of the while loop. Assume that a node u is
10: v.candidate := true with probability 1/ŝ(v)
white and that 2s(u) ≥ maxv∈C(u) ŝ(v) where C(u) = {v ∈ N (u) : v.candidate =
11: end if ;
true}. Then, the probability that u becomes dominated (turns gray or black) in
12: compute c(v) := |{uP ∈ W (v) : u.candidate = true}|;
the considered while loop iteration is larger than 1/9.
13: if v.candidate and u∈W (v) c(u) ≤ 3w(v) then
14: node v joins dominating set Proof. Let D(u) be the event that u becomes dominated in the considered while
15: end if loop iteration, i.e., D(u) is the event that u changes
  its color from white to gray
16: W (v) := {u ∈ N (v) : u is white}; w(v) := |W (v)|; or black. Thus, we need to prove that Pr D(u) > 1/9. We can write this
17: end while probability as
         
Pr D(u) = Pr c(u) > 0 ·Pr D(u)|c(u) > 0 +Pr c(u) = 0 ·Pr D(u)|c(u) = 0 .
| {z }
ln ∆ + 12) · |S ∗ |, where S ∗ is an optimal dominating set. =0
 
It is therefore sufficient
 to lower bound the probabilities Pr c(u) > 0 and
Proof. The proof is a bit more involved but analogous to the analysis of the
Pr D(u)|c(u) > 0 . We have 2s(u) ≥ maxv∈C(u) ŝ(v). Therefore, in line 10, each
approximation ratio of the greedy algorithm. Every time, we add a new node v
of the s(u) active nodes v ∈ N (u) becomes a candidate node with probability
to the dominating set, we distribute the cost among v (if it is still white) and its
1/ŝ(v) ≥ 1/(2s(u)). The probability that at least one of the s(u) active nodes
white neighbors. Consider an optimal dominating set S ∗ . As in the analysis of
in N (u) becomes a candidate therefore is
the greedy algorithm, we partition the graph into stars by assigning every node
 s(u)
u not in S ∗ to a neighbor v ∗ in S ∗ . We want to show that the total distributed 1 1 1
cost in the star of every v ∗ ∈ S ∗ is at most 6H(∆ + 1). Pr[c(u) > 0] > 1 − 1 − > 1− √ > .
2s(u) e 3
Consider a node v that is added to the dominating set by Algorithm 103.
Let W (v) be the set of white nodes in N (v) when v becomes a dominator. For We used that for x≥ 1, (1−1/x)x < 1/e. We next want to bound the probability
a node uP ∈ W (v) let c(u) be the number of candidate nodes in N (u). We define Pr D(u)|c(u) > 0 that at least one of the c(u) candidates in N (u) joins the
C(v) = u∈W (v) c(u). Observe that C(v) ≤ 3w(v) because otherwise v would dominating set. We have
not join the dominating set in line 15. When adding v to the dominating set,    
Pr D(u)|c(u) > 0 ≥ min Pr v joins dominating set|v.candidate = true .
every newly covered node u ∈ W (v) is charged 3/(c(u)w(v)). This compensates v∈N (u)
the cost 1 for adding v to the dominating set because P
Consider some node v and let C(v) = v0 ∈W (v) c(v 0 ). If v is a candidate, it joins
X 3 3 3 the dominating set if C(v) ≤ 3w(v). We  are thus interested in the probability
≥ w(v) · P = ≥ 1. Pr C(v) ≤ 3w(v) v.candidate = true . Assume that v is a candidate. Let
c(u)w(v) w(v) · u∈W (v) c(u)/w(v) C(v)/w(v)
u∈W (v) c0 (v 0 ) = c(v 0 ) − 1 be the number of candidates in N (v 0 ) \ {v}. For a node
Pk v 0 ∈ W (v), 0 0
 c (v ) is upper bounded by a binomial random variable Bin s(v ) −
0

The first inequality follows because it can be shown that for αi > 0, i=1 1/αi ≥ 1, 1/s(v 0 ) with expectation (s(v 0 ) − 1)/s(v 0 ). We therefore have
Pk
k/ᾱ where ᾱ = i=1 αi /k.
    s(v 0 ) − 1
Now consider a node v ∗ ∈ S ∗ and assume that a white node u ∈ W (v ∗ ) turns E c(v 0 )|v.candidate = true = 1 + E c0 (v 0 ) = 1 + < 2.
gray or black in iteration t of the while loop. We have seen that u is charged s(v 0 )
3/(c(u)w(v)) for every node v ∈ N (u) that joins the dominating set in iteration By linearity of expectation, we hence obtain
t. Since a node can only join the dominating set if its span is largest up to a   X  
factor of two within two hops, we have w(v) ≥ w(v ∗ )/2 for every node v ∈ N (u) E C(v)|v.candidate = true = E c(v 0 )|v.candidate = true
that joins the dominating set in iteration t. Because there are at most c(u) such v 0 ∈W (v)

nodes, the charge of u is at most 6/w(v ∗ ). Analogously to the sequential greedy < 2w(v).
28.2. DISTRIBUTED GREEDY ALGORITHM 323 324 CHAPTER 28. DOMINATING SET

We can now use Markov’s inequality to bound the probability that C(v) becomes As long as u satisfies all three conditions, the probability that u becomes domi-
too large: nated is larger than 1/9 in every while loop iteration. Hence, after t+τ iterations
  2 (from the beginning), u is dominated or does not satisfy (2) or (3) with prob-
Pr C(v) > 3w(v) v.candidate = true < .
3 ability larger than (8/9)τ . Choosing τ = log9/8 (2n), this probability becomes
Combining everything, we get 1/(2n). There are at most n nodes u satisfying Conditions (1) − (3). Therefore,
applying union bound, we obtain that with probability more than 1/2, there is
  no white node u satisfying Conditions (1) − (3) at time t + log9/8 (2n). Equiva-
Pr v joins dom. set|v.candidate = true
  lently, with probability more than 1/2, T (t) ≤ t + log9/8 (2n). Analogously, we
1
= Pr C(v) ≤ 3w(v) v.candidate = true > obtain that with probability more than 1/2k , T (t) ≤ t + k log9/8 (2n). We then
3
have
and hence ∞
X
     1 1 1 E[T (t) − t] = Pr[T (t) − t = τ ] · τ
Pr D(u) = Pr c(u) > 0] · Pr D(u)|c(u) > 0 > · = . τ =1
3 3 9
X∞  
1 1
≤ − k+1 · k log9/8 (2n) = log9/8 (2n)
2k 2
k=1

Theorem 28.6. In expectation, Algorithm 103 terminates in O(log2 ∆ · log n)


rounds. and thus Equation (28.6.2) holds.
Let t0 = 0 and ti = T (ti−1 ) for i = 1, . . . , k. where tk = mint w̃max (t) =
0. Because w̃max (t) = 0 implies that w(v) = 0 for all v ∈ V and that we
Proof. First observe that every iteration of the while loop can be executed in
therefore have computed a dominating set, by Equations (28.6.1) and (28.6.2)
a constant number of rounds. Consider the state after t iterations of the while
(and linearity of expectation), the expected number of rounds until Algorithm
loop. Let w̃max (t) = maxv∈V w̃(v) be the maximal span rounded down to the
103 terminates is O(k · log n). Since w̃max (t) can only have blog ∆c different
next power of 2 after t iterations. Further, let smax (t) be the maximal support
values and because for a fixed value of w̃max (t), the number of times smax (t) can
s(v) of any node v for which there is a node u ∈ N (v) with w(u) ≥ w̃max (t)
be decreased by a factor of 2 is at most log ∆ times, we have k ≤ log2 ∆.
after t while loop iterations. Observe that all nodes v with w(v) ≥ w̃max (t) are
active in iteration t + 1 and that as long as the maximal rounded span w̃max (t)
does not change, smax (t) can only get smaller with increasing t. Consider the
Remarks:
pair (w̃max , smax ) and define a relation ≺ such that (w0 , s0 ) ≺ (w, s) iff w0 < w
or w = w0 and s0 ≤ s/2. From the above observations, it follows that
• It is not hard to show that Algorithm 103 even terminates in O(log2 ∆ ·
(w̃max (t), smax (t)) ≺ (w̃max (t0 ), smax (t0 )) =⇒ t > t0 . (28.6.1) log n) rounds with probability 1 − 1/nc for an arbitrary constant c.

For a given time t, let T (t) be the first time for which • Using the median of the supports of the neighbors instead of the
maximum in line 8 results in an algorithm with time complexity
(w̃max (T (t)), smax (T (t))) ≺ (w̃max (t), smax (t)). O(log ∆ · log n). With another algorithm, this can even be slightly
improved to O(log2 ∆).
We first want to show that for all t,
• One can show that Ω(log ∆/ log log ∆) rounds are necessary to obtain
E[T (t) − t] = O(log n). (28.6.2) an O(log ∆)-approximation.

Let us look at the state after t while loop iterations. By Lemma 28.5, every white • It is not known whether there is a fast deterministic approximation al-
node u with support s(u) ≥ smax (t)/2 will be dominated after the following while gorithm. This is an interesting and important open problem. The best
loop iteration with probability larger than 1/9. Consider a node u that satisfies deterministic algorithm known to achieve an O(log ∆)-approximation

the following three conditions: has time complexity 2O( log n) .
(1) u is white

(2) there is a node v ∈ N (u) : w(v) ≥ w̃max (t) Chapter Notes


(3) s(u) ≥ smax (t)/2. See [JRS02, KW05].
BIBLIOGRAPHY 325 326 CHAPTER 28. DOMINATING SET

Bibliography
[JRS02] Lujun Jia, Rajmohan Rajaraman, and Torsten Suel. An efficient dis-
tributed algorithm for constructing small dominating sets. Distributed
Computing, 15(4):193–205, 2002.

[KW05] Fabian Kuhn and Roger Wattenhofer. Constant-Time Distributed


Dominating Set Approximation. In Springer Journal for Distributed
Computing, Volume 17, Number 4, May 2005.
328 CHAPTER 29. ROUTING

Proof. By induction two packets will never contend for the same link. Then
each packet arrives at its destination in d steps, where d is the distance between
source and destination.

Remarks:

Chapter 29 • Unfortunately, only the array (or the ring) allows such a simple
contention-free analysis. Already in a tree (with nodes of degree 3
or more) there might be two packets arriving at the same step at the
same node, both want to leave on the same link, and one needs to be

Routing queued. In a “Mercedes-Benz” graph Ω(n) packets might need to be


queued. We will study this problem in the next section.

• There are many strategies for scheduling packets contending for the
same edge (e.g. “farthest goes first”); these queuing strategies have a
substantial impact on the performance of the algorithm.
29.1 Array
(Routing is important for any distributed system. This chapter is only an 29.2 Mesh
introduction into routing; we will see other facets of routing in a next chapter.)

Definition 29.1 (Routing). We are given a graph and a set of routing requests, Algorithm 105 Greedy on Mesh
each defined by a source and a destination node. A mesh (a.k.a. grid, matrix) is a two-dimensional array with m columns and
m rows (n = m2 ). Packets are routed to their correct column (on the row in
Definition 29.2 (One-to-one, Permutation). In a one-to-one routing problem, greedy array style), and then to their correct row. The farthest packet will
each node is the source of at most one packet and each node is the destination of be given priority.
at most one packet. In a permutation routing problem, each node is the source
of exactly one packet and each node is the destination of exactly one packet.
Theorem 29.5 (Analysis). In one-to-one routing, the greedy algorithm termi-
Remarks: nates in 2m − 2 steps.

• Permutation routing is a special case of one-to-one routing. Proof. First note that packets in the first phase of the algorithm do not interfere
with packets in the second phase of the algorithm. With Theorem 29.4 each
Definition 29.3 (Store and Forward Routing). The network is synchronous. packet arrives at its correct column in m − 1 steps. (Some packets may arrive
In each step, at most two packets (one in each direction) can be sent over each at their turning node earlier, and already start the second phase; we will not
link. need this in the analysis.) We need the following Lemma for the second phase
of the algorithm.
Remarks:
Lemma 29.6 (Many-to-One on Array, Lemma 1.5 in Leighton Section 1.7).
• If two packets want to follow the same link, then one is queued (stored) We are given an array with n nodes. Each node is a destination for at most
at the sending node. This is known as contention. one packet (but may be the source of many). If edge contention is resolved by
farthest-to-go (FTG), the algorithm terminates in n − 1 steps.

Leighton Section 1.7 Lemma 1.5. Leftward moving packets and rightward mov-
Algorithm 104 Greedy on Array
ing packets never interfere; so we can restrict ourselves to rightward moving
An array is a linked list of n nodes; that is, node i is connected with nodes packets. We name the packets with their destination node. Since the queu-
i − 1 and i + 1, for i = 2, . . . , n − 1. With the greedy algorithm, each node ing strategy is FTG, packet i can only be stopped by packets j > i. Note
injects its packet at time 0. At each step, each packet that still needs to move that a packet i may be contending with the same packet j several times. How-
rightward or leftward does so. ever, packet i will either find its destination “among” the higher packets, or
directly after the last of the higher packets. More formally, after k steps, pack-
ets j, j + 1, . . . , n do not need links 1, . . . , l anymore, with k = n − j + l. Proof
Theorem 29.4 (Analysis). The greedy algorithm terminates in n − 1 steps. by induction: Packet n has the highest priority: After k steps it has escaped

327
29.3. ROUTING IN THE MESH WITH SMALL QUEUES 329 330 CHAPTER 29. ROUTING

the first k links. Packet n − 1 can therefore use link l in step l + 1, and so on. (The factor (1 − 1/m)m−r is not present because the event “exactly r” includes
Packet i not needing link i in step k = n means that packet i has arrived at its the event “more than r” already.) Using
destination node i in step n − 1 or earlier.    y
x xe
< , for 0 < y < x
Lemma 29.6 completes the proof. y y

we directly get
Remarks:  me r  1 r  e r
P < =
• A 2m − 2 time bound is the best we can hope for, since the distance r m r
between the two farthest nodes in the mesh is exactly 2m − 2. Hence most queues do not grow larger than O(1). Also, when we choose r :=
e log n 2
• One thing still bugs us: The greedy algorithm might need queues in log log n we can show P = o(1/n ). The probability that any of the 4n queues
the order of m. And queues are expensive! In the next section, we try ever exceeds r is less than 1 − (1 − P )4n = o(1/n).
to bring the queue size down!
Remarks:

29.3 Routing in the Mesh with Small Queues • OK. We got a bound on the queue size. Now what about time com-
plexity?!? The same analysis as for one-to-one routing applies. The
(First we look at a slightly simpler problem.) probability that a node sees “many” packets in phase 2 is small... it
can be shown that the algorithm terminates in O(m) time with high
Definition 29.7 (Random Destination Routing). In a random destination rout- probability.
ing problem, each node is the source of at most one packet with destination
chosen uniformly at random. • In fact, maximum queue sizes are likely to be a lot less than logarith-
mic. The reason is the following: Though Θ(log n/ log log n) packets
Remarks: might turn at some node, these turning packets are likely to be spread
in time. Early arriving packets might use gaps and do not conflict with
• Random destination routing is not one-to-one routing. In the worst late arriving packets. With a much more elaborate method (using the
case, a node can be destination for all n packets, but this case is very so-called “wide-channel” model) one can show that there will never be
unlikely (with probability 1/nn−1 ) more than four(!) packets in any queue (with high probability only,
of course).
• We study algorithm 105, but this time in the random destination
model. Studying the random destination model will give us a deeper • Unfortunately, the above analysis only works for random destination
understanding of routing... and distributed computing in general! problems. Question: Can we devise an algorithm that uses small
queues only but for any one-to-one routing problem? Answer: Yes, we
Theorem 29.8 (Random destination analysis of algorithm 105). If destinations can! In the simplest form we can use a clever trick invented by Leslie
are chosen at random the maximum queue size is O(log n/ log log n) with high Valiant: Instead of routing the packets directly on their row-column
probability. (With high probability means with probability at least 1 − O(1/n).) path, we route each packet to a randomly chosen intermediate node
(on the row-column path), and from there to the destination (again
Proof. We can restrict ourselves to column edges because there will not be any on the row-column path). Valiant’s trick routes all packets in O(m)
contention at row edges. Let us consider the queue for a north-bound column time (with high probability) and only needs queues of size O(log n).
edge. In each step, there might be three packets arriving (from south, east, Instead of choosing a random intermediate node one can choose a
west). Since each arriving south packet will be forwarded north (or consumed random node that is more or less in the direction of the destination,
when the node is the destination), the queue size can only grow from east or solving any one-to-one routing problem in 2m + O(log n) time with
west packets – packets that are “turning” at the node. Hence the queue size only constant-size queues. You don’t wanna know the details...
of a node is always bounded by the number of packets turning at the node. A
packet only turns at a node u when it is originated at a node in the same row as • What about no queues at all?!?
u (there are only m nodes in the row). Packets have random destinations, so the
probability to turn for each of these packets is 1/m only. Thus the probability
P that r or more packets turn in some particular node u is at most 29.4 Hot-Potato Routing
  r
m 1 Definition 29.9 (Hot-Potato Routing). Like the store-and-forward model the
P ≤ hot-potato model is synchronous and at most two packets (one in each direction)
r m
29.4. HOT-POTATO ROUTING 331 332 CHAPTER 29. ROUTING

can be sent over a link. However, contending packets cannot be stored; instead all 29.5 More Models
but one contending packet must be sent over a “wrong link” (known as deflection)
immediately, since the hot-potato model does not allow queuing. Routing comes in many flavors. We mention some of them in this section for
the sake of completeness.
Remarks: Store-and-forward and hot-potato routing are variants of packet-switching.
• Don’t burn your fingers with “hot-potato” packets. If you get one you In the circuit-switching model, the entire path from source to destination must
better forward it directly! be locked such that a stream of packets can be transmitted.
A packet-switching variant where more than one packet needs to be sent
• A node with degree δ receives up to δ packets at the beginning of
from source to destination in a stream is known as wormhole routing.
each step – since the node has δ links, it can forward all of them, but
Static routing is when all the packets to be routed are injected at time 0.
unfortunately not all in the right direction.
Instead, in dynamic routing, nodes may inject new packets constantly (at a
• Hot-potato routing is easier to implement, especially on light-based certain rate). Not much is known for dynamic routing.
networks, where you don’t want to convert photons into electrons and Instead of having a single source and a single destination for each packet
then back again. There are a couple of parallel machines that use the as in one-to-one routing, researchers have studied many-to-one routing, where a
hot-potato paradigm to simplify and speed up routing. node may be destination for many sources. The problem of many-to-one routing
is that there might be congested areas in the network (areas with nodes that
• How bad does hot-potato routing get (in the random or the one-to-one
are destinations of many packets). Packets that can be routed around such a
model)? How bad can greedy hot-potato routing (greedy: whenever
congested area should do that, or they increase the congestion even more. Such
there is no contention you must send a packet into the right direction)
an algorithm was studied by Busch et al. at STOC 2000.
get in a worst case?
Also one-to-many routing (multicasting) was considered, where a source
needs to send the same packet to many destinations. In one-to-many routing,
Algorithm 106 Greedy Hot-Potato Routing on a Mesh
packets can be duplicated whenever needed.
Packets move greedy towards their destination (any good link is fine if there Nobody knows the topology of the Internet (and it is certainly not an array
is more than one). If a packet gets deflected, it gets excited with probability or a mesh!). The problem is to find short paths without storing huge routing
p (we set p = Θ(1/m)). An excited packet has higher priority. When being tables at each node. There are several forms of routing (e.g. compact routing,
excited, a packet tries to reach the destination on the row-column path. If interval routing) that study the trade-off between routing table size and quality
two excited packets contend, then the one that wants to exit the opposite link of routing.
is given priority. If an excited packet fails to take its desired link it becomes Also, researchers started studying the effects of mixing various queuing
normal again. strategies in one network. This area of research is known as adversarial queuing
theory.
Theorem 29.10 (Analysis). A packet will reach its destination in O(m) ex- And last not least there are several special networks. A mobile ad-hoc net-
pected time. work, for example, consists of mobile nodes equipped with a wireless communi-
cation device. In such a networks nodes can only communicate when they are
Sketch, full proof in Busch et al., SODA 2000. An excited packet can only be within transmission range. Since the network is mobile (dynamic), and since
deflected at its start node (after becoming excited), and when trying to turn. the nodes are considered to be simple, a variety of new problems arise.
In both cases, the probability to fail is only constant since other excited packets
need to be at the same node at exactly the right instant. Thus the probability
that an excited packets finds to its destination is constant, and therefore a packet Chapter Notes
needs to “try” (to become excited) only constantly often. Since a packet tries
every p’th time it gets deflected, in only gets deflected O(1/p) = O(m) times See [BHW00a, BHW00b, RT92, Lei90, VB81].
in expectation. Since each time it does not get deflected, it gets closer to its
destination, it will arrive at the destination in O(m) expected time.
Bibliography
Remarks:
• It seems that at least in expectation having no memory at all does [BHW00a] Costas Busch, Maurice Herlihy, and Roger Wattenhofer. Hard-
not harm the time bounds much. Potato Routing. In 32nd Annual ACM Symposium on Theory of
Computing (STOC), Portland, Oregon, May 2000.
• It is conjectured that one-to-one routing can be shown to have time
complexity O(m) for this greedy hot-potato routing algorithm. How- [BHW00b] Costas Busch, Maurice Herlihy, and Roger Wattenhofer. Random-
ever, the best known bound needs an additional logarithmic factor. ized Greedy Hot-Potato Routing. In 11th Annual ACM-SIAM Sym-
BIBLIOGRAPHY 333 334 CHAPTER 29. ROUTING

posium on Discrete Algorithms (SODA), pp. 458-466, San Fran-


cisco, California, USA, January 2000.

[Lei90] Frank Thomson Leighton. Average Case Analysis of Greedy Routing


algorithms on Arrays. In SPAA, pages 2–10, 1990.
[RT92] Sanguthevar Rajasekaran and Thanasis Tsantilas. Optimal Routing
Algorithms for Mesh-Connected Processor Arrays. Algorithmica,
8(1):21–38, 1992.

[VB81] Leslie G. Valiant and Gordon J. Brebner. Universal Schemes for


Parallel Communication. In STOC, pages 263–277. ACM, 1981.
336 CHAPTER 30. ROUTING STRIKES BACK

steps.

Remarks:
• The bit-reversed routing is therefore asymptotically a worst-case ex-
ample.

Chapter 30 • However, one that requires square-root queues. When being limited
to constant queue sizes the greedy algorithm can be forced to use Θ(n)
steps for some permutations.
• A routing problem where all the sources are on level 0 and all the
Routing Strikes Back destinations are on level d is called an end-to-end routing problem.
Surprisingly, solving an arbitrary routing problem on a butterfly (or
any hypercubic network) is often not harder.
• In the next section we show that there is general square-root lower
30.1 Butterfly bound for “greedy” algorithms for any constant-degree graph. (In
other words, our optimal greedy mesh routing algorithm of Chapter 4
Let’s first assume that all the sources are on level 0, all destinations are on level was only possible because the mesh has such a bad diameter...)
d of a d-dimensional butterfly.

Algorithm 107 Greedy Butterfly Routing 30.2 Oblivious Routing


The unique path from a source on level 0 to a destination on level d with Definition 30.2 (Oblivious). A routing algorithm is oblivious if the path taken
d hops is the greedy path. In the greedy butterfly routing algorithm each by each packet depends only on source and destination of the packet (and not
packet is constrained to follow its greedy path. on other packets, or the congestion encountered).
Theorem 30.3 (Lower Bound). Let G be a graph with n nodes and (maximum)
Remarks: degree d. Let A be any oblivious routing algorithm. Then there is a one-to-one

routing problem for which A will need at least n/2d steps.
• In the bit-reversal permutation routing problem, the destination of a
packet is the bit-reversed address of the source. With d = 3 you can Proof. Since A is oblivious, the path from node u to node v is Pu,v ; A can be
see that both source (000, 0) and source (001, 0) route through edge specified by n2 paths. We must find k one-to-one paths that all use the same
(000, 1..2). Will the contention grow with higher dimension? Yes! edge e. Then we can proof that A takes at least k/2 steps.
Choose an odd d, then all the sources (0 . . . 0b(d+1)/2 . . . bd−1 , 0) will Let’s look at the n − 1 paths to destination node v. For any integer k let
route through edge (00..0, (d − 1)/2...(d + 1)/2). You can choose p the Sk (v) be the set of edges in G where k or more of these paths pass through
bits bi arbitrarily. There are 2(d+1)/2 bit combinations, which is n/2 them. Also, let Sk∗ (v) be the nodes incident to Sk (v). Since there are two
for n = 2d sources. nodes incident to each edge |Sk∗ (v)| ≤ 2|Sk (v)|. In the following we assume that
k ≤ (n − 1)/d; then v ∈ Sk∗ (v), hence |Sk∗ (v)| > 0.
• On the good side, this contention is also a guaranteed time bound, as
We have
the following theorem shows.
n − |Sk∗ (v)| ≤ (k − 1)(d − 1)|Sk∗ (v)|

Theorem 30.1 (Analysis). The greedy butterfly algorithm terminates in O( n) because every node u not in Sk∗ (v) is a start of a path Pu,v that enters Sk∗ (v)
steps. from outside. In particular, for any node u ∈ / Sk∗ (v) there is an edge (w, w0 ) in
Pu,v that enters Sk∗ (v). Since the edge (w, w0 ) ∈
/ Sk (v), there are at most (k − 1)
Proof. For simplicity we assume that d is odd. An edge on level l (from a
starting nodes u for edge (w, w0 ). Also there are at most (d − 1) edges adjacent
node on level l to a node on level l + 1) has at most 2l sources, and at most
to w0 that are not in Sk (v). We get
2d−l−1 destinations. Therefore the number of paths through an edge on level
l is bounded by nl = 2min(l,d−l−1) . A packet can therefore be delayed at most n ≤ (k − 1)(d − 1)|Sk∗ (v)| + |Sk∗ (v)| ≤ 2[1 + (k − 1)(d − 1)]|Sk (v)| ≤ 2kd|Sk (v)|
nl − 1 times on level l. Summing up over all levels, a packet is delayed at most n √
Thus, |Sk (v)| ≥ 2kd . We set k = n/d, and sum over all n nodes:
d−1 (d−1)/2 d−1 (d−1)/2 (d−3)/2
X X X X X √ X n2 n3/2
nl = nl + nl = 2l + 2l < 3 · 2(d−1)/2 = O( n). |Sk (v)| ≥ =
l=0 l=0 l=(d+1)/2 l=0 l=0 2kd 2
v∈V

335
30.3. OFFLINE ROUTING 337 338 CHAPTER 30. ROUTING STRIKES BACK

Since there are at most nd/2 edges in G, this means that there is an edge e for Definition 30.5 (Path Selection). We are given a routing problem (a graph and
at least a set of routing requests). A path selection algorithm selects a path (a route) for
n3/2 /2 √ each request.
= n/d = k
nd/2
different values of v. Remarks:
Since edge e is in at least k different paths in each set S√k (v) we can construct
• Path selection is efficient if the paths are “short” and do not interfere
a one-to-one permutation problem where edge e is used n/d times (directed:
√ if they do not need to. Formally, this can be defined by congestion
n/2d contention).
and dilation (see below).
Remarks: • For some routing problems, path selection is easy. If the graph is a
√ tree, for example, the best path between two nodes is the direct path.
• In fact, as many as ( n/d)! one-to-one routing problems can be con-
(Every route from a source to a destination includes at least all the
structed with this method.
links of the shortest path.)
• The proof can be extended to the case where the one-to-one routing
R
problem consists of R route requests. The lower bound is then Ω( d√ ). Definition 30.6 (Dilation, Congestion). The dilation of a path selection is the
n
length of a maximum path. The contention of an edge is the number of paths that
p use the edge. The congestion of a path selection is the load of a most contended
• There is a node that needs to route Ω( n/d) packets.
edge.
• The lower bound can be extended to randomized oblivious algo-
rithms... however, if we are allowed to use randomization, the lower Remarks:
bound gets much weaker. In fact, one can use Valiant’s trick also in
the butterfly: In a first phase, we route each packet on the greedy • A path selection should minimize congestion and dilation.
path to a random destination on level d, in the second phase on the
• Networking researchers have defined the “flow number” which is de-
same row back to level 0, and in a third phase on the greedy path
fined as the minimum max(congestion, dilation) over all possible path
to the destination. This way we can escape the bad one-to-one prob-
selections.
lems with high probability. (There are much more good one-to-one
problems than bad one-to-one problems.) One can show that with • Alternatively, congestion can be defined with directed edges, or nodes.
this trick one can route any one-to-one end-to-end routing problem in
asymptotically optimal O(log n) time (with high probability). Definition 30.7 (Scheduling). We are given a set of source-destination paths.
A scheduling algorithm specifies which messages traverse which link at which
• If a randomized algorithm fails (takes too long), simply re-run it. It time step (for an appropriate model).
will be likely to succeed then. On the other hand, if a deterministic
algorithm fails in some rare instance, re-running it will not help!
Remarks:

• The most popular model is store-and-forward (with small queues).


30.3 Offline Routing Other popular models have no queues at all: e.g. hot-potato routing
or direct routing (where the source might delay the injection of a
There are a variety of other aspects in routing. In this section we study one of
packet; once a packet is injected however, it will go to the destination
them to gain further insights.
without stop.)
Definition 30.4 (Offline Routing). We are given a routing problem (graph
and set of routing requests). An offline routing algorithm is a (not distributed) Lemma 30.8 (Lower Bound). Scheduling takes at least Ω(C + D) steps, where
algorithm that sees the whole input (the routing problem). C is the congestion and D is the dilation.

Remarks: Remarks:
• Offline routing is worth being studied because the same communica- • We aim for algorithms that are competitive with the lower bound. (As
tion pattern might appear whenever you run your (important!) (par- opposed to algorithms that finish in O(f (n)) time; C + D and n are
allel) algorithm. generally not comparable.)
• In offline routing, path selection and scheduling can be studied inde-
pendently. Theorem 30.9 (Analysis). Algorithm 108 terminates in 2C + D steps.
BIBLIOGRAPHY 339

Algorithm 108 Direct Tree Routing


We are given a tree, and a set of routing requests. (Since the graph is a tree
each route request will take the direct path between source and destination;
in other words, path selection is trivial.) Choose an arbitrary root r. Now
sort all packets using the following order (breaking ties arbitrarily): packet
p comes before packet q if the path of p reaches a node closer to r then the
path of q. Now scan all packets in this order, and for each packet greedily
assign its injection time to be the first that does not cause a conflict with any
previous packet.

Proof. A packet p first goes up, then down the tree; thus turning at node u.
Let eu and ed be the “up” resp. “down” edge on the path adjacent to u. The
injection time of packet p is only delayed by packets that traverse eu or ed (if it
contends with a packet q on another edge, and packet q has not a lower order,
then it contends also on eu or eq ). Since congestion is C, there are at most
2C − 2 many packets q. Thus the algorithm terminates after 2C + D steps.

Remarks:
• [Leighton, Maggs, Rao 1988] have shown the existence of an O(C +D)
schedule for any routing problem (on any graph!) using the Lovasz
Local Lemma. Later the result was made more accessible by [Leighton,
Maggs, Richa 1996] and others. Still it is too hard for this course...

Chapter Notes
See [BH85, LMR88, LM95, KKT91].

Bibliography
[BH85] Allan Borodin and John E. Hopcroft. Routing, Merging, and Sorting
on Parallel Models of Computation. J. Comput. Syst. Sci., 30(1):130–
145, 1985.

[KKT91] Christos Kaklamanis, Danny Krizanc, and Thanasis Tsantilas. Tight


Bounds for Oblivious Routing in the Hypercube. Mathematical Sys-
tems Theory, 24(4):223–232, 1991.
[LM95] T. Leighton and B. Maggs. Fast algorithms for finding
O(congestion+dilation) packet routing schedules. In Proceedings of
the 28th Hawaii International Conference on System Sciences, HICSS
’95, pages 555–, Washington, DC, USA, 1995. IEEE Computer Soci-
ety.
[LMR88] Frank Thomson Leighton, Bruce M. Maggs, and Satish Rao. Universal
Packet Routing Algorithms (Extended Abstract). In FOCS, pages
256–269, 1988.

You might also like