Lecture Notes in Computer Science
Lecture Notes in Computer Science
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
5401
Principles of
Distributed Systems
12th International Conference, OPODIS 2008
Luxor, Egypt, December 15-18, 2008
Proceedings
13
Volume Editors
Theodore P. Baker
Florida State University
Department of Computer Science
207A Love Building, Tallahassee, FL 32306-4530, USA
E-mail: [email protected]
Alain Bui
Universit de Versailles-St-Quentin-en-Yvelines
Laboratoire PRiSM
45, avenue des Etats-Unis, 78035 Versailles Cedex, France
E-mail: [email protected]
Sbastien Tixeuil
LIP6 & INRIA Grand Large
Universit Pierre et Marie Curie - Paris 6
104 avenue du Prsident Kennedy, 75016 Paris, France
E-mail: [email protected]
0302-9743
3-540-92220-2 Springer Berlin Heidelberg New York
978-3-540-92220-9 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microlms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
springer.com
Springer-Verlag Berlin Heidelberg 2008
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientic Publishing Services, Chennai, India
Printed on acid-free paper
SPIN: 12582457
06/3180
543210
Preface
This volume contains the 30 regular papers, the 11 short papers and the abstracts
of two invited keynotes that were presented at the 12th International Conference
on Principles of Distributed Systems (OPODIS) held during December 1518,
2008 in Luxor, Egypt.
OPODIS is a yearly selective international forum for researchers and practitioners in design and development of distributed systems.
This year, we received 102 submissions from 28 countries. Each submission
was carefully reviewed by three to six Program Committee members with the
help of external reviewers, with 30 regular papers and 11 short papers being
selected. The overall quality of submissions was excellent and there were many
papers that had to be rejected because of organization constraints yet deserved
to be published. The two invited keynotes dealt with hot topics in distributed
systems: The Next 700 BFT Protocols by Rachid Guerraoui and On Replication of Software Transactional Memories by Luis Rodriguez.
On behalf of the Program Committee, we would like to thank all authors of
submitted papers for their support. We also thank the members of the Steering Committee for their invaluable advice. We wish to express our appreciation to the Program Committee members and additional external reviewers for
their tremendous eort and excellent reviews. We gratefully acknowledge the
Organizing Committee members for their generous contribution to the success of the symposium. Special thanks go to Thibault Bernard for managing the conference publicity and technical organization. The paper submission
and selection process was greatly eased by the EasyChair conference system
(http://www.easychair.org). We wish to thank the EasyChair creators and
maintainers for their commitment to the scientic community.
December 2008
Ted Baker
Sebastien Tixeuil
Alain Bui
Organization
OPODIS 2008 was organized by PRiSM (Universite Versailles Saint-Quentin-enYvelines) and LIP6 (Universite Pierre et Marie Curie).
General Chair
Alain Bui
Program Co-chairs
Theodore P. Baker
Sebastien Tixeuil
Program Committee
Bjorn Andersson
James Anderson
Alan Burns
Andrea Clementi
Liliana Cucu
Shlomi Dolev
Khaled El Fakih
Pascal Felber
Paola Flocchini
Gerhard Fohler
Felix Freiling
Mohamed Gouda
Fabiola Greve
Isabelle Guerin-Lassous
Ted Herman
Anne-Marie Kermarrec
Rastislav Kralovic
Emmanuelle Lebhar
Jane W.S. Liu
Steve Liu
Toshimitsu Masuzawa
Rolf H. M
ohring
Bernard Mans
Maged Michael
Mohamed Mosbah
VIII
Organization
Marina Papatriantalou
Boaz Patt-Shamir
Raj Rajkumar
Sergio Rajsbaum
Andre Schiper
Sam Toueg
Eduardo Tovar
Koichi Wada
Organizing Committee
Thibault Bernard
Celine Butelle
Publicity Chair
Thibault Bernard
Steering Committee
Alain Bui
Marc Bui
Hacene Fouchal
Roberto Gomez
Nicola Santoro
Philippas Tsigas
Referees
H.B. Acharya
Amitanand Aiyer
Mario Alves
James Anderson
Bjorn Andersson
Hagit Attiya
Rida Bazzi
Muli Ben-Yehuda
Alysson Bessani
Gaurav Bhatia
Konstantinos Bletsas
Bjoern Brandenburg
Alan Burns
John Calandrino
Pierre Casteran
Daniel Cederman
Keren Censor
Jeremie Chalopin
Claude Chaudet
Yong Hoon Choi
Andrea Clementi
Reuven Cohen
Alex Cornejo
Roberto Cortinas
Pilu Crescenzi
Liliana Cucu
Shantanu Das
Emiliano De Cristofaro
Gianluca De Marco
Carole Delporte
UmaMaheswari Devi
Shlomi Dolev
Pu Duan
Partha Dutta
Khaled El-fakih
Yuval Emek
Organization
Hugues Fauconnier
Pascal Felber
Paola Flocchini
Gerhard Fohler
Pierre Fraignaud
Felix Freiling
Zhang Fu
Shelby Funk
Emanuele G. Fusco
Giorgos Georgiadis
Seth Gilbert
Emmanuel Godard
Joel Goossens
Mohamed Gouda
Maria Gradinariu
Potop-Butucaru
Vincent Gramoli
Fabiola Greve
Damas Gruska
Isabelle Guerin-Lassous
Phuong Ha Hoai
Ahmed Hadj Kacem
Elyes-Ben Hamida
Danny Hendler
Thomas Herault
Ted Herman
Daniel Hirschko
Akira Idoue
Nobuhiro Inuzuka
Taisuke Izumi
Tomoko Izumi
Katia Jares-Runser
Prasad Jayanti
Arshad Jhumka
Mohamed Jmaiel
Hirotsugu Kakugawa
Arvind Kandhalu
Yoshiaki Katayama
Branislav Katreniak
Anne-Marie Kermarrec
Ralf Klasing
Boris Koldehofe
Anis Koubaa
Darek Kowalski
Rastislav Kralovic
Evangelos Kranakis
Ioannis Krontiris
Petr Kuznetsov
Mikel Larrea
Erwan Le Merrer
Emmanuelle Lebhar
Hennadiy Leontyev
Xu Li
George Lima
Jane Liu
Steve Liu
Hong Lu
Victor Luchangco
Weiqin Ma
Bernard Mans
Soumaya Marzouk
Toshimitsu Masuzawa
Nicole Megow
Maged Michael
Luis Miguel Pinho
Rolf M
ohring
Mohamed Mosbah
Heinrich Moser
Achour Mostefaoui
Junya Nakamura
Alfredo Navarra
Gen Nishikawa
Nicolas Nisse
Luis Nogueira
Koji Okamura
Fukuhito Ooshita
Marina Papatriantalou
Dana Pardubska
Boaz Patt-Shamir
Andrzej Pelc
David Peleg
Nuno Pereira
Tomas Plachetka
Shashi Prabh
IX
Giuseppe Prencipe
Shi Pu
Raj Rajkumar
Sergio Rajsbaum
Dror Rawitz
Tahiry Razandralambo
Etienne Riviere
Gianluca Rossi
Anthony Rowe
Nicola Santoro
Gabriel Scalosub
Elad Schiller
Andre Schiper
Nicolas Schiper
Ramon Serna Oliver
Alexander Shvartsman
Riccardo Silvestri
Francoise Simonot-Lion
Alex Slivkins
Jason Smith
Kannan Srinathan
Sebastian Stiller
David Stotts
Weihua Sun
Hakan Sundell
Cheng-Chung Tan
Andreas Tielmann
Sam Toueg
Eduardo Tovar
Corentin Travers
Frederic Tronel
Remi Vannier
Jan Vitek
Roman Vitenberg
Koichi Wada
Timo Warns
Andreas Wiese
Yu Wu
Zhaoyan Xu
Hirozumi Yamaguchi
Yukiko Yamauchi
Keiichi Yasumoto
Table of Contents
Invited Talks
The Next 700 BFT Protocols (Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rachid Guerraoui
On Replication of Software Transactional Memories
(Extended Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Luis Rodrigues
Regular Papers
Write Markers for Probabilistic Quorum Systems . . . . . . . . . . . . . . . . . . . .
Michael G. Merideth and Michael K. Reiter
22
41
Group Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yehuda Afek, Iftah Gamzu, Irit Levy, Michael Merritt, and
Gadi Taubenfeld
58
73
89
105
125
145
XII
Table of Contents
164
184
197
217
226
246
259
275
295
311
331
346
363
388
Table of Contents
XIII
408
428
446
463
481
496
512
527
534
538
542
546
551
XIV
Table of Contents
555
560
564
568
572
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
579
Byzantine fault-tolerant state machine replication (BFT) has reached a reasonable level of maturity as an appealing, software-based technique, to building
robust distributed services with commodity hardware. The current tendency
however is to implement a new BFT protocol from scratch for each new application and network environment. This is notoriously dicult. Modern BFT
protocols require each more than 20.000 lines of sophisticated C code and proving their correctness involves an entire PhD. Maintainning and testing each new
protocol seems just impossible.
This talk will present a candidate abstraction, named ABSTRACT (Abortable
State Machine Replication), to remedy this situation. A BFT protocol is viewed
as a, possibly dynamic, composition of instances of ABSTRACT, each instance
developed and analyzed independently. A new eective BFT protocol can be
developped by adding less than 10% of code to an existing one. Correctness proofs
become at human reach and even model checking techniques can be envisaged.
To illustrate the ABSTRACT approach, we describe a new BFT protocol we
name Aliph: the rst of a hopefully long series of eective yet modular BFT
protocols. The Aliph protocol has a peak throughput that outperforms those of
all BFT protocols we know of by 300% and a best case latency that is less than
30% of that of state of the art BFT protocols.
This is joint work with Dr V. Quema (CNRS) and Dr M. Vukolic (IBM).
T.P. Baker, A. Bui, and S. Tixeuil (Eds.): OPODIS 2008, LNCS 5401, p. 1, 2008.
c Springer-Verlag Berlin Heidelberg 2008
On Replication of
Software Transactional Memories
(Invited Talk)
Luis Rodrigues
INESC-ID/IST
joint work with:
Extended Abstract
Software Transactional Memory (STM) systems have garnered considerable interest of late due to the recent architectural trend that has led to the pervasive
adoption of multi-core CPUs. STMs represent an attractive solution to spare
programmers from the pitfalls of conventional explicit lock-based thread synchronization, leveraging on concurrency-control concepts used for decades by
the database community to simplify the mainstream parallel programming [1].
As STM systems are beginning to penetrate into the realms of enterprise systems [2,3] and to be faced with the high availability and scalability requirements
proper of production environments, it is rather natural to foresee the emergence
of replication solutions specically tailored to enhance the dependability and the
performance of STM systems. Also, since STM and Database Management Systems (DBMS) share the key notion of transaction, it might appear that the state
of the art database replication schemes e.g. [4,5,6,7] represent natural candidates
to support STM replication as well.
In this talk, we will rst contrast, from a replication oriented perspective,
the workload characteristics of two standard benchmarks for STM and DBMS,
namely TPC-W [8] and STBench7 [9]. This will allow us to uncover several
pitfalls related to the adoption of conventional database replication techniques
in the context of STM systems.
At the light of such analysis, we will then discuss promising research directions we are currently pursuing in order to develop high performance replication
strategies able to t the unique characteristics of the STM.
In particular, we will present one of our most recent results in this area which
not only tackles some key issues characterizing STM replication, but actually
represents a valuable tool for the replication of generic services: the Weak Mutual
Exclusion (WME) abstraction. Unlike the classical Mutual Exclusion problem
(ME), which regulates the concurrent access to a single and indivisible shared
resource, the WME abstraction ensures mutual exclusion in the access to a
shared resource that appears as single and indivisible only at a logical level,
while instead being physically replicated for both fault-tolerance and scalability
purposes.
T.P. Baker, A. Bui, and S. Tixeuil (Eds.): OPODIS 2008, LNCS 5401, pp. 24, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Dierently from ME, which is well known to be solvable only in the presence of very constraining synchrony assumptions [10] (essentially exclusively in
synchronous systems), we will show that WME is solvable in an asynchronous
system using an eventually perfect failure detector, P , and prove that P is
actually the weakest failure detector for solving the WME problem. These results imply, unlike ME, WME is solvable in partially synchronous systems, (i.e.
systems in which the bounds on communication latency and relative process
speed either exist but are unknown or are known but are only guaranteed to
hold starting at some unknown time) which are widely recognized as a realistic
model for large scale distributed systems [11,12].
However, this is not the only element contributing to the pragmatical relevance
of the WME abstraction. In fact, the reliance on the WME abstraction, as a mean
for regulating the concurrent access to a replicated resource, also provides the
two following important practical benets:
Robustness: pessimistic concurrency control is widely used in commercial o
the shelf systems, e.g. DBMSs and operating systems, because of its robustness and predictability in presence of conict intensive workloads. The
WME abstraction lays a bridge between these proven contention management techniques and replica control schemes. Analogously to centralized lock
based concurrency control, WME reveals particularly useful in the context
of conict-sensitive applications, such as STMs or interactive systems, where
it may be preferable to bridle concurrency rather than incurring the costs
of application level conicts, such as transactions abort or re-submission of
user inputs.
Performance: the WME abstraction ensures that users issue operations on
the replicated shared resource in a sequential manner. Interestingly, it has
been shown that, in such a scenario, it is possible to sensibly boost the
performance of lower level abstractions [13,14], such as consensus or atomic
broadcast, which are typically used as building blocks of modern replica
control schemes and which often represent, like in typical STM workloads,
the performance bottleneck of the whole system.
References
1. Adl-Tabatabai, A.R., Kozyrakis, C., Saha, B.: Unlocking concurrency. ACM
Queue 4, 2433 (2007)
2. Cachopo, J.: Development of Rich Domain Models with Atomic Actions. PhD
thesis, Instituto Superior Tecnico/Universidade Tecnica de Lisboa (2007)
3. Carvalho, N., Cachopo, J., Rodrigues, L., Rito Silva, A.: Versioned transactional
shared memory for the FenixEDU web application. In: Proc. of the Second Workshop on Dependable Distributed Data Management (in conjunction with Eurosys
2008), Glasgow, Scotland. ACM, New York (2008)
4. Agrawal, D., Alonso, G., Abbadi, A.E., Stanoi, I.: Exploiting atomic broadcast in
replicated databases (extended abstract). In: Lengauer, C., Griebl, M., Gorlatch,
S. (eds.) Euro-Par 1997. LNCS, vol. 1300, pp. 496503. Springer, Heidelberg (1997)
L. Rodrigues
5. Cecchet, E., Marguerite, J., Zwaenepole, W.: C-JDBC: exible database clustering
middleware. In: Proc. of the USENIX Annual Technical Conference, Berkeley, CA,
USA, p. 26. USENIX Association (2004)
6. Pati
no-Martnez, M., Jimenez-Peris, R., Kemme, B., Alonso, G.: Scalable replication in database clusters. In: Proc. of the 14th International Conference on Distributed Computing, London, UK, pp. 315329. Springer, Heidelberg (2000)
7. Pedone, F., Guerraoui, R., Schiper, A.: The database state machine approach.
Distributed and Parallel Databases 14, 7198 (2003)
8. Transaction Processing Performance Council: TPC BenchmarkTM W, Standard
Specication, Version 1.8. Transaction Processing Perfomance Council (2002)
9. Guerraoui, R., Kapalka, M., Vitek, J.: Stmbench7: a benchmark for software transactional memory. SIGOPS Oper. Syst. Rev. 41, 315324 (2007)
10. Delporte-Gallet, C., Fauconnier, H., Guerraoui, R., Kouznetsov, P.: Mutual exclusion in asynchronous systems with failure detectors. J. Parallel Distrib. Comput. 65,
492505 (2005)
11. Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. J. ACM 35, 288323 (1988)
12. Cristian, F., Fetzer, C.: The timed asynchronous distributed system model. IEEE
Transactions on Parallel and Distributed Systems 10, 642657 (1999)
13. Brasileiro, F.V., Greve, F., Mostefaoui, A., Raynal, M.: Consensus in one communication step. In: Proc. of the International Conference on Parallel Computing
Technologies, pp. 4250 (2001)
14. Lamport, L.: Fast paxos. Distributed Computing 9, 79103 (2006)
Introduction
(1)
For this reason, Malkhi and Reiter [2] proposed various ways of strengthening
the intersection property (1) so as to enable quorums to be used in Byzantine
environments. For example, an alternative to (1) is
|Q Q \ B| > |Q B|
(2)
for all Q, Q Q, where B is the (unknown) set of all (up to b) servers that are
faulty. In other words, the intersection of any two quorums contains more nonfaulty servers than the faulty ones in either quorum. As such, the responses from
these non-faulty servers will outnumber those from faulty ones. These quorum
systems are called masking systems.
Opaque quorum systems, have an even more stringent requirement as an alternative to (1):
(3)
|Q Q \ B| > |(Q B) (Q \ Q)|
for all Q, Q Q. In other words, the number of correct servers in the intersection
of Q and Q (i.e., |Q Q \ B|) exceeds the number of faulty servers in Q (i.e.,
|Q B|) together with the number of servers in Q but not Q. The rationale
for this property can be seen by considering the servers in Q but not Q as
outdated, in the sense that if Q was used to perform an update to the system,
then those servers in Q \ Q are unaware of the update. As such, if the faulty
servers in Q behave as the outdated ones do, their behavior (i.e., their responses)
will dominate that from the correct servers in the intersection (Q Q \ B) unless
(3) holds.
The increasingly stringent properties of Byzantine quorum systems come with
costs in terms of the smallest system sizes that can be supported while tolerating
a number b of faults [2]. This implies that a system with a xed number of
servers can tolerate fewer faults when the property is more stringent as seen in
Table 1, which refers to the quorums just discussed as strict. Table 1 also shows
the negative impact on the ability of the system to disperse load amongst the
replicas, as discussed next.
Naor and Wool [3] introduced the notion of an access strategy by which clients
select quorums to access. An access
strategy p : Q [0, 1] is simply a probability distribution on quorums, i.e., QQ p(Q) = 1. Intuitively, when a client
accesses the system, it does so at a quorum selected randomly according to the
distribution p.
The formalization of an access strategy is useful as a tool for discussing the
load dispersing properties of quorums. The load [3] of a quorum system, L(Q), is
the probability with which the busiest server is accessed in a client access, under
the best possible access strategy p. As listed in Table 1, tight lower bounds
have been proven for the load of each type of strict Byzantine quorum system.
The load for opaque quorum systems is particularly unfortunatesystems that
utilize opaque quorum systems cannot eectively disperse processing load across
more servers (i.e., by increasing n) because the load is at least a constant. Such
Byzantine quorum systems are used by many modern Byzantine-fault-tolerant
protocols, e.g., [4,5,6,7,8,9] in order to tolerate the arbitrary failure of a subset
of their replicas. As such, circumventing the bounds is an important topic.
Our primary contributions are (i) the identication and analysis of the benets
of write markers; and (ii) a proposed implementation of write markers that
handles the complexities of tolerating Byzantine clients. Our analysis yields the
following results:
Masking Quorums: We show that the use of write markers allows probabilistic
maskingquorum systems to tolerate up to
b < n/2 faults when quorums are of
size ( n). Setting all quorums to size n for some constant , weachieve
a load
that is asymptotically optimal for any quorum system, i.e., n/n =
O(1/ n) [3].
This represents an improvement in load and the number of faults that can
be tolerated. Probabilistic masking quorums without write markers can tolerate
up to b < n/2.62 faults [11] and achieve load no better than (b/n) [10]. In
addition, the maximum number of faults that can be tolerated is tied to the size
of quorums [10]. Thus, without write markers, achieving optimal load requires
tolerating fewer faults. Strict masking quorum
systems can tolerate (only) up to
b < n/4 faults [2] and can achieve load ( b/n) [12].
Opaque Quorums: We show that the use of write markers allows probabilistic opaque quorum systems to tolerate up to
b < n/2.62 faults. We present a
construction with load O(b/n) when b = ( n), thereby breaking the constant
lower bound
quorum systems [2]. Moreover,
of 1/2 on the load of strict opaque
if b = O( n), we can set all quorums to size n for some constant , in order
toachieve a load
that is asymptotically optimal for any quorum system, i.e.,
n/n = O(1/ n) [3].
This represents an improvement in load and the number of faults that can
be tolerated. Probabilistic opaque quorum systems without write markers can
tolerate (only) up to b < n/3.15 faults [11]. Strict opaque quorum systems can
tolerate (only) up to b < n/5 faults [2]; these quorum systems can do no better
than constant load even if b = 0 [2].
10
called a conicting candidate. Two candidates may conict because, e.g., they
both bear the same timestamp. In either masking or opaque quorum systems,
a faulty server may try to forge a conicting candidate. No non-faulty server
accepts two candidates that conict with each other.
A server can try to vote for some candidate (e.g., by responding to a read
operation) if the server is a participant in voting (i.e., if the server is a member
of the clients read access set). However, a server becomes qualied to vote for
a particular candidate only if the server is a member of the clients write access
set selected for the write operation for which it votes. Non-faulty clients wait for
responses from a read quorum of size qrd contained in the read access set of size
ard . An error is said to occur in a read operation when a non-faulty client fails
to observe the latest value or a faulty client obtains suciently many votes for
a conicting value.1 The error probability is the probability of this occurring.
Behavior of faulty clients. We assume that faulty clients seek to maximize
the error probability by following specic strategies [11]. This is a conservative
assumption; a client cannot increasebut may decreasethe probability of error
by failing to follow these strategies. At a high level, the strategies are as follows:
a faulty client, which may be completely restricted in its choices: (i) when establishing a candidate, writes the candidate to as few non-faulty servers as possible
to minimize the probability that it is observed by a non-faulty client; and (ii)
writes a conicting candidate to as many servers as will accept it (i.e., faulty
servers plus, in the case of an opaque quorum system, any non-faulty server that
has not accepted the established candidate) in order to maximize the probability
that it is observed.
Faulty clients may be able to aect the system with such votes in some protocols [11].
11
Thus, write markers remove the advantage enjoyed by faulty servers in strict
and traditional-probabilistic masking and opaque quorum systems, where any
faulty participant can vote for any candidateand therefore can collude to have
a conicting, potentially fabricated candidate chosen instead of an established
candidate. This aspect of write markers is summarized in Table 2, which shows
the impact of write markers in terms of the abilities of faulty and non-faulty
servers to vote for a given candidate.
3.1
Consistency Constraints
(4)
The use of write markers has no impact here on (4) because (Qrd Qwt ) \ B
contains no faulty servers. However, write markers do enable us to set r smaller,
as the following shows.
Second, the constraints must ensure that a conicting candidate (which is in
conict with an established candidate as described in Section 2) is, in expectation, not observed by any client (non-faulty or faulty). In general, it is important
for all clients to observe only established candidates so as to enable higher-level
protocols (e.g., [4]) that employ repair phases that may aect the state of the
system within a read [11]. Let Ard and Awt represent read and write access sets,
respectively, chosen uniformly at random. (Think of Awt as the access set used by
a faulty client for a conicting candidate, and of Ard as the access set used by a
faulty client for a read operation. How faulty clients can be forced to choose uniformly at random is described in Section 4.) We consider the cases for masking
and opaque quorums separately:
12
(5)
Contrast this with (2) and with the consistency requirement for traditional probabilistic masking quorum systems [10] (adapted to consider access sets), which
requires that the faulty participants (qualied or not) cannot produce sucient
votes for a candidate to be observed in expectation:
E [|(Qrd Qwt ) \ B|] > E [|Ard B|] .
(6)
Intuitively, the intersection between access sets can be smaller with write markers
because the right-hand side of (5) is less than the right-hand side of (6) if
awt < n.
Probabilistic Opaque Quorums. With write markers, we have the benet, described above for probabilistic masking quorums, in terms of the number of
faulty participants that can vote for a candidate in expectation. However, as
shown in (3), opaque quorum systems must additionally consider the maximum
number of non-faulty qualied participants that vote for the same conicting
candidate in expectation. As such, instead of (5), we have:
E [|(Qrd Qwt ) \ B|] > E [|(Ard Awt ) B|]+E [| ((Ard Awt ) \ B) \ Qwt |] . (7)
Contrast this with the consistency requirement for traditional probabilistic
opaque quorums [11]:
E [|(Qrd Qwt ) \ B|] > E [|Ard B|] + E [| ((Ard Awt ) \ B) \ Qwt |] .
(8)
Again, intuitively, the intersection between access sets can be smaller with write
markers because the right-hand side of (7) is less than the right-hand side of (8)
if awt < n.
3.2
Implied Bounds
In this subsection, we are concerned with quorum systems for which we can
achieve error probability (as dened in Section 2) no greater than a given for
any n suciently large. For such quorum systems, there is an upper bound on b
in terms of n, akin to the bound for strict quorum systems.
Intuitively, the maximum value of b is limited by the relevant constraint (i.e.,
either (5) or (7)). Of primary interest are Theorem 1 and its corollaries, which
demonstrate the benets of write markers for probabilistic masking quorum systems, and Theorem 2 and its corollaries, which demonstrate the benets of write
13
markers for probabilistic opaque quorum systems. They utilize Lemmas 1 and 2,
which together present basic requirements for the types of quorum systems with
which we are concerned. Due to space constraints, proofs of the lemmas and
theorems appear only in a companion technical report [15].
Dene MinCorrect to be a random variable for the number of non-faulty servers
with the established candidate, i.e., MinCorrect = |(Qrd Qwt ) \ B| as indicated
in (4).
Lemma 1. Let n b = (n). For all c > 0 there is a constant d > 1 such that
for all qrd , qwt where qrd qwt > dn and qrd qwt n = (1), it is the case that
E [MinCorrect] > c for all n suciently large.
Let r be the threshold, discussed in Section 3.1, for the number of votes necessary to observe a candidate. Dene MaxConflicting to be a random variable for
the maximum number of servers that vote for a conicting candidate. For example: due to (5), in masking quorums with write markers, MaxConflicting =
|(Ard Awt ) B|; and due to (7), in opaque quorums with write markers,
MaxConflicting = |(Ard Awt ) B| + | ((Ard Awt ) \ B) \ Qwt |.
Lemma 2. Let the following hold,2
E [MinCorrect] E [MaxConflicting] > 0,
E [MinCorrect] E [MaxConflicting] = ( E [MinCorrect]).
Then it is possible to set r such that,
error probability 0
as E [MinCorrect] .
14
In other words, with write markers, the size of quorums does not impact the
maximum fraction of faults that can be tolerated when quorums are selected
uniformly at random (i.e., when ard = qrd and awt = qwt ).
Corollary 2. Let ard = qrd , awt = qwt ,
and b < n/2. For all there is a
constant > 1 such that if qrd = qwt = n, any such probabilistic masking
quorum system employing write markers achieves error probability no greater
than given a suitable setting of r for all n suciently large, and has load
any such probabilistic opaque quorum system employing write markers achieves
error probability no greater than given a suitable setting of r for all n suciently
large.
Corollary 3. Let ard = qrd and awt = qwt . For all there is a constant d > 1
such that for all qrd , qwt where qrd qwt > dn, qrd qwt n = (1), and
b<
qwt n
,
qwt + n
any such probabilistic opaque quorum system employing write markers achieves
error probability no greater than given a suitable setting of r for all n suciently
large.
Comparing Corollary 3 with Corollary 1, we see that in the opaque quorum case
qwt cannot be set independently of b.
Corollary 4. Let ard = qrd , awt = qwt , and b < (qwt n)/(qwt + n). For all
there is a constant d > 1 such that for all qrd , qwt where qrd qwt > dn and
qrd qwt n = (1), any such probabilistic opaque quorum system employing write
markers achieves error probability no greater than given a suitable setting of r
for all n suciently large, and has load
(b/n).
Corollary 5. Let b = ( n). For all there is a constant d > 1 such that
for all ard , awt , qrd , qwt where ard = awt = qrd = qwt = lb for a value l such
that c l > n/(n b) for some constant c , (lb)2 > dn and (lb)2 n = (1),
any such probabilistic opaque quorum system employing write markers achieves
error probability no greater than given a suitable setting of r for all n suciently
large, and has load
O(b/n).
15
Corollary 6. Let ard = qrd and awt = qwt = n b. For all there is a constant
d > 1 such that for all qrd , qwt where qrd qwt > dn, qrd qwt n = (1), and
b < n/2.62,
any such probabilistic opaque quorum system employing write markers achieves
error probability no greater than given a suitable setting of r for all n suciently
large.
Implementation
Non-faulty clients should choose a new access set for each operation to ensure independence from the decisions of faulty clients [11].
16
17
Figures 1, 2, 3, and 4
illustrate
relevant pieces
access set
promise certificate status
of the preexisting protosolution
data value
col and our modications
Opaque write
for write markers in the
a access set
b status
context of read and write
solution
operations in probabilistic
data value
masking and opaque quoRead
rum systems. The gures
i query
ii data value
highlight that the additions
certificate
(masking)
access set, solution
(opaque)
to the protocol for write
markers involve saving the
Fig. 2. Message types (Write marker emphasized with write markers and returngray)
ing them to clients so that
clients can also verify them.
The dierences in the structure of the write marker for probabilistic opaque
and masking quorum systems mentioned above results in subtly dierent guarantees. The remainder of the section discusses these details.
Masking write
4.1
18
a
b
However, this is (6), the constraint
Client
on probabilistic masking quorum
S0
systems without write markers. In
S1
eect, a faulty client must either:
(i) use a recent access set that
S2
is therefore chosen approximately
S3
uniformly at random, and be lim
ited by (7); or (ii), use a stale acSn
cess set and be limited by (6). If
quorums are the sizes of access sets,
both inequalities have the same up- Fig. 3. Write operation in opaque quorum sysper bound on b (see [15]); other- tems: messages and stages of verication of
wise, a faulty client is disadvan- write marker (Changes in gray)
taged by using a stale access set
because a system that satises (6) can tolerate more faults than one that satises (7), and is therefore less likely to result in error (see [15]). Even if the access
set contains all of the faulty servers, i.e., B Awt , then this becomes,
Verify Certificate
Collect
Cert.
Protocols for masking quorum systems involve an additional round of communication (an echo phase, c.f., [8] or broadcast phase, c.f., [18]) during write operations in order to tolerate Byzantine or concurrent clients. This round prevents
non-faulty servers from accepting conicting data values, as assumed by (2).
In order to write a data value, a client must rst obtain a write certicate (a
quorum of replies that together attest that the non-faulty servers will accept
no conicting data value). In contrast to optimistic protocols that use opaque
quorum systems, these protocols are pessimistic.
This additional round allows us to prevent clients from using stale access sets.
Specically, in the request to authorize a data value (message in Figure 2 and
Figure 4), the client
sends the access set
identier (including
Client
the VRV), the soS0
lution to the puzzle
enabling use of this
S1
access set, and the
S2
data value. We reS3
quire that the cer
ticate come from
Sn
servers in the access
set that is chosen for
the write operation. Fig. 4. Write operation in masking quorum systems: messages
Each server veries and stages of verication of write marker (Changes in gray)
19
the VRV and that the puzzle solution enables use of the indicated access set
before returning authorization (message in Figure 2 and Figure 4). The nonfaulty servers that contribute to the certicate all implicitly agree that the access
set is not stale, for otherwise they would not agree to the write. This certicate
(sent to each server in message in Figure 2 and Figure 4) is stored along with
the data value as a write marker. Thus, unlike in probabilistic opaque quorum
systems, a veriable write marker in a probabilistic masking quorum system
implies that a stale access set was not used. The reading client veries the certicate (returned in message ii in Figure 1 and Figure 2) before accepting a vote
for a candidate. Because a writing client will be unable to obtain a certicate for
a stale access set, votes for such a candidate will be rejected by reading clients.
Therefore, the analysis in Section 3 applies without additional complications.
Conclusion
We have presented write markers, a way to improve the load of masking and
opaque quorum systems asymptotically. Moreover, our new masking and opaque
probabilistic quorum systems with write markers can tolerate an additional 24%
and 17% of faulty replicas, respectively, compared with the proven bounds of
probabilistic quorum systems without write markers. Write markers achieve this
by limiting the extent to which Byzantine-faulty servers may cooperate to provide incorrect values to clients. We have presented a proposed implementation
20
of write markers that is designed to be eective even while tolerating Byzantinefaulty clients and servers.
References
1. Lamport, L., Shostak, R., Pease, M.: The Byzantine generals problem. ACM Transactions on Programming Languages and Systems 4, 382401 (1982)
2. Malkhi, D., Reiter, M.: Byzantine quorum systems. Distributed Computing 11,
203213 (1998)
3. Naor, M., Wool, A.: The load, capacity, and availability of quorum systems. SIAM
Journal on Computing 27, 423447 (1998)
4. Abd-El-Malek, M., Ganger, G.R., Goodson, G.R., Reiter, M.K., Wylie, J.J.: Faultscalable Byzantine fault-tolerant services. In: Symposium on Operating Systems
Principles (2005)
5. Castro, M., Liskov, B.: Practical Byzantine fault tolerance. In: Symposium on
Operating Systems Design and Implementation (1999)
6. Goodson, G.R., Wylie, J.J., Ganger, G.R., Reiter, M.K.: Ecient Byzantinetolerant erasure-coded storage. In: International Conference on Dependable Systems and Networks (2004)
7. Kong, L., Manohar, D., Subbiah, A., Sun, M., Ahamad, M., Blough, D.: Agile store:
Experience with quorum-based data replication techniques for adaptive Byzantine
fault tolerance. In: IEEE Symposium on Reliable Distributed Systems, pp. 143154
(2005)
8. Malkhi, D., Reiter, M.K.: An architecture for survivable coordination in large distributed systems. IEEE Transactions on Knowledge and Data Engineering 12, 187
202 (2000)
9. Martin, J.P., Alvisi, L.: Fast Byzantine consensus. IEEE Transactions on Dependable and Secure Computing 3, 202215 (2006)
10. Malkhi, D., Reiter, M.K., Wool, A., Wright, R.N.: Probabilistic quorum systems.
Information and Computation 170, 184206 (2001)
11. Merideth, M.G., Reiter, M.K.: Probabilistic opaque quorum systems. In: International Symposium on Distributed Computing (2007)
12. Malkhi, D., Reiter, M.K., Wool, A.: The load and availability of Byzantine quorum
systems. SIAM Journal of Computing 29, 18891906 (2000)
13. Yu, H.: Signed quorum systems. Distributed Computing 18, 307323 (2006)
14. Herlihy, M., Wing, J.: Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems 12, 463492
(1990)
15. Merideth, M.G., Reiter, M.K.: Write markers for probabilistic quorum systems.
Technical Report CMU-CS-07-165R, Computer Science Department, Carnegie
Mellon University (2008)
16. Juels, A., Brainard, J.: Client puzzles: A cryptographic countermeasure against
connection depletion attacks. In: Network and Distributed Systems Security Symposium, pp. 151165 (1999)
17. Malkhi, D., Mansour, Y., Reiter, M.K.: Diusion without false rumors: On propagating updates in a Byzantine environment. Theoretical Computer Science 299,
289306 (2003)
18. Martin, J.P., Alvisi, L., Dahlin, M.: Minimal Byzantine storage. In: International
Symposium on Distributed Computing (2002)
21
19. Abraham, I., Malkhi, D.: Probabilistic quorums for dynamic systems. Distributed
Computing 18, 113124 (2005)
20. Du, W., Deng, J., Han, Y.S., Varshney, P.K., Katz, J., Khalili, A.: A pairwise
key predistribution scheme for wireless sensor networks. ACM Transactions on
Information and System Security 8, 228258 (2005)
21. Luo, J., Hubaux, J.P., Eugster, P.T.: Pan: providing reliable storage in mobile ad
hoc networks with probabilistic quorum systems. In: International symposium on
mobile ad hoc networking and computing, pp. 112 (2003)
22. Lee, H., Welch, J.L.: Applications of probabilistic quorums to iterative algorithms.
In: International Conference on Distributed Computing Systems, pp. 2130 (2001)
23. Lee, H., Welch, J.L.: Randomized shared queues applied to distributed optimization
algorithms. In: International Symposium on Algorithms and Computation (2001)
24. Alvisi, L., Malkhi, D., Pierce, E., Reiter, M.K.: Fault detection for Byzantine quorum systems. IEEE Transactions on Parallel and Distributed Systems 12, 9961007
(2001)
Abstract. Consensus is a fundamental building block used to solve many practical problems that appear on reliable distributed systems. In spite of the fact
that consensus is being widely studied in the context of classical networks, few
studies have been conducted in order to solve it in the context of dynamic and
self-organizing systems characterized by unknown networks. While in a classical network the set of participants is static and known, in a scenario of unknown
networks, the set and number of participants are previously unknown. This work
goes one step further and studies the problem of Byzantine Fault-Tolerant Consensus with Unknown Participants, namely BFT-CUP. This new problem aims at
solving consensus in unknown networks with the additional requirement that participants in the system can behave maliciously. This paper presents a solution for
BFT-CUP that does not require digital signatures. The algorithms are shown to be
optimal in terms of synchrony and knowledge connectivity among participants in
the system.
Keywords: Consensus, Byzantine fault tolerance, Self-organizing systems.
1 Introduction
The consensus problem [1,2,3,4,5], and more generally the agreement problems, form
the basis of almost all solutions related to the development of reliable distributed systems. Through these protocols, participants are able to coordinate their actions in order
to maintain state consistency and ensure system progress. This problem has been extensively studied in classical networks, where the set of processes involved in a particular
computation is static and known by all participants in the system. Nonetheless, even in
these environments, the consensus problem has no deterministic solution in presence of
one single process crash, when entities behave asynchronously [2].
T.P. Baker, A. Bui, and S. Tixeuil (Eds.): OPODIS 2008, LNCS 5401, pp. 2240, 2008.
c Springer-Verlag Berlin Heidelberg 2008
23
In self-organizing systems, such as wireless mobile ad-hoc networks, sensor networks and, in a different context, unstructured peer to peer networks (P2P), solving
consensus is even more difficult. In these environments, an initial knowledge about participants in the system is a strong assumption to be adopted and the number of participants and their knowledge cannot be previously determined. These environments define
indeed a new model of distributed systems which has essential differences regarding the
classical one. Thus, it brings new challenges to the specification and resolution of fundamental problems. In the case of consensus, the majority of existing protocols are not
suitable for the new dynamic model because their computation model consists of a set
of initially known nodes. The only notably exceptions are the works of Cavin et al. [6,7]
and Greve et al. [8].
Cavin et al. [6,7] defined a new problem named FT-CUP (fault-tolerant consensus with unknown participants) which keeps the consensus definition but assumes that
nodes are not aware of , the set of processes in the system. They identified necessary
and sufficient conditions in order to solve FT-CUP concerning knowledge about the
system composition and synchrony requirements regarding the failure detection. They
concluded that in order to solve FT-CUP in a scenario with the weakest knowledge connectivity, the strongest synchrony conditions are necessary, which are represented by
failures detectors of the class P [4].
Greve and Tixeuil [8] show that there is in fact a trade-off between knowledge connectivity and synchrony for consensus in fault-prone unknown networks. They provide
an alternative solution for FT-CUP which requires minimal synchrony assumptions;
indeed, the same assumptions already identified to solve consensus in a classical environment, which are represented by failure detectors of the class S [4]. The approach followed on the design of their FT-CUP protocol is modular: Initially, algorithms
identify a set of participants in the network that share the same view of the system.
Subsequently, any classical consensus like for example, those initially designed for
traditional networks can be reused and executed by these participants.
Our work extends these results and study the problem of Byzantine Fault-Tolerant
Consensus with Unknown Participants (BFT-CUP). This new problem aims at solving CUP in unknown networks with the additional requirement that participants in
the system can behave maliciously [1]. The main contribution of the paper is then
the identification of necessary and sufficient conditions in order to solve BFT-CUP.
More specifically, an algorithm for solving BFT-CUP is presented for a scenario which
does not require the use of digital signatures (a major source of performance overhead on Byzantine fault-tolerant protocols [9]). Finally, we show that this algorithm
is optimal in terms of synchrony and knowledge connectivity requirements,
establishing then the necessary and sufficient conditions for BFT-CUP solvability in
this context.
The paper is organized in the following way. Section 2 presents our system model
and the concept of participant detectors, among other preliminary definitions used in
this paper. Section 3 describes a basic dissemination protocol used for process communication. BFT-CUP protocols and respective necessary and sufficient proofs are described in Section 4. Section 5 presents some comments about our protocol. Section 6
presents our final remarks.
24
2 Preliminaries
2.1 System Model
We consider a distributed system composed by a finite set of n processes (also called
participants or nodes) drawn from a larger universe U. In a known network, and n is
known to every participanting process, while in an unknown network, a process i
may only be aware of a subset i .
Processes are subject to Byzantine failures [1], i.e., they can deviate arbitrarily from
the algorithm they are specified to execute and work in collusion to corrupt the system
behavior. Processes that do not follow their algorithm in some way are said to be faulty.
A process that is not faulty is said to be correct. Despite the fact that a process does
not know all participants of the system, it does know the expected maximum number
of process that may fail, denoted by f . Moreover, we assume that all processes have a
unique id, and that it is infeasible for a faulty process to obtain additional ids to be able
to launch a sybil attack [10] against the system.
Processes communicate by sending and receiving messages through authenticated
and reliable point to point channels established between known processes1 . Authenticity of messages disseminated to a not yet known node is verified through message channel redundancy, as explained in Section 3. A process i may only send a message directly
to another process j if j i , i.e., if i knows j. Of course, if i sends a message to j such
that i j , upon receipt of the message, j may add i to j , i.e., j now knows i and
become able to send messages to it. We assume the existence of an underlying routing
layer resilient to Byzantine failures [11,12,13], in such a way that if j i and there
is sufficient network connectivity, then i can send a message reliably to j. For example,
[12] presents a secure multipath routing protocol that guarantees a proper communication between two processes provided that there is at least one path between these
processes that is not compromised, i.e., none of its processes or channels are faulty.
There are no assumptions on the relative speed of processes or on message transfer
delays, i.e., the system is asynchronous. However, the protocol presented in this paper
uses an underlying classical Byzantine consensus that could be implemented over an
eventually synchronous system [14] (e.g., Byzantine Paxos [9]) or over a completely
asynchronous system (e.g., using a randomized consensus protocol [5,15,16]). Thus,
our protocol requires the same level of synchrony required by the underlying classical
Byzantine consensus protocol.
2.2 Participant Detectors
To solve any nontrivial distributed problem, processes must somehow get a partial
knowledge about the others if some cooperation is expected. The participant detector oracle, namely PD, was proposed to handle this subset of known processes [6]. It
can be seen as a distributed oracle that provides hints about the participating processes
in the computation. Let i.PD be defined as the participant detector of a process i. When
1
Without authenticated channels it is not possible to tolerate process misbehavior in an asynchronous system since a single faulty process can play the roles of all other processes to some
(victim) process.
25
26
Component A
Component B
11
16
11
10
13
12
14
10
6
8
15
Component A
12
Sink Component
(a) 2-OSR
Sink Component
(b) 3-OSR
subsequent calls. This ensures that the partial view about the initial composition of the
system is consistent for all nodes in the system, what defines a common knowledge
connectivity graph Gdi . Also, in this work we say that some participant p is neighbor
of another participant i iff p i.PD.
2.3 The Consensus Problem
In a distributed system, the consensus problem consists of ensuring that all correct processes eventually decide the same value, previously proposed by some processes in the
system. Thus, each process i proposes a value vi and all correct processes decide on
some unique value v among the proposed values. Formally, consensus is defined by the
following properties [4]:
The Byzantine Fault-Tolerant Consensus with Unknown Participants, namely BFTCUP, proposes to solve consensus in unknown networks with the additional requirement
that a bounded number of participants in the system can behave maliciously.
27
28
message:
3. REACHABLE FLOODING:
4.
message :value to flood
5.
route : ordered list of nodes
** Initiator Only **
procedure: reachable send(message, sender)
6. j i.PD, send REACHABLE FLOODING(message, sender) to j;
// sender = i
** All Nodes **
INIT:
7. i.received msgs ;
upon receipt of REACHABLE FLOODING(m.message, m.route) from j
8. if getLastElement(m.route) = j i m.route then
9.
append(m.route, i);
10.
initiator getFirstElement(m.route);
11.
i.received msgs i.received msgs {m.message, m.route};
12.
routes computeRoutes(m.message, i.received msgs);
13.
if routes f + 1 then
14.
trigger reachable deliver(m.message, initiator);
15.
i.received msgs i.received msgs \ {m.message, };
16.
end if
17.
z i.PD \ { j}, send REACHABLE FLOODING(m.message, m.route) to z;
18. end if
after it has been received through f + 1 node disjoint paths, it is able to verify its authenticity. These measures prevent the delivery of forged messages (generated by malicious
participants), because the authenticity of them cannot be verified by correct processes.
An undesirable property of the proposed solution is that the same message, sent
by some participant, could be delivered more than once by its receivers. This property
does not affect the use of this protocol in our consensus protocol (Section 4). Thus, we
do not deal with this limitation of the algorithm. However, it can be easily solved by
using buffers to store delivered messages that must have unique identifiers.
Additionaly, each message receiver, disseminated by some participant p, is able
to send back a reply to p using some routing protocol resilient to Byzantine failures [11,12,13]. Our BFT-CUP protocol (Section 4) uses this algorithm to disseminate
messages.
Sketch of Proof. The correctness of this protocol is based on the proof of the properties
defined for the reachable reliable broadcast.
29
30
its local participant detector, a process is able to get an initial knowledge about the
system that is not enough to solve BFT-CUP. Then, a process expands this knowledge
by executing the DISCOVERY protocol, presented in Algorithm 2. The main idea is
that each participant i broadcasts a message requesting information about neighbors of
each reachable participant, making a sort of breadth-first search in the knowledge graph.
At the end of the algorithm, i obtains the maximal set of reachable participants, which
represents the participants known by i (a partial view of the system).
The algorithm uses three sets:
1. i.known set containing identifiers of all processes known by i;
2. i.msg pend this set contains identifiers of processes that should send a message
to i, i.e., for each j i.msg pend, i should receive a message from j;
3. i.nei pend this set contains identifiers of processes that i knows, but does not
know all of their neighbors (i is still waiting for information about them), i.e., for
each j, j.neighbor i.nei pend, i knows j but does not know all neighbors of j.
In the initialization phase of the algorithm for participant i, the set i.known is updated
to itself plus its neighbors, returned by i.PD, and the set i.msg pend to its neighbors
(line 7). Moreover, a message requesting information about neighbors is disseminated
to all participants reachable from i (line 8). When a participant p delivers this message,
p sends back to i a reply indicating its neighbors (line 9).
Upon receipt of a reply at participant i, the set of known participants is updated,
along with the set of pending neighbors3 and the set of pending messages (lines 10 - 12).
The next step is to verify whether i has acquired knowledge about any new participant
(line 13 - 16). Thus, i gets to know other participant j if at least f + 1 other processes
known by i reported to i that j is their neighbor (line 13). After this verification, the
set of pending neighbors is updated (lines 17 - 21), according to the new participants
discovered.
To determine if there is still some participant to be discovered, i uses the sets
i.msg pend and i.nei pend, which store the pendencies related to the replies received
by i. Then, the algorithm ends when there remain at most f pendencies (lines 22 - 24).
The intuition behind this condition is that if there are at most f pendencies at process i,
then i already has discovered all processes reachable from it because k 2 f + 1. Thus,
the algorithm ends by returning the set of participants discovered by i (line 23), which
contains all participants (correct or faulty) reachable from it. Algorithm 2 satisfies some
properties that are stated by Lemma 1.
Lemma 1. Consider Gdi a knowlegde graph induced by a k-OSR PD. Let f < 2k < n
be the number of nodes that may fail. Algorithm DISCOVERY executed by each correct
participant p satisfies the following properties:
Termination: p terminates the execution of the algorithm and returns a set of known
processes;
Accuracy: the algorithm returns the maximal set of processes reachable from p in
Gdi .
3
If i reaches p, i also reaches all neigbours of p and should receive a reply to its initial dissemination (line 8) from all of them.
31
** All Nodes **
INIT:
7. i.known {i} i.PD; i.nei pend ; i.msg pend i.PD;
8. reachable send(GET NEIGHBOR , i);
upon execution of reachable deliver(GET NEIGHBOR , sender)
9. send SET NEIGHBOR (i.PD) to sender;
upon receipt of SET NEIGHBOR (m.neighbor) from sender
10. i.known i.known {sender};
11. i.nei pend i.nei pend {sender, m.neighbor};
12. i.msg pend i.msg pend \ {sender};
13. if ( j : #, ji.nei pend > f ) ( j i.known) then
14.
i.known i.known { j};
15.
i.msg pend i.msg pend { j};
16. end if
17. for all j, j.neighbor i.nei pend do
18.
if (z j.neighbor : z i.known) then
19.
i.nei pend i.nei pend \ { j, j.neighbor};
20.
end if
21. end for
22. if (|i.nei pend| + |i.msg pend|) f then
23.
return i.known;
24. end if
Sketch of Proof. Termination: In the worst case, the algorithm ends when p receives
replies from at least all correct reachable participants (line 22). By dissemination protocol properties, even in the presence of f < 2k failures, all messages disseminated by p is
delivered by its correct receivers (processes reachable from p). Thus, each correct participant reachable from p receives a request (line 8) and sends back a reply (line 9) that is
received by p (lines 10 - 24). Then, as is finite, it is guaranteed that p receives replies
from at least all correct reachable participants and ends the algorithm by returning a set
of known processes.
Accuracy: The algorithm only ends when there remain at most f pendencies, which
may be divided between processes that supply information about neighbors that do not
32
exist in the system (i.nei pend) and processes from which p is still waiting for their
messages/replies (i.msg pend). Moreover, each participant z (being z reachable from p)
is neighbor of at least 2 f + 1 other participants, because f < 2k < n. Now, we have to
consider two cases:
If z is malicious and does not send back a reply to p (line 9), then p computes
messages (replies) from at least f + 1 correct neighbors of z, discovering z (lines
13 - 16).
If z is correct, in the worst case, the message from z to p is delayed and f neighbors
of z are malicious and do not inform p that z is in the system. However, as f < 2k ,
there remain f + 1 correct neighbors of z in the system that inform p about the
presence of z in the system.
As the algorithm only ends when there remain at most f pendencies, in both cases it
is guaranteed that p only ends after discovering z, even if it firstly computes messages
from the f malicious processes.
4.2 Sink Component Determination
The objective of this phase is to define which participants belong to the sink component
of the knowlegde graph induced by a k-OSR PD. More specifically, through Algorithm
3 (SINK), each participant is able to determine whether or not it is member of the sink
component. The idea behind this algorithm is that after the execution of the procedure
DISCOVERY, members in the sink component obtain the same partial view of the system, whereas in the other components, nodes have strictly more knowledge than in the
sink, i.e., each node knows at least members of the component to which it belongs and
members of the sink (see Definition 3).
In the initialization phase of the algorithm for participant i, i executes the DISCOVERY procedure in order to obtain its partial view of the system (line 8) and sends this
view to all reachable/known participant (line 10). When these messages are delivered
by some participant j, j sends back an ack response to i if it has the same knowledge of
i (i.e., j belongs to the same component of i). Otherwise, j sends back a nack response
(lines 11-15).
Upon receipt of a reply (lines 16-27), i updates the set of processes that have already answered (line 16). Moreover, if the reply received is a nack, the set of processes
that belong to other components (i.nacked) is updated (line 18) and if the number of
processes that do not belong to the same component of i is greater than f (line 19), i
concludes that it does not belong to the sink component (lines 20-21). This condition
holds because the system has at least 3 f + 1 processes in the sink, known by all participants, that have strictly less knowledge about than processes not in the sink (Lemma
1). On the other hand, if i has received replies from all known processes, excluding f
possible faulty (line 24), and the number of processes that belong to other components
is not greater than f , i concludes that it belongs to the sink component (lines 25-26).
This condition holds because processes in the sink receive messages only from members of this component. Moreover, in both cases, a collusion of f malicious participants
cannot lead a process to decide incorrectly. Lemma 2 states the properties satisfied by
Algorithm 3.
33
message:
6. RESPONSE :
7.
ack/nack : boolean
** All Nodes **
INIT:
8. i.known DISCOVERY ();
9. i.responded {i}; i.nacked ;
10. reachable send(i.known, i);
upon execution of reachable deliver(sender.known, sender)
11. if i.known = sender.known then
12.
send RESPONSE(ack) to sender;
13. else
14.
send RESPONSE(nack) to sender;
15. end if
upon receipt of RESPONSE(m) from sender
16. i.responded i.responded {sender}
17. if m.nack then
18.
i.nacked i.nacked {sender};
19.
if |i.nacked| f + 1 then
20.
i.in the sink f alse;
21.
return i.in the sink, i.known;
22.
end if
23. end if
24. if |i.responded| |i.known| f then
25.
i.in the sink true;
26.
return i.in the sink, i.known;
27. end if
Lemma 2. Consider a k-OSR PD. Let f < 2k < n be the number of nodes that may fail.
Algorithm SINK, executed by each correct participant p of the system that has at least
3 f + 1 nodes in the sink component, satisfies the following properties:
Termination: p terminates the execution by deciding whether it belongs (true) or
not (false) to the sink;
Accuracy: p is in the unique k-strongly connected sink component iff algorithm
SINK returns true.
34
Sketch of Proof. Termination: For each participant p, the algorithm returns in two
cases: (i) when it receives f + 1 replies from processes that belong to other components (processes not in the sink line 19) or (ii) when it receives replies from at least
all correct known processes (processes in the sink line 24). By properties of the dissemination protocol, even in the presence of f < 2k failures, all messages disseminated
by p are delivered by its receivers (processes reachable from p). Thus, each correct participant known by p (reachable from p) receives the request (line 10) and sends back a
reply (lines 11-15) that is received by p (lines 16-27). Then, it is guaranteed that either
(i) or (ii) always occur.
Accuracy: By Lemma 1, after execution of the DISCOVERY algorithm, each correct
participant discovers the maximal set of participants reachable from it. Then, by Lemma
1 and by k-OSR PD properties, it is guaranteed that all correct processes that belong to
the same component obtain the same partial view of the system. Thus, as members in the
sink component receive replies only from members of this component, it is guaranteed
that these participants end correctly (line 26). Moreover, as the sink has at least 3 f + 1
nodes, members in other components know at least 2 f + 1 correct members in the sink
(Lemma 1). Then, before making a wrong decision, these members must compute at
least f + 1 replies from correct members in the sink (that have strictly less knowledge
about , due to Lemma 1), what makes it possible for correct members not in the sink
to end correctly (line 21).
4.3 Achieving Consensus
This is the last phase of the protocol for solving BFT-CUP. Here, the main idea is to
make members of the sink component execute a classical Byzantine consensus and send
the decision value to other participants of the system. The optimal resilience of these
algorithms to solve a classical consensus is 3 f + 1 [3,9]. Thus, it is necessary at least
3 f + 1 participants in the sink component.
The Algorithm 4 (CONSENSUS) presents this protocol. In the initialization, each
participant executes the SINK procedure (line 11) in order to get its partial view of
the system and decide whether or not it belongs to the sink component. Depending on
whether or not the node belongs to the sink, two distinct behaviors are possible:
1. Nodes in the sink execute a classical consensus (line 13) and send the decision value
to other participants (lines 18 and 20-24). By construction, all correct nodes in the
sink component share the same partial view of the system (exactly the members in
the sink Lemma 1). Thus, these nodes know at least 2 f + 1 correct members that
belong to the sink component, what makes possible to reach the properties of the
classical Byzantine consensus (Section 2.3);
2. Other nodes (in the remaining components) do not participate to the classical consensus. These nodes ask for the decison value to all known nodes, i.e., all reachable
nodes, what includes all nodes in the sink (line 15). Each node decides for a value
v only after it has received v from at least f + 1 other participants, ensuring that v is
gathered from at least one correct participant (lines 25-31). Theorem 1 shows that
Algorithm 4 solves the BFT-CUP problem as defined in Section 2.3 with the stated
participant detector and connectivity requirements.
35
variables:
3. i.in the sink : boolean
// is i in the sink?
4. i.known : set of nodes
// partial view of i
5. i.decision : value
// decision value
6. i.asked : set of nodes
// nodes that have required the decision value
7. i.values : set of node, value tuples
// reported decisions
message:
8. SET DECISION :
9.
decision : value
** All Nodes **
INIT: {Main Decision Task}
10. i.decision ; i.values ; i.asked ;
11. (i.in the sink, i.known) SINK();
12. if i.in the sink then
13.
Consensus.propose(i.initial);
// underlying Byzantine consensus with all
p i.known
14. else
15.
reachable send(GET DECISION, i);
16. end if
** Node In Sink **
upon Consensus.decide(v)
17. i.decision v;
18. j i.asked, send SET DECISION (i.decision) to j;
19. return i.decision;
upon execution of reachable deliver(GET DECISION , sender)
20. if i.decision = then
21.
i.asked i.asked {sender};
22. else
23.
send SET DECISION (i.decision) to sender;
24. end if
** Node Not In Sink **
upon receipt of SET DECISION (m.decision) from sender
25. if i.decision = then
26.
i.values i.values {sender, m.decision};
27.
if #,m.decision i.values f + 1 then
28.
i.decision m.decision;
29.
return i.decision;
30.
end if
31. end if
36
Theorem 1. Consider a classical Byzantine consensus protocol. Algorithm CONSENSUS solves BFT-CUP, in spite of f < 2k < n failures, if k-OSR PD is used and assuming
at least 3 f + 1 participants in the sink.
Sketch of Proof. In this proof we have to consider two cases:
Processes in the sink: All correct participants in the sink component determine that they
belong to the sink (Lemma 2) (line 12) and start the execution of an underlying classical
Byzantine consensus algorithm (line 13). Then, as the sink has at least 2 f + 1 correct
nodes, it is guaranteed that all properties of the classical consensus will be met, i.e., validity, integrity, agreement and termination. Thus, nodes in the sink obtain the decision
value (line 17), send this value to other participants (line 18) and return the decided
value to the application (line 19), ensuring termination. Whenever a process in the sink
receives a request for decision from other processes (lines 2024), it will send the value
if it has already decided (line 23); otherwise, it will store the senders identity in order
to send the decision value later (line 18) after the consensus has been achieved.
Processes not in the sink: Processes not in the sink request the decision value to all participants in the sink (line 15). Notice that if there is enough connectivity (k 2 f + 1),
nodes in the sink are reachable from any node of the system. Moreover, by properties of
the reachable reliable broadcast, all correct participant in the sink will receive requests
sent by correct participants not in the sink, even in the presence of f < 2k failures (lines
2024). Thus, as there are at least 2 f + 1 correct participants in the sink able to send
back replies for these requests (lines 18, 23), it is guaranteed that nodes not in the sink
will receive at least f + 1 messages with the same decision value (lines 25-31) and the
predicate of line 27 will be true, allowing the process to terminate and return the decided value (line 28). Moreover, a collusion of up to f malicious participants cannot
lead a process to decide for incorrect values (line 27), guaranteeing thus agreement. Integrity is ensured through the verification of predicate on line 25, by which each correct
participant decides only once. Notice that validity is ensured through the underlying
classical Byzantine consensus protocol, i.e., the decided value is a value proposed by
nodes in the sink. This proves that k-OSR PD is sufficient to solve BFT-CUP.
4.4 Necessity of k-OSR Participant Detector to Solve BFT-CUP
Using a k-OSR PD, our protocol requires a degree of connectivity k 2 f + 1 to solve
BFT-CUP. Theorem 2 states that a participant detector of this class and this connectivity
degree are necessary to solve BFT-CUP.
Theorem 2. A participant detector PD k-OSR is necessary to solve BFT-CUP, in
spite of f < 2k < n failures.
Sketch of Proof. This proof is based on the same arguments to prove the necessity of
OSR (One Sink Reducibility) for solving CUP [6]. Assume by contradiction that there
is an algorithm which solves BFT-CUP with a PD k-OSR. Let Gdi be the knowledge graph induced by PD, then two scenarios are possible: (i.) there are less than k
node-disjoint paths connecting a participant p in Gdi ; or (ii.) the directed acyclic graph
37
obtained by reduction of Gdi to its k-strongly connected components has at least two
sinks. There are two possible scenarios to be considered.
In the first scenario, let at most 2 f node-disjoint paths connect p in Gdi . Then, the
simple crash failure of f neighbors of p makes it impossible for a participant i (being
p reachable from i) to discover p, because only f processes are able to inform i about
the presence of p in the system. In fact, i is not able to determine if p really exists, i.e.,
it is not guaranteed that i has received this information from a correct process. Then,
the partial view obtained by i will be inconsistent, what makes it impossible to solve
BFT-CUP. Thus, we reach a contradiction.
In the second scenario, let G1 and G2 be two of the sink components and consider
that participants in G1 have proposition value v and participants in G2 value w, with
v = w. By Termination property of consensus, processes in G1 and G2 must eventually
decide. Let us assume that the first process in G1 that decides, say p, does so at time t1 ,
and the first process in G2 that decides, say q, does so at time t2 . Delay all messages sent
to G1 and G2 such that they are received after max{t1 ,t2 }. Since the processes in a sink
component are unaware of the existence of other participants, p decides v and q decides
w, violating the Agreement property of consensus and reaching thus a contradiction.
5 Discussion
This section presents some comments about the protocol presented in this paper.
5.1 Digital Signatures
It is worth to notice that the lower bound required to solve BFT-CUP in terms of connectivity and resiliency is k 2 f + 1, and it holds even if digital signatures are used. By
using digital signatures, it is possible to exchange messages among participants, since
there is at least one path formed only by correct processes (k f + 1). However, even
with digital signatures, a connectivity of k 2 f + 1 is still required in order to discover
the participants properly (first phase of the protocol). In fact, if k < 2 f + 1, a malicious
participant can lead a correct participant p not to discover every node reachable from it,
what makes it impossible to use this protocol to solve BFT-CUP (the partial view of p
will be inconsistent).
For example, Figure 2 presents a knowledge connectivity graph induced by a 2-OSR
PD (k = 2) in which the system does not support any fault (to support f = 1, k 3).
Now, consider that process 2 is malicious and that process 1 is starting the DISCOVERY
phase. Then, process 2 could inform to process 1 that it only knows process 3. At this
point, process 1 will break the search because it is only waiting for a message from
process 3, i.e., number of pending messages less or equal to f . Thus, process 1 obtains
the wrong partial view {1, 2, 3} of the system.
5.2 Protocol Limitations
The model used in this study, as well as in all solutions for FT-CUP [7,8], supports
mobility of nodes, but it is not strong enough to tolerate arbitrary churn (arrivals and
38
4
6
departures of processes) during protocol executions. This happens because, after the
relations of knowledge have been established (first phase of the protocol), new participants will be considered only in future executions of consensus.
In current algorithms, process departures can be considered as failures. Nonetheless, this is not the optimal approach, since our protocols tolerate Byzantine faults and
the behaviour of a departing process resembles a simple crash failure. An alternative
approach consists in specifying an additional parameter d to indicate the number of
supported departures, separating departures from malicious faults. In this way, the degree of connectivity in the knowledge graph should be k 2 f + d + 1 to support up
to f malicious faults and up to d departures. Moreover, even with departures, the sink
component should remains with enough participants to execute a classical consensus,
i.e., nsink 3 f + 2d + 1, following the same reasoning as [19].
5.3 Other Participant Detectors
Although k-OSR PD is the weakest participant detector defined to solve FT-CUP, there
are other (stronger) participant detectors able to solve BFT-CUP [6,8]:
FCO (Full Connectivity PD): the knowledge connectivity graph Gdi = (V, ) induced by the PD oracle is such that for all p, q , we have (p, q) .
k-SCO (k-Strong Connectivity PD): the knowledge connectivity graph Gdi = (V, )
induced by the PD oracle is k-strongly connected.
Notice that a characteristic common to all participant detectors able to solve BFTCUP (except for the FCO PD that is fully connected) is the degree of connectivity k,
which makes possible the proper work of the protocol even in the presence of failures.
Using these participant detectors (FCO or k-SCO) the partial view obtained by each
process in the system contains exactly all processes in the system (first phase of the protocol). Thereafter, the consensus problem is trivially solved using a classical Byzantine
consensus protocol, since all processes have the same (complete) view of the system.
6 Final Remarks
Most of the studies about consensus found in the literature consider a static known
set of participants in the system (e.g., [1,3,4,5,17,19]). Recently, some works which
39
Table 1. Comparing solutions for the consensus with unknown participants problem
Approach
failure participant
model
detector
without
OSR
failures
crash
OSR
CUP
[6]
FT-CUP
[7]
FT-CUP
crash
[8]
BFT-CUP Byzantine
(this paper)
participants
connectivity
in the sink between components
1
OSR
k-OSR
f +1
2f +1
k-OSR
2f +1
3f +1
OSR + safe
crash pattern
k node-disjoint
paths
k node-disjoint
paths
synchrony
model
asynchronous
asynchronous + P
asynchronous + S
same of the underlying
consensus protocol
deal with a partial knowledge about the system composition have been proposed. The
works of [6,7,8] are worth noticing. They propose solutions and study conditions in
order to solve consensus whenever the set of participants is unknown and the system is
asynchronous. The work presented herein extends these previous results and presents
an algorithm for solving FT-CUP in a system prone to Byzantine failures. It shows
that to solve Byzantine FT-CUP in an environment with little synchrony requirements,
it is necessary to enrich the system with a greater degree of knowledge connectivity
among its participants. The main result of the work is to show that it is possible to solve
Byzantine FT-CUP with the same class of participant detectors (k-OSR) and the same
synchrony requirements (S ) necessary to solve FT-CUP in a system prone to crash
failures [8]. As a side effect, a Byzantine fault-tolerant dissemination primitive, namely
reachable reliable broadcast, has been defined and implemented and can be used in
other protocols for unknown networks.
Table 1 summarizes and presents a comparison with the known results regarding the
consensus solvability with unknown participants.
Acknowledgements
Eduardo Alchieri is supported by a CAPES/Brazil grant. Joni Fraga and Fabola Greve
are supported by CNPq/Brazil grants. This work was partially supported by the EC,
through projects IST-2004-27513 (CRUTIAL), by the FCT, through the Multiannual
(LaSIGE) and the CMU-Portugal Programmes, and by CAPES/GRICES (project TISD).
References
1. Lamport, L., Shostak, R., Pease, M.: The Byzantine generals problem. ACM Transactions on
Programing Languages and Systems 4(3), 382401 (1982)
2. Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one
faulty process. Journal of the ACM 32(2), 374382 (1985)
3. Toueg, S.: Randomized Byzantine Agreements. In: Proceedings of the 3rd Annual ACM
Symposium on Principles of Distributed Computing, pp. 163178 (1984)
4. Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal
of the ACM 43(2), 225267 (1996)
40
5. Correia, M., Neves, N.F., Verssimo, P.: From consensus to atomic broadcast: Time-free
Byzantine-resistant protocols without signatures. The Computer Journal 49(1) (2006)
6. Cavin, D., Sasson, Y., Schiper, A.: Consensus with unknown participants or fundamental
self-organization. In: Nikolaidis, I., Barbeau, M., Kranakis, E. (eds.) ADHOC-NOW 2004.
LNCS, vol. 3158, pp. 135148. Springer, Heidelberg (2004)
7. Cavin, D., Sasson, Y., Schiper, A.: Reaching agreement with unknown participants in mobile
self-organized networks in spite of process crashes. Technical Report IC/2005/026, EPFL LSR (2005)
8. Greve, F.G.P., Tixeuil, S.: Knowledge connectivity vs. synchrony requirements for faulttolerant agreement in unknown networks. In: Proceedings of the International Conference
on Dependable Systems and Networks - DSN, pp. 8291 (2007)
9. Castro, M., Liskov, B.: Practical Byzantine fault-tolerance and proactive recovery. ACM
Transactions on Computer Systems 20(4), 398461 (2002)
10. Douceur, J.: The sybil attack. In: Proceedings of the 1st International Workshop on Peer-toPeer Systems (2002)
11. Awerbuch, B., Holmer, D., Nita-Rotaru, C., Rubens, H.: An on-demand secure routing protocol resilient to byzantine failures. In: Proceedings of the 1st ACM workshop on Wireless
security - WiSE, pp. 2130. ACM, New York (2002)
12. Kotzanikolaou, P., Mavropodi, R., Douligeris, C.: Secure multipath routing for mobile ad
hoc networks. In: Wireless On-demand Network Systems and Services - WONS, pp. 8996
(2005)
13. Papadimitratos, P., Haas, Z.: Secure routing for mobile ad hoc networks. In: Proceedings of
SCS Communication Networks and Distributed Systems Modeling and Simulation Conference - CNDS (2002)
14. Dwork, C., Lynch, N.A., Stockmeyer, L.: Consensus in the presence of partial synchrony.
Journal of ACM 35(2), 288322 (1988)
15. Bracha, G.: An asynchronous (n 1)/3-resilient consensus protocol. In: Proceedings of
the 3rd ACM symposium on Principles of Distributed Computing, pp. 154162 (1984)
16. Ben-Or, M.: Another advantage of free choice: Completely asynchronous agreement protocols (extended abstract). In: Proceedings of the 2nd Annual ACM Symposium on Principles
of Distributed Computing, pp. 2730 (1983)
17. Friedman, R., Mostefaoui, A., Raynal, M.: Simple and efficient oracle-based consensus protocols for asynchronous Byzantine systems. IEEE Transactions on Dependable and Secure
Computing 2(1), 4656 (2005)
18. Dolev, D.: The Byzantine generals strike again. Journal of Algorithms (3), 1430 (1982)
19. Martin, J.P., Alvisi, L.: Fast Byzantine consensus. IEEE Transactions on Dependable and
Secure Computing 3(3), 202215 (2006)
Introduction
This work was initiated while Franck Petit was with MIS Lab., Universite of Picardie,
France. Research partially supported by Region Picardie, Proj. APREDY.
T.P. Baker, A. Bui, and S. Tixeuil (Eds.): OPODIS 2008, LNCS 5401, pp. 4157, 2008.
c Springer-Verlag Berlin Heidelberg 2008
42
C. Delporte-Gallet et al.
The FIFO assumption is necessary because, from the results in [18], if lossy links are
not FIFO, reliable broadcast requires unbounded message headers.
With innite memory and fair lossy links, (uniform) reliable broadcast can be solved
using [19], and is strictly weaker than (, ) which is necessary to solve consensus.
43
Model
We assume that each process knows the set of processes that are in the system; some
papers related to failure detectors do not make this assumption e.g. [21,22,23].
44
C. Delporte-Gallet et al.
45
S < P < P
The weakest [8] failure detector D to solve a given problem is a failure detector
D that is sucient to solve the problem and that is also necessary to solve the
problem, i.e. D is weaker than any failure detector that solves the problem.
Notations. In the sequel, vp denotes the value of the variable v at process p.
Finally, a datum in a message can be replaced by when this value has no
impact on the reasonning.
Problem Specications
Reliable Broadcast. The reliable broadcast [26] is dened with two primitives:
BROADCAST(m) and DELIVER(m). Informally, any reliable broadcast algorithm
guarantees that after a process p invokes BROADCAST(m), every correct process
eventually executes DELIVER(m). In the formal denition below, we denote by
sender(m) the process that invokes BROADCAST(m).
Specication 1 (Reliable Broadcast). A run R satises the specication
Reliable Broadcast if and only if the following three requirements are satised in
R:
Validity: If a correct process invokes BROADCAST(m), then it eventually executes DELIVER(m).
46
C. Delporte-Gallet et al.
(Uniform) Agreement: If a process executes DELIVER(m), then all other correct processes eventually execute DELIVER(m).
Integrity: For every message m, every process executes DELIVER(m) at most
once, and only if sender(m) previously invokes BROADCAST(m).
Consensus. In the consensus problem, all correct processes propose a value and
must reach a unanimous and irrevocable decision on some value that is chosen
between the proposed values. We dene the consensus problem in terms of two
primitives, PROPOSE(v) and DECIDE(u). When a process executes PROPOSE(v), we
say that it proposes v; similarly, when a process executes DECIDE(u), we say that
it decides u.
Specication 2 (Consensus). A run R satises the specication Consensus
if and only if the following three requirements are satised in R:
(Uniform) Agreement: No two processes decide dierently.
Termination: Every correct process eventually decides some value.
Validity: If a process decides v, then v was proposed by some process.
Repeated Consensus. We now dene repeated consensus. Each correct process
has as input an innite sequence of proposed values, and outputs an innite
sequence of decision values such that:
1. Two correct processes have the same output. (The output of a faulty process
is a prex of this output.)
2. The ith value of the output is the ith value of the input of some process.
We dene the repeated consensus in terms of two primitives, R-PROPOSE(v) and
R-DECIDE(u). When a process executes the ith R-PROPOSE(v), v is the ith value
of its input (we say that it proposes v for the ith consensus); similarly, when a
process executes the ith R-DECIDE(u) u is the ith value of its output (we say that
it decides v for the ith consensus).
Specication 3 (Repeated Consensus). A run R satises the specication
Repeated Consensus if and only if the following three requirements are satised
in R:
Agreement: If u and v are the outputs of two processes, then u is a prex of
v or v is a prex of u.
Termination: Every correct process has an innite output.
Validity: If the ith value of the output of a process is v, then v is the ith
value of the input of some process.
Reliable Broadcast in F
In this section, we show that P is the weakest failure detector to solve the
reliable broadcast in F .
47
48
C. Delporte-Gallet et al.
Fig. 2. Algorithm B
49
50
C. Delporte-Gallet et al.
Consensus in F
In this section, we show that we can solve consensus in system F with a failure detector that is strictly weaker than the failure detector necessary to solve
reliable broadcast and repeated consensus. We solve consensus with the strong
failure detector S. S is not the weakest failure detector to solve consensus whatever the number of crash but it is strictly weaker than P and so enough to
show our results.
We customize the algorithm of Chandra and Toueg [7] that works in an
asynchronous message-passing system with reliable links and augmented with
a strong failure detector (S), to our model.
In this algorithm, called CS in the following (Figure 3), the processes execute
n asynchronous rounds. First, processes execute n 1 asynchronous rounds (r
denotes the current round number) during which they broadcast and relay their
proposed values. Each process p waits until it receives a round r message from
every other non-suspected process (n.b. as mentionned in Section 2, we assume
that when a process is suspected it remains suspected forever) before proceeding
51
Repeated Consensus in F
We show in this section that P is the weakest failure detector to solve the
reliable consensus problem in F .
P is Necessary. The proof is similar to the one in Section 4, and here the
following lemma is central to the proof:
Lemma 2. Let A be an algorithm solving Repeated Consensus in F with a
failure detector D. There exists an integer k such that for every process p and
every correct process q for every run R of A where process p R-PROPOSEs and
R-DECIDEs k times, at least one message from q has been received by some process.
Assume that there exists an algorithm A that implements Repeated Consensus in
F using the failure detector D. To show our result we have to give an algorithm
that uses only D to emulate the output of P for every failure pattern.
In fact we give an algorithm Aq (Figure 4) where processes monitor a given
process q. This algorithm uses one instance of A with D. Note that all processes
except q participate to this algorithm following the code of A. In this algorithm
Output q is equal to either {q} (q is crashed) or (q is correct).
5
52
C. Delporte-Gallet et al.
to
53
Fig. 4. Aq
54
C. Delporte-Gallet et al.
to
55
Note also that if the consensus function is executed with P , then there is
no need to send R-x, in round r > x again. We have rewritten the consensus
function to take account of these facts, but the behaviour remains the same.
Theorem 5. Algorithm RCP (Figure 5 and 6) is a Repeated Consensus algorithm in F with P .
Corollary 3. P is sucient for solving Repeated Consensus in F .
Contrary to these results in system F , in system I , we have the same weakest failure detector to solve the consensus problem and the repeated consensus
problem:
Proposition 2. In system I , if there is an algorithm A with failure detector
D solving Consensus, then there exists an algorithm solving Repeated Consensus
with D.
56
C. Delporte-Gallet et al.
References
1. Guerraoui, R., Schiper, A.: The generic consensus service. IEEE Transactions on
Software Engineering 27(1), 2941 (2001)
2. Gafni, E., Lamport, L.: Disk paxos. Distributed Computing 16(1), 120 (2003)
3. Fischer, M.J., Lynch, N.A., Paterson, M.: Impossibility of distributed consensus
with one faulty process. Journal of the ACM 32(2), 374382 (1985)
4. Chor, B., Coan, B.A.: A simple and ecient randomized byzantine agreement
algorithm. IEEE Trans. Software Eng. 11(6), 531539 (1985)
5. Dolev, D., Dwork, C., Stockmeyer, L.J.: On the minimal synchronism needed for
distributed consensus. Journal of the ACM 34(1), 7797 (1987)
6. Dwork, C., Lynch, N.A., Stockmeyer, L.J.: Consensus in the presence of partial
synchrony. Journal of the ACM 35(2), 288323 (1988)
7. Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43(2), 225267 (1996)
8. Chandra, T.D., Hadzilacos, V., Toueg, S.: The weakest failure detector for solving
consensus. Journal of the ACM 43(4), 685722 (1996)
9. Delporte-Gallet, C., Fauconnier, H., Guerraoui, R.: Shared memory vs message
passing. Technical report, LPD-REPORT-2003-001 (2003)
10. Eisler, J., Hadzilacos, V., Toueg, S.: The weakest failure detector to solve nonuniform consensus. Distributed Computing 19(4), 335359 (2007)
11. Delporte-Gallet, C., Fauconnier, H., Guerraoui, R., Hadzilacos, V., Kouznetsov, P.,
Toueg, S.: The weakest failure detectors to solve certain fundamental problems in
distributed computing. In: Twenty-Third Annual ACM Symposium on Principles
of Distributed Computing (PODC 2004), pp. 338346 (2004)
12. Aguilera, M.K., Toueg, S., Deianov, B.: Revisiting the weakest failure detector for
uniform reliable broadcast. In: Jayanti, P. (ed.) DISC 1999. LNCS, vol. 1693, pp.
1333. Springer, Heidelberg (1999)
13. Halpern, J.Y., Ricciardi, A.: A knowledge-theoretic analysis of uniform distributed
coordination and failure detectors. In: Eighteenth Annual ACM Symposium on
Principles of Distributed Computing (PODC 1999), pp. 7382 (1999)
14. Delporte-Gallet, C., Fauconnier, H., Guerraoui, R., Kouznetsov, P.: Mutual exclusion in asynchronous systems with failure detectors. Journal of Parallel and
Distributed Computing 65(4), 492505 (2005)
15. Guerraoui, R., Kapalka, M., Kouznetsov, P.: The weakest failure detectors to boost
obstruction-freedom. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 399412.
Springer, Heidelberg (2006)
16. Raynal, M., Travers, C.: In search of the holy grail: Looking for the weakest failure
detector for wait-free set agreement. In: Shvartsman, M.M.A.A. (ed.) OPODIS
2006. LNCS, vol. 4305, pp. 319. Springer, Heidelberg (2006)
17. Zielinski, P.: Anti-omega: the weakest failure detector for set agreement. Technical Report UCAM-CL-TR-694, Computer Laboratory, University of Cambridge,
Cambridge, UK (July 2007)
18. Lynch, N.A., Mansour, Y., Fekete, A.: Data link layer: Two impossibility results.
In: Symposium on Principles of Distributed Computing, pp. 149170 (1988)
19. Bazzi, R.A., Neiger, G.: Simulating crash failures with many faulty processors
(extended abstract). In: Segall, A., Zaks, S. (eds.) WDAG 1992. LNCS, vol. 647,
pp. 166184. Springer, Heidelberg (1992)
20. Delporte-Gallet, C., Devismes, S., Fauconnier, H., Petit, F., Toueg, S.: With nite
memory consensus is easier than reliable broadcast. Technical Report hal-00325470,
HAL (October 2008)
57
21. Cavin, D., Sasson, Y., Schiper, A.: Consensus with unknown participants or fundamental self-organization. In: Nikolaidis, I., Barbeau, M., Kranakis, E. (eds.)
ADHOC-NOW 2004. LNCS, vol. 3158, pp. 135148. Springer, Heidelberg (2004)
22. Greve, F., Tixeuil, S.: Knowledge connectivity vs. synchrony requirements for faulttolerant agreement in unknown networks. In: DSN, pp. 8291. IEEE Computer
Society, Los Alamitos (2007)
23. Fern
andez, A., Jimenez, E., Raynal, M.: Eventual leader election with weak assumptions on initial knowledge, communication reliability, and synchrony. In: DSN,
pp. 166178. IEEE Computer Society, Los Alamitos (2006)
24. Chandra, T.D., Toueg, S.: Unreliable failure detectors for asynchronous systems
(preliminary version). In: 10th Annual ACM Symposium on Principles of Distributed Computing (PODC 1991), pp. 325340 (1991)
25. Delporte-Gallet, C., Fauconnier, H., Guerraoui, R.: A realistic look at failure detectors. In: DSN, pp. 345353. IEEE Computer Society, Los Alamitos (2002)
26. Hadzilacos, V., Toueg, S.: A modular approach to fault-tolerant broadcasts and
related problems. Technical Report TR 94-1425, Department of Computer Science,
Cornell University (1994)
27. Bartlett, K.A., Scantlebury, R.A., Wilkinson, P.T.: A note on reliable full-duplex
transmission over halfduplex links. Journal of the ACM 12, 260261 (1969)
28. Stenning, V.: A data transfer protocol. Computer Networks 1, 99110 (1976)
Group Renaming
Yehuda Afek1 , Iftah Gamzu1, , Irit Levy1 , Michael Merritt2 , and Gadi Taubenfeld3
1
1 Introduction
1.1 The Group Renaming Problem
We investigate the group renaming task which generalizes the well known renaming
task [3]. In the original renaming task, each processor starts with a unique identifier taken from a large domain, and the goal of each processor is to select a new
unique identifier from a smaller range. Such an identifier can be used, for example,
Supported by the Binational Science Foundation, by the Israel Science Foundation, and by
the European Commission under the Integrated Project QAP funded by the IST directorate as
Contract Number 015848.
T.P. Baker, A. Bui, and S. Tixeuil (Eds.): OPODIS 2008, LNCS 5401, pp. 5872, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Group Renaming
59
to mark a memory slot in which the processor may publish information in its possession. In the group renaming task, groups of processors may hold some information
which they would like to publish, preferably using a common memory slot for each
group. An additional motivation for studying the group version of the problem is to
further our understanding about the inherent difficulties in solving tasks with respect to
groups [10].
More formally, an instance of the group renaming task consists of n processors partitioned into m groups, each of which consists of at most g processors. Each processor has
a group name taken from some large name space [M ] = {1, . . . , M }, representing the
group that the processor affiliates with. In addition, every processor has a unique identifier taken from [N ]. The objective of each processor is to choose a new group name
from [
], where
< M . The collection of new group names selected by the processors
must satisfy the uniqueness property meaning that any two processors from different
groups choose distinct new group names. We consider two variants of the problem:
a tight variant, in which in addition to satisfying the uniqueness property, processors
of the same group must choose the same new group name (this requirement is called
the consistency property), and
a loose variant, in which processors from the same group may choose different
names, rather than a single one, as long as no two processors from different groups
choose the same new name.
1.2 Summary of Results
We present a wait-free algorithm that solves the tight variant of the problem with
=
2m 1 in a system equipped with g-consensus objects and atomic read/write registers.
This algorithm extends the upper bound result of Attiya et al. [3] for g = 1. On the
lower bound side, we show that there is no wait-free implementation of tight group
renaming in a system equipped with (g 1)-consensus objects and atomic read/write
registers. In particular, this result implies that there is no wait-free implementation of
tight group renaming using only atomic read/write registers for g 2.
We then restrict our attention to shared memory systems which support only atomic
read/write reagisters and study the loose variant. We develop a self-adjusting algorithm,
namely, an algorithm that achieves distinctive performance guarantees conditioned
on
the number of groups and processors. On worst case, this algorithm has
= 3n n1,
while guaranteeing that the number of different
new group names chosen by processors
from the same group is at most min{g, 2m, 2 n}. It seems worthy to note that the
algorithm is built around a filtering technique that overcomes scenarios in which both
the size of the maximal group and the number of groups are large, i.e., g = (n) and
m = (n). Essentially, such scenario arises when there are (n) groups containing
only few members and few groups containing (n) members.
We also consider the special case when the groups are uniform in size, and refine
the analysis of our loose group
we demonstrate that
=
60
Y. Afek et al.
Group Renaming
61
The consensus number of an object of type o, is the largest n for which it is possible
to implement an n-consensus object in a wait-free manner, using any number of
objects of type o and any number of atomic registers. If no largest n exists, the
consensus number of o is infinite.
The consensus hierarchy (also called the wait-free hierarchy) is an infinite hierarchy
of objects such that the objects at level i of the hierarchy are exactly those objects
with consensus number i.
It has been shown in [12], that in the consensus hierarchy, for any positive i, in a system
with i processors: (1) no object at level less than i together with atomic registers can
implement any object at level i; and (2) each object at level i together with atomic
registers can implement any object at level i or at a lower level, in a system with i
processors. Classifying objects by their consensus numbers is a powerful technique for
understanding the relative power of shared objects.
Finally, for simplicity, when refereing to the group renaming problem, we will assume that m, the number of groups, is greater or equal to 2.
62
Y. Afek et al.
Agree on w, the winner of group GIDi in iteration k, and import its snapshot:
w CON[GIDi ][k].Compete(i)
(GID1 , p1 , k1 , . . . , GIDN , pN , kN ) HIS[w][k][1, . . . , N ]
Check if pw , the proposal of w, can be chosen as the new name of group GIDi :
8:
P = {pj : j [N ] has GIDj
= GIDw and kj = maxq[N] {kq : GIDq = GIDj }}
9:
if pw P then
10:
r the rank of GIDw in {GIDj
= : j [N ]}
11:
p the r-th integer not in P
12:
else return pw
13:
end if
14:
k k+1
15: end while
Group Renaming
63
processor chooses a new group name, all the other processors will follow its lead and
choose the same name.
Lemma 3. No two processors of different groups choose the same new group name.
Proof. Recall that we know, by Lemma 2, that all the processors of the same group
choose an identical new group name. Hence, it is sufficient that we prove that no two
groups select the same new name. Assume by way of contradiction that this is not the
case, namely, there are two distinct groups G and G that select the same new group
name p . Let k and k be the iteration numbers in which the decisions on the new
names of G and G are done, and let w G and w G be the corresponding processors that won the g-consensus objects in that iterations. Now, consider the snapshot
(GID1 , p1 , k1 , . . . , GIDN , pN , kN ), taken by w in its k-th iteration. One can easily
validate that pw = p since w writes its proposed value before taking a snapshot. Sim
ilarly, it is clear that pw = p in the snapshot (GID1 , p1 , k1 , . . . , GIDN , pN , kN
),
taken by w in its k -th iteration. By the linearizability property of the atomic snapshot
object and without loss of generality, we may assume that snapshot of w was taken before the snapshot of w . Consequently, w must have captured the proposal value of w
in its snapshot, i.e., pw = p . This implies that p appeared in the set P of w . However,
this violates the fact that w reached the decision step in line 12, a contradiction.
Lemma 4. All the new group names are from the range [
], where
= 2m 1.
Proof. In what follows, we prove that the proposal value of any processor in any iteration is in the range [
]. Clearly, this proves the lemma as the chosen name of any group
is a proposal value of some processor. Consider some processor. It is clear that its first
iteration proposal value is in the range [
]. Thus, let us consider some iteration k > 1
and prove that its proposal value is at most 2m1. Essentially, this is done by bounding
the value of p calculated in line 11 of the preceding iteration. For this purpose, we first
claim that the set P consists of at most m 1 values. Notice that P holds the proposal
values of processors from at most m 1 groups. Furthermore, observe that for each of
those groups, it holds the proposal values of processors having the same maximal iteration counter. This implies, in conjunction with Lemma 1, that for each of those groups,
the proposal values of the corresponding processors are identical. Consequently, P consists of at most m 1 distinct values. Now, one can easily verify that the rank of every
group calculated in line 10 is at most m. Therefore, the new value of p is no more than
2m 1.
Lemma 5. Any processor either takes finite number of steps or chooses a new group
name.
Proof. The proof of this theorem is a natural generalization of the termination proof of
the renaming algorithm (see, e.g., [5, Sec. 16.3]). Thus, we defer it to the final version
of the paper.
3.2 An Impossibility Result
In Appendix A.1, we provide an FLP-style proof of the following theorem.
64
Y. Afek et al.
Theorem 2. For any g 2, it is impossible to wait-free implement tight group renaming in a system having (g 1)-consensus objects and atomic registers.
In particular, Theorem 2 implies that there is no wait-free implementation of tight group
renaming, even when g = 2, using only atomic registers.
Group Renaming
65
66
Y. Afek et al.
We are now ready to present our self-adjusting loose group renaming algorithm. The
algorithm has its roots in the natural approach that applies the best response with respect to the instance under consideration. For example, it is easy to see that Algorithm 3
outperforms Algorithm 2 with respect to
the inner scope size, for any instance. In adn, Algorithm 3 has an outer scope size of at
dition, one can
verify
that
when
m
<
2
has an outer scope size of at least n. Hence,
most n/2 n/2, whereas Algorithm
given an instance having m < n, the best response would be to execute Algorithm 3.
Unfortunately, a straight-forward application of this approach has several difficulties.
One immediate difficulty concerns the implementation since none of the processors
have prior knowledge of the real values of m or g. Our algorithm bypasses this difficulty by maintaining an estimation of these parameters using an atomic snapshot object.
Another difficulty concerns with performance issues. Specifically, both algorithms have
poor inner scope size guarantees for instances which simultaneously satisfy g = (n)
and m = (n). One concrete example having g = n/2 and m = n/2 + 1 consists
of a single group having n/2 members and n/2 singleton groups. In this case, both
algorithms have an inner scope size guarantee of n/2. We overcome this difficulty by
sensibly
combining the algorithms, therefore yielding an inner scope size guarantee of
2 n for these hard cases. The key observation utilized in this context is that if there
are many groups then most of them must be small. Consequently, by filtering out the
small-sized groups, we are left with a small number of large groups that we can handle efficiently. Note that Algorithm 4 employs Algorithm 3 as sub-procedure in two
cases (see lines 6 and 12). It is assumed that the shared memory space used by each
application of the algorithm is distinct.
Theorem 5. Algorithm 4 is
a group renaming algorithm having a worstcase inner
scope size of min{g, 2m, 2 n} and a worst case outer scope size of 3n n 1.
Proof. We begin by establishing the correctness of the algorithm. For this purpose, we
demonstrate that it maintains the uniqueness property and terminates after finite number of steps. One can easily validate that the termination property holds since both
Algorithm 2 and Algorithm 3 terminate after finite number of steps. It is also easy to
verify that the uniqueness property is maintained. This follows by recalling that both
Algorithm 2 and Algorithm 3 maintain the uniqueness property, and noticing that each
case of the if statement (see lines 514) utilizes a distinct set of new names. To be
precise, one should observe that any processor that
executes Algorithm 3 in line 6 is
assigned a new name in the range {1, . . . , n/2 n/2}, any processor that executes
Group Renaming
67
Algorithm
2 in line 9 is assigned a new name in the range {n/2 n/2+1, . . . , 5n/2
name whose value is at least 5n/2 n/2. The first claim results bythe outer scope
properties of Algorithm 3 and the fact that processors from less than n groups may
execute this algorithm. The second argument follows by the outer scope properties of
Algorithm 2, combined with the observation that m n, and
the fact that the value
of the name returned by the algorithm is increased by n/2 n/2 in line 10. Finally,
the last claim holds since Algorithm 3 is guaranteed to attain
a positive-valued integer
name, and the value of this name is increased by 5n/2 n/2 1 in line 13.
Algorithm 4. Adjusting group renaming algorithm: code for processor i [N ].
In shared memory: SS[1, . . . , N ] array of swmr registers, initially .
1: SS[i] GIDi
2: (GID1 , . . . , GIDN ) Snapshot(SS)
3: m
the number of distinct GIDs in {GIDj
= : j [N ]}
4: g thenumber of processors j [N ] having GIDj = GIDi
5: if m
< n then
6:
x the outcome of Algorithm 3 (using shared memory SS1 [1, . . . , N ])
7:
return x
8: else if g n then
9:
x the outcome of
Algorithm 2 (using shared memory SS2 [1, . . . , N ])
10:
return x + n/2 n/2
11: else
12:
x the outcome of Algorithm 3 (using shared memory SS3 [1, . . . , N ])
13:
return x + 5n/2 n/2 1
14: end if
We now turn to establish the performance properties of the algorithm. We demonstrate that it is self-adjusting and has the following (inner scope, outer scope) properties:
+ 1)/2 )
m< n
( min{m, g}, m(m
m n and g n
( g, 3n/2 + m n/2 1 )
( min{g, 2 n}, 3n n 1 )
m n and g > n
always satisfy m
m. Therefore, all the
Case I: m < n. The estimation value m
processors execute Algorithm 3 in line 6. The properties of Algorithm 3 guarantee that
the inner scope size is min{m, g}and the outer scope size is m(m
+ 1)/2. Take notice
Case II: m n and g n. The estimation values never exceed their real values,
namely, m
m and g g. Consequently, some processors may execute Algorithm 3 in
line 6 and some may execute Algorithm 2 in line 9, depending on the concrete execution
sequence. The inner scope size guarantee is trivially satisfied since there are at most g
processors in each group. Furthermore, one can establish the outer scope size guarantee
68
Y. Afek et al.
by simply
summing the size of the name space that may be used by Algorithm 3, which
is n/2 n/2, with the size of the name space thatmay be used byAlgorithm 2,
which is n +m 1. Notice that
g min{g, 2m, 2 n} since g n m, and
3n/2 + m n/2 1 3n n 1 as m n. Hence, the performance properties
of the algorithm in this case support the worst case analysis.
Case III: m n and g > n. Every processors may execute any of the algorithms,
depending of the concrete
execution sequence. The first observation one should make
is that no more than n new names may be collectively assigned to processors of the
same group by Algorithm 3 in line 6 and Algorithm 2 in line 9. Moreover, one should
notice that any processor
that executes Algorithm 3 in line 12
is part of a group of
size greater than n. Consequently, processors from less than n groups may execute
it.
no more than
This implies, in conjunction with the properties of Algorithm 3, that
n new names may be assigned to each group, and at most n/2 n/2 names are
assigned by this
algorithm. Putting everything together,
we attain that the inner scope
n}
and
the
outer
scope
size
is
3n
min{g, 2 n} min{g, 2m, 2 n} since m n, and thus the performance properties of the algorithm in this case also support the worst case analysis.
4.2 The Uniform Case
In what follows, we study the problem when the groups are guaranteed to be uniform
in size. We refine the analysis of Algorithm 4 by establishing that it is a loose group
renaming algorithm having aworst case inner scope size of min{m,
g}, and an outer
scope size of 3n/2 + m n/2 1. Note that min{m, g} n in this case. In
particular, we demonstrate that the algorithm is self-adjusting and has the following
(inner scope, outer scope) properties:
( g, 3n/2 + m n/2 1 )
m n
This result settles, to some extent, an open question posed by Gafni [10], which called
for a self-adjusting group renaming algorithm that requires at most m(m + 1)/2 names
on one extreme, and no more than 2n 1 names on the other.
The key observation required to establish this refinement
m g when
is that n =
the groups are uniform in size. Consequently, either m < n or g n. Since the
estimation values that each processor sees cannot exceed the corresponding real values,
no processor can ever reach the second execution of Algorithm 3 in line 12. Now, the
proof of the performance properties follows the same line of argumentation presented
in the proof of Theorem 5.
5 Discussion
This paper has considered and investigated the tight and loose variants of the group renaming problem. Below we discuss few ways in which our results can be extended. An
immediate open question is whether a g-consensus task can be constructed from group
Group Renaming
69
renaming tasks for groups of size g, in a system with g processes. Another question
is to design an adaptive group renaming algorithm in which a processor is assigned
a new group name, from the range 1 through k where k is a constant multiple of the
contention (i.e., the number of different active groups) that the processor experiences.
We have considered only one-shot tasks (i.e., solutions that can be used only once), it
would be interesting to design long-lived group renaming algorithms. We have focused
in this work mainly on reducing the new name space as much as possible, it would be
interesting to construct algorithms also with low space and time (step) complexities. Finally, the k-set consensus task, a generalization of the consensus task, enables for each
processor that starts with an input value from some domain, to choose some participating processor input as its output, such that all processors together may choose no more
than k distinct output values. It is interesting to find out what type of group renaming
task, if any, can be implemented using k-set consensus tasks and registers.
References
1. Afek, Y., Attiya, H., Fouren, A., Stupp, G., Touitou, D.: Long-lived renaming made adaptive.
In: Proc. 18th ACM Symp. on Principles of Distributed Computing, pp. 91103 (May 1999)
2. Afek, Y., Stupp, G., Touitou, D.: Long lived adaptive splitter and applications. Distributed
Computing 30, 6786 (2002)
3. Attiya, H., Bar-Noy, A., Dolev, D., Peleg, D., Reischuk, R.: Renaming in an asynchronous
environment. J. ACM 37(3), 524548 (1990)
4. Attiya, H., Fouren, A.: Algorithms adapting to point contention. Journal of the ACM 50(4),
144468 (2003)
5. Attiya, H., Welch, J.: Distributed Computing: Fundamentals, Simulations and Advanced Topics. John Wiley Interscience, Chichester (2004)
6. Bar-Noy, A., Dolev, D.: Shared memory versus message-passing in an asynchronous. In:
Proc. 8th ACM Symp. on Principles of Distributed Computing, pp. 307318 (1989)
7. Bar-Noy, A., Dolev, D.: A partial equivalence between shared-memory and message-passing
in an asynchronous fail-stop distributed environment. Mathematical Systems Theory 26(1),
2139 (1993)
8. Burns, J., Peterson, G.: The ambiguity of choosing. In: Proc. 8th ACM Symp. on Principles
of Distributed Computing, pp. 145158 (August 1989)
9. Fischer, M.J., Lynch, N.A., Paterson, M.: Impossibility of distributed consensus with one
faulty process. J. ACM 32(2), 374382 (1985)
10. Gafni, E.: Group-solvability. In: Proceedings 18th International Conference on Distributed
Computing, pp. 3040 (2004)
11. Gafni, E., Merritt, M., Taubenfeld, G.: The concurrency hierarchy, and algorithms for unbounded concurrency. In: Proc. 20th ACM Symp. on Principles of Distributed Computing,
pp. 161169 (August 2001)
12. Herlihy, M.: Wait-free synchronization. ACM Transactions on Programming Languages and
Systems 13(1), 124149 (1991)
13. Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects.
ACM Transactions on Programming Languages and Systems 12(3), 463492 (1990)
14. Inoue, M., Umetani, S., Masuzawa, T., Fujiwara, H.: Adaptive long-lived O(k2 )-renaming
with O(k2 ) steps. In: Welch, J.L. (ed.) DISC 2001. LNCS, vol. 2180, pp. 123135. Springer,
Heidelberg (2001)
15. Moir, M., Anderson, J.H.: Wait-free algorithms for fast, long-lived renaming. Science of
Computer Programming 25(1), 139 (1995)
70
Y. Afek et al.
Group Renaming
71
Otherwise, consider the input instance obtained by adding processors to G until it becomes maximal in size. Notice that the execution sequences 1 and 2 are valid with
respect to the new input instance. In addition, observe that each possessor must decide
on the same value as in the former instance. This follows by the assumption that none
of the processors has prior knowledge about the other processors and groups, and thus
each processor cannot distinguish between the two instances. Hence, the initial algorithm state is also multivalent with respect to G in this new instance.
Lemma 7. Every group renaming algorithm admits an input instance for which a critical state with respect to a maximal size group may be reached.
Proof. We prove that every group renaming algorithm which admits an input instance
whose initial algorithm state is multivalent with respect to some group may reach a critical state with respect to that group. Notice that having this claim proved, the lemma
follows as consequence of Lemma 6. Consider some group renaming algorithm, and
suppose its initial algorithm state is multivalent with respect to group G. Consider the
following sequential execution, starting from this state. Initially, some arbitrary processor executes until it reaches a state where its next operation leaves the algorithm in a
univalent state with respect to G, or until it terminates and decides on a new group name.
Note that the latter case can only happen if the underlying processor is not affiliated to
G. Also note that the processor must eventually reach one of the above-mentioned states
since the algorithm is wait-free and cannot run forever. Later on, another arbitrary processor executes until it reaches a similar state, and so on. This sequential execution
continues until reaching a state in which any step of any active processor is a decision
step with respect to G. Again, since the algorithm cannot run forever, it must eventually
reach such state, which is, by definition, critical.
We are now ready to prove the impossibility result.
Proof of Theorem 2. Assume that there is a group renaming algorithm implemented
from atomic registers and r-consensus objects, where r < g. We derive a contradiction
by constructing an infinite sequential execution that keeps such algorithm in a multivalent state with respect to some maximal size group. By Lemma 7, we know that there
is an input instance and a corresponding execution of the algorithm that leads to a critical state s with respect to some group G of size g. Keep in mind that there are at least
g active processors in this critical state since, in particular, all the processors of G are
active. Let p and q be two active processors in the critical state which respectively carry
the algorithm into an x-valent and a y-valent states with respect to G, where x and y
are distinct. We now consider four cases, depending on the nature of the decision steps
taken by the processors:
Case I: One of the processors reads a register. Let us assume without loss of generality that this processor is p. Let s be the algorithm state reached if ps read step
is immediately followed by qs step, and let s be the algorithm state following qs
step. Notice that s and s differ only in the internal state of p. Hence, any processor
p G, other than p, cannot distinguish between these states. Thus, if it executes a solo
run, it must decide on the same value. However, an impossibility follows since s is
72
Y. Afek et al.
x-valent with respect to G whereas s is y-valent. This case is schematically described
in Figure 1(a).
Case II: Both processors write to the same register. Let s be the algorithm state
reached if ps write step is immediately followed by qs write step, and let s be the algorithm state following qs write step. Observe that in the former scenario q overwrites
the value written by p. Hence, s and s differ only in the internal state of p. Therefore, any processor p G, other than p, cannot distinguish between these states. The
impossibility follows identically to Case I.
Case III: Each processor writes to or competes for a distinct register or consensus
object. In what follows, we prove impossibility for the scenario in which both processors write to different registers, noting that impossibility for other scenarios can be
easily established using nearly identical arguments. The algorithm state that results if
ps write step is immediately followed by qs write step is identical to the state which
results if the write steps occur in the opposite order. This is clearly impossible as one
state is x-valent and the other is y-valent. This case is schematically illustrated in Figure 1(b).
Case IV: All active processors compete for the same consensus object. As mentioned above, there are at least g active processors in the critical state. Additionally, we
assumed that the algorithm uses r-consensus objects, where r < g. This implies that
the underlying consensus object is accessed by more processors then its capacity, which
is illegal.
q step
p read step
p write step
q write step
q write step
p write step
s
q step
y -valent
?
x-valent
(a)
Fig. 1. The decision steps cases
(b)
Abstract. Consider the problem of scheduling real-time tasks on a multiprocessor with the goal of meeting deadlines. Tasks arrive sporadically
and have implicit deadlines, that is, the deadline of a task is equal
to its minimum inter-arrival time. Consider this problem to be solved
with global static-priority scheduling. We present a priority-assignment
scheme with the property that if at most 38% of the processing capacity
is requested then all deadlines are met.
Introduction
74
B. Andersson
for static-priority scheduling on a single processor. The success story of staticpriority scheduling on a single processor started with the development of the
rate-monotonic (RM) priority-assignment scheme [4]. It assigns task j a higher
priority than task i if Tj < Ti . RM is an optimal priority-assignment scheme,
meaning that for every task set, it holds that if there is an assignment of priorities that causes deadlines to be met then deadlines are met as well when RM
is used. It is also known [4] that U BRM = 0.69 for the case that m = 1. This
result is important because it gives designers an intuitive idea of how much a
processor can be utilized without missing a deadline.
Multiprocessor scheduling algorithms are often categorized as partitioned or
global. Global scheduling stores tasks which have arrived but not nished execution in one queue, shared by all processors. At any moment, the m highestpriority tasks among those are selected for execution on the m processors. In
contrast, partitioned scheduling algorithms partition the task set such that all
tasks in a partition are assigned to the same processor. Tasks may not migrate
from one processor to another. The multiprocessor scheduling problem is thus
transformed to many uniprocessor scheduling problems.
Real-time scheduling on a multiprocessor is much less developed than realtime scheduling on a single processor. And this applies to static-priority scheduling as well. In particular, it is known that it is impossible to design a partitioned
algorithm with U B > 0.5 [5]. It is also known that for global static-priority
scheduling, RM is not optimal. In fact, global RM can miss a deadline although
Us approaches zero [6]. For a long time, the research community dismissed global
static-priority scheduling for this reason. But later, it was realized that other
priority-assignment schemes (not necessarily RM) can be used for global staticpriority scheduling and the research community developed such schemes. Many
priority-assignment schemes and analysis techniques for global static-priority
scheduling are available (see for example [7, 8, 9, 10]) but so far, only two
priority-assignment schemes, RM-US(m/(3m 2)) [11] and RM-US(x) [12] have
known (and non-zero) utilization bounds. These two algorithms categorize a task
i
as heavy or light. A task is said to be heavy if C
Ti exceeds a certain threshold
number and a task is said to be light otherwise. Heavy tasks are assigned the
highest priority and the light tasks are assigned a lower priority; the relative
priority order among light tasks is given by RM. It was shown that among the
algorithms that separate heavy and light tasks and use RM for light tasks, no
algorithm can achieve a utilization bound greater than 0.374 [12]. And in fact,
the current state-of-art oers no algorithm with utilization bound greater than
0.374.
In this paper, we present a new priority-assignment scheme SM-US(2/(3 +
5)). It categorizes tasks as heavy and light and assigns the highest priority to
heavy tasks. The relative priority order of light tasks is given by slack-monotonic
i if Tj
(SM) though, meaning that task j is assigned higher priority than task
- Cj <
Ti - Ci . We prove that the utilization bound of SM-US(2/(3 + 5)) is
2/(3 + 5), which is approximately 0.382.
75
We consider
this result to be signicant because (i) the new algorithm SMUS(2/(3 + 5)) breaks free from the performance limitations
of the RM-US
framework, (ii) the utilization bound of SM-US(2/(3 + 5)) is higher than the
utilization bound of the previously-known best algorithm in global
static-priority
5)) is reasonably
scheduling and (iii)
the
utilization
bound
of
SM-US(2/(3
+
Background
2.1
76
B. Andersson
Fig. 1. An example of a task set where RM-US(0.375) performs poorly. All tasks arrive
at time 0. Tasks 1 , 2 ,. . ., m are assigned the highest priority and execute on the m
processors during [0,). Then the tasks m+1 , m+2 ,. . ., 2m execute on the m processors
during [,2). The other groups of tasks execute in analogous manner. Task n executes
then until time 1. Then the groups of tasks arrive again. The task set meets its deadlines
but an arbitrarily small increase in execution times causes a deadline miss.
Example 2. [Partially taken from [12]]. Figure 1 illustrates the example. Consider
n = m q + 1 tasks to be scheduled on m processors, where q is a positive integer.
The task n is characterized by Tn = 1 + y and Cn = 1 y. The tasks with index
i {1, 2, . . . , n 1} are organized into groups, where each group comprises m
tasks. One group is the tasks with index i {1, 2, . . . , m}. Another group is
the tasks with index i {m + 1, m + 2, . . . , 2 m} and so on. The r:th group
comprises the tasks with index i {r m + 1, r m + 2, . . . , r m + m}. All
tasks belonging to the same group have the same Ti and Ci . Clearly there are
q groups. The tasks in the r:th group have the parameters Ti = 1 + r and
Ci = , where is selected as y = q . Hence, specifying m and y gives us the
task set. By letting y = 0.454 and m we have a task set that where all
tasks are light. The resulting task set is depicted in Figure 1. Also, all tasks meet
their deadlines but an arbitrarily small increase in execution time of n causes
77
Lemma 1-4 state four simple inequalities that we will nd useful; their proofs
are available in the Appendix.
Lemma 1. Let m denote a positive integer. Consider ui to be a real number
such that 0 ui < 3+25 and consider S to denote a set of non-negative real
numbers uj such that
2
m
(1)
uj ) + ui
(
3+ 5
jS
then it follows that
1
( (2 ui ) uj ) + ui 1
m
(2)
jS
1 ui
1 ui
+ (1 uj
) uj (2 ui ) uj
1 uj
1 uj
(3)
(4)
Tj
Tj
1 ui
1 ui
+ (1 uj ) uj uj
+ (1 uj
)
Ti
Ti
1 uj
1 uj
(5)
t
Cj
t
Cj + min(t Tj , Cj ) Cj + (t Cj )
Tj
Tj
Tj
(6)
78
B. Andersson
79
Cj
m
Tj
2m 1
(7)
Cj
m
m
T
2m
1
j
(8)
then
W (G, , [t0 , t1 ))
(t1 t0 gap([t0 , t1 ], j ))
Cj
Tj
(9)
Proof. From Equation 7 and Equation 8 it follows that the task set can be
scheduled to meet deadlines by OP T on a multiprocessor with m processors of
speed m/(2m 1). The amount of execution during [t0 ,t1 ) is then given by the
right-hand side of Equation 9. And the result by Philips et al gives us that also
algorithm G performs as much execution during [t0 ,t1 ). Hence Equation 9 is true
and it gives us that the lemma is true.
Schedulability analysis. Let t0 denote a time such that no tasks arrive before
t0 . Let us consider a time interval that begins at time t0 ; let [t0 , t2 ) denote this
time interval. We obtain that the amount of execution performed by the task
set during [t0 , t2 ) is at most:
t2 t0 gap([t0 , t2 ), j )
Cj +
Tj
j hp(i)
min(t2 t0 gap([t0 , t2 ), j )
t2 t0 gap([t0 , t2 ), j )
Tj , Cj )
Tj
(10)
From Lemma 5 we obtain that the amount of execution performed by the task
set during [t0 , t1 ) is at least:
Cj
(t1 t0 gap([t0 , t1 ], j ))
(11)
Tj
j hp(i)
Let us consider the case that a deadline was missed. Let us consider the earliest
time when a deadline was missed. Let t1 denote the arrival time of the job that
missed this deadline and let i denote the task that generated this job. Let hp(i)
denote the set of tasks with higher priority than i . Let t2 denote the deadline
that was missed; that is, t2 =t1 +Ti . Applying Equation 8 and Equation 9 on
hp(i) gives us that the amount of execution by hp(i) during [t1 ,t2 ) is at most:
t2 t0 gap([t0 , t2 ), j )
Cj +
Tj
j hp(i)
t2 t0 gap([t0 , t2 ), j )
Tj , Cj )
Tj
Cj
(t1 t0 gap([t0 , t1 ], j ))
Tj
min(t2 t0 gap([t0 , t2 ), j )
j hp(i)
(12)
80
B. Andersson
j hp(i)
Tj
j hp(i)
(13)
Applying Lemma 4 on Equation 13 gives us that the amount of execution by
hp(i) during [t1 ,t2 ) is at most:
Cj
Cj + (Ti + t1 t0 gap([t0 , t1 ), j ) gap([t1 , t2 ), j ) Cj )
Tj
j hp(i)
(t1 t0 gap([t0 , t1 ], j ))
j hp(i)
Cj
Tj
(14)
Cj
Cj + (Ti gap([t1 , t2 ), j ) Cj )
(15)
Tj
j hp(i)
Relaxing gives that the amount of execution by tasks in hp(i) during [t1 ,t2 )
is at most:
Cj
Cj + (Ti Cj )
(16)
Tj
j hp(i)
From Equation 16 it follows that the amount of time during during [t1 ,t2 )
where all processors are busy executing tasks in hp(i) is at most:
1
Cj
Cj + (Ti Cj )
(17)
m
Tj
j hp(i)
Cj
m
Tj
2m 1
(18)
and
m
Ci
m
Ti
2m 1
and
j hp(i)
and
m
Cj Ci
+
m
Tj
Ti
2m 1
Cj
Cj + (Ti Cj )
+ Ci Ti
m
Tj
81
(19)
(20)
(21)
j hp(i)
Section 3.1 presents Slack-monotonic (SM) scheduling and analyzes its performance for restricted task sets (called light tasks). This restriction is then removed
in Section 3.2; the new algorithm is presented and its utilization bound is proven.
3.1
Light Tasks
2
i
We say that a task i is light if C
Ti 3+ 5 . We let Slack-Monotonic (SM) denote
a priority assignment scheme which assigns priorities such that task j is assigned
higher priority than task i if Tj Cj < Ti Ci .
Cj
2
Tj
3+ 5
2
Ci
Ti
3+ 5
and
(
j
Cj
Ci
2
m
)+
Tj
Ti
3+ 5
hp(i)
(22)
(23)
(24)
(25)
82
B. Andersson
+ (1
)
1
m
Tj 1 Cj
Tj 1 Cj
Tj
Ti
Tj
jhp(i)
(26)
Tj
(27)
+ (1
)
1
m
Tj Ti
Tj Ti Tj
Ti
(28)
jhp(i)
Multiplying both the left-hand side and the right-hand side of Inequality 28
by Ti and rewriting yields:
1
Cj
Cj + (Ti Cj )
+ Ci Ti
(29)
m
Tj
jhp(i)
Using Inequality 29 and Lemma 6 gives us that all deadline of i are met.
This states the lemma.
Lemma 8. Consider global static-priority scheduling with SM. If it holds for
the task set that
Cj
2
j :
(30)
Tj
3+ 5
and
Cj
2
m
T
3+ 5
j j
(31)
We
say that a task is heavy if it is not light. We let the algorithm SM-US(2/(3 +
5)) denote a priority assignment scheme which assigns the highest priority to
heavy tasks and assigns a lower priority to light tasks; the priority order between
light tasks is given by SM.
Theorem
1. Consider global static-priority
Cj
1
Tj
scheduling
with
SM-US(2/
(32)
and
Cj
2
m
T
3
+
5
j
j
83
(33)
T
3+ 5
j
light( f ailed )
j
where light( f ailed ) denotes the set of light tasks in f ailed . Since the light tasks
are the same in f ailed and f ailed2 it clearly follows that
Cj
2
(m k)
(35)
Tj
3+ 5
light( f ailed2 )
j
f ailed2
If the task
would meet all deadlines when scheduled by SM set
US(2/(3 + 5)) then it would follow (from the fact that global static-priority
scheduling is predictable) that alldeadlines would have been met when f ailed
was scheduled by SM-US(2/(3 + 5)). Hence it follows that at least one deadline was missed by f ailed2 . And since there are at most k m 1 heavy tasks
it follows that no deadline miss occurs for the heavy tasks. Hence it must have
been that a deadline miss occurred from a light task in f ailed2 . But the scheduling of the light tasks in f ailed2 is identical to what is would have been if we
deleted the heavy tasks in f ailed2 and deleted the k processors. That is, we
have that scheduling the light tasks on m k processor causes a deadline miss.
But Inequality 35 and Lemma 8 gives that no deadline miss occurs. This is a
contradiction. Hence the theorem is correct.
Conclusions
84
B. Andersson
Acknowledgements
This work was partially funded by the Portuguese Science and Technology Foundation (Fundacao para a Ciencia e a Tecnologia - FCT) and the ARTIST2 Network of Excellence on Embedded Systems Design.
References
[1] Gallmeister, B.: POSIX.4 Programmers Guide: Programming for the Real World.
OReilly Media, Sebastopol (1995)
[2] Sha, L., Rajkumar, R., Sathaye, S.: Generalized Rate-Monotonic Scheduling Theory: A Framework for Developing Real-Time Systems. Proceedings of the IEEE 82,
6882 (1994)
[3] Tindell, K.W.: An Extensible Approach for Analysing Fixed Priority Hard RealTime Tasks. Technical Report, Department of Computer Science, University of
York, UK YCS 189 (1992)
[4] Liu, C.L., Layland, J.W.: Scheduling Algorithms for Multiprogramming in a HardReal-Time Environment. Technical Report, Department of Computer Science, University of York, UK YCS 189., 1992. Journal of the ACM, vol. 20, pp. 4661 (1973)
[5] Oh, D., Baker, T.P.: Utilization Bounds for N-Processor Rate Monotone Scheduling with Static Processor Assignment. Real-Time Systems 5, 183192 (1998)
[6] Dhall, S., Liu, C.: On a real-time scheduling problem. Operations Research 6,
127140 (1978)
[7] Baker, T.P.: An Analysis of Fixed-Priority Schedulability on a Multiprocessor.
Real-Time Systems 32, 4971 (2006)
[8] Bertogna, M., Cirinei, M.: Response-Time Analysis for Globally Scheduled Symmetric Multiprocessor Platforms. In: IEEE Real-Time Systems Symposium, Tucson, Arizona (2007)
[9] Bertogna, M., Cirinei, M., Lipari, G.: New Schedulability Tests for Real-Time Task
Sets Scheduled by Deadline Monotonic on Multiprocessors. In: 9th International
Conference on Principles of Distributed Systems, Pisa, Italy (2005)
[10] Cucu, L.: Optimal priority assignment for periodic tasks on unrelated processors.
In: Euromicro Conference on Real-Time Systems (ECRTS 2008), WIP session,
Prague, Czech Republic (2008)
[11] Andersson, B., Baruah, S., Jonsson, J.: Static-Priority Scheduling on Multiprocessors. In: IEEE Real-Time Systems Symposium, London, UK (2001)
[12] Lundberg, L.: Analyzing Fixed-Priority Global Multiprocessor Scheduling. In:
Eighth IEEE Real-Time and Embedded Technology and Applications Symposium
(RTAS 2002) (2002)
[13] Andersson, B., Jonsson, J.: The utilization bounds of partitioned and pfair staticpriority scheduling on multiprocessors are 50%. In: Euromicro Conference on RealTime Systems, Porto, Portugal (2003)
[14] Liu, C.L.: Scheduling algorithms for multiprocessors in a hard real-time environment. JPL Space Programs Summary 37-60, 2831 (1969)
[15] Ha, R., Liu, J.W.S.: Validating timing constraints in multiprocessor and distributed real-time systems. In: Proceedings of the 14th International Conference
on Distributed Computing Systems, Pozman, Poland (1994)
[16] Phillips, C.A., Stein, C., Torng, E., Wein, J.: Optimal time-critical scheduling via
resource augmentation. In: ACM Symposium on Theory of Computing, El Paso,
Texas, United States (1997)
85
Appendix
Lemma 1. Let m denote a positive integer. Consider ui to be a real number
such that 0 ui < 3+25 and consider S to denote a set of non-negative real
numbers uj such that
2
m
(36)
uj ) + ui
(
3+ 5
jS
then it follows that
1
( (2 ui ) uj ) + ui 1
m
(37)
jS
2
2
m + m ui m ui +
3+ 5
3+ 5
f
2
m+m1>0
=
ui
3+ 5
(38)
(39)
2
2
m + m ui m ui +
0
3+ 5
3+ 5
2
m
(2 ui ) (
uj ) + ui + m ui ui +
3
+
5
jS
(40)
(41)
(2 ui ) uj + ui +
m
m
2
3+ 5
(42)
jS
Recall that ui 3+25 . Clearly this gives us 2 ui 1. And hence the last term
in the left-hand side of Inequality 42 is non-negative. This gives us:
1
(2 ui ) uj + ui 1
(43)
m
jS
1 ui
1 ui
+ (1 uj
) uj (2 ui ) uj
1 uj
1 uj
(44)
86
B. Andersson
Proof. The proof is by contradiction. Suppose that the lemma is false. Then we
have:
1 ui
1 ui
uj
+ (1 uj
) uj > (2 ui ) uj
(45)
1 uj
1 uj
Let us explore the following cases.
1. ui = 0 and uj = 0
Applying this case on Inequality 45 gives us:
0>0
(46)
1
1
+ (1 uj
) uj > 2 uj
1 uj
1 uj
(47)
(48)
(49)
(50)
(51)
(52)
(53)
1>1
(54)
87
(55)
Tj
Tj
1 ui
1 ui
+ (1 uj ) uj uj
+ (1 uj
)
Ti
Ti
1 uj
1 uj
(56)
1 ui
Tj
Ti
1 uj
(57)
Let qi,j denote the left-hand side of Inequality 57. There are two occurrences
qi,j in the left-hand side of Inequality 56. Also observe that the left-hand side
of Inequality 56 is increasing with increasing qi,j . For this reason, combining
Inequality 57 and the left-hand side of inequality 56 gives us that the lemma is
true.
Lemma 4. Consider two integers Tj and Cj such that 0 Cj Tj . For every
t > 0 it holds that:
t
t
Cj
Cj + min(t Tj , Cj ) Cj + (t Cj )
Tj
Tj
Tj
(58)
Proof. The proof is by contradiction. Suppose that the lemma is false. Then
there is a t > 0 such that:
t
Cj
t
Cj + min(t Tj , Cj ) > Cj + (t Cj )
Tj
Tj
Tj
(59)
(61)
(62)
88
B. Andersson
2. t t/Tj Tj Cj
Let be dened as: = (t Ttj Tj ) Cj . Let us decrease t by .
Then the left-hand side of Inequality 59 is unchanged and the right-hand
side decreases by (Cj /Tj ) . Since 0 Cj /Tj it follows that Inequality 59
still true. That is:
t
t
Cj
Cj + min(t Tj , Cj ) > Cj + (t Cj )
Tj
Tj
Tj
(63)
Abstract. The scheduling of sporadic task systems upon uniform multiprocessor platforms using global Deadline Monotonic algorithm is studied. A sucient schedulability test is presented and proved correct. It
is shown that this test oers non-trivial quantitative guarantees, in the
form of a processor speedup bound.
Introduction
A multiprocessor computer platform is comprised of several processors. A platform in which all the processors have the same capabilities is referred to as an
identical multiprocessor, while those in which dierent processors have dierent
capabilities are called heterogeneous multiprocessors. Heterogeneous multiprocessors may be further classied into uniform and unrelated multiprocessors.
The only dierence between the dierent processors in a uniform multiprocessor
is the rate at which they can execute work: each processor is characterized by a
speed or computing capacity parameter s, and any job executing on the processor
for t time units completes t s units of execution. In unrelated multiprocessors,
on the other hand, the amount of execution completed by a particular job executing on a given processor depends upon the identities of both the job and the
processor.
A real-time system is often modeled as a nite collection of independent recurrent tasks, each of which generates a potentially innite sequence of jobs. Every
job is characterized by an arrival time, an execution requirement, and a deadline,
and it is required that a job completes execution between its arrival time and its
deadline. Dierent formal models for recurring tasks place dierent restrictions
on the values of the parameters of jobs generated by each task. One of the more
commonly used formal models is the sporadic task model [1, 2]. Each recurrent
task i in this model is characterized by three parameters: i = (Ci , Di , Ti ),
with the interpretation that i may generate an innite sequence of jobs with
successive jobs arriving at least Ti time units apart, each with an execution
T.P. Baker, A. Bui, and S. Tixeuil (Eds.): OPODIS 2008, LNCS 5401, pp. 89104, 2008.
c Springer-Verlag Berlin Heidelberg 2008
90
requirement at most Ci and a deadline Di time units after its arrival time. A
sporadic task system is comprised of a nite collection of such sporadic tasks.
Sporadic task systems in which each task is required to have its relative deadline
and period parameters the same (Di = Ti for all i) are called implicit-deadline
task systems, and ones in which each task is required to have its relative deadline
be no larger than its period parameter (Di Ti for all i) are called constraineddeadline task systems. A task system that is not constrained-deadline is said to
be an arbitrary-deadline task system.
Several results have been obtained over the past decade, concerning the
scheduling of implicit-deadline systems on identical [3, 4, 5, 6, 7, 8, 9] and on uniform [10,11,12,13,14,15,16,17] multiprocessors, of constrained-deadline systems
on identical multiprocessors [18, 19, 20, 21, 22, 23, 24], and of arbitrary-deadline
systems on identical multiprocessors, [25, 26]. This paper seeks to extend this
body of work, by addressing the scheduling of constrained and arbitrary-deadline
sporadic task systems upon uniform multiprocessors. We assume that the platform is fully preemptive an executing job may be interrupted at any instant in
time and have its execution resumed later with no cost or penalty. We study the
behavior of the well-known and very widely-used Deadline Monotonic scheduling
algorithm [27] when scheduling systems of sporadic tasks upon such preemptive
platforms. We will refer to Deadline Monotonic scheduling with global interprocessor migration as global dm (or simply dm).
Contributions. We obtain a new test to our knowledge, this is the rst such
tests for determining whether a given constrained or arbitrary-deadline sporadic task system is guaranteed to meet all deadlines upon a specied uniform
multiprocessor platform, when scheduled using dm. This test is derived by applying techniques that have previously been used for the schedulability analysis
of constrained-deadline task systems on uniform multiprocessors when scheduled
using edf [28] and by integrating techniques used for schedulability analysis of
sporadic arbitrary-deadline systems on identical multiprocessors using dm [25].
Organization. The remainder of this paper is organized as follows. In Sect. 2
we formally dene the sporadic task model and uniform multiprocessor platforms, and provide some additional useful denitions, notation, and terminology
concerning sporadic tasks and uniform multiprocessors. We also provide a specication of the behavior of global dm is to be implemented upon uniform multiprocessors. In Sect. 3 we derive, and prove the correctness of, a schedulability
test for determining whether a given sporadic task system is dm-schedulable on a
specied uniform multiprocessor platform. In Sect. 4 we provide a quantitative
characterization of the ecacy of this new schedulability test in terms of the
resource augmentation metric.
91
Density: The density i of a task i is the ratio (Ci / min(Di , Ti )) of its execution
requirement to the smaller of its relative deadline and its period. The total
density sum ( ) of a task system is dened as follows:
def
sum ( ) =
i .
i
For each k, 1 k n, max (k) denotes the largest density from among the
tasks 1 , 2 , . . . , k :
def
DBF: For any interval length t, the demand bound function dbf(i , t) of a
sporadic task i bounds the maximum cumulative execution requirement by
jobs of i that both arrive in, and have deadlines within, any interval of
length t. It has been shown [2] that
t Di
+ 1) Ci .
dbf(i , t) = max 0, (
Ti
Load: A load parameter, based upon the dbf function, may be dened for any
sporadic task system as follows:
k
def
i=1 dbf(i , t)
.
load(k) = max
t>0
t
92
Computing dbf (and thereby, load) will turn out to be a critical component
of the schedulability analysis test proposed in this paper; hence, it is important
that dbf be eciently computable if this schedulability test is to be eciently
implementable as claimed. Fortunately, computing dbf is a well-studied subject,
and algorithms are known for computing dbf exactly [2, 29], or approximately
to any arbitrary degree of accuracy [30, 31, 32].
The following Lemma relates the density of a task to its dbf:
Lemma 1 ( [25]). For all tasks i and for all t 0,
t i dbf(i , t) .
In constrained task systems those in which Di Ti i a job becomes eligible to execute upon arrival, and remains eligible until it completes execution1 .
In systems with Di > Ti for some tasks i , we require that at most one job of
each task be eligible to execute at each time instant. We assume that jobs of
the same task are considered in rst-come rst-served order; hence, a job only
becomes eligible to execute after both these conditions are satised: (i) it has
arrived, and (ii) all previous jobs generated by the same task that generated it
have completed execution. This gives rise to the notion of an active task: briey,
a task is active at some instant if it has some eligible job awaiting execution at
that instant. More formally,
Denition 1 (active task). A task is said to be active in a given schedule at
a time-instant t if some job of the task is eligible to execute at time-instant t.
That is, (i) t the greater of the jobs arrival time and the completion time of
the previous job of the same task, and (ii) the job has not completed execution
prior to time-instant t.
2. Uniform multiprocessors. A uniform multiprocessor = (s1 , s2 , . . . , sm ) is
comprised of m > 1 processors, with the ith processor characterized by speed or
computing capacity si . The interpretation is that a job executing on the ith processor for a duration of t units of time completes tsi units of execution. Without
loss of generality, we assume that the speeds are specied in non-increasing order:
si si+1 for all i. We will also use the following notation:
def
Si () =
sj .
(1)
j=1
That is, Si () denotes the sum of the computing capacities of the i fastest
processors in (and Sm () hence denotes the total computing capacity of ).
An additional parameter that turns out to be useful in describing the properties of a uniform multiprocessor is the lambda parameter [12, 10]:
m
m
def
j=i+1 sj
.
(2)
() = max
i=1
si
1
Or its deadline has elapsed, in which case the system is deemed to have failed.
6
to
ta
93
- deadline miss
D Dk -
?
time
td
Fig. 1. Notation. A job of task k arrives at ta . Task k is not active immediately prior
to ta , and is continually active over [ta , td ).
This parameter ranges in value between 0 and (m1) for an m-processor uniform
multiprocessor platform, with a value of (m 1) corresponding to the degenerate
case when all the processors are of the same speed (i.e., the platform is an
identical multiprocessor).
3. Deadline Monotonic scheduling. Priority-driven scheduling algorithms operate on uniform multiprocessors as follows: at each instant in time they assign
a priority to each job that is awaiting execution, and favor for execution the
jobs with the greatest priorities. Specically, (i) no processor is idled while there
is an active job awaiting execution; (ii) when there are fewer active jobs than
processors, the jobs execute on the fastest processors and the slowest ones are
idled; and (iii) greater-priority jobs execute on the faster processors. The Deadline Monotonic (dm) scheduling algorithm [33] is a priority-driven scheduling
algorithm that assigns priority to tasks according to their (relative) deadlines:
the smaller the deadline, the greater the priority.
With respect to a specied platform, a given sporadic task system is said to
be feasible if there exists a schedule meeting all deadlines for every collection of
jobs that may be generated by the task system. A given sporadic task system is
said to be (global) dm schedulable if dm meets all deadlines for every collection
of jobs that may be generated by the task system.
94
(3)
m1
s I ,
(4)
=1
m1
Since Sm ()D =1 (Sm ()S ())I denotes the total amount of execution
completed over [ta , td ) and this is not enough for k s jobs to complete the
execution requirement C before td , we have the following relationship:
2
95
m1
(Sm () S ())I
=1
m1
=1
m1
Sm () S ()
s I
s
()s I
=1
= Sm ()D ()
m1
s I .
(5)
=1
(6)
observe that the value of k depends upon the parameters of both the task
system being scheduled, and the uniform multiprocessor = (s1 , s2 , . . . , sm )
upon which it is scheduled.
def
Let to denote the smallest value of t ta such that (t) k . Let = td to
(see Fig. 1).
By denition, W (to ) denotes the amount of work that the dm schedule needs
(but fails) to execute over [to , td ). This work in W (to ) arises from two sources:
those jobs that arrived at or after to , and those that arrived prior to to but have
not completed execution in the dm schedule by time-instant to . We will refer to
jobs arriving prior to to that need execution over [to , td ) as carry-in jobs.
We wish to obtain an upper bound on the total contribution of all the carry-in
jobs to the W (to ) term. We achieve this in two steps: we rst bound the number
of tasks that may have carry-in jobs (Lemma 2), and then we bound the amount
of work that all the carry-in jobs of any one such task may contribute to W (to )
(Lemma 3).
Lemma 2. The number of tasks that have carry-in jobs is strictly bounded from
above by
def
(7)
k = max
: S () < k .
Proof. Let denote an arbitrarily small positive number. By denition of the
instant to , (to ) < k while (to ) k . It must therefore be the case
that strictly less than k work was executed over [to , to ); i.e., the total
computing capacity of all the busy processors over [to , to ) is < k . And since
96
6
6
i
ti
?
to
time
Fig. 2. Example: dening ti for a task i with Di Ti . Three jobs of i are shown.
Task i is not active prior to the arrival of the rst of these 3 jobs, and the rst job
completes execution only after the next job arrives. This second job does not complete
execution prior to to . Thus, the task is continuously active after the arrival of the rst
job shown, and ti is hence set equal to the arrival time of this job.
k < Sm () (as can be seen from (6) above), it follows that some processor was
idled over [to , to ), implying that all jobs active at this time would have been
executing. This allows us to conclude that there are strictly fewer than k tasks
with carry-in jobs.
Lemma 3. The total remaining execution requirement of all the carry-in jobs
of each task i (that has carry-in jobs at time-instant to ) is < max (k).
Proof. Let us consider some task i (i < k) that has a carry-in job. Let ti < to
denote the earliest time-instant such that i is active throughout the interval
[ti , to ]. Observe that ti is necessarily the arrival time of some job of i . If Di < Ti ,
then ti is the arrival time of the (sole) carry-in job of i . If Di Ti , however, ti
may be the arrival-time of a job that is not a carry-in job see Fig. 2.
def
Let i = to ti (see Fig. 2). All the carry-in jobs of i have their arrival-times
and their deadlines within the (i + )-sized interval [ti , td ), and consequently
their cumulative execution requirement is dbf(i , i + ); in what follows,
we will quantify how much of this must have been completed prior to to (and
hence cannot contribute to the carry-in). We thus obtain an upper bound on the
total work that all the carry-in jobs of i contribute to W (to ), as the dierence
between dbf(i , i + ) and the amount of execution received by i over [ti , to ).
By denition of to , it must be the case that (ti ) < k . That is,
W (ti ) < k ( + i ) .
(8)
(9)
Let Ci denote the amount of execution received by i s carry-in jobs over the
duration [ti , to ); the dierence dbf(i , i + ) Ci thus denotes an upper bound
on the amount of carry-in execution. Let J denote the total duration over [ti , to )
for which exactly
processors are busy in this dm schedule, 0
m. Observe
that the amount of execution that i s carry-in jobs receive over [ti , to ) is at least
97
m1
=1 s J since i s job must be executing on one of the processors during any
instant when some processor is idle; therefore
Ci
m1
s J .
(10)
=1
m1
Since Sm ()i =1 J (Sm () S ())
denotes the total
amount of execution completed over [ti , to ), the dierence W (ti ) W (to ) the amount of
execution completed over [ti , to ) is given by
W (ti ) W (to ) = Sm ()i
m1
(Sm () S ())J
=1
k ( + i ) k >
m1
Sm ()i
(Sm () S ())J
=1
m1
Sm () S ()
s J
s
=1
k i > Sm ()i
m1
()s J
=1
k i > Sm ()i ()
k i > Sm ()i
m1
s J
=1
()Ci .
98
(12)
(13)
(since Dk )
A Speedup Bound
99
(14)
for all k, 1 k n.
Proof. Suppose that task system is feasible upon x. To prove that max (k)
xs1 , consider each task i separately:
In order to be able to meet all deadlines of i if i generates jobs exactly Ti
time units apart, it is necessary that Ci /Ti xs1 .
Since any individual job of i can receive at most Di xs1 units of execution
by its deadline, we must have Ci Di x s1 ; i.e., Ci /Di xs1 .
Putting both conditions together, we get (Ci / min(Ti , Di )) xs1 . Taken over all
the tasks 1 , 2 , . . . , k , this observation yields the condition that max (k) xs1 .
Since any individual job of i can receive at most Di xs1 units of execution
by its deadline, we must have Ci Di s1 x; i.e., Ci /Di s1 x. Taken over all
tasks in , this observation yields the rst condition.
To prove that load(k) Sm ()x, recall the denition of load(k) from
Sect. 1. Let t denote some value of t which denes load(k):
k
def
i=1 dbf(i , t)
t = argmax
.
t
Suppose that all tasks in {1 , 2 , . . . , k } generate a job at time-instant zero, and
each task i generates subsequent jobs exactly Ti time units apart. The total
amount of execution that is available over the interval [0, t ) on this platform is
equal to Sm ()xt ; hence, it is necessary that load(k) Sm ()x if all deadlines
are to be met.
100
Lemma 5. Any sporadic task system that is feasible upon a multiprocessor platform x is determined to be global-dm schedulable on by the dm-schedulability
test of Theorem 1, provided
2
2
x (2s1 )/ Sm ()s1 + 2Sm ()sm + s1 sm Sm
()s21 + 4Sm
()s1 sm
2
+ 2Sm ()s21 sm + 4Sm
()s2m + 4s1 s2m Sm () + 2 s21 s2m
1/2
.
4s1 Sm ()sm
(15)
load(k)
1
max (k)
(k (1
))
2
sm
(by (6))
load(k)
load(k)
1
max (k)
((Sm () max (k))(1
))
2
sm
Sm ()x
1
s1 x
((Sm () s1 x)(1
))
2
sm
Sm ()x
1
Sm ()s1 x
s2 x2
(Sm ()
s1 x + 1 )
2
sm
sm
101
m
4
20
100
1000
4
20
100
1000
Sm ()
12
60
300
3000
15
75
375
3750
2
14
74
749
0.875
8.345
45.875
467.75
sm
4
4
4
4
8
8
8
8
s1
2
2
2
2
1
1
1
1
speedup
4.59
4.84
4.89
4.90
10.42
10.81
10.89
10.91
4.1
Conclusions
Most research on multiprocessor real-time scheduling has focused on the simplest model systems of implicit-deadline tasks that are scheduled on identical
3
102
References
1. Mok, A.K.: Fundamental Design Problems of Distributed Systems for The
Hard-Real-Time Environment. PhD thesis, Laboratory for Computer Science, Massachusetts Institute of Technology, Available as Technical Report
No. MIT/LCS/TR-297 (1983)
2. Baruah, S., Mok, A., Rosier, L.: Preemptively scheduling hard-real-time sporadic
tasks on one processor. In: Proceedings of the 11th Real-Time Systems Symposium,
Orlando, Florida, pp. 182190. IEEE Computer Society Press, Los Alamitos (1990)
3. Baruah, S., Cohen, N., Plaxton, G., Varvel, D.: Proportionate progress: A notion
of fairness in resource allocation. Algorithmica 15(6), 600625 (1996)
4. Oh, D.I., Baker, T.P.: Utilization bounds for N-processor rate monotone scheduling
with static processor assignment. Real-Time Systems: The International Journal
of Time-Critical Computing 15, 183192 (1998)
5. Lopez, J.M., Garcia, M., Diaz, J.L., Garcia, D.F.: Worst-case utilization bound for
EDF scheduling in real-time multiprocessor systems. In: Proceedings of the EuroMicro Conference on Real-Time Systems, Stockholm, Sweden, pp. 2534. IEEE
Computer Society Press, Los Alamitos (2000)
6. Andersson, B., Jonsson, J.: Fixed-priority preemptive multiprocessor scheduling:
To partition or not to partition. In: Proceedings of the International Conference
on Real-Time Computing Systems and Applications, Cheju Island, South Korea,
pp. 337346. IEEE Computer Society Press, Los Alamitos (2000)
7. Andersson, B., Baruah, S., Jonsson, J.: Static-priority scheduling on multiprocessors. In: Proceedings of the IEEE Real-Time Systems Symposium, pp. 193202.
IEEE Computer Society Press, Los Alamitos (2001)
8. Goossens, J., Funk, S., Baruah, S.: Priority-driven scheduling of periodic task systems on multiprocessors. Real Time Systems 25(23), 187205 (2003)
9. Lopez, J.M., Diaz, J.L., Garcia, D.F.: Utilization bounds for EDF scheduling on
real-time multiprocessor systems. Real-Time Systems: The International Journal
of Time-Critical Computing 28(1), 3968 (2004)
10. Funk, S.H.: EDF Scheduling on Heterogeneous Multiprocessors. PhD thesis, Department of Computer Science, The University of North Carolina at Chapel Hill
(2004)
103
11. Baruah, S.: Scheduling periodic tasks on uniform processors. In: Proceedings of the
EuroMicro Conference on Real-time Systems, Stockholm, Sweden, pp. 714 (June
2000)
12. Funk, S., Goossens, J., Baruah, S.: On-line scheduling on uniform multiprocessors.
In: Proceedings of the IEEE Real-Time Systems Symposium, pp. 183192. IEEE
Computer Society Press, Los Alamitos (2001)
13. Baruah, S., Goossens, J.: Rate-monotonic scheduling on uniform multiprocessors.
IEEE Transactions on Computers 52(7), 966970 (2003)
14. Funk, S., Baruah, S.: Task assignment on uniform heterogeneous multiprocessors.
In: Proceedings of the EuroMicro Conference on Real-Time Systems, Palma de
Mallorca, Balearic Islands, Spain, pp. 219226. IEEE Computer Society Press, Los
Alamitos (2005)
15. Darera, V.N., Jenkins, L.: Utilization bounds for RM scheduling on uniform multiprocessors. In: RTCSA 2006: Proceedings of the 12th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, Washington, DC, USA, pp. 315321. IEEE Computer Society, Los Alamitos (2006)
16. Andersson, B., Tovar, E.: Competitive analysis of partitioned scheduling on uniform
multiprocessors. In: Proceedings of the Workshop on Parallel and Distributed RealTime Systems, Long Beach, CA (March 2007)
17. Andersson, B., Tovar, E.: Competitive analysis of static-priority scheduling on
uniform multiprocessors. In: Proceedings of the IEEE International Conference on
Embedded and Real-Time Computing Systems and Applications, Daegu, Korea.
IEEE Computer Society Press, Los Alamitos (2007)
18. Baker, T.: Multiprocessor EDF and deadline monotonic schedulability analysis.
In: Proceedings of the IEEE Real-Time Systems Symposium, pp. 120129. IEEE
Computer Society Press, Los Alamitos (2003)
19. Baker, T.P.: An analysis of EDF schedulability on a multiprocessor. IEEE Transactions on Parallel and Distributed Systems 16(8), 760768 (2005)
20. Bertogna, M., Cirinei, M., Lipari, G.: Improved schedulability analysis of EDF on
multiprocessor platforms. In: Proceedings of the EuroMicro Conference on RealTime Systems, Palma de Mallorca, Balearic Islands, Spain, pp. 209218. IEEE
Computer Society Press, Los Alamitos (2005)
21. Bertogna, M., Cirinei, M., Lipari, G.: New schedulability tests for real-time tasks
sets scheduled by deadline monotonic on multiprocessors. In: Proceedings of the 9th
International Conference on Principles of Distributed Systems, Pisa, Italy. IEEE
Computer Society Press, Los Alamitos (2005)
22. Cirinei, M., Baker, T.P.: EDZL scheduling analysis. In: Proceedings of the EuroMicro Conference on Real-Time Systems, Pisa, Italy. IEEE Computer Society Press,
Los Alamitos (2007)
23. Fisher, N.: The Multiprocessor Real-Time Scheduling of General Task Systems.
PhD thesis, Department of Computer Science, The University of North Carolina
at Chapel Hill (2007)
24. Baruah, S., Baker, T.: Schedulability analysis of global EDF. Real- Time Systems
(to appear, 2008)
25. Baruah, S., Fisher, N.: Global deadline-monotonic scheduling of arbitrary-deadline
sporadic task systems. In: Tovar, E., Tsigas, P., Fouchal, H. (eds.) OPODIS 2007.
LNCS, vol. 4878, pp. 204216. Springer, Heidelberg (2007)
26. Baruah, S., Baker, T.: Global EDF schedulability analysis of arbitrary sporadic
task systems. In: Proceedings of the EuroMicro Conference on Real-Time Systems,
Prague, Czech Republic. IEEE Computer Society Press, Los Alamitos (2008)
104
27. Leung, J., Whitehead, J.: On the complexity of xed-priority scheduling of periodic,
real-time tasks. Performance Evaluation 2, 237250 (1982)
28. Baruah, S., Goossens, J.: The EDF scheduling of sporadic task systems on uniform
multiprocessors. Technical report, University of North Carolina at Chapel Hill
(2008)
29. Ripoll, I., Crespo, A., Mok, A.K.: Improvement in feasibility testing for real-time
tasks. Real-Time Systems: The International Journal of Time-Critical Computing 11, 1939 (1996)
30. Baker, T.P., Fisher, N., Baruah, S.: Algorithms for determining the load of a sporadic task system. Technical Report TR-051201, Department of Computer Science,
Florida State University (2005)
31. Fisher, N., Baruah, S., Baker, T.: The partitioned scheduling of sporadic tasks
according to static priorities. In: Proceedings of the EuroMicro Conference on RealTime Systems, Dresden, Germany. IEEE Computer Society Press, Los Alamitos
(2006)
32. Fisher, N., Baker, T., Baruah, S.: Algorithms for determining the demand-based
load of a sporadic task system. In: Proceedings of the International Conference on
Real-time Computing Systems and Applications, Sydney, Australia. IEEE Computer Society Press, Los Alamitos (2006)
33. Liu, C., Layland, J.: Scheduling algorithms for multiprogramming in a hard realtime environment. Journal of the ACM 20(1), 4661 (1973)
A Comparison of the
M-PCP, D-PCP, and FMLP on LITMUSRT
Bj
orn B. Brandenburg and James H. Anderson
The University of North Carolina at Chapel Hill
Dept. of Computer Science
Chapel Hill, NC 27599-3175 USA
{bbb,anderson}@cs.unc.edu
Abstract. This paper presents a performance comparison of three multiprocessor real-time locking protocols: the multiprocessor priority ceiling protocol (M-PCP), the distributed priority ceiling protocol (D-PCP),
and the exible multiprocessor locking protocol (FMLP). In the FMLP,
blocking is implemented via either suspending or spinning, while in the
M-PCP and D-PCP, all blocking is by suspending. The presented comparison was conducted using a UNC-produced Linux extension called
LITMUSRT . In this comparison, schedulability experiments were conducted in which runtime overheads as measured on LITMUSRT were
used. In these experiments, the spin-based FMLP variant always exhibited the best performance, and the M-PCP and D-PCP almost always
exhibited poor performance. These results call into question the practical viability of the M-PCP and D-PCP, which have been the de-facto
standard for real-time multiprocessor locking for the last 20 years.
Introduction
With the continued push towards multicore architectures by most (if not all)
major chip manufacturers [19,26], the computing industry is facing a paradigm
shift: in the near future, multiprocessors will be the norm. Current o-the-shelf
systems now routinely contain chips with two, four, and even eight cores, and
chips with up to 80 cores are envisioned within a decade [26]. Not surprisingly,
with multicore platforms becoming so widespread, real-time applications are
already being deployed on them. For example, systems processing time-sensitive
business transactions have been realized by Azul Systems on top of the highlyparallel Vega2 platform, which consists of up to 768 cores [4].
Motivated by these developments, research on multiprocessor real-time scheduling has intensied in recent years (see [13] for a survey). Thus far, however, few
proposed approaches have actually been implemented in operating systems and
evaluated under real-world conditions. To help bridge the gap between algorithmic research and real-world systems, our group recently developed LITMUSRT ,
a multiprocessor real-time extension of Linux [8,11,12]. Our choice of Linux as
a development platform was inuenced by recent eorts to introduce real-timeoriented features in stock Linux (see, for example, [1]). As Linux evolves, it could
T.P. Baker, A. Bui, and S. Tixeuil (Eds.): OPODIS 2008, LNCS 5401, pp. 105124, 2008.
c Springer-Verlag Berlin Heidelberg 2008
106
107
it can be dicult to predict which scheduler events may aect a task while it
is suspended, so needed analysis tends to be pessimistic. (Each of the protocols
considered here is described more fully later.)
Methodology and results. The main contribution of this paper is an assessment of the performance of the three protocols described above in terms of P-SP
schedulability. Our methodology in conducting this assessment is similar to that
used in our earlier work on EDF-scheduled systems [11]. The performance of
any synchronization protocol will depend on runtime overheads, such as preemption costs, scheduling costs, and costs associated with performing various
system calls. We determined these costs by analyzing trace data collected while
running various workloads under LITMUSRT (which, of course, rst required implementing each synchronization protocol in LITMUSRT ). We then used these
costs in schedulability experiments involving randomly-generated task systems.
In these experiments, a wide range of task-set parameters was considered (though
only a subset of our data is presented herein, due to space limitations). In each
experiment, schedulability was checked for each scheme using a demand-based
schedulability test [15], augmented to account for runtime overheads. In these experiments, we found that the spin-based FMLP variant always exhibited the best
performance (usually, by a wide margin), and the M-PCP and D-PCP almost always exhibited poor performance. These results reinforce our earlier nding that
spin-based locking is preferable to suspension-based locking under EDF scheduling [11]. They also call into question the practical viability of the M-PCP and
D-PCP.
Organization. In the next two sections, we discuss needed background and the
results of our experiments. In an appendix, we describe how runtime overheads
were obtained.
Background
108
scheduled
(no resource)
blocked
(resource unavailable)
scheduled
(with resource X )
R1
Tij
109
|R1 |
1,2 1
R2
time
nested
job release
job completion
(a)
issued
satisfied
complete
(b)
Fig. 1. (a) Legend. (b) Phases of a resource request. Tij issues R1 and blocks since R1
is not immediately satised. Tij holds the resource associated with R1 for |R1 | time
units, which includes blocking incurred due to nested requests.
1 , A14 is blocked. Similarly, A12 becomes active and blocks at time 4. When T31
releases
1 , A12 gains access next because it is the highest-priority active agent
on processor 1. Note that, even though the highest-priority job T11 is released
at time 2, it is not scheduled until time 7 because agents and resource-holding
jobs have an eective priority that exceeds the base priority of T11 . A12 becomes
active at time 9 since T21 requests
2 . However, T11 is accessing
1 at the time,
and thus has an eective priority that exceeds A12 s priority. Therefore, A12 is not
scheduled until time 10.
Inset (b) shows the same scenario under the M-PCP. In this case, T21 and T41
access global resources directly instead of via agents. T41 suspends at time 2 since
T21 already holds
1 . Similarly, T21 suspends at time 4 until it holds
1 one time
unit later. Meanwhile, on processor 1, T11 is scheduled at time 5 after T21 returns
to normal priority and also requests
1 at time 6. Since resource requests are
satised in priority order, T11 s request has precedence over T41 s request, which
was issued much earlier at time 2. Thus, T41 must wait until time 8 to access
1 .
110
A12
2
1
T1
T3
T1
T4
time
9 10 11 12 13 14 15
(a)
T4
(c)
9 10 11 12 13 14 15
T1
T2
T3
T4
time
9 10 11 12 13 14 15
Processor 2
T2
time
9 10 11 12 13 14 15
Processor 1
T3
Processor 2
1
1
(b)
Processor 1
T1
Processor 2
T4
T2
Processor 2
T2
T3
Processor 1
Processor 1
A14
time
(d)
Fig. 2. Example schedules of four tasks sharing two global resources. (a) D-PCP schedule. (b) M-PCP schedule. (c) FMLP schedule (1 , 2 are long). (d) FMLP schedule (1 ,
2 are short).
Note that T41 preempts T21 when it resumes at time 8 since it is holding a global
resource.
The FMLP. The FMLP is considered to be exible for several reasons: it
can be used under either partitioned or global scheduling, with either static
or dynamic task priorities, and it is agnostic regarding whether blocking is via
spinning or suspension. Regarding the latter, resources are categorized as either
short or long. Short resources are accessed using queue locks (a type of spin
lock) [2,14,18] and long resources are accessed via a semaphore protocol. Whether
a resource should be considered short or long is user-dened, but requests for
long resources may not be contained within requests for short resources. To date,
we have implemented FMLP variants for both partitioned and global EDF and
P-SP scheduling (the focus of the description given here).
Deadlock avoidance. The FMLP uses a very simple deadlock-avoidance mechanism that was motivated by trace data we collected involving the behavior of
actual real-time applications [7]. This data (which is summarized later) suggests
111
that nesting, which is required to cause a deadlock, is somewhat rare; thus, complex deadlock-avoidance mechanisms are of questionable utility. In the FMLP,
deadlock is prevented by grouping resources and allowing only one job to access resources in any given group at any time. Two resources are in the same
group i they are of the same type (short or long) and requests for one may be
nested within those of the other. A group lock is associated with each resource
group; before a job can access a resource, it must rst acquire its corresponding
group lock. All blocking incurred by a job occurs when it attempts to acquire
the group lock associated with a resource request that is outermost with respect
to either short or long resources.2 We let G(
) denote the group that contains
resource
.
We now explain how resource requests are handled in the FMLP. This process
is illustrated in Fig. 3.
non-preemptive execution
blocked, job spins
critical section
short
issued
satisfied
complete
time
long
blocked, job suspends
resumed,
but blocked
critical section
non-preemptive
execution
priority boosted
A short resource request nested within a long resource request but no short resource
request is considered outermost.
The desirability of FIFO-based real-time multiprocessor locking protocols has been
noted by others [17], but to our knowledge, the FMLP is the rst such protocol to
be implemented in a real OS.
112
1 and
2 are classied as long resources. As before, T31 requests
1 rst and forces
the jobs on processor 2 to suspend (T41 at time 2 and T21 at time 4). In contrast
to both the D-PCP and M-PCP, contending requests are satised in FIFO order.
Thus, when T31 releases
1 at time 5, T41 s request is satised before that of T21 .
Similarly, T11 s request for
1 is only satised after T21 completes its request at
time 7. Note that, since jobs suspend when blocked on a long resource, T31 can
be scheduled for one time unit at time 6 when T11 blocks on
1 .
Inset (d) depicts the schedule that results when both
1 and
2 are short. The
main dierence from the schedule depicted in (c) is that jobs busy-wait nonpreemptively when blocked on a short resource. Thus, when T21 is released at
time 3, it cannot be scheduled until time 6 since T41 executes non-preemptively
from time 2 until time 6. Similarly, T41 cannot be scheduled at time 7 when
T21 blocks on
2 because T21 does not suspend. Note that, due to the waste of
processing time caused by busy-waiting, the last job only nishes at time 15.
Under suspension-based synchronization methods, the last job nishes at either
time 13 (M-PCP and FMLP for long resources) or 14 (D-PCP).
Experiments
113
resource requests. By eliminating the need to partition task sets, we prevent the
eects of bin-packing heuristics from skewing our results. All generated task sets
were determined to be schedulable before blocking was taken into account.
Resource sharing. Each task was congured to issue between 0 and K resource
requests. The access cost of each request (excluding synchronization overheads)
was chosen uniformly from [0.1s, L]. K ranged from 0 to 9 and L from 0.5s
to 15.5s. The latter range was chosen based on locking trends observed in a
prior study of locking patterns in the Linux kernel, two video players, and an
interactive 3D video game (see [7] for details.). Although Linux is not a real-time
system, its locking behavior should be similar to that of many complex systems,
including real-time systems, where great care is taken to make critical sections
short and ecient. The video players and the video game need to ensure that
both visual and audio content are presented to the user in a timely manner,
and thus are representative of the locking behavior of a class of soft real-time
applications. The trace data we collected in analyzing these applications suggests
that, with respect to both semaphores and spin locks, critical sections tend to be
short (usually, just a few microseconds on a modern processor) and nested lock
requests are somewhat rare (typically only 1% to 30% of all requests, depending
on the application, with nesting levels deeper than two being very rare).
The total number of generated tasks N was used to determine the number of
resources according to the formula KN
m , where the sharing degree was chosen
from {0.5, 1, 2, 4}. Under the D-PCP, resources were assigned to processors in a
round-robin manner to distribute the load evenly. Nested resource requests were
not considered since they are not supported by the M-PCP and D-PCP and also
because allowing nesting has a similar eect on schedulability under the FMLP
as increasing the maximum critical section length.
Finally, task execution costs and request durations were inated to account
for system overheads (such as context switching costs) and synchronization overheads (such as the cost of invoking synchronization-related system calls). The
methodology for doing this is explained in the appendix.
Schedulability. After a task set was generated, the worst-case blocking delay
of each task was determined by using methods described in [20,21] (M-PCP),
[16,21] (D-PCP), and [9] (FMLP). Finally, we determined whether a task set was
schedulable after accounting for overheads and blocking delay with a demandbased [15] schedulability test.
A note on the period enforcer. When a job suspends, it defers part of its
execution to a later instant, which can cause a lower-priority job to experience
deferral blocking. In checking schedulability, this source of blocking must be accounted for. In [21], it is claimed that deferral blocking can be eliminated by
using a technique called period enforcer. In this paper, we do not consider the
use of the period enforcer, for a number of reasons. First, the period enforcer has
not been described in published work (nor is a complete description available
114
online). Thus, we were unable to verify its correctness4 and were unable to obtain
sucient information to enable an implementation in LITMUSRT (which obviously is a prerequisite for obtaining realistic overhead measurements). Second,
from our understanding, it requires a task to be split into subtasks whenever it
requests a resource. Such subtasks are eligible for execution at dierent times
based on the resource-usage history of prior (sub-)jobs. We do not consider it
feasible to eciently maintain a suciently complete resource usage history inkernel at runtime. (Indeed, to the best of our knowledge, the period enforcer
has never been implemented in any real OS.) Third, all tested, suspension-based
synchronization protocols are aected by deferral blocking to the same extent.
Thus, even if it were possible to avoid deferral blocking altogether, the relative
performance of the algorithms is unlikely to dier signicantly from our ndings.
3.1
In fact, we have conrmed that some existing scheduling analysis (e.g., [21]) that
uses the period enforcer is awed [22]. Interestingly, in her now-standard textbook
on the subject of real-time systems, Liu does not assume the presence of the period
enforcer in her analysis of the D-PCP [16].
115
concerning these trends. Below, we consider a few specic graphs that support
these observations.
In all tested scenarios, suspending was never preferable to spinning. In fact, in
the vast majority of the tested scenarios, every generated task set was schedulable under spinning (the short FMLP variant). In contrast, many scenarios could
not be scheduled under any of the suspension-based methods. The only time that
suspending was ever a viable alternative was in scenarios with a small number
, high ) and relatively lax timing constraints
of resources (i.e., small K, low U
(long, homogeneous periods). Since the short FMLP variant is clearly the best
choice (from a schedulability point of view), we mostly focus our attention on
the suspension-based protocols in the discussion that follows.
Overall, the long FMLP variant exhibited the best performance among suspension-based algorithms, especially in low-sharing-degree scenarios. For = 0.5,
the long FMLP variant always exhibited better performance than both the MPCP and D-PCP. For = 1, the long FMLP variant performed best in 101 of
108 tested scenarios. In contrast, the M-PCP was never the preferable choice
for any . Our results show that the D-PCP hits a sweet spot (which we
0.3, and 2; it even
discuss in greater detail below) when K = 2, U
outperformed the long FMLP variant in some of these scenarios (but never the
short variant). However, the D-PCPs performance quickly diminished outside
this narrow sweet spot. Further, even in the cases where the D-PCP exhibited
the best performance among the suspension-based protocols, schedulability was
very low. The M-PCP often outperformed the D-PCP; however, in all such cases,
the long FMLP variant performed better (and sometimes signicantly so).
The observed behavior of the D-PCP reveals a signicant dierence with respect to the M-PCP and FMLP. Whereas the performance of the latter two is
mostly determined by the task count and tightness of timing constraints, the DPCPs performance closely depends on the number of resourceswhenever the
number of resources does not exceed the number of processors signicantly, the
D-PCP does comparatively well. Since (under our task-set generation method)
the number of resources depends directly on both K and (and indirectly on
, which determines how many tasks are generated), this explains the observed
U
sweet spot. The D-PCPs insensitivity to total task count can be traced back to
its distributed natureunder the D-PCP, a job can only be delayed by events on
its local processor and on remote processors where it requests resources. In contrast, under the M-PCP and FMLP, a job can be delayed transitively by events
on all processors where jobs reside with which the job shares a resource.
Example graphs. Insets (a)-(f) of Fig. 4 and (a)-(c) of Fig. 5 display nine
selected graphs for the four-processor case that illustrate the above trends. These
insets are discussed next.
Fig. 4 (a)-(c). The left column of graphs in Fig. 4 shows schedulability as a
= 0.3 and
function of L for K = 9. The case depicted in inset (a), where U
p(Ti ) [33, 100], shows how both FMLP variants signicantly outperform both
the M-PCP and D-PCP in low-sharing-degree scenarios. Note how even the long
116
[2]
0.8
[3]
schedulability
schedulability
[1]
[1,2]
0.8
0.6
0.4
0.2
0.6
0.4
[3]
0.2
[4]
[4]
0
0
FMLP (short)
FMLP (long)
8
L (in us)
10
[1]
[2]
12
14
M-PCP
D-PCP
16
[3]
[4]
(a)
schedulability
schedulability
[2]
0.6
0.4
[3]
0.2
[1]
[2]
8
10
L (in us)
9
[3]
[4]
M-PCP
D-PCP
12
14
M-PCP
D-PCP
[1]
0.8
0.6
0.4
[2,3,4]
0
6
0.2
[4]
4
[1]
(d)
FMLP (short)
FMLP (long)
0.8
4
[1]
[2]
FMLP (short)
FMLP (long)
16
4
[1]
[2]
FMLP (short)
FMLP (long)
[3]
[4]
(b)
9
[3]
[4]
M-PCP
D-PCP
(e)
ucap=0.15 L=9 period=3-33 alpha=2 cpus=4
[1]
schedulability
0.8
schedulability
117
0.6
0.4
[1]
0.8
0.6
[3]
0.4
[2]
0.2
0.2
[2,3,4]
0
0
FMLP (short)
FMLP (long)
6
[1]
[2]
8
10
L (in us)
(c)
12
M-PCP
D-PCP
14
16
[3]
[4]
[4]
0
FMLP (short)
FMLP (long)
4
[1]
[2]
7
M-PCP
D-PCP
9
[3]
[4]
(f)
3.2
Scalability
We now consider how the performance of each protocol scales with the processor
count. To determine this, we varied the processor count from two to 16 for all
, L, K, and periods (assuming the ranges for each
possible combinations of , U
dened earlier). This resulted in 324 graphs, three of which are shown in the
right column of Fig. 5. The main dierence between insets (d) and (e) of the
gure is that task periods are large in (d) (p(Ti ) [100, 1000]) but small in (e)
118
[1]
schedulability
schedulability
0.8
0.6
0.4
0.2
[1]
0.8
[2]
0.6
0.4
[3]
0.2
[2,3,4]
[4]
0
0.1
0.15
0.2
FMLP (short)
FMLP (long)
0.25 0.3
0.35
utilization cap
[1]
[2]
0.4
0.45
0.5
[3]
[4]
M-PCP
D-PCP
FMLP (short)
FMLP (long)
(a)
[3]
schedulability
schedulability
[1]
0.6
0.4
[4]
[2]
0.2
[1]
0.8
[2]
0.6
0.4
[3]
0.2
[4]
0.1
0.15
0.2
FMLP (short)
FMLP (long)
0.25 0.3
0.35
utilization cap
[1]
[2]
0.4
0.45
M-PCP
D-PCP
0.5
[3]
[4]
FMLP (short)
FMLP (long)
(b)
8
10
12
14
16
processor count
[1]
M-PCP
[3]
D-PCP
[4]
[2]
(e)
0.8
[1]
0.8
schedulability
schedulability
16
[3]
[4]
(d)
0.8
8
10
12
14
processor count
[1]
M-PCP
D-PCP
[2]
0.6
0.4
0.2
0.4
0.2
[2,3,4]
[1]
0.6
[2,3,4]
0
0.1
0.15
0.2
FMLP (short)
FMLP (long)
0.25 0.3
0.35
utilization cap
[1]
[2]
(c)
0.4
0.45
0.5
2
M-PCP
D-PCP
[3]
[4]
FMLP (short)
FMLP (long)
8
10
12
14
processor count
M-PCP
[1]
D-PCP
[2]
16
[3]
[4]
(f)
and
Fig. 5. Schedulability as a function of (a)-(c) the per-processor utilization cap U
of (d)-(f ) processor count
(p(Ti ) [10, 100]). As seen, both FMLP variants scale well in inset (d), but the
performance of the long variant begins to degrade quickly beyond six processors
in inset (e). In both insets, the M-PCP shows a similar but worse trend as the
long FMLP variant. This relationship was apparent in many (but not all) of the
tested scenarios, as the performance of both protocols largely depends on the
total number of tasks. In contrast, the D-PCP quite consistently does not follow
the same trend as the M-PCP and FMLP. This, again, is due to the fact that
the D-PCP depends heavily on the number of resources. Since, in this study,
119
the total number of tasks increases at roughly the same rate as the number of
processors, in each graph, the number of resources does not change signicantly
as the processor count increases (since and K are constant in each graph).
The fact that the D-PCPs performance does not remain constant indicates that
its performance also depends on the total task count, but to a lesser degree.
Inset (f) depicts the most-taxing scenario considered in this paper, i.e., that
shown earlier in Fig. 5 (c). None of the suspension-based protocols support this
scenario (on any number of processors), and the short FMLP variant does not
scale beyond four to ve processors.
Finally, we repeated some of the four-processor experiments discussed in
Sec. 3.1 for 16 processors to explore certain scenarios in more depth Although
we are unable to present the graphs obtained for lack of space, we do note that
blocking-by-suspending did not become more favorable on 16 processors, and the
short FMLP variant still outperformed all other protocols in all tested scenarios.
However, the relative performance of the suspension-based protocols did change,
so that the D-PCP was favorable in more cases than before. This appears to be
due to two reasons. First, as discussed above, among the suspension-based protocols, the D-PCP is impacted the least by an increasing processor count (given
our task-set generation method). Second, the long FMLP variant appears to be
somewhat less eective at supporting short periods for larger processor counts.
However, schedulability was poor under all suspension-based protocols for tasks
sets with tight timing constrains on a 16-processor system.
3.3
Impact of Overheads
120
Conclusion
References
1. IBM and Red Hat announce new development innovations in Linux kernel. Press
release (2007), http://www-03.ibm.com/press/us/en/pressrelease/21232.wss
2. Anderson, T.: The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems 1(1), 616
(1990)
3. Baker, T.: Stack-based scheduling of real-time processes. Journal of Real-Time
systems 3(1), 6799 (1991)
4. Bisson, S.: Azul announces 192 core Java appliance (2006),
http://www.itpro.co.uk/serves/news/99765/
azul-announces-192-core-java-appliance.html
5. Block, A., Brandenburg, B., Anderson, J., Quint, S.: An adaptive framework for
multiprocessor real-time systems. In: Proceedings of the 20th Euromicro Conference on Real-Time Systems, pp. 2333 (2008)
6. Block, A., Leontyev, H., Brandenburg, B., Anderson, J.: A exible real-time locking protocol for multiprocessors. In: Proceedings of the 13th IEEE International
Conference on Embedded and Real-Time Computing Systems and Applications,
pp. 7180 (2007)
7. Brandenburg, B., Anderson, J.: Feather-Trace: A light-weight event tracing toolkit.
In: Proceedings of the Third International Workshop on Operating Systems Platforms for Embedded Real-Time Applications, pp. 1928 (2007)
8. Brandenburg, B., Anderson, J.: Integrating hard/soft real-time tasks and besteort jobs on multiprocessors. In: Proceedings of the 19th Euromicro Conference
on Real-Time Systems, pp. 6170 (2007)
121
122
Appendix
To obtain the overheads required in this paper, we used the same methodology
that we used in the prior study concerning EDF scheduling [11]. For the sake of
completeness, the approach is summarized here.
In real systems, task execution times are aected by the following sources
of overhead. At the beginning of each quantum, tick scheduling overhead is incurred, which is the time needed to service a timer interrupt. Whenever a scheduling decision is made, a scheduling cost is incurred, which is the time taken to
select the next job to schedule. Whenever a job is preempted, context-switching
overhead and preemption overhead are incurred; the former term includes any
non-cache-related costs associated with the preemption, while the latter accounts
for any costs due to a loss of cache anity.
When jobs access shared resources, they incur an acquisition cost. Similarly,
when leaving a critical section, they incur a release cost. Further, when a system
call is invoked, a job will incur the cost of switching from user mode to kernel
mode and back. Whenever a task should be preempted while it is executing a
non-preemptive (NP) section, it must notify the kernel when it is leaving its NPsection, which entails some overhead. Under the D-PCP, in order to communicate
with a remote agent, a job must invoke that agent. Similarly, the agent also incurs
overhead when it receives a request and signals its completion.
Accounting for overheads. Task execution costs can be inated using standard techniques to account for overheads in schedulability analysis [16]. Care
must be taken to also properly inate resource request durations. Acquire and
release costs contribute to the time that a job holds a resource and thus can cause
blocking. Similarly, suspension-based synchronization protocols must properly
account for preemption eects within critical sections. Further, care must be
taken to inate task execution costs for preemptions and scheduling events due
to suspensions in the case of contention. Whenever it is possible for a lowerpriority job to preempt a higher-priority job and execute a critical section,5 the
event source (i.e., the resource request causing the preemption) must be accounted for in the demand term of all higher-priority tasks. One way this can
be achieved is by modeling such critical sections as special tasks with priorities
higher than that of the highest-priority normal task [16].
Implementation. To obtain realistic overhead values, we implemented the MPCP, D-PCP, and FMLP under P-SP scheduling in LITMUSRT . A detailed description of the LITMUSRT kernel and its architecture is beyond the scope of this paper.
Such details can be found in [10]. Additionally, a detailed account of the implementation issues encountered, and relevant design decisions made, when implementing
the aforementioned synchronization protocols in LITMUSRT can be found in [9].
LITMUSRT is open source software that can be downloaded freely.6
5
This is possible under all three suspension-based protocols considered in this paper:
a blocked lower-priority job might resume due to a priority boost under the FMLP
and M-PCP and might activate an agent under the D-PCP.
http://www.cs.unc.edu/anderson/litmus-rt.
123
Overheads for the short FMLP variant were already known from prior work [11] and
did not have to be re-determined.
124
Table 1. (a) Worst-case overhead values (in s), on our four-processor test platform
obtained in prior studies. (b) Newly measured worst-case overhead values, on our fourprocessor test platform, in s. These values are based on 86,368,984 samples recorded
over a total of 150 minutes.
Overhead
Worst-Case
Preemption
42.00
Context-switching
9.25
Switching to kernel mode
0.34
Switching to user mode
0.89
Leaving NP-section
4.12
FMLP short acquisition / release 2.00 / 0.87
(a)
Overhead
Scheduling cost
Tick
FMLP long acquisition / release
M-PCP acquisition / release
D-PCP acquisition / release
D-PCP invoke / agent
Worst-Case
6.39
8.08
2.74 / 8.67
5.61 / 8.27
4.61 / 2.85
8.36 / 7.15
(b)
timestamp before acquiring a resource and after the resource was acquired (however, no blocking is included in these overhead terms). Each overhead term was
determined by plotting the measured values obtained to check for anomalies, and
then computing the maximum value (discarding outliers, as discussed above).
Measurement results. In some case, we were able to re-use some overheads
determined in prior work; these are shown in inset (a) of Tab. 1. In other cases,
new measurements were required; these are shown in inset (b) of Tab. 1.
The preemption cost in Table 1 was derived in [12]. In [12], this cost is given
as a function of working set size (WSS). These WSSs are per quantum, thus
reecting the memory footprint of a particular task during a 1-ms quantum,
rather than over its entire lifetime. WSSs of 4K, 32K, and 64K were considered
in [12], but we only consider the 4K case here, due to space constraints. Note that
larger WSSs tend to decrease the competitiveness of methods that suspend, as
preemption costs are higher in such cases. Thus, we concentrate on the 4K case
to demonstrate that, even in cases where such methods are most competitive,
spinning is still preferable. The other costs shown in inset (a) of Table 1 were
determined in [11].
Introduction
126
Y. Asahiro et al.
In Fig. 4, the circles and a line segment represent the robots and the ladder they
carry, respectively.
Parameters LB , and will be explained later. An analytical method for computing
a time-optimal motion of this problem for two robots, under the assumption that
the robots speed is either 0 or a given constant at any moment, is reported in [11].
We used this method to calculate this optimal motion.
127
128
Y. Asahiro et al.
Fig. 2. The setup of the problem. We represent robots A and B by hollow and gray
circles, respectively.
Here we ignore the acceleration of the robots and assume that suciently long period
of time for spring relaxation is given to the robots before sensing the current size of
the oset vector.
129
Ladder
offset vector
oR
Robot R at time t
oR
vR
assume that both robots compute their respective velocity vectors and move to
their new positions at time instances 0, 1, . See Fig. 3.
The nish time is the time tf when both robots arrive at their respective goal
positions. The delay is then dened to be (tf to )/to 100(%), where to is the
nish time of a time-optimal motion. We use the size of the oset vector during
a motion to evaluate the smoothness of a motion.
130
Y. Asahiro et al.
The distance to the goal is large, and the robots do not need to rotate the
ladder quickly, as in instance I2 (LB = 4, = 30 , = 150), whose timeoptimal motion is shown in Fig. 1 (right).
Note that only the relative positioning of the initial and goal positions of the
ladder is important. That is, by interchanging the initial and goal positions, and
by interchanging endpoints A and B, the above setting covers the following cases
also, hence, need not be discussed separately:
The case > 90 and = 180 (rotating the ladder clockwise).
The case 90 < 0, and = 180 (this is a symmetric case)
Although this conguration setting does not cover all the possibilities, we consider that it is a reasonable subset of the innite instances: For example the case
= 180 for 0 < < 90 is not included explicitly in the above setting.
However, the robots fall into such a conguration at some intermediate step during the motion, because the angle of the ladder gradually changes and the robots
determine their motion based only on the current and goal positions. Hence if
the algorithm works well for the above conguration setting, we can expect that
it also works for the case = 180 .
The two gures of Fig. 4 show the motions by G for instances I1 (left) and I2
(right), respectively. The nish times are 340 and 439, respectively.
First, let us observe that the trajectories of the robots in these gures look
quite dierent from the smoother motions shown in Fig. 1. Robot A temporarily
yields to generate a smooth motion in Fig. 1 (left), while it does not in Fig. 4
(left). Both translation and rotation take place simultaneously in Fig. 1 (right),
while in Fig. 4 (right), the ladder starts to rotate only toward the end of the
motion. That is, rotation can occur only as a result of the robots individual
moves toward their goal positions, and that explains why in Fig. 4 (right) the
ladder rst translates without any rotation.
Here we would like to note the eect of the spring constant s: Intuitively
speaking, if s is smaller, the gray robot tends to move (more) straight to the
goal with narrowing the distance from the other robot, which largely breaks the
formation of the robots (The evaluation of the motions based on this criteria is
mentioned below).
The two gures of Figs. 5 show for LB = 2 (left) and LB = 4 (right), respectively, the nish times of the motions generated by G and those of time-optimal
Optimal
G
G+
540
320
520
300
500
Finish time
Finish time
560
Optimal
G
G+
340
280
480
260
460
240
440
220
420
200
131
400
0
10
20
30
40
50
60
70
80
90
10
20
30
Alpha
40
50
60
70
80
90
Alpha
Optimal
G
G+
0.08
0.08
0.06
0.06
Offset vector
Offset vector
0.1
0.04
0.02
Optimal
G
G+
0.04
0.02
50
100
150
200
Time
250
300
350
50
100
150
200
250
300
350
400
450
Time
Fig. 6. Oset vector size for instances I1 (left) and I2 (right), by G (Fig. 4, left), G+
(Fig. 9, left), and a time-optimal motion (Fig. 1, left)
132
Y. Asahiro et al.
0.1
0.1
Optimal
G
G+
0.08
0.08
Optimal
G
G+
0.06
0.04
0.02
0.06
0.04
0.02
10
20
30
40
50
Alpha
60
70
80
90
10
20
30
40
50
60
70
80
90
Alpha
Fig. 7. Peak oset vector size by G, G+ and a time-optimal motion, for LB = 2 (left)
and LB = 4 (right), respectively
in Fig. 6 (right) stays at zero before time 229 when the ladder is translated
but not rotated. Furthermore, in Fig. 6 (left), the curve stays very high for
a long period around time 20 to 150, indicating continuous severe stress that
might be unacceptable to physical robots. Especially the motions by G produce
oscillating oset vectors, e.g., around time 30 for I1 , that might be unacceptable
for practical robots as well. The spring constant s aects this stress of motion.
The above observation is obtained setting s = 0.25 and so we may be able to
choose more suitable (actually larger) value for s in order to keep the size of
oset vector small. However, the larger s causes the longer nish time because
the robots have to detour compared to the motions obtained by s = 0.25. The
dicult point here is that we need to choose a value for s which can achieve
fastness and smoothness at the same time.
Figs. 7 (left) and 7 (right) show the peak oset vector size observed during a
motion generated by G for LB = 2 and LB = 4, respectively, for various values
of . As observed above, the motion of G is divided into two parts, translation of
the ladder followed by rotation. Since oset vectors become large during rotation,
the distance to the goal does not have much eect here. Therefore the results
for LB = 2 and LB = 4 are very similar. The maximum size of the oset vectors
gradually decreases as increases and hence smaller rotation is required.
To summarize, we observe that G tends to separate translation and rotation.
This results in a motion that is less smooth because of a sudden transition
between the two phases. We believe that there are two reasons for the separation.
There is no explicit mechanism to rotate the ladder. Consequently, robot A
never moves away from its goal location in Fig. 4 (left) to assist robot B to
rotate the ladder.
The robots do not utilize the information on the distances to their respective
goals. Consequently, both robots move at the same speed (and hence, the
ladder is not rotated at all), during translation in Fig. 4 (right), even though
robot B is farther away from its goal than robot A.
It is conceivable that a smoother, faster motion can result if translation and
rotation are merged by resolving these issues.
133
134
Y. Asahiro et al.
L=2, Alpha=1
L=4, Alpha=30
Finish time
Finish time
360
440
435
430
425
420
415
410
405
400
340
320
300
280
260
240
220
10
10
8
0
6
2
4
t
4
6
0
u
6
2
2
10 0
4
6
2
10 0
Fig. 8. Finish time of the motions by G+ for instance I1 (left) and I2 (right), for
t, u = 0, 1, , 10
Fig. 9. The motions by G+ with t = 1 and u = 1, for instance I1 (left) and I2 (right),
respectively
often gives a good nish time. If the ladder need not be rotated very quickly,
then as shown in Fig. 8 (right) the nish time becomes less dependent on t and
u, and we can expect good performance using any value of t 1 and u = 0. In
addition to I1 and I2 , we tested more congurations, e.g., LB = 3, 4, 5, 6 and
= 1 , and also LB = 2 and = 5 , 10 , 15 , 20 . We omit the detailed results
here due to the space limitation, however for all the tested congurations the
obtained charts look like those in Fig. 8. Based on the above observation, in the
following we use t = 1 and u = 1 to evaluate G+4 .
Figs. 9 (left) and 9 (right) show the motions generated by G+ for instances I1
and I2 , respectively. These motions resemble the time-optimal motions in Figs. 1
(left) and 1 (right) more closely than those by G shown in Figs. 4 (left) and 4
(right). Specically, in Fig. 9 (left), robot A (the hollow circle) temporarily leaves
its position and returns to it in order to rotate the ladder quickly. In Fig. 9,
translation and rotation take place simultaneously, and robot B (that has to
move more than the other robot) can move nearly straight to its destination.
The simulation results indicate that G+ generates a faster and smoother
motion than G. See Fig. 5 and for a comparison of the nish times of G+, G
and a time-optimal motion. For 80 the nish time of G+ is smaller than
that of G. (For > 80 , both G and G+ generate a motion whose nish time
equals that of a time-optimal motion.) The delay of G+ in the motion of Fig. 9
4
135
(left) for instance I1 is only 9.6% (this is the worst case we observed for G+),
as opposed to 63.5% of G in Fig. 4 (left). For instance I2 , the delay is 3.5% in
the motion of Fig. 9 (right) by G+, as opposed to 9.8% in the motion of Fig. 4
(right) by G. Figs. 6 (left) and 6 (right), respectively, show that the oset vector
size is considerably smaller in the motions of Figs. 9 (left) and 9 (right) by G+,
compared to that of Figs. 4 (left) and 4 (right) by G. Figs. 7 show that the peak
oset vector size is smaller in G+ than in G.
We next show that G+ is correct, i.e., for arbitrary initial and goal positions
of the ladder, using Algorithm G+, robots A and B can transport the ladder to
its goal position, provided that there are no sensor or control errors.
Theorem 1. Suppose that u > 0. Then Algorithm G+ is correct.
Proof. We give an outline of the proof. If Lmax = max{LA , LB } V , then G+
terminates after the robots transport the ladder to the goal position in Step 1.
We show that eventually Lmax V holds. Suppose that = . Since u > 0,
gets closer to in each iteration of G+. Once turns to be equal to , the
rest of the work for the robots is just moving straight to the goal position; the
rotation vector is no longer needed and so it has size 0 in Step 3. Thus the target
vectors tA and tB of the robots can have opposite directions only in the very
rst iterations, and hence in subsequent iterations, tA and tB together have the
eect of moving the center of the ladder closer to its center in the goal position.
The robots oset correction vectors hA and hB are opposite in direction and
equal in magnitude, and hence they do not aect the movement of the center of
the ladder. Even if the center of the ladder stays exactly at the center of the goal
positions of the ladder, it does not mean the robots reached the goal positions.
However, after that, the vectors tR and hR help to move the robots to the goal
positions without moving the center of the ladder.
Consider the instance shown on the left in Fig. 10, where G+ would drive both
robots straight to their respective goal positions, if there were no control errors.
As illustrated on the right, in any physical experiment the robots will inevitably
deviate from the intended trajectories for a number of reasons, including sensor
and control errors. An advantage of G+ is that its future output depends only on
the current and goal states, and is independent of the past history. An algorithm
is said to be self-stabilizing if it tolerates any nite number of transient faults
like sensor and control errors, i.e., the algorithm is correct, even in the presence
of any nite number of transient faults.
Corollary 1. Algorithm G+ is self-stabilizing.
To conrm the robustness of G+, we conducted a series of simulations by replacing TR in Step 5 of G+ by TR = tR +rR +hR +nR , where nR is a random noise
vector of size 0.1V . Since Corollary 1 guarantees that G+ eventually transports
the ladder to the goal position, our main concern here is the nish time.
Figs. 11 (left) and 11 (right), respectively, show the nish times of the motions
by G+ in the presence of noise (as well as the results for the noise-free case and
136
Y. Asahiro et al.
Fig. 10. A simple instance (left), and deviation from the intended path (right)
230
420
Optimal
No Noise
With Noise
Optimal
No Noise
With Noise
225
415
410
215
Finish time
Finish time
220
210
405
205
400
200
195
395
0
10
20
30
40
50
Alpha
60
70
80
90
10
20
30
40
50
60
70
80
90
Alpha
Fig. 11. Finish times of G+ with noise for LB = 2 (left) and LB = 4 (right), respectively
137
138
Y. Asahiro et al.
Fig. 13. Motions with three robots by G (left) and G+ (right) for instance I1
Fig. 14. Motions with three robots by G (left) and G+ (right) for instance I2
290
0.1
G
G+
G
G+
0.09
280
0.08
270
0.07
Finish time
260
250
240
230
0.06
0.05
0.04
0.03
220
0.02
210
0.01
200
0
0
10
20
30
40
50
60
70
80
90
10
20
30
40
Alpha
50
60
70
80
90
Alpha
Fig. 15. Finish times (left) and peak oset vector size (right) of G and G+ for L = 2
and 0 90 with three robots, respectively
0.1
G
G+
0.09
480
G
G+
0.08
470
0.07
460
Finish time
450
440
430
0.06
0.05
0.04
0.03
420
0.02
0.01
410
400
0
10
20
30
40
50
Alpha
60
70
80
90
10
20
30
40
50
60
70
80
90
Alpha
Fig. 16. Finish times (left) and peak oset vector size (right) of G and G+ for L = 4
and 0 90 with three robots, respectively
(i)
(ii)
(iii)
139
(iv)
Fig. 17. Motions for instance I1 with four robots by (i) G and (ii) G+, and with ve
robots by (iii) G and (iv) G+, respectively
(i)
(ii)
(iii)
(iv)
Fig. 18. Motions for instance I1 with six robots by (i) G and (ii) G+, and with seven
robots by (iii) G and (iv) G+, respectively
(i)
(ii)
(iii)
(iv)
Fig. 19. Motions for instance I1 with eight robots by (i) G and (ii)G+, and with nine
robots by (iii) G and (iv) G+, respectively
Fig. 20. Motions with four robots by G (left) and G+ (right) for instance I2
140
Y. Asahiro et al.
Fig. 21. Motions with ve robots by G (left) and G+ (right) for instance I2
Fig. 22. Motions with six robots by G (left) and G+ (right) for instance I2
Fig. 23. Motions with seven robots by G (left) and G+ (right) for instance I2
Fig. 24. Motions with eight robots by G (left) and G+ (right) for instance I2
do not show the results for optimal motions. Finish times of G and G+ are
almost similar, although G+ is slightly better (Fig. 15, left). We observed the
largest improvement in terms of nish time for the instance I1 with nine robots:
The nish times of G and G+ are 835 and 741, respectively, in which the motion
by G+ is completed 12.2% faster than the motion by G. As for the size of the
peak oset vector, the motions by G+ are always smoother than those by G
141
Fig. 25. Motions with nine robots by G (left) and G+ (right) for instance I2
(Fig. 15, right). As an example, for the instance I1 with nine robots, the size
of the peak oset vector by G+ is about 63% of that by G; G+ improves the
smoothness of the motion. In summary, the simulation results indicate that if
the number of robots is greater than two, the advantage of algorithm G+ over G
is in smoother motions rather than smaller nish times. This observation seems
to indicate that maintaining a given formation can be a severe constraint when
obtaining time-optimal motion with many robots.
We would like to note here that, arguing as in the proof of Theorem 1, we
can prove the convergence of Algorithm G+, i.e., the robots positions always
converge to their goal positions, for the case of more than two robots in a simple
formation such as a regular polygon we examined.
Conclusion
142
Y. Asahiro et al.
deep studies on the algorithm G+, if we can develop a method to select their
values for every conguration, it must be very useful. The algorithms G and G+
are evaluated by delay (nish time) and formation error, which shows that G+
is better than G, however it is controversial that how much delay or formation
error is accepted for real robots, or whether G+ is a best possible algorithm or
not. In addition to that, the proof of correctness of G+ does not guarantee the
maximum nish time or the maximum formation error. The self-stability of G+
is demonstrated under an assumption that the robot may move to wrong position
because of noise. An alternative interesting situation to be tested is that each
robot sometimes misunderstands the others locations because of sensor errors.
Distributed marching algorithms (i) by robots with dierent maximum speeds
and capabilities, (ii) using many robots, and (iii) in an environment occupied by
obstacles, are suggested for future study. Some results on these issues are found
in [3, 4].
Acknowledgments
This work was partially supported by KAKENHI 18300004 and 18700015.
References
1. Alami, R., Fleury, S., Herrb, M., Ingrand, F., Qutub, S.: Operating a Large Fleet of
Mobile Robots Using the Plan-merging Paradigm. In: IEEE Int. Conf. on Robotics
and Automation, pp. 23122317 (1997)
2. Ando, H., Oasa, Y., Suzuki, I., Yamashita, M.: A Distributed Memoryless Point
Convergence Algorithm for Mobile Robots with Limited Visibility. IEEE Trans.
Robotics and Automation 15(5), 818828 (1999)
3. Asahiro, Y., Chang, E.C., Mali, A., Nagafuji, S., Suzuki, I., Yamashita, M.: Distributed Motion Generation for Two Omni-directional Robots Carrying a ladder.
Distributed Autonomous Robotic Systems 4, 427436 (2000)
4. Asahiro, Y., Chang, E.C., Mali, A., Suzuki, I., Yamashita, M.: A Distributed Ladder Transportation Algorithm for Two Robots in a Corridor. In: IEEE Int. Conf.
on Robotics and Automation, pp. 30163021 (2001)
5. Asama, H., Sato, M., Bogoni, L., Kaetsu, H., Matsumoto, A., Endo, I.: Development of an Omni-directional Mobile Robot with 3 DOF Decoupling Drive Mechanism. In: IEEE Int. Conf. on Robotics and Automation, pp. 19251930 (1995)
6. Balch, T.: Behavior-based Formation Control for Multi-robot Teams. IEEE Trans.
Robotics and Automation 14(6), 926939 (1998)
7. Belta, C., Kumar, V.: Abstraction and Control for Groups of Robots. IEEE Trans.
Robotics 20(5), 865875 (2004)
8. Canepa, D., Gradinariu Potop-Butucaru, M.: Stabilizing Flocking via Leader Election in Robot Networks. In: Int. Symp. Stabilization, Safety, and Security, pp.
5266 (2007)
9. Cieliebak, M., Flocchini, P., Prencipe, G., Santoro, N.: Solving the robots gathering
problem. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.)
ICALP 2003. LNCS, vol. 2719, pp. 11811196. Springer, Heidelberg (2003)
143
10. Cao, Y.U., Fukunaga, A.S., Kahng, A.B.: Cooperative Mobile Robots: Antecedents
and Directions. Autonomous Robots 4, 123 (1997)
11. Chen, A., Suzuki, I., Yamashita, M.: Time-optimal Motion of Two Omnidirectional
Robots Carrying a Ladder Under a Velocity Constraint. IEEE Trans. Robotics and
Automation 13(5), 721729 (1997)
12. Czyzowicz, J., Gasieniec, L., Pelc, A.: Gathering Few Fat Mobile Robots in the
Plane. In: Shvartsman, M.M.A.A. (ed.) OPODIS 2006. LNCS, vol. 4305, pp. 350
364. Springer, Heidelberg (2006)
13. Cohen, R., Peleg, D.: Convergence Properties of the Gravitational Algorithm in
Asynchronous Robot Systems. SIAM J. on Computing 34, 15161528 (2005)
14. Cohen, R., Peleg, D.: Convergence of Autonomous Mobile Robots with Inaccurate
Sensors and Movements. SIAM J. on Computing 38, 276302 (2008)
15. Debest, X.A.: Remark about Self-stabilizing Systems. Comm. ACM 38(2), 115117
(1995)
16. Donald, B.R.: Information Invariants in Robotics: Part I State, Communication,
and Side-eects. In: IEEE Int. Conf. on Robotics and Automation, pp. 276283
(1993)
17. Flocchini, P., Prencipe, G., Santoro, N., Widmayer, P.: Hard Tasks for Weak
Robots: The Role of Common Knowledge in Pattern Formation by Autonomous
Mobile Robots. In: Aggarwal, A.K., Pandu Rangan, C. (eds.) ISAAC 1999. LNCS,
vol. 1741, pp. 93102. Springer, Heidelberg (1999)
18. Flocchini, P., Prencipe, G., Santoro, N., Widmayer, P.: Arbitrary Pattern Formation by Asynchronous, Anonymous, Oblivious Robots. Theoretical Computer
Science (to appear)
19. Ge, S.S., Lewis, F.L. (eds.): Autonomous Mobile Robots: Sensing, Control Decision
Making and Applications. CRC Press, Boca Raton (2006)
20. Gervasi, V., Prencipe, G.: Coordination without Communication: The case of the
Flocking Problem. Discrete Applied Mathematics 143, 203223 (2003)
21. Izumi, T., Katayama, Y., Inuzuka, N., Wada, K.: Gathering Autonomous Mobile
Robots with Dynamic Compasses: An Optimal Results. In: Intl Symp. Distributed
Computing, pp. 298312 (2007)
22. Jadbabaie, A., Lin, J., Morse, A.S.: Coordination of Groups of Mobile Autonomous
Agents Using Nearest Neighbor Rules. IEEE Trans. Automatic Control 48(6), 988
1001 (2003)
23. Justh, E.W., Krishnaprasad, P.S.: Equilibria and Steering Laws for Planar Formation. System Control Letters 52(1), 2538 (2004)
24. Katayama, Y., Tomida, Y., Imazu, H., Inuzuka, N., Wada, K.: Dynamic Compass
Models and Gathering Algorithms for Autonomous Mobile Robots. In: Prencipe,
G., Zaks, S. (eds.) SIROCCO 2007. LNCS, vol. 4474, pp. 274288. Springer, Heidelberg (2007)
25. Kosuge, K., Oosumi, T.: Decentralized Control of Multiple Robots Handling an
Object. In: International Conference on Intelligent Robots and Systems, pp. 318
323 (1996)
26. Martnez, S., Cortes, J., Bullo, F.: Motion Coordination with Distributed Information. IEEE Control Systems Magazine, 7588 (2007)
27. LaValle, S.M.: Planning Algorithms. Cambridge University Press, Cambridge
(2006)
28. Lee, L.F., Krovi, V.: A Standardized Testing-ground for Articial Potential-eld
Based Motion Planning for Robot Collectives. In: 2006 Performance Metrics for
Intelligent Systems Workshop, pp. 232239 (2006)
144
Y. Asahiro et al.
29. Nakamura, Y., Nagai, K., Yoshikawa, T.: Dynamics and Stability in Coordination
of Multiple Robotics Mechanisms. Int. J. of Robotics Research 8(2), 4460 (1989)
30. Olfati-Saber, R.: Flocking for Multi-agent Dynamic Systems: Algorithms and Theory. IEEE Trans. Automatic Control 51(3), 401420 (2006)
31. Prencipe, G.: CORDA: Distributed Coordination of a Set of Autonomous Mobile
Robots. In: ERSADS 2001, pp. 185190 (2001)
32. Prencipe, G.: On the Feasibility of Gathering by Autonomous Mobile robots.
In: Pelc, A., Raynal, M. (eds.) SIROCCO 2005. LNCS, vol. 3499, pp. 246261.
Springer, Heidelberg (2005)
33. Shuneider, F.E., Wildermuth, D., Wolf, H.L.: Motion Coordination in Formations
of Multiple Robots Using a Potential Field Approach. Distributed Autonomous
Robotic Systems 4, 305314 (2000)
34. Souissi, S., Defago, X., Yamashita, M.: Gathering Asynchronous Mobile Robots
with Inaccurate Compasses. In: Shvartsman, M.M.A.A. (ed.) OPODIS 2006.
LNCS, vol. 4305, pp. 333349. Springer, Heidelberg (2006)
35. Souissi, S., Defago, X., Yamashita, M.: Using Eventually Consistent Compasses to
Gather Oblivious Mobile Robots with Limited Visibility. In: Datta, A.K., Gradinariu, M. (eds.) SSS 2006. LNCS, vol. 4280, pp. 471487. Springer, Heidelberg
(2006)
36. Stilwell, D.J., Bay, J.S.: Toward the Development of a Material Transport System
Using Swarms of Ant-like Robots. In: IEEE Int. Conf. on Robotics and Automation,
pp. 766771 (1995)
37. Sugihara, K., Suzuki, I.: Distributed Motion Coordination of Multiple Mobile
Robots. In: IEEE Int. Symp. on Intelligent Control, pp. 138143 (1990)
38. Sugihara, K., Suzuki, I.: Distributed Algorithms for Formation of Geometric Patterns with Many Mobile Robots. Journal of Robotic Systems 13(3), 127139 (1996)
39. Suzuki, I., Yamashita, M.: Formation and Agreement Problems for Anonymous
Mobile Robots. In: Annual Allerton Conference on Communication, Control, and
Computing, pp. 93102 (1993)
40. Suzuki, I., Yamashita, M.: Distributed Anonymous Mobile Robots: Formation of
Geometric Patterns. SIAM J. Computing 28(4), 13471363 (1999)
41. Tanner, H., Jadbabaie, A., Pappas, G.J.: Flocking in Fixed and Switching Networks. IEEE Trans. Automatic Control 52(5), 863868 (2007)
42. Whitcomb, L.L., Koditschek, D.E., Cabrera, J.B.D.: Toward the Automatic Control of Robot Assembly Tasks via Potential Functions: The Case of 2-D Sphere Assemblies. In: IEEE Int. Conf. on Robotics and Automation, pp. 21862191 (1992)
43. Yamaguchi, H.: A Distributed Motion Coordination Strategy for Multiple Nonholomic Mobile Robots in Cooperative Hunting Operations. Robotics and Autonomous Systems 43(4), 257282 (2003)
Abstract. This paper studies the flocking problem, where mobile robots group
to form a desired pattern and move together while maintaining that formation.
Unlike previous studies of the problem, we consider a system of mobile robots
in which a number of them may possibly fail by crashing. Our algorithm ensures
that the crash of faulty robots does not bring the formation to a permanent stop,
and that the correct robots are thus eventually allowed to reorganize and continue
moving together. Furthermore, the algorithm makes no assumption on the relative
speeds at which the robots can move.
The algorithm relies on the assumption that robots activations follow a kbounded asynchronous scheduler, in the sense that the beginning and end of activations are not synchronized across robots (asynchronous), and that while the
slowest robot is activated once, the fastest robot is activated at most k times (kbounded).
The proposed algorithm is made of three parts. First, appropriate restrictions
on the movements of the robots make it possible to agree on a common ranking
of the robots. Second, based on the ranking and the k-bounded scheduler, robots
can eventually detect any robot that has crashed, and thus trigger a reorganization of the robots. Finally, the third part of the algorithm ensures that the robots
move together while keeping an approximation of a regular polygon, while also
ensuring the necessary restrictions on their movement.
1 Introduction
Be it on earth, in space, or on other planets, robots and other kinds of automatic systems provide essential support in otherwise adverse and hazardous environments. For
instance, among many other applications, it is becoming increasingly attractive to consider a group of mobile robots as a way to provide support for rescue and relief during
or after a natural catastrophe (e.g., earthquake, tsunami, cyclone, volcano eruption).
As a result, research on mechanisms for coordination and self-organization of mobile
robot systems is beginning to attract considerable attention (e.g, [17,19,20,21]). For
Work supported by MEXT Grant-in-Aid for Young Scientists (A) (Nr. 18680007).
T.P. Baker, A. Bui, and S. Tixeuil (Eds.): OPODIS 2008, LNCS 5401, pp. 145163, 2008.
c Springer-Verlag Berlin Heidelberg 2008
146
such operations, relying on a group of simple robots for delicate operations has various advantages over considering a single complex robot. For instance, (1) it is usually
more cost-effective to manufacture and deploy a number of cheap robots rather than a
single expensive one, (2) higher number yields better potential for a system resilient to
individual robot failures, (3) smaller robots have obviously better mobility in tight and
confined spaces, and (4) a group can survey a larger area than an individual robot, even
if the latter is equipped with better sensors.
Nevertheless, merely bringing robots together is by no means sufficient, and adequate coordination mechanisms must be designed to ensure coherent group behavior.
Furthermore, since many applications of cooperative robotics consider cheap robots
dwelling in hazardous environments, fault-tolerance is of primary concern.
The problem of reaching agreement among a group of autonomous mobile robots
has attracted considerable attention over the last few years. While much formal work
focuses on the gathering problem (robots must meet at a point, e.g., [7]) as the embodiment of a static notion of agreement, this work studies the problem of flocking (robots
must move together), which embodies a dynamic notion of agreement, as well as coordination and synchronization. The flocking problem has been studied from various
perspectives. Studies can be found in different disciplines, from artificial intelligence
to engineering [1,3,5,6]. However, only few works considered the presence of faulty
robots [2,4].
Fault-tolerant flocking. Briefly, the main problem studied in this paper, namely the
flocking problem, requires that a group of robots move together, staying close to each
other, and keeping some desired formation while moving. Numerous definitions of
flocking can be found in the literature [3,11,12,14], but few of them define the problem precisely. The rare rigorous definitions of the problem suppose the existence of a
leader robot and require that the other robots, called followers, follow the leader in a
desired fashion [3,6,10], such as by maintaining an approximation of a regular polygon.
The variant of the problem that we consider in this paper requires that the robots
form and move while maintaining an approximation of a regular polygon, in spite of
the possible presence of faulty robotsrobots may fail by crashing and a crash is permanent. Although we do consider the presence of a leader robot to lead the group, the
role of leader is assigned dynamically and any of the robots can potentially become a
leader. In particular, after the crash of a leader, a new leader must eventually take over
that role.
Model. The system is modelled as a system composed of a group of autonomous mobile
robots, modelled as points evolving on the plane, and all of which execute the same
algorithm independently. Some of the robots may possibly fail by crashing, after which
they do not move forever. Although the robots share no common origin, they do share
one common direction (as given by a compass), a common unit distance, and the same
notion of clockwise direction.
Robots repeatedly go through a succession of activation cycles during which they
observe their environment, compute a destination and move. Robots are asynchronous
in that one robot may begin an activation cycle while another robot finishes one. While
some robots may be activated more often than others, we assume that the scheduler
147
is k-bounded in the sense that, in the interval it takes any correct robot to perform a
single activation cycle, no other robot performs more than k activations. The robots can
remember only a limited number of their past activations.
Contribution. The paper presents a fault-tolerant flocking algorithm for a k-bounded
asynchronous robot system. The algorithm is decomposed into three parts. In the first
part, the algorithm relies on the k-bounded scheduler to ensure failure detection. In the
second part, the algorithm establishes a ranking system for the robots and then ensures
that robots agree on the same ranking throughout activations. In the third and last part,
the ranking and the failure detector are combined to realize the flocking of the robots
by maintaining an approximation of a regular polygon while moving.
Related work. Gervasi and Prencipe [3] have proposed a flocking algorithm for robots
based on a leader-followers model, but introduce additional assumptions on the speed
of the robots. In particular, they proposed a flocking algorithm for formations that are
symmetric with respect to the leaders movement, without agreement on a common
coordinate system (except for the unit distance). However, their algorithm requires that
the leader is distinguished from the robots followers.
Canepa and Potop-Butucaru [6] proposed a flocking algorithm in an asynchronous
system with oblivious robots. First, the robots elect a leader using a probabilistic algorithm. After that, the robots position themselves according to a specific formation. Finally, the formation moves ahead. Their algorithm only lets the formation move straight
forward. Although the leader is determined dynamically, once elected it can no longer
change. In the absence of faulty robots, this is a reasonable limitation in their model.
To the best of our knowledge, our work is the first to consider flocking of asynchronous (k-bounded) robots in the presence of faulty robots. Also, we want to stress
that the above two algorithms do not work properly in the presence of faulty robots, and
that their adaptation is not straightforward.
Structure. The remainder of this paper is organized as follows. In Section 2, we present
the system model. In Section 3, we define the problem. In Section 4, we propose a
failure detection algorithm based on kbounded scheduler. In Section 5, we give an
algorithm that provides a ranking mechanism for robots. In Section 6, we propose a
dynamic fault tolerant flocking algorithm that maintains an approximation of a regular
polygon. Finally, in Section 7, we conclude the paper.
148
The local view of each robot includes a unit of length, an origin, and the directions and orientations of the two x and y coordinate axes. In particular, we assume
that robots have a partial agreement on the local coordinate system. Specifically, they
agree on the orientation and direction of one axis, say y. Also, they agree on the clockwise/counterclokwise direction.
The robots are completely autonomous. Moreover, they are anonymous, in the sense
that they are a priori indistinguishable by appearance. Furthermore, there is no direct
means of communication among them.
In the C ORDA model, robots are totally asynchronous. The cycle of a robot consists
of a sequence of events: Wait-Look-Compute-Move.
Wait. A robot is idle. A robot cannot stay permanently idle. At the beginning all
robots are in Wait state.
Look. Here, a robot observes the world by activating its sensors, which will return
a snapshot of the positions of the robots in the system.
Compute. In this event, a robot performs a local computation according to its deterministic algorithm. The algorithm is the same for all robots, and the result of the
compute state is a destination point.
Move. The robot moves toward its computed destination. But, the distance it moves
is unmeasured; neither infinite, nor infinitesimally small. Hence, the robot can only
go towards its goal, but the move can end anywhere before the destination.
In the model, there are two limiting assumptions related to the cycle of a robot.
Assumption 1. It is assumed that the distance travelled by a robot r in a move is not
infinite. Furthermore, it is not infinitesimally small: there exists a constant r > 0, such
that, if the target point is closer than r , r will reach it; otherwise, r will move toward
it by at least r .
Assumption 2. The amount of time required by a robot r to complete a cycle (waitlook-compute-move) is not infinite. Furthermore, it is not infinitesimally small; there
exists a constant r > 0, such that the cycle will require at least r time.
2.2 Assumptions
k-bounded-scheduler. In this paper, we assume the C ORDA model with k-bounded
scheduler, in order to ensure some fairness of activations among robots. Before we
define the k-bounded-scheduler, we give a definition of full activation cycle for robots.
Definition 1 (full activation cycle). A full activation cycle for any robot ri is defined
as the interval from the event Look (included) to the next instance of the same event
Look (excluded).
Definition 2 (k-bounded-scheduler). With a k-bounded scheduler, between two consecutive full activation cycles of the same robot ri , another robot rj can execute at most
k full activation cycles.
This allows us to establish the following lemma:
Lemma 1. If a robot ri is activated k+1 times, then all (correct) robots have completed
at least one full activation cycle during the same interval.
149
Faults. In this paper, we address crash failures. That is, we consider initial crash of
robots and also the crash of robots during execution. That is, a robot may fail by crashing, after which it executes no actions (no movement). A crash is permanent in the sense
that a faulty robot never recovers. However, it is still physically present in the system,
and it is seen by the other non-crashed robots. A robot that is not faulty is called a
correct robot.
Before we proceed, we give the following notations that will be used throughout this
paper. We denote by R = {r1 , , rn } the set of all the robots in the system. Given
some robot ri , ri (t) is the position of ri at time t. y(ri ) denotes the y coordinate of
robot ri at some time t. Let A and B be two points, with AB, we will indicate the
segment starting at A and terminating at B, and dist (A, B) is the length of such a
segment. Given a region X , we denote by |X |, the number of robots in that region at
time t. Finally, let S be a set of robots, then |S| indicates the number of robots in S.
3 Problem Definition
Definition 3 (Formation). A formation F = F ormation(P1 , P2 , ..., Pn ) is a configuration, with P1 the leader of the formation, and the remaining points, the followers of
the formation. The leader P1 is not distinct physically from the robot followers.
In this paper, we assume that the formation F is a regular polygon. We denote by d the
length of the polygon edge (known to the robots), and by = (n 2)180/n the angle
of the polygon, where n is the number of robots in F .
Definition 4 (Approximate Formation). We say that robots form an approximation of
the formation F if each robot ri is within r from its target Pi in F .
Definition 5 (The Flocking Problem). Let r1 ,...,rn be a group of robots, whose positions constitute a formation F = F ormation(P1 , P2 , ..., Pn ). The robots solve the
Approximate Flocking Problem if, starting from any arbitrary formation at time t0 ,
t1 t0 such that, t t1 all robots are at a distance of at most r from their respective targets Pi in F , and r is a small positive value known to all robots.
150
ri never visits the same location for the last k + 1 activations of ri .1 Finally, a robot
ri never visits a location that was visited by any other robot rj during the last k + 1
activations of rj .
Recall that we only consider permanent crash failures of robots, and that crashed robots
remain physically in the system. Besides, robots are anonymous. Therefore, the problem is how to distinguish faulty robots from correct ones. Algorithm 1 provides a simple
perfect failure detection mechanism for the identification of correct robots. The algorithm is based on the fact that a correct robot must change its current position whenever
it is activated (Assumption 3), and also relies on the definition of the kbounded scheduler for the activations of robots. So, a robot ri considers that some robot rj is faulty
if ri is activated k + 1 times, while robot rj is still in the same position. Algorithm 1
gives as output the set of positions of correct robots Scorrect , and uses the following
variables:
SP osP revObser : a global variable representing the set of points of the positions
of robots in the system in the previous activation of some robot ri . These points
include the positions of correct and faulty robots. SP osP revObser is initialized to
the empty set during the first activation of robot ri .
SP osCurrObser : the set of points representing the positions of robots (including
faulty ones) in the current activation of some robot ri .
cj : a global variable recording how many times robot rj did not change its position.
Algorithm 1. Perfect Failure Detection (code executed by robot ri )
Initialization: SP osP revObser := ; cj := 0
1: procedure Failure Detection(SP osP revObser ,SP osCurrObser )
2:
Scorrect := SP osCurrObser ;
3:
for pj SP osCurrObser do
4:
if (pj SP osP revObser ) then
5:
cj := cj + 1;
6:
else
7:
cj := 0;
8:
end if
9:
if (cj k) then
10:
Scorrect = Scorrect {pj };
11:
end if
12:
end for
13:
return (Scorrect)
14: end
The proposed failure detection algorithm (Algorithm 1) satisfies the two properties
of a perfect failure detector: strong completeness, and strong accuracy. It also satisfies
the eventual agreement property. These properties are stated respectively in Theorem 1,
Theorem 2, and Theorem 3, and their proofs are straightforward (details are in corresponding research report [18]).
1
That is, ri never revisits a point location that was within its line of movement for its last k + 1
total activations.
151
Consider two robots that happen to have the same coordinate system and that are always activated together. It is impossible to separate them deterministically. In contrast, it would be
trivial to scatter them at distinct positions using randomization (e.g., [15]), but this is ruled out
in our model.
Note that, the bounded distance min(r /(k + 1)(k + 2), dist (ri , p)/(k + 1)(k + 2)) set on
the movement of robots is conservative, and is sufficient to avoid collisions between robots,
and to satisfy Assumption 3.
152
derive the following lemmas. In particular, the algorithm gives a unique ranking to every
robot in the system, and also ensures no collisions between robots.
Lemma 2. Algorithm 2 gives a unique ranking to every correct robot in the system.
Lemma 3. By Algorithm 2, there is a finite time after which, all correct robots agree
on the same initial sequence of ranking, RankSequence.
Lemma 4. Algorithm 2 guarantees no collisions between the robots in the system.
The proofs of the above lemmas are simple (details can be found in corresponding
research report [18]).
153
154
{leader}
{follower}
155
y
y
zone(ri)
ri
zone(ri)
dist(ri,ri+1)/(k+1)(k+2)
ri
ri+1
< r
(a) ri and ri+1 have the same y-coordinate,
and dist(ri , ri+1 ) < r : Zone(ri ) is the
half circle with radius dist (ri , ri+1 )/(k +
1)(k + 2).
r/(k+1)(k+2)
>= r
ri+1
(b) ri and ri+1 do not have the same ycoordinate, and dist (ri , projri+1 ) r :
Zone(ri ) is the circle with radius r /(k +
1)(k + 2).
ri and ri+1 do not have the same y coordinate: Zone(ri ) is the circle, centered
at ri , and with radius min(dist (ri , projri+1 )/(k+1)(k+2), r /(k+1)(k+2))
(refer to Fig. 1(b)).
After determining its zone of movement Zone(ri ), robot ri needs to determine
if there are crashed robots within Zone(ri ). If no crashed robots are within its
zone, then robot ri can move to any desired target within Zone(ri ), satisfying
Assumption 3. Otherwise, robot ri can move within Zone(ri ) by excluding the
positions of crashed robots, and satisfying Assumption 3.
6. Robot ri is a follower. First, ri assigns the points of the formation P1 , ..., Pn to the
robots in RankSequence based on their order in RankSequence. Subsequently,
robot ri determines its target Pi based on the current position of the leader (P1 ),
and the polygon angle given in the following equation: = (n 2)180 /n,
where n is the number of robots in the formation.
In order to ensure no collisions between robots, the algorithm also defines a
movement zone for each robot follower. The zone of a follower, referred to as
Zone(ri ) is defined depending on the position of the previous robot ri1 and
the next robot ri+1 to ri in RankSequence. Before we proceed, we denote by
projri1 , the projection of robot ri1 on the yaxis of robot ri . Similarly, we denote by projri+1 , the projection of robot ri+1 on the yaxis of ri . The zone of
movement of a robot follower ri is defined as follows:
ri , ri1 and ri+1 have the same y coordinate, then Zone(ri ) is the segment
ri p, with p as the point at distance min(dist (ri , ri+1 )/(k + 1)(k + 2), r /(k +
1)(k + 2)) from ri (Fig. 2(a)).
ri , ri1 and ri+1 do not have the same y coordinate, then Zone(ri ) is the circle
centered at ri , and with radius min(r /(k + 1)(k + 2), dist(ri , projri1 )/(k +
1)(k + 2), dist (ri , projri+1 )/(k + 1)(k + 2)) (Fig. 2(b)).
ri and ri+1 have the same y coordinate, however ri1 does not, then Zone(ri )
is the half circle above it, centered at ri , and with radius min(r /(k + 1)(k +
2), dist (ri , projri1 )/(k + 1)(k + 2), dist (ri , ri+1 )/(k + 1)(k + 2)).
156
ri-1
zone(ri)
ri-1
ri
zone(ri)
ri+1
r/(k+1)(k+2)
>=r
(a) Aligned.
>= r
dist(ri,proj(ri+1))/(k+1)(k+2)
ri-1
ri
ri
r/(k+1)(k+2)
zone(ri)
>= r
ri+1
>= r
< r
ri+1
Fig. 2. Zone of movement of a follower. There are three cases as follows. The situation (a) in
which ri1 , ri , and ri+1 have the same y coordinate. The situation (b) where ri1 , ri and ri+1
do not have the same y coordinate, and dist (ri , projri1 ) r , and dist (ri , projri+1 ) r .
The situation (c) where ri1 and ri have the same y coordinate, however, ri+1 does not. Also,
dist (ri , ri1 ) r , and dist(ri , projri+1 ) < r .
ri and ri1 have the same y coordinate, however ri+1 does not, then Zone(ri )
is the half circle below it, centered at ri , and with radius min(r /(k + 1)(k +
2), dist (ri , ri1 )/(k +1)(k +2), dist (ri , projri+1 )/(k +1)(k +2)) (Fig. 2(c)).
As we mentioned before, the bound min(r /(k +1)(k +2), dist(ri , p)/(k +1)(k +
2)) set on the movement of robots is conservative, and is sufficient to avoid collisions between robots, and to satisfy Assumption 3 (this will be proved later).
For the sake of clarity, we do not describe explicitly in Algorithm 6 the zone
of movement of the last robot in the rank sequence. The computation of its zone
of movement is similar to that of the other robot followers, with the only difference being that it does not have a next neighbor ri+1 . So, if robot ri has the
same ycoordinate as its previous neighbor ri1 , then its zone of movement is
the half circle with radius min(r /(k + 1)(k + 2), dist (ri , ri1 )/(k + 1)(k + 2)),
centered at ri and below ri . Otherwise, the circle centered at ri , and with radius
min(r /(k + 1)(k + 2), dist (ri , projri1 )/(k + 1)(k + 2)).
After determining its zone of movement Zone(ri ), robot ri needs to determine
if it can progress toward its target T arget(ri ). Note that, T arget(ri ) may not necessarily belong to Zone(ri ). To do so, robot ri computes the intersection of the
segment ri T arget(ri ) and Zone(ri ), called Intersect. If Intersect is equal to
the position of ri , then ri will move toward its right as given by the procedure
Lateral Move Right(). Otherwise, ri moves along the segment Intersect as much
as possible, while avoiding to reach the location of a crashed robot in Intersect, if
any, and satisfying Assumption 3. In any case, if ri is not able to move to any point
in Intersect, except its current position, it moves to its right as in the procedure
Lateral Move Right().
Note that, by the algorithm robot followers can move in any direction by adaptation
of their target positions with respect to the new position of the leader. When the leader
is idle, robot followers move within the distance r /(k + 1)(k + 2) or smaller in order
to keep an approximation of the formation with respect to the position of the leader, and
preserve the rank sequence.
157
158
159
Now assume that both robots ri and ri+1 are moving to the same direction, then
we will show that ri never reaches the position of ri+1 after k + 1 activations of ri+1 .
Assume the worst case where robot ri+1 is activated once during each k activations of
ri . Then, after k + 1 activations of ri+1 , ri will move toward ri+1 by a distance of at
most dist (ri , ri+1 )(k + 1)2 /(k + 1)(k + 2), which is strictly less than dist (ri , ri+1 ),
hence ri is unable to reach the position of ri+1 .
Finally, we assume that both ri and ri+1 are moving toward each other. In this case,
we assume the worst case when both robots are always activated together. After k + 1
activations of either ri or ri+1 , each of them will travel toward the other one by at most
the distance dist (ri , ri+1 )(k +1)/(k +1)(k +2). Consequently, 2dist (ri , ri+1 )/(k +2)
is always strictly less than dist (ri , ri+1 ) because k 1. Hence, neither ri or ri+1
moves to a location that was occupied by the other during its last k + 1 activations, and
the lemma holds.
Corollary 1. By Algorithm 4, at any time t, there is no overlap between the zones of
movement of any two correct robots in the system.
Agreement on Ranking. In this section, we show that correct robots agree always on
the same sequence of ranking even in the presence of failure of robots.
Lemma 6. By Algorithm 4, correct robots always agree on the same RankSequence
when there is no crash. Moreover, if some robot rj crashes, there is a finite time after
which, all correct robots exclude rj from the ordered set RankSequence, and keep the
same total order in RankSequence.
Proof (Lemma 6). By Lemma 3, all correct robots agree on the same sequence of ranking, RankSequence after the first k activations of any robot in the system. Then, in
the following, we first show that the RankSequence is preserved during the execution
of Algorithm 4 when there is no crash in the system. Second, we show that if some
robot rj has crashed, there is a finite time after which correct robots agree on the new
sequence of ranking, excluding rj .
There is no crash in the system: we consider three consecutive robots ra , rb and
rc in RankSequence, such that ra < rb < rc . We prove that the movement of rb
does not allow it to swap ranks with ra or rc in the three different cases that follow:
1. ra , rb and rc share the same y coordinate. In this case, rb moves by min(r /
(k + 1)(k + 2), dist(rb , rc )/(k + 1)(k + 2)) along the segment rb rc . Such a
move does not change the y coordinate of rb , and also it does not change its
rank with respect to ra and rc because it always stays between ra and rc , and
it never reaches either ra nor rb , by the restrictions on the algorithm.
2. ra , rb and rc do not share the same y coordinate. In this case, the movement of
rb is restricted within a circle C, centered at rb , and having a radius that does
not allow rb to reach the same y coordinate as either ra nor rc . In particular, the
radius of C is equal to min(r /(k + 1)(k + 2), dist (rb , projra )/(k + 1)(k +
2), dist (rb , projrc )/(k + 1)(k + 2)), which is less than dist (rb , projra )/k,
and dist (rb , projrc )/k, where projra and projrc are respectively, the projections of robot ra and rc on the yaxis of rb . Hence, such a restriction on the
movement of rb does not allow it to swap its rank with either ra or rb .
160
3. Two consecutive robots have the same y coordinate, (say ra and rb ), however
rc does not. This case is almost similar to the previous one. The movement
of rb is restricted within a half circle, centered at rb , and below it, and with
a radius that does not allow rb to have less than or equal y coordinate as rc .
In particular, that radius is equal to min(r /(k + 1)(k + 2), dist (ra , rb )/(k +
1)(k + 2), dist (rb , projrc )/(k + 1)(k + 2)), which is less than dist (ra , rb )/k,
and also less than dist (rb , projrc )/k, where projrc is the projection of robot
rc on the yaxis of rb . Hence, the restriction on the movement of rb does not
allow it to swap ranks with either ra or rb .
Since, all robots execute the same algorithm, then the proof holds for any two consecutive robots in RankSequence. Note that, the same proof applies for both algorithms executed by the leader and the followers because the restrictions made on
their movements are the same
Some robot rj crashes: From what we proved above, we deduce that all robots
agree and preserve the same sequence of ranking, RankSequence in the case of
no crash. Assume now that a robot rj crashes. By Theorem 3, we know that there
is a finite time after which all correct robots detect the crash of rj . Hence, there
is a finite time after which correct robots exclude robot rj from the ordered set
RankSequence.
In conclusion, the total order in RankSequence is preserved for correct robots during
the entire execution of Algorithm 4. This terminates the proof.
The following Theorem is a direct consequence from Lemma 6.
Theorem 4. By Algorithm 4, all robots agree on the total order of their ranking during
the entire execution of the algorithm.
Collision-Freedom
Lemma 7. Under Algorithm 4, at any time t, no two correct robots ever move to the
same location. Also, no correct robot ever moves to a position occupied by a faulty
robot.
Proof (Lemma 7). To prove that no two correct robots ever move to the same location,
we show that any robot ri always moves to a location within its own zone Zone(ri ), and
the rest follows from the fact that the zones of two robots do not intersect (Corollary 1).
By restriction on the algorithm, ri must move to a location T arget(ri ), which is within
Zone(ri ). Since, ri belongs to Zone(ri ), Zone(ri ) is a convex form or a line segment,
and the movement of ri is linear, so all points between ri and T arget(ri ) must be in
Zone(ri ).
Now we prove that, no correct robot ever moves to a position occupied by a crashed
robot. By Theorem 1, robot ri can compute the positions of crashed robots in finite time.
Moreover, by Lemma 5, robot ri always has free destinations within its zone Zone(ri ),
which excludes crashed robots. Finally, Algorithm 4 restricts robots from moving to the
locations that are occupied by crashed robots. Thus, robot ri never moves to a location
that is occupied by a crashed robot.
161
162
r /(k + 1)(k + 2), then in every new k activations in the system, each correct robot ri
cannot go farther away than r from its position during k activations. Consequently, ri
can always be within r of its target Pi as in Definition 4, and the lemma follows.
Theorem 6. Algorithm 4 allows correct robots to dynamically form an approximation
of a regular polygon, while avoiding collisions.
Proof (Theorem 6). First, by Theorem 3, there is a finite time after which all correct
robots agree on the same set of correct robots. Second, by Theorem 4, all correct robots
agree on the total order of their ranking RankSequence. Third, By Theorem 5, there is
no collision between any two robots in the system, including crashed ones. Finally, by
Lemma 9, all correct robots form an approximation of a regular polygon in finite time,
and the theorem holds.
Lemma 10. Algorithm 4 tolerates permanent crash failures of robots.
Proof (Lemma 10). By Theorem 1, a crash of a robot is detected in finite time, and by
Algorithm 4, a crashed robot is removed from the list of correct robots, although it appears physically in the system. Finally, by Theorem 5, correct robots avoid collisions
with crashed robots. Thus, Algorithm 4 tolerates permanent crash failures of robots.
From Theorem 6, and Lemma 10, we infer the following theorem:
Theorem 7. Algorithm 4 is a fault tolerant dynamic flocking algorithm that tolerates
permanent crash failures of robots.
7 Conclusion
In this paper, we have proposed a fault-tolerant flocking algorithm that allows a group
of asynchronous robots to self organize dynamically, and form an approximation of a
regular polygon, while maintaining this formation in movement. The algorithm relies
on the assumption that robots activations follow a k-bounded asynchronous scheduler,
and that robots have a limited memory of the past.
Our flocking algorithm allows correct robots to move in any direction, while keeping
an approximation of the polygon. Unlike previous works (e.g., [3,6]), our algorithm is
fault-tolerant, and tolerates permanent crash failures of robots. The only drawback of
our algorithm is the fact that it does not permit the rotation of the polygon by the robots,
and this is due to the restrictions made on the algorithm in order to ensure the agreement
on the ranking by robots. The existence of such algorithm is left as an open question
that we will investigate in our future work.
Finally, our work opens new interesting questions, for instance it would be interesting
to investigate how to support flocking in a model in which robots may crash and recover.
Acknowledgments
This work is supported by the JSPS (Japan Society for the Promotion of Science) postdoctoral fellowship for foreign researchers (ID No.P 08046).
163
References
1. Daigle, M.J., Koutsoukos, X.D., Biswas, G.: Distributed diagnosis in formations of mobile
robots. IEEE Transactions on Robotics 23(2), 353369 (2007)
2. Coble, J., Cook, D.: Fault tolerant coordination of robot teams,
citeseer.ist.psu.edu/coble98fault.html
3. Gervasi, V., Prencipe, G.: Coordination without communication: the Case of the Flocking
Problem. Discrete Applied Mathematics 143(1-3), 203223 (2004)
4. Hayes, A.T., Dormiani-Tabatabaei, P.: Self-organized flocking with agent failure: Off-line
optimization and demonstration with real robots. In: Proc. IEEE Intl. Conference on Robotics
and Automation, vol. 4, pp. 39003905 (2002)
5. Saber, R.O., Murray, R.M.: Flocking with Obstacle Avoidance: Cooperation with Limited
Communication in Mobile Networks. In: Proc. 42nd IEEE Conference on Decision and Control, pp. 20222028 (2003)
6. Canepa, D., Potop-Butucaru, M.G.: Stabilizing flocking via leader election in robot networks.
In: Masuzawa, T., Tixeuil, S. (eds.) SSS 2007. LNCS, vol. 4838, pp. 5266. Springer, Heidelberg (2007)
7. Defago, X., Gradinariu, M., Messika, S., Raipin-Parvedy, P.: Fault-tolerant and selfstabilizing mobile robots gathering. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp.
4660. Springer, Heidelberg (2006)
8. Prencipe, G.: CORDA: Distributed Coordination of a Set of Autonomous Mobile Robots. In:
Proc. European Research Seminar on Advances in Distributed Systems, pp. 185190 (2001)
9. Flocchini, P., Prencipe, G., Santoro, N., Widmayer, P.: Pattern Formation by Autonomous
Robots Without Chirality. In: Proc. 8th Intl. Colloquium on Structural Information and Communication Complexity (SIROCCO 2001), pp. 147162 (2001)
10. Gervasi, V., Prencipe, G.: Flocking by A Set of Autonomous Mobile Robots. Technical Report, Dipartimento di Informatica, Universit`a di Pisa, Italy, TR-01-24 (2001)
11. Reynolds, C.W.: Flocks, Herds, and Schools: A Distributed Behavioral Model. Journal of
Computer Graphics 21(1), 7998 (1987)
12. Brogan, D.C., Hodgins, J.K.: Group Behaviors for Systems with Significant Dynamics. Autonomous Robots Journal 4, 137153 (1997)
13. John, T., Yuhai, T.: Flocks, Herds, and Schools: A Quantitative Theory of Flocking. Physical
Review Journal 58(4), 48284858 (1998)
14. Yamaguchi, H., Beni, G.: Distributed Autonomous Formation Control of Mobile Robot
Groups by Swarm-based Pattern Generation. In: Proc. 2nd Int. Symp. on Distributed Autonomous Robotic Systems (DARS 1996), pp. 141155 (1996)
15. Dieudonne, Y., Petit, F.: A Scatter of Weak Robots. Technical Report, LARIA, CNRS,
France, RR07-10 (2007)
16. Chandra, T.D., Toueg, S.: Unreliable Failure Detectors for Reliable Distributed Systems.
Journal of the ACM 43(2), 225267 (1996)
17. Schreiner, K.: NASAs JPL Nanorover Outposts Project Develops Colony of Solar-powered
Nanorovers. IEEE DS Online 3(2) (2001)
18. Souissi, S., Yang, Y., Defago, X.: Fault-tolerant Flocking in a k-bounded Asynchronous System. Technical Report, JAIST, Japan, IS-RR-2008-004 (2008)
19. Konolige, K., Ortiz, C., Vincent, R., Agno, A., Eriksen, M., Limketkai, B., Lewis, M., Briesemeister, L., Ruspini, E., Fox, O., Stewart, J., Ko, B., Guibas, L.: CENTIBOTS: Large-Scale
Robot Teams. Journal of Multi-Robot Systems: From Swarms to Intelligent Autonoma (2003)
20. Bellur, B.R., Lewis, M.G., Templin, F.L.: An Ad-hoc Network for Teams of Autonomous
Vehicles. In: Proc. 1st IEEE Symp. on Autonomous Intelligent Networks and Systems (2002)
21. Jennings, J.S., Whelan, G., Evans, W.F.: Cooperative Search and Rescue with a Team of
Mobile Robots. In: Proc. 8th Intl. Conference on Advanced Robotics, pp. 193200 (1997)
Abstract. In this paper we study the impact of the speed of movement of nodes
on the solvability of deterministic reliable geocast in mobile ad-hoc networks,
where nodes move in a continuous manner with bounded maximum speed. Nodes
do not know their position, nor the speed or direction of their movements. Nodes
communicate over a radio network, so links may appear and disappear as nodes
move in and out of the transmission range of each other. We assume that it takes a
given time T for a single-hop communication to reliably complete. The mobility
of nodes may be an obstacle for deterministic reliable communication, because
the speed of movements may impact on how quickly the communication topology
changes.
Assuming the two-dimensional mobility model, the paper presents two tight
bounds for the solvability of deterministic geocast. First, we prove that the maximum speed vmax < T is a necessary and sufficient condition to solve the geocast, where is a parameter that together with the maximum speed captures the
local stability in the communication topology. We also prove that (nT ) is a time
complexity lower bound for a geocast algorithm to ensure deterministic reliable
delivery, and we provide a distributed solution which is asymptotically optimal
in time.
Finally, assuming the one-dimensional mobility model, i.e. nodes moving on
a line, we provide a lower bound on the speed of movement necessary to solve
the geocast problem, and we give a distributed solution. The algorithm proposed
is more efficient in terms of time and message complexity than the algorithm for
two dimensions.
Keywords: Mobile ad-hoc network, geocast, speed of movement towards solvability, distributed algorithms.
1 Introduction
A mobile ad-hoc network (MANET) is a set of mobile wireless nodes which dynamically build a network, without relying on a stable infrastructure. Direct communication
links are created between pairs of nodes as they come into the transmission range of
each other. If two nodes are too far apart to establish a direct wireless link, other nodes
act as relays to route messages between them. This self-organizing nature of mobile
This research was supported in part by Comunidad de Madrid grant S-0505/TIC/0285; Spanish MEC grants TIN2005-09198-C02-01 and TIN2008-06735-C02-01.The work of Alessia
Milani is funded by a Juan de la Cierva contract.
T.P. Baker, A. Bui, and S. Tixeuil (Eds.): OPODIS 2008, LNCS 5401, pp. 164183, 2008.
c Springer-Verlag Berlin Heidelberg 2008
165
ad-hoc networks makes them specially interesting for scenarios where networks have
to be built on the fly, e.g., under emergency situation, in military operations, or in environmental data collection and dissemination.
A fundamental communication primitive in certain mobile ad-hoc network is geocasting [14]. This is an operation initiated by a node in the system, called the source,
that disseminates some information to all the nodes in a given geographical area, named
the geocast region. In this sense, the geocast primitive is a variant of multicasting, where
nodes are eligible to deliver the information according to their geographical location.
While geocasting in two dimensions is clearly useful, geocasting in one dimension is
also a natural operation in some real situations, like announcing an accident to the
nearby vehicles in a highway. In mobile ad-hoc environments, geocasting is also a basic building block to provide more complex services. As an example, Dolev et al. [5] use
a deterministic reliable geocast service to implement atomic memory in mobile ad-hoc
networks. A geocast service is deterministic if it ensures deterministic reliable delivery,
i.e. all the nodes eligible to deliver the information will surely deliver it.
Designing a geocast primitive in a mobile ad-hoc network forces to deal with the
uncertainty due to the dynamics of the network. Since communication links appear and
disappear as nodes move in and out of the transmission range of other nodes, there is
a (potential) continuous change of the communication topology. In other words, the
movement of nodes and their speed of movement usually impacts on the lifetime of
radio links. Then, since it takes at least (log n) time to ensure a one-hop successful
transmission in a network with n nodes [6], mobility may be an obstacle for deterministic reliable communication.
Our contributions. In this paper we study the impact of the maximum speed of movement of nodes on the solvability of deterministic geocast in mobile ad-hoc networks.
In our model we assume that a node does not know its position (nodes have no GPS or
similar device), and that it knows neither the speed nor the direction of its movement.
Additionally, we assume that it takes a given time T for a single-hop communication to
succeed. As far as we know, [1] is the only theoretical work that deals with the geocast
problem in such a model.
Our results improve and generalize the bounds presented in [1] and, to the best of our
knowledge, present the first deterministic reliable geocast solution for two dimensions,
i.e. where nodes move in a continuous manner in the plane. In particular, we give bounds
on the maximum speed of nodes in order to be able to solve the deterministic reliable
geocast problem in one and two dimensions. While the bounds provided in [1] are for a
special class of algorithms, our bounds apply to all geocasting algorithms and are tighter.
Then, we present a distributed solution for the two-dimensional mobility model, which
is proved to be asymptotically optimal in terms of time complexity. Let n be the number
of nodes in the system, it takes 3nT time for our solution to ensure that the geocast information is reliably delivered by all the eligible nodes. Additionally, we prove that (nT )
is a time complexity lower bound for a geocast algorithm to ensure deterministic reliable
delivery. Finally, we provide a distributed geocast algorithm for the one-dimensional case
(i.e. nodes move on a line) and upper bound its message and time complexity. This algorithm is more efficient in terms of message complexity than the algorithms proposed
in [1], and (not surprisingly) than the algorithm for two dimension.
166
Related work. Initially introduced by Imielinski and Navas [14] for the Internet, the
geocast problem was then proposed for mobile ad-hoc networks by Ko and Vaidya
[7]. The majority of geocast algorithms presented in the literature for mobile ad-hoc
networks provide probabilistic guarantees, e.g. [8, 12, 9]. See the review of Jinag and
Camp [4] for an overview of the main existing geocast algorithms. As mentioned above,
Baldoni et al. [1] provide a deterministic solution for the case where nodes move on a
line.
Other deterministic solutions for multicast and broadcast in MANETs have been proposed, but their correctness relies on strong synchronization or stability assumptions. In
particular, Mohsin et al. [13] present a deterministic solution to solve broadcast in onedimensional mobile ad-hoc networks. They assume that nodes move on a linear grid,
that nodes know their current position on the grid, and that communication happens in
synchronous rounds. Gupta and Srimani [10], and Pagani and Rossi [16] provide two
deterministic multicast solutions for MANET, but they require the network topology to
globally stabilize for long enough periods to ensure delivery. Moreover, they assume a
fixed and finite number of nodes arranged in some logical or physical structure.
Few bounds on deterministic communication in MANETs have been provided [15,
2]. We prove that the lower time complexity bound to complete a geocast in the plane
is (nT ). Interestingly, Prakash et al. [15] provide a lower bound of (n) rounds for
the completion time of broadcast in mobile ad hoc networks, where n is the number of
nodes in the network. As the authors point out, they consider grid-based networks, but
a lower bound proved for this restricted grid mobility model automatically applies to
more general mobility models. This latter result improves the (D log n) bound provided by Bruschi and Pinto [2], where D is the diameter of the network. These results
unveil the fact that, when nodes may move, the dominating factor in the complexity of
an algorithm is the number of nodes in the network and not its diameter.
Road map. In Section 2 we present the model for mobile ad hoc network we consider
and in Section 3 we revise the problem. Then, in Section 4 we present the results for
two dimensions and in Section 5 the results for one dimension. Finally, our conclusions
are presented in Section 6.
167
Local broadcast primitive. To directly communicate with their neighbors, nodes are
provided with a local reliable broadcast primitive. Communication is not instantaneous,
it takes some time for a broadcast message to be received. To simplify the presentation
we consider as time unit the time slot. This is the time the communication of a message
takes when accessing the underlying wireless network communication channel without
collision. Additionally, we assume that local computation at the nodes takes negligible
time (zero time for the purpose of the analyses). Since collisions are a intrinsic characteristic of MANETs, they have to be considered. We assume that the potential collisions
due to concurrent broadcasts by neighbors are dealt by a lower level communication
layer, and that this layer takes T units of time to (reliably and deterministically) deliver
a message to its destination. The value of T could be related to the size of the system
and depends on the complexity of the lower level communication protocol. As already
stated, [6] shows that it takes at least (log n) time to ensure a one-hop successful
transmission in a network with n nodes.
Then, if a node p invokes broadcast (m) at time t, then all nodes that remain neighbors of p throughout [t, t + T ] receive m by time t + T , for some fixed known integer
T > 0. A node that receives a message m generates a receive(m) event. It is possible
that some node that has been a neighbor of p at some time in [t, t + T ] (but not during
the whole period) also receives m, but there is no such guarantee. However, no node
receives m after time t + T . A node issues a new invocation of the broadcast primitive
only after it has completed the previous one (T time later). Then in each time interval
of length T a node broadcasts at most one message.
Connectivity. Baldoni et al. [1] have proved that traditional connectivity is too weak
to implement a deterministic geocasting primitive in the model described. To overcome
this impossibility result they have introduced the notion of strong connectivity, and assumed it in their model. Like them, we also assume strong connectivity in our model1 .
Let us remark that strong connectivity is only a possible way to overcome the above
impossibility. A different approach could be to constrain the mobility pattern of nodes
(e.g. [13]) or to assume the global communication topology to be stable long enough
to ensure reliable delivery (e.g. [10]). On the other hand, strong connectivity is a local
property which helps to formalize the local stability in the communication topology
necessary to solve the problem. In the following, we revise the notions of strong neighborhood and strong connectivity.
Definition 1 (Strong neighborhood). Let 2 = r and 1 be fixed positive real numbers
such that 1 < 2 . Two nodes p and p are strong neighbors at some time t if there is
a time t t such that distance(p, p , t ) 1 , and distance(p, p , t ) < 2 for all
t [t , t].
Assumption 1 (Strong Connectivity). For every pair of nodes p and p , and every
time t, there is at least one path of strong neighbors connecting p and p at t.
When convenient, we may use that a pair of (strong) neighbors have a (strong) connection, or are (strongly) connected. Observe that once two nodes p and p become strong
1
168
neighbors (i.e., they are at distance 1 from each other), to get disconnected they must
move away from each other so that their distance is at least 2 . This means that the total
distance to be covered in order for p and p to disconnect is 2 1 . We use the notation
1
= 2
2 , where denotes the minimum distance that any two nodes that just became
strong neighbors have to travel to stop being neighbors when moving in opposite directions. Thus, for a clearer presentation of our results, we express the maximum speed of
nodes, denoted vmax , as the ratio between and the time necessary to travel this space,
denoted T . Formally,
Assumption 2 (Movement Speed). It takes at least T > 0 time for a node to travel
2 1
1
distance = 2
2 , i.e. vmax = 2T .
Since nodes move, the topology of the network may continuously change. In this sense,
assuming both strong connectivity and an upper bound on the maximum speed of nodes
provides some topological stability in the network. In particular, it ensures that there are
periods in which the neighborhood of a node remains stable. Formally,
Lemma 1. If two nodes become strong neighbors at time t, then they are neighbors
throughout the interval (t T , t + T ) and remain strong neighbors throughout the
interval [t, t + T ).
Proof. If p and p become strong neighbors at time t, then distance(p, p , t) = 1 . To
be disconnected, they must move away from each other a distance of at least 2, so
that their distance is at least 2 . From Assumption 2, this takes at least T time. Hence,
for (t T , t + T ), distance(p, q, ) < 1 + 2 = 2 , which combined with
Definition 1 proves the claims.
169
Property 2. [Termination] If no other node issues a call to the geocast service, then
there is a positive integer C such that after time t + C , no node performs any communication (i.e. a local broadcast) triggered by a geocast.
Property 3. [Integrity] There is a d d such that, if a node has never been within
distance d from l, it never delivers I.
Observe that these properties are deterministic. This justifies the use of a deterministic
reliable local broadcast primitive and the fact that we enforce nodes to be in range less
than 2 during T steps to complete a successful communication.
x1
2
q
x2
x2
2
2
2
x1
2
x3
2
s
l2
l1
l
(a) State at time t and t0
2
l1
2
x3
2
p
s
l
l2
170
not a neighbor of neither x1 , x2 , nor x3 . Additionally, both p and q are in the geocast
region (since d 2 ). They will be in the region during the whole execution and hence
to satisfy reliable delivery they should deliver I.
We consider several possible behaviour of the geocast algorithm. Let us first assume
then that, although it invoked Geocast (I, d), s never makes a call to broadcast (I).
Then, p and q will never receive nor deliver I and reliable delivery is violated. Otherwise, assume that as a consequence of the Geocast (I, d) invocation, s invokes
broadcast (I ) at times t0 , t1 , ..., with ti+1 ti + T . Let us define first the behavior
of the nodes in interval [t0 , t1 ]. At time t0 , the source s and node q start moving towards
each other at the maximum speed vmax , while nodes p, x1 , x2 , and x3 start moving in
the same direction as q. At time t0 = t0 + T t1 all nodes have travelled a distance
of (by definition of T ) and the system is in the state depicted in Figure 1.(b). In the
interval [t0 , t1 ] no node moves.
Observe that strong connectivity has been preserved during the whole period [t0 , t0 ],
since the distances along the path q, x1 , x2 , x3 , p did not change, and the source is a
strong neighbor of p for all the period [t0 , t0 ) and at time t0 it becomes strong neighbor
of q. Note also that neither p nor q have been neighbors of s during the whole period
[t0 , t0 + T ], because q is not a neighbor at time t0 and p is not a neighbor at time
t0 t0 + T . Hence, in our execution no node delivers I in [t0 , t1 ].
The behavior in interval [t1 , t2 ] is the same as described for [t0 , t1 ], but swapping the
directions of movement and the roles of p and q. The initial state at time t1 is the one
show in Figure 1.(b), and the final state reached at time t1 = t1 + T is the one shown in
Figure 1.(a). Again, I is not delivered at p nor q because they have not been neighbors
of s in the whole period [t1 , t1 + T ]. For any interval [ti , ti+1 ] the behavior is the same
as in interval [t0 , t1 ], if i is even, and the same as in interval [t1 , t2 ] if i is odd. Then, in
this scenario only s delivers I and the reliable delivery property is not satisfied.
The above theorem gives a lower bound of T > T to be able to solve the geocast
problem. We show now that this bound is tight by presenting an algorithm that always
solves the problem if T > T . The algorithm belongs to the class of algorithms presented in Figure 2, which has a configuration parameter M , the bound on time that
the algorithm uses to stop the geocast. The algorithm M -Geocast(I, d) works as follows. When the source node invokes a call Geocast (I, d), it immediately delivers the
information I (Line 8). Then, it broadcasts I and stores in a local variable T LB the time
this first transmission happened (Lines 10-11), in order to retransmit every T units of
time (Lines 13-14). When a node p receives for the first time a message with information I, it immediately delivers it and starts rebroadcasting the information periodically
(Lines 2-6). With the information I the algorithm broadcasts a value count I , which
contains an estimate of the time that has passed since the geocast started. This value
combined with the the parameter M is used to terminate the algorithm.
We show now that the algorithm M -Geocast (I, d) solves the geocast problem in two
dimensions for an appropriate value of M . Let us denote by S the set of nodes that have
already delivered the information I, and S(t) the set S at time t. Let us denote by ti the
time at which the set S increases from size i to i + 1. Note that t0 is the time the geocast
starts.
Init
(1)
T LBI
171
Lemma 2. If T > T and count I < M at all nodes during the time interval [t0 , tn1 ],
then ti+1 ti 3T for every i {0, . . . , n 2}.
Proof. Since strong connectivity holds, at any time there must be chains of strong
neighbors connecting any two nodes in the system. In particular, at every time t0 <
t < tn1 (i.e., such that S(t) = ) there must exist at least a pair of neighbors q and p
such that q S(t) and p S(t). Let C(t) denote the set of all such pairs.
Let us fix an i {0, . . . , n 2}, and assume, for contradiction, that ti+1 ti > 3T .
Consider the case when there is some pair (q, p) C(ti ) that belongs to C(t ) for
all t [ti , ti + 2T ]. In other words, this pair is formed by a node q that has I at ti ,
and a node p that does not, neighbors for at least 2T time. By the M -Geocast(I, d)
algorithm and the fact that count I < M during the time interval [t0 , tn1 ], a node
having the information I will rebroadcast it once every T time. Hence q will rebroadcast
the information I at some time t [ti , ti + T ], and thus p will receive and deliver it by
time t + T ti + 2T .
Otherwise, all the connections in C(ti ), i {0, . . . , n 2}, have been broken by
some time t (ti , ti + 2T ]. But, for strong connectivity to hold, a strong connection
has to exist between some node q S(t ) and a node p S(t ), since otherwise these
subsets are disconnected at time t . Let t , ti < t t , be the time at which q and
p become strong neighbors, i.e. they are within distance 1 from each other. The claim
follows if q S(ti ), since q S(t ) and ti < t t ti + 2T . Otherwise, note that
by Lemma 1 and the fact that T > T , q and p are neighbors throughout all the period
[t T, t +T ]. Then, since q S(ti ) and ti < t , q will broadcast I once in the period
[t T, t ], and p will deliver I by time t + T > t > ti . Given that t ti + 2T , p
will deliver the information I by time ti + 3T and the claim holds.
Let us now relate the value of the count I at each node with respect to the time that
has passed since M -Geocast(I, d) was invoked. Let count I (q, t) be the value of the
variable count I of node q at time t. Let us define a propagation sequence as the sequence of nodes s = p0 , p1 , p2 , ..., pk = q such that the first message received by pi
with information I was sent by pi1 . Node s = p0 is the source of the geocast.
Lemma 3. Let t0 be the time at which M -Geocast(I, d) is invoked at source s. Given
a node q with propagation sequence s = p0 , p1 , p2 , ..., pk = q and a time t t0 at
172
which q has delivered I, with count I (q, t) M , then it is satisfied that ((t t0 )
count I (q, t)) [0, k(T 1) + T ].
Proof. We prove by induction on k that at time t t0 it is satisfied that ((t t0 )
count I (pk , t)) [0, k(T 1) + T ], and that if a message is sent at time t it carries a
counter c(pk , t) such that ((t t0 ) c(pk , t)) [0, k(T 1)]. The base case is the
source node s = p0 . At time t0 the source sets count I (s, t0 ) = 0 (Line 9), and then,
as long as count I < M , it increments count I by T every T time (Line 12). Hence, at
time t = t0 + we have ((t t0 ) count I (s, t)) = 0 if is a multiple of T , and
((t t0 ) count I (s, t)) > 0 otherwise. Furthermore, the difference (t t0 ) count I
is always smaller than T . Since messages are broadcast at times t = t0 + with
a multiple of T , the values c(s, t) carried by the messages sent by the source satisfy
((t t0 ) c(s, t)) = 0.
Let us assume now by induction that, if pi1 broadcasts a message at time t t0 ,
this carries a value c(pi1 , t) such that ((t t0 ) c(pi1 , t)) [0, (i 1)(T 1)]. If pi
receives I for the first time at t and the corresponding message was sent by pi1 at time
t, pi sets count I (pi , t ) = c(pi1 , t) + 1 (Line 4). This message took between 1 and T
time units to be received at time t = t+. Hence, [1, T ]. Considering one extreme
case, if ((t t0 ) c(pi1 , t)) = 0 and = 1, then ((t t0 ) count I (pi , t )) = 0. In
the other extreme, if ((t t0 ) c(pi1 , t)) = (i 1)(T 1) and = T , then ((t
t0 ) count I (pi , t )) = i(T 1). Therefore, ((t t0 ) count I (pi , t )) [0, i(T 1)].
Like the source, pi increments count I by T every T time as long as count I < M (Line
12). Hence, at any time t = t + we have ((t t0 ) count I (pi , t )) [0, i(T 1)]
if is a multiple of T . Otherwise, this difference increases in up to T time, and hence
((t t0 ) count I (pi , t )) [0, i(T 1) + T ]. Since messages are broadcast by pi at
times t = t + with a multiple of T , the values c(pi , t ) carried by the messages
sent by pi satisfy ((t t0 ) c(pi , t )) [0, i(T 1)].
This lemma can be used to prove the following theorem, which shows that the geocast
problem can be solved in two dimensions as long as T > T .
Theorem 4. If T > T , the M -Geocast (I, d) algorithm with M = 3T (n 1) ensures
(1) the Reliable Delivery Property 1 for C = 3T (n 1),
(2) the Termination Property 2 for C = 3T (n 1) + (n 1)(T 1) + T , and
(3) the Integrity Property 3 for d = max(d, 3T (n 1)vmax + (n 1)2 ).
Proof. The first part of the claim is a direct consequence of Lemma 2, which proves
that at most 3T (n 1) tn1 t0 time after Geocast (I, d) is invoked, all nodes have
delivered the information I. The second part of the claim follows from Lemma 3, using
the fact that no propagation sequence has more than n nodes (hence taking k = n 1),
combined with the first part of the claim. The third claim is also direct consequence of
Lemma 2, since the information can be carried by nodes at most distance 3T (n1)vmax
in time 3T (n 1) from the initial location of the source, and travels less than (n 1)2
in the n 1 broadcasts that inform new nodes.
Finally we show that the time bound found for the M -Geocast(I, d) algorithm with
M = 3T (n 1) is in fact asymptotically optimal, since there are cases in which any
geocast algorithm requires (nT ) time to complete.
xp = xl + 22
xl + 2
q1
xl
q3
xl + 2
q1
173
xp = xl + 22
q3
...
2
2
p
a0
a1
a2
2
a0
...
a2
...
...
...
...
...
...
...
...
...
2
p a1
...
2
2
s = q0 2
q2
...
2
2
s = q0 2
q2
2
Theorem 5. Any deterministic Geocast (I, d) algorithm that ensures the Reliable Delivery Property in two dimensions requires time (nT ) to complete, for each d 322 .
Proof. We present a scenario (shown in Figure 3) in which for any Geocast (I, d) call,
with d 322 , all nodes that are in the geocast region require time (n 1)T to deliver I.
Consider the scenario depicted in Figure 3.(a) where Geocast (I, d) is invoked at time
t0 at the source node s while in location l = (xl , yl ) and with only one neighbor at
distance 2 . The rest of nodes except p form a chain in which each node is neighbor
of its predecessor and successor in the chain. The chain forms a snake shape. Node p
is a node that is within distance d of l, at the same level (coordinate y) of the last node
in the chain (r in Figure 3.(b)), and connected with some node in the chain as shown.
Especially, p is located at some position (xp , yp ) where xp is equal to xl + 22
with ! 2 and yp is the same as the coordinate y of node r. Assume that nodes reach
this configuration while previously been at distance 1 from each other. Thus at time t0
strong connectivity holds. We usually refer to the location of the node only considering
the x coordinate because it is the one of interest.
The number of nodes between any pair of nodes qi and qi+1 as depicted in Figure 3 is
2
. In this execution, at time t0 all the nodes, except node p, start to
fixed to k = T vmax
2
move towards the left at the same speed v = T (k+1)
vmax . These values are chosen
so that, in the execution we construct, qi is at position xl (see Figure 3.(b)) at time
t0 + iT (k + 1). When the last node of the chain is at a distance 2 at the left of p, the
latter also starts moving to the left at the same speed. In our execution, we assume that
a node that receives the information I immediately rebroadcasts it. In any other case the
execution can be easily adapted by stopping the movement while I is not rebroadcasted.
Then, in the execution, s broadcasts I at time t0 . We assume that each node in the chain
receives I from its predecessor in T units of time. Then, qi receives first I at time
t0 + iT (k + 1). Since at that time qi is at xl , no progress has been made to the right.
Then, the geocast problem will be solved when all nodes in the chain have received I,
and p received I from the last node in the chain. Since this implies n 1 transmissions
and each takes T time, the total time to provide reliable delivery is (n 1)T . Finally,
for all nodes except p strong connectivity holds throughout all the execution because
they do not change their neighbourhood. At time t0 strong connectivity holds for node
174
p because it is at distance from a node a0 on its right. It is easy to see that strong
connectivity holds throughout all the execution, since due to the movement pattern and
speed, p will stop to be strong neighbour of node ai only after it already becomes strong
neighbour of node ai+1 for i 0 (see Figure 3,(a)). The
claim holds because p is at
1
2
2
2
most at distance [ T vmax (2 )]2 + (22 )2 = 22 ( T vmax
)2 + (22)
< 322
4
2
from (xl , yl ).
The bound of the above theorem depends on n. If n is finite this bound is finite. However, in a system with potentially infinite nodes, the geocast problem may never be
solved.
Corollary 1. No deterministic Geocast (I, d) algorithm will ensure the reliable delivery property in a system with infinite nodes, for d 322 .
l1
175
l2
2
1
State spq
at time ti
State at some
time in (ti, ti+1)
State sqp
at time ti+1
2
p
Observe that strong connectivity has been preserved during the whole period [t0 , t0 ]: p
and q never stop being strong neighbors, and the source is strong neighbor of p for all
the period [t0 , t0 ) and at time t0 it becomes strong neighbor of q. Note that neither p
nor q have been neighbors of s during the whole period [t0 , t0 + T ], because q is not a
neighbor at time t0 and p is not a neighbor at time t0 t0 + T . Hence, in our execution
no node delivers I in [t0 , t1 ].
The behavior in interval [t1 , t2 ] is the same as described but exchanging the roles of
p and q: the initial state is sqp , at time t1 they start moving to exchange positions, and
at time t1 they end up at state spq . Again, I is not delivered at p nor q because they have
not been neighbors of s in the whole period [t1 , t1 + T ]. For any interval [ti , ti+1 ] the
behavior is the same as in interval [t0 , t1 ], if i is even, and the same as in interval [t1 , t2 ]
if i is odd. Then, in this scenario of execution only s delivers I and the reliable delivery
property is not fulfilled.
This result shows that in order to solve the geocast problem it must hold that T > T /2.
In the following we prove that when all nodes move along the same line, the algorithm
presented in Figure 2, for an appropriate value of M , efficiently solves the problem as
long as T > T and 1 > . Let us first introduce some preliminary Observations and
Lemmata which are instrumental for the proof of the main Theorem 13.
Assume that the source s = q0 initiates a call of M -Geocast(I, d) at time t = t0
from location l = l0 . Next, we prove that I propagates from l0 towards the right of l0 .
(For the left of l0 , the proof is symmetrical.) This happens in steps so that within a small
period of time, I moves from a node, qj to another node qj+1 at some large distance
away.
Observation 7. Let p be a node that receives information I at time t, then either p
immediately rebroadcasts I or it exists a time [t, t + T ] such that p broadcasts I
both at time T and at time .
Observation 8. If T > T ,
time T .
T
T
176
177
Proof. Assume that at time tj , a node qj broadcasts the information I being located at
lj . One of the following two cases holds:
At time tj + T , there is at least a node p located in the interval [lj + 1 , lj + ].
Then, by Lemma 5, p will deliver the information I by time tj + T . By Observation
7, p will broadcast I at some time tj+1 [tj , tj + T ]. By Observation 8 and its
position at time tj + T , p will rebroadcast I at time tj+1 tj + T being at location
lj+1 lj + 1 T
T . The claim holds being qj+1 = p and q = qj .
At time tj + T , no node is located in the interval [lj + 1 , lj + ]. Let L and L
respectively denote the set of nodes that at time tj + T are located on the left of
lj + 1 and the ones that at time tj + T are located on the right of lj + .
If L = then all the nodes that at time tj were on the right of lj + 1 are within
distance from lj at time tj + T . By Lemma 5, these nodes deliver information I
by time tj + T .
Otherwise, there must exist paths of strong neighbors from nodes in L to node
on the left of lj + 1 . In particular, nodes in L can be connected with nodes in L at
most within distance on the left of lj . These latter have delivered the information
I by time tj + T . One of the following cases has to hold:
1. It exists at least a connection between a node p in L and a node q in L which
lasts throughout [tj , tj + 2T ]. Then p will deliver the information I at some
time t [tj , tj + 2T ]. Note that at time tj + T , p is on the right of lj +
T
and, since T > T , it is on the right or on lj + T
T > l j + 1 T
throughout all the period [tj + T, tj + 2T ]. Then by Observation 7, p will
broadcast information I at some time tj+1 [tj + T, tj + 2T ], being located
at some position lj+1 > lj + 1 T
T . The claim holds being qj+1 = p and by
the fact that p and q are neighbors throughout [, tj+1 ] [tj , tj + 2T ], where
= min{tj+1 T, tj }.
2. Each connection between nodes in L and nodes in L breaks at some time in
[tj , tj + 2T ]. Then, a new strong connection has to be created at some time t
[tj , tj + 2T ] before all such connections break. Otherwise strong connectivity
is violated.
Let p and q be respectively the node in L and the node in L that create
the new strong connection at time t, i.e. distance(p, q, t) 1 . By Lemma
4, p and q have been neighbors throughout [t T, t + T ]. If t [tj , tj + T ],
[tj , tj +T ] [tT, t+T ] and all such connections have to break at some point
in [tj + T, tj + 2T ], since otherwise it will exist at least a connection between
a node in L and a node in L that lasts throughout all the period [tj , tj + 2T ]
and thus we reach a contradiction.
Then, a new connection between a node p in L and a node q in L has to be
created at some time t [tj +T, tj +2T ]. At time t, distance(p, q, t) 1 , and
since in 2T time a node can travel at most a distance 2T
T , at time tj q was on the
right of lj 2. Thus q delivers information I by time tj + T , and q broadcasts
I both at time and = + T with [tj , tj + T ]. By Lemma 1, p and q are
neighbors throughout all the period [t T, t + T ] with t [tj + T, tj + 2T ].
Either or is in the interval [t T, T ], then p delivers information I by time
t + T tj + 3T . Then, either p immediately broadcasts I or it broadcasted
178
179
[lj , lj+1 ] and it does not exist t [tj , tj+1 ] with t > t such that p is on the right of
lj+1 at t .
If t [tj+1 T, tj+1 ], the claim follows by Lemma 8 and Observation 9. Then,
consider t [tj , tj+1 T ). [tj , tj+1 T ) [tj , tj + 2T ), then by Observation 8,
at time tj+1 T , p is on the right of lj+1 2. At the same time tj+1 T , qj+1 is
at most within distance T
T from lj+1 , since it has to be located at lj+1 at time tj+1 .
Then, since 3 < 2 , at time tj+1 T p and qj+1 are neighbors. p and qj+1 remain
neighbors throughout all the period [tj+1 T, tj+1 ] because at time tj+1 , p is at most
within distance 3 from lj+1 , due to Lemma 6.(1) and Observation 8.
By Lemma 6 third bullet, either qj+1 receives I at time tj+1 because a node q that
received the information directly by qj invoked broadcast(I) at some time [tj+1
T, tj+1 ] or qj+1 invoked broadcast(I) also at time tj+1 T . In this last case, the claim
holds because p and qj+1 are neighbors throughout all the period [tj+1 T, tj+1 ] and
because of Observation 9. Then, consider the case where qj+1 receives I at time tj+1
because a node q invoked broadcast(I) at some time [tj+1 T, tj+1 ].
If at time tj+1 + T node p is within distance from lj+1 then by Lemma 5 p
deliver the message by time tj+1 + T . Otherwise, the location of p at time tj+1 + T
is on the left of lj+1 . This implies that the location of p at time tj+1 is minor or
equal to lj+1 + T
T . Then, at time tj+1 , p and q are neighbors because distance(q,
qj+1 ,tj+1 )< 2 and distance(p, qj+1 ,tj+1 ) T
T .
Note that at time tj , p is on the right of lj . Then, by Observation 10, either p delivers
the information by time tj + T or at some point t [tj , tj + T ] p is located on the
right of q. Note that q will broadcast the information once in each time interval [tj +
kT, tj + (k + 1)T ] with k {0, . . . , 3}. So either there is a time in [tj , tj+1 ] where p
and q are strong neighbors and then p delivers the information by time tj+1 + T , or at
time tj+1 q is on the left of p and this latter is on the left of qj+1 . Then, p will deliver
information I by time tj+1 + 2T because of a call of broadcast either at qj+1 or at q.
This is because either p remains neighbors of q or of qj+1 throughout all the interval
[tj+1 T, tj+1 +T ] or at time tj+1 p and q are within distance greater than 1 from each
other and they move towards or in the same direction of q. So they do not disconnect
for at least other 2T .
Lemma 10. Let p be a node that at some time t [tj , tj+1 ] is in some location lp
[lj , lj+1 ]. If it does not exist a time t [tj , tj+1 ] with t > t such that p is not on the
right of lj+1 , p delivers the information I by time tj+1 + 2T .
Proof. Let p be a node that at some time t [tj , tj+1 ] is located at lp [lj , lj+1 ].
Assume that it does not exist a time t [tj , tj+1 ] with t > t such that p is not on the
right of lj+1 . Then if at time tj p is either on the left of lj or on the right of lj+1 , then
the claim follows respectively by Lemma 7 and Observation 9, or by Lemma 9. Finally,
consider the case where node p is inside the interval [lj , lj+1 ] throughout all the interval
[tj , tj+1 ]. We prove that p delivers information I by time tj+1 + T . If at time tj + T
p is within distance from lj , p delivers the information I by time tj + T , because of
Lemma 11. Then assume that p is located in the interval [lj + , lj+1 ] at time tj + T .
At that time q is located on the left of location lj + . At time [tj+1 T, tj+1 ] q and
qj+1 are neighbors because of the third bullet of Lemma 6. At time tj+1 T one of
180
the following cases will happens: (1) p is in between of q and qj+1 , (2) p is on the right
of both these nodes but on the left of lj+1 or (3) p is on the left of both q and qj+1 .
But this means that p is a neighbor of q throughout [tj+1 , tj+1 + 2T ] or is a neighbor
of qj+1 throughout [tj+1 T, tj+1 ]. Since q broadcasts I once in each time interval
[tj + kT, tj + (k + 1)T ] with k {0, . . . , 3} and qj+1 broadcasts at time tj+1 , p
delivers I by time tj+1 + 2T and the claim holds.
Now we prove that if a node stays within distance d from the location where the geocast
has been invoked, throughout all the geocast period, then it is eventually inside one of
the intervals between two consecutive broadcasts at the right time and for long enough
to deliver the information I.
Lemma 11. If a node q stays within distance d from l throughout [t0 , ti+1 ] for i such
that l + d [li , li+1 ], then q delivers the information I by time ti+1 + 2T .
Proof. Let t0 be the time when the source node s performs the first broadcast(m)
because of a call of M -Geocast (I, d). If q is located at l0 (= l) at time t0 then the
lemma holds. Otherwise, without loss of generality, let q be located on the right of s at
time t0 . For every time in [t0 , ti+1 ] q is located either on or on the left of li+1 because
l + d li+1 .
By induction on j, it is easy to see that it exists a j i such that at time t [tj , tj+1 ]
q is in the interval [lj , lj+1 ] and it does not exist a time t [tj , tj+1 ] with t > t such
that q is on the right of lj+1 . Otherwise at time tj+1 q is on the right of lj+1 , and for
j = i we have that at time ti+1 q is on the right of li+1 . This means that at time ti+1 q
is at distance greater than d from l. By the Lemma 10, q will deliver the information I
by time ti+1 + 2T .
Observation 11. Let countI be the counter associated to the communication generated by a call of M -Geocast(I, d). countI is set to zero once when the source invokes
the first broadcast(I) at time t0 and it is never reset.
Observation 12. Let p be a node different from the source node. p invokes
broadcast(I) at some time t only if it has generated a receive(I) event at some time
before t.
Lemma 12. Let t be the time when a call of M -Geocast (I, d) is invoked. Every message broadcast or received at some time in [t, t + k] has counter at most equal to k.
Proof. The proof is by induction on k. For k = 0, we have to consider the time t. At that
time only the source node invokes a broadcast(I) and the counter of the broadcasted
message has value 0 (Line 9 of Figure 2). Then the claim holds. By inductive hypothesis,
assume that every message broadcast or received at some time in [t, t+ k] has counter at
most equal to k. Then, we prove that every message broadcast or received at some time
in [t, t + k + 1] has counter at most equal to k + 1. We know that this cannot happen
by time t + k because of the inductive hypothesis. Then, by contradiction assume that it
exists a message that is received at time t + k + 1 and whose counter has value greater
than k + 1. But since it takes at least 1 time unit to receive a message, this means that
181
the message received at time t + k + 1 was broadcast at the latest at time t + k. But then
if the message has counter k + 1 we contradict the inductive hypothesis.
Finally, consider the case where at time t + k + 1 a message m is broadcast by a
node p. p increments its counter possibly each time it receives a message or when it
broadcast a message. But by time t + k all the messages received by p have counter
smaller or equal to k, and p may have broadcast at most k messages. So at time t + k
the counter of p is at most k. Then, when at time t + k + 1 it broadcasts a message, this
message has a counter at most k + 1. Then the claim follows.
Finally we define the bound for the time to ensure the reliable delivery property and the
termination property. From the latter, we obtain the bound for the integrity property.
Theorem 13. If T > T , the M -Geocast(I, d) algorithm with M = 3T (i + 1) + 2T
and i = d T ensures
1
T
T
T
From Lemma 11, all the nodes that remain within distance d from l(= l0 ) throughout
[t0 , ti+1 ] deliver I by time ti+1 + T = t + C. By Lemma 6, ti+1 t 3T (i + 1), and
C = ti+1 t + T 3T (i + 1) + 2T . Then C 3T ( d T + 1) + 2T .
1
T
We have finally to prove that, during [t, t + C], for any node, countI < M , where
M = C. This follows from Lemma 12.
We prove now (2). Every message received causes rebroadcasting of I in a message
with counter at least incremented by one. This will happen at least once every T times.
Termination happens after any message received has counter larger than 3T (i+1)+2T ,
where i = d T . This happens within (3T (i + 1) + 2T + 1)T + T time, because all
1
T
messages broadcast after time (3T (i + 1) + 2T + 1)T have counters at least equal to
3T (i + 1) + 2T + 1 and all such messages are received within at most another T times.
Note that, in the worst case, each broadcast message is received exactly after T times
and then the counter is incremented by one unit, while in reality T steps have passed.
Therefore, C = (3T ( d T + 1) + 2T + 1)T .
1
T
Finally, we prove (3). A broadcast message will be received at least after one time
unit during which any node can traverse distance at most T . Therefore, if a node broadcasts a message from location l at time t , then its neighbors receive it the earliest at
time t + 1, when at distance less than 2 + T away from l . Then, if the source starts
M -Geocast(I, d) at time t from location l, at time t + m, the furthest node that delivers I is at distance less than m(2 + T ) away from l. By (2), after time t + C ,
no node broadcasts messages with information I. Therefore, no node delivers I after
182
time t + C + T . But at time t + C + T , all nodes that have delivered I are within
distance less than (C + T )(2 + T ) from l. Therefore, if a node remains further than
d = (C + T )(2 + T ) from l, it will never deliver I.
6 Conclusion
We have studied the geocast problem in mobile ad-hoc networks. We have considered
a set of n mobile nodes which move in a continuous manner with bounded maximum
speed. We have addressed the question of how the speed of movement impacts on providing a deterministic reliable geocast solution, assuming that it takes some time T to
ensure a successful one-hop radio communication.
Our results improve and generalize the bounds presented in [1]. For the two-dimensional mobility model, we have presented a tight bound on the maximum speed of
movement that keeps the solvability of geocast. We have also proved that (nT ) is a
time complexity lower bound for a geocast algorithm to ensure deterministic reliable
delivery, and we have provided a distributed solution which is proved to be asymptotically optimal in time. This latter bound confirms the intuition, presented in [15] for the
brodcast problem by Prakash et al., that when nodes may move the number of nodes in
the system is the impact factor on the reliable communication completion time. In fact,
our solution and bounds are also applicable to 3 dimensions, a case that is rarely studied
but may be of growing interest.
Finally, assuming the one-dimensional mobility model, i.e. nodes moving on a line,
we have proved that vmax < 2
T is a necessary condition to solve the geocast, where
is a system parameter, and presented an efficient algorithm when vmax < T . This still
leaves a gap on the maximum speed to solve the geocast problem in one dimension.
References
1. Baldoni, R., Ioannidou, K., Milani, A.: Mobility Versus the Cost of Geocasting in Mobile
Ad-Hoc Networks. In: Pelc, A. (ed.) DISC 2007. LNCS, vol. 4731, pp. 4862. Springer,
Heidelberg (2007)
2. Bruschi, D., Del Pinto, M.: Lower bounds for the broadcast problem in mobile radio networks. Distributed Computing 10(3), 129135 (1997)
3. Clark, B.N., Colbourn, C.J., Johnson, D.S.: Unit Disk Graphs. Discrete Mathematics 86(1-3),
165177 (1990)
4. Jinag, X., Camp, T.: A review of geocasting protocols for a mobile ad hoc network. In:
Proceedings of Grace Hopper Celebration (2002)
5. Dolev, S., Gilbert, S., Lynch, N.A., Shvartsman, A.A., Welch, J.: Geoquorums: Implementing
atomic memory in mobile ad hoc networks. Distributed Computing 18(2), 125155 (2005)
6. Chlebus, B.S., Gasieniec, L., Gibbsons, A., Pelc, A., Rytter, W.: Deterministic broadcasting
in ad hoc radio networks. Distributed Computing 15(1), 2738 (2002)
7. Ko, Y.-B., Vaidya, N.H.: Geocasting in mobile ad-hoc networks: Location-based multicast
algorithms. In: Proceedings of IEEE WMCSA, NewOrleans, LA (1999)
8. Ko, Y., Vaidya, N.H.: Geotora: a protocol for geocasting in mobile ad hoc networks. In:
Proceedings of the 8th International Conference on Network Protocols (ICNP), p. 240. IEEE
Computer Society, Los Alamitos (2000)
183
9. Ko, Y., Vaidya, N.H.: Flooding-based geocasting protocols for mobile ad hoc networks. Mobile Network and Application 7(6), 471480 (2002)
10. Gupta, S.K.S., Srimani, P.K.: An adaptive protocol for reliable multicast in mobile multi-hop
radio networks. In: Proceedings of the 2nd Workshop on Mobile Computing Systems and
Applications (WMCSA), p. 111. IEEE Computer Society, Los Alamitos (1999)
11. Imielinski, T., Navas, J.C.: Gps-based geographic addressing, routing, and resource discovery. Communication of the ACM 42(4), 8692 (1999)
12. Liao, W., Tseng, Y., Lo, K., Sheu, J.: Geogrid: A geocasting protocol for mobile ad hoc
networks based on grid. Journal of Internet Technology 1(2), 2332 (2001)
13. Mohsin, M., Cavin, D., Sasson, Y., Prakash, R., Schiper, A.: Reliable broadcast in wireless
mobile ad hoc networks. In: Proceedings of the 39th Hawaii International Conference on
System Sciences (HICSS), p. 233.1. IEEE Computer Society, Los Alamitos (2006)
14. Navas, J.C., Imielinski, T.: Geocast: geographic addressing and routing. In: Proceedings of
the 3rd Annual ACM/IEEE International Conference on Mobile Computing and Networking
(MobiCom), pp. 6676. ACM Press, New York (1997)
15. Prakash, R., Schiper, A., Mohsin, M., Cavin, D., Sasson, Y.: A lower bound for broadcasting in mobile ad hoc networks. Ecole Polytechnique Federale de Lausanne, Tech. Rep.
IC/2004/37 (2004)
16. Pagani, E., Rossi, G.P.: Reliable broadcast in mobile multihop packet networks. In: Proceedings of the 3rd Annual ACM/IEEE International Conference on Mobile Computing and
Networking (MobiCom), pp. 3442. ACM Press, New York (1997)
Introduction
Peer-to-peer networks have become an established paradigm of distributed computing and data storage. One of the main issues tackled in this research area is
building an overlay network that provides a sparse set of connections for communication between all node pairs. The aim is to build the network in a way
that an underlying routing scheme is able to quickly reach any node from any
other, without maintaining a complete graph of connections. In this paper, we
investigate such networks suitable not only for peer-to-peer networks but also
for Bluetooth scatternet formation and one-hop radio networks.
An important property of the investigated networks is their scalability. We
introduce a scalable and dynamic network structure which we call SkewCCC.
The maximum in- and out-degree of a node inside the SkewCCC network is 3
and routing and lookup times are logarithmic in the current number of nodes.
Naturally, it is impossible to decrease the degree to 2 while preserving any network topology besides a ring. Our routing scheme is name-driven, i.e. packet
This work has been partially supported by EU Commission COST 295 Action
DYNAMO Foundations and Algorithms for Dynamic Networks.
Supported by MNiSW grant number N206 001 31/0436, 2006-2008.
Partially supported by the EU within the 6th Framework Programme under contract IST-2005-034891 Hydra.
Supported by MNiSW grant number PBZ/MNiSW/07/2006/46.
T.P. Baker, A. Bui, and S. Tixeuil (Eds.): OPODIS 2008, LNCS 5401, pp. 184196, 2008.
c Springer-Verlag Berlin Heidelberg 2008
185
Related Work
186
SkewCCC
As we base our approach on the hypercube and the CCC (Cube Connected
Cycles) networks, we shortly review their construction. These networks were
extensively studied and have good properties concerning maintenance, diameter,
187
dilation
join runtime
(log n)
(log n)
(log n)
O(log n)
(log n)
(log2 n)
(log d n) (log n logd n)
(log n)
(log n)
degree of nodes and routing speed. On the other hand, they are meant exclusively
for static networks. For a thorough introduction to these kinds of networks, we
refer the reader to [8].
In this paper, we use the following notation. By a string we always mean a
binary string, whose bits are numbered from 0. For any two strings a and b,
we write a " b to denote that a is a (not necessarily proper) prex of b. We
denote an empty string by and use to denote a concatenation of two strings;
denotes the bitwise xor operation. We also identify strings of xed lengths
with binary numbers they represent.
Denition 1. The d-dimensional hypercube network has n = 2d nodes. Each
node is represented by a number 0 i < n. Two nodes i and j are connected if
and only if i 2k = j for an integer 0 k < d.
A d-dimensional Cube Connected Cycles (CCC) network is essentially a
d-dimensional hypercube in which each node is replaced with a ring of length d
and each of its d connections is assigned to one of the ring nodes. This way the
degree of the network is reduced from d to 3, whereas almost all of the network
properties (e.g. diameter) are changed only slightly.
Denition 2. The d-dimensional CCC network has d 2d nodes. Each node is
represented by a pair (i, j), where 0 i < d and 0 j < 2d . Each such node is
connected to three neighbors: two cycle ones with indices ((i 1) mod d, j) and
a hypercubic one (i, j 2i ).
Examples of a 3-dimensional hypercube and CCC network are given in Fig. 2.
a)
101
111
001
011
100
000
b)
110
010
b)
188
We present our network in two stages. First, we describe the network and its
properties. Second, we describe algorithms, which assure proper structure when
nodes join and leave the network.
3.1
Network Structure
When we compare the CCC network with the hypercube, we may think that the
d-dimensional hypercube is a skeleton for the d-dimensional CCC. In other words,
if we replace each node of the hypercube (to which we refer as a corner) by a cycle
of nodes, then the resulting network is a CCC. In the following description we
start from a description of such a skeleton for our network, called SkewHypercube,
then we show how to replace each of its corners by a ring of real network nodes,
nally creating a structure, which we call a SkewCCC network.
SkewHypercube. In the following we describe a skeleton network called skewHypercube. Each node of this network will correspond to a group of real nodes.
To avoid ambiguity, we refer to a skeleton node as a corner.
First, we dene the set of corners. Each corner i has an identier, which is
a string si of length di . The number di is called the dimension of the corner.
We require that the set of corner identiers C = {si } is prex-free and complete.
Prex-freeness means that for any two strings si , sj , neither si " sj nor sj " si .
Completeness means that for any innite string s, there exists an identier si C,
s.t. si " s. The description above implies that (i) a single corner with empty
name s = constitutes a correct set C and (ii) any correct set of corners C can be
obtained by multiple use of the following operation (starting from the set {}):
take a corner i and replace it with two corners j and k with identiers sj = si 0
and sk = si 1.
Second, we dene the set of edges in a SkewHypercube.
Denition 3. Two SkewHypercube corners i and j are connected i
(i) there exists 0 ki < di , s.t. si 2ki " sj or
(ii) there exists 0 kj < dj , s.t. sj 2kj " si .
We note that if all identiers of corners have the same length d, then our SkewHypercube is just a regular d-dimensional hypercube. On the other hand, the
denition above allows the following situation to occur. It may happen that a
single corner s has identier 0 (dimension 1) and there are 2k corners with dimension k + 1 with identiers starting with 1. This results in corner s having
the degree of 2k . We will explicitly forbid such situations in the construction
of our network and require that the dimensions of neighboring corners can differ at most by 1. This ensures that each corner of dimension d has at most 2d
neighbors. An example of a SkewHypercube is presented in Fig. 3.
Identiers. To specify which nodes are stored in particular corners of the
SkewHypercube, we have to give nodes unique identiers. These identiers are
innite strings, chosen randomly upon joining the network, where each bit is
101
189
b)
111
011
00
100
010
1101
1100
Network Maintenance
In this section, we show how to maintain the shape of the network in a distributed
way in a dynamic setting, where nodes may join or leave the system. We start
190
191
node j is chosen so that after removing i and migrating j into its place, spare
nodes are distributed evenly in the corner. After i leaves and j migrates, a check
is performed if the number of nodes in the corner has decreased suciently to
call a merging operation.
Merging. When the number of nodes in a corner s of dimension d drops below
2d, we have to merge s with its neighbor s = s 2d1, i.e. with the one diering
from s on the last bit. Actually, we do it already when the number of nodes in
such a corner drops to (6
+ 7) d. First, we send a message to all neighboring
corners of dimension d + 1 (possibly including the neighbors across bit d), telling
them to merge rst. They merge recursively and after all neighbors of s are of
dimension d or d 1, we merge the two corners. Naturally, whenever a corner
receives a message from one of its neighbors (of lower dimension) telling it to
merge, it starts the merge procedure too.
3.3
Analysis
Before we bound the runtime of all operations on the system, we prove that with
high probability the system is balanced, i.e. the dimensions of each corner are
roughly the same.
To formally dene this notion, we introduce (just for the analysis) a parameter
du , which would be the current dimension of the network if it would be a CCC.
This means that if n is the current number of nodes in the network, then du 2du
n < (du + 1) 2du +1 . For du > 2, it holds that du /2 ln n 2du . Additionally,
we introduce a parameter dl diering from du by a constant: dl := du log
5.
For simplicity of notation, we assume that all nodes identiers are innite.
Denition 4. A skew CCC network is balanced if all corners dimension are
between dl and du .
Now we show that the network is balanced with high probability. We note that
even in case of bad luck (happening with polynomially small probability), the
system still works it might just work slower.
Lemma 1. If the network is stable, i.e. no nodes are currently joining or leaving
and no split or merge operations are currently being executed or pending, then
with probability 1 2 n the network is balanced.
Proof. We prove two claims:
(i) the probability that there exists a corner with dimension du or greater is
at most n ;
(ii) the probability that there exists a corner with dimension dl or smaller is
at most n .
For proving (i), we take a closer look at the set S of all node identiers. For any
string s, let Ss = {si S : s " si }, i.e. Ss consists of all identiers starting with
s. We say that S is well separated if for each du -bit string s, |Ss | (6
+ 7) du .
192
First, we observe that if S is well separated, then there is no corner of dimension du or higher. Assume the contrary and choose the corner si with highest
dimension (at least du ). Then there are at most (6
+ 7) du identiers starting with si , i.e. remaining in this corner. As all the neighbors of corner si have
smaller or equal dimension, this corner should be merged, which contradicts the
assumption that the network is stable.
Second, we show that the probability that S is not well separated is at
most n . Fix any string s of length du . For each node i with identier si let an
indicator random variable Xis be equal
nto 1 ifs s " si . Since s is a du -bit string,
we have
E[Xis ] = 2du . Let X s =
i=1 Xi ; by the linearity of expectation,
n
E[X s ] = i=1 E[Xis ] = n 2du du . Using the Cherno bound, we obtain that
Pr [X s (6
+ 7)du ] Pr [X s E[X s ] (6
+ 6) E[X s ]]
(6
+ 6) E[X s ]
exp
3
e(2+2)du
n(+1) .
There are 2du n possible du -bit strings, and thus (by the sum argument) the
probability that S is not well separated is at most n n(+1) = n .
Proving (ii) is analogous. We say that S is well glued if for each dl -bit string s,
|Ss | 12
dl .
Again, we prove that if S is well glued, then there is no corner of dimension dl
or lower. Assume the contrary and let si be the corner with lowest dimension.
There are at least 12
dl nodes in corner si . As all the neighbors of corner si have
greater or equal dimension, this corner should be splitted, which contradicts the
assumption that the network is stable.
Again, we show that the probability that S is not well glued is at most n .
Let t be
random variables Xit denote if t " si and
nany dtl -bit string, tindicator(d
t
X = i=1 Xi . Then E[X ] = n 2 u 5log ) 32
du . Using the Cherno
bound, we get that
Pr X t 12
dl Pr X t (1 1/2) E[X t ]
eE[X
]/8
e4du
n2 .
There are 2dl n possible dl -bit strings, and thus the probability that S is not
well glued is at most n n2 n .
According to Lemma 1, the system is balanced with high probability. Now, we
will show that all basic operations are performed in a time that is logarithmic in the current number of nodes. In the following, we assume that the high
probability event of the network being balanced actually happens.
Lemma 2. If the network is balanced, then each search operation is performed
in logarithmic time.
193
Proof. Since each corner is of dimension (log n), we have to x (log n) bits
in order to reach the destination corner. As the number of nodes in each corner
is within a constant factor from its dimension, there are (log n) nodes in each
corner. In particular, there is a constant number of spare nodes between any two
consecutive core nodes. Thus, in order to x the i-th bit after xing the (i 1)-st
bit we have to traverse only a constant number of edges. In order to x the rst
bit and to reach the destination node after reaching the destination corner, we
have to travel at most through all the nodes of two corners. Hence, the total
number of traversed edges is O(log n).
Before we prove upper bounds for join and leave operations we bound the time
of split and merge.
Lemma 3. If the network is balanced, then each split operation is performed in
logarithmic time.
Proof. When we split a corner s of dimension d into corners s 0 and s 1 of
dimension d + 1, then there are more than sucient nodes for each corner and
we know that each neighboring corner is of dimension d or d + 1.
Since the corner s currently has to split and
1, there are at least 12(d + 1)
nodes of each type in s. Starting in any node, we traverse the ring a constant
number of times and do the following. In the rst pass, we make sure that
there are two connections to neighbor corners across every bit. If there are two
connections already, there is nothing to do and if there is only one, we take any
spare node in the corner we are currently splitting and make a connection to
a spare node in the neighboring corner. From now on, these two spare nodes
(one in each corner) are core nodes. Finally, we add two core nodes without an
outside connection to s; they will be responsible for connecting the two corners
into which we split s.
In the second pass, we use all spare nodes to create two additional rings: one
built of nodes with the d-th bit equal to 0 and the other built of nodes with the
d-th bit equal to 1. Since each ring has at most 2(d + 1) core nodes and at least
12(d + 1) nodes in total, each of the newly created rings has at least 10(d + 1)
nodes. In the next pass, we can go along each of the three rings in parallel and
pass the responsibility for a connection to another corner from the old ring to
one of the new ones. This means that the ring with nodes with d-th bit equal
to 0 takes responsibility for the connections to corners (s 2k ) 0 (analogously
if the last bit is equal to 1).
In the last traversal of all rings, we delete nodes from the old ring and make
them join one of the new rings as spare nodes. Again, we move nodes with the
d-th bit equal to 0 to the newly created corner s 0 and nodes with the d-th bit
equal to 1 to the newly created corner s 1.
As we have used only a constant number of traversals of rings of length
O(log n), the whole split operation needs time O(log n).
We note that in case of search and split operations, the system can be balanced
in a weaker sense than we dened, i.e. we just need that the corner dimension
194
is O(log n). The merge operation is the only operation which depends on the
property that the dierence between dimension of the corners is constant. It is
also the most time and work consuming function, as a single merge operation
might need executing other merge operations before it can start.
Lemma 4. If the network is balanced, then a merge operation is performed in
time O(polylog(n)), including recursive execution of other needed merge operations, whereas the amortized cost per merge operation is O(log n).
Proof. Since the total number of dierent dimensions of all corners in the network is bounded by a constant (log
+ 5), the recursive execution of other possibly necessary merge operations for neighboring corners of higher dimensions
has only constant depth. As on each level a corner can have d + O(1) = O(log n)
neighbors, the total number of involved corners is logO(1) n. Below we prove that
a single merge operation of corners s0 and s1 of dimension d+ 1 into a corner s
of dimension d has cost O(d) = O(log n), if all neighbors have been reduced to
equal or lower dimension.
Similarly to the split operation, we can traverse in parallel both rings s 0 and
s 1 which we want to merge. We denote their dimension by d. As no neighbor
of s 0 and s 1 is of dimension d + 1, only d connections of 2d core nodes are
actually used in each of them. We rst build a ring composed of these used core
nodes of both old rings, whereas we zip them into one ring interleaving the core
nodes of s0 and the core nodes of s1. When we remove core nodes from the old
rings, we glue the holes so that we get two rings composed of spare nodes. Next,
we remove two core nodes which have been responsible for the connection across
bit d 1 (they connected s 0 to s 1) and move them to one of the old rings as
spare nodes. In the next traversal of the old rings we calculate how many spare
nodes they contain and then, in the last traversal, we evenly distribute the spare
nodes in the newly created corner s.
Notice that there is no need to add any connections to neighboring corners
all necessary connections already exist. On the other hand, if our new (d 1)dimensional corner is a neighbor to another (d 1)-dimensional one, we have
a double connection with this corner. We should remove one of the connections,
namely the one which originates from s 1.
Since the cost of merging s0 and s1 into s (not including recursive merging of
neighbors) is (d), and the cost of splitting s into s0 and s1 has also been (d),
we can amortize the cost of merging into s against the cost of splitting s. This
shows that the amortized cost of a merge operation together with its symmetric
split operation of a corner of dimension d is (d) = (log nm ) = (log ns ),
where nm is the number of nodes in the system at the moment when we perform
the merge operation and ns is the number of nodes at the moment when we have
performed the split operation.
Lemma 5. The join operation can be performed in logarithmic time.
Proof. Each time a node joins the network, it has to search its position inside the
network and to take its position inside the ring. Based on the previous lemmas,
195
the search operation can be performed in logarithmic time and the update of
the ring structure involves the creation of two new edges and the removal of
one existing edge. Furthermore, it might happen that the corner has to split its
dimension, resulting in additional O(log n) operations.
Finally, in the following lemma, we analyze the cost of a leave operation.
Lemma 6. The leave operation can be performed in polylogarithmic and amortized logarithmic time.
Proof. A typical leave operation only triggers a few connection updates. If the
node has been a spare node in its corner, the leave operation involves two connection updates, if the node has been a core node, one additional update has to
be performed to re-connect to the neighboring corner.
Besides the connection updates, a node leaving the network might also trigger
a merge operation of the corner. Based on the previous lemmas, each merge
operation costs at most polylogarithmic time (and logarithmic amortized time)
and majorizes the cost of a leave operation.
We have shown a fully distributed but simple scheme which joins a potentially
very large set of computationally weak nodes into an organized network with
minimal possible degree of 3, logarithmic dilation and name-driven routing.
Based on the properties and the structure of the SkewCCC network, it is
possible to further investigate aspects of heterogeneity and locality. The former
means allowing the existence of network nodes which can have a higher degree
and potentially also greater computational power. The latter aspect would incorporate distances of the underlying network.
References
1. Aspnes, J., Shah, G.: Skip graphs. ACM Transactions on Algorithms 3(4) (2007);
Also appeared in: Proc. of the 14th SODA, pp. 384393 (2003)
2. Awerbuch, B., Scheideler, C.: The hyperring: a low-congestion deterministic data
structure for distributed environments. In: Proc. of the 15th ACM-SIAM Symp.
on Discrete Algorithms (SODA), pp. 318327 (2004)
3. Barri`ere, L., Fraigniaud, P., Narayanan, L., Opatrny, J.: Dynamic construction
of bluetooth scatternets of xed degree and low diameter. In: Proc. of the 14th
ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 781790 (2003)
4. Bienkowski, M., Brinkmann, A., Korzeniowski, M., Orhan, O.: Cube connected
cycles based bluetooth scatternet formation. In: Proc. of the 4th International
Conference on Networking, pp. 413420 (2005)
5. Harvey, N.J.A., Jones, M.B., Saroiu, S., Theimer, M., Wolman, A.: Skipnet: a
scalable overlay network with practical locality properties. In: Proc. of the 4th
USENIX Symposium on Internet Technologies and Systems (2003)
196
6. Jiang, X., Polastre, J., Culler, D.: Perpetual environmentally powered sensor networks. In: Proc. of the 4th Int. Symp. on Information Processing in Sensor Networks
(IPSN), pp. 463468 (2005)
7. Karger, D., Lehman, E., Leighton, T., Levine, M., Lewin, D., Panigrahy, R.: Consistent hashing and random trees: Distributed caching protocols for relieving hot
spots on the world wide web. In: Proc. of the 29th ACM Symp. on Theory of
Computing (STOC), pp. 654663 (1997)
8. Leighton, F.T.: Introduction to parallel algorithms and architectures: array, trees,
hypercubes. Morgan Kaufmann Publishers, San Francisco (1992)
9. Malkhi, D., Naor, M., Ratajczak, D.: Viceroy: A scalable and dynamic emulation
of the buttery. In: Proc. of the 21st ACM Symp. on Principles of Distributed
Computing (PODC), pp. 183192 (2002)
10. Naor, M., Wieder, U.: Novel architectures for P2P applications: The continuousdiscrete approach. ACM Transactions on Algorithms 3(3) (2007); Also appeared
in: Proc. of the 15th SPAA, pp 5059 (2003)
11. Rowstron, A.I.T., Druschel, P.: Pastry: Scalable, decentralized object location, and
routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware
2001. LNCS, vol. 2218, pp. 329350. Springer, Heidelberg (2001)
12. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek, F.,
Balakrishnan, H.: Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Transactions on Networking 11(1), 1732 (2003); In: Proc. of
the ACM SIGCOMM, pp. 149160 (2001)
13. Zhao, B.Y., Huang, L., Stribling, J., Rhea, S.C., Joseph, A.D., Kubiatowicz, J.:
Tapestry: A resilient global-scale overlay for service deployment. IEEE Journal on
Selected Areas in Communications 22(1), 4153 (2004)
On the Time-Complexity of
Robust and Amnesic Storage
Dan Dobre, Matthias Majuntke, and Neeraj Suri
TU Darmstadt, Hochschulstr. 10, 64289 Darmstadt, Germany
{dan,majuntke,suri}@cs.tu-darmstadt.de
Introduction
Research funded in part by DFG GRK 1362 (TUD GKmM), EC NoE ReSIST and
Microsoft Research via the European PhD Fellowship.
T.P. Baker, A. Bui, and S. Tixeuil (Eds.): OPODIS 2008, LNCS 5401, pp. 197216, 2008.
c Springer-Verlag Berlin Heidelberg 2008
198
registers where read operations never return outdated values. A regular register
is deemed to return the last value written before the read was invoked, or one
written concurrently with the read (see [1] for a formal denition). Regular registers are attractive because even under concurrency, they never return spurious
values as sometimes done by the weaker class of safe registers [1]. Furthermore,
they can be used, for instance, together with a failure detector to implement
consensus [2].
The abstraction of a reliable storage is typically built by replicating the data
over multiple unreliable distributed storage units called base objects. These can
range from simple (low-level) read/write registers to more powerful base objects
like active disks [3] that can perform some more sophisticated operations (e.g. an
atomic read-modify-write). Taken to the extreme, base objects can also be implemented by full-edged servers that execute more complex protocols and actively
push data [4]. We consider Byzantine-fault tolerant register constructions where
a threshold t < n/3 of the base objects can fail by being non-responsive or by returning arbitrary values, a failure model called NR-arbitrary [5]. Furthermore, we
consider wait-free implementations where concurrent access to the base objects
and client failures must not hamper the liveness of the algorithm. Wait-freedom
is the strongest possible liveness property, stating that each client completes its
operations independent of the progress and activity of other clients [6]. Algorithms that wait-free implement a regular register from Byzantine components
are called robust [7]. An implementation of a reliable register requires the (client)
processes accessing the register via a high-level operation to invoke multiple lowlevel operations on the base objects. In a distributed setting, each invocation of a
low-level operation results in one round of communication from the client to the
base object and back. The number of rounds needed to complete the high-level
operation is used as a measure for the time-complexity of the algorithm.
Robust algorithms are particularly dicult to design when the base objects
store only a limited number of written values. Algorithms that satisfy this property are called amnesic. With amnesic algorithms, values previously stored are
not permanently kept in storage but are eventually erased by a sequence of values
written after them. Amnesic algorithms eliminate the problem of space exhaustion raised by (existing) non-amnesic algorithms, which take the approach of
storing the entire version history. Therefore, the amnesic property captures an
important aspect of the space requirements of a distributed storage implementation. The notion of amnesic storage was introduced in [7] and dened in terms of
write-reachable congurations. A conguration captures the state of the correct
base objects. Starting from an initial conguration, any low-level read/write operation (i.e., one changing the state of a base object) leads the system to a new
conguration. A conguration C is write-reachable from a conguration C when
there is a sequence consisting only of (high-level) write operations that starting
from C, leads the system to C . Intuitively, a storage algorithm is amnesic if,
except a nite number of congurations, all congurations reached by the algorithm are eventually erased by a sucient number of values written after them.
Erasing a conguration C , which itself was obtained from a conguration C,
199
means to reach a conguration C that could have been obtained directly from
C without going through C . This means that once in C , the system cannot
tell whether it has ever been in conguration C . For instace, an algorithm that
stores the entire history of written values in the base objects is not amnesic.
In contrast, an algorithm that stores in the base objects only the last l written
values is amnesic because after writing the l + 1st value, the algorithm cannot
recall the rst written value anymore.
1.1
Despite the importance of amnesic and robust distributed storage, most implementations to date are either not robust or not amnesic. While some relax
wait-freedom and provide weaker termination guarantees instead [2, 8], others
relax consistency and implement only the weaker safe semantics [5,9,2,10]. Generally, when it comes to robustly accessing (unauthenticated) data, most algorithms store an unlimited number of values in the base objects [11, 10, 12]. Also
in systems where base objects push messages to subscribed clients [4,13,14], the
servers store every update until the corresponding message has been received
by every non-faulty subscriber. Therefore, when the system is asynchronous, the
servers might store an unbounded number of updates. A dierent approach is to
assume a stronger model where data is self-verifying [9, 15, 16], typically based
on digital signatures. For unauthenticated data, the only existing robust and
amnesic storage algorithms [17, 18] do not achieve the same time-complexity as
non-amnesic ones. Time-complexity lower bounds have shown that protocols using the optimal number of 3t + 1 base objects [4] require at least two rounds to
implement both read/write operations [10, 2]. So far these bounds are met only
by non-amnesic algorithms [12]. In fact, the only robust and amnesic algorithm
with optimal resilience [17] requires an unbounded number of read rounds in the
worst case. For the 4t + 1 case, the trivial lower bound of one round for both
operations is not reached by the only other existing amnesic implementation [18]
that albeit elegant, requires at least three rounds for reading and two for writing.
1.2
Paper Contributions
Current state of the art protocols leave the following question open: Do amnesic
algorithms inherently have a non-optimal time complexity? This paper addresses
this question and shows, for the rst time, that amnesic algorithms can achieve
optimal time complexity in both the 3t + 1 and 4t + 1 cases. Justied by the
impossibility of amnesic and robust register constructions when readers do not
write [7], one of the key principles shared by our algorithms is having the readers
change the base objects state. The developed algorithms are based on a novel
concurrency detection mechanism and a helping procedure, by which a writer
detects overlapping reads and helps them to complete. Specically, the paper
makes the following two main contributions:
A rst algorithm, termed DMS, which uses 4t + 1 base objects, described in
Section 3. With DMS, every (high-level) read and write operation is fast, i.e.,
200
it completes after only one round of communication with the base objects.
This is the rst robust and amnesic register construction (for unauthenticated data) with optimal time-complexity.
A second algorithm, termed DMS3, which uses the optimal number of 3t + 1
base objects, presented in Section 4. With DMS3, every (high-level) read
operation completes after only two rounds, while write operations complete
after three rounds. This is the rst amnesic and robust register construction
(for unauthenticated data) with optimal read complexity. Note also that,
compared to the optimal write complexity, it needs only one additional communication round.
Table 1 summarizes our contributions and compares DMS and DMS3 with recent
distributed storage solutions for unauthenticated data.
Table 1. Distributed storage for unauthenticated data
Protocol
Resilience
4t + 1
4t + 1
3t + 1
3t + 1
3t + 1
3t + 1
Worst-Case Time-complexity
Read
Write
Amnesic Robust
3
2
1
1
2
2
t+1
2
unbounded
3
2
3
2.1
System Model
Note that since write&read is not an atomic operation, it can be implemented from
simple read/write registers and thus the model is not strengthened.
2.2
201
Preliminaries
In order to distinguish between the target registers interface and that of the
base registers, throughout the paper we denote the high-level read (resp. write)
operation as read (resp. write). Each of the developed protocols uses an underlying layer that invokes operations on dierent base objects in separate threads
in parallel. We use the notation from [2] and write invoke write(Xi ,v) (resp.
invoke x[i] read(Xi )) to denote that a write(v) operation on register Xi
(resp. a read of register Xi whose response will be stored in a local variable x[i])
is invoked in a separate thread by the underlying layer. The notation invoke
x[i] write&read(Yi , v, Xi ) denotes the invocation of an operation write&read
on base object i, consisting of a write(v) on register Yi followed by a read of register Xi (whose response will be stored in x[i]).
As base objects may be non-responsive, high-level operations can return while
there are still pending invocations to the base objects. The underlying layer keeps
track of which invocations are pending to ensure well-formedness, i.e., that a
process does not invoke an operation on a base object while invocations of the
same process and on the same base object are pending. Instead, the operation
is denoted enabled. If an operation is enabled when a pending one responds,
the response is discarded and the enabled operation is invoked. See e.g. [2] for a
detailed implementation of such layers.
We say that an operation op is complete in a run if the run contains a response
step for op. For any two operations op1 and op2 , when the response step of op1
precedes the invocation step of op2 , we say op1 precedes op2 . If neither op1 nor
op2 precedes the other then the two operations are said to be concurrent.
In order to better convey the insight behind the protocols, we simplify the
presentation in two ways. We introduce a shared object termed safe counter and
describe both algorithms in terms of this abstraction. Although easy to follow,
the resulting implementations require more rounds than the optimal number.
Thus, for each of the protocols we explain how with small changes these rather
didactic versions can be condensed to achieve the announced time-complexity.
The full details of the optimizations can be found in our publicly available technical report [19]. Secondly, for presentation simplicity we implement a SRSW
register. Conceptually, a MRSW register for m readers can be constructed using
m copies of this register, one for each reader. In a distributed storage setting,
the writer accesses all m copies in parallel, whereas the reader accesses a single
copy. It is worth noting that this approach is heavy and that in practice, cheaper
solutions are needed to reduce the communication complexity and the amount
of memory needed in the base objects.
We now introduce the safe counter abstraction used in our algorithms. A
safe counter has two wait-free operations inc and get. inc modies the counter
by incrementing its value (initially 0) and returns the new value. Specically,
the k th inc operation denoted inck returns k. get returns the current value
of the counter without modifying it. The counter provides the following
guarantees:
202
We start by describing an initial version of protocol DMS that uses the safe
counter abstraction. It is worth noting that the algorithm requires more rounds
than the optimum, but it conveys the main idea. Next, we explain the changes
applied to DMS to obtain an algorithm with optimal time-complexity.
3.1
Protocol Description
Predicates:
safe(c)
|{i : c y[i] c c}| t + 1
get()
for 1 i n do y[i]
for 1 i n do
invoke y[i] read (Yi )
wait for n t responses
return max{c Integers : safe(c)}
203
Local variables:
y[1 . . . n] Integers
k Integers, initially 0
inc()
k k+1
for 1 i n do
invoke write(Yi , k)
wait for n t responses
return k
consists of (1) a write phase, where it rst executes inc to increment the current
view and (2) a subsequent read phase, where it reads at least n t registers. To
ensure that read never returns a corrupted value, the returned value must be
read from t+1 registers, a condition captured by the predicate safe. Moreover, to
ensure regularity, read must not return old values written before the last write
preceding the read. This condition is captured by the predicate highestCand.
We now give a more detailed description of the algorithm. As depicted in
Figure 2, each base register consists of three value elds current, prev and frozen
holding timestamp-value pairs, and an integer eld view. The writer holds a
variable x of the same type and uses x to overwrite the base registers. Each
write operation saves the timestamp-value pair previously written in x.prev.
Then, it chooses an increasing timestamp, stores the value together with the
timestamp in x.curr and overwrites n t registers with x. Subsequently, the
writer executes get. If the view returned by get is higher than the current
view (indicating a concurrent read), then x.view is updated and the most recent
value previously written is frozen, i.e., the content of x.prev is stored in x.frozen
(line 14, Figure 3). Finally, write returns ack and completes. It is important
to note that the algorithm is amnesic because each correct base object stores at
most three values (curr, prev and frozen).
The read rst executes inc to increment the current view, and then it reads
at least nt registers into the array x[1...n], where element i stores the content of
register Xi . If necessary, it waits for additional responses until there is a candidate
for returning, i.e., a read timestamp-value pair that satises both predicates safe
Types:
TSVals Integers Vals, with selectors ts and val
Shared objects:
- regular registers Xi TSVals 3 Integers with selectors curr, prev,
frozen and view, initially 0, v0 , 0, v0 , 0, v0 , 0
- safe counter object Y Integers, initially Y = 0
204
Predicates (reader):
readFrom(c, i) (c = x[i].curr x[i].view < view)
(c = x[i].frozen x[i].view = view)
safe(c) |{i : c {x[i].curr, x[i].prev, x[i].f rozen}}| t + 1
highestCand(c) |{i : readFrom(c , i) c .ts c.ts}| 2t + 1
Local variables (reader):
view Integers, initially 0
x[1 . . . n] TSVals 3 Integers
1
2
3
4
5
read()
for 1 i n do x[i]
view inc(Y )
for 1 i n do invoke x[i] read (Xi )
wait until n t responded c TSVals: safe(c) highestCand(c)
return c.val
Local variables (writer):
newView, ts Integers, initially 0
x TSVals 3 Integers, initially 0, v0 , 0, v0 , 0, v0 , 0
6
7
8
9
10
11
12
13
14
15
write(v )
ts ts+1
x.prev x.curr
x.curr ts, v
for 1 i n do invoke write(Xi , x)
wait for n t responses
newView get(Y )
if newView > x.view then
x.view newView
x.frozen x.prev
return ack
Fig. 3. Robust and amnesic storage algorithm DMS (4t + 1)
205
readk
k
. . . inc
Rd
wr
Wr . . .
c, ,
x=
get
<k
<k
x=
rd
write(c.val )
curr, prev, frozen
view
wr
ch , c,
<k
get
k
write(ch .val )
wr
get . . . get
, ch , c
k
wr
...
, , c
k
write()
(henceforth inck ), and the last write that still reads a view lower than k, i.e.,
the corresponding get returns a view lower than k. Note that by the safety
property of the counter, inck does not precede get and thus c is stored in 2t + 1
correct registers before any of them is read. A key aspect of the algorithm is
to ensure that no matter how many writes are subsequently invoked, c never
disappears from all elds of those 2t + 1 correct registers, as long as readk is
still in progress. Essentially this holds because the subsequent write re-writes
c to all registers and it also freezes c to ensure that future writes do the same.
In this process, c migrates from curr to prev and from prev to frozen where it
stays until the next view change. Therefore, c eventually becomes safe. But what
if c is not highestCand? In this situation, at least t + 1 correct registers report
timestamp-value pairs higher than c. We note that if any of them had stored c in
its frozen eld, then it would report c. This implies that none of these registers
has stored c in its frozen eld and thus, also none of these registers has stored a
timestamp-value pair higher than ch in its curr eld. Therefore, ch is reported
by t + 1 correct registers, and hence it is safe. Note that ch is also highestCand
because only faulty registers report values with higher timestamps.
We now explain how the fast algorithm is derived from DMS. The principle
underlying the optimization is to condense one round of write to the base objects and a subsequent round of read of the base objects into a single round
of write&read. For this purpose we disregard the safe counter abstraction and
directly weave inc and get (Fig. 1) into read and write (Fig. 3) respectively.
As a result, the reader advances the view and reads the base registers in one
round. Likewise, the writer stores a value in the base registers and reads the
view in a single round. The reader code (Fig. 3) is modied as follows: variable view is incremented locally, and line 3 is replaced with the statement for
1 i n do invoke x[i] write&read(Yi , view, Xi ). Similarly, in the writer
code (Fig. 3), line 9 is replaced with the statement for 1 i n do invoke
y[i] write&read(Xi , x, Yi ). Additionally in line 11, instead of executing get,
the writer picks the t + 1st highest element of y.
206
Protocol Correctness
207
the write of c.val reads a view equal to k (case 1), or lower than k (case 2).
Note that by the validity of the counter, only views k are returned. Case 1
implies that (a) only timestamp-value pairs lower than c are frozen, and (b) c
is the highest timestamp-value pair readFrom the curr eld of a correct register.
Together (a) and (b) imply that c is the highest timestamp-value pair readFrom
a correct register. Thus, for all registers Xi R ( t + 1), readFrom(c ,i) implies
that c = c and hence, c is safe. We now consider case 2 where write(c.val)
reads a view lower than k. This implies that c or a higher timestamp-value pair
is frozen in view k. If t + 1 registers in L were updated with c before they
are read, then they would report c either from their curr or their frozen eld,
and clearly c would be safe. Therefore, c is missing from t + 1 correct registers.
Thus, write(c.val)s write phase (lines 910) does not precede readk s read
phase (lines 34). By the transitivity of the precedence relation, inck (line 2)
precedes get (line 11). By the safety of the counter, write(c.val) reads view k,
a contradiction.
Theorem 1 (Robustness). The algorithm in Figure 3 wait-free implements a
regular register.
Proof. Immediately follows from Lemma 1 and 2.
Similar to the previous section, we describe an initial version of DMS3 that uses
a safe counter. The algorithm requires more rounds than the optimum but it
is easier to understand because most of its complexity is hidden in the counter
implementation. Then, we overview the changes necessary to obtain the optimal
algorithm. The full details of the optimized DMS3 such as the pseudocode and
proofs can be found in our technical report [19]. We proceed in a bottom-up
fashion and describe the counter implementation rst.
4.1
We present a safe counter with operations inc and get using 3t + 1 base objects
i {1 . . . n}, where t base objects can be subject to NR-arbitrary failures. The
types and shared objects used by the counter are depicted in Figure 5 and
the algorithm appears in Figure 6. Each base object i implements two regular
registers: a register Ti holding a timestamp written by get and read by inc, and
a second register Yi consisting of two elds pw and w, modied by inc and read
by get. While the pw eld stores only the counter value, the w eld stores the
counter value together with a high-resolution timestamp [20]. A high-resolution
timestamp is a timestamp-array with n entries, one for each base object.
The get operation performs in two phases. The rst phase reads from the
base objects until n t registers Yi have responded and all responses are nonconicting. This condition is captured by the predicate conict. When two base
208
Additional Types:
TSs Integers array of size n, Integers[n]
TSsInt TSs Integers with selectors hrts (high-resolution timestamp)
and cnt
Shared objects:
- regular registers Yi Integers TSsInt with selectors pw and w,
initially Yi = 0, [0, . . . , 0], 0
- regular registers Ti Integers, initially 0
objects i and j are in conict, then at least one of them is malicious. In this
situation, the get operation can wait for more than n t responses without
blocking, eectively ltering out responses from malicious base objects. Next, the
get operation uses the responses to build a candidate set from values appearing
in the w eld of Yi . In the second phase, the get operation chooses an increasing
timestamp ts and overwrites nt registers Ti with ts; at the same time it re-reads
the registers Yi until n t of them have responded and there exists a candidate
to return. This condition is captured by the predicates safe and highCand. If no
candidate can be returned (because of overlapping inc operations), get returns
the initial counter value 0.
Similarly, the inc operation performs in two phases, a pre-write and a write
phase. The pre-write phase accesses n t base objects i, overwriting the pw eld
of Yi with an increasing counter value and reading the individual timestamps
stored in Ti into a single high-resolution timestamp. Subsequently, in the write
phase, inc stores the counter value together with the high-resolution timestamp
in the w eld of n t registers Yi and returns.
We now show that the algorithm in Figure 6 wait-free implements a safe
counter. We do this by showing that the two following properties are satised:
Validity: If get returns k then get does not precede inck .
Safety: If inck precedes get and for all l > k get precedes incl , then get
returns k.
Lemma 3 (Validity). The counter object implemented in Figure 6 is valid.
Proof. If the initial value is returned then we are done. Else only a value c.cnt = k
is returned such that c is safe. This implies that t + 1 base objects report values
k or higher either from their pw or w elds. As not all of them are faulty, there
exists a correct object Yi and a value l k such that l was indeed written to Yi .
As inck precedes incl (or it is the same operation) and get does not precede
incl , it follows that get does not precede inck .
Lemma 4 (Safety). The counter object implemented in Figure 6 is safe.
Proof. Let inck be the last operation preceding the invocation of get. Furthermore, for all l > k, get precedes incl . By assumption, c.cnt = k was written to
inc()
cnt cnt + 1
y.pw cnt
for 1 i n do invoke hrts[i] write&read (Yi , y, Ti )
wait for n t responses
y.w.hrts hrts
y.w.cnt cnt
for 1 i n do invoke write(Yi , y)
wait for n t responses
return ack
Predicates (get):
conict(i, j) y[i].w.hrts[j] ts
safe(c) |{i : max{P W [i]} c.cnt (c W [i] c .cnt c.cnt)}| > t
highCand(c) c C (c.cnt = max{c .cnt : c C})
Local variables (get):
P W [1 . . . n] 2Integers , W [1 . . . n] 2TSsInt , C 2TSsInt
y[1 . . . n] Integers TSsInt {}
ts Integers, initially 0
10
11
12
13
14
15
16
17
18
19
20
21
22
get()
for 1 i n do y[i] ; P W [i] W [i]
C
ts ts + 1
for 1 i n do invoke y[i] read (Yi )
repeat
check
until a set S of n t objects responded i, j S : conict(i, j)
C {y[i].w : |{j : y[j].w
= y[i].w}| 2t}
for 1 i n do invoke y[i] write&read (Ti , ts, Yi )
repeat
check
C C \ {c C : |{i : c W [i] c
= c}| 2t + 1}
until n t responded c C: (safe(c) highCand(c)) C =
if C
= then return c.cnt else return 0
check
if Yi responded then
P W [i] P W [i] {y[i].pw}
W [i] W [i] {y[i].w}
Fig. 6. Safe counter algorithm (3t + 1)
209
210
write ts to Ti
211
read Ti
read Yi
a)
inc (1st round)
pre-write k to Yi
read Ti
write ts to Ti
read Yi
b)
Protocol Description
In this section we present a robust and amnesic SRSW register construction from
a safe counter and 3t + 1 regular base registers, out of which t can be subject to
NR-arbitrary failures. We now describe the write and read operations of the
DMS3 algorithm illustrated in Figure 8.
The write operation performs in three phases, (1) a pre-write phase (lines 7
9) where it stores a timestamp-value pair c in the pw eld of n t registers, (2) a
read phase (line 10), where it calls get to read the current view and (3) a write
phase (lines 1416), where it overwrites the w eld of n t registers with c. If
the read phase results in a view change, the most recent value previously written
is frozen together with the new view. This is done by updating the view eld
and copying the value stored in w to the frozen eld (lines 1113). The reader
performs exactly the same steps as in DMS (see Section 3).
We now explain with help of Figure 9 why reads are wait free. Similar to the
description of DMS in Section 3, we consider readk and the last write that
reads a view lower than k. Note that inck does not precede get and thus, c is
stored in the pw eld of t + 1 correct registers before they are read. Also, the
w eld of t + 1 correct registers is updated with c. As the subsequent write
encounters a view change, c is written to the frozen eld of t+1 correct registers,
where it stays until readk completes. Hence, c is sampled from t + 1 correct
registers pw, w or frozen eld and thus it is safe. Note that c is also highestCand
because only faulty registers report newer values.
212
1
2
3
4
5
read()
for 1 i n do x[i]
view inc(Y )
for 1 i n do invoke x[i] read (Xi )
wait until n t responded c TSVals: safe(c) highestCand(c)
return c.val
Local variables (writer):
ts, newView Integers, initially 0
x TSVals 3 Integers, initially 0, v0 , 0, v0 , 0, v0 , 0
6
7
8
9
10
11
12
13
14
15
16
17
write(v )
ts ts+1
x.pw ts, v
for 1 i n do invoke write(Xi , x)
wait for n t responses
newView get(Y )
if newView > x.view then
x.view newView
x.frozen x.w
x.w ts, v
for 1 i n do invoke write(Xi , x)
wait for n t responses
return ack
Fig. 8. Robust and amnesic storage algorithm DMS3 (3t + 1)
213
readk
k
. . . inc
Rd
wr
Wr . . .
c, ,
x=
rd
get
wr
<k
<k
c, c,
<k
write(c.val )
x=
pw, w, frozen
view
wr
get
wr
, c,
<k
write()
, , c
k
. . . wr
get
wr
, , c
k
, , c
k
write()
the read phase are merged together. Overall, this results in a time-complexity
of three rounds for the write and two rounds for the read.
We now informally argue that the optimization is correctness preserving. As
above, we consider readk and the last write that reads a view lower than k.
We argue that t + 1 correct base registers have stored c in their pw eld before
any of them is read. This would imply that c is safe. The fact that the write of
c.val reads a view lower than k implies that k is missing from at least 2t + 1 base
objects. We know from the safe counter algorithm in the previous section that
if only 2t base objects respond without k, then k is never removed from the set
of candidates. As the safe counter implementation is wait-free, k is eventually
read, contradicting the initial assumption. Therefore, 2t+ 1 base objects respond
without k, and thus there are t + 1 correct base objects among them that are
accessed by (the read phase of) readk only after c was pre-written to them. By
applying similar arguments as above, it is not dicult to see that c does not
disappear from any of the t + 1 correct base objects before readk completes.
This would imply that c eventually becomes safe. For a formal treatment we
refer the interested reader to our full paper [19]. The remainder of this section
is concerned with the correctness of DMS3.
Protocol Correctness
Lemma 6 (Regularity). Algorithm DMS3 in Figure 8 implements a regular
register.
Proof. Identical to the proof of Lemma 1.
Lemma 7 (Wait-freedom). Algorithm DMS3 in Figure 8 implements waitfree read and write operations.
Proof. The write operation is nonblocking because it never waits for more than
n t responses. To derive a contradiction we assume that readk blocks at line 4
214
and show that there exists a candidate for returning. We consider the time after
which all correct base objects (at least 2t + 1) have responded. We choose c
as the highest timestamp-value pair readFrom a correct register. Note that c is
highestCand by construction because values with timestamps c.ts are readFrom
2t + 1 correct registers. In the following, we distinguish the cases where the view
read by the write of c.val is equal to k (case 1) or it is lower than k (case
2). Note that by the validity of the counter, only views k are returned. Case
1: Let Xi be a correct register such that readFrom(c, i). Since by assumption
x[i].view = k, c is readFrom the frozen eld of Xi . However, in view k only
timestamp-value pairs lower than c are frozen, a contradiction. Now we consider
case 2, where the write(c.val) reads a view lower than k. This implies that
inck does not precede get. As the pre-write phase (lines 89) precedes get
(line 10), and inck (line 2) precedes the read phase (lines 34), by transitivity,
the pre-write phase also precedes the read phase (see Figure 9). Thus, t + 1
correct registers have stored c in their pw eld before they are read. What is
left to show is that no subsequent write erases c from all elds of those t + 1
correct registers. Note that in view k, only timestamp-value pairs c or higher a
frozen. Thus, if c was stored in the w eld of t + 1 correct registers before they
are read, then c would be safe. Hence, c is missing from t + 1 correct registers w
eld. Consequently, write(c.val)s write phase (lines 1516) does not precede
readk s read phase (lines 34). By transitivity, the subsequent write reads
view k and freezes c. Note that c is erased from pw only after c was previously
stored in w (line 14). Furthermore, c is erased from w only after it was stored
in frozen (line 13). As k is the last view, by the validity of the safe counter, c is
never erased from frozen.
Theorem 3 (Robustness). Algorithm DMS3 in Figure 8 implements a robust
register.
Proof. Immediately follows from Lemma 6 and 7.
Concluding Remarks
215
Acknowledgments
We thank Gregory Chockler, Felix Freiling, Marco Serani and Jay Wylie for
many useful comments on an earlier version of this paper.
References
1. Lamport, L.: On interprocess communication. part II: Algorithms. Distributed
Computing 1(2), 86101 (1986)
2. Abraham, I., Chockler, G., Keidar, I., Malkhi, D.: Byzantine disk paxos: optimal
resilience with byzantine shared memory. Distributed Computing 18(5), 387408
(2006)
3. Chockler, G., Malkhi, D.: Active disk paxos with innitely many processes. Distributed Computing 18(1), 7384 (2005)
4. Martin, J.P., Alvisi, L., Dahlin, M.: Minimal Byzantine Storage. In: Malkhi, D.
(ed.) DISC 2002. LNCS, vol. 2508, pp. 311325. Springer, Heidelberg (2002)
5. Jayanti, P., Chandra, T.D., Toueg, S.: Fault-tolerant wait-free shared objects. J.
ACM 45(3), 451500 (1998)
6. Herlihy, M.: Wait-free synchronization. ACM Trans. Program. Lang. Syst. 13(1),
124149 (1991)
7. Chockler, G., Guerraoui, R., Keidar, I.: Amnesic Distributed Storage. In: Pelc, A.
(ed.) DISC 2007. LNCS, vol. 4731, pp. 139151. Springer, Heidelberg (2007)
8. Hendricks, J., Ganger, G.R., Reiter, M.K.: Low-overhead byzantine fault-tolerant
storage. In: SOSP 2007: Proceedings of twenty-rst ACM SIGOPS symposium on
Operating systems principles, pp. 7386. ACM, New York (2007)
9. Malkhi, D., Reiter, M.: Byzantine quorum systems. Distrib. Comput. 11(4), 203
213 (1998)
10. Guerraoui, R., Vukolic, M.: How fast can a very robust read be? In: PODC
2006: Proceedings of the twenty-fth annual ACM symposium on Principles of
distributed computing, pp. 248257. ACM, New York (2006)
11. Goodson, G.R., Wylie, J.J., Ganger, G.R., Reiter, M.K.: Ecient byzantinetolerant erasure-coded storage. In: DSN 2004: Proceedings of the 2004 International
Conference on Dependable Systems and Networks (DSN 2004), Washington, DC,
USA, pp. 135144. IEEE Computer Society, Los Alamitos (2004)
216
12. Guerraoui, R., Vukolic, M.: Rened quorum systems. In: PODC 2007: Proceedings
of the twenty-sixth annual ACM symposium on Principles of distributed computing, pp. 119128 (2007)
13. Bazzi, R.A., Ding, Y.: Non-skipping timestamps for byzantine data storage systems. In: Guerraoui, R. (ed.) DISC 2004. LNCS, vol. 3274, pp. 405419. Springer,
Heidelberg (2004)
14. Aiyer, A., Alvisi, L., Bazzi, R.A.: Bounded wait-free implementation of optimally
resilient byzantine storage without (unproven) cryptographic assumptions. In: Pelc,
A. (ed.) DISC 2007. LNCS, vol. 4731, pp. 719. Springer, Heidelberg (2007)
15. Cachin, C., Tessaro, S.: Optimal resilience for erasure-coded byzantine distributed
storage. In: DSN 2006: Proceedings of the International Conference on Dependable
Systems and Networks (DSN 2006), Washington, DC, USA, pp. 115124. IEEE
Computer Society, Los Alamitos (2006)
16. Liskov, B., Rodrigues, R.: Tolerating byzantine faulty clients in a quorum system.
In: ICDCS 2006: Proceedings of the 26th IEEE International Conference on Distributed Computing Systems, Washington, DC, USA, pp. 3443. IEEE Computer
Society, Los Alamitos (2006)
17. Guerraoui, R., Levy, R.R., Vukolic, M.: Lucky read/write access to robust atomic
storage. In: DSN 2006: Proceedings of the International Conference on Dependable
Systems and Networks (DSN 2006), pp. 125136 (2006)
18. Abraham, I., Chockler, G., Keidar, I., Malkhi, D.: Wait-free regular storage from
byzantine components. Inf. Process. Lett. 101(2) (2007)
19. Dobre, D., Majuntke, M., Suri, N.: On the time-complexity of robust and amnesic storage. Technical Report TR-TUD-DEEDS-04-01-2008, Technische Universit
at Darmstadt (2008),
http://www.deeds.informatik.tu-darmstadt.de/dan/amnesicTR.pdf
20. Chockler, G., Rachid Guerraoui, I.K., Vukolic, M.: Reliable distributed storage.
IEEE Computer (to appear, 2008)
Introduction
The small world eect, or six degrees of separation, is the well known property observed in social networks [9,21] that any pair of nodes in these networks
is connected by a very short chain of acquaintances (typically polylogarithmic
in the size of the network), that, moreover, can be discovered locally. In the
literature, a small world graph can either refer to this property or to a graph
with polylogarithmic diameter and high clustering (see e.g. [23]). In this paper, a
small world graph refers to a graph of polylogarithmic diameter and whose short
paths can be discovered locally, i.e. which is navigable. This surprising property
has gained a lot of interest recently since Kleinberg [17] introduced the rst analytical graph model for navigability, and because of its potential in the design
of large decentralized networks with ecient routing schemes. The model proposed by Kleinberg in 2000 consists in a d-dimensional mesh augmented by one
extra random link in each node, distributed according to the d-harmonic distribution. The local search is then modeled by greedy routing, which is the simple
algorithm that, at each node, forwards the message to the neighbor that is the
T.P. Baker, A. Bui, and S. Tixeuil (Eds.): OPODIS 2008, LNCS 5401, pp. 217225, 2008.
c Springer-Verlag Berlin Heidelberg 2008
218
closest to the destination in the mesh. Kleinberg demonstrates that greedy routing computes paths of expected length (log2 n) between any pair of nodes in
his model, with the only knowledge of the distances in the mesh: the augmented
mesh is navigable
Following this seminal work, a major challenge was to extend this model to
larger classes of graphs than regular meshes, i.e. to determine which n-node
graphs G admit an augmentation with one link in each node such that greedy
routing with the only of G computes polylog(n)-length paths between any pair
in the augmented graph. Kleinberg [18] and Duchon et al. [7] showed that this
is possible for all graphs of bounded growth, i.e. where, for any node u and
radius r 1, the 2r-neighborhood of u is of size at most a constant times
its r-neighborhood. Fraigniaud [10] demonstrates that any bounded treewidth
graph can also be augmented by one link per node to become navigable, and
Abraham and Gavoille [4] showed that, more generally, this is possible for all
graphs excluding a xed minor. The denition of the problem can directly be
extended to metric spaces by asking which n-points metric spaces1 M = (V, )
can be augmented by O(log n) links such that, in the resulting graph, greedy
routing computes polylog(n) routes between any pair with the only knowledge
of M . In this framework, Slivkins [22] showed that any doubling metric can be
augmented to become navigable. A doubling metric is a metric where, for all
r 1, any ball of radius 2r can be covered by at most C balls of radius r, for
some constant C.
However, it was recently proven by Fraigniaud et al. [13] that such an augmentation is not possible for all graphs: there exist an innite familiy of n-node
graphs on which any distribution
of augmented links will leave the greedy paths
of expected length (n1/ log n ) for some pairs. The best upper bound valid for
1/3 ) between
arbitrary graphs up to our knowledge is an expected length O(n
any pair, due to Fraigniaud et al. [12], with some specic link augmentation.
The remaining gap between these two bounds is today still open and leaves a
question mark on the limiting characteristics of a metric for the navigability
augmentation.
Orthogonally to the navigability question, studies on embeddings of metric
spaces have known huge developments this last decade (cf. Chapter 15 of [20] for a
review), due in particular to their applications in approximation algorithms [15]
and more recently in handling eciently large decentralized networks [6]. An
embedding of a metric M = (V, ) into a metric M = (V , ) is an injective
function on V into V . Its quality is characterized by the distorsion it induces
on the distances. For the sake of simplicity, we consider only non-contracting
embeddings, we then say that has distorsion if and only i for any u, v V ,
((u), (v)) (u, v). Crucial networking problems like routing, resource
location or nearest neighbor search are easy to handle on a low dimensional
euclidean space. However, large real networks like the Internet do not present
1
219
such a simple structure. The increasing interest for metrics embeddings comes
therefore partially from the fact that, if the embedding is of good quality, it
can provide a way to develop ecient algorithms on complex, or even arbitrary,
metric spaces, by solving them on a simple metric space that approximates them
well (cf. e.g. [14,15]). In addition, many good quality embeddings are computed
with randomized local algorithms that only require a distance oracle, making
them particularly appropriate to the large decentralized networks setting (cf.
e.g. [5] for a seminal example).
In this paper, we propose a new way to tackle the augmented graphs navigability problem through the metric embedding setting.
1.1
Our Contribution
We introduce a generalized augmentation process. The main feature of our augmentation process is to use an embedding of the input graph shortest paths metric into a metric that is easy to augment into a navigable graph. This distinction
between the augmentation process in itself (handled on the easy metric) and
the structural characteristics of the input (captured by the embedding quality)
provides a new way to characterize the classes of navigable graphs. We consider
embedding into (Rd ,
p ) which is the d-dimensional euclidean space associated
to the
p norm, for d, p 1: for any u = (u1 , . . . , ud ) and v = (v1 , . . . , vd ), we
d
have ||u v||p = ( i=1 |ui vi |p )1/p . We prove the following theorem:
Theorem 1. Let p, n, , d 1. For any > 0, any n-node graph G whose
shortest path metric M = (V, ) admits an embedding of distorsion into (Rd ,
p )
can be augmented with one link per node such that greedy routing in the resulting
graph computes paths of expected length O( 1 d log2+ n) between any pair with
the only knowledge of M .
For instance, using the recent embedding result of Abraham et al. [3], we get as
an immediate corollary that, for any 0 < 1 and any n 1, any n-node graph
G of doubling dimension D (cf. [14]) can be augmented so that the expected
lengths of all greedy paths is O((log(1+) n)O(D/) log2 n) = O((log n)O(D) ) with
the only knowledge of G. This provides a more direct proof to the fact that
bounded doubling dimension graphs are navigable (proved in [22]).
Intuitively, if the metric considered is not too far from a metric M which can
be easily augmented, we use a low distorsion embedding of the metric into M ,
draw the random links in M , and then map back appropriately these links to
the original metric so that they will still be useful shortcuts for greedy routing.
Moreover, the design of the augmented links in our process can be done in a
fully decentralized way and only requires to know the embedding. In the case
where the chosen embedding is itself local (like e.g. the seminal Bourgain embedding [5] if a distance oracle is available), we thus provide an algorithm which
locally adds one address to each routing table in a network and guarantees a
small number of hops decentralized routing between any pair for a large class of
input graphs.
220
In this section, we present our augmentation process that adds one directed link
per node. This process is universal in the sense that it only requires as an input
the base graph (arbitrary) and an embedding function of this graph into Rdp , for
some p, d 1. Such a function exists for any graph and therefore the algorithm
is not restricted to a specic graph class. However, as we will see in the next
section, the analysis of greedy routing might give a poor routing time result if the
embedding is not of good quality. There exists lower bound results on arbitrary
metric embedding quality. A typical example is that embedding some n-node
constant degree expander graph into Rdp requires distortion (log n) [20] and
dimension d = (log n) [2]. Nevertheless, expander graphs are always navigable
without any augmentation given their polylogarithmic diameter.
The augmentation algorithm is based on the well known augmentation of ddimensional meshes of the Kleinberg model, where the shortcuts are distributed
according to the d-harmonic distribution. The idea is to map back these links
to the original set of nodes. Given that not all the extremities of the shortcuts
added in
dp are images of the original nodes, this requires some careful rewiring.
Augmentation Process AP
Input: An n-node graph G = (V, E), an embedding of its shortest path
metric M = (V, ) into
dp , and a constant > 0;
Output: G augmented with one directed link in each node.
Begin
For each u V do
Pick a point u Rdp with probability density:
1
1
,
Z ||(u) ||p d ln1+ (||(u) ||p + e)
over all Rdp .
Add a directed link from u to v V where v is the node such that (v) is
the closest point to u in (V ).
End.
Note: e stands here for exp(1) and is only used to allow distance to be zero
in the formula. Z is the normalizing factor of the probability density described:
S(t)
dt, where S(t) is the surface of an hypersphere of raius t in
Z = t>0 td ln1+
(t+e)
Rd . Figure 1 illustrates the process AP.
In this section, we demonstrate our main result. The intrinsic dimension [3], or
doubling dimension [1] of a graph G characterizes its geometric property, this is
221
the minimum constant such that any ball in G can be covered by at most 2
balls of half the radius. We show that, if a graph has low intrinsic dimension,
AP process provides augmented shortcuts that enables navigability. We have
the following theorem:
Theorem 2. Let p, n, , d 1, > 0, G an n-node graph and an embedding
of distorsion of the shortest path metric M of G into (Rd ,
p ). Then, greedy
routing in AP(G, , ) computes paths of expected length at most O( 1 d log2+ n)
between any pair, with the only information of the distances in M .
Proof. In order to analyze greedy routing performances in AP(G, , ), we begin
by analyzing some technical properties of the probability distribution of the
chosen points in (Rd ,
p ). For any u G, we say that u , as dened in algorithm
AP, is the contact point of u.
Let Z be the normalizing factor of the contact points distribution. We have:
S(t)
Z=
dt,
1+
d
(t + e)
t>0 t ln
where S(t) stands for the surface of a sphere of radius t in Rdp . This surface is
at most cp 2d /(d 1)! td1 , where cp > 1 is a constant depending on p. It
follows:
2d
dt
2d
(1 + e)
.
Z cp
cp
1+
(d 1)! t>1 t ln (t + e)
(d 1)!
Let s and t G be the source and the target of greedy routing in AP(G, , ).
Let M = (V, ) be the shortest paths metric of G. Let v be the current node of
greedy routing, and let 1 i log (s, t) such that (v, t) [2i1 , 2i ).
222
d5d d
1+
ln
1
,
(2(v, t) + e)
,
Z B (1 + 1/(4))X d ln1+ ((1 + 1/(4))X + e)
since (1 + 1/(4))X is the largest distance from (v) to any point in B.
d
On the other hand, the volume of B is at least cp 2d! (X/(4))d , for some
constant cp > 0. We get:
P
1 cp 2d (X/4)d
1
1 d d
1+
1
Z d!(1 + 4
) X
ln ((1 + 4
)X + e)
cp
1
1
1
cp (1 + e) d5d d ln1+ ((1 + 4
)X + e)
.
1+
d
d
d5 ln (2(v, t) + e)
Claim. If the current node v of greedy routing satises (v, t) [2i1 , 2i ) for
some 1 i log (s, t) , then after O( 1 d (i 1)1+ ) steps on expectation,
greedy routing is at distance less than 2i1 from t.
223
Proof of the claim. Combining the claims, we get that, with probability
([ 1 d ln1+ ((v, t))]1 ) (where the notation hides a linear factor in ),
the contact L(v) of v is at distance at most 2i1 to t. If this does not occur,
greedy routing moves to a neighbor v at distance strictly less than (v, t) to t
and strictly greater than 2i1 and we can repeat the same argument. Therefore,
after O( 1 d ln1+ ((v, t))) = O( 1 d (i 1)1+ ) steps, greedy routing is at dis$
tance less than 2i1 to t with constant probability.
Finally, from this last claim, the expected number of steps of greedy routing
from s to t is at most:
log((s,t))
i=1
1
O( d (i 1)1+ ) = O( d log2+ n).
From this theorem, results giving new insights on the navigability problem can
be derived from the very recent advances in metric embeddings theory. In particular, graphs of bounded doubling dimension, that subsumes graphs of bounded
growth, received an increasing interest recently. They are of particular interest
for scalable and distributed network applications since it is possible to decompose
them greedily into clusters of exponentially decreasing diameter.
Corollary 1. For any > 0, any n-node graph G of bounded doubling dimension can be augmented with one link per node so that greedy routing compute
paths of expected length O( 1 log(2++2) n) between any pair of vertices with the
only knowledge of G.
Indeed, from Theorem 1.1 of [3], it is known that, for every n-point metric space
M of doubling dimension and every (0, 1], there exists an embedding of M
into Rdp with distorsion O(log1+ n) and dimension O(/). Taking = 1 gives
the corollary. This result was previously proved in [22] by another method of
augmentation, using rings of neighbor. The originality of our method is that
it is not specic to a given graph or metric class, this dependency lying only in
the embedding function. Therefore, it enables to get more direct proofs that a
graph is augmentable into a navigable small world than previous ones.
This new kind of augmentation process via embedding is also promising to derive lower bounds on metrics embedding quality. Indeed, since not all graphs can
be augmented to become navigable, necessarily, if there exists a positive result
on small world augmentation via some embedding, then this embedding cannot
keep the same quality for all graphs. For the particular case of Theorem 2, we
derive that any injective function that embeds any arbitrary metric into Rdp
224
in a spanner2 of this graph. They remarked that greedy routing usually requires
to know the spanner map of distances in order to achieve an ecient routing. On
the contrary, our augmentation process does not requires greedy routing to be
aware of distances in Rd . This is due to the geography of the spaces considered:
an embedding of a graph in Rd preserves geographical neighboring regions.
Discussion
The result presented in this paper gives new perspectives in the understanding
of networks small world augmentations. Indeed, the augmentation process AP
isolates all the dependencies on the graph structure in the embedding function.
On the other hand, such an augmentation process focuses on the geography
of the graph and cannot capture the augmentation processes that are based on
graph separator decomposition. It can be distinguished two main kinds of augmentation processes in the navigable networks literature. One kind of augmentation relies on the graph density and its similarity with a mesh (like augmentations
in [7,17,18,22]), while the other kind relies on the existence of good separators in
the graph (like augmentations in [4,10]). Augmentation via embedding cannot
be directly extended to augmentations using separators because of the diculty
to handle the distortion in the analysis of greedy routing. Finally, the extension
of AP to graphs that are close to a tree metric (using embeddings into tree metrics) could open the path to the exhaustive characterization of graph classes that
can be augmented to become navigable, as well as provide new lower and upper
bounds on embeddings as side results. More generally, the exhaustive characterization of the graphs that can be augmented to become navigable is still an
important open problem, as well as the design of good quality embeddings into
low dimensional spaces.
References
1. Assouad, P.: Plongements lipschitzien dans Rn . Bull. Soc. Math. France 111(4),
429448 (1983)
2. Abraham, I., Bartal, Y., Neiman, O.: Advances in metric embedding theory. In:
Proceeeding of the the 38th annual ACM symposium on Theory of Computing
(STOC), pp. 271286 (2006)
3. Abraham, I., Bartal, Y., Neiman, O.: Embedding Metric Spaces in their Intrinsic
Dimension. In: Proceedings of the nineteenth annual ACM-SIAM symposium on
Discrete algorithms (SODA), pp. 363372 (2008)
4. Abraham, I., Gavoille, C.: Object location using path separators. In: Proceedings
of the Twenty-Fifth Annual ACM Symposium on Principles of Distributed Computing (PODC), pp. 188197 (2006)
5. Bourgain, J.: On Lipschitz embedding of nite metric spaces in Hilbert space. Israel
Journal of Mathematics 52, 4652 (1985)
2
225
6. Dabek, F., Cox, R., Kaashoek, F., Morris, R.: Vivaldi: A decentralized network
coordinate system. In: ACM SIGCOMM (2004)
7. Duchon, P., Hanusse, N., Lebhar, E., Schabanel, N.: Could any graph be turned
into a small-world? Theoretical Computer Science 355(1), 96103 (2006)
8. Duchon, P., Hanusse, N., Lebhar, E., Schabanel, N.: Towards small world emergence. In: 18th Annual ACM Symp. on Parallel Algorithms and Architectures
(SPAA), pp. 225232 (2006)
9. Dodds, P.S., Muhamad, R., Watts, D.J.: An experimental study of search in global
social networks. Science 301, 827829 (2003)
10. Fraigniaud, P.: Greedy routing in tree-decomposed graphs: a new perspective on the
small-world phenomenon. In: Brodal, G.S., Leonardi, S. (eds.) ESA 2005. LNCS,
vol. 3669, pp. 791802. Springer, Heidelberg (2005)
11. Fraigniaud, P., Gavoille, C.: Polylogarithmic network navigability using compact
metrics with small stretch. In: 20th Annual ACM Symp. on Parallel Algorithms
and Architectures (SPAA), pp. 6269 (2008)
12. Fraigniaud, P., Gavoille, C., Kosowski, A., Lebhar, E., Lotker, Z.: Universal Aug
mentation Schemes for Network Navigability: Overcoming the n-Barrier. In: Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architecture (SPAA), pp. 17 (2007)
13. Fraigniaud, P., Lebhar, E., Lotker, Z.: A Doubling Dimension Threshold
(log log n) for Augmented Graph Navigability. In: Azar, Y., Erlebach, T. (eds.)
ESA 2006. LNCS, vol. 4168, pp. 376386. Springer, Heidelberg (2006)
14. Gupta, A., Krauthgamer, R., Lee, J.R.: Bounded geometries, fractals, and lowdistortion embeddings. In: Proceedings of the 44th Annual IEEE Symposium on
Foundations of Computer Science (FOCS), pp. 534543 (2003)
15. Indyk, P.: Algorithmic aspects of geometric embeddings. In: Proceedings of the
42nd Annual IEEE Symposium on Foundations of Computer Science, FOCS (2001)
16. Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz maps into a Hilbert
space. Contemporary mathematics 26, 189206 (1984)
17. Kleinberg, J.: The Small-World Phenomenon: An Algorithmic Perspective. In: 32nd
ACM Symp. on Theo. of Comp. (STOC), pp. 163170 (2000)
18. Kleinberg, J.: Small-World Phenomena and the Dynamics of Information. Advances in Neural Information Processing Systems (NIPS) 14 (2001)
19. Kleinberg, J.: Complex networks and decentralized search algorithm. In: Intl.
Congress of Math, ICM (2006)
20. Matousek, J.: Lectures on Discrete Geometry. Graduate Texts in Mathematics,
vol. 212. Springer, Heidelberg (2002)
21. Milgram, S.: The Small-World Problem. Psychology Today, 6067 (1967)
22. Slivkins, A.: Distance estimation and object location via rings of neighbors. In:
24th Annual ACM Symp. on Princ. of Distr. Comp. (PODC), pp. 4150 (2005)
23. Watts, D.J., Strogatz, S.H.: Collective dynamics of small-world networks. Nature 393, 440442 (1998)
Abstract. The aim of a software transactional memory (STM) system is to facilitate the delicate problem of low-level concurrency management, i.e. the design of
programs made up of processes/threads that concurrently access shared objects.
To that end, a STM system allows a programmer to write transactions accessing shared objects, without having to take care of the fact that these objects are
concurrently accessed: the programmer is discharged from the delicate problem
of concurrency management. Given a transaction, the STM system commits or
aborts it. Ideally, it has to be efficient (this is measured by the number of transactions committed per time unit), while ensuring that as few transactions as possible
are aborted. From a safety point of view (the one addressed in this paper), a STM
system has to ensure that, whatever its fate (commit or abort), each transaction
always operates on a consistent state.
STM systems have recently received a lot of attention. Among the proposed
solutions, lock-based systems and clock-based systems have been particularly investigated. Their design is mainly efficiency-oriented, the properties they satisfy
are not always clearly stated, and few of them are formally proved. This paper
presents a lock-based STM system designed from simple basic principles. Its
main features are the following: it (1) uses visible reads, (2) does not require the
shared memory to manage several versions of each object, (3) uses neither timestamps, nor version numbers, (4) satisfies the opacity safety property, (5) aborts
a transaction only when it conflicts with some other live transaction (progressiveness property), (6) never aborts a write only transaction, (7) employs only
bounded control variables, (8) has no centralized contention point, and (9) is formally proved correct.
Keywords: Atomic operation, Commit/abort, Concurrency control, Consistent
global state, Lock, Opacity, Progressiveness, Shared object, Software transactional memory, Transaction.
1 Introduction
Software transactional memory. Recent advances in technology, and more particularly
in multicore processors, have given rise to a new momentum to practical and theoretical research in concurrency and synchronization. Software transactional memory (STM)
constitutes one of the most visible domains impacted by these advances. Given that concurrent processes (or threads) that share data structures (base objects) have to synchronize, the transactional memory concept originates from the observation that traditional
T.P. Baker, A. Bui, and S. Tixeuil (Eds.): OPODIS 2008, LNCS 5401, pp. 226245, 2008.
c Springer-Verlag Berlin Heidelberg 2008
227
lock-based solutions have inherent drawbacks. On one side, if the set of data whose accesses are controlled by a single lock is too large (large grain), the parallelism can be
drastically reduced, while, on another side, the solutions where a lock is associated with
each datum (fine grain), are difficult to master, error-prone, and difficult to prove correct.
The software transactional memory (STM) approach has been proposed in [23].
Considering a set of sequential processes that accesses shared objects, it consists in
decomposing each process into (a sequence of) transactions (plus possibly some parts
of code not embedded in transactions). This is the job of the programmer. The job of
the STM system is then to ensure that the transactions are executed as if each was
an atomic operation (it would make little sense to move the complexity of concurrent programming from the fine management of locks to intricate decompositions into
transactions). So, basically, the STM approach is a structuring approach. (STM borrows
ideas from database transactions; there are nevertheless fundamental differences with
database transactions that are examined below [10].)
Of course, as in database transactions, the fate of a transaction is to abort or commit.
(According to its aim, it is then up to the issuing process to restart -or not- an aborted
transaction.) The great challenge any STM system has to take up is consequently to be
efficient (the more transactions are executed per time unit, the better), while ensuring
that few transactions are aborted. This is a fundamental issue each STM system has to
address. Moreover, in the case where a transaction is executed alone (no concurrency)
or in the absence of conflicting transactions, it should not be aborted. Two transactions
conflict if they access the same object and one of them modifies that object.
Consistency of a STM. In the past recent years, several STM concepts have been proposed, and numerous STM systems have been designed and analyzed. On the correctness side (safety), an important notion that has been introduced very recently is the
concept of opacity. That concept, introduced and formalized by Guerraoui and Kapaka
[12], is a consistency criterion suited to STM executions. Its aim is to render aborted
transactions harmless.
The classical consistency criterion for database transactions is serializability [19]
(sometimes strengthened in strict serializability, as implemented when using the 2phase locking mechanism). The serializability consistency criterion involves only the
transactions that are committed. Said differently, a transaction that aborts is not prevented from accessing an inconsistent state before aborting. In a STM system, the code
encapsulated in a transaction can be any piece of code and consequently a transaction
has to always operate on a consistent state. To be more explicit, let us consider the following example where a transaction contains the statement x a/(b c) (where a, b
and c are integer data), and let us assume that b c is different from 0 in all the consistent states. If the values of b and c read by a transaction come from different states, it is
possible that the transaction obtains values such as b = c (and b = c defines an inconsistent state). If this occurs, the transaction raises an exception that has to be handled
by the process that invoked the corresponding transaction1 . Such bad behaviors have to
be prevented in STM systems: whatever its fate (commit or abort) a transaction has to
1
Even worse undesirable behaviors can be obtained when reading values from inconsistent
states. This occurs for example when an inconsistent state provides a transaction with values
that generate infinite loops.
228
always see a consistent state of the data it accesses. The important point is here that a
transaction can (a priori) be any piece of code (involving shared data), it is not restricted
to predefined patterns. This also motivates the design of STM protocols that reduce the
number of aborts (even if this entails a slightly lower throughput for short transactions).
Roughly speaking, opacity extends serializability to all the transactions (regardless of
whether they are committed or aborted). Of course, a committed transaction is considered entirely. Differently, only an appropriately defined subset of an aborted transaction
has to be considered.
Opacity (like serializability) states only what is a correct execution, it is a safety
property. It does not state when a transaction has to commit, i.e., it is not a liveness
property. Several types of liveness properties are investigated in [22].
Context of the work. Among the numerous STM systems that have been designed in
the past years, only four of them are considered here, namely, JVSTM [8], TL2 [9],
LSA-RT [21], and RSTM [16]. This choice is motivated by (1) the fact that (differently
from a lot of other STM systems) they all satisfy the opacity property, and (2) additional
properties that can be associated with STM systems.
Before introducing these properties, we first consider underlying mechanisms on
which the design of STM systems is based.
From an operational point of view, locks and (physical or logical) clocks constitute base synchronization mechanisms used in a lot of STM systems. Locks allow
mutex-based solutions. Clocks allow to benefit from the progress of the (physical
or logical) time in order to facilitate the validation test when the system has to decide the fate (commit or abort) of a transaction. As a clock can always increase,
clock-based systems require appropriate management of the clock values.
An important design principle that differentiates STM systems is the way they implement base objects. More specifically we have the following.
Two types of implementation of base objects can be distinguished, namely, the
single version implementations, and the multi-version implementations. The aim
of the latter is to allow the commit of more (mainly read only) transactions, but
requires to pay a higher price from the shared memory occupation point of view.
An STM implementation can also be characterized by the fact it satisfies or not
important additional properties. We consider here the progressiveness property.
The progressiveness notion, introduced in [12], is a safety property from the commit/abort termination point of view: it defines an execution pattern that forces a
transaction not to abort another one.
As already indicated, two transactions conflict if they access the same base
object and one of them updates it. The STM system satisfies the progressiveness
property, if it forcefully aborts T1 only when there is a time t at which T1 conflicts
with another concurrent transaction (say T2 ) that is not committed or aborted by
time t [12]. This means that, in all the other cases, T1 cannot be aborted due to T2 .
As an example, let us consider Figure 1 where two patterns are depicted. Both
involve the same conflicting concurrent transactions T1 that reads X, and T2 that
writes X (each transaction execution is encapsulated in a rectangle). On the left
T1
T2
BT1
X.read()
ET 1
X.write()
ET 2
BT2
T1
T2
X.read()
BT1
229
ET 1
X.write()
BT2
Time t
ET 2
Time t
side, T2 has not yet terminated when T1 reads X. In that case, an STM system
that aborts T 1 , due to its its conflict with T2 , does not violate the progressiveness
property. Differently, when we consider the right side, T2 has terminated when T1
reads X. In that case, an STM system that guarantees the progressiveness property,
cannot abort T1 due to T2 .
Finally, a last criterion to compare STM systems lies in the way they cope with lower
bound results related to the cost of read and write operations.
Let k be the number of objects shared by a set of transactions. A theorem proved in
[12] states the following. For any STM protocol that (1) ensures the opacity consistency criterion, (2) is based on single version objects, (3) implements invisible read
operations, and (4) ensures the progressiveness property, each read/write operation
issued by a transaction requires (k) computation steps in the worst case. This
theorem shows an inescapable cost associated with the implementation of invisible
read operations as soon as we want single version objects and abort only due to
conflict with a live transaction.
Considering the previous list of items (base mechanisms, number of versions, additional properties, lower bound), Table 1 indicates how each of the TL2, LSA-RT,
JVSTM, and RSTM behaves. While traditional comparisons of STM systems are based
on efficiency measurements (usually from benchmark-based simulations), this table
provides a different view to compare STM systems. A read operation issued by a transaction is invisible if its implementation does not entail updates of the underlying control
variables (kept in shared memory). Otherwise, the read is visible.
Content of the paper. The (k) lower bound states an inherent cost for the STM systems that want to ensure invisible read operations and progressiveness while using a
single version per object. When looking at Table 1, we see that, while both TL2 and
JVSTM implement invisible read operations, each circumvents the (k) lower bound
in its own way. JVSTM uses several copies of each object and does not ensure the progressiveness property. TL2 does not ensure the progressiveness property either (it has
even scenarios in which a transaction is directed to abort despite the fact that it has read
consistent values.)
Progressiveness is a noteworthy safety property. As already indicated, it states circumstances where transactions must commit2 . Considering consequently progressiveness as a first class property, this paper presents a new STM system that circumvents
the (k) lower bound and satisfies the progressiveness property. To that end it employs
2
This can be particularly attractive when there are long-lived read-only transactions.
230
System
TL2 [9] LSA-RT [20] JVSTM [8] RSTM [16] This paper
Clock-free
no
no
no
yes
yes
Lock-based
yes
no
yes
no
yes
Single version
yes
no
no
yes
yes
Invisible read operations
yes
yes
yes
no
no
Progressiveness
no
yes
no
yes
yes
Circumvent the (k) lower bound yes
no
yes
no
yes
a single version per object and implements visible read operations. Moreover, differently from nearly all the STM systems proposed so far, whose designs have been driven
mainly by implementation concerns and efficiency, the paper strives for a protocol with
powerful properties that can be formally proved. Its formal proof gives us a deeper
understanding on how the protocol works and why it works. Combined with existing
protocols, it consequently enriches our understanding of STM systems.
Finally, let us notice that the proposed protocol exhibits an interesting property related to contention management. The shared control variables associated with each object X (it is their very existence that makes the read operations visible) can be used
by an underlying contention manager [11,24]. If the contention manager is called when
a transaction is about to commit, it can benefit from the content of these variables to
decide whether to accept the commit or to abort the transaction in case this abort would
entail more transactions to commit.
Roadmap. The paper is made up of 6 sections. Section 2 describes the computation
model, and the safety property we are interested in (opacity, [12]). The proposed protocol is presented incrementally. A base protocol is first presented in Section 3. This STM
protocol (also called STM system in the following) associates a lock and two atomic
control variables (sets) with each object X. It also uses a global control variable (a set
denoted OW ) that is accessed by all the update transactions (when they try to commit).
Section 4 presents a formal proof of the protocol. Then, Section 5 presents the final version of the protocol. The resulting STM system has the following noteworthy features.
It (1) does not require the shared memory to manage several versions of each object,
(2) does not use timestamps, (3) satisfies the opacity and progressiveness properties,
(4) never aborts a write only transaction, (5) employs only bounded control variables,
and (6) has no centralized contention point. The design of provable STM protocols is
an important issue for researchers interested in the foundations of STM systems [3].
Finally, Section 6 concludes the paper.
231
sequential processes (also called threads) denoted p1 , . . . , pn (a process is also sometimes denoted p) that cooperate through base read/write atomic registers and locks. The
shared objects are denoted with upper case letters (e.g., the base object X). A lock, with
its classical mutex semantics, is associated with each base object X.
Each process p has a local memory (a memory that can be accessed only by p).
Variables in local memory are denoted with lower case letters indexed by the process id
(e.g., lrsi is a local variable of pi ).
High (user) abstraction level: transactions From a structural point of view, at the user
abstraction level, each process is made up of a sequence of transactions (plus some code
managing these transactions). A transaction is a sequential piece of code (computation
unit) that reads/writes base objects and does local computations. At the abstraction level
at which the transactions are defined, a transaction sees only base objects, it sees neither
the atomic registers nor the locks. (Atomic registers and locks are used by the STM
system to correctly implement transactions on top of the base model).
A transaction can be a read-only transaction (it then only reads base objects), or an
update transaction (it then modifies at least one base object). A write-only transaction
is an update transaction that does not read base objects. A transaction is assumed to be
executed entirely (commit) or not at all (abort). If a transaction is aborted, it is up to
the invoking process to re-issue it (as a new transaction) or not. Each transaction has its
own identifier, and the set of transactions can be infinite.
2.2 Problem Specification
Intuitively, the STM problem consists in designing (on top of the base computation
model) protocols that ensure that, whatever the base objects they access, the transactions
are correctly executed. The following property formulates precisely what correctly
executed means in this paper.
Safety property. Given a run of a STM system, let C be the set of transactions that
commit, and A the set of transactions that abort. Let us assume that any transaction T
starts with an invocation event (BT ) and terminates with an end event (ET ).
Given T A, let T = (T ) be the transaction built from T as follows ( stands
for reduced). As T has been aborted, there is a read or a write on a base object that
entailed that abortion. Let prex (T ) be the prefix of T that includes all the read and
write operations on the base objects accessed by T until (but excluding) the read or
write that provoked the abort of T . T = (T ) is obtained from prex (T ) by replacing
its write operations on base objects and all the subsequent read operations on these
objects, by corresponding write and read operations on a copy in local memory. The
idea here is that only an appropriate prefix of an aborted transaction is considered: its
write operations on base objects (and the subsequent read operations) are made fictitious
in T = (T ). Finally, let A = {T | T = (T ) T A}.
As announced in the Introduction, the safety property considered in this paper is
opacity (introduced in [12] with a different formalism). It expresses the fact that a transaction never sees an inconsistent state of the base objects. With the previous notation, it
can be stated as follows:
232
T1
T2
Y.read()
X.read()
233
Y.read()
T1
Y.write()
T4
X.write()
T3
Y.write()
X.write()
An example explaining the meaning of FBDX is described in Figure 2. On the left side,
the execution of three transactions are depicted (as before, each rectangle encapsulates
a transaction execution). T1 starts by reading X, executes local computation, and then
reads Y . The execution of T1 overlaps with two transactions, T2 that is a simple write
of Y , followed by T3 that is a simple write of X. It is easy to see that the execution
of these three transactions can be linearized: first T2 , then T1 and finally T3 . In this
execution, FBDX does not include T1 .
In the execution on the right side, T2 and T3 are combined to form a single transaction
T4 . It is easy to see that this concurrent execution of T1 and T4 cannot be linearized.
Due to its access to X, the STM system (as we will see) will force T4 to add T1 to
FBDY , entailing the abort of T1 when T1 will access Y (if T1 would not access Y , it
would not be aborted). Let us observe that the same thing occurs if, instead of T4 , we
have (with the same timing) a transaction made up of X.write() followed by another
transaction including Y.write().
The STM system also uses the following local variables (kept in the local memory
of the process that invoked the corresponding transaction). lrsT is a local set where T
keeps the ids of all the objects it reads. Similarly, lwsT is a local set where T keeps the
ids of all the objects it writes. Finally, read onlyT is a boolean variable initialized to
true.
The previous shared sets can be efficiently implemented using Bloom filters (e.g.,
[2,7,17]). In a very interesting way, the small probability of false positive on membership queries does not make the protocol incorrect (it can only affect its efficiency by
entailing non-necessary aborts).
Let us recall that a process is sequential and consequently executes transactions one
after the other. As local control variables are associated with a transaction, the corresponding process has to reset them to their initial values between two transactions. Similarly, if a transaction creates a local copy of an object, that copy is destroyed when the
transaction terminates (a given copy of an object is meaningful for one transaction only).
3.3 The Algorithms of the STM System
The three operations that constitute the STM system X.readT (), X.writeT (v), and
try to commitT (), are described in Figure 3.
The operation X.readT (). The algorithm implementing this operation is pretty simple.
If there is a local copy of X, its value is returned (lines 01 and 07). Otherwise, space
for X is allocated in the local memory (line 02), X is added to the set of objects read
by T (line 03), T is added to the read set RSX of X, and the current value of X is read
from the shared memory and saved in the local memory (line 04).
234
235
If the transaction T is an update transaction, try to commitT () first locks all the objects accessed by T (line 14). (In order to prevent deadlocks, it is assumed that these
objects are locked according to a predefined total order, e.g., their identity order.) Then,
T checks if it belongs to the set OW . If this is the case, there is a read-write conflict:
T has read an object that since then has been overwritten. T consequently aborts (after
having released all the locks, line 15). If the predicate T OW is false, T will necessarily commit. But, before committing (at line 20), T has to update the control variables
to indicate possible conflicts due to the objects it has written, the ids of which have been
kept by T in the local set lwsT during its execution.
So, after it has updated the shared memory with the new value of each object X
lwsT (line 16), T computes the union of their read sets; this union contains all the
transactions that will have a write/read conflict with T when they will read an object
X lwsT . This union set is consequently added to OW (line 17), and the set FBDX of
each object X lwsT is updated to OW (line 18). (It is important to notice that each set
FBDX is updated to OW in order not to miss the transitive conflict dependencies that
have been possibly created by other transactions). Moreover, as now the past read/write
conflicts are memorized in FBDX (line 18), the transaction T resets RSX to just after
it has set FBDX to OW . Finally, before committing, T releases all its locks (line 19).
On locking. As in TL2 [9], it is possible to adopt the following systematic abort strategy.
When a transaction T tries to lock an object that is currently locked, it immediately
aborts (after releasing the locks it has, if any).
3.4 On the Management of the Sets RSX , FBDX and OW
Let us recall that these sets are kept in atomic variables.
Management of RSX and FBDX . The set RSX is written only at line 04 (readT ()
operation), and reset to at line 18 (try to commitT () operation), and (due to the lock
associated with X) no two updates of RSX can be concurrent; so, no update of RSX is
missed. Its only read (line 16) is protected by the same lock. So, there is no concurrency
problem for RSX .
The set FBDX is read at line 06 (readT () operation), and its only write (line 18,
try to commitT () operation) is protected by a lock. As it is an atomic variable, there is
no concurrency problem for FBDX .
Management of the set OW . This set is read and written only by try to commitT ()
which reads it at lines 15 and 17, and writes it at line 17 (its read at line 18 can benefit
from a local copy saved at line 17).
Concurrent invocations of try to commitT () can come from transactions accessing
distinct sets of objects. When this occurs, the set OW is not protected by the locks associated with the objects and can consequently be concurrently accessed. As OW is kept
in an atomic variable there is no concurrency problem for the reads. Differently, writes
of OW (line 17) can be missed. Actually,
when we look at
the update of the atomic
set variable OW , namely OW OW XlwsT RSX (line 17), we can observe
that this update is nothing else than a F etch&Add() statement that has to atomically
add XlwsT RSX to OW . If such an operation on a set variable is not provided by
236
the hardware, there are several ways to implement it. One consists in using a lock to
execute this operation is mutual exclusion. Another consists in using specialized hardware operations such as Compare&swap() (manipulating a pointer on the set OW ,
or LL/SC (load-linked/store-conditional) [15,18]. Yet, another possible implementation
consists in considering the set OW as a shared array with one entry per process, pi
being the only process that can write OW [i]. Moreover, for efficiency, the current value
of OW [i] can be saved in a local variable owi . A write by pi in OW [i] then becomes
owi owi XlwsT RSX followed by OW [i] owi ; while the atomic read of the
set OW is implemented by a snapshot operation on the array OW [1..n] [1] (there are
efficient implementations of the snapshot operation, e.g., [4,5]).
Differently from the pair of sets RSX and FBDX , associated with each object X, the
set OW constitutes a global contention point. This contention point can be suppressed
by replacing OW by independent boolean variables (see Section 5). We have adopted
here an incremental presentation, to make the final protocol easier to understand.
3.5 Early Abort and Contention Manager
When the predicate T OW is satisfied, the transaction T has read an object that since
then has been overwritten. This fact is not sufficient to abort T if it is a read-only transaction. Differently, if T is an update transaction, it cannot be linearized; consequently, it
will be aborted when executing line 15 of try to commitT (). It is possible to abort such
an update transaction T earlier than during the execution of try to commitT (). This can
be simply done by adding the statement if T OW then return(abort) end if just
before the first line of the operation writeT (). Similarly, the statement if T FBDX
then return(abort) end if can be inserted between the first and the second line of the
operation readT ().
Interestingly, it is important to notice that the sets RSX , FBDX , and OW can be
used by an underlying contention manager [11,24] to abort transactions according to
predefined rules (namely, there are configurations where aborting a single transaction
can prevent the abort of other transactions).
237
Let wT (X)v denote the event associated with the write of the value v in X. Given
an object X, there is at most one event wT (X)v per transaction T . If any, it corresponds to a write issued at line 16 in the try to commitT () operation. If the value
v is irrelevant wT (X)v is abbreviated wT (X).
Without loss of generality we assume that no two writes on the same object X write
the same value.
We also assume that all the objects are initially written by a fictitious transaction.
Let ALT (X, op) denote the event associated with the acquisition of the lock on
the object X issued by the transaction T during an invocation of op where op is
X.readT () or try to commitT ().
Similarly, let RLT (X, op) denote the event associated with the release of the lock
on the object X issued by the transaction T during an invocation of op.
Given an execution, let H be the set of all the events generated by the shared memory
accesses issued by the STM system described in Figure 3. As these shared memory
accesses are atomic, the previous events are totally ordered. Consequently, at the shared
= (H, <H ) where <H
memory level, an execution can be represented by the pair H
denotes the total ordering on its events. H is called a shared memory history.
As <H is a total order, it is possible to consider each event in H as a date of the time
line. This date view of a sequential history on events will be used in the proof.
History at the transaction level. Given an execution, let TR be the set of transactions
issued during that execution. Let TR be the order relation defined on the transactions
of TR as follows: T 1 TR T 2 if ET 1 <H BT 2 (T 1 has terminated before T 2 starts).
If T 1 TR T 2 T 2 TR T 1, we say that T 1 and T 2 are concurrent (their executions
overlap in time). At the transaction level, that execution is defined by the partial order
TR = (TR, TR ), that is called a transaction level history.
The read-from relation between transactions, denoted rf , is defined as follows:
X
238
5. T1 rf T2 T1 ST T2 .
4.3 Definition of the Linearization Points
ST is produced by ordering the transactions according to their linearization points. The
linearization point of the transaction T is denoted
T . The linearization points of the
transactions are defined as follows:
If a transaction T aborts,
T is the time just before T is added to the set OW (line
17 of the try to commitT () operation that entails its abort).
If a read only transaction T commits,
T is placed at the earliest of (1) the occurrence time of the test during its last read operation (line 05 of the X.read() operation) and (2) the time just before it is added to OW (if it ever is). (An example is
depicted in Figure 4.)
If an update transaction T commits,
T is placed just after the execution of line 17
by T (update of OW ).
The total order <H (defined on the events generated by T R) can be extended with
these linearization points. Transactions whose linearization points happen at the same
time (for example, in multi-core systems) are ordered arbitrarily. An example is given
in Figure 4.
4.4 Safety: Proof of the Opacity Property
! = ((TR), ST ) a history
Let TR = (TR, TR ) be a transaction history. Let ST
whose transactions are the transactions in (TR), and such that ST is defined according to linearization points of each transaction in (TR). If two transactions in (TR)
have the same linearization point, they are ordered arbitrarily. Finally, let us observe that
BT1
Event/time line
rT1 (Y )
rT1 (X)
T1
T3 ;CT3
CT 1
BT3
239
wT2 (X)
T2
CT 2
wT3 (Y )
T3
RSX RSX {T1}
OW OW {T1}
OW OW {T1}
= (H, H )
the linearization points can be trivially added to the sequential history H
defined on the events generated by the transaction history TR. So, we consider in the
following that the set H includes the linearization points of the transactions.
Lemma 1. ST is a total order.
Proof. Trivial from the ordering of the linearization points.
Lemma 2. (TR) ST .
Proof. This lemma follows from the fact that, given any transaction T , its linearization
point is placed within its lifetime. Therefore, if T 1 (TR) T 2 (T 1 ends before T 2
2
begins), then T 1 ST T 2.
Let ow(T, t) be the predicate at time t, T belongs to OW .
Lemma 3. ow(T, t)
T <H t.
Proof. We show that the linearization point of a transaction T cannot be after the time
at which the transactions id is added to OW . There are three cases.
By construction, if T aborts, its linearization
T is the time just before its id is
added to OW , which proves the lemma.
If T is read-only and commits, again by construction, its linearization
T point is
placed at the latest just before the time at which its id is added to OW (if it ever
is), which again proves the lemma.
If T writes and commits, its linearization point
T is placed during try to commit
(), while T holds the locks of every object that it has read. If T was in OW before
it acquired all the locks, it would not commit (due to line 15). Let us notice that T
can be added to OW only by an update transaction holding a lock on a base object
previously read by T . As T releases the locks just before committing (line 19), it
follows that
T occurs before the time at which its id is added to OW (if it ever is),
which proves the last case of the lemma.
2
Let rsX (T, t) be the predicate at time t, T belongs to RSX or OW .
240
X
(X)
Lemma 4. TW rf TR TW
such that TW ST TW
ST TR wTW
H .
Proof. By contradiction, let us assume that there are transactions TW , TW
and TR and
an object X such that:
X
TW rf TR ,
(X)v H,
wTW
ST TR .
TW ST TW
write X in shared memory, they have necessarily committed (a
As both TW and TW
write in shared memory occurs only at line 16 during the execution of try to commit(),
abbreviated ttc in the following). Moreover, their linearization points
TW and
TW
occur while they hold the lock on X (before committing), from which we have the
following implications:
,
TW ST TW
TW <H TW
(X, ttc),
TW <H TW
RLTW (X, ttc) &l