Concurrent Programming
Concurrent Programming
Concurrent Programming:
Algorithms, Principles,
and Foundations
123
Michel Raynal
Institut Universitaire de France
IRISA-ISTIC
Université de Rennes 1
Rennes Cedex
France
… Ce jour-là j’ai bien cru tenir quelque chose et que ma vie s’en trouverait changée.
Mais rien de cette nature n’est définitivement acquis.
Comme une eau, le monde vous traverse et pour un temps vous prête ses couleurs.
Puis se retire et vous replace devant ce vide qu’on porte en soi, devant cette espèce
d’insuffisance centrale de l’âme qu’il faut bien apprendre à côtoyer, à combattre,
et qui, paradoxalement, est peut-être notre moteur le plus sûr.
In L’usage du monde (1963), Nicolas Bouvier (1929–1998)
v
vi Preface
What synchronization is
Since the early work of E.W. Dijkstra (1965), who introduced the mutual exclu-
sion problem, the concept of a process, the semaphore object, the notion of a
weakest precondition, and guarded commands (among many other contributions),
synchronization is no longer a catalog of tricks but a domain of computing science
with its own concepts, mechanisms, and techniques whose results can be applied in
many domains. This means that process synchronization has to be a major topic of
any computer science curriculum.
Content
As stressed by its title, this book is on algorithms, base principles, and foundations
of concurrent objects and synchronization in shared memory systems, i.e., systems
where the entities communicate by reading and writing a common memory. (Such
a corpus of knowledge is becoming more and more important with the advent of
new technologies such as multicore architectures.)
The book is composed of six parts. Three parts are more focused on base
synchronization mechanisms and the construction of concurrent objects, while the
other three parts are more focused on the foundations of synchronization. (A
noteworthy feature of the book is that nearly all the algorithms that are presented
are proved.)
• Part I is on lock-based synchronization, i.e., on well-known synchronization
concepts, techniques, and mechanisms. It defines the most important synchro-
nization problem in reliable asynchronous systems, namely the mutual exclusion
problem (Chap. 1). It then presents several base approaches which have been
proposed to solve it with machine-level instructions (Chap. 2). It also presents
traditional approaches which have been proposed at a higher abstraction level to
solve synchronization problems and implement concurrent objects, namely the
concept of a semaphore and, at an even more abstract level, the concepts of
monitor and path expression (Chap. 3).
• After the reader has become familiar with base concepts and mechanisms suited
to classical synchronization in reliable systems, Part II, which is made up of a
single chapter, addresses a fundamental concept of synchronization; namely, it
presents and investigates the concept of atomicity and its properties. This allows
for the formalization of the notion of a correct execution of a concurrent pro-
gram in which processes cooperate by accessing shared objects (Chap. 4).
• Part I has implicitly assumed that the cooperating processes do not fail. Hence,
the question: What does happen when cooperating entities fail? This is the main
issue addressed in Part III (and all the rest of the book); namely, it considers that
cooperating entities can halt prematurely (crash failure). To face the net effect of
asynchrony and failures, it introduces the notions of mutex-freedom and asso-
ciated progress conditions such as obstruction-freedom, non-blocking, and wait-
freedom (Chap. 5).
viii Preface
The rest of Part III focuses on hybrid concurrent objects (Chap. 6), wait-free
implementations of paradigmatic concurrent objects such as counters and store-
collect objects (Chap. 7), snapshot objects (Chap. 8), and renaming objects
(Chap. 9).
• Part V returns to the foundations side. It shows how reliable atomic read/write
registers (shared variables) can be built from non-atomic bits. This part consists
of three chapters. Chapter 11 introduces the notions of safe register, regular
register, and atomic register. Then, Chap. 12 shows how to build an atomic bit
from a safe bit. Finally, Chap. 13 shows how an atomic register of any size can
be built from safe and atomic bits.
This part shows that, while atomic read/write registers are easier to use than safe
read/write registers, they are not more powerful from a computability point-of-
view.
• Part VI, which concerns also the foundations side, is on the computational
power of concurrent objects. It is made up of four chapters. It first introduces the
notion of a consensus object and shows that consensus objects are universal
objects (Chap. 14). This means that, as soon as a system provides us with atomic
read/write registers and consensus objects, it is possible to implement in a wait-
free manner any object defined from a sequential specification.
Part VI then introduces the notion of self-implementation and shows how atomic
registers and consensus objects can be built from base objects of the same type
which are not reliable (Chap. 15). Then, it presents the notion of a consensus
number and the associated consensus hierarchy which allows the computability
power of concurrent objects to be ranked (Chap. 16). Finally, the last chapter of
the book focuses on the wait-free implementation of consensus objects from
read/write registers and failure detectors (Chap. 17).
To have a more complete feeling of the spirit of this book, the reader can also
consult the section ‘‘What Was the Aim of This Book’’ in the Afterword) which
describes what it is hoped has been learned from this book. Each chapter starts
with a short presentation of its content and a list of keywords; it terminates with a
summary of the main points that have explained and developed. Each of the six
parts of the book is also introduced by a brief description of its aim and its
technical content.
Preface ix
Acknowledgments
This book originates from lecture notes for undergraduate and graduate courses on
process synchronization that I give at the University of Rennes (France) and, as an
invited professor, at several universities all over the world. I would like to thank
the students for their questions that, in one way or another, have contributed to this
book.
Last but not least (and maybe most importantly), I also want to thank all the
researchers whose results are presented in this book. Without their work, this book
would not exist (Since I typeset the entire text myself (– for the text and xfig
for figures–), any typesetting or technical errors that remain are my responsibility.)
Michel Raynal
Professeur des Universités
Institut Universitaire de France
IRISA-ISTIC, Université de Rennes 1
Campus de Beaulieu, 35042 Rennes, France
xi
xii Contents
Afterword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
Notation
No-op No operation
Process Program in action
n Number of processes
Correct process Process that does not crash during
an execution
Faulty process Process that crashes during an execution
Concurrent object Object shared by several processes
AA½1::m Array with m entries
ha; bi Pair with two elements a and b
Mutex Mutual exclusion
Read/write register Synonym of read/write variable
SWSR Single-writer/single-reader (register)
SWMR Single-writer/multi-reader (register)
MWSR Multi-writer/single-reader (register)
SWMR Single-writer/multi-reader (register)
ABCD Identifiers in italics upper case letters:
shared objects
abcd Identifiers in italics lower case letters:
local variables
"X Pointer to object X
P# Object pointed to by the pointer P
AA½1::s, (a½1::s) Shared (local) array of size s
for each i 2 f1; :::; mg do statements end for Order irrelevant
for each i from 1 to m do statements end for Order relevant
wait ðPÞ while :P do no-op end while
return ðvÞ Returns v and terminates the operation
invocation
% blablabla % Comments
; Sequentiality operator between two
statements
xxiii
Figures and Algorithms
xxv
xxvi Figures and Algorithms
7.1 A simple wait-free counter for n processes (code for pi). . . . . . . . 191
7.2 Wait-free weak counter (one-shot version, code for pi) . . . . . . . . 194
7.3 Proof of the weak increment property . . . . . . . . . . . . . . . . . . . . 197
7.4 Fast read of a weak counter (code for process pi) . . . . . . . . . . . . 199
7.5 Reading a weak counter (non-restricted version,
code for process pi) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.6 A trivial implementation of a store-collect object (code for pi) . . . 203
7.7 A store-collect object has no sequential specification . . . . . . . . . . 203
7.8 A complete binary tree to implement a store-collect object. . . . . . 205
7.9 Structure of a vertex of the binary tree. . . . . . . . . . . . . . . . . . . . 205
7.10 An adaptive implementation of a store-collect
object (code for pi) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 207
7.11 Computing an upper bound on the number of marked vertices ... 211
7.12 Merging store() and collect() (code for process pi) . . . . . . . . . ... 212
7.13 Incorrect versus correct implementation
of the store collect() operation . . . . . . . . . . . . . . . . . . . . . . ... 213
7.14 An efficient store_collect() algorithm (code for pi). . . . . . . . . ... 214
7.15 Sequential and concurrent invocations
of store_collect() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 215
7.16 Concurrent invocations of store_collect() . . . . . . . . . . . . . . . ... 216
The concept of a process to express the idea of an activity has become an indispensable
tool to master the activity on multiprocessors. More precisely, a concurrent algorithm
(or concurrent program) is the description of a set of sequential state machines that
cooperate through a communication medium, e.g., a shared memory. A concurrent
algorithm is sometimes called a multiprocess program (each process corresponding
to the sequential execution of a given state machine).
This chapter considers processes that are reliable and asynchronous. “Reliable”
means that each process results from the correct execution of the code of the corre-
sponding algorithm. “Asynchronous” means that there is no timing assumption on
the time it takes for a process to proceed from a state transition to the next one (which
means that an asynchronous sequential process proceeds at an arbitrary speed).
1.2.2 Synchronization
More generally, synchronization is the set of rules and mechanisms that allows
the specification and implementation of sequencing properties on statements issued
by the processes so that all the executions of a multiprocess program are correct.
This type of process interaction occurs when processes have to compete to execute
some statements and only one process at a time (or a bounded number of them) is
allowed to execute them. This occurs, for example, when processes compete for a
shared resource. More generally, resource allocation is a typical example of process
competition.
time line
i.e., from the disk D point of view, the execution corresponds to the sequence
D.seek(x); r ← D.read(); D.seek(y); D.write(v), from which we conclude that p
has read the value at address x and afterwards q has written the value v at address y.
Let us now consider the case where p and q simultaneously invoke disk_read(x)
and disk_write(y, v), respectively. The effect of the corresponding parallel execution
is produced by any interleaving of the primitives invoked by p and the primitives
invoked by q that respects the order of invocations issued by p and q. As an example,
a possible execution is depicted in Fig. 1.2. This figure is a classical space-time
diagram. Time flows from left to right, and each operation issued by a process is
represented by a segment on the time axis associated with this process. Two dashed
arrows are associated with each invocation of an operation. They meet at a point of
the “real time” line, which indicates the instant at which the corresponding operation
appears to have been executed instantaneously. This sequence of points define the
order in which the execution is seen by an external sequential observer (i.e., an
observer who can see one operation invocation at a time).
In this example, the processes p and q have invoked in parallel D.seek(x) and
D.seek(y), respectively, and D.seek(x) appears to be executed before D.seek(y).
Then q executes D.write(v) while p executes in parallel D.read(), and the write by
q appears to an external observer to be executed before the read of p.
It is easy to see that, while the write by process q is correct (namely v has been
written at address y), the read by process p of the value at address x is incorrect
( p obtains the value written at address y and not the value stored at address x).
Other incorrect parallel executions (involving invocations of both disk_read() and
disk_write() or involving only invocations of disk_write() operations) in which a
value is not written at the correct address can easily be designed.
A solution to prevent this problem from occurring consists in allowing only
one operation at a time (either disk_read() or disk_write()) to be executed. Mutual
exclusion (addressed later in this chapter) provides such a solution.
This section presents two examples of process cooperation. The first is a pure coor-
dination problem while the second is the well-known producer–consumer problem.
In both cases the progress of a process may depend on the progress of other processes.
To better understand the nature of what synchronization is, let us consider the
previous producer–consumer problem. Let # p and #c denote the number of data
items produced and consumed so far, respectively. The instance of the problem
8 1 The Mutual Exclusion Problem
authorized
forbidden area area
forbidden area
forbidden area
Critical section Let us consider a part of code A (i.e., an algorithm) or several parts
of code A, B, C . . . (i.e., different algorithms) that, for some consistency reasons,
must be executed by a single process at a time. This means that, if a process is execut-
ing one of these parts of code, e.g., the code B, no other process can simultaneously
execute the same or another part of code, i.e., any of the codes A or B or C or etc.
This is, for example, the case of the disk operations disk_read() and disk_write()
introduced in Sect. 1.2.2, where guaranteeing that, at any time, at most one process
can execute either of these operations ensures that each read or write of the disk is
correct. Such parts of code define what is called a critical section. It is assumed that a
code defining a critical section always terminates when executed by a single process
at a time.
In the following, the critical section code is abstracted into a procedure called
cs_code(in) where in denotes its input parameter (if any) and that returns a result
value (without loss of generality, the default value ⊥ is returned if there is no explicit
result).
It is easy to extend this execution in such a way that, while p3 wants to enter the
critical section, it can never enter it. This execution is deadlock-free but (due to p3 )
is not starvation-free.
Finite bypass versus bounded bypass A liveness property that is stronger than
starvation-freedom is the following one. Let p and q be a pair of competing processes
such that q wins the competition. Let f (n) denote a function of n (where n is the
total number of processes).
• Bounded bypass. There is a function f (n) such that, each time a process invokes
acquire_mutex(), it loses at most f (n) competitions with respect to the other
processes.
Let us observe that starvation-freedom is nothing else than the case where the
number of times that a process can be bypassed is finite. More generally, we have the
following hierarchy of liveness properties: bounded bypass ⇒ starvation-freedom
≡ finite bypass ⇒ deadlock-freedom.
Definition A lock (say LOCK) is a shared object that provides the processes
with two operations denoted LOCK.acquire_lock() and LOCK.release_lock(). It
can take two values, free and locked, and is initialized to the value free. Its
behavior is defined by a sequential specification: from an external observer point
of view, all the acquire_lock() and release_lock() invocations appear as if they
have been invoked one after the other. Moreover, using the regular language
operators “;” and “∗”, this sequence corresponds to the regular expression LOCK.
∗
acquire_lock(); LOCK.release_lock() (see Fig. 1.5).
According to the operations and their properties provided to the processes by the
underlying shared memory communication system, several families of mutex algo-
rithms can be designed. We distinguish three distinct families of mutex algorithms
which are investigated in the next chapter.
1.4 Summary
This chapter has presented the mutual exclusion problem. Solving this problem con-
sists in providing a lock object, i.e., a synchronization object that allows a zone of
code to be bracketed to guarantee that a single process at a time can execute it.
• The mutual exclusion problem was first stated by E.W. Dijkstra [88].
• A theory of interprocess communication and mutual exclusion is described in
[185].
• The notions of safety and liveness were introduced by L. Lamport in [185]. The
notion of liveness is investigated in [20].
• An invariant-based view of synchronization is presented in [194].
Chapter 2
Solving Mutual Exclusion
The read/write register object is one of the most basic objects encountered in com-
puter science. When such an object is accessed only by a single process it is said to
be local to that process; otherwise, it is a shared register. A local register allows a
process to store and retrieve data. A shared register allows concurrent processes to
also exchange data.
Definition A register R can be accessed by two base operations: R.read(), which
returns the value of R (also denoted x ← R where x is a local variable of the invoking
process), and R.write(v), which writes a new value into R (also denoted R ← v,
where v is the value to be written into R). An atomic shared register satisfies the
following properties:
Let us observe that R.write(3) and R.write(2) are concurrent, which means that
they could appear to an external observer as if R.write(2) was executed before
R.write(3). If this was the case, the execution would be correct if the last two read
invocations (issued by p1 and p3 ) return the value 3; i.e., the external observer should
then see the following sequential execution:
Let us also observe that the second read invocation by p1 is concurrent with both
R.write(2) and R.write(3). This means that it could appear as having been executed
before these two write operations or even between them. If it appears as having been
executed before these two write operations, it should return the value 1 in order for
the register behavior be atomic.
As shown by these possible scenarios (and as noticed before) concurrency is
intimately related to non-determinism. It is not possible to predict which execution
will be produced; it is only possible to enumerate the set of possible executions that
could be produced (we can only predict that the one that is actually produced is one
of them).
Examples of non-atomic read and write operations will be presented in Sect. 2.3.
Why atomicity is important Atomicity is a fundamental concept because it allows
the composition of shared objects for free (i.e., their composition is at no additional
cost). This means that, when considering two (or more) atomic registers R1 and
R2, the composite object [R1, R2] which is made up of R1 and R2 and provides the
processes with the four operations R1.read(), R1.write(), R2.read(), and R2.write()
is also atomic. Everything appears as if at most one operation at a time was executed,
and the sub-sequence including only the operations on R1 is a correct behavior of
R1, and similarly for R2.
This is very important when one has to reason about a multiprocess program
whose processes access atomic registers. More precisely, we can keep reasoning
sequentially whatever the number of atomic registers involved in a concurrent com-
putation. Atomicity allows us to reason on a set of atomic registers as if they were a
single “bigger” atomic object. Hence, we can reason in terms of sequences, not only
for each atomic register taken separately, but also on the whole set of registers as if
they were a single atomic object.
The composition of atomic objects is formally addressed in Sect. 4.4, where
it is shown that, as atomicity is a “local property”, atomic objects compose for
free.
The mutex algorithm for two processes that is presented below is due to G.L. Peterson
(1981). This construction, which is fairly simple, is built from an “addition” of two
base components. Despite the fact that these components are nearly trivial, they allow
us to introduce simple basic principles.
18 2 Solving Mutual Exclusion
operation
Fig. 2.2 Peterson’s algorithm for two processes: first component (code for pi )
The processes are denoted pi and p j . As the algorithm for p j is the same as the
one for pi after having replaced i by j, we give only the code for pi .
First component This component is described in Fig. 2.2 for process pi . It is
based on a single atomic register denoted AFTER_YOU, the initial value of which
is irrelevant (a process writes into this register before reading it). The principle that
underlies this algorithm is a “politeness” rule used in current life. When pi wants
to acquire the critical section, it sets AFTER_YOU to its identity i and waits until
AFTER_YOU = i in order to enter the critical section. Releasing the critical section
entails no particular action.
It is easy to see that this algorithm satisfies the mutual exclusion property. When
both processes want to acquire the critical section, each assigns its identity to the
register AFTER_YOU and waits until this register contains the identity of the other
process. As the register is atomic, there is a “last” process, say p j , that updated it,
and consequently only the other process pi can proceed to the critical section.
Unfortunately, this simple algorithm is not deadlock-free. If one process alone
wants to enter the critical section, it remains blocked forever in the wait statement.
Actually, this algorithm ensures that, when both processes want to enter the critical
section, the first process that updates the register AFTER_YOU is the one that is
allowed to enter it.
Second component This component is described in Fig. 2.3. It is based on a simple
idea. Each process pi manages a flag (denoted FLAG[i]) the value of which is down
or up. Initially, both flags are down. When a process wants to acquire the critical
section, it first raises its flag to indicate that it is interested in the critical section. It
is then allowed to proceed only when the flag of the other process is equal to down.
To release the critical section, a process pi has only to reset FLAG[i] to its initial
value (namely, down), thereby indicating that it is no longer interested in the mutual
exclusion.
Fig. 2.3 Peterson’s algorithm for two processes: second component (code for pi )
2.1 Mutex Based on Atomic Read/Write Registers 19
It is easy to see that, if a single process pi wants to repeatedly acquire the critical
section while the other process is not interested in the critical section, it can do so
(hence this algorithm does not suffer the drawback of the previous one). Moreover,
it is also easy to see that this algorithm satisfies the mutual exclusion property. This
follows from the fact that each process follows the following pattern: first write its flag
and only then read the value of the other flag. Hence, assuming that pi has acquired
(and not released) the critical section, we had (FLAG[i] = up)∧(FLAG[ j] = down)
when it was allowed to enter the critical section. It follows that, after p j has set
FLAG[ j] to the value up, it reads up from FLAG[i] and is delayed until pi resets
FLAG[i] to down when it releases the critical section.
Unfortunately, this algorithm is not deadlock-free. If both processes concurrently
raise first their flags and then read the other flag, each process remains blocked until
the other flag is set down which will never be done.
Remark: the notion of a livelock In order to prevent the previous deadlock situa-
tion, one could think replacing wait (FLAG[ j] = down) by the following statement:
while (FLAG[ j] = up) do
FLAG[i] ← down;
pi delays itself for an arbitrary period of time;
FLAG[i] ← up
end while.
This modification can reduce deadlock situations but cannot eliminate all of them.
This occurs, for example when both processes execute “synchronously” (both delay
themselves for the same duration and execute the same step—writing their flag and
reading the other flag—at the very same time). When it occurs, this situation is
sometimes called a livelock.
This tentative solution was obtained by playing with asynchrony (modifying the
process speed by adding delays). As a correct algorithm has to work despite any
asynchrony pattern, playing with asynchrony can eliminate bad scenarios but cannot
suppress all of them.
the flag of the other one was raised, the current value of the register AFTER_YOU
allows exactly one of them to progress.
It is important to observe that, in the wait statement of Fig. 2.4, the reading of
the atomic registers FLAG[ j] and AFTER_YOU are asynchronous (they are done at
different times and can be done in any order).
Theorem 1 The algorithm described in Fig. 2.4 satisfies mutual exclusion and
bounded bypass (where the bound is f (n) = 1).
Preliminary remark for the proof The reasoning is based on the fact that the
three registers FLAG[i], FLAG[ j], and AFTER_YOU are atomic. As we have seen
when presenting the atomicity concept (Sect. 2.1.1), this allows us to reason as if at
most one read or write operation on any of these registers occurs at a time.
Proof Proof of the mutual exclusion property.
Let us assume by contradiction that both pi and p j are inside the critical section.
Hence, both have executed acquire_mutex() and we have then FLAG[i] = up,
FLAG[ j] = up and AFTER_YOU = j (if AFTER_YOU = i, the reasoning is the
same after having exchanged i and j). According to the predicate that allowed pi to
enter the critical section, there are two cases.
pi executes
AFTER YOU ← i at current time we have:
pi executes pj executes AFTER YOU = j
FLAG[i] ← up AFTER YOU ← j FLAG [i] = FLAG [j] = up
time line
pj executes its wait statement
As p j executes the wait statement after writing j into AFTER_YOU and pi read
j from AFTER_YOU, it follows that p j cannot read down from FLAG[i] when it
executes the wait statement. This contradicts the assumption that p j is inside the
critical section.
pi executes pj executes
AFTER YOU ← i AFTER YOU ← j
pi executes pj executes pj executes
FLAG [i] ← up FLAG [j] ← down FLAG [j] ← up
time line
pi does not read FLAG[j]
which allows it to enter the critical section. For 1 ≤ x < n −1, FLAG_LEVEL[i] = x
means that pi is trying to enter level x + 1.
Moreover, to eliminate possible deadlocks at any level , 0 < < n − 1 (such as
the deadlock that can occur in the algorithm of Fig. 2.3), the processes use a second
array of atomic registers AFTER_YOU[1..(n − 1)] such that AFTER_YOU[] keeps
track of the last process that has entered level .
More precisely, a process pi executes a for loop to progress from one level to
the next one, starting from level 1 and finishing at level n − 1. At each level the
two-process solution is used to block a process (if needed). The predicate that allows
a process to progress from level , 0 < < n − 1, to level + 1 is similar to the
one of the two-process algorithm. More precisely, pi is allowed to progress to level
+ 1 if, from its point of view,
• Either all the other processes are at a lower level (i.e., ∀ k = i:FLAG_LEVEL
[k] < ).
• Or it is not the last one that entered level (i.e., AFTER_YOU[] = i).
Let us notice that the predicate used in the wait statement of line 4 involves all but one
of the atomic registers FLAG_LEVEL[·] plus the atomic register AFTER_YOU[].
As these registers cannot be read in a single atomic step, the predicate is repeatedly
evaluated asynchronously on each register.
When all processes compete for the critical section, at most (n − 1) processes can
concurrently be winners at level 1, (n − 2) processes can concurrently be winners
at level 2, and more generally (n − ) processes can concurrently be winners at
level . Hence, there is a single winner at level (n − 1).
The code of the operation release_mutex(i) is similar to the one of the two-process
algorithm: a process pi resets FLAG_LEVEL[i] to its initial value 0 to indicate that
it is no longer interested in the critical section.
Theorem 2 The algorithm described in Fig. 2.8 satisfies mutual exclusion and
starvation-freedom.
Proof Initially, a process pi is such that FLAG_LEVEL[i] = 0 and we say that it is
at level 0. Let ∈ [1..(n − 1)]. We say that a process pi has “attained” level (or,
from a global state point of view, “is” at level ) if it has exited the wait statement
of the th loop iteration. Let us notice that, after it has set its loop index to α > 0
and until it exits the wait statement of the corresponding iteration, that process is at
level α − 1. Moreover, a process that attains level has also attained the levels
with 0 ≤
≤ ≤ n − 1 and consequently it is also at these levels
.
The proof of the mutual exclusion property amounts to showing that at most one
process is at level (n − 1). This is a consequence of the following claim when we
consider = n − 1.
Claim. For , 0 ≤ ≤ n − 1, at most n − processes are at level .
The proof of this claim is by induction on the level . The base case = 0 is
trivial. Assuming that the claim is true up to level − 1, i.e., at most n − ( − 1)
24 2 Solving Mutual Exclusion
py
time line
px
processes are simultaneously at level − 1, we have to show that at least one process
does not progress to level . The proof is by contradiction: let us assume that n −+1
processes are at level .
Let px be the last process that wrote its identity into AFTER_YOU[] (hence,
AFTER_YOU[] = x). When considering the sequence of read and write operations
executed by every process, and the fact that these operations are on atomic registers,
this means that, for any of the n − other processes p y that are at level , these
operations appear as if they have been executed in the following order where the
first two operations are issued by p y while the least two operations are issued by px
(Fig. 2.9):
1. FLAG_LEVEL[y] ← is executed before AFTER_YOU[] ← y (sequentiality
of p y )
2. AFTER_YOU[] ← y is executed before AFTER_YOU[] ← x (assumption:
definition of px )
3. AFTER_YOU[] ← x is executed before r ← FLAG_LEVEL[y] (sequentiality
of px ; r is px ’s local variable storing the last value read from FLAG_LEVEL[y]
before px exits the wait statement at level ).
It follows from this sequence that r = . Consequently, as AFTER_YOU[] = x,
px exited the wait statement of the th iteration because ∀ k = x : FLAG_LEVEL
[k] < . But this is contradicted by the fact that we had then FLAG_LEVEL[y] = ,
which concludes the proof of the claim.
The proof of the starvation-freedom property is by induction on the levels starting
from level n − 1 and proceeding until level 1. The base case = n − 1 follows from
the previous claim: if there is a process at level (n − 1), it is the only process at that
level and it can exit the for loop. This process eventually enters the critical section
(that, by assumption, it will leave later). The induction assumption is the following:
each process that attains a level
such that n − 1 ≥
≥ eventually enters the
critical section.
The rest of the proof is by contradiction. Let us assume that is such that there is
a process (say px ) that remains blocked forever in the wait statement during its th
2.1 Mutex Based on Atomic Read/Write Registers 25
iteration (hence, px cannot attain level ). It follows that, each time px evaluates the
predicate controlling the wait statement, we have
∃ k = i : FLAG_LEVEL[k] ≥ ) ∧ (AFTER_YOU[] = x)
(let us remember that the atomic registers are read one at a time, asynchronously,
and in any order). There are two cases.
• Case 1: There is a process p y that eventually executes AFTER_YOU[] ← y.
As only px can execute AFTER_YOU[] ← x, there is eventually a read of
AFTER_YOU[] that returns a value different from x, and this read allows px
to progress to level . This contradicts the assumption that px remains blocked
forever in the wait statement during its th iteration.
• Case 2: No process p y eventually executes AFTER_YOU[] ← y.
The other processes can be partitioned in two sets: the set G that contains the
processes at a level greater or equal to , and the set L that contains the processes
at a level smaller than .
As the predicate AFTER_YOU[] = x remains forever true, it follows that no
process p y in L enters the th loop iteration (otherwise p y would necessarily
execute AFTER_YOU[] ← y, contradicting the case assumption).
On the other side, due to the induction assumption, all processes in G eventu-
ally enter (and later leave) the critical section. When this has occurred, these
processes have moved from the set G to the set L and then the predicate
∀ k = i : FLAG_LEVEL[k] < becomes true.
When this has happened, the values returned by the asynchronous reading of
FLAG_LEVEL[1..n] by px allow it to attain level , which contradicts the assump-
tion that px remains blocked forever in the wait statement during its th iteration.
In both case the assumption that a process remains blocked forever at level is
contradicted which completes the proof of the induction step and concludes the
proof of the starvation-freedom property.
an arbitrary long “sleeping” period (this is possible as the processes are asynchronous)
and consequently does not read AFTER_YOU[1] = 1 (which would allow it to
progress to the second level). Differently, p2 progresses to the second level and
enters the critical section. Later, p2 first invokes release_mutex() and immediately
after invokes acquire_mutex() and updates AFTER_YOU[1] = 2. While p3 keeps
on “sleeping”, p1 progresses to level 2 and finally enters the critical section. This
scenario can be reproduced an arbitrary number of times until p3 wakes up. When this
occurs, p3 reads from AFTER_YOU[1] a value different from 3, and consequently
progresses to level 2. Hence:
• Due to asynchrony, a “sleeping period” can be arbitrarily long, and a process can
consequently lose an arbitrary number of competitions with respect to the other
processes,
• But, as a process does not sleep forever, it eventually progresses to the next level.
It is important to notice that, as shown in the proof of the bounded pass property of
Theorem 1, this scenario cannot happen when n = 2.
Atomic register: size and number It is easy to see that the algorithm uses
2n − 1 atomic registers. The domain of each of the n registers FLAG_LEVEL[i] is
[0..(n −1)], while the domain of each of the n −1 AFTER_YOU[] registers is [1..n].
Hence, in both cases,
log2 n bits are necessary and sufficient for each atomic reg-
ister.
Number of accesses to atomic registers Let us define the time complexity of a
mutex algorithm as the number of accesses to atomic registers for one use of the
critical section by a process.
It is easy to see that this cost is finite but not bounded when there is contention
(i.e., when several processes simultaneously compete to execute the critical section
code).
Differently in a contention-free scenario (i.e., when only one process pi wants to
use the critical section), the number of accesses to atomic registers is (n − 1)(n + 2)
in acquire_mutex(i) and one in release_mutex(i).
The case of k-exclusion This is the k-mutual exclusion problem where the critical
section code can be concurrently accessed by up to k processes (mutual exclusion
corresponds to the case where k = 1).
Peterson’s n-process algorithm can easily be modified to solve k-mutual exclusion.
The upper bound of the for loop (namely (n−1)) has simply to be replaced by (n−k).
No other statement modification is required. Moreover, let us observe that the size
of the array AFTER_YOU can then be reduced to [1..(n − k)].
being able to access the critical section. Said differently, it has to execute n − 1 loop
iterations (eliminating another process at each iteration), and consequently, the cost
(measured in number of accesses to atomic registers) in a contention-free scenario
is O(n) × the cost of one loop iteration, i.e., O(n 2 ). Hence a natural question is the
following: Is it possible to reduce this cost and (if so) how?
Tournament tree A simple principle to reduce the number of shared memory
accesses is to use a tournament tree. Such a tree is a complete binary tree. To simplify
the presentation, we consider that the number of processes is a power of 2, i.e., n = 2k
(hence k = log2 n). If n is not a power of two, it has to be replaced by n
= 2k where
k =
log2 n (i.e., n
is the smallest power of 2 such that n
> n).
Such a tree for n = 23 processes p1 , . . . , p8 , is represented in Fig. 2.10. Each
node of the tree is any two-process starvation-free mutex algorithm, e.g., Peterson’s
two-process algorithm. It is even possible to associate different two-process mutex
algorithms with different nodes. The important common feature of these algorithms
is that any of them assumes that it is used by two processes whose identities are 0
and 1.
As we have seen previously, any two-process mutex algorithm implements a lock
object. Hence, we consider in the following that the tournament tree is a tree of (n−1)
locks and we accordingly adopt the lock terminology. The locks are kept in an array
denoted LOCK[1..(n − 1)], and for x = y, LOCK[x] and LOCK[y] are independent
objects (the atomic registers used to implement LOCK[x] and the atomic registers
used to implement LOCK[y] are different).
The lock LOCK[1] is associated withe root of the tree, and if it is not a leaf, the
node associated with the lock LOCK[x] has two children associated with the locks
LOCK[2x] and LOCK[2x + 1].
According to its identity i, each process pi starts competing with a single other
process p j to obtain a lock that is a leaf of the tree. Then, when it wins, the process
28 2 Solving Mutual Exclusion
pi proceeds to the next level of the tree to acquire the lock associated with the node
that is the father of the node currently associated with pi (initially the leaf node
associated with pi ). Hence, a process competes to acquire all the locks on the path
from the leaf it is associated with until the root node.
As (a) the length of such a path is
log2 n and (b) the cost to obtain a lock
associated with a node is O(1) in contention-free scenarios, it is easy to see that
the number of accesses to atomic registers in these scenarios is O(log2 n) (it
is exactly 4 log2 n when each lock is implemented with Peterson’s two-process
algorithm).
Remark Let us consider the case where each algorithm implementing an under-
lying two-process lock object uses a bounded number of bounded atomic regis-
ters (which is the case for Peterson’s two-process algorithm). In that case, as the
tournament-based algorithm uses (n−1) lock objects, it follows that it uses a bounded
number of bounded atomic registers.
Let us observe that this tournament-based algorithm has better time complexity
than Peterson’s n-process algorithm.
algorithm is O(n 2 ) while the cost of the tournament tree-based algorithm is O(log2 n).
Hence, a natural question is the following: Is it possible to design a fast n-process
mutex algorithm, where fast means that the cost of the algorithm is constant in a
contention-free scenario?
The next section of this chapter answers this question positively. To that end, an
incremental presentation is adopted. A simple one-shot operation is first presented.
Each of its invocations returns a value r to the invoking process, where r is the value
abor t or the value commit. Then, the next section enriches the algorithm imple-
menting this operation to obtain a deadlock-free fast mutual exclusion algorithm due
to L. Lamport (1987).
Concurrency-abortable operation A concurrency-abortable (also named conten-
tion-abortable and usually abbreviated abortable) operation is an operation that is
allowed to return the value abor t in the presence of concurrency. Otherwise, it has to
return the value commit. More precisely, let conc_abort_op() be such an operation.
Assuming that each process invokes it at most once (one-shot operation), the set of
invocations satisfies the following properties:
• Obligation. If the first process which invokes conc_abort_op() is such that its
invocation occurs in a concurrency-free pattern (i.e., no other process invokes
conc_abort_op() during its invocation), this process obtains the value commit.
• At most one. At most one process obtains the value commit.
• Let us first consider the (possibly empty) set Q of processes p j that read Y at line
2 after this register was written by pi or another process (let us notice that, due to
the atomicity of the registers X and Y , the notion of after/before is well defined).
As Y is never reset to ⊥, it follows that each process p j ∈ Q obtains a non-⊥
value from Y and consequently executes return(abor t1 ) at line 3.
32 2 Solving Mutual Exclusion
possibly some pj
executes X ← j pi executes Y ← i
time line
Fig. 2.13 Access pattern to X and Y for a successful conc_abort_op() invocation by process pi
The next corollary follows from the proof of the previous theorem.
Corollary 1 (Y = ⊥) ⇒ a process has obtained the value commit or several
processes have invoked conc_abort_op().
Remark: splitter object When we (a) replace the value commit, abor t1 , and
abor t2 by stop, right, and left, respectively, and (b) rename the operation
2.1 Mutex Based on Atomic Read/Write Registers 33
Principle and description This section presents L. Lamport’s fast mutex algo-
rithm, which is built from the previous one-shot concurrency-abortable operation.
More specifically, this algorithm behaves similarly to the algorithm of Fig. 2.12 in
contention-free scenarios and (instead of returning abort) guarantees the deadlock-
freedom liveness property when there is contention.
The algorithm is described in Fig. 2.14. The line numbering is the same as in
Fig. 2.12: the lines with the same number are the same in both algorithms, line N0 is
new, line N3 replaces line 3, lines N7.1–N7.5 replace line 7, and line N10 is new.
To attain its goal (both fast mutex and deadlock-freedom) the algorithm works as
follows. First, each process pi manages a SWMR flag FLAG[i] (initialized to down)
that pi sets to up to indicate that it is interested in the critical section (line N0). This
flag is reset to down when pi exits the critical section (line N10). As we are about
to see, it can be reset to down also in other parts of the algorithm.
According to the contention scenario in which a process pi returns abort in the
algorithm of Fig. 2.12, there are two cases to consider, which have been differentiated
by the values abort 1 and abort 2 .
• Eliminating abort 1 (line N3).
In this case, as we have seen in Fig. 2.12, process pi is “late”. As captured by
Corollary 1, this is because there are other processes that currently compete for
the critical section or there is a process inside the critical section. Line 3 of Fig. 2.12
is consequently replaced by the following statements (new line N3):
– Process pi first resets its flag to down in order not to prevent other processes
from entering the critical section (if no other process is currently inside it).
– According to Corollary 1, it is useless for pi to retry entering the critical section
while Y = ⊥. Hence, process pi delays its request for the critical section until
Y = ⊥.
• Eliminating abort 2 (lines N7.1–N7.5).
In this case, as we have seen in the base contention-abortable algorithm (Fig. 2.12),
several processes are competing for the critical section (or a process is already
inside the critical section). Differently from the base algorithm, one of the com-
peting processes has now to be granted the critical section (if no other process is
inside it). To that end, in order not to prevent another process from entering the
critical section, process pi first resets its flag to down (line N7.1). Then, pi tries
to enter the critical section. To that end, it first waits until all flags are down (line
N7.2). Then, pi checks the value of Y (line N7.3). There are two cases:
– If Y = i, process pi enters the critical section. This is due to the following
reason.
Let us observe that, if Y = i when pi reads it at line N7.3, then no process has
modified Y since pi set it to the value i at line 4 (the write of Y at line 4 and its
reading at line N7.3 follow the same access pattern as the write of X at line 1 and
its reading at line 5). Hence, process pi is the last process to have executed line
4. It then follows that, as it has (asynchronously) seen each flag equal to down
(line 7.2), process pi is allowed to enter the critical section (return() statement
at line N7.3).
– If Y = i, process pi does the same as what is done at line N3. As it has already
set its flag to down, it has only to wait until the critical section is released before
retrying to enter it (line N7.4). (Let us remember that the only place where Y is
reset to ⊥ is when a process releases the critical section.)
Fast path and slow path The fast path to enter the critical section is when pi
executes only the lines N0, 1, 2, 4, 5, and 6. The fast path is open for a process pi
2.1 Mutex Based on Atomic Read/Write Registers 35
if it reads i from X at line 5. This is the path that is always taken by a process in
contention-free scenarios.
The cost of the fast path is five accesses to atomic registers. As release_mutex()
requires two accesses to atomic registers, it follows that the cost of a single use of the
critical section in a contention-free scenario is seven accesses to atomic registers.
The slow path is the path taken by a process which does not take the fast path.
Its cost in terms of accesses to atomic registers depends on the current concurrency
pattern.
A few remarks A register FLAG[i] is set to down when pi exits the critical section
(line N10) but also at line N3 or N7.1. It is consequently possible for a process pk to
be inside the critical section while all flags are down. But let us notice that, when this
occurs, the value of Y is different from ⊥, and as already indicated, the only place
where Y is reset to ⊥ is when a process releases the critical section.
When executed by a process pi , the aim of the wait statement at line N3 is to
allow any other process p j to see that pi has set its flag to down. Without such a
wait statement, a process pi could loop forever executing the lines N0, 1, 2 and N3
and could thereby favor a livelock by preventing the other processes from seeing
FLAG[i] = down.
Theorem 6 Lamport’s fast mutex algorithm satisfies mutual exclusion and
deadlock-freedom.
Proof Let us first consider the mutual exclusion property. Let pi be a process that
is inside the critical section. Trivially, we have then Y = ⊥ and pi returned from
acquire_mutex() at line 6 or at line N7.3. Hence, there are two cases. Before consid-
ering these two cases, let us first observe that each process (if any) that reads Y after
it was written by pi (or another process) executes line N3: it resets its flag to down
and waits until Y = ⊥ (i.e., at least until pi exits the critical section, line N10). As
the processes that have read a non-⊥ value from Y at line 2 cannot enter the critical
section, it follows that we have to consider only the processes p j that have read ⊥
from Y at line 2.
It then follows from the fact that pi is the last process which wrote into X and τ 2j > τi1
that p j reads i from X at line 4 and consequently does enter the repeat loop again
and waits until X = ⊥. The mutual exclusion property follows.
Proof of the deadlock-freedom property. This is an immediate consequence of
the fact that, among the processes that have concurrently invoked the operation
acquire_mutex(), the last process that writes X ( pi in the previous reasoning) reads
its own identity from X at line 4.
Short discussion The main property of this algorithm is its simplicity. Moreover,
its code is independent of the number of processes.
The previous section presented mutual exclusion algorithms based on atomic read/
write registers. These algorithms are important because understanding their design
and their properties provides us with precise knowledge of the difficulty and subtleties
2.2 Mutex Based on Specialized Hardware Primitives 39
that have to be addressed when one has to solve synchronization problems. These
algorithms capture the essence of synchronization in a read/write shared memory
model.
Nearly all shared memory multiprocessors propose built-in primitives (i.e., atomic
operations implemented in hardware) specially designed to address synchroniza-
tion issues. This section presents a few of them (the ones that are the most
popular).
does not modify its local variable r between acquire_mutex() and release_mutex()
(or, equivalently, that it sets r to 1 before invoking release_mutex()). The test&set-
based algorithm and the swap-based algorithm are actually the very same algorithm.
Let ri be the local variable used by each process pi . Due to the atomicity property
and the “exchange of values” semantics of the swap() primitive, it is easy to see the
swap-based algorithm is characterized by the invariant X + 1≤i≤n ri = 1.
The compare&swap() primitive Let X be a shared register and old and new
be two values. The semantics of the primitive X.compare&swap(old, new), which
returns a Boolean value, is defined by the following code that is assumed to be
executed atomically.
X.compare&swap(old, new) is
if (X = old) then X ← new; return(true)
else return(false)
end if.
A problem due to asynchrony The previous primitives allow for the (simple)
design of algorithms that ensure mutual exclusion and deadlock-freedom. Said dif-
ferently, these algorithms do not ensure starvation-freedom.
2.2 Mutex Based on Specialized Hardware Primitives 41
the process pTURN is the process that has priority and p(TURN mod n)+1 is the next
process that will have priority.
• When a process pi invokes acquire_mutex(i) it first raises its flag to inform the
other processes that it is interested in the critical section (line 1). Then, it waits
(repeated checks at line 2) until it has priority (predicate TURN = i) or the process
that is currently given the priority is not interested (predicate FLAG[TURN] =
down). Finally, as soon as it can proceed, it invokes LOCK.acquire_lock(i)
in order to obtain the underlying lock (line 3). (Let us remember that reading
FLAG[TURN] requires two shared memory accesses.)
• When a process pi invokes release_mutex(i), it first resets its flag to down
(line 5). Then, if (from pi ’s point view) the process that is currently given priority
is not interested in the critical section (i.e., the predicate FLAG[TURN] = down
is satisfied), then pi makes TURN progress to the next process (line 6) on the ring
before releasing the underlying lock (line 7).
Remark 1 Let us observe that the modification of TURN by a process pi is always
done in the critical section (line 6). This is due to the fact that pi modifies TURN
after it has acquired the underlying mutex lock and before it has released it.
Remark 2 Let us observe that a process pi can stop waiting at line 2 because it finds
TURN = i while another process p j increases TURN to ((i + 1) mod n) because it
does not see that FLAG[i] has been set to up. This situation is described in Fig. 2.21.
Theorem 8 Assuming that the underlying mutex lock LOCK is deadlock-free, the
algorithm described in Fig. 2.20 builds a starvation-free mutex lock.
Proof We first claim that, if at least one process invokes acquire_mutex(), then
at least one process invokes LOCK.acquire_lock() (line 3) and enters the critical
section.
2.2 Mutex Based on Specialized Hardware Primitives 43
pi’s side
pj ’s side
pj reads FLAG [i] = down
pj reads TURN = i pj updates TURN to ((i + 1) mod n)
pj executes line 6
Proof of the claim. Let us first observe that, if processes invoke LOCK.acquire_
lock(), one of them enters the critical section (this follows from the fact that the
lock is deadlock-free). Hence, X being the non-empty set of processes that invoke
acquire_mutex(), let us assume by contradiction that no process of X terminates
the wait statement at line 2. It follows from the waiting predicate that TURN ∈ / X
and FLAG[TURN] = up. But, FLAG[TURN] = up implies TURN ∈ X , which
contradicts the previous waiting predicate and concludes the proof of the claim.
Let pi be a process that has invoked acquire_mutex(). We have to show that
it enters the critical section. Due to the claim, there is a process pk that holds the
underlying lock. If pk is pi , the theorem follows, hence let pk = pi . When pk exits
the critical section it executes line 6. Let TURN = j when pk reads it. We consider
two cases:
1. FLAG[ j] = up. Let us observe that p j is the only process that can write into
FLAG[ j] and that it will do so at line 5 when it exits the critical section. More-
over, as TURN = j, p j is not blocked at line 2 and consequently invokes
LOCK.acquire_lock() (line 3).
We first show that eventually p j enters the critical section. Let us observe that
all the processes which invoke acquire_mutex() after FLAG[ j] was set to up
and TURN was set to j remain blocked at line 2 (Observation OB). Let Y be
the set of processes that compete with p j for the lock with y = |Y |. We have
0 ≤ y ≤ n − 1. It follows from observation OB and the fact that the lock is
deadlock-free that the number of processes that compete with p j decreases from
y to y − 1, y − 2, etc., until p j obtains the lock and executes line 5 (in the worst
case, p j is the last of the y processes to obtain the lock).
If pi is p j or a process that has obtained the lock before p j , the theorem follows
from the previous reasoning. Hence, let us assume that pi has not obtained the
lock. After p j has obtained the lock, it eventually executes lines 5 and 6. As
TURN = j and p j sets FLAG[ j] to down, it follows that p j updates the register
TURN to = ( j mod n)+1. The previous reasoning, where k and j are replaced
by j and , is then applied again.
44 2 Solving Mutual Exclusion
Fast starvation-free mutual exclusion Let us consider the case where a process pi
wants to enter the critical section, while no other process is interested in entering it.
We have the following:
• The invocation of acquire_mutex(i) requires at most three accesses to the shared
memory: one to set the register FLAG[i] to up, one to read TURN and save it in a
local variable tur n, and one to read FLAG[tur n].
• Similarly, the invocation by pi of release_mutex(i) requires at most four accesses
to the shared memory: one to reset FLAG[i] to down, one to read TURN and save
it in a local variable tur n, one to read FLAG[tur n], and a last one to update TURN.
It follows from this observation that the stacking of the algorithm of Fig. 2.20
on top of the algorithm described in Fig. 2.14 (Sect. 2.1.7), which implements a
deadlock-free fast mutex lock, provides a fast starvation-free mutex algorithm.
2.2.3 Fetch&Add
Let us observe that, while NEXT is an atomic MWMR register, the operation
NEXT ← NEXT + 1 is not atomic. It is easy to see that no increase of NEXT can be
missed. This follows from the fact that the increase statement NEXT ← NEXT + 1
appears in the operation release_mutex(), which is executed by a single process at a
time.
The mutual exclusion property follows from the uniqueness of each ticket number,
and the starvation-freedom property follows from the fact that the ticket numbers are
defined from a sequence of consecutive known values (here the increasing sequence
of positive integers).
This section presents two mutex algorithms which rely on shared read/write registers
weaker than read/write atomic registers. In that sense, they implement atomicity
without relying on underlying atomic objects.
The algorithms described in this section rely on safe registers. As shown here, safe
registers are the weakest type of shared registers that we can imagine while being
useful, in the presence of concurrency.
As an atomic register, a safe register (or a regular register) R provides the processes
with a write operation denoted R.write(v) (or R ← v), where v is the value that is
written and a read operation R.read() (or local ← R, where local is a local variable
of the invoking process). Safe, regular and atomic registers differ in the value returned
by a read operation invoked in the presence of concurrent write operations.
Let us remember that the domain of a register is the set of values that it can contain.
As an example, the domain of a binary register is the set {0, 1}.
46 2 Solving Mutual Exclusion
SWMR safe register An SWMR safe register is a register whose read operation
satisfies the following properties (the notion of an MWMR safe register will be
introduced in Sect. 2.3.3):
• A read that is not concurrent with a write operation (i.e., their executions do not
overlap) returns the current value of the register.
• A read that is concurrent with one (or several consecutive) write operation(s) (i.e.,
their executions do overlap) returns any value that the register can contain.
It is important to see that, in the presence of concurrent write operations, a read can
return a value that has never been written. The returned value has only to belong to
the register domain. As an example, let the domain of a safe register R be {0, 1, 2, 3}.
Assuming that R = 0, let R.write(2) be concurrent with a read operation. This read
can return 0, 1, 2, or 3. It cannot return 4, as this value is not in the domain of R, but
can return the value 3, which has never been written.
A binary safe register can be seen as modeling a flickering bit. Whatever its
previous value, the value of the register can flicker during a write operation and
stabilizes to its final value only when the write finishes. Hence, a read that overlaps
with a write can arbitrarily return either 0 or 1.
SWMR regular register An SWMR regular register is an SWMR safe register
that satisfies the following property. This property addresses read operations in thee
presence of concurrency. It replaces the second item of the definition of a safe register.
• A read that is concurrent with one or several write operations returns the value of
the register before these writes or the value written by any of them.
An example of a regular register R (whose domain is the set {0, 1, 2, 3, 4}) written
by a process p1 and read by a process p2 is described in Fig. 2.23. As there is no
concurrent write during the first read by p2 , this read operation returns the current
value of the register R, namely 1. The second read operation is concurrent with three
write operations. It can consequently return any value in {1, 2, 3, 4}. If the register
was only safe, this second read could return any value in {0, 1, 2, 3, 4}.
Atomic register The notion of an atomic register was defined in Sect. 2.1.1. Due
to the total order on all its operations, an atomic register is more constrained (i.e.,
stronger) than a regular register.
p2
R.read() → 1 R.read() → v
p2
R.read() → 1 R.read() → a R.read() → b R.read() → 0 R.read() → c
To illustrate the differences between safe, regular, and atomic, Fig. 2.24 presents
an execution of a binary register R and Table 2.1 describes the values returned by
the read operations when the register is safe, regular, and atomic. The first and third
read by p2 are issued in a concurrency-free context. Hence, whatever the type of the
register, the value returned is the current value of the register R.
• If R is safe, as the other read operations are concurrent with a write operation,
they can return any value (i.e., 0 or 1 as the register is binary). This is denoted 0/1
in Table 2.1.
It follows that there are eight possible correct executions when the register R is
safe for the concurrency pattern depicted in Fig. 2.24.
• If R is regular, each of the values a and b returned by the read operation which
is concurrent with R.write(0) can be 1 (the value of R before the read oper-
ation) or 0 (the value of R that is written concurrently with the read operation).
Differently, the value c returned by the last read operation can only be 0 (because
the value that is written concurrently does not change the value of R).
It follows that there are only four possible correct executions when the register R
is regular.
• If R is atomic, there are only three possible executions, each corresponding to a
correct sequence of read and write invocations (“correct” means that the sequence
respects the real-time order of the invocations and is such that each read invocation
returns the value written by the immediately preceding write invocation).
48 2 Solving Mutual Exclusion
Principle of the algorithm The mutex algorithm presented in this section is due to
L. Lamport (1974) who called it the mutex bakery algorithm. It was the first algorithm
ever designed to solve mutual exclusion on top of non-atomic registers, namely on
top of SWMR safe registers. The principle that underlies its design (inspired from
bakeries where a customer receives a number upon entering the store, hence the
algorithm name) is simple. When a process pi wants to acquire the critical section,
it acquires a number x that defines its priority, and the processes enter the critical
section according to their current priorities.
As there are no atomic registers, it is possible that two processes obtain the same
number. A simple way to establish an order for requests that have the same number
consists in using the identities of the corresponding processes. Hence, let a pair x, i
define the identity of the current request issued by pi . A total order is defined for
the requests competing for the critical section as follows, where x, i and y, j
are the identities of two competing requests; x, i < y, j means that the request
identified by x, i has priority over the request identified by y, j where “<” is
defined as the lexicographical ordering on pairs of integers, namely
Description of the algorithm Two SWMR safe registers, denoted FLAG[i] and
MY _TURN[i], are associated with each process pi (hence these registers can be read
by any process but written only by pi ).
• MY _TURN[i] (which is initialized to 0 and reset to that value when pi exits the
critical section) is used to contain the priority number of pi when it wants to use the
critical section. The domain of MY _TURN[i] is the set of non-negative integers.
• FLAG[i] is a binary control variable whose domain is {down, up}. Initialized to
down, it is set to up by pi while it computes the value of its priority number
MY _TURN[i].
The sequence of values taken by FLAG[i] is consequently the regular expression
down(up, down)∗ . The reader can verify that a binary safe register whose write
operations of down and up alternate behaves as a regular register.
The algorithm of a process pi is described in Fig. 2.25. When it invokes acquire_
mutex(), process pi enters a “doorway” (lines 1–3) in which it computes its turn
number MY _TURN[i] (line 2). To that end it selects a number greater than all
MY _TURN[ j], 1 ≤ j ≤ n. It is possible that pi reads some MY _TURN[ j] while it is
written by p j . In that case the value obtained from MY _TURN[ j] can be any value.
Moreover, a process informs the other processes that it is computing its turn value by
raising its flag before this computation starts (line 1) and resetting it to down when
it has finished (line 3). Let us observe that a process is never delayed while in the
doorway, which means no process can direct another process to wait in the doorway.
2.3 Mutex Without Atomicity 49
After it has computed its turn value, a process pi enters a “waiting room” (lines
4–7) which consists of a for loop with one loop iteration per process p j . There are
two cases:
• If p j does not want to enter the critical section, we have FLAG[ j] = down ∧
MY _TURN[ j] = 0. In this case, pi proceeds to the next iteration without being
delayed by p j .
• Otherwise, pi waits until FLAG[ j] = down (i.e., until p j has finished to compute
its turn, line 5) and then waits until either p j has exited the critical section (predicate
MY _TURN[ j] = 0) or pi ’s current request has priority over p j ’s one (predicate
(MY _TURN[i], i) < (MY _TURN[ j], j)).
When pi has priority with respect to each other process (these priorities being
checked in an arbitrary order, one after the other) it enters the critical section
(line 8).
Finally, when it exits the critical section, the only thing a process pi has to do is
to reset MY _TURN[i] to 0 (line 9).
Remark: process crashes Let us consider the case where a process may crash (i.e.,
stop prematurely). It is easy to see that the algorithm works despite this type of failure
if, after a process pi has crashed, its two registers FLAG[i] and MY _TURN[i] are
eventually reset to their initial values. When this occurs, the process pi is considered
as being no longer interested in the critical section.
A first in first out (FIFO) order As already indicated, the priority of a process
pi over a process p j is defined from the identities of their requests, namely the pairs
MY _TURN[i], i and MY _TURN[ j], j. Moreover, let us observe that it is not
possible to predict the values of these pairs when pi and p j compute concurrently
the values of MY _TURN[i] and MY _TURN[ j].
50 2 Solving Mutual Exclusion
Let us consider two processes pi and p j that have invoked acquire_mutex() and
where pi has executed its doorway part (line 2) before p j has started executing its
doorway part. We will see that the algorithm guarantees a FIFO order property defined
as follows: pi terminates its invocation of acquire_mutex() (and consequently enters
the critical section) before p j . This FIFO order property is an instance of the bounded
bypass liveness property with f (n) = n − 1.
Definitions The following time instant definitions are used in the proof of
Theorem 9. Let px be a process. Let us remember that, as the read and write operations
on the registers are not atomic, they cannot be abstracted as having been executed
instantaneously. Hence, when considering the execution of such an operation, its
starting time and its end time are instead considered.
The number that appears in the following definitions corresponds to a line number
(i.e., to a register operation). Moreover, “b” stands for “beginning” while “e” stands
for “end”.
1. τex (1) is the time instant at which px terminates the assignment FLAG[x] ← up
(line 1).
2. τex (2) is the time instant at which px terminates the execution of line 2. Hence,
at time τex (2) the non-atomic register MY _TURN[x] contains the value used by
px to enter the critical section.
3. τbx (3) is the time instant at which px starts the execution of line 3. This means that
a process that reads FLAG[x] during the time interval [τex (1)..τbx (3)] necessarily
obtains the value up.
4. τbx (5, y) is the time instant at which px starts its last evaluation of the waiting
predicate (with respect to FLAG[y]) at line 5. This means that px has obtained
the value down from FLAG[y].
5. Let us notice that, as it is the only process which writes into MY _TURN[x],
px can save its value in a local variable. This means that the reading of
MY _TURN[x] entails no access to the shared memory. Moreover, as far as a
register MY _TURN[y] (y = x) is concerned, we consider that px reads it once
each time it evaluates the predicate of line 6.
τbx (6, y) is the time instant at which px starts its last reading of MY _TURN[y].
Hence, the value tur n it reads from MY _TURN[y] is such that (tur n =
0) ∨ MY _TURN[x], x < tur n, y.
Proof Let tur n i be the value used by pi at line 6. As pi is in the bakery (i.e., exe-
cuting lines 4–9) before p j enters the doorway (line 2), it follows that MY _TURN[i]
was assigned the value tur n i before p j reads it at line 2. Hence, when p j reads the
safe register MY _TURN[i], there is no concurrent write and p j consequently obtains
the value tur n i . It follows that the value tur n j assigned by p j to MY _TURN[ j] is
such that tur n j ≥ tur n i + 1, from which the lemma follows.
Lemma 2 Let pi and p j be two processes such that pi is inside the critical section
while p j is in the bakery. Then MY _TURN[i], i < MY _TURN[ j], j.
Proof Let us notice that, as p j is inside the bakery, it can be inside the critical
section.
As process pi is inside the critical section, it has read down from FLAG[ j] at
line 5 (and exited the corresponding wait statement). It follows that, according to the
timing of this read of FLAG[ j] that returned the value down to pi and the updates
of FLAG[ j] by p j to up at line 1 or down at line 3 (the only lines where FLAG[ j]
is modified), there are two cases to consider (Fig. 2.26).
j
As pi reads down from FLAG[ j], we have either τbi (5, j) < τe (1) or τei (5, j) >
j j
τb (3) (see Fig. 2.26). This is because if we had τbi (5, j) > τe (1), pi would
necessarily have read up from FLAG[ j] (left part of the figure), and, if we had
j
τbi (5, j) < τb (3), pi would necessarily have also read up from FLAG[ j] (right part
of the figure). Let us consider each case:
j
• Case 1: τbi (5, j) < τe (1) (left part of Fig. 2.26). In this case process, pi has entered
the bakery before process p j enters the doorway. It then follows from Lemma 1
that MY _TURN[i] < MY _TURN[ j], which proves the lemma for this case.
j
• Case 2: τei (5, j) > τb (3) (right part of Fig. 2.26). As p j is sequential, we have
j j
τe (2) < τb (3) (P1). Similarly, as pi is sequential, we also have τbi (5, j) < τbi (6, j)
j
(P2). Combing (P1), (P2), and the case assumption, namely τb (3) < τbi (5, j), we
obtain
j j
τe (2) < τb (3) < τei (5, j) < τbi (6, j);
Fig. 2.26 The two cases where p j updates the safe register FLAG[ j]
52 2 Solving Mutual Exclusion
j
i.e., τe (2) < τbi (6, j) from which we conclude that the last read of
MY _TURN[ j] by pi occurred after the safe register MY _TURN[ j] obtained its
value (say tur n j ).
As pi is inside the critical section (lemma assumption), it exited the second wait
statement because (MY _TURN[ j] = 0)∨MY _TURN[i], i<MY _TURN[ j], j.
j
Moreover, as p j was in the bakery before pi executed line 6 (τe (2) < τbi (6, j)), we
have MY _TURN[ j] = tur n j = 0. It follows that we have MY _TURN[i], i <
MY _TURN[ j], j, which terminates the proof of the lemma.
Theorem 9 Lamport’s bakery algorithm satisfies mutual exclusion and the bounded
bypass liveness property where f (n) = n − 1.
Proof Proof of the mutual exclusion property. The proof is by contradiction. Let
us assume that pi and p j (i = j) are simultaneously inside the critical section. We
have the following:
• As pi is inside the critical section and p j is inside the bakery, we can apply Lemma
2. We then obtain: MY _TURN[i], i < MY _TURN[ j], j.
• Similarly, as p j is inside the critical section and pi is inside the bakery, applying
Lemma 2, we obtain: MY _TURN[ j], j < MY _TURN[i], i.
As i = j, the pairs MY _TURN[ j], j and MY _TURN[i], i are totally ordered.
It follows that each item contradicts the other, from which the mutex property follows.
Proof of the FIFO order liveness property. The proof shows first that the algo-
rithm is deadlock-free. It then shows that the algorithm satisfies the bounded
bypass property where f (n) = n − 1 (i.e., the FIFO order as defined on the pairs
MY _TURN[x], x).
The proof that the algorithm is deadlock-free is by contradiction. Let us assume
that processes have invoked acquire_mutex() and no process exits the waiting room
(lines 4–7). Let Q be this set of processes. (Let us notice that, for any other process
p j , we have FLAG[ j] = down and MY _TURN[ j] = 0.) As the number of processes
is bounded and no process has to wait in the doorway, there is a time after which
we have ∀ j ∈ {1, . . . , n} : FLAG[ j] = down, from which we conclude that no
process of Q can be blocked forever in the wait statement of line 5.
By construction, the pairs MY _TURN[x], x of the processes px ∈ Q are totally
ordered. Let MY _TURN[i], i be the smallest one. It follows that, eventually, when
evaluated by pi , the predicate associated with the wait statement of line 6 is satisfied
for any j. Process pi then enters the critical section, which contradicts the deadlock
assumption and proves that the algorithm is deadlock-free.
To show the FIFO order liveness property, let us consider a pair of processes pi
and p j that are competing for the critical section and such that p j wins and after
exiting the critical section it invokes acquire_mutex( j) again, executes its doorway,
and enters the bakery. Moreover, let us assume that pi is still waiting to enter the
critical section. Let us observe that we are then in the context defined in Lemma 1: pi
and p j are in the bakery and pi entered the bakery before p j enters the doorway.
2.3 Mutex Without Atomicity 53
We then have MY _TURN[i] < MY _TURN[ j], from which we conclude that p j
cannot bypass again pi . As there are n processes, in the worst case pi is competing
with all other processes. Due to the previous observation and the fact that there is
no deadlock, it can lose at most n − 1 competitions (one with respect to each other
process p j (which enters the critical section before pi ), which proves the bounded
bypass liveness property with f (n) = n − 1.
This section presents a second mutex algorithm which does not require underlying
atomic registers. This algorithm is due to A. Aravind (2011). Its design principles
are different from the ones of the bakery algorithm.
Principle of the algorithm The idea that underlies the design of this algorithm is to
associate a date with each request issued by a process and favor the competing process
which has the oldest (smallest) request date. To that end, the algorithm ensures that
(a) the dates associated with requests are increasing and (b) no two process requests
have the same date.
More precisely, let us consider a process pi that exits the critical section. The
date of its next request (if any) is computed in advance when, just after pi has used
the critical section, it executes the corresponding release_mutex() operation. In that
way, the date of the next request of a process is computed while this process is still
“inside the critical section”. As a consequence, the sequence of dates associated with
the requests is an increasing sequence of consecutive integers and no two requests
(from the same process or different processes) are associated with the same date.
From a liveness point of view, the algorithm can be seen as ensuring a least
recently used (LRU) priority: the competing process whose previous access to the
critical section is the oldest (with respect to request dates) is given priority when it
wants to enter the critical section.
Safe registers associated with each process The following three SWMR safe
registers are associated with each process pi :
• FLAG[i], whose domain is {down, up}. It is initialized to up when pi wants to
enter the critical section and reset to down when pi exits the critical section.
• If pi is not competing for the critical section, the safe register DATE[i] contains the
(logical) date of its next request to enter the critical section. Otherwise, it contains
the logical date of its current request.
DATE[i] is initialized to i. Hence, no two processes start with the same date for
their first request. As already indicated, pi will compute its next date (the value
that will be associated with its next request for the critical section) when it exits
the critical section.
• STAGE[i] is a binary control variable whose domain is {0, 1}. Initialized to 0,
it is set to 1 by pi when pi sees DATE[i] as being the smallest date among the
54 2 Solving Mutual Exclusion
dates currently associated with the processes that it perceives as competing for the
critical section. The sequence of successive values taken by STAGE[i] (including
its initial value) is defined by the regular expression 0((0, 1)+ , 0)∗ .
j
from which we conclude τei (4) < τb (5, i), i.e., the last read of STAGE[i] by p j at line
5 started after pi had written 1 into it. Hence, the last read of STAGE[i] by p j returned
1 which contradicts the fact that it is inside the critical section simultaneously with
pi . (A similar reasoning shows that, if p j is inside the critical section, pi cannot be.)
Before proving the liveness property, let us notice that at most one process at a
time can modify the array DATE[1..n]. This follows from the fact that the algorithm
satisfies the mutual exclusion property (proved above) and line 7 is executed by
a process pi before it resets STAGE[i] to 0 (at line 8), which is necessary to allow
another process p j to enter the critical section (as the predicate of line 5 has to be true
when evaluated by p j ). It follows from the initialization of the array DATE[1..n] and
the previous reasoning that no two requests can have the same date and the sequence
of dates computed in mutual exclusion at line 7 by the processes is the sequence of
natural integers (Observation OB).
As in the proof of Lamport’s algorithm, let us first prove that there is no deadlock.
Let us assume (by contradiction) that there is a non-empty set of processes Q that have
invoked acquire_mutex() and no process succeeds in entering the critical section.
Let pi be the process of Q with the smallest date. Due to observation OB, there is a
single process pi . It then follows that, after some finite time, pi is the only process
whose predicate at line 3 is satisfied. Hence, after some time, pi is the only process
such that STAGE[i] = 1, which allows it to enter the critical section. This contradicts
the initial assumption and proves the deadlock-freedom property.
As a single process at a time can modify its entry of the array DATE, it follows
that a process p j that exits the critical section updates its register DATE[ j] to a
value greater than all the values currently kept in DATE[1..n]. Consequently, after
p j has executed line 7, all the other processes pi which are currently competing
for the critical section are such that DATE[i] < DATE[ j]. Hence, as we now have
(FLAG[i] = up) ∧ (DATE[i] < DATE[ j]), the next request (if any) issued by p j
cannot bypass the current request of pi , from which the starvation-freedom property
follows.
Moreover, it also follows from the previous reasoning that, if pi and p j are
competing and p j wins, then as soon as p j has exited the critical section pi has
priority over p j and can no longer be bypassed by it. This is nothing else than the
bounded bypass property with f (n) = n − 1 (which defines a FIFO order property).
Bounded mutex algorithm Each safe register MY _TURN[i] of Lamport’s algo-
rithm and each safe register DATE[i] of Aravind’s algorithm can take arbitrary large
values. It is shown in the following how a simple modification of Aravind’s algorithm
allows for bounded dates. This modification relies on the notion of an MWMR safe
register.
MWMR safe register An MWMR safe register is a safe register that can be written
and read by several processes. When the write operations are sequential, an MWMR
safe register behaves as an SWMR safe register. When write operations are concur-
rent, the value written into the register is any value of its domain (not necessarily a
value of a concurrent write).
Said differently, to be meaningful, an algorithm based on MWMR safe registers
has to prevent write operations on an MWMR safe register from being concurrent in
order for the write operations to be always meaningful. The behavior of an MWMR
safe register is then similar to the behavior of an SWMR safe register in which the
“single writer” is implemented by several processes that never write at the same time.
From unbounded dates to bounded dates Let us now consider that each safe
register DATE[i], 1 ≤ i ≤ n, is an MWMR safe register: any process pi can write
any register DATE[ j]. MWMR safe registers allow for the design of a (particularly
2.3 Mutex Without Atomicity 57
simple) bounded mutex algorithm. The domain of each register DATE[ j] is now
[1..N ] where N ≥ 2n. Hence, all registers are safe and have a bounded domain.
In the following we consider N = 2n. A single bit is needed for each safe register
FLAG[ j] and each safe register STAGE[ j], and only
log2 N bits are needed for
each safe register DATE[ j].
In a very interesting way, no statement has to be modified to obtain a bounded
version of the algorithm. A single new statement has to be added, namely the insertion
of the following line 7
between line 7 and line 8:
(7
) if (DATE[i] ≥ N ) then for all j ∈ [1..n] do DATE[ j] ← j end for end if.
This means that, when a process pi exiting the critical section updates its register
DATE[i] and this update is such that DATE[i] ≥ N , pi resets all date registers
to their initial values. As for line 7, this new line is executed before STAGE[i] is
reset to 0 (line 8), from which it follows that it is executed in mutual exclusion and
consequently no two processes can concurrently write the same MWMR safe register
DATE[ j]. Hence, the MWMR safe registers are meaningful.
Moreover, it is easy to see that the date resetting mechanism is such that each
date d, 1 ≤ d ≤ n, is used only by process pd , while each date d, n + 1 ≤ d ≤ 2n
can be used by any process. Hence, ∀d ∈ {1, . . . , n} we have DATE[d] ∈ {d, n +
1, n + 2, . . . , 2n}.
Theorem 11 When considering Aravind’s mutual exclusion algorithm enriched with
line 7
with N ≥ 2n, a process encounters at most one reset of the array DATE[1..n]
while it is executing acquire_mutex().
Proof Let pi be a process that executes acquire_mutex() while a reset of the array
DATE[1..n] occurs. If pi is the next process to enter the critical section, the theorem
follows. Otherwise, let p j be the next process which enters the critical section. When
p j exits the critical section, DATE[ j] is updated to max(DATE[1], . . . , DATE[n]) +
1 = n + 1. We then have FLAG[i] = up and DATE[i] < DATE[ j]. It follows that,
if there is no new reset, p j cannot enter again the critical section before pi .
In the worst case, after the reset, all the other processes are competing with pi
and pi is pn (hence, DATE[i] = n, the greatest date value after a reset). Due to
line 3 and the previous observation, each other process p j enters the critical section
before pi and max(DATE[1], . . . , DATE[n]) becomes equal to n + (n − 1). As
2n − 1 < 2n ≤ N , none of these processes issues a reset. It follows that pi enters
the critical section before the next reset. (Let us notice that, after the reset, the
invocation issued by pi can be bypassed only by invocations (pending invocations
issued before the reset or new invocations issued after the reset) which have been
issued by processes p j such that j < i).
The following corollary is an immediate consequence of the previous theorem.
Corollary 2 Let N ≥ 2n. Aravind’s mutual exclusion algorithm enriched with line
7
satisfies the starvation-freedom property.
58 2 Solving Mutual Exclusion
(Different progress conditions that this algorithm can ensure are investigated in
Exercise 6.)
Bounding the domain of the safe registers has a price. More precisely, the addition
of line 7
has an impact on the maximal number of bypasses which can now increase
up to f (n) = 2n −2. This is because, in the worst case where all the processes always
compete for the critical section, before it is allowed to access the critical section, a
process can be bypassed (n − 1) times just before a reset of the array DATE and, due
to the new values of DATE[1..n], it can again be bypassed (n − 1) times just after
the reset.
2.4 Summary
This chapter has presented three families of algorithms that solve the mutual exclu-
sion problem. These algorithms differ in the properties of the base operations they
rely on to solve mutual exclusion.
Mutual exclusion is one way to implement atomic objects. Interestingly, it was
shown that implementing atomicity does not require the underlying read and write
operations to be atomic.
• The reader will find surveys on mutex algorithms in [24, 231, 262]. Mutex algo-
rithms are also described in [41, 146].
• Peterson’s algorithm for two processes and its generalization to n processes are
presented in [224].
The first tournament-based mutex algorithm is due to G.L. Peterson and M.J.
Fischer [227].
A variant of Peterson’s algorithm in which all atomic registers are SWMR registers
due to J.L.W. Kessels is presented in [175].
• The contention-abortable mutex algorithm is inspired from Lamport’s fast mutex
algorithm [191]. Fischer’s synchronous algorithm is described in [191].
Lamport’s fast mutex algorithm gave rise to the splitter object as defined in [209].
The notion of fast algorithms has given rise to the notion of adaptive algorithms
(algorithms whose cost is related to the number of participating processes) [34].
• The general construction from deadlock-freedom to starvation-freedom that was
presented in Sect. 2.2.2 is from [262]. It is due to Y. Bar-David.
2.5 Bibliographic Notes 59
• The notions of safe, regular, and atomic read/write registers are due to L. Lamport.
They are presented and investigated in [188, 189]. The first intuition on these types
of registers appears in [184].
It is important to insist on the fact that “non-atomic” does not mean “arbiter-free”.
As defined in [193], “An arbiter is a device that makes a discrete decision based on
a continuous range of values”. Binary arbiters are the most popular. Actually, the
implementation of a safe register requires an arbiter. The notion of arbitration-free
synchronization is discussed in [193].
• Lamport’s bakery algorithm is from [183], while Aravind’s algorithm and its
bounded version are from [28].
• A methodology based on model-checking for automatic discovery of mutual exclu-
sion algorithms has been proposed by Y. Bar-David and G. Taubenfeld [46]. Inter-
estingly enough, this methodology is both simple and computationally feasible.
New algorithms obtained in this way are presented in [46, 262].
• Techniques (and corresponding algorithms) suited to the design of locks for
NUMA and CC-NUMA architectures are described in [86, 200]. These techniques
take into account non-uniform memories and caching hierarchies.
• A combiner is a thread which, using a coarse-grain lock, serves (in addition to its
own synchronization request) active requests announced by other threads while
they are waiting by performing some form of spinning. Two implementations of
such a technique are described in [173]. The first addresses systems that support
coherent caches, whereas the second works better in cacheless NUMA architec-
tures.
1. Peterson’s algorithm for two processes uses an atomic register denoted TURN
that is written and read by both processes. Design a two-process mutual exclusion
algorithm (similar to Peterson’s algorithm) in which the register TURN is replaced
by two SWMR atomic registers TURN[i](which can be written only by pi ) and
TURN[ j](which can be written only by p j ). The algorithm will be described for
pi where i ∈ {0, 1} and j = (i + 1) mod 2.
Solution in [175].
2. Considering the tournament-based mutex algorithm, show that if the base two-
process mutex algorithm is deadlock-free then the n-process algorithm is
deadlock-free.
60 2 Solving Mutual Exclusion
After having introduced the notion of a concurrent object, this chapter presents
lock-based methodologies to implement such objects. The first one is based on a
low-level synchronization object called a semaphore. The other ones are based on
linguistic constructs. One of these constructs is based on an imperative approach
(monitor construct), while the other one is based on a declarative approach (path
expression construct). This chapter closes the first part of the book devoted to lock-
based synchronization.
value ⊥ when the stack is empty. Hence, both operations can always be executed
whatever the current state of the stack (such operations are said to be total). The
sequential specification of such a stack is the set of all the sequences of push() and
pop() operations that satisfy the “last in, first out” (LIFO) property (“last in” being ⊥
when the stack is empty). Differently, as indicated in the first chapter, a rendezvous
object has no sequential specification.
operation STACK is
end operation.
operation is
end operation.
first uses a sequential stack S_STACK 1 and a lock instance LOCK 1 , while the sec-
ond uses another sequential stack S_STACK 2 and another lock instance LOCK 2 .
Hence, as LOCK 1 and LOCK 2 are distinct locks, the operations on C_STACK 1 and
C_STACK 2 are not prevented from being concurrent.
S = s0 + #(S.up) − #(S.down).
64 3 Lock-Based Concurrent Objects
Hence, when it is negative, the implementation counter S.count provides us with the
number of processes currently blocked on the semaphore S. Differently, when it is
non-negative, the value of S.count is the value of the semaphore S.
3.2 A Base Synchronization Object: the Semaphore 65
operation is
if then
the invoking process is blocked and added to ; the control is given to the scheduler
end if
end operation
operation is
if then
remove the first process in which can now be assigned a processor
end if;
end operation.
The base read and write operations on BUF[x] are denoted BUF[x].read() and
BUF[x].write().
– in and out are two local variables containing array indexes whose domain is
[0..(k − 1)]; in is used by the producer to point to the next entry of BUF where
an item can be deposited; out is used by the consumer to point to the next entry
of BUF from which an item can be consumed. The law that governs the progress
of these index variables is the addition mod k, and we say that the buffer is
circular.
• Control part. This part comprises the synchronization objects that allow the
processes to never violate the buffer invariant. It consists of two semaphores:
– The semaphore FREE counts the number of entries of the array BUF that
can currently be used to deposit new items.This semaphore is initialized
to k.
– The semaphore BUSY counts the number of entries of the array BUF that cur-
rently contain items produced and not yet consumed. This semaphore is initial-
ized to 0.
operation is
(1)
(2)
(3)
(4)
end operation
operation is
(5)
(6)
(7)
(8)
end operation
When the consumer invokes B.consume(), it first checks if there is one entry in the
array BUF that contains an item not yet consumed. The semaphore BUSY is used to
that end (line 5). When it is allowed to consume, the consumer consumes the next
item value, i.e., the one kept in BUF[out], saves it in a local variable r, and increases
the index out (line 5). Finally, before returning that value saved in r (line 5), the
consumer signals that one entry is freed; this is done by increasing the value of the
semaphore FREE (line 7).
Remark 1 It is important to repeat that, for any x, a register BUF[x] is not required
to satisfy special semantic requirements and that the value that is written (read) can be
of any type and as large as we want (e.g., a big file). This is an immediate consequence
of the fact each register BUF[x] is accessed in mutual exclusion. This means that
what is abstracted as a register BUF[x] does not have to be constrained in any way. As
an example, the operation BUF[in].write(v) (line 2) can abstract several low-level
write operations involving accesses to underlying disks which implement the register
(and similarly for the operation BUF[in].read() at line 2). Hence, the size of the items
that are produced and consumed can be arbitrary large, and reading and writing them
can take arbitrary (but finite) durations. This means that one can reasonably assume
that the duration of the operations BUF[in].write(v) and BUF[in].read() (i.e., the
operations which are in the data part of the algorithms) is usually several orders of
magnitude greater than the execution of the rest of the algorithms (which is devoted
to the control part).
Remark 2 It is easy to see that the values taken by the semaphores FREE and
BUSY are such that 0 ≤ FREE, BUSY ≤ k, but it is important to remark that a
semaphore object does not offer an operation such as FREE.read() that would return
the exact value of FREE. Actually, such an operation would be useless because there
is no guarantee that the value returned by FREE.read() is still meaningful when the
invoking process would use it (FREE may have been modified by FREE.up() or
FREE.down() just after its value was returned).
A semaphore S can be seen as an atomic register that could be modified by the
operations fech&add() and fetch&sub() (which atomically add 1 and subtract 1,
respectively) with the additional constraint that S can never become negative.
The case of a buffer with a single entry This is the case k = 1. Each of the
semaphores FREE and BUSY takes then only the values 0 or 1. It is interesting
to look at the way these values are modified. A corresponding cycle of produc-
tion/consumption is depicted in Fig. 3.5.
Initially the buffer is empty, FREE = 1, and BUSY = 0. When the producer
starts to deposit a value, the semaphore FREE decreases from 1 to 0 and the buffer
starts to being filled. When it has been filled, the producer raises BUSY from 0 to
1. Hence, FREE = 0 ∧ BUSY = 1 means that a value has been deposited and can
be consumed. When the consumer wants to read, it first decreases the semaphore
BUSY from 1 to 0 and then reads the value kept in the buffer. When, the reading is
terminated, the consumer signals that the buffer is empty by increasing FREE from
68 3 Lock-Based Concurrent Objects
If there are several producers (consumers) the previous solution no longer works,
because the control register in (out) now has to be an atomic register shared by all
producers (consumers). Hence, the local variables in and out are replaced by the
atomic registers IN and OUT . Moreover, (assuming k > 1) the read and update
operations on each of these atomic registers have to be executed in mutual exclusion
in order that no two producers simultaneously obtain the same value of IN, which
could entail the write of an arbitrary value into BUF[in]. (And similarly for out.)
A simple way to solve this issue consists in adding two semaphores initialized
to 1, denoted MP and MC. The semaphore MP is used by the producers to ensure
that at most one process at a time is allowed to execute B.produce(); (similarly
MC is used to ensure that no two consumers concurrently execute B.consume()).
Albeit correct, such a solution can be very inefficient. Let us consider the case of
a producer p1 that is very slow while another producer p2 is very rapid. If both p1
and p2 simultaneously invoke produce() and p1 wins the competition, p2 is forced to
wait for a long time before being able to produce. Moreover, if there are several free
entries in BUF[0..(k −1)], it should be possible for p1 and p2 to write simultaneously
in two different free entries of the array.
Additional control data To that end, in addition to the buffer BUF[0..(k − 1)] and
the atomic registers IN and OUT , two arrays of atomic Boolean registers denoted
3.2 A Base Synchronization Object: the Semaphore 69
Behavior of Behavior of
FULL[0..(k − 1)] and EMPTY [0..(k − 1)] are used. They are such that, for every x,
the pair FULL[x], EMPTY [x] describes the current state of BUF[x] (full, empty,
being filled, being emptied). These registers have similar behaviors, one from the
producer point of view and the other one from the consumer point of view. More
precisely, we have the following:
• FULL[x] (which is initialized to false) is set to true by a producer p just after it has
written a new item value in BUF[x]. In that way, p informs the consumers that the
value stored in BUF[x] can be consumed. FULL[x] is reset to false by a consumer
c just after it has obtained the right to consume the item value kept in BUF[x]. In
that way, c informs the other consumers that the value in BUF[x] is not for them.
To summarize: FULL[x] ⇔ (BUF[x] can be read by a consumer) (Fig. 3.6).
• EMPTY [x] (which is initialized to true) is reset to false by a consumer c just after
it has read the item value kept in BUF[x]. In that way, the consumer c informs the
producers that BUF[x] can be used again to deposit a new item value. EMPTY [x]
is set to false by a producer p just before it writes a new item value in BUF[x]. In
that way, the producer p informs the other producers that BUF[x] is reserved and
they cannot write into it. To summarize: EMPTY [x] ⇔ (BUF[x] can be written
by a producer).
operation is
(1)
(2)
(3) while do end while
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
end operation
operation is
(12)
(13)
(14) while do end while
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
characterizes the buffer implementation given in Fig. 3.7. This invariant states that
no BUF[x] can be simultaneously full and empty. Other facts can be deduced from
Fig. 3.7:
– ¬EMPTY [x]) ∧ ¬FULL[x] ⇒ BUF[x] is currently being filled or emptied,
– FULL[x] ⇒ BUF[x] contains a value not yet consumed, and
– EMPTY [x] ⇒ BUF[x] does contain a value to be consumed.
Let us observe that this priority scheme is without preemption: when a process p
has obtained the resource, it keeps it until it invokes release(), whatever the priority
of the requests that have been issued by other processes after p was granted the
resource.
Principle of the solution and base objects The object we are building can be
seen as a room made up of two parts: a blackboard room where processes can post
information to inform the other processes plus a sleeping room where processes go
to wait when the resource is used by another process (see Fig. 3.8).
In order for the information on the blackboard to be consistent, at most one
process at a time can access the room. To that end a semaphore denoted MUTEX and
initialized to 1 is used by the processes.
72 3 Lock-Based Concurrent Objects
Semaphore:
Boolean register:
Boolean registers: Blackboard room
Registers:
operation is
(1)
(2) if then
(3)
(4)
(5) else
(6)
(7) end if
(8)
end operation
operation is
(9)
(10) if then let
(11)
(12)
(13) else
(14) end if
(15)
(16)
end operation
Remark Let us observe that, due to asynchrony, it is possible that a process pi wakes
up a waiting process pk (by executing SLEEP_CHAIR[k].up(), line 12), before pk
has executed SLEEP_CHAIR[k].down() (this can occur if pk spends a long time to
go from line 3 to line 4). The reader may check that this does not cause problems:
actually, the slowness of pk between line 3 and line 4 has the same effect as if pk was
waiting for SLEEP_CHAIR[k] to become positive.
On the way priorities are defined The previous resource allocation algorithm is
very general. The way priorities are defined does not depend on it. Priorities can be
statically associated with processes, or can be associated with each invocation of the
operation acquire() independently one from another.
A token metaphor The algorithms in Fig. 3.9 can be seen as a token management
algorithm. There is initially a single token deposited on a table (this is expressed by
the predicate BUSY = false).
When there is no competition a process takes the token which is on the table
(statement BUSY ← true, line 5), and when it has finished using the resource, it
deposits it on the table (statement BUSY ← false, line 13).
When there is competition, the management of the token is different. The
process pi that owns the token gives it directly to a waiting process pk (state-
ment SLEEP_CHAIR[i].up(), line 12) and pk obtains it when it terminates executing
SLEEP_CHAIR[i].down() (line 4). In that case, due to priorities, the transmission
of the token is “direct” from pi to pk . The Boolean register BUSY is not set to true
by pi , and after being reset to false by pk , it remains equal to false until pk gives the
token to another process or deposits it on the table at line 13 if no process wants to
access the resource.
The readers-writers problem Let us consider a file that can be accessed by a single
process by the operations read_file() and write_file(). The readers-writers problem
consists in designing a concurrent object that allows several processes to access this
file in a consistent way. Consistent means here that any number of processes can
simultaneously read the file, but at most one process at a time can write the file, and
writing the file and reading the file are mutually exclusive.
To that end, an approach similar to the one used in Fig. 3.1 to go from a sequen-
tial stack to a concurrent stack can be used. As reading a file does not modify it,
using a lock would be too constraining (as it would allow at most one read at a
time). In order to allow several read operations to execute simultaneously, let us
instead design a concurrent object defined by four operations denoted begin_read(),
end_read(), begin_write(), and end_write(), and let us use them to bracket the oper-
ation read_file() and write_file(), respectively. More precisely, the high-level oper-
ations used by the processes, denoted conc_read_file() and conc_write_file(), are
defined as described in Fig. 3.10.
3.2 A Base Synchronization Object: the Semaphore 75
operation is
end operation
operation is
end operation
operation is
(1)
(2)
(3) if then end if
(4)
(5)
end operation
operation is
(6)
(7)
(8) if then end if
(9)
(10)
end operation
operation is
(11)
(12)
end operation
operation is
(13)
(14)
end operation
and pw2 have invoked begin_write() and one of them (say pw1 ) has obtained the
mutual exclusion to write the file. GLOBAL_MUTEX is then equal to 0 and the
other writers are blocked at line 11. Let us assume that a process pr invokes
begin_read(). The counter NBR is increased from 0 to 1, and consequently pr invokes
GLOBAL_MUTEX.down() and becomes blocked. Hence, pw2 and pr are blocked on
the semaphore GLOBAL_MUTEX with pw2 blocked before pr . Hence, when later
pw1 executes end_write(), it invokes GLOBAL_MUTEX.up(), which unblocks the
first process blocked on that semaphore, i.e., pw2 .
Strong priority to the read operation is weak priority (a reader obtains the priority
for the class of readers) plus the following property: when a writer terminates and
readers are waiting, the readers have to immediately obtain the mutual exclusion.
As already noticed, the readers-writers object described in Fig. 3.11 satisfies weak
priority for the readers but not strong priority.
There is a simple way to enrich the previous implementation to obtain an object
implementation that satisfies strong priority for the readers. It consists in ensur-
ing that, when several processes invoke begin_write(), at most one of them is
allowed to access the semaphore GLOBAL_MUTEX (in the previous example, pw2
is allowed to invoke GLOBAL_MUTEX.down() while pw1 had not yet invoked
GLOBAL_MUTEX.up()). To that end a new semaphore used only by the writers
is introduced. As its aim is to ensure mutual exclusion among concurrent writers,
this semaphore, which is denoted WRITER_MUTEX, is initialized to 1.
3.2 A Base Synchronization Object: the Semaphore 77
operation is
(11.0)
(11)
(12)
end operation
operation is
(13)
(13.0)
(14)
end operation
operation is
(NR.1)
(1)
(2)
(3) if then end if
(4)
(NR.2)
(5)
end operation
operation is
(6)
(7)
(8) if then end if
(9)
(10)
end operation
operation is
(NW.1)
(NW.2)
(NW.3) if then end if
(NW.4)
(11)
(12)
end operation
operation is
(13)
(NW.5)
(NW.6)
(NW.7) if then end if
(NW.8)
(14)
end operation
When considering the readers-writers problem, the mutual exclusion used in the
previous solutions between base read and write operations can entail that readers
and writers experience long delays when there is heavy contention for the shared
file. This section presents a solution to the readers-writers problem that reduces
waiting delays. Interestingly, this solution relies on the producer-consumer problem.
Read/write lock A read/write lock is a synchronization object that (a) provides
the processes with the four operations begin_read(), end_read(), begin_write(), and
end_write() and (b) ensures one of the specifications associated with the readers-
writers problem (no priority, weak/strong priority to the readers or the writers, etc.).
3.2 A Base Synchronization Object: the Semaphore 79
operation is
(1)
(2)
(3)
(4)
(5)
end operation
operation is
(6)
(7)
(8)
(9)
(10)
(11)
end operation
BUF[0]. Similarly, the last read (which is of BUF[1] because LAST = 1 when the
corresponding conc_read_file() operation starts) is concurrent with the write into
BUF[0]. Hence, the next write operation (namely conc_write_file(v5 )) will be on
BUF[1]. If conc_write_file(v5 ) is invoked while BUF[1].read() has not terminated,
the write must be constrained to wait until this read terminates. Hence, the mutual
exclusion requirement on the reads and writes on each entry of the buffer.
Starvation-freedom There are two algorithm instances that ensure the mutual
exclusion property: one between BUF[0].read() and BUF[0].write() and the other
one between BUF[1].read() and BUF[1].write().
Let us assume that these algorithms guarantee the starvation-freedom property
for the read operations (i.e., each invocation of BUF[x].read() terminates). Hence,
there is no specific liveness property attached to the base BUF[0].write() operations.
The following theorem captures the liveness properties of the conc_read_file() and
conc_write_file() operations which are guaranteed by the algorithms described in
Fig. 3.14.
Theorem 12 If the underlying read/write lock objects ensure starvation-freedom for
the read operations, then the implementation of the operations conc_read_file() and
conc_write_file() given in Fig. 3.14 ensure starvation-freedom for both operations.
Proof Starvation-freedom of the invocations of the operation conc_read_file() fol-
lows trivially from the starvation-freedom of the base BUF[x].read() operation.
To show that each invocation of the operation conc_write_file() terminates, we
show that any invocation of the operation BUF[x].write() does terminate. Let us
consider an invocation of BUF[x].write(). Hence, LAST = 1 − x (lines 6 and 10).
This means that the read invocations that start after the invocation BUF[x].write() are
on BUF[1 − x]. Consequently, these read invocations cannot prevent BUF[x].write()
from terminating.
Let us now consider the invocations BUF[x].read() which are concurrent with
BUF[x].write(). There is a finite number of such invocations. As the underlying
mutual exclusion algorithm guarantees starvation for these read invocations, there
a finite time after which they all have terminated. If the invocation BUF[x].write()
3.2 A Base Synchronization Object: the Semaphore 81
operation is
(1)
(2)
(3)
(4)
(5)
end operation
operation is
(7)
(8)
(9)
(10)
(NW.1)
(NW.2) if then end if
(11)
end operation
has not yet been executed, it is the only operation on BUF[x] and is consequently
executed, which concludes the proof of the theorem.
The case of several writers and several readers The previous single writer/
multi-reader algorithm can be easily generalized to obtain a multi-writer/multi-reader
algorithm.
Let us assume that there are m writers denoted q1 , . . . , qm . The array of reg-
isters now has 2m entries: BUF[0..(2m − 1)], and the registers BUF[2i − 2] and
BUF[2i − 1] are associated with the writer number qi . As previously, a writer qi
writes alternately BUF[2i − 2] and BUF[2i − 1] and updates LAST after it has
written a base register. The corresponding write algorithm is described in Fig. 3.16.
The local index my_last of qi is initialized to 2i − 2. Basically line 6 of Fig. 3.14 is
replaced by the new lines NW.1–NW.2. Moreover, the local variable new_last is now
renamed my_last and the algorithm for the operation conc_read_file() is the same as
before.
Semaphores are synchronization objects that allow for the construction of application-
oriented concurrent objects. Unfortunately, they are low-level counting objects.
The concept of a monitor allows for the definition of concurrent objects at a
“programming language” abstraction level. Several variants of the monitor concept
have been proposed. This concept was developed by P. Brinch Hansen and C.A.R.
Hoare from an initial idea of E.W. Dijkstra. To introduce it, this section adopts Hoare’s
presentation (1974).
82 3 Lock-Based Concurrent Objects
• Operation C.wait().
When a process p invokes C.wait() it stops executing and from an operational
point of view it waits in the queue C. As the invoking process is no longer active,
the mutual exclusion on the monitor is released.
• Operation C.signal().
When a process p invokes C.signal() there are two cases according to the value of
C:
– If no process is blocked in the queue C, the operation C.signal() has no effect.
– If at least one process is blocked in the queue C, the operation C.signal() reacti-
vates the first process blocked in C. Hence, there is one fewer process blocked in
C but two processes are now active inside the monitor. In order to guarantee that
a single process at a time can access the internal representation of the monitor
the following rule is applied:
* The process which was reactivated becomes active inside the monitor and
executes the statements which follows its invocation C.wait().
* The process that has executed C.signal() becomes passive but has priority to
re-enter the monitor. When allowed to re-enter the monitor, it will execute the
statements which follow its invocation C.signal().
• Operations C.empty().
This operation returns a Boolean value indicating if the queue C is empty or not.
3.3 A Construct for Imperative Languages: the Monitor 83
operation is
(1)
(2)
(3) if then
(4) else wait
(5) end if
(6)
end operation
Let us observe that a rendezvous object implemented as in Fig. 3.17 is not restricted
to be used only once. It can be used repeatedly to synchronize a given set of processes
as many times as needed. We say that the rendezvous object is not restricted to be a
one-shot object.
A monitor-based rendezvous The internal representation of the rendezvous object
(for m participating processes) consists of a register denoted COUNTER (initialized
to 0) and a condition denoted QUEUE.
The algorithm implementing the operation rendezvous() is described in Fig. 3.18.
Let us remember that, when a process is active inside the monitor, it is the only
active process inside the monitor. Hence, the algorithm implementing the opera-
tion rendezvous() can be designed as a sequential algorithm which momentarily
stops when it executes QUEUE.wait() and restarts when it is reactivated by another
process that has executed QUEUE.signal(). As we are about to see, as an invocation
of QUEUE.signal() reactivates at most one process, the only tricky part is the man-
agement of the invocations of QUEUE.signal() so that all blocked processes are
eventually reactivated.
When one of the m participating processes invokes rendezvous(), it first increases
COUNTER (line 1) and then checks the value of COUNTER (line 2). If COUNTER <
m, it is blocked and waits in the condition QUEUE (line 2). It follows that the (m −1)
first processes which invoke rendezvous() are blocked and wait in that condition. The
mth process which invokes rendezvous() increases COUNTER to m and consequently
resets COUNTER to 0 (line 3) and reactivates the first process that is blocked in the
condition QUEUE. When reactivated, this process executes QUEUE.signal() (line
5), which reactivates another process, etc., until all m processes are reactivated and
terminate their invocations of rendezvous().
Let us notice that, after their first rendezvous has terminated, the m processes can
use again the very same object for a second rendezvous (if needed). As an example,
this object can be used to re-synchronize the processes at the beginning of each
iteration of a parallel loop.
monitor is
init
operation is
(1)
(2) if then
(3) else
(4) end if
(5)
(6)
end operation
end monitor
monitor is
init
init
operation is
(1) if then end if
(2)
(3)
(4)
(5)
end operation
operation is
(6) if then end if
(7)
(8)
(9)
(10)
end operation
end monitor
predicate transfer
from Consumer to Producer
Producer
active passive active
inside the monitor
active passive
Consumer
The base objects (semaphores and integers) used to implement the control part of
a signal-and-wait monitor are described below. This monitor internal structure is
depicted in Fig. 3.21 (let us remark that the control part of the internal structure is
similar to the structure depicted in Fig. 3.8).
• A semaphore MUTEX, initialized to 1, is used to ensure mutual exclusion on the
monitor (at most one process at a time can access the monitor internal representa-
tion).
Semaphore:
Integers:
for each predicate Blackboard room
Semaphores:
for each predicate Waiting rooms
PRIO_SEM.up() (line 2). Hence, the control inside the monitor passes directly from
p to the reactivated process. If there are no such processes, we have PRIO_NB = 0. In
that case p releases the mutual exclusion on the monitor by releasing the semaphore
MUTEX (line 3).
The code executed by a process p that invokes C[P].wait() is described at lines
5–10. Process p then has to be blocked on the semaphore COND_SEM[P], which
is the waiting room associated with the condition C[P] (line 9). Hence, p increases
NB[P] before being blocked (line 5) and will decrease it when reactivated (line 10).
Moreover, as p is going to wait, it has to release the mutual exclusion on the monitor
internal representation. If there is a process q which is blocked due to an invocation of
wait(), it has priority to re-enter the monitor. Consequently, p directly passes control
of the monitor to q (line 6). Differently, if no process is blocked on PRIO_SEM, p
releases the mutual exclusion on the monitor entrance (line 7) to allow a process that
invokes a monitor operation to enter it.
The code executed by a process p that invokes C[P].signal() is described at lines
11–15. If no process is blocked in the condition C[P] we have NB[P] = 0 and there
is nothing to do. If NB[P] > 0, the first process blocked in C[P] has to be reactivated
and p has to become a priority process to obtain the control of the monitor again. To
that end, p increases PRIO_NB (line 11) and reactivates the first process blocked in
C[P] (line 12). Then, it exits the monitor and goes to wait in the priority waiting room
PRIO_SEM (line 13). When later reactivated, it will decrease PRIO_NB in order to
indicate that one fewer process is blocked in the priority semaphore PRIO_SEM
(line 14).
This section considers several monitors suited to the readers-writers problem. They
differ in the type of priority they give to readers or writers. This family of monitors
gives a good illustration of the programming comfort provided by the monitor con-
struct (the fact that a monitor allows a programmer to use directly the power of a
programming language makes her/his life easier).
These monitors are methodologically designed. They systematically use the fol-
lowing registers (which are all initialized to 0 and remain always non-negative):
• NB_WR and NB_AR denote the number of readers which are currently waiting and
the number of readers which are currently allowed to read the file, respectively.
• NB_WW and NB_AW denote the number of writers which are currently waiting
and the number of writers which are currently allowed to write the file, respectively.
In some monitors that follow, NB_WR and NB_AR could be replaced by a single
register NB_R (numbers of readers) whose value would be NB_WR + NB_AR and,
as its value is 0 or 1, NB_AW could be replaced by a Boolean register. This is not
done to insist on the systematic design dimension of these monitors.
90 3 Lock-Based Concurrent Objects
monitor is
init
operation is
if then end if
end operation
operation is
if then end if
end operation
operation is
if then end if
end operation
operation is
end operation
end monitor
Let us now assume that NB_AW = 1. Then, due to the transfer of predicate
(between lines 6 and 8 or between lines 12 and 8) we have NB_AR + NB_AW = 0,
from which we conclude NB_AR = 0, and consequently NB_AR × NB_AW = 0.
Let us now assume that NB_AR > 0. This register is increased at line 3. Due to the
waiting predicate NB_AW > 0 used at line 2 and the transfer of predicate between
line 12 (where we also have NB_AW = 0) and line 2, it follows that NB_AW = 0
when line 3 is executed. Consequently, we have (NB_AR > 0) ⇒ (NB_AW = 0),
which completes the proof of the safety property of the readers-writers problem.
Let us now prove the liveness property, namely strong priority to the readers.
Let us first observe that, if a read is allowed to read, we have NB_AR > 0 and,
consequently, NB_AW = 0. It then follows from the waiting predicate used at line 2
that all the readers which invoke begin_read() are allowed to read.
Let us now consider that readers invoke begin_read() while a writer has previously
been allowed to write. We then have NB_AW > 0 (line 9 executed by the writer)
and NB_AR > 0 (line 1 executed later by readers). It follows that, when the writer
invokes end_write(), it will execute C_READERS.up() (line 12) and reactivate a
reader (which in turn will reactivate another reader, etc.). Consequently, when a
92 3 Lock-Based Concurrent Objects
writer is writing and there are waiting readers, those readers proceed to read when
the writer terminates, which concludes the proof of the liveness property.
Strong priority to the writers The monitor described in Fig. 3.24 provides strong
priority to the writers. This means that, as soon as writers want to write, no more
readers are allowed to read until they have all terminated. The text of the monitor is
self-explanatory.
When looking at Fig. 3.24, as far as the management of priority is concerned,
it is important to insist on the role played by the register NB_WW . This register
stores the actual number of processes which want to write and are blocked. Hence,
giving strong priority to the writers is based on the testing of that register at line 1
and line 12. Moreover, when the priority is given to the writers, the register NB_WR
(which counts the number of waiting readers) is useless.
Similarly, the same occurs in the monitor described in Fig. 3.23. Strong priority
is given to the readers with the help of the register NB_WR while, in that case, the
register NB_WW becomes useless.
A type of fairness Let us construct a monitor in which, while all invocations
of conc_read_file() and conc_write_file() (as defined in Fig. 3.10) terminate, the
following two additional liveness properties are satisfied:
monitor is
init
operation is
(1) if then end if
(2)
(3)
end operation
operation is
(4)
(5) if then end if
(6)
end operation
operation is
(7)
(8) if then end if
(9)
(10)
end operation
operation is
(11)
(12) if then else end if
(13)
end operation
end monitor
• Property P1: When a write terminates, all waiting readers are allowed to read
before the next write.
• Property P2: When there are readers which are reading the file, the newly arriving
readers have to wait if writers are waiting.
These properties are illustrated in Fig. 3.25, where indexes are used to distinguish
different executions of a same operation. During an execution of conc_write_file1 (),
two readers invoke conc_read_file1 () and conc_read_file2 () and a writer invokes
conc_write_file2 (). As there is currently a write on the file, these operations are
blocked inside the monitor (to preserve the monitor invariant). When
conc_write_file1 () terminates, due to property P1, the invocations conc_read_file1 ()
and conc_read_file2 () are executed. Then, while they are reading the file,
conc_read_file3 () is invoked. Due to property P2, this invocation must be blocked
because, despite the fact that the file is currently being read, there is a write waiting.
When conc_read_file1 () and conc_read_file2 () have terminated, conc_write_file2 ()
can be executed. When this write terminates, conc_read_file3 () and conc_read_file4 ()
are executed. Etc.
The corresponding monitor is described in Fig. 3.26. The difference from the
previous readers-writers monitors lies in the way the properties P1 and P2 are ensured.
The property P2 is ensured by the waiting predicate at line 2. If a writer is writing or
waiting (predicate (NB_WW + NB_AW = 0)) when a reader arrives, the reader has
to wait, and when this reader reactivates it will propagate the reactivation to another
waiting reader (if any) before starting to read. The property P1 is ensured by the
reactivating predicate (NB_WR > 0) used at line 13: if there are readers that are
waiting when a writer terminates, the first of them is reactivated, which reactivates
the following one (statement C_READERS.signal() at line 2), etc., until all waiting
readers have been reactivated.
The reader can check that the implementation of such fairness properties would
have been much more difficult if one was asked to implement them directly from
semaphores. (“Directly” meaning here: without using the translation described in
Fig. 3.22.)
monitor is
init
operation is
(1)
(2) if then end if
(3)
(4)
end operation
operation is
(5)
(6) if then end if
(7)
end operation
operation is
(8)
(9) if then end if
(10)
(11)
end operation
operation is
(12)
(13) if then else end if
(14)
end operation
end monitor
monitor is
init init
operation is
(1)
(2) add
(3)
(4) suppress
(5)
end operation
operation is
(6)
(7)
(8) while do end while
(9)
end operation
end monitor
while the statement “d from BAG” removes one copy of d from the bag (if any). Then,
p invokes QUEUE.wait(wake_up_date) (line 3). When it is reactivated, p removes
its wake up date from the bag (line 4).
The second operation, denoted tic(), is executed by the system at the end of each
time unit (it can also be executed by a specific process whose job is to measure the
physical or logical passage of time). This operation first increases the monitor clock
CLOCK (line 6). Then, it reactivates, one after the other, all the processes whose
wake up time is equal to now (the current time value) (lines 7–8).
The monitor concept allows concurrent objects to be built by providing (a) sequen-
tiality inside a monitor and (b) condition objects to solve internal synchronization
issues. Hence, as we have seen, monitor-based synchronization is fundamentally an
imperative approach to synchronization. This section shows that, similarly to sequen-
tial programming languages, the statement of synchronization can be imperative or
declarative.
As monitors, several path expression formulations have been introduced. We
considered here the one that was introduced in a variant of the Pascal programming
language.
96 3 Lock-Based Concurrent Objects
3.4.1 Definition
The idea of path expressions is to state constraints on the order in which the operations
on a concurrent object have to be executed. To that end, four base operators are used,
namely concurrency, sequentiality, restriction, and de-restriction. It is then up to the
compiler to generate the appropriate control code so that these constraints are always
satisfied.
Let us consider an object defined by a set of operations. A path expression
associated with this object has the form
path expression that, at any time, there is at most one execution of op1 and one
execution of op2 which can proceed concurrently.
• path 2 : (op1 ; op2 ) end path states that, at any time, (a) the number of executions
of op2 that have started never surpasses the number of executions of op1 that
have completed (this is due to the “;” internal operator), and (b) the number of
executions of op1 that have started never surpasses by more than two the number
of executions of op2 that have completed (this is due to the “2 : ()” operator).
• path 1 : [op1 ], op2 end path states that, at any time, there is at most either one
execution of op2 or any number of concurrent executions of op1 .
• path 4 : (3 : op1 ), (2 : op2 ) end path states that, at any time, there are at most
three concurrent executions of op1 and at most two concurrent executions of op2 ,
and at most four concurrent executions when adding the executions of op1 and the
executions of op2 .
operation is
end operation
operation is
end operation
file is given either to any number of readers (which have invoked read_file()) or any
number of writers (which have invoked WRITE_file()). Moreover, path2 defines a
kind of alternating priority. If a reader is reading, it gives access to the file to all the
readers that arrive while it is reading and, similarly, if a writer is writing, it reserves
the file for all the writers that are waiting.
Producer-consumer Let us consider a buffer of size k shared by a single producer
and a single consumer. Using the same base objects (BUF[0..(k − 1)], in and out)
as in Fig. 3.4 (Sect. 3.2.2), the operations of such a buffer object B are defined in
Fig. 3.28.
The following path expression path4 defines the synchronization control associ-
ated with such a buffer:
path4 = path k : prod; cons end path.
If there are both several readers and several writers, it is possible to use the same
object B. (The only difference for B is that now in is shared by the producers and out
is shared by the consumers, but this does not entail a modification of the code of B.)
The only modification is the addition of synchronization constraints specifying that
at most one producer at a time is allowed to produce and at most one consumer at a
time is allowed to consume.
Said differently, the only modification is the replacement of path4 by path5 ,
defined as follows:
path5 = path k : (1 : prod); ((1 : cons) end path.
Generating prefixes and suffixes Let pe denote a path expression. prefix(pe) and
suffix(pe) denote the code prefix and the code suffix currently associated with pe.
These prefixes and suffixes are defined recursively starting with the path expression
pe and proceeding until the prefix and suffix of each operation is determined. Initially,
prefix(pe) and suffix(pe) are empty control sequences.
1. Concurrency rule. Let pe = pe1 , pe2 . The expression prefix(pe) pe1 ,
pe2 suffix(pe) gives rise to the two expressions prefix(pe) pe1 suffix(pe)
and prefix(pe) pe2 suffix(pe), which are then considered separately.
2. Sequentiality rule. Let pe = pe1 ; pe2 . The expression prefix(pe) pe1 ;
pe2 suffix(pe) gives rise to two expressions which are then considered sep-
arately, namely the expression prefix(pe) pe1 S.up() and the expression
S.down() pe2 suffix(pe) where S is a new semaphore initialized to 0. As we
can see, the aim of the semaphore S is to force pe2 to wait until pe1 is executed.
Hence, for the next step, as far as pe1 is concerned, we have prefix(pe1 ) =
prefix(pe) and suffix(pe1 )=S.up(). Similarly, we have prefix(pe2 )=S.down() and
suffix(pe1 ) = S.up().
3. Restriction rule. Let pe = k : pe1 . The expression prefix(pe)k : pe1suffix(pe)
gives rise to the expression S .down(); prefix(pe) pe1 suffix(pe); S .up(),
where S is a new semaphore initialized to k.
Hence, we have prefix(pe1 ) = S .down(); prefix(pe) and suffix(pe1 ) = suffix(pe);
S .up() to proceed recursively (if pe1 is not an operation name).
4. De-restriction rule. Let pe = [pe1 ]. The expression prefix(pe)[pe1 ]
suffix(pe) gives rise to the expression prio_down(CT , S , prefix(pe))
pe1 prio_up(CT , S , suffix(pe)), where
• CT is a counter initialized to 0,
• S is a new semaphore initialized to 1, and
• prio_down() and prio_up() are defined as indicated in Fig. 3.29.
operation is
if then end if
end operation
operation is
if then end if
end operation
The aim of these operations is to give priority to the processes that invoke an
operation involved in the path expression pe1 . (The reader can check that their
internal statements are the same as the ones used in Fig. 3.11 to give weak priority
to the readers.)
An example To illustrate the previous rules, let us consider the following path
expression involving three operations denoted op1 , op2 , and op3 :
path 1 : [op1 ; op2 ], op3 end path.
Hence, we have initially pe = 1 : [op1 ; op2 ], op3 , prefix(pe) = suffix(pe) =
(where represents the empty sequence). Let us now apply the rules as defined
by their precedence order. This is also described in Fig. 3.30 with the help of the
syntactical tree associated with the considered path expression.
• Let us first apply the restriction rule (item 3). We obtain k = 1 with pe1 =
[op1 ; op2 ], op3 . It follows from that rule that prefix(pe1 ) = S1.down(); and
suffix(pe1 ) = ; S1.up(), where S1 is a semaphore initialed to 1.
• Let us now apply the concurrency rule (item 1). We have pe1 = pe2 , pe3 , where
pe2 = [op1 ; op2 ] and pe3 = op3 . It follows from that rule that:
– prefix(op3 ) = prefix(pe1 ) = S1.down() and suffix(op3 ) = suffix(pe1 ) =
S1.up(). Hence, any invocation of op3 () has to be bracketed by S1.down()
and S1.up().
– Similarly, prefix(pe2 ) = prefix(pe1 ) = S1.down() and suffix(pe2 ) = suffix(pe1 )
= S1.up().
• Let us now consider pe2 = [op1 ; op2 ] = [pe4 ]. Applying the de-restriction rule
(item 4) we obtain prefix(pe4 ) = prio_down(CT , S2, prefix(pe2 )) and suffix(pe4 )
operation is
if then end if
end operation
operation is
if then end if
end operation
3.5 Summary
This chapter has presented the semaphore object and two programming language
constructs (monitors and path expressions) that allow the design of lock-based atomic
objects. Such language constructs provide a higher abstraction level than semaphores
102 3 Lock-Based Concurrent Objects
or base mutual exclusion when one has to reason on the implementation of concurrent
objects. Hence, they can make easier the job of programmers who have to implement
concurrent objects.
operation is
end operation
operation is
end operation
• A reader that arrives while another reader is reading can immediately access
the file if no writer is waiting. Otherwise, the reader has to wait until the writers
that arrived before it have accessed the file.
• A writer cannot bypass the writers and the readers which arrived before it.
end operation
operation is
wait
end operation
end operation
operation is
wait
end operation
• Prove that the algorithm is correct (a writer executes in mutual exclusion and
readers are allowed to proceed concurrently).
• What type of priority is offered by this algorithm?
• In the worst case how many process can be blocked on the wait statement?
• Let us replace the array FLAG[1..n] of SWMR atomic registers by a single
MWMR atomic register READERS initialized to 0, and
3.7 Exercises and Problems 105
operation is
end operation
operation is
wait
end operation
9. Let us associate the semantics “signal all” with the C.signal() operation on each
condition C of a monitor. This semantics means that (if any) all the processes
which are blocked on the condition C are reactivated and have priority to obtain
mutual exclusion and re-access the monitor. The process which has invoked
C.signal() continues its execution inside the monitor. Considering this “signal
all” semantics:
• Design a readers-writers monitor with strong priority to the writers.
• Design an implementation for “signal all” monitors from underlying
semaphores.
10. The last writer. Let us consider the monitor-based solution with strong priority
to the readers (Fig. 3.23). Modify this solution so that only the last writer can be
blocked (it can be blocked only because a reader is reading or a writer is writing).
This means that, when a writer p invokes begin_write(), it unblocks the waiting
106 3 Lock-Based Concurrent Objects
writer q if there is one. The write (not yet done) of q is then “overwritten” by
the write of p and the invocation of begin_write() issued by q returns false.
To that end, the operation s conc_write_file(v) defined in Fig. 3.10 is redefined
as follows:
operation conc_write_file(v) is
r ← begin_write();
if (r) then write_file(v); end_write() end if;
return(r)
end operation.
11. Implement semaphores (a) from monitors, and (b) from path expressions.
(c) leave_C_to_D() when it arrives at C and tries to enter the line CD.
(d) arrive_in_D() when it arrives at D.
The same four operations are defined for the trains that go from D to A (A, B, C,
and D are replaced by D, C, B, and A).
Design first a deadlock-free monitor that provides the processes (trains) with the
previous eight operations. Design then a starvation-free monitor.
Hints.
• The internal representation of the monitor will be made of:
– The integer variables NB_AB, NB_BA, NB_CD, and NB_DC, where NB_xy
represents the number of trains currently going from x to y.
The binary variables NB_BC and NB_CB, whose values are 0 or 1.
All these control variables are initialized to 0.
– The six following conditions (queues): START _FROM_A, ENTER_BC,
ENTER_CD, START _FROM_D, ENTER_CB, ENTER_BA.
• The code of a process going from A to D is:
start_from_A(); . . . ; leave_B_to_C();...; leave_C_to_D(); . . . ;
arrive_in_D)().
• Before deriving the predicates that allow a train to progress when it executes
a monitor operation, one may first prove that the following relation must be
an invariant of the monitor internal representation:
operation is
end operation
operation is
end operation
Let us observe that path2 defines mutual exclusion between a write invocation
and any other operation invocation while allowing concurrent read operations.
3.7 Exercises and Problems 109
The combination of the path expressions path1 and path2 defines the associated
priority. What type of priority is defined?
Solutions in [63].
Part II
On the Foundations Side:
The Atomicity Concept
This part of the book is made up of a single chapter that introduces the atomicity
concept (also called linearizability). This concept (which was sketched in the first
part of the book) is certainly (with non-determinism) one of the most important
concepts related to the concurrency and synchronization of parallel and distributed
programs. It is central to the understanding and the implementation of concurrent
objects. This chapter presents a formal definition of atomicity and its main
properties. Atomicity (which is different from sequential consistency or serializ-
ability) is the most popular consistency condition. This is due to the fact that
atomic objects compose ‘‘for free’’.
Chapter 4
Atomicity:
Formal Definition and Properties
4.1 Introduction
in the middle of an operation? (This possibility was not considered in the previous
chapters.)
Example To give a flavor of these questions, let us consider an unbounded first-in
first-out (FIFO) queue denoted Q which provides the processes with the following
two operations:
• Q.enq(v), which adds the value v at the tail of the queue, and
• Q.deq(), which returns the value at the head of the queue and suppresses it from
the queue. If the queue is empty, the default value ⊥ is returned.
Figure 4.1 describes a sequential execution of a system made up of a single process
using the queue. The time line, going from left to right, describes the progress of
the process when it enqueues first the value a, then the value b, and finally the value
c. According to the expected semantics of a queue, and as depicted in the figure,
the first invocation of Q.deq() returns the value a, the second returns the value
b, etc.
Figure 4.2 depicts an execution of a system made up of two processes sharing
the same queue. Now, process p1 enqueues first a and then b whereas process p2
concurrently enqueues c. As shown in the figure, the execution of Q.enq(c) by p2
overlaps the executions of both Q.enq(a) and Q.enq(b) by p1 . Such an execution
raises many questions, including the following: What values are dequeued by p1 and
p2 ? What values can be returned by a process, say p1 , if the other process, p2 , stops
forever in the middle of an operation? What happens if p1 and p2 share several queues
instead of a single one?
Addressing the previous questions and related issues start from the definition of a
precise computation model. This chapter presents first the base elements of such a
model and the important notion of a concurrent computation history.
4.2.2 Objects
An object has a name and a type. A type is defined by (1) the set of possible values
for (the states of) objects of that type, (2) a finite set of operations through which the
objects of that type can be manipulated, and (3) a specification describing, for each
operation, the condition under which that operation can be invoked, and the effect
produced after the operation was executed. Figure 4.3 presents a structural view of a
set of n processes sharing m objects.
Sequential specification The object types we consider are defined by a sequential
specification. (We talk interchangeably about the specification of the object or the
specification of the type.) A sequential specification depicts the behavior of the object
when accessed sequentially, i.e., in a sequential execution. This means that, despite
concurrency, the implementation of any such object has to provide the illusion of
sequential accesses. As already noticed, the aim is to facilitate the task of application
programmers who have to reason only about sequential specifications.
It is common to define a sequential specification by associating two predicates
with each operation. These predicates are called pre-assertion and post-assertion.
Assuming the pre-assertion is satisfied before executing the operation, the post-
assertion describes the new value of the object and the result of the operation returned
to the calling process. We refine the notion of sequential specification in terms of
histories later in this chapter.
Total versus partial operations An object operation is total if it is defined for
every state of the object; otherwise it is partial. This means that, differently from a
pre-assertion associated with a partial operation, the pre-assertion associated with a
total operation is always satisfied.
Deterministic versus non-deterministic operations An object operation is deter-
ministic if, given any state of the object that satisfies the pre-assertion of the oper-
ation, and given any valid input parameters of the operation, the output parameters
and the final state of the object are uniquely defined. An object type that has only
4.2 Computation Model 117
4.2.3 Histories
Two operations op and op are said to overlap (or be concurrent) in a history H if
neither resp[op] <H inv[op ] nor resp[op ] <H inv[op]. Notice that two overlapping
operations are such that ¬(op →H op ) and ¬(op →H op).
A sequential history has no overlapping
operations; i.e., for any pair of operations
op and op , we have (op = op ) ⇒ (op →H op ) ∨ ((op →H op) . →H is
is a sequential history.
consequently a total order if H
Illustrating histories Figure 4.4 depicts the (well-formed) history H associated
with the queue object execution described in Fig. 4.2. This history comprises ten
events e1 . . . e10 (e4, e6, e7, and e9 are explicitly detailed). As there is a single object,
its name is omitted. Let us notice that the operation enq(c) by p2 is concurrent with
both enq(a) and enq(b) issued by p1 . Moreover, as the history H has no pending
operations, it is a complete history.
The sequence e1 . . . e9 is a partial history where the dequeue operation issued by
p1 is pending. The sequence e1 . . . e6 e7 e8 e10 is another partial history in which
the dequeue operation issued by p2 is pending. Finally, the history e1 . . . e8 has two
pending operations.
4.2 Computation Model 119
Definition A history is sequential if its first event is an invocation, and then (1) each
invocation event, except possibly the last, is immediately followed by the matching
reply event, and (2) each reply event, except possibly the last, is immediately followed
by an invocation event. The phrase “except possibly the last” associated with an
invocation event is due to the fact that a history can be partial. A complete sequential
history always ends with a reply event. A history that is not sequential is concurrent.
A sequential history models a sequential multiprocess computation (there are no
overlapping operations in such a computation), while a concurrent history models
a concurrent multiprocess computation (there are at least two overlapping opera-
tions in such a computation). Given that a sequential history S has no overlapping
operations, the associated partial order →S defined on its operations is actually a
total order. With a sequential history, one can thus reason about executions at the
granularity of the operations invoked by the processes, instead of at the granularity
of the underlying events.
Strictly speaking, the sequential specification of an object is a set of sequential
histories involving solely that object. Basically, the sequential specification repre-
sents all possible sequential ways according to which the object can be accessed such
that the pre-assertion and post-assertion of each of its operations are respected.
Example The history H = e1 e2 · · · e10 depicted in Fig. 4.4 is a complete concur-
rent history. On the other hand, the complete history
1 = e1 e3 e4 e6 e2 e5 e7 e9 e8 e10
H
1 = [e1 e3] [e4 e6] [e2 e5] [e7 e9] [e8 e10].
H
The histories
2 = [e1 e3] [e4 e6] [e2 e5] [e8 e10] [e7 e9],
H
3 = [e1 e3] [e4 e6] [e8 e10] [e2 e5] [e7 e9].
H
H
are also sequential. Let us also notice that H, 2 , H
1 , H 3 are equivalent histories
(they have the same local histories). Let H4 be the history defined as
4 = [e1 e3] [e4 e6] [e2 e5] [e8 e10] [e7 e9].
H
4 is a partial sequential history. All these histories have the same local history for
H
1 =H
process p1 : H|p 1 |p1 = H 2 |p1 = H3 |p1 = H 4 |p1 = [e1 e3] [e4 e6] [e8 e10],
and, as far p2 is concerned, H3 |p2 is a prefix of H|p2 = H 1 |p2 = H
2 |p2 = H3 |p2 =
[e2 e5] [e7 e9].
Hence, the notion of a history is an abstract way to depict the interactions between
a set of processes and a set of concurrent objects. In short, a history is a total order
on the set of (invocation and reply) events generated by the processes on the objects.
As we are about to see, the notion of a history is central to defining the notion of
atomicity through the very notion of atomic history.
4.3 Atomicity
The role of a correctness condition is to select, among all possible histories of a set
of processes accessing shared objects, those considered to be correct. This section
introduces the correctness condition called atomicity (also called linearizability). The
aim of atomicity is to transform the difficult problem of reasoning about a concurrent
execution into the simpler problem of reasoning about a sequential one.
Intuitively, atomicity states that a history is correct if its invocation and reply
events could have been obtained, in the same order, by a single sequential process. In
an atomic (or linearizable) history, each operation has to appear as if it was executed
alone and instantaneously at some point between its invocation event and its reply
event.
As the concurrent objects that are considered are defined by sequential specifications,
a definition of what is a “correct” history has to refer in one way or another to these
specifications. The notion of legal history captures this idea.
4.3 Atomicity 121
This section first defines atomicity for complete histories H, i.e., histories without
pending operations: each invocation event of H has a matching reply event in H. The
section that follows will extend this definition to partial histories.
is atomic (or linearizable) if there is a “witness”
Definition A complete history H
history S such that:
and
1. H S are equivalent,
2.
S is sequential and legal, and
3. →H ⊆→S .
to be linearizable, there must
The definition above states that, for a history H
exist a permutation of H (namely the witness history
S) which satisfies the following
requirements:
• First, and has to respect the
S has to be composed of the same set of events as H
local history of each process [item 1].
• Second, S has to be sequential (interleave the process histories at the granularity of
complete operations) and legal (respect the sequential specification of each object)
[item 2]. Notice that, as
S is sequential, →S is a total order.
• Finally,
S has also to respect the real-time occurrence order of the operations as
defined by →H [item 3].
S represents a history that could have been obtained by executing all the opera-
tions, one after the other, while respecting the occurrence order of non-overlapping
operations. Such a sequential history
S is called a linearization of H.
Proving that an algorithm implements an atomic object To this end, we need
to prove that all histories generated by the algorithm are linearizable, i.e., iden-
tify a linearization of its operations that respects the “real-time” occurrence order
of the operations and that is consistent with the sequential specification of the
object.
It is important to notice that the notion of atomicity inherently includes a form
of non-determinism. More precisely, given a history H, several linearizations of H
might exist.
122 4 Atomicity: Formal Definition and Properties
defined in Sect. 4.2.4 is such a witness. At the granularity level defined by the oper-
ations, witness history H1 can be represented as f
This formulation highlights the intuition that underlies the definition of the atomicity
concept.
Linearization point The very existence of a linearization of an atomic history H
means that each operation of H could have been executed at an indivisible instant
between its invocation and reply time events (while providing the same result as H).
It is thus possible to associate a linearization point with each operation of an atomic
history. This is a point of the time line at which the corresponding operation could
have been “instantaneously” executed according to the witness sequential and legal
history.
To respect the real-time occurrence order, the linearization point associated with
an operation has always to appear within the interval defined by the invocation event
and the reply event associated with that operation.
Example Figure 4.5 depicts the linearization point of each operation. A triangle is
associated with each operation, such that the vertex at the bottom of a triangle (bold
dot) represents the associated linearization point. A triangle shows how atomicity
allows shrinkage of an operation (the history of which takes some duration) into a
single point on the time line.
In that sense, atomicity reduces the difficult problem of reasoning about a con-
current system to the simpler problem of reasoning about a sequential system where
the operations issued by the processes are instantaneously executed.
As a second example, let us consider a variant of the history depicted in Fig. 4.5
where the reply events e9 and e10 are “exchanged”, i.e., we have now that e9 =
resp[deq(b) by p2 ] and e10 = resp[deq(a) by p1 ]. It is easy to see that this history
is linearizable: the sequential history H 2 described in Sect. 4.2.4 is a linearization
of it.
Similarly, the history where e9 = resp[deq(c) by p2 ] and e10 = resp[deq(a) by
p1 ] is also linearizable. It has the following sequential witness history:
Differently, the history in which the two dequeue operations would return the
same value is not linearizable: it does not have a witness history which respects the
sequential specification of the queue.
Thissectionextendsthedefinitionofatomicitytopartialhistories.Asalreadyindicated,
thesearehistorieswithatleastoneprocesswhoselastoperationispending:theinvocation
event of this operation appears in the history while the corresponding reply event does
not. The history H4 described in Sect. 4.2.4 is a partial history. Extending atomicity to
partial histories is important as it allows arbitrary delays experienced by processes, or
even process crashes (when these delays become infinite), to be dealt with.
Definition A partial history H is linearizable if H
can be modified in such a way
that every invocation of a pending operation is either removed or completed with a
reply event, and the resulting (complete) history H is linearizable.
Basically, the problem of determining whether a partial history H is linearizable is
reduced to the problem of determining whether a complete history H , extracted from
H, is linearizable. We obtain H by adding reply events to certain pending operations
as if these operations have indeed been completed, but also removing invocation
of H,
events from some of the pending operations of H. We require, however, that all
complete operations of H be preserved in H . It is important to notice that, given a
that satisfy the required conditions.
we can extract several histories H
history H,
Example Let us consider Fig. 4.6, which depicts two processes accessing a register.
Process p1 first writes the value 0. The same process later issues a write for the value 1,
but p1 crashes during this second write (this is indicated by a cross on its time line).
Process p2 executes two consecutive read operations. The first read operation lies
between the two write operations of p1 and returns the value 0. A different value
would clearly violate atomicity. The situation is less obvious with the second value,
124 4 Atomicity: Formal Definition and Properties
and it is not entirely clear what value v has to be returned by the second read operation
in order for the history to be linearizable.
As explained below, both values 0 and 1 can be returned by that read operation
while preserving atomicity. The second write operation is pending in the partial
history H modeling this execution. This history H is made up of seven events (the
names of the object and the processes are omitted as there is no ambiguity), namely
We explain now why both 0 and 1 can be returned by the second read:
• Let us first assume that the returned value v is 0.
We can associate with history H a legal sequential witness history H 0 which
includes only complete operations and respects the partial order defined by H
by remov-
0 , we construct history H
on these operations (see Fig. 4.6). To obtain H
ing the event inv[write(1)] from H: we obtain a complete history, i.e., a history
without pending operations.
History H with v = 0 is consequently linearizable. The associated witness history
0 models the situation where p1 is considered as having crashed before invoking
H
the second write operation: everything appears as if this write had never been
issued.
• Assume that the returned value v is 1.
Similarly to the previous case, we can associate with history H a witness legal
sequential history H1 that respects the partial order on the operations. We actually
1 by first constructing H
derive H , which we obtain by adding to H
the reply event
—from which
res[write(1)]. (In Fig. 4.6, the part added to H in order to obtain H
H1 is constructed—is indicated by dotted lines.)
The history where v = 1 is consequently linearizable. The associated witness
1 represents the situation where the second write is taken into account
history H
despite the crash of the process that issued that write operation.
4.4 Object Composability and Guaranteed Termination Property 125
This section presents two fundamental properties of atomicity that make it partic-
ularly attractive. The first property states that atomic objects can be composed for
free, while the second property states that, as object operations are total, no operation
invocation can be prevented from terminating.
Basically, “→” totally orders all operations on the same object X, according to →X
(item 1), while preserving →H , i.e., the real-time occurrence order on the operations
(item 2).
Claim. “→ is acyclic”. This claim means that → defines a partial order on the set of
all the operations of H.
Assuming this claim (see its proof below), it is thus possible to construct a sequen-
tial history and respecting →. We trivially have →⊆→S ,
S including all events of H
where →S is the total order on the operations defined from S. We have the three
following conditions: (1) H and S are equivalent (they contain the same events and
contain the same local histories), (2) S is sequential (by construction) and legal (due
to item 1 above), and (3) →H ⊆→S (due to item 2 above and →⊆→S ). It follows
that H is linearizable.
Proof of the claim. We show (by contradiction) that → is acyclic. Assume first that
→ induces a cycle involving the operations on a single object X. Indeed, as →X is
a total order, in particular transitive, there must be two operations opi and opj on X
such that opi →X opj and opj →H opi . But opi →X opj ⇒ inv[opi ] <H resp[opj ]
because X is linearizable. As <H is a total order on the whole set of events, the fact
that opj →H opi ⇒ resp[opj ] <H inv[opi ] establishes the contradiction.
It follows that any cycle must involve at least two objects. To obtain a contradiction
we show that, in that case, a cycle in → implies a cycle in →H (which is acyclic).
Let us examine the way the cycle could be obtained. If two consecutive edges of the
cycle are due to just some →X or just →H , then the cycle can be shortened, as any
of these relations is transitive. Moreover, opi →X opj →Y opk is not possible for
X = Y , as each operation is on only one object (opi →X opj →Y opk would imply
that opj is on both X and Y ). So let us consider any sequence of edges of the cycle
such that: op1 →H op2 →X op3 →H op4 . We have:
– op1 →H op2 ⇒ resp[op1 ] <H inv[op2 ] (definition of op1 →H ),
– op2 →X op3 ⇒ inv[op2 ] <H resp[op3 ] (as X is linearizable),
– op3 →H op4 ⇒ resp[op3 ] <H inv[op4 ] (definition of op1 →H ).
Combining these statements, we obtain resp[op1 ] <H inv[op4 ], from which we can
conclude that op1 →H op4 . It follows that any cycle in → can be reduced to a cycle
in →H , which is a contradiction as →H is an irreflexive partial order. End of the
proof of the claim.
The benefit of locality Considering an execution of a set of processes that access
concurrently a set of objects, atomicity allows the programmer to reason as if
all the operations issued by the processes on the objects were executed one after
the other. The previous theorem is fundamental. It states that, to reason about
sequential processes that access concurrent atomic objects, one can reason on
each object independently, without losing the atomicity property of the whole
computation.
4.4 Object Composability and Guaranteed Termination Property 127
An example Locality means that atomic objects compose for free. As an example,
let us consider two atomic queue objects Q1 and Q2 each with its own implementation
I1 and I2, respectively (hence, the implementations can use different algorithms).
Let us define the object Q that is a composition of Q1 and Q2 defined as fol-
lows (Fig. 4.7). Q provides processes with the four following operations Q.enq1(),
Q.deq1(), Q.enq2(), and Q.deq2() whose effect is the same as Q1.enq(),
Q1.deq(), Q2.enq() and Q2.deq(), respectively.
Thanks to locality, an implementation of Q consists simply in piecing together I1
and I2 without any modification to their code. As we will see in Sect.4.5, this object
composition property is no longer true for other consistency conditions.
Duetothefactthatoperationsaretotal,atomicity(linearizability)persedoesnotrequire
a pending invocation of an operation to wait for another operation to complete. This
means that, if a given implementation I of an atomic object entails blocking of a total
operation, this is not due to the atomicity concept but only to I. Blocking is an artifact
of particular implementations of atomicity but not an inherent feature of atomicity.
This property of the atomicity consistency condition is captured by the following
theorem, which states that any (atomic) history with a pending operation invocation
can be extended with a reply to that operation.
Theorem 15 Let inv[op(arg)] be the invocation event of a total operation that is
There exists a matching reply event res[op(res)]
pending in a linearizable history H.
such that the history H = H.resp[op(res)] is linearizable.
Proof Let By definition of a linearization,
S be a linearization of the partial history H.
S has a matching reply to every invocation. Assume first that S includes a reply event
resp[op(res)] matching the invocation event inv[op(arg)]. In this case, the theorem
trivially follows, as then S is also a linearization of H .
128 4 Atomicity: Formal Definition and Properties
If
S does not include a matching reply event, then
S does not include inv[op(arg)].
Because the operation op() is total, there is a reply event resp[op(res)] matching
the invocation event inv[op(arg)] in every state of the shared object. Let S be the
sequential history S with the invocation event inv[op(arg)] and a matching reply
S. S is trivially legal. It follows
event resp[op(res)] added in that order at the end of
.
that S is a linearization of H
<1 = [e1 e3] [e4 e6] [e8 e10] and <2 = [e2 e5] [e7 e9].
S = [Q.enq(b) by p2 ] [Q.enq(a) by p1 ] [Q.deq(b) by p2 ]
easy to see that, when we consider each object in isolation, we obtain the histories
and H|Q
H|Q that are sequentially consistent. Unfortunately, there is no way to
witness a legal total order
S that involves the six operations: if p1 dequeues b from
Q , Q .enq(a ) has to be ordered after Q .enq(b ) in a witness sequential history.
But this means that (to respect process-order) Q.enq(a) by p1 is necessarily ordered
before Q.enq(b) by p2 . Consequently Q.deq() by p2 should return a for S to be
legal. A similar reasoning can be done starting from the operation Q.deq(b) by p2 .
It follows that there can be no legal witness total order. Hence, despite the fact that
and H|Q
H|Q are sequentially consistent, the whole history H is not.
4.5.2 Serializability
(or abort) event after all other events of a pending transaction is called commit-
ting (or aborting) the transaction. A sequential history is a sequence of committed
transactions.
Let →trans denote the total order of events of the committed transactions. This
is analogous to the process-order relation defined above. We say that a history is
complete if all its transactions are complete.
Let H be a complete history. H is serializable if there is a “witness” history
S
such that:
1.
S is made up of all events of committed transactions of H,
2.
S is sequential and legal, and
3. →trans ⊆→S (
S has to respect transaction order).
Let H be a history that is not complete. H is serializable if we can derive from
a complete history H
H (by completing or removing pending transactions from H)
such that: (1) H includes the complete transactions of H,
is complete, (2) H and (3)
is serializable.
H
Atomicity versus serializability As for atomicity, serializability is defined accord-
ing to the equivalence to a witness sequential history, but differently from atomicity,
no real-time ordering is required. In this sense, serializability can be viewed as an
extension of sequential consistency to transactions where a transaction is made up of
several invocations of object operations. Unlike atomicity, serializability is not a local
property (replacing processes with transactions in Fig. 4.9 gives a counter-example).
4.6 Summary
This chapter has introduced the basic elements that are needed to reason about exe-
cutions of a multiprocess program whose processes cooperate through concurrent
objects (defined by a sequential specification on total operations). More specifically,
this chapter has presented the basic notions from which the atomicity concept has
then been defined.
The fundamental modeling element is that of a history: a sequence of events depict-
ing the interaction between processes and objects. An event represents the invocation
of an object or the return of a reply. A history is atomic if, despite concurrency, it
appears as if processes access the objects by invoking operations one after the other.
In this sense, the correctness of a concurrent computation is judged with respect to a
sequential behavior, itself determined by the sequential specification of the objects.
Hence, atomicity is what allows us to reason sequentially despite concurrency.
132 4 Atomicity: Formal Definition and Properties
• The notion of atomic read/write objects (registers), as studied here, was investi-
gated and formalized by L. Lamport [189] and J. Misra [206].
• The generalization of the atomicity consistency condition to objects of any sequen-
tial type was developed by M. Herlihy and J. Wing under the name linearizability
[148].
• The notion of sequential consistency was introduced by L. Lamport [187].
The relation between atomicity and sequential consistency was investigated in
[40] and [232], where it was shown that, from a protocol design point of view,
sequential consistency can be seen as lazy linearizability. Examples of protocols
implementing sequential consistency can be found in [3, 40, 233].
• The concept of transactions is part of almost every textbook on database systems.
Books entirely devoted to transactions include [50, 97, 119]. The theory of serial-
izability is the main topic of [97, 222].
Part III
Mutex-Free Synchronization
While Part I was devoted to lock-based synchronization, this part of the book is on
the design of concurrent objects whose implementation does not rely on mutual
exclusion. It is made up of five chapters:
• The first chapter introduces the notion of a mutex-free implementation (i.e.,
implementations which are not allowed to rely on locks) and the associated
liveness properties, namely obstruction-freedom, non-blocking, and wait-
freedom.
• The second chapter introduces the notion of a hybrid implementation, namely
an implementation which is partly lock-based and partly mutex-free.
• The next three chapters are on the power of atomic read/write registers when one
has to design wait-free object implementations. These chapters show that non-
trivial objects can be built in such a particularly poor context. To that end, they
present wait-free implementations of the following concurrent objects: weak
counters, store-collect objects, snapshot objects, and renaming objects.
Remark on terminology As we are about to see, the term mutex-freedom is used
to indicate that the use of critical sections (locks) is prohibited. The term lock-
freedom could have been used instead of mutex-freedom. This has not been done
for the following reason: the term lock-freedom is already used in a lot of papers
on synchronization with different meanings. In order not to overload it and to
prevent confusion, the term mutex-freedom is used in this book.
Chapter 5
Mutex-Free Concurrent Objects
Locks are not always the panacea As we have seen in Chaps. 1 and 3, the systematic
use of locks constitutes a relatively simple method to implement atomic concurrent
objects defined by total operations. A lock is associated with every object O and all the
operation invocations on O are bracketed by acquire_lock() and release_lock() so that
at most one operation invocation on O at a time is executed. However, as we are about to
see in this chapter, locks are not the only approach to implement atomic objects. Locks
have drawbacks related to process blocking and the granularity of the underlying base
objects used in the internal representation of the object under construction.
As far as the granularity of the object protected by a lock is concerned, let us
consider a lock-based implementation of a bounded queue object Q with total oper-
ations (Q.deq() returns ⊥ when the queue is empty and Q.enq() returns when
the queue if full). The use of a single lock on the whole internal representation of
the queue prevents Q.enq() and Q.deq() from being executed concurrently. This can
decrease the queue efficiency, as nothing prevents these two operations from exe-
cuting concurrently when the queue is neither empty nor full. A solution consists
in using locks at a finer granularity level in order to benefit from concurrency and
increase efficiency. Unfortunately this makes deadlock prevention more difficult and,
due to their very nature, locks cannot eliminate the blocking problem.
The drawback related to process blocking is more severe. Let us consider a process
p that for some reason (e.g., page fault) stops executing during a long period in the
middle of an operation on an object O. If we use locks, as we have explained above,
the processes which have concurrently invoked an operation on O become blocked
until p terminates its own operation. When such a scenario occurs, processes suffer
delays due to other processes. Such an implementation is said to be blocking-prone.
The situation is even worse if the process p crashes while it is in the middle of an
operation execution. (In an asynchronous system a crash corresponds to the case
where the speed of the corresponding process becomes and remains forever equal to
0, this being never known by the other processes. This point is developed below at
the end of Sect. 5.1.2.) When this occurs, p never releases the lock, and consequently,
all the processes that will invoke an operation on O will become blocked forever.
Hence, the crash of a process creates an infinite delay that can entail a deadlock on
all operations accessing the object O.
These observations have motivated the design of concurrent object implementa-
tions that do not use locks in one way or another (i.e., explicitly or implicitly). These
implementations are called mutex-free.
Operation level versus implementation level Let us consider an object O with
two operations O.op1() and O.op2(). At the user level, the (correct) behaviors of O
are defined by the traces of its sequential specification.
When considering the implementation level, the situation is different. Each exe-
cution of O.op1() or O.op2() corresponds to a sequence of invocations of base
operations on the base objects that constitute the internal representation of O.
If the implementation of O is lock-based and we do not consider the execution of
the base operations that implement acquire_lock() and release_lock(), the sequence
of base operations produced by an invocation of O.op1() or O.op2() cannot be
interleaved with the sequence of base operations produced by another operation
invocation. When the implementation is mutex-free, this is no longer the case, as
depicted in Fig. 5.1.
Figure 5.1 shows that the invocations of O.op1() by p1 , O.op2() by p2 , and O.op1()
by p3 are linearized in that order (i.e., they appear to have been executed in that order
from an external observer point of view).
5.1 Mutex-Freedom and Progress Conditions 137
History
linearization at the object level
History at the
implementation level
When processes may crash A process crashes when it stops its execution prema-
turely. Due to the asynchrony assumption on the speed of processes, a crash can be
seen as if the corresponding process pauses during an infinitely long period before
executing its next step. Asynchrony, combined with the fact that no base shared-
memory operation (read, write, compare&swap, etc.) provides processes with infor-
mation on failures, makes it impossible for a process to know if another process
has crashed or is only very slow. It follows that, when we consider mutex-free
object implementations, the definition of obstruction-freedom, non-blocking, and
wait-freedom copes naturally with any number of process crashes.
Of course, if a process crashes while executing an object operation, it is assumed
that this invocation trivially terminates. As we have seen in Chap. 4 devoted to the
atomicity concept, this operation invocation is then considered either as entirely
executed (and everything appears as if the process crashed just after the invocation)
or not at all executed (and everything appears as if the process crashed just before
the invocation). This is the all-or-nothing semantics associated with crash failures
from the atomicity consistency condition point of view.
Hierarchy of progress conditions It is easy to see that obstruction-freedom, non-
blocking, and wait-freedom define a hierarchy of progress conditions for mutex-free
implementations of concurrent objects.
More generally, the various progress conditions encountered in the implementa-
tion of concurrent objects are summarized in Table 5.1.
140 5 Mutex-Free Concurrent Objects
Definition The splitter object was implicitly used in Chap. 1 when presenting Lam-
port’s fast mutual exclusion algorithm. A splitter is a concurrent object that provides
processes with a single operation, denoted direction(). This operation returns a value to
the invoking process. The semantics of a splitter is defined by the following properties:
5.2 Mutex-Free Concurrent Objects 141
processes invoke
processes processes
process
A splitter (Fig. 5.2) ensures that (a) not all the invoking processes go in the same
direction, and (b) the direction stop is taken by at most one process and exactly one
process in a solo execution. As we will see in this chapter, splitters are base objects
used to build more sophisticated concurrent objects.
Let us observe that, for x = 1, the concurrent execution property becomes: if a
single process invokes direction(), only the value stop can be returned. This property
is sometimes called the “solo execution” property.
A wait-free implementation A very simple wait-free implementation of a splitter
object SP is described in Fig. 5.3. The internal representation is made up of two
MWMR atomic registers: LAST , which contains a process index (its initial value is
arbitrary), and a binary register DOOR, whose domain is {open, closed} and which
is initialized to open.
When a process pi invokes SP.direction() it first writes its index i in the atomic
register LAST (line 1). Then it checks if the door is open (line 2). If the door has been
closed by another process, pi returns right (line 3). Otherwise, pi closes the door
(which can be closed by several processes, line 4) and then checks if it was the last
process to have invoked direction() (line 5). If this is the case, we have LAST = i
and pi returns stop, otherwise it returns left.
A process that obtains the value right is actually a “late” process: it arrived late at
the splitter and found the door closed. Differently, a process pi that obtains the value
left is actually a “slow” process: it set LAST ← i but was not quick enough during
the period that started when it wrote its index i into LAST (line 1) and ended when
it read LAST (line 5). According to the previous meanings for “late” and “slow”,
not all the processes can be late, not all the processes can be slow, and at most one
process can be neither late not slow, being “timely” and obtaining the value stop.
Theorem 17 The algorithm described in Fig. 5.3 is a correct wait-free implemen-
tation of a splitter.
142 5 Mutex-Free Concurrent Objects
operation is
(1)
(2) if
(3) then
(4) else
(5) if
(6) then
(7) else
(8) end if
(9) end if
end operation
Proof The algorithm of Fig. 5.3 is basically the same as the one implementing
the operation conc_abort_op() presented in Fig. 2.12 (Chap. 2); abort 1 , abort 2 , and
commit are replaced by right, left, and stop. The following proof is consequently very
close to the proof of Theorem 4. We adapt and repeat it here for self-containment of
the chapter.
The validity property follows trivially from the fact that the only values that can
be returned are right (line 3), stop (line 6), and left (line 7).
As far as the termination property is concerned, let us observe that the code of the
algorithm contains neither loops nor wait statements. It follows that any invocation
of SP.direction() by a process (which does not crash) does terminate and returns a
value. The implementation is consequently wait-free.
As far as the solo execution property is concerned, it follows from a simple
examination of the code and the fact that the door is initially open that, if a single
process invokes SP.direction() (and does not crash before executing line 6), it returns
the value stop.
Let us now consider the concurrent execution property. For a process to obtain
right, the door must be closed (lines 2–3). As the door is initially open, it follows that
the door was closed by at least one process p and this was done at line 4 (which is the
only place where a process can close the door). According to the value of LAST (line
5), process p will return stop or left. It follows that, among the x processes which
invoke SP.direction(), at least one does not return the value right.
As far as the value left is concerned, we have the following. Let pi be the last
process that writes its index i into the register LAST (as this register is atomic, the
notion of “last” writer is well defined). If the door is closed, it obtains the value
right. If the door is open, it finds LAST = i and obtains the value stop. Hence, not
all processes can return left.
Let us finally consider the value stop. Let pi be the first process that finds LAST
equal to its own index i (line 5). This means that no process pj , j = i, has modified
LAST during the period starting when it was written by pi at line 1 and ending when
it was read by pi at line 5 (Fig. 5.4). It follows that any process pj that modifies LAST
5.2 Mutex-Free Concurrent Objects 143
after this register was read by pi will find the door closed (line 2). Consequently, any
such pj cannot obtain the value stop.
The reader may check that the proof of the splitter object remains valid if
processes crash.
This section presents a simple obstruction-free timestamp object built from atomic
registers. Actually, the object is built from splitters, which as we have just seen, are
in turn built from atomic read/write registers.
Definition The object is a weak timestamp generator object which provides the
processes with a single operation denoted get_timestamp() which returns an natural
integer. Its specification is the following:
• Validity. No two invocations of get_timestamp() return the same value.
• Consistency. Let gt1 () and gt2 () be two distinct invocations of get_timestamp().
If gt1 () returns before gt2 () starts, the timestamp returned by gt2 () is greater than
the one returned by gt1 ().
• Termination. Obstruction-freedom.
It is easy to see that a lock-based implementation of a timestamp object is triv-
ial: an atomic register protected by a lock is used to supply timestamps. But, as
already noticed, locking and obstruction-freedom are incompatible in asynchronous
crash-prone systems. It is also trivial to implement this object directly from the
fetch&add() primitive. The presentation of such a timestamp generator object is
mainly pedagogic, namely showing an obstruction-free implementation built on top
of read/write registers only.
An algorithm The obstruction-free implementation relies on the following under-
lying data structures:
• NEXT defines the value of the next integer that can be used as a timestamp. It is
initialized to 1.
• LAST is an unbounded array of atomic registers. A process pi deposits its index i
in LAST [k] to indicate it is trying to obtain the timestamp k.
144 5 Mutex-Free Concurrent Objects
• COMP is another unbounded array of atomic Boolean registers with each entry
initialized to false. A process pi sets COMP[k] to true to indicate that it is competing
for the timestamp k (hence several processes can write true into COMP[k]). For
any k, COMP[k] is initialized to false.
operation is
(1)
(2) repeat forever
(3)
(4) if
(5) then
(6) if then end if
(7) end if
(8)
(9) end repeat
end operation
The ABA problem When using compare&swap(), a process pi usually does the
following. It first reads the atomic register X (obtaining the value a), then executes
statements (possibly involving accesses to the shared memory) and finally updates
X to a new value c only if X has not been modified by another process since it was
read by pi . To that end, pi invokes X.compare&swap(a, c) (Fig. 5.6).
Unfortunately, the fact that this invocation returns true to pi does not allow pi
to conclude that X has not been modified since the last time it read it. This is
because, between the read of X and the invocation X.compare&swap(a, c) both
issued by pi , X could have been updated twice, first by a process pj that success-
fully invoked X.compare&swap(a, b) and then by a process pk that has successfully
invoked X.compare&swap(b, a), thereby restoring the value a to X. This is called
the ABA problem.
Solving the ABA problem This problem can be solved by associating tags
(sequence numbers) with each value that is written. The atomic register X is
then composed of two fields content, tag. When it reads X, a process pi obtains
a pair x, y (where x is the current “data value” of X) and it later invokes
X.compare&swap(x, y, c, y + 1) to write a new value c into X. It is easy to
see that the write succeeds only if X has continuously been equal to x, y.
statements;
Differently, (Q ↓).tail.ptr points to the dummy cell only when the list is empty.
Moreover, we have initially (Q ↓).head.tag = (Q ↓).tail.tag = 0.
It is assumed that the operation new_cell() creates a new cell in the shared memory,
while the operation free_cell(pt) frees the cell pointed to by pt.
The algorithm implementing the operation Q.enq() As already indicated, these
algorithms consist in handling pointers in an appropriate way. An interesting point is
the fact that they require processes to help other processes terminate their operations.
Actually, this helping mechanism is the mechanism that implements the non-blocking
property.
The algorithm implementing the enq() operation is described at lines 1–13 of
Fig. 5.9. The invoking process pi first creates a new cell in the shared memory,
assigns its address to the local pointer pt_cell, and updates its fields value and next.ptr
(line 1). Then pi enters a loop that it will exit when the value v will be enqueued.
In the loop, pi executes the following statements. It is important to notice that,
in order to obtain consistent pointer values, these statements include sequences of
read and re-read (with compare&swap) to check that pointer values have not been
modified.
• Process pi first makes local copies (kept in tail and next) of (Q ↓).tail and
(tail.ptr ↓).next, respectively. These values inform pi on the current state of the
tail of the queue (lines 3–4).
• Then pi checks if the content of (Q ↓).tail has changed since it read it (line 5).
If it has changed, tail.ptr no longer points to the last element of the queue.
Consequently, pi starts the loop again.
• If tail = (Q ↓).tail (line 6), pi optimistically considers that no other process is
currently trying to enqueue a value. It then checks if next.ptr is equal to ⊥.
operation is
(14) repeat forever
(15)
(16)
(17)
(18) if then
(19) if
(20) then if then end if
(21)
(22) else
(23) if
(24) then
(25) end if
(26) end if
(27) end if
(28) end repeat
end operation
∗ If process pi succeeds in appending its new cell to the list, it tries to update the
content of (Q ↓).tail. This is done by executing (Q ↓).tail).compare&swap
(tail, cell, tail.tag + 1) (line 8). Finally, pi returns from its invocation.
Let us observe that it is possible that the second compare&swap does not
succeed. This is the case when, due to asynchrony, another process pj did the
work for pi by executing line 10 of enq() or line 21 of deq().
– If next.ptr = ⊥, pi discovers that next does not point to the last element of
the queue. Hence, pi discovers that the value of (Q ↓).tail was not up to date
when it read it. Another process has added an element to the queue but had not
yet updated (Q ↓).tail when pi read it. In that case, pi tries to help the other
process terminate the update of (Q ↓).tail if not yet done. To that end, it executes
the statement ((Q ↓).tail).compare&swap(tail, next.ptr, tail.tag+1) (line
10) before restarting the loop.
5.2 Mutex-Free Concurrent Objects 149
The stack and its operations The stack has two operations, denoted push(v)
(where v is the value to be added at the top of the stack) and pop(). It is a bounded
stack: it can contain at most k values. If the stack is full, push(v) returns the control
value full, otherwise v is added to the top of the stack and the control value done is
returned. The operation pop() returns the value that is at the top of the stack (and
suppresses it from the stack), or the control value empty if the stack is empty.
Internal representation of the stack This non-blocking implementation of an
atomic stack is due to N. Shafiei (2009). The stack is implemented with an atomic
register denoted TOP and an array of k + 1 atomic registers denoted STACK[0..k].
These registers can be read and can be modified only by using the compare&swap()
primitive.
• TOP has three fields that contain an index (to address an entry of STACK), a value,
and a counter. It is initialized to 0, ⊥, 0.
• Each atomic register STACK[x] has two fields: the field STACK[x].val, which
contains a value, and the field STACK[x].sn, which contains a sequence number
(used to prevent the ABA problem as far as STACK[x] is concerned).
STACK[0] is a dummy entry initialized to ⊥, −1. Its first field always contains the
default value ⊥. As far as the other entries are concerned, STACK[x] (1 ≤ x ≤ k)
is initialized to ⊥, 0.
The array STACK is used to store the contents of the stack, and the register TOP
is used to store the index and the value of the element at the top of the stack. The
contents of TOP and STACK[x] are modified with the help of the conditional write
instruction compare&swap() (which is used to prevent erroneous modifications of
the stack internal presentation).
The implementation is lazy in the sense that a stack operation assigns its new
value to TOP and leaves the corresponding effective modification of STACK to the
next stack operation. Hence, while on the one hand a stack operation is lazy, on the
other hand it has to help terminate the previous stack operation (as far as the internal
representation of the stack is concerned).
The algorithm implementing the operation push(v) When a process pi invokes
push(v), it enters a repeat loop that it will exit at line 4 or line 7. The process first
reads the content of TOP (which contains the last operation on the stack) and stores
its three fields in its local variables index, value, and seqnb (line 2).
Then, pi calls the internal procedure help(index, value, seqnb) to help terminate
the previous stack operation (line 3). That stack operation (be it a push() or a pop()) is
required to write the pair value, seqnb into
STACK[index]. To that end, pi invokes
STACK[index].compare&swap. old, new with the appropriate values old and new
so that the write is executed only if not yet done (lines 17–18).
5.2 Mutex-Free Concurrent Objects 151
After its help (which was successful if not yet done by another stack operation)
to move the content of TOP into STACK[index], pi returns full if the stack is full
(line 4). If the stack is not full, it tries to modify TOP so that it registers its push
operation. This invocation of TOP.compare&swap() (line 7) succeeds if no other
process has modified TOP since it was read by pi at line 2. If it succeeds, TOP
takes its new value and push(v) returns the control value done (lines 7). Other-
wise pi executes the body of the repeat loop again until its invocation of push()
succeeds.
The triple of values to be written in TOP at line 7 is computed at lines 5–6. Process
pi first computes the last sequence number sn_of _next used in STACK[index + 1]
and then defines the new triple, namely newtop = index + 1, v, sn_of _next + 1, to
be written first in TOP and, later, in STACK[index + 1] thanks to the help provided
by the next stack operation (let us remember that sn_of _next + 1 is used to prevent
the ABA problem).
The algorithm implementing the operation pop() The algorithm implementing
this operation has exactly the same structure as the previous one and is nearly the
same. Its explanation is consequently left to the reader.
Linearization points of the push() and pop() operations The operations that
terminate are linearizable; i.e., they can be totally ordered on the time line, each
operation being associated with a single point of that line after its start event and
before its end event. Its start event corresponds to the execution of the first statement of
an operation, and its end event corresponds to the execution of the return() statement.
More precisely, an invocation of an operation appears as if it was atomically executed
• when it reads TOP (at line 2 or 10) if it returns full or empty (at line 4 or 12),
• or at the time at which its invocation TOP.compare&swap −, − (at line 7 or 15)
is successful (i.e., returns true).
operation is
(1) repeat forever
(2)
(3)
(4) if then end if
(5)
(6)
(7) if then end if
(8) end repeat
end operation
operation is
(9) repeat forever
(10)
(11)
(12) if then end if
(13)
(14)
(15) if then end if
(16) end repeat
end operation
procedure
(17)
(18)
end procedure
• NEXT is an atomic register that contains the index of the next entry where a value
can be deposited. It is initialized to 1. This register can be read by any process. It
can be modified by any process by invoking NEXT .fetch&add(), which adds 1 to
NEXT and returns its new value.
• Case 1: pi crashes after it has executed the atomic statement REG[in] ← v (line 2).
In this case, from an external observer point of view, everything appears as if pi
crashed after it invoked STACK.push(v).
• Case 2: pi crashes after it has obtained an index value (line 1) and before it invokes
the atomic statement REG[in] ← v. In this case, pi has obtained an entry in from
operation is
(1)
(2)
(3)
end operation
operation is
(4)
(5) for from to do
(6)
(7) if then end if
(8) end for
(9)
end operation
NEXT but did not deposit a value into REG[in], which consequently will remain
forever equal to ⊥. In this case, from an external observer point of view, everything
appears as if the process crashed before invoking STACK.push(v).
From an internal point of view, the crash of pi just before executing REG[in] ← v
entails an increase of NEXT . But as the corresponding entry of the array REG will
remain forever equal to ⊥, this increase of NEXT can only increase the duration
of the loop but cannot affect its output.
If pi executes REG[x] ← c after all the values deposited at entries with an index
greater than x have been removed from the stack, and before new values are pushed
onto the stack, then the linearization point associated with push(c) is the time at
which pi executes REG[x] ← c.
While the definition of the linearization points associated with the operation invo-
cations on a concurrent object is sometimes fairly easy, the previous wait-free imple-
mentation of a stack (whose algorithms are simple) shows that this is not always the
case. This is due to the net effect of the mutex-freedom requirement and asynchrony.
Let us consider the case where (a) the processes can cooperate by accessing base
read/write atomic registers only and (b) any number of processes may crash. Let
us suppose that, in such a context, we have an obstruction-free implementation of
a concurrent object (hence this implementation relies only on read/write atomic
registers). An important question is then the following: Is it possible to boost this
implementation in order to obtain a non-blocking or even a wait-free implementa-
tion? This section presents an approach based on failure detectors that answers this
question.
process pi with a local variable denoted ev_leader(X) (eventual leader in the set X)
such that the following properties are always satisfied:
• Validity. At any time, the variable ev_leader(X) of any process contains a process
index.
• Eventual leadership. There is a finite time after which the local variables
ev_leader(X) of the correct processes of X contain the same index, which is
the index of one of them.
This means that there is an arbitrarily long anarchy period during which the
content of any local variable ev_leader(X) can change and, at the same time, distinct
processes can have different values in their local variables. However, this anarchy
period terminates for the correct processes of X, and when it has terminated, the
local variable ev_leader(X) of the correct processes of X contain forever the same
index, and it is the index of one of them. The time at which this occurs is finite but
remains unknown to the processes. This means that, when a process of X reads x from
ev_leader(X), it can never be sure that px is correct. In that sense, the information
on failures (or the absence of failures) provided by X is particularly weak.
Remark on the use of X This failure detector is usually used in a context where
X denotes a dynamically defined subset of processes. It then allows these processes
to rely on the fact that one of them (which is correct) is eventually elected as their
common leader.
It is possible that, at some time, a process perceived locally X as being xi while
another process pj perceives it as being xj = xi . Consequently, the local read-only
variables provided by X are denoted ev_leader(xi ) at pi and ev_leader(xj ) at pj .
As xi and xj may change with time, this means that X may potentially be required
to produce outputs for any non-empty subset x of (the whole set of processes
composing the system).
The failure detector ♦P (eventually perfect) This failure detector provides each
process pi with a local set variable denoted suspected such that the following prop-
erties are always satisfied:
As with X (a) there is an arbitrary long anarchy period during which each set
suspectedi can contain arbitrary values, and (b) the time at which this anarchy period
terminates remains unknown to the processes.
It is easy to see that ♦P is stronger than X (actually, it is strictly stronger). Let
assume that we are given ♦P. The output of X can be constructed as follows. For
a process pi such that i ∈/ X the current value of ev_leader(X) is any process index
and it can change at any time. For a process pi such that i ∈ X, the output of X is
5.3 Boosting Obstruction-Freedom to Stronger Progress in the Read/Write Model 157
defined as follows: ev_leader(X) = min (\suspected)∩X . The reader can check
that the local variables ev_leader(X) satisfy the validity and eventual leadership
of X .
operation is
(1)
(2) repeat forever
(3)
(4) if
(5) then
(6) if then end if
(7) end if
(8)
(9) end repeat
end operation
operation is
end operation
us remember that ♦P provides each process pi with a set suspected that eventually
contains all crashed processes and only them.
This contention manager uses an underlying operation, denoted weak_ts(), that
generates locally increasing timestamps such that, if a process obtains a timestamp
value ts, then any process can obtain only a finite number of timestamp values
lower than ts. This operation weak_ts() can be implemented from atomic read/write
registers only. (Let us remark that weak_ts() is a weaker operation than the operation
get_timestamp() described in Fig. 5.5.)
The internal representation of the contention manager consists of an array of
SWMR atomic read/write registers TS[1..n] such that only pi can write TS[i]. This
array is initialized to [0, . . . , 0].
When pi invokes need_help(i), it assigns a weak timestamp to TS[i] (line 1). It will
reset TS[i] to 0 only when it executes stop_help(i). Hence, TS[i] = 0 means that pi
is competing inside the contention manager. After it has assigned a value to TS[i], pi
waits (loops) until the pair (TS[i], i) is the smallest pair (according to lexicographical
ordering) among the processes that (a) are competing inside the contention manager
and (b) are not locally suspected to have crashed (lines 2–4).
Theorem 20 The contention manager described in Fig. 5.16 transforms an enriched
obstruction-free implementation of an object into a wait-free implementation.
Proof The proof is similar to the proof of Theorem 19. Let us suppose (by con-
tradiction) that there is an operation invocation by a correct process pi that never
terminates. Let tsi be its timestamp (obtained at line 1). Moreover, let this invocation
be the one with the smallest pair tsi , i among all the invocations issued by correct
processes that never terminate.
It follows from the property of weak_ts() that any other process obtains a finite
number of timestamp values smaller than ts, from which we conclude that there is
a finite number of operation invocations that are lexicographically ordered before
tsi , i. Let I be this set of invocations. There are two cases.
• If an invocation of I issued by a process pj that is not correct (i.e., a process
that will crash in the execution) does not terminate, it follows from the eventual
accuracy of ♦P that eventually j is forever suspected pi (i.e., remains forever in its
set suspected).
operation is
(1) if then end if
(2) repeat
(3) let be
(4) until end repeat
end operation
It then follows from the predicate tested by pi at line 1 that there is a finite time
after which, whatever the value of the pair tsj , j attached to the invocation issued
by pj , j will never belong to the set competing repeatedly computed by pi . Hence,
these invocations cannot prevent pi from progressing.
• Let us now consider the invocations in I issued by correct processes. Due to
the definition of the pair tsi , i and pi , all these invocation terminate. Moreover,
due to the definition of I, any of these processes pj that invokes again an operation
obtains a pair such that the pair tsj , j is greater than the pair tsi , i. Consequently,
the fact that j belongs or not to the set suspected of pi cannot prevent pi from
progressing.
To conclude the proof, as pi is correct, it follows from the eventual completeness
property of ♦P that there is a finite time after which i never belongs to the set
supectedk of any correct process pk .
Hence, there is a finite time after which, at any correct process pj , i ∈
/ suspected
and tsi , j is the smallest pair. As the number of processes is bounded, it follows
that, when this occurs, only pi can progress.
On the design principles of contention managers As one can sec, this contention
manager and the previous one are based on the same design principle. When a process
asks for help, a priority is given to some process so that it can proceed alone and
benefit from the obstruction-freedom property.
In the case of non-blocking, it is required that any one among the concurrent
processes progresses. This was obtained from X , and the only additional under-
lying objects which are required are bounded atomic read/write registers. As any
invocation by a correct process has to terminate, the case of wait-freedom is more
demanding. This progress property is obtained from ♦P and unbounded atomic
read/write registers.
5.4 Summary
This chapter has introduced the notion of a mutex-free implementation and the asso-
ciated progress conditions, namely obstruction-freedom, non-blocking, and wait-
freedom.
To illustrate these notions, several mutex-free implementations of concurrent
objects have been described: wait-free splitter, obstruction-free counter, non-blocking
queue and stack based on compare&swap registers, and wait-free queues based
on fetch&add registers and swap registers. Techniques based on failure detectors
have also been described that allow boosting of an obstruction-free implementa-
tion of a concurrent object to a non-blocking or wait-free implementation of that
object.
1. Prove that the concurrent queue implemented by Michael & Scott’s non-blocking
algorithm presented in Sect. 5.2.4 is an atomic object (i.e., its operations are
atomic).
Solution in [205].
2. The hardware-provided primitives LL(), SC() and VL() are defined in Sect. 6.3.2.
Modify Michael & Scott’s non-blocking algorithm to obtain an algorithm that
uses the operations LL(), SC(), and VL() instead of compare&swap().
3. A one-shot atomic test&set register R allows each process to invoke the operation
R.test&set() once. This operation is such that one of the invoking processes
obtains the value winner while the other invoking processes obtain the value
loser.
Let us consider an atomic swap() operation that can be used by two (statically
determined) processes only. Assuming that there are n processes, this means
that there is a half-matrix of registers MSWAP such that (a) MSWAP[i, j] and
MSWAP[j, i] denote the same atomic register, (b) this register can be accessed
only by pi and pj , and (c) their accesses are invocations of MSWAP[j, i].swap().
Design, in such a context, a wait-free algorithm that implements R.test&set().
Solutions in [13].
164 5 Mutex-Free Concurrent Objects
As already mentioned, if a process crashes while holding a lock, the processes that
invoke a lock-based operation on the same object can be blocked forever. Hence, locks
cannot cope with process crashes. This means that the implementations described in
this chapter tolerate process crashes in all executions in which no process crashes
while holding a lock.
6.2 A Static Hybrid Implementation of a Concurrent Set Object 167
The internal representation of a set The set S is represented by a linked list pointed
to by a pointer kept in an atomic register HEAD. A cell of the list (say NEW _CELL)
is made up of four atomic registers:
• NEW _CELL.val which contains a value (element of the set).
• NEW _CELL.out, a Boolean (initialized to false) that is set to true when the
corresponding element is suppressed from the list.
• NEW _CELL.lock, which is a lock used to ensure mutual exclusion (when needed)
on the registers composing the cell. This lock is accessed with the operations
acquire_lock() and release_lock().
168 6 Hybrid Concurrent Objects
• NEW _CELL.next, which is a pointer to the next cell. The set is organized as
a sorted linked list. Initially the list is empty and contains two sentinel cells, as
indicated in Fig. 6.1. The values associated with these cells are the default values
denoted ⊥ and . These values cannot belong to the set and are such that for any
value v of the set we have ⊥ < v < . All operations are based on list traversal.
The algorithm implementing the operation S.remove(v) This algorithm is des-
cribed in lines 1–9 of Fig. 6.2. Using the fact that the list is sorted in increasing order,
the invoking process pi traverses the list from the beginning until the first cell whose
element v is greater than v (lines 1–2). Then it locks two cells: the cell containing
the element v (which is pointed to by its local variable curr ) and the immediately
preceding cell (which is pointed to by its local variable pr ed).
The list traversal and the locking of the two consecutive cells are asynchronous,
and other processes can concurrently access the list to add or remove elements. It is
consequently possible that there are synchronization conflicts that make the content
of pr ed and curr no longer valid. More specifically, the cell pointed to by pr ed or
curr could have been removed, or new cells could have been inserted between the
cells pointed to by pr ed and curr . Hence, before suppressing the cell containing
v (if any), pi checks that pr ed and curr are still valid. The Boolean procedure
validate( pr ed, curr ) is used to this end (lines 10–11).
If the validation predicate is false, pi restarts the removal operation (line 9). This
is the price that has to be paid to have an optimistic removal operation (there is no
global locking of the whole list, which would prevent concurrent processes from
traversing the list). Let us remember that, as by assumption there are few invocations
of the remove() and add() operations, pi will eventually terminate its invocation.
If the validation predicate is satisfied, pi checks whether v belongs to the set or
not (Boolean pr es, line 5). If v is present, it is suppressed from the set (line 6). This
is done in two steps:
• First the Boolean field out of the cell containing v is set to true. This is a logical
removal (logical because the pointers have not yet been modified to suppress the
cell from the list). This logical removal is denoted S1 in Fig. 6.3.
• Then, the physical removal occurs. The pointer ( pr ed ↓).next is updated to its
new value, namely (curr ↓).next. This physical removal is denoted S2 in Fig. 6.3.
The algorithm implementing the operation S.add(v) This algorithm is described
in lines 12–23 of Fig. 6.2. It is very close to the algorithm implementing the
remove(v) operation. Process pi first traverses the list until it reaches the cell whose
value field is greater than v (lines 12–13) and then locks the cell that precedes it
(line 14). Then, as previously, it checks if the values of its pointers pr ed and curr
are valid (line 14). If they are valid and v is not in the list, pi creates a new cell that
contains v and inserts it into the list (lines 17–20).
6.2 A Static Hybrid Implementation of a Concurrent Set Object 169
Finally, pi releases the lock on the cell pointed to by its local pointer variable
ptr . It returns a Boolean value if the validation predicate was satisfied and restarts
if it was not.
The algorithm implementing the operation S.contain(v) This algorithm is des-
cribed in lines 24–24 of Fig. 6.2. As it does not use locks and cannot be delayed
by locks used by the add() and remove() operations, it is wait-free. It consists of
a simple traversal of the list. Let us remark that, during this traversal, the list does
not necessarily remain constant: cells can be added or removed, and so the values of
the pointers are not necessarily up to date when they are read by the process pi that
invoked S.contain(). Let us consider Fig. 6.5. It is possible that the pointer values
pr edi and curri of the current invocation of contain(v) by pi are as indicated in the
figure while all the cells between those containing a1 and b are removed (let us remark
that it is also possible that a new cell containing the value v is concurrently added).
The list traversal is the same as for the add() and remove() operations. The value
true is returned if and only if v is currently the value of the cell pointed to by curr
and this cell has not been logically removed. The algorithm relies on the fact that a
cell cannot be recycled as long as it is reachable from a global or local pointer. (In
contrast, cells that are no longer accessible can be recycled.)
Base properties The previous implementation of a concurrent set has the following
noteworthy features:
• The traversal of the list by an add()/remove() operation is wait-free (a cell locked
by an add()/remove() does not prevent another add()/remove() from progressing
until it locks a cell).
• Locks are used on at most two (consecutive) cells by an add()/remove() operation.
• Invocations of the add()/remove() operations on non-adjacent list entries do not
interfere, thereby favoring concurrency.
Linearization points Let us remember that the linearization point of an operation
invocation is a point of the time line such that the operation appears as if it was been
executed instantaneously at that time instant. This point must lie between the start
time and the end time of the operation.
The algorithm described in Fig. 6.2 provides the operations add(), remove(), and
contain() with the following linearization points. Let an operation be successful
(unsuccessful) if it returns true (false).
• remove() operation:
– The linearization point of a successful remove(v) operation is when it marks
the value v as being removed from the set, i.e., when it executes the statement
(curr ↓).out ← true (line 6).
– The linearization point of an unsuccessful remove(v) operation is when, during
its list traversal, it reads the first unmarked cell with a value v > v (line 2).
• add(v) operation:
– The linearization point of a successful add(v) operation is when it updates the
pointer ( pr ed ↓).next which, from then on, points to the new cell (line 19).
– The linearization point of an unsuccessful add(v) operation is when it reads the
value kept in (curr ↓).val and that value is v (line 16).
• contain(v) operation:
– The linearization point of a successful contain(v) operation is when it checks
whether the value v kept in (curr ↓).val belongs to the set, i.e., (curr ↓).out
is then false (line 26).
– The linearization point of an unsuccessful contain(v) operation is more tricky
to define. This is due to the fact that (as discussed previously with the help of
Fig. 6.5), while contain(v) executes, an execution of add(v) or remove(v) can
proceed concurrently.
Let τ1 be the time at which a cell containing v is found but its field out is marked
true (line 26), or a cell containing v > v is found (line 25). Let τ2 be the time
172 6 Hybrid Concurrent Objects
just before the linearization point of a new operation add(v) that adds v to the
set (if there is no such add(v), let τ2 = +∞). The linearization point of an
unsuccessful contain(v) operation is min(τ1 , τ2 ).
The proof that this object construction is correct consists in (a) showing that any
the operation contain() is wait-free and the operations of add() and remove() are
deadlock-free, and (b) showing that, given any execution, the previous linearization
points associated with the operation invocations define a trace that belongs to the
sequential specification of the set object.
• Case 2. Both values v and 1 − v are written into AUX (line 1).
Let pi be a process that proposes v and reads ⊥ from AUX, and p j a process that
proposes 1 − v and reads ⊥ from AUX. As both pi and p j have read ⊥ from AUX,
we conclude that, at line 3, both pi and p j have read true from PROPOSED[1 − v]
and PROPOSED[v], respectively (Fig. 6.8). It follows that both of them execute
lines 4–8.
Let us now consider a process pk that proposes a value w and reads a non-⊥ value
from AUX. As it reads a non-⊥ value and both PROPOSED[0] and PROPOSED[1]
were equal to true when it read them, it follows that pk necessarily reads true from
PROPOSED[1 − w]. Hence, it executes lines 4–8.
It follows that all processes execute lines 4–8. The first process that acquires the
lock writes the current value of AUX into DECIDED, and that value becomes the
only decided value.
(Let us notice that, due to the arbitrary speed of processes, it is not possible to
predict if it is the first value written in AUX or the second one that will be the
decided value.)
Let us now show that the implementation satisfies the contention sensitiveness
property. We consider each case of “favorable circumstances” separately.
• Case 1: all participating processes propose the same value v.
In this case, PROPOSED[1 − v] remains forever equal to false. It follows that all
the processes that invoke C.propose() write v into the atomic register DECIDED
(line 3). Consequently none of the participating processes ever execute the lines
4–8, which proves the property.
• Case 2: the invocations of C.propose(v) are not concurrent.
Let us consider such an invocation. If it is the first one, it writes v into DECIDED
(line 3) and does not execute the lines 4–8, which proves the property. If other
invocations have been executed before this one, they have all terminated and at
least one of them has written a value into DECIDED (at line 3 or 6). Hence, the con-
sidered invocation C.propose(v) executes line 4, and as DECIDED = ⊥, it does
not execute lines 4–8, which concludes the proof of the contention sensitiveness
property.
The double-ended queue A double-ended queue has two heads: one on its left side
and one on its right side. The head on one side is the last element of the queue seen
from the other side. Such an object has four operations:
• The operation right_enq(v) (or left_enq(v)) adds v to the queue such that v
becomes the last value on the right (or left) side of the queue.
• The operation right_deq() (or left_deq()) suppresses the last element at the right
(or left) of the queue. If the queue is empty, the operation returns the value empty.
A double-ended queue is defined by a sequential specification. This specification
contains all the correct sequences including all or a subset of the operations. It follows
that, in a concurrency context, a queue has to be an atomic object. A double-ended
queue is a powerful object that generalizes queues and stacks. More precisely, we
have the following (see Fig. 6.9, where the double-ended queue contains the list of
values a, b, c, d, e, f ):
• If either only the operations left_enq() and right_deq() or only the operations
right_enq() and left_deq() are used, the object is a queue.
• If either only the operations left_enq() and left_deq() or only the operations
right_enq() and right_deq() are used, the object is a stack.
Favorable circumstances The implementation that follows considers the following
notion of “favorable circumstances” from the contention sensitiveness point of view:
The operation invocations appear in a concurrency-free context. When this occurs,
such an operation invocation is not allowed to use locks.
Internal representation of a double-ended queue Let D Q be a double-ended
queue. Its internal representation is made up of the following objects:
6.3 Contention-Sensitive Implementations 177
∀x, y : (x < y) ⇒
(Q[y] = ⊥ ) ⇒ (Q[x] = ⊥ ) ∧ (Q[x] = ⊥r ) ⇒ (Q[y] = ⊥r ) .
Hence, at any time, the list of values which have been enqueued and not yet dequeued
is the list kept in the array Q[(LI + 1)..(RI − 1)]. In Fig. 6.9, the current value of the
double-ended queue is represented by the array Q[−2..3].
Atomic operations for accessing a register Q[x] An atomic register Q[x] can
be accessed by three atomic operations, denoted LL() (linked load), SC() (store
conditional) and VL() (validate). These operations are provided by the hardware,
and their effects are described by the algorithms of Fig. 6.10.
Let X be any register Q[x]. The description given in Fig. 6.10 assumes there are
n processes whose indexes are in {1, . . . , n}. It considers that a distinct Boolean
array VALID X [1..n] is associated with each register X . This array is initialized to
[false, . . . , false].
An invocation of X.LL() (linked load) returns the current value of X and links
this read (issued by a process pi ) by setting VALID X [i] to true (line 1).
An invocation of X.SC(−, v) (store conditional) by a process pi is successful if
no process has written X since pi ’s last invocation of X.LL(). In that case, the write
is executed (line 2) and the value true is returned (line 4). If it is not successful, the
value false is returned (line 5). Moreover, if X.SC(−, v) is successful, all the entries
Fig. 6.10 Definition of the atomic operations LL(), SC(), and VL() (code for process pi )
of the array VALID X [1..n] are set to false (line 3) to prevent the processes that have
previously invoked X.LL() from having a successful X.SC().
An invocation of X.VD() (validate) by a process pi returns true if and only if
no other process has issued a successful X.SC() operation since the last X.LL()
invocation issued by pi .
It is important to notice that between an invocation of X.LL() and an invocation
of X.SC() or X.VL(), a process pi can execute any code at any speed (including
invocations of Y.LL(), Y.SC(), and Y.VL() where Y = X ).
LL/SC primitives appear in MIPS architectures. Variants of these atomic opera-
tions are proposed in some architectures such as Alpha AXP (under the names idl_l
and stl_c), IBM PowerPC (under the names lwarx and stwcx), or ARM (under the
names ldrex and strex).
Fig. 6.11 Implementation of the operations right_enq() and right_deq() of a double-ended queue
180 6 Hybrid Concurrent Objects
since it was read by pi at line 2, the write is successful and pi consequently increases
the right index RI and returns the control value done. This behavior, which entails
the enqueue of v on the right side, is described in Fig. 6.12.
If the previous invocations of LL() and SC() (issued at lines 2, 4, and 5) reveal
that the right side of the double-ended queue was modified, pi requires the lock in
order to solve conflicts among the invocations of D Q.right_enq() (line 9). It then
executes a loop in which it does the same as before. Lines 10–16 are exactly the
same as lines 1–7 except for the statement return() at line 5, which is replaced at
line 14 by ter m ← true to indicate that the value v was added to the right side of the
double-ended queue. When this occurs, the process pi releases the lock and returns
the control value done.
Let us consider an invocation of right_enq() (by a process p) which is about to
terminate. More precisely, p starts executing the statements in the then part at line
5 (or line 14). If other processes are concurrently executing right_enq(), they will
loop until p has updated the right index RI to RI + 1. This is due to the fact that p
modifies Q[RI] at line 5 (or line 14) before updating RI.
While the right_enq() operation issues SC() invocations first on Q[my_index −1]
and then on Q[my_index] (lines 4–5 or lines 13–14), the right_deq() opera-
tion has to issue them in the opposite order, first on Q[my_index] and then
on Q[my_index − 1] (lines 24–25 or lines 34–35). This is due to the fact that
right_enq() writes (a value v) into Q[my_index] while right_enq() writes (⊥r )
into Q[my_index − 1].
The algorithms implementing the operations left_enq() and left_deq() These
algorithms are similar to the algorithms implementing the right_enq() and
right_deq() operations. The only modifications to be made to the previous algo-
rithms are the following: replace RI by LI, replace R_LOCK by L_LOCK, replace
each occurrence of ⊥r by ⊥ , and replace the occurrence of ⊥ at line 33 by ⊥r .
A left-side operation and a right-side operation can be concurrent and try to invoke
an atomic SC() operation on the same register R[x]. In such a case, if one is unsuc-
cessful, it is because the other one was successful. More generally, the construction
is non-blocking.
Fig. 6.14, where the operations are denoted ab_push() and ab_pop(). The internal
representation of the stack is the same as the one defined in Sect. 5.2.5 with the fol-
lowing simple modification: in each operation, the loops are suppressed and replaced
by a return(⊥) statement. It is easy to see that this modification does not alter the
non-blocking property of the algorithm described in Fig. 5.10.
This section describes a simple contention sensitive algorithm which transforms the
implementation of any non-blocking concurrency-abortable object into a wait-free
implementation of the same object. This algorithm, which is based on a starvation-
free lock, is described in Fig. 6.15. (Let us remember that a simple algorithm which
builds a starvation-free lock from a deadlock-free lock was presented in Sect. 2.2.2.)
Notation The algorithm is presented in Fig. 6.15. Let oper( par ) denote any
operation of the considered object O and ab_oper( par ) denote the corresponding
operation on its non-blocking concurrency-abortable version ABO. This means that,
when considering the stack object presented in the previous section, push() or pop()
denote the non-abortable counterparts of ab_push() or ab_pop(), respectively. It is
assumed that any invocation of an object operation oper( par ) returns a value which
is different from the default value ⊥. As in the previous section, ⊥ can be returned
by invocations of ab_oper( par ) only to indicate that they failed.
equal to false when the operation starts and consequently oper() invokes
ABO.ab_oper() at line 2. As ABO is a concurrency-abortable object and there are
no concurrent operation invocations, the invocation of ab_oper() does not abort. It
follows that this invocation of oper() returns at line 2, which proves the property.
Let us now show that the implementation is starvation-free, i.e., that any invo-
cation of any operation oper() terminates. To this end, given an invocation inv_op p
of an operation oper() issued by a process p, we have to show that there is eventu-
ally an underlying invocation of ABO.ab_oper() invoked by inv_op p that does not
return ⊥.
Let us first observe that, as ABO is a concurrency-abortable object, any invoca-
tion of ABO.ab_oper() terminates (returning ⊥ or another value). If the underlying
invocation ABO.ab_oper() issued at line 2 returns a non-⊥ value, inv_op p does
terminate. If the underlying invocation ABO.ab_oper() returns ⊥ or if the Boolean
CONTENTION was equal to false when p executed line 1, p tries to acquire the lock
(line 4).
Among the process that compete for the lock, let q be the process which
has obtained and not yet released the lock. It repeatedly invokes some operation
ABO.ab_operq () until it obtains a non-⊥ value. It is possible that other processes exe-
cute, concurrently with q, some underlying operations ABO.ab_oper1(),
ABO.ab_oper2(), etc. This happens if these processes have found CONTENTION =
false at line 2 (which means that they have read CONTENTION before it was written
by q). Hence, in the worst case, there are (n −2) other processes executing operations
on ABO concurrently with q (all the processes but p and q). As ABO is non-blocking,
one of them returns a non-⊥ value and the corresponding process terminates its invo-
cation of an operation on the underlying object ABO. If this process is q ,we are done.
If it not q and invokes again an operation oper (), it is directed to require the lock
because now CONTENTION = false.
Hence, there are now at most (n − 3) processes executing operations on ABO
concurrently with q. If follows that, if q has not obtained a non-⊥ value before, it
eventually executes ABO.ab_poperq () in a concurrency-free context. It then obtains
a non-⊥ value and releases the lock.
As the lock is starvation-free, it follows that p eventually obtains it. Then, replac-
ing q by p in the previous reasoning, it follows that p eventually obtains a non-⊥
value from an invocation of ABO.ab_oper() and accordingly terminates its upper-
layer invocation of the operation oper().
The proof of atomicity follows from the following definition of the linearization
points associated with the invocations of the underlying object ABO. Given an invo-
cation of an operation O.oper(), let us consider its last invocation of ABO.ab_oper()
(that invocation returned a non-⊥ value). The linearization point of oper() is the
linearization of this underlying invocation.
186 6 Hybrid Concurrent Objects
6.5 Summary
• Without giving it the name “static hybrid”, the notion of static hybrid implemen-
tation of a concurrent object was implicitly introduced by S. Heller, M. Herlihy,
V. Luchangco, M. Moir, W. Scherer, and N. Shavit in [137].
The implementation of a concurrent set object described is Sect. 6.2 is due to to
the same authors [137]. This implementation was formally proved correct in [78].
• The notion of contention sensitive implementation is due to G. Taubenfeld [263].
The contention sensitive implementations of a binary consensus object and of a
double-ended queue are due to G. Taubenfeld [263]. The second of these imple-
mentations is an adaptation of an implementation of a double-ended queue based
on compare and swap() proposed by M. Herlihy, V. Luchangco, and M. Moir in
[143] (the notion of obstruction-freedom is also introduced in this paper).
• The notion of concurrency-abortable implementation used in this chapter is from
[214] where the methodology to go from a non-blocking abortable implementation
to a starvation-free implementation of an object is presented. This methodology
relies on a general approach introduced by G. Taubenfeld in [263].
• It is important to insist on the fact that the notion of “abortable object” used in this
chapter is different from the one used in [16] (where an operation that returns ⊥
may or may not have been executed).
6.7 Exercises and Problems 187
The two previous chapters were on the implementation of concurrent atomic objects
(such as queues and stacks). More precisely, the aim of Chap. 5 was to introduce
and illustrate the notion of a mutex-free implementation and associated progress
conditions, namely obstruction-freedom, non-blocking and wait-freedom. The aim
of Chap. 6 was to introduce and investigate the notion of a hybrid implementation. In
both cases, the internal representation of the high-level object that was constructed
was based on atomic read/write registers and more sophisticated registers accessed
by stronger hardware-provided operations such as compare&swap(), fetch&add(),
or swap().
This chapter and the two following ones address another dimension when one is
interested in building wait-free implementations of concurrent objects, namely the
case where the only base objects that can be used are atomic read/write registers.
Hence, these chapters investigate the power of base read/write registers to construct
wait-free implementations. This chapter is on the wait-free implementation of weak
counters and store-collect objects, while Chap. 8 addresses snapshot objects, and
Chap. 9 focuses on renaming objects.
As we are concerned with wait-free implementations, let us remember that it is
assumed that any number of processes may crash. Let us also remember that, as far
as terminology is concerned, a process is correct in a run if it does not crash in that
run; otherwise, it is faulty.
This section has two aims: to present a wait-free implementation of a weak counter
object and to show how to cope with an unknown and arbitrarily large number of
processes. To that end, it first presents a very simple implementation of a (non-weak)
counter and then focuses on the wait-free implementation of a weak counter that can
be accessed by infinitely many processes.
A shared counter C is a concurrent object that has an integer value (initially 0) and
provides the processes with two operations denoted increment() and get_count().
Informally, the operation increment() increases the value of the counter by 1, while
the operation get_count() returns its current value. In a more precise way, the behav-
ior of a counter is defined by the three following properties:
• Liveness. Any invocation of increment() or get_count() by a correct process ter-
minates.
• Monotonicity. Let gt1 and gt2 be two invocations of get_count() such that (a) gt1
returns c1 , gt2 returns c2 , and gt1 terminates before gt2 starts. Then, c1 ≤ c2 .
• Freshness. Let gt be an invocation of get_count() and c the value it returns. Let ca
be the number of invocations of increment() that have terminated before gt starts
and cb be the number of invocations of increment() that have started before gt
terminates. Then, ca ≤ c ≤ cb .
The liveness property expresses that the implementation has to be wait-free.
Monotonicity and freshness are the safety properties which give meaning to the
object; namely, they define the domain of the value returned by a get_count() invo-
cation. As we will see in the proof of Theorem 23, the previous behavior can be
defined by a sequential specification.
A simple implementation A concurrent counter can be easily built as soon as
the number of processes n is known and the system provides one atomic SWMR
read/write register per process. More precisely, let REG[1..n] be an array of atomic
registers initialized to 0, such that, for any i, REG[i] can be read by any process but
is written only by pi .
The algorithm implementing the operations increment() and get_count() are triv-
ial (Fig. 7.1). The invocation of increment() by pi consists in adding 1 to REG[i]
(local_ct is a local variable of pi , initialized to 0). The invocation of get_count()
consists in reading (in any order) and summing up the values of all the entries of the
array REG[1..n].
Theorem 23 The algorithms described in Fig. 7.1 are a wait-free implementation
of an atomic counter object.
Proof The fact that the operations are wait-free follows directly from their code.
The proof that the construction provides an atomic counter is based on the atomicity
of the underlying base registers. Let us associate a linearization point with each
invocation as follows:
7.1 A Wait-Free Weak Counter for Infinitely Many Processes 191
operation is
end operation.
operation is
;
for do end for;
end operation.
Infinitely many processes This section focuses on dynamic systems where each
run can have an unknown, arbitrarily large, and possibly infinite number of processes.
The only constraint is that in each finite time interval only finitely many processes
execute operations. Each process pi has an identity i, and it is common knowledge
that no two processes have the same identity.
192 7 Wait-Free Objects from Read/Write Registers Only
Differently from the static model where there are n processes p1 , . . . , pn , each
process knowing n and the whole set of identities, now the identities of the processes
that are in the system are not necessarily consecutive, and no process has a priori
knowledge of which other processes can execute operations concurrently with it.
(Intuitively, this means that a process can “enter” or “leave” the system at any time.)
Moreover, no process is provided with an upper bound n on their number, which
could be used by the algorithms (as, for example, in the previous algorithm, where
the operation get_count() scans the whole array REG[1..n]). This model, called the
finite concurrency model, captures existing physical systems where the only source
of “infiniteness” is the passage of time.
It is important to see that the algorithms designed for this computation model have
to be inherently wait-free as they have to guarantee progress even if new processes
keep on arriving: the progress of pending operations cannot be indefinitely delayed
because new processes keep on arriving.
Helping mechanism A basic principle when designing algorithms suited to the
previous dynamic model with finite concurrency consists in using a helping mecha-
nism. More generally, such mechanisms are central when one has to design wait-free
implementations of concurrent objects.
More precisely, ensuring the wait-freedom property despite the fact that infinitely
many processes can be involved in an algorithm requires a process to help other
processes terminate their operations. This strategy prevents slow processes from
never terminating despite the continuous arrival of new processes. This will clearly
appear in the weak counter algorithms described below.
Weak counter: definition A weak counter is a counter whose increment() and
get_count() operations satisfy the liveness and monotonicity properties of a classical
counter (as defined previously), plus the following property (which replaces the
previous freshness property):
• Weak increment. Let gt1 and gt2 be two invocations of the get_count() opera-
tion that return c1 and c2 , respectively. Let incr be an invocation of increment()
that (a) has started after gt1 has terminated (i.e., r es[gt1 ] < H inv[incr] using
the notations defined in Chap. 4), and (b) has terminated before gt2 has started
(i.e., r es[incr] < H inv[gt2 ]). We have then c1 < c2 .
With a classical counter, each invocation of the increment() operation, be it con-
current with other invocations or not, results in adding 1 to the value of the counter (if
the invoking process does not crash before updating the SWMR register it is associ-
ated with). The way the counter increases is different for a weak counter. Let k be the<