0% found this document useful (0 votes)
49 views19 pages

Ownership of A Queue For Practical Lock-Free Scheduling: Lincoln Quirk April 26, 2008

This document discusses a lock-free scheduling approach for tasks that must obey constraints, such as only one task associated with a particular object running at a time. It proposes using "OwnerQueue" objects, which are single-dequeuer lock-free queues, with one OwnerQueue per object and tasks on that object enqueued to it. Only the owner of the queue can run tasks from it, and the OwnerQueue ensures at most one owner at a time and that there is always an owner if the queue is not empty. This allows constructing a scheduler that satisfies the constraints in a lock-free manner.

Uploaded by

anon-776943
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views19 pages

Ownership of A Queue For Practical Lock-Free Scheduling: Lincoln Quirk April 26, 2008

This document discusses a lock-free scheduling approach for tasks that must obey constraints, such as only one task associated with a particular object running at a time. It proposes using "OwnerQueue" objects, which are single-dequeuer lock-free queues, with one OwnerQueue per object and tasks on that object enqueued to it. Only the owner of the queue can run tasks from it, and the OwnerQueue ensures at most one owner at a time and that there is always an owner if the queue is not empty. This allows constructing a scheduler that satisfies the constraints in a lock-free manner.

Uploaded by

anon-776943
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Ownership of a queue for practical lock-free

scheduling
Lincoln Quirk
April 26, 2008

Abstract
When scheduling tasks to processors in a concurrent system, tasks
may not be independent – for example, only one task associated with
a particular object can be run at once. How can a lock-free, constant-
time greedy scheduler decide which tasks to run under this constraint?
We implement this using a system of OwnerQueue objects, which are
single-dequeuer lock-free queues which allow for the smooth handoff
of dequeuing privilege from processor to processor. One OwnerQueue
is allocated per object, and tasks on the object are enqueued to it.
Only the owner of the queue may run tasks. The OwnerQueue itself
ensures two safety properties: no more than one thread considers itself
the owner at any one time, and at least one thread considers itself the
owner if there are any items on the queue. With these guarantees we
construct a scheduler satisfying the constraints.

1 Introduction
Most programming languages are based around the idea of a process, which
comes into existence, does some work, and exits. However, this model does
not work well for event-driven server systems: the process must instruct the
computer to listen and respond to events, and the code to do this tends to
be difficult to write and maintain if the programmer is using a process model.
Furthermore, processes have a lot of state, and so when multiple threads of
execution exist within a process (as is often created by asynchronous events
in event-driven systems), synchronization of this state between threads is

1
also difficult to write and maintain. Lastly, requiring the programmer to
specify which processes run on which processors (or threads) requires the
programmer to predict beforehand how much work a particular process will
require.
A new programming paradigm can make steps towards solving these is-
sues. We have designed a programming language based not around processes
but around units, which are an aggregation of data (which must be consistent
with itself) and a set of operations (code which operates on the data). Units
communicate only by sending asynchronous messages to other units. Since
the messages are asynchronous between units, the system is free to schedule
non-conflicting messages in parallel. Due to this scheduling freedom, the
system exhibits natural parallelism at the granularity of units. However, to
preserve consistency within a unit, the system is constrained never to run
two messages on a particular unit at once.

2 Multithreaded computation model


To provide a formal framework in which to specify this system, consider the
model of multithreaded computations given by Blumofe and Leiserson [1]:
computations are split into a series of steps, some of which may be executed
in parallel. Tasks are defined as a series of steps separated by continue
edges; they may spawn other tasks (using spawn edges) or depend – block,
if necessary – on the results of other tasks (using join edges).
Figure 1 shows a typical multithreaded computation. The main com-
putation spawns at step v0 (using a dotted spawn edge) an asynchronous
subcomputation, whose value it accepts in step v3 using a join edge. Figure
2 shows a non-strict computation for contrast.
This framework depends on the constraint that all multithreaded com-
putations are fully strict, where all join edges from a task go to the task’s
parent.
On the other hand, I proposes a different constraint: all join edges to
a task go to the first computational step in that task, and the graph is
acyclic. This property is tentatively called forwardesque, because all edges
are forward – they go from an earlier step in the computation to a later one.
By design, in our system, all tasks are forwardesque.
Figure 3 shows a typical forwardesque computation. The main computa-
tion spawns two asynchronous subcomputations (v4-v7 and v8-v11). It then

2
main computation

v0 v1 v2 v3

spawn edge join edge

v4 v5 v6 v7

asynchronous subcomputation

Figure 1: Fully strict computation.

main computation

v0 v1 v2 v3

spawn edges

join edge from sibling

v4 v5 v6 v7

asynchronous subcomputation

Figure 2: A non-strict multithreaded computation. Note the join edge from


a sibling.

3
main computation

v0 v1 v2 v3

spawn edge

v8 v9 v10 v11

join edge

v4 v5 v6 v7

v12 v13 v14 v15

Figure 3: Forwardesque computation.

spawns a receiver subcomputation, which waits for the result of the values
of the two asynchronous subcomputations (indicated by the two join edges)
before it executes.
Forwardesque is a desirable property in our system, because it means that
once a task is runnable (once a message arrives at a unit), its execution may
be completed fully. Tasks do not ever block waiting on other tasks. Similarly,
deadlock may not occur because the graph is acyclic. (I go into more depth
later.)
The user in our system creates forwardesque computations using Futures.
Our notion of Futures is similar to Halstead’s in Multilisp[2]: an operation is
started asynchronously, and returns a Future immediately, which is a promise
for a later value when that asynchronous computation terminates.
The user starts asynchronous computations using the send primitive.
Send accepts a receiver and some number of arguments, returns a Future,
and causes the receiver to be invoked asynchronously with the arguments.
When the receiver returns a value, the Future is filled (resolved) with the
value.

4
In Multilisp, the user may “force” a future during execution. In our
system, on the other hand, there is no way to force a Future – no capability
for blocking on Futures during execution of an operation. In order to ensure
that a Future has a value, the user must perform send, passing the Future he
wishes to resolve to another asynchronous computation as an argument. That
computation will not begin until the Future that was sent as an argument
is resolved. This restriction ensures that all computations in the system are
forwardesque.
Proof. Join edges are created by attempting to wait for the value of a Future.
In our system, the only way a task can wait for a value of a future is by start-
ing an asynchronous computation which accepts this value as an argument.
In this case, the asynchronous computation’s first step must wait on all of its
arguments to be filled (and thus have join edges for each argument), because
any step in the computation could use the value of one of the arguments.
Once a computation has started, it may not wait on any arguments, so there
are no join edges on any other steps in a computation. Thus the computation
is forwardesque.
Computations in the system are also acyclic and therefore deadlock-free,
assuming that a computation can never start with a reference to an unre-
solved Future (a join edge).
Proof. Trivially there are no cycles among continue or spawn edges alone.
Suppose a situation existed where two processes had some join edge forming
a cycle. How was this join edge created? Since join edges may only be
created to a process by sending a Future to the (newly spawned) process, the
newly spawned process must have had outgoing edges to previously existing
processes. But the computation would need a reference from before it was
created to a step within the newly spawned process in order to create an
edge from it. This contradicts the assumption that a computation never
starts with an unresolved Future.

3 Units
Our runtime system includes the notion of units. Units may have some shared
state which must stay consistent no matter how many messages are ready to
run on it. The system guarantees to the user that all operations on a unit

5
are linearizable.[3] In future work, we will examine automatic linearization
of operations using software transactional memory.[4] For now, however, we
will require the scheduler to execute at most one operation on a particular
unit at one time. This property is called unit exclusion.
Not all operations are part of a unit: if the operation doesn’t read or
write any shared state, then it is known as a free operation.
It is worthwhile to note that the proof of deadlock freedom above stip-
ulates that computations may not start with a reference to an unresolved
Future. Therefore, provisions must be made against the possibility of an op-
eration inside a unit storing a Future inside the unit that another operation
might access (or else it would be possible to construct deadlock).

4 Scheduling specification
Since there could be more tasks runnable at one time than processors avail-
able, the runtime system needs a scheduler to select which steps of which
runnable tasks will actually be run. Furthermore, it is possible that there
are more runnable tasks than processors and yet not all processors can be
busy, because of the unit exclusion requirement.
Another scheduler requirement is that it be efficient. It should scale as
well as possible to highly concurrent systems with many processors. I believe
that lock-free algorithms scale well because they minimize critical sections,
so I implement a lock-free multithreaded scheduler
One approach to lock-free scheduling is to use a global runqueue from
which each processor dequeues. This is subject to high contention due to
cache misses and interconnect bandwidth on the runqueue, however, so Blu-
mofe describes a model where each processor has its own private runqueue.
When there are only pure operations ready to run, the scheduler in this pa-
per is very similar to a wait-free implementation of the work-stealing deques
scheduler in Blumofe.
However, when there are ready operations, we must be careful to only run
one operation per unit. In order to guarantee unit exclusion, we place unit-
specific messages on a unit runqueue, and then place the unit on a processor’s
runqueue.
Here it is important to consider the scheduler in the presence of work-
stealing. When a processor is not busy and other processors are, it is usually
beneficial for that processor to try to find some work to do. This is accom-

6
plished through workstealing. When it would otherwise be idle, the processor
scans other processor runqueues and attempts to steal a unit by dequeuing
from the first runqueue it comes across with some units on it.
But in the presence of workstealing, a unit that appeared twice on a
processor’s runqueue might have one of its units stolen. Then it appears
twice on two processors’ runqueues, and unit exclusion could be broken. Thus
units must appear at most once on a processor’s runqueue. This is done with
an OwnerQueue, described below.

5 OwnerQueue specification
The OwnerQueue structure is an extension of a simple lock-free queue to no-
tify the user when the queue is empty. It defines two methods: enq and was empty
and deq and is empty. Its internal state includes a backing lock-free queue
and a counter.
The sequential specification for enq and was empty is the following: En-
queue the given item into the backing queue, increment the counter, and if
the counter was zero before the increment, return T rue. Otherwise, return
F alse.
The sequential specification for deq and is empty is the following: De-
queue an item i from the backing queue. Decrement the counter, and if the
counter is now zero, return (i, T rue). Otherwise, return (i, F alse).
The correctness condition of linearizability says that to be linearizable,
any concurrent history of the object must be equivalent to a legal sequential
history [3]. For efficiency reasons, the implementation provided in this paper
is not linearizable. However, it can still be useful.
The OwnerQueue provides the user with certain guarantees, provided its
contract is met. Critical to the contract is the understanding of ownership
of an OwnerQueue. When a thread receives T rue from enq and was empty,
that thread must consider itself the owner of the OwnerQueue. That thread
now has exclusive privilege to call deq and is empty on the OwnerQueue,
until deq and is empty returns T rue, at which time the thread’s privilege is
revoked and it must no longer consider itself the owner.
The OwnerQueue makes the guarantee that if the threads obey the Own-
erQueue contract, then no two threads consider themselves the owner at the
same time (Theorem 1).

7
6 Implementation
The OwnerQueue’s enqueue and dequeue methods are implemented as a sim-
ple non-synchronized two-step operation: perform the operation on the under-
lying queue, then atomically fetch and increment or decrement and fetch
the counter, assigning the return value to c, then return c ≤ 0.
As an example of how this works, consider an empty queue with two
threads t1 and t2 concurrently attempting to enqueue. Each performs the
enqueue operation on the underlying queue, then attempts to increment the
counter. Both increments will succeed, but only one gets the value 0 and
thus returns T rue.
As noted above, this implementation is not linearizable. As counterex-
ample, in the above situation, t1 ’s underlying-enq operation might succeed
before that of t2 , but the subsequent fetch and increment operations were
executed in the reverse order. t2 would be surprised to find that although the
queue appeared “empty” according to the return value of enq and was empty,
the first element subsequently dequeued was not the item that t2 enqueued
(it was instead that of t1 ).
This is not usually a problem due to the OwnerQueue contract. The
contract’s mention of “exclusive privilege to call deq and is empty” also
comes with a responsibility: the requirement that each thread that becomes
the owner shall eventually exhaust the queue by dequeuing from it until
deq and is empty returns false. This leads to the requirement that a thread
must be prepared to deal with any number of other people’s items on the
queue, in arbitrary order, if it becomes the owner.

7 Correctness
The correctness and linearizability of the underlying lock-free queue is as-
sumed (e.g., as in [5]). Linearizability of shared counters using fetch and add
and similar operations is also assumed.

OwnerQueue contract. For all histories, for any thread A and Own-
erQueue oq, while A is the owner (as in the following pattern in the history),
no thread B calls Deq on oq.

oq Enq( ) A
oq EnqOk(True) A A becomes the owner.

8
... other entries, not including Deq() B
oq Deq() A
oq DeqOk( , False) A A is no longer the owner.
Theorem 1. If the OwnerQueue contract is satisfied, then for all legal his-
tories, for any threads A and B and OwnerQueue oq, whenever oq En-
qOk(True) A appears in the history, oq DeqOk( , False) A appears before
oq EnqOk(True) B.
Proof. When no thread is the owner, the linearization point of gaining own-
ership is the increment of the counter from zero to one. When a thread is
the owner, the counter either stays above one and the current thread stays
the owner, or the owner decrements the counter to zero and the new thread
becomes the owner.
A thread which starts but doesn’t finish an enq operation could result in
an OQ with one more item than is represented by the state of the counter.
If this is the case, then when other threads perform enqueues and dequeues,
they will handoff ownership with respect to the counter – i.e., the queue will
always have one more item in it than the other threads realize.
Corollary. If the OwnerQueue contract is satisfied, there are no overlapping
calls to deq and is empty.
Proof. Assume Theorem 1. According to the OwnerQueue contract, the
owner has exclusive privilege to call deq and is empty. Therefore, since no
more than one thread is the owner at one time, no more than one thread will
call deq and is empty at one time.
Theorem 2. If the OwnerQueue contract is satisfied, both enq and was empty
and deq and is empty are total.
Proof. The underlying queue is unbounded, and the shared counter is infi-
nite1 , so enq and was empty is total.
A thread can only become the owner after enqueuing. If the OwnerQueue
contract holds, no thread will ever dequeue unless it is the owner.
A thread which starts but doesn’t finish a deq operation could result in
an OQ with one fewer item than is represented by the counter. But since
only one thread can be calling dequeue at once, this disagreement can never
result in a failure to dequeue. Therefore, deq and is empty is total.
1
In theory. In practice, most implementations will use a fixed-size integer which can
be very large.

9
The OwnerQueue is used in a fairly simple manner: a unit’s runqueue
is an OwnerQueue. If there are no items on the runqueue, there may or
may not be an owner. But as soon as a processor enqueues onto an empty
runqueue, that processor becomes the owner. That processor is constrained
to ensure that messages on that queue will be executed as long as there are
messages ready. In order for another processor to execute ready messages on
that queue, it must obtain the first processor’s consent to receive ownership
of the queue (which, in our system, is obtained by taking it off the victim’s
runqueue atomically).

8 Efficiency
The system was tested on two benchmark programs, called fib and ddb.
fib is an implementation of tree-recursive fibonacci: calculation of a par-
ticular value k in the Fibonacci sequence by creating two recursive free fib
operations to calculate k − 1 and k − 2, and another free operation (plus)
that waits until the futures from the previous operations are filled with val-
ues, then sums the values. This is inefficient for calculating the Fibonacci
sequence (a dynamic programming algorithm reduces the exponential com-
plexity of this algorithm to a linear one). However, the experiment is a good
test that messages and futures are handled properly.
A modification of the experiment applies some arbitrary amount of extra
work (for example, counting to ten thousand) at each fib and plus operation,
in order to have a better idea of synchronization overhead (the extra work
requires no synchronization).
ddb is a simple distributed database. Four database servers are imple-
mented as units, and there are twenty-eight clients. The database servers
map integer keys to values, and each server holds a unique set of keys; the
server on which a key resides is determined by the hash of the key. The clients
put heavy load on the servers, requesting and updating keys, and taking ac-
tion based on the value of particular keys. Each server unit exposes three
operations: put, get, and increment. This experiment is designed to test the
unit-related aspects of the system: performance of units, unit exclusion, and
workstealing.
The load on the database is unbalanced; in particular, one key is designed
to have heavy load. Because of the characteristics of the database, this key
is always assigned to one server, and because of unit exclusion, that server

10
32
10000 work
1000 work
100 work
28 10 work
linear

24

20
speedup

16

12

0
0 4 8 12 16 20 24 28 32
number of threads

Figure 4: CPU utilization for the fib experiment: CPU time divided by
actual time.

will be executed by at most one processor. In fact, if that server has enough
requests to keep it constantly busy, then that server will only execute requests
on that one unit. Therefore, units which process many messages should tend
to have affinity to whatever processor they end up on. Since there are 32
units and 32 processors, the system should “settle” in the 32-processor case
after a small amount of workstealing on having one unit per processor.

8.1 Results
The Fibonacci experiment demonstrated that the system scales fairly well to
multiprocessor machines. The experiment is almost embarrassingly parallel,
so it should exhibit linear speedup as processors are added, except for syn-
chronization overhead. This appears to be the case, as Fig. 4 demonstrates;
when the amount of work to be done at each node is large, the system uti-
lizes all the processors fairly effectively. Some synchronization overhead is
observed, as the graph tails off at higher concurrencies. Another measure-

11
1000
10000 work
1000 work
900 100 work
10 work
10000 work new
800 20000 work new
50000 work new
50000a work new
700

600
waste, sec

500

400

300

200

100

-100
4 8 12 16 20 24 28 32
number of threads

Figure 5: CPU time lost during fib due to synchronization. The lost time
is calculated by subtracting the total CPU time from the CPU time of the
one-thread execution.

12
4
10000 work
1000 work
100 work
10 work
3.5 10000 work new
20000 work new
50000 work new

2.5
waste ratio

1.5

0.5
4 8 12 16 20 24 28 32
number of threads

Figure 6: CPU waste ratio of fib: total CPU time divided by CPU time of
one-thread execution.

13
250
thread
unit

200
1000s of messages processed

150

100

50

0
0 4 8 12 16 20 24 28 32
thread/unit

Figure 7: Number of messages processed in ddb by each processor.

ment of synchronization overhead is the idea of waste: the amount of extra


CPU time that adding a processor costs. Figs. 5 and 6 demonstrate the
amount of waste exhibited by the system. (I don’t yet know what to make
of these results, as it seems waste is a function of work, yet I don’t believe it
should be.)
The distributed database experiment demonstrated that the system is
able to distribute work to processors, and that the workstealing algorithm
we chose has the effect of placing each unit onto separate processors. Fig. 7
demonstrates that the workstealing was able to do this, since the distribution
of work assigned to processors matches up very closely with the amount of
work assigned to units by the experiment. Fig. 8 shows how much work-
stealing was taking place across the experiment; Fig. 9 shows which units
were stolen during the 32-processor experiment. The results are as expected:
the units with higher load were stolen less often, because the processors they
were assigned to were busy working on them, and workstealing never steals
a unit that another processor already owns.
CPU utilization in ddb was lower because of unit exclusion (see Fig. 10).

14
800

700

600
work units stolen during execution

500

400

300

200

100

0
0 4 8 12 16 20 24 28 32
thread

Figure 8: Number of units stolen during ddb experiment.

15
30
work
steal count
number of stealings, 10000 * number of messages

25

20

15

10

0
0 4 8 12 16 20 24 28 32
unit

Figure 9: Units which were stolen during ddb experiment on 32 processors.

16
32
utilization
linear

28

24

20
utilization

16

12

0
0 4 8 12 16 20 24 28 32
number of threads

Figure 10: CPU Utilization for the ddb experiment.

17
Since most of the messages were concentrated onto a few units, it was often
the case that some processors ran out of work.

9 Conclusions
A lock-free multithreaded scheduler that implements unit exclusion is feasi-
ble to implement. Performance in my implementation is not great, but is
acceptable, and there are some notable ways to improve it (for example, us-
ing thread runqueues for free ops rather than scheduling them all off a main
runqueue).
The OwnerQueue data structure seems very useful and more generally ap-
plicable as a design pattern for lock-free implementations. The OwnerQueue
was found useful for another part of the scheduling system, not just unit
exclusion: When a future is filled, it is important to wake up a list of mes-
sages that depend on that future. However, messages must not be woken
up before the future is filled, and “lost wakeups” are not permissible, but
messages could be added to the list asynchronously, even while the future is
being filled.
Therefore, the list was implemented using an OwnerQueue. A thread
which fills the future also takes ownership of the queue and wakes up all
dependent messages. A thread which adds a message to the queue takes
ownership only if it adds a message to an empty queue. Otherwise, it is
notified by the real owner of the queue.
Thus, when it is desirable for a particular task to be handled by exactly
one thread at a time, but you don’t want to designate a separate thread to
handle it, an “ownership” model is a useful abstraction. An OwnerQueue is
a simple implementation of an ownership-based queue of tasks.

References
[1] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded
computations by work stealing. J. ACM, 46(5):720–748, 1999.

[2] Jr. Robert H. Halstead. Multilisp: a language for concurrent symbolic


computation. ACM Trans. Program. Lang. Syst., 7(4):501–538, 1985.

18
[3] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a correct-
ness condition for concurrent objects. ACM Trans. Program. Lang. Syst.,
12(3):463–492, 1990.

[4] Nir Shavit and Dan Touitou. Software transactional memory. In Sympo-
sium on Principles of Distributed Computing, pages 204–213, 1995.

[5] J. D. Valois. Implementing lock-free queues. In Proceedings of the Seventh


International Conference on Parallel and Distributed Computing Systems,
pages 64–69, Las Vegas, NV, 1994.

19

You might also like