Unit 6 Notes
Unit 6 Notes
Our emphasis here will be parallel algorithms, that is, multithreading a single
algorithm so that some of its instructions may be executed simultaneously. Parallelism
can also be applied to scheduling and managing multiple algorithms, each running
concurrently in their own thread and possibly sharing resources, as studied in courses
on operating systems and concurrent and high performance computing.
Concurrency Constructs:
parallel: add to loop construct such as for to indicate each iteration can be
executed in parallel.
spawn: create a parallel subprocess, then keep executing the current process
(parallel procedure call).
sync: wait here until all active parallel threads created by this instance of the
program finish.
These keywords specify opportunities for parallelism without affecting whether (or
not) the corresponding sequential program obtained by removing them is correct. In
other words, if we ignore the parallel keywords the program can be analyzed as a
single threaded program. We exploit this in analysis.
Logical Parallelism
The parallel and spawn keywords do not force parallelism: they just says that it is
permissible. This is logical parallelism. A scheduler will make the decision
concerning allocation to processors. We return to the question of scheduling at the end
of this document, after approriate concepts have been introduced.
For illustration, we take a really slow algorithm and make it parallel. (There are much
better ways to compute Fibonacci numbers; this is just for illustration.) Here is the
definition of Fibonacci numbers:
F0 = 0.
F1 = 1.
Fi = Fi-1 + Fi-2, for i ≥ 2.
Noticing that the recursive calls operate independently of each other, let's see what
improvement we can get by computing the two recursive calls in parallel. This will
illustrate the concurrency keywords and also be an example for analysis:
Notice that without the parallel keywords it is the same as the serial program above.
below.)
If G has a directed path from u to v they are logically in series; otherwise they
are logically parallel.
A strand with multiple successors means all but one of them must have
spawned. A strand with multiple predecessors means they join at a sync
statement.
The model can be visualized as exemplified below for the computation DAG for P-
Fib(4):
Vertices (strands) are visualized as circles in the figure.
The rounded rectangles are not part of the formal model, but they help organize
the visualization by collecting together all strands for a given call.
The colors are specific to this example and indicate the corresponding code:
black indicates that the strand is for lines 1-3; grey for line 4; and white for
lines 5-6.
Continuation Edges (u, v) are drawn horizontally and indicate that v is the
successor to u in the sequential procedure.
Call Edges (u, v) point downwards, indicating that u called v as a normal
subprocedure call. In this example they come out of the grey circles.
Spawn Edges (u, v) also point downwards, indicating that u spawned v in
parallel. In this example they come out of the black circles.
Return edges point upwards to indicate the next strand executed after returning
from a normal procedure call, or after parallel spawning at a sync point. In this
example they return to the white circles.
Work
T1 = the total time to execute an algorithm on one processor. This is called work in
analogy to work in physics: the total amount of computational work that gets done.
An ideal parallel computer with P processors can do at most P units of work in one
time step. So, in TP time it can do at most P⋅TP work. Since the total work is T1,
P⋅TP ≥ T1, or dividing by P we get the work law:
TP ≥ T1 / P
The work law can be read as saying that the speedup for P processors can be no better
than the time with one processor divided by P. That is,
Parallelism will not change the asymptotic class of an algorithm: it's not a substitute
for careful design of asymptotically fast algorithms.
Span
The span in our P-Fib example is represented by the shaded edges in the figure.
The span law states that a P-processor ideal parallel computer cannot run faster than
one with an infinite number of processors:
TP ≥ T∞
This is because at some point the span will limit the speedup possible: No matter how
many processors you have, you still must do these strands in sequence, taking the time
they require.
Speedup
The ratio T1 / TP defines how much speedup you get with P processors as compared
to one.
TP ≥ T1 / P, so T1 / TP ≤ P:
one cannot have any more speedup than the number of processors.
When the speedup T1 / TP = Θ(P) we have linear speedup: the speedup is linear in the
number of processors.
Parallelism
The ratio T1 / T∞ of the work to the span gives the (potential) parallelism of the
computation. It can be interpreted in three ways:
Ratio : The average amount of work that can be performed for each step of
parallel execution time.
Upper Bound : the maximum possible speedup that can be achieved on any
number of processors.
Limit: The limit on the possibility of attaining perfect linear speedup. Once the
number of processors exceeds the parallelism, the computation cannot possibly
achieve perfect linear speedup. The more processors we use beyond
parallelism, the less perfect the speedup.
the factor by which the parallelism of the computation exceeds the number of
processors in the machine. We have three cases:
If slackness is less than 1 then perfect linear speedup is not possible: you have
more processors than you can make use of.
If slackness is greater than 1, then the work per processor is the limiting
constraint and a scheduler can strive for linear speedup by distributing the work
across more processors.
If slackness is 1, (T1 / T∞) / P = 1 so T1 / T∞ = P: we get perfect linear speedup
with P processors.
Analyzing span requires a different approach. (I hope you did the exercises above:
they will make you appreciate the following all the more.)
Analyzing Span
If a set of subcomputations (or the vertices representing them) are in series, the span is
the sum of the spans of the subcomputations. This is like normal sequential analysis
(as was just exemplified above with the sum T(n − 1) + T(n − 2)).
If a set of subcomputations (or the vertices representing them) are in parallel, the span
is the maximum of the spans of the computations. This is where analysis of
multithreaded algorithms differs.
Returning to our example, the span of the parallel recursive calls of P-Fib(n) is
computed by taking the max rather than the sum:
The recurrence T∞ (n) = T∞(n−1) + Θ(1) has solution Θ(n). So the span of P-Fib(n) is
Θ(n).
We can now compute the parallelism of P-Fib(n) in general (not just the specific case
of n=4 that we computed earlier) by dividing its work Θ(Fn) by its span Θ(n):
(Of course in this example it's because we chose an inefficent way to compute
Fibonacci numbers, but this was only for illustration. These ideas apply to other well
designed algorithms.)
Parallel Loops
So far we have used spawn, but not the parallel keyword, which is used with loop
constructs such as for. Here is an example.
The parallel for keywords indicate that each iteration of the loop can be executed
concurrently. (Notice that the inner for loop is not parallel; a possible point of
improvement to be discussed.)
It is not realistic to think that all n subcomputations in these loops can be spawned
immediately with no extra work. (For some operations on some hardware up to a
constant n this may be possible; e.g., hardware designed for matrix operations; but we
are concerned with the general case.) How might this parallel spawning be done, and
how does this affect the analysis?
Parallel for spawning can be accomplished by a compiler with a divide and conquer
approach, itself implemented with parallelism. The procedure shown below is called
with Mat-Vec-Main-Loop(A, x, y, n, 1, n). Lines 2 and 3 are the lines originally within
the loop.
The computation DAG is also shown. It appears that a lot of work is being done to
spawn the n leaf node computations, but the increase is not asymptotic.
The work of Mat-Vec is T1(n) = Θ(n2) due to the nested loops in 5-7.
Since the tree is a full binary tree, the number of internal nodes is 1 fewer than
the n leaf nodes, so this extra work is Θ(n).
Each leaf node corresponds to one iteration of loop, and the extra work of recursive
spawning can be amortized across the work of the iterations, so that it contributes only
constant work.
The span is increased by Θ(lg n) due to the addition of the recursion tree for Mat-Vec-
Main-Loop, which is of height Θ(lg n). In some cases (such as this one), this increase is
washed out by other dominating factors (e.g., the span in this example is dominated
by the nested loops).
Nested Parallelism
Continuing with our example, the span is Θ(n) because even with full utilization of
parallelism the inner for loop still requires Θ(n). Since the work is Θ(n2) the
parallelism is Θ(n2)/Θ(n) = Θ(n). Can we improve on this?
Perhaps we could make the inner for loop parallel as well? Compare the original to
the revised version Mat-Vec':
Race Conditions
Determinancy races are hard to detect with empirical testing: many execution
sequences would give correct results. This kind of software bug is consequential:
Race condition bugs caused the Therac-25 radiation machine to overdose patients,
killing three; and caused the North American Blackout of 2003.
After you understand that simple example, let's look at our (renamed) matrix-vector
example again:
Exercise: Do you see how yi might be updated differently depending on the order in
which parallel invocations of line 7 (including access to current value of y i and
writing new ones) are executed?
The span of this algorithm is T ∞(n) = Θ(n), due to the path for spawning the outer
and inner parallel loop executions and then the n executions of the innermost for loop.
So the parallelism is T1(n) / T∞(n) = Θ(n3) / Θ(n) = Θ(n2)
Here is a parallel version of the divide and conquer algorithm from Chapter 4 of
CLRS (not in these web notes):
See the text for analysis, which concludes that while the work is still Θ(n3), the span is
reduced to Θ(lg2n). Thus, while the work is the same as the basic algorithm the
parallelism is Θ(n3) / Θ(lg2n), which makes good use of parallel resources.
Parallelizing Merge-Sort
The recurrence for the work MS'1(n) of MERGE-SORT' is the same as the serial version:
The recurrence for the span MS'∞(n) of MERGE-SORT' is based on the fact that the
recursive calls run in parallel, so there is only one n/2 term (they are the same,
so min takes either):
This is low parallelism, meaning that even for large input we would not benefit from
having hundreds of processors. How about speeding up the serial MERGE?
Parallelizing Merge
MERGE takes two sorted lists and steps through them together to construct a single
sorted list. This seems intrinsically serial, but there is a clever way to make it parallel.
A divide-and-conquer strategy can rely on the fact that the lists are sorted to break the
lists into four lists, two of which will be merged to form the head of the final list and
the other two merged to form the tail.
1. Choose the longer list to be the first list, T[p1 .. r1] in the figure below.
2. Find the middle element (median) of the first list (x at q1).
3. Use binary search to find the position (q2) of this element if it were to be
inserted in the second list T[p2 .. r2].
4. Recursively merge
o The first list up to just before the median T[p1 .. q1-1] and the second list
up to the insertion point T[p2 .. q2-1].
o The first list from just after the median T[q1+1 .. r1] and the second list
after the insertion point T[q2 .. r2].
5. Assemble the results with the median element placed between them, as shown
below.
The text presents the BINARY-SEARCH pseudocode and analysis of Θ(lg n) worst case;
this should be review for you. It then assembles these ideas into a parallel merge
procedure that merges into a second array Z at location p3 (r3 is not provided as it can
be computed from the other parameters):
Analysis
My main purpose in showing this to you is to see that even apparently serial
algorithms sometimes have a parallel alternative, so we won't get into details, but here
is an outline of the analysis:
The span of P-MERGE is the maximum span of a parallel recursive call. Notice that
although we divide the first list in half, it could turn out that x's insertion point q2 is at
the beginning or end of the second list. Thus (informally), the maximum recursive
span is 3n/4 (as at best we have "chopped off" 1/4 of the first list).
The text derives the recurrence shown below; it does not meet the Master Theorem, so
an approach from a prior exercise is used to solve it:
Given 1/4 ≤ α ≤ 3/4 for the unknown dividing of the second array, the work
recurrence turns out to be:
With some more work, PM1(n) = Θ(n) is derived. Thus the parallelism is Θ(n / lg2n)
Some adjustment to the MERGE-SORT' code is needed to use this P-MERGE; see the text.
Further analysis shows that the work for the new sort, P-MERGE-SORT, is PMS1(n lg n) =
Θ(n), and the span PMS∞(n) = Θ(lg3n). This gives parallelism of Θ(n / lg2n), which is
much better than Θ(lg n) in terms of the potential use of additional processors
as n grows.
The chapter ends with a comment on coarsening the parallelism by using an ordinary
serial sort once the lists get small. One might consider whether P-MERGE-SORT is still a
stable sort, and choose the serial sort to retain this property if it is desirable.
Scheduling
At the beginning, we noted that we rely on a concurrency platform to determine how
to allocate potentially parallel threads of computation to available processors. This is
the scheduling problem. Scheduling parallel computations is a complex problem, and
sophisticated schedulers have been designed that are beyond what we can discuss
here.
Centralized schedulers are those that have information on the global state of
computation, but must make decisions in real time rather than in batch. A simple
approach to centralized scheduling is a greedy scheduler, which assigns as many
strands to available processors as possible at any given time step. The CLRS texts
proves a theorem concerning the performance of a greedy scheduler, with interesting
corollaries:
The proofs are not difficult to understand: see the text if you are interested. I think we
have said enough here to introduce the concepts of multithreading.
Distributed Algorithms - Introduction
An Introduction to Distributed Algorithms takes up some of the main concepts and
algorithms, ranging from basic to advanced techniques and applications, that underlie
the programming of distributed-memory systems such as computer networks,
networks of workstations, and multiprocessors. Written from the broad perspective of
distributed-memory systems in general it includes topics such as algorithms for
maximum flow, program debugging, and simulation that do not appear in more
orthodox texts on distributed algorithms. Moving from fundamentals to advances and
applications, ten chapters—with exercises and bibliographic notes—cover a variety of
topics. These include models of distributed computation, information propagation,
leader election, distributed snapshots, network synchronization, self-stability,
termination detection, deadlock detection, graph algorithms, mutual exclusion,
program debugging, and simulation.
All of the algorithms are presented in a clear, template-based format for the
description of message-passing computations among the nodes of a connected graph.
Such a generic setting allows the treatment of problems originating from many
different application areas. The main ideas and algorithms are described in a way that
balances intuition and formal rigor—most are preceded by a general intuitive
discussion and followed by formal statements as to correctness complexity or other
properties.
Algorithms for build a BreadthFirstSearch tree in a network. All assume that there is a
designated initiator node that starts the algorithm. At end of algorithm each node
except the initiator has a parent pointer and every node has a list of children. These
are consistent and define a BFS tree i.e. nodes at distance k from the initiator appear at
level k of the tree.
Asynchronous algorithms
To keep things simple, we'll drop the requirement that a parent learn the IDs of its
children, since this can be tacked on as a separate notification protocol, in which each
child just sends one message to its parent once it figures out who its parent is.
2.1. A simple algorithm using explicit distances
It's a very simple algorithm, closely related to Dijkstra's algorithm for shortest paths,
but there is otherwise no particular reason to use it; it is dominated by the O(D) time
and O(DE) message complexity synchronizer-based algorithm described later.
The idea is to run an AsynchronousBroadcast with distances attached. Each node sets
its distance to 1 plus the smallest distance sent by its neighbors and its parent to the
neighbor supplying that smallest distance. A node notifies all its neighbors of its new
distance whenever its distance changes.
In pseudocode:
States: distance, initially 0 for initiator and inf for all other nodes, internal send
buffers
All processes:
upon receiving d from p:
if d+1 < distance:
distance := d+1
parent := p
send distance to all neighbors
(See LynchBook for a precondition-effect description, which also includes code for
buffering outgoing messages.)
The claim is that after at most O(VE) messages and O(D) time, all distance values are
equal to the length of the shortest path from the initiator to the appropriate node. The
proof is by showing the following
Invariant
distancep is always the length of some path from initiator to p, and any
message sent by p is also the length of some path from initiator to p.
Proof
The second part follows from the first; any message sent equals p's current
value of distance. For the first part, suppose p updates its distance; then it sets it
to 1+the length of some path from initiator to p', which is the length of that
same path extended by adding the pp' edge.
We also need a liveness argument that says that distancep = d(initiator, p) no later
than time d(initiator, p). Note that we can't detect this condition occurring without a
lot of additional work.
The distributed minimum spanning tree (MST) problem involves the construction of
a minimum spanning tree by a distributed algorithm, in a network where nodes
communicate by message passing. It is radically different from the classical sequential
problem, although the most basic approach resembles Borůvka's algorithm. One
important application of this problem is to find a tree that can be used
for broadcasting. In particular, if the cost for a message to pass through an edge in a
graph is significant, a MST can minimize the total cost for a source process to
communicate with all the other processes in the network.
String Matching
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char
txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.
Examples:
pat[] = "TEST"
pat[] = "AABA"