0% found this document useful (0 votes)
2K views470 pages

The Design and Analysis of Parallel Algorithms

I originally published this textbook with Oxford University Press. When it went out of print, I reclaimed its copyright and am rewriting it to reflect new developments and the CUDA parallel programming system (using graphics cards).

Uploaded by

Justin Smith
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views470 pages

The Design and Analysis of Parallel Algorithms

I originally published this textbook with Oxford University Press. When it went out of print, I reclaimed its copyright and am rewriting it to reflect new developments and the CUDA parallel programming system (using graphics cards).

Uploaded by

Justin Smith
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 470

The Design and Analysis of Parallel Algorithms

Justin R. Smith
Preface

This book grew out of lecture notes for a course on parallel algorithms that I
gave at Drexel University over a period of several years. I was frustrated by the
lack of texts that had the focus that I wanted. Although the book also addresses
some architectural issues, the main focus is on the development of parallel algo-
rithms on “massively parallel” computers. This book could be used in several
versions of a course on Parallel Algorithms. We tend to focus on SIMD parallel
algorithms in several general areas of application:
• Numerical and scientific computing. We study matrix algorithms and nu-
merical solutions to partial differential equations.
• “Symbolic” areas, including graph algorithms, symbolic computation,
sorting, etc.
There is more material in this book than can be covered in any single course, but
there are many ways this book can be used. I have taught a graduate course in
parallel numerical algorithms by covering the material in:
1. The Introduction.
2. § 2 of chapter 3 (page 57).
3. Chapter 4, and;
4. Chapter 5.
Another possible “track” through this book, that emphasizes symbolic algorithms,
involves covering:
1. The Introduction;
2. Chapter 2;
3. and Chapter 6.
A graduate course on parallel algorithms in general could follow the sequence:
1. The Introduction;
2. Chapter 2 — for a theoretical background;
3. §§ 1 and 1.2 of chapter 5 — for a taste of some numerical algorithms;
4. §§ 1, 2, 2.1, 2.2, 2.3.1 3.5, and if time permits, 3.7.1 of chapter 6.
I welcome readers’ comments, and am particularly interested in reports of errors
or suggestions for new topics or exercises. My address is:
Justin R. Smith
Department of Mathematics
Drexel University
Philadelphia, PA 19104
USA
3

and my electronic mail address is [email protected]. Although I will try to


respond to readers’ comments, I regret that it will probably not be possible to
respond to all of them.
I generally used the C* language as a kind of “pseudocode” for describing
parallel algorithms. Although Modula* might have been more suitable than C* as
a “publication language” for parallel algorithms, I felt that C* was better known
in the computing community1. In addition, my access to a C* compiler made it
possible for me to debug the programs. I was able to address many issues that
frequently do not arise until one actually attempts to program an algorithm. All
of the source code in this book can actually be run on a computer, although I make
no claim that the published source code is optimal.
The manuscript was typeset using AMS-LATEX — the extension of LATEX de-
veloped by the American Mathematical Society. I used a variety of machines in
the excellent facilities of the Department of Mathematics and Computer Science
of Drexel University for processing the TEX source code, and used MacDraw on
a Macintosh Plus computer to draw the figures. The main font in the manuscript
is Adobe Palatino — I used the psfonts.sty package developed by David M.
Jones at the MIT Laboratory for Computer Science to incorporate this font into
TEX.
Many of the symbolic computations and graphs were made using the Maple
symbolic computation package, running on my own Macintosh Plus computer.
Although this package is not object-oriented, it is very powerful. I am particu-
larly impressed with the fact that the versions of Maple on microcomputers are of
“industrial strength”.

1This will probably change as Modula* becomes more widely available.


4

Acknowledgements

I am indebted to several people who read early versions of this manuscript


and pointed out errors or made suggestions as to how to improve the exposition,
including Jeffrey Popyack, Herman Gollwitzer, Bruce Char, David Saunders, and
Jeremy Johnson. I am also indebted to the students of my M780 course in the
Summer of 1992 who discovered some errors in the manuscript. This course was
based upon a preliminary version of this book, and the students had to endure the
use of a “dynamic textbook” that changed with time.
I would like to thank the proofreaders at Oxford University Press for an ex-
tremely careful reading of an early version of the manuscript, and many helpful
suggestions.
I would also like to thank Christian Herter and Ernst Heinz for providing
me with information on the Triton project at the University of Karlsruhe and the
Modula* language, and Kenneth Traub for providing me with information on the
Monsoon Project at M.I.T. I am indebted to Thinking Machines Corporation and
the CMNS project for providing me with access to a CM-2 Connection Machine
for development of the parallel programs that appear in this book.
Most of all, I am indebted to my wonderful wife Brigitte, for her constant
encouragement and emotional support during this arduous project. Without her
encouragement, I wouldn’t have been able to complete it. This book is dedicated
to her.
Contents

Chapter 1. Basic Concepts 1


1. Introduction 1

Chapter 2. Models of parallel computation 13


1. Generalities 13
2. Sorting on an EREW-SIMD PRAM computer 18
3. Bitonic Sorting Algorithm 19
4. Appendix: Proof of the 0-1 Principal 24
5. Relations between PRAM models 25
5.1. Complexity Classes and the Parallel Processing Thesis 29
5.2. P-Completeness and Inherently Sequential Problems 37
5.3. Further reading. 41
5.4. Brent’s Theorem 42
5.5. SIMD Algorithms 44
5.5.1. Doubling Algorithms 44
5.5.2. The Brent Scheduling Principle. 45
5.5.3. Pipelining. 46
5.5.4. Divide and Conquer. 46
5.6. MIMD Algorithms 46
5.6.1. Generalities 46
5.6.2. Race-conditions 47
5.6.3. Optimization of loops 51
5.6.4. Deadlocks 52
5.7. Comparison of the SIMD and MIMD models of computation 53

Chapter 3. Distributed-Memory Models 57


1. Introduction. 57
2. Generic Parallel Algorithms. 57
3. The Butterfly Network. 61
3.1. Discussion and further reading 66
3.2. Description 67
3.3. Discussion and further reading. 76
4. Cube-Connected Cycles 77
5. Dataflow Computers 84

Chapter 4. Examples of Existing Parallel Computers 95


1. Asynchronous Parallel Programming 95
1.1. Portable Programming Packages 95
1.1.1. Linda 95
1.1.2. p4 99
5
6 CONTENTS

1.1.3. MPI 99
1.2. Automatically Parallelizing Compilers 100
2. Discussion and further reading 100
3. SIMD Programming: the Connection Machine 101
3.1. Generalities 101
3.2. Algorithm-Design Considerations 103
3.3. The C* Programming Language 103
3.3.1. Running a C* program. 109
3.4. Semantics of C* 109
3.4.1. Shapes and parallel allocation of data. 109
3.4.2. Action of the with-statement 110
3.4.3. Action of the where-statement. 111
3.4.4. Parallel Types and Procedures. 114
3.4.5. Special Parallel Operators. 117
3.4.6. Sample programs. 117
3.4.7. Pointers. 121
3.4.8. Subtleties of Communication. 122
3.4.9. Collisions. 123
3.5. A Critique of C* 126
4. Programming a MIMD-SIMD Hybrid Computer: Modula* 128
4.1. Introduction. 128
4.2. Data declaration. 129
4.2.1. Elementary data types. 129
4.2.2. Parallel data types. 129
4.3. The FORALL Statement. 130
4.3.1. The asynchronous case. 131
4.3.2. The synchronous case. 131
4.4. Sample Programs 134
4.5. A Critique of Modula* 135

Chapter 5. Numerical Algorithms 137


1. Linear algebra 137
1.1. Matrix-multiplication 137
1.2. Systems of linear equations 137
1.2.1. Generalities on vectors and matrices 138
1.2.2. Appendix: Diagonalization of Hermitian Matrices 147
1.2.3. The Jacobi Method 151
1.2.4. The JOR method 154
1.2.5. The SOR and Consistently Ordered Methods 157
1.2.6. Discussion 163
1.3. Power-series methods: the Pan-Reif Algorithm 165
1.3.1. Introduction. 165
1.3.2. The main algorithm 167
1.3.3. Proof of the main result 170
1.4. Nonlinear Problems 173
1.5. A Parallel Algorithm for Computing Determinants 174
1.6. Further reading 178
2. The Discrete Fourier Transform 178
2.1. Background 178
CONTENTS 7

2.2. Definition and basic properties 182


2.3. The Fast Fourier Transform Algorithm 186
2.4. Eigenvalues of cyclic matrices 197
3. The JPEG Algorithm 200
4. Wavelets 201
4.1. Background 201
4.2. Discrete Wavelet Transforms 208
4.3. Discussion and Further reading 220
5. Numerical Evaluation of Definite Integrals 222
5.1. The one-dimensional case. 222
5.2. Higher-dimensional integrals 227
5.3. Discussion and Further reading. 230
6. Partial Differential Equations 232
6.1. Elliptic Differential Equations 233
6.1.1. Basic concepts 233
6.1.2. Self-adjoint equations 242
6.1.3. Rate of convergence 244
6.2. Parabolic Differential Equations 250
6.2.1. Basic Methods 250
6.2.2. Error Analysis 252
6.2.3. Implicit Methods 255
6.2.4. Error Analysis 259
6.3. Hyperbolic Differential Equations 261
6.3.1. Background 261
6.3.2. Finite differences 262
6.4. Further reading 265

Chapter 6. A Survey of Symbolic Algorithms 267


1. Doubling Algorithms 267
1.1. General Principles 267
1.2. Recurrence Relations 273
1.3. Deterministic Finite Automata 275
1.4. Parallel Interpolation 278
2. The Euler Tour Algorithm 282
2.1. Parallel Tree Contractions 289
2.2. Shortest Paths 290
2.3. Connected Components 293
2.3.1. Algorithm for a CREW Computer 294
2.3.2. Algorithm for a CRCW computer 301
2.4. Spanning Trees and Forests 305
2.4.1. An algorithm for an inverted spanning forest 311
2.5. Minimal Spanning Trees and Forests 315
2.6. Cycles in a Graph 327
2.6.1. Definitions 327
2.6.2. A simple algorithm for a cycle basis 328
2.6.3. Lowest Common Ancestors 329
2.7. Further Reading 331
3. Parsing and the Evaluation of arithmetic expressions 333
3.1. Introduction 333
8 CONTENTS

3.2. An algorithm for building syntax-trees 334


3.3. An algorithm for evaluating a syntax tree 338
3.4. Discussion and Further Reading 344
3.5. Appendix: Parenthesis-matching algorithm 348
3.6. Parallel searching 348
3.7. Sorting Algorithms for a PRAM computer 349
3.7.1. The Cole Sorting Algorithm — CREW version 349
3.7.2. Example 356
3.7.3. The Cole Sorting Algorithm — EREW version 361
3.8. Detailed proof of the correctness of the algorithm 376
3.9. Expander Graphs 379
3.10. Introduction 382
3.11. Number-Theoretic Considerations 384
3.12. Discussion and further reading 398
Chapter 7. Probabilistic Algorithms 401
1. Introduction and basic definitions 401
1.1. Numerical algorithms 401
1.1.1. Monte Carlo Integration 401
1.2. Monte Carlo algorithms 403
1.3. Las Vegas algorithms 404
2. The class textbfRNC 405
2.1. Work-efficient parallel prefix computation 406
2.2. The Valiant and Brebner Sorting Algorithm 409
2.3. Maximal Matchings in Graphs 410
2.3.1. A Partitioning Lemma 412
2.3.2. Perfect Matchings 412
2.3.3. The General Case 415
2.4. The Maximal Independent-Set Problem 417
2.4.1. Definition and statement of results. 417
2.4.2. Proof of the main results. 421
3. Further reading 426
Appendix A. Solutions to Selected Exercises 429
Appendix. Bibliography 447
Appendix B. Index 453
Appendix. Index 455
CHAPTER 1

Basic Concepts

1. Introduction
Parallel processing algorithms is a very broad field — in this introduction we
will try to give some kind of overview.
Certain applications of computers require much more processing power than
can be provided by today’s machines. These applications include solving differ-
ential equations and some areas of artificial intelligence like image processing. Ef-
forts to increase the power of sequential computers by making circuit elements
smaller and faster have approached basic physical limits.
Consequently, it appears that substantial increases in processing power can
only come about by somehow breaking up a task and having processors work on
the parts independently. The parallel approach to problem-solving is sometimes
very different from the sequential one — we will occasionally give examples of
parallel algorithms that convey the flavor of this approach.
A designer of custom VLSI circuits to solve some problem has a potentially
large number of processors available that can be placed on the same chip. It is
natural, in this context, to consider parallel algorithms for solving problems. This
leads to an area of parallel processing known as the theory of VLSI algorithms.
These are parallel algorithms that use a large number of processors that each have
a small amount of local memory and that can communicate with neighboring pro-
cessors only along certain predefined communication lines. The number of these
communication lines is usually very limited.
The amount of information that can be transmitted through such a commu-
nication scheme is limited, and this provides a limit to the speed of computa-
tion. There is an extensive theory of the speed of VLSI computation based on
information-flow arguments — see [160].
Certain tasks, like low level image processing, lend themselves to paralleliza-
tion because they require that a large number of independent computations be car-
ried out. In addition, certain aspects of computer design lead naturally to the
question of whether tasks can be done in parallel.
For instance, in custom VLSI circuit design, one has a large number of simple
processing elements available and it is natural to try to exploit this fact in devel-
oping a VLSI to solve a problem.
We illustrate this point with an example of one of the first parallel algorithms
to be developed and applied. It was developed to solve a problem in computer
vision — the counting of distinct objects in a field of view. Although this algorithm
has some applications, we present it here only to convey the flavor of many parallel
algorithms. It is due to Levialdi (see [100]).

1
2 1. BASIC CONCEPTS

We are given a two-dimensional array whose entries are all 0 or 1. The array
represents pixels of a black and white image and the 1’s represent the darkened
pixels. In one of the original applications, the array of pixels represented digitized
images of red blood cells. Levialdi’s algorithm solves the problem:
How do we efficiently count the connected sets of darkened pix-
els?
Note that this is a more subtle problem than simply counting the number of dark-
ened pixels. Levialdi developed a parallel algorithm for processing the array that
shrinks the objects in the image in each step — it performs transformations on the
array in a manner reminiscent of Conway’s well-known Game of Life:
Suppose ai,j denotes the (i, j)th pixel in the image array during some step of
the algorithm. The (i, j)th entry of the array in the next step is calculated as
(1) h[h( ai,j−1 + ai,j + ai+1,j − 1) + h( ai,j + ai+1,j−1 − 1)]
where h is a function defined by:
(
1, if x ≥ 1;
h( x ) =
0 otherwise.
This algorithm has the effect of shrinking the connected groups of dark pixels
until they finally contain only a single pixel. At this point the algorithm calls for
removing the isolated pixels and incrementing a counter.
We assume that each array element ai,j (or pixel) has a processor (or CPU)
named Pi,j associated with it, and that these CPU’s can communicate with their
neighbors.These can be very simple processors — they would have only limited
memory and only be able to carry out simple computations — essentially the com-
putations contained in the equations above, and simple logical operations. Each
processor would have to be able to communicate with its neighbors in the array.
This algorithm can be carried out in cycles, where each cycle might involve:
1. Exchanging information with neighboring processors; or
2. Doing simple computations on data that is stored in a processor’s local
memory.
In somewhat more detail, the algorithm is:
C←0
for k ← 1 to n do
for all processors { Pi,j } do in parallel
Pi,j receives the values of
ai+1,j , ai−1,j , ai+1,j+1 , ai+1,j−1
ai−1,j+1 ,ai,j−1 ,ai,j+1
from its neighbors (it already contains the value of ai,j
if ai,j = 1 and all neighboring elements are 0 then
C ← C+1
Perform the computation in equation (1) above
end /* do in parallel */
end for
Here is an example — we are assuming that the i axis is horizontal (increasing
from left-to-right) and the j axis is vertical. Here is an example of the Levialdi
algorithm. Suppose the initial image is given by figure 1.1.
1. INTRODUCTION 3

F IGURE 1.1. Initial image

F IGURE 1.2. Result of first iteration

The result of the first iteration of the algorithm is given by figure 1.2.
In the next step the lone pixel in the upper right is removed and a counter
incremented. The result is depicted in figure 1.3. After a sufficient number of
steps (in fact, n steps, where n is the size of the largest side of the rectangle) the
screen will be blank, and all of the connected components will have been counted.
This implementation is an example of a particularly simple form of parallel
algorithm called a systolic algorithm1. Another example of such an algorithm is the
following:
Suppose we have a one-dimensional array (with n elements) whose entries are
processing elements. We assume that these processing elements can carry out the
basic operations of a computer — in this case it must be able to store at least two
numbers and compare two numbers. Each entry of this array is connected to its
two neighboring entries by communication lines so that they can send a number
to their neighbors — see figure 1.4.
Now suppose each processor has a number stored in it and we want to sort
these numbers. There exists a parallel algorithm for doing this in n steps — note
that it is well-known that a sequential sort using comparisons (other than a radix-
sort) requires Ω(n lg n) steps. Here lg denote the logarithm to the base 2.

1See the discussion on page 16 for a (somewhat) general definition of systolic algorithms. Also
see [96]
4 1. BASIC CONCEPTS

F IGURE 1.3. Result of second iteration

F IGURE 1.4. A linear array of processors

In odd-numbered steps odd-numbered processors compare their number with


that of their next higher numbered even processors and exchange numbers if they
are out of sequence.
In even-numbered steps the even-numbered processors carry out a similar op-
eration with their odd-numbered neighbors (this is a problem that first appeared
in [91]). See figure 1.5 for an example of this process.
Note that this algorithm for sorting corresponds to the bubble sort algorithm
when regarded as a sequential algorithm.
At first glance it might seem that the way to develop the fastest possible parallel
algorithm is to start with the fastest possible sequential algorithm. This is very
definitely not true, in general. In fact, in many cases, the best parallel algorithm
for a problem doesn’t remotely resemble the best sequential algorithm. In order to
understand this phenomena it is useful to think in terms of computation networks
for computations.
For instance the expression ( a + b + c)/(e + f − g) can be represented by the
directed graph in figure 1.6.
The meaning of a computation-network is that we try to perform a computa-
tion at each vertex of the graph and transmit the result of the computation along
the single exit edge. See § 5.4 on page 42 for a rigorous definition of a computation
network. The computation cannot be performed until data arrives at the vertex
along the incoming directed edges. It is not hard to see that the computations be-
gin at the vertices that have no incoming edges and ends at the vertex (labeled the
root in figure 1.6) that has no outgoing edges.
We will briefly analyze the possibilities for parallelization presented by this
network. Suppose that each vertex of the network has a number associated with
it, called its value. The numbers attached to the leaf vertices are just the numerical
values of the variables a through g — we assume these are given.
For a non-leaf vertex the number attached to it is equal to the result of per-
forming the indicated operation on the values of the children of the vertex. It is
not hard to see that the value of the root of the tree will equal the value of the
1. INTRODUCTION 5

7 1 6 3 4 2 5 0

1 7 3 6 2 4 0 5

1 3 7 2 6 0 4 5

1 3 2 7 0 6 4 5

1 2 3 0 7 4 6 5

1 2 0 3 4 7 5 6

1 0 2 3 4 5 7 6

0 1 2 3 4 5 6 7

F IGURE 1.5. The odd-even sorting algorithm

Root
/

+
+

c –
+ e

f g
a b

F IGURE 1.6. Computation network


6 1. BASIC CONCEPTS

a1 +

a2 +

a3 +

a4 +

a5 +

a6 +

a7 a8

F IGURE 1.7. Computation network for sequential addition

whole expression. It is also not hard to see that to compute the value of a given
non-leaf vertex, it is first necessary to compute the values of its children — so that
the whole computation proceeds in a bottom up fashion. We claim that:
If a computation is to be done sequentially the execution time is
very roughly proportional to the number of vertices in the syntax
tree. If the execution is to be parallel, though, computations in dif-
ferent branches of the syntax tree are essentially independent so
they can be done simultaneously. It follows that the parallel exe-
cution time is roughly proportional to the distance from the root to
the leaves of the syntax tree.
This idea is made precise by Brent’s Theorem (5.17 on page 42). The task of find-
ing a good parallel algorithm for a problem can be regarded as a problem of re-
modeling the computation network in such a way as to make the distance from the
root to the leaves a minimum. This process of remodeling may result in an increase
in the total number of vertices in the network so that the efficient parallel algorithm
would be very inefficient if executed sequentially. In other words a relatively com-
pact computation network (i.e. small number of vertices) might be remodeled to a
network with a large total number of vertices that is relatively balanced or flat (i.e.
has a shorter distance between the root and the leaves).
For instance, if we want to add up 8 numbers a1 , . . . , a8 , the basic sequential
algorithm for this has the computation network given in figure 1.7.
This represents the process of carrying out 8 additions sequentially. We can re-
model this computation network to get figure 1.8. In this case the total execution-
time is 3 units, since the distance from the root of the tree to the leaves is 3. When
we remodel these computation-graphs and try to implement them on parallel
1. INTRODUCTION 7

+ +

+ + + +

a1 a2 a3 a4 a5 a6 a7 a8

F IGURE 1.8. Remodeled computation network

computers, we encounter issues of interconnection topology — this is a description of


how the different processors in a parallel computer communicate with each other.
It is not hard to see that the linear array of processors used in the odd-even sorting
algorithm depicted in figure 1.5 on page 5 would have a hard time implementing
the addition-algorithm shown in figure 1.8. The ideal situation would be for the
communication-patterns of the processors to be identical to the links of the com-
putation graph. Although this ideal configuration is not always possible, there are
many interconnection topologies that are suitable for a wide variety of problems
— see chapter 2, particularly § 3 on page 57 through § 4 on page 77.
Assuming an ideal interconnection topology, we get an algorithm for adding
k
2 numbers in k steps, by putting the numbers at the leaves of a complete binary
tree. Figure 1.9 shows this algorithm adding 8 = 23 numbers in 3 steps.
As simple as this algorithm for adding 2k numbers is, it forms the basis of a
whole class of parallel algorithms — see chapter 6.
Note that, since we can add n numbers in lg n steps, with sufficiently many
processors, we can do matrix multiplication of n × n matrices in lg n steps. See 1.1
on page 137 for the details. Consequently we can also perform other operations
derived from matrix multiplication rapidly. For instance, we can find the distance
between all pairs of vertices in a graph in O(lg2 n)-steps — see 2.4 on page 291 for
the details.
While it is clear that certain tasks, like matrix addition, can be done rapidly
in parallel it doesn’t follow that all tasks have this property. This leads to the
question of whether there exist inherently sequential problems — problems that can’t
necessarily be done faster in parallel.
From the discussion connected with the example given above one might get
the impression that many unbounded parallel algorithms have an execution time
that is O(lgk n) for some value of k. This turns out to be the case — for instance,
many algorithms are loosely based upon the example given above.
This phenomena is so widespread that Nicholas Pippenger has defined a class
of problems called NC. These are problems that can be solved on a parallel
computer (like that in the preceding example) O(lgk n)-time using a polynomial
number of processors (it seems reasonable to impose this restriction on the number
of processors — in addition many problems can be rapidly solved in parallel if an
unlimited number of processors are available). A rigorous definition is given in
§ 5.1 on page 34. In general, a problem is called parallelizable if it is in NC. Since
any NC problem can be sequentially solved in polynomial time (by simulating a
8 1. BASIC CONCEPTS

Initial data

4 2 1 3 1 –2 0 2

Step 1

6 4 –1 2

4 2 1 3 1 –2 0 2

Step 2 10 1

6 4 –1 2

4 2 1 3 1 –2 0 2

11

Step 3 10 1

6 4 –1 2

4 2 1 3 1 –2 0 2

F IGURE 1.9. Sum of 8 numbers in 3 steps


1. INTRODUCTION 9

PRAM computer by a sequential one) it follows that NC ⊆ P. The natural question


at this point is whether NC = P — do there exist any inherently sequential problems?
This question remains open at present but there are a number of interesting things
one can say about this problem. The general belief is that NC 6= P but no one
has come up with an example of an inherently sequential problem yet. A number
of problems are known to be P-complete. Roughly speaking, a polynomial-time
sequential problem is called P-complete if any other polynomial-time sequential
problem can be transformed into an instance of it — see page 38. If a fast parallel
algorithm can be found for any P-complete problem, then fast parallel versions
can be found for all polynomial-time sequential algorithms. See § 5.2 in chapter 2
for examples and a discussion of the theoretical issues involved.
Chapter 2 discusses various unbounded parallel models of computation
and Batcher’s sorting algorithm. We discuss the distinction between
“Procedure-level” and “Data-level” parallelism, and the question of whether
“super-linear” speedup of a sequential algorithm is possible. We give
comparisons of the so-called Concurrent-I/O and Exclusive-I/O models (where
the “I/O” in question is between processors and the shared random access
memory).
We also discuss some theoretical questions of when efficient parallel algo-
rithms exist. These questions all involve the class NC mentioned above and its
relations with the class of polynomial-time sequential problems P. We will dis-
cuss the relations between NC and PLogspace-sequential problems — known as
the Parallel Processing Thesis, originally due to Fortune and Wyllie. We will also
discuss potential candidates for inherently sequential problems here.
In addition, we discuss some of architectures of existing parallel computers
and how algorithms are developed for these architectures. These architectures
include:
• the Butterfly Network
• the Hypercube
• Shuffle-Exchange networks and
• Cube-Connected Cycles
We also briefly discuss dataflow computers, and issues connected with the ques-
tion of how memory is accessed in network computers and the Granularity Prob-
lem.
Chapter 4 discusses some concrete examples of parallel computers and pro-
gramming languages. It shows how architectural considerations influence pro-
gramming languages. It considers some of the languages for the Sequent Symme-
try and the Connection Machine in some detail.
The programming constructs used on the Sequent provide examples of how
one programs MIMD computers in general. The use of semaphores and synchro-
nization primitives like cobegin and coend are discussed in this connection. We
also discuss the LINDA system for asynchronous parallel programming. It has the
advantage of simplicity, elegance, and wide availability.
We also discuss various portable packages for performing MIMD computing
using existing languages like FORTRAN and C — see § 1.1 on page 95.
We conclude our material on coarse-grained MIMD algorithms with a discus-
sion of automatic parallelization of algorithms (“Parallelizing compilers”) and look
at some of the research being done in this area.
10 1. BASIC CONCEPTS

The C* language for the Connection Machine is a model for programming


SIMD machines in general. We use it as a kind of pseudocode for describing SIMD
algorithms in the remainder of the text.
Chapter 5 considers numeric applications of parallel computing:
• Systems of linear equations. We discuss iterative techniques for solving
systems of linear equations.
• Matrix operations. We discuss the Pan-Reif matrix-inversion algorithm as
an example of an extraordinarily inefficient sequential algorithm that can
be efficiently parallelized.
• The Fourier Transform. We develop the FFT algorithm and give a few
applications.
• Wavelet transforms. This is a new area, related to the Fourier Transform,
that involves expanding input-functions (which may represent time-series
or other phenomena) in a series whose terms are fractal functions. Wavelet
transforms lend themselves to parallel computations — in some respects
to a greater extent than Fourier Transforms. We present parallels algo-
rithms for wavelet transforms.
• Numerical integration. We give several parallel algorithms for approxi-
mately evaluating definite integrals.
• Numeric solutions to partial differential equations — including
techniques peculiar to elliptic, parabolic, and hyperbolic differential
equations.
Chapter 6 considers several classes of non numeric parallel algorithms, including:
1. A general class of algorithms that may be called doubling algorithms. All of
these roughly resemble the algorithms for adding 8 numbers given in this
introduction. In general, the technique used for adding these 8 numbers
can be used in performing any associative operation. A large number of
operations can be expressed in these terms:
• The solution of linear recurrences.
• Parsing of linear languages (with applications to pattern-recognition
and the design of front-ends to compilers).
2. searching and various sorting algorithms including that of Cole and the
Ajtai, Komlós, Szemerédi sorting algorithm.
3. Graph-theoretic algorithms like minimal spanning tree, connected com-
ponents of a graph, cycles and transitive closure.
4. Fast Fourier Transforms (revisited) with applications to Computer Alge-
bra.
5. Probabilistic algorithms including Monte Carlo Integration, and some
symbolic algorithms. We introduce the class textbfRNC, of problems that
have parallel solutions that involve random choices, and whose expected
execution time (this term is defined on page 405) is in O(lgk n) for k some
integer, and for n equal to the complexity-parameter of the problem.

E XERCISES .
1. INTRODUCTION 11

1.1. Prove the correctness of the algorithm on page 59 for forming the cumu-
lative sum of 2k numbers in k steps.
CHAPTER 2

Models of parallel computation

1. Generalities
In this section we discuss a few basic facts about parallel processing in general.
One very basic fact that applies to parallel computation, regardless of how it is
implemented, is the following:
C LAIM 1.1. Suppose the fastest sequential algorithm for doing a computation with
parameter n has execution time of T(n). Then the fastest parallel algorithm with m proces-
sors (each comparable to that of the sequential computer) has execution time ≥ T(n)/m.
The idea here is: If you could find a faster parallel algorithm, you could exe-
cute it sequentially by having a sequential computer simulate parallelism and get a
faster sequential algorithm. This would contradict the fact that the given sequen-
tial algorithm is the fastest possible. We are making the assumption that the cost
of simulating parallel algorithms by sequential ones is negligible.
This claim is called the “Principle of Unitary Speedup”.
As usual, the parameter n represents the relative size of the instance of the
problem being considered. For instance, if the problem was that of sorting, n might
be the number of items to be sorted and T(n) would be O(n lg n) for a sorting
algorithm based upon comparisons.
As simple as this claim is, it is a bit controversial. It makes the tacit assumption
that the algorithm in question is deterministic. In other words, the algorithm is
like the usual idea of a computer program — it performs calculations and makes
decisions based on the results of these calculations.
There is an interesting area of the theory of algorithms in which statement
1.1 is not necessarily true — this is the theory of randomized algorithms. Here, a
solution to a problem may involve making random “guesses” at some stage of the
calculation. In this case, the parallel algorithm using m processors can run faster
than m× the speed of the sequential algorithm (“Super-unitary speedup”). This
phenomenon occurs in certain problems in which random search is used, and most
guesses at a solution quickly lead to a valid solution, but there are a few guesses
that execute for a long time without producing any concrete results.
We will give an example of this phenomenon. Although it is highly oversim-
plified, it does illustrate how super unitary speedup can occur.
Suppose there are a 100 possible approaches to an AI-type search problem
and:
1. 99 out of the 100 possibilities arrive at a solution in 1 time unit.
2. 1 of the possibilities runs for 1000 time units, and then fails.
The expected execution-time of a single (sequential) attempt to find a solution is the
average of all of these times, or 10.99 time-units.
13
14 2. MODELS OF PARALLEL COMPUTATION

If we attack this problem with a parallel computer that has 2 processors that
try distinct possibilities, the expected time (and even the worst-case time) is 1 unit,
since at least one of these two distinct possibilities will be a fast solution. We, conse-
quently, see a super-unitary speedup in the parallel algorithm. In other words the
expected1 running-time of the algorithm is divided by > 10, which is much greater
than the ratio of processors.
The opponents of the concept of super-unitary speedups (including the au-
thor) would argue that the original sequential algorithm was not optimal — and
that the optimal sequential algorithm would have attempted two possible solu-
tions with two distinct processes, run concurrently). A sequential algorithm that
created two processes would have a running time of 2 units. The speedup that
results by going to the sample parallel algorithm is 2, which is exactly equal to the
ratio of processors. Thus, by modifying the sequential algorithm used, the validity
of 1.1 is restored.
In [128], Parkinson argues that super-unitary speedup is possible, and in [49]
Faber Lubeck White argue that it is not.
See [98], [101], and [114], for more information about this phenomenon.
We will will concentrate on deterministic algorithms in this text, so that we
will assume that super-unitary speedup is essentially impossible.
The next question we consider is how the instructions to the processors are
handled.
In this section we will consider some simple algorithms that can be imple-
mented when we have a SIMD computer in which every processor can access com-
mon RAM. In general, a computer in which many processors can access common
RAM in a single program-step is called the PRAM model of computer. This is one
of the oldest models of parallel processing to be considered, although there have
never been any large parallel computers that implement it. The PRAM model is a
kind of mathematical idealization of a parallel computer that eliminates many of
the low-level details and allows a researcher to concentrate on the “purely paral-
lel” aspects of a problem.
Traditionally, the PRAM model of computation has been regarded as more
of theoretical than practical interest. This is due to the fact that it requires large
numbers of processors to be physically connected to the same memory location.
The two examples cited above (the Sequent and the Encore Multimax) aren’t an
exception to this statement: they only have small numbers of processors. Several
researchers are exploring the possibility of physically realizing of PRAM comput-
ers using optical interfaces — see [109].
There exist several different models of program control. In [51], Flynn listed
several basic schemes:
SIMD Single Instruction Multiple Data. In this model the processors are
controlled by a program whose instructions are applied to all of them si-
multaneously (with certain qualifications). We will assume that each of
the processors has a unique number that is “known” to the processor in
the sense that instructions to the parallel computer can refer to processor
numbers. An example of this type of machine is:

1Recall that the expected running-time of an algorithm like the one in the example is the average
of actual running times, weighted by probabilities that these running times occur.
1. GENERALITIES 15

The Connection Machine (models CM-1 and CM-2), from


Thinking Machines Corporation. This is used as a generic
example of a SIMD computer in this text.
A seemingly more powerful model is:
MIMD Multiple Instruction Multiple Data. In this model processors can
each have independent programs that are read from the common RAM.
This model is widely used in several settings:
1. Coarse-grained parallelism — this is a form of parallel computing that
closely resembles concurrent programming in that it involves processes
that run asynchronously. Several commercial parallel computers are
available that support this model of computation, including the Se-
quent Balance and Symmetry and the Encore Multimax. Many inter-
esting issues arise in this case. The data-movement and communica-
tions problems that occur in all parallel computation are more signifi-
cant here because the instructions to the processors as well as the data
must be passed between the common memory and the processors.
Due to these data-movement problems, commercial MIMD computers
tend to have a relatively small number of processors (≈ 20). In general,
it is easier to program a MIMD machine if one is only interested in a
very limited form of parallelism — namely the formation of processes.
Conventional operating systems like UNIX form separate processes to
carry out many functions, and these processes really execute in parallel
on commercially-available MIMD machines. It follows that, with one
of these MIMD machines, one can reap some of the benefits of paral-
lelism without explicitly doing any parallel programming. For this
reason, most of the parallel computers in commercial use today tend
to be MIMD machines, run as general-purpose computers.
On the surface, it would appear that MIMD machine are strictly more
powerful than SIMD machines with the same number of processors.
Interestingly enough, this is not the case — it turns out that SIMD ma-
chines are more suited to performing computations with a very regular
structure. MIMD machines are not as suited to solving such problems
because their processors must be precisely synchronized to implement
certain algorithms — and this synchronization has a cost that, in some
cases, can dominate the problem. See § 1.1.1, and particularly 5.23 for a
discussion of these issues.
Pure MIMD machines have no hardware features to guarantee syn-
chronization of processors. In general, it is not enough to simply load
multiple copies of a program into all of the processors and to start
all of these copies at the same time. In fact many such computers
have hardware features that tend to destroy synchronization, once it
has been achieved. For instance, the manner in which memory is ac-
cessed in the Sequent Symmetry series, generally causes processes to
run at different rates even if they are synchronized at some time. Many
Sequents even have processors that run at different clock-rates.
2. The BBN Butterfly Computer — this is a computer whose architecture
is based upon that of the shuffle-exchange network, described in sec-
tion 3.2, on page 71.
16 2. MODELS OF PARALLEL COMPUTATION

Three other terms that fill out this list are:


SISD Single Instruction, Single Data. This is nothing but conventional se-
quential computing.
MISD This case is often compared to computation that uses Systolic Ar-
rays. These are arrays of processors that are developed to solve specific
problems — usually on a single VLSI chip. A clock coordinates the data-
movement operations of all of the processors, and output from some pro-
cessors are pipelined into other processors. The term “Systolic” comes
from an analogy with an animal’s circulatory system — the data in the
systolic array playing the part of the blood in the circulatory system. In
a manner of speaking, one can think of the different processors in a sys-
tolic array as constituting “multiple processors” that work on one set of
(pipelined) data. We will not deal with the MISD case (or systolic arrays)
very much in this text. See [96] for a discussion of systolic computers.
SIMD-MIMD Hybrids This is a new category of parallel computer that
is becoming very significant. These machines are also called SAMD ma-
chines (Synchronous-Asynchronous Multiple Data). The first announced
commercial SAMD computer is the new Connection Machine, the CM-5.
This is essentially a MIMD computer with hardware features to allow:
• Precise synchronization of processes to be easily achieved.
• Synchronization of processors to be maintained with little or no over-
head, once it has been achieved (assuming that the processors are all
executing the same instructions in corresponding program steps). It
differs from pure MIMD machines in that the hardware maintains a
uniform “heartbeat” throughout the machine, so that when the same
program is run on all processors, and all copies of this program are
started at the same time, it is possible to the execution of all copies
to be kept in lock-step with essentially no overhead. Such computers
allow efficient execution of MIMD and SIMD programs.
Here are some systems of this type:
1. The Paragon system from Intel. (Also called the iPSC-860.)
2. The SP-1 from IBM. This is essentially a networked set of independent
processors, each running its own copy of Unix and having its own
disk space. Logically, it looks like a set of workstations that communi-
cate through Ethernet connections. The thing that makes this system a
“parallel computer” is that the processors are networked together in a
way that looks like Ethernet to a programmer, but actually has a much
higher transfer rate.
3. The Triton Project, at the University of Karlsruhe (Germany). The Tri-
ton Project is currently developing a machine called the Triton/1 that
can have up to 4096 processors.
Flynn’s scheme for classifying parallel computers was somewhat refined by
Hndler in 1977 in [64]. Hndler’s system for classifying parallel computers
involves three pairs of integers:

T (C ) =< K × K 0 , D × D 0 , W × W 0 >

where:
1. GENERALITIES 17

1. K: the number of processor control units (PCU’s). These are portions of


CPU’s that interpret instructions and can alter flow of instructions.
The number K corresponds, in some sense, to the number of
instruction-streams that can execute on a computer.
2. K 0 : The number of processor control units that can be pipelined. Here,
pipelining represents the process of sending the output of one PCU to the
input of another without making a reference to main memory. A MISD
computer (as described above) would represent one in which K 0 > K.
3. D: The number of arithmetic-logical units (ALU’s) controlled by each
PCU. An ALU is a computational-element — it can carry out arithmetic
or logical calculations. A SIMD computer would have a K-number of 1
and a D number that is > 1.
4. D 0 : The number of ALU’s that can be pipelined.
5. W: The number of bits in an ALU word.
6. W 0 : The number of pipeline segments in all ALU’s controlled by a PCU or
in a single processing element.
Although Hndler’s scheme is much more detailed than Flynn’s, it still leaves much
to be desired — many modern parallel computers have important features that do
not enter into the Hndler scheme2 at all. In most case, we will use the much simpler
Flynn scheme, and we will give additional details when necessary.
With the development of commercially-available parallel computers, some
new terms have come into common use:
Procedure-level parallelism: This represents parallel programming on a com-
puter that has a relatively small number of processors, and is usually a MIMD-
machine. Since the number of processors is small, each processor usually does a
large chunk of the computation. In addition, since the machine is usually MIMD
each processor must be programmed with a separate program. This is usually
done in analogy with concurrent programming practices — a kind of fork-statement
is issued by the main program and a procedure is executed on a separate proces-
sor. The different processors wind up executing procedures in parallel, so this style
of programming is called procedure-level parallelism. It has the flavor of concur-
rent programming (where there is real concurrency) and many standard concur-
rent programming constructs are used, like semaphores, monitors, and message-
passing.
Data-level parallelism: This represents the style of parallel programming that is
emphasized in this book. It is used when there is a large number of processors on a
machine that may be SIMD or MIMD. The name ‘data-level parallelism’ is derived
from the idea that the number of processors is so great that the data can be broken
up and sent to different processors for computations — originally this referred
to a do-loop in FORTRAN. With a large number of processors you could send
each iteration of the loop to a separate processor. In contrast to procedure-level
parallelism, where you broke the code up into procedures to be fed to different
processors, here you break up the data and give it to different processors.
This author feels that these terms are not particularly meaningful — they are
only valid if one does parallel programming in a certain way that is closely related
to ordinary sequential programming (i.e. the terms arose when people tried to

2 For instance, the CM-5 computer mentioned above.


18 2. MODELS OF PARALLEL COMPUTATION

parallelize sequential programs in a fairly straightforward way). The terms are


widely used, however.

2. Sorting on an EREW-SIMD PRAM computer


The next issue we might consider in classifying parallel processors, is how
memory is accessed.
D EFINITION 2.1. A PRAM computer follows the EREW scheme of memory
access if, in one program step, each memory location can be written or read by at
most a single processor.
Consider the sorting algorithm discussed in the introduction. It isn’t hard to
see that it is optimal in the sense that it will always take at least n steps to sort
n numbers on that computer. For instance, some numbers might start out n − 1
positions away from their final destination in the sorted sequence and they can
only move one position per program step. On the other hand it turns out that the
PRAM-EREW computer described above can sort n numbers in O(lg2 n) program
steps using an old algorithm due to Batcher. This difference in execution time
throws some light on the fundamental property of the PRAM model of computa-
tion: there is an unbounded flow of information. In other words even if only one
processor can access one memory location at a time it is very significant that all
processors can access all of the available memory in a single program step.
Batcher’s sorting algorithm involves recursively using the Batcher Merge al-
gorithm, which merges two sequences of length n in time O(lg n).
In order to discuss this algorithm, we must first look at a general result that is
indispensable for proving the validity of sorting algorithms. This is the 0-1 Princi-
ple — already alluded to in the exercise at the end of the introduction. See § 2 in
chapter 28 of [35] for more information. This result applies to sorting and merging
networks, which we now define.
D EFINITION 2.2. A comparator is a type of device (a computer-circuit, for in-
stance) with two inputs and two outputs:
IN1 OUT1
Comparator
IN2 OUT2

such that:
• OUT1 = min(IN1 , IN2 )
• OUT2 = max(IN1 , IN2 )
The standard notation for a comparator (when it is part of a larger network) is
the more compact diagram:
IN1 OUT1

IN2 OUT2
A comparator network is a directed graph with several special properties:
1. Data (numbers) can flow along the edges (in their natural direction). You
can think of these edges as “pipes” carrying data, or “wires”.
3. BITONIC SORTING ALGORITHM 19

IN1 OUT1

IN2 OUT2

IN3 OUT3

F IGURE 2.1. A sorting network on three inputs

2. One set of vertices constitutes the inputs for the graph, and another are
called its outputs. These are the exterior vertices of the graph.
3. The interior vertices of the graph are comparators.
A sorting network is a comparator network that has the additional property:
The data that appears at the output vertices is the result of sorting
the data that was at the input vertices.
A merging network is defined to be a comparator network with the property that,
if we subdivide the inputs into two subsets of equal sizes, and insert sorted data
into each of these subsets, the output is the result of merging the input-sequences
together.
P ROPOSITION 2.3. If a comparator network correctly sorts all input-sequences
drawn from the set {0, 1}, then it correctly sorts any input-sequence of numbers, so that
it constitutes a sorting network.
Similarly, if a comparator-network whose inputs are subdivided into two equal sets
correctly merges all pairs of 0-1-sequences, then it correctly merges all pairs of number-
sequences.
The proof is given in the appendix to this section on page 24. Figure 2.1 shows
a sorting network that correctly sorts all possible sequences of three numbers.
Although the 0-1 Principle is stated for comparator networks it also applies
to many algorithms. If a sort or merge algorithm performs fixed sequences of
comparisons between elements, it clearly defines a network in the sense defined
above.

3. Bitonic Sorting Algorithm


In this section we will discuss one of the first parallel sorting algorithms to be
developed.
D EFINITION 3.1. A sequence of numbers will be called bitonic if either of the
following two conditions is satisfied:
• It starts out being monotonically increasing up to some point and then
becomes monotonically decreasing.
• It starts out being monotonically decreasing up to some point and then
becomes monotonically increasing.
A sequence of 0’s and 1’s will be called clean if it consists entirely of 0’s or entirely
of 1’s.
For instance the sequence {4, 3, 2, 1, 3, 5, 7} is bitonic. We will present an al-
gorithm that correctly sorts all bitonic sequences. This will turn out to imply an
efficient algorithm for merging all pairs of sorted sequences, and then, an associ-
ated algorithm for sorting all sequences.
20 2. MODELS OF PARALLEL COMPUTATION

6
Upper
5

2
Lower
1

F IGURE 2.2. A bitonic halver of size 8

D EFINITION 3.2. Given a bitonic sequence of size { a0 , . . . , an−1 }, where n =


2m, a bitonic halver is a comparator network that performs the following sequence
of compare-exchange operations:
for i ← 0 to m − 1 do in parallel
if(ai < ai+m ) then swap( ai , ai+m )
endfor
Figure 2.2 shows a bitonic halver of size 8.
Note that a bitonic halver performs some limited sorting of its input (into as-
cending order). The following proposition describes the precise sense in which
sorting has been performed:
P ROPOSITION 3.3. Suppose { a0 , . . . , an−1 }, where n = 2m is a bitonic sequence
of 0’s and 1’s, that is input to a bitonic halving network, and suppose the output is
{b0 , . . . , bn−1 } = {r0 , . . . , rm−1 , s0 , . . . , sm−1 }. Then one of the two following statements
applies:
• The sequence {r0 , . . . , rm−1 } consists entirely of 0’s and the sequence
{s0 , . . . , sm−1 } is bitonic, or
• The sequence {r0 , . . . , rm−1 } is bitonic, the sequence {s0 , . . . , sm−1 } consists en-
tirely of 1’s.
Consequently, the smallest element of the lower half of the output is ≥ the largest element
of the upper half.
The two cases are distinguished by the number of 1’s that were in the original
input.
P ROOF. We have four cases to contend with:
1. The input first increases and later decreases, and has a preponderance of
1’s:
{00 , . . . , 0α−1 , 1α , . . . , 1m−1 , . . . , 1β , 0β+1 . . . , 0n−1 }
where β − α ≥ m − 1 or β ≥ α + m − 1, or n − 1 − β ≤ m − 1 − α. This
inequality implies that when we compare 0’s in the upper half of the input
with 1’s in the lower half, each 0 will be compared with a 1 (i.e., there will
3. BITONIC SORTING ALGORITHM 21

F IGURE 2.3. Bitonic sorting network for 8 inputs

be enough 1’s) and, therefore, will be swapped with it. A straightforward


computation shows that the output will be
{00 , . . . , 0α−1 , 1α , . . . , 1β−m+1 , 0β−m+2 , . . . , 0m−1 , 1m , . . . , 1n−1 }
so the conclusion is true.
2. The input first decreases and later increases, and has a preponderance of
1’s:
{10 , . . . , 1α−1 , 0α , . . . , 0m−1 , . . . , 0β , 1β+1 . . . , 1n−1 }
where β − α < m. In this case each 0 in the lower half of the input will also
be compared with a 1 in the upper half, since β ≤ α + m − 1. The output
is
{00 , . . . , 0β−1 , 1β , . . . , 1α−m+1 , 0α−m+2 , . . . , 0m−1 , 1m , . . . , 1n−1 }
The two cases with a preponderance of 0’s follow by symmetry. r

It is not hard to see how to completely sort a bitonic sequence of 0’s and 1’s:
A LGORITHM 3.4. Bitonic Sorting Algorithm. Let n = 2k and let
{ a0 , . . . , an−1 } be a bitonic sequence of 0’s and 1’s. The following algorithm sorts
it completely:
for i ← k downto 1 do in parallel
Subdivide the data into 2k−i disjoint sublists of size 2i
Perform a Bitonic Halving operation on each sublist
endfor
In the first iteration the “sublists” in question are the entire original input.
Since this correctly sorts all bitonic sequences of 0’s and 1’s, and since it is
a sorting network, the 0-1 Principle implies that it correctly sorts all bitonic se-
quences of numbers.
22 2. MODELS OF PARALLEL COMPUTATION

Figure 2.3 shows a bitonic sorting network. Since the bitonic halving opera-
tions can be carried out in a single parallel step (on an EREW computer, for in-
stance), the running time of the algorithm is O(lg n), using O(n) processors.

P ROOF. Each Bitonic Halving operation leaves one half of its data correctly
sorted, the two halves are in the correct relationship with each other, and the other
half is bitonic. It follows that in phase i, each sublist is either sorted, or bitonic (but
in the proper sorted relationship with the other sublists). In the end the intervals
will be of size 1, and the whole list will be properly sorted. r

Here is an example of the bitonic sorting algorithm:


E XAMPLE 3.5. We set n to 8. The input data is:
{1, 2, 3, 6, 7, 4, 2, 1}
and we will sort in ascending order. After the first bitonic halving operation we
get
{1, 2, 2, 1, 7, 4, 3, 6}
Now we apply independent bitonic halving operations to the upper and lower
halves of this sequence to get
{1, 1, 2, 2, 3, 4, 7, 6}
In the last step we apply bitonic halving operations to the four sublists of size 2 to
get the sorted output (this step only interchanges the 6 and the 7)
{1, 1, 2, 2, 3, 4, 6, 7}
Now we have a fast algorithm for sorting bitonic sequences of numbers. Un-
fortunately, such sequences are very rare in the applications for sorting. It turns
out, however, that this bitonic sorting algorithm gives rise to a fast algorithm for
merging two arbitrary sorted sequences. Suppose we are given two sorted se-
quences { a0 , . . . , am−1 } and {b0 , . . . , bm−1 }. If we reverse the second sequence and
concatenate it with the first sequence the result is
{ a0 , . . . , am−1 , bm−1 . . . , b0 }
which is bitonic. This give rise to the following Batcher Merge algorithm:
A LGORITHM 3.6. Batcher Merge. Suppose { a0 , . . . , am−1 } and {b0 , . . . , bm−1 }
are two sorted sequences, where m = 2k−1 . Then the following algorithms merges
them together:
1. Preparation step. Reverse the second input-sequence and concatenate it
with the first, forming
{ a0 , . . . , am−1 , bm−1 . . . , b0 }
2. Bitonic sort step.
for i ← k downto 1 do in parallel
Subdivide the data into 2k−i disjoint sublists of size 2i
Perform a Bitonic Halving operation on each sublist
endfor
The result is a correctly-sorted sequence.
3. BITONIC SORTING ALGORITHM 23

This merge algorithm executes in O(lg n) steps, using O(n) processors.


It is now straightforward to come up with an algorithm for sorting a sequence
of n numbers on an EREW-PRAM computer with n processors. We will also an-
alyze its running time. Suppose T (n) represents the time required to sort n = 2k
data-items.
A LGORITHM 3.7. Batcher Sort
1. Sort right and left halves of the sequence (recursively). This runs in
T (2k−1 )-time (assuming that the right and left halves are sorted in
parallel.
2. Perform a Batcher Merge (3.6) of the two sorted halves.
The overall running time satisfies the relation T (2k ) = T (2k−1 ) + c1 (k − 1) + c2 .
Here c1 is equal to the constant factors involved in the Merge algorithm and c2
is the total contribution of the constant factors in the unshuffle and shuffle-steps,
and the last step. We get
k
(k − 1)(k − 2)
T (2k ) = ∑ ( c1 ( j − 1) + c2 ) = c1 + c2 k = O ( k 2 )
j =1
2

Since n = 2k , k = lg n and we get T (n) = O(lg2 n).

E XERCISES .

3.1. Find examples of PRAM and networked parallel computers that are com-
mercially available. What programming languages are commonly used on these
machines?
3.2. Is it possible for a sorting algorithm to not be equivalent to a sorting net-
work? What operations would the algorithm have to perform in order to destroy
this equivalence?
3.3. Prove that the odd-even sorting algorithm on page 5 works. Hint: Use
the 0-1 principle.
3.4. Consider the example of super-unitary speedup on page 13. Is such
super-unitary speedup always possible? If not, what conditions must be satisfied
for it to happen?
3.5. At about the time that he developed the Bitonic Sorting Algorithm, and
the associated merging algorithm, Batcher also developed the Odd-Even Merge Al-
gorithm .
1. Assume that we have two sorted sequences of length n: { Ai } and
{ Bi }. Unshuffle these sequences forming four sequences of length n/2:
{ A2j−1 },{ A2j }, { B2j−1 }, { B2j };
2. (Recursively) merge { A2j−1 } with { B2j−1 } forming {Ci } and { A2j } with
{ B2j } forming { Di };
3. Shuffle {Ci } with { Di } forming {C1 , D1 , C2 , D2 , . . . Cn , Dn };
24 2. MODELS OF PARALLEL COMPUTATION

4. Sorting this result requires at most 1 parallel step, interchanging Ci with


Di−1 for some values of i.
The correctness of this algorithm depends upon the lemma:
L EMMA 3.8. Let n be a power of 2 and let A1 , . . . , An and B1 , . . . , Bn be two sorted
sequences such that A1 ≤ B1 . Merge the sequences A1 , A3 , A5 , . . . and B1 , B3 , B5 , . . . to
form C1 , . . . , Cn and the sequences A2 , A4 , A6 , . . . and B2 , B4 , B6 , . . . to form D1 , . . . , Dn .
Now shuffle the C-sequence with the D- sequence to get C1 , D1 , C2 , D2 , . . . Cn , Dn . Then
sorting this last sequence requires, at most interchanging Ci with Di−1 for some values of
i.
Here is an example: Suppose the original A-sequence is 1, 5, 6, 9 and the
original B-sequence is 2, 3, 8, 10. The sequence of odd A’s is 1, 6 and the sequence
of even A’s is 5, 9. The sequence of odd B’s is 2, 8 and the sequence of even B’s is
3, 10. The result of merging the odd A’s with the odd B’s is the sequence of C’s —
this is 1, 2, 6, 8. Similarly the sequence of D’s is 3, 5, 9, 10. The result of shuffling
the C’s with the D’s is 1, 3, 2, 5, 6, 9, 8, 10. Sorting this sequence only involves
interchanging C2 = 2 and D1 = 3 and D3 = 9 and C4 = 8.
Prove this lemma, using the 0-1 Principal.

4. Appendix: Proof of the 0-1 Principal


We begin by recalling the definition of monotonically increasing functions:
D EFINITION 4.1. A real-valued function f is called monotonically increasing if,
whenever x ≤ y, then f ( x ) ≤ f (y).
For instance f ( x ) = x or x2 are monotonically increasing functions. These
functions (and 0-1 comparator networks) have the following interesting property:
P ROPOSITION 4.2. Let { a0 , . . . , ak−1 } be a set of numbers that are input to a net-
work, N, of comparators, and let {b0 , . . . , bk−1 } be the output. If f ( x ) is any mono-
tonically increasing function, the result of inputting { f ( a0 ), . . . , f ( ak1 )} to N will be
{ f (b0 ), . . . , f (bk−1 )}.
In other words, the way a comparator network permutes a set of numbers
is not affected by applying a monotonically increasing function to them. This is
intuitively clear because:
• the decisions that a comparator network makes as to whether to permute
two data items are based solely upon the relative values of the data items;
• and monotonically increasing functions preserve these relative value rela-
tionships.

P ROOF. We use induction on the number of comparators in the comparator


network. If there is only one comparator in the network, then each data item must
5. RELATIONS BETWEEN PRAM MODELS 25

traverse at most one comparator. In this case, the proof is clear:


min( f ( x ), f (y)) = f (min( x, y))
max( f ( x ), f (y)) = f (max( x, y))
Now suppose the conclusion is true for all comparator networks with n com-
parators. If we are given a comparator network with n + 1 comparators, we can
place one comparator, C, adjacent to the input-lines and regard the original net-
work as a composite of this comparator, and the subnetwork, N \ C, that remained
after we removed it. The notation A \ B represents the set-difference of the sets A
and B.
We have just shown that comparator C (i.e., the one that was closest to the
input lines in the original network) satisfies the conclusion of this result. It follows
that the input-data to N \ C will be f (output of C ). The inductive hypothesis im-
plies that the rest of the original network (which has n comparators) will produce
output whose relative order will not be modified by applying the function f . r
C OROLLARY 4.3. If a comparator network correctly sorts all input-sequences drawn
from the set {0, 1}, then it correctly sorts any input-sequence of numbers, so that it con-
stitutes a sorting network.
Similarly, if a comparator-network whose inputs are subdivided into two equal sets
correctly merges all pairs of 0-1-sequences, then it correctly merges all pairs of number-
sequences.
P ROOF. If we have k inputs { a0 , . . . , ak−1 }, define the k monotonically increas-
ing functions:
(
0 if x < ai
fi (x) =
1 if x ≥ ai
The conclusion follows immediately from applying 4.2, above, to these monotoni-
cally increasing functions. r

5. Relations between PRAM models


In this section we will use the sorting algorithm developed in the last section
to compare several variations on the PRAM models of computation. We begin
by describing two models that appear to be substantially stronger than the EREW
model:
CREW — Concurrent Read, Exclusive Write. In this case any number of pro-
cessors can read from a memory location in one program step, but at most
one processor can write to a location at a time. In some sense this model
is the one that is most commonly used in the development of algorithms.
CRCW — Concurrent Read, Concurrent Write. In this case any number of
processors can read from or write to a common memory location in one
program step. The outcome of a concurrent write operation depends on
the particular model of computation being used (i.e. this case breaks up
into a number of sub-cases). For instance, the result of a concurrent write
might be the boolean OR of the operands; or it might be the value stored
by the lowest numbered processor attempting the write operation, etc.
This model of computation is more powerful than the CREW model —
see § 2.3 in chapter 6 (on page 293) for an example of a problem (connected
26 2. MODELS OF PARALLEL COMPUTATION

components of a graph) for which there exists an improved algorithm us-


ing the CRCW model. The Hirschberg-Chandra-Sarawate algorithm runs
on a CREW computer in O(lg2 n) time and the Shiloach-Vishkin algorithm
runs on a CRCW computer in O(lg n) time. There is no known algorithm
for this problem that runs on a CREW computer in O(lg n) time.
It is a somewhat surprising result, due to Vishkin (see [167]) that these models
can be effectively simulated by the EREW model (defined in 2.1 on page 18). The
original statement is as follows:
T HEOREM 5.1. If an algorithm on the CRCW model of memory access executes
in α time units using β processors then it can be simulated on the EREW model using
O(α lg2 n) -time and β processors. The RAM must be increased by a factor of O( β).
This theorem uses the Batcher sorting algorithm in an essential way. If we sub-
stitute the (equally usable) EREW version of the Cole sorting algorithm, described
in § 3.7.3 in chapter 6 (see page 357) we get the following theorem:
T HEOREM 5.2. Improved Vishkin Simulation Theorem If an algorithm on the
CRCW model of memory access executes in α time units using β processors then it can be
simulated on the EREW model using O(α lg n) -time and β processors. The RAM must
be increased by a factor of O( β).
Incidentally, we are assuming the SIMD model of program control.
The algorithm works by simulating the read and write operations in a single
program step of the CRCW machine.
CRCW Write Operation. This involves sorting all of the requests to write to
a single memory location and picking only one of them per location. The request
that is picked per memory location is the one coming from the processor with the
lowest number — and that request is actually performed.
Suppose that processor i wants to write to address a(i )(0 ≤ i ≤ β − 1).
1. Sort the pairs {( a(i ), i ), 0 ≤ i ≤ β − 1} in lexicographic order using the
Batcher Algorithm (or the Cole sorting algorithm on page 357, for the im-
proved version of the theorem) presented in the last section. Call the re-
sulting list of pairs {( a( ji ), ji )}.
2. Processor 0 writes in a( j0 ) the value that processor j0 originally intended
to write there. Processor k (k > 0) tests whether a( jk−1 ) = a( jk ). If not
it writes in a( jk ) the value that processor jk originally intended to write
there.
Here is an example of the simulation of a CRCW-write operation:
Processor 0 1 2 3 4 5 6 7
Target 2 6 3 1 5 7 1 0
D (i ) 3 4 5 6 7 8 9 0
Here D (i ) is the data that processor i wants to write to location a(i ).
This is converted into a list of pairs:
(0,2), (6,1), (3,2), (1,3), (5,4), (7,5), (7,6), (0,7)
This list is sorted by the second element in each pair:
(0,7), (1,3), (2,0), (3,2), (5,4), (6,1), (7,5), (7,6)
5. RELATIONS BETWEEN PRAM MODELS 27

Suppose the ith pair in the sorted list is called ( a( ji ), ji ), and the mem-
ory in the ith processor is called Mi . These pairs are processed via the
following sequence of operations:

Processor Action
0 M0 ← D (7) = 0
1 Test a( j1 ) = 1 6= a( j0 ) = 0 and do M1 ← D (3) =6
2 Test a( j2 ) = 2 6= a( j1 ) = 1 and do M2 ← D (0) =3
3 Test a( j3 ) = 3 6= a( j2 ) = 2 and do M3 ← D (2) =5
4 Test a( j4 ) = 5 6= a( j3 ) = 3 and do M5 ← D (4) =7
5 Test a( j5 ) = 6 6= a( j4 ) = 5 and do M6 ← D (1) =4
6 Test a( j6 ) = 7 6= a( j5 ) = 6 and do M5 ← D (6) =9
7 Test a( j7 ) = a( j6 ) = 7 and do nothing
CRCW Read Operation. Here a(i )(0 ≤ i ≤ β − 1) denotes the address from which
processor i wants to read in the CRCW machine.
1. Identical to step 1 in the Write Operation. In addition introduce an auxil-
iary β × 3 array denoted Z.
For i, 0 ≤ i ≤ β − 1: Z (i, 0) contains the content of memory address
a( ji ) at the end of the read-operation.
Z (i, 1) contains YES if the content of a( ji ) is already written in Z (i, 1),
and NO otherwise. It is set to NO before each simulated CRCW read-
operation.
Z (i, 2) contains the content of address a(i ) at the end of the read-
operation.
2. Processor 0 copies the content of a( j0 ) into Z (0, 0); Z (0, 1) ← YES. If
a( ji ) 6= a( ji−1 ) then processor ji copies the content of a( ji ) into Z ( ji , 0);
Z ( ji , 1) ← YES. The array now has the unique values needed by the pro-
cessors. The next step consists of propagating these values throughout the
portions of Z (0, ∗) that correspond to processors reading from the same
location. This is accomplished in lg n iterations of the following steps (for
processors 0 ≤ i ≤ β − 1):
k(i ) ← 0 (once, in the first iteration);
Wait until Z (i, 1) is turned to YES;
while (i + 2k(i) ≤ β − 1 and
Z (i + 2k(i) , 1) = NO) do
Z (i + 2k(i) , 1) ←YES;
Z (i + 2k(i) , 0) ← Z (i, 0);
k(i + 2k(i) ) ← k(i ) + 1;
k(i ) ← k(i ) + 1; Z ( ji , 2) ← Z (i, 0);
endwhile
Note that the EREW design of the computer makes it necessary for us to have
separate counters — the k (i ) — for each of the processors.
This operation copies the values that have been read from memory as many
times as are necessary to satisfy the original read-requests. The final step consists
in having the processors read Z (∗, 2).
Here is an example of the simulation of a CRCW-read operation:
28 2. MODELS OF PARALLEL COMPUTATION

Processor 0 1 2 3 4 5 6 7
Reads from 2 6 7 1 5 7 1 0
D (i ) 3 4 5 6 7 8 9 0
In this example Di is the data that processor i initially contains. The first step
is the same as in the simulation of the CRCW-write operation. The set of desired
read-operations is converted into a list of pairs:
(2,0), (6,1), (7,2), (1,3), (5,4), (7,5), (1,6), (0,7)
This list is sorted by the second element in each pair:
(0,7), (1,3), (1,6), (2,0), (5,4), (6,1), (7,2), (7,5)
Now we set up the Z array. Initially it looks like the following:
i 0 1 2 3 4 5 6 7
Z (i, 0)
Z (i, 1) NO NO NO NO NO NO NO NO
Z (i, 2)
The first processor copies D ( a( j0 )) into position Z (0, 0). Every other processor
tests whether a( ji ) 6= a( ji−1 ) and, if the values are not equal, copies its value of
D ( a( ji )) into Z (i, 0). Each position of the Z-array that receives one of the a(i ) is
marked by having its value of Z (i, 1) set to YES. We also set up the variables k(i ).
We get the following array:
i 0 1 2 3 4 5 6 7
Z (i, 0) 0 6 3 7 4 5
Z (i, 1) YES YES NO YES YES YES YES NO
Z (i, 2)
k (i ) 0 0 0 0 0 0 0 0
Now we begin the iterations of the algorithm. After the first iteration we get
i 0 1 2 3 4 5 6 7
Z (i, 0) 0 6 6 3 7 4 5 5
Z (i, 1) YES YES YES YES YES YES YES YES
Z (i, 2)
k (i ) 1 1 1 1 1 1 1 1
In this particular example the iterations are completed in the first step. No
computations occur in the remaining (2) iterations. In the last step of the algorithm
the data is copied into Z (∗, 2).
i 0 1 2 3 4 5 6 7
Z (i, 0) 0 6 6 3 7 4 5 5
Z (i, 1) YES YES YES YES YES YES YES YES
Z (i, 2) 3 4 5 6 7 5 6 0
k (i ) 3 3 3 3 3 3 3 3

E XERCISES .
5. RELATIONS BETWEEN PRAM MODELS 29

5.1. Modify the CRCW Write phase of the simulation algorithm described
above (page 26) so that, whenever multiple processors attempt to write numbers
to the same simulated location, their sum is actually written.3.

Theoretical Issues
5.1. Complexity Classes and the Parallel Processing Thesis. In this chapter
we will be concerned with various theoretical issues connected with parallel pro-
cessing. We will study the question of what calculations can be efficiently done
in parallel and in what sense. We present the so-called Parallel Processing Thesis of
Fortune and Wyllie — see [52]. It essentially shows that execution-time on a par-
allel computer corresponds in some sense to space (i.e., memory) on a sequential
computer. The arguments used by Fortune and Wyllie also give some insight into
why the execution time of many parallel algorithms is a power of a logarithm of the
complexity of the problem.
One of the most interesting theoretical questions that arise in this field
is whether there exist inherently sequential problems. These are essentially
computations for which it is impossible to find parallel algorithms that are
substantially faster than the fastest sequential algorithms. This is a subtle
question, because there are many problems that appear to be inherently
sequential at first glance but have fast parallel algorithms. In many cases the fast
parallel algorithms approach the problem from a completely different angle than
the preferred sequential algorithms. One of the most glaring examples of this is the
problem of matrix inversion, where:
1. the fastest sequential algorithm (i.e., a form of Gaussian Elimination) only
lends itself to a limited amount of parallelization (see the discussion be-
low, on page 41);
2. the (asymptotically) fastest parallel algorithm would be extremely bad
from a sequential point of view.
This should not be too surprising — in many cases the fastest sequential algo-
rithms are the ones that reduce the amount of parallelism in the computations to
a minimum.
First it is necessary to make precise what we mean by a parallel algorithm
being substantially faster than the corresponding sequential algorithm. Here are
some of the algorithms that have been considered so far:
1. Forming cumulative sums of n numbers. The sequential algorithm has an
execution time of O(n). The parallel algorithm has an execution time of
O(lg n) using O(n) processors;
2. Sorting n numbers by performing comparisons. The best sequential algo-
rithms have an asymptotic execution time of O(n lg n). The best parallel

3This situation actually arises in existing parallel computers — see the description of census-
operations on the CM-2 on page 103.
30 2. MODELS OF PARALLEL COMPUTATION

algorithms have asymptotic execution times of O(lg n) using O(n) proces-


sors — see Chapter 5, § 1.3 (page 165);
3. Inversion of an n × n non-sparse matrix. The best sequential algorithms
use Gaussian Elimination and have an execution time of O(n3 ). The
asymptotically fastest known parallel algorithms have an execution time
of O(lg2 n) using n2.376 processors.
The general pattern that emerges is:
• we have a sequential algorithm that executes in an amount of time that
is bounded by a polynomial function of the input-size. The class of such
problems is denoted P;
• we have parallel algorithms that execute in an amount of time that is
bounded by a polynomial of the logarithm of the input-size, and use a num-
ber of processors bounded by a polynomial of the input size. The class of
these problems is denoted NC;
As has been remarked before, NC ⊆ P — any algorithm for a problem in NC can
be sequentially simulated in an amount of time that is bounded by a polynomial
function of the original input.
Our question of whether inherently sequential problems exist boils down to
the question of whether there exist any problems in P \ NC — or the question of
whether NC = P.
As of this writing (1991) this question is still open. We will discuss some partial
results in this direction. They give a natural relationship between parallel execu-
tion time and the amount of RAM required by sequential algorithms. From this we
can deduce some rather weak results regarding sequential execution time.
It is first necessary to define the complexity of computations in a fairly rigor-
ous sense. We will consider general problems equipped with
• An encoding scheme for the input-data. This is some procedure, chosen
in advance, for representing the input-data as a string in some language
associated with the problem. For instance, the general sorting problem
might get its input-data in a string of the form { a1 , a2 , . . . }, where the ai
are bit-strings representing the numbers to be sorted.
• A complexity parameter that is proportional to the size of the input-string.
For instance, depending on how one defines the sorting problem, the com-
plexity parameter might be equal to the number of items to be sorted, or
the total number of symbols required to represent these data-items.
These two definitions of the sorting problem differ in significant
ways. For instance if we assume that inputs are all distinct, then it
requires O(n lg n) symbols to represent n numbers. This is due to the
fact that lg n bits are needed to count from 0 to n − 1 so (at least) this
many bits are needed to represent each number in a set of n distinct
numbers. In this case, it makes a big difference whether one defines
the complexity-parameter to be the number of data-items or the size
(in bits) of the input. If (as is usual) we assume the all inputs to a
sorting-algorithm can be represented by a bounded number of bits, then
the number of input items is proportional to the actual size of the input.
Incidentally, in defining the complexity-parameter of a problem to be the size of
the string containing the input-data, we make no assumptions about how this input
5. RELATIONS BETWEEN PRAM MODELS 31

Blank

A1 A2 A 3
A B B . . .
4

F IGURE 2.4. Tape in a Turing Machine

string is processed — in particular, we do not assume that it is processed sequen-


tially.
Having defined what we mean by the size of the input, we must give a rigor-
ous definition of the various parameters associated with the execution of an algo-
rithm for solving the problem like running time, memory, and number of proces-
sors. This is usually done with a Turing Machine.
A Turing Machine is a kind of generalization of a finite state automaton — it
is a kind of fictitious computer for which it is easy to define the number of steps
required to perform a computation. It consists of the following elements:
1. A Tape, shown in figure 2.4.
This tape is infinite in length and bounded at one end. It is divided
into cells that may have a symbol marked on them or be blank. Initially at
most a finite number of cells are nonblank.
2. A Control Mechanism, which consists of:
• Q = a finite set of states;
• Γ= set of tape symbols, called the alphabet of the Turing machine;
• B=blank symbol;
• Σ= input symbols;
• δ=next move function:Q × Γ → Q × Γ × { L, R};
• q0 =start state;
• F ⊆ Q = final states.
In each step the machine reads a symbol from the tape, changes state, optionally
writes a symbol, and moves the tape head left (L) or right (R) one position — all
of these actions are encoded into the next move function. There are a number of
auxiliary definitions we make at this point:
D EFINITION 5.3. Suppose we have a Turing machine T. Then:
1. An input string, S, is accepted by T if T ends up in one of its stop-states after
executing with the string as its original tape symbols.
2. The set, L( T ), of input strings accepted by T is called the language rec-
ognized by T. The reader might wonder what all of this has to do with
performing computations.
3. If a string, S, is recognized by T, the symbols left on the tape after T has
stopped will be called the result of T performing a computation on S. Note
that this is only well-defined if T recognizes S.
It can be shown that any computation that can be done on a conventional se-
quential computer can also be done on a suitable Turing machine (although in
much more time). It is also well-known that any computation that can be done in
polynomial time on a conventional computer can be done in polynomial time on a
32 2. MODELS OF PARALLEL COMPUTATION

Input Symbols
States space 0 1
Write 1,
Move left Move right,
1 (Start state) move left,
Go to state 4 go to state 1
go to state 2
Move right, Move left, Move left,
2
go to state 3 go to state 1 go to state 1
Write 0,
3 Move right move right,
go to state 1
Move right,
4 Move left Move left
go to state 5
5 Stop
TABLE 2.1. Actions of a Turing machine for sorting a string of 0’s
and 1’s

Turing machine. Thus a Turing machine presents a simple model of computation


that can be analyzed theoretically.
E XAMPLE 5.4. Here is a Turing machine whose language consists of arbitrary
finite strings on two symbols {0, 1} (in other words, it accepts all such strings). It
has five states and its actions are described by table 2.1.
Careful examination of the action of this Turing Machine shows that it sorts its
input string.
We will also need a variation on this form of Turing machine. An offline Turing
machine is defined to be like the one above except that there are three tapes. The
first one is a read-only tape that contains the input string. The second is like the
tape in the definition above but it is required to be initially blank. The third is a
write-only tape — it receives the output of the computation. Note that this form of
Turing machine clearly distinguishes input from output, unlike the general Turing
machines. It is possible to show that offline Turing machines are essentially equiv-
alent to general Turing machines in the sense that any language recognized by a
general Turing machine can also be recognized by a suitable offline Turing ma-
chine. Furthermore, the computations of the two types of Turing machines agree
under this equivalence. Clearly, if we are only interested in what computations
can be computed by a Turing machine, there is no need to define offline Turing
machines — we only define them so we can rigorously define the memory used by
an algorithm. See [73] for more information on Turing machines.
D EFINITION 5.5. The space required to solve a problem on an offline Turing
machine is defined to be the number of cells of the second tape that was used in
the course of computation when the problem was solved.
D EFINITION 5.6. 1. A problem with complexity parameter n will be said to
be in the class T(n)-space if there exists an offline Turing machine that uses space
T( n ).
2. A problem will be said to be in Plogspace(k) if there exists an offline Turing
machine that solves the problem using space that is O(lgk n).
5. RELATIONS BETWEEN PRAM MODELS 33

It is well-known that Plogspace(1) ⊂ P, i.e. any problem that can be solved on


an offline Turing machine in space proportional to the logarithm of the complexity-
parameter can be solved in polynomial time on a conventional Turing machine.
This is not hard to see — if the total amount of RAM used by a sequential algorithm
is c lg n then the total number of possible states (or sets of data stored in memory)
is a power of n.
The converse question is still open — and probably very difficult. In fact it is
not known whether every problem in P is in Plogspace(k) for any value of k.
One question that might arise is: “How can problems with an input-size of n
use an amount of RAM that is less than n?” The answer is that we can use an ineffi-
cient algorithm that doesn’t store much of the input data in RAM at any one time.
For instance it turns out that sorting n quantities by comparisons is in Plogspace —
we simply use a kind of bubblesort that requires a great deal of time to execute but
very little RAM. It turns out that essentially all of the typical algorithms discussed
(for instance) in a course on algorithms are in Plogspace.
This will be our model of sequential computation. Our model for parallel com-
putation will be somewhat different from those described in chapter 4. It is a
MIMD form of parallel computation:
We will have a large (actually infinite) number of independent processors that
can access common RAM and can execute the following instructions:
LOAD op
STORE op
ADD op
SUB op
JUMP label
JZERO label
READ reg#
FORK label
HALT

Here an operand can be either: an address; an indirect address; or a literal. Ini-


tially input is placed in input registers that can be accessed via the READ instruc-
tion. Each processor has one register called its accumulator. Binary arithmetic
instructions use the accumulator as the first operand and store the result in the
accumulator. New processors are introduced into a computation via the FORK
instruction which, when executed by processor i:
1. activates the next available processor — say this is processor j;
2. copies the accumulator of processor i to processor j;
3. and makes processor j take its first instruction from label.
When a processor executes a HALT instruction it stops running (and re-enters the
pool of processors available for execution by FORK instructions). Execution of a
program continues until processor 0 executes a HALT instruction, or two or more
processors try to write to the same location in the same program step.
One processor can initiate two other processors in constant time via the FORK
command. It can pass (via its accumulator) the address of a block of memory
containing parameters for a subtask. It follows that a processor can initiate a tree of
n other processors in O(lg n)-time.
34 2. MODELS OF PARALLEL COMPUTATION

Processors have no local memory — they use the global memory available
to all processors. It is not hard to “localize” some of this memory via a suitable
allocation scheme. For instance, suppose a given processor has every kth memory
location allocated to it (as local memory). When this processor initiates a new
processor, it can allocate local memory to the new processors from its own local
memory. It can allocate every memory location (of its local memory) to the ith
processor it directly initiates, where pi is the ith prime number.
Note that this is a very generous model of parallel computation — it is much
more powerful than the Connection Machine, for instance.
We are in a position to give a rigorous definition of the term NC:
D EFINITION 5.7. A problem with complexity parameter n is in the class NC if
there exists a parallel algorithm on the computer described above, that executes in
0
time O(lgk n) and uses O(nk ) processors, where k and k0 are two integers ≥ 0.
Our first result is:
T HEOREM 5.8. For T(n) ≥ lg n:

[ ∞
[
T(n)k -time-P-RAM = T(n)k -space
k =1 k =1
In particular,

[ ∞
[
lgk n-time-P-RAM = lgk n-space
k =1 k =1

The proof consists in the following two lemmas:


L EMMA 5.9. Let L be a language accepted by a deterministic T(n)-space bounded
Turing machine M, for T(n) ≥ lg n. Then L is accepted by a deterministic cT(n)-time
bounded P-RAM P, for some constant c.
In the theory of Turing machines, acceptance of a language can be regarded as
performance of a computation.
P ROOF. We will simulate the behavior of M by P. Initially our simulation will
assume that the value of T(n) is available at the beginning of the computation
(possibly an unrealistic assumption) and then we will show how to conduct the
simulation in such a way that this assumption is removed.
Given T(n), P constructs a directed graph representing all possible configura-
tions of M during the computation. We will regard a configuration of M as a state
of M together with the data required to compute the next state. This consists of:
1. the data in the memory tape;
2. a pointer to a position of the input tape;
Statement 1 above results in 2T(n) possible configurations (since the memory tape
has T(n) slots for data) and since the parameter defining the original problem’s
size is n we assume O(n) possible positions on the input tape. The total number
of possible configurations is thus wn2T(n) , where w is some constant. This can be
regarded as 2T(n)+lg n+lg w ≤ 2dT(n) , for some constant d depending on M. Leaving
each node of the graph will be a single edge to the node of its successor configu-
ration. Accepting configuration-nodes of M are their own successors. Thus there
5. RELATIONS BETWEEN PRAM MODELS 35

exists a path from the initial configuration node to an accepting node if and only if
M accepts its input within T(n)-space.
To build the graph, P first initiates 2dT(n) processors in O(T(n))-steps, each
holding a different integer, representing a different configuration of M. Each pro-
cessor then, in O(T(n))-time:
1. unpacks its configuration integer (we assume this integer contains an en-
coded representation of a state of the Turing machine);
2. computes the successor configuration (simulating the behavior of the Tur-
ing machine when it is in this state);
3. packs this successor configuration into integer form.
The graph is stored in global memory and the parallel computer then determines
whether there exists a path connecting the initial node to some accepting node.
This is done as follows: Each processor computes the successor of its successor
node and stores this as its immediate successor. In k steps each processor will
point to the 2k -th successor node and in O(T(n))-steps the question of whether the
language will be accepted will be decided because the successor nodes are all a
distance of at most 2dT(n) from the start node.
As pointed out above this algorithm has the unfortunate drawback that it re-
quires knowledge of T(n) before it can be carried out. This can be corrected as
follows: At the beginning of the simulation processor 0 starts other processors out
that assume that the value of T(n) is 1, 2, . . . , respectively. These processors carry
out the simulation above, using these assumptions and if any of them accepts the
language, then processor 0 accepts also. We assume that memory is allocated so
that the areas of memory used by the different simulations is disjoint — this can
be accomplished by adding c2dT(n) to the addresses of nodes for the simulation of
a given value of T(n). r

L EMMA 5.10. Let L be accepted by a deterministic T(n)-time bounded P-RAM. Then


L is accepted by a T(n)2 -space bounded Turing machine.

P ROOF. We will first construct a nondeterministic T(n)2 -space bounded Turing


Machine accepting L. We will then show how to make it deterministic. Recall that
a nondeterministic Turing machine is one that may take many different actions in a
given step — imagine that the machine “splits” into many different machines, each
taking a different action and attempting to complete the computation using that
alternate. The language is ultimately accepted if any one of these split machines
accepts it.
In order to determine whether the P-RAM accepts its input, the Turing Ma-
chine needs to:
1. know the contents of processor 0’s accumulator when it halts; and
2. to verify that no two writes occur simultaneously into the same global
memory location.
The simulation is based on a recursive procedure ACC which checks the contents
of a processor’s accumulator at a particular time. By applying the procedure to
processor 0 we can determine if the P-RAM accepts the language. ACC will check
that at time t, processor j executed the ith instruction of its program leaving c in its
accumulator. In order to check this, ACC needs to know
36 2. MODELS OF PARALLEL COMPUTATION

1. the instruction executed by processor j at time t − 1 and the ensuing con-


tents of its accumulator, and
2. the contents of the memory locations referenced by instruction i.

ACC can nondeterministically guess 1 and recursively verify it. To determine 2,


for each memory location, m, referenced ACC guesses that m was last written
by some processor k at time t0 < t. ACC can recursively verify that processor k
did a STORE of the proper contents into m at time t0 . ACC must also check that
no other processor writes into m at any time between t0 and t. It can do this by
guessing the instructions executed by each processor at each such time, recursively
verifying them, and verifying that none of the instructions changes m. Checking
that two writes do not occur into the same memory location at the same time
can be done in a similar fashion. For each time step and each pair of processors,
ACC nondeterministically guesses the instructions executed, recursively verifies
them, and checks that the two instructions were not write-operations into the same
memory location. The correctness of the simulation follows from the determinism
of the P-RAM. In general, each instruction executed by the P-RAM will be guessed
and verified many times by ACC. Since the P-RAM is deterministic, however,
there can be only one possible instruction that can be executed by a processor in a
program step — each verified guess must be the same. Now we analyze the space
requirements:
Note that there can be at most 2T(n) processors running on the P-RAM af-
ter T(n) program steps so writing down a processor number requires T(n) space.
Since addition and subtraction are the only arithmetic operators, numbers can in-
crease by at most one bit each step. Thus, writing down the contents of the ac-
cumulator takes at most T(n) + lg n = O(T(n)) space. Writing down a time step
takes lg T(n) space and the program counter requires only constant space. Hence
the arguments to a recursive call require O(T(n)) space. Cycling through time
steps and processor numbers to verify that a memory location was not overwrit-
ten also only takes T(n) space, so that the total space requirement at each level of
the recursion is O(T(n)). Since there can be at most T(n) levels of recursion (one
for each time step of the program on the parallel machine) the total space require-
ment is O(T(n)2 ). Note that this simulation can be performed by a deterministic
Turing Machine — in each step, simply loop through all of the possibilities when
performing the recursive calls to ACC — i.e., all instructions in the instruction set;
all prior times, etc.
This requires that we modify the ACC procedure slightly so that it returns
information on whether a given guess was correct. This could be accomplished
by having it returns either the value of the accumulator, if a given processor really
did execute a given instruction at a given time or return NULL, indicating that the
guess was wrong. This increases the execution time of the algorithm tremendously
but has no effect on the space requirement. r

E XERCISES .
5. RELATIONS BETWEEN PRAM MODELS 37

5.2. Show that the example of a Turing machine given in 5.4 on page 32 is in
Plogspace. Do this by explicitly transforming it into an offline Turing machine.

5.2. P-Completeness and Inherently Sequential Problems. The previous


section provided some important definitions and showed that problems were
solvable in polylogarithmic time on a parallel computer if and only if they were
solvable in polylogarithmic space on an offline Turing machine. In the present
section we will apply these results to study the question of whether there exist
inherently sequential problems. Although this question is still (as of 1991) open,
we can study it in a manner reminiscent of the theory of NP-complete problems.
As is done there, we restrict our attention to a particular class of problems known
as decision problems. These are computations that produce a boolean 0 or 1 as
their result. It will turn out that there are reasonable candidates for inherently
sequential problems even among this restricted subset.
D EFINITION 5.11. A decision problem is a pair { T, L} where:
1. T recognizes all strings made up of the basic alphabet over which it is
defined.
2. L is the set of strings over this alphabet for which the computation that T
performs results in a 1.
For instance, a decision-problem version of the problem of sorting n numbers
would be the problem of whether the kth element of the result of sorting the n
numbers had some given value.
We will be able to give likely candidates for inherently sequential problems by
showing that there exist problems in P with the property that, if there exists an NC
algorithm for solving them, then P = NC. We need the very important concept of
reducibility of decision-problems:
D EFINITION 5.12. Let P1 = { T1 , L1 } and P2 = { T2 , L2 } be two decision-
problems and let Σ1 and Σ2 be the respective alphabets for the Turing machines
that resolve these decision-problems. Then P1 will be said to be reducible to P2 ,
denoted P1 ∝ P2 , if there exists a function f : Σ1∗ → Σ2∗ such that:
• a string s ∈ L1 if and only if f (s) ∈ L2 ;
• the function f is in NC — in other words, there exists a parallel algorithm
0
for computing f that requires O(lgk n) time-units and O(nk ) processors,
0
where k and k are some integers, and n is the complexity parameter for
P1 .
P1 will be said to be logspace reducible to P2 , denoted P1 ∝logspace P2 , if the con-
ditions above are satisfied, and in addition, the algorithm computing the function
f is in logspace. In other words, there must exist an offline Turing machine using
logspace, that computes f . The results of the previous section show that strong
reducibility is equivalent to reducibility with the exponent k equal to 1.
38 2. MODELS OF PARALLEL COMPUTATION

Note that P1 ∝ P2 implies that an NC algorithm for P2 gives rise to a similar


algorithm for P1 : if we want to decide whether a string s ∈ Σ1∗ results in a 1 when
executed on T1 , we just apply the function f (which can be done via an NC algo-
rithm) and execute T2 on the result. If there exists an NC algorithm for P2 , P1 ∝ P2
implies that, in some sense, the decision problem P1 is solvable in polylogarithmic
time with a polynomial number of processors:
1. Convert an input-string of P1 into a corresponding input-string of P2 via
the transformation-function f , in P1 ∝ P2 . The computation of this function
can be done via an NC algorithm.
2. Execute the NC algorithm for P2 .
As in the theory of NP completeness, we distinguish certain problems in P that are
the “hardest”:
D EFINITION 5.13. A problem Z in P will be called P-complete if it has the prop-
erty that:
For every problem A ∈ P, A ∝ Z.
The problem Z will be called logspace-complete for P or strongly P-complete if:
For every problem A ∈ P, A ∝logspace Z.
The first problem proved to be P-complete was the problem of “directed forest
accessibility” — see [32].
We will conclude this section with an example of a problem that is known to
be strongly P-complete — i.e., logspace complete for P, as defined above. It is
fairly simple to state:
E XAMPLE 5.14. Circuit Value Problem.
• Input: A list L = { L1 , . . . , Ln } of n-terms (where n is some number > 0,
where each term is either:
1. A 0 or a 1, or;
2. A boolean expression involving strictly lower-numbered terms — for
instance, we might have L5 = ( L2 ∨ L4 ) ∧ ( L1 ∨ L3 ).
• Output: The boolean value of Ln . Note that this is a decision problem.
Note that the requirement that boolean expressions involve strictly
lower-numbered terms means that the first term must be a 0 or a 1.
This problem is trivial to solve in n steps sequentially: just scan the list from left
to right and evaluate each term as you encounter it. It is interesting that there is no
known NC algorithm for solving this problem. Ladner proved that this problem
is logspace complete for P, or strongly P-complete — see [97].
L EMMA 5.15. The circuit value problem, as defined above, is logspace-complete for
P.
P ROOF. Let CVP denote the Circuit Value Problem. We must show, that if Z
is any problem in P, then Z ∝logspace CVP. We will assume that Z is computed by
a Turing machine T that always halts in a number of steps that is bounded by a
polynomial of the complexity parameter of the input. We will construct a (fairly
large) circuit whose final value is always the same as that computed by T. We will
also show that the construction can be done in logspace. We will assume:
• The number of characters in the input-string to T is n;
5. RELATIONS BETWEEN PRAM MODELS 39

• The execution-time of T is E = O(nk ), where k is some integer ≥ 1.


• During the course of the execution of T, the number of states that it ever
passes through is O(nk ), since it stops after this number of steps. The
binary representation of a state-number of T clearly requires O(lg n) bits.
• The number of tape-symbols that are nonblank at any step of the algo-
rithm is O(nk ). This is because the tape started out with n nonblank (in-
put) symbols, and at most O(nk ) write-operations were executed while T
was active.
• The transitions of T are encoded as a large set of boolean formulas, com-
puting:
1. the bits of the next state-number from the bits of the current state-
number, the bits of the current input-symbol.
2. whether the tape is moved to the left or the right.
3. whether anything is written to the tape in this step (and what is writ-
ten).
Although it might seem that this assumption is an attempt to make life
easy for ourselves, it is nevertheless a reasonable one. The transition func-
tion may well be encoded as such a sequence of boolean formulas. If it
is simply represented as a table like that on page 32, it is fairly trivial to
convert this table into a series of boolean formulas representing it.
We will build a large circuit that is the concatenation of sublists { L1 , . . . , L E }, one
for each program-step of the execution of T. The list L1 consists of bits representing
the input-data of T, and a sequence of bits representing its start-state. In general Li
will consist of a concatenation of two sublists Si and Qi .
• the sublist Si consists of a sequence of boolean formulas computing the
bits of the new state-number of T in program-step i. These formulas use
the bits of Si−1 and Qi−1 as inputs.
• Qi lists the bits of the nonblank tape symbols in program-step i (of T). For
tape-positions that are not written to in step i, the corresponding entries in
Qi simply copy data from corresponding positions of Qi−1 . In other words,
these formulas are trivial formulas that are just equal to certain previous
terms in the giant list. The entries of Qi that represent tape-symbols that
are written to in program-step i have boolean formulas that compute this
data from:
1. The bits of the current state-number, computed by the boolean formu-
las in Si .
2. Bits of Qi−1 , representing the tape in the previous program-step.
Now we consider how this conversion of the input data and the
transition-function of T into the Li can be accomplished:
1. We must be able to count the number of program-steps of T so that we will
know when to stop generating the Li . Maintaining such a counter requires
O(lg n) memory-locations in the offline Turing machine that converts T
into the circuit.
2. We must be able to copy formulas from rows of the table representing the
transition-function of T into entries of the circuit we are generating. This
requires looping on the number of bits in these boolean formulas. Again,
we will need O(lg n) memory-locations.
40 2. MODELS OF PARALLEL COMPUTATION

3. We must keep track of the current location of the read-write head of T.


This tells us how to build the formulas in Qi that represent tape-symbols
that are written to in program-step i. Since the total number of such tape-
positions is O(nk ), we will need O(lg n) bits to represent this number. We
increment this number every time the transition-function calls for move-
ment to the right (for instance) and decrement it every time we are to move
to the left.
The offline Turing machine will contain this memory, and its input program. The
total memory used will be O(lg n), so the transformation from T to the Circuit
Value Problem will be logspace. r

Many problems are known to be P-complete. We will give a list of a few of the
more interesting ones.
• The Monotone Circuit Problem This is a circuit whose only operations
are ∨ and ∧. Goldschlager gave a logspace reduction of the general circuit-
value problem to this — see [59]. The Planar Circuit Value Problem is also
P-complete. A planar circuit is a circuit that can be drawn on a plane
without any “wires” crossing. It is interesting that the Monotone, Planar
Circuit Value problem is in NC — see [57].
• Linear Inequalities The input to this problem is an n × d integer-valued
matrix A, and an integer-valued n × 1 vector c. The problem is to answer
the question of whether there exists a rational-valued vector x such that
Ax ≤ c
In 1982, Cook found the following reduction of the Circuit Value Problem
to the Linear Inequality Problem: ½ ¾
True
1. If input xi (of the Circuit Value Problem) is , it is represented
½ ¾ False
xi = 1
by the equation .
xi = 0
2. A NOT gate with input u ½ ¾ computing w = ¬u, is repre-
and output w,
w = 1−u
sented by the inequalities
0≤w≤1
3. An OR gate with inputs u and v, and output w is represented by the

 0 ≤ w ≤ 1 
 
u≤w
inequalities

 v≤w  
 
w ≤ u+v
4. An AND gate with gate with  inputs u and v, and output w is repre-

 0 ≤ w ≤ 1 

 
w≤u
sented by the inequalities

 w≤v 

 
u+v−1 ≤ w
This is interesting because this decision problem is a crucial step in per-
forming Linear Programming. It follows that Linear Programming is P-
complete. It is interesting that Xiaotie Deng has found an NC algorithm
for planar Linear Programming. This is linear programming with only two
variables (but, perhaps, many inequalities) — see [46] for the algorithm.
5. RELATIONS BETWEEN PRAM MODELS 41

• Gaussian elimination This is the standard sequential algorithm for


solving a system of simultaneous linear equations or inverting a matrix.
Among other things, it involves clearing out rows of a matrix at certain
pivot points, which are computed as the algorithm executes. The paper
[165], by Stephen Vavasis proves that the problem of deciding whether
a given element will be a pivot-element in the Gaussian elimination
algorithm is P-complete. This implies that fast parallel algorithms (i.e.,
NC algorithms) cannot be based upon Gaussian elimination. This
conclusion is interesting because many people have worked on the
problem of parallelizing Gaussian elimination by trying to exploit certain
parallelisms in the problem. This paper implies that none of this work
will ever result in an NC algorithm for solving linear equations. See § 1 in
chapter 5 for an example of an NC algorithm for matrix inversion. This
algorithm is not based upon Gaussian elimination.
• Maximum flows This is an important problem in the theory of networks.
In order to understand it, imagine a network of pipes with the property
that each pipe in the network has a certain carrying capacity — a maxi-
mum rate at which fluid can flow through it. Suppose that one point
of the network is a source of fluid and some other point is where fluid
leaves the network. The maximum flow problem asks the question “What
is the maximum rate fluid can flow from the source to the exit?”. This
must be computed from characteristics of the network, including the car-
rying capacities of the pipes and how they are arranged. The paper [60],
by L. Goldschlager, L. Shaw and J. Staples proves that this problem is
Plogspace-complete.
• Inference problem for multivalued dependencies This is a problem from
the theory of databases. Essentially, a large part of the problem of de-
signing a database consists in identifying dependencies in the data: as-
pects of the data that determine other aspects. A multivalued dependency
represents a situation where one aspect of the data influences but doesn’t
completely determine another4. These multivalued dependencies imply
the existence of other multivalued dependencies among aspects of data.
The problem of determining these other multivalued dependencies from the
given ones turns out to be P-complete — see [45].

5.3. Further reading. Greenlaw, Hoover, and Ruzzo have compiled a large
list of P-complete problems — see [61]. This list will be periodically updated to
incorporate newly discovered P-complete problems.
In [171] Wilson shows that the class NC has a “fine-structure” — it can be
decomposed into subclasses that have natural descriptions.
In [31] Cook gives a detailed description of parallel complexity classes and
how they relate to each other. Also see [30].
General Principles of Parallel Algorithm Design
In this section we discuss some general principles of algorithm-design. Later
chapters in this book will explore these principles in more detail. We have already
seen some of these concepts in the Introduction.

4There is more to it than this, however. See a good book on database design for more information
42 2. MODELS OF PARALLEL COMPUTATION

5.4. Brent’s Theorem. Brent’s Theorem makes precise some of the heuristic
arguments in the introduction relating computation networks with time required
to compute something in parallel.
We will need a rigorous definition of a combinational network, or computa-
tional network. As with comparator networks defined in § 2 we consider a kind
of “circuit” whose “wires” can transmit numerical data and nodes that can modify
the data in prescribed ways5.

D EFINITION 5.16. A computational network is a directed acyclic6 We will as-


sume that the edges of this graph “transmit” data like sorting networks. The ver-
tices of a computational network are subdivided into three sets:
Input vertices These vertices have no incoming edges.
Output vertices These vertices have no outgoing edges.
Interior vertices These vertices have incoming edges and one or more out-
going edges. All of the outgoing edges of an interior vertex pass the same
data.
Each interior vertex is labeled with an elementary operation (to be discussed be-
low). The number of incoming edges to an interior vertex is called its fan-in. The
number of outgoing edges is called the fan-out. The maxima of these two quanti-
ties over the entire graph is called, respectively, the fan-in and the fan-out of the
graph.
The length of the longest path from any input vertex to any output vertex is
called the depth of the computational network. The computation performed by
a computation network, on a given set of inputs is defined to be the data that
appears on the output vertices as a result of the following procedure:
1. Apply the input data to the input vertices.
2. Transmit data along directed edges. Whenever an interior vertex is en-
countered, wait until data arrives along all of its incoming edges, and then
perform the indicated elementary computation. Transmit the result of the
computation along all of the outgoing edges.
3. The procedure terminates when there is no data at interior vertices.
Each edge of the computational network is assumed to be able to transmit a
number of a certain size that is fixed for the rest of this discussion (a more detailed
consideration of these networks would regard the carrying capacity of each wire
to be a single bit — in this case the magnitudes of the numbers in question would
enter into complexity-estimates). The elementary operations are operations that
can be carried out on the numbers transmitted by the edges of the network within
a fixed, bounded, amount of time on a RAM computer. As before, a more detailed
discussion would define an elementary operation to be an elementary boolean
operation like AND, OR, or NOT. In our (higher level) discussion, elementary op-
erations include +, −, min, max, ∗, /, and many more.
Now we are in a position to state Brent’s Theorem:

5These are, of course, precisely the properties of the electronic circuits used in a computer. We do
not want to become involved with issues like the representation of arithmetic operations in terms of
boolean logic elements.
6“Acyclic” just means that the graph has no directed loops.
5. RELATIONS BETWEEN PRAM MODELS 43

T HEOREM 5.17. Let N be a computational network with n interior nodes and depth
d, and bounded fan-in. Then the computations performed by N can be carried out by a
CREW-PRAM computer with p processors in time O( np + d).

The total time depends upon the fan-in of N — we have absorbed this into the
constant of proportionality.

P ROOF. We simulate the computations of N in a fairly straightforward way.


We assume that we have a data-structure in the memory of the PRAM for encoding
a vertex of N, and that it has a field with a pointer to the vertex that receives
the output of its computation if it is an interior vertex. This field is nil if it is an
output vertex. We define the depth of a vertex, v, in N to be the maximum length
of any path from an input vertex to v — clearly this is the greatest distance any
amount of input-data has to travel to reach v. It is also clear that the depth of N
is the maximum of the depths of any of its vertices. We perform the simulation
inductively — we simulate all of the computations that take place at vertices of
depth ≤ k − 1 before simulating computations at depth k. Suppose that there
are ni interior vertices of N whose depth is precisely i. Then ∑id=1 ni = n. After
simulating the computations on vertices of depth k − 1, we will be in a position
to simulate the computations on nodes of depth k, since the inputs to these nodes
will now be available.
C LAIM 5.18. When performing the computations on nodes of depth k, the order of
the computations is irrelevant.
This is due to the definition of depth — it implies that the output of any vertex
of depth k is input to a vertex of strictly higher depth (since depth is the length of
the longest path from an input vertex to the vertex in question).
The simulation of computations at depth k proceeds as follows:
1. Processors read the data from the output areas of the data-structures for
vertices at depth k − 1.
2. Processors perform the required computations.
Since there are nk vertices of depth k, and the computations can be performed in
any order, the execution-time of this phase is
» ¼
nk n
≤ k +1
p p
The total execution-time is thus
d » ¼ d
nk n n
∑ p ≤ ∑ pk + 1 = p + d
i =1 i =1
r

We normally apply this to modifying parallel algorithms to use fewer proces-


sors —see § 5.5.2 below (page 45). If a computational network has bounded fan-out
as well as bounded fan-in we get:
C OROLLARY 5.19. Let N be a computational network with n interior nodes and
depth d, and bounded fan-in and fan-out. Then the computations performed by N can be
carried out by a EREW-PRAM computer with p processors in time O( np + d).
44 2. MODELS OF PARALLEL COMPUTATION

Brent’s theorem has interesting implications for the question of work efficiency
of an algorithm.
D EFINITION 5.20. The amount of work performed by a parallel algorithm is
defined to be the product of the execution time by the number of processors.
This measures the number of distinct computations a parallel algorithm per-
forms.
We can think of a computation network with n vertices as requiring n units
of work to perform its associated computation — since there are n distinct com-
putations to be performed. The work required by a simulation of a computation
network is (by Brent’s Theorem) O(n + dp). This is proportional to the number of
vertices in the original computation network if p is proportional to n/d.
D EFINITION 5.21. Let A be a parallel algorithm that performs a computation
that can be represented by a computation network. Then A will be said to be
a work-efficient parallel algorithm if it executes in time O(d) using p processors,
where p = n/d, and n is the number of vertices in a computation network for
performing the computations of algorithm A with the smallest possible number
of vertices.
Work-efficient parallel algorithms are optimal, in some sense.

E XERCISES .

5.3. Find a computation network for evaluating the expression ( x2 + 2)( x3 −


3) − 2x3 .
5.4. Show that sorting networks can be expressed in terms of computation
networks. This implies that they are a special case of computation networks.

5.5. SIMD Algorithms.


5.5.1. Doubling Algorithms. The program for adding up n numbers in O(lg n)
time is an example of a general class of parallel algorithms known by several dif-
ferent names:
• Parallel-prefix Operations.
• Doubling Algorithms.
In each case a single operation is applied to a large amount of data in such a
way that the amount of relevant data is halved in each step. The term “Doubling
Algorithms” is somewhat more general than “Parallel Prefix Operations”. The lat-
ter term is most often used to refer to generalizations of our algorithm for adding
5. RELATIONS BETWEEN PRAM MODELS 45

n numbers — in which the operation of addition is replaced by an arbitrary asso-


ciative operation. See §1 in chapter 6 for several examples of how this can be done.

Another variation on this theme is a class of techniques used in Graph Algo-


rithms called “Pointer Jumping”. This involves assigning a processor to each node
of a directed graph and:
1. Having each processor determine the successor of its successor;
2. Causing the processors to effectively regard the nodes computed in step 1
as their new successors;
3. Go to step 1.
In this way, the end of a directed path of length n can be found in O(lg n) steps.
This is a special case of parallel prefix operations, since we can regard the edges of
the graph as the objects under study, and regard the operation of combining two
edges into “one long edge” (this is basically what the pointer jumping operation
amounts to) as an associative operation. See § 1.4 for examples of algorithms of this
type (and definitions of the terminology used in graph theory).
5.5.2. The Brent Scheduling Principle. One other general principle in the design
of parallel algorithm is the Brent Scheduling Principle. It is a very simple and inge-
nious idea first described by R. Brent in [21], that often makes it possible to reduce
the number of processors used in parallel algorithms, without increasing the as-
ymptotic execution time. In general, the execution time increases somewhat when
the number of processors is reduced, but not by an amount that increases the as-
ymptotic time. In other words, if an algorithm has an execution time of O(lgk n),
then the execution-time might increase by a constant factor. In order to state this
result we must recall the concept of computation network, defined in 5.16 on page
42.
C OROLLARY 5.22. Suppose algorithm A has the property that its computations can
be expressed in terms of a computation network with x vertices and depth t that has
bounded fan-in. Then algorithm A can be executed on a CREW-PRAM computer with
p processors in time O( xp + t).

The proof is a direct application of Brent’s Theorem (5.17 on page 42).


See page 270 for some applications of this principle.
This result has some interesting consequences regarding the relationship be-
tween data-representation and execution-time of an algorithm. Consider the algo-
rithm for adding up numbers presented on page 7. Since the data is given in an
array, we can put it into any computation network we want — for instance, the
one in figure 1.8 on page 7. Consequently, the Brent Scheduling Principle states
that the algorithm on page 7 can be executed with dn/ lg(n)e processors with no
asymptotic degradation in execution time (i.e., the execution time is still O(lg n)).
If the input-data was presented to us as elements of a linked list, however, it
is not clear how we could apply the Brent Scheduling Principle to this problem.
The linked list can be regarded as a computation network of depth n, so Brent’s
Theorem would imply an execution time of O(n). We can actually get an execution
time of O(lg n) by using the technique of pointer jumping in § 5.5.1 above, but this
actually requires n processors. The parallel algorithms for list-ranking in this case
are more complicated than straightforward pointer-jumping — see [7]. Also see
chapter 7 for a simple probabilistic algorithm for this problem.
46 2. MODELS OF PARALLEL COMPUTATION

5.5.3. Pipelining. This is another technique used in parallel algorithm design.


Pipelining can be used in situations where we want to perform several opera-
tions in sequence { P1 , . . . , Pn }, where these operations have the property that some
steps of Pi+1 can be carried out before operation Pi is finished. In a parallel algo-
rithm, it is often possible to overlap these steps and decrease total execution-time.
Although this technique is most often used in MIMD algorithms, many SIMD al-
gorithms are also able to take advantage of it. Several algorithms in this book
illustrate this principle:
• The Shiloach-Vishkin algorithm for connected components of a graph (see
page 301). Pipelining in this case is partly responsible for reducing the
execution-time of this algorithm from O(lg2 n) to O(lg n).
• The Cole sorting algorithm (see page 349). This is, perhaps, the most strik-
ing example of pipelining in this book. This sorting algorithm is based
upon ideas like those in 3.7 on page 23, but ingeniously “choreographs”
the steps in such a way that the execution-time is reduced from O(lg2 n)
to O(lg n).
5.5.4. Divide and Conquer. This is the technique of splitting a problem into
small independent components and solving them in parallel. There are many ex-
amples of this technique in this text:
• The FFT algorithm (at least if we consider its recursive definition) in § 2.3
of chapter 5 (page 186);
• The parallel prefix, or doubling algorithms of § 1 in chapter 6 (page 267);
• All of the algorithms for connected components and minimal spanning
trees in § 1.4 of chapter 6(page 281);
• The Ajtai, Komlós, Szemerédi sorting algorithm in § 3.7 of chapter 6 (page
365);
The reader will doubtless be able to find many other examples in this book.

5.6. MIMD Algorithms.


5.6.1. Generalities. this section is longer than the corresponding section on
SIMD algorithm design because many of the important issues in SIMD algorithm
design are dealt with throughout this book, and frequently depend upon the
problem being studied. The issues discussed in this section are fairly constant
throughout all MIMD algorithms. As is shown in § 5.7 on page 53, design of a
MIMD algorithm can sometimes entail
• the design of a good SIMD algorithm,
• conversion to a MIMD algorithm
The second step requires the computations to be synchronized, and this involves
using the material in this section.
The issues that arise in the design of MIMD algorithm are essentially identical
to those that occur in concurrent programming. The problems that occur and their
solutions are basically the same in both cases. The main difference between MIMD
algorithm design and concurrent algorithm design involves questions of when and
why one creates multiple processes:
1. When designing concurrent algorithms to run on a single-processor com-
puter, we usually create processes in order to handle asynchronous events in
situations in which we expect little real concurrency to occur. The generic
5. RELATIONS BETWEEN PRAM MODELS 47

example of this is waiting for I/O operations to complete. Here we expect


one process to be dormant most of the time, but are unable to accurately
predict when it will be dormant. We usually avoid creating processes to
handle computations or other operations that will be truly concurrent, be-
cause there is no real concurrency on a single-processor machine — it is
only simulated, and this simulation has an associated cost.
2. When designing MIMD algorithms to run on a parallel computer, we try
to maximize the amount of concurrency. We look for operations that can
be carried out simultaneously, and try to create multiple processes to han-
dle them. Some of the considerations in writing concurrent programs still
apply here. Generally, it is not advisable to create many more processes
than there are processors to execute them. The overhead involved in creat-
ing processes may be fairly large. This is the problem of grain-size. This is
in contrast to SIMD algorithms, in which there is little overhead in creating
parallel threads of data and computation.
We will discuss a few of the very basic issues involved in concurrent and MIMD
algorithm design. The most basic issue is that we should analyze the computations
to be performed and locate parallelism. This can be done with a dependency graph.
Page 7 gives two examples of syntax trees. These are like dependency graphs for
arithmetic operations. The cost of creating processes on most MIMD computers
almost always makes it necessary to work on a much coarser scale.
We generally take a high-level description of the computations to be
performed, and make a directed graph whose nodes represent discrete blocks of
independent operations. The edges represent situations in which one block of
operations depends upon the outcome of performing other blocks. After this has
been done we can design the MIMD algorithm by:
1. Creating one process for each node of the graph, and make the processes
wait for the completion of other processes upon which they are depen-
dent.
2. Creating one process for each directed path through the dependency
graph, from its starting point to its end.
5.6.2. Race-conditions. However we do this, we encounter a number of impor-
tant issues at this point. If two processes try to access the same shared data, they
may interfere with each other:
Suppose two processes update a shared linked list simultaneously
— the head of the list is pointed to by a pointer-variable named
head, and each entry has a next pointer to the next entry.
Process A wants to add a record to the beginning of the list by:
A.1 making the new record’s next pointer equal to the head pointer (so it
points to the same target);
A.2 making the head pointer point to the new record;
Process B wants to delete the first record by:
B.1 Making the head pointer equal to the next pointer of its target;
B.2 Deleting the record that was originally the target of the head pointer;
Through unlucky timing, these operations could be carried out in the sequence A.1,
B.1, A.2, B.2. The result would be that the head would point to the new record
added by process A, but the next pointer of that record would point to the record
48 2. MODELS OF PARALLEL COMPUTATION

deleted by process B. The rest of the list would be completely inaccessible. This is
an example of a race-condition — the two processes are in a “race” with each other
and the outcome of the computations depend crucially on which process reaches
certain points in the code first.
Here is a program in C for the Sequent Symmetry computer that illustrates
race-conditions:

#include <stdio.h>

/* The next two include files refer to system libraries for


* sharing memory and creating processes. */

#include <parallel/microtask.h>
#include <parallel/parallel.h>
shared int data;
void dispatch ();
void child1 ();
void child2 ();
void main ()
{
m set procs (2); /* Causes future calls to ’m fork’ to
* spawn two processes */
m fork (dispatch);/* This call to ’m fork’ spawn two processes
* each of which, is a contains a copy of the
*routine ’dispatch’ */
exit (0);
}
void dispatch () /* This routine is executed in two
* concurrent copies. */
{
int i,
j;
int p = m get myid (); /* Each of these copies determines
* its identity (i.e., whether it is
* process number 0 or 1) */
if (p == 0)
child1 ();
else
child2 ();
}
void child1 () /* ’child1’ contains the actual
* code to be executed by process 0. */
{
int i,
j;
for (i = 0; i < 10; i++)
{
data = 0;
for (j = 0; j < 500; j++);
5. RELATIONS BETWEEN PRAM MODELS 49

printf (”Child 1, data=%d\n”, data);


}
}
void child2 () /* ’child2’ contains the actual
* code to be executed by process 1. */
{
int i,
j;
for (i = 0; i < 10; i++)
{
data++;
for (j = 0; j < 500; j++);
printf (”Child 2, data=%d\n”, data);
}
}
Here two processes are generated, called child1 and child2. Since mfork nor-
mally generates many processes at once, we have to make it spawn both child-
processes in a single statement. This is done by making a routine named dispatch
be the child-process. Each copy of this routine calls m get myid to determine its
identity and call child1 or child2 depending on the result. Note that child1 zeroes a
shared data item named data, and child2 increments it. They both then wait a short
time and print the value out. Since data is shared though, it is possible that while
one process is waiting to print out the data, the other process can sneak in and
change it’s value. This actually happens, as you can see if you run this program:
cc name.c -lpps a.out
The results are unpredictable (this is usually true with race-conditions) — you
will probably never get the same set of results twice. Most of the time, however,
child1 will occasionally print out values other than 0, and child2 will sometimes
print 0.
This type of problem is solved in several ways:
a. One involves locking operations that prevent more than one process from
accessing shared data (access to data that is exclusive to a single process is called
atomic). Essentially, the first process that calls the m lock system-function con-
tinues to execute and any other process that calls this function is suspended until
the first process calls m unlock. If two processes call m lock simultaneously, the
system makes a decision as to which gets priority.
Here is how the program above looks when semaphores are used to prevent
race-condition:
#include <stdio.h>
#include <parallel/microtask.h>
#include <parallel/parallel.h>
shared int data;
void child ();
void child1 ();
void child2 ();
void main ()
{
m set procs (2);
50 2. MODELS OF PARALLEL COMPUTATION

m fork (child);
exit (0);
}
void child ()
{
int i,
j;
int p = m get myid ();
if (p == 0)
child1 ();
else
child2 ();
}
void child1 ()
{
int i,
j;
for (i = 0; i < 10; i++)
{
m lock ();
data = 0;
for (j = 0; j < 500; j++);
printf (”Child 1, data=%d\n”, data);
m unlock ();
}
}
void child2 ()
{
int i,
j;
for (i = 0; i < 10; i++)
{
m lock ();
data++;
for (j = 0; j < 500; j++);
printf (”Child 2, data=%d\n”, data);
m unlock ();
}
}

The functions m lock and m unlock() are system calls available on the Se-
quent line of parallel computers.
The standard term (i.e. the term you will see most often in the literature) for
a locking operation (in the theory of concurrent programming) is a semaphore. The
lock and unlock-operations are called semaphore down and semaphore up operations.
One characteristic of processes that are under the control of a lock (or
semaphore) is that the amount of speedup that is possible due to parallel
5. RELATIONS BETWEEN PRAM MODELS 51

processing is limited. This is due to the fact that the semaphore forces certain
sections of code to be executed sequentially. In fact:
L EMMA 5.23. Suppose the optimal sequential algorithm for performing a computa-
tion requires time T, and accessing a semaphore requires time k. Then an optimal parallel
version of this computation using processes
√ under the control of a single semaphore re-
quires an execution time of at least O( T/k ).
P ROOF. If we use m processors the execution time must be at least T/m (see
1.1 in chapter 2). On the other hand, since the semaphore-operations are executed
sequentially, they will require an execution time of km — i.e. the time required
to carry out the semaphore-operations increases with the number of processors.
The total execution time will be ≥ ( T/m + km). The value of m that minimizes
this occurs when the derivative of this expression with respect to m vanishes. This
means that
T
− 2 +k = 0
√ m
This occurs when m = T/k. r
It is interesting to note that this result makes essential use of the fact that there
is a single semaphore involved — and access of this semaphore by n processes re-
quires a time of kn. Recent unpublished results of David Saunders shows that it is
possible to set up a kind of tree of semaphores that will permit the synchronization
of n processes in time that is O(lg n).
b. Another solution to this problem involves synchronizing the parallel pro-
cesses. One common notation for this construct is:
cobegin
coend;
The idea here is that all processes execute the cobegin statement simultane-
ously and remain synchronized until the coend statement is reached. This solves
the problem of processes interfering with each other when they access shared data
by allowing the programmer to “choreograph” this common access. For instance,
in the sorting algorithm on the Butterfly Computer, no semaphores were used, but
processes never interfered with each other. The DYNIX operating system provides
the m sync() system call to implement cobegin. When a process calls m sync() it
spins (i.e. loops) until all processes call m sync() — then all processes execute.
The operating system uses a process scheduling algorithm that causes child pro-
cesses to execute to completion without interruption (except by higher-priority
processes). Consequently, once processes have been synchronized, they remain in
sync if they execute the same instructions.
See § 1.1.1 (page 95) for information on another interesting paradigm for syn-
chronizing parallel processes.
5.6.3. Optimization of loops. The theory of semaphores give rise to a number
of issues connected with parallelizing loops in an algorithm. Suppose we have an
algorithm that requires a number of computations for be performed in a loop, with
very little dependency between different iterations of the loop. We assume that the
loop has been divided up between several parallel processes — each process will
execute a few iterations of the loop. Data that the loop references may be divided
up into several categories:
1. Local data. This is data that is not shared — it is declared locally in each
process. There is no need to use a semaphore in accessing this data.
52 2. MODELS OF PARALLEL COMPUTATION

2. Read-only shared data. This is data that is only read by different processes.
There is no need to use a semaphore or lock to control access to it.
3. Reduction data. This is shared data that read and written by each process, but
in a limited way. The data is used in a single associative commutative operation by
each iteration of the loop and always, read and then written. Although it is shared,
we do not need to use a semaphore every time we access such data. Since it is
used in an associative commutative operation, the order in which the operations
are applied is not significant. We can replace this reference variable in the loop by
a local variable in each process. Only when the data from different processes is
being combined need we use a semaphore to prevent simultaneous access. This
saves a little execution time if each process is performing many iterations of the
loop, because a semaphore inhibits parallelism. Here is an example in C — again
this program is for the Sequent Symmetry computer:
for (i=0; i < 1000;i++)
for (j=0;j < 1000;j++) sum=sum+a[i][j];
Here ‘sum’ is a reduction variable. Suppose each processor performs all itera-
tions of the loop on ‘j’. Then we could replace this nested loop by:

for (i=0; i<1000;i++)


{
int local sum=0;
for(j=0;j<1000;j++)
{
local sum=local sum+a[i][j];
}
m lock(); /* Set semaphore. */
sum=sum+local sum; /* Now accumulate values
computed by different processors.*/
m unlock(); /* Release semaphore. */
}

4. Ordered data. This is shared data whose final numerical value depends
upon the iterations of the loop being carried out in precisely the same order as in
a sequential execution of the loop. A loop containing such data is not suitable for
parallelization (at least not in an asynchronous program). There are situations in
which such a loop might be contained in a much larger block of code that does lend
itself to parallelization. In this case we must guarantee that the loop is question
is executed sequentially (even if execution of different parts of the loop is done on
different processors). There are several ways this can be done: a. we can isolate
this code in a procedure and allow only one processor to call the procedure. b. We
can share the variable that describes the index of the loop (i.e. iteration-count) and
make each processor wait for that to reach an appropriate value.
Alternative a is probably the more structured solution to this problem.
5. Shared variables that are read and written, but for which the order of exe-
cution of the iterations of the loop is not significant. Such variables must be locked
via a semaphore before they can be accessed.
5.6.4. Deadlocks. The last general issue we will discuss is that of deadlock con-
ditions. Suppose we have two processes that each try to access two data-items.
5. RELATIONS BETWEEN PRAM MODELS 53

Since we are aware of the problems that can arise involving race-conditions, we
use semaphores to control access to the data-items. Now suppose, for some rea-
son (unlucky choices or timing), the first process locks up the first data-item at
the same time that the second process locks up the second. Now both processes
try to lock the other data-item. Since they can’t complete their computations (and
release the data they have locked) until they get both data-items, they both wait
forever. This is known as a deadlock condition.
The classic problem that illustrates the issue of deadlocks is the Dining
Philosopher’s Problem, described by Dijkstra in [47].
Five philosopher’s sit at a round table with a huge plate of
spaghetti in the center. There are five forks on the table — they
lie between the philosophers. Each philosopher alternates
between meditating and eating, and a philosopher needs two
forks in order to eat. The philosophers are very egotistical, so no
philosopher will relinquish a fork once they have picked it up
until they finish eating7.
Deadlocks can only occur if the following conditions are satisfied:
1. Processes can request (and lock) only part of the resources they need.
2. Processes can never relinquish resources they have requested until their
computations are completed.
3. Processes cannot take resources away from other processes.
4. A circular chain of requests for resources can exist. Each process in the
chain requests two or more resources and at least one of these is also re-
quested by the next process in the chain.
We prevent deadlock by eliminating at least one of these conditions. It is generally
impractical to try to eliminate conditions 2 and 3 , but the other two conditions can
be eliminated.
• We can prevent condition 1 from being satisfied by implementing
semaphore sets. These are sets of semaphores with semaphore-down
operations that apply atomically to the entire set. When a process
performs a semaphore-down operation on the set, it is suspended if any
of the semaphores in the set is 0. In this case none of the semaphores is
lowered. In the case where all of the semaphores in the set are 1, they are
all lowered simultaneously. In the context of the Dining Philosopher’s
Problem, this is as if the philosophers could grab both of the forks at the
same instant, so they either get both forks, or they get nothing. The ATT
System V implementation of UNIX has such semaphore set operations.
• We can prevent condition 4 in several ways:
– Careful algorithm design.
– Use of resources in a fixed order.
5.7. Comparison of the SIMD and MIMD models of computation. As the
title of this section suggests, we will prove that these two very general models
of computation are essentially equivalent. They are equivalent in the sense that,
with some restrictions, any algorithm that executes in T time units on one type
of computer can be made to execute in kT time units on the other, where k is

7The reader may have noticed a few puns in this example!


54 2. MODELS OF PARALLEL COMPUTATION

some constant. Before the reader concludes that this means the type of parallel
computer one uses is unimportant, it is necessary to point out that the constant k
may turn out to be very large. Many problems will naturally lend themselves to a
SIMD or MIMD implementation, and any other implementation may turn out to
be substantially slower. First we need the following definition:
D EFINITION 5.24. An algorithm for a SIMD parallel computer will be called
calibrated, if whenever processors access memory, they also compute the program-
step in which the memory was last written. This means that there is a function
f : P × T × M → T, where
1. P is the set of processors in the SIMD computer.
2. T is the set of possible time-steps — a range of integers.
3. M is the set of memory-locations.
In addition, it means that the algorithm effectively computes this function f in the
course of its execution.
Many highly regular algorithms for SIMD computers have this property. For
instance, an algorithm that accesses all of memory in each program step can be
easily converted into a calibrated algorithm.
A LGORITHM 5.25. Suppose A is a calibrated algorithm that runs in T time
units on a SIMD-EREW parallel computer with n processors and uses m memory
locations. Then it is possible to execute this algorithm on a MIMD-EREW com-
puter with n processors in kT time units, using mT distinct semaphores, where k
is the number of instruction-steps required to:
1. Check a semaphore;
2. Suspend a process;
3. Awaken a suspended process;
This result suggests that a MIMD computer is strictly better than a SIMD com-
puter, in the sense that
• It looks as though MIMD computers can execute any program that runs
on a SIMD computer.
• A MIMD computer can also run programs that require asynchronous pro-
cesses.
The “catch” here is that:
1. Many SIMD algorithms are not calibrated, and there is a very significant
cost associated with converting them into calibrated algorithms.
2. Most MIMD computers today (1992) have far fewer processors than SIMD
computers;
3. The constant k may turn out to be very large.

P ROOF. The basic idea here is that race-conditions will not occur, due to the
fact that the SIMD algorithm was designed to execute on a EREW computer. Race
conditions only occur when multiple processors try to read and write to the same
location in a given program step. The only problem with carrying out the simu-
lation in an entirely straightforward way is that of synchronizing the processors.
This is easily handled by using the fact that A is calibrated. Simply associate a
time − stamp with each data-item being computed. Each processor of the MIMD
5. RELATIONS BETWEEN PRAM MODELS 55

computer executes instructions of the SIMD program and maintains a program-


counter. We attach a single semaphore to each simulated SIMD-memory location,
and for each simulated time-step. This gives a total of mT semaphores, and they
are all initially down except the ones for all of the processors at time 0. When a
processor is about to read the data that it needs for a given program-step, it checks
the semaphore for that data-item at the required time. When a processor com-
pletes a computation in a given simulated time-step, it executes an up operation
on the corresponding semaphore.
We must prove that the execution-time of the simulated algorithm is as stated.
We use induction on the number of program-steps in the SIMD algorithm. Cer-
tainly the conclusion is true in the ground-case of the induction. Suppose that it is
true after t simulated program-steps. This means that all processors of the MIMD
machine have simulated t program steps of the SIMD algorithm after kt time-units
have elapsed. All processors are ready to begin simulating at least the t + 1st
program-step at this point. If any processors require data from the tth program
step, they must access a semaphore that is attached to that data. Consequently, the
elapsed time may be k before the algorithm can simulate the next SIMD program-
step. r
The results of § 5, particularly 5.1 on page 26 imply:
C OROLLARY 5.26. Suppose A is a calibrated algorithm that runs in T time units on
a SIMD-CRCW parallel computer with n processors. Then it is possible to execute this
algorithm on a MIMD-EREW computer with n processors in kT lg2 n time units, where
k is the number of instruction-steps required to:
1. Check a semaphore;
2. Suspend a process;
3. Awaken a suspended process;
In more generality, we have:
A LGORITHM 5.27. Suppose A is an algorithm that runs in T time units on a
SIMD-EREW parallel computer with n processors. Then it is possible to execute
this algorithm on a MIMD-EREW computer with 3n processors in kT lg n time
units, where k is a constant.
This algorithm eliminates the requirement that the SIMD algorithm be cali-
brated. It is based upon the tree-of-semaphores construction of David Saunders
(see page 51). Essentially, we insert a barrier construct after each simulated SIMD
instruction. We set up a tree of semaphores and, when processors finish a given
simulated SIMD instruction-step (say instruction j) they wait until all processors
complete this instruction-step. Then they begin their simulation of the next SIMD
instruction-step.
In like fashion, it is possible to simulate MIMD algorithms on a SIMD machine.
Suppose we have a MIMD algorithm that makes use of an instruction-set with I
instructions { a1 , . . . , a I }. Suppose that a SIMD computer can simulate a j in time
t j . Then we get:
A LGORITHM 5.28. Let A be an algorithm that runs in T time units on the
MIMD computer described above. Then A can be run in T ∑ jI=1 t j time units on
the SIMD computer described above.
56 2. MODELS OF PARALLEL COMPUTATION

The idea of this simulation is extremely simple. Each program-step of the


MIMD computer is simulated on the SIMD computer by a loop with I iterations.
In iteration j all processors that should be executing instruction a j run a simulation
of this instruction — this requires time t j .
Note that simulations of this kind are extremely slow unless the simulated
instruction-set is made as small as possible. Mr. Jeffrey Salvage is writing such a
simulator to run on a Connection Machine (CM-2) computer8.
The question of how one can simulate a MIMD machine by a SIMD machine
has also been considered by M. Wloka in his Doctoral Dissertation ([173]) and by
Michael Littman and Christopher Metcalf in [102].

E XERCISES .

5.5. Why doesn’t the simulation algorithm 5.25 run up against the limitations
implied by 5.23 on page 51?
5.6. Is it possible for the MIMD simulation in 5.25 to run faster than the original
SIMD algorithm being simulated? Assume that the processors of the MIMD com-
puter run at the same rate as those of the SIMD computer, and that the operations
of checking semaphores take negligible time (so the constant k is 0).

8This is his Master’s Thesis at Drexel University


CHAPTER 3

Distributed-Memory Models

1. Introduction.
The PRAM models of computation requires that many processors access the
same memory locations in the same program-steps. This creates engineering prob-
lems that have only been solved in a few cases. Most practical parallel computers
are built along a Distributed Memory Model of some kind. In a distributed-memory
architecture, the processors of the computer are interconnected in a communication-
network and the RAM of the computer is local to the processors. Each processor can
access its own RAM easily and quickly. If it needs access to a larger address-space
than is contained in its local RAM, it must communicate with other processors over
the network
This leads to several interesting questions:
1. How do the various interconnection-schemes compare with each other in
regard to the kinds of algorithms that can be developed for them?
2. How easy is it for processors to “share” memory over the communication
network? This is related to the question of how easy it might be to port
PRAM-algorithms to a network-computer.
It turns out that the answer to question 1 is that “almost any network will do”.
Work of Vishkin and others shows that algorithms that are fast most of the time
(i.e. probabilistic algorithms) can be developed for any network in which it is
possible to reach 2k other processors from a given processor in k steps. Ques-
tion 2 has a similar answer — there exist efficient algorithms for simulating a
PRAM-computer via a network-computer, if the network satisfies the condition
mentioned above.

2. Generic Parallel Algorithms.


We will consider how several different networks handle many common algo-
rithms. In order to do this, we follow Preparata and Vuillemin in [131] in defining
a pair of generic parallel algorithms that can be easily implemented on the com-
mon network-computers. Many interesting parallel algorithms can be expressed
in terms of these two.
We will assume that we have n = 2k data items stored in storage locations
T [0], T [1],. . . ,T [n − 1].
The notation OPER(m, j; U, V ) will denote some operation that modifies the
data present in locations U and V, and depends upon the parameters m and j,
where 0 ≤ m < n and 0 ≤ j < k. We will also define the function bit j (m) to be the
jth bit in the binary representation of the number m. Given these definitions, we
will say that an algorithm is in the DESCEND class if it is of the form:
57
58 3. DISTRIBUTED-MEMORY MODELS

A LGORITHM 2.1. DESCEND Algorithm.


for j ← k − 1 downto 0 do
for each m such that 0 ≤ m < n
do in parallel
if bit j (m) = 0 then
OPER(m, j; T [m], T [m + 2 j ])
endfor
endfor
and we will say that an algorithm is of ASCEND class if it is a special case of
A LGORITHM 2.2. ASCEND Algorithm.
for j ← 0 to k − 1 do
for each m such that 0 ≤ m < n
do in parallel
if bit j (m) = 0 then
OPER(m, j; T [m], T [m + 2 j ])
endfor
endfor
In many cases, algorithms that do not entirely fall into these two categories
can be decomposed into a sequence of algorithms that do — we will call these
algorithms composite. We will often want to regard OPER(m, j; T [m], T [m + 2 j ]) as
a pair of functions ( f 1 , f 2 ):
T [m] ← f 1 (m, j, T [m], T [m + 2 j ])
T [m + 2 j ] ← f 2 (m, j, T [m], T [m + 2 j ])
We will consider some common parallel algorithms that can be re-stated in the
context of ASCEND and DESCEND algorithms. Throughout the remainder of the
text, we will occasionally encounter more of these.
E XAMPLE 2.3. The Bitonic Sorting algorithm (3.4 on page 21) is a DESCEND
algorithm — here the operation OPER(m, j; U, V ) is just compare-and-swap (if the
elements are out of sequence).
Here is a refinement of the algorithm for forming the sum of n = 2k numbers
given on page 7 on the Introduction: We will not only want the sum of the n = 2k
numbers — we will also want all of the cumulative partial sums. Suppose we have
n processors and they are numbered from 0 to n − 1 and n numbers in an array, A.
In step i (where steps are also numbered from 0 on) let processor number u, where
j2i+1 + 2i ≤ u ≤ ( j + 1)2i+1 − 1 (here j runs through all possible values giving
a result between 1 and n) add the value of A( j2i+1 + 2i − 1) to A(u). Figure 3.1
illustrates this process.
It is easy for a processor to determine if its number is of the correct form —
if its identifying number is encoded in binary it can compute its value of j by
right-shifting that number i + 1 positions, adding 1 and right-shifting 1 additional
position. Then it can compute the corresponding value of j2i+1 + 2i − 1 and carry
out the addition. It isn’t difficult to see that this procedure will terminate in lg n
steps and at the end A will contain the cumulative sums of the original numbers.
A single extra step will compute the absolute values of the differences of all of the
2. GENERIC PARALLEL ALGORITHMS. 59

1 2 0 1 6 3 0 5

1 3 0 1 6 9 0 5

1 3 3 4 6 9 9 14

1 3 3 4 10 13 13 18

F IGURE 3.1. Computing the cumulative sums of 8 integers, in parallel

partial sums from half of the sum of all of the numbers and the minimum of the
differences is easily determined.
In step i processors numbered j compare the numbers in A( j2i − 2i−1 ) with
those in A( j2i ) and exchange them if the former is smaller than the latter. In ad-
dition subscript numbers (which are stored in another array of n elements) are
exchanged. Clearly, this determination of the minimum requires lg n steps also.
Later, we will examine an improvement of this algorithm that only requires n/ lg n
processors — see 1.7 on page 271.
E XAMPLE 2.4. The algorithm for forming the sum of n numbers is an ASCEND
algorithm where OPER(m, j; U, V ) has the following effect:

U←U
V ← U+V

These two examples are fairly clear — the original statements of the algo-
rithms in question were almost exactly the same the the statements in terms of
the ASCEND and DESCEND algorithms.
Now we will consider a few less-obvious examples:
P ROPOSITION 2.5. Suppose we have n = 2k input data items. Then the permutation
that exactly reverses the order of input data can be expressed as a DESCEND algorithm
— here OPER(m, j; U, V ) simply interchanges U and V in all cases (i.e., independently
of the values of U and V).
P ROOF. We prove this by induction. It is clear in the case where k = 1. Now
suppose the conclusion is known for all values of k ≤ t. We will prove it for k =
t + 1. The key is to assume that the original input-data was T [i ] = i, 0 ≤ i ≤ n − 1.
If we prove the conclusion in this case, we will have proved it in all cases, since
the numerical values of the data-items never enter into the kinds of permutations
that are performed by OPER(m, j; U, V ). The next important step is to consider the
binary representations of all of the numbers involved. It is not hard to see that the
first step of the DESCEND algorithm will alter the high-order bit of the input-data
so that the high-order bit of T [i ] is the same as that of n − 1 − i. The remaining bits
will not be effected since:
60 3. DISTRIBUTED-MEMORY MODELS

Bits 0 through t of the numbers {0, . . . , n/2 − 1} are the same as


bits 0 through t of the numbers {n/2, . . . , n − 1}, respectively.
The remaining iterations of the DESCEND algorithm correspond exactly to the
algorithm in the case where k = t, applied independently to the lower and upper
halves of the input-sequence. By the inductive hypothesis, these iterations of the
algorithm have the correct effect on bits 0 through t of the data. It follows that,
after the algorithm executes, we will have T [i ] = n − 1 − i. r
For instance:
E XAMPLE 2.6. Suppose n = 23 and our input-data is:
{3, 7, 2, 6, 1, 8, 0, 5}
After the first iteration of the DESCEND algorithm (with OPER(m, j; U, V ) defined
to always interchange U and V), we have:
{1, 8, 0, 5, 3, 7, 2, 6}
After the second iteration, we get:
{0, 5, 1, 8, 2, 6, 3, 7}
And after the final iterations we get:
{5, 0, 8, 1, 6, 2, 7, 3}
This is the reversal of the input sequence.
The reason we took the trouble to prove this is that it immediately implies
that:
P ROPOSITION 2.7. The Batcher merge algorithm (3.6 on page 22) can be expressed
as a composite of two DESCEND algorithms. The first phase reverses the upper half of the
input-data via 2.5 above, and the second performs a Bitonic sort using 2.3.
This implies that the general Batcher sorting algorithm can be implemented
with O(lg n) DESCEND algorithms via the reasoning of 3.7 on page 23.
We will conclude this section by showing that the operation of shifting data
can be implemented as an ASCEND algorithm:
P ROPOSITION 2.8. Suppose we have n = 2k input data-items. Then the cyclic shift
operation
T [ i ] ← T [ i − 1]
T [0] ← T [ n − 1]
for all i such that 0 ≤ i ≤ n − 1, occurs as the result of an ASCEND algorithm with
OPER(m, j; U, V ) defined to
• Interchange U and V, if m is a multiple of 2 j+1 .
• Leave U and V unchanged otherwise.
P ROOF. We use induction on k. Clearly, the conclusion is true for k = 1 —
the algorithm just interchanges the two input data items. As with the previous
algorithm, we will assume the input is given by T [i ] = i, and will assume that the
algorithm works for k ≤ t. Now suppose we are given a sequence of 2t+1 data
items:
{0, 1, . . . , 2t+1 }
3. THE BUTTERFLY NETWORK. 61

rank 0

rank 1

rank 2

rank 3

F IGURE 3.2. The butterfly architecture

The inductive hypothesis implies that the first t iterations of the algorithm will
produce:

{ T [0] = 2t − 1, T [1] = 0, . . . , T [2t − 1] = 2t − 2,


T [2t ] = 2t+1 − 1, T [2t + 1] = 2t , . . . , T [2t+1 − 1] = 2t+1 − 2}

The last iteration interchanges T [0] and T [2t ] so we get the sequence

{ T [0] = 2t+1 − 1, T [1] = 0, . . . , T [2t+1 − 1] = 2t+1 − 2}


r

3. The Butterfly Network.


Naturally, some networks are better than others for developing parallel algo-
rithms. Essentially, they have structural properties that make it easier to describe
the data-movement operations necessary for parallel computations. We will con-
sider several such architectures in this chapter. In every case, physical computers
have been constructed that implement the given architecture.
We first consider the butterfly layout — see figure 3.2 for a diagram that il-
lustrates the basic idea. The nodes represent processors and the lines represent
communication links between the processors.
Here the nodes of this graph represent processors (each of which has some
local memory) and the edges are communication lines. This particular layout is of
rank 3. The general rank k butterfly layout has (k + 1)2k processors arranged in
2k columns with k + 1 processors in each column. The processors in a column are
numbered from 0 to k (this number is called the rank of the processor) — these pro-
cessors are denoted dr,j , where r is the rank and j is the column number. Processor
dr,j is connected to processors dr−1,j and dr−1,j0 , where j0 is the number whose k-bit
binary representation is the same as that of j except for the r − 1st bit from the left.
62 3. DISTRIBUTED-MEMORY MODELS

rank 0

rank 1

rank 2

F IGURE 3.3. The Even Sub-Butterfly

In some cases the 0th and kth (last) ranks are identified so that every processor
has exactly 4 communication lines going out of it.
The fundamental properties of butterfly networks that interest us are the fol-
lowing:
3.1. 1. If the rank-0 processors (with all of their communication arcs) of
a butterfly network of rank k are deleted we get two butterfly networks of
rank k − 1.
2. If the rank-k processors (with all of their communication arcs) of a but-
terfly network of rank k are deleted we get two interwoven butterfly net-
works of rank k − 1.
Statement 1 is immediately clear from figure 3.2 — in this case the ranks of
the remaining processors have to be changed. Statement 2 is not hard to see from
figure 3.3.
Here the even columns, lightly shaded, form one butterfly network of rank 2,
the the odd columns form another.
The organization of the diagonal connections on the butterfly networks makes
it easy to implement the generic ASCEND and DESCEND algorithms on it:
A LGORITHM 3.2. DESCEND Algorithm on the Butterfly computer. Suppose
the input data is T [0], . . . , T [n − 1], with n = 2k and we have a rank-k butterfly
network. Start with T [i ] in processor d0,i , for 0 ≤ i ≤ n − 1 — the top of the but-
terfly diagram. In each phase of the Butterfly version of the DESCEND algorithm
(2.1 on page 58) perform the following operations:
for j ← k − 1 downto 0 do
Transmit data from rank k − 1 − j to rank k − j
along both vertical and diagonal lines
The processors in rank k − j and columns
m and m + 2 j now contain the old values
of T [m],T [m + 2 j ].
Compute OPER(m, j; T [m], T [m + 2 j ])
endfor
3. THE BUTTERFLY NETWORK. 63

Note: in the step that computes OPER(m, j; T [m], T [m + 2 j ]), two separate
computations are involved. The processor in column m computes the new value
of T [m] and the processor in column m + 2 j computes the new value of T [m + 2 j ]).
Both of these processors have all of the input-data they need.
The ASCEND algorithm is very similar:
A LGORITHM 3.3. ASCEND Algorithm on the Butterfly computer. Suppose
the input data is T [0], . . . , T [n − 1], with n = 2k and we have a rank-k butterfly
network. Start with T [i ] in processor dk,i , for 0 ≤ i ≤ n − 1 — the bottom of the but-
terfly diagram. In each phase of the Butterfly version of the ASCEND algorithm
(2.1 on page 58) perform the following operations:
for j ← 0 to k − 1 do
Transmit data from rank k − j to rank k − j − 1
along both vertical and diagonal lines
The processors in rank k − j − 1 and columns
m and m + 2 j now contain the old values
of T [m],T [m + 2 j ].
Compute OPER(m, j; T [m], T [m + 2 j ])
endfor
Note that the execution-times of the original ASCEND and DESCEND algo-
rithms have at mostly been multiplied by a constant factor.
The results of the previous section immediately imply that that we can effi-
ciently1 implement many common parallel algorithms on butterfly computers:
• Bitonic sort (see 2.3 on page 58).
• Computation of cumulative sums of n numbers (see 2.4 on page 59).
• the generalized Batcher sorting algorithm (see 2.7 on page 60).
In the future we will need the following:
D EFINITION 3.4. An algorithm for the butterfly computer will be called normal
if in each step either (but not both) of the following operations is performed:
1. data is copied from one rank to a neighboring rank;
2. calculations are performed within processors and no data movement oc-
curs.
Normality is a fairly restrictive condition for an algorithm to satisfy — for
instance it implies that the significant data is contained in a single rank in the
computer in each step (which may vary from one step to the next). Most of the
processors are inactive in each step. The DESCEND and ASCEND algorithms de-
scribed above are clearly normal. Normal algorithms are important because it will
turn out that they can be simulated on computers that have only O(n) processors
(rather than O(n lg n) processors, like the butterfly). In a sense normal algorithms
are wasteful — they only utilize a single row of the butterfly in each time-step2.
The fact that the Batcher sorting algorithm sort can be implemented on the
butterfly computer implies that:
1In this context “efficiently” means that the program on the butterfly-computer has the same
asymptotic execution-time as the PRAM implementation.
2Such algorithms, however, can be used to create online algorithms that continually input new
data and process it.
64 3. DISTRIBUTED-MEMORY MODELS

L EMMA 3.5. An algorithm for an EREW-unbounded computer that executes in α-


steps using n processors and β memory locations can be simulated on a rank lg n butterfly
computer executing in O(αd β/ne lg2 n)-time using n(1 + lg n) processors. Here each
processor is assumed to have at least β/n memory locations of its own.
Here is the algorithm:
1. At the beginning of each simulated EREW machine cycle each processor
makes up memory requests: we get an array of lists like (op(i), (a(i) mod n),
|a(i)/n|,d(i), dummyflag), where op(i)=READ or WRITE, a(i) is the address
on the EREW computer, and d(i) is the data to be written, or empty, and
dummyflag=FALSE. We assume that memory in the EREW computer is simulated
by the memory of the processors of the butterfly computer: location k in the
EREW computer is represented by location |k/n| in processor k (mod n) — it
follows that the two middle entries in the list above can be interpreted as:
(processor containing simulated location a(i ),
memory location within that processor of simulated location a(i ));
2. We now essentially sort all of these memory requests by their second entry
to route them to the correct processors. This sorting operation is basically what
was outlined above but we must be a little careful — more than one memory re-
quest may be destined to go to the same processor. In fact we must carry out
d β/ne complete sorting operations. In the first step we sort all memory requests
whose third list element is 0, next all memory requests whose third element is 1,
and so on. We must modify the sort algorithm given above slightly, to take into
account incomplete lists of items to be sorted. In the beginning of the algorithm,
if a processor doesn’t have any list item with a given value of the third element
is makes up a dummy item with the dummyflag equal to TRUE. When the a given
sorting operation is complete the dummy items are discarded.
3. When the d β/ne sorting operations are completed all processors will con-
tain the (maximum of d β/ne) memory requests for accesses to memory locations
that they contain. These requests are then processed sequentially in O(d β/ne)-time.
4. The results of the memory accesses are then sorted again (in one sorting
operation) to send them back to the processors that originated the requests.
Note that the algorithm would be poly-logarithmic if these was some small
upper-bound on d β/ne. If β is very large (i.e. O(n2 ) the algorithm seriously falls
short of being poly-logarithmic. The problem here is clearly not the amount of data
to be sorted — that is always O(n) and it should be possible to sort this much
data in O(lg2 n)-time. The problem is that a large amount of data might have to
go to the same processor (in which case many other processors would receive no
data). There is also a problem with the processing of the memory requests once
they have been routed to the correct processors. In the algorithm above this is
done sequentially within processors, but this is clearly wasteful since many other
processors will be idle (because there are at most n data items to be processed
overall). This situation is known as the Granularity Problem — it has been studied
in connection with the theory of distributed databases as well as parallel processing
algorithms. See §5 for a solution to this problem (and, consequently, an improved
algorithm for simulating an EREW computer with large memory on a butterfly).
A similar result is possible for the CRCW computer. In order to prove this
it is first necessary to develop an algorithm for moving data through a butterfly
efficiently.
3. THE BUTTERFLY NETWORK. 65

We immediately conclude:
L EMMA 3.6. An algorithm for an CRCW-unbounded computer that executes in α-
steps using n processors and β memory locations can be simulated on a rank lg n butterfly
computer executing in O(αd β/ne lg2 n)-time using n(1 + lg n) processors. Here each
processor is assumed to have β/n + 3 memory locations of its own.
This algorithm is clearly normal since each of its steps are. That will turn out
to imply that a CRCW computer can be simulated via a network that has O(n)
processors, and with no degradation in the time estimates.

P ROOF. We copy the proof of 5.1, substituting the sorting algorithm for the
butterfly computer for that of §5 and the data movement algorithm above for the
simple shift that takes place in the CRCW-read simulation of 5.1.
In addition, in the step of 5.1 in which processors compare data values with
those of their neighbors to determine whether they contain the lowest indexed
reference to a memory location, they can use the butterfly implementation of the
ASCEND algorithm and 2.8 to first move the data to be compared to the same
processors. Note that, due to the Granularity Problem mentioned above we will
have to actually carry out the routing operation d β/ne times. This accounts for the
factor of d β/ne in the time estimate. r

Here is a sample programming language (Butterfly Pascal) that might be used


on a butterfly-SIMD computer:
D EFINITION 3.7. The language is like Pascal except that there exist:
1. pre-defined (and reserved) variables: ROW, COLUMN,
COPY EXCHANGE, COPY — COPY EXCHANGE and COPY are areas
of memory of size 1K — that represents the maximum amount of data
that may be copied or copy-exchanged at any given time. ROW and
COLUMN are integers which may never be the target of an assignment
(and may never be used as a var-type parameter in a procedure or
function call). They are equal, respectively, to the row- and column-
numbers of a processor (so their values vary from one processor to
another).
2. Procedures copy exch() and copy up(), copy down(). In order to copy-
exchange some data it must be plugged into COPY EXCHANGE and the
procedure copy exch() must be called. Assume that the pascal assignment
operators to and from COPY and COPY EXCHANGE are size-sensitive
— i.e. COPY:=x; copies a number of bytes equal to the size of the variable
x and the corresponding statement is true for x:=COPY.
3. Assume an additional block structure:
if <condition> paralleldo <stmt>; This statement evaluates the con-
dition (which generally tests COLUMN number in some way) and exe-
cutes the statement if the condition is true. This differs from the usual
if-statement only in the sense that a subset of the processors may execute
the statement in <stmt> and the remainder of the processors will attempt
to “mark time” — they will not execute <stmt> but will attempt to wait
the appropriate amount of time for the active processors to finish. This is
accomplished as follows: in the machine language for the butterfly each
66 3. DISTRIBUTED-MEMORY MODELS

processor has a flag that determines whether it “really” executes the cur-
rent instruction or merely waits. This flag is normally true (for active ex-
ecution) but when the pascal compiler translates the parallel-if statement
above it sets this flag in each processor according to whether the condition
is true or not (for that processor). At the end of the parallel-if block the flag
is again set to true.
Note: this statement is not necessary in order to execute a parallel program
— execution of a program is normally done in parallel by all processors. This
construct merely facilitates the synchronization of all processors across a row of the
butterfly.
Assume that a given program executes simultaneously on all processors in the
computer.

E XERCISES .

3.1. Suppose we had a computer that used the SIMD model of computation
and the Butterfly architecture. It is, consequently, necessary for each processor
to have a copy of the program to be run on it. Devise an algorithm to transmit a
program from one processor to all of the others. It should execute in O(l lg n)-time,
where l is the length of the program.
3.2. Is it possible for the processors to get “out of synchronization” in butterfly
pascal even though they use the parallel-if statement?
3.3. Why would it be difficult to synchronize processors in butterfly pascal
without a parallel-if statement? (I.e. why couldn’t we force some processors to
wait via conventional if statements and loops, for instance?)
3.4. Suppose we the Butterfly Pascal language available on a Butterfly com-
puter of rank 5: Program (in butterfly pascal):
1. the butterfly sorting algorithm algorithm (Hints:
a. In each case, pass a parameter to subroutines telling which column
the processor is supposed to regard itself as being in — within the
appropriate sub-butterfly;
b. Confine activity to a single row of the butterfly at a time;
c. use the parallel-if statement described above whenever you want to
make a subset of the processors in a row execute some statements);
2. the simulation algorithm for a CRCW computer.

3.1. Discussion and further reading. The BBN Butterfly computer indirectly
utilizes a butterfly network. It has a number of processors that communicate via a
3. THE BUTTERFLY NETWORK. 67

system called an Omega switch. This is a butterfly network whose vertices are not
completely-functional processors — they are gates or data-switches. See [99] for a
discussion of the issues involved in programming this machine.
We will discuss some of the literature on algorithms for the butterfly network.
In [16], Bhatt, Chung, Hong, Leighton, and Rosenberg develop algorithms for sim-
ulations that run on a butterfly computer. Hong
The Hypercube Architecture
3.2. Description. An n-dimensional hypercube is a graph that looks, in the
3-dimensional case, like a wire frame that models a cube. The rigorous definition
of an n-dimensional hypercube is a graph Hn , where
1. The vertices of Hn are in a 1-1 correspondence with the n-bit binary se-
quences a0 · · · an−1 (so there are 2n such vertices). Each vertex has an iden-
tifying number.
2. Two vertices a0 · · · an−1 and a00 · · · a0n−1 are connected by an edge if and
only if these sequences differ in exactly one bit — i.e., ai = ai0 for 0 ≤ i ≤
n − 1, i 6= k for some value of k and ak 6= a0k .
An n-dimensional hypercube computer has a processing-element at each vertex of
Hn and connecting communication lines along the edges.
It is not hard to see that each vertex has exactly n edges incident upon it. Its
connectivity is, consequently, higher than that of the butterfly or perfect-shuffle
architectures. One might think that such a hypercube-computer is harder to im-
plement than a butterfly or perfect-shuffle computer.
The generic ASCEND and DESCEND algorithms (2.1 and 2.2 on page 58) are
easy to implement on the hypercube architecture:
A LGORITHM 3.8. DESCEND Algorithm on the Hypercube computer.
Suppose the input data is T [0], . . . , T [n − 1], with n = 2k and we have an
n-dimensional hypercube. Start with T [m] in vertex m, for 0 ≤ m ≤ n − 1 —
where vertex-numbers are as defined above. In iteration j of the Hypercube
version of the DESCEND algorithm perform the following operations:
for j ← k − 1 downto 0 do
for each m such that 0 ≤ m < n
do in parallel
if bit j (m) = 0 then
vertices m and m + 2 j exchange copies
of their data via the unique
common communication line
Each processor computes OPER(m, j; T [m], T [m + 2 j ])
(Now having the necessary input-data:
the old values of T [m] and T [m + 2 j ])
endfor
endfor

A LGORITHM 3.9. ASCEND Algorithm on the Hypercube computer. Suppose


the input data is T [0], . . . , T [n − 1], with n = 2k and we have an n-dimensional
hypercube. Start with T [i ] in vertex i, for 0 ≤ i ≤ n − 1 — where vertex-numbers
are as defined above. In iteration j of the Hypercube version of the ASCEND
algorithm perform the following operations:
68 3. DISTRIBUTED-MEMORY MODELS

F IGURE 3.4. A 5-dimensional hypercube

for j ← 0 to k − 1 do
for each m such that 0 ≤ m < n
do in parallel
if bit j (m) = 0 then
vertices m and m + 2 j exchange copies
of their data via the unique
common communication line
Each processor computes OPER(m, j; T [m], T [m + 2 j ])
(Now having the necessary input-data:
the old values of T [m] and T [m + 2 j ])
endfor
endfor
Note that the implementations of ASCEND and DESCEND on the hypercube
are more straightforward than on the butterfly network. These implementations
immediately imply that we have efficient implementations of all of the ASCEND
and DESCEND algorithms in § 2 on page 57. The hypercube architecture is in-
teresting for many reasons not related to having good implementations of these
generic algorithms. It turns out to be very easy to map certain other interesting
architectures into the hypercube. We will spend the rest of this section looking at
some of these.
D EFINITION 3.10. Let a and b be sequences of bits of the same length. The
Hamming distance between these sequences is the number of bit-positions in the
two sequences, that have different values.
It is not hard to see that the distance between two different vertices in a hy-
percube is equal to the Hamming distance between the binary representations of
their vertex-numbers.
Figure 3.4 shows a 5 dimensional hypercube.
A six dimensional hypercube is the result of taking two copies of this graph
and attaching each vertex of one copy to a corresponding vertex of the other — and
3. THE BUTTERFLY NETWORK. 69

each time the dimension is raised by 1 the complexity of the graphs doubles again.
This is meant to convey the idea that high-dimensional hypercubes might be diffi-
cult to implement. Nevertheless, such computers are commercially available. The
Connection Machine from Thinking Machines (CM-2 model) is a 12-dimensional
hypercube computer with 64000 processors (it actually has 16 processors at each
vertex of a 12-dimensional hypercube).
It turns out that an n-dimensional hypercube is equivalent to an order-n but-
terfly network with all of the columns collapsed to single vertices, and half as many
edges. Basically, the result of collapsing the columns of a butterfly to vertices is a
hypercube with all of its edges doubled.
It is not hard to see that any normal algorithm for a degree-n butterfly net-
work can easily be ported to an n-dimensional hypercube computer with no time
degradation.
L EMMA 3.11. Every normal algorithm that runs on a degree-k butterfly network
(that has k2k processors) in t time units can run on a k-dimensional hypercube computer
in O(t) time.
Hypercubes are interesting as models for parallel communication because of
the fact that many other communication-schemes can be mapped into hypercubes.
In order to see how this is done, we discuss the subject of Gray codes. We will be
particularly interested in how one can map lattice organizations into hypercubes.
D EFINITION 3.12. The k-bit reflected Gray code sequence is defined recur-
sively via:
• The 1-bit sequence is {0, 1};
• If {s1 , . . . , sm } is the k − 1-bit reflected Gray code sequence, then the k-bit
sequence is {0s1 , . . . , 0sm , 1sm , . . . , 1s1 }.
The k-bit reflected Gray code sequence has 2k elements.
Here are the first few reflected Gray code sequences:
1. {0, 1}
2. {00, 01, 11, 10}
3. {000, 001, 011, 010, 110, 111, 101, 100}
The important property of Gray codes that will interest us is:
P ROPOSITION 3.13. In a k-bit Gray code sequence, the Hamming distance between
any two successive terms, and the Hamming distance between the first term and the last
term is 1.
P ROOF. This follows by a simple induction. It is clearly true for the 1 bit Gray
codes. If it is true for k − 1 bit Gray codes, then the inductive definition implies
that it is true for the k bit Gray codes, because each half of this sequence is just the
concatenation of the k − 1 bit Gray codes with a fixed bit (0 of the first half, and 1
for the second). This leaves us with the question of comparing:
• the two middle terms — but these are just 0sm , 1sm , and they differ in only
1 bit.
• the first and last elements — but these are 0s1 , 1s1 , and they just differ in 1
element.
r
70 3. DISTRIBUTED-MEMORY MODELS

Since vertices whose numbers differ in only one bit are adjacent in a hyper-
cube, the k bit Gray code sequences provides us with a way to map a loop of size
2k into a hypercube:
P ROPOSITION 3.14. Suppose S = {s1 , . . . , sn } be the k bit Gray code sequence. In
addition, suppose we have a d dimensional hypercube, where d ≥ k — its vertices are
encodes by d bit binary numbers. If z is any d − k bit binary number, then the sequences
of vertices whose encoding is {zs1 , . . . , zsn } is an embedding of a loop of size n in the
hypercube.
We can use multiple Gray code sequences to map a multi-dimensional lattice
of vertices into a hypercube of sufficiently high dimension. This lattice should
actually be regarded as a torus. Here is an example:
E XAMPLE 3.15. Suppose we have a two dimensional lattice of processors that
we want to simulate on a hypercube computer whose dimension is at least 7. Each
processor in this lattice is denoted by a pair of numbers (u, v), where (we suppose)
u runs from 0 to 3 (so it takes on 4 values) and v runs from 0 to 15 (so it takes on
24 = 16 values). We use the 3 and 4 bit Gray codes:
• 3 bit Gray code, S1 : 000,001,011,010,110,111,101,100
• 4 bit Gray code, S2 : 0000,0001,0011,0010,
0110,0111,0101,0100, 1100,1101,1111,1110,
1010,1011,1001,1000
and we will assume that both of these sequences are numbered from 0 to 7
and 0 to 15, respectively. Now we map the processor numbered (u, v) into the
element of the hypercube location whose binary representation has low-order 7
bits of {S1 (u), S2 (v)}. Processor (2, 3) is sent to position 0100011 = 67 in the
hypercube.
Note that size of each dimension of the lattice must be an exact power of 2.
The general statement of how we can embed lattices in a hypercube is:

P ROPOSITION 3.16. Let L be a k-dimensional lattice of vertices such that the ith sub-
script can take on 2ri possible values. Then this lattice can be embedded in an r-dimensional
hypercube, where r = ∑ik=1 ri . An element of L can be represented by a sequence of k num-
bers {i1 , . . . , ik }, and the embedding maps this element to the element of the hypercube
whose address has the binary representation {S(i1 , r1 ), . . . , S(ik , rk )}, where S(q, r ) de-
notes the qth element of the r-bit reflected Gray code.
This mapping has been implemented in the hardware of the Connection Ma-
chine (CM-2 model), so that it is easy to define and operate upon arrays of proces-
sors that have the property that the range of each subscript is exactly a power of
2.
Many parallel computers have been built using the hypercube network archi-
tecture including:
• the nCUBE computers, including the nCUBE/7 and nCUBE 2;
• The Cosmic Cube.
• the Connection Machines (CM-1 and CM-2) from Thinking Machines Cor-
poration. These are SIMD computers discussed in more detail later in this
book (see § 3 on page 101).
3. THE BUTTERFLY NETWORK. 71

0 1 2 3 4 5 6 7

F IGURE 3.5. The shuffle exchange network

• the Intel iPSC series, including the iPSC/1, iPSC/2, iPSC/860, and the
Touchstone Gamma.
• the Caltech Mark II.
• the MasPar MP-1 computer.
Most of these are MIMD computers

E XERCISES .

3.5. Prove the statement that an n-dimensional hypercube is equivalent to an


order-n butterfly network with all of the columns collapsed to single vertices, and
half as many edges.
3.6. Compute an explicit embedding of a 3-dimensional lattice of size 8 × 4 × 2
into a 6-dimensional hypercube.
3.7. Suppose an n × n matrix is embedded in a hypercube (as an n × n lattice
— we assume that n is a power of 2). Find an algorithm for transposing this matrix
in O(lg n) time. Hint: Consider algorithm 2.5 on page 59, and the implementation
of DESCEND on the hypercube.

The Shuffle-Exchange Network


The last network-computer we will consider is the shuffle-exchange network.
Like the others it is physically realizable and has the ability to efficiently simulate
unbounded parallel computers. It has the added advantage that this simulation
can be done without any increase in the number of processors used.
Suppose n is a power of 2. Then a degree-n shuffle exchange network is con-
structed as follows: Start with n processors, numbered 0 to n − 1, arranged in a
linear array except that processor i is connected to:
1. processor i + 1 if i is even;
2. processor j, where j ≡ 2i (mod n − 1);
3. itself, if i = n − 1 (rather than processor 0, as rule b would imply).
Figure 3.5 shows an 8-node shuffle-exchange network.
Here the shaded lines represent exchange lines: data can be swapped be-
tween processors connected by such lines — this movement will be denoted EX(i ).
72 3. DISTRIBUTED-MEMORY MODELS

The dark curved lines represent the shuffle lines — they connect processor i with
2i mod n − 1 (in the pascal sense). Although data can move in both directions
along these lines one direction is regarded as forward and the other is regarded as
reverse.
There are two main data-movement operations that can be performed on the
shuffle-exchange network:
1. Shuffle, PS(i ) (for “perfect shuffle”). Here data from processor i is moved
to processor 2i mod n − 1. The inverse operation PS−1 (i ) is also defined
(it moves data from processor i to processor j, where i ≡ 2j (mod n − 1)).
Most of the time these operations will be applied in parallel to all proces-
sors — the parallel operations will be denoted PS and PS−1 , respectively.
2. Exchange, EX(i ). Here data from processor i is moved to processor i + 1 if
i is even and i − 1 if i is odd.
We will consider the effects of these operations. Suppose b(i ) is the binary repre-
sentation of the number i.
P ROPOSITION 3.17. Suppose processor i has data in it and:
1. PS(i ) will send that data to processor j, or;
2. EX(i ) will send that data to processor j0 .
Then:
1. b( j) is the result of cyclically permuting b(i ) one bit to the left;
2. b( j0 ) is the same as b(i ) except that the low order bit is different.
P ROOF. Recall that we are considering n to be a power of 2. Statement b is
clear. Statement a follows from considering how j is computed: i is multiplied by
2 and reduced (mod n − 1). Multiplication by 2 shifts the binary representation
of a number one bit to the left. If the high-order bit is 1 it becomes equal to n after
multiplication by 2 — this result is congruent to 1 (mod n − 1). r

Our main result concerning the shuffle-exchange network is:


L EMMA 3.18. Every normal algorithm that runs on a degree-k butterfly network
(that has k2k processors) in t time units can run on a 2k -processor shuffle-exchange net-
work in O(t) time.
P ROOF. We assume that the processors of the shuffle-exchange network can
carry out the same operations as the processors of the butterfly. The only thing that
remains to be proved is that the data-movements of the butterfly can be simulated
on the shuffle-exchange network.
We will carry out this simulation (of data-movements) in such a way that the
processors of the shuffle-exchange network correspond to the columns of the but-
terfly.
In other words we will associate all processors of column i of the butterfly
with processor i in the shuffle-exchange network. That it is reasonable to associate
all processors in a column of the butterfly with a single processor of the shuffle-
exchange network follows from the fact that the algorithm we are simulating is
normal — only one rank of processors is active at any given time.
Recall how the processors of the butterfly were connected together. Processor
dr,i (r is the rank) was connected to processors dr−1,i and dr−1,i0 , where i and i0
3. THE BUTTERFLY NETWORK. 73

differ (in their binary representations) only in the r − 1st bit from the left. Let us
simulate the procedure of moving data from dr,i to dr−1,i0 :
Perform PSr on all processors in parallel. Proposition 3.17 implies that this
will cyclically left-shift the binary representation of i by r-positions. The r − 1st
bit from the left will wind up in the low-order position. EX will alter this value
(whatever it is) and PS−r will right-shift this address so that the result will be in
processor i0 .
We have (in some sense) simulated the copy operation from dr,i to dr−1,i0 : by
−r
PS ◦ EX(PSr (i )) ◦ PSr . The inverse copy operation (from dr−1,i0 to dr,i ) is clearly
simulated by (PS−r ◦ EX(PSr (i )) ◦ PSr )−1 = PS−r ◦ EX(PSr (i )) ◦ PSr .
Incidentally — these composite operations must be read from right to left.
There are several obvious drawbacks to these procedures: we must carry out
O(lg n) steps (to do PSr and PS−r ) each time; and we must compute PSr (i ) inside
each EX.
These problems are both solved as follows: Note that after doing the copy
from dr,i to dr−1,i0 : the data will be in rank r − 1 — consequently the next step will
be an operation of the form PS1−r ◦ EX(PSr−1 (i0 )) ◦ PSr−1 — and the composite
will have steps of the form PSr−1 ◦ PS−r = PS−1 .
Consequently the simulations aren’t so time-consuming if we compose suc-
cessive operations and cancel out terms that are inverses of each other. In fact,
with this in mind, we can define:
• Simulation of dr,i → dr−1,i0 :PS−1 ◦ EX(PSr (i )).
• Simulation of dr,i → dr−1,i :PS−1 (here we have represented movement of
dr,i → dr−1,i by PS−r ◦ PSr before canceling).
Here we have lost the simple correspondence between columns of the butterfly
and processors of the shuffle-exchange network. Now processor dr,i in the butter-
fly corresponds to processor PSr (i ) of the shuffle-exchange network — here we are
regarding PS as an operation performed upon numbers (like squaring) as well as
a data-movement command on a computer.
This correspondence is illustrated by figure 3.6.
Here the bottom row represents the processors of the shuffle-exchange com-
puter and the top rows represent the butterfly computer (with its interconnections
drawn very lightly).
The correspondence between ranks 1, 2 and 3 of the butterfly and the shuffle-
exchange computer are indicated by curved lines. The top row of the butterfly
corresponds to the shuffle-exchange computer in exactly the same way as the bot-
tom row so no lines have been drawn in for it.
It is easy to keep track of this varying correspondence and solve the second
problem mentioned above by doing the following: in each step manipulate a list
(data to be moved, column #). The first item is the data upon which we are perform-
ing calculations. The second is the column number of the butterfly that the data
is supposed to be in. In each step we only carry out the PS and PS−1 operations
on the column numbers so that they reflect this varying correspondence. Our final
simulation program can be written as:
• Simulation of dr,i → dr−1,i0 :PS−1 ◦ (column # i) — the EX-operation is
only carried out on the data portion of the lists — the PS−1 is carried out
on both portions.
74 3. DISTRIBUTED-MEMORY MODELS

0 1 2 3 4 5 6 7

F IGURE 3.6.

• Simulation of dr,i → dr−1,i :PS−1 — carried out on both portions of whole


lists.
• Simulation of dr,i → dr+1,i0 :EX(column # i0 ) ◦ PS — the EX-operation is
only carried out on the data portion of the lists — the PS is carried out
on both portions. Note that we most carry out the EX-operation on the
processor whose column# field is i0 rather than i. It is easy for a processor
to determine whether it has column# i0 , however. The processors with
column numbers i and i0 will be adjacent along an EX-communication line
before the PS-operation is carried out. So our simulation program has
the processors check this; set flags; perform PS, and then carry out EX-
operations if the flags are set.
• Simulation of dr,i → dr+1,i :PS — carried out on both portions of whole
lists.
r
This immediately implies that we can implement the DESCEND and ASCEND
algorithms on the shuffle-exchange computer — we “port” the Butterfly versions
of these algorithms (3.2 and 3.3 on page 63) to the shuffle-exchange computer.
A LGORITHM 3.19. DESCEND Algorithm on the Shuffle-Exchange
computer. Suppose the input data is T [0], . . . , T [n − 1], with n = 2k and we
have a shuffle-exchange computer with n processors. The first elements of these
lists will be called their data portions. In each phase of the Shuffle-Exchange
version of the DESCEND algorithm (2.1 on page 58) perform the following
operations:
do in parallel (all processors)
L1 ← { T [i ], i } in processor i
for j ← k − 1 downto 0 do
3. THE BUTTERFLY NETWORK. 75

do in parallel (all processors)


Perform PS
L2 ← L1
Perform EX upon the
data-portions of L2 in each processor.
Each processor now contains two lists:
L1 = old { T [m], m},L2 = old { T [m + 2 j ], m}
Compute OPER(m, j; T [m], T [m + 2 j ])

Here is an example of this algorithm. Suppose n = 23 , so that k = 3. We have


an array of 8 processors with a data-item ti in processor i. If we use binary notation
for the processor-numbers we start out with:

{000,t0 } {001,t1 } {010,t2 } {011,t3 } {100,t4 } {101,t5 } {110,t6 } {111,t7 }


In the first step of the DESCEND algorithm, we perform PS. The result is:

{000,t0 } {100,t4 } {001,t1 } {101,t5 } {010,t2 } {110,t6 } {011,t3 } {111,t7 }


Note that the data-elements that are now adjacent to each other are the ones
whose original index-numbers differ by 22 = 4:
• {000,t0 } and {100,t4 };
• {001,t1 } and {101,t5 };
• {010,t2 } and {110,t6 };
• {011,t3 } and {111,t7 };
We are now in a position to perform OPER(m, j; T [m], T [m + 22 ]) (after exchanging
data via an EXCHANGE operation.
The next PS operation results in the array

{000,t0 } {010,t2 } {100,t4 } {110,t6 } {001,t1 } {011,t3 } {101,t5 } {111,t7 }


Now the pairs of data-items that are comparable via the EXCHANGE opera-
tion are
• {000,t0 } and {010,t2 };
• {100,t4 } and {110,t6 };
• {001,t1 } and {011,t3 };
• {101,t5 } and {111,t7 };
and we can perform the next phase of the DESCEND algorithm. The final PS
operation shuffles the array elements back to their original positions, and we can
use the EXCHANGE operation to operate on even-numbered elements and their
corresponding next higher odd-numbered elements.
A LGORITHM 3.20. ASCEND Algorithm on the Shuffle-Exchange computer.
Suppose the input data is T [0], . . . , T [n − 1], with n = 2k and we have a shuffle-
exchange computer with n processors. The first elements of these lists will be
called their data portions. In each phase of the Shuffle-Exchange version of the
ASCEND algorithm (2.2 on page 58) perform the following operations:
do in parallel (all processors)
L1 ← { T [i ], i } in processor i
for j ← 0 to k − 1 do
76 3. DISTRIBUTED-MEMORY MODELS

do in parallel (all processors)


L2 ← L1
Perform EX upon the
data-portions of L2 in each processor.
Perform PS−1 in each processor, with both lists
Each processor now contains two lists:
L1 = old { T [m], m},L2 = old { T [m + 2 j ], m}
Compute OPER(m, j; T [m], T [m + 2 j ])
C OROLLARY 3.21. An algorithm for an CRCW-unbounded computer that
executes in α-steps using n processors and β memory locations can be simulated on an
n-processor shuffle-exchange computer in O(αd β/ne lg2 n)-time. Each processor of the
shuffle-exchange computer must have at least ( β/n) + 3 memory locations of its own.
Here we are assuming that n is a power of 2.

3.3. Discussion and further reading. the shuffle-exchange network is also


frequently called the DeBruijn network. This network, and variations on it, are
widely used in existing parallel computers.
• The NYU Ultra Computer project is current building Ultra III which will
be based on a shuffle-exchange network. See [140] for some additional
material on this type of machine.
• Many new computer systems under development will use this design, in-
cluding:
1. the Triton project at the University of Karlsruhe. Here, according to
Dr. Christian Herter3:
“. . . to select a topology with very low average diam-
eter because the average diameter is directly respon-
sible for the expected latency of the network. The best
network class we found are DeBruijn nets also known
as perfect shuffle (which is nearly equivalent to shuf-
fle exchange). Those networks have an average di-
ameter very close to the theoretical limit. Each node
has indegree two and outdegree two (constant for all
network sizes) the (max.) diameter is still log N the
average diameter is approximately log N - 1.7.”
This machine is intended to be a SIMD-MIMD hybrid with that ini-
tial would have 256 processors. See [71] for more information on this
project.
2. the Cedar Machine, begin developed at the Center for Supercomput-
ing Research and Development, University of Illinois at Urbana Cham-
paign. See [55].

E XERCISES .

3Private communication.
4. CUBE-CONNECTED CYCLES 77

F IGURE 3.7. The cube-connected cycles architecture

3.8. Devise a version of Pascal to run on the shuffle exchange computer, along
the lines of butterfly pascal (page 66).
3.9. Write out the Bitonic Sort algorithm (3.4 on page 21) for the
shuffle-exchange computer in terms of the PS and EX operations.

4. Cube-Connected Cycles
One of the limiting factors in designing a computer like the butterfly com-
puter is the number of connections between processors. Clearly, if processors are
only allowed to have two communication lines coming out of them the whole net-
work will form a tree or ring. Since both of these networks only allow a limited
amount of data to flow through them, it is clear that networks with high data flow
must have at least three communication lines coming out of most of the proces-
sors. The butterfly network is close to this optimum since it has four lines through
most processors. It is possible to design a network that has the same asymptotic
performance as the butterfly with precisely three communications lines coming
out of each processor. This is called the cube-connected cycles (or CCC) network.
Figure 3.7 shows how the processors are interconnected — as usual the nodes rep-
resent processors and the edges represent communication lines. These were first
described by Preparata and Vuillemin in [130]
In general a k-dimensional CCC is a k-dimensional cube with the vertices re-
placed with loops of k processors. In a k-dimensional cube every vertex has k
edges connected to it but in the k-dimensional CCC only one of the processors in
a loop connects to each of the k dimensions so that each processor has only three
communication lines connected to it — two connecting it to other processors in its
loop and one connecting it to another face of the cube.
If we identify rank 0 processors in the butterfly with rank k processors we get
a graph that turns out to contain the k-dimensional CCC network as a subgraph.
78 3. DISTRIBUTED-MEMORY MODELS

In addition, every data movement in the butterfly network can be simulated in the
embedded CCC, except that some movements in the butterfly require two steps in
the CCC since it lacks some of the edges of the butterfly.
Now we are ready to discussion the development of algorithms for the CCC
network. In order to implement an efficient scheme for addressing processors, we
make an additional assumption about the dimension of the hypercube that is the
backbone of the network. We will assume that n = 2k , and that k is of the form
k = r + 2r . We are, in effect, assuming that
The dimension of the hypercube that forms the backbone of a
Cube-Connected Cycles network must be a power of 2 (i.e., 2r )
Although Cube-Connected Cycles of other dimensions are perfectly well-defined,
the addressing scheme that we will present imposes this requirement (the
3-dimensional example in figure 3.7 on page 77 doesn’t meet this requirement).
Each vertex has a k-bit address, which is regarded as a pair of numbers (α, `),
where α is a k − r bit number and ` is an r bit number. Each vertex has three
connections to it:
• F — the forward connection. F (α, `) is connected to B(α, (` + 1) mod 2r );
• B — the back connection. B(α, `) is connected to F (α, (` − 1) mod 2r ).
• L — the lateral connection. L(α, `) is connected to L(α + e2` , `), where
(
1 if the `th bit of α is 0
e=
−1 if the `th bit of α is 1
The forward and back connections connect the vertices within cycles, and the lat-
eral connections connect different cycles, within the large-scale hypercube. If we
shrink each cycle to a single vertex, the result is a hypercube, and the lateral con-
nections are the only ones that remain.
In fact Zvi Galil and Wolfgang Paul show that the cube-connected cycles net-
work could form the basis of a general-purpose parallel computer that could effi-
ciently simulate all other network-computers in [54].
We will present several versions of the DESCEND and ASCEND algorithms.
The first will closely mimic the implementations of these algorithms for the but-
terfly computer. Each loop the the CCC network will have a single data-item, and
this will be cyclically shifted through the network in the course of the algorithm.
This algorithm only processes 2k−r data-items, but does it in O(k − r ) steps. We
define the loop-operations:
D EFINITION 4.1. Define the following data-movement operations on the CCC
network:
• F-LOOP is the permutation that transmits all data within the cycles of the
CCC in the forward direction, i.e., through the F (α, `) port.
• B-LOOP is the permutation that transmits all data within the cycles of the
CCC in the backward direction, i.e., through the B(α, `) port.
In the following algorithm, we assume that processor i has two memory-
locations allocated: T [i ] and T 0 [i ]
A LGORITHM 4.2. Wasteful version of the DESCEND algorithm The input
consists of 2k−r data items stores, respectively, in D [0], . . . ,D [2k−r − 1].
4. CUBE-CONNECTED CYCLES 79

do in parallel (for 0 ≤ i ≤ 2k−r − 1)


T [i2r + k − r − 1] ← D [i ]
for j ← k − r − 1 downto 0 do
do in parallel (for all processors)
Transmit T [∗] along lateral lines, and
store received data in T 0 [∗]
(if the th j bit of i is 0
T 0 [i2r + j] = old value of T [[i2r + 2 j+r + j]
of T [i ], T [i + 2 j ])
(if the th j bit of i is 1
T 0 [i2r + j] = old value of T [[i2r − 2 j+r + j]
of T [i ], T [i + 2 j ])
Compute OPER(i, j; T [i ], T [i + 2 j ])
Perform B-LOOP
endfor

Note that processor number i2r + k − r − 1 is the processor with coordinates


(i, k − r − 1).

A LGORITHM 4.3. Wasteful version of the ASCEND algorithm The input con-
sists of 2k−r data items stores, respectively, in D [0], . . . ,D [2k−r − 1].
do in parallel (for 0 ≤ i ≤ 2k−r − 1)
T [i2r ] ← D [i ]
for j ← 0 to k − r − 1 do
do in parallel (for all processors)
Transmit T [∗] along lateral lines, and store received data in T 0 [∗]
(if the th j bit of i is 0
T 0 [i2r + j] = old value of T [[i2r + 2 j+r + j]
of T [i ], T [i + 2 j ])
(if the th j bit of i is 1
T 0 [i2r + j] = old value of T [[i2r − 2 j+r + j]
of T [i ], T [i + 2 j ])
Compute OPER(i, j; T [i ], T [i + 2 j ])
Perform F-LOOP
endfor

Now we will present the implementations of DESCEND and ASCEND of


Preparata and Vuillemin in [131]. Their implementations utilize all of the proces-
sors in each time-step. Their algorithms for the n-processor CCC operate upon
n data-items and execute in the same asymptotic time as the generic versions
(O(z · lg n), where z is the execution-time of OPER( a, b; U, V )).
We begin with the DESCEND algorithm. To simplify the discussion slightly,
we will initially assume that we are only interested in carrying out the first k − r
iterations of the DESCEND algorithm — so the for-loop only has the subscript
going from k − 1 to k − r. We can carry out the these iterations of the algorithm in
a way that is somewhat like the the “wasteful” algorithm, except that:
• Now we have more data to work with — the “wasteful” algorithm only
had data in a single processor in each loop of the CCC.
80 3. DISTRIBUTED-MEMORY MODELS

• We must process each data-item at the proper time. This means that in
the first few steps of the algorithm, we cannot do anything with most of
the data-items.For instance, in the first step, the only data item that can
be processed in each loop is the one that the “wasteful” algorithm would
have handled — namely, the one with `-coordinate equal to k − 1. In the
second step this data-item will have been shifted into the processor with
`-coordinate equal to k − 2, but another data item will have shifted into
the processor with coordinate k − 1. That second data-item will now be
ready to be processed in the first iteration of the DESCEND algorithm.
In this manner, each step of the algorithm will “introduce” a new ele-
ment of the loop to the first iteration of the DESCEND algorithm. This will
continue until the k + 1st , at which time the element that started out being
in the processor with ` coordinate k − 2 will be in position to start the al-
gorithm (i.e., it will be in the processor with ` coordinate k − 1). Now we
will need another k − r steps to complete the computations for all of the
data-items.
• We must have some procedure for “turning off” most of the processors in
some steps.

A LGORITHM 4.4. If we have an n-processor CCC network and n data-items


T [0], . . . , T [n − 1], where T [i ] is stored in processor i (in the numbering scheme
described above). Recall that k = r + 2r .

do in parallel (for all processors)


for i ← 2r − 1 downto −2r do
do in parallel (for all processors whose α coordinate
satisfies 0 ≤ α < 2k−r )
and whose `-coordinate satisfies
max(i, 0) ≤ ` < min(2r , 2r + i ))
(if bit` (α) = 0 then processor (α, `) contains
T [α2r + `]
and if bit` (α) = 1 then processor (α, `) contains
T [α2r + ` + 2`+r ])
All processors transmit their value of T [∗] along a lateral
communication line
Each processor (α, `) computes OPER( a, b; U, V )
where: a = α2r + ((i − ` − 1) mod 2r )}
b = `+r
U = T [α2r + `]
V = T [α2r + ` + 2`+r ]
B-LOOP
endfor

This handles iterations k − 1 to k − r. The remaining iterations do not involve


any communication between distinct cycles in the CCC network.

We will discuss how the remaining iterations of the algorithm are performed
later.
4. CUBE-CONNECTED CYCLES 81

We will analyze how this algorithm executes. In the first iteration, i = 2r − 1,


and the only processors, (α, `), that can be active must satisfy the conditions
0 ≤ α < 2k −r
max(i, 0) = 2r − 1 ≤` < min(2r , 2r + i ) = 2r
so ` = 2r − 1. Data-item T [α2r + 2r − 1] is stored in processor (α, 2r − 1), for all α
satisfying the condition above. In this step
a =α2r + (((2r − 1) − ` − 1) mod 2r )
=α2r + (−1 mod 2r )
=α2r + 2r − 1
b =` + r = k − 1 (recalling that r + 2r = k)
The lateral communication-lines in these processors connect processor (α, 2r − 1)
r
to (α0 , 2r − 1), where |α − α0 | = 22 −1 = 2k−r−1 . These two processors contain
r r 0
data-items T [α2 + 2 − 1] and T [α 2 + 2r − 1], where
r

|(α2r + 2r − 1) − (α0 2r + 2r − 1)| =2r |(α − α0 )|


= 2r · 2 k − r − 1 = 2 k − 1
Consequently, in the first iteration, the algorithm does exactly what it should —
it performs OPER( a, b; U, V ), where
a =α2r + `
b =k − 1
U = a if bitk−1 ( a) = 0, in which case
V = a + 2k −1
In the next step, ` can only take on the values 2r − 1 and 2r − 2. Our original
data that was in processor (α, 2r − 1) has been shifted into processor (α, 2r − 2). In
this case
a =α2r + (((2r − 2) − ` − 1) mod 2r )
=α2r + (−1 mod 2r )
=α2r + 2r − 1
b =` + r = k − 2
Note that the quantity a remains unchanged — as it should. This is because
a represents the index of that data-item being processed, and this hasn’t
changed. The original data-item was shifted into a new processor (with a lower
`-coordinate), but the formula for a compensates for that. It is also not difficult to
verify that the correct version of OPER( a, b; U, V ) is performed in this step.
This iteration of the algorithm has two active processors in each cycle of the
CCC network. It processes all data-items of the form T [α2r + 2r − 1] and T [α2r +
2r − 2] — a new data-item has entered into the “pipeline” in each cycle of the
network. This is what happens for the first 2r − 1 iterations of the algorithm — new
elements enter into the algorithm. At iteration 2r − 1 — when i = 0, the first data-
items to enter the entire algorithm (namely the ones of the form T [α2r + 2r − 1], for
all possible values of α) are completely processed. We must continue the algorithm
for an additional 2r steps to process the data-items that entered the algorithm late.
82 3. DISTRIBUTED-MEMORY MODELS

The if-statement that imposes the condition max(i, 0) ≤ ` < min(2r , 2r + i )


controls which processors are allowed to be active in any step.
We have described how to implement the first k − r − 1 iterations of the DE-
SCEND algorithm. Now we must implement the remaining r iterations. In these
iterations, the pairs of data-items to be processed in each call of OPER( a, b; U, V )
both lie in the same loop of the CCC network, so that no lateral moves of data are
involved (in the sense of the definition on page 78). Consider the ith step of the
last r iterations of the DESCEND algorithm. It involves the computation:
if (bitr−i ( j) = 0 then
OPER( j, r − i; T [ j], T [ j + 2r−i ])
It is completely straightforward to implement this on a linear array of processors
(and we can think of the processors within the same cycle of the CCC as a linear
array) — we:
1. Move data-item T [ j + 2r−i ] to processor j, for all j with the property that
bitr−i ( j) = 0. This is a parallel data-movement, and moves each selected
data-item a distance of 2r−i . The execution-time is, thus, proportional to
2r − i .
2. For every j such that bitr−i ( j) = 0, processor j now contains T [ j] and
T [ j + 2r−i ]. It is in a position to compute OPER( j, r − i; T [ j], T [ j + 2r−i ]). It
performs the computation in this step.
3. Now the new values of T [ j] and T [ j + 2r−i ] are in processor j. We send
T [ j + 2r−i ] back to processor j + 2r−i — this is exactly the reverse of step 1
above. The execution-time is also 2r−i .
The total execution-time is thus
r
T= ∑ 2 r − i = 2r + 1 − 1 = 2 ( k − r ) − 1
i =0
We can combine all of this together to get out master DESCEND algorithm for
the CCC network:
A LGORITHM 4.5. If we have an n-processor CCC network and n data-items
T [0], . . . , T [n − 1], where T [i ] is stored in processor i (in the numbering scheme
described above). Recall that k = r + 2r .
do in parallel (for all processors)
for i ← 2r − 1 downto −2r do
do in parallel (for all processors whose α coordinate
satisfies 0 ≤ α < 2k−r )
and whose `-coordinate satisfies
max(i, 0) ≤ ` < min(2r , 2r + i ))
(if bit` (α) = 0 then processor (α, `) contains
T [α2r + `]
and if bit` (α) = 1 then processor (α, `) contains
T [α2r + ` + 2`+r ])
All processors transmit their value of T [∗] along a lateral
communication line
Each processor (α, `) computes OPER( a, b; U, V )
where: a = α2r + ((i − ` − 1) mod 2r )}
b = `+r
4. CUBE-CONNECTED CYCLES 83

U = T [α2r + `]
V = T [α2r + ` + 2`+r ]
B-LOOP
endfor

for i ← r downto 0 do
for all processors j such that biti ( j) = 1
transmit T [ j] to processor j − 2i
endfor
for all processors j such that biti ( j) = 0
transmit T [ j + 2i ] to processor j + 2i
endfor
All processors Compute OPER(i, j; U, V )
/* Processors j with biti ( j) = 0 have values of T [ j], T [ j + 2i ]
and processors j with biti ( j) = 1 ave values of T [ j], T [ j − 2i ]
so they are able to perform this computation. */
endfor
The “transmit” operations are completely straightforward — they simply in-
volve a sequence of steps in which data is sent to the appropriate neighboring
processor.
The corresponding ASCEND algorithm is:
A LGORITHM 4.6. If we have an n-processor CCC network and n data-items
T [0], . . . , T [n − 1], where T [i ] is stored in processor i (in the numbering scheme
described above).
for i ← r downto 0 do
for all processors j such that biti ( j) = 1
transmit T [ j] to processor j − 2i
endfor
for all processors j such that biti ( j) = 0
Compute OPER( j, i; T [ j], T [ j + 2i ])
(Both of these data-items are now in processor j)
for all processors j such that biti ( j) = 0
transmit T [ j + 2i ] to processor j + 2i
endfor
endfor

do in parallel (for all processors)


for i ← −2r to 2r − 1 do
do in parallel (for all processors whose α coordinate
satisfies 0 ≤ α < 2k−r )
and whose `-coordinate satisfies
max(i, 0) ≤ ` < min(2r , 2r + i ))
(if bit` (α) = 0 then processor (α, `) contains
T [α2r + `]
and if bit` (α) = 1 then processor (α, `) contains
T [α2r + ` + 2`+r ])
All processors transmit their value of T [∗] along a lateral
84 3. DISTRIBUTED-MEMORY MODELS

communication line
Each processor (α, `) computes OPER( a, b; U, V )
where: a = α2r + ((i − ` − 1) mod 2r )}
b = `+r
U = T [α2r + `]
V = T [α2r + ` + 2`+r ]
F-LOOP
endfor
Although these algorithms look considerably more complex than the corre-
sponding algorithms for the Butterfly and the shuffle-exchange network, their
execution-time is comparable.

E XERCISES .

4.1. find an embedding of the CCC-network in a butterfly network that has


had its highest and lowest ranks identified (we will call this the m-Butterfly net-
work).
1. Show that the wasteful DESCEND and ASCEND algorithms (4.2 and 4.3
on page 79) of the CCC map into algorithms 3.2 and 3.3 on page 63 for the
Butterfly network, under this isomorphism.
2. Map the master DESCEND and ASCEND algorithms for the CCC into an
algorithm for the m-Butterfly network under this isomorphism
3. Implement the shift-operation (2.8 on page 60) on the CCC network. De-
scribe all of the data-movements.
4. Implement the Bitonic sort algorithm (originally defined in § 3 on page 19)
on the CCC network.

5. Dataflow Computers
Recall the definition of computation network in 5.16 on page 42. Dataflow
computers represent an interesting approach to parallel computation in which a
computation network for a problem is directly implemented by the hardware.
The processors of a dataflow computer perform the arithmetic operations of the
program and directly correspond to the vertices of the computation network. A
program for a dataflow computer consists of a kind of symbolic representation of
a computation network.
As one might think, many complex and interesting issues arise in connection
with the design of dataflow computers — for instance:
• The architecture must reconfigure itself somehow during execution of a
program.
5. DATAFLOW COMPUTERS 85

• The vertices in the hardware-implemented computation network must


perform various auxiliary functions, such as the queuing of data that has
been received along input lines before the other data needed to perform a
computation. For instance, if a processor is to add two numbers, and one
of the numbers has arrived and the other hasn’t, the processor must hold
the number that is available and wait for the second number.
There is a great deal of research being conducted on the development of dataflow
computers. At M.I.T. a group is working on the Monsoon Project — see [13], by
Beckerle, Hicks, Papadopoulos, and Traub. In Japan a group is developing the
Sigma-1 computer — see [143] and [142], by Sekiguchi, Hoshino, Yuba, Hiraki,
and Shimada.
More recently, the Multicon corporation in Russia has developed a massively
parallel dataflow computer with 92000 processors. It fits into a package the size of
an IBM PC.
The Granularity Problem
In all of the simulation algorithms described above we assumed that the un-
bounded parallel computer had an amount of memory that was proportional to
the number of processors (this factor of proportionality is the term β that appears
so prominently). Suppose we want to simulate an unbounded parallel computer
with a large amount of RAM — much larger than the number of processors. Also
suppose that the processors in the network used to carry out the simulation to-
gether have enough memory between them to accomplish this. We then encounter
the so-called Granularity Problem — how is this distributed memory to be effi-
ciently accessed?
This is an interesting problem that has been considered before in the context
of distributed databases — suppose a database is broken up into many pieces in
distinct computers that are networked together. Suppose people use all of the
computers in the network to access the database. Under some (possibly rare) cir-
cumstances it is possible that all of the users want data that is be located on the
one computer in the network and response time will slow down tremendously be-
cause each individual computer in the network can only handle a small number of
requests for data at a time. The question arises: Is is possible to organize the data
in the database (possibly with multiple copies of data items) so that access to data
is always fast regardless of how users request it?
The term granularity comes from the fact that the number of processors is
much lower than the amount of memory available so that each processor has a
sizable chunk of local memory that must be accessed by all of the other proces-
sors. This was an open question until recently. Work of Upfal and Wigderson
solved this problem in a very satisfactory way. Although this entire chapter has
made references to graphs to some extent (for instance all of the networks we have
considered are graphs), the present section will use slightly more graph theory. We
will make a few definitions:
D EFINITION 5.1. A graph (or undirected graph) is a pair (V, E), where V is a set
of objects called the vertices of the graphs and E is a set of ordered-pairs of vertices,
called the edges of the graph. In addition, we will assume the symmetry condition
If a pair (v1 , v2 ) ∈ E, then (v2 , v1 ) ∈ E also.
A directed graph (or digraph) is defined like an undirected graph except that the
symmetry condition is dropped.
86 3. DISTRIBUTED-MEMORY MODELS

F IGURE 3.8. A complete graph on 7 vertices

Throughout this section we will assume that our network computer has pro-
cessors connected in a complete graph. This is a graph in which every vertex is
connected to every other vertex. Figure 3.8 shows a complete graph.
If not it turns out that a complete graph network computer can be efficiently
simulated by the models presented above.
The main result is the following:
T HEOREM 5.2. A program step on an EREW computer with n processors and
RAM bounded by a polynomial in n can be simulated by a complete graph computer in
O((lg n)(lg lg n)2 ) steps.
The idea of this result is as follows:
1. Problems occur when many processors try to access the data in a single
processor’s local memory. Solve this by keeping many copies of each data
item randomly distributed in other processors’ local memory. For the time
being ignore the fact that this requires the memory of the whole computer
to be increased considerably.
2. When processors want to access local memory of some other processor,
let them randomly access one of the many redundant copies of this data,
stored in some other processor’s local memory. Since the multiple copies
of data items are randomly distributed in other processor’s local memory,
and since the processors that read this data access randomly chosen copies
of it, chances are that the bottleneck described above won’t occur. Most of
the time the many processors that want to access the same local data items
will actually access distinct copies of it.
We have, of course, ignored several important issues:
1. If multiple copies of data items are maintained, how do we insure that
all copies are current — i.e., some copies may get updated as a result of a
calculation and others might not.
2. Won’t total memory have to be increased to an unacceptable degree?
3. The vague probabilistic arguments presented above don’t, in themselves
prove anything. How do we really know that the possibility of many pro-
cessors trying to access the same copy of the same data item can be ig-
nored?
5. DATAFLOW COMPUTERS 87

The argument presented above is the basis for a randomized solution to the gran-
ularity problem developed by Upfal in [161]. Originally the randomness of the al-
gorithm was due to the fact that the multiple copies of data items were distributed
randomly among the other processors. Randomness was needed to guarantee that
other processors trying to access the same data item would usually access distinct
copies of it. In general, the reasoning went, if the pattern of distributing the copies
of data items were regular most of the time the algorithm would work, but under
some circumstances the processors might try to access the data in the same pattern
in which it was distributed.
Basically, he showed that if only a limited number of copies of each data item
are maintained (≈ lg n) conflicts will still occur in accessing the data (i.e. proces-
sors will try to access the same copy of the same data item) but will be infrequent
enough that the expected access time will not increase unacceptably. The fact that a
limited number of copies is used answers the object that the total size of the mem-
ory would have to be unacceptably increased. Since this is a randomized algorithm,
Upfal rigorously computed the expected4 execution time and showed that it was
O( β lg2 n) — i.e. that the vague intuitive reasoning used in the discussion follow-
ing the theorem was essentially correct. The issue of some copies of data getting
updated and others being out of date was addressed by maintaining a time stamp
on each copy of the data and broadcasting update-information to all copies in a
certain way that guaranteed that a certain minimum number of these copies were
current at any given time. When a processor accessed a given data item it was
required to access several copies of that data item, and this was done in such a
way that the processor was guaranteed of getting at least one current copy. After
accessing these multiple copies of the data item the processor would then simply
use the copy with the latest time stamp.
After this Upfal and Wigderson (in [162]) were able the make the randomized
algorithm deterministic. They were able to show that there exist fixed patterns for
distributing data among the processors that allow the algorithm to work, even in
the worst case. This is the algorithm we will consider in this section.
Let U denote the set of all simulated RAM locations in the EREW computer.
The important idea here is that of an organization scheme S. An organization
scheme consists of an assignment of sets Γ(u) to every u ∈ U — where Γ(u) is the
set of processors containing RAM location u — with a protocol for execution of
read/write instructions.
We will actually prove the following result:
T HEOREM 5.3. There exists a constant b0 > 1, such that for every b ≥ b0 and
c satisfying bc ≥ m2 , there exists a consistent scheme with efficiency O(b[c(lg c)2 +
lg n lg c]).
1. Note that we get the time estimates in the previous result by setting c pro-
portional to lg n. In fact, c must be ≥ logb (m2 ) = 2 log(m)/ log(b).
2. Here:
m is the number of memory locations in the RAM that is being simulated,
n is the number of processors,

4In randomized algorithms the expected execution time is the weighted average of all possible execution
times, weighted by the probabilities that they occur.
88 3. DISTRIBUTED-MEMORY MODELS

b is a constant parameter that determines how many real memory locations


are needed to accomplish the simulation. It must be ≥ 4 but is otherwise
arbitrary.
To get some idea of how this works suppose we are trying to simulate a megabyte
of RAM on a computer with 1024 processors. Each PE must contain 1024 simu-
lated memory locations following the straightforward algorithm in the previous
chapter, and each simulation step might require 1024 lg2 n-time. Using the present
algorithm, m = 220 , n = 210 , and execution time is
µ µ ¶ µ ¶¶
40 2 40 40
O b lg + 10 lg
lg b lg b lg b

with 40/ lg b-megabytes of real memory needed to simulate the one megabyte of
simulated memory. By varying b we determine execution time of the simulation
verses amount of real memory needed — where b must be at least 4.
In our scheme every item u ∈ U will have exactly 2c − 1 copies. It follows that
Γ(u) is actually a set of 2c − 1 values: {γ1 (u), . . . , γ2c−1 (u)}. These γ-functions can
be regarded as hashing functions, like those used in the sorting algorithm for the
hypercube computer. Each copy of a data item is of the form <value,time stamp>.
The protocol for accessing data item u at the tth instruction is as follows:
1. to update u, access any c copies in Γ(u), update their values and set their
time-stamp to t;
2. to read u access any c copies in Γ(u) and use the value of the copy with the
latest time-stamp;
The algorithm maintains the following invariant condition:

Every u ∈ U has at least c copies that agree in value and are


time-stamped with the index of the last instruction in which a
processor updated u.

This follows from the fact that every two c-subsets of Γ(u) have a non-empty inter-
section (because the size of Γ(u) is 2c − 1).
Processors will help each other to access these data items according to the pro-
tocol. It turns out to be efficient if at most n/(2c − 1) data items are processed at
a time. We consequently shall partition the set of processors into k = n/(2c − 1)
groups, each of size 2c − 1. There will be 2c phases, and in each phase each group
will work in parallel to satisfy the request of one of its members. The current dis-
tinguished member will broadcast its request (access ui or write vi into ui )
to the other members of its group. Each of them will repeatedly try to access a
fixed distinct copy of ui . After each step the processors in this group will check
whether ui is still alive (i.e., c of its copies haven’t yet been accessed). When c of
a given data item’s copies have been accessed the group will stop working on it —
the copy with the latest time stamp is computed and sent to Pi .
Each of the first 2c − 1 phases will have a time limit that may stop processing
of the k data items while some are still alive (i.e., haven’t been fully processed).
We will show, however, that at most k/(2c − 1) of the original k items will remain.
These are distributed, using sorting, one to a group. The last phase (which has no
time limit) processes these items.
5. DATAFLOW COMPUTERS 89

5.4. Let P(m−1)(2c−1)+i , i = 1, . . . , 2c − 1 denote the processors in group m,


m = 1, . . . , k, k = n/(2c − 1). The structure of the jth copy of data item u is
<value j (u),time stamp j (u)>
Phase (i, time limit):
{
m ← dprocessor #/2c − 1e
f ← (m − 1)(2c − 1)
Pf +i broadcast its request
(read(u f +i )) or
update(u f +i ,v f +i ) to
Pf +1 , . . . , Pf +2c−1 ;
live(u f +i ) ←true;
count ← 0;
while live(u f +i ) and count <
time limit do
count = count + 1;
Pf + j tries to access copy j (u f +i );
if permission granted
if read request
read ¡value j (u f +i ), time stamp j (u f +i )>
else
/*update request */
¡value j (u f +i ), time stamp j (u f +i >
=<v f +i ,t>
if < c copies of u f +i are still alive
live(u f +i )← false;
endwhile
if read request then find and send
to P f +i the value with the latest
time stamp;
}
/* phase i;*/

Incidentally, the condition permission granted refers to the fact that more than
one processor may try to access the same copy of a data item in a single step. Only
one processor is given permission to perform the access.

A LGORITHM 5.5. Here is the top-level algorithm:


begin
for i = 1 to 2c − 1 do run phase(i, logη 4c);
for a fixed h to be calculated later
sort the k0 2k live requests
and route them
to the first processors in the k0
first groups, one to each processor;
run phase(1, lg n);
endfor
90 3. DISTRIBUTED-MEMORY MODELS

end; end algorithm;


L EMMA 5.6. If the number of live items at the beginning of a phase is w ≤ k then
after the first s iterations of the while loop at most 2(1 − 1/b)s w live copies remain.
The significant point here is that the number of live data-items is reduced by
a factor. This implies that at most a logarithmic number of iterations are required
to reduce the number to zero.
P ROOF. At the beginning of a phase there are w live items, and all of their
copies are alive so there are a total of (2c − 1)w live copies. By the lemma above
after s iterations the number of live copies is 2 (1 − 1/b)s (2c − 1)w. Since |Γ0 (u)| ≥
c for each live item, these can be the live copies of at most (1 − 1/b)s (2c − 1)w/c ≤
2(1 − 1/b)s w live items. r
C OROLLARY 5.7. Let η = (1 − 1/b)−1 :
1. After the first logη (4c − 2) iterations of the while loop, at most k/(2c − 1) live
items remain alive (so the last phase has to process at most k requests);
2. After logη 4c ≤ logη n iterations in a phase, no live items remain.
To complete the analysis note that each group needs to perform the following
operations during each phase: broadcast, maximum, summation (testing whether ui
is still alive). Also, before the last phase all the requests that are still alive are
sorted.
It remains to give an efficient memory access protocol. In each iteration of the
while loop in the algorithm the number of requests sent to each processor is equal
to the number of live copies of live data items this processor contains. (Recall that
a data item is called live if at least c copies of it are live.) Since a processor can only
process one data item at a time the number of copies processed in an iteration of
the while loop is equal to the number of processors that contain live copies of live
data items. We will, consequently, attempt to organize memory allocation in such
a way as to maximize this number. We will use bipartite graphs to describe this
memory allocation scheme.
D EFINITION 5.8. A bipartite graph is a graph with the property that its vertices
can be partitioned into two sets in such a way that the endpoints of each edge lie
in different sets. Such a graph is denoted G ( A, B, E), where A and B are the sets of
vertices and E is the set of edges.
Figure 3.9 shows a bipartite graph.
In our memory allocation scheme A we will consider bipartite graphs
G (U, N, E), where U represents the set of m shared memory items and n is the set
of processors. In this scheme, if u ∈ U is a data item then Γ(u) is the set of vertices
adjacent to u in the bipartite graph being used.
L EMMA 5.9. For every b ≥ 4, if m ≤ (b/(2e)4 )c/2 then there is a way to distribute
the 2c − 1 copies of the m shared data items among the n processors such that before the
start of each iteration of the while loop Γ ≥ A(2c − 1)/b. Here Γ is the number of
processors containing live copies of live data items and A is the number of live data items.
We give a probabilistic proof of this result. Essentially we will compute the
probability that an arbitrary bipartite graph has the required properties. If this
probability is greater than 0 there must exist at least one good graph. In fact it
5. DATAFLOW COMPUTERS 91

F IGURE 3.9. A bipartite graph

turns out that the probability that a graph is good approaches 1 as n goes to ∞
— this means that most arbitrary graphs will suit our purposes. Unfortunately,
the proof is not constructive. The problem of verifying that a given graph is good
turns out to be of exponential complexity.
It turns out that the bipartite graphs we want, are certain expander graphs —
see § 3.7 in chapter 6 (page 365). It is intriguing to compare the operation we are
performing here with the sorting-operation in § 3.7.
P ROOF. Consider the set Gm,n,c of all bipartite graphs G (U, N, E) with |U | =
m, | N | = n, and with the degree of each vertex in U equal to 2c − 1.
We will say that a graph G (U, N, E) ∈ Gm,n,c is good if for all possible choices
of sets {Γ0 (u): Γ0 (u) ⊆ Γ(u), |Γ0 (uS)| ≥ c, for all u ∈ U } and for all S ⊆ U such that
|S| ≤ n/(2c − 1) the inequality | 0
S u∈U Γ ( u )| ≥ | S |(2c − 1) /b — here S represents
0
the set of live data items and u∈U Γ (u) represents the set of processors containing
live copies of these data items. We will count the number of bipartite graphs in
Gm,n,c that aren’t good or rather compute the probability that a graph isn’t good.
If a graph isn’t good then there exists a choice {Γ0 (u): Γ0 (u) ⊆ Γ(u), such that
|Γ0 (u)| ≥ c, for all u ∈ U } and a set S ⊆ U such that |S| ≤ n/(2c − 1) and
|Γ0 (u)| < |S|(2c − 1)/b.

(2) Prob [ G ∈ Gm,n,c is not good]


½ ¾ gc
g g(2c − 1)
≤ ∑ (mg)( g (2cn−1))(2cc−1)
g≤ 2cn−1
b bn
S
Here g is the size of the set S and g(2c − 1)/b is the size of | u∈U Γ0 (u)|. The
formula has the following explanation:
1. the fourth factor isSthe probability that an edge coming out of a fixed set S
will hit a fixed set | u∈U Γ0 (u)| — the exponent qc is the probability that all
of the
S
edges coming out of all of the elements of S will have their other end
in | u∈U Γ0 (u)|. The idea is that we imagine the graph G as varying and
this results in variations in the ends to the edges coming out of vertices of
S.
92 3. DISTRIBUTED-MEMORY MODELS

2. the first factor is the number of ways of filling out the set S to get the m
vertices of the graph; S
3. the second factor is the number of ways of filling out the set | u∈U Γ0 (u)|,
whose size is < |S|(2c − 1)/b ≤ q(2c − 1)/b) to get the n processor-
vertices of the graph;
4. the third factor is the number of ways of adding edges to the graph to get
the original graph in Gm,n,c — we are choosing the edges that were deleted
to get the subgraph connecting S with Γ0 (u).
This can be approximately evaluated using Stirling’s Formula, which states that n! is
asymptotically proportional to nn−.5 e−n . We want to get an estimate that is ≥ the
original.
D EFINITION 5.10. µ ¶
m m!
=
q q!(m − q)!
We will use the estimate
µ ¶
m
(3) ≤ mq q−q+1/2 eq
q
Claim: Formula (2) asymptotically behaves like o(1/n) as n → ∞, for b ≥ 4 and
µ ¶c
b 2
m≤ 4
.
(2e)
Since the formula increases with increasing m we will assume m has
its maximal allowable value of (b/(2e)4 )c/2 . This implies that the first
term is ≤ ((b/(2e)4 )c/2 ) g q− g+1/2 e g = b gc/2 2−2gc e g−2gc g− g+1/2 . Now
the third term is ((2c − 1)!/c!(c − 1)!) g , which can be estimated by
((2c − 1)2c−1.5 /cc−.5 (c − 1)c−1.5) ) g , and this is
≤ ((2c)2c−1.5 /cc−.5 cc−1.5 ) g = ((2c)2c−1.5 /c2c−2 ) g
= (22c−1.5 c2c−1.5 /c2c−2 ) g k = (22c−1.5 c.5 ) g ≤ c.5g 22cg
The product of the first and the third terms is therefore
≤ b gc/2 2−2gc e g−2gc g− g+1/2 c.5g 22cg
= b gc/2 e g−2gc g− g+1/2 c g/2
Now the second term is
≤ n g(2c−1)/b ( g(2c − 1)/b)−( g(2c−1)/b)+1/2 e( g(2c−1)/b)
≤ ngc/2 ( g(2c − 1)/b)−( g(2c−1)/b)+1/2 e gc/2
— here we have replaced g(2c − 1)/b first by g(2c − 1)/4 (since we are only trying
to get an upper bound for the term and 4 is the smallest allowable value for b), and
then by gc/2. We can continue this process with the term raised to the 1/2 power
to get

n gc/2 ( g(2c − 1)/b)−( g(2c−1)/b) ( gc/2)1/2 e gc/2


= n gc/2 ( g(2c − 1))−( g(2c−1)/b) b( g(2c−1)/b) ( gc/2)1/2 e gc/2
= n gc/2 ( g(2c − 1))−( g(2c−1)/b) b gc/2 ( gc/2)1/2 e gc/2
5. DATAFLOW COMPUTERS 93

The product of the first three terms is

bgc/2e g−2gc g− g+1/2 c g/2 n gc/2 ( g(2c − 1))−( g(2c−1)/b) b gc/2 ( gc/2)1/2 e gc/2
= b gc e g−3gc/2 g− g+1 c g+1/2 n gc/2 ( g(2c − 1))−( g(2c−1)/b)
The product of all four terms is

b gc e g−3gc/2 g− g+1 c g+1/2 n gc/2 ( g(2c − 1))−( g(2c−1)/b) ( g(2c − 1)/bn) gc


= e g−3gc/2 g− g+1 c g+1/2 n gc/2 ( g(2c − 1))−( g(2c−1)/b) ( g(2c − 1)/n) gc
Now note that the exponential term (on the left) dominates all of the polynomial
terms so the expression is ≤ g− g c g n gc/2 ( g(2c − 1))−( g(2c−1)/b) ( g(2c − 1)/n) gc . In
the sum the first term is cn−c/2 (2c − 1)c−((2c−1)/b) . This dominates the series be-
cause the factors of g− g and ( g(2c − 1))k−( g(2c−1)/b) overwhelm the factors c g n gc/2
— the last factor is bounded by 1. r

E XERCISES .

5.1. The algorithm for the Granularity Problem requires the underlying net-
work to be a complete graph. Can this algorithm be implemented on other types of
networks? What would have to be changed?
5.2. Program the algorithm for the granularity problem in Butterfly Pascal.
CHAPTER 4

Examples of Existing Parallel Computers

1. Asynchronous Parallel Programming


In this section we will consider software for asynchronous parallel programs.
These are usually run on MIMD computers with relatively few processors, al-
though it is possible to develop such programs for networks of sequential com-
puters. Many of the important concepts were first discussed in § 5.6.1 in chapter
2.
We begin by discussing a number of portable programming packages.
1.1. Portable Programming Packages.
1.1.1. Linda. In this section we will discuss a general programming system
that has been implemented on many different parallel computers. Our main em-
phasis will be on the LINDA package. It is widely available and striking in its
simplicity and elegance. The original system, called LINDA, was developed by
Scientific Computing Associates and they have ported it to many platforms —
see [141]. Linda is particularly interesting because it is available on parallel com-
puters and on sets of networked uni-processor machines. The network version
of LINDA effectively converts networked uni-processor machines into a kind of
MIMD-parallel computer. It consists of:
• A set of commands that can be issued from a C or Pascal program.
• A preprocessor that scans the source program (with the embedded LINDA
commands) and replaces them by calls to LINDA functions.
In order to describe the commands, we must first mention the LINDA concept of
tuple space. Essentially, LINDA stores all data that is to be used by several processes
in table created and maintained by LINDA itself, called tuple space. User-defined
processes issue LINDA commands to put data into, or to take data out of tuple
space. In particular, they are completely unable to share memory explicitly. LINDA
controls access to tuple space in such a way that race-conditions are impossible —
for instance the program on page 48 can’t be written in LINDA1.
A tuple is an ordered sequence of typed values like:
(1,zz,"cat",37)

or
(1,3.5e27)

An anti tuple is similar to a tuple, except that some of its entries are written as
variable-names preceded by ?. Example:

1So LINDA is not a suitable system to use in the beginning of a course on concurrent programming!

95
96 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

(1,?x,?y,37)
A tuple is said to match an anti-tuple if they both have the same number of
entries, and all of the quantities in the anti-tuple are exactly equal to the corre-
sponding quantities in the tuple. For instance:
(1,?x,?y,37)
matches
(1,17,”cat”,37)
but doesn’t match
(1,17,37)
(wrong number of entries) or
(2,17,”cat”,37)
(first entries don’t agree).
The LINDA system defines the following operations upon tuples:
1. out(tuple). This inserts the tuple that appears as a parameter into tuple
space. Example:
out(1,zz,”cat”,37)
Suppose t and a are, respectively, a tuple and an anti-tuple. The opera-
tion of unifying t with a is defined to consist of assigning to every element
of the anti-tuple that is preceded by a ?, the corresponding value in the
tuple. After these assignments are made, the anti-tuple and the tuple are
identical.
2. in(anti-tuple) This attempts to find an actual tuple in tuple-space that
matches the given anti-tuple. If no such matching tuple exists, the process
that called in is suspended until a matching tuple appears in tuple space
(as a result of another process doing an out operation). If a match is found
a. the matching tuple is atomically removed from tuple-space
b. the anti-tuple is unified with the matching tuple. This results in the
assignment of values to variables in the anti-tuple that has been pre-
ceded by a ?.
The in statement represents the primary method of receiving data from
other processes in a system that employs LINDA. Here is an example:
We have a process that has a variable defined in it named zz. The state-
ment:
in(1,?zz)
tests whether a tuple exists of the form (1,x) in tuple-space. If none
exists, then the calling process is blocked. If such a tuple is found then it is
atomically removed from tuple-space (in other words, if several processes
“simultaneously” attempt such an operation, only one of them succeeds)
and the assignment zz←x is made.
3. eval(special-tuple) Here special-tuple is a kind of tuple that contains one or
more function-calls. LINDA created separate processes for these function-
calls. This is the primary way to create processes in the LINDA system.
Example:
eval(compeq(pid))
4. rd(anti-tuple) Similar to in(anti-tuple), but it doesn’t remove the matching
tuple from tuple-space.
1. ASYNCHRONOUS PARALLEL PROGRAMMING 97

5. inp(anti-tuple) Similar to in(anti-tuple), but it doesn’t suspend the calling


process if no matching tuple is found. It simply returns a boolean value
(an integer equal to 1 or 0) depending on whether a matching tuple was
found.
6. rdp(anti-tuple) Similar to inp(anti-tuple), but it doesn’t remove a matching
tuple from tuple-space.
Although LINDA is very simple and elegant, it has the unfortunate shortcoming
that is lacks anything analogous to semaphore sets as defined on page 53.
Here is a sample LINDA program that implements the generic DESCEND al-
gorithm (see 2.1 on page 58):

#define TRUE 1
#define K /* the power of 2 defining the size of the
* input−data. */
#define N /* 2ˆk*/
int bit test[K];
real main(argc, argv) /* The main program in LINDA must be called
* ’real main’ −−− the LINDA system has its own
* main program that calls this. */
int argc;
char **argv;
{
struct DATA ITEM {/* declaration */}

struct DATA ITEM T[N];


int i,k;
for (i=0; i<N; ++i) out(T[i],i,K); /* Place the data into
* tuple−space */
for (i=0; i<K; i++) bit test[i]=1<<i; /* Initialize bit test array. */
for (k=0; k<K, k++)
for (i=0; i<N; ++i) eval(worker(i,k)); /* Start processes */

void worker(x,k)
int x,k;
{
int k;
struct DATA ITEM data1,data2;
if ((bit test[k]&x) == 0) then
{
in(?data1,x,k);
in(?data2,x+bit test[k],k);
OPER(x,k,&data1,&data2);
out(data1,x,k−1);
out(data2,x+bit test[k],k−1);
}
98 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

}
In this program, we are assuming that the procedure OPER(x,k,*data1,*data2);
is declared like:
void OPER(x,k,p1,p2); int x,k; struct DATA ITEM *p1, *p2;
and is defined to perform the computations of OPER(m, j; T [m], T [m + 2 j ]).
This program is not necessarily practical:
1. The time required to start a process or even to carry out he in and out
operations may exceed the time required to simply execute the algorithm
sequentially.
2. Only half of the processes that are created ever execute.
Nevertheless, this illustrates the simplicity of using the parallel programming con-
structs provided by the LINDA system. Substantial improvements in the speed of
LINDA (or even a hardware implementation) would negate these two objections.
In addition, a practical algorithm can be developed that is based upon the one pre-
sented above — it would have each process do much more work in executing the
DESCEND algorithm so that the time required for the real computations would
dominate the overall computing time. For instance we can have each worker rou-
tine perform the OPER(m, j; T [m], T [m + 2 j ]) computation for G different values of
m rather than just 1. Suppose:
pt = Time required to create a process with eval.
it = Time required to in a tuple.
ot = Time required to out a tuple.
rt = Time required to perform OPER(m, j; T [m], T [m + 2 j ]).
Then the time required to perform the whole algorithm, when each worker
procedure works on G distinct rows of the T-array is approximately:
nkpt /G + k(it + ot ) + kGrt
Here, we are assuming that, if each worker process handles G different sets
of data, we can get away with creating 1/G of the processes we would have had
to otherwise. We are also assuming that the times listed above dominate all of
the execution-time — i.e., the time required to increment counters, perform if-
statements, etc., is negligible. The overall execution time is a minimum when the
derivative of this with respect to G is zero or:
−nkpt /G2 + krt = 0
nkpt = G2 krt
r
npt
G=
rt
The resulting execution-time is
r
nkpt npt
nkpt /G + k(it + ot ) + kGrt = q + krt + k (i t + o t )
npt rt
rt
√ √
=k npt rt + k npt rt + k(it + ot )

=k (2 npt rt + it + ot )
1. ASYNCHRONOUS PARALLEL PROGRAMMING 99

One good feature of the LINDA system is that it facilitates the implementation
of calibrated-algorithms, as described in 5.25 on page 54.
A LGORITHM 1.1. Suppose we have a calibrated algorithm for performing
computation on an EREW-SIMD computer. Then we can implement this
algorithm on a MIMD computer in LINDA as follows:
1. Create a LINDA process for each processor in the SIMD algorithm (using
eval).
2. Add all input-data for the SIMD algorithm to the tuple-space via out-
statements of the form
out(addr,0,data);
where addr is the address of the data (in the SIMD-computer), 0 repre-
sents the 0th time step, and the data is the data to be stored in this memory
location (in the SIMD-computer). In general, the middle entry in the out-
statement is a time-stamp.
3. In program step i, in LINDA process p (which is simulating a processor
in the SIMD computer) we want to read a data-item from address a. The
fact that the algorithm was calibrated implies that we know that this data-
item was written in program-step i0 = f (i, a, p), where i0 < i, and we We
perform
rd(a,i0 ,data);
4. In program step i, in LINDA process p, we want to store a data-item. We
perform
out(a,i,data);
1.1.2. p4. P4 is a portable system developed at Argonne National Laboratories
for parallel programming. It is designed to be used with FORTRAN and C and
consists of a library of routines that implement
• Monitors — these are synchronization primitives that can be used as
semaphores and barriers, and more general things.
• Shared memory.
• Message-passing.
It’s user interface is not as elegant as that of Linda, but it essentially gives the user
similar functionality. It is free software, available via anonymous ftp to ????????
1.1.3. MPI. The acronym “MPI” stands for “Message Passing Interface” and
represents a standard for message passing. It was developed by a consortium
of hardware manufacturers — who have agreed to develop optimized versions of
MPI on their respective platforms. Consequently, a software developer can assume
that an efficient version of MPI exists (or will eventually exist) on a given platform
and develop software accordingly.
It consists of a library of routines, callable from Fortran or C programs that
implement asynchronous message passing. Its functionality is a subset of that of
Linda or P4, and its user-interface is rather ugly2, but it is the only standard3.
There is a portable (and free) implementation of MPI developed jointly by
people at
2This is a highly subjective term, but represents the author’s reaction to a library of routines with
large numbers of parameters, most of which aren’t used in any but marginal situations.
3As of this writing — May 21, 1994
100 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

• Argonne National Labs — Rusty Lusk, Bill Gropp;


• Mississippi State University — Tony Skjellum and Nathan Doss;
• IBM Corporation — Hubertus Franke.
This implementation interfaces to the FORTRAN and C languages and runs
on the following platforms:
• Networked Sun workstations and servers;
• RS6000’s;
• the IBM SP-1 — see page 16;
• Intel iPSC — see page 71;
• the Intel Paragon — see page 16;
The sample implementation isn’t as efficient as it might be — it gains its portability
by using an Abstract Device Interface to related it to pecularities of the hardware.
It can use any one of three such systems: Chameleon, p4, Chimp, or PVM. It is
availiable via anonymous ftp from info.mcs.anl.gov in directory pub/mpi.

1.2. Automatically Parallelizing Compilers. This has always been an impor-


tant topic because:
• Some people find parallel programming hard and want a way to auto-
mate the process. Furthermore, expertise in parallel programming isn’t as
widely available as expertise in sequential programming.
• The fastest numerical compute-engines these days are parallel, but there
are millions of lines of old sequential code lying around, and people don’t
want to expend the effort to re-write this stuff from scratch.
Theroetically speaking automatic parallelization is essentially impossible. This is to
say that getting an optimal parallel algorithm from a given sequential algorithm is
probably recursively uncomputable. There are sub-problems of this general prob-
lem that are known to be NP-complete. For instance, an optimal parallel algorithm
would require optimal assignment of processors to perform portions of a sequen-
tial algorithm. This is essentially equivalent to the optimal register assignment
problem, which is NP-complete (see [57]).
One of the problems is that optimal parallel algorithms are frequently very
different from optimal sequential algorithms — the problem of matrix inversion is
a striking example of this. The optimal sequential algorithm is Gaussian elimina-
tion, which is known to be P-complete (see page 41, and [165]). The optimal (as
of May 21, 1994) parallel algorithm for this problem involves plugging the matrix
into a power series — a technique that would lead to an extraordinarily inefficient
sequential algorithm (see page 165).
Having said all of this there is still some motivation for attempting to develop
automatically parallelizing compilers.

2. Discussion and further reading


p4 is a successor to the set of parallel programming macros described in the
book [20].
Linda was developed by David Gelernter, who works as a consultant to S. C.
A. For an interesting discussion of the history of Linda see [111] — this article also
explains the name “Linda”.
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 101

3. SIMD Programming: the Connection Machine


3.1. Generalities. In this section we will consider the Connection Machine
manufactured by Thinking Machines Corporation. Aside from the inherent inter-
est in this particular piece of hardware, it will be instructive to see how a number
of important issues in the theory of parallel algorithms are addressed in this sys-
tem.
We first consider some general features of the architecture of this system. The
computer (model CM-2) is physically implemented as a 12-dimensional hyper-
cube, each node of which has 16 processors. This is a SIMD machine in which
programs are stored on a front end computer. The front end is a conventional com-
puter that is hard-wired into the Connection Machine. As of this writing (March,
1990) the types of machines that can be front end for the Connection Machine are:
DEC VAX’s; Symbolics Lisp machines; Sun machines. In a manner of speaking the
Connection Machine may be viewed as a kind of coprocessor (!) for the front end
— even though its raw computational power is usually many thousands of times
that of the front end. The front end essentially “feeds” the Connection Machine its
program, one instruction at a time. Strictly speaking, it broadcasts the instructions
to all of the processors of the Connection Machine though the hardware connec-
tion between them. Each processor executes the instructions, using data from:
1. the front end (which acts like common RAM shared by all processors)
2. its own local RAM (8K);
3. local RAM of other processors.
The first operation is relatively slow if many processors try to access distinct data
items from the front end at the time time — the front end can, however, broadcast
data to all processors very fast. This turns out to mean that it is advantageous to
store data that must be the same for all processors, in the front end. Copying of
data between processors (statement 3 above) follows two basic patterns:
1. Grid communication This makes use of the fact that the processors in
the machine can be organized (at run-time) into arrays with rapid com-
munication between neighboring processors. Under these conditions, the
Connection Machine behaves like a mesh-connected computer. Grid com-
munication is the fastest form of communication between processors.
2. General communication This uses an algorithm somewhat like
the randomized sorting algorithm of Valiant and Brebner (see
[164] and §3.2 in chapter 2 of this book). This sorting algorithm is used
when data must be moved between distinct nodes of the 12-dimensional
hypercube (recall that each node has 16 processors). The sorting
algorithm is built into hardware and essentially transparent to the user.
Movement of data between processors is slower than access of local
memory, or receiving broadcast data from the front end. Since this data
movement uses a randomized algorithm, the time it takes varies — but
is usually reasonably fast. In this case the Connection Machine can be
viewed as a P-RAM machine, or unbounded parallel computer.
You can think of the memory as being organized into hierarchies:
• local memory — this is memory contained in the individual processors.
This memory can be accessed very rapidly, in parallel.
102 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

• front-end — this is memory contained in the front-end computer, that also


contains the program for the connection machine. This memory can also
be accessed rapidly, but only a limited amount of parallelism can be used.
The front-end computer can
– read data from a single individual processor in the Connection Ma-
chine, in one program-step;
– perform a census-operation on all active processors. See § 3.2 on page
103 for a discussion of census-operations.
– “broadcast” data from the front-end to all of the processors in a single
program-step. This is a fast operation, since it uses the same mech-
anism used for broadcasting the instructions to the Connection Ma-
chine.
• non-local — here processors access local memory of other processors. This
is a parallel memory access that is somewhat slower than the other two
operations. It still uses a single program-step — program steps on the
Connection Machine do not all use the same amount of time.
This division into hierarchies turns out to fit in well with programming practices
used in high-level programming languages, since modular programs make use of
local memory in procedures — and this turns out to be local memory of processors.
I/O can be performed by the front-end, or the Connection Machine itself. The
latter form of I/O make use of a special piece of hardware called a Data Vault —
essentially disk drives with a high data rate (40 megabytes/second).
The Connection Machine is supplied with several programming languages,
including:
Paris essentially the assembly language of the Connection Machine (it
stands for Parallel Instruction Set). It is meant to be used in conjunction
with a high-level language that runs on the front-end computer. This is
usually C. You would, for instance, write a C program that contained
Paris instructions embedded in it.
*Lisp this is a version of Common Lisp with parallel programming features
added to it. These extra commands are in a close correspondence with
PARIS instructions — causing *Lisp to be regarded as a “low-level” lan-
guage (!). There is also a form of lisp called CM-Lisp that is somewhat
higher-level than *Lisp.
CM-Fortran this is a version of Fortran with parallel programming primi-
tives.
C* This language is essentially a superset of ANSI C. We will focus on this
language in some detail. Since it is fairly typical of a language used on
SIMD machines we will use it as a kind of pseudocode for presenting par-
allel algorithms in the sequel. There are two completely different versions
of this language. The old versions (< 6.0) were very elegant but also in-
efficient to the point of being essentially unusable. The new versions (with
version-numbers ≥ 6.0) are lower-level languages4 that are very powerful
and efficient — they rival C/Paris in these respects. We will focus on this
language in the remainder of this chapter.

4As this statement implies, these two versions have almost nothing in common but the name, C*,
and the fact that they are loosely based on C.
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 103

3.2. Algorithm-Design Considerations. In many respects designing


algorithms for the Connection Machine is like designing algorithms for a PRAM
computer. The processors have their own local RAM, and they don’t share it with
other processors (at least in a straightforward way), but modular programming
techniques make direct sharing of memory unnecessary in most circumstances.
One important difference between the Connection Machine and a PRAM ma-
chine is that two operations have different relative timings:
• Census operations take constant time on a Connection Machines and
O(lg n)-time on a PRAM computer. Census operations are operations
that involving accumulating data over all of the active processors.
“Accumulating data” could amount to:
1. taking a (sum, product, logical OR, etc.) of data items over all active
processors.
2. counting the number of active processors.
On a PRAM computer, census operations are implemented via an algo-
rithm like the one on page 58 or in § 1 of chapter 6. Although the com-
putations are not faster on the Connection Machine, the census operations
are effectively built into the hardware, the time spent doing them is essen-
tially incorporated into the machine cycles.
• Parallel data movement-operations require unit-time on a PRAM
computer, but may require longer on a Connection Machine. Certain
basic movement-operations are built into the hardware and require unit
time, but general operations5 can require much longer. See section 3.4.8
for more details.
3.3. The C* Programming Language. In addition to the usual constructs of
ANSI C, it also has ways of describing code to be executed in parallel. Before one
can execute parallel code, it is necessary to describe the layout of the processing
elements. This is done via the shape command. It allocates processors in an array
and assigns a name to the set of processors allocated. All code to be executed on
these processors must contain a reference to the name that was assigned. We illus-
trate these ideas with a sample program — it implements the parallel algorithm
for forming cumulative sums of integers:

#include <stdio.h>
#define MACHINE SIZE 8192
shape [MACHINE SIZE]sum;
int:sum n; /* each processor will hold an integer value */
int:sum PE number; /* processors must determine their identity. */
int:current lower(int);
void add(int);
int:current lower (int iteration)
{
int next iter=iteration+1;
int:current PE num reduced = (PE number >>
next iter)<<next iter;

5Such as a parallel movement of data form processor i to processor σ (i ), with σ an arbitrary


permutation
104 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

return PE num reduced + (1<<iteration) − 1;


}
void add(int iteration)
{
where (lower(iteration)<PE number)
n+=[lower(iteration)]n;
}
void main() /* code not prefaced by a shape name
is executed sequentially on the front end. */
{
int i;
with (sum) PE number= pcoord(0);
printf("Enter 8 numbers to be added in parallel:\n");
for (i=0;i<8;i++)
{
int temp;
scanf("%d",&temp);
[i]n=temp;
}
for (i=0;i<=2;i++)
with(sum) add(i);
/* the call to add is executed in parallel
by all processors in the shape named sum*/
printf("The cumulative sums are:\n");
for (i=0;i<8;i++)
printf("Processor %d, value =%d\n",
[i]PE number,[i]n);
}

(See algorithm 1.4 on page 268 for an explanation of what this program does.)
The first aspect of this program that strikes the reader is the statement: shape
[MACHINE SIZE]sum;. This allocates an array of processors (called a shape) of
size equal to MACHINE SIZE and assigns the name sum to it. Our array was one-
dimensional, but you can allocate an array of processors with up to 32 dimen-
sions6. The dimension is actually significant in that transmission of data to neigh-
boring processors can be done faster than to random processors. This language
construct mirrors the fact that the hardware of the Connection Machine can be dy-
namically reconfigured into arrays of processors. Of course transmission of data
between processors can also be done at random (so that the Connection Machine
can be viewed as a CREW computer) — this motion is optimized for certain con-
figurations of processors. If you are writing a program in which the data move-
ment doesn’t fit into any obvious array-pattern it is usually best to simply use
a one-dimensional array and do random-access data movement (data-movement
commands can contain subscripts to denote target-processors).

6Currently, (i.e., as of release 6.0 of C*) each dimension of a shape must be a power of 2 and the
total number of processors allocated must be a power of 2 times the number of processors physically
present in the machine
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 105

All statements that use this array of processors must be within the scope of a
statement with(sum).
The statement int:sum n; allocates an integer variable in each processor of the
shape sum. Note that the shape name follows the data type and that they are
separated by a colon. In C* there are two basic classes of data: scalar and parallel.
Scalar data is allocated on the front-end computer in the usual fashion (for the
C-language), and parallel data is allocated on the processors of the Connection
Machine.
You may have noticed the declaration: int:current lower(int); and wonder
where the shape current was declared. This is an example of a pre-defined
and reserved shape name. Another such predefined name is physical — it
represents all processors physically present on a given Connection Machine. The
word current simply represents all currently-active processors. The fact that the
procedure lower has such a shape declaration in it implies that it is parallel code to
be executed on the processors of the Connection Machine, rather than the front
end. The word current is a kind of generic term for a shape that can be used to
declare a procedure to be parallel.
Note that the parameter to this procedure has no shape declaration — this
means it represents scalar data allocated on the front end computer and broadcast
to the set of active processors whenever it is needed. Note that the procedure allo-
cates data on the front end and on the processors — the second allocation is done
via int:current PE num reduced....
Consider the next procedure:

void add(int iteration)


{
where (lower(iteration)<PE number)
n+=[lower(iteration)]n;
}

This block of code is executed sequentially on the front end computer. On


the other hand, it calls parallel code in the where-statement. A where statement
evaluates a logical condition on the set of active processors and executes its conse-
quent on the processors that satisfy the test. In a manner of speaking, it represents
a parallel-if statement.
Now we look at the main program. It is also executed on the front end. The
first unusual piece of code to catch the eye is:

with (sum) PE number = pcoord(0);

At most one shape can be active at any given time. When a shape is activated
all processors in the shape become active and parallel code can be executed on
them. Note that where and with-statements can be nested.
Nested where statements simply decrease the number of active processors —
each successive where-statement is executed on the processors that remained ac-
tive from the previous where statements. This is a block structured statement in
the sense that, when the program exits from the where statement the context that
106 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

existed prior to it is restored. In other words, when one leaves a where block, all
processors that were turned off by the logical condition are turned back on.
Nested with statements simply change the current shape (and restore the pre-
viously active shape when the program leaves the inner with statement).
Note the expression pcoord(0). This is a predefined parallel procedure (i.e. it
runs on all active processors) that returns the coordinate of a processor in a given
shape-array. These coordinates are numbered from 0 so the only coordinate that
exists in a one-dimensional array like sum is the 0th . This statement assigned a
unique number (in parallel) to each processor in the shape sum. Here is a list of the
pre-defined functions that can be used to access shapes:
1. pcoord(number) — defined above.
2. rankof(shape) — returns the rank of the given shape — this is the number
of dimensions defined for it. Example: rankof(sum) is equal to 1.
3. positionsof(shape) — returns the total number of positions of a
given shape — essentially the number of processors allocated to it.
positionsof(sum) is 8192.
4. dimof(shape, number) — returns the range of a given dimension of a shape.
Next, we examine the code:

for (i=0;i<8;i++)
{
int temp;
scanf("%d",&temp);
[i]n=temp;
}

This reads 8 numbers typed in by the user and stores them into 8 processors.
The scanf procedure is the standard C-procedure so that it reads data into the front-
end computer. We read the data into a temporary variable names temp and then
assign it into the variable n declared in the processors of sum. Note the statement
[i]n=temp; — this illustrates how one refers to data in parallel processors. The data
temp is assigned to the ith copy of n. Such statements can be used in parallel code
as well — this is how one does random access data movement on a Connection
Machine. With a multi-dimension shape of processor, each subscript is enclosed
in its own pair of square brackets: [2][3][1]z.
There is also a special notation for referencing neighboring processors in a shape.
Suppose p is a parallel variable in a one-dimensional shape. Then the notation
[pcoord(0)+1]p refers to the value of p in the next-higher entry in this shape. The
statement

[pcoord(0)+1]p=p;

sends data in each processor in this shape to the next higher processor in the
shape — in other words it does a shift operation. The statement

with (this shape)


z=[pcoord(0)+1][pcoord(1)−1]z;
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 107

does a two-dimensional shift operation.


In such cases, there is a shorthand for referring to the current position or current
value of some index in a shape — this is the dot notation. The expression ‘pco-
ord(i)’, where i is the current index, is represented by a dot. The data-movements
described above can be replaced by:

[.+1]p=p;

and
with (this shape) z=[.+1][.−1]z;
The next sample program illustrates the use of a two-dimensional shape. It im-
plements the Levialdi Counting Algorithm, discussed in the Introduction. First,
it generates a random array of pixels using the rand function. Then it imple-
ments the Levialdi algorithm and reports on the number of connected components
found.

#include <stdio.h>
#include <stdlib.h>
shape [64][128]pix; /* image, represented as an array of
* pixels */
int:pix val, nval; /* 1 or 0, representing on or off
* pixel */
int total = 0; /* total number of clusters found
* thus far */
int:current h(int:current);
int:pix h(int:pix);
void update();
int isolated();
void print array();
void
main()
{
int i, j, on pixels; /* number of 1 pixels in
* image */
unsigned int seed; /* seed value passed to random number
* generator */
int density;/* density (% of 1’s) in original
* image (0−100) */
printf("Enter density and seed: ");
scanf("%d %u", &density, &seed);
srand(seed);
for (i = 0; i < 64; i++)
for (j = 0; j < 64; j++)
[i][j] val = ((i == 0) || (i >= 63) ||
(j == 0) || (j >= 63)) ? 0 :
(rand() % 100 < density);
108 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

printf(rray Initialized\n");
print array();
on pixels = 1;
i = 0;
with(pix)
while (on pixels > 0) {
int t = 0;
update();
i++;
t += val;
if (i < 2)
print array();
on pixels = t;
printf(
"Iteration %.3d -- %.4d pixels on, %.3d clusters found\n",
i, on pixels, total);
}
printf("Final tally: %d clusters found \n",
total);
}
int:current h(int:current x)
{
return (x >= 1);
}
void update()
{
total += isolated();
where((pcoord(0) > 0)
& (pcoord(0) < 63)
& (pcoord(1) > 0)
& (pcoord(1) < 63)) {
val = h(h([.][. − 1] val
+[.][.] val
+[. + 1][.] val − 1)
+ h([.][.] val
+[. + 1][. − 1] val − 1));
}
}
int isolated()
{
int t = 0;
int:current iso;
iso = (val == 1)
& ([. − 1][.] val == 0)
& ([. + 1][.] val == 0)
& ([.][. − 1] val == 0)
& ([.][. + 1] val == 0)
& ([. − 1][. − 1] val == 0)
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 109

& ([. + 1][. − 1] val == 0)


& ([. − 1][. + 1] val == 0)
& ([. + 1][. + 1] val == 0);
t += iso;
return t;
}
void
print array()
{
int i, j;
for (i = 0; i < 15; i++) {
for (j = 0; j < 64; j++)
printf("%d",[i][j] val);
printf("\n");
}
printf("\n\n");
}

3.3.1. Running a C* program. To run a C* program:


1. Give it a name of the form ‘name.cs’;
2. Compile it via a command of the form:
cs name.cs
there are a number of options that can be supplied to the compiler.
These closely correspond to the options for the ordinary C compiler. If no
option regarding the name of the output file is supplied, the default name
is ‘a.out’ (as with ordinary C).
3. Execute it via a command:
a.out
This attaches to the Connection Machine, and executes a.out.
3.4. Semantics of C*.
3.4.1. Shapes and parallel allocation of data. We will begin this section by dis-
cussing the precise meaning of the with statement and its effects on where state-
ments. As has been said before, the with statement determines the current shape.
In general, the current shape determines which data can actively participate in
computations. This rule has the following exceptions:
1. You can declare a parallel variable that is not of the current shape. This
variables cannot be initialized, since that would involve its participation
in active computations.
2. A parallel variable that is not of the current shape can be operated upon
if it is left-indexed by a scalar or scalars — since the result of left-indexing
such a parallel variable is effectively a scalar. In other words, you can
execute statements like [2][3]z=2; even though z is not of the current shape.
3. The result of left-indexing a variable not of the current shape by a parallel
variable that is of the current shape is regarded as being of the current
shape. In other words, the following code is valid:

shape [128][64]s1;
110 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

shape [8192]s2;
int:s2 a,b;
int:s1 z;
with (s2) [a][b]z=2;

since the expression [a][b]z is regarded as being of the


shape s2. This is essentially due to the fact that the set
of values it takes on are indexed by the shape s2 — since a
and b are indexed by s1.
4. It is possible to apply dimof and shapeof and the address-of operator, &,
to a parallel variable that is not of the current shape. This is due to the
fact that these are implemented by the compiler, and don’t involve actual
parallel computations.
5. You can right-index a parallel array that is not of the current shape with a
scalar expression.
6. You can use the “dot” operator to select a field of a parallel structure or
union that is not of the current shape — provided that field is not another
structure, union, or array.
3.4.2. Action of the with-statement. The with-statement activates a given shape
and makes it the current shape. The concept of current shape is central to the C*
language. In order to understand it, we must consider the problem of describ-
ing SIMD code. Since each processor executes the same instruction, it suffices to
describe what one processor is doing in each program step. We consider a single
generic processor and its actions. This generic processor will be called the current
processor. The term “current processor” does not mean that only one processor
is “current” at any given time — it merely selects one processor out of all of the
processors and uses it as an example of how the parallel program executes on all of
the processors.
The current shape defines:
• how the current processor is accessed in program-code. If the current
shape is k dimensional, the current processor is a member of a
k-dimensional array.
• the functions pcoord(i).
The next example illustrates how variables that are not of the current shape can be
accessed in parallel code. This code-fragment will multiply a vector v by a matrix
A:
shape [64][128]twodim;
shape [8192]onedim;
int:twodim A;
int:onedim v,prod;
void main()
{
with(onedim) prod=0; /* Zero out the v−vector. */
with(twodim) /* Make the shape ’twodim’ current. */
{
[pcoord(0)]prod+=A*[pcoord(1)]v;
/* This code accesses ’prod’ and ’v’, in
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 111

* spite of the fact that they are not of the


* current shape. The fact that ’twodim’ is
* the current shape defines the ’pcoord(0)’
* ’pcoord(1)’ */

}
The use of the shape twodim is not essential here — it merely makes it more
convenient to code the array-access of A. We could have written the code above
without using this shape.The processors in the shape twodim are the same proces-
sors as those in the shape onedim (not simply the same number of processors). Here
is a version of the code-fragment above that only uses the onedim-shape:
shape [64][128]twodim;
shape [8192]onedim;
int:twodim A;
int:onedim v,prod;
void main()
{
with(onedim) prod=0; /* Zero out the v−vector. */
with(onedim) /* Continue to use the shape ’onedim’. */
{
[pcoord(0)/128]prod+=[pcoord(0)/128][pcoord(0) % 128]A*v;

}
}
Here, we use the expression [pcoord(0)/128][pcoord(0) % 128]A to refer to el-
ements of A. We assume that the shape twodim is numbered in a row-major format.
The second code-fragment carries out the same activity as the first. The use of
the shape twodim is not entirely essential, but ensures:
• The code-fragment
[pcoord(0)]prod+=A*[pcoord(1)]v;
is simpler, and
• It executes faster — this is due to features of the hardware of the Connec-
tion Machine. Shapes are implemented in hardware and a with-statement
has a direct influence on the Paris code that is generated by the C* com-
piler.
3.4.3. Action of the where-statement. There are several important
considerations involving the execution of the where-statement. In computations
that take place within the scope of a where-statement, the active processors
that are subject to the conditions in the where-statement are the ones whose
coordinates are exactly equal to {pcoord(0),. . . ,pcoord(n)}. This may seem to
be obvious, but has some interesting consequences that are not apparent at first
glance. For instance, in the code:

shape [8192]s;
int:s z;
where(z > 2) [.+2]z=z;
112 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

a processor [i+2]z will receive the value of [i]z if that value is > 2 — the original
value of [i+2]z (and, therefore, the whether processor number i + 2 was active) is
irrelevant. Processor [.+2] in this statements is simply a passive receiver of data — it
doesn’t have to be active to receive the data from processor [.]z. On the other hand,
processor [.] must be active in order to send the data.
Suppose the processors had z-values given by:

[0]z=1;
[1]z=3;
[2]z=3;
[3]z=0;
[4]z=1;
[5]z=4;
[6]z=2;
[7]z=0;

then the code fragment given above will result in the processors having the values:

[0]z=1;
[1]z=3;
[2]z=3;
[3]z=3;
[4]z=3;
[5]z=4;
[6]z=2;
[7]z=4;

When processors are inactive in a where-statement they mark time in a certain


sense — they execute the code in a manner that causes no changes in data in their
memory or registers. They can be passive receivers of data from other processors,
however.
Other processors can also read data from an inactive processor’s local memory.
For instance the effect of the where statement:

where(z>2)
z=[.−1]z;

is to cause the original data in the parallel variable z to be transformed


into:

[0]z=1;
[1]z=1;
[2]z=3;
[3]z=0;
[4]z=1;
[5]z=1;
[6]z=2;
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 113

[7]z=0;

In order to understand more complex code like

where(cond)
[.+3]z=[.+2]z;

it is necessary to pretend the existence of a new variable temp in active proces-


sors:

where(cond)
{
int:shapeof(z) temp;
temp=[.+2]z;
[.+3]z=temp;
}

a given data-movement will take place if and only if the condition in cond is sat-
isfied in the current processor. In this case both [.+2]z and [.+3]z play a passive role
— the processor associated with one is a storage location for parallel data and the
other is a passive receiver for parallel data.
These where-statements can be coupled with an optional else clause. The else-
clause that follows a where-statement is executed on the complement of the set of
processors that were active in the where-statement.

where (c) s1;


else s2;

Statement s2 is executed on all processors that do not satisfy the condition c.


There is also the everywhere-statement, that explicitly turns on all of the pro-
cessors of the current shape.
Ordinary arithmetic operations and assignments in C* have essentially the
same significance as in C except when scalar and parallel variables are combined
in the same statement. In this case the following two rules are applied:
1. Replication Rule: A scalar value is automatically replicated where neces-
sary to form a parallel value. This means that in a mixed expression like:

int a;
int:someshape b,c;
b=a+c;

the value of a is converted into a parallel variable before it is substituted into the
expression. This means it is broadcast from the front-end computer.
This is also used in the Levialdi Counting Program in several places — for
instance:

int:current h(int:current x)
114 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

{
return (x >= 1);
}

2. As-If-Serial Rule: A parallel operator is executed for all active processors


as if in some serial order. This means that in an expression:

int a;
int:someshape b;
a=b;

a will get the value of one of the copies of b — which one cannot be determined
before execution. Code like the following:

int a;
int:someshape b;
a += b;

will assign to a the sum of all of the values of b. This is accomplished via an efficient
parallel algorithm (implemented at a low-level) like that used to add numbers in
the sample program given above. Note that the basic computations given in the
sample program were unnecessary, in the light of this rule — the main computa-
tion could simply have been written as

int total;
with(sum) total += n;

This is used in several places in the Levialdi Counting Program:

with(pix)
while (on pixels > 0) {
int t = 0;
update();
i++;
t += val;
if (i < 2)
print array();
on pixels = t;
printf(
"Iteration %.3d -- %.4d pixels on, %.3d clusters found\n",
i, on pixels, total);
}

3.4.4. Parallel Types and Procedures. 1. Scalar data. This is declared exactly as
in ANSI C.
2. Shapes Here is an example:
shape descriptor shapename;
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 115

where descriptor is a list of 0 or more symbols like [number] specifying the size
of each dimension of the array of processors. If no descriptor is given, the shape
is regarded as having been only partially specified – its specification must be com-
pleted (as described below) before it can be used.
When shapes are defined in a procedure, they are dynamically allocated on
entry to the procedure and de-allocated on exit. Shapes can also be explicitly dy-
namically allocated and freed. This is done with the allocate shape command.
This is a procedure whose:
1. first parameter is a pointer to the shape being created;
2. second parameter is the rank of the shape (the number of dimensions);
3. the remaining parameters are the size of each dimension of the shape.
For instance, the shape sum in the first sample program could have been dynami-
cally allocated via the code:

shape sum;
sum=allocate shape(&sum,1,8192);

The shape pix in the Levialdi Counting Program could have been created
by:

shape pix;
pix=allocate shape(&pix,2,64,128);

Shapes can be explicitly deallocated by calling the function


deallocate shape and passing a pointer to the shape. Example:
deallocate shape(&pix);. You have to include the header file <stdlib.h>
when you use this function (you do not need this header file to use
allocate shape).
3. Parallel data Here is a typical parallel type declaration:
typename:shapename variable;
4. Scalar pointer to a shape In C*, shapes can be regarded as data items and it
is possible to declare a scalar pointer to a shape. Example:
shape *a;
Having done this we can make an assignment like
a=&sum;
and execute code like
with(*a)...
A function related to this is the shapeof-function. It takes a parallel variable
as its parameter and returns the shape of the parameter.
5. Scalar pointers to parallel data. Sample:
typename:shapename *variable;
Here variable becomes a pointer to all instances of a given type of parallel data.
It becomes possible to write
[3]*variable.data
116 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

Typecasting can be done with parallel data in C*. For example, it is possible to
write a=(int:sum)b;.
6. type procedure name (type1 par1, type2 par2,
...,typen parn);
Note that the preferred syntax of C* looks a little like Pascal. All
procedures are supposed to be declared (or prototyped), as well as being defined
(i.e. coded). The declaration of the procedure above would be:
7. type procedure name (type1, type2,
...,typen);
— in other words, you list the type of data returned by the proce-
dure, the name of the procedure, and the types of the parameters. This is a
rule that is not strictly enforced by present-day C*, but may be in the future.
Example:

int:sum lower (int iteration) /* DEFINITION of


‘lower’ procedure */
{
int next iter=iteration+1;
int:sum PE num reduced = (PE number >>
next iter)<<next iter;
return PE num reduced + (1<<iteration) − 1;
}
void add(int iteration) /* DEFINITION of ‘add’
procedure */
{
where (lower(iteration)<PE number)
n+=PE[lower(iteration)].n;
}

In the remainder of this discussion of C* we will always follow the preferred syn-
tax.
8. Overloaded function-definitions. In C* it is possible to define scalar and
parallel versions of the same function or procedure. In other words, one can de-
clare procedures with the same name with with different data types for the param-
eters. It is possible to have different versions of a procedure for different shapes of
input parameters. This is already done with many of the standard C functions. For
instance, it is clear the the standard library <math.h> couldn’t be used in the usual
way for parallel data: the standard functions only compute single values. On the
other hand, one can code C* programs as if many of these functions existed. The
C* compiler automatically resolves these function-calls in the proper way when
parallel variables are supplied as parameters. In fact, many of these function-calls
are implemented on the Connection Machine as machine language instructions, so
the C* compiler simply generates the appropriate assembly language (actually,
PARIS) instructions to perform these computations.
In order to overload a function you must use its name in an
overload-statement. This must be followed by the declarations of all of the
versions of the function, and the definitions of these versions (i.e., the actual
code).
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 117

3.4.5. Special Parallel Operators. The following special operators are defined in
C* for doing parallel computations:
<?= Sets the target to the minimum of the values of the parallel variable on
the right;
>?= Sets the target to the maximum of the values of the parallel variable on
the right;
,= Sets the target to an arbitrary selection of the values of the parallel variable
on the right.
<? This is a function of two variables (scalar or parallel) that returns the
minimum of the variables.
>? This is a function of two variables (scalar or parallel) that returns the
maximum of the variables.
(?:) This is very much like the standard C operator-version of the
if-statement. It is coded as (var?stmt1:stmt2). If var is a scalar variable it
behaves exactly like the corresponding C statement. If var is a parallel
variable it behaves like: where(var) stmt1; else stmt2;
C* also defines a new data type: bool. This can be used for flags and is advan-
tageous, since it is implemented as a single bit and the Connection Machine can
access single bits.
3.4.6. Sample programs. We begin with sample programs to compute Julia sets
and the Mandelbrot set. These are important fractal sets used in many areas of
pure and applied mathematics and computer science including data compression
and image processing. Computing these sets requires a enormous amount of CPU
activity and lends itself to parallelization. We begin by quickly defining these sets.
These computations are interesting in that they require essentially no communica-
tion between processors.
Let p ∈ C be some complex number. The Julia set with parameter p is defined
to be the set of points z ∈ C with the property that the sequence of points zi
remains bounded. Here z0 = z and for all i zi+1 = z2i + p. Note that if z starts out
sufficiently large, kzi k → ∞ as i → ∞ since the numbers are being squared each
time. In our program we will calculate z100 for each initial value of z and reject
initial values of z that give a value of z100 whose absolute value is > 5 since adding
p to such a number doesn’t have a chance of canceling it out7. We will assign a
different processor to each point that is tested. Here is the program:

#include <stdio.h>
shape[64][128] plane;
float:plane r, c, r1, c1, x, y, mag;
int:plane injulia, row, column;
void
main()
{
float p r = 0.0, p c = 0.0;
int i, j;
with(plane) {
r = pcoord(0); /* Determine row of processor. */
7 This is not obvious, incidentally. It turns out that the values of p that produce a nonempty Julia
set have an absolute value < 2.
118 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

c = pcoord(1); /* Determine column of processor. */


x = r / 16.0 − 2.0; /* Determine x−coordinate
* (real part of point that
* this processor will
* handle. We subtract 2.0
* so that x−coordinates will
* range from −2 to +2. */
y = c / 32.0 − 2.0; /* Determine y−coordinate
* (complex part of point
* that this processor will
* handle. We subtract 2.0
* so that y−coordinates will
* range from −2 to +2. */
r = x;
c = y;
injulia = 1; /* Initially assume that all points
* are in the Julia set. */
mag = 0.; /* Initially set the absolute value
* of the z(i)’s to 0. We will
* calculate these absolute values
* and reject all points for which
* they get too large. */
}
printf("Enter real part of par:\n");
scanf("%g", &p r);
printf("Enter complex part of par:\n");
scanf("%g", &p c);
/*
* Compute the first 100 z’s for each point of the selected
* region of the complex plane.
*/
for (i = 0; i < 100; i++) {
with(plane) {
/*
* We only work with points still thought to
* be in the Julia set.
*/
where(injulia == 1) {
r1 = r * r − c * c + p r;
c1 = 2.0 * r * c + p c;
r = r1;
c = c1;
mag = (r * r + c * c);
/*
* This is the square of the absolute
* value of the point.
*/
}
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 119

where(mag > 5.0) injulia = 0;


/*
* Eliminate these points as candidates for
* membership in the Julia set.
*/
}
}
/*
* Now we print out the array of 0’s and 1’s for half of the
* plane in question. A more serious program for computing a
* Julia set might work with more data points and display the
* set in a graphic image, rather than simply printing out a
* boolean array.
*/
for (i = 0; i < 64; i++) {
for (j = 0; j < 64; j++)
printf("%d",[i][j] injulia);
printf("\n");
}
}

Mandelbrot sets are very similar to Julia sets, except that, in computing zi+1
from zi , we add the original value of z instead of a fixed parameter p. In fact, it is
not hard to see that a given point is in the Mandelbrot set if and only if the Julia
set generated using this point as its parameter is nonempty. This only requires a
minor change in the program:

#include <stdio.h>
shape [64][128]plane;
float:plane r, c, r1, c1, x, y, mag;
int:lane inmand, row, column;
void main()
{
int i, j;
with(plane) {
r = pcoord(0); /* Determine row of processor. */
c = pcoord(1); /* Determine column of processor. */
x = r / 16.0 − 2.0; /* Determine x−coordinate
* (real part of point that
* this processor will
* handle. We subtract 2.0
* so that x−coordinates will
* range from −2 to +2. */
y = c / 32.0 − 2.0; /* Determine y−coordinate
* (complex part of point
* that this processor will
* handle. We subtract 2.0
* so that y−coordinates will
120 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

* range from −2 to +2. */


r = x;
c = y;
inmand = 1; /* Initially assume that all points
* are in the Mandelbrot set. */
mag = 0.; /* Initially set the absolute value
* of the z(i)’s to 0. We will
* calculate these absolute values
* and reject all points for which
* they get too large. */
}
/*
* Compute the first 100 z’s for each point of the selected
* region of the complex plane.
*/
for (i = 0; i < 100; i++) {
with(plane) {
/*
* We only work with points still thought to
* be in the Mandelbrot set.
*/
where(inmand == 1) {
r1 = r * r − c * c + x;
c1 = 2.0 * r * c + y;
r = r1;
c = c1;
mag = (r * r + c * c);
/*
* This is the square of the absolute
* value of the point.
*/
}
where(mag > 5.0) inmand = 0;
/*
* Eliminate these points as candidates for
* membership in the Mandelbrot set.
*/
}
}
/*
* Now we print out the array of 0’s and 1’s for half of the
* plane in question. A more serious program for computing a
* Mandelbrot set might work with more data points and
* display the set in a graphic image, rather than simply
* printing out a boolean array.
*/
for (i = 0; i < 64; i++) {
for (j = 0; j < 64; j++)
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 121

printf("%d",[i][j] inmand);
printf("\n");
}
}

3.4.7. Pointers. In this section we will discuss the use of pointers in C*. As was
mentioned above, C* contains all of the usual types of C-pointers. In addition, we
have:
• scalar pointers to shapes;
• scalar pointers to parallel variables.
It is interesting that there is no simple way to have a parallel pointer to a paral-
lel variable. Although the CM-2 Connection Machine has indirect addressing, the
manner in which floating point data is processed requires that indirect addressing
be done in a rather restrictive and non-intuitive way8. The designer of the current
version of C* has simply opted to not implement this in C*. It can be done in PARIS
(the assembly language), but it is not straightforward (although it is fast). We can
“fake” parallel pointers to parallel variables by making the arrays into arrays of
processors — i.e., storing individual elements in separate processors. It is easy and
straightforward to refer to sets of processors via parallel variables. The disadvan-
tage is that this is slower than true indirect addressing (which only involves the
local memory of individual processors).
A scalar pointer to a shape can be coded as:
shape *ptr;
This can be accessed as one might expect:
with(*ptr)...
As mentioned before, shapes can be dynamically allocated via the
allocate shape command. In order to use this command, the user needs to
include the file <stdlib.h>.
allocate shape is a procedure whose:
1. first parameter is a pointer to the shape being created;
2. second parameter is the rank of the shape (the number of dimensions);
3. the remaining parameters are the size of each dimension of the shape.
The shape pix in the Levialdi Counting Program could have been create
by:
shape pix;
pix=allocate shape(&pix,2,64,128);
Dynamically allocated shapes can be freed via the deallocate shape
command.
Scalar pointers to parallel variables point to all instances of these variables in
their respective shape. For instance, if we write:

shape [8192]s;
int:s a;
int:s *b;
b=&a;
8The address must be “shuffled” in a certain way
122 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

causes a and *b to refer to the same parallel data-item. Parallel data can be
dynamically allocated via the palloc function. In order to use it one must include
the file <stdlib.h>. The parameters to this function are:
1. The shape in which the parallel variable lives;
2. The size of the parallel data item in bits. This is computed via the bool-
sizeof function.
Example:

#include <stdlib.h>
shape [8192]s;
int:s *p;
p=palloc(s, boolsizeof(int:s));

Dynamically allocated parallel variables can be freed via the pfree function.
This only takes one parameter — the pointer to the parallel data-item.
3.4.8. Subtleties of Communication. A number of topics were glossed over in
the preceding sections, in an effort to present a simple account of C*. We will
discuss these topics here. The alert reader may have had a few questions, about
communication:
• How are general and grid-based communications handled in C* — assum-
ing that one has the ability to program these things in C*?
• Although statements like a+=b, in which a is scalar and b is parallel, adds
up values of b into a, how is this accomplished if a is a parallel variable? A
number of algorithms (e.g., matrix multiplication) will require this capa-
bility.
• How is data passed between different shapes?
The C* language was designed with the idea of providing all of the capability of
assembly language in a higher-level language (like the C language, originally) and
addresses all of these topics.
We first note that the default forms of communication (implied whenever one
uses left-indexing of processors) is:
1. general communication, if the left-indexing involves data-items or num-
bers;
2. grid communication, if the left-indexing involves dot notation with fixed dis-
placements. This means the numbers added to the dots in the left-indexes
are the same for all processors. For instance a=[.+1][.−1]a; satisfies this
condition. In addition, the use of the pcoord function to compute index-
values results in grid communication, if the displacements are the same
for all processors. For instance, the compiler is smart enough to determine
that
a=[pcoord(0)+1][pcoord(1)−1]a;
involves grid communication, but
a=[pcoord(1)+1][pcoord(0)−1]a;
does not, since the data movement is different for different processors.
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 123

Even in general communication, there are some interesting distinctions to


be made. Suppose a is some parallel data-item. We will call the following
code-fragment
a = [2]a
a parallel get statement, and the following
[7]a = a
a parallel send statement. In the first case data enters the current processor
from some other processor, and in the second, data from the current processor is
plugged into another processor.
It turns out that send operations are twice as fast as the parallel get operations.
Essentially, in a get operation, the current processor has to send a message to the
other processor requesting that it send the data.
It is also important to be aware of the distinction between get and send oper-
ations when determining the effect of inactive processors. The get and send opera-
tions are only carried out by the active processors. Consider the code:

shape [8192]s;
int:s dest,source;
int index=4;
where(source < 30)
dest = [index]source;

here, the processors carrying out the action are those of dest — they are getting
data from source. It follows that this code executes by testing the value of source
in every processor, then then plugging [4]source into dest in processors in which
source is < 30. In code like

shape [8192]s;
int:s dest,source;
int index=4;
where(source < 30)
[.+1]source = source;

the processors in [.+1]source are passive receivers of data. Even when they are
inactive, data will be sent to them by the next lower numbered processor unless
that processor is also inactive.
3.4.9. Collisions. In PARIS, there are commands for data-movement on the
Connection Machine that resolve collisions of data in various ways. The C* com-
piler is smart enough to generate appropriate code for handling these collisions in
many situations — in other words the compiler can determine the programmer’s
intent and generate the proper PARIS instructions.
1. if the data movement involves a simple assignment and there are colli-
sions, a random instance of the colliding data is selected;
2. if a reduction-assignment operation is used for the assignment,
the appropriate operation is carried out. Table 4.1 lists the valid
reduction-assignment operations in C*.
124 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

Reduction-Assignment Operations in C*
Operation Effect Value if no active processors
+= Sum 0
-= Negative sum 0
ˆ= Exclusive OR 0
|= OR 0
<?= MIN Maximum possible value
>?= MAX Minimum possible value
TABLE 4.1.

These reduction-operations are carried out using an efficient


algorithm like the algorithm for adding 8 numbers on page 58.
Reduction-operations take O(lg n)-time to operate on n numbers. See §1
chapter 6 for more information on the algorithm used.
These considerations are essential in programming some kinds of algorithms.
For instance, matrix multiplication (at least when one uses the O(lg n) time algo-
rithm) requires the use of reduction operations:

#include <stdio.h>
#include <stdlib.h>
shape [64][128]mats;
void matmpy(int, int, int, int:current*,
int:current*, int:current*); /* Function prototype. */

/* This is a routine for multiplying an


m x n matrix (*mat1) by an n x k matrix (*mat2)
to get an m x k matrix (*outmat)
*/
void matmpy(int m, int n, int k,
int:current *mat1,
int:current *mat2,
int:current *outmat)
{
shape [32][32][32]tempsh; /* A temporary shape used
* to hold intermediate
* results −− products
* that are combined
* in the reduction
* operations.
*/
int:tempsh tempv;
*outmat = 0;
with(tempsh)
{
bool:tempsh region = (pcoord(0) < m) &
(pcoord(1) < n)
& (pcoord(2) < k);
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 125

/* This is the most efficient


* way to structure a where
* statement with complex
* conditions.
*/
where(region) {
tempv =[pcoord(0)][pcoord(1)] (*mat1) *
[pcoord(1)][pcoord(2)] (*mat2);
[pcoord(0)][pcoord(2)](*outmat) += tempv;
/* Parallel reduction operations here −−
via the ‘+=’−operation. */
}
}
}
void main()
{
int n = 3;
int m = 3;
int k = 3;
int:mats a, b, c;
int i, j;
with(mats) {
a = 0;
b = 0;
c = 0;
/* Test data. */
[0][0]a = 1;
[0][1]a = 2;
[0][2]a = 3;
[1][0]a = 4;
[1][1]a = 5;
[1][2]a = 6;
[2][0]a = 7;
[2][1]a = 8;
[2][2]a = 9;
b = a;
matmpy(3, 3, 3, &a, &b, &c);
for (i = 0; i < n; i++)
for (j = 0; j < k; j++)
printf("%d\n",[i][j]c);
}
}

Note that the matrices are passed to the procedure as pointers to parallel in-
tegers. The use of pointers (or pass-by-value) is generally the most efficient way
to pass parallel data. C programmers may briefly wonder why it is necessary to
take pointers at all — in C matrices are normally stored as pointers to the first
element. The answer is that these “arrays” are not arrays in the usual sense (as
126 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

regarded by the C programming language). They are parallel integers — the index-
ing of these elements is done by the parallel programming constructs of C* rather
than the array-constructs of the C programming language. In other words these
are regarded as integers rather than arrays of integers. They store the same kind of
information as an array of integers because they are parallel integers.
Note the size of the shape shape [32][32][32]tempsh; — this is the fundamental
limitation of the O(lg n)-time algorithm for matrix multiplication. A more practi-
cal program would dynamically allocate tempsh to have dimensions that approxi-
mate n, m, and k if possible. Such a program would also switch to a slower algo-
rithm that didn’t require as many processors if n × m × k was much larger than
the available number of processors. For instance, it is possible to simply compute
the m × k elements of *outmat in parallel — this is an O(m)-time algorithm that
requires m × k processors.

E XERCISES .

3.1. How would the CM-2 be classified, according to the Hndler scheme de-
scribed on page 16?
3.2. Write the type of program for matrix multiplication described above. It
should dynamically allocate (and free) tempsh to make its dimensions approximate
n, m, and k (recall that each dimension of a shape must be a power of 2), and switch
to a slower algorithm if the number of processors required is too great.
3.3. Suppose we want to program a parallel algorithm that does nested paral-
lelism. In other words it has statements like:
for i = 1 to k do in parallel
int t;
for j = 1 to ` do in parallel
t ← min{ j| A[i, j] = 2}
How do we implement this in C*?

3.5. A Critique of C*. In this section we will discuss some of the good and
bad features of the C* language. Much of this material is taken from [66], by
Hatcher, Philippsen, and Tichy.
We do this for several reasons:
• Analysis of these features shows how the hardware influences the design
of the software.
• This type of analysis is important in pointing the way to the development
of newer parallel programming languages.
3. SIMD PROGRAMMING: THE CONNECTION MACHINE 127

The development of parallel languages cannot proceed in a purely abstract fash-


ion. Some of the issues discussed here were not known until the C* language was
actually developed.
If one knows PARIS (the “assembly language” of the CM-1 and 2), the first
flaw of C* that comes to mind is the fact that most PARIS instructions are not
implemented in C*. This is probably due to the fact that people working with
the Connection Machine originally programmed mostly in PARIS. There was the
general belief that one had to do so in order to get good performance out of the
machine. The designers of C* may have been felt that the the people who wanted
to extract maximal performance from the CM-series by using some of the more
obscure PARIS instructions, would simply program in PARIS from the beginning.
Consequently, they may have felt that there was no compelling reason to imple-
ment these instructions in C*. This position is also amplified by the fact that PARIS
is a very large assembly language that is generally more “user-friendly” than many
other assembly languages9 — in other words, PARIS was designed with the idea
that “ordinary users” might program in it (rather than just system programmers).
We have already seen the problem of parallel pointers and indirect addresses
— see § 3.4.7 on page 121. This limitation was originally due to certain limitations
in the CM-2 hardware. In spite of the hardware limitations, the C* language could
have provided parallel pointers — it would have been possible to implement in
software, but wasn’t a high enough priority.
The use of left indices for parallel variables is a non-orthogonal feature of C*.
Parallel scalar variables are logically arrays of data. The only difference between
these arrays and right-indexed arrays is that the data-items are stored in sepa-
rate processors. Since the left-indexed data is often accessed and handled in the
same way as right-indexed arrays, we should be able to use the same notation for
them. In other words, the criticism is that C* requires us to be more aware of the
hardware (i.e. which array-elements are in separate processors) than the logic of a
problem requires. The issue is further confused slightly by the fact that the CM-1
and 2 frequently allocate virtual processors. This means that our knowledge of how
data-items are distributed among the processors (as indicated by the left-indices)
is frequently an illusion in any case.
In [66] Hatcher, Philippsen, and Tichy argue that the statements for declaring
data in C* are non-orthogonal. They point out that type of a data-item should de-
termine the kinds of operations that can be performed upon it — and that C* mixes
the concepts of data-type and array size. In C* the shape of a parallel variable is
regarded as an aspect of its data-type, where it is really just a kind of array decla-
ration. For instance, in Pascal (or C) one first declares a data-type, and then can
declare an array of data of that type. The authors mentioned above insist that these
are two logically independent operations that C* forces a user to combine. Again
the C* semantics are clearly dictated by the hardware, but Hatcher, Philippsen,
and Tichy point out that a high-level language should insulate the user from the
hardware as much as possible.
In the same vein Hatcher, Philippsen, and Tichy argue that C* requires users
to declare the same procedure in more different ways than are logically necessary.
For instance, if we have a procedure for performing a simple computation on an
integer (say) and we want to execute this procedure in parallel, we must define a
9Some people might disagree with this statement!
128 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

version of the procedure that takes a parallel variable as its parameter. This also
results in logical inconsistency in parallel computations.
For instance, suppose we have a procedure for computing absolute value of
an integer abs(). In order to compute absolute values in parallel we must define
a version of this procedure that takes a suitable parallel integer as its parame-
ter:

int:current abs(int:current);

Now suppose we use this procedure in a parallel block of code:

int:someshape c,d;
where(somecondition)
{
int:someshape a;
a=c+d;
e=abs(a);
}

Now we examine how this code executes. The add-operation a=c+d is per-
formed in parallel, upon parallel data. When the code reaches the statement
e=abs(a), the code logically becomes sequential. That is, it performs a single
function-call with a single data-item (albeit, a parallel one). Hatcher, Philippsen,
and Tichy argue that it should be possible to simply have a single definition of
the abs-function for scalar variables, and have that function executed in parallel
upon all of the components of the parallel variables involved. They argue that this
would be logically more consistent than what C* does, because the semantics of
the abs-function would be the same as for addition-operations.
In defense of C*, one should keep in mind that it is based upon C, and the
C programming language is somewhat lower-level than languages like Pascal or
Lisp10. One could argue that it is better than working in PARIS, and one shouldn’t
expect to be completely insulated from the hardware in a language based upon C.
An international group of researchers have developed a programming lan-
guage that attempts to correct these perceived shortcomings of C* — see [70], [69],
and [71]. This language is based upon Modula-2, and is called Modula*. We dis-
cuss this language in the next section.

4. Programming a MIMD-SIMD Hybrid Computer: Modula*


4.1. Introduction. In this section we will discuss a new parallel programming
language based upon Modula-2 in somewhat the same way that C* was based
upon the C programming language.
The semantics of this language are somewhat simpler and clearer than C*, and
it is well-suited to both SIMD and MIMD parallel programming. This language
has been developed in conjunction with a MIMD-SIMD hybrid computer — the
Triton system, being developed at the University of Karlsruhe. They have imple-
mented Modula* on the CM-2 Connection Machine, and the MasPar MP-1, and
10It is sometimes called “high-level assembly language”.
4. PROGRAMMING A MIMD-SIMD HYBRID COMPUTER: MODULA* 129

intend to implement it upon their MIMD-SIMD hybrid computer, the Triton/1,


when it is complete. The discussion in this section follows the treatment of Mod-
ula* in [70], by Christian Herter and Walter Tichy.

4.2. Data declaration.


4.2.1. Elementary data types. The elementary data-types of Modula* are the
same as those of Modula-2. They closely resemble basic data-types in Pascal. The
most striking difference (at first glance) is that the language is case-sensitive and
all keywords are capitalized. We have:
• INTEGER.
• CARDINAL. These are unsigned integers.
• CHAR. Character data.
• BOOLEAN. Boolean data.
• REAL. Floating point data.
• SIGNAL. This is a special data-type used for synchronizing processes. A
SIGNAL is essentially a semaphore. These data-items are used in Modula*
somewhat more than in Modula-2, since this is the main synchronization
primitive for asynchronous parallel processes.
4.2.2. Parallel data types. Unlike C*, Modula* has no special parallel data type.
Parallel data is indistinguishable from array data, except that one can specify cer-
tain aspects of parallel allocation.
In Modula-2 array-types are defined with statements like:
ARRAY RangeDescriptor OF type;
Here RangeDescriptor describes the range of subscript-values, and type is the
data-type of the elements of the array. Example:
VAR
x:ARRAY [1..10] OF INTEGER;
This is very much like Pascal, except that keywords must be capitalized. This
construct is generalized to higher dimensions in a way reminiscent of C:
VAR
y:ARRAY [1..10], [-3..27] OF INTEGER;
In Modula*, arrays declared this way will be parallel, by default. If we
want to control how this allocation is done, Modula* allows an optional
additional parameter, called the allocator that follows the RangeDescriptor for each
dimension:
ARRAY RangeDescriptor [allocator] OF type;
If an allocator is missing it is assumed to be the same as the next allocator to
its right. The possible values for this allocator are:
1. SPREAD. This is the default allocator. It means that values of the array
are allocated in ranges of consecutive values in the separate processors.
This means that if we want to allocate n array elements among p proces-
sors, segments of size dn/pe are allocated to each processor. This allocator
guarantees that array elements with consecutive subscript-values are allo-
cated, as much as possible, to the same processors. In a one-dimensional
array (from 0 to n − 1, say), element i is stored in location i mod p in pro-
cessor di/pe.
130 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

2. CYCLE. This allocator assigns elements of the array in a round-robin fash-


ion among the available processors. As a result, array-elements with con-
secutive subscript-values are almost always put in different processors. In
a one-dimensional array (from 0 to n − 1), element i is put in location di/pe
of processor i mod p.
In both of the preceding cases, if more than one subscript in a sequence
uses the same allocator, all of the consecutive subscripts with the same al-
locator are combined into a single one-dimensional index that is processed
using SPREAD or CYCLE, as the case may be. For instance in the follow-
ing declaration:
VAR
y:ARRAY [1..10] CYCLE [-2..27] CYCLE OF INTEGER;
the two subscripts are combined into a single subscript that runs from
0 to 299, and the elements are distributed by the CYCLE distribution, us-
ing this one subscript.
The following two allocators correspond to SPREAD and CYCLE, re-
spectively, but inhibit the process of combining several consecutive sub-
scripts with the same allocator into a single subscript:
3. SBLOCK. this corresponds to SPREAD.
4. CBLOCK. this corresponds to CYCLE.
5. LOCAL. This forces all elements of the array whose subscripts only differ
in an index defined to be LOCAL, to be in the same processor.

E XERCISES .

4.1. If we have a computer with 6 processors (numbered from 0 to 5), describe


how the following array is allocated among the processors:
VAR
y:ARRAY [1..10] CYCLE [-2..27] CYCLE OF INTEGER;
4.2. If we have a computer with 6 processors (numbered from 0 to 5), describe
how the following array is allocated among the processors:
VAR
y:ARRAY [1..10] CBLOCK [-2..27] CBLOCK OF INTEGER;

4.3. The FORALL Statement. In this section we will describe the command
that initiates parallel execution of code-segments. There is a single statement that
suffices to handle asynchronous and synchronous parallel execution, and can con-
sequently, be used for MIMD or SIMD programs.
4.1. The basic syntax is:
4. PROGRAMMING A MIMD-SIMD HYBRID COMPUTER: MODULA* 131

½ ¾
PARALLEL
FORALL identifier:rangeType IN
SYNC
statementSequence
END
Here:
1. identifier is a variable that becomes available inside the block — its scope
is the FORALL-block. This identifier takes on a unique value in each of
the threads of control that are created by FORALL-statement.
2. rangeType is a data-type that defines a range of values — it is like that
used to define subscript-ranges in arrays. It define the number of parallel
threads created by the FORALL-statement. One such thread is created for
each value in the rangeType, and in each such thread, the identifier identifier
takes on the corresponding value.
3. statementSequence is the code that is executed in parallel.
Example:
FORALL x:[1..8] IN PARALLEL
y:=f(x)+1;
END
This statement creates 8 processes, in a 1-1 correspondence with the numbers
1 through 8. In process i, the variable x takes the value i.
The two synchronization-primitives, PARALLEL and SYNC, determine how
the parallel threads are executed. The PARALLEL option specifies asynchronous
processes — this is MIMD-parallel execution. The SYNC option specifies syn-
chronous execution of the threads — essentially SIMD-parallel execution.
Two standard Modula-2 commands SEND and WAIT are provided for ex-
plicit synchronization of asynchronous processes. The syntax of these functions is
given by:
SEND(VAR x:SIGNAL);
WAIT(VAR x:SIGNAL);
The sequence of events involved in using these functions is:
1. A variable of type SIGNAL is declared globally;
2. When a process calls SEND or WAIT with this variable as its parameter,
the variable is initialized. Later, when the matching function is called this
value is accessed.
4.3.1. The asynchronous case. In the asynchronous case, the FORALL-
statement terminates when the last process in the StatementSequence terminates.
No assumptions are made about the order of execution of the statements in the
asynchronous case, except where they are explicitly synchronized via the SEND
and WAIT statements. The statements in statementSequence may refer to their
private value of identifier or any global variable. If two processes attempt to store
data to the same global memory-location, the result is undefined, but equal to one
of the values that were stored to it.
4.3.2. The synchronous case. This is the case that corresponds to execution on a
SIMD computer. This case is like the asynchronous case except that the statements
in the body of the construct are executed synchronously. There are a number of
details to be worked out at this point.
132 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

Sequence A sequence of statements of the form


S1 ; S 2 ; . . . S k
is executed synchronously by a set of processes by synchronously execut-
ing each of the Si in such a way that Si+1 is not begun until statement Si is
completed. This can be regarded as a recursive definition.
Assignment An assignment statement of the form
L:=E
is executed synchronously by a set rangeType of processes as follows.
1. All processes in rangeType evaluate the designator L synchronously,
yielding |rangeType|, results, each designating a variable.
2. All processors in rangeType evaluate expression E synchronously,
yielding |rangeType|, values.
3. All processes in rangeType store their values computed in the second
step into their respective variables computed in the first step (in arbi-
trary order).
Then the assignment terminates. If the third step results in several values
being stored into the same memory location, one of the store-operations is
successful — and which of them succeeds is indeterminate.
expressions These are executed somewhat like the assignment statement
above. The set of processes rangeType apply the operators to the operands
in the same order and in unison. The order of application is determined
by the precedence-rules of Modula-2.
If statement An if statement executes somewhat like a where statement in C*
— see page 105 in the previous section. The following statement
IF e1 THEN t1
ELSIF e2 THEN t2
···
ELSE tk
END
is executed synchronously as follows:
1. All processes evaluate expression e1 synchronously.
2. Those processes for which e1 evaluates to TRUE then execute t1 syn-
chronously, while the other processes evaluate e2 synchronously.
3. Those processes whose evaluation of e2 results in TRUE then execute
t2 synchronously, and so on.
The synchronous execution of the if statement terminates when the last
non-empty subset of rangeType terminates.
CASE statement Syntax:
CASE e OF
c1: s1
···
ck: sk
ELSE s0
END
(here the ELSE clause is optional).
1. All processes in rangeType synchronously evaluate expression e.
4. PROGRAMMING A MIMD-SIMD HYBRID COMPUTER: MODULA* 133

2. Each process selects the case whose label matches the computed value
of e. The set of processes rangeType gets partitioned into k (or k + 1,
if the ELSE clause appears). The processes in each set synchronously
execute the statements that correspond to the case that they have se-
lected. This is very much like the switch statement in the old version of
C*. (In the current version of C*, the switch statement is used only for
sequential execution).
LOOP statement. This is one of the standard looping constructs in Modula-
2.
LOOP
t1
END
Looping continues until the EXIT statement is executed. This type of
statement is executed synchronously in Modula* by a set of processors as
follows:
• As long as there is one active process within the scope of the LOOP
statement, the active processes execute statement t1 synchronously.
• An EXIT statement is executed synchronously by a set of processes by
marking those processes inactive with respect to the smallest enclosing
LOOP statement.
WHILE statement Syntax:
WHILE condition DO
t1
END
In Modula* each iteration of the loop involves:
1. all active processes execute the code in condition and evaluate it.
2. processes whose evaluation resulted in FALSE are marked inactive.
3. all other processes remain active and synchronously execute the code
in t1.
The execution of the WHILE statement continues until there are no active
processes.
FORALL statement One of the interesting aspects of Modula* is that parallel
code can be nested. Consider the following FORALL statement:
FORALL D: U
t1
...
tk
END
This is executed synchronously by a set rangeType, of processes, as
follows:
1. Each process p ∈rangeType computes U p , the set of elements in the
range given by U. Processes may compute different values for U p , if
U has identifier (the index created by the enclosing FORALL statement
that identifies the process — see 4.1 on page 130), as a parameter.
2. Each process p creates a set, S p , of processes with |S p | = U p and sup-
plies each process with a constant, D, bound to a unique value of U p .
134 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

3. Each process, p, initiates the execution of the statement sequence t1


through tk by S p . Thus there are |rangeType| sets sets of processes,
each executing the sequence of statements independently of the oth-
ers. Each such set of processes executes its statements synchronously
or asynchronously, depending on the option chosen in this FORALL
statement.
WAIT and SEND When a set of processes synchronously executes a WAIT
command, if any process in the set blocks on this command, then all of
them do. This rule allows synchronization to be maintained.
Procedure call The set of processes rangeType synchronously executes a pro-
cedure call of the form
P(x1,...,xk);
as follows:
1. Each process in rangeType creates an activation of procedure P.
2. All processes in rangeType evaluate the actual parameters x1,. . . ,xk
synchronously in the same order (not necessarily left to right) and
substitute them for the corresponding formal parameters in their
respective activations of P.
3. All processes execute the body of P synchronously in the environment
given by their respective activations.
4. The procedure call terminates when all processes in rangeType have
been designated inactive with respect to the call.
Note: there is an implied return statement at the end of each procedure.
In an explicit return statement like
RETURN [E];
if the expression E is supplied all processes synchronously evaluate it
and use this value as the result of the function-call. After evaluating E (or
if no expression is supplied) all processes in rangeType are marked inactive
with respect to the most recent procedure call.

4.4. Sample Programs. Here are some sample programs. Our standard ex-
ample is the program for adding up n integers. This example was first introduced
in the Introduction (on page 59), and a implementation in C* appeared in § 3.3 (on
page 103). It is interesting to compare these two programs:
VAR V: ARRAY [0..N-1] OF REAL;
VAR stride: ARRAY [0..N-1] OF CARDINAL;
BEGIN
FORALL i : [0..N-1] IN SYNC
stride[i] := 1;
WHILE stride[i] ¡ N DO
IF ((i MOD (stride[i] * 2)) = 0) AND ((i - stride[i]) >= 0) THEN
V[i] := V[i] + V[i - stride[i]]
END;
stride[i] := stride[i] * 2
END
END (* sum in V[N] *)
END
4. PROGRAMMING A MIMD-SIMD HYBRID COMPUTER: MODULA* 135

Note that we must explicitly declare a parallel variable stride[i] and its use is
indexed by the process-indexing variable i.
Modula* does not take advantage of the hardware support for performing
operations like the above summation in a single statement.
We can also perform these calculations asynchronously:
VAR V: ARRAY [0..N-1] OF REAL;
VAR stride: CARDINAL;
BEGIN
stride := 1;
WHILE stride < N DO
FORALL i : [0..N-1] IN PARALLEL
IF ((i MOD (stride * 2)) = 0) AND ((i - stride) ¿= 0) THEN
V[i] := V[i] + V[i - stride]
END;
stride := stride * 2
END
END (* sum in V[N-1] *)
END

4.5. A Critique of Modula*. The language Modula* responds to all of the


points raised by Tichy, Philippsen, and Hatcher in [66] and it might, consequently,
be regarded as a “Second generation” parallel programming language (with C* a
first generation language). This is particularly true in light of the probable trend
toward the development of hybrid MIMD-SIMD computers like the CM-5 and the
Triton/1, and the fact that Modula* handles asynchronous as well as synchronous
parallel code.
The simple and uniform semantics of the parallel code in Modula* makes pro-
grams very portable, and even makes Modula* a good language for describing
parallel algorithms in the abstract sense11
One suggestion: some operations can be performed very quickly and effi-
ciently on a parallel computer, but the implementation of the algorithm is highly
dependent upon the architecture and the precise layout of the data. It would be
nice if Modula* supported these operations.
For instance the census-operations like:
1. Forming a sum or product of a series of numbers.
2. Performing versions of the generic ASCEND or DESCEND algorithms in
chapter 3 (see page 57).
Many architectures have special hardware support for such operations — for in-
stance the Connection Machines support census-operations. The material in chap-
ter 3 shows that such hardware support is at least possible for the ASCEND and
DESCEND algorithms. I feel that it might be desirable to add ASCEND and DE-
SCEND operations to the Modula* language, where the user could supply the
OPER(∗, ∗; ∗, ∗)-function. Failing this, it might be desirable to include a generic
census operation in the language.

11This is reminiscent of the Algol language, which was used as much as a publication language or
pseudocode for describing algorithms, as an actual programming language.
136 4. EXAMPLES OF EXISTING PARALLEL COMPUTERS

The alternative to implementing these generic parallel operations would be to


write the Modula* compiler to automatically recognize these constructs in a pro-
gram. This does not appear to be promising — such a compiler would constitute
an “automatically parallelizing” compiler (at least to some extent12).

12The important point is that we are requiring the compiler to recognize the programmer’s in-
tent, rather than to simply translate the program. The experience with such compilers has been
discouraging.
CHAPTER 5

Numerical Algorithms

In this chapter we will develop SIMD algorithms for solving several types of
numerical problems.

1. Linear algebra
1.1. Matrix-multiplication. In this section we will discuss algorithms for per-
forming various matrix operations. We will assume that all matrices are n × n,
unless otherwise stated. It will be straightforward to generalize these algorithms
to non-square matrices where applicable. We will also assume that we have a
CREW parallel computer at our disposal, unless otherwise stated. We also get:
P ROPOSITION 1.1. Two n × n matrices A and B can be multiplied in O(lg n)-time
using O(n3 ) processors.

P ROOF. The idea here is that we form the n3 products Aij Bjk and take O(lg n)
steps to sum over j. r

Since there exist algorithms for matrix multiplication that require fewer than
n3 multiplications (the best current asymptotic estimate, as of 1991, is n2.376 multi-
plications — see [34]) we can generalize the above to:
C OROLLARY 1.2. If multiplication of n × n matrices can be accomplished with M (n)
multiplications then it can be done in O(lg n)time using M(n) processors.
This algorithm is efficiently implemented by the sample program on page 124.
We present an algorithm for that due to Reif and Pan which inverts an n × n-
matrix in O(lg2 n)-time using M(n) processors — see [127]. Recall that M(n) is
the number of multiplications needed to multiply two n × n matrices.

1.2. Systems of linear equations. In this section we will study parallel algo-
rithms for solving systems of linear equations
Ax = b
where A is an n × n matrix, and b is an n-dimensional vector.
We will concentrate upon iterative methods for solving such equations. There
are a number of such iterative methods available:
• The Jacobi Method.
• The JOR method — a variation on the Jacobi method.
• The Gauss-Seidel Method.
• The SOR method.
137
138 5. NUMERICAL ALGORITHMS

These general procedures build upon each other. The last method is the one we
will explore in some detail, since a variant of it is suitable for implementation on
a parallel computer. All of these methods make the basic assumption that the
largest elements of the matrix A are located on the main diagonal. Although this
assumption may seem rather restrictive
• This turns out to be a natural assumption for many of the applications of
systems of linear equations. One important application involves numer-
ical solutions of partial differential equations, and the matrices that arise in
this way are mostly zero. See § 6 for more information on this application.
• Matrices not dominated by their diagonal entries can sometimes be trans-
formed into this format by permuting rows and columns.
In 1976, Csanky found NC-parallel algorithms for computing determinants and in-
verses of matrices — see [39], and § 1.5 on page 174. Determinants of matrices are
defined in 1.8 on page 139. Csanky’s algorithm for the inverse of a matrix wasn’t
numerically stable and in 1985, Pan and Reif found an improved algorithm for
this — see [127] and § 1.3 on page 165. This algorithm is an important illustration
of the fact that the best parallel algorithm for a problem is often entirely different
from the best sequential algorithms. The standard sequential algorithm for invert-
ing a matrix, Gaussian elimination, does not lend itself to parallelization because
the process of choosing pivot point is P-complete — see [165]. It is very likely that
there doesn’t exist a fast parallel algorithm for inverting a matrix based upon the
more traditional approach of Gaussian elimination. See the discussion on page 41
for more information.
§ 1.3 discusses methods that work for arbitrary invertible matrices. These
methods require more processors than the iterative methods discussed in the other
sections, but are of some theoretical interest.
1.2.1. Generalities on vectors and matrices. Recall that a matrix is a linear trans-
formation on a vector-space.
D EFINITION 1.3. Let A be an n × n matrix. A will be called
1. lower triangular if Ai,j = 0 whenever i ≥ j.
2. full lower triangular if Ai,j = 0 whenever i > j.
3. upper triangular if Ai,j = 0 whenever i ≤ j.
4. full upper triangular if Ai,j = 0 whenever i < j.
Note that this definition implies that upper and lower triangular matrices have
zeroes on the main diagonal. Many authors define these terms in such a way that
the matrix is permitted to have nonzero entries on the main diagonal.
D EFINITION 1.4. Given a matrix A, the Hermitian transpose of A, denoted AH ,
is defined by ( AH )ij = A∗ji , where ∗ denotes the complex conjugate;

D EFINITION 1.5. Let u and v be two vectors of an n-dimensional vector-space


V. The inner product of u and v, denoted (u, v), is defined by
n
(u, v) = ∑ ūi · vi
i =1
where ūi denotes the complex conjugate of vi .
Inner products have the following properties:
1. LINEAR ALGEBRA 139

1. kvk22 = (v, v)
2. (v, w) = (w,¯ v)
3. (v, Aw) = ( AH v, w)
D EFINITION 1.6. A set of vectors {v1 , . . . , vk } is called orthonormal if
(
1 if i = j
( vi , v j ) =
0 otherwise

D EFINITION 1.7. Let L = {n1 , . . . , nk } be the result of permuting the list of


integers {1, . . . , k}. The parity of the permutation, ℘(n1 , . . . , nk ) is ±1, and is com-
puted via:
1. For each ni in L, count the number of values n j appearing to the right of ni
in L (i.e. j > i) such that ni > n j . Let this count be ci .
2. Form the total c = ∑ik=1 ci , and define ℘(n1 , . . . , nk ) = (−1)c
The number c roughly measures the extent to which the permutation alters
the normal sequence of the integers {1, . . . , k}. Here are a few examples:
Suppose the permutation is {3, 1, 2, 4, 5}. Then c1 = 4 because the first ele-
ment in this sequence is 3 and there are 4 smaller elements to the right of it in the
sequence. It follows that the parity of this permutation is +1. Permutations with
a parity of +1 are commonly called even permutations and ones with a parity of −1
are called odd permutations.
D EFINITION 1.8. If A is an n × n matrix,the determinant of A is defined to be
the
det( A) = ∑ ℘(i1 , . . . , in ) A1,i1 · · · An,in
i1 ,...,in
all distinct
µ ¶
1 ... n
where ℘(i1 , . . . , in ) is the parity of the permutation .
i1 ... in
There is a geometric definition of determinant that is interesting:
Let C be the n dimensional unit cube at the origin of the coordinate
system. If is defined by letting each of the n coordinates go from
0 to 1. Now consider A(C ), where A is an n × n matrix. This an
n-dimensional polyhedron. Then the absolute value of det( A) is
equal to the volumn of A(C ).
Determining the sign of det( A) is not quite so simple — it depends upon some-
thing called the orientation of A(C ).
Here are some basic properties of determinants of matrices:
P ROPOSITION 1.9. If A and B are n × n matrices then:
1. det( A · B) = det( A) · det( B);
2. The linear transformation represented by A is invertible if and only if det( A) 6=
0.
3. If A is a lower or upper triangular matrix and D is a diagonal matrix, then
det( D + A) = det( D ).
Vector-spaces can be equipped with measures of the size of vectors: these are
called norms of vectors. Our measure of the size of a matrix will be called the norm
140 5. NUMERICAL ALGORITHMS

of a matrix – it will be closely associated with a norm of a vector1. Essentially,


the norm of a matrix will measure the extent to which that matrix “stretches” vec-
tors in the vector-space — where stretching is measured with respect to a norm of
vectors.
D EFINITION 1.10. Let V be a vector space. A norm on V is a function k ∗ k: V →
R, with the following properties:
1. kvk = 0, if and only if the vector v is 0;
2. for all v, w ∈ V, kv + wk ≤ kvk + kwk;
3. for all v ∈ V, c ∈ C, kc · vk = |c| · kvk;
A function k ∗ k: Matrices over V → R is called a matrix norm if it
satisfies conditions like 1, 2, and 3 above and, in addition:
4. for all matrices X and Y over V, k XY k ≤ k X k · kY k.

D EFINITION 1.11. Given a vector norm k ∗ k on a vector space we can define


the associated matrix norm as follows: k Mk = maxv6=0 k Mvk/kvk.
1. The matrix norm inherits properties 1, 2, and 3 from the vector norm. Prop-
erty 4 results from the fact that the vector v that gives rise to the maximum value
for k Xvk or kYvk might not also give the maximum value of k XYvk.
2. Property 3 of the vector norm implies that we can define the matrix norm
via: k X k = maxv,kvk=1 k Xvk;
Here are three fairly standard vector norms:
D EFINITION 1.12. kvk1 = ∑i |vi |;
k v k ∞ =p
maxi |vi |;
k v k2 = ∑ i | v i |2 ;
The three vector norms give rise to corresponding matrix norms.
D EFINITION 1.13. 1. A square matrix will be called a diagonal matrix if
all entries not on the main diagonal are 0;
2. det(λ · I − A) = 0 is a polynomial in λ called the characteristic polynomial
of A.
3. A number λ ∈ C; is called an eigenvalue of A if there exists a vector v 6= 0
such that Av = λv. This vector is called the eigenvector corresponding to
λ; An alternate definition of eigenvalues of A is: λ is an eigenvalue of A
if and only if the matrix λ · I − A is not invertible. This leads to an equa-
tion for computing eigenvalues (at least, for finite-dimensional matrices):
eigenvalues are roots of the characteristic polynomial.
4. If λ is an eigenvalue of a matrix A, then the eigenspace associated with λ
is the space of vectors, v, satisfying (λ · I − A)v = 0. If A has only one
eigenvalue equal to λ the eigenspace associated to λ is 1 dimensional — it
consists of all scalar multiples of the eigenvector associated with λ.
5. The minimal polynomial µ(λ) of a square matrix A is the polynomial, µ( x ),
of lowest degree such that µ( A) = 0.
6. The nullspace of a matrix A is the space of all vectors v such that Av = 0. It
is the same as the eigenspace of 0 (regarded as an eigenvalue of A).

1Although there exist norms of matrices that are not induced by norms of vectors, we will not use
them in the present discussion
1. LINEAR ALGEBRA 141

7. ρ( A), the spectral radius of A is defined to be the maximum magnitude of


the eigenvalues of A.
8. A matrix A is called Hermitian if A = AH . Recall the definition of the
Hermitian transpose in 1.4 on page 138.
9. A matrix U is called unitary if U −1 = U H .
10. The condition number of a matrix is defined by cond A = k Ak2 · k A−1 k2 ≥
k I k = 1 if A is nonsingular, ∞ otherwise.
It is not difficult to see that the eigenvalues of AH are the complex conjugates of
those of A — consequently, if A is Hermitian its eigenvalues are real.
Note that eigenvectors are not nearly as uniquely determined as eigenvalues
— for instance any scalar multiple of an eigenvector associated with a given eigen-
value is also an eigenvector associated with that eigenvalue.
Although computation of eigenvalues and eigenvectors of matrices is some-
what difficult in general, some special classes of matrices have symmetries that
simplify the problem. For instance, the eigenvalues of a diagonal matrix are just the
values that occur in the main diagonal. For a more interesting example of compu-
tation of eigenvalues and eigenvectors of a matrix see § 2.4 on page 197 and § 6.1.3
on page 244.
P ROPOSITION 1.14. If k ∗ k is any norm and A is any matrix, then:
1. ρ( A) ≤ k Ak.
2. k Ak k → 0 as k → ∞ if and only if ρ( A) < 1.
3. det( A) = ∏in=1 λi , where the λi are the eigenvalues of A. Here, given values of
eigenvalues may occur more than once (as roots of the characteristic polynomial).
P ROOF. First statement: Let λ be the largest eigenvalue of A. Then ρ( A) = |λ|,
and Av = λv, where v is the eigenvector corresponding to λ. But k Avk ≤ k Ak ·
kvk by 1.11, and Av = λv and k Avk = |λ| · kvk = ρ( A)kvk. We get ρ( A)kvk ≤
k Ak · kvk and the conclusion follows upon dividing by kvk.
Second statement: Suppose ρ( A) ≥ 1. Then ρ( A)k ≤ ρ( Ak ) ≥ 1 for all k and
this means that k Ak k ≥ 1 for all k. On the other hand, if k Ak k → 0 as k → ∞, then
ρ( Ak ) → 0 as k → ∞. The fact that ρ( A)k ≤ ρ( Ak ) and the fact that powers of
numbers ≥ 1 are all ≥ 1 imply the conclusion.
Third statement: This follows from the general fact that the constant-term of a
polynomial is equal to its value at zero and the product of the roots. We apply this
general fact to the characteristic polynomial det( A − λ · I ) = 0. r

D EFINITION 1.15. Two n × n matrices A and B will be said to be similar if there


exists a third invertible n × n C such that A = CBC −1 .
Two matrices that are similar are equivalent in some sense. Suppose:
• A is a transformation of a vector-space with basis vectors {bi }, i = 1, . . . , n.
• B is a transformation of the same vector-space with basis vectors {bi0 }.
• C is the matrix whose columns are the result of expressing the {bi0 } in
terms of the {bi }.
Then the result of writing the B-transformation in terms of the basis {bi } is the
A matrix (if A = CBC −1 ). In other words, similar matrices represent the same
transformation — in different coordinate-systems. It makes sense that:
142 5. NUMERICAL ALGORITHMS

L EMMA 1.16. Let A and B be similar n × n matrices with A = CBC −1 for some
invertible matrix C. Then A and B have the same eigenvalues and spectral radii.
P ROOF. The statement about the eigenvalues implies the one about spectral
radius. Let λ be an eigenvalue of A with corresponding eigenvector V (see 1.13 on
page 140). Then
Av = λv
Now replace A by CBC −1 in this formula to get
CBC −1 v = λv
and multiply (on the left) by C −1 to get
BC −1 v = λC −1 v
This implies that λ is an eigenvalue of B with corresponding eigenvector C −1 v. A
similar argument implies that every eigenvalue of B also occurs as an eigenvalue
of A. r
In fact
L EMMA 1.17. Let A be an n × n square matrix. Then A is similar to a full upper
triangular matrix, T, with the eigenvalues of A on the main diagonal.
P ROOF. Suppose the eigenvalues of A are λ1 , . . . , λk with corresponding lin-
early independent eigenvectors v = {v1 , . . . , vk }. Now we form a basis of the vec-
tor space composed of eigenvectors and completed (i.e. the eigenvectors might
not form a complete basis for the vector-space) by some other vectors linearly in-
dependent of the eigenvectors — say u = {u1 , . . . , un−k }. In this basis, A has the
form µ ¶
D1 M1
0 A2
where M1 is some (k × k) matrix, A2 is an n − k × n − k matrix and D1 is the
diagonal matrix:
 
λ1 · · · 0
 .. .. .. 
 . . . 
0 · · · λk
Note that we have to include the submatrix M1 because, although A maps the {vi }
into themselves, it may map the {u j } into the subspace spanned by the {vi }. Now
we find the eigenvalues and corresponding eigenvectors of A1 and modify the set
of vectors u so that it contains as many of these new eigenvectors as possible. In
this new basis, A1 will have the form
µ ¶
D2 M2
0 A3
and the original matrix A will have the form
 
D1 M1
 D2 M2 
0
0 A3
The argument follows by induction: we carry out a similar operation on A3 and so
forth to get a sequence of matrices { A4 , . . . , A` } of monotonically decreasing size.
1. LINEAR ALGEBRA 143

The process terminates when we reach a matrix A` that is 1 × 1 and, at that point,
the original matrix A is full upper-triangular. r
If A is Hermitian, we can say more:
L EMMA 1.18. Let A be a Hermitian matrix. Then A is similar to a diagonal matrix
with the eigenvalues of A on the main diagonal. The similarity transformation is unitary
— in other words
A = UDU −1
where U is a unitary matrix and A is a diagonal matrix.
Recall the definition of a unitary matrix in 1.13 on page 141. This lemma is
proved in Appendix 1.2.2 on page 147.
L EMMA 1.19. Let λ be an eigenvalue of a nonsingular matrix W. Then 1/λ is an
eigenvalue of W −1 .
P ROOF. Note that λ 6= 0 because the matrix is nonsingular. Let v be an asso-
ciated eigenvector so Wv = λv. Then Wv ≤ 0 and W −1 (Wv) = v = (1/λ)λv =
(1/λ)Wv. r
L EMMA 1.20. kW k2 = kW H k2 ; kW k2 = ρ(W ) if W = W H .
P ROOF. Recall the definition of the inner product in 1.5 on page 138.
The three basic properties of inner product listed after 1.5 imply the first state-
ment, since
r
kW k2 = max(Wv, Wv)/(v, v)
v 6 =0
r
= max(Wv, Wv)/(v, v)
v 6 =0
r
= max |(v, W H Wv)|/(v, v)
v 6 =0
r
= max |(W H Wv, v)|/(v, v)
v 6 =0

Suppose V is the vector-space upon which W acts: W: V → V. The second state-


ment follows from the fact that we can find a basis for V composed of eigenvectors
of W (here, we use the term eigenvector in the loosest sense: the nullspace of W is
generated by the eigenvectors of 0. This is a well-known result that is based upon
the fact that eigenvectors of distinct eigenvalues are linearly independent, and a
count of the dimensions of the nullspaces of W − λ · I shows that the vector-space
generated by the eigenspaces of all of the eigenvectors is all of V. Suppose that
v = ∑in=1 ci ei , where the ei are eigenvectors of W. Then
n
(Wv, Wv) = ∑ (Wci ei , Wci ei )
i =1
n
= ∑ ( λ i c i ei , λ i c i ei )
i =1
n
= ∑ | λ i |2 | c i |2
i =1
144 5. NUMERICAL ALGORITHMS

so (Wv, Wv) is a weighted average of the eigenvalues of W. It follows that the max-
imum value of (Wv, Wv)/(v, v) occurs when v is an eigenvector, and that value is
the square of an eigenvalue. r

D EFINITION 1.21. A Hermitian matrix will be called:


1. Positive semidefinite if ( Av, v) ≥ 0 for any nonzero vector v.
2. Positive definite if ( Av, v) > 0 for any nonzero vector v.
A will be called Hermitian positive semidefinite. A
L EMMA 1.22. A Hermitian matrix is positive semidefinite if and only if its eigenval-
ues are nonnegative. It is positive definite if and only if its eigenvalues are positive.
This follows immediately from 1.18 on page 143.
C OROLLARY 1.23. Let A be a positive semidefinite matrix. Then there exists a posi-
tive semidefinite matrix W such that
A = W2
P ROOF. This follows immediately from 1.18 on page 143, which implies that
A = UDU −1
where D is a diagonal matrix with nonnegative entries on the diagonal. We define
W = UD1/2 U −1
where the entries of D1/2 are just the (nonnegative) square roots of corresponding
entries of D. Now
W 2 = UD1/2 U −1 UD1/2 U −1 = UD1/2 D1/2 U −1 = UDU −1 = A
r

It turns out that the 1-norm and the ∞-norm of matrices are very easy to com-
pute:
P ROPOSITION 1.24. The matrix norm associated with the vector norm k ∗ k∞ is
given by k Mk∞ = maxi ∑nj=1 | Mij |.

P ROOF. We must maximize maxi ∑ j Mij · v j , subject to the requirement that


all of the v j are between −1 and 1. If is clear that we can maximize one of these
quantities, say the ith , by setting:
(
+1, if Mij is positive;
vj =
−1, if Mij is negative.
and this will result in a total value of ∑ j | Mij | for the ith row. The norm of the
matrix is just the maximum of these row-sums. r

P ROPOSITION 1.25. The 1-norm, k M k1 , of a matrix M is given by


n
max ∑ | Mij |
j i =1
1. LINEAR ALGEBRA 145

F IGURE 5.1. A 2-dimensional diamond

y-axis

1
Unit diamond

1
x-axis

F IGURE 5.2.
y-axis
P ROOF. This is somewhat more difficult to show than the case of the ∞-norm.
We must compute the maximum 1-norm of M of ∑i | Mij · v j | subject to the requirement that
∑ j |v j | = 1. The crucial step involves showing:
Claim: The maximum occurs when all but one of the v j is zero. The remaining
1
nonzero v j must be +1 or −1. M(unit diamond)
We will give a heuristic proof of this claim. With a little work the argument
can be made completely rigorous. 1 Essentially, we must consider the geometric
significance of the 1-norm of vectors,x-axis and the shape of “spheres” with respect to
this norm. The set of all vectors v with the property that kvk1 = 1 forms a poly-
hedron centered at the origin. We will call this a diamond. Figure 5.1 shows a
2-dimensional diamond.
The 1-norm of a vector can be regarded as the radius (in the sense of 1-norms)
of the smallest diamond centered at the origin, that encloses the vector. With this
in mind, we can define the 1-norm of a matrix M as the radius of the smallest
diamond that encloses M(unit diamond), as in figure 5.2.
146 5. NUMERICAL ALGORITHMS

Note that the radius of a diamond centered at the origin is easy to measure —
it is just the x-intercept or the y-intercept. The heuristic part of the argument is to note
that the smallest enclosing diamond always intersects M(unit diamond) at a vertex.
This implies the claim, however, because vertices of the unit diamond are precisely
the points with one coordinate equal to +1 or −1 and all other coordinates 0.
Given this claim, the proposition follows quickly. If the jth component of v
is 1 and all other components are 0, then k Mvk1 = ∑i | Mij |. The 1-norm of M is
computed with the value of j that maximizes this. r

Unfortunately, there is no simple formula for the 2-norm of a matrix in terms


of the entries of the matrix. The 2-norm of a Hermitian positive semidefinite matrix
(defined in 1.21 on page 144) turns out to be equal to the largest eigenvalue of the
matrix2. We do have the following result that relates these quantities, and the
spectral radius:

L EMMA 1.26. Let W = (wij ). Then kW H W k2 = ρ(W H W ) = kW k22 ≤


kW H W k1 ≤ maxi ∑ j |wij | max j ∑i |wij | ≤ nkW H W k2 .

P ROOF. It is not hard to see that kW H W k1 ≤ maxi ∑ j |wij | max j ∑i |wij since
kW H W k1 ≤ kW H k1 · kW k1 and kW H k1 = kW k∞ , by the explicit formulas given
for these norms. The remaining statements follow from 1.54 on page 170. r

E XERCISES .

1.1. Let A be a square matrix. Show that if λ1 and λ2 are two eigenvalues of A
with associated eigenvectors v1 and v2 and λ1 = λ2 , then any linear combination
of v1 and v2 is a valid eigenvector of λ1 and λ2 .
1.2. Let A be a square matrix and let λ1 , . . . , λk be eigenvalues with associated
eigenvectors v1 , . . . , vk . In addition, suppose that λi 6= λ j for all i, j such that i 6= j,
and that all of the vi are nonzero. Show that the set of vectors {v1 , . . . , vk } are
linearly independent — i.e. the only way a linear combination
k
∑ αi vi
i =1

(where the αi are scalars) can equal zero is for all of the coefficients, αi , to be zero.
The two exercises above show that the numerical values of eigenvalues deter-
mine many of the properties of the associated eigenvectors: if the eigenvalues are
equal, the eigenvectors are strongly related to each other, and if the eigenvalues
are distinct, the eigenvectors are independent.
1.3. Show that the two exercises above imply that an n × n matrix cannot have
more than n distinct eigenvalues.

2This follows from 1.20 on page 143 and 1.23 on page 144.
1. LINEAR ALGEBRA 147

1.4. Compute the 1-norm and the ∞-norm of the matrix


 
1 2 3
0 −1 0
2 −4 8
1.5. Compute the 1-norm and the ∞-norm of the matrix
 
0 1/4 1/4 1/4 1/4
 1/4 0 −1/4 −1/4 1/4
 
−1/4 −1/4 0 1/4 1/4
 
−1/4 −1/4 −1/4 0 1/4
−1/4 −1/4 −1/4 1/4 0
1.6. compute the eigenvalues and eigenvectors of the matrix
 
1 2 3
A = 2 −1 1
3 0 0
Use this computation to compute its spectral radius.
1.7. Let A be the matrix µ ¶
1 2
3 −1
Compute the 2-norm of A directly.
1.8. Suppose that A is an n × n matrix with eigenvectors {vi } and correspond-
ing eigenvalues {λi }. If I is a n × n identity matrix and α and β are constants show
that the eigenvectors of αA + βI are also {vi } and the corresponding eigenvalues
are αλi + β.
1.9. Suppose that W is some square matrix. Show that WW H is positive semi-
definite. Show that, if W is nonsingular, that WW H is positive definite
1.10. Suppose that W is some nonsingular matrix and A is a positive definite
matrix. Show that WAW H is positive definite.
1.11. Lemma 1.14 on page 141 implies that, for any square matrix A, ρ( A) ≤
k Ak, where k Ak is any norm. In addition, we know that, if A is Hermitian, ρ( A) =
k Ak2 . Give an example of a matrix A such that ρ( A) < k Ak2 (i.e., they are not
equal).

1.2.2. Appendix: Diagonalization of Hermitian Matrices. In this appendix we will


prove 1.18 on page 143. This will be somewhat like the proof of 1.17 on page 142,
except that the full upper-triangular matrix, T, will turn out to be Hermitian —
which will imply that it is a diagonal matrix.
Throughout this section A will denote an n × n Hermitian matrix with eigen-
values {λ1 , . . . , λk } and corresponding eigenspaces {V1 , . . . , Vk }.
We begin with:
148 5. NUMERICAL ALGORITHMS

D EFINITION 1.27. A set of vectors {v1 , . . . , vk } will be called orthogonal if, for
every i, j such that i 6= j we have
( vi , v j ) = 0
The set of vectors will be called orthonormal if, in addition, we have
( vi , vi ) = 1
for all 1 ≤ i ≤ k.
We need to develop a basic property of orthonormal bases of vector spaces:
L EMMA 1.28. Let {v1 , . . . , vk } be an orthonormal basis of a vector-space V and let u
be an arbitrary vector in V. Then
k
u= ∑ ( vi , u ) vi
i =1

P ROOF. Since {v1 , . . . , vk } is a basis for V it is possible to find an expression


k
u= ∑ ai vi
i =1
where the { ai } are suitable constants. Now we form the inner product of this
expression with an arbitrary basis vector v j :
k
(v j , u) =(v j , ∑ ai vi )
i =1
k
( v j , u ) = ∑ ai ( v j , vi )
i =1
(v j , u) = a j
r
C OROLLARY 1.29. Let u = {u1 , . . . , uk } and v = {v1 , . . . , vk } be two orthonormal
bases of the same vector space. Then the matrix U that transforms u into v is given by
Ui,j = (v j , ui )
and is a unitary matrix.
P ROOF. The ith column of U is the set of coefficients that expresses ui in terms
of the {v1 , . . . , vk }, so 1.28 above implies this statement. The remaining statement
follows from the fact that
(U −1 )i,j =(u j , vi )
=(vi ,¯u j )
=U¯j,i
r
Our proof of 1.18 on page 143 will be like that of 1.17 on page 142 except that
we will use an orthonormal basis of eigenvectors for Rn . The first result we need to
show this is:
1. LINEAR ALGEBRA 149

L EMMA 1.30. Let λi 6= λ j be eigenvalues of A with corresponding eigenvectors vi ,


v j . Then
( vi , v j ) = 0
P ROOF. We compute
( Avi , v j ) = λi (vi , v j )
= ( v i , AH v j )
= (vi , Av j ) (because A is Hermitian)
= λ j ( vi , v j )
So λi (vi , v j ) = λ j (vi , v j ), which implies the conclusion since λi 6= λ j . r

C OROLLARY 1.31. The eigenspaces {V1 , . . . , Vk } of A are orthogonal to each other.


In fact, we will be able to get a basis for Rn of orthonormal eigenvectors.
This turns out to be the crucial fact that leads to the conclusion. If each of the
eigenspaces {V1 , . . . , Vk } is one-dimensional, we have already proved that: sim-
ply normalize each eigenvector by dividing it by its 2-norm. A slight problem
arises if an eigenspace has a dimension higher than 1. We must show that we can
find an orthonormal basis within each of these eigenspaces.
L EMMA 1.32. Let V be a vector space with an inner product (∗, ∗) and a basis
{v1 , . . . , vk }. Then there exists an orthonormal basis {v̄1 , . . . , v̄k }. This is computed
by an inductive procedure — we compute the following two equations with i = 1..k:
i −1
(4) ui = vi − ∑ (vi , v̄ j )v̄ j
j =1
q
(5) v̄i =ui / ( ui , ui )
Here u1 = v1 .
This is called the Gram-Schmidt orthogonalization algorithm.
P ROOF. Equation (5) implies that each of the {v̄i } has the property that
(v̄i , v̄i ) = 1. We will show that the vectors are orthogonal by induction. Suppose
that (v̄ j , v̄ j0 ) = 0 for all j, j0 < i. We will show that (ui , v̄i0 ) = 0 for all i0 < i:
i −1
(ui , v̄i0 ) =(vi − ∑ (vi , v̄ j )v̄ j , v̄i0 )
j =1
i −1
=(vi , v̄i0 ) − ∑ (vi , v̄ j )(v̄ j , v̄i0 )
j =1
=(vi , v̄i0 ) − (vi , v̄i0 )
=0
r
Suppose we apply this algorithm to each eigenspace {Vi } whose dimension
is > 1. The new basis-vectors will be orthogonal to each other, and the new basis
vectors in each of the {Vi } will remain orthogonal to the new basis vectors of all
150 5. NUMERICAL ALGORITHMS

{Vi0 } with i0 6= i. We will, consequently, get an orthonormal basis for Rn . Suppose


that this new basis is
{ u1 , . . . , u n }
We claim that the matrix, U, that transforms this basis to the standard basis of Rn
is a unitary matrix (see the definition of a unitary matrix in 1.13 on page 141). This
follows from 1.29 on page 148 and the fact that U transforms the standard basis of
Rn (which is orthonormal) into {u1 , . . . , un }.
Now we return to the fact that we are proving a variation of 1.17 on page 142
and we have an expression:
A = UTU −1
where T is full upper-triangular and U is unitary. We claim that this implies that
T is Hermitian since
³ ´H
AH = UTU −1
=(U −1 )H T H U H
=UT H U −1
(since
U
is unitary)
This implies that T = T H . Since all of the elements of T below the main
diagonal are 0, this implies that all of the elements above the main diagonal are
also 0, and T is a diagonal matrix.
1. LINEAR ALGEBRA 151

1.2.3. The Jacobi Method. The most basic problem that we will want to solve is
a system of linear equations:
(6) Ax = b
where A is a given n × n matrix, x = ( x0 , . . . , xn−1 ) is the set of unknowns and
b = (b0 , . . . , bn−1 ) is a given vector of dimension n. Our method for solving such
problems will make use of the matrix D ( A) composed of the diagonal elements of
A (which we now assume are nonvanishing):
 
A0,0 0 0 ... 0
 0 A2,2 0 . . . 0 
 
(7) D ( A) =  . . . . . 
 .. .. .. .. .. 
0 0 0 . . . An−1,n−1
As remarked above, the traditional methods (i.e., Gaussian elimination) for solv-
ing this do not lend themselves to easy parallelization. We will, consequently, ex-
plore iterative methods for solving this problem. The iterative methods we discuss
requires that the D-matrix is invertible. This is equivalent to the requirement that
all of the diagonal elements are nonzero. Assuming that this condition is satisfied,
we can rewrite equation (6) in the form

D ( A)−1 Ax = D ( A)−1 b
x + ( D ( A ) −1 A − I ) x = D ( A ) −1 b
(8) x = ( I − D ( A ) −1 A ) x + D ( A ) −1 b
where I is an n × n identity matrix. We will be interested in the properties of the
matrix Z ( A) = I − D ( A)−1 A. The basic iterative method for solving equation (6)
is to
1. Guess at a solution u(0) = (r0 , . . . , rn−1 );
2. Forming a sequence of vectors {u(0) , u(1) , . . . }, where u(i+1) = Z ( A)u(i) +
D ( A)−1 b.
Now suppose that the sequence {u(0) , u(1) , . . . } converges to some vector ū. The
fact that ū is the limit of this sequence implies that ū = Z ( A)ū + D ( A)−1 b — or
that ū is a solution of the original system of equations (6). This general method
of solving systems of linear equations is called the Jacobi method or the Relaxation
Method. The term “relaxation method” came about as a result of an application
of linear algebra to numerical solutions of partial differential equations – see the
discussion on page 238.
We must, consequently, be able to say whether, and when the sequence
{u ) , u(1) , . . . } converges. We will use the material in the preceding section on
( 0

norms of vectors and matrices for this purpose.


P ROPOSITION 1.33. Suppose A is an n × n matrix with the property that all of its
diagonal elements are nonvanishing. The Jacobi algorithm for solving the linear system
Ax = b
converges to the same value regardless of starting point (u(0) ) if and only if ρ( Z ( A)) =
µ < 1, where ρ( Z ( A)) is the spectral radius defined in 1.13 on page 140.
152 5. NUMERICAL ALGORITHMS

Note that this result also gives us some idea of how fast the algorithm con-
verges.
P ROOF. Suppose ū is an exact solution to the original linear system. Then
equation (8) on page 151 implies that:
(9) ū = ( I − D ( A)−1 A)ū + D ( A)−1 b
Since ρ( Z ( A)) = µ < 1 it follows that k Z ( A)k k → 0 as k → ∞ for any matrix-
norm k ∗ k — see 1.14 on page 141. We will compute the amount of error that exists
at any given stage of the iteration. The equation of the iteration is
u ( i +1) = ( I − D ( A ) −1 A ) u ( i ) + D ( A ) −1 b
and if we subtract this from equation (9) above we get
ū − u(i+1) =( I − D ( A)−1 A)ū + D ( A)−1 b
− ( I − D ( A ) −1 A ) u ( i ) + D ( A ) −1 b
=( I − D ( A)−1 A)(ū − u(i) )
(10) = Z ( A)(ū − u(i) )
The upshot of this is that each iteration of the algorithm has the effect of multiply-
ing the error by Z ( A). A simple inductive argument shows that at the end of the
ith iteration
(11) ū − u(i+1) = Z ( A)i (ū − u(0) )
The conclusion follows from 1.14 on page 141, which implies that Z ( A)i → 0 as
i → ∞ if and only if ρ( Z ( A)) < 1. Clearly, if Z ( A)i → 0 the error will be killed off
as i → ∞ regardless of how large it was initially. r
The following corollary give us an estimate of the rate of convergence.
C OROLLARY 1.34. The conditions in the proposition above are satisfied if k Z ( A)k =
τ < 1 for any matrix-norm k ∗ k. If this condition is satisfied then
kū − u(i) k ≤ τ i−1 kū − u(0) k
where ū is an exact solution of the original linear system.
P ROOF. This is a direct application of equation (11) above:
ū − u(i+1) = Z ( A)i (ū − u(0) )
kū − u(i+1) k = k Z ( A)i (ū − u(0) )k
≤ k Z ( A)i k · k(ū − u(0) )k
≤ k Z ( A)ki · k(ū − u(0) )k
= τ i k(ū − u(0) )k
r
We conclude this section with an example. Let
 
4 1 1
A = 1 4 −1
1 1 4
1. LINEAR ALGEBRA 153

and  
1
b = 2
3
Then  
4 0 0
D ( A ) = 0 4 0
0 0 4
and  
0 −1/4 −1/4
Z ( A) = −1/4 0 1/4 
−1/4 −1/4 0
so our iteration-step is:
   
0 −1/4 −1/4 1/4
u ( i +1) = −1/4 0 1/4  u(i) + 1/2
1/4 −1/4 0 3/4
It is not hard to see that k Z ( A)k1 = 1/2 so the Jacobi method converges for any
initial starting point. Set u(0) = 0. We get:
 
1
4
 
1
u =
(1)
2

 
3
4
and  
1

 16 
 
 5 
u = 8 
(2) 

 
 9 
16
 
3

 64 
 
 21 
u =  32 
(3) 

 
 39 
64
Further computations show that the iteration converges to the solution
 
1

 15 
 
 2 
x=  3 

 
 3 
5
154 5. NUMERICAL ALGORITHMS

E XERCISES .

1.12. Apply the Jacobi method to the system


Ax = b
where  
4 0 1
A = 1 4 −1
1 1 4
and  
1
b = 2
3

1.13. Compute D ( A) and Z ( A) when


 
4 1 1
A = 2 5 1
0 −3 10
Can the Jacobi method be used to solve linear systems of the form
Ax = b?

1.2.4. The JOR method. Now we will consider a variation on the Jacobi
method. JOR, in this context stands for Jacobi Over-Relaxation. Essentially the
JOR method attempts to speed up the convergence of the Jacobi method by
modifying it slightly. Given the linear system:
Ax = b
where A is a given n × n matrix whose diagonal elements are nonvanishing and
b is a given n-dimensional vector. The Jacobi method solves it by computing the
sequence
u ( i +1) = Z ( A ) u ( i ) + D ( A ) −1 b
where D ( A) is the matrix of diagonal elements of A and Z ( A) = I − D ( A)−1 A.
In other words, we solve the system of equations by “moving from u(i) toward
u(i+1) . the basic idea of the JOR method is that:
If motion in this direction leads to a solution, maybe moving fur-
ther in this direction at each step leads to the solution faster.
We, consequently, replace u(i) not by u(i+1) as defined above, but by (1 − ω )u(i) +
ωu(i+1) . when ω = 1 we get the Jacobi method exactly. The number ω is called the
1. LINEAR ALGEBRA 155

relaxation coefficient. Actual overrelaxation occurs when ω > 1. Our basic iteration-
step is:
(12) u(i+1) =(ωZ ( A))u(i) + ωD ( A)−1 b + (1 − ω )u(i)
=(ωZ ( A) + I (1 − ω ))u(i) + ωD ( A)−1 b
=(ω ( I − D ( A)−1 A + I (1 − ω ))u(i) + ωD ( A)−1 b
=( I − ωD ( A)−1 A)u(i) + ωD ( A)−1 b
Note that, if this iteration converges, then the limit is still a solution of the
original system of linear equations. In other words a solution of
x = ( I − ωD ( A)−1 A) x + ωD ( A)−1 b
is a solution of x = x − ωD ( A)−1 Ax + ωD ( A)−1 b and 0 = −ωD ( A)−1 Ax +
ωD ( A)−1 b. Dividing by ω and multiplying by D ( A) gives Ax = b.
The rate of convergence of this method is can be determined in the same way
as in the Jacobi method:
C OROLLARY 1.35. The conditions in the proposition above are satisfied if k I −
ωD ( A)−1 Ak = τ < 1 for any matrix-norm k ∗ k. In this case
kū − u(i) k ≤ τ i−1 kū − u(0) k
where ū is an exact solution of the original linear system.
P ROOF. The proof of this is almost identical to that of 1.34 on page 152. r
We must, consequently, adjust ω in order to make k I − ωD ( A)−1 Ak < 1 and
to minimize ωτ i−1 k D ( A)−1 Au(0) + D ( A)−1 bk.
P ROPOSITION 1.36. If the Jacobi method converges, then the JOR method converges
for 0 < ω ≤ 1.
P ROOF. This is a result of the fact that the eigenvalues of ( I − ωD ( A)−1 A) are
closely related to those of Z ( A) because ( I − ωD ( A)−1 A) = ωZ ( A) + (1 − ω ) I
If θ j is an eigenvalue of ( I − ωD ( A)−1 A) then θ j = ωλ j + 1 − ω, where λ j is an
eigenvalue of Z ( A). If λ j = reit , then
|θ j |2 = ω 2 r2 + 2ωr cos t(1 − ω ) + (1 − ω )2
≤ (ωr + 1 − ω )2 < 1
r
The main significance of the JOR method is that, with a suitable choice of ω,
the JOR method will converge under circumstances where the Jacobi method does
not (the preceding result implies that it converges at least as often).
T HEOREM 1.37. Let A be a positive definite matrix with nonvanishing diagonal el-
ements such that the associated matrix Z ( A) has spectral radius > 1. Then there exists
a number α such that 0 < α < 1 such that the JOR method converges for all values of ω
such that 0 < ω < α.
Let λmin be the smallest eigenvalue of Z ( A) (I.e., the one with the largest magnitude).
Then
2 2
α= =
1 − λmin 1 + ρ( Z ( A))
156 5. NUMERICAL ALGORITHMS

P ROOF. This proof will be in several steps. We first note that all of the diagonal
elements of A must be positive.
1. If 0 < ω < α = 2/(1 − λmin ) then the matrix V = 2ω −1 D ( A) − A is
positive definite.
Proof of claim: Since the diagonal elements of A are positive,
³ ´2
there exists a matrix D1/2 such that D1/2 = D ( A). V is positive
definite if and only if D −1/2 VD −1/2 is positive definite. We have
D −1/2 VD −1/2 = 2ω −1 I − D −1/2 AD −1/2 . Now we express this in terms of Z ( A):
D1/2 Z ( A) D −1/2 = D1/2 ID −1/2 − D1/2 D ( A) AD −1/2
= I − D−1/2 AD −1/2
so
D −1/2 VD −1/2 = (2ω −1 − 1) I + D1/2 Z ( A) D −1/2
and this matrix will also be positive definite if and only if V = 2ω −1 D ( A) − A is
positive definite. But the eigenvalues of this matrix are
2ω −1 − 1 + µi
(see exercise 1.8 on page 147 and its solution on page 435) and these are the same as
the eigenvalues of Z ( A) — see 1.16 on page 142. The matrix V is positive definite
if and only if these eigenvalues are all positive (see 1.22 on page 144, and 1.18 on
page 143). This is true if and only if 2ω −1 − 1 > |µi | for all i, or ω < 2/(1 − λmin ).
2. If V = 2ω −1 D ( A) − A is positive definite, then ρ( I − ωD ( A)−1 A) < 1, so
that the JOR method converges (by 1.35 on page 155).
Since A is positive definite, it has a square root (by 1.23 on page 144) — we
call this square root A1/2 . We will prove that ρ( A1/2 ( I − ωD ( A)−1 A) A−1/2 ) < 1,
which will imply the claim since spectral radii of similar matrices are equal (1.16
on page 142).
Now R = A1/2 ( I − ωD ( A)−1 A) A−1/2 = I − ωA1/2 D ( A)−1 A1/2 . Now we
form RRH :
RRH = ( I − ωA1/2 D ( A)−1 A1/2 )( I − ωA1/2 D ( A)−1 A1/2 )H
= I − 2ωA1/2 D ( A)−1 A1/2 + ω 2 A1/2 D ( A)−1 AD ( A)−1 A1/2
= I − ω 2 A1/2 D ( A)−1 (V ) D ( A)−1 A1/2
= I−M
where V = 2ω −1 D ( A) − A. This matrix is positive definite (product of
an invertible matrix by its Hermitian transpose — see exercise 1.9 on
page 147 and its solution on page 435). It follows that the eigenvalues of
M = ω 2 A1/2 D ( A)−1 (V ) D ( A)−1 A1/2 must be < 1 (since the result of
subtracting them from 1 is still positive). But M is also positive definite,
since it is of the form FVFH and V is positive definite (see exercise 1.10 on
page 147 and its solution on page 436. This means that the eigenvalues of
M = ω 2 A1/2 D ( A)−1 (V ) D ( A)−1 A1/2 lie between 0 and 1. We conclude that
the eigenvalues of RRH = I − M also lie between 0 and 1. This means that the
eigenvalues of R = A1/2 ( I − ωD ( A)−1 A) A−1/2 lie between 0 and 1 and the
conclusion follows. r
1. LINEAR ALGEBRA 157

1.2.5. The SOR and Consistently Ordered Methods. We can combine the itera-
tive methods described above with the Gauss-Seidel method. The Gauss-Seidel
method performs iteration as described above with one important difference:
In the computation of u(i+1) from u(i) in equation (12), computed
values for u(i+1) are substituted for values in u(i) as soon as they are
available during the course of the computation.
( i +1)
In other words, assume we are computing u(i+1) sequentially by computing u1 ,
( i +1)
u2 ,
and so forth. The regular Jacobi method or the JOR method involves per-
forming these computations in a straightforward way. The Gauss-Seidel method
( i +1) (i ) ( i +1)
involves computing u1 , and immediately setting u1 ← u1 before doing
( i +1)
any other computations. When we reach the point of computing u2 , it will al-
( i +1)
ready contain the computed value of u1 .
This technique is easily implemented
on a sequential computer, but it is not clear how to implement it in parallel.
The combination of the Gauss-Seidel method and overrelaxation is called the
SOR method. The term SOR means successive overrelaxation. Experience and theo-
retical results show that it almost always converges faster than the JOR method.
The result showing that the JOR method converges for A a positive definite matrix
(theorem 1.37 on page 155) also applies for the SOR method.
In order to write down an equation for this iteration-scheme, we have to con-
( i +1)
sider what it means to use u j for 0 ≤ j < k when we are computing the kth
( i +1)
entry of uk . We are essentially multiplying u(i) by a matrix (Z ( A)) in order
to get u(i+1) . When computing the kth entry of u(i+1) , the entries of the matrix
Z ( A) that enter into this entry are the entries whose column-number is strictly less
than their row-number. In other words, they are the lower triangular entries of the
matrix. It amounts to using the following iteration-scheme:

(13) u ( i +1) = ω ( L ( A ) u ( i +1) + U ( A ) u ( i ) + D ( A ) −1 b ) + ( 1 − ω ) u ( i )


Here L( A) is the lower-triangular portion of Z ( A) and U ( A) is the
upper-triangular portion.

(14) u(i+1) = Lω u(i) + (1 − ωL( A))−1 ωD ( A)−1 b


where Lω = ( I − ωL)−1 (ωU + (1 − ω ) I ). As before, we have the following crite-
rion for the convergence of the SOR method:
T HEOREM 1.38. The SOR iteration-scheme for solving the linear system
Au = b
(where A has nonvanishing diagonal elements) converges if kLω k = τ < 1 for any
matrix-norm k ∗ k. In this case

kū − u(i) k ≤ τ i−1 kū − u(0) k


where ū is an exact solution of the original linear system.
P ROOF. As before, the proof of this is almost identical to that of 1.34 on page
152. r
158 5. NUMERICAL ALGORITHMS

T HEOREM 1.39. (See [80]) The spectral radius of Lω satisfies


ρ (Lω ) ≥ | ω − 1|
In addition, if the SOR method converges, then
0<ω<2
P ROOF. This follows from the fact that the determinant of a matrix is equal to
the product of the eigenvalues — see 1.14 on page 141.
det Lω = det(( I − ωL)−1 (ωU + (1 − ω ) I ))
= (det( I − ωL))−1 det(ωU + (1 − ω ) I )
= det(ωU + (1 − ω ) I )
= (1 − ω ) n
since the determinant of a matrix that is the sum of a lower or upper triangular
matrix and the identity matrix is 1. It follows that:
ρ(Lω ) ≥ (|ω − 1|n )1/n = |1 − ω |
r

Experiment and theory show that this method tends to converge twice as
rapidly as the basic Jacobi method — 1.42 on page 160 computes the spectral ra-
dius ρ(Lω ) if the matrix A satisfies a condition to be described below.
Unfortunately, the SOR method as presented above doesn’t lend itself to par-
allelization. Fortunately, it is possible to modify the SOR method in a way that
does lend itself to parallelization.
D EFINITION 1.40. Let A be an n × n matrix. Then:
1. two integers 0 ≤ i, j ≤ n are associated if Ai,j 6= 0 or A j,i 6= 0;
2. Let Σ = {1, . . . , n}, the set of numbers from 1 to n and let S1 , S2 , . . . , Sk be
disjoint subsets of Σ such that
k
[
Si = Σ
i =1

Then the partition S1 , S2 , . . . , Sk of Σ is


a. an ordering of A if
i. 1 ∈ S1 and for any i, j contained in the same set St , i and j are not
associated — i.e., Ai,j = A j,i = 0.
ii. If j is the lowest number in
i
[
Σ\ St
t =1

then j ∈ Si+1 for all 1 ≤ i < k.


b. a consistent ordering of A if for any pair of associated integers 0 ≤ i, j ≤
n, such that i ∈ St ,
• j ∈ St+1 if j > i;
• j ∈ St−1 if j < i;
1. LINEAR ALGEBRA 159

3. A vector 
γ1
 
γ =  ... 
γn
will be called a consistent ordering vector of a matrix A if
a. γi − γ j = 1 if i and j are associated and i > j;
b. γi − γ j = −1 if i and j are associated and i < j;
Note that every matrix has an ordering: we can just set Si = {i }. It is not true
that every matrix has a consistent ordering.
Consistent orderings, and consistent ordering vectors are closely related: the
set Si in a consistent ordering is the set of j such that γ j = i, for a consistent
ordering vector.
An ordering for a matrix is important because it allows us to parallelize the
SOR method:
P ROPOSITION 1.41. Suppose A is an n × n matrix equipped with an ordering
{S1 , . . . , St }, and consider the following iteration procedure:

for j = 1 to t do
for all k such that k ∈ S j
( i +1)
Compute entries of uk
(i ) ( i +1)
Set uk ← uk
endfor
endfor
This procedure is equivalent to the SOR method applied to a version of the linear system
Au = b
in which the coordinates have been re-numbered in some way. If the ordering of A was
consistent, then the iteration procedure above is exactly equivalent to the SOR algorithm3.
In other words, instead of using computed values of u(i+1) as soon as they are
available, we may compute all components for u(i+1) whose subscripts are in S1 ,
then use these values in computing other components whose subscripts are in S2 ,
etc. Each individual “phase” of the computation can be done in parallel. In many
applications, it is possible to use only two phases (i.e., the ordering only has sets
S1 and S2 ).
Re-numbering the coordinates of the linear system
Au = b
does not change the solution. It is as if we solve the system
BAu = Bb
where B is some permutation-matrix4. The solution is
u = A−1 B−1 Bb = A−1 b
3I.e., without re-numbering the coordinates.
4A matrix with exactly one 1 in each row and column, and all other entries equal to 0.
160 5. NUMERICAL ALGORITHMS

— the same solution as before. Since the computations are being carried out in a
different order than in equation (13) on page 157, the rate of convergence might be
different.
P ROOF. We will only give an intuitive argument. The definition of an order-
ing in 1.40 on page 158 implies that distinct elements r, r in the same set Si are
independent. This means that, in the formula (equation (13) on page 157) for u(i+1)
in terms of u(i) ,
( i +1) ( i +1)
1. the equation for ur does not contain us on the right.
( i +1) ( i +1)
2. the equation for us does not contain ur on the right.
( i +1) ( i +1)
It follows that we can compute andur simultaneously. The re-numbering
us
of the coordinates comes about because we might do some computations in a dif-
ferent order than equation (13) would have done them. For instance, suppose we
have an ordering with S1 = {1, 2, 4}, and S2 = {3, 5}, and suppose that compo-
nent 4 is dependent upon component 3 — this does not violate our definition of
an ordering. The original SOR algorithm would have computed component 3 be-
fore component 4 and may have gotten a different result than the algorithm based
upon our ordering.
It is possible to show (see [174]) that if the ordering is consistent, the permu-
tation B that occurs in the re-numbering of the coordinates has the property that
it doesn’t map any element of the lower triangular half of A into the upper triangu-
lar half, and vice-versa. Consequently, the phenomena described in the example
above will not occur. r
The importance of a consistent ordering of a matrix5 is that it is possible to give
a explicit formula for the optimal relaxation-coefficient for such a matrix:
T HEOREM 1.42. If the matrix A is consistently-ordered, in the sense defined in 1.40
on page 158, then the SOR or the consistently-ordered iteration procedures for solving the
system
Ax = b
both converge if ρ( Z ( A)) < 1. In both cases the optimum relaxation coefficient to use is
2
ω= q
1+ 1 − ρ( Z ( A))2
where (as usual) Z ( A) = I − D ( A)−1 A, and D ( A) is the matrix of diagonal elements of
A. If the relaxation coefficient has this value, then the spectral radius of the effective linear
transformation used in the SOR (or consistently-ordered) iteration scheme is ρ(Lω ) =
ω − 1.
The last statement gives us a good measure of the rate of convergence of the
SOR method (via 1.38 on page 157 and the fact that the 2-norm of a symmetric
matrix is equal to the spectral radius).
We can expand ω into a Taylor series to get some idea of its size:
µ2 µ4
ω = 1+ + + O ( µ6 )
4 8
5Besides the fact that it produces computations identical to the SOR algorithm.
1. LINEAR ALGEBRA 161

F IGURE 5.3. Color-graph of a linear coloring

and this results in a value of the effective spectral radius of the matrix of
µ2 µ4
+ + O ( µ6 )
4 8
The proof of 1.42 is beyond the scope of this book — see chapter 6 of [174]
for proofs. It is interesting that this formula does not hold for matrices that are
not consistently ordered — [174] describes an extensive theory of what happens in
such cases.
We will give a criterion for when a consistent ordering scheme exists for a
given matrix. We have to make a definition first:
D EFINITION 1.43. Let G be an undirected graph with n vertices. A coloring of
G is an assignment of colors to the vertices of G in such a way that no two adjacent
vertices have the same color.
Given a coloring of G with colors {c1 , . . . , ck }, we can define the associated
coloring graph to be a graph with vertices in a 1-1 correspondence with the colors
{c1 , . . . , ck } and an edge between two vertices c1 and c2 if any vertex (of G) colored
with color c1 is adjacent to any vertex that is colored with color c2 .
A linear coloring of G is a coloring whose associated coloring graph is a linear
array of vertices (i.e., it consists of a single path, as in figure 5.3).

P ROPOSITION 1.44. The operation of finding a consistent-ordering of a matrix can


be regarded as equivalent to solving a kind of graph-coloring problem:
Given a square matrix, A, construct a graph, G, with one vertex for
each row (or column) of the matrix, and an edge connecting a vertex
representing row i to the vertex representing row j if j 6= i and Aij 6= 0.
Then:
1. the ordering schemes of A are in a 1 − 1 correspondence with the colorings of G.
2. the consistent ordering schemes of A are in a 1 − 1 correspondence with the linear
colorings of the graph G.
P ROOF. Statement 1: Define the sets {S j } to be the sets of vertices of G with
the same color (i.e., set S1 might be the set of “green” vertices, etc.). Now arrange
these sets so as to satisfy the second condition of 1.40 on page 158. This essentially
amounts to picking the smallest element of each set and arranging the sets so that
these smallest elements are in ascending order as we go from 1 to the largest number.
Statement 2: Suppose {S1 , . . . , Sk } is a consistent ordering of the matrix. We
will color vertex i of G with color S j where S j is the set containing i. The condition:
. . . for any pair of associated integers 0 ≤ i, j ≤ n, such that i ∈ St ,
• j ∈ St+1 if j > i;
• j ∈ St−1 if j < i;
implies that the associated coloring graph is linear. It implies that vertex i in the
coloring graph is only adjacent to vertices i − 1 and i + 1 (if they exist).
Conversely, given a linear coloring, we number the vertices of the coloring
graph by assigning to a vertex (and its associated color) its distance from one end.
162 5. NUMERICAL ALGORITHMS

Blue Red
Red 3
1
Green
2
4
Green Blue

Matrix graph Color graph

F IGURE 5.4. Graph of the matrix A, and the associated color-graph

1. For each color ci arbitrarily order the rows with that color.
2. Associate with a row i the pair (c ji , oi ), where c ji is the color of row i, and
oi is the ordinal position of this row in the set of all rows with the same
color.
3. Order the rows lexicographically by the pairs (c ji , oi ).
4. Define the permutation π to map row i to its ordinal position in the order-
ing of all of the rows defined in the previous line.
It is not hard to verify that, after the permutation of the rows and columns of
the matrix (or re-numbering of the coordinates in the original problem) that the
matrix will be consistently-ordered with ordering vector whose value on a given
coordinate is the number of the color of that coordinate. r
Here is an example of a consistently ordered matrix: Let
 
4 0 0 −1
 −1 4 −1 0 
A=  0 −1 4

0 
−1 0 0 4
If we let S1 = {1}, S2 = {2, 4}, and S3 = {3}, we have a consistent ordering of A.
The vector  
1
2
γ= 
3
2
is a consistent ordering vector for this matrix. The graph for this matrix is in figure
5.4.
We conclude this section with an algorithm for determining whether a ma-
trix has a consistent ordering (so that we can use the formula in 1.42 on page 160
to compute the optimum relaxation-factor). The algorithm is constructive in the
sense that it actually finds an ordering if it exists. It is due to Young (see [174]):
A LGORITHM 1.45. Let A be an n × n matrix. It is possible to determine
whether A has a consistent ordering via the following sequence of
steps:
Data:vectors {γi } and {γ̄i }
sets D initially {1}, T = {1}
1. LINEAR ALGEBRA 163

Boolean variable Cons = TRUE

γ1 ← 1, γ̄1 ← 1
for j ← 2 to n do
if A1,j 6= 0 or A j,1 6= 0 then
γ j ← 2, γ̄ j ← 2
D ← D ∪ { j}
T ← T ∪ { j}
endfor
D ← D \ {1}
while T 6= {1, . . . , n}
if D 6= ∅ then
i ← Minimal element ofD
else
i ← Minimal element of{1, . . . , n} \ T
endif
if j 6∈ T then
D ← D ∪ { j}
γ̄ j ← 1 − γ̄i
if j > i then
γ j ← γi + 1
else
γ j ← γi − 1
endif
else { j ∈ T}
if γ̄ j 6= 1 − γ̄i then
Cons ← FALSE
Exit
endif
if j > i then
if γ j 6= γi + 1 then
Cons ← FALSE
Exit
endif
else
if γ j 6= γi − 1 then
Cons ← FALSE
Exit
endif
endif
endwhile
The variable Cons determines whether the matrix A has a consistent ordering.
If it is TRUE at the end of the algorithm, the vector γ is a consistent ordering
vector. This determines a consistent ordering, via the comment following 1.40, on
page 159.

1.2.6. Discussion. We have only scratched the surface in our treatment of the
theory of iterative algorithms for solving systems of linear equations. There are a
164 5. NUMERICAL ALGORITHMS

number of important issues in finding parallel versions of these algorithms. For


instance, the execution-time of such an algorithm depends upon:
1. The number of complete iterations required to achieve a desired degree of
accuracy. This usually depends upon the norm or spectral radius of some
matrix.
2. The number of parallel steps required to implement one complete itera-
tion of an iteration-scheme. This depends upon the number of sets in an
ordering (or consistent ordering) of a matrix.
In many cases the fastest parallel algorithm is the one that uses an ordering with
the smallest number of sets in an ordering — even though that may lead to an
iteration-matrix with a larger norm (than some alternative).
The reader may have noticed that we have required that the diagonal elements
of the matrix A be nonvanishing throughout this chapter. This is a very rigid
condition, and we can eliminate it to some extent. The result is known at the
Richardson Iteration scheme. It requires that we have some matrix B that satisfies
the conditions:
1. B−1 is easy to compute exactly;
2. The spectral radius of I − B−1 A is small (or at least < 1);
We develop this iteration-scheme via the following sequence of steps

Av = b
( A − B)v + Bv = b
−1
B ( A − B ) v + v = B −1 b
v = B −1 ( B − A ) v + B −1 b
v = ( I − B −1 A ) v + B −1 b

Soouriteration-schemeis:

v ( i +1) = ( I − B −1 A ) v ( i ) + B −1 b

Note that we really only need to know B−1 in order to carry this out. The main
result of the Pan-Reif matrix inversion algorithm (in the next section) gives an
estimate for B−1 :
B−1 = AH /(k Ak1 · k Ak∞ )

(this is a slight re-phrasing of 1.53 on page 169). The results of that section show
that O(lg n) parallel iterations of this algorithm are required to give a desired de-
gree of accuracy (if A is an invertible matrix). In general SOR algorithm is much
faster than this scheme if it can be used.
Even the theory of consistent ordering schemes is well-developed. There are
group-iteration schemes, and generalized consistent ordering schemes. In these
cases it is also possible to compute the optimal relaxation coefficient. See [174] for
more details.
1. LINEAR ALGEBRA 165

E XERCISES .

1.14. Let  
4 2 3
A = 2 4 1
3 0 4
Compute the spectral radius of this matrix. Will the SOR method converge in the
problem
 
2
 3 
Ax =  
 −5 ?
1
If so compute the optimum relaxation coefficient in the SOR method.
1.15. Show that a matrix that is consistently ordered has an ordering (non-
consistent) that has only two sets. This means that the parallel version of the SOR
algorithm (1.41 on page 159) has only two phases. (Hint: use 1.44 on page 161).

1.3. Power-series methods: the Pan-Reif Algorithm.


1.3.1. Introduction. The reader might have wondered what we do in the case
where the diagonal elements of a matrix vanish, or the conditions on spectral ra-
dius are not satisfied.
In this section we present a variation on the iterative methods described above.
The results in this section are of more theoretical than practical value since they
require a larger number of processors than the iterative methods, and inversion of
a matrix is generally not the most numerically stable method for solving a linear
system.
In order to understand the algorithm for matrix inversion, temporarily forget
that we are trying to invert a matrix — just assume we are inverting a number.
Suppose we want to invert a number u using a power series. Recall the well-known
power-series for the inverse of 1 − x:
1 + x + x2 + x3 · · ·
We can use this power series to compute u−1 if we start with an estimate a
for u−1 since u−1 = u−1 a−1 a = ( au)−1 a = (1 − (1 − au))−1 a, where we use the
power-series to calculate (1 − (1 − au))−1 . If a is a good estimate for the inverse
of u, then au will be close to 1 and 1 − au will be close to 0 so that the power-series
for (1 − (1 − au))−1 will converge.
It turns out that all of this can be made to work for matrices. In order to refor-
mulate it for matrices, we must first consider what it means for a power-series of
matrices to converge. The simplest way for a power-series of matrices to converge
is for all but a finite number of terms to vanish. For instance:
166 5. NUMERICAL ALGORITHMS

P ROPOSITION 1.46. Let M = ( Mij ) be an n × n lower triangular matrix. This is a


matrix for which Mik = 0 whenever i ≥ j. Then Mn = 0, and

( I − M ) −1 = I + M + M 2 + · · · + M n −1
P ROOF. We prove that Mn = 0 by induction on n. It is easy to verify the result
in the case where n = 2. Suppose the result is true for all k × k, lower triangular
matrices. Given a k + 1 × k + 1 lower triangular matrix, M, we note that:
• Its first k rows and columns form a k × k lower triangular matrix, M0 .
• If we multiply M by any other k + 1 × k + 1 lower triangular matrix,L with
its k × k lower-triangular submatrix, L0 , we note that the product is of the
form: µ 0 0 ¶
ML 0
E 0
where the 0 in the upper right position represents a column of k zeros, and
E represents a last row of k elements.
It is not hard to see that Mk will be of the form
µ ¶
0 0
E 0
i.e., it will only have a single row with nonzero elements — the last. One further
multiplication of this by M will kill off all of the elements.
The remaining statement follows by direct computation:

( I − M)( I + M + M2 + · · · + Mn−1 ) = I − Mn = I
r

In somewhat greater generality:


C OROLLARY 1.47. Let A be an n × n matrix of the form
A = D−L
where D is a diagonal matrix with all diagonal entries nonzero, and L is a lower-triangular
matrix. Then
 −1 
a11 0 ... 0
 0 −1
a22 ... 0 
−1  
D = . . . .. 
 .. .. .. . 
0 0 −
. . . ann1
and
.
A −1 = D −1 ( I + ( D −1 L ) + . . + ( D −1 L ) n −1 )
Now we give a fast algorithm for adding up these power series:
h k
P ROPOSITION 1.48. Let Zk = ∏kh− 1 2 2 −1 i
=0 ( I + R ). Then Zk = ∑i =0 R .

P ROOF. This follows by induction on k — the result is clearly true in the case
where k = 1. Now suppose the result is true for a given value of k — we will prove
1. LINEAR ALGEBRA 167

k
it for the next higher value. Note that Zk+1 = Zk ( I + R2 ). But this is equal to
2k −1 2k −1 2k −1
k k
( I + R2 ) ∑ Ri = ∑ R i + R2 ∑ Ri
i =0 i =0 i =0
2k −1 2k −1
k
= ∑ Ri + ∑ R i +2
i =0 i =0
2k +1 − 1
= ∑ Ri
i =0

C OROLLARY 1.49. Let M be an n × n matrix, all of whose entries vanish above the
main diagonal, and such that M has an inverse. Then there is a SIMD-parallel algorithm
for computing M−1 in O(lg2 n) time using n2.376 processors.
This algorithm first appeared in [126] and [67]. See [68] for related algorithms.

P ROOF. We use 1.46, 1.47 and 1.48 in a straightforward way. The first two
results imply that M−1 is equal to a suitable power-series of matrices, with only n
nonzero terms, and the last result gives us a way to sum this power-series in lg n
steps. The sum is a product of terms that each equal the identity matrix plus an
iterated square of an n × n matrix. Computing this square requires O(lg n) time,
using n2.376 processors, by 1.2 on page 137. r

Now we will consider the question of how one finds an inverse of an arbitrary
invertible matrix. In general, we cannot assume that some finite power of part
of a matrix will vanish. We must be able to handle power series of matrices that
converge more slowly. We need to define a measure of the size of a matrix (so
we can easily quantify the statement that the terms of a power-series of matrices
approach 0, for instance).
1.3.2. The main algorithm. Now we are in a position to discuss the Pan-Reif
algorithm. Throughout this section, we will have a fixed n × n matrix, A. We will
assume that A is invertible.
D EFINITION 1.50. The following notation will be used throughout this section:
1. If B is a matrix R( B) is defined to be I − BA;
2. A matrix B will be called a sufficiently good estimate for the inverse of A if
k R( B)k = u( B) < 1.
Now suppose A is a nonsingular matrix, and B is a sufficiently good estimate
for the inverse (with respect to some matrix-norm) of A (we will show how to get
such an estimate later). Write A−1 = A−1 B−1 B = ( BA)−1 B = ( I − R( B))−1 B.
We will use a power series expansion for ( I − R( B))−1 — the fact that B was a
sufficiently good estimate for the inverse of A will imply that the series converges
(with respect to the same matrix-norm).
The following results apply to any matrix-norms:
P ROPOSITION 1.51. Suppose that B is a sufficiently close estimate for the inverse of
A. Then the power series ( I + R( B) + R( B)2 + · · · ) B converges to the inverse of A. In
168 5. NUMERICAL ALGORITHMS

fact, if Sk is the kth partial sum of the series (i.e. ( I + R( B) + · · · + R( B)k ) B) then
k Bku( B)k
k A −1 − S k k ≤
(1 − u( B))
which tends to 0 as k goes to ∞.
P ROOF. The inequalities in the remark above imply that k R( B)i k ≤ k R( B)ki
0 0
and kSk − Sk0 k (where k > k0 ) is ≤ k Bk(u( B)k + · · · + u( B)k ) ≤ k Bku( B)k (1 +
0
u( B) + u( B)2 + · · · ) = k Bku( B)k /(1 − u( B)) (we are making use of the fact that
u( B) < 1 here). This tends to 0 as k0 and k tend to 0 so the series converges (it isn’t
hard to see that the series will converge if the norms of the partial sums converge
— this implies that the result of applying the matrix to any vector converges). Now
A−1 − Sk = ( R( B)k + · · · ) B = R( B)k ( I + R( B) + R( B)2 + · · · ) B. The second
statement is clear. r
We can use 1.48 on page 166 to add up this power series. We apply this result
by setting Bk∗ = Zk and R( B) = R.
It follows that
k
k B k u ( B )2
k A−1 − Bk∗ k ≤
(1 − u( B))
We will show that the series converges to an accurate estimate of A−1 in O(lg n)
steps. We need to know something about the value of u( B). It turns out that this
value will depend upon n — in fact it will increase towards 1 as n goes to 0 — we
consequently need to know that it doesn’t increase towards 1 too rapidly. It will
be proved later in this section that:
Fact: u( B) = 1 − 1/nO(1) as n → ∞, | lg k Bk| ≤ an β .
P ROPOSITION 1.52. Let α be such that u( B) ≤ 1 − 1/nα and assume the two facts
listed above and that r bits of precision are desired in the computation of A−1 . Then B · Bk∗
is a sufficiently close estimate of A−1 , where αnα lg(n) + r + anα+ β ≤ 2k . It follows that
O(lg n) terms in the computation of ∏kh− 1 2h −1 to the
=0 ( I + R ( B ) ) suffice to compute A
desired degree of accuracy.
P ROOF. We want
k
k B k u ( B )2
k A−1 − Bk∗ k ≤ ≤ 2−r
(1 − u( B))
Taking the logarithm gives
| lg k Bk| + 2k lg(u( B)) − lg(1 − u( B)) ≤ −r
The second fact implies that this will be satisfied if 2k lg(u( B)) − lg(1 − u( B)) ≤
−(r + anβ). The first fact implies that this inequality will be satisfied if we have
µ ¶ µ ¶ µ ¶
k 1 1 1
2 lg 1 − α − lg = 2 lg 1 − α + α lg(n) ≤ −(r + an β )
k
n nα n
Now substituting the well-known power series lg(1 − g) = − g − g2 /2 − g3 /3 · · ·
gives
2k (−1/nα − 1/2n2α − 1/3n3α · · · ) + α lg(n) ≤ −(r + an β ). This inequality will
be satisfied if the following one is satisfied (where we have replaced the power
series by a strictly greater one): 2k (−1/nα ) + α lg(n) ≤ −(r + an β ).
1. LINEAR ALGEBRA 169

Rearranging terms gives α lg(n) + r + an β ≤ 2k /nα or αnα lg(n) + r + anα+ β ≤


2k , which proves the result. r

Now that we have some idea of how the power series converges, we can state
the most remarkable part of the work of Pan and Reif — their estimate for the
inverse of A. This estimate is given by:
1.53. Estimate for A−1 :
B = AH /(k Ak1 · k Ak∞ )

It is remarkable that this fairly simple formula gives an estimate for the in-
verse of an arbitrary nonsingular matrix. Basically, the matrix R( B) will play the
part of the matrix Z ( A) in the iteration-methods of the previous sections. One dif-
ference between the present results and the iteration methods, is that the present
scheme converges much more slowly than the iteration-schemes discussed above.
We must consequently, use many more processors in the computations — enough
processors to be able to perform multiple iterations in parallel, as in 1.48.
Now we will work out an example to give some idea of how this algorithm
works:
Suppose our initial matrix is
 
1 2 3 5
3 0 1 5
A= 0 1 3 1

5 0 0 3
Then our estimate for the inverse of A is:
 
0.0065 0.0195 0 0.0325
AH 0.0130 0 0.0065 0 
B= =
0.0195 0.0065

k A k1 · k A k ∞ 0.0195 0 
0.0325 0.0325 0.0065 0.0195
and  
0.7727 −0.0130 −0.0390 −0.2273
−0.0130 0.9675 −0.0584 −0.0714
R( B) = I − BA = 
−0.0390

−0.0584 0.8766 −0.1494
−0.2273 −0.0714 −0.1494 0.6104
The 2-norm of R( B) turns out to be 0.9965 (this quantity is not easily
computed, incidentally), so the power-series will converge. The easily computed
norms are given by k R( B)k1 = 1.1234 and k R( B)k∞ = 1.1234, so they give no
indication of whether the series will converge.
Now we compute the Bk∗ — essentially we use 1.1 and square R( B) until we
arrive at a power of R( B) whose 1-norm is < 0.00001. This turns out to require 13
steps (which represents adding up 213 − 1 terms of the power series). We get:
 
19.2500 24.3833 2.5667 −16.6833
 24.3833 239.9833 −94.9667 −21.8167

B13 =  2.5667

−94.9667 61.6000 −7.7000 
−16.6833 −21.8167 −7.7000 19.2500
170 5. NUMERICAL ALGORITHMS

and  
−0.0500 −0.1500 0.1000 0.3000
 0.7167 −0.8500 −0.4333 0.3667 

B13 B=
−0.2667 −0.2667 0.5333 −0.0667

0.0833 0.2500 −0.1667 −0.1667


which agrees with A−1 to the number of decimal places used here.
1.3.3. Proof of the main result. Recall that § 1.2.1 describes several different con-
cepts of the size of a matrix: the 1-norm, the ∞-norm, and the 2-norm. The proof
that the Pan-Reif algorithm works requires at all three of these. We will also need
several inequalities that these norms satisfy.
L EMMA 1.54. The vector norms, the 1-norm, the ∞-norm, and the 2-norm defined
in 1.12 on page 140 satisfy:
1. k v k ∞ ≤ k v k2 ≤ k v k1 ;
2. k v k1 ≤ n k v k ∞ ;
3. kvk2 ≤ (n1/2 )kvk∞ ;
4. (n−1/2 )kvk1 ≤ kvk2 ;
5. kvk22 ≤ kvk1 · kvk∞ ;
P ROOF. kvk∞ ≤ kvk2 follows by squaring both terms; kvk2 ≤ kvk1 follows by
squaring both terms and realizing that kvk21 has cross-terms as well as the squares of
single terms. kvk1 ≤ nkvk∞ follows from the fact that each of the n terms in kvk1
is ≤ the maximum such term. kvk2 ≤ (n1/2 )kvk∞ follows by a similar argument
after first squaring both sides.
(n−1/2 )kvk1 p ≤ kvk2 follows from Lagrange undetermined multipliers. They are
used to minimize ∑i |vi |2 subject to the constraint that ∑i |vi | = 1 (this is the same
as minimizing ∑i |vi |2 ) — it is not hard to see that the minimum occurs p when all
coordinates (i.e., all of the |vi |) are equal and the value taken on by ∑i |vi |2 is
(n1/2 )|v0 | in this case.
The last inequality follows by an argument like that used for ii — each term
|vi |2 is ≤ |vi |× the maximum such term. r

These relations imply the following relations among the corresponding matrix
norms:
C OROLLARY 1.55. 1. k Mk2 ≤ (n1/2 )k Mk∞ ;
2. (n−1/2 )k Mk1 ≤ k Mk2 ;
3. k Mk22 ≤ k Mk1 · k Mk∞
Recall the formulas for the 1-norm and the ∞-norm in 1.24 on page 144 and
1.25 on page 144.
The rest of this section will be spent examining the theoretical basis for this
algorithm. In particular, we will be interested in seeing why B, as given above, is
a sufficiently good estimate for A−1 . The important property of B is:
T HEOREM 1.56. 1. k Bk2 ≤ 1/k Ak2 ≤ 1/ maxi,j | aij |;
2. k R( B)k2 ≤ 1 − 1/((cond A)2 n).
1. This result applies to the 2-norms of the matrices in question.
1. LINEAR ALGEBRA 171

2. The remainder of this section will be spent proving these statements. It


isn’t hard to see how these statements imply the main results (i.e. the facts
cited above) — k Bk2 ≤ 1/ maxi,j | aij | ≤ M and k R( B)k2 ≤ 1 − 1/n7 M2 ,
by the remark above.
3. The proof of this theorem will make heavy use of the eigenvalues of ma-
trices — 1.13 on page 140 for a definition of eigenvalues. The idea, in a
nutshell, is that:
a. the eigenvalues of B and R( B), as defined in 1.53, are closely related to
those of A;
b. the eigenvalues of any matrix are closely related to the 2-norm of the
matrix;
We can, consequently, prove an inequality for the 2-norm of R( B) and use certain
relations between norms (described in 1.54 on page 170) to draw conclusions about
the values of the other norms of R( B) and the convergence of the power-series.
Another way to think of this, is to suppose that the matrix A is diagonalizable in
a suitable way, i.e. there exists a nonsingular matrix Z that preserves the 2-norm6,
and such that D = Z −1 AZ is a diagonal matrix — its only nonzero entries lie on
the main diagonal. For the time being forget about how to compute Z — it turns
out to be unnecessary to do so — we only need know that such a Z exists. It turns
out that it will be relatively easy to prove the main result in this case.
We make essential use of the fact that the 2-norm of D is the same as that of A
(since Z preserves 2-norms) — in other words, it is not affected by diagonalization.
Suppose that D is
 
λ1 0 . . . 0
 0 λ2 . . . 0 
 
 .. .. . . .. 
 . . . . 
0 0 . . . λn
Then:
1. k D k2 = k Ak2 = maxi |λi |;
2. A is nonsingular if and only if none of the λi is 0.
Now, let t = 1/(k Ak1 · k Ak∞ ). Then k Ak22 ≤ 1/t, by 1.26 on page 146, and the
2-norm of B is equal to t · maxi |λi |. The matrix R( D ) is equal to:
 
1 − tλ21 0 ... 0
 0 1 − tλ22 . . . 0 
 
 .. .. .. .. 
 . . . . 
0 0 ... 1 − tλ2n
and its 2-norm is equal to maxi (1 − tλ2i ). If A was nonsingular, then the λi are all
nonzero. The fact that k Ak22 ≤ 1/t and the fact that k Ak2 = maxi |λi | imply that
tλ2i ≤ 1 for all i and k D k2 = maxi 1 − tλ2i < 1. But this is a 2-norm, so it is not
sensitive to diagonalization. It is not hard to see that R( D ) is the diagonalization
of R( B) so we conclude that k R( B)k2 < 1, if A was nonsingular.
All of this depended on the assumption that A could be diagonalized in this
way. This is not true for all matrices — and the bulk of the remainder of this section

6Such a matrix is called unitary – it is characterized by the fact that ZH = Z −1


172 5. NUMERICAL ALGORITHMS

involves some tricks that Reif and Pan use to reduce the general case to the case
where A can be diagonalized.
Recall the definition of spectral radius in 1.13 on page 140.
Applying 1.26 on page 146 to W = AH we get k AH k22 ≤ 1/t where t =
1/(k Ak1 · k Ak∞ ) is defined in 1.52 on page 168. It follows, by 1.56 that k AH k2 ≤
1/(tk Ak2 ).
The remainder of this section will be spent proving the inequality in line 2 of
1.56 on page 170.

C OROLLARY 1.57. Let λ be an eigenvalue of AH A and let A be nonsingular. Then


1/k A−1 k22 ≤ λ ≤ k Ak22 .

P ROOF. λ ≤ ρ( AH A) = k Ak22 by definition and 1.20 on page 143. On the


other hand ( AH A)−1 = A−1 ( A−1 )H , so ( AH A)−1 is a Hermitian positive definite
matrix. Lemma 1.19 on page 143 implies that 1/λ is an eigenvalue of ( AH A)−1 .
Consequently 1/λ ≤ ρ(( AH A)−1 ) = ρ( A−1 ( A−1 )H ) = k A−1 k22 . r

L EMMA 1.58. Let B and t = 1/(k Ak1 · k Ak∞ ) Let µ be an eigenvalue of R( B) =


I − BA. Then 0 ≤ µ ≤ 1 − 1/(cond A)2 n.

P ROOF. Let R( B)v = µv for v 6= 0. Then ( I − tAH )v = v = tAH v = µv.


Therefore AH Av = λv for λ = (1 − µ)/t so λ is an eigenvalue of AH A. Corollary
1.57 implies that 1/k A−1 k22 ≤ λ = (1 − µ)/t ≤ k Ak22 . It immediately follows
that 1 − tk Ak22 ≤ µ ≤ 1 − t/k A−1 k22 . It remains to use the definition of t =
1/(k Ak1 · k Ak∞ ) and to apply statement 3 of 1.55 on page 170. r

We are almost finished. We have bounded the eigenvalues of R( B) and this


implies a bound on the spectral radius. The bound on the spectral radius implies
a bound on the 2-norm since R( B) is Hermitian. Since µ is an arbitrary eigenvalue
of R( B), ρ( R( B)) ≤ 1 − 1/((cond A)2 ). On the other hand, k R( B)k2 = ρ( R( B))
since R( B) = I − tAH A is Hermitian. This completes the proof of the second line
of 1.56 on page 170.

E XERCISES .

1.16. Apply the algorithm given here to the development of an algorithm for
determining whether a matrix is nonsingular. Is it possible for a matrix A to be sin-
gular with k R( B)k < 1 (any norm)? If this happens, how can one decide whether
A was singular or not?

1.17. Write a C* program to implement this algorithm. Design the algorithm


to run until a desired degree of accuracy is achieved — in other words do not make
the program use the error-estimates given in this section.

1.18. From the sample computation done in the text, estimate the size of the
constant of proportionality that appears in the ‘O(lg2 n)’.
1. LINEAR ALGEBRA 173

1.19. Compute the inverse of the matrix


 
2 0 0 0
1 −1 0 0
 
3 0 3 0
1 2 3 4
using the algorithm of 1.47 on page 166.
1.20. If A is a symmetric positive definite matrix (i.e., all of its eigenvalues are
positive) show that
I
B=
k Ak∞
can be used as an approximate inverse of A (in place of the estimate in 1.53 on
page 169).

1.4. Nonlinear Problems. In this section we will give a very brief discussion
of how the iterative methods developed in this chapter can be generalized to non-
linear problems. See [76] for a general survey of this type of problem and [103]
and [15] for a survey of parallel algorithms
D EFINITION 1.59. Let f : Rn → Rn be a function that maps some region M ⊆
Rn to itself. This function will be called:
1. a contracting map on M with respect to a (vector) norm k ∗ k if there exists
a number 0 ≤ α < 1 such that, for all pairs of points x and y, in M
k f ( x ) − f (y)k ≤ αk x − yk
2. a pseudo-contracting map on M (with respect to a vector norm) if there ex-
ists a point x0 ∈ M such that:
a. f ( x0 ) = x0 and there exists a number α between 0 and 1 such that for
all x ∈ M
k f ( x ) − x0 k ≤ α k x − x0 k
Although these definition might seem a little abstract at first glance, it turns
out that the property of being a contracting map is precisely a nonlinear version
of the statement that the norm of a matrix must be < 1. The material in this
section will be a direct generalization of the material in earlier section on the Jacobi
method — see page 151.
Suppose f is a pseudo-contracting map. Then, it is not hard to see that the
iterating the application of f to any point in space will result in a sequence of
points that converge to x0 :
f ( x ), f ( f ( x )), f ( f ( f ( x ))), · · · → x0
174 5. NUMERICAL ALGORITHMS

This is a direct consequence of the fact that the distance between any point
and x0 is reduced by a constant factor each time f is applied to the parameter.
This means that if we want to solve an equation like:
f (x) = x
we can easily get an iterative procedure for finding x0 . In fact the Jacobi (JOR, SOR)
methods are just special cases of this procedure in which f is a linear function.
As remarked above, the possibilities for exploiting parallelism in the nonlinear
case are generally far less than in the linear case. In the linear case, if we have
enough processors, we can parallelize multiple iterations of the iterative solution
of a linear system — i.e. if we must compute
Mn x
where M is some matrix, and n is a large number, we can compute Mn by repeated
squaring — this technique is used in the Pan-Reif algorithm in § 1.3. In nonlinear
problems, we generally do not have a compact and uniform way of representing
f n (the n-fold composite of a function f ), so we must perform the iterations one at
a time. We can still exploit parallelism in computing f — when there are a large
number of variables in the problem under investigation.
We conclude this section with an example:
E XAMPLE 1.60. Let M be an k × k matrix. If v is an eigenvector of M with a
nonzero eigenvalue, λ, then
Mv = λv
by definition. If k ∗ k is any norm, we get:
k Mvk = |λ|kvk
so we get
Mv v
=
k Mvk kvk
Consequently, we get the following nonlinear equation
f (w) = w
where f (w) = Mw/k Mwk. Its solutions are eigenvectors of M of unit norm
(all eigenvectors of M are scalar multiples of these). This is certainly nonlinear
since it involves dividing a linear function by k Mwk which, depending upon the
norm used may have square roots or the maximum function. This turns out to
be a pseudo-contracting map to where x0 (in the notation of definition 1.59) is the
eigenvector of the eigenvalue of largest absolute value.
1.5. A Parallel Algorithm for Computing Determinants. In this section we
will discuss an algorithm for computing determinants of matrices. The first is due
to Csanky and appeared in [39]. In is of more theoretical than practical interest,
since it uses O(n4 ) processors and is numerically unstable. The problem of com-
puting the determinant of a matrix is an interesting one because there factors that
make it seem as though there might not exist an NC algorithm.
• the definition of a determinant (1.8 on page 139) has an exponential num-
ber of terms in it.
1. LINEAR ALGEBRA 175

• the only well-known methods for computing determinants involve varia-


tions on using the definition, or Gaussian Elimination. But this is known
to be P-complete — see page 41.
Before Csanky published his paper [39], in 1976, the general belief was that no NC
algorithm existed for computing determinants.
Throughout this section A will denote an n × n matrix.
D EFINITION 1.61. If A is an n × n matrix, the trace of A is defined to be
n
tr ( A) = ∑ Aii
i =1

Recall the characteristic polynomial, defined in 1.13 on page 140. If the matrix
is nonsingular
n
(15) f (λ) = det(λ · I − A) = ∏(λ − λi )
i −1
(16) = λ + c 1 λ n −1 + · · · + c n −1 λ + c n
n

where the {λi } are the eigenvalues of the matrix. Direct computation shows that
n
(17) tr ( A) = ∑ λ i = − c1
i =1

It follows that the trace is an invariant quantity — in other words, transform-


ing the matrix in a way that corresponds to a change in coordinate system in the
vector-space upon which it acts, doesn’t change the trace.
Settingλ = 0 into equation (16) implies that the determinant of A is equal to
(−1)n cn .
The first step in Csanky’s algorithm for the determinant is to compute the
powers of A: A2 , A3 , . . . ,An−1 , and the trace of each. Set sk = tr ( Ak ).
P ROPOSITION 1.62. If A is a nonsingular matrix and k ≥ 1 is an integer then
n
sk = tr ( Ak ) = ∑ λik
i =1

P ROOF. Equation (17) shows that tr ( Ak ) = ∑in=1 µ(k)i , where the {µ(k)i } are
the eigenvalues of Ak , counted with their multiplicities. The result follows from
the fact that the eigenvalues of Ak are just kth powers of corresponding eigenval-
ues of A. This follows from definition of eigenvalue in 1.13 on page 140: it is a
number with the property that there exists a vector v such that Av = λv. Clearly,
if multiplying v by A has the effect of multiplying it by the scalar λ, then multi-
plying the same vector by Ak will have the effect of multiplying it by the scalar λk .
So µ(k)i = λik and the result follows. r

At this point we have the quantities ∑in=1 λi , ∑in=1 λ2i , . . . ,∑in=1 λin−1 . It turns
out that we can compute ∏in=1 λi from this information. It is easy to see how to do
this in simple cases. For instance, suppose n = 2. Then
1. s1 = λ1 + λ2
2. s2 = λ21 + λ22
176 5. NUMERICAL ALGORITHMS

and, if we compute s21 we get λ21 + 2λ1 λ2 + λ22 , so

s21 − s2
λ1 λ2 =
2
There is a general method for computing the coefficients of a polynomial in
terms sums of powers of its roots. This was developed by Le Verrier in 1840 (see
[166] and [50]) to compute certain elements of orbits of the first seven planets (that
is all that were known at the time). Le Verrier

P ROPOSITION 1.63. Let

p ( x ) = x n + c 1 x n −1 + · · · + c n −1 x + c n

and let the roots of p( x ) be x1 , . . . , xn . Then


 
1 0 .............. 0
 s1 2 0 ........ 0
    
 .. . . . . .. ..  c1 s1
 . . . . .
   ..   .. 
 ..   .  = −  . 
 s k −1 . . . s1 k 0 .
  cn sn
 . . . . . 
 .. .. .. .. .. 0 
s n −1 . . . s k −1 . . . s 1 n

where sk = ∑in=1 xik .

P ROOF. We will give an analytic proof of this result. If we take the derivative
of p( x ), we get:

dp( x )
= nx n−1 + c1 (n − 1) x n−2 + · · · + cn−1
dx
We can also set
n
p( x ) = ∏ ( x − xi )
i =1

and differentiate this formula (using the product rule) to get:


n n
dp( x ) p( x )
= ∑ ∏( x − x j ) = ∑
dx i =1 j =1 i =1
x − xi
j 6 =i

so we get
n n
p( x ) 1
(18) nx n−1 + c1 (n − 1) x n−2 + · · · + cn−1 = ∑ = p( x ) ∑
i =1
x − xi i =1
x − xi

Now we expand each of the terms 1/( x − xi ) into a power-series over xi /x to


get
à !
1 1 1 x x2
= = 1 + i + i2 + · · ·
x − xi x (1 − xi /x ) x x x
1. LINEAR ALGEBRA 177

Now we plug this into equation (18) to get

n
1
nx n−1 + c1 (n − 1) x n−2 + · · · + cn−1 = p( x ) ∑
i =1
x − xi
à !
p( x ) n
x x2
= ∑ 1 + i + i2 + · · ·
x i =1
x x
³n s1 s ´
= p( x ) + 2
+ 23 + · · ·
x x x
Since the power-series converge for all sufficiently large values of x, the co-
efficients of x must be the same in both sides of the equations. If we equate the
coefficients of x n−k−1 in both sides of this equation, we get the matrix equation in
the statement. r

Csanky’s algorithm for the determinant is thus:

A LGORITHM 1.64. Given an n × n matrix A we can compute the determinant


by the following sequence of steps:
1. Compute Ak in parallel for k = 2, . . . , n − 1. This can be done in O(lg2 n)
time using O(nn2.376 ) processors;
2. Compute sk = tr ( Ak ) for k = 1, . . . , n − 1. This requires O(lg n) time using
O(n2 ) processors;
3. Solve the matrix-equation in 1.63 for
 
c1
 .. 
.
cn

these are the coefficients of the characteristic polynomial of A. This equa-


tion can be solved in O(lg2 n) time using n2.376 processors. The only thing
that has to be done is to invert the square matrix in the equation, and this
can be done via the algorithm 1.49 on page 167.
Return (−1)n cn as the determinant of A.

Note that we also get the values of c1 , c2 , etc., as an added bonus. There is no
simple way to compute cn without also computing these other coefficients. The
original paper of Csanky used these coefficients to compute A−1 via the formula:

1
A −1 = − ( A n −1 + c 1 A n −2 + · · · + c n −1 I )
cn
Although this was the first published NC-algorithm for the inverse of an ar-
bitrary invertible matrix, it is not currently used, since there are much better ones
available7.

7Better in the sense of using fewer processors, and being more numerically stable.
178 5. NUMERICAL ALGORITHMS

1.6. Further reading. Many matrix algorithms make use of the so-called LU
decomposition of a matrix (also called the Cholesky decomposition). Given a
square matrix M, the Cholesky decomposition of M is a formula
M = LU
where L is a lower triangular matrix and U is an upper triangular matrix. In [41],
Datta gives an NC algorithm for finding the Cholesky decomposition of a matrix.
It isn’t entirely practical since it requires a parallel algorithm for the determinant
(like that in § 1.5 above).
One topic we haven’t touched upon here is that of normal forms of matrices.
A normal form is a matrix-valued function of a matrix that determines (among
other things) whether two matrices are similar (see the definition of similarity of
matrices in 1.15 on page 141). If a given normal form of a matrix M is denoted
F ( M) (some other matrix), then two matrices M1 and M2 are similar if and only if
F ( M1 ) = F ( M2 ), exactly. There are a number of different normal forms of matrices
including: Smith normal form and Jordan form.
Suppose M is an n × n matrix with eigenvalues (in increasing order) λ1 , . . . ,
λk . Then the Jordan normal form of M is the matrix
 
Q1 0
 Q2 
 
 · 

J ( M) =  
· 
 
 · 
0 Qk
where Qi is an mi × mi matrix of the form
 
λi 0
 1 λi 
 
 · · 
Qi = 



 · · 
 · · 
0 1 λi
and the {mi } are a sequence of positive integers that depend upon the matrix.
In [81], Kaltofen, Krishnamoorthy, and Saunders present parallel randomized
algorithms for computing these normal forms. Note that their algorithms must,
among other things, compute the eigenvalues of a matrix.
If we only want to know the largest eigenvalue of a matrix, we can use the
power method, very briefly described in example 1.60 on page 174. If we only want
the eigenvalues of a matrix, we can use the parallel algorithm developed by Kim
and Chronopoulos in [88]. This algorithm is particularly adapted to finding the
eigenvalues of sparse matrices. In [144], Sekiguchi, Sugihara, Hiraki, and Shimada
give an implementation of an algorithm for eigenvalues of a matrix on a particular
parallel computer (the Sigma-1).

2. The Discrete Fourier Transform


2.1. Background. Fourier Transforms are variations on the well-known
Fourier Series. A Fourier Series was traditionally defined as an expansion of
2. THE DISCRETE FOURIER TRANSFORM 179

some function in a series of sines and cosines like:



f (x) = ∑ ak sin(kx) + bk cos(kx)
k =0

Since sines and cosines are periodic functions8 with period 2π, the expansion
also will have this property. So, any expansion of this type will only be valid if
f ( x ) is a periodic function with the same period. It is easy to transform this series
(by a simple scale-factor) to make the period equal to any desired value — we
will stick to the basic case shown above. If they exist, such expansions have many
applications
• If f ( x ) is equal to the sum given above, it will periodic — a wave of some
sort — and we can regard the terms { ak sin(kx ), bk cos(kx )} as the compo-
nents of f ( x ) of various frequencies9. We can regard the expansion of f ( x )
into a Fourier series as a decomposition of it into its components of various
frequencies. This has many applications to signal-processing, time-series
analysis, etc.
• Fourier series are very useful in finding solutions of certain types of partial
differential equations.
Suppose f ( x ) is the function equal to x when −π < x ≤ π and periodic with
period 2π (these two statements define f ( x ) completely). Then its Fourier series
is:

sin(kx )
(19) f (x) = 2 ∑ (−1)k+1 k
k =1

Figures 5.5 through 5.8 illustrate the convergence of this series when −π/2 <
x ≤ π/2. In each case a partial sum of the s