This paper discusses the ongoing debate between shared and distributed memory multiprocessors, highlighting their characteristics and implementation trends, particularly for large-scale systems. It emphasizes that both architectures may converge, as they each have unique advantages and challenges in communication and synchronization. The document also explores the complexities of programming models and the potential for hybrid systems that incorporate elements of both memory types.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0 ratings0% found this document useful (0 votes)
23 views15 pages
Shared Versus Distributed Memory Multiprocessors
This paper discusses the ongoing debate between shared and distributed memory multiprocessors, highlighting their characteristics and implementation trends, particularly for large-scale systems. It emphasizes that both architectures may converge, as they each have unique advantages and challenges in communication and synchronization. The document also explores the complexities of programming models and the potential for hybrid systems that incorporate elements of both memory types.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
Shared Versus
ributed Memory Multiprocessors*
Hany F. Jordan
ABSTRACT
The question of whether multiprocessors should have shared or distributed memory has
attracted a great deal of attention, Some researchers argue strongly for building distributed
memory machines, while others argue just as strongly for programming shared memory mul-
tiprocessors. A great deal of research is underway on both types of parallel systems. This
paper puts special emphasis on systems with a very large number of processors for computa-
tion intensive tasks and considers research and implementation trends. It appears that the two
types of system will likely converge to a common form for large scale multiprocessors.
“This work was supported in part by the National Aeronautics and Space Administration under NASA contract NASI-18605 while
the author was in residence ICASE, Mail Stop 132C, NASA Langley Research Center, Hampton, VA 23665, and in part by the National
‘Science Foundation under Grant NSP-GBT-17773,
117What Are They?
‘The generic term parallel processor covers a wide variety of architectures, including
SIMD machines, data flow computers and systolic arrays. ‘The issue of shared versus distri-
buted memory arises specifically in connection with MIMD computers or multiprocessors.
‘These are sometimes referred to simply as "parallel" computers to distinguish them from vec-
tor computers, but we prefer to be precise and call them multiprocessors to avoid confusion
with the generic use of the former word, Some similar sounding but different terms are often,
used in a confusing way. Multiprocessors are computers capable of running multiple instruc-
tion streams simultaneously to cooperatively execute a single program. Multiprogramming is
the sharing of a computer by many independent jobs. They interact only through their
requests for the same resource. Multiprocessors can be used to multiprogram single stream
(sequential) programs. A process is a dynamic instance of an instruction stream. It is a com-
bination of code and process state, e.g. program counter and status word. Processes are also
called tasks, threads, or virtual processors. The term Multiprocessing can be ambiguous. It
is either:
a) Running a program (perhaps sequential) on a multiprocessor or
b) Running a program which consists of several cooperating processes.
The interest here is in the second meaning of multiprocessing. We want to gain high speed in
scientific computation by breaking the computation into pieces which are independent enough
to be performed in parallel using several processes running on separate hardware units but
cooperative enough that they solve a single problem.
There are two basic types of MIMD or multiprocessor architectures, commonly called
shared memory and distributed memory multiprocessors. Figure 1 shows block diagrams of
these two types, which are distinguished by the way in which values computed by one proces-
sor reach another processor. Since architectures may have mixtures of shared and private
memories, we use the term "fragmented" to indicate lack of any shared memory. Mixing
memories private to specific processors with shared memory in a system may well yield a
better architecture, but the issues can be discussed easily with respect to the two extremes
fully shared memory and fragmented memory.
A few characteristics are commonly used to distinguish shared and fragmented memory
multiprocessors. Starting with shared memory machines, communication of data values
Shared Memory Distributed Memory
‘Multiprocessor Multiprocessor
cru M
M
switen [~LM a)
(CPU \ M Leo Switch
T
Dance Hall Boudoir
Architecture Architecture
Figure 1: Shared and distributed memory multiprocessors.
118between processors is by way of memory, supported by hardware in the memory interface.
Interfacing many processors may lead to long and variable memory latency. Contributing to
the latency is the fact that collisions are possible among references to memory. As in unipro-
cessor systems with memory module interleaving, randomization of requests may be used to
reduce collisions. Distinguishing characteristics of fragmented memory rest on the fact that
communication is done in software by data transmission instructions, so that the machine
level instruction set has send/receive instructions as well as read/write. The long and variable
latency of the interconnection network is not associated with the memory and may be masked
by software which assembles and transmits long messages. Collisions of long messages are
not easily managed by randomization, so careful management of communications is used
instead. ‘The key question of how data values produced by one processor reach another to be
used by it as operands is illustrated in Fig. 2.
‘The organizations of Fig, 1 and the transmission mechanisms of Fig. 2 lead to a broad
brush characterization of the differences in the appearance of the two types of architecture to
a user. A shared memory multiprocessor supports communication of data entirely by
hardware in the memory interface. It requires short and uniform latency for access to any
memory cell. The collisions which are inevitable when multiple processors access memory
can be reduced by randomizing the references, say by memory module interleaving. A frag-
mented memory switching network involves software in data communication by way of
explicit send and receive instructions. Data items are packed into large messages to mask
Jong and variable latency. Since messages are long, communications scheduling instead of
randomization is used to reduce collisions. To move an intermediate datum from its producer
to its consumer a fragmented memory machine ideally sends it to the consumer as soon as it
is produced, while a shared memory machine stores it in memory to be picked up by the con-
sumer when it is needed.
(@)- write(loc. A) read(loc. A)
Shared Memory
Switch
l
| ud... M
a) Shared Memory Communication
M send(proc. Y) receive(proc. »--(e)
x Y
|Communications
Switch
b) Fragmented Memory Communication
Figure 2: Communication of data in multiprocessors.
119It can be seen from Fig. 1 that the switching network which communicates data among
processors occupies two different positions with respect to the classical, von Neumann, single
processor architecture. In shared memory, it occupies a position analogous to that of the
memory bus in a classical architecture. In the fragmented memory case, it is independent of
the processor to memory connection and more analogous to an I/O interface. The use of send
and receive instructions in the fragmented memory case also contributes to the similarity to
an I/O interface. This memory bus versus 1/O channel nature of the position of the switching
network underlies the naive characterization of the differences between the two types of net-
work. A processor to memory interconnection network involves one word transfers with reli-
able transmission. The address (name) of the datum controls a circuit switched connection
with uniform access time to any location, Since a read has no knowledge of a previous write,
explicit synchronization is needed to control data sharing. In contrast, a processor to proces-
sor interconnection network supports large block transfers and error control protocols. Mes-
sage switching routes the information through the network on the basis of the receiving
processor’s name. Delivery time varies with the source and destination pair, and the
existence of a message at the receiver provides an implicit form of synchronization.
From the user’s perspective, there are two distinct naive programming models for the
two multiprocessor architectures. A fragmented memory machine requires mapping data
structures across processors and the communication of intermediate results using send and
receive. The data mapping must be available in a form which allows each processor to deter-
mine the destinations for intermediate results which it produces. Large message overhead
encourages the user to gather many data items for the same destination into long messages
before transmission. If many processors transmit simultaneously, the source/destination pairs
should be disjoint and not cause congestion on specific paths in the network. The user of a
shared memory machine sees a shared address space and explicit synchronization instructions
to maintain consistency of shared data, Synchronization can be based on program control
structures or associated with the data whose sharing is being synchronized. There is no rea-
son to aggregate intermediate results unless synchronization overhead is unusually large.
Large synchronization overhead leads to a programming style which uses one synchroniza-
tion to satisfy many write before read dependencies at once. Better performance can result
from avoiding memory “hot spots" by randomizing references so that no specific memory
module is referenced simultaneously by many processors.
Why it Isn’t That Simple
The naive views of the hardware characteristics and programming styles for shared and
fragmented memory multiprocessors just presented are oversimplified for several reasons,
First, as already mentioned, shared and private memories can be mixed in a single architec-
ture, as shown in Fig. 3. This corresponds to real aspects of multiprocessor programs, where
some data is conceptually private to the processor doing an individual piece of work. The
program, while normally shared by processors, is read only for each and should be placed in a
private memory, if only for caching purposes. ‘The stack generated by most compilers nor-
mally contains only private data and need not be in shared memory. In addition, analysis
done by many parallel compilers identifies some shared data as read only and thus cachable in
private memory. Some multiprocessors share memories among some, but not all, processors.
Examples are the PAX[1] and DIRMU[2] computers. These machines move intermediate
data by having its producer place it in the correct memory and its consumer retrieve it from
there. ‘The transmission may be assisted by other processors if producer and consumer do not
share a memory.
Not only may a multiprocessor mix shared and private memories, but the same memory
structure may have different appearances when viewed at different system levels. An impor-
tant early multiprocessor was Cm*[3], built at Carnegie Mellon University. An abbreviated
120T T T
| | |
K K K
S (high concurrency)
P —M Local P —M Local P —M Local
Notation:
P- processor M- memory
S-switch _K- controller
‘T - transducer (1/0 device)
Figure 3: Shared plus private memory architecture.
block diagram of the architecture is shown in Fig. 4, Processors were attached by way of a
local bus to memories and possibly 1/O devices to form computer modules. Several computer
modules were linked into a cluster by a cluster bus. Processors could access the memory of
other processors using the cluster bus. Processors in different clusters communicated through
interconnected mapping controllers, called K.maps. The name K.map and some of the
behavior of Cm* are easier to understand in light of the fact that the PDP-11 had a very small
physical address, so that address mapping was essential to accessing any large physical
4 ™ other K.map
K .map
Cluster bus __
s s
TT r+
M K P (PDP-11) M K P (PDP-11)
: !
Figure 4: Architecture of the Cm* multiprocessor.
124memory, shared or not,
Not only does Cm* illustrate a mixture of shared and fragmented memory ideas, but
there are three answers to the question of whether Cm* is a shared or fragmenied memory
multiprocessor. At the microcode level in the K.map, there are explicit send and receive
instructions and message passing software, thus making the Cm* appear to be a fragmented
memory machine. At the PDP-11 instruction set level, the machine has shared memory.
‘There were no send and receive instructions, and any memory cell could be accessed by any
processor. The page containing the memory address had to be mapped into the processor's
address space, but as mentioned, this was a standard mechanism for the PDP-11. A third
answer to the question appeared at the level of programs running under an operating system.
Two operating systems were built for Cm*. The processes which these operating systems
supported were not allowed to share any memory. They communicated through operating
system calls to pass messages between processes. Thus at this level Cm* became a frag-
mented memory machine once more.
Taking the attitude that a machine architecture is characterized by its native instruction
set, we should call Cm* a shared memory machine. A litmus test for a fragmented memory
machine could be the existence of distinct send and receive instructions for data sharing in the
processor instruction set. The Cm* is an example of shared memory machines with non-
uniform memory access time, sometimes called NUMA machines. If access to a processor’s
local memory took one time unit, then access via the cluster bus required about three units
and access to memory in another cluster took about 20 units, Writing programs under either
operating system followed the programming paradigm for fragmented memory multiproces-
sors, with explicit send and receive of shared data, but performance concems favored large
granularity cooperation less strongly than in a truly fragmented memory machine.
A more recent NUMA shared memory multiprocessor is the BBN Butterfly[4], Refer-
ences to non-local memory take about three times as long as local references. The Butterfly
processor to memory interconnection network also contradicts the naive characterization of
shared memory switches. The network connecting N processors to N memories is a multis-
tage network with logyNV stages, and thus (N/2)logpN’ individual links, It thus has a poten-
tially high concurrency, although collisions are possible when two memory references require
the same link, Read and write data are sent through the network as messages with a self rout-
ing header which establishes a circuit over which the data bits follow. Messages are pipe-
lined a few bits at a time, and long data packets of many words can use the circuit, once esta-
blished. Thus, although single word transfers are the norm, higher bandwidths can be
achieved by packing data into a multiword transmission. Messages attempting to reference a
memory which is in use, or colliding with others in the switch, fail and are retried by the pro-
cessor.
Finally, the naive view of the difference between implicit synchronization in fragmented
memory and the need for explicit synchronization with shared memory should be challenged.
A shared memory synchronization based on data rather than control structures is that of asyn-
chronous variables. Asynchronous variables have a state as well as a value, The state has
two values, usually called full and empry, which control access to the variable by two opera-
tions, produce and consume. Produce waits for the state to be empty, writes the variable with
a new value, and sets the state to full. Consume waits for the state to be full, reads the value,
and sets the state to empty. Both are atomic operations, or in general obey the serialization
principle. Void and copy operations are often supplied to initialize the state to empty, and to
wait for full, read and leave full, respectively. The HEP[S] and Cedar[6] computers sup-
ported these operations on memory cells in hardware,
When data is put in memory by one processor using produce and read by another using
consume, the transaction behaves like a one word message from producer to consumer, with
minor differences. The memory cell serves as a one word buffer, and may be occupied when
122produce is attempted, The producer need not name the consumer; instead, both name a com-
mon item as when send and receive are linked to a common communications channel name.
Another difference is that one produce and multiple copys suffice to deliver the same datum
to multiple receivers.
Abstraction of Characteristics
The essence of the problem to be addressed by the switching network in both shared and
agmented memory multiprocessors is the communication of data from a processor produc-
ing it to one which will use it. This process can slow parallel computation when either the
producer is delayed in transmitting or when the consumer is delayed in receiving. This pro-
cess can be abstracted in terms of four characteristics: initiation of transmission to the data’s
destination, synchronization of production and use of the data, binding of the data’s source to
its destination, and how transmission latency is dealt with, Table 1 summarizes these charac-
teristics and tabulates them for the traditional views of shared and fragmented memory mul-
tiprocessors.
The initiation of data delivery to its consumer is a key characteristic and influences the
others, Producer initiated delivery characterizes the programming model of fragmented
memory multiprocessors. It implies that the producer knows the identity of the consumer, so
that binding by processor name can be used, and provides the possibility of implicit synchron-
ization when the consumer is informed of the arrival of data, If a producer in a shared
memory multiprocessor were forced to write data into an asynchronous variable in a section
of memory uniquely associated with the consumer, the programming model would be much
the same as for fragmented memory. Consumer initiated access to data assumes a binding
where the identity of the data allows a determination of where it resides. Since the consumer
operation is decoupled from the data’s writing by its producer, explicit synchronization is
needed to guarantee validity. One can imagine a fragmented memory system in which part of
a data item’s address specifies its producer and a sharing protocol in which the consumer
sends a request message to the owner (producer) of a required operand. An interrupt could
cause the owner to satisfy the consumer's request, yielding a consumer initiated data
transmission. Such a fragmented memory system would be programmed like a shared
memory machine, Binding is by data name, and the consumer has no implicit way of know-
ing the data it requests has been written yet, so explicit synchronization is required.
1 many explicit synchronization mechanisms are possible to attempt a complete treat-
ment, and sufficient characterization for our purposes has already been given. Since message
delivery is less often thought of in terms of synchronization, Table 2 summarizes the types
synchronization associated with message delivery. Different requirements are placed on the
operating or run-time system and different precedence constraints are imposed by the possible
combinations of blocking and non-blocking send and receive operations.
Types of binding between producer and consumer in fragmented memory systems
include: source/destination pair, channel, and destination/type. In the case of
Characters Fragmented Memory Shared Memory
Initiation Producer ‘Consumer
Synchronization || Implicit by message existence | Explicit
Binding Processor name Data name
Latency Masked by early send Consumer waits
‘Table 1: Data Sharing in Multiprocessors.
123Message System
Synchronization Requirements Precedence Constraints
Send: nonblocking | Message buffering None, unless message is
Receive: nonblocking | Fail return from receive | received successfully
Send: nonblocking | Message buffering Actions before send precede
Receive: blocking Termination detection | those after receive
Send: blocking Termination detection | Actions before receive precede
Receive: nonblocking | Fail return from receive those after send
Send: blocking Termination detection | Actions before rendezvous
Receive: blocking ‘Termination detection | precede ones after it
in both processes.
Table 2: Summary of the types of message synchronization.
source/destination, the send operation names the destination and receive names the source. A
message can be broadcast, or sent to multiple receivers, but not received from multiple
sources. Source thus designates a single processor while destination might specify one or
more. Message delivery can also be through a “channel” or mailbox. In this case send and
receive are connected because both specify the same channel. A channel holds a sequence of
messages, limited by the channel capacity. To facilitate a receiver handling messages from
several sources, a sender can specify a "type" for the message and the receiver ask for the
next message of that type. The source is then not explicitly specified by the receiver but may
be supplied to it as part of the message. Binding in shared memory is normally by data loca-
tion, but note that the Linda[7] shared tuple memory uses content addressability, which is
somewhat like the "type" binding just mentioned.
The problem of latency in sharing data and how it is dealt with is the most important
issue in the performance of multiprocessors. At the lowest level it is tied up with the latency
and concurrency of the switch. Two slightly different concepts should be distinguished. If T,
is the time at which a send is issued in a message passing system and T, is the time at which
the corresponding receive returns data, then the latency is T, =T, ~T,. The transmission
time for messages often has an initial startup overhead and a time per unit of information in
the message, of the form 1; +t, where k is the number of units transmitted. The startup
time 1; is less than 7}, but is otherwise unrelated. In particular, if T;, is large, several mes-
sages can be sent before the first one is received. The granularity of data sharing is deter-
mined by the relationship of #; to ty. If t; >>, good performance dictates k >> 1, making
the granularity coarse. If 1; ~ 1%, the fine granularity case of k = 1 ‘suffers little performance
degradation. Read and write in a shared memory switch must at least have small f; so that
data transmissions with small k perform well, A fine granularity switch with small startup 1;
may still have a large latency T),, and this is the concern of the fourth characteristic in Table
1.
Latency must grow with the number of processors in a system, if only because its physi-
cal size grows and signal transmission is limited by the speed of light. As the system size
grows, the key question is how the inevitable latency is dealt with, An architecture in which
latency does not slow down individual processors as the number of them increases is called
scalable. Scalability is a function both of how latency grows and how it is managed. Mes-
sage latency can be masked by overlapping it with useful computation. Figure 5 shows a
send/receive transaction in a fragmented memory system. In part a) message latency is
124Produce Send
Producer intermediate to Compute
result consumer
Channel Message
latency
Receive
Consumer Compute independent of producer intermediate
result
Time >
a) Message latency well masked by computation.
Produce Send
Producer intermediate to Compute
result consumer
Channel Message
latency
Receive
Consumer Compute Wait for message intermediate
result
Time >
b) Poorly masked message latency.
Figure 5: Masking message latency by computation.
successfully overlapped by computation in the consumer whereas in part b) the consumer
does not have enough to do before needing the data in order to completely mask the latency.
In reference to Fig. 5, scalability is is a function of how the program doing the sends and
receives is organized. The ratio of available overlapping computation to message latency
decreases as system size grows, both because latency grows and because computation is more
finely divided.
In shared memory multiprocessors the consumer initiation of access to data when
needed eliminates the possibility of arranging the program so that sends occur early enough to
mask latency. Latency can be managed in this case, as in the other, by reducing it or by
masking it with useful computation, Latency reduction in the shared memory hardware
regime is done by caching and latency masking by pipelining or multiprogramming, In the
naive view, scalability is a hardware concem in shared memory but more a function of pro-
gram structure in fragmented memory, leading to the notion of software scalability. Assum-
ing infinitely fast transmission, networks with P processors and a reasonable number of
switching nodes usually have latency on the order of log, P , where m is the number of input
and output ports per switch node. If finite speed of signal transmission is an issue, latency is
proportional to the cube root of P for a system buildable in three dimensional space and to
the square root of P if messages occupy volume.
Concurrency of the switch also has an influence on latency. It must clearly have a con-
currency much greater than one for any multiprocessor with more than a very few processors.
Using a single bus for this switch is inadmissible in all but the smallest of systems. For scala-
bility, concurrency should grow linearly with the number of processors; otherwise the lack of
physical network paths will lead to long latencies when many processors use the switch
simultaneously. Even with order P links, collisions between messages can occur under
125unfavorable access patterns. The way to control collisions is a function of granularity. In a
fine granularity network, randomization which distributes the small transactions uniformly
over the network is usually appropriate. With large granularity transactions, randomization is
less effective, and scheduling of the transactions may be required,
Thus the abstract differences between shared and fragmented memory multiprocessors
rest on the four characteristics of Table 1, with the selection of producer or consumer initia-
tion of data delivery having a strong influence on the other three, Consumer initiation is
naively associated with explicit synchronization, data name binding, and latency reduction
Producer initiation suggests implicit synchronization, processor name binding, and latency
tolerance by executing sends early.
Convergence
The direction of current developments in shared and fragmented memory multiproces-
sors is generally toward convergence. The desire to write programs with a shared name space
for fragmented memory machines is supported both by research on virtual shared memory
using paging techniques and by automatic compiled or preprocessed generation of sends and
receives for remote data references. Multiprogramming the nodes of a fragmented memory
multiprocessor can also increase the amount of computation available to mask latency. Vir~
tual processors make use of the idea of parallel slackness, or using of some of a problem’s
inherent parallelism to control latency. In shared memory multiprocessors, considerable
work is being applied to multiprocessor caching, which distributes shared data among proces
sors to reduce latency. Hardware cache management, software cachability analysis, and
correct placement and copying in NUMA machines have been considered. Much attention
has been given to fast, packet switched, multistage interconnection networks for use in the
processor to memory interface, and pipelining techniques have been applied to tolerate the
inevitably large latency of such networks connecting many processors.
Support for a shared name space on fragmented memory multiprocessors takes several
forms. Li{8] has considered using paging techniques to produce a shared memory address
space on a fragmented machine. If the paging is heavily supported by hardware, convergence
is easily seen between this work and the work on multiprocessor caching exemplified by [9].
Another approach uses program analysis to automatically generate the sends and receives
required to move data from its producer to its consumer, For regular access patterns, the user
can specify data mapping and a language like DINO[10] can generate message transmissions
to satisfy non-local references, When regular access patterns are generated by loops in
automatic parallelization of a sequential program{11], the more constrained structure allows
even more of the mapping and data movement to be generated automatically by the compiler.
Automating data mapping across distributed memories has a long history and might be
typified by the work of Berman and Snyder(12]. If access patterns are data dependent, as in
computations on machine generated grids, they may still be constant over long periods. It
may then be beneficial to bind addresses and generate data movement using a preproces
sor{13] which acts at run-time, after data affecting addresses is known, but before the bulk of
the computation, which is often iterative, is carried out. Preprocessor work can thus be amor-
tized over many iterations with the same access pattern. Convergent work for shared memory
has taken place in connection with NUMA architectures. The BBN Butterfly provides sup-
port for placement and copying to reduce the penalty for long memory references. Software
places private data in the local memory of its processor and randomizes references to struc-
tures such as arrays over memory modules to avoid memory "hot spots"[14].
Finally, convergence in latency hiding techniques is seen between the use of virtual pro-
cessors in fragmented memory and pipelining in shared memory multiprocessors. If we
attempt to use consumer initiation in fragmented memory by interrupting the owner of a
126datum with a request for transmission, we see a behavior like that of Fig. 6 a). In order to
make use of the long wait resulting from consumer initiation of the delivery, the processor
executing the consumer process can be switched to another process, as shown in Fig. 6 b). If
the process is associated with a different program, we have the standard technique of masking
latency by multiprogramming, which is used in masking disk latency in virtual memory sys-
tems. If the extra process is associated with the same parallel program, we have a partly time
multiplexed form of multiprocessing often characterized by the name virtual processors. The
use of virtual processors to enhance performance has recently been most frequently discussed
in relation to an SIMD architecture, the Connection Machine[15], where it is important for
masking latency arising from several different sources. If each processor of a fragmented
memory multiprocessor time multiplexes several processes so that message latency in the
communication network is overlapped with useful computation, a time snapshot of message
traffic and processor activity might appear as in Fig. 7.
An early use of multiprocessing to mask memory latency, as opposed to I/O latency, was
in the peripheral processors of the CDC 6600[16]. Ten slow peripheral processor memories
were accommodated by time multiplexing ten virtual processors on a single set of fast proces-
sor logic. Process contexts were switched on a minor cycle basis. Later, the Denelcor HEP
used fine grained multiprocessing to mask latency in a shared, pipelined data memory. The
concept of pipelined multiprocessing is illustrated in Fig. 8. Round robin issuing of a set of
process states into the unified pipeline is done on a minor cycle basis. Processes making
memory references are queued separately to be returned to the execution queue when
satisfied. Pipeline interlocks are largely unnecessary since instructions which occupy the
pipeline simultaneously come from different processes and can only depend on each other
through explicit synchronization.
Compute, including Send
Producer production of to Compute
intermediate result consumer
Channel Request Reply
message message
Request
Consumer intermediate Wait Compute
result
Time >
a) Latency results in consumer wait.
Producer Process A: compute Send
processor and produce to Process A: compute
intermediate result consumer
Channel Request Reply
message message
Consumer Proces: Process B:
processor request compute
result
Time >
b) Latency masked by multiprocessing.
Figure 6: Consumer initiated transmission in a fragmented memory system.
127Processor 1 Processor 2 Processor 3 Processor 4
Run Wait Run Wait Run Wai Run Wai
Active Q Q| | Active Q Q| | Active Q Q| | Active Q Q
PI] [P2 Pa PT PIO PID,
[P3 PS [PC PS | PD PI
Send_— Rev Send _~ Rev Send Rev Send Rev
i a
to $
P6
. P . PIT
Switch node || 19, |] Switch node Switch node || 19, |] Switch node
Figure 7: Masking Message Transmission with Multiprogramming,
M }—~p>= E ion Pipeli
Pipelined (ad :xecution Pipeline
Switch cru CCC c
M
: Process
: Queue
M
Memory
Reference Queue
Figure 8: One execution unit of a pipelined multiprocessor.
For latency to be masked by satisfying requests at a higher rate than processor-memory
latency would imply, many requests must be in progress simultaneously. This implies a pipe-
lined switch between processors and memory, and possibly pipelining the memory also.
Pipelining and single word access together imply a low overhead, message switched network.
Variable traffic in the shared network requires a completion report for each transaction,
regardless of whether it is a read or write. Whether the memory modules themselves are
pipelined or not depends on the ratio of the module response time to the step time of the pipe-
lined switch. If the memory module responds significantly slower than the switching network
can deliver requests, memory mapping and address decoding are obvious places to use pipe-
lining within the memory itself. Figure 9, which bears an intentional resemblance to Fig. 7,
shows an activity snapshot in a system built of multiple pipelined multiprocessors which
mask the latency of multiple read and write operations in the processor to memory switch.
Convergence can also be seen in switching network research. Packet switched processor
to memory interconnections such as that in the NYU Ultracomputer[17] bear a strong resem-
blance to communication networks used in message passing distributed memory computers.
Previously, the store and forward style of contention resolution was only seen in communica-
tions networks carrying information packets much larger than one memory word. There is
also a strong resemblance between the "cut-through routing"[18] recently introduced in frag-
mented memory multiprocessors and the previously mentioned header switched connections
128Processor 1 Processor 2 Processor 3 Processor 4
Run Run Run _ Run
Running Queue| | Running Queue} | Running Queue) | Running Queue
PI Pa J PT PIO
Piz
PS PO PIT
Pe
SN real NN a
Switch node Switch node
PS
(ee
Switch node Switch node
P2) 8
4 So write,
Mem. Mem. Mem. Mem.
Figure 8: Pipelined Multiprocessors in a Shared Memory Multiprocessor System.
made by messages in the BBN Butterfly shared memory switch.
Conclusions
The question of what one concludes from all this is really a question of what one is led
to predict for the future of multiprocessors. The predictions can be formulated as the answers
to three questions: What will be the programming model and style for multiprocessors? How
will the system architecture support this model of computation? What will be the split
between hardware and software in contributing to this system architecture?
‘The programmer will surely reference a global name space. This feature corresponds
t00 closely to the way we formulate problems, and too much progress has been made toward
supporting it on widely different multiprocessor architectures, for us to give it up. It also
seems that most synchronization will be data based rather than control based. Associating the
synchronization with the objects whose consistency it is supposed to preserve is more direct
and less error prone than associating it with the control flow of one or more processes. Pro-
grams will have more parallelism than the number of physical processors in the multiproces-
sor expected to run them, with the extra parallelism being used to mask latency.
Multiprocessor architecture will consist of many processors connected to many
memories. A portion of the memory will be globally interconnected by way of a high con-
currency switch. The switch will have a latency which scales as log,,P for moderate speed
systems, with m probably greater than two. For the highest speed systems, the latency will
scale as P™2. Multiprocessors will use a Harvard architecture, separating the program
memory from data memory to take advantage of its very different access pattems. Data
memory private to each processor will be used to store the stack, other process private data
129and copies of read only shared data, Only truly shared data will reside in the shared memory.
‘A combination of software and hardware techniques will be used to mask the latency
inherent in data sharing. Compiler analysis will be the main mechanism for determining what
data is truly shared. It may even generate code to dynamically migrate data into private
memories for a long program section during which it is not shared. The hardware will time
multiplex (pipeline) multiple processes on each processor at a very fine granularity in order to
support latency masking by multiplexed computation, Some of the multiprocessor cache
research may find use in partially supporting the data migration with hardware, but a
knowledge of reference patterns is so important to data sharing that it is unlikely that the
hardware will forego the increasingly effective assistance available from the compiler.
In short, the hardware, assisted by the compiler, of multiprocessor systems can do much
more than we currently ask of it. Moving software mechanisms into hardware produces a
significant performance gain, and should be done when a mechanism is well understood, pro-
ven effective and of reasonably low complexity. Finally, although automatic parallelization
has been poorly treated in this paper, it is perhaps possible to say that, in spite of the excellent
work done in turning sequential programs into parallel ones, a user should not take great
pains in a new program to artificially sequentialize naturally parallel operations so that they
can be done on a computer.
REFERENCES
{1} T. Hoshino, "An invitation to the world of PAX," IEEE Computer, V. 19, pp. 68-79
(May 1986).
[2] W. Haendler, E, Maehle and K. Wirl, "DIRMU multiprocessor configurations," Proc.
1985 Int’ nl Conf. on Parallel Processing, pp. 652-656 (Aug. 1985).
[3] EF. Gehringer, D-P. Siewiorek and Z. Segall, Parallel Processing The Cm* Experience,
Digital Press, Billerica, MA (1987).
[4] R.H. Thomas, "Behavior of the Butterfly parallel processor in the presence of memory
hot spots," Proc. of the 1986 Int’ nl Conf. on Parallel Processing, pp. 46-50 (Aug. 1986).
[5] J. 8. Kowalik, Ed., Parallel MIMD Computation: The HEP Supercomputer and its
Applications, MIT Press (1985).
[6] D. Gajski et al., "Cedar," Proc. Compcon, pp. 306-309 (Spring 1989).
[7] S. Ahuja, N. Carriero and D. Gelernter, "Linda and friends," IEEE Computer, V. 19, pp.
26-34 (1986).
[8] K. Li, Shared Virtual Memory on Loosely Coupled Multiprocessors, Ph.D. Thesis, Yale
Univ. New Haven, CT (Sept. 1986).
[9] J.-L. Baer and W.-H. Wang, "Multilevel cache Hierarchies: Organizations, protocols,
and performance," J. Parallel and Distributed Computing, V. 6, No. 3, pp. 451-476
(June 1989).
[10] M. Rosing, R.W. Schnabel and R.P. Weaver, "Expressing complex parallel algorithms in
DINO," Proc. 4th Conf. on Hypercubes, Concurrent Computers & Applications, pp.
553-560 (1989).
[11] D. Callahan and K. Kennedy, "Compiling programs for distributed-memory multipro-
cessors," J. of Supercomputing, V. 2, pp. 131-169 (1988).
130[12] F. Berman and L. Snyder, "On mapping parallel algorithms into parallel architectures,”
Proc. 1984 Int’ nl Conf. on Parallel Processing, pp. 307-309 (1984).
(13] J. Saltz, K. Crowley, R. Mirchandaney and H. Berryman, "Run-time scheduling and exe-
cution of loops on message passing machines," J. Parallel and Distributed Computing,
V. 8, pp. 303-312 (1990).
[14] R. Rettberg and R. Thomas, "Contention is no obstacle to shared-memory multiprocess-
ing,” Communications of the ACM, V. 29, No. 12, pp. 1202-1212 (Dec. 1986).
[15] L.W. Tucker and G.G. Robertson, "Architecture and applications of the Connection
Machine," Computer, V. 21, pp. 26-38 (Aug, 1988).
[16] J. E. Thornton, Design of a Computer: The Control Data 6600, Scott, Foresman and Co.,
Glenview, Il. (1970).
{17} A. Gottlieb, R. Grishman, C.P. Kruskal, K.P. McAuliffe, L. Rudolph and M. Snir, "1
NYU Ultacomputer—Designing an MIMD shared memory parallel computer,” JEEE
Trans. on Computers, v. C-32, No. 2, pp. 175-189 (Feb. 1983).
[18] WJ. Dally and CLL. Seitz,
187-196 (1986).
‘The Torus routing chip," Distributed Computing, V. 1, pp-
131