Parallel Computation Models
Lecture 3
Lecture 4
Slide 1
Parallel Computation Models
PRAM (parallel RAM)
Fixed Interconnection Network
bus, ring, mesh, hypercube, shuffle-exchange
Boolean Circuits
Combinatorial Circuits
BSP
LOGP
Slide 2
PARALLEL AND DISTRIBUTED
COMPUTATION
MANY INTERCONNECTED PROCESSORS WORKING CONCURRENTLY
P4
P5
P3
INTERCONNECTION
NETWORK
P2
P1
....
Pn
CONNECTION MACHINE
INTERNET
Connects all the computers of the world
Slide 3
TYPES OF MULTIPROCESSING FRAMEWORKS
PARALLEL
DISTRIBUTED
TECHNICAL ASPECTS
PARALLEL COMPUTERS (USUALLY) WORK IN TIGHT SYNCRONY, SHARE MEMORY TO A LARGE
EXTENT AND HAVE A VERY FAST AND RELIABLE COMMUNICATION MECHANISM BETWEEN THEM.
DISTRIBUTED COMPUTERS ARE MORE INDEPENDENT, COMMUNICATION IS LESS
FREQUENT AND LESS SYNCRONOUS, AND THE COOPERATION IS LIMITED.
PURPOSES
PARALLEL COMPUTERS COOPERATE TO SOLVE MORE EFFICIENTLY (POSSIBLY)
DIFFICULT PROBLEMS
DISTRIBUTED COMPUTERS HAVE INDIVIDUAL GOALS AND PRIVATE ACTIVITIES.
SOMETIME COMMUNICATIONS WITH OTHER ONES ARE NEEDED. (E. G. DISTRIBUTED DATA
BASE OPERATIONS).
PARALLEL COMPUTERS: COOPERATION IN A POSITIVE SENSE
DISTRIBUTED COMPUTERS: COOPERATION IN A NEGATIVE SENSE,
ONLY WHEN IT IS NECESSARY
Slide 4
FOR PARALLEL SYSTEMS
WE ARE INTERESTED TO SOLVE ANY PROBLEM IN PARALLEL
FOR DISTRIBUTED SYSTEMS
WE ARE INTERESTED TO SOLVE IN PARALLEL
PARTICULAR PROBLEMS ONLY, TYPICAL EXAMPLES ARE:
COMMUNICATION SERVICES
ROUTING
BROADCASTING
MAINTENANCE OF CONTROL STUCTURE
SPANNING TREE CONSTRUCTION
TOPOLOGY UPDATE
LEADER ELECTION
RESOURCE CONTROL ACTIVITIES
LOAD BALANCING
MANAGING GLOBAL DIRECTORIES
Slide 5
PARALLEL ALGORITHMS
WHICH MODEL OF COMPUTATION IS THE BETTER TO USE?
HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM?
HOW TO CONSTRUCT EFFICIENT ALGORITHMS?
MANY CONCEPTS OF THE COMPLEXITY THEORY MUST BE REVISITED
IS THE PARALLELISM A SOLUTION FOR HARD PROBLEMS?
ARE THERE PROBLEMS NOT ADMITTING AN EFFICIENT PARALLEL SOLUTION,
THAT IS INHERENTLY SEQUENTIAL PROBLEMS?
Slide 6
We need a model of computation
NETWORK (VLSI) MODEL
The processors are connected by a network of bounded degree.
No shared memory is available.
Several interconnection topologies.
Synchronous way of operating.
MESH CONNECTED ARRAY
degree = 4 (N)
diameter = 2N
Slide 7
HYPERCUBE
0110
0111
1110
0100
degree = 4 (log2N)
0000
0101
0010
1111
1100
0011
1101
1010
diameter = 4
1011
0001
1000
1001
N = 24 PROCESSORS
Slide 8
Other important topologies
binary trees
mesh of trees
cube connected cycles
In the network model a PARALLEL MACHINE is a very complex
ensemble of small interconnected units, performing elementary
operations.
- Each processor has its own memory.
- Processors work synchronously.
LIMITS OF THE MODEL
different topologies require different algorithms to solve the same
problem
it is difficult to describe and analyse algorithms (the migration of
data have to be described)
A shared-memory model is more suitable by an algorithmic point of view
Slide 9
Model Equivalence
given two models M1 and M2, and a problem
of size n
if M1 and M2 are equivalent then solving
requires:
T(n) time and P(n) processors on M1
T(n)O(1) time and P(n)O(1) processors on M2
Slide 10
PRAM
Parallel Random Access Machine
Shared-memory multiprocessor
unlimited number of processors, each
has unlimited local memory
knows its ID
able to access the shared memory
unlimited shared memory
Slide 11
PRAM MODEL
1
2
3
P1
P2
.
Pi
Common Memory
.
.
.
Pn
m
PRAM n RAM processors connected to a common memory of m cells
ASSUMPTION: at each time unit each Pi can read a memory cell, make an internal
computation and write another memory cell.
CONSEQUENCE: any pair of processor Pi Pj can communicate in constant time!
Pi writes the message in cell x at time t
Pi reads the message in cell x at time t+1
Slide 12
PRAM
Inputs/Outputs are placed in the shared
memory (designated address)
Memory cell stores an arbitrarily large
integer
Each instruction takes unit time
Instructions are synchronized across the
processors
Slide 13
PRAM Instruction Set
accumulator architecture
memory cell R0 accumulates results
multiply/divide instructions take only
constant operands
prevents generating exponentially large
numbers in polynomial time
Slide 14
PRAM Complexity Measures
for each individual processor
time: number of instructions executed
space: number of memory cells accessed
PRAM machine
time: time taken by the longest running processor
hardware: maximum number of active processors
Slide 15
Two Technical Issues for PRAM
How processors are activated
How shared memory is accessed
Slide 16
Processor Activation
P0 places the number of processors (p) in the
designated shared-memory cell
each active Pi, where i < p, starts executing
O(1) time to activate
all processors halt when P0 halts
Active processors explicitly activate additional
processors via FORK instructions
tree-like activation
O(log p) time to activate
Slide 17
THE PRAM IS A THEORETICAL (UNFEASIBLE) MODEL
The interconnection network between processors and memory would require
a very large amount of area .
The message-routing on the interconnection network would require time
proportional to network size (i. e. the assumption of a constant access time
to the memory is not realistic).
WHY THE PRAM IS A REFERENCE MODEL?
Algorithms designers can forget the communication problems and focus their
attention on the parallel computation only.
There exist algorithms simulating any PRAM algorithm on bounded degree
networks.
E. G. A PRAM algorithm requiring time T(n), can be simulated in a mesh of tree
in time T(n)log2n/loglogn, that is each step can be simulated with a slow-down
of log2n/loglogn.
Instead of design ad hoc algorithms for bounded degree networks, design more
general algorithms for the PRAM model and simulate them on a feasible network.
Slide 18
For the PRAM model there exists a well developed body of techniques
and methods to handle different classes of computational problems.
The discussion on parallel model of computation is still HOT
The actual trend:
COARSE-GRAINED MODELS
The degree of parallelism allowed is independent from the number
of processors.
The computation is divided in supersteps, each one includes
local computation
communication phase
syncronization phase
the study is still at the beginning!
Slide 19
Metrics
A measure of relative performance between a multiprocessor
system and a single processor system is the speed-up S( p),
defined as follows:
S( p) =
Execution time using a single processor system
Execution time using a multiprocessor with p processors
S( p) =
T1
Tp
Efficiency =
Sp
p
Cost = p Tp
Slide 20
Metrics
Parallel algorithm is cost-optimal:
parallel cost = sequential time
C p = T1
Ep = 100%
Critical when down-scaling:
parallel implementation may
become slower than sequential
T1 = n 3
Tp = n2.5 when p = n2
Cp = n4.5
Slide 21
Amdahls Law
f = fraction of the problem thats
inherently sequential
(1 f) = fraction thats parallel
Parallel time Tp:
Tp f (1 f ) p
1
Speedup with p processors: S p
1 f
f
p
Slide 22
Amdahls Law
Upper bound on speedup (p = )
1
Sp
1 f
f
p
Converges to 0
1
S
f
Example:
f = 2%
S = 1 / 0.02 = 50
Slide 23
PRAM
Too many interconnections gives problems with synchronization
However it is the best conceptual model for designing efficient
parallel algorithms
due to simplicity and possibility of simulating efficiently PRAM
algorithms on more realistic parallel architectures
Slide 24
Shared-Memory Access
Concurrent (C) means, many processors can do the operation simultaneously in
the same memory
Exclusive (E) not concurent
EREW (Exclusive Read Exclusive Write)
CREW (Concurrent Read Exclusive Write)
Many processors can read simultaneously
the same location, but only one can attempt to write to a given location
ERCW
CRCW
Many processors can write/read at/from the same memory location
Slide 25
Example CRCW-PRAM
Initially
table A contains values 0 and 1
output contains value 0
The program computes the Boolean OR of
A[1], A[2], A[3], A[4], A[5]
Slide 26
Example CREW-PRAM
Assume initially table A contains [0,0,0,0,0,1] and we
have the parallel program
Slide 27
Pascal triangle
PRAM CREW
Slide 28