Advanced Computer Architecture
Parallel Processing
KAUSHIK BANERJEE
IEM, CSE
Advanced Computer Architecture
Pipeline Processing
Architecture Classification
Vector Processing
Vector Processor Architecture
Memory Interleaving
Network Properties
MIMD and Multiprocessors
Kaushik Banerjee, IEM, CSE 2
Parallel Processing
Introduction
Need is ahead of availability: We are trying to get more
performance out of computer systems and hence we go for
advance computer architecture.
Computer Architecture vs Computer Organization:
Architecture: Structural behaviour of the various functional
modules of the computer and how they interact to provide the
processing need of the work.
Organization: The way the h/w components are connected
together to form a computer system.
Kaushik Banerjee, IEM, CSE 3
Parallel Processing
Background
Earliest computing machines worked on fixed programs, as in
calculators. A computer used to be meant for a particular type
of job. For example, a desk calculator can perform arithmetic,
but not word processing or gaming.
Reprogramming would mean making flowcharts, paper
working, documentation and even rewiring of the system.
Kaushik Banerjee, IEM, CSE 4
Parallel Processing
Paradigm Change
John Von Neumann in 1944-45 wrote a paper where he came up with a
scheme to treat program as data and store it in memory. This is called
stored program concept.
Memory
Arithmetic Logic
Control Unit Unit
Input Output
The above is uniprocessor or in other words, scalar design.
Kaushik Banerjee, IEM, CSE 5
Parallel Processing
Problems: As memory is much slower than ALU, it causes waiting state of
the later, narrowing down the communication between memory and ALU.
This problem is known as Von Neumann bottleneck.
In order to improve performance, super scalar design was introduced.
How to improve performance
Rely on faster circuits: Cost per circuit increases with circuit speed and at
some point cost/performance ratio becomes unfavourable
Introduce concurrency
Replicate resources
Do more per cycle
Kaushik Banerjee, IEM, CSE 6
Parallel Processing
Superscalar Design (as CPU is getting cheaper)
More instructions issued at the same time.
Multiple processing units.
The parallel instructions must be independent.
Disadvantage:
• Overhead of controlling
• Burden to programmer
• Hard to have parallel lines > 5 (No point in increasing number of processors
above 5)
Kaushik Banerjee, IEM, CSE 7
Parallel Processing
In superscalar design, resources are replicated as shown below.
Control Control Control Control
Unit Unit Unit Unit
Arithmetic & Arithmetic & Arithmetic & Arithmetic &
Logic Logic Logic Logic
Unit Unit Unit Unit
Memory Memory Memory Memory
Unit Unit Unit Unit
Uniprocessor
Multiprocessor
This superscalar design is flexible, but it was difficult to segregate between statements
which are independent, so as to make them work in parallel.
Kaushik Banerjee, IEM, CSE 8
Parallel Processing
A CISC or RISC scalar processor can be improved with a superscalar or vector
architecture. Scalar processors are those executing one instruction per cycle. Only one
instruction is issued per cycle and only one completion of instruction is expected from
the pipeline per cycle.
In a superscalar processor, multiple instruction pipelines are used. This implies that
multiple instructions are issued per cycle and multiple results are generated per
cycle. A vector processor executed vector instructions on arrays of data. This each
instruction involves a string of repeated operations, which are ideal for pipelining
with one result per cycle.
Superscalar processors are designed to exploit more instruction level parallelism in
user programs. Only independent instructions can be executed in parallel without
causing a wait state. The amount of instruction level parallelism varies widely
depending on the type of code being executed.
It has been observed that the average value is around 2 for code without loop
unrolling. Therefore, for these codes there is not much benefit gained from building
a machine that can issue more than three instructions per cycle
Kaushik Banerjee, IEM, CSE 9
Parallel Processing
Pipelines
Instruction Pipeline
Fetch Decode Execute Write
Back
Fetch Decode Execute Write
Back
Fetch Decode Execute Write
Back
Fetch Decode Execute Write
Back
Kaushik Banerjee, IEM, CSE 10
Parallel Processing
Figure: A superscalar processor of degree m=3
Fetch Decode Execute Write
Back
Kaushik Banerjee, IEM, CSE 11
Parallel Processing
Disadvantages of superscalar design are:
The programmer is burdened with designing the program in
parallel executable parts.
It is very hard to have more than 5 parallel lines; so there is
no point in increasing number of processing units over 5.
Some representative superscalar processors: IBM RS/6000,
DEC21064, Intel i960CA
The above architecture shows replication to the extreme.
Very flexible, but costly
Do we need all this flexibility?
How about part replication?
Kaushik Banerjee, IEM, CSE 12
Parallel Processing
Arithmetic Pipeline
Pipelined addition: One of the six stages of the addition of a
pair of elements is performed at each stage in the pipeline.
Each stage of the pipeline has a separate arithmetic unit
designed for the operation to be performed at that stage.
Once stage A has been completed for the first pair of elements,
these elements can be moved to the next stage B while the
second pair of elements moves to the first stage A
Kaushik Banerjee, IEM, CSE 13
Arithmetic Pipeline for Floating Point Addition
The stages of adding two floating point numbers, for example,
0.1234 X 10 5 + 0.5678 X 10 4
1) The exponents are compared. 10 5 > 10 4
2 ) Since 5-4=1, the mantissa of the lesser, that is the second
number is shifted right by 1 digit and 1 is added to the
exponent . It becomes 0.05678 X 10 5
3) Now that the exponents of the two numbers are equal, the
mantissas are added (0.1234 + 0.05678) = 0.18018
4) The exponent of the result is 5. So, the final result in
0.18018 X 10 5
5) The result is normalised if required.
6) Check for exceptions, such as overflow.
7) Rounding off, if required.
Kaushik Banerjee, IEM, CSE 14
Arithmetic Pipeline for Floating Point Addition
How this is achieved:
The processing unit is designed in such a way that all stages of
the computations are separate. When one pair, say the first pair
of data have gone through Stage 1 (and it enters Stage 2), the
second pair of data can enter Stage 1.
Similarly, When the first pair of data has completed Stage 2 and
entered Stage 3,
The second pair of data, if it has completed Stage 1, can enter
Stage 2. At the same time, The thirf pair of data, if it has
completed Stage 2, can enter Stage 3.
This chain goes on Kaushik Banerjee, IEM, CSE 15
Arithmetic Pipeline for Floating Point Addition
For example, at certain point of time, following take place
simultaneously
Stage 6 of pair 1 (e.g., x1,y1)
Stage 5 of pair 1 (e.g., x2,y2)
Stage 4 of pair 1 (e.g., x3,y3)
Stage 3 of pair 1 (e.g., x4,y4)
Stage 2 of pair 1 (e.g., x5,y5)
Stage 1 of pair 1 (e.g., x6,y6)
Kaushik Banerjee, IEM, CSE 16
Arithmetic Pipeline for Floating Point Addition
Cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6
Stage
1 x1+y1
2
3
4
5
6
Time
In the beginning, in the very first cycle, when (x1+y1) has started, it is in Stage 1. At
that time, remaining stages are idle.
Kaushik Banerjee, IEM, CSE 17
Arithmetic Pipeline for Floating Point Addition
Cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6
Stage
1 x2+y2
2 x1+y1
3
4
5
6
Time
In the second cycle, when (x1+y1) completed Stage 1 and it enters Stage 2, the next
pair of data (x2+y2) enters Stage 1. At this time, the Stages 3 – 6 are idle.
Kaushik Banerjee, IEM, CSE 18
Arithmetic Pipeline for Floating Point Addition
Cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6
Stage
1 x3+y3
2 x2+y2
3 x1+y1
4
5
6
Time
Like wise, in 6 cycles, all the 6 stages of the pipeline are filled up.
Kaushik Banerjee, IEM, CSE 19
Arithmetic Pipeline for Floating Point Addition
Cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6
Stage
1 x4+y4
2 x3+y3
3 x2+y2
4 x1+y1
5
6
Time
Like wise, in 6 cycles, all the 6 stages of the pipeline are filled up.
Kaushik Banerjee, IEM, CSE 20
Arithmetic Pipeline for Floating Point Addition
Cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6
Stage
1 x5+y5
2 x4+y4
3 x3+y3
4 x2+y2
5 x1+y1
6
Time
Like wise, in 6 cycles, all the 6 stages of the pipeline are filled up.
Kaushik Banerjee, IEM, CSE 21
Arithmetic Pipeline for Floating Point Addition
Cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6
Stage
1 x6+y6
2 x5+y5
3 x4+y4
4 x3+y3
5 x2+y2
6 x1+y1
Time
Like wise, in 6 cycles, all the 6 stages of the pipeline are filled up.
Kaushik Banerjee, IEM, CSE 22
Architecture Categories
Kaushik Banerjee, IEM, CSE 23
Architecture Categories
Classification/ Taxonomy by Flynn
Michael Flynn’s classification of Computer Architecture is based
on instruction and data streams
Single Instruction Multiple
Instruction
Single Data SISD MISD
Stream
Multiple Data SIMD MIMD
Stream
Kaushik Banerjee, IEM, CSE 24
Flynn’s Classification
Kaushik Banerjee, IEM, CSE 25
Flynn’s Classification
Kaushik Banerjee, IEM, CSE 26
Flynn’s Classification - SISD
Now we will discuss the models SISD, SIMD and MIMD in detail.
MISD however, is hardly used for practical purposes. So, the above basic definition
will suffice for our curriculum.
(I) SISD
Control Processing
Unit Unit
Instruction
Input / Stream
Output
Data
Stream
Memory
Instruction Unit
Stream
This is the SISD, the conventional architecture.
Kaushik Banerjee, IEM, CSE 27
Flynn’s Classification - SISD
Instructions are fetched from Memory Unit and sent to Control
Unit. The control unit after generating proper timing signals
sends them the processing unit.
Data required for processing are made available in either or
both of the following ways:
Fetched from input device through the control unit
Fetched from the memory unit if it is already stored there
However, parallelism can be introduced in SISD by the use of
pipelining of instructions.
Kaushik Banerjee, IEM, CSE 28
Flynn’s Classification - SIMD
(II) SIMD: Single Instruction over a multiple data stream
In this architecture, many processing units are under the supervision of a
single control unit.
All processors receive the same instruction from the control unit but
operate on different items of data
Needed: shared memory with multiple interleaved modules in order to
communicate simultaneously with multiple processes
In this kind of scenario, we can use SIMD.
So, in SIMD,
Many processing units are under the supervision of a common control unit.
All processing units receive the same instruction from the control unit but
operate on different items of Kaushik
data.Banerjee, IEM, CSE 29
Flynn’s Classification - SIMD
Practical scenario
Part of application is about understanding application needs.
Many applications have the following nature of processing:
c = 125 Scalar (not repeated)
For i=0 to 999
a(i) = b(i) + c Vector (repeated)
or huge matrix manipulation like multiplication, inversion, etc.
The above indicates same application over many tuples of data, mostly independent
across iterations. That is, the instruction step
a(2) = b(2) + c
can be executed independent of the instruction step
a(1) = b(1) + c
Kaushik Banerjee, IEM, CSE 30
Flynn’s Classification - SIMD
Control Main
Unit Memory
Processing Processing ... Processing Micro Control Unit,
Registers,
Element Element Element
Arithmetic & Logic
Unit
Memory Memory Memory
...
Unit Unit Unit
Kaushik Banerjee, IEM, CSE 31
Flynn’s Classification - SIMD
The principles behind SIMD are:
Replicate data path, not control
All PEs work in tandem
Common CU coordinates operations
Operation model of SIMD can be represented by a 5-tuple:
M = {N, C, I, M, R}
N: Number of processing elements (PE) in the machine, e.g., ILLIAC has 64 PEs,
Connection Machine’s CM-2 has 65,536 PEs (designed by Danny Hillis of Thinking
Machines)
C: Set of instructions directly executed by the control unit (CU), including scalar and
program flow instructions.
I: Set of instructions broadcast by the CU to all PEs for parallel execution. These
include arithmetic, logic, data routing, masking and other local operations executed
by each active PE over local data within that PE.
Kaushik Banerjee, IEM, CSE 32
Flynn’s Classification - SIMD
M: Set of masking schemes, where each mask partitions the set of PEs into enabled
and disabled subsets.
R: Set of data routing functions, specifying various patterns to be set up in the
interconnection network for the inter-PE communications.
Kaushik Banerjee, IEM, CSE 33
Flynn’s Classification - SIMD
Almost all SIMD machines built today are based on the distributed
memory model.
• PEs communicate among themselves through routing network
• Array of processing elements are controlled by the common array control
unit
• Program and data are loaded into the control memory through the host
computer
• Instruction decoded by the control unit
• Scalar instructions are passed on to the scalar processor directly
• Vector instruction is broadcast to the PEs for parallel execution
• Partitioned data sets are distributed to the local memory units
• Data routing network is under program control through the control unit
• PEs are synchronized by the hardware of the control unit
• Same instruction is executed by all the PEs in the same cycle. However,
PEs can be selectively disabled by masking instruction
Kaushik Banerjee, IEM, CSE 34
SIMD – Distributed Memory
PE :
Processing
Element
Kaushik Banerjee, IEM, CSE 35
SIMD – Shared Memory
(b) Shared memory model:
Kaushik Banerjee, IEM, CSE 36
SIMD – Shared Memory
Note: m is chosen relatively prime with respect to n so that
parallel memory access can be achieved through skewing
without conflicts.
An example would be Burroughs Scientific Processor (BSP) with
n=16 and m=17.
1,4,7 2,5,8 3,6,9
n1 n2 n3
m1 m2 m3 m4
1,5,9,13,17 2,6,10,14,18 3,7,11,15,19
Kaushik Banerjee, IEM, CSE 4,8,12,16,20 37
SIMD – Processing Element
Internal structure of a Processing Element
We will give example of ILLIAC IV
processor
A, B, C: Working registers
D: Address register for holding the
address of the PE
I: Index register for internal usage
S: Status flag register. If the value is
zero then the PE is disabled for the
current instruction. Value of 1 means
selected.
R: Data routing register for sending
data to and receiving data from other
PEs in the network. Kaushik Banerjee, IEM, CSE 38
Parallel Processing - SIMD
Matrix data manipulation
What is crucial: Distributing the vector elements among the PEMs. Ideally, N
elements of a vector are retrieved from N different PEMs simultaneously. If
the matrix is one dimensional, then try to use one memory location of each
PEM.
If there is a matrix with n rows and m columns, each PEM has to occupy
either row data or column data. It depends on the values of m and n and the
number of PEs.
For example, A[i, j] is stored in i-th memory location of j-th PEM.
PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 . . . . PEn
0
1
2
3
4
5
8
9
…… …… ….. ………… …… ….. ………… …… ….. ………… …… ….. …… ….. …. ….. … .. ………. … ……..
m-2
m-1
Kaushik Banerjee, IEM, CSE 39
Vector Processing
Focus:
Large volumes of data or Series of data to be processed
Same operation will be performed over a string of data
elements
A vector instruction involves large arrays of operands .
Classification:
(A)Pipelined vector machines using a few powerful
processors equipped with vector hardware Vector
Super Computers
(B)SIMD machine model Kaushik Banerjee, IEM, CSE 40
Vector Data
Vector Data is specified by the following parameters
(A) Base Address: From which memory location the data starts
(B) Length: The number of elements in the vector data
(C) Stride: the address gap between two consecutive elements
For example , a vector data may
• Start at base memory location 1000
• Have a stride of 5. Thus , the subsequent elements will be stored in
addresses 1005, 1010, 1015, 1020….and so on
• A length of 100. So. The last element will be at memory location 1200.
Kaushik Banerjee, IEM, CSE 41
SIMD – Vector Processing
Array Processors
SIMD can be array processors or associative processors. Associative
processors are content addressable where array elements can be
accessed concurrently, but are of high cost. Array processors use
RAM where arrays of data are to be accessed sequentially.
A synchronous array of parallel processors in called array
processors, which consists of multiple processing elements under
the supervision of one control unit.
An array processor can handle single instruction and multiple data
stream (SIMD).
SIMD machines are especially designed to perform ‘vector’
computations over matricesKaushik
or Banerjee,
arrays of data..
IEM, CSE 42
Vector Instruction
Vector Instruction
A vector is a set of scalar data items, e.g., A[1..1000] of integer.
We can represent such a series by a vector, e.g., V1, V2 or V3,
etc.
A vector processor is one that can compute operations on
entire vectors with one simple instruction.
A vector compiler will attempt to translate loops into single
vector instructions.
Example - Suppose we have the following do loop:
4 DO 5 I = 1, N
5 X(I) = Y(I) + Z(I)
6 CONTINUE Kaushik Banerjee, IEM, CSE 43
Vector Instruction
This will be translated into one long vector of length n and a vector add
instruction will be executed.
The assembly vector instruction will be something like
VADD Y,Z,X
------------------------------------------------------------------------------------
Why is this more efficient ?
(1) Instruction is fetched and decoded only once. Thus, memory bandwidth
and the control unit overhead are reduced considerably
(2) The vector processor, after receiving the instruction, will be told that it
must fetch x amount of pairs of operands.
These operands will have a set pattern of arrangement in memory
Kaushik Banerjee, IEM, CSE 44
Vector Instruction
The vector instructions can be of the following types:
Format
V1 op V2 V3 : Vector binary operation: Operation between two vectors to
produce another vector. Example will be adding each element of an array
(V1) with the corresponding element of another array (V2) to produce
elements of a third array (V3)
V1 op S V2 : Vector scaling operation: Operation of each element of a
vector (V1) with a common scalar quantity (S) to produce elements of
another vector (V2). Example will be multiplying each element of an array by
a constant number.
V1 op V2 S : Vector binary reduction: Operation between two vectors
to produce a scalar quantity. Example will be the dot product of two arrays.
Kaushik Banerjee, IEM, CSE 45
Vector Instruction
M (1: n) V1 Vector load: copying data elements from
memory to the vector operand.
V1 M (1: n) Vector store: copying data elements from the
vector operand to memory.
Uop V1 V2 Unary vector operation: Applying some
operation (e.g., negation) to each element of a vector (V1) to
produce another vector (V2)
Uop V1 S Unary vector reduction: Applying some
operation (e.g., Average / Maximum / Sum) to all elements of a
vector (V1) to produce a scalar (S)
Kaushik Banerjee, IEM, CSE 46
Vector Instruction
Advantage of vector: It does loop unrolling.
That means operation of all the elements of a vector can be
carried out simultaneously and independent of each other.
There is no need to do memory access after operation on each
element to get the next element. In vector processing, a series
of elements can be fetched together from the memory and
operated without waiting for each other.
Kaushik Banerjee, IEM, CSE 47
Vector Instruction
Pseudo vector instructions
Vector Load/Store
Vector load/store instructions provide the ability to do strided
and scatter/gather memory accesses, which take data elements
distributed throughout memory and pack them into sequential
vectors/streams placed in vector/stream registers. This
promotes data locality. It results in less data pollution, since
only useful data is loaded from the memory system. It provides
latency tolerance because there can be many simultaneous
outstanding memory accesses. Vector instructions such as VLD
and VST provide this capability.
Kaushik Banerjee, IEM, CSE 48
Vector Instruction
As illustrated in figure 3,
VLD V0, (V1) loads memory elements into V0 specified by the
memory addresses (indices) stored in V1.
VST (V1), V0 works in a similar manner.
VST (S1), 4, V0, stores the contents of V0 starting at the address
contained in S1 with stride of 4.
Kaushik Banerjee, IEM, CSE 49
Vector Instruction
Vj Register
(1) Vector-Vector
Vk Register Vi Register
Format:
Vj op Vk Vi
Example:
V0 = V1 X V2
V3 = V2 + V4
Functional Unit or Pipeline
One pipeline for each type of operation
Kaushik Banerjee, IEM, CSE 50
Vector Instruction
Vj Register (2) Vector-Scalar Vi Register
Format:
Vj OP Sk Vi
Example:
V0 = V1 X 2
V3 = V2 + 0.5
Sk Register
Functional Unit or Pipeline
One pipeline for each type of operation
Kaushik Banerjee, IEM, CSE 51
Vector Instruction
(3) Vector-Memory
Vi Register
Vector Load
Main
Memory
Vector Store
Format:
Load: M (1:n) Vi
Store: Vi M (1:n)
Kaushik Banerjee, IEM, CSE 52
Vector Instruction
(4) Vector Reduction
Vj Register
Format:
Uop Vj Si
Example:
S = ∑ Vj [i]
Sum of all elements
Si Register
Functional Unit or Pipeline
One pipeline for each type of operation
Kaushik Banerjee, IEM, CSE 53
Vector Instruction
(5) Vector Binary Reduction
Vj Register Vk Register
Format:
Vj op Vk Si
Example:
S = ∑ Vj [i] * Vj [i]
Dot product
Si Register
Functional Unit or Pipeline
One pipeline for each type of operation
Kaushik Banerjee, IEM, CSE 54
Vector Instruction
(6) Gather : The non-zero elements of a sparse vector using
indices that themselves are indexed.
To be
Given Kaushik Banerjee, IEM, CSE Given 55
Vector Instruction
(6) Gather : The non-zero elements of a sparse vector using
indices that themselves are indexed.
To be
Given Kaushik Banerjee, IEM, CSE Given 56
Vector Instruction
A0 = Starting memory address
V0 = Indices of non zero elements w.r.t. the base address
in memory
VL = Length of the vector
V1 = Data register
i-th location of V1 will get the value of
Base + Vo[i]-th memory location
V1[i] Memory[A0 + Vo[i]]
Example: For I = 0, V0[0] = 4 ; So, (Base + 4)-th, i.e., 104th memory
location will be copied to V1[0]
For I = 1, V0[1] = 2 ; So, (Base + 2)-th, i.e., 102th memory location will
Kaushik Banerjee, IEM, CSE 57
be copied to V [1]
Vector Instruction
(7) Scatter
It stores into memory a vector in a sparse vector whose non-
zero entries are indexed
Structure is the same. Data assignment is in the opposite
direction
Memory[A0 + Vo[i]] V1[i]
Example: Entry 0 of V1 is copied to Memory Location
100+v0[0] , i.e., 104. So, M[104] V1 [0] , i.e., 600
: Entry 1 of V1 is copied to Memory Location 100+v0[1] , i.e.,
102. So, M[102] V1 [0] , i.e., 600
Kaushik Banerjee, IEM, CSE 58
Vector Instruction
Note
Gathering is done before a vector operation to fetch the
scattered data element into a register
Scattering is done after a vector operation to put back the
result into memory locations.
Kaushik Banerjee, IEM, CSE 59
Vector Instruction
(8)Masking
Step 1. Elements of V0 are scanned and VM is populated accordingly.
Step 2. V1 is now populated from VM with the indices of the non zero elements
Kaushik Banerjee, IEM, CSE 60
Vector Instruction
(8)Masking
Step 1
Step 1. Elements of V0 are scanned and VM is populated accordingly.
Step 2. V1 is now populated from VM with the indices of the non zero elements
Kaushik Banerjee, IEM, CSE 61
Vector Instruction
(8)Masking Step 2
Step 1
Step 1. Elements of V0 are scanned and VM is populated accordingly.
Step 2. V1 is now populated from VM with the indices of the non zero elements
Kaushik Banerjee, IEM, CSE 62
Vector Processing
Summary
Same operation is performed over a stream of data
Instruction fetch and decode takes place only once
A vector instruction involves a large array of operands
Vector processor is a co-processor specially designed to
perform vector computation
Vector operand: Series of data on which same
operation will be carried out
Kaushik Banerjee, IEM, CSE 63
Vector Processing
Summary
Vector computation: Same operation in all elements in
the series
Vector operands can be in memory or in registers.
Registers are faster but limited . Memory is slower but
unlimited.
Uses interleaved memory for parallel fetch
The elements of vector operands are fetched from
memory and also written back to then memory
simultaneously. Kaushik Banerjee, IEM, CSE 64
Vector Processing & Architecture
Kaushik Banerjee, IEM, CSE 65
Vector Processing
Kaushik Banerjee, IEM, CSE 66
Vector Processing
Vector Functional Pipelining
For example, consider the following:
v0 = v1 * v2
v3 = v2 + v4
The Multiplication and Addition can be carried on in parallel, as they are not
dependent on each other.
Kaushik Banerjee, IEM, CSE 67
Vector Processing
Arithmatic Pipelining
Pipelined addition: One of the six stages of the floating point addition for a pair of
elements is performed at each stage of the pipeline.
The stages are, for addition:
1. Compare Exponent
2. Shift Mantissa of the element with lower exponent
3. Add Mantissas
4. Normalize
Each stage of the pipeline has a separate arithmetic unit designed for the operation
to be performed at that stage.
Once stage A has been completed, the elements with the intermediate results can
be moved to the next stage B, while the second pair of elements enter stage A.
Kaushik Banerjee, IEM, CSE 68
Vector Processing
Vector Chaining V0[0] = v1[0] * v2 [0]
V0[1] = v1[1] * v2 [1] V3[0] = v0[0] + v4 [0]
For example, consider the following: V0[2] = v1[2] * v2 [2] V3[1] = v0[1] + v4 [1]
v0 = v1 * v2 V0[3] = v1[3] * v2 [3] V3[2] = v0[2] + v4 [2]
v3 = v0 + v4 V3[3] = v0[3] + v4 [3]
After V0 is computed for the first pair of V1 and V2, it can be passed on to the add
pipeline.
During this process, there will be a point when operands from V1 and V2 still need to
be fetched and sent to the pipeline, and results placed into V0 are just leaving the
pipeline. The process of vector chaining will send the result from V0 directly to the
pipelined adder (at the same time it is stored in the vector register), and combined
with the appropriate value from V4.
Thus the second instruction will be able to begin before the first is finished and the
machine creates 2 results as opposed to 1. This is very similar to the MIPS pipeline
and forwarding. The result of this is approximately 3 times the peak performance
Kaushik Banerjee, IEM, CSE 69
Vector Processing
Time: tau 2 tau 3 tau 4 tau 5 tau 6 tau 7 tau 8 tau
Step
A x1 + y1 x2 + y2 x3 + y3 x4 + y4 x5 + y5 x6 + y6 x7 + y7 x8 + y8
B x1 + y1 x2 + y2 x3 + y3 x4 + y4 x5 + y5 x6 + y6 x7 + y7
C x1 + y1 x2 + y2 x3 + y3 x4 + y4 x5 + y5 x6 + y6
D x1 + y1 x2 + y2 x3 + y3 x4 + y4 x5 + y5
E x1 + y1 x2 + y2 x3 + y3 x4 + y4
F x1 + y1 x2 + y2 x3 + y3
Some vector architectures provide greater efficiency by allowing
the output of one pipeline to be chained directly into another
pipeline. This feature is called chaining and eliminates the need to
store the result of the first pipeline before sending it into the
second pipeline. Figure 5 demonstrates the use of chaining in the
computation of a saxpy vector operation: a*x + y,
where x and y are vectors and a is a scalar constant.
Kaushik Banerjee, IEM, CSE 70
Vector Processing
Some vector architectures provide greater efficiency by
allowing the output of one pipeline to be chained directly into
another pipeline. This feature is called chaining and eliminates
the need to store the result of the first pipeline before sending
it into the second pipeline. Figure 5 demonstrates the use of
chaining in the computation of a saxpy vector operation:
a*x + y,
where x and y are vectors and a is a scalar constant.
Kaushik Banerjee, IEM, CSE 71
Vector Processing
Observe that it still takes 6*tau units of time to complete the
sum of the first pair of elements, but that the sum of the next
pair is ready in only tau more units of time. And this pattern
continues for each succeeding pair. This means that the time,
Tp, to do the pipelined addition of two vectors of length n is
Tp = 6*tau + (n-1)*tau = (n + 5)*tau.
Each of remaining (n-1) units is churned out in 1 unit of time.
The first 6*tau units of time are required to fill the pipeline and
to obtain the first result. After the last result, xn + yn, is
completed, the pipeline is emptied out or flushed.
Kaushik Banerjee, IEM, CSE 72
Vector Processing
Comparing the equations for Ts and Tp, it is clear that
(n + 5)*tau < 6*n*tau, for n > 1.
Thus, this pipelined version of addition is faster than the serial version by almost a
factor of the number of stages in the pipeline. This is an example of what makes
vector processing more efficient than scalar processing. For large n, the pipelined
addition for this sample pipeline is about six times faster than scalar addition.
In this discussion, we have assumed that the floating-point addition requires six
stages and takes 6*tau units of time. There is nothing magic about this number 6; in
fact, for some architectures, the number of stages in a floating-point addition may be
more or less than six. Further, the individual stages may be quite different from the
ones listed in section on pipelined addition. The operations at each stage of a
pipeline for floating-point multiplication are slightly different than those for addition;
a multiplication pipeline may even have a different number of stages than an
addition pipeline. There may also be pipelines for integer operations..
Kaushik Banerjee, IEM, CSE 73
Vector Processing
Vector Chaining used to compute a*x + y
Kaushik Banerjee, IEM, CSE 74
Vector Processing
Vector chaining used to compute a scalar value a times a vector
x, adding the elements the resultant vector to the elements of
a second vector y (of the same length).
Chaining can double the number of floating-point operations
that are done in tau units of time. Once both the multiplication
and addition pipelines have been filled, one floating-point
multiplication and one floating-point addition (a total of two
floating-point operations) are completed every tau time units.
Conceptually, it is possible to chain more than two functional
units together, providing an even greater speedup. However
this is rarely (if ever) done due to difficult timing problems.
Kaushik Banerjee, IEM, CSE 75
Vector Processor Architecture or Vector Super Computer
Vector Processing can be done in two ways
SIMD machine model
Pipelined vector machines using a few powerful processors equipped with
vector hardware. These are called Vector Super Computer or Vector
processor architecture r
We have discussed SIMD before. Now we discuss Vector processor
architecture or Vector Super Computer
While vector processing can be done by SIMD architecture by utilizing spatial
distribution of processing elements, there is another advanced architecture
which implement parallel processing by the means of functional pipelining.
Kaushik Banerjee, IEM, CSE 76
Vector Architecture
Let us consider two vector operands A and B. These are each one dimensional
array of equal size. We want to compute dot product of these two. C=A1B1 + A2B2
+A3B3+….
Kaushik Banerjee, IEM, CSE 77
Vector Architecture
This architecture takes advantage of the fact that each
arithmetic operation has multiple stages. Say,
multiplication has 5 stages and addition has 4 stages.
Each multiplication will go through the various 5 stages
of the multiplier pipeline. When A1 and B1 will be in the
Stage 2, A2 and B2 will be in the Stage 1.
At some point of time, the scenario will look like this:
A1 and B1 in Stage 5, A2 and B2 in Stage 4, A3 and B3 in
Stage 3,A4 and B4 in Stage 2, A5 and B5 in Stage 1,
A6 and B6 are yet to enter the pipeline.
Kaushik Banerjee, IEM, CSE 78
Vector Architecture
Next, the product A1*B1 leaves the multiply pipeline
and enters the add pipeline. It gets added to the
previous value of the add pipeline, which is, say C. We
assume that C has been initialized to 0. So, the
addition C= 0+A1*B1 will start now. When the first
addition, i.e., C=A1*B1 will come out of the add
pipeline, it will be fed back to the add pipeline.
Simultaneously, A6*B6 has come out of the multiple
pipeline. Now, the addition of A1*B1 +A5*B5 will
start. In the next cycle, addition of A2*B2 and A6*B6
will start..and so on…
Kaushik Banerjee, IEM, CSE 79
Vector Architecture
S
C
A
L
A
R
P
R
O
C
E
S
S
O
R
Host
Input / Computer Mass
Output Storage
Kaushik Banerjee, IEM, CSE 80
Vector Architecture
S
C
A
L
A
R
P
R
O
C
E
S
S
O
R
Host
Input / Computer Mass
Output Storage
Kaushik Banerjee, IEM, CSE 81
Vector Architecture
As shown in the diagram, the vector processor is attached to the
scalar processor as an optional feature.
Program and data are first loaded into the main memory through a
host computer.
Instruction decoding is done by the Scalar Control Unit.
Depending on whether the operation is scalar or vector, the
decoded instruction will be sent to the scalar and vector processor
respectively.
Vector instruction will be sent to the vector control unit (of the
vector processor)
The vector control unit will supervise the flow of vector data
between the main memory and the vector functional pipelines
Kaushik Banerjee, IEM, CSE 82
Vector Architecture
Vector Processing Models
(A) Register to Register Model (as shown in the diagram)
Vector Registers hold the operands, intermediate and
final vector results.
Vector functional pipelines retrieve operands from and
put results into the vector registers.
Vector registers are programmable
Vector registers can be of fixed length or reconfigurable
(later example: Fujitsu VP2000 series)
Kaushik Banerjee, IEM, CSE 83
Vector Architecture
Vector Processing Models (Continued)
(B) Memory to Memory model
Uses a vector stream unit in place of vector registers
Vector operands and results are directly retrieved from
the main memory to superwords (e.g., 512 bits as in
Cyber 205
Kaushik Banerjee, IEM, CSE 84
Vector Architecture
A comparison
Memory to Memory: Slower but can support very long
length of vector operands. (suitable for longer vectors)
Register to Register: Faster, but limited length of vector
operands (suitable for smaller vectors)
Kaushik Banerjee, IEM, CSE 85
Vector Architecture
A brief history
Pipelined vector super computer started with uni
processor models, e.g., Cray 1 (1976)
Recent super computer systems offer both uniprocessor
and multiprocessor models\. E.g., Cray Y-MP series
Most high-end main frames offer multiprocessor models
with add-on vector hardware., e.g., VAX9000 , IBM 390 /
VF
Kaushik Banerjee, IEM, CSE 86
Vectorization
Benefit and How it operates
A FORTRAN example
5 DO 10 I=1,100
10 Z(I) = X (I) * Y (I) Gets compiled by
vector compiler
15 CONTINUE
3 cycles to fetch instruction, decode instruction & fetch operand
1 cycle to execute
For 100 instructions
On Scalar: (3+1) * 100 = 400 cycles are required
On Vector: 3 + 1 * 100 = 103 cycles are required
Kaushik Banerjee, IEM, CSE 87
Vectorization
I. First, load the registers X and Y from memory :
simultaneously
II. One set of elements are loaded in a single clock cycle
III. Finally, computed values are taken from register Z
and put in memory
Vector compiler translates the code into something like
VLOAD X VR1
VLOAD Y VR2
VMULT VR1 VR2 VR3
VSTORE VR3 Z
Kaushik Banerjee, IEM, CSE 88
Vector Architecture
Summary
The vector machine is faster at performing mathematical
operations on larger vectors than the scalar or MIPS
(Microprocessor without Interlocked Pipeline Stages)
machines.
The vector processing computer’s vector register architecture
makes it better able to compute vast amounts of data quickly.
Conclusion
Not widely popular today but still used in home PCs as SIMD
units which augment the scalar CPU when necessary. (usually
multimedia applications)
Kaushik Banerjee, IEM, CSE 89
Vector Architecture
Implementation examples
Changing the brightness or contrast of an image, a value is
added or subtracted from each R,G,B set of data.
Limitations
One should have separate and longer registers for vector
processing.
Saving a register by reusing scalar registers might create
contention between SIMD and normal floating point
instructions.
Kaushik Banerjee, IEM, CSE 90
Vector Architecture
Basic Vector Architecture
64
Functional
Unit
64
64
Kaushik Banerjee, IEM, CSE 91
Vector Processing
Note: Very Important
(A)Loop structure program code is translated to vectorized
assembly code using vectorizing compiler.
A well vectorized code can easily achieve a speed up of 10-20
times.
Vectorization rate: How much vectorization has been done to the
code.
(B) Assembly Language Instruction Set must support vector
operation.
(C) Of course, hardware must
Kaushikhave
Banerjee, vector
IEM, CSE processor 92
Vector Processing & Processor
Summary
Same operation is performed over a stream of data
Instruction fetch and decode takes place only once
A vector instruction involves a large array of operands
Vector processor is a co-processor specially designed to
perform vector computation
Vector operand: Series of data on which same
operation will be carried out
Kaushik Banerjee, IEM, CSE 93
Vector Processing & Processor
Note
For register based vector processors,
Long vectors exceeding the register length n must be segmented
to fit the vector registers, n elements at a time.
e.g., if vector length is > 10 but <= 20, and vector register = 10,
Then the vector operation will have to be 2 times.
For example, if number of vector elements = 15, then
First time, 10 elements are copied to the register and processed.
Second time, remaining 5 elements are copied to 5 registers are
processed. Kaushik Banerjee, IEM, CSE 94
Vector Access Memory Structure
Note
Flow of vector operands between memory and vector registers
is pipelined with multiple access paths.
Vector Operand specification
Location of vector operand in memory must be specified by Base
Address, Stride and Length.
If row elements are stored continuously with unit stride, the
column elements will be stored with stride of n. (Diagonal
elements will be spaced at stride n+1)
Kaushik Banerjee, IEM, CSE 95
Vector Access Memory Structure
Note
Arrangement of vector in memory can be row major or column
major.
The arrangement of the vector data in memory should be
interleaved to allow pipelined or parallel access
Since each vector register has fixed number of components, only
a segment of vector data from memory can be loaded per cycle.
Kaushik Banerjee, IEM, CSE 96
Memory Interleaving
Kaushik Banerjee, IEM, CSE 97
Memory Interleaving
Memory is not desired to be monolithic, but a collection of
modules.
Arrangement of vector data in memory should be interleaved
to allow pipeline/parallel access. Flow of vector operands
between memory and vector registers is pipelined with
parallel access path.
Same word address is applied to all memory modules
simultaneously.
Kaushik Banerjee, IEM, CSE 98
Memory Interleaving
Memory is not desired to be monolithic, but a collection of
modules.
Same word address is applied to all memory modules
simultaneously.
Example:
Word address: 5 bits; Module address: 3 bits;
Kaushik Banerjee, IEM, CSE 99
Memory Interleaving
Word address: 5 bits; Module number: 3 bits
So, number of modules = 23 = 8
Number of words in each module = 25=32
10000 000 16th word of 0th module
10000 111 16th word of 7th module
Kaushik Banerjee, IEM, CSE 100
Memory Interleaving
M0 M1 M2 M7
0 1 2 7
8 9 10 15
16 17 18 23
24 25 26 31
32 33 34 39
40 41 42 47
48 49 50 55
56 57 58 63
64 65 66 71
72 73 74 79
240 241 242 247
248 249 250 255
MDB MDB MDB MDB
8 modules with 32 words in each
Kaushik Banerjee, IEM, CSE 101
Memory Interleaving
Two kinds of memory interleaving: Low order and High order
Consider a memory bank of 256MB:
M0 M1 M2 M3
64 MB 64 MB 64 MB 64 MB
It consists of 4 memory modules, 64MB each.
256MB = 228; So, we need total 28 bits address lines to access
this memory bank.
We will split these address lines into two parts, one for
selecting the module and the other for addressing a word
within a module.
Kaushik Banerjee, IEM, CSE 102
Memory Interleaving
Each memory module is of size 64MB=226
Hence, in order to access memory words in a module, we need
26 bits of address lines.
A27 – A26 Module Select
A25 – A0 : Word Select (within a module)
The remaining bits, i.e., A27 - A26=2bits will be used to select
the module.
This is called High order memory interleaving.
Kaushik Banerjee, IEM, CSE 103
Memory Interleaving
The other arrangement is
A27 – A02 : Word Select (within a module)
A01 – A00 : Module Select
This is called Low order memory interleaving.
Kaushik Banerjee, IEM, CSE 104
Memory Interleaving
HIGH ORDER MEMORY INTERLEAVING
0-67108863 67108864-134217727
Kaushik Banerjee, IEM, CSE 105
Memory Interleaving
High Order Interleaving:
Higher bits (MSB) are used for selecting a memory module and
lower bits (LSB) are used to address word in a module.
Address arrangement in high order
Module From address To address (MB)
(MB)
0 0 63
1 >63 127
2 >127 192
3 >192 255
Kaushik Banerjee, IEM, CSE 106
Memory Interleaving
In high order interleaving, consecutive words are stored
in the same module, except for the first and last address
in each module. For example, 0-63MB are stored in M0,
Above 63MB to 127MB are in M1, above 127MB to
192MB in M2 and above 192MB to 255MB are stored in
M3.
Usefulness
High order interleaving is useful in cases where multiple
processing elements are accessing different zones of the
memory at the same time.
Kaushik Banerjee, IEM, CSE 107
Memory Interleaving
LOW ORDER MEMORY INTERLEAVING
Kaushik Banerjee, IEM, CSE 108
Memory Interleaving
Low Order Interleaving: Higher bits (MSB) are used
for addressing word in a module and lower bits (LSB) are
used to select memory module.
Address arrangement in low order
Module Word Address
0 0 4 8 12 ……….. 252
1 1 5 9 13 ……….. 253
2 2 6 10 14 ……….. 254
3 3 7 11 15 ……….. 255
Kaushik Banerjee, IEM, CSE 109
Memory Interleaving
In low order interleaving, consecutive words are stored in
different modules. That is, location n and location n+1 will
be in different modules. Location 0 will be in M0, location
1 will be in M1, location 2 in M2 and location 3 will be
stored in M3. Again, location 4 will be in M0, location 5 in
M1, location 6 in M2 and location 7 will be stored in M3
and so on.
Therefore, vector stride in a module will be equal to the
number of modules
.
Kaushik Banerjee, IEM, CSE 110
Memory Interleaving
Usefulness
Low order interleaving is useful in case of vector
operands where vector elements need to be fetched
simultaneously from consecutive memory locations. If
memory is organized in a contiguous manner, then low
order interleaving will result in very poor memory
utilization, in fact not using the interleaving at all.
Kaushik Banerjee, IEM, CSE 111
Memory Interleaving
How it works
Word addresses are asserted simultaneously on all
modules. However, module addresses have to be
asserted one by one.
Suppose we are fetching n-th word of each module.
The value of n is placed on the word select address lines.
We have to select a module. We will start from module 0.
First, 000 is placed on the module select lines and
thereby module 000 gets selected.
Kaushik Banerjee, IEM, CSE 112
Memory Interleaving
So, module 0 starts accessing word n.
Before the word is accessed from module 0, the system can
access the same word from the next module, i.e., module 1.
So, module 1 starts accessing word n.
Before the access is over, module address 2 is placed on the
module select lines and the system initiates access of the n-
th word of module 2.
When array processing is required, low order interleaving is
used for pipelining. As the consecutive memory locations
are stored in different modules, they can be accessed in
parallel to be consumed by different processing units.
Kaushik Banerjee, IEM, CSE 113
Memory Interleaving
For example, for operations on an array A, the elements
A[0], A[1], A[2]….will be stored in different modules, i.e.,
M0, M1, M2…respectively. Therefore, they can be
accessed in a pipelined fashion and loaded into vector
registers or local PE memories simultaneously
Kaushik Banerjee, IEM, CSE 114
Memory Interleaving
Some Sample Applications
It is important to keep some actual applications in mind,
such as:
• Gaussian elimination. Recall that a typical operation is
to add a multiple of one row of our matrix to another row
of the matrix.
• Matrix multiplication. Recall that to do AB = C for
matrices A, B and C, we get Cij by taking the inner
product (“dot product”) of A’s ith row and B’s jth column.
Memory Interleaving: 2
Kaushik Banerjee, IEM, CSE 115
Memory Interleaving
• Image processing, such as in Adobe Photoshop or the
open-source GIMP. Here element (i,j) element in our
matrix would be the brightness at the pixel (i,j). If we wish
our image to look smoother, then at each pixel we would
average the brightness at neighboring points, and make
this value average our new brightness at the given point.
Kaushik Banerjee, IEM, CSE 116
Memory Interleaving
Categories of interleaving
Interleaving is of the following kinds:
• Simultaneous Access (S-Access)
• Concurrent Access (C-Access) and
• Concurrent-Simultaneous Access (C/S Access)
Kaushik Banerjee, IEM, CSE 117
Memory Interleaving
Simultaneous Access (S-Access) memory organization
Simultaneous
Kaushik Banerjee, IEM, CSE 118
Memory Interleaving
This is the simplest technique.
The single access returns m consecutive words of
information from m memory modules..
All the modules are accessed simultaneously.
Using low order m address bits, the information from a
particular module can be accessed.
A data latch is associated with each module. For a fetch
operation, the information from each module is gated
into its latch, whereupon the multiplexer can be used to
direct the desired data to the single word bus.
Kaushik Banerjee, IEM, CSE 119
Memory Interleaving
The memory access time line diagram is as depicted below
Memory access
time = T (for all)
Latch delay = τ
Total time for k
consecutive words
=T+k*τ
Kaushik Banerjee, IEM, CSE 120
Memory Interleaving
Concurrent Access (C-Access) memory organization
Kaushik Banerjee, IEM, CSE 121
Memory Interleaving
Concurrent Access (C-Access) memory organization
Common
Word
address is
impressed
concurrently
Kaushik Banerjee, IEM, CSE 122
Memory Interleaving
Concurrent Access (C-Access) memory organization
Module
selects are
impressed
one by one
Common
Word
address is
impressed
concurrently
Kaushik Banerjee, IEM, CSE 123
Memory Interleaving
The m-way low order interleaved memory structure
allows m memory modules to be access concurrently in
an overlapped manner.
The access cycles in different memory modules are
staggered. The low order a bits select the module and
the high order b bits select the word within each module,
where m=2a and a+b=n is the address length.
To access a vector of stride 1, successive addresses are
latched in the address buffer at the rate of one per cycle.
Effectively it takes m minor cycles to fetch m words,
which equals one (major) memory cycle as in the time
graph below.
Kaushik Banerjee, IEM, CSE 124
Memory Interleaving
After the addressed word is found in a module, the
contents are brought to the respective MDB.
We can depict the behaviour by the time graph
Θ : Major cycle [access time]
τ : Minor cycle [churn out time]
Above, each word is being fetched from different modules.
It is clear that there is a good overlap in the access of
adjacent modules.
So, how much faster is the pipelining access?
It is measured by M = Θ / τ and is called degree of
interleaving.
Kaushik Banerjee, IEM, CSE 125
Memory Interleaving
If the stride is 2, the
successive accesses
must be separated by
two minor cycles in
order to avoid access
conflicts. This reduces
memory throughput by
one-half.
If the stride is 3, there is no module conflict and the maximum
throughput (m words) results.
In general, C-access will yield the maximum throughput of m
words per memory cycle if the stride is relatively prime to m,
the number of interleavedKaushik
memory modules.
Banerjee, IEM, CSE 126
Memory Interleaving
(C) C/S-access memory organization
This is a combination of the S-access and C-Access
organizations, where n access buses are used with m
interleaved memory modules attached to each bus. The m
modules on each bus are m-way interleaved to allow S-access.
In each memory cycle, at most m*n words are fetched if the n
buses are fully used with pipelined memory accesses.
The C/S access memory is suitable for vector multiprocessor
configurations. It provides parallel pipelined access of vector
data set with high bandwidth. A special vector cache design is
needed within each processor in order to guarantee smooth
data movement between the memory and multiple vector
processors. Kaushik Banerjee, IEM, CSE 127
Memory Interleaving
Kaushik Banerjee, IEM, CSE 128
Memory Interleaving
S Access
Kaushik Banerjee, IEM, CSE 129
Memory Interleaving
C-Access
Kaushik Banerjee, IEM, CSE 130
Memory Interleaving
C/S-Access
Kaushik Banerjee, IEM, CSE 131
Network for Inter Processor & Memory
Communication
Kaushik Banerjee, IEM, CSE 132
Network Properties and Routing
The topology of an interconnection network can be
Static or Dynamic
Static : Point to point direct connections which will
not change during program execution
Dynamic: Dynamically configured using switched
channels
Ref: Kai Hwang P76 Kaushik Banerjee, IEM, CSE 133
Network Properties and Routing
Some definitions
Node Degree : The number of edges (I/O links/channels) incident on a node
Network Diameter : Maximum shortest path between any two nodes
Bisection Width : When a given network is cut into two halves, the minimum
number of edges along the cut is called channel bisection width b.
If the number of bits = w for each edge, then the wire bisection width is B =
bw.
When B is fixed, the channel width in bits is w = B/b
Another quantitative parameter is the wire (channel) length between nodes.
It determines the latency, clock skewing or power requirement.
Symmetric network: If the topology is the same w.r.t. any node.
Kaushik Banerjee, IEM, CSE 134
Network Properties and Routing
Data Routing Functions
Needed for inter PE data exchange.
For multicomputer network, message passing is used
through hardware routers.
Common data routing functions:
shifting, rotation, permutation (one-to-one),
broadcast (one-to-all), multicast (many-to-many),
personalized communication (one-to-many), shuffle,
exchange, etc.
Kaushik Banerjee, IEM, CSE 135
Network Properties and Routing
Network Routing Functions
A large number of network structures were
investigated by telephone companies that needed
ever larger and more efficient circuit switching
exchanges .
When it became possible to build computer systems
with more than one processor, the application of
interconnection networks to parallel computers was
also investigated
Kaushik Banerjee, IEM, CSE 136
Network Properties and Routing
Network Routing Functions
The requirements of parallel computing structures are
somewhat different from those of a telephone system.
In a telephone network requests for the connection of
a circuit-switched link between an originator and a
respondent may occur at any time. The primary aim is
to maximise the number of concurrent circuits. In a
parallel architecture a network is required to support
either processor-to-memory connections or
processor-to-processor communication links. It is
instructive to visualise an interconnection network in
a parallel computer simply as a `black box', with a
number of input ports and a number of output ports,
which performs a specified routing function to
connect inputs to outputs
Kaushik Banerjee, IEM, CSE 137
Network Properties and Routing
Network Routing Functions
In SIMD systems, where a single instruction operates on a
multiplicity of data, the routing function may be completely
defined within each data-movement instruction. Consequently
all input-output connections will be distinct. Much of the
remainder of this chapter discusses interconnection networks
which fall into this category. In MIMD systems, where each
connection is defined by individual processors operating
independently, no assumptions can be made about the input-
output connections requested by each processor. For example,
two processors may both request data from the same shared
memory bank simultaneously, resulting in network requests
with distinct input ports but the same output port. Although
this problem is not found exclusively
Kaushik Banerjee, IEM,in
CSEMIMD systems most of 138
Network Properties and Routing
Network Routing Functions
An idealised interconnection structure takes a set of labelled input
ports and sets up a number of connections them to a similar set of
output ports, as shown in the figure. In order to simplify this
discussion of interconnection networks it is assumed that the
number of input and output ports in the network are equal. Hence
if A is defined to be an ordered set of N port labels, then
A = {0, 1, 2, .... , N-1} and a routing function f is a function from port
labels to port labels: f:A → A
If f is an injection on A, then it can often be represented as a
sequence of simple permutations of the labels in A.
For example, if A represented a labelled deck of playing cards then a
possible permutation to perform would be a perfect shuffle, and in
fact this is a very useful permutation forCSEinterconnection networks.
Kaushik Banerjee, IEM, 139
Network Properties and Routing
An idealised interconnection structure
Kaushik Banerjee, IEM, CSE 140
Network Properties and Routing
Permutations
For n nodes, there will be n! ways in which the nodes can be
ordered.
The set of all permutations form a permutation group with
respect to a composition operation.
For example, the permutation π = (a,b,c)(d,e) stands for the
bijection mapping
Kaushik Banerjee, IEM, CSE 141
Network Properties and Routing
Permutations
a d
b
c e
permutation groups
The cycle (ab , bc, ca ) having a period of 3 and the cycle
(de, ed) having a period of 2 each in a circular fashion.
Combining the two cycles, the permutation π has a period of 3 *
2 = 6.
Kaushik Banerjee, IEM, CSE 142
Network Properties and Routing
A perfect-shuffle permutation of port labels can be used
to map from a set of source labels S to a set of
destination labels D. The ordered set of input labels is
divided into two subsets of equal size which are then
interleaved. This can be represented by the bipartite
graph on the next slide, from which it can be observed
that this permutation can be produced by a simple
manipulation of the binary representation of the source
label.
Kaushik Banerjee, IEM, CSE 143
Network Properties and Routing
Perfect Shuffle and Exchange
Perfect Shuffle Inverse Perfect Shuffle
Shifting one bit to the left and Shifting one bit to the right and
wrapping around LSB to MSB. wrapping around MSB to
LSB.
000 000
001 010 000 000
010 100 001 100
011 110 010 001
100 001 011 101
101 011 100 010
110 101 101 110
111 111 110 011
111 111
Bits at places (k-1, k-2,…x1,x0) will be shuffled to
respectively bit positions (k-2, k-3,…x1,x0,
Kaushik Banerjee,k-1)
IEM, CSE 144
Network Properties and Routing
Hypercube Routing Functions
110 111 Three routing functions for
3-Cube
010 011
Routing can be done by
Three Cube
(A) Changing the MSB
(B) Changing the LSB
100 101
(C) Changing the middle bit
000 001
Observe: the routing is to
an adjacent node.
For n-cube there will be n routing functions.
Kaushik Banerjee, IEM, CSE 145
Network Properties and Routing
Broadcast and Multicast
Broadcast : One to all mapping.
Achieved in SIMD by a broadcast bus extended from Array
Controller to all the PEs.
In Multiprocessor, it is done by message passing.
Multicast: Many to many ; from one set to another.
Personalised Broadcast:
KaushikOne toIEM,selected
Banerjee, CSE many 146
Network Properties and Routing
Network Performance
Depends on
(1) Functionality: The method of routing, interrupt handling,
synchronization, request /message combining and
coherence.
(2) Network Latency: Worst case time delay for a unit message
to be transferred
(3) Bandwidth: Maximum data transfer rate in Mbytes/s
(4) Hardware complexity: Costs of wires, switches, connectors,
arbitration and interface logic
(5) Scalability
Kaushik Banerjee, IEM, CSE 147
Static Network
Linear Array
Star Ring
Tree
Near Neighbour Mesh 3 X 3
Kaushik Banerjee, IEM, CSE 148
Static Network
Chordal Ring of
Chordal Ring of degree =4
degree =3
Bisection width=5
Completely Connected
Kaushik Banerjee, IEM, CSE 149
Static Network
Near Neighbour Mesh 3 X 3
Wrapped around
mesh(ILLIAC)
Torus
[equivalent to a chordal ring
of degree 4]
Kaushik Systolic Array
Banerjee, IEM, CSE 150
Static Network
Hyper Cubes
(Low scalability)
A 4-cube formed by interconnecting two 3-Cubes
3-Cube
3-Cube Connected Cycles
(Modified version of Hypercube)
Kaushik Banerjee, IEM, CSE 151
Static Network
Linear Array
Diameter = N-1
Suitable for small number of nodes
Very different than a shared bus. Allows concurrent use of different
channels.
Ring
Uni-directional or bi-directional
Symmetric with a constant node degree = 2
Diameter = N/2
IBM token ring protocol is based on this. Message keeps on
travelling till it finds the addressed node
Chordal Ring
By increasing the node degree of a Ring to 3 or 4
More the node degree, shorter the diameter
Kaushik Banerjee, IEM, CSE 152
Static Network
Tree and Star
Tree: a k-level completely balanced binary tree will have N=2k-1
nodes
Scalable
Long diameter
Star is a two level tree with high node degree of d=N-1 and a small
constant diameter of 2 . (for systems with a centralized supervisor
node)
Fat tree
When the channel width of a
tree increases from leaf to node.
Kaushik Banerjee, IEM, CSE 153
Static Network
Mesh and Torus
Mesh is popular. Used in ILLIAC IV, MPP, DAP, CM-2 and Intel
Paragon
In general, a k-dimensional mesh with N=nk nodes has an interior
node degree of 2k and network diameter is k(n-1).
In the example, we have a 2 dimensional, 3 X 3 mesh with 32 nodes
Systolic Arrays
This is a class of multi dimensional pipelined array architecture
designed for implementing fixed algorithms.
The one shown in the figure is specifically designed for performing
matrix-matrix multiplication.
In general static systolic arrays are pipelined with multidirectional
flow of data streams. Very difficult to program.
Kaushik Banerjee, IEM, CSE 154
Dynamic Network
Dynamic interconnection networks enable the temporary
connection of any two components of a multiprocessor.
There are two categories
(A) Shared path networks
i. Single bus
ii. Multiple bus systems
(B) Switching networks.
iii. Crossbars
iv. Multistage networks
Kaushik Banerjee, IEM, CSE 155
Dynamic Network
Shared path network
Can be allocated to any active component of the machine
on a competitive basis
Switching networks
Allocated by a number of switches that can be set in
different ways according to the connection requirements.
Kaushik Banerjee, IEM, CSE 156
Dynamic Network
Shared Path Networks
Single shared bus
One of the most popular
A generalization and extension of the buses employed in
uniprocessor systems
Contains the same bus lines (address, data, control,
interrupt) as uniprocessors and some additional ones to
solve the contention. Arbitration.
Very cost-effective, increasing the number of processors
does not increase the price of the shared bus. .
Kaushik Banerjee, IEM, CSE 157
Dynamic Network
Limitation of shared bus
As the number of processors on the bus increases, the
probability of contention also increases proportionally,
reaching a point when the whole bandwidth of the bus is
exhausted by the processors and hence, adding a new
processor will not cause any potential speed-up in the
multiprocessor.
Kaushik Banerjee, IEM, CSE 158
Dynamic Network
One of the main design issues in shared bus multiprocessors is
the enhancement of the number of applicable processors by
different methods. The three most important techniques are:
A. Introducing private memory
B. Introducing coherent cache memory
C. Introducing multiple buses
Without these improvements the applicable number of
processors is in the range 3-5
Can be improved up to 30 processors using private and
coherent cache memory
Bus hierarchies open the way to constructing scalable
shared memory systems bases on bus interconnection
Kaushik Banerjee, IEM, CSE 159
Dynamic Network
Although, in principle, the uniprocessor and multiprocessor
buses are very similar, there is a significant difference in their
mode of operation.
Uniprocessors and first-generation multiprocessors employ
the so-called locked buses; examples are the Multibus,
VMEbus, etc.
Second-generation multiprocessors apply pended buses. The
difference is in how memory accesses are handled on the bus.
Kaushik Banerjee, IEM, CSE 160
Dynamic Network
A memory write access needs two phases:
Phase 1. The address and data are transferred via the bus to
the memory controller.
Phase 2. The memory write operation (including parity check,
error correction, and so on) is executed by the memory
controller.
The first phase is typically 3-4 times faster than the second
There is no need to lock the bus during phase 2.
In multiprocessor system, if the two phases of two processors
can be overlapped then it will save a lot of time.
Kaushik Banerjee, IEM, CSE 161
Dynamic Network
Similarly, improvement can be done in case of Memory Read.
A memory read access needs two phases:
Phase 1. The address is transferred via the bus to the memory
controller.
Phase 2. The memory read operation is executed by the
memory controller.
Phase 3. The data is transferred via the bus to the requesting
processor
In a multiprocessor system it is possible to reduce the 3 phases to 2 by
combining the 1st phase and the 3rd phase of two memory reads, executed
on different memory units. Using separate buses for Address and
Data and appropriate buffers
Kaushik Banerjee, IEM, CSE 162
Dynamic Network
Multiple shared bus
The limited bandwidth of the single shared bus represents a
major limitation in building scalable multiprocessors.
There are several ways to increase the bandwidth of the
interconnection network. A natural idea is to multiply the
number of buses, like the processors and memory units. Four
different ways have been proposed for connecting buses to
the processors, memory units and other buses:
1-dimension multiple bus system
2- or 3-dimension bus systems
Cluster bus system hierarchical bus system
Kaushik Banerjee, IEM, CSE 163
Dynamic Network
Bus arbitration procedures are of two kinds:
A. Serial arbitration
B. Parallel arbitration
Kaushik Banerjee, IEM, CSE 164
Dynamic Network
Switching Networks
Crossbar
The crossbar is the most powerful network type since it
provides simultaneous access among all the inputs and outputs
of the network providing that all the requested outputs are
different. All the switches must contain arbiter logic to allocate
the memory block in the case of conflicting requests and a bus-
bus connection unit to enable connection between the buses of
the winning processor and the memory buses. It means that
both the wiring and the logic complexity of the crossbar is
dramatically increased compared with the single bus
interconnection..
Kaushik Banerjee, IEM, CSE 165
Dynamic Network
Switching Networks (Contd.)
Crossbar (more)
The single bus system is unable to serve as an
interconnection network for scalable multiprocessors due to
the limited bandwidth of the single bus. Although the
crossbar provides a scalable bandwidth, it is not appropriate
constructing large-scale multiprocessors because of the large
complexity and high cost of the switches. In addition, the
number of switches increases with the square of the number
of processors in the crossbar
Kaushik Banerjee, IEM, CSE 166
Dynamic Network
Switching Networks (Contd.)
Multistage networks
Multistage networks represent a compromise between the
single bus and the crossbar switch interconnections from the
point of view of implementation complexity, cost,
connectivity and bandwidth. A multistage network consists of
alternating stages of links and switches. Many kinds of
multistage networks have been proposed and built so far.
They can be categorized according to the number of stages,
the number of switches at a stage, the topology of links
connecting subsequent stages, the type of switches
employed at the stages and the possible operation nodes.
Kaushik Banerjee, IEM, CSE 167
Dynamic Network
Time Shared Common Bus
Only one processor
can transfer to/from
Memory Unit Memory or another
processor at a time
I/O I/O
CPU 1 CPU 2 CPU 3 Processor 1 Processor 2
Receiving unit recognizes its address from the sender and responds to control
signal then the transfer starts.
Bus conflict may be resolved by dedicating a bus controller which will do the
arbitration Kaushik Banerjee, IEM, CSE 168
Dynamic Network
Time Shared Common Bus
Improvement : Two or more parallel buses
Common System Bus Local
Controller CPU 3 IOP 3 Memory 3
Shared
Memory Unit Local Bus
System Bus
System Bus Local
Controller CPU 2 IOP 2 Memory 2
Local Bus
System Bus Local
Controller CPU 1 IOP 1 Memory 1
Local Bus Kaushik Banerjee, IEM, CSE 169
Dynamic Network
Time Shared Common Bus
Locally connected memory and IOP
System Bus Controller connects the local bus to the system
bus
IOP/Memory connected to System bus is shared.
At a time, only one processor can communicate with the
shared memory through the system bus, while other
processors communicate locally.
Kaushik Banerjee, IEM, CSE 170
Dynamic Network
Multi Port Memory
Memory Memory Memory
Unit 1 Unit 2 Unit 3
CPU 1
CPU 2
CPU 3
Kaushik Banerjee, IEM, CSE 171
Dynamic Network
Multi Port Memory
Each Processor bus is connected to each memory unit.
Processor bus contains address, control and data lines
Any one port is active at a point of time
Fixed priority is given to each memory port, depending on the physical
port position.
Advantage: Multiple ports => higher transfer rate
Disadvantage: Expensive memory control logic
Good for small number of processors
Kaushik Banerjee, IEM, CSE 172
Dynamic Network
Crossbar Switches
Memory Memory Memory
Unit 1 Unit 2 Unit 3
CPU 1
Switch placed at each cross point
CPU 2
Switch recognizes the address and directs
instruction/data to the desired memory module
CPU 3
Kaushik Banerjee, IEM, CSE 173
Dynamic Network
Crossbar Switches
Data
Simultaneous transfers for all
memory modules Address CPU
1
Control (R/W)
Memory Enable
Multiplexers Data
Data
and Address CPU
Memory Control (R/W)
2
Arbitration
Module
Address Logic
Data
Control (R/W) Address CPU
3
Control (R/W)
Arbitration is by priority encoder
Kaushik Banerjee, IEM, CSE 174
Dynamic Network
Multi Stage Switching Network
O
I A 0 U
N
T
P
P
U
B U
T 1
T
The Switch Concept : 2 X 2
Resolves / arbitrates priority between A and B , for the port
Switches are a X b and generally a=b and they are of power 2.
Commonly, 2 X 2, 4 X 4 and 8 X 8
When it is 2 X 2 , it is called binary switch.
For n X n switch, the number of legitimate states = nn and the number of
permutations=n! Kaushik Banerjee, IEM, CSE 175
Dynamic Network
000
Multi Stage Switching Network
001
P1, P2: Processors *
000-111: Memory Modules 010
P1 *
0 000-011 011
*
100
P2
1 100-111 101
*
Binary Tree with
110
2 X 2 switches
*
* Other Processors * 111
Kaushik Banerjee, IEM, CSE 176
Kaushik Banerjee, IEM, CSE 177
Various connectivity of a switch
A 0 A 0
B 1 B 1
Straight Crossover
A 0 A 0
B 1 B 1
Upper Broadcast Lower Broadcast
Kaushik Banerjee, IEM, CSE 178
Omega Network
Implementation of a 8 X 8 network using 2 X 2
switches and Perfect Shuffle as interconnection
network
From To
000 000 (0) The number in ()
indicates the decimal
001 010 (2) equivalent of the
010 100 (4) node
011 110 (6)
100 001 (1)
101 011 (3)
110 101 (5)
111 111 (7)
This is Perfect Shuffle
Kaushik Banerjee, IEM, CSE 179
Omega Network
We will demonstrate the connection from 13 i.e., from 001 to 011.
Starts at Input node 001 . Since this connection follows perfect shuffle, it is
connected to 010, i.e., 2
001(Node 1) Perfect Shuffle 010 (port 2)
Kaushik Banerjee, IEM, CSE 180
Omega Network
Now, in the transition from 001 to 011 1st bit is unchanged
So, the connection within node A will be straight.
Kaushik Banerjee, IEM, CSE 181
Omega Network
Next mapping 010 (port 2) Perfect Shuffle 100 (port 4)
Kaushik Banerjee, IEM, CSE 182
Omega Network
Now, in the transition from 001 to 011 change in 2nd bit
So, the connection through B will be a cross over , hence port 5
5
Kaushik Banerjee, IEM, CSE 183
Omega Network
Next mapping 101 (port 5) Perfect Shuffle 011 (port 3)
And then , since the third bit is unchanged, the connection within switch C will be straight
5
Kaushik Banerjee, IEM, CSE 184
Kaushik Banerjee, IEM, CSE 185
000110 and 100111 will conflict at F;
011 000 and 111 011 will conflict at G;
101 001 and 011000 will conflict at H;
At I and J broadcast will be required, if hardware allows it.
Because of the conflicts, all permutations can not be implemented in one pass.
Kaushik Banerjee, IEM, CSE 186
Hypercube Connection
One Cube Addresses
21 = 2 nodes
Each node contains a CPU, Local memory and I/O interface
Kaushik Banerjee, IEM, CSE 187
Hypercube Connection
01 11
Addresses Two Cube
22 = 4 nodes
00 10
Each node contains a CPU, Local memory and I/O interface
Kaushik Banerjee, IEM, CSE 188
Hypercube Connection
Similarly, 3-Cube = 23 = 8 nodes.
Likewise, N-cube = 2n nodes.
For say, 3-cube,
Node 000 can directly communicate with
Node 001 or
Node 010 or
Node 100
That is, only one bit change.
To communicate with 111, it has be routed through three links
000 001 011 111
Kaushik Banerjee, IEM, CSE 189
Hypercube Connection
Make an XOR between the two nodes Y
If 1st bit=1, then 1st axis X differs
If 2nd bit=1, then 2nd axis Y differs Z
If 3rd bit=1, then 3rd axis Z differs.
X
For example, between 000 and 010, Y axis will differ.
Kaushik Banerjee, IEM, CSE 190
Multiprocessors and MIMD
Kaushik Banerjee, IEM, CSE 191
MIMD
An MIMD computers system has a number of
independent processors operating upon separate data
concurrently
Hence, each processor has its own program memory or
has access to program memory.
Similarly, each processor has its own data memory or
has access to data memory.
Clearly, there has to be a mechanism for passing
information between processors, as they work on the
same problem.
Kaushik Banerjee, IEM, CSE 192
Multiprocessors
Interconnection of two or more CPUs with memory and I/O
Necessarily means multiple CPUs with optional I/O
processors.
Falls into the category of MIMD
MIMD (Multiple Instruction Stream, Multiple Data Stream) is
a technique to achieve parallelism.
MIMD machines have a number of processors functioning
asynchronously and independently.
At any point of time, different processors may be executing
different instructions on different pieces of data.
Kaushik Banerjee, IEM, CSE 193
Multiprocessors
Most multiprocessor systems and multiple computer systems
can be included in this category.
An intrinsic MIMD computer implies interactions among the n
processors because all memory streams are derived from the
same data space shared by all the processors.
If the n data streams were derived from disjointed subspaces
of the shared memories, then we would have the so called
MSISD (Multiple SISD) operation, which is nothing but a set of n
independent SISD uniprocessor systems.
An intrinsic MIMD computer is tightly coupled if the degree of
interactions among the processors is high. Most commercial
MIMD computers are loosely coupled.
Kaushik Banerjee, IEM, CSE 194
Multiprocessors
Many of the early multiprocessors were SIMD, renewed
attention in 1980s
Except for vector processors, they gone by the mid 1990s.
MIMD emerged as the architecture of choice for general
purpose multiprocessors.
There are two important aspects of MIMD
(1) Flexibility : Single user multiprocessors offering high
performance for one application
(2) Cost Effective: Can be built on off-the-shelf
microprocessors cost performance advantages.
Kaushik Banerjee, IEM, CSE 195
Multiprocessors
Advantages
(1) Flexibility : Single user multiprocessors offering high
performance for one application by distributing the task.
Needs compiler/ language support.
(2) Cost Effective: Can be built on off-the-shelf
microprocessors
Improved Performance
(1)Multiple independent jobs in parallel, OR
(2)Single job partitioned into parallel tasks.
Example: Processor A: Computation,
Kaushik Banerjee, IEM, CSE
Processor B: Control 196
Multiprocessors
MIMD Architecture’s Goal
Without accessibility to shared memory architectures, it is
not possible to obtain memory access of a remote block in a
multicomputer. Rather than having a processor being
allocated to different process or having a processor continue
with other computations (in other words stall), Distributed
MIMD architectures are used. Thus, the main point of this
architectural design is to develop a message passing parallel
computer system organized such that the processor time
spent in communication within the network is reduced to a
minimum.
Kaushik Banerjee, IEM, CSE 197
Multiprocessors
Components of the Multi Computer and Their Tasks
Within a multicomputer, there are a large number of nodes and a
communication network linking these nodes together. Inside each node,
there are three important elements that do have tasks related to message
passing:
Computation Processor and Private Memory
Communication Processor
This component is responsible for organizing communication among the
multicomputer nodes, "packetizing" a message as a chunk of memory in
the on the giving end, and "depacketizing" the same message for the
receiving node.
Router, commonly referred to as switch units.
The router’s main task is to transmit the message from one node to the
next and assist the communication processor in organizing the
communication of the message in through
Kaushik Banerjee, IEM,the
CSE network of nodes. 198
Multiprocessors
The development of each of the above three elements took
place progressively, as hardware became more and more
complex and useful.
First came the Generation I computers, in which messages
were passed through direct links between nodes, but there
were not any communication processors or routers.
Then Generation II multicomputers came along with
independent switch units that were separate from the
processor, and finally in the Generation III multicomputer, all
three components exist.
Kaushik Banerjee, IEM, CSE 199
Multiprocessors
Categorization of MIMD processors
There are two kinds of categorization of MIMD computers
Memory Organization wise
Interconnection wise
1. MIMD memory organization
There are two major types of MIMD memory architectures: distributed
memory MIMD architecture, and shared memory MIMD architecture.
Shared memory organizations are those where all processors share the
same common memory. The processors in a shared memory organisation
are known as tightly coupled as they provide high level of interaction
among the processors.
Distributed memory organization are those where each CPU has its own
local private memory. Message passing scheme is used for interaction
among the processors. This system is known as loosely coupled due to the
low degree of interaction among the
Kaushik processors
Banerjee, IEM, CSE . 200
Multiprocessors Memory Models
(A) Distributed Memory MIMD Architectures
Kaushik Banerjee, IEM, CSE 201
Multiprocessors Memory Models
Advantages and Disadvantages (of distributed memory)
+Highly Scalable
+Message passing solves memory access synchronization problem
-Load balancing problem
-Deadlock in message passing
-Need to physically copying data between processes
Kaushik Banerjee, IEM, CSE 202
Multiprocessors Memory Models
(B) Shared Memory MIMD Architectures
Tightly
Coupled
Also called Symmetric Multi
Kaushik Banerjee, IEM, CSE Processors 203
Multiprocessors Memory Models
Advantages and Disadvantages (of shared memory)
+May use uniprocessor programming techniques
+Communication between processor is efficient
-Synchronized access to share data in memory needed
-Lack of scalability due to (memory) contention problem
Kaushik Banerjee, IEM, CSE 204
Multiprocessors Memory Models
Shared Memory
No matter how memory Address spaces of these memory
blocks are unified into a global address space which is
completely visible to all processors of the shared memory
system.
Issuing a certain memory address by any processor will
access the same memory block location.
However, according to the physical organization of the
logically shared memory, two main types of shared
memory system could be distinguished:
Physically shared memory systems
Virtual (or distributed) shared
Kaushik Banerjee, memory systems
IEM, CSE 205
Multiprocessors Memory Models
Shared Memory
In physically shared memory systems all memory blocks
can be accessed uniformly by all processors.
In distributed shared memory systems the memory blocks
are physically distributed among the processors as local
memory units.
The three main design issues in increasing the scalability
of shared memory systems are:
a. Organization of memory
b. Design of interconnection networks
c. Design of cache coherent protocols
Kaushik Banerjee, IEM, CSE 206
Multiprocessors Memory Models
(C) Virtual shared memory (Best of Both Worlds)
Also called distributed shared memory architecture.
Kaushik Banerjee, IEM, CSE 207
Multiprocessors Memory Models
Shared memory systems are basically classified according to
their memory organization since this is the most fundamental
design issue. Accordingly, shared memory systems can be
divided into four main classes:
1. Uniform memory access (UMA) machines. This follows
shared memory architecture
Following are models of distributed shared memory
architecture
2. Non-uniform memory access (NUMA) machines
3. Cache-coherent non-uniform memory access
(CC_NUMA) machines
4. Cache-only memory access
Kaushik Banerjee,(COMA)
IEM, CSE machines 208
Multiprocessors Memory Models
UMA machines belong to the physically shared memory
architecture class, while NUMA, CC_NUMA, and COMA
machines form the class of distributed shared memory
architectures.
The four classes cover the three generations of shared
memory systems. Their first generation contains the UMA
machines where the interconnection network was based
either on the concept of shared bus in order to construct low-
price multiprocessors or on multistage networks in order to
build massively parallel shared memory systems.
Kaushik Banerjee, IEM, CSE 209
Multiprocessors Memory Models
Shared bus is a serious bottleneck due to contention
Earlier, in first generation shared memory systems this was
being improved by using local cache memories. Still the
scalability was in the range of 20-30 in shared bus machines
and 100-200 in multistage network-based machines.
The second-generation shared memory systems tried to
physically distribute the shared memory among the
processors and also more complex multibus or multistage
network.
The third-generation shared memory systems combine the
advantages of the first two generations (by introducing large
local cache memories) resulting inIEM,
Kaushik Banerjee, dramatic
CSE improvement 210
Multiprocessors Memory Models
However, the application of caches in a multiprocessor
environment gives rise to the so-called cache consistency
problem.
In order to solve the problem of maintaining data consistency
in the caches, the cache coherence protocol must be added
to the traffic on the network.
Kaushik Banerjee, IEM, CSE 211
Multiprocessors Memory Models
(I) Uniform Memory Access (UMA) Machines
Contemporary uniform memory access machines are small-
size single bus multiprocessors. Large UMA machines with
hundreds of processors and a switching network were typical
in the early design of scalable shared memory systems.
Famous representatives of that class of multiprocessors are
the Denelcor HEP and the NYU Ultracomputer.
They introduced many innovative features in their design,
some of which even today represent a significant milestone in
parallel computer architectures.
Kaushik Banerjee, IEM, CSE 212
Multiprocessors Memory Models
UMA Machines (Contd.)
However, these early systems do not contain either cache
memory or local main memory which turned out to be
necessary to achieve high performance in scalable shared
memory systems. Although the UMA architecture is not
suitable for building scalable parallel computers, it is
excellent for constructing small-size single bus
multiprocessors.
Two such machines are the Encore Multimax of Encore
Computer Corporation representing the technology of the
late 1980s and the Power Challenge of Silicon Graphics
Computing Systems representing the technology of the 1990s.
Kaushik Banerjee, IEM, CSE 213
Multiprocessors Memory Models
(II) Non-Uniform Memory Access (NUMA) Machines
Kaushik Banerjee, IEM, CSE 214
Multiprocessors Memory Models
Non-uniform memory access (NUMA) machines were designed
to avoid the memory access bottleneck of UMA machines. The
logically shared memory is physically distributed among the
processing nodes of NUMA machines, leading to distributed
shared memory architectures. On one hand these parallel
computers became highly scalable, but on the other hand they
are very sensitive to data allocation in local memories.
Accessing a local memory segment of a node is much faster
than accessing a remote memory segment. Not by chance, the
structure and design of these machines resemble in many ways
that of distributed memory multicomputers. The main
difference is in the organization of the address space.
Kaushik Banerjee, IEM, CSE 215
Multiprocessors Memory Models
In multiprocessors, a global address space is applied that is
uniformly visible from each processor; that is, all processors
can transparently access all memory locations.
In multicomputers, the address space is replicated in the local
memories of the processing elements. This difference in the
address space of the memory is also reflected at the software
level: distributed memory multicomputers are programmed
on the basis of the message-passing paradigm, while NUMA
machines are programmed on the basis of the global address
space (shared memory) principle.
Kaushik Banerjee, IEM, CSE 216
Multiprocessors Memory Models
The problem of cache coherency does not appear in
distributed memory multi-computers since the message-
passing paradigm explicitly handles different copies of the
same data structure in the form of independent messages.
In the shared memory paradigm, multiple accesses to the same
global data structure are possible and can be accelerated if
local copies of the global data structure are maintained in local
caches.
Kaushik Banerjee, IEM, CSE 217
Multiprocessors Memory Models
However, the hardware-supported cache consistency schemes
are not introduced into the NUMA machines. These systems
can cache read-only code and data, as well as local data, but
not shared modifiable data.
This is the distinguishing feature between NUMA and CC-NUMA
multiprocessors.
Kaushik Banerjee, IEM, CSE 218
Multiprocessors Memory Models
Accordingly, NUMA machines are closer to multicomputers
than to other shared memory multiprocessors, while CC-NUMA
machines look like real shared memory systems.
In NUMA machines, like in multi computers, the main design
issues are the organization of processor nodes, the
interconnection network, and the possible techniques to
reduce remote memory accesses. Two examples of NUMA
machines are the Hector and the Cray T3D multiprocessor
.
Kaushik Banerjee, IEM, CSE 219
Multiprocessors Memory Models
NUMA
Remote load
Kaushik Banerjee, IEM, CSE 220
Multiprocessors Memory Models
(III) Cache-Only Memory Access (COMA) Machines
Kaushik Banerjee, IEM, CSE 221
Multiprocessors Memory Models
Cache-Only Memory Access (COMA) Machines
COMA machines try to avoid the problems of static memory
allocation of NUMA and CC-NUMA machines by excluding main
memory blocks from the local memory of nodes and employing
only large caches as node memories. In these architectures only
cache memories are present; no main memory is employed wither
in the form of a central shared memory as in UMA machines or in
the form of a distributed main memory as in NUMA and CC-NUMA
computers. Similarly to the way virtual memory has eliminated the
need to handle memory addresses explicitly, COMA machines
render static data allocation to local memories superfluous. In
COMA machines data allocation is demand driven; according to the
cache coherence scheme, data is always attracted to the local
(cache) memory where it isKaushik
needed . CSE
Banerjee, IEM, 222
Multiprocessors Memory Models
Cache-Only Memory Access (COMA) Machines
In COMA machines similar cache coherence schemes can
be applied as in other shared memory systems. The only
difference is that these techniques must be extended with
the capability of finding the data on a cache read miss and
of handling replacement. Since COMA machines are
scalable parallel architectures, only cache coherence
protocols that support large-scale parallel systems can be
applied, that is, directory schemes and hierarchical cache
coherent schemes. Two representative COMA
architectures are: DDM (Data Diffusion Machine), KSR1.
Kaushik Banerjee, IEM, CSE 223
Multiprocessors Memory Models
(IV) Cache-Coherent Non-Uniform Memory Access (CC-NUMA)
All the CC-NUMA machines share the common goal of building
a scalable shared memory multiprocessor. The main difference
among them is in the way the memory and cache coherence
mechanisms are distributed among the processing nodes.
Another distinguishing design issue is the selection of the
interconnection network among the nodes. They demonstrate
a progress from bus-based networks towards a more general
interconnection network and from the snoopy cache
coherency protocol towards a directory scheme.
Kaushik Banerjee, IEM, CSE 224
Multiprocessors Memory Models
Cache-Coherent Non-Uniform Memory Access (CC-NUMA)
The Wisconsin multicube architecture is the closest
generalization of a single bus-based multiprocessor. It
completely relies on the snoopy cache protocol but in a
hierarchical way. The main goal of the Stanford FLASH design
was the efficient integration of cache-coherent shared memory
with high-performance message passing. The FLASH applies a
directory scheme for maintaining cache coherence. The figure
below shows the design space of the CC-NUMA machines.
Kaushik Banerjee, IEM, CSE 225
Multiprocessors Memory Models
Cache-Coherent Non-Uniform Memory Access (CC-NUMA)
Kaushik Banerjee, IEM, CSE 226