0% found this document useful (0 votes)
64 views226 pages

My Lecture - Advanced Computer Architecture

The document discusses advanced computer architecture with a focus on parallel processing, including pipeline processing, architecture classification, and vector processing. It highlights the evolution of computer systems from fixed programs to the stored program concept, addressing challenges like the Von Neumann bottleneck and introducing superscalar design for improved performance. Additionally, it covers Flynn's classification of computer architectures, detailing SISD, SIMD, and MIMD models.

Uploaded by

Kaushik Banerjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views226 pages

My Lecture - Advanced Computer Architecture

The document discusses advanced computer architecture with a focus on parallel processing, including pipeline processing, architecture classification, and vector processing. It highlights the evolution of computer systems from fixed programs to the stored program concept, addressing challenges like the Von Neumann bottleneck and introducing superscalar design for improved performance. Additionally, it covers Flynn's classification of computer architectures, detailing SISD, SIMD, and MIMD models.

Uploaded by

Kaushik Banerjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 226

Advanced Computer Architecture

Parallel Processing

KAUSHIK BANERJEE
IEM, CSE
Advanced Computer Architecture

Pipeline Processing

Architecture Classification

Vector Processing

Vector Processor Architecture

Memory Interleaving

Network Properties

MIMD and Multiprocessors


Kaushik Banerjee, IEM, CSE 2
Parallel Processing
Introduction

Need is ahead of availability: We are trying to get more


performance out of computer systems and hence we go for
advance computer architecture.

Computer Architecture vs Computer Organization:

Architecture: Structural behaviour of the various functional


modules of the computer and how they interact to provide the
processing need of the work.

Organization: The way the h/w components are connected


together to form a computer system.
Kaushik Banerjee, IEM, CSE 3
Parallel Processing

Background

Earliest computing machines worked on fixed programs, as in


calculators. A computer used to be meant for a particular type
of job. For example, a desk calculator can perform arithmetic,
but not word processing or gaming.

Reprogramming would mean making flowcharts, paper


working, documentation and even rewiring of the system.

Kaushik Banerjee, IEM, CSE 4


Parallel Processing
Paradigm Change

John Von Neumann in 1944-45 wrote a paper where he came up with a


scheme to treat program as data and store it in memory. This is called
stored program concept.

Memory

Arithmetic Logic
Control Unit Unit

Input Output

The above is uniprocessor or in other words, scalar design.


Kaushik Banerjee, IEM, CSE 5
Parallel Processing
Problems: As memory is much slower than ALU, it causes waiting state of
the later, narrowing down the communication between memory and ALU.

This problem is known as Von Neumann bottleneck.

In order to improve performance, super scalar design was introduced.


How to improve performance

Rely on faster circuits: Cost per circuit increases with circuit speed and at
some point cost/performance ratio becomes unfavourable

Introduce concurrency
Replicate resources
Do more per cycle

Kaushik Banerjee, IEM, CSE 6


Parallel Processing
Superscalar Design (as CPU is getting cheaper)

More instructions issued at the same time.

Multiple processing units.

The parallel instructions must be independent.

Disadvantage:
• Overhead of controlling
• Burden to programmer
• Hard to have parallel lines > 5 (No point in increasing number of processors
above 5)

Kaushik Banerjee, IEM, CSE 7


Parallel Processing
In superscalar design, resources are replicated as shown below.

Control Control Control Control


Unit Unit Unit Unit

Arithmetic & Arithmetic & Arithmetic & Arithmetic &


Logic Logic Logic Logic
Unit Unit Unit Unit

Memory Memory Memory Memory


Unit Unit Unit Unit

Uniprocessor
Multiprocessor

This superscalar design is flexible, but it was difficult to segregate between statements
which are independent, so as to make them work in parallel.
Kaushik Banerjee, IEM, CSE 8
Parallel Processing
A CISC or RISC scalar processor can be improved with a superscalar or vector
architecture. Scalar processors are those executing one instruction per cycle. Only one
instruction is issued per cycle and only one completion of instruction is expected from
the pipeline per cycle.

In a superscalar processor, multiple instruction pipelines are used. This implies that
multiple instructions are issued per cycle and multiple results are generated per
cycle. A vector processor executed vector instructions on arrays of data. This each
instruction involves a string of repeated operations, which are ideal for pipelining
with one result per cycle.

Superscalar processors are designed to exploit more instruction level parallelism in


user programs. Only independent instructions can be executed in parallel without
causing a wait state. The amount of instruction level parallelism varies widely
depending on the type of code being executed.

It has been observed that the average value is around 2 for code without loop
unrolling. Therefore, for these codes there is not much benefit gained from building
a machine that can issue more than three instructions per cycle

Kaushik Banerjee, IEM, CSE 9


Parallel Processing
Pipelines
Instruction Pipeline

Fetch Decode Execute Write


Back

Fetch Decode Execute Write


Back

Fetch Decode Execute Write


Back

Fetch Decode Execute Write


Back

Kaushik Banerjee, IEM, CSE 10


Parallel Processing
Figure: A superscalar processor of degree m=3

Fetch Decode Execute Write


Back

Kaushik Banerjee, IEM, CSE 11


Parallel Processing
Disadvantages of superscalar design are:
The programmer is burdened with designing the program in
parallel executable parts.
It is very hard to have more than 5 parallel lines; so there is
no point in increasing number of processing units over 5.

Some representative superscalar processors: IBM RS/6000,


DEC21064, Intel i960CA

The above architecture shows replication to the extreme.

Very flexible, but costly


Do we need all this flexibility?
How about part replication?
Kaushik Banerjee, IEM, CSE 12
Parallel Processing
Arithmetic Pipeline

Pipelined addition: One of the six stages of the addition of a


pair of elements is performed at each stage in the pipeline.

Each stage of the pipeline has a separate arithmetic unit


designed for the operation to be performed at that stage.

Once stage A has been completed for the first pair of elements,
these elements can be moved to the next stage B while the
second pair of elements moves to the first stage A

Kaushik Banerjee, IEM, CSE 13


Arithmetic Pipeline for Floating Point Addition
The stages of adding two floating point numbers, for example,
0.1234 X 10 5 + 0.5678 X 10 4

1) The exponents are compared. 10 5 > 10 4


2 ) Since 5-4=1, the mantissa of the lesser, that is the second
number is shifted right by 1 digit and 1 is added to the
exponent . It becomes 0.05678 X 10 5
3) Now that the exponents of the two numbers are equal, the
mantissas are added (0.1234 + 0.05678) = 0.18018
4) The exponent of the result is 5. So, the final result in
0.18018 X 10 5
5) The result is normalised if required.
6) Check for exceptions, such as overflow.
7) Rounding off, if required.
Kaushik Banerjee, IEM, CSE 14
Arithmetic Pipeline for Floating Point Addition

How this is achieved:

The processing unit is designed in such a way that all stages of


the computations are separate. When one pair, say the first pair
of data have gone through Stage 1 (and it enters Stage 2), the
second pair of data can enter Stage 1.

Similarly, When the first pair of data has completed Stage 2 and
entered Stage 3,

The second pair of data, if it has completed Stage 1, can enter


Stage 2. At the same time, The thirf pair of data, if it has
completed Stage 2, can enter Stage 3.

This chain goes on Kaushik Banerjee, IEM, CSE 15


Arithmetic Pipeline for Floating Point Addition

For example, at certain point of time, following take place


simultaneously

Stage 6 of pair 1 (e.g., x1,y1)


Stage 5 of pair 1 (e.g., x2,y2)
Stage 4 of pair 1 (e.g., x3,y3)
Stage 3 of pair 1 (e.g., x4,y4)
Stage 2 of pair 1 (e.g., x5,y5)
Stage 1 of pair 1 (e.g., x6,y6)

Kaushik Banerjee, IEM, CSE 16


Arithmetic Pipeline for Floating Point Addition

Cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6

Stage
1 x1+y1
2
3
4
5
6

Time
In the beginning, in the very first cycle, when (x1+y1) has started, it is in Stage 1. At
that time, remaining stages are idle.

Kaushik Banerjee, IEM, CSE 17


Arithmetic Pipeline for Floating Point Addition

Cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6

Stage
1 x2+y2
2 x1+y1
3
4
5
6

Time

In the second cycle, when (x1+y1) completed Stage 1 and it enters Stage 2, the next
pair of data (x2+y2) enters Stage 1. At this time, the Stages 3 – 6 are idle.

Kaushik Banerjee, IEM, CSE 18


Arithmetic Pipeline for Floating Point Addition

Cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6

Stage
1 x3+y3
2 x2+y2
3 x1+y1
4
5
6

Time

Like wise, in 6 cycles, all the 6 stages of the pipeline are filled up.

Kaushik Banerjee, IEM, CSE 19


Arithmetic Pipeline for Floating Point Addition

Cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6

Stage
1 x4+y4
2 x3+y3
3 x2+y2
4 x1+y1
5
6

Time

Like wise, in 6 cycles, all the 6 stages of the pipeline are filled up.

Kaushik Banerjee, IEM, CSE 20


Arithmetic Pipeline for Floating Point Addition

Cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6

Stage
1 x5+y5
2 x4+y4
3 x3+y3
4 x2+y2
5 x1+y1
6

Time

Like wise, in 6 cycles, all the 6 stages of the pipeline are filled up.

Kaushik Banerjee, IEM, CSE 21


Arithmetic Pipeline for Floating Point Addition

Cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6

Stage
1 x6+y6
2 x5+y5
3 x4+y4
4 x3+y3
5 x2+y2
6 x1+y1

Time

Like wise, in 6 cycles, all the 6 stages of the pipeline are filled up.

Kaushik Banerjee, IEM, CSE 22


Architecture Categories

Kaushik Banerjee, IEM, CSE 23


Architecture Categories
Classification/ Taxonomy by Flynn

Michael Flynn’s classification of Computer Architecture is based


on instruction and data streams

Single Instruction Multiple


Instruction
Single Data SISD MISD
Stream
Multiple Data SIMD MIMD
Stream

Kaushik Banerjee, IEM, CSE 24


Flynn’s Classification

Kaushik Banerjee, IEM, CSE 25


Flynn’s Classification

Kaushik Banerjee, IEM, CSE 26


Flynn’s Classification - SISD
Now we will discuss the models SISD, SIMD and MIMD in detail.
MISD however, is hardly used for practical purposes. So, the above basic definition
will suffice for our curriculum.

(I) SISD
Control Processing
Unit Unit
Instruction
Input / Stream
Output
Data
Stream

Memory

Instruction Unit
Stream

This is the SISD, the conventional architecture.

Kaushik Banerjee, IEM, CSE 27


Flynn’s Classification - SISD

Instructions are fetched from Memory Unit and sent to Control


Unit. The control unit after generating proper timing signals
sends them the processing unit.

Data required for processing are made available in either or


both of the following ways:
Fetched from input device through the control unit
Fetched from the memory unit if it is already stored there

However, parallelism can be introduced in SISD by the use of


pipelining of instructions.

Kaushik Banerjee, IEM, CSE 28


Flynn’s Classification - SIMD
(II) SIMD: Single Instruction over a multiple data stream

In this architecture, many processing units are under the supervision of a


single control unit.

All processors receive the same instruction from the control unit but
operate on different items of data

Needed: shared memory with multiple interleaved modules in order to


communicate simultaneously with multiple processes

In this kind of scenario, we can use SIMD.

So, in SIMD,

Many processing units are under the supervision of a common control unit.
All processing units receive the same instruction from the control unit but
operate on different items of Kaushik
data.Banerjee, IEM, CSE 29
Flynn’s Classification - SIMD
Practical scenario

Part of application is about understanding application needs.


Many applications have the following nature of processing:

c = 125  Scalar (not repeated)

For i=0 to 999


a(i) = b(i) + c  Vector (repeated)
or huge matrix manipulation like multiplication, inversion, etc.

The above indicates same application over many tuples of data, mostly independent
across iterations. That is, the instruction step
a(2) = b(2) + c
can be executed independent of the instruction step
a(1) = b(1) + c

Kaushik Banerjee, IEM, CSE 30


Flynn’s Classification - SIMD

Control Main
Unit Memory

Processing Processing ... Processing Micro Control Unit,


Registers,
Element Element Element
Arithmetic & Logic
Unit
Memory Memory Memory
...
Unit Unit Unit

Kaushik Banerjee, IEM, CSE 31


Flynn’s Classification - SIMD
The principles behind SIMD are:
Replicate data path, not control
All PEs work in tandem
Common CU coordinates operations

Operation model of SIMD can be represented by a 5-tuple:


M = {N, C, I, M, R}

N: Number of processing elements (PE) in the machine, e.g., ILLIAC has 64 PEs,
Connection Machine’s CM-2 has 65,536 PEs (designed by Danny Hillis of Thinking
Machines)

C: Set of instructions directly executed by the control unit (CU), including scalar and
program flow instructions.

I: Set of instructions broadcast by the CU to all PEs for parallel execution. These
include arithmetic, logic, data routing, masking and other local operations executed
by each active PE over local data within that PE.
Kaushik Banerjee, IEM, CSE 32
Flynn’s Classification - SIMD

M: Set of masking schemes, where each mask partitions the set of PEs into enabled
and disabled subsets.

R: Set of data routing functions, specifying various patterns to be set up in the


interconnection network for the inter-PE communications.

Kaushik Banerjee, IEM, CSE 33


Flynn’s Classification - SIMD

Almost all SIMD machines built today are based on the distributed
memory model.

• PEs communicate among themselves through routing network


• Array of processing elements are controlled by the common array control
unit
• Program and data are loaded into the control memory through the host
computer
• Instruction decoded by the control unit
• Scalar instructions are passed on to the scalar processor directly
• Vector instruction is broadcast to the PEs for parallel execution
• Partitioned data sets are distributed to the local memory units
• Data routing network is under program control through the control unit
• PEs are synchronized by the hardware of the control unit
• Same instruction is executed by all the PEs in the same cycle. However,
PEs can be selectively disabled by masking instruction
Kaushik Banerjee, IEM, CSE 34
SIMD – Distributed Memory

PE :
Processing
Element

Kaushik Banerjee, IEM, CSE 35


SIMD – Shared Memory
(b) Shared memory model:

Kaushik Banerjee, IEM, CSE 36


SIMD – Shared Memory
Note: m is chosen relatively prime with respect to n so that
parallel memory access can be achieved through skewing
without conflicts.

An example would be Burroughs Scientific Processor (BSP) with


n=16 and m=17.

1,4,7 2,5,8 3,6,9

n1 n2 n3

m1 m2 m3 m4

1,5,9,13,17 2,6,10,14,18 3,7,11,15,19


Kaushik Banerjee, IEM, CSE 4,8,12,16,20 37
SIMD – Processing Element
Internal structure of a Processing Element

We will give example of ILLIAC IV


processor

A, B, C: Working registers

D: Address register for holding the


address of the PE

I: Index register for internal usage

S: Status flag register. If the value is


zero then the PE is disabled for the
current instruction. Value of 1 means
selected.

R: Data routing register for sending


data to and receiving data from other
PEs in the network. Kaushik Banerjee, IEM, CSE 38
Parallel Processing - SIMD
Matrix data manipulation

What is crucial: Distributing the vector elements among the PEMs. Ideally, N
elements of a vector are retrieved from N different PEMs simultaneously. If
the matrix is one dimensional, then try to use one memory location of each
PEM.
If there is a matrix with n rows and m columns, each PEM has to occupy
either row data or column data. It depends on the values of m and n and the
number of PEs.
For example, A[i, j] is stored in i-th memory location of j-th PEM.
PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 . . . . PEn
0
1
2
3
4
5
8
9
…… …… ….. ………… …… ….. ………… …… ….. ………… …… ….. …… ….. …. ….. … .. ………. … ……..
m-2
m-1
Kaushik Banerjee, IEM, CSE 39
Vector Processing
Focus:

Large volumes of data or Series of data to be processed

Same operation will be performed over a string of data


elements

A vector instruction involves large arrays of operands .

Classification:

(A)Pipelined vector machines using a few powerful


processors equipped with vector hardware  Vector
Super Computers
(B)SIMD machine model Kaushik Banerjee, IEM, CSE 40
Vector Data
Vector Data is specified by the following parameters

(A) Base Address: From which memory location the data starts

(B) Length: The number of elements in the vector data

(C) Stride: the address gap between two consecutive elements

For example , a vector data may

• Start at base memory location 1000


• Have a stride of 5. Thus , the subsequent elements will be stored in
addresses 1005, 1010, 1015, 1020….and so on
• A length of 100. So. The last element will be at memory location 1200.

Kaushik Banerjee, IEM, CSE 41


SIMD – Vector Processing
Array Processors

SIMD can be array processors or associative processors. Associative


processors are content addressable where array elements can be
accessed concurrently, but are of high cost. Array processors use
RAM where arrays of data are to be accessed sequentially.

A synchronous array of parallel processors in called array


processors, which consists of multiple processing elements under
the supervision of one control unit.

An array processor can handle single instruction and multiple data


stream (SIMD).

SIMD machines are especially designed to perform ‘vector’


computations over matricesKaushik
or Banerjee,
arrays of data..
IEM, CSE 42
Vector Instruction
Vector Instruction

A vector is a set of scalar data items, e.g., A[1..1000] of integer.


We can represent such a series by a vector, e.g., V1, V2 or V3,
etc.

A vector processor is one that can compute operations on


entire vectors with one simple instruction.

A vector compiler will attempt to translate loops into single


vector instructions.
Example - Suppose we have the following do loop:
4 DO 5 I = 1, N
5 X(I) = Y(I) + Z(I)
6 CONTINUE Kaushik Banerjee, IEM, CSE 43
Vector Instruction
This will be translated into one long vector of length n and a vector add
instruction will be executed.

The assembly vector instruction will be something like

VADD Y,Z,X
------------------------------------------------------------------------------------
Why is this more efficient ?

(1) Instruction is fetched and decoded only once. Thus, memory bandwidth
and the control unit overhead are reduced considerably

(2) The vector processor, after receiving the instruction, will be told that it
must fetch x amount of pairs of operands.
These operands will have a set pattern of arrangement in memory

Kaushik Banerjee, IEM, CSE 44


Vector Instruction
The vector instructions can be of the following types:

Format
V1 op V2  V3 : Vector binary operation: Operation between two vectors to
produce another vector. Example will be adding each element of an array
(V1) with the corresponding element of another array (V2) to produce
elements of a third array (V3)

V1 op S  V2 : Vector scaling operation: Operation of each element of a


vector (V1) with a common scalar quantity (S) to produce elements of
another vector (V2). Example will be multiplying each element of an array by
a constant number.

V1 op V2  S : Vector binary reduction: Operation between two vectors


to produce a scalar quantity. Example will be the dot product of two arrays.

Kaushik Banerjee, IEM, CSE 45


Vector Instruction
M (1: n)  V1 Vector load: copying data elements from
memory to the vector operand.

V1 M (1: n) Vector store: copying data elements from the


vector operand to memory.

Uop V1  V2 Unary vector operation: Applying some


operation (e.g., negation) to each element of a vector (V1) to
produce another vector (V2)

Uop V1  S Unary vector reduction: Applying some


operation (e.g., Average / Maximum / Sum) to all elements of a
vector (V1) to produce a scalar (S)

Kaushik Banerjee, IEM, CSE 46


Vector Instruction

Advantage of vector: It does loop unrolling.

That means operation of all the elements of a vector can be


carried out simultaneously and independent of each other.
There is no need to do memory access after operation on each
element to get the next element. In vector processing, a series
of elements can be fetched together from the memory and
operated without waiting for each other.

Kaushik Banerjee, IEM, CSE 47


Vector Instruction
Pseudo vector instructions

Vector Load/Store

Vector load/store instructions provide the ability to do strided


and scatter/gather memory accesses, which take data elements
distributed throughout memory and pack them into sequential
vectors/streams placed in vector/stream registers. This
promotes data locality. It results in less data pollution, since
only useful data is loaded from the memory system. It provides
latency tolerance because there can be many simultaneous
outstanding memory accesses. Vector instructions such as VLD
and VST provide this capability.

Kaushik Banerjee, IEM, CSE 48


Vector Instruction
As illustrated in figure 3,
VLD V0, (V1) loads memory elements into V0 specified by the
memory addresses (indices) stored in V1.

VST (V1), V0 works in a similar manner.


VST (S1), 4, V0, stores the contents of V0 starting at the address
contained in S1 with stride of 4.

Kaushik Banerjee, IEM, CSE 49


Vector Instruction
Vj Register
(1) Vector-Vector
Vk Register Vi Register
Format:
Vj op Vk  Vi

Example:

V0 = V1 X V2

V3 = V2 + V4

Functional Unit or Pipeline


One pipeline for each type of operation

Kaushik Banerjee, IEM, CSE 50


Vector Instruction
Vj Register (2) Vector-Scalar Vi Register

Format:
Vj OP Sk  Vi

Example:

V0 = V1 X 2

V3 = V2 + 0.5

Sk Register
Functional Unit or Pipeline
One pipeline for each type of operation

Kaushik Banerjee, IEM, CSE 51


Vector Instruction
(3) Vector-Memory
Vi Register
Vector Load

Main
Memory

Vector Store

Format:
Load: M (1:n)  Vi
Store: Vi  M (1:n)
Kaushik Banerjee, IEM, CSE 52
Vector Instruction
(4) Vector Reduction
Vj Register

Format:
Uop Vj  Si

Example:

S = ∑ Vj [i]

Sum of all elements


Si Register
Functional Unit or Pipeline
One pipeline for each type of operation

Kaushik Banerjee, IEM, CSE 53


Vector Instruction
(5) Vector Binary Reduction
Vj Register Vk Register
Format:
Vj op Vk  Si

Example:

S = ∑ Vj [i] * Vj [i]

Dot product
Si Register
Functional Unit or Pipeline
One pipeline for each type of operation

Kaushik Banerjee, IEM, CSE 54


Vector Instruction
(6) Gather : The non-zero elements of a sparse vector using
indices that themselves are indexed.

To be

Given Kaushik Banerjee, IEM, CSE Given 55


Vector Instruction
(6) Gather : The non-zero elements of a sparse vector using
indices that themselves are indexed.

To be

Given Kaushik Banerjee, IEM, CSE Given 56


Vector Instruction
A0 = Starting memory address
V0 = Indices of non zero elements w.r.t. the base address
in memory
VL = Length of the vector
V1 = Data register

i-th location of V1 will get the value of


Base + Vo[i]-th memory location
V1[i]  Memory[A0 + Vo[i]]

Example: For I = 0, V0[0] = 4 ; So, (Base + 4)-th, i.e., 104th memory


location will be copied to V1[0]

For I = 1, V0[1] = 2 ; So, (Base + 2)-th, i.e., 102th memory location will
Kaushik Banerjee, IEM, CSE 57
be copied to V [1]
Vector Instruction
(7) Scatter

It stores into memory a vector in a sparse vector whose non-


zero entries are indexed

Structure is the same. Data assignment is in the opposite


direction

Memory[A0 + Vo[i]]  V1[i]

Example: Entry 0 of V1 is copied to Memory Location


100+v0[0] , i.e., 104. So, M[104]  V1 [0] , i.e., 600
: Entry 1 of V1 is copied to Memory Location 100+v0[1] , i.e.,
102. So, M[102]  V1 [0] , i.e., 600
Kaushik Banerjee, IEM, CSE 58
Vector Instruction
Note

Gathering is done before a vector operation to fetch the


scattered data element into a register

Scattering is done after a vector operation to put back the


result into memory locations.

Kaushik Banerjee, IEM, CSE 59


Vector Instruction
(8)Masking

Step 1. Elements of V0 are scanned and VM is populated accordingly.

Step 2. V1 is now populated from VM with the indices of the non zero elements
Kaushik Banerjee, IEM, CSE 60
Vector Instruction
(8)Masking

Step 1

Step 1. Elements of V0 are scanned and VM is populated accordingly.

Step 2. V1 is now populated from VM with the indices of the non zero elements
Kaushik Banerjee, IEM, CSE 61
Vector Instruction
(8)Masking Step 2

Step 1

Step 1. Elements of V0 are scanned and VM is populated accordingly.

Step 2. V1 is now populated from VM with the indices of the non zero elements
Kaushik Banerjee, IEM, CSE 62
Vector Processing

Summary

Same operation is performed over a stream of data

Instruction fetch and decode takes place only once

A vector instruction involves a large array of operands

Vector processor is a co-processor specially designed to


perform vector computation

Vector operand: Series of data on which same


operation will be carried out
Kaushik Banerjee, IEM, CSE 63
Vector Processing

Summary

Vector computation: Same operation in all elements in


the series

Vector operands can be in memory or in registers.


Registers are faster but limited . Memory is slower but
unlimited.

Uses interleaved memory for parallel fetch

The elements of vector operands are fetched from


memory and also written back to then memory
simultaneously. Kaushik Banerjee, IEM, CSE 64
Vector Processing & Architecture

Kaushik Banerjee, IEM, CSE 65


Vector Processing

Kaushik Banerjee, IEM, CSE 66


Vector Processing

Vector Functional Pipelining

For example, consider the following:


v0 = v1 * v2
v3 = v2 + v4

The Multiplication and Addition can be carried on in parallel, as they are not
dependent on each other.

Kaushik Banerjee, IEM, CSE 67


Vector Processing

Arithmatic Pipelining

Pipelined addition: One of the six stages of the floating point addition for a pair of
elements is performed at each stage of the pipeline.

The stages are, for addition:


1. Compare Exponent
2. Shift Mantissa of the element with lower exponent
3. Add Mantissas
4. Normalize

Each stage of the pipeline has a separate arithmetic unit designed for the operation
to be performed at that stage.

Once stage A has been completed, the elements with the intermediate results can
be moved to the next stage B, while the second pair of elements enter stage A.

Kaushik Banerjee, IEM, CSE 68


Vector Processing

Vector Chaining V0[0] = v1[0] * v2 [0]


V0[1] = v1[1] * v2 [1] V3[0] = v0[0] + v4 [0]
For example, consider the following: V0[2] = v1[2] * v2 [2] V3[1] = v0[1] + v4 [1]
v0 = v1 * v2 V0[3] = v1[3] * v2 [3] V3[2] = v0[2] + v4 [2]
v3 = v0 + v4 V3[3] = v0[3] + v4 [3]

After V0 is computed for the first pair of V1 and V2, it can be passed on to the add
pipeline.

During this process, there will be a point when operands from V1 and V2 still need to
be fetched and sent to the pipeline, and results placed into V0 are just leaving the
pipeline. The process of vector chaining will send the result from V0 directly to the
pipelined adder (at the same time it is stored in the vector register), and combined
with the appropriate value from V4.

Thus the second instruction will be able to begin before the first is finished and the
machine creates 2 results as opposed to 1. This is very similar to the MIPS pipeline
and forwarding. The result of this is approximately 3 times the peak performance
Kaushik Banerjee, IEM, CSE 69
Vector Processing
Time: tau 2 tau 3 tau 4 tau 5 tau 6 tau 7 tau 8 tau
Step
A x1 + y1 x2 + y2 x3 + y3 x4 + y4 x5 + y5 x6 + y6 x7 + y7 x8 + y8
B x1 + y1 x2 + y2 x3 + y3 x4 + y4 x5 + y5 x6 + y6 x7 + y7
C x1 + y1 x2 + y2 x3 + y3 x4 + y4 x5 + y5 x6 + y6
D x1 + y1 x2 + y2 x3 + y3 x4 + y4 x5 + y5
E x1 + y1 x2 + y2 x3 + y3 x4 + y4
F x1 + y1 x2 + y2 x3 + y3

Some vector architectures provide greater efficiency by allowing


the output of one pipeline to be chained directly into another
pipeline. This feature is called chaining and eliminates the need to
store the result of the first pipeline before sending it into the
second pipeline. Figure 5 demonstrates the use of chaining in the
computation of a saxpy vector operation: a*x + y,

where x and y are vectors and a is a scalar constant.


Kaushik Banerjee, IEM, CSE 70
Vector Processing

Some vector architectures provide greater efficiency by


allowing the output of one pipeline to be chained directly into
another pipeline. This feature is called chaining and eliminates
the need to store the result of the first pipeline before sending
it into the second pipeline. Figure 5 demonstrates the use of
chaining in the computation of a saxpy vector operation:
a*x + y,

where x and y are vectors and a is a scalar constant.

Kaushik Banerjee, IEM, CSE 71


Vector Processing

Observe that it still takes 6*tau units of time to complete the


sum of the first pair of elements, but that the sum of the next
pair is ready in only tau more units of time. And this pattern
continues for each succeeding pair. This means that the time,
Tp, to do the pipelined addition of two vectors of length n is

Tp = 6*tau + (n-1)*tau = (n + 5)*tau.


Each of remaining (n-1) units is churned out in 1 unit of time.

The first 6*tau units of time are required to fill the pipeline and
to obtain the first result. After the last result, xn + yn, is
completed, the pipeline is emptied out or flushed.

Kaushik Banerjee, IEM, CSE 72


Vector Processing

Comparing the equations for Ts and Tp, it is clear that


(n + 5)*tau < 6*n*tau, for n > 1.

Thus, this pipelined version of addition is faster than the serial version by almost a
factor of the number of stages in the pipeline. This is an example of what makes
vector processing more efficient than scalar processing. For large n, the pipelined
addition for this sample pipeline is about six times faster than scalar addition.

In this discussion, we have assumed that the floating-point addition requires six
stages and takes 6*tau units of time. There is nothing magic about this number 6; in
fact, for some architectures, the number of stages in a floating-point addition may be
more or less than six. Further, the individual stages may be quite different from the
ones listed in section on pipelined addition. The operations at each stage of a
pipeline for floating-point multiplication are slightly different than those for addition;
a multiplication pipeline may even have a different number of stages than an
addition pipeline. There may also be pipelines for integer operations..

Kaushik Banerjee, IEM, CSE 73


Vector Processing

Vector Chaining used to compute a*x + y

Kaushik Banerjee, IEM, CSE 74


Vector Processing

Vector chaining used to compute a scalar value a times a vector


x, adding the elements the resultant vector to the elements of
a second vector y (of the same length).

Chaining can double the number of floating-point operations


that are done in tau units of time. Once both the multiplication
and addition pipelines have been filled, one floating-point
multiplication and one floating-point addition (a total of two
floating-point operations) are completed every tau time units.
Conceptually, it is possible to chain more than two functional
units together, providing an even greater speedup. However
this is rarely (if ever) done due to difficult timing problems.

Kaushik Banerjee, IEM, CSE 75


Vector Processor Architecture or Vector Super Computer

Vector Processing can be done in two ways

SIMD machine model

Pipelined vector machines using a few powerful processors equipped with


vector hardware. These are called Vector Super Computer or Vector
processor architecture r

We have discussed SIMD before. Now we discuss Vector processor


architecture or Vector Super Computer

While vector processing can be done by SIMD architecture by utilizing spatial


distribution of processing elements, there is another advanced architecture
which implement parallel processing by the means of functional pipelining.

Kaushik Banerjee, IEM, CSE 76


Vector Architecture

Let us consider two vector operands A and B. These are each one dimensional
array of equal size. We want to compute dot product of these two. C=A1B1 + A2B2
+A3B3+….

Kaushik Banerjee, IEM, CSE 77


Vector Architecture
This architecture takes advantage of the fact that each
arithmetic operation has multiple stages. Say,
multiplication has 5 stages and addition has 4 stages.

Each multiplication will go through the various 5 stages


of the multiplier pipeline. When A1 and B1 will be in the
Stage 2, A2 and B2 will be in the Stage 1.

At some point of time, the scenario will look like this:

A1 and B1 in Stage 5, A2 and B2 in Stage 4, A3 and B3 in


Stage 3,A4 and B4 in Stage 2, A5 and B5 in Stage 1,
A6 and B6 are yet to enter the pipeline.
Kaushik Banerjee, IEM, CSE 78
Vector Architecture
Next, the product A1*B1 leaves the multiply pipeline
and enters the add pipeline. It gets added to the
previous value of the add pipeline, which is, say C. We
assume that C has been initialized to 0. So, the
addition C= 0+A1*B1 will start now. When the first
addition, i.e., C=A1*B1 will come out of the add
pipeline, it will be fed back to the add pipeline.
Simultaneously, A6*B6 has come out of the multiple
pipeline. Now, the addition of A1*B1 +A5*B5 will
start. In the next cycle, addition of A2*B2 and A6*B6
will start..and so on…
Kaushik Banerjee, IEM, CSE 79
Vector Architecture
S
C
A
L
A
R

P
R
O
C
E
S
S
O
R
Host
Input / Computer Mass
Output Storage

Kaushik Banerjee, IEM, CSE 80


Vector Architecture
S
C
A
L
A
R

P
R
O
C
E
S
S
O
R
Host
Input / Computer Mass
Output Storage

Kaushik Banerjee, IEM, CSE 81


Vector Architecture
As shown in the diagram, the vector processor is attached to the
scalar processor as an optional feature.

Program and data are first loaded into the main memory through a
host computer.

Instruction decoding is done by the Scalar Control Unit.

Depending on whether the operation is scalar or vector, the


decoded instruction will be sent to the scalar and vector processor
respectively.

Vector instruction will be sent to the vector control unit (of the
vector processor)

The vector control unit will supervise the flow of vector data
between the main memory and the vector functional pipelines
Kaushik Banerjee, IEM, CSE 82
Vector Architecture
Vector Processing Models

(A) Register to Register Model (as shown in the diagram)

Vector Registers hold the operands, intermediate and


final vector results.
Vector functional pipelines retrieve operands from and
put results into the vector registers.
Vector registers are programmable
Vector registers can be of fixed length or reconfigurable
(later example: Fujitsu VP2000 series)

Kaushik Banerjee, IEM, CSE 83


Vector Architecture
Vector Processing Models (Continued)

(B) Memory to Memory model

Uses a vector stream unit in place of vector registers


Vector operands and results are directly retrieved from
the main memory to superwords (e.g., 512 bits as in
Cyber 205

Kaushik Banerjee, IEM, CSE 84


Vector Architecture
A comparison

Memory to Memory: Slower but can support very long


length of vector operands. (suitable for longer vectors)

Register to Register: Faster, but limited length of vector


operands (suitable for smaller vectors)

Kaushik Banerjee, IEM, CSE 85


Vector Architecture

A brief history

Pipelined vector super computer started with uni


processor models, e.g., Cray 1 (1976)

Recent super computer systems offer both uniprocessor


and multiprocessor models\. E.g., Cray Y-MP series

Most high-end main frames offer multiprocessor models


with add-on vector hardware., e.g., VAX9000 , IBM 390 /
VF

Kaushik Banerjee, IEM, CSE 86


Vectorization
Benefit and How it operates

A FORTRAN example

5 DO 10 I=1,100
10 Z(I) = X (I) * Y (I) Gets compiled by
vector compiler
15 CONTINUE

3 cycles to fetch instruction, decode instruction & fetch operand


1 cycle to execute

For 100 instructions


On Scalar: (3+1) * 100 = 400 cycles are required
On Vector: 3 + 1 * 100 = 103 cycles are required
Kaushik Banerjee, IEM, CSE 87
Vectorization
I. First, load the registers X and Y from memory :
simultaneously

II. One set of elements are loaded in a single clock cycle

III. Finally, computed values are taken from register Z


and put in memory

Vector compiler translates the code into something like

VLOAD X VR1
VLOAD Y VR2
VMULT VR1 VR2 VR3
VSTORE VR3 Z
Kaushik Banerjee, IEM, CSE 88
Vector Architecture
Summary

The vector machine is faster at performing mathematical


operations on larger vectors than the scalar or MIPS
(Microprocessor without Interlocked Pipeline Stages)
machines.

The vector processing computer’s vector register architecture


makes it better able to compute vast amounts of data quickly.

Conclusion

Not widely popular today but still used in home PCs as SIMD
units which augment the scalar CPU when necessary. (usually
multimedia applications)
Kaushik Banerjee, IEM, CSE 89
Vector Architecture
Implementation examples

Changing the brightness or contrast of an image, a value is


added or subtracted from each R,G,B set of data.

Limitations

One should have separate and longer registers for vector


processing.

Saving a register by reusing scalar registers might create


contention between SIMD and normal floating point
instructions.

Kaushik Banerjee, IEM, CSE 90


Vector Architecture
Basic Vector Architecture
64
Functional
Unit

64

64

Kaushik Banerjee, IEM, CSE 91


Vector Processing
Note: Very Important

(A)Loop structure program code is translated to vectorized


assembly code using vectorizing compiler.

A well vectorized code can easily achieve a speed up of 10-20


times.

Vectorization rate: How much vectorization has been done to the


code.

(B) Assembly Language Instruction Set must support vector


operation.

(C) Of course, hardware must


Kaushikhave
Banerjee, vector
IEM, CSE processor 92
Vector Processing & Processor

Summary

Same operation is performed over a stream of data

Instruction fetch and decode takes place only once

A vector instruction involves a large array of operands

Vector processor is a co-processor specially designed to


perform vector computation

Vector operand: Series of data on which same


operation will be carried out
Kaushik Banerjee, IEM, CSE 93
Vector Processing & Processor

Note

For register based vector processors,

Long vectors exceeding the register length n must be segmented


to fit the vector registers, n elements at a time.

e.g., if vector length is > 10 but <= 20, and vector register = 10,
Then the vector operation will have to be 2 times.

For example, if number of vector elements = 15, then

First time, 10 elements are copied to the register and processed.

Second time, remaining 5 elements are copied to 5 registers are


processed. Kaushik Banerjee, IEM, CSE 94
Vector Access Memory Structure
Note

Flow of vector operands between memory and vector registers


is pipelined with multiple access paths.

Vector Operand specification

Location of vector operand in memory must be specified by Base


Address, Stride and Length.

If row elements are stored continuously with unit stride, the


column elements will be stored with stride of n. (Diagonal
elements will be spaced at stride n+1)

Kaushik Banerjee, IEM, CSE 95


Vector Access Memory Structure
Note

Arrangement of vector in memory can be row major or column


major.

The arrangement of the vector data in memory should be


interleaved to allow pipelined or parallel access

Since each vector register has fixed number of components, only


a segment of vector data from memory can be loaded per cycle.

Kaushik Banerjee, IEM, CSE 96


Memory Interleaving

Kaushik Banerjee, IEM, CSE 97


Memory Interleaving

Memory is not desired to be monolithic, but a collection of


modules.

Arrangement of vector data in memory should be interleaved


to allow pipeline/parallel access. Flow of vector operands
between memory and vector registers is pipelined with
parallel access path.

Same word address is applied to all memory modules


simultaneously.

Kaushik Banerjee, IEM, CSE 98


Memory Interleaving

Memory is not desired to be monolithic, but a collection of


modules.

Same word address is applied to all memory modules


simultaneously.
Example:

Word address: 5 bits; Module address: 3 bits;

Kaushik Banerjee, IEM, CSE 99


Memory Interleaving

Word address: 5 bits; Module number: 3 bits

So, number of modules = 23 = 8


Number of words in each module = 25=32

10000 000  16th word of 0th module


10000 111  16th word of 7th module

Kaushik Banerjee, IEM, CSE 100


Memory Interleaving
M0 M1 M2 M7

0 1 2 7

8 9 10 15

16 17 18 23
24 25 26 31

32 33 34 39

40 41 42 47

48 49 50 55

56 57 58 63

64 65 66 71

72 73 74 79

240 241 242 247


248 249 250 255

MDB MDB MDB MDB

8 modules with 32 words in each


Kaushik Banerjee, IEM, CSE 101
Memory Interleaving
Two kinds of memory interleaving: Low order and High order

Consider a memory bank of 256MB:

M0 M1 M2 M3
64 MB 64 MB 64 MB 64 MB

It consists of 4 memory modules, 64MB each.

256MB = 228; So, we need total 28 bits address lines to access


this memory bank.
We will split these address lines into two parts, one for
selecting the module and the other for addressing a word
within a module.
Kaushik Banerjee, IEM, CSE 102
Memory Interleaving

Each memory module is of size 64MB=226

Hence, in order to access memory words in a module, we need


26 bits of address lines.

A27 – A26 Module Select


A25 – A0 : Word Select (within a module)

The remaining bits, i.e., A27 - A26=2bits will be used to select


the module.

This is called High order memory interleaving.


Kaushik Banerjee, IEM, CSE 103
Memory Interleaving

The other arrangement is

A27 – A02 : Word Select (within a module)


A01 – A00 : Module Select

This is called Low order memory interleaving.

Kaushik Banerjee, IEM, CSE 104


Memory Interleaving
HIGH ORDER MEMORY INTERLEAVING

0-67108863 67108864-134217727

Kaushik Banerjee, IEM, CSE 105


Memory Interleaving
High Order Interleaving:

Higher bits (MSB) are used for selecting a memory module and
lower bits (LSB) are used to address word in a module.

Address arrangement in high order


Module From address To address (MB)
(MB)
0 0 63
1 >63 127
2 >127 192
3 >192 255

Kaushik Banerjee, IEM, CSE 106


Memory Interleaving

In high order interleaving, consecutive words are stored


in the same module, except for the first and last address
in each module. For example, 0-63MB are stored in M0,
Above 63MB to 127MB are in M1, above 127MB to
192MB in M2 and above 192MB to 255MB are stored in
M3.

Usefulness
High order interleaving is useful in cases where multiple
processing elements are accessing different zones of the
memory at the same time.

Kaushik Banerjee, IEM, CSE 107


Memory Interleaving
LOW ORDER MEMORY INTERLEAVING

Kaushik Banerjee, IEM, CSE 108


Memory Interleaving

Low Order Interleaving: Higher bits (MSB) are used


for addressing word in a module and lower bits (LSB) are
used to select memory module.

Address arrangement in low order

Module Word Address


0 0 4 8 12 ……….. 252
1 1 5 9 13 ……….. 253
2 2 6 10 14 ……….. 254
3 3 7 11 15 ……….. 255

Kaushik Banerjee, IEM, CSE 109


Memory Interleaving

In low order interleaving, consecutive words are stored in


different modules. That is, location n and location n+1 will
be in different modules. Location 0 will be in M0, location
1 will be in M1, location 2 in M2 and location 3 will be
stored in M3. Again, location 4 will be in M0, location 5 in
M1, location 6 in M2 and location 7 will be stored in M3
and so on.

Therefore, vector stride in a module will be equal to the


number of modules
.

Kaushik Banerjee, IEM, CSE 110


Memory Interleaving
Usefulness

Low order interleaving is useful in case of vector


operands where vector elements need to be fetched
simultaneously from consecutive memory locations. If
memory is organized in a contiguous manner, then low
order interleaving will result in very poor memory
utilization, in fact not using the interleaving at all.

Kaushik Banerjee, IEM, CSE 111


Memory Interleaving
How it works

Word addresses are asserted simultaneously on all


modules. However, module addresses have to be
asserted one by one.

Suppose we are fetching n-th word of each module.

The value of n is placed on the word select address lines.


We have to select a module. We will start from module 0.
First, 000 is placed on the module select lines and
thereby module 000 gets selected.

Kaushik Banerjee, IEM, CSE 112


Memory Interleaving
So, module 0 starts accessing word n.

Before the word is accessed from module 0, the system can


access the same word from the next module, i.e., module 1.

So, module 1 starts accessing word n.


Before the access is over, module address 2 is placed on the
module select lines and the system initiates access of the n-
th word of module 2.

When array processing is required, low order interleaving is


used for pipelining. As the consecutive memory locations
are stored in different modules, they can be accessed in
parallel to be consumed by different processing units.
Kaushik Banerjee, IEM, CSE 113
Memory Interleaving

For example, for operations on an array A, the elements


A[0], A[1], A[2]….will be stored in different modules, i.e.,
M0, M1, M2…respectively. Therefore, they can be
accessed in a pipelined fashion and loaded into vector
registers or local PE memories simultaneously

Kaushik Banerjee, IEM, CSE 114


Memory Interleaving
Some Sample Applications

It is important to keep some actual applications in mind,


such as:

• Gaussian elimination. Recall that a typical operation is


to add a multiple of one row of our matrix to another row
of the matrix.

• Matrix multiplication. Recall that to do AB = C for


matrices A, B and C, we get Cij by taking the inner
product (“dot product”) of A’s ith row and B’s jth column.
Memory Interleaving: 2

Kaushik Banerjee, IEM, CSE 115


Memory Interleaving

• Image processing, such as in Adobe Photoshop or the


open-source GIMP. Here element (i,j) element in our
matrix would be the brightness at the pixel (i,j). If we wish
our image to look smoother, then at each pixel we would
average the brightness at neighboring points, and make
this value average our new brightness at the given point.

Kaushik Banerjee, IEM, CSE 116


Memory Interleaving

Categories of interleaving

Interleaving is of the following kinds:

• Simultaneous Access (S-Access)


• Concurrent Access (C-Access) and
• Concurrent-Simultaneous Access (C/S Access)

Kaushik Banerjee, IEM, CSE 117


Memory Interleaving
Simultaneous Access (S-Access) memory organization

Simultaneous

Kaushik Banerjee, IEM, CSE 118


Memory Interleaving
This is the simplest technique.

The single access returns m consecutive words of


information from m memory modules..

All the modules are accessed simultaneously.

Using low order m address bits, the information from a


particular module can be accessed.

A data latch is associated with each module. For a fetch


operation, the information from each module is gated
into its latch, whereupon the multiplexer can be used to
direct the desired data to the single word bus.
Kaushik Banerjee, IEM, CSE 119
Memory Interleaving
The memory access time line diagram is as depicted below

Memory access
time = T (for all)
Latch delay = τ

Total time for k


consecutive words
=T+k*τ

Kaushik Banerjee, IEM, CSE 120


Memory Interleaving
Concurrent Access (C-Access) memory organization

Kaushik Banerjee, IEM, CSE 121


Memory Interleaving
Concurrent Access (C-Access) memory organization

Common
Word
address is
impressed
concurrently

Kaushik Banerjee, IEM, CSE 122


Memory Interleaving
Concurrent Access (C-Access) memory organization
Module
selects are
impressed
one by one

Common
Word
address is
impressed
concurrently

Kaushik Banerjee, IEM, CSE 123


Memory Interleaving
The m-way low order interleaved memory structure
allows m memory modules to be access concurrently in
an overlapped manner.

The access cycles in different memory modules are


staggered. The low order a bits select the module and
the high order b bits select the word within each module,
where m=2a and a+b=n is the address length.

To access a vector of stride 1, successive addresses are


latched in the address buffer at the rate of one per cycle.
Effectively it takes m minor cycles to fetch m words,
which equals one (major) memory cycle as in the time
graph below.

Kaushik Banerjee, IEM, CSE 124


Memory Interleaving
After the addressed word is found in a module, the
contents are brought to the respective MDB.

We can depict the behaviour by the time graph

Θ : Major cycle [access time]


τ : Minor cycle [churn out time]

Above, each word is being fetched from different modules.


It is clear that there is a good overlap in the access of
adjacent modules.

So, how much faster is the pipelining access?


It is measured by M = Θ / τ and is called degree of
interleaving.
Kaushik Banerjee, IEM, CSE 125
Memory Interleaving
If the stride is 2, the
successive accesses
must be separated by
two minor cycles in
order to avoid access
conflicts. This reduces
memory throughput by
one-half.

If the stride is 3, there is no module conflict and the maximum


throughput (m words) results.

In general, C-access will yield the maximum throughput of m


words per memory cycle if the stride is relatively prime to m,
the number of interleavedKaushik
memory modules.
Banerjee, IEM, CSE 126
Memory Interleaving
(C) C/S-access memory organization

This is a combination of the S-access and C-Access


organizations, where n access buses are used with m
interleaved memory modules attached to each bus. The m
modules on each bus are m-way interleaved to allow S-access.
In each memory cycle, at most m*n words are fetched if the n
buses are fully used with pipelined memory accesses.

The C/S access memory is suitable for vector multiprocessor


configurations. It provides parallel pipelined access of vector
data set with high bandwidth. A special vector cache design is
needed within each processor in order to guarantee smooth
data movement between the memory and multiple vector
processors. Kaushik Banerjee, IEM, CSE 127
Memory Interleaving

Kaushik Banerjee, IEM, CSE 128


Memory Interleaving

S Access

Kaushik Banerjee, IEM, CSE 129


Memory Interleaving

C-Access

Kaushik Banerjee, IEM, CSE 130


Memory Interleaving

C/S-Access

Kaushik Banerjee, IEM, CSE 131


Network for Inter Processor & Memory
Communication

Kaushik Banerjee, IEM, CSE 132


Network Properties and Routing

The topology of an interconnection network can be

Static or Dynamic

Static : Point to point direct connections which will


not change during program execution

Dynamic: Dynamically configured using switched


channels

Ref: Kai Hwang P76 Kaushik Banerjee, IEM, CSE 133


Network Properties and Routing
Some definitions

Node Degree : The number of edges (I/O links/channels) incident on a node

Network Diameter : Maximum shortest path between any two nodes

Bisection Width : When a given network is cut into two halves, the minimum
number of edges along the cut is called channel bisection width b.

If the number of bits = w for each edge, then the wire bisection width is B =
bw.
When B is fixed, the channel width in bits is w = B/b

Another quantitative parameter is the wire (channel) length between nodes.


It determines the latency, clock skewing or power requirement.

Symmetric network: If the topology is the same w.r.t. any node.


Kaushik Banerjee, IEM, CSE 134
Network Properties and Routing
Data Routing Functions

Needed for inter PE data exchange.


For multicomputer network, message passing is used
through hardware routers.

Common data routing functions:

shifting, rotation, permutation (one-to-one),


broadcast (one-to-all), multicast (many-to-many),
personalized communication (one-to-many), shuffle,
exchange, etc.

Kaushik Banerjee, IEM, CSE 135


Network Properties and Routing
Network Routing Functions
A large number of network structures were
investigated by telephone companies that needed
ever larger and more efficient circuit switching
exchanges .

When it became possible to build computer systems


with more than one processor, the application of
interconnection networks to parallel computers was
also investigated

Kaushik Banerjee, IEM, CSE 136


Network Properties and Routing
Network Routing Functions
The requirements of parallel computing structures are
somewhat different from those of a telephone system.
In a telephone network requests for the connection of
a circuit-switched link between an originator and a
respondent may occur at any time. The primary aim is
to maximise the number of concurrent circuits. In a
parallel architecture a network is required to support
either processor-to-memory connections or
processor-to-processor communication links. It is
instructive to visualise an interconnection network in
a parallel computer simply as a `black box', with a
number of input ports and a number of output ports,
which performs a specified routing function to
connect inputs to outputs
Kaushik Banerjee, IEM, CSE 137
Network Properties and Routing
Network Routing Functions

In SIMD systems, where a single instruction operates on a


multiplicity of data, the routing function may be completely
defined within each data-movement instruction. Consequently
all input-output connections will be distinct. Much of the
remainder of this chapter discusses interconnection networks
which fall into this category. In MIMD systems, where each
connection is defined by individual processors operating
independently, no assumptions can be made about the input-
output connections requested by each processor. For example,
two processors may both request data from the same shared
memory bank simultaneously, resulting in network requests
with distinct input ports but the same output port. Although
this problem is not found exclusively
Kaushik Banerjee, IEM,in
CSEMIMD systems most of 138
Network Properties and Routing
Network Routing Functions

An idealised interconnection structure takes a set of labelled input


ports and sets up a number of connections them to a similar set of
output ports, as shown in the figure. In order to simplify this
discussion of interconnection networks it is assumed that the
number of input and output ports in the network are equal. Hence
if A is defined to be an ordered set of N port labels, then
A = {0, 1, 2, .... , N-1} and a routing function f is a function from port
labels to port labels: f:A → A
If f is an injection on A, then it can often be represented as a
sequence of simple permutations of the labels in A.
For example, if A represented a labelled deck of playing cards then a
possible permutation to perform would be a perfect shuffle, and in
fact this is a very useful permutation forCSEinterconnection networks.
Kaushik Banerjee, IEM, 139
Network Properties and Routing
An idealised interconnection structure

Kaushik Banerjee, IEM, CSE 140


Network Properties and Routing
Permutations

For n nodes, there will be n! ways in which the nodes can be


ordered.

The set of all permutations form a permutation group with


respect to a composition operation.

For example, the permutation π = (a,b,c)(d,e) stands for the


bijection mapping

Kaushik Banerjee, IEM, CSE 141


Network Properties and Routing
Permutations

a d

b
c e

permutation groups

The cycle (ab , bc, ca ) having a period of 3 and the cycle
(de, ed) having a period of 2 each in a circular fashion.

Combining the two cycles, the permutation π has a period of 3 *


2 = 6.
Kaushik Banerjee, IEM, CSE 142
Network Properties and Routing

A perfect-shuffle permutation of port labels can be used


to map from a set of source labels S to a set of
destination labels D. The ordered set of input labels is
divided into two subsets of equal size which are then
interleaved. This can be represented by the bipartite
graph on the next slide, from which it can be observed
that this permutation can be produced by a simple
manipulation of the binary representation of the source
label.

Kaushik Banerjee, IEM, CSE 143


Network Properties and Routing
Perfect Shuffle and Exchange

Perfect Shuffle Inverse Perfect Shuffle

Shifting one bit to the left and Shifting one bit to the right and
wrapping around LSB to MSB. wrapping around MSB to
LSB.
000  000
001  010 000  000
010  100 001  100
011 110 010  001
100  001 011  101
101  011 100  010
110  101 101  110
111  111 110  011
111  111
Bits at places (k-1, k-2,…x1,x0) will be shuffled to
respectively bit positions (k-2, k-3,…x1,x0,
Kaushik Banerjee,k-1)
IEM, CSE 144
Network Properties and Routing

Hypercube Routing Functions

110 111 Three routing functions for


3-Cube
010 011
Routing can be done by
Three Cube
(A) Changing the MSB
(B) Changing the LSB
100 101
(C) Changing the middle bit

000 001
Observe: the routing is to
an adjacent node.

For n-cube there will be n routing functions.


Kaushik Banerjee, IEM, CSE 145
Network Properties and Routing

Broadcast and Multicast

Broadcast : One to all mapping.

Achieved in SIMD by a broadcast bus extended from Array


Controller to all the PEs.

In Multiprocessor, it is done by message passing.

Multicast: Many to many ; from one set to another.

Personalised Broadcast:
KaushikOne toIEM,selected
Banerjee, CSE many 146
Network Properties and Routing

Network Performance

Depends on

(1) Functionality: The method of routing, interrupt handling,


synchronization, request /message combining and
coherence.

(2) Network Latency: Worst case time delay for a unit message
to be transferred

(3) Bandwidth: Maximum data transfer rate in Mbytes/s

(4) Hardware complexity: Costs of wires, switches, connectors,


arbitration and interface logic

(5) Scalability
Kaushik Banerjee, IEM, CSE 147
Static Network

Linear Array
Star Ring

Tree
Near Neighbour Mesh 3 X 3
Kaushik Banerjee, IEM, CSE 148
Static Network

Chordal Ring of
Chordal Ring of degree =4
degree =3

Bisection width=5

Completely Connected

Kaushik Banerjee, IEM, CSE 149


Static Network

Near Neighbour Mesh 3 X 3


Wrapped around
mesh(ILLIAC)
Torus
[equivalent to a chordal ring
of degree 4]

Kaushik Systolic Array


Banerjee, IEM, CSE 150
Static Network

Hyper Cubes
(Low scalability)

A 4-cube formed by interconnecting two 3-Cubes


3-Cube

3-Cube Connected Cycles


(Modified version of Hypercube)
Kaushik Banerjee, IEM, CSE 151
Static Network

Linear Array
 Diameter = N-1
 Suitable for small number of nodes
 Very different than a shared bus. Allows concurrent use of different
channels.

Ring
Uni-directional or bi-directional
 Symmetric with a constant node degree = 2
 Diameter = N/2
 IBM token ring protocol is based on this. Message keeps on
travelling till it finds the addressed node

Chordal Ring
 By increasing the node degree of a Ring to 3 or 4
 More the node degree, shorter the diameter
Kaushik Banerjee, IEM, CSE 152
Static Network

Tree and Star


 Tree: a k-level completely balanced binary tree will have N=2k-1
nodes
 Scalable
 Long diameter
 Star is a two level tree with high node degree of d=N-1 and a small
constant diameter of 2 . (for systems with a centralized supervisor
node)

Fat tree
 When the channel width of a
tree increases from leaf to node.

Kaushik Banerjee, IEM, CSE 153


Static Network

Mesh and Torus


 Mesh is popular. Used in ILLIAC IV, MPP, DAP, CM-2 and Intel
Paragon
 In general, a k-dimensional mesh with N=nk nodes has an interior
node degree of 2k and network diameter is k(n-1).

In the example, we have a 2 dimensional, 3 X 3 mesh with 32 nodes

Systolic Arrays
 This is a class of multi dimensional pipelined array architecture
designed for implementing fixed algorithms.

 The one shown in the figure is specifically designed for performing


matrix-matrix multiplication.

In general static systolic arrays are pipelined with multidirectional


flow of data streams. Very difficult to program.
Kaushik Banerjee, IEM, CSE 154
Dynamic Network

Dynamic interconnection networks enable the temporary


connection of any two components of a multiprocessor.

There are two categories

(A) Shared path networks


i. Single bus
ii. Multiple bus systems

(B) Switching networks.

iii. Crossbars
iv. Multistage networks

Kaushik Banerjee, IEM, CSE 155


Dynamic Network

Shared path network

Can be allocated to any active component of the machine


on a competitive basis

Switching networks

Allocated by a number of switches that can be set in


different ways according to the connection requirements.

Kaushik Banerjee, IEM, CSE 156


Dynamic Network

Shared Path Networks

Single shared bus

One of the most popular


A generalization and extension of the buses employed in
uniprocessor systems
Contains the same bus lines (address, data, control,
interrupt) as uniprocessors and some additional ones to
solve the contention. Arbitration.
Very cost-effective, increasing the number of processors
does not increase the price of the shared bus. .

Kaushik Banerjee, IEM, CSE 157


Dynamic Network

Limitation of shared bus

As the number of processors on the bus increases, the


probability of contention also increases proportionally,
reaching a point when the whole bandwidth of the bus is
exhausted by the processors and hence, adding a new
processor will not cause any potential speed-up in the
multiprocessor.

Kaushik Banerjee, IEM, CSE 158


Dynamic Network

One of the main design issues in shared bus multiprocessors is


the enhancement of the number of applicable processors by
different methods. The three most important techniques are:

A. Introducing private memory


B. Introducing coherent cache memory
C. Introducing multiple buses

 Without these improvements the applicable number of


processors is in the range 3-5
Can be improved up to 30 processors using private and
coherent cache memory
 Bus hierarchies open the way to constructing scalable
shared memory systems bases on bus interconnection
Kaushik Banerjee, IEM, CSE 159
Dynamic Network

Although, in principle, the uniprocessor and multiprocessor


buses are very similar, there is a significant difference in their
mode of operation.

Uniprocessors and first-generation multiprocessors employ


the so-called locked buses; examples are the Multibus,
VMEbus, etc.

Second-generation multiprocessors apply pended buses. The


difference is in how memory accesses are handled on the bus.

Kaushik Banerjee, IEM, CSE 160


Dynamic Network

A memory write access needs two phases:

Phase 1. The address and data are transferred via the bus to
the memory controller.

Phase 2. The memory write operation (including parity check,


error correction, and so on) is executed by the memory
controller.

The first phase is typically 3-4 times faster than the second
There is no need to lock the bus during phase 2.

In multiprocessor system, if the two phases of two processors


can be overlapped then it will save a lot of time.
Kaushik Banerjee, IEM, CSE 161
Dynamic Network

Similarly, improvement can be done in case of Memory Read.

A memory read access needs two phases:

Phase 1. The address is transferred via the bus to the memory


controller.
Phase 2. The memory read operation is executed by the
memory controller.
Phase 3. The data is transferred via the bus to the requesting
processor

In a multiprocessor system it is possible to reduce the 3 phases to 2 by


combining the 1st phase and the 3rd phase of two memory reads, executed
on different memory units. Using separate buses for Address and
Data and appropriate buffers
Kaushik Banerjee, IEM, CSE 162
Dynamic Network

Multiple shared bus


The limited bandwidth of the single shared bus represents a
major limitation in building scalable multiprocessors.
There are several ways to increase the bandwidth of the
interconnection network. A natural idea is to multiply the
number of buses, like the processors and memory units. Four
different ways have been proposed for connecting buses to
the processors, memory units and other buses:

1-dimension multiple bus system


2- or 3-dimension bus systems
Cluster bus system hierarchical bus system

Kaushik Banerjee, IEM, CSE 163


Dynamic Network

Bus arbitration procedures are of two kinds:

A. Serial arbitration
B. Parallel arbitration

Kaushik Banerjee, IEM, CSE 164


Dynamic Network

Switching Networks
Crossbar
The crossbar is the most powerful network type since it
provides simultaneous access among all the inputs and outputs
of the network providing that all the requested outputs are
different. All the switches must contain arbiter logic to allocate
the memory block in the case of conflicting requests and a bus-
bus connection unit to enable connection between the buses of
the winning processor and the memory buses. It means that
both the wiring and the logic complexity of the crossbar is
dramatically increased compared with the single bus
interconnection..

Kaushik Banerjee, IEM, CSE 165


Dynamic Network

Switching Networks (Contd.)

Crossbar (more)

The single bus system is unable to serve as an


interconnection network for scalable multiprocessors due to
the limited bandwidth of the single bus. Although the
crossbar provides a scalable bandwidth, it is not appropriate
constructing large-scale multiprocessors because of the large
complexity and high cost of the switches. In addition, the
number of switches increases with the square of the number
of processors in the crossbar

Kaushik Banerjee, IEM, CSE 166


Dynamic Network
Switching Networks (Contd.)

Multistage networks

Multistage networks represent a compromise between the


single bus and the crossbar switch interconnections from the
point of view of implementation complexity, cost,
connectivity and bandwidth. A multistage network consists of
alternating stages of links and switches. Many kinds of
multistage networks have been proposed and built so far.
They can be categorized according to the number of stages,
the number of switches at a stage, the topology of links
connecting subsequent stages, the type of switches
employed at the stages and the possible operation nodes.

Kaushik Banerjee, IEM, CSE 167


Dynamic Network
Time Shared Common Bus
Only one processor
can transfer to/from
Memory Unit Memory or another
processor at a time

I/O I/O
CPU 1 CPU 2 CPU 3 Processor 1 Processor 2

Receiving unit recognizes its address from the sender and responds to control
signal then the transfer starts.

Bus conflict may be resolved by dedicating a bus controller which will do the
arbitration Kaushik Banerjee, IEM, CSE 168
Dynamic Network
Time Shared Common Bus
Improvement : Two or more parallel buses

Common System Bus Local


Controller CPU 3 IOP 3 Memory 3
Shared
Memory Unit Local Bus

System Bus
System Bus Local
Controller CPU 2 IOP 2 Memory 2

Local Bus

System Bus Local


Controller CPU 1 IOP 1 Memory 1

Local Bus Kaushik Banerjee, IEM, CSE 169


Dynamic Network
Time Shared Common Bus

 Locally connected memory and IOP

 System Bus Controller connects the local bus to the system


bus

 IOP/Memory connected to System bus is shared.

 At a time, only one processor can communicate with the


shared memory through the system bus, while other
processors communicate locally.

Kaushik Banerjee, IEM, CSE 170


Dynamic Network
Multi Port Memory

Memory Memory Memory


Unit 1 Unit 2 Unit 3

CPU 1

CPU 2

CPU 3
Kaushik Banerjee, IEM, CSE 171
Dynamic Network
Multi Port Memory

Each Processor bus is connected to each memory unit.

Processor bus contains address, control and data lines

Any one port is active at a point of time

Fixed priority is given to each memory port, depending on the physical


port position.

Advantage: Multiple ports => higher transfer rate

Disadvantage: Expensive memory control logic

Good for small number of processors


Kaushik Banerjee, IEM, CSE 172
Dynamic Network
Crossbar Switches

Memory Memory Memory


Unit 1 Unit 2 Unit 3

CPU 1
Switch placed at each cross point

CPU 2
Switch recognizes the address and directs
instruction/data to the desired memory module

CPU 3
Kaushik Banerjee, IEM, CSE 173
Dynamic Network
Crossbar Switches
Data
Simultaneous transfers for all
memory modules Address CPU
1
Control (R/W)
Memory Enable

Multiplexers Data
Data
and Address CPU
Memory Control (R/W)
2
Arbitration
Module
Address Logic
Data
Control (R/W) Address CPU
3
Control (R/W)
Arbitration is by priority encoder
Kaushik Banerjee, IEM, CSE 174
Dynamic Network
Multi Stage Switching Network

O
I A 0 U
N
T
P
P
U
B U
T 1
T

The Switch Concept : 2 X 2

Resolves / arbitrates priority between A and B , for the port

Switches are a X b and generally a=b and they are of power 2.


Commonly, 2 X 2, 4 X 4 and 8 X 8
When it is 2 X 2 , it is called binary switch.
For n X n switch, the number of legitimate states = nn and the number of
permutations=n! Kaushik Banerjee, IEM, CSE 175
Dynamic Network
000
Multi Stage Switching Network

001
P1, P2: Processors *
000-111: Memory Modules 010
P1 *
0 000-011 011
*
100
P2
1 100-111 101
*
Binary Tree with
110
2 X 2 switches
*
* Other Processors * 111
Kaushik Banerjee, IEM, CSE 176
Kaushik Banerjee, IEM, CSE 177
Various connectivity of a switch

A 0 A 0

B 1 B 1

Straight Crossover

A 0 A 0

B 1 B 1

Upper Broadcast Lower Broadcast


Kaushik Banerjee, IEM, CSE 178
Omega Network

Implementation of a 8 X 8 network using 2 X 2


switches and Perfect Shuffle as interconnection
network

From To

000 000 (0) The number in ()


indicates the decimal
001 010 (2) equivalent of the
010 100 (4) node
011 110 (6)
100 001 (1)
101 011 (3)
110 101 (5)
111 111 (7)
This is Perfect Shuffle
Kaushik Banerjee, IEM, CSE 179
Omega Network

We will demonstrate the connection from 13 i.e., from 001 to 011.
Starts at Input node 001 . Since this connection follows perfect shuffle, it is
connected to 010, i.e., 2

001(Node 1) Perfect Shuffle 010 (port 2)

Kaushik Banerjee, IEM, CSE 180


Omega Network

Now, in the transition from 001 to 011  1st bit is unchanged


So, the connection within node A will be straight.

Kaushik Banerjee, IEM, CSE 181


Omega Network

Next mapping 010 (port 2) Perfect Shuffle 100 (port 4)

Kaushik Banerjee, IEM, CSE 182


Omega Network

Now, in the transition from 001 to 011  change in 2nd bit


So, the connection through B will be a cross over , hence port 5

5
Kaushik Banerjee, IEM, CSE 183
Omega Network

Next mapping 101 (port 5) Perfect Shuffle 011 (port 3)


And then , since the third bit is unchanged, the connection within switch C will be straight

5
Kaushik Banerjee, IEM, CSE 184
Kaushik Banerjee, IEM, CSE 185
000110 and 100111 will conflict at F;
011  000 and 111 011 will conflict at G;
101  001 and 011000 will conflict at H;
At I and J broadcast will be required, if hardware allows it.

Because of the conflicts, all permutations can not be implemented in one pass.

Kaushik Banerjee, IEM, CSE 186


Hypercube Connection

One Cube Addresses


21 = 2 nodes

Each node contains a CPU, Local memory and I/O interface


Kaushik Banerjee, IEM, CSE 187
Hypercube Connection

01 11

Addresses Two Cube


22 = 4 nodes

00 10

Each node contains a CPU, Local memory and I/O interface


Kaushik Banerjee, IEM, CSE 188
Hypercube Connection
Similarly, 3-Cube = 23 = 8 nodes.
Likewise, N-cube = 2n nodes.

For say, 3-cube,

Node 000 can directly communicate with


Node 001 or
Node 010 or
Node 100

That is, only one bit change.


To communicate with 111, it has be routed through three links

000  001  011  111


Kaushik Banerjee, IEM, CSE 189
Hypercube Connection
Make an XOR between the two nodes Y

If 1st bit=1, then 1st axis X differs


If 2nd bit=1, then 2nd axis Y differs Z

If 3rd bit=1, then 3rd axis Z differs.


X

For example, between 000 and 010, Y axis will differ.

Kaushik Banerjee, IEM, CSE 190


Multiprocessors and MIMD

Kaushik Banerjee, IEM, CSE 191


MIMD

An MIMD computers system has a number of


independent processors operating upon separate data
concurrently

Hence, each processor has its own program memory or


has access to program memory.

Similarly, each processor has its own data memory or


has access to data memory.

Clearly, there has to be a mechanism for passing


information between processors, as they work on the
same problem.

Kaushik Banerjee, IEM, CSE 192


Multiprocessors

Interconnection of two or more CPUs with memory and I/O

Necessarily means multiple CPUs with optional I/O


processors.

Falls into the category of MIMD

MIMD (Multiple Instruction Stream, Multiple Data Stream) is


a technique to achieve parallelism.

MIMD machines have a number of processors functioning


asynchronously and independently.

At any point of time, different processors may be executing


different instructions on different pieces of data.
Kaushik Banerjee, IEM, CSE 193
Multiprocessors

Most multiprocessor systems and multiple computer systems


can be included in this category.

An intrinsic MIMD computer implies interactions among the n


processors because all memory streams are derived from the
same data space shared by all the processors.

 If the n data streams were derived from disjointed subspaces


of the shared memories, then we would have the so called
MSISD (Multiple SISD) operation, which is nothing but a set of n
independent SISD uniprocessor systems.

An intrinsic MIMD computer is tightly coupled if the degree of


interactions among the processors is high. Most commercial
MIMD computers are loosely coupled.
Kaushik Banerjee, IEM, CSE 194
Multiprocessors

Many of the early multiprocessors were SIMD, renewed


attention in 1980s

Except for vector processors, they gone by the mid 1990s.

MIMD emerged as the architecture of choice for general


purpose multiprocessors.

There are two important aspects of MIMD

(1) Flexibility : Single user multiprocessors offering high


performance for one application
(2) Cost Effective: Can be built on off-the-shelf
microprocessors  cost performance advantages.
Kaushik Banerjee, IEM, CSE 195
Multiprocessors

Advantages
(1) Flexibility : Single user multiprocessors offering high
performance for one application by distributing the task.
Needs compiler/ language support.

(2) Cost Effective: Can be built on off-the-shelf


microprocessors

Improved Performance
(1)Multiple independent jobs in parallel, OR
(2)Single job partitioned into parallel tasks.

Example: Processor A: Computation,


Kaushik Banerjee, IEM, CSE
Processor B: Control 196
Multiprocessors

MIMD Architecture’s Goal

Without accessibility to shared memory architectures, it is


not possible to obtain memory access of a remote block in a
multicomputer. Rather than having a processor being
allocated to different process or having a processor continue
with other computations (in other words stall), Distributed
MIMD architectures are used. Thus, the main point of this
architectural design is to develop a message passing parallel
computer system organized such that the processor time
spent in communication within the network is reduced to a
minimum.

Kaushik Banerjee, IEM, CSE 197


Multiprocessors

Components of the Multi Computer and Their Tasks


Within a multicomputer, there are a large number of nodes and a
communication network linking these nodes together. Inside each node,
there are three important elements that do have tasks related to message
passing:

Computation Processor and Private Memory


Communication Processor

This component is responsible for organizing communication among the


multicomputer nodes, "packetizing" a message as a chunk of memory in
the on the giving end, and "depacketizing" the same message for the
receiving node.

Router, commonly referred to as switch units.


The router’s main task is to transmit the message from one node to the
next and assist the communication processor in organizing the
communication of the message in through
Kaushik Banerjee, IEM,the
CSE network of nodes. 198
Multiprocessors

The development of each of the above three elements took


place progressively, as hardware became more and more
complex and useful.

First came the Generation I computers, in which messages


were passed through direct links between nodes, but there
were not any communication processors or routers.

Then Generation II multicomputers came along with


independent switch units that were separate from the
processor, and finally in the Generation III multicomputer, all
three components exist.

Kaushik Banerjee, IEM, CSE 199


Multiprocessors

Categorization of MIMD processors


There are two kinds of categorization of MIMD computers

 Memory Organization wise


 Interconnection wise

1. MIMD memory organization


There are two major types of MIMD memory architectures: distributed
memory MIMD architecture, and shared memory MIMD architecture.

Shared memory organizations are those where all processors share the
same common memory. The processors in a shared memory organisation
are known as tightly coupled as they provide high level of interaction
among the processors.
Distributed memory organization are those where each CPU has its own
local private memory. Message passing scheme is used for interaction
among the processors. This system is known as loosely coupled due to the
low degree of interaction among the
Kaushik processors
Banerjee, IEM, CSE . 200
Multiprocessors Memory Models

(A) Distributed Memory MIMD Architectures

Kaushik Banerjee, IEM, CSE 201


Multiprocessors Memory Models

Advantages and Disadvantages (of distributed memory)

+Highly Scalable
+Message passing solves memory access synchronization problem

-Load balancing problem


-Deadlock in message passing
-Need to physically copying data between processes

Kaushik Banerjee, IEM, CSE 202


Multiprocessors Memory Models

(B) Shared Memory MIMD Architectures

Tightly
Coupled

Also called Symmetric Multi


Kaushik Banerjee, IEM, CSE Processors 203
Multiprocessors Memory Models

Advantages and Disadvantages (of shared memory)

+May use uniprocessor programming techniques


+Communication between processor is efficient

-Synchronized access to share data in memory needed


-Lack of scalability due to (memory) contention problem

Kaushik Banerjee, IEM, CSE 204


Multiprocessors Memory Models
Shared Memory

No matter how memory Address spaces of these memory


blocks are unified into a global address space which is
completely visible to all processors of the shared memory
system.

Issuing a certain memory address by any processor will


access the same memory block location.

However, according to the physical organization of the


logically shared memory, two main types of shared
memory system could be distinguished:

Physically shared memory systems


Virtual (or distributed) shared
Kaushik Banerjee, memory systems
IEM, CSE 205
Multiprocessors Memory Models
Shared Memory

In physically shared memory systems all memory blocks


can be accessed uniformly by all processors.

In distributed shared memory systems the memory blocks


are physically distributed among the processors as local
memory units.

The three main design issues in increasing the scalability


of shared memory systems are:

a. Organization of memory
b. Design of interconnection networks
c. Design of cache coherent protocols
Kaushik Banerjee, IEM, CSE 206
Multiprocessors Memory Models

(C) Virtual shared memory (Best of Both Worlds)

Also called distributed shared memory architecture.

Kaushik Banerjee, IEM, CSE 207


Multiprocessors Memory Models

Shared memory systems are basically classified according to


their memory organization since this is the most fundamental
design issue. Accordingly, shared memory systems can be
divided into four main classes:

1. Uniform memory access (UMA) machines. This follows


shared memory architecture

Following are models of distributed shared memory


architecture

2. Non-uniform memory access (NUMA) machines


3. Cache-coherent non-uniform memory access
(CC_NUMA) machines
4. Cache-only memory access
Kaushik Banerjee,(COMA)
IEM, CSE machines 208
Multiprocessors Memory Models

UMA machines belong to the physically shared memory


architecture class, while NUMA, CC_NUMA, and COMA
machines form the class of distributed shared memory
architectures.

The four classes cover the three generations of shared


memory systems. Their first generation contains the UMA
machines where the interconnection network was based
either on the concept of shared bus in order to construct low-
price multiprocessors or on multistage networks in order to
build massively parallel shared memory systems.

Kaushik Banerjee, IEM, CSE 209


Multiprocessors Memory Models

Shared bus is a serious bottleneck due to contention

Earlier, in first generation shared memory systems this was


being improved by using local cache memories. Still the
scalability was in the range of 20-30 in shared bus machines
and 100-200 in multistage network-based machines.

The second-generation shared memory systems tried to


physically distribute the shared memory among the
processors and also more complex multibus or multistage
network.

The third-generation shared memory systems combine the


advantages of the first two generations (by introducing large
local cache memories) resulting inIEM,
Kaushik Banerjee, dramatic
CSE improvement 210
Multiprocessors Memory Models

However, the application of caches in a multiprocessor


environment gives rise to the so-called cache consistency
problem.

In order to solve the problem of maintaining data consistency


in the caches, the cache coherence protocol must be added
to the traffic on the network.

Kaushik Banerjee, IEM, CSE 211


Multiprocessors Memory Models

(I) Uniform Memory Access (UMA) Machines

Contemporary uniform memory access machines are small-


size single bus multiprocessors. Large UMA machines with
hundreds of processors and a switching network were typical
in the early design of scalable shared memory systems.
Famous representatives of that class of multiprocessors are
the Denelcor HEP and the NYU Ultracomputer.

They introduced many innovative features in their design,


some of which even today represent a significant milestone in
parallel computer architectures.

Kaushik Banerjee, IEM, CSE 212


Multiprocessors Memory Models

UMA Machines (Contd.)

However, these early systems do not contain either cache


memory or local main memory which turned out to be
necessary to achieve high performance in scalable shared
memory systems. Although the UMA architecture is not
suitable for building scalable parallel computers, it is
excellent for constructing small-size single bus
multiprocessors.

Two such machines are the Encore Multimax of Encore


Computer Corporation representing the technology of the
late 1980s and the Power Challenge of Silicon Graphics
Computing Systems representing the technology of the 1990s.
Kaushik Banerjee, IEM, CSE 213
Multiprocessors Memory Models

(II) Non-Uniform Memory Access (NUMA) Machines

Kaushik Banerjee, IEM, CSE 214


Multiprocessors Memory Models

Non-uniform memory access (NUMA) machines were designed


to avoid the memory access bottleneck of UMA machines. The
logically shared memory is physically distributed among the
processing nodes of NUMA machines, leading to distributed
shared memory architectures. On one hand these parallel
computers became highly scalable, but on the other hand they
are very sensitive to data allocation in local memories.

Accessing a local memory segment of a node is much faster


than accessing a remote memory segment. Not by chance, the
structure and design of these machines resemble in many ways
that of distributed memory multicomputers. The main
difference is in the organization of the address space.

Kaushik Banerjee, IEM, CSE 215


Multiprocessors Memory Models

In multiprocessors, a global address space is applied that is


uniformly visible from each processor; that is, all processors
can transparently access all memory locations.

In multicomputers, the address space is replicated in the local


memories of the processing elements. This difference in the
address space of the memory is also reflected at the software
level: distributed memory multicomputers are programmed
on the basis of the message-passing paradigm, while NUMA
machines are programmed on the basis of the global address
space (shared memory) principle.

Kaushik Banerjee, IEM, CSE 216


Multiprocessors Memory Models

The problem of cache coherency does not appear in


distributed memory multi-computers since the message-
passing paradigm explicitly handles different copies of the
same data structure in the form of independent messages.

In the shared memory paradigm, multiple accesses to the same


global data structure are possible and can be accelerated if
local copies of the global data structure are maintained in local
caches.

Kaushik Banerjee, IEM, CSE 217


Multiprocessors Memory Models

However, the hardware-supported cache consistency schemes


are not introduced into the NUMA machines. These systems
can cache read-only code and data, as well as local data, but
not shared modifiable data.

This is the distinguishing feature between NUMA and CC-NUMA


multiprocessors.

Kaushik Banerjee, IEM, CSE 218


Multiprocessors Memory Models

Accordingly, NUMA machines are closer to multicomputers


than to other shared memory multiprocessors, while CC-NUMA
machines look like real shared memory systems.

In NUMA machines, like in multi computers, the main design


issues are the organization of processor nodes, the
interconnection network, and the possible techniques to
reduce remote memory accesses. Two examples of NUMA
machines are the Hector and the Cray T3D multiprocessor
.

Kaushik Banerjee, IEM, CSE 219


Multiprocessors Memory Models
NUMA

Remote load
Kaushik Banerjee, IEM, CSE 220
Multiprocessors Memory Models

(III) Cache-Only Memory Access (COMA) Machines

Kaushik Banerjee, IEM, CSE 221


Multiprocessors Memory Models

Cache-Only Memory Access (COMA) Machines

COMA machines try to avoid the problems of static memory


allocation of NUMA and CC-NUMA machines by excluding main
memory blocks from the local memory of nodes and employing
only large caches as node memories. In these architectures only
cache memories are present; no main memory is employed wither
in the form of a central shared memory as in UMA machines or in
the form of a distributed main memory as in NUMA and CC-NUMA
computers. Similarly to the way virtual memory has eliminated the
need to handle memory addresses explicitly, COMA machines
render static data allocation to local memories superfluous. In
COMA machines data allocation is demand driven; according to the
cache coherence scheme, data is always attracted to the local
(cache) memory where it isKaushik
needed . CSE
Banerjee, IEM, 222
Multiprocessors Memory Models

Cache-Only Memory Access (COMA) Machines

In COMA machines similar cache coherence schemes can


be applied as in other shared memory systems. The only
difference is that these techniques must be extended with
the capability of finding the data on a cache read miss and
of handling replacement. Since COMA machines are
scalable parallel architectures, only cache coherence
protocols that support large-scale parallel systems can be
applied, that is, directory schemes and hierarchical cache
coherent schemes. Two representative COMA
architectures are: DDM (Data Diffusion Machine), KSR1.

Kaushik Banerjee, IEM, CSE 223


Multiprocessors Memory Models

(IV) Cache-Coherent Non-Uniform Memory Access (CC-NUMA)

All the CC-NUMA machines share the common goal of building


a scalable shared memory multiprocessor. The main difference
among them is in the way the memory and cache coherence
mechanisms are distributed among the processing nodes.
Another distinguishing design issue is the selection of the
interconnection network among the nodes. They demonstrate
a progress from bus-based networks towards a more general
interconnection network and from the snoopy cache
coherency protocol towards a directory scheme.

Kaushik Banerjee, IEM, CSE 224


Multiprocessors Memory Models

Cache-Coherent Non-Uniform Memory Access (CC-NUMA)

The Wisconsin multicube architecture is the closest


generalization of a single bus-based multiprocessor. It
completely relies on the snoopy cache protocol but in a
hierarchical way. The main goal of the Stanford FLASH design
was the efficient integration of cache-coherent shared memory
with high-performance message passing. The FLASH applies a
directory scheme for maintaining cache coherence. The figure
below shows the design space of the CC-NUMA machines.

Kaushik Banerjee, IEM, CSE 225


Multiprocessors Memory Models

Cache-Coherent Non-Uniform Memory Access (CC-NUMA)

Kaushik Banerjee, IEM, CSE 226

You might also like