0% found this document useful (0 votes)

64 views226 pages

My Lecture - Advanced Computer Architecture

The document discusses advanced computer architecture with a focus on parallel processing, including pipeline processing, architecture classification, and vector processing. It highlights the evolution of computer systems from fixed programs to the stored program concept, addressing challenges like the Von Neumann bottleneck and introducing superscalar design for improved performance. Additionally, it covers Flynn's classification of computer architectures, detailing SISD, SIMD, and MIMD models.

Uploaded by

Kaushik Banerjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views226 pages

My Lecture - Advanced Computer Architecture

Uploaded by

Kaushik Banerjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 226

Advanced Computer Architecture

Parallel Processing

KAUSHIK BANERJEE
IEM, CSE
Advanced Computer Architecture

Pipeline Processing

Architecture Classification

Vector Processing

Vector Processor Architecture

Memory Interleaving

Network Properties

MIMD and Multiprocessors

Kaushik Banerjee, IEM, CSE 2
Parallel Processing
Introduction

Need is ahead of availability: We are trying to get more

performance out of computer systems and hence we go for
advance computer architecture.

Computer Architecture vs Computer Organization:

Architecture: Structural behaviour of the various functional

modules of the computer and how they interact to provide the
processing need of the work.

Organization: The way the h/w components are connected

together to form a computer system.
Kaushik Banerjee, IEM, CSE 3
Parallel Processing

Background

Earliest computing machines worked on fixed programs, as in

calculators. A computer used to be meant for a particular type
of job. For example, a desk calculator can perform arithmetic,
but not word processing or gaming.

Reprogramming would mean making flowcharts, paper

working, documentation and even rewiring of the system.

Kaushik Banerjee, IEM, CSE 4

Parallel Processing
Paradigm Change

John Von Neumann in 1944-45 wrote a paper where he came up with a

scheme to treat program as data and store it in memory. This is called
stored program concept.

Memory

Arithmetic Logic
Control Unit Unit

Input Output

The above is uniprocessor or in other words, scalar design.

Kaushik Banerjee, IEM, CSE 5
Parallel Processing
Problems: As memory is much slower than ALU, it causes waiting state of
the later, narrowing down the communication between memory and ALU.

This problem is known as Von Neumann bottleneck.

In order to improve performance, super scalar design was introduced.

How to improve performance

Rely on faster circuits: Cost per circuit increases with circuit speed and at
some point cost/performance ratio becomes unfavourable

Introduce concurrency
Replicate resources
Do more per cycle

Kaushik Banerjee, IEM, CSE 6

Parallel Processing
Superscalar Design (as CPU is getting cheaper)

More instructions issued at the same time.

Multiple processing units.

The parallel instructions must be independent.

Disadvantage:
• Overhead of controlling
• Burden to programmer
• Hard to have parallel lines > 5 (No point in increasing number of processors
above 5)

Kaushik Banerjee, IEM, CSE 7

Parallel Processing
In superscalar design, resources are replicated as shown below.

Control Control Control Control

Unit Unit Unit Unit

Arithmetic & Arithmetic & Arithmetic & Arithmetic &

Logic Logic Logic Logic
Unit Unit Unit Unit

Memory Memory Memory Memory

Unit Unit Unit Unit

Uniprocessor
Multiprocessor

This superscalar design is flexible, but it was difficult to segregate between statements
which are independent, so as to make them work in parallel.
Kaushik Banerjee, IEM, CSE 8
Parallel Processing
A CISC or RISC scalar processor can be improved with a superscalar or vector
architecture. Scalar processors are those executing one instruction per cycle. Only one
instruction is issued per cycle and only one completion of instruction is expected from
the pipeline per cycle.

In a superscalar processor, multiple instruction pipelines are used. This implies that
multiple instructions are issued per cycle and multiple results are generated per
cycle. A vector processor executed vector instructions on arrays of data. This each
instruction involves a string of repeated operations, which are ideal for pipelining
with one result per cycle.

Superscalar processors are designed to exploit more instruction level parallelism in

user programs. Only independent instructions can be executed in parallel without
causing a wait state. The amount of instruction level parallelism varies widely
depending on the type of code being executed.

It has been observed that the average value is around 2 for code without loop
unrolling. Therefore, for these codes there is not much benefit gained from building
a machine that can issue more than three instructions per cycle

Kaushik Banerjee, IEM, CSE 9

Parallel Processing
Pipelines
Instruction Pipeline

Fetch Decode Execute Write

Back

Fetch Decode Execute Write

Back

Fetch Decode Execute Write

Back

Fetch Decode Execute Write

Back

Kaushik Banerjee, IEM, CSE 10

Parallel Processing
Figure: A superscalar processor of degree m=3

Fetch Decode Execute Write

Back

Kaushik Banerjee, IEM, CSE 11

Parallel Processing
Disadvantages of superscalar design are:
The programmer is burdened with designing the program in
parallel executable parts.
It is very hard to have more than 5 parallel lines; so there is
no point in increasing number of processing units over 5.

Some representative superscalar processors: IBM RS/6000,

DEC21064, Intel i960CA

The above architecture shows replication to the extreme.

Very flexible, but costly

Do we need all this flexibility?
How about part replication?
Kaushik Banerjee, IEM, CSE 12
Parallel Processing
Arithmetic Pipeline

Pipelined addition: One of the six stages of the addition of a

pair of elements is performed at each stage in the pipeline.

Each stage of the pipeline has a separate arithmetic unit

designed for the operation to be performed at that stage.

Once stage A has been completed for the first pair of elements,
these elements can be moved to the next stage B while the
second pair of elements moves to the first stage A

Kaushik Banerjee, IEM, CSE 13

Arithmetic Pipeline for Floating Point Addition
The stages of adding two floating point numbers, for example,
0.1234 X 10 5 + 0.5678 X 10 4

1) The exponents are compared. 10 5 > 10 4

2 ) Since 5-4=1, the mantissa of the lesser, that is the second
number is shifted right by 1 digit and 1 is added to the
exponent . It becomes 0.05678 X 10 5
3) Now that the exponents of the two numbers are equal, the
mantissas are added (0.1234 + 0.05678) = 0.18018
4) The exponent of the result is 5. So, the final result in
0.18018 X 10 5
5) The result is normalised if required.
6) Check for exceptions, such as overflow.
7) Rounding off, if required.
Kaushik Banerjee, IEM, CSE 14
Arithmetic Pipeline for Floating Point Addition

How this is achieved:

The processing unit is designed in such a way that all stages of

the computations are separate. When one pair, say the first pair
of data have gone through Stage 1 (and it enters Stage 2), the
second pair of data can enter Stage 1.

Similarly, When the first pair of data has completed Stage 2 and
entered Stage 3,