0% found this document useful (0 votes)

39 views25 pages

Parallel Processor Computing Unit 2

The document provides a tutorial on pipeline and vector processing, explaining how vector instructions operate on large arrays of operands and the necessity for pipelined systems. It details the pipelining concept, including the steps of instruction fetch, decode, operand fetch, and execution, and discusses dedicated and general pipelines, along with collision avoidance strategies. Additionally, it introduces the use of reservation tables and collision vectors to manage task scheduling and prevent collisions in pipelined systems.

Uploaded by

Sajib Chowdhury

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views25 pages

Parallel Processor Computing Unit 2

Uploaded by

Sajib Chowdhury

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof.

BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

UNIT-2

PIPELINE & VECTOR PROCESSING

VECTOR PROCESSING
A vector instruction contains large array of operands and same set of operations are performed using these
operands.
They need pipelined systems to execute.

PIPELINING CONCEPT
To achieve pipelining we must subdivide the input process (major task) in to number of subtasks that can be
fed to the pipeline one after another without waiting for the results.
The subtasks are given to dedicated hardware.
As previous tasks move forward in the system the subsequent tasks enter the system.
A typical pipelined CPU is as follows:

A typical pipelined CPU

Memory

Cache

Pipelined Instruction Unit

PC, Instruction Fetch, Instruction
Decode, Operand Fetch, etc.

Instruction (pipeline)
Queue (ready for
execution)
I1, I2, I3,…

Execution Unit
containing Arithmetic
and Logic pipelines

So, the Major tasks are getting executed in parallel form, but the Subtasks within them get executed in serial
form.
The major concern in pipelined systems is collision avoidance.

http://royalproy.tripod.com [email protected] 1
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

PIPELINED PROCESSING
Pipelined computers perform overlapped computations and are said to implement Temporal Parallelism. Usually
there are four major steps in program execution, first is Instruction Fetch (I.F.) from the main memory, second is
Instruction Decode (I.D.) to identify the type of operation that needs to be performed, third is Operand Fetch
(O.F.) as needed by the instruction and fourth is the Execution (E.X.) of the identified operation.
In non-pipelined computers these steps are performed in a sequential manner for every instruction. In a pipelined
system the instructions are collected according to the capacity of the pipelines and this set is sent to the first stage
i.e I.F. then the second set of instructions are collected when the previous set reaches the second stage i.e I.D. In
the mean time the second set of instructions are sent for I.F., in this manner upper set of instructions finish at
once and so on.
Say for example we have a system than can handle instruction pipeline that has a capacity of 5 instructions, from
I1 to I5, diagrammatically the pipelining can be represented as:

Pipelined System

OUTPUT

E.X. I1 I2 I3 I4 I5

O.F I1 I2 I3 I4 I5

I.D. I1 I2 I3 I4 I5

I.F. I1 I2 I3 I4 I5

INSTRUCTIONS LOADED IN MEMORY

The operations of all stages are synchronized under a common clock control.
Interface latches are used to hold the intermediate results between adjacent segments.
These systems perform optimum when same type of operations are preformed through out the pipeline e.g.
addition. Whenever there is change in type of instruction i.e from addition to multiplication then the pipeline
must be drained and reconfigured.
They are more appropriate for Vector processing where same type of operation is repeated for an array of
operands.
Example: AP-120B, FPS-164, etc.

http://royalproy.tripod.com [email protected] 2
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Pipelined System (Linear Pipelined System)

I/P S1 S2 S3 O/P

L1 L2 L3 L4

L1 to L4 are memory Latches to store and forward intermediate results

S1 to S3 are stages in pipeline which usually contain combinational circuits.

DEDICATED PIPELINES
These pipelines are created for performing a fixed function and the data fed to these systems need to be in
specific format. The example of such dedicated pipeline can be a system that performs matrix multiplication.
Even though dedicated pipelines perform a single dedicated function but there performance is always better than
non-pipelined systems for the same function.
Let’s consider the example of matrix multiplication:
Let there be two matrices A and B of size 3x3 and we try to build a pipelined system that can perform the
multiplication of A and B and generate a third resultant matrix C.
So, the formula that can be used for this is

Cij =
where,
i = 1 to 3 , j = 1 to 3 , k = 1 to 3.

Algorithm:
For i=1 to 3 do
{
For j=1 to 3 do
{
For k=1 to 3 do
{
Cij = Cij + ( Aik * Bkj )
}
}
}

The complexity of this program in non-pipelined system would be 3*3*3 = 27 units of time.

Now, we decide a processing element that can be used in our system:

http://royalproy.tripod.com [email protected] 3
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Processing Element
B C+(A*B)

A A

C
B

Arranging the Processing Elements, forming Systolic Array

Output at t6 C13 C12 C11

Output at t7 C23 C22

Column Input C21
Output at t8 C33
C32
t3 t2 t1 C31

B33 0 0

B23 B32 0

B13 B22 B31

0 B12 B21

0 0 B11

0 Row Input
0 0

A11 A12 A13 0 0 input at time t1

0 A21 A22 A23 0 input at time t2
0 0 A31 A32 A33 input at time t3

From the above pipelined arrangement the output that we get is

At time t6: C12, C12, C11, C21 and C32
At time t7: C23, C22 and C32
At time t8: C33

So, compared to non-pipelined system which takes 27 time units we need only 8 time units to complete the same
task of matrix multiplication.
The speedup factor is = 27/8 i.e. 3.35 times the non-pipelined system.

The dedicated pipelines are meant to perform the same task always.
The data flows in only one direction, that too from stage i to stage j where j = i +1 and also j cannot be less than i
or greater than i+1.

http://royalproy.tripod.com [email protected] 4
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

GENERAL PIPELINES AND RESERVATION TABLES

Usually there are two categories of pipelines, first is linear pipeline and second is non-linear pipeline. In linear
pipelining there are no feedback connections from output to input, so output input is totally independent of
output. In non-linear pipelining there are feed-back connections and feed-forward connections from output to
input, so there occurs a change in the input depending upon the type of output. The timing of the non-linear
pipelined systems becomes very crucial to avoid any mistakes while feed-back or feed-forward operations. The
non-linear systems can be multi-functional systems where more that one functions occur to the data passing
through the system.
The MUX are multiplexers which along with a control input would decide which of the inputs should be
forwarded if more than one input comes to the MUX.
The elements L1 to L4 are latches which help to just store and forward the input that they receive. These latches
are also used as delays to synchronize the movement of data from one stage of pipeline to another stage and also
they help avoid any intermediate data loss.

Reservation Tables are used to denote which part of the pipeline gets reserved for operations at specific instance
in time. If we are using an uni-functional system then we have single reservation table and when we have a multi-
functional system then the number of reservation tables are equal to the number of functions.
For example is we have two functions A and B in a multi-functional non-linear pipelined system then following
table can represent the reservation table:

Reservation table for function A

t1 t2 t3 t4 t5 t6
S1 A A
S2 A A
S3 A A

Here S1,S2 and S3 are the stages in the pipeline, A is the function, t1 to t6 are the instances of time where a
particular stage of pipeline (Sn) is engaged in performing the function (A).

Reservation table for function B

t1 t2 t3 t4 t5 t6
S1 B B
S2 B
S3 B B B

Here S1,S2 and S3 are the stages in the pipeline, B is the function, t1 to t6 are the instances of time where a
particular stage of pipeline (Sn) is engaged in performing the function (B).
If we are using a uni-functional system then we use X mark instead of A or B and there will only be one
reservation table to represent the system.
Different reservation tables can be made for different tasks that occupy the resources of stages S1 to S3.

http://royalproy.tripod.com [email protected] 5
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

COLLISIONS AND COLLISION AVOIDANCE

At a given instance of time more than one stage can be engaged in performing tasks, but tasks trying to access the
same stage at same time will cause a Collision problem. As one stage can perform only one task at any give
instance of time. In order to avoid Collision problems we need the reservation table to determine the proper
schedule of the pipeline in order to avoid collision.

Latency: It is the step difference between successive initiation of two tasks. For example if a task I is initiated at
time instance t0 and another task J is initiated at time instance t5 then the latency would be t5 – t0 = 5 units of time
steps.

Latency Sequence: It is sequence formed by arranging the latencies of series of tasks that are performed one
after another in succession. For example if a task I is initiated at time instance t 0 then another task J is initiated at
time instance t5 then another task K is initiated at time instance t8 then the latency sequence of tasks I, J, and K
would be 5,3.

Latency Cycle: It is the latencies that repeat but do not lead to collision. For example 5,3,5,3.

Collision Vector (C): It is the collection of all the latency sequence of a particular task. We can create a collision
vector for every task present in the system. This helps us to identify the Forbidden states where new tasks
should not be initiated as it would result in collision.

Forbidden Set (F): It is the set of all those latencies where new tasks should not be initiated as that would lead to
collision.
Consider the following example:
Let there be a task ‘I’ as the first task in the system whose reservation table is given as follows:

Reservation table for task „I‟

t0 t1 t2 t3 t4 t5 t6 t7 t8
S1 X X
S2 X X X
S3 X
S4 X X
S5 X X

According to the reservation table of the First task we have to create the collision vector as this is the first state in
which the system is set.
Step1: We find the column differences between every pair of crosses in each row.

Latency (Forbidden Latency) Calculated by

Row1 8 t8 – t0 << Maximum Forbidden Latency
Row2 1,5,6 t2 – t1, t7 – t2, t7 – t1
Row3 0 t3
Row4 1 t5 – t4
Row5 1 t7 – t6

Step2: Now we have the Forbidden Latency Set (F) = {1, 5, 6, 8}

http://royalproy.tripod.com [email protected] 6
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Step3: Now we compute the collision vector C, which is a binary vector. The values in this vector are either 0 or
1. The number of elements in the vector should be equal to the maximum forbidden latency, in our example it is
8. So we will have 8 elements in the collision vector.
C = { Cn, Cn-1, …., C2, C1}
Where, Cn to C1 should be either 0 or 1 depending up on the following condition:
Ci = 1 only if i is a member of set F
else Ci = 0
So, C1 = 1, C5=1, C6=1 and C8=1 and others are 0.
C8 C7 C6 C5 C4 C3 C2 C1
Value = 1 0 1 1 0 0 0 1

C = {1 0 1 1 0 0 0 1}

Step3: We put the elements of the collision vector in a queue (shift register) and pop out the elements (right
shift). If a 1 comes out then we cannot initiate a task at that point and if 0 comes out then we can initiate a new
task at that point. In the mean the time 0s are inserted in the left most positions where the places are getting
vacant.
Right Right
Next Next
Initial vector Shift Shift
vector vector
(pop) (pop)
1 so new 0 so new And so on…
task will task can
Value = 10110001 01011000 00101100
not be be
initiated initiated

Step4: When we get a 0 we initiate a new task and also we need to recalculate the newly formed collision vector
by bitwise ORing the original collision vector with recently achieved collision vector, the result of this ORing
would give us the new collision vector.

10110001 >> 01011000 >> 00101100 at this point a new task will be introduced so we calculate the new
collision vector as follows:

Initial vector 10110001

ORing OR
Vector where new task will be introduced 00101100
New resultant vector 10111101

So 10111101 becomes the new C (collision vector) now further calculations will be done in the basis of this new
C.

Again the above mentioned steps from 1 to 4 can be repeated but now with the newly formed collision vector in
place of the old one.

By analyzing the initial collision vector we can identify that at 2,3,4, and 7 we can introduce new tasks in to the
system.

http://royalproy.tripod.com [email protected] 7
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Following state diagram represents the state of the pipelined system under consideration.

State Transition Diagram along with collision vectors

Start
7+

A - 10110001
7+

7+
3 2

D - 10110111 B - 10111101

4 3 7+ 2
4

E - 10111011 C - 10111111

The cycles in the diagram indicate that there exists a sustainable and stable way in which the tasks can be
initiated without collisions.

Optimizing the throughput: To optimize the throughput we have to find the optimum path using which the
pipeline utility can be maximized and required time can be minimized and also the cycle can be used to sustain
the optimum path. For this we need to find the average latency time for every path and then pick the smallest
value path. So here we find the Minimum Average Latency (MAL) from the available set of average latencies.

Average Latency = (Sum of Latencies in the path) / (Number of steps needed to go through the path)
Evaluation of the paths:
Path Path elements Average Latency
Path1 A–B–A (2+7)/2 = 4.5
Path2 A–B–C–A (2+2+7)/3 = 3.6
Path3 A–D–A (3+7)/2 = 5
Path4 A–D–E–D (3+4+3)/3 = 3.3
Path5 A–D–E–D–A (3+4+3+7)/4 = 4.2
Path6 A–E–D–A (4+3+7)/3 = 4.6
Path7 A–E–A (4+7)/2 = 5.5

The optimum path here is A-D-E-D which is having the least average latency of 3.3 and the system after reaching
E can be in a sustainable cycle of E-D. So A-D-E-D has the MAL value of 3.3.

http://royalproy.tripod.com [email protected] 8
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

The above mentioned collision vector and its analysis was meant for Uni-functional pipelines, but in case of
Multi-functional pipelines we need to consider collisions at following situations and build collision vectors for
those situations.
Let there be two functions A and B, also it possible that any function can initiate at any instance of time. Then in
this case we need to generate the collision vector for four different conditions:
1. When A initiates and at the same time again A initiates.
2. When B initiates and at the same time again B initiates.
3. When A initiates and at the same time B initiates.
4. When B initiates and at the same time A initiates.
For this a collision matrix is created

CLASSIFICATION OF PIPELINE PROCESSORS

1. Arithmetic Pipelining.
2. Instruction Pipelining.
3. Processor Pipelining.
4. Uni-functional and Multi-functional Pipelining.
5. Static and Dynamic Pipelining.
6. Scalar and Vector Pipelining.

Arithmetic Pipeline : The arithmetic logic units of a computer can be segmentized for pipeline operations in
various data formats. Well-known arithmetic pipeline examples are the four-stage pipes used in Star-100, the
eight-stage pipes used in the TI-ASC, the up to 14 pipelines stages used in the Cray-1, and up to 26 stages per
pipe in the Cyber-205.

Instruction Pipelining : The execution of a stream of instruction can be pipelined by overlapping the execution
of the current instruction with the fetch, decode, and operand fetch of subsequent instruction. This technique is
also known as instruction lookahead. Almost all high-performance computers are now equipped with instruction-
execution pipelines.

Processor Pipelining : This refers to the pipeline processing of the same data stream by a cascade of processors,
each of which processes a specific task. The data stream passes the first processor with results stored in a memory
block which is also accessible by the second processor. The second processor then passes the refined results to
the third, and so on. The pipelining of multiple processors is not yet well accepted as a common practice.

Unifunctional & Multifunction Pipelines : A pipeline unit with a fixed and dedicated function, such as the
floating-point adder is called unifunctional. The Cray-1 has 12 unifunctional pipeline units for various scalar,
vector, fixed-point, and floating-point operations. A multifunction pipe may perform different subsets of stages
in the pipeline. The TI-ASC. has four multifunction pipeline processors, each of which is reconfigurable for a
variety of arithmetic logic operations at different times.

Static & Dynamic Pipelines: A static pipeline may assume only one functional configuration at a time. Static
pipelines can be either unifunctional or multi-functional. Pipelining is made possible in static pipes only if
instructions of the same type are to be executed continuously. The function performed by a static pipeline should
not change frequently. Otherwise, its performance may be very low. A dynamic pipeline processor permits
several functional configurations to exist simultaneously. In this sense, a dynamic pipeline must be
multifunctional. On the other hand, a unifunctional pipe must be static. The dynamic configuration needs much
more elaborate control and sequencing mechanisms than those for static pipelines. Most existing computers are
equipped with static pipes, either unifunctional or multifunctional.

http://royalproy.tripod.com [email protected] 9
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Scalar & Vector Pipelines : Depending on the instruction or data types, pipeline processors can be also
classified as scalar pipelines and vector pipelines. A scalar pipeline processes a sequence of scalar operands
under the control of DO loop. Instructions in a small DO loop are often prefetched into the instruction buffer. The
required scalar operands for repeated scalar instructions are moved into a data cache in order to continuously
supply the pipeline with operands. The IBM System/360 Model 91 is typical example of a machine equipped
with scalar pipelines. However, the Model 91 does not have a cache. Vector pipelines are specially designed to
handle vector instructions over vector operands. Computers having vector instructions are often called vector
processors. The design of a vector pipeline is expended from that of a scalar pipeline. The handling of vector
operands in vector pipelines is under firmware and hardware controls (rather than under software controls as in
scalar pipelines).

PIPELINING MODELS
1. Linear Pipelining Models
2. Non-Linear Pipelining Model.
3. Instruction Pipelining Model
4. Arithmetic Pipelining Model.
5. Superscalar Pipelining Model.
6. Super-pipelined Model.

Linear Pipelining Models

A linear pipeline processor is a cascade of processing stages which are linearly connected to perform a fixed
function over a stream of data flowing from one end to another.
They are applied for instruction execution, arithmetic computation and memory access.
Depending up on the control of data flow along the pipeline there are two categories of linear pipelines, first is
Asynchronous and Synchronous pipelines.

In Asynchronous pipelines the data flow between adjacent stages in asynchronous manner so there is need to
handshaking between the stages to avoid collisions. So when stage Si is ready to send then it sends a ready signal
to Si+1 and when Si+1 receives the data it then sends an acknowledgement signal to Si about the receipt of the
data. So these pipelines have variable throughput.

Asynchronous Linear Pipeline

S1 to S3 are stages in pipeline which usually contain combinational circuits.

Data Data Data Data

I/P S1 S2 S3 O/P
Ready Ready Ready Ready

Ack Ack Ack Ack

http://royalproy.tripod.com [email protected] 10
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

In Synchronous pipelines clocked latches are used to interface between stages. The latches are made using
master-slave flip flops. On the arrival of clock pulse, all the latches transfer the data to their next stage.

Synchronous Linear Pipeline

S1 to S3 are stages in pipeline which usually contain combinational circuits.

I/P S1 S2 S3 O/P

L1 L2 L3 L4

Nonlinear Pipelining Model:

They are also called Dynamic pipelining models as they can be reconfigured to perform variable functions at
different times.
They have feed-back and feed-forward connections in addition to the linear connections.

Non-linear pipelining with feed-back and feed-forward connections

Feed-Back connection

L1 L2 L3 L4
M M M
I/P S1 S2 S3
U U U
X X X

Feed-Forward connection

http://royalproy.tripod.com [email protected] 11
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Instruction Pipeline model:

Here a stream of instructions can be executed by the pipeline in an overlapped manner.
A typical instruction sequence of operations which include Instruction Fetch (IF), Instruction Decode (ID),
Operand Fetch(OF), Execute (EX) and Memory Write Back (WB).
Each operation can require one or more clock cycles to execute, depending upon the instruction type and memory
architecture used.
Pipelined System

W.B. OUTPUT

E.X. I1 I2 I3 I4 I5

O.F I1 I2 I3 I4 I5

I.D. I1 I2 I3 I4 I5

I.F. I1 I2 I3 I4 I5

INSTRUCTIONS LOADED IN MEMORY

OPTIMIZATION OF INSTRUCTION PIPELINES

Various mechanisms are used to optimize the operation of instruction pipeline models.
Prefetch Buffers:
o Sequential buffers.
o Target buffers.
o Loop buffers.
Multiple Functional Units:
o Functional Units.
o Reservation Stations (RS).
o Tag Unit.
Internal Data Forwarding:
o Store-Load forwarding.
o Load-Load forwarding.
o Store-Store forwarding.
Pipeline Hazard avoidance:
o Structural hazard.
o Data hazard.
o Control hazard.
o Bubbling the pipeline (pipeline stalling).
o Scoreboarding.
o Tomasulo’s method.

http://royalproy.tripod.com [email protected] 12
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Prefetch Buffers:
Three types of buffers can be used to match the instruction fetch rate to the pipeline consumption rate.
In one memory-access time, a block of consecutive instructions are fetched into a Prefetch Buffer.
Sequential instructions are loaded into a pair of Sequential Buffers for in-sequence pipelining.
Instructions from a branch target are loaded into a pair of Target Buffers for out-of-sequence pipelining.
Both the buffers work in FIFO fashion.
We can use one buffer to load instructions from memory and use another buffer to feed the instructions into the
pipeline.
The two buffers alternate to prevent a collision between instructions flowing into and out of the pipeline.
A Loop Buffer is used to hold sequential instructions contained in a loop. Prefetch instructions are executed
repeatedly until all iterations are complete. The Loop buffer works in two ways firstly it contains the instructions
sequentially ahead of the current instruction and secondly it recognizes when the target of a branch falls within
the loop boundary, so if the target instruction in already there in the loop buffer then extra memory access is
avoided.

Multiple Functional Units:

Bottle necks in the pipeline are indicated when we observe lots of crosses in a particular row in the reservation
table. To reduce this bottleneck problem multiple functional units are introduced in the system.
The arrangement of hardware is done such that every functional unit has a preceding Reservation Station (RS).
These RS help to resolve data or resource dependencies among the successive instructions entering the pipeline.
The operands wait in the RS until its dependency problem is solved. Every RS has a unique tag for identification,
which is monitored by the Tag Unit.
The Tag unit keeps continuous track of the currently used RS units. This tagging helps the hardware to resolve
conflicts between source and destination registers assigned for multiple instructions.
The RS units operate in parallel as soon as the conflicts are resolved.

Internal Data Forwarding:

The throughput of a pipelined system can further improved using Internal Data Forwarding among multiple
functional units, for that Memory-access operations can be replaced by Register-transfer operations.
Usually three types of operations are done i.e. Store-Load forwarding, Load-Load forwarding and Store-Store
forwarding.
In Store-Load forwarding we need two set of instructions to transfer data from one register R1 to another
register R2. So using one instruction the data from register R1 is Stored in to memory M and using another
instruction the data from M is Loaded to R2.
In Load-Load forwarding we need two set of instructions to transfer data from one memory M to another
registers R1 and R2. So using one instruction the data from memory M is Loaded into register R1 and using
another instruction the data from M is Loaded into R2.
In Store-Store forwarding we need two set of instructions to transfer data from two registers R1 and R2 into
memory M. So using one instruction the data from register R1 is Stored into memory M and using another
instruction the register R2 is Stored into memory M.
So in all the above cases we need data movement from and to memory more than once, which leads to delays.
As memory transfer operations are more time taking and register transfer operations are less time taking, so we
apply the following changes.
In Store-Load forwarding we optimize by eliminating one of the two memory access operations. So using one
instruction the data from register R1 is Stored in to memory M and using another instruction the data from R1
is Moved to R2.
In Load-Load forwarding we optimize by eliminating one of the two memory access operations. So using one
instruction the data from memory M is Loaded into the register R1 and using another instruction the data
from R1 is Moved to R2.

http://royalproy.tripod.com [email protected] 13
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

In Store-Store forwarding we optimize by eliminating one of the two memory access operations. So using
one instruction the data from register R2 is directly Stored in memory M.
This elimination of memory operations drastically improves the throughput.

Pipeline Hazards and Hazard avoidance:

There are three classes of hazards:

Structural Hazards: They arise from resource conflicts when the hardware cannot support all possible
combinations of instructions in simultaneous overlapped fashion. A structural hazard occurs when a part of the
processor's hardware is needed by two or more instructions at the same time. A structural hazard might occur, for
instance, if a program were to execute a branch instruction followed by a computation instruction. Because they
are executed in parallel, and because branching is typically slow (requiring a comparison, program counter-
related computation, and writing to registers), it is quite possible (depending on architecture) that the computation
instruction and the branch instruction will both require the ALU (arithmetic logic unit) at the same time.

Data Hazards: They arise when an instruction depends on the result of a previous instruction in a way that is
exposed by the overlapping of instructions in the pipeline.
1. Read after Write (RAW) or True dependency: An operand is modified and read simultaneously. Because
the first instruction may not have finished writing to the operand, the second instruction may use incorrect
data. A RAW Data Hazard refers to a situation where we refer to a result that has not yet been calculated, for
example:

i1. R2 <=R1 + R3
i2. R4 <= R2 + R3

The 1st instruction is calculating a value to be saved in register 2, and the second is going to use this value to
compute a result for register 4. However, in a pipeline, when we fetch the operands for the 2nd operation, the
results from the 1st will not yet have been saved, and hence we have a data dependency.
We say that there is a data dependency with instruction 2, as it is dependent on the completion of instruction
1
2. Write after Read (WAR) or Anti dependency: Read an operand and write soon after to that same operand.
Because the write may have finished before the read, the read instruction may incorrectly get the new written
value. A WAR Data Hazard represents a problem with concurrent execution, for example:

i1. R4 <= R1 + R3
i2. R3 <= R1 + R2

If we are in a situation that there is a chance that i2 may be completed before i1 (i.e. with concurrent
execution) we must ensure that we do not store the result of register 3 before i1 has had a chance to fetch the
operands.
3. Write after Write (WAW) or Output dependency: Two instructions that write to the same operand are
performed. The first one issued may finish second, and therefore leave the operand with an incorrect data
value. A WAW Data Hazard is another situation which may occur in a Concurrent execution environment,
for example:

i1. R2 <= R1 + R2
i2. R2 <= R4 x R7

We must delay the WB (Write Back) of i2 until the execution of i1

http://royalproy.tripod.com [email protected] 14
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Control Hazards: They arise from the pipelining of branches and other instructions that change the PC.
Branching hazards (also known as control hazards) occur when the processor is told to branch - i.e., if a certain
condition is true, then jump from one part of the instruction stream to another - not necessarily to the next
instruction sequentially. In such a case, the processor cannot tell in advance whether it should process the next
instruction (when it may instead have to move to a distant instruction). This can result in the processor doing
unwanted actions.

Eliminating Hazards
Eliminating a hazard often requires that some instructions in the pipeline to be allowed to proceed while others
are delayed. When the instruction is stalled, all the instructions issued later than the stalled instruction are also
stalled. Instructions issued earlier than the stalled instruction must continue, since otherwise the hazard will never
clear.
We can delegate the task of removing data dependencies to the compiler, which can fill in an appropriate number
of NOP instructions between dependent instructions to ensure correct operation, or re-order instructions where
possible.

Bubbling the Pipeline: Bubbling the pipeline (a technique also known as a pipeline break or pipeline stall) is a
method for preventing data, structural, and branch hazards from occurring. As instructions are fetched, control
logic determines whether a hazard could/will occur. If this is true, then the control logic inserts NOPs into the
pipeline. Thus, before the next instruction (which would cause the hazard) is executed, the previous one will have
had sufficient time to complete and prevent the hazard. If the number of NOPs is equal to the number of stages in
the pipeline, the processor has been cleared of all instructions and can proceed free from hazards. This is called
flushing the pipeline. All forms of stalling introduce a delay before the processor can resume execution.

ARITHMETIC PIPELINING MODEL:

The principles used in instruction pipelining can be used in order to improve the performance of computers in
performing arithmetic operations such as add, subtract, and multiply. In this case, these principles will be used to
realize the arithmetic circuits inside the ALU.
Some functions of the ALU can be pipelined and Complex functions has to be decomposed
The major task here is to find a multistage sequential algorithm to compute the arithmetic function and also the
algorithm’s steps should be balanced. After this the stages in the pipeline can be designed in which buffers
(synchronizing latches) are to be placed between the stages.
Usually arithmetic pipelines are static pipelines that perform only one function. They perform fixed-point and
floating-point operations using separate pipelined ALUs. When integrated in a system they are known as Integer
Units and Floating-point Units. Although, Multi-functional pipelines are also built using feed-back and feed-
forward interconnections.
The majority of mathematical operations are implemented by using a combination of Add and Shifting
operations.

Fixed-Point Arithmetic Pipelines:

The basic fixed point arithmetic operation performed inside the ALU is the addition of two n-bit operands A and
B.
Addition of these two operands can be performed using a number of techniques where the techniques may differ
in basically two attributes: degree of complexity and achieved speed.
A simple realization may lead to a slower circuit while a complex realization may lead to a faster circuit.

http://royalproy.tripod.com [email protected] 16
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Fixed-Point Arithmetic Pipeline Model with carry propagating to next stage

Floating-Point (FP) Arithmetic Pipelines:

The main operations needed in FP addition are exponent comparison (EC), exponent alignment (EA), addition
(AD), and normalization (NZ).
The pipeline organization is to have a four-stage pipeline each performing an operation from EC, EA, AD, and
NZ.
For example, if we want to add two floating point numbers,
A = a * 2p and B = b * 2q to generate the result C = A+B = d * 2s
Where, d = a + b and s = max(p,q)
So, we first equalize the exponents p and q by selecting the greatest and accordingly adjust the fractions a or b
whose exponent is changed. Then we add the fractions a and b to generate the result and if the result has leading
zeros after decimal point then we re-adjust that fraction and its respective exponent.
The steps in details would be;
1. Find the difference between the exponents, let the difference be n, which should be an absolute value (only
positive). n = | p-q |
2. Compare the exponents and find the smallest. Then add the difference value n to the smallest exponent.
3. Due to step 2 we have adjust the corresponding fraction part so we Right Shift the fraction part by n digits.
s
4. Now we add the fraction parts to generate the fraction result. C = (a+b) * 2
5. If we encounter leading zeros ( 0.0003) in the result then we have readjust this by eliminating the leading
zeros. We count the number of leading zeros, say m and we Left Shift the fraction part by m digits and also
reduce the exponent by a value of m.

http://royalproy.tripod.com [email protected] 17
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Floating-Point (FP) Arithmetic Pipeline for adding two FP numbers

This is useful when we have two vectors, each containing a series of FP numbers and we want to use the above
system to add the elements of the vectors in a pipelined fashion.

http://royalproy.tripod.com [email protected] 18
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

SUPERSCALAR AND SUPER PIPELINING MODELS

Scalar pipelined machines:
In basic scalar machine one instruction is issued per cycle.
The CPU is essentially a scalar processor consists of multiple functional units.
The floating-point unit can be built on a coprocessor attached to the CPU.
Generic RISC processors are called scalar RISC because they are designed to issue one instruction per cycle,
similar to the base scalar processor.
Superscalar pipelined machines:
Usually they have a single Fetch Unit, which fetches the series of instructions and pushes them to the pipeline.
There are more than one pipelines which are connected to the Fetch Unit and the pipelines can work
concurrently.
Go beyond single instruction pipeline, so dispatch multiple instructions per cycle
It is possible to have multiple ALUs in a particular stage.
Functional units in the various stages may take longer than one clock cycle to execute.

Superscalar pipelined machines:

S1 S2 S3 S4 S5

Instruction Operand Instruction Memory

Decode Fetch Execute Write-back
Unit Unit Unit Unit

Instruction
Fetch Unit
Instruction Operand Instruction Memory
Decode Fetch Execute Write-back
Unit Unit Unit Unit

S1 S2 S3 S4 S5

In a superscalar processor, multiple instruction pipelines are required. This implies that multiple instructions are
issued per cycle and multiple results are generated per cycle.
Superscalar processors are designed to exploit more instruction-level parallelism in user programs. Only
independent instructions an be executed in parallel without causing a wait state.
The instruction-issue degree in a superscalar processor is limited to 2-5 in practice.
The degree of superscalar machine refers to the number of instructions issued per cycle. So if m instructions are
issued per cycle then the degree of the superscalar machine would be m.
Instruction Level Parallelism (ILP) is the maximum number of instructions that can be simultaneously executed
in the pipeline. So ILP should be equal to m in order to fully utilize the pipelines.

http://royalproy.tripod.com [email protected] 19
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Super-pipelining Models:
Super-pipelining simply allows a processor to improve its performance by running the pipeline at a higher clock
rate.
Double internal clock speed gets two tasks per external clock cycle.
If a super-pipelined system is of degree m (number of instructions issued per cycle) then the pipeline cycle time
would be 1/n of the base cycle (n). So, all the operations will take m short cycles each of 1/n time span. Hence
we need a very high-speed clocking mechanism.

Super-pipelined machine of Degree m=3

Simultaneously multiple instructions advance through the pipeline stages.

Multiple functional units leads to higher instruction execution throughput.
Able to execute instructions in an order different from that specified by the original program. Out of program
order execution allows more parallel processing of instructions.

http://royalproy.tripod.com [email protected] 20
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

VLIW – COMPUTERS
A typical VLIW (very long instruction word) machine has instruction words hundreds of bits in length.
Multiple functional units are used concurrently in a VLIW processor.
All functional units share the use of a common large register file.
The VLIW Compiler prepares fixed packets of multiple operations that give the full "plan of execution"
Dependencies are determined by compiler and used to schedule according to function unit latencies
Function units are assigned by compiler and correspond to the position within the instruction packet ("slotting")
Compiler produces fully-scheduled, hazard-free code so here the hardware doesn't have to "rediscover"
dependencies.
Compatibility across implementations is a major problem as the VLIW code won't run properly with different
number of function units or different latencies and unscheduled events (e.g., cache miss) stall the entire
processor.
One VLIW instruction encodes multiple operations; specifically, one instruction encodes at least one operation
for each execution unit of the device. For example, if a VLIW device has five execution units, then a VLIW
instruction for that device would have five operation fields, each field specifying what operation should be done
on that corresponding execution unit. To accommodate these operation fields, VLIW instructions are usually at
least 64 bits wide, and on some architectures are much wider.
The following is an instruction for the SHARC (Super Harvard Architecture Single-Chip Computer). In one
cycle, it does a floating-point multiply, a floating-point add, and two autoincrement loads. All of this fits into a
single 48-bit instruction: f12=f0*f4, f8=f8+f12, f0=dm(i0,m3), f4=pm(i8,m9);

A Typical VLIW Processor and Instruction Format

http://royalproy.tripod.com [email protected] 21
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

PROGRAM FLOW MECHANISMS

Control Flow
Data Flow
Demand-Driven Mechanisms (Reduction machines)

Control Flow Mechanism

Control-flow computers use shared memory to hold program instructions and data objects.
In Control-flow computers since variables are updated by many instructions, there may be side effects on
other instructions. These side effects frequently prevent parallel processing.

Data Flow Mechanism

In a dataflow computer, the execution of an instruction is driven by data availability instead of being guided
by a program counter.
Computational results (data tokens) are passed directly between instructions.
Instructions in dataflow machines are unordered and can be executed as soon as their operands are available;
data is held in the instructions themselves. Data tokens are passed from an instruction to its dependents to
trigger execution. So they are more appropriate for parallel processing.
No need for
o shared memory
o program counter
o control sequencer
Special mechanisms are required to
o detect data availability
o match data tokens with instructions needing them
o enable chain reaction of asynchronous instruction execution

Demand-Driven Mechanisms (Reduction machines)

In a reduction machine, the computation is triggered by the demand for an operation’s result.
A demand-driven computation corresponds to lazy evaluation, because operations are executed only when
their results are required by another instruction.

ARCHITECTURE OF CRAY-1
The CRAY-1 is the only computer to have been built to date that satisfies ERDA's Class VI requirement (a
computer capable of processing from 20 to 60 million floating point operations per second).
The CRAY-I's Fortran compiler (CVT) is designed to give the scientific user immediate access to the benefits of
the CRAY-I’s vector processing architecture.
The arithmetic calculations are performed in discrete steps, with each step producing interim results used in
subsequent steps. Through a technique called "chaining," the CRAY-1 vector functional units, in combination
with scalar and vector registers, generate interim results and use them again immediately without additional
memory references.
Other than its computational capabilities, features that are worth mentioning are :
its small size, which reduces the distances through which the electrical signals must travel within the
computer's framework and allows a 12.5 nanosecond clock period.
A one million word semiconductor memory equipped with error detection and correction logic
(SECD~D); its 64-bit word size; and
Its optimizing Fortran compiler.

http://royalproy.tripod.com [email protected] 22
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Block-diagram of registers in CRAY-1

http://royalproy.tripod.com [email protected] 23
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Functional Units
There are 12 functional units, organized in four groups: address, scalar, vector, and floating point.
Each functional unit is pipelined into single clock segments.
All of the functional units can operate concurrently so that in addition to the benefits of pipelining (each
functional unit can be driven at a result rate of 1 per clock period) we also have parallelism across the units too.

Registers
The basic set of programmable registers are as follows:
8 number of 24-bit address (A) registers
64 number of 24-bit address-save (B) registers
8 number of 64-bit scalar (S) registers
64 number of 64-bit scalar-save (T) registers
8 number of 64-word (4096-bit) vector (V) registers

Instruction Formats
Instructions are expressed in either one or two 16-bit parcels.

Vector Instructions
On the CRAY-1, vector instructions may issue at a rate of one instruction parcel per clock period.
All vector instructions are one parcel instructions (parcel size = 16 bits).
Vector instructions place a reservation on whichever functional unit they use, including memory, and on the input
operand registers.

System Software
CRAY Operating System (COS) and CRAY Fortran Compiler (cFr) are the main softwares that work in CRAY-1
COS:
It is a batch operating system capable of supporting up to 63 jobs in a multiprogramming environment,
It is designed to be the recipient of job requests and data files from front-end computers.
Output from jobs is normally staged back to the front-ends upon job completion.
Other CRAY-1 software includes Cray Assembler Language (CAL) which is a powerful macro assembler, a full
range of utilities including a text editor, and some debug aids.

http://royalproy.tripod.com [email protected] 24
Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof. BIT, Durg
UNIT-2 : PIPELINE & VECTOR PROCESSING

Functional Block Diagram of CRAY-1

http://royalproy.tripod.com [email protected] 25

CA Slides#3 Pipeline Introduction
No ratings yet
CA Slides#3 Pipeline Introduction
26 pages
Unit 3
No ratings yet
Unit 3
64 pages
Pipelining and Vector Processing Guide
No ratings yet
Pipelining and Vector Processing Guide
63 pages
ACA - Pipelining
No ratings yet
ACA - Pipelining
25 pages
Pipelining & Vector Processing Guide
No ratings yet
Pipelining & Vector Processing Guide
29 pages
Lecture 8 Unit 4 Pipeline and Vector Processing 2019
No ratings yet
Lecture 8 Unit 4 Pipeline and Vector Processing 2019
36 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
33 pages
Unit-V NEW
No ratings yet
Unit-V NEW
21 pages
Pipeline
No ratings yet
Pipeline
30 pages
Pipelining 2
No ratings yet
Pipelining 2
43 pages
Pipelining
No ratings yet
Pipelining
33 pages
Chapter 3 - Pipelining-And-Vector-Processing
100% (1)
Chapter 3 - Pipelining-And-Vector-Processing
29 pages
Module 3-Part 2
No ratings yet
Module 3-Part 2
50 pages
Chapter 5 - CO - BIM - III
No ratings yet
Chapter 5 - CO - BIM - III
7 pages
Advanced Pipelining Techniques
No ratings yet
Advanced Pipelining Techniques
75 pages
Instruction Pipeline Design, Arithmetic Pipeline Deign - Super Scalar Pipeline Design
No ratings yet
Instruction Pipeline Design, Arithmetic Pipeline Deign - Super Scalar Pipeline Design
34 pages
Pipeline and Vector Processing
100% (1)
Pipeline and Vector Processing
18 pages
Parallel Processing Essentials
No ratings yet
Parallel Processing Essentials
32 pages
Chapter 5 Pipelining and Vector Processing Modified
No ratings yet
Chapter 5 Pipelining and Vector Processing Modified
37 pages
Parallel Processing and Pipelining
No ratings yet
Parallel Processing and Pipelining
53 pages
Chapter 9 - Pipeline and Vector Processing Section 9.1 - Parallel Processing
No ratings yet
Chapter 9 - Pipeline and Vector Processing Section 9.1 - Parallel Processing
10 pages
Pipelining & Superscalar Techniques
No ratings yet
Pipelining & Superscalar Techniques
71 pages
Principles of Linear Pipelining
50% (2)
Principles of Linear Pipelining
71 pages
Advanced Computer Architectures
100% (6)
Advanced Computer Architectures
29 pages
Unit 6 COA
No ratings yet
Unit 6 COA
37 pages
Pipelining & Vector Processing Guide
No ratings yet
Pipelining & Vector Processing Guide
28 pages
Ca Unit 2.2
100% (2)
Ca Unit 2.2
22 pages
Pipelining and Vector Processing Overview
No ratings yet
Pipelining and Vector Processing Overview
37 pages
Pipeline and Vector Processing
No ratings yet
Pipeline and Vector Processing
52 pages
Unit-5-Parallel Processing
No ratings yet
Unit-5-Parallel Processing
11 pages
Presentation 5156 Content Document 20250301102853AM
No ratings yet
Presentation 5156 Content Document 20250301102853AM
40 pages
Pipelining Vector Processing
No ratings yet
Pipelining Vector Processing
27 pages
Module 3 Chapter 2
No ratings yet
Module 3 Chapter 2
40 pages
Computer Systems A Programmers Perspective, Section 4.4, "General Principles of Pipelining"
No ratings yet
Computer Systems A Programmers Perspective, Section 4.4, "General Principles of Pipelining"
7 pages
APznzabDMN0K7ucLj 5y16mZ4MCAzvYka6XPubu o-J2kvJ41PtLmk6WmKHv2EeC4Ezo2wWs0bceGCsYwyq4dsvlt0hqLhY17sXl8HI4hJMeArq1cYV0OrVA-LXS0 77s jVurWxDlctuiAfZ24C8IrdGDNq-YxVFyEtTihvDe2xUFnrVedfCLXwLd0z
No ratings yet
APznzabDMN0K7ucLj 5y16mZ4MCAzvYka6XPubu o-J2kvJ41PtLmk6WmKHv2EeC4Ezo2wWs0bceGCsYwyq4dsvlt0hqLhY17sXl8HI4hJMeArq1cYV0OrVA-LXS0 77s jVurWxDlctuiAfZ24C8IrdGDNq-YxVFyEtTihvDe2xUFnrVedfCLXwLd0z
20 pages
Unit 3 - Advanced Computer Architecture - WWW - Rgpvnotes.in
No ratings yet
Unit 3 - Advanced Computer Architecture - WWW - Rgpvnotes.in
15 pages
Parallel Computer Architecture
No ratings yet
Parallel Computer Architecture
22 pages
Multiprocessor Systems & Pipelining
No ratings yet
Multiprocessor Systems & Pipelining
11 pages
Parallel Chapter3
No ratings yet
Parallel Chapter3
29 pages
Pipelining and Vector Processing: - Parallel
No ratings yet
Pipelining and Vector Processing: - Parallel
37 pages
Coa Unit 5
No ratings yet
Coa Unit 5
71 pages
CO Module 5 Notes
No ratings yet
CO Module 5 Notes
16 pages
PCC-CS402
No ratings yet
PCC-CS402
7 pages
Pipelining and Vector Processing
No ratings yet
Pipelining and Vector Processing
30 pages
Pipelining and Superscalar Techniques: CSE539: Advanced Computer Architecture
No ratings yet
Pipelining and Superscalar Techniques: CSE539: Advanced Computer Architecture
49 pages
Parallelism in Uniprocessor System and Granularity
100% (5)
Parallelism in Uniprocessor System and Granularity
5 pages
Unit - V: Pipeline & Vector Processing and Multi Processors Pipeline and Vector Processing: Multiprocessors
No ratings yet
Unit - V: Pipeline & Vector Processing and Multi Processors Pipeline and Vector Processing: Multiprocessors
20 pages
Pipelining and Vector Processing
No ratings yet
Pipelining and Vector Processing
37 pages
Stud CSA Mod4 p2 PipeliningBasics
No ratings yet
Stud CSA Mod4 p2 PipeliningBasics
83 pages
Pipelined Processing in Computer Architecture
No ratings yet
Pipelined Processing in Computer Architecture
33 pages
Pipelining and Vector Processing Overview
No ratings yet
Pipelining and Vector Processing Overview
18 pages
Coa Unit-3 Part-2
No ratings yet
Coa Unit-3 Part-2
35 pages
Pipeline Processing Explained
No ratings yet
Pipeline Processing Explained
5 pages
Chapter9pipelining 200907163859
No ratings yet
Chapter9pipelining 200907163859
13 pages
Caalp Unit5
No ratings yet
Caalp Unit5
20 pages
BCA Semester II Computer Organisation and Architecture (COA
No ratings yet
BCA Semester II Computer Organisation and Architecture (COA
24 pages
5.1-5.3 Pipelining and Parallel Processing
No ratings yet
5.1-5.3 Pipelining and Parallel Processing
56 pages
Advanced Computer Architecture 2
No ratings yet
Advanced Computer Architecture 2
17 pages
Electrostatic Potential Full Question Paper
No ratings yet
Electrostatic Potential Full Question Paper
2 pages
Seven States of Matter Class8 Project Final 15pages
No ratings yet
Seven States of Matter Class8 Project Final 15pages
15 pages
CS-493 Lab Questions
No ratings yet
CS-493 Lab Questions
4 pages
1 s2.0 S1877050919310361 Main
No ratings yet
1 s2.0 S1877050919310361 Main
6 pages
Date Half Sem Name of The Invigilator Dept
No ratings yet
Date Half Sem Name of The Invigilator Dept
5 pages
Literature Survey and Uniqueness
No ratings yet
Literature Survey and Uniqueness
1 page
Keynote Speaker2024
No ratings yet
Keynote Speaker2024
4 pages
New2 Article - Test2
No ratings yet
New2 Article - Test2
12 pages
Matter Class8 Question
No ratings yet
Matter Class8 Question
1 page
Matter Chemistry 8
No ratings yet
Matter Chemistry 8
10 pages
DBMS Lab Assignment 1
No ratings yet
DBMS Lab Assignment 1
3 pages
Computer Organisation Viva Sample Questions
No ratings yet
Computer Organisation Viva Sample Questions
1 page
CS-201 1-1 Intr To Comp Programming (Revised)
No ratings yet
CS-201 1-1 Intr To Comp Programming (Revised)
7 pages
MiEA 3.1 Forex EA: Scalping & Profit Updates
No ratings yet
MiEA 3.1 Forex EA: Scalping & Profit Updates
340 pages
Week4 Lecture 4
No ratings yet
Week4 Lecture 4
9 pages
PHYSICS MCQS FOR IIT JEE NEET IAS SAT MAT Multiple Choice Questions Answers Fully Solved IITJEE Main Advanced Trust Education PDF
100% (3)
PHYSICS MCQS FOR IIT JEE NEET IAS SAT MAT Multiple Choice Questions Answers Fully Solved IITJEE Main Advanced Trust Education PDF
1,487 pages
Individual Footings (17.12.09) EDIT by J3
No ratings yet
Individual Footings (17.12.09) EDIT by J3
32 pages
Steam Turbine Performance Calculations
No ratings yet
Steam Turbine Performance Calculations
2 pages
PDF 16mve31
No ratings yet
PDF 16mve31
3 pages
Thromboses Veineuses Cérébrales (TVC)
No ratings yet
Thromboses Veineuses Cérébrales (TVC)
31 pages
Metal Proc. Mid Exam
No ratings yet
Metal Proc. Mid Exam
3 pages
Study The Performance of Dissolved Air F
No ratings yet
Study The Performance of Dissolved Air F
5 pages
Support Pack Management Summary
100% (1)
Support Pack Management Summary
37 pages
N-T Coordinate System (A) 635410672717374182
100% (1)
N-T Coordinate System (A) 635410672717374182
14 pages
PeopleLink Wall Mount Speakers 1
No ratings yet
PeopleLink Wall Mount Speakers 1
3 pages
Fallacies: What Are They?
No ratings yet
Fallacies: What Are They?
13 pages
Angulated Views in Coronary Angiography
No ratings yet
Angulated Views in Coronary Angiography
26 pages
Science7 Q3 SLM1
80% (5)
Science7 Q3 SLM1
15 pages
Class 12 Sample Paper With Solution Mathematics Set 4
No ratings yet
Class 12 Sample Paper With Solution Mathematics Set 4
10 pages
Mobile App Development Approaches
No ratings yet
Mobile App Development Approaches
15 pages
Advances in Engineering Materials: R. K. Tyagi Pallav Gupta Prosenjit Das Rajiv Prakash
No ratings yet
Advances in Engineering Materials: R. K. Tyagi Pallav Gupta Prosenjit Das Rajiv Prakash
377 pages
SPM Algebra Study Guide
No ratings yet
SPM Algebra Study Guide
11 pages
Electrostatics DPP 13 Insp
No ratings yet
Electrostatics DPP 13 Insp
5 pages
f2838x Ethercat Introduction
No ratings yet
f2838x Ethercat Introduction
14 pages
PhysioEx Exercise 1 Activity 2
No ratings yet
PhysioEx Exercise 1 Activity 2
3 pages
11111111sensata Switch Catalog
No ratings yet
11111111sensata Switch Catalog
49 pages
Span1 Advice
No ratings yet
Span1 Advice
3 pages
3 - Lecture-3 (Part-01) Buckling and Stability of Columns
No ratings yet
3 - Lecture-3 (Part-01) Buckling and Stability of Columns
39 pages
DR Label Software Operation - Toc
No ratings yet
DR Label Software Operation - Toc
5 pages
Iia-4. Permutations and Combinations
100% (2)
Iia-4. Permutations and Combinations
5 pages
Electric and Water Meter Reading (COT 2)
100% (3)
Electric and Water Meter Reading (COT 2)
19 pages
Series 700 Intelligent Conventional Fire Detection Range: Sell Sheet
No ratings yet
Series 700 Intelligent Conventional Fire Detection Range: Sell Sheet
4 pages

Parallel Processor Computing Unit 2

Uploaded by

Parallel Processor Computing Unit 2

Uploaded by

Tutorial on “Parallel Processors and Computing” by PARTHA ROY, Asso.Prof.

PIPELINE & VECTOR PROCESSING

A typical pipelined CPU

Pipelined Instruction Unit

INSTRUCTIONS LOADED IN MEMORY

Pipelined System (Linear Pipelined System)

L1 to L4 are memory Latches to store and forward intermediate results

Now, we decide a processing element that can be used in our system:

Arranging the Processing Elements, forming Systolic Array

Output at t7 C23 C22

B13 B22 B31

A11 A12 A13 0 0 input at time t1

From the above pipelined arrangement the output that we get is

GENERAL PIPELINES AND RESERVATION TABLES

Reservation table for function A

Reservation table for function B

COLLISIONS AND COLLISION AVOIDANCE

Reservation table for task „I‟

Latency (Forbidden Latency) Calculated by

Step2: Now we have the Forbidden Latency Set (F) = {1, 5, 6, 8}

Initial vector 10110001

State Transition Diagram along with collision vectors

CLASSIFICATION OF PIPELINE PROCESSORS

Linear Pipelining Models

Asynchronous Linear Pipeline

Data Data Data Data

Ack Ack Ack Ack

Synchronous Linear Pipeline

Nonlinear Pipelining Model:

Non-linear pipelining with feed-back and feed-forward connections

Instruction Pipeline model:

INSTRUCTIONS LOADED IN MEMORY

OPTIMIZATION OF INSTRUCTION PIPELINES

Multiple Functional Units:

Internal Data Forwarding:

Pipeline Hazards and Hazard avoidance:

We must delay the WB (Write Back) of i2 until the execution of i1

Other methods include on-chip solutions such as:

ARITHMETIC PIPELINING MODEL:

Fixed-Point Arithmetic Pipelines:

Fixed-Point Arithmetic Pipeline Model with carry propagating to next stage

Floating-Point (FP) Arithmetic Pipelines:

Floating-Point (FP) Arithmetic Pipeline for adding two FP numbers

SUPERSCALAR AND SUPER PIPELINING MODELS

Superscalar pipelined machines:

Instruction Operand Instruction Memory

Super-pipelined machine of Degree m=3

Simultaneously multiple instructions advance through the pipeline stages.

A Typical VLIW Processor and Instruction Format

PROGRAM FLOW MECHANISMS

Control Flow Mechanism

Data Flow Mechanism

Demand-Driven Mechanisms (Reduction machines)

Block-diagram of registers in CRAY-1

Functional Block Diagram of CRAY-1

You might also like