0% found this document useful (0 votes)
71 views30 pages

CO UNIT - 5 - Parallel Processing

college notes

Uploaded by

sahilmadan0508
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views30 pages

CO UNIT - 5 - Parallel Processing

college notes

Uploaded by

sahilmadan0508
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Unit -5

UNIT – V: Syllabus
Pipeline and Vector Processing: Parallel Processing, Pipelining, Arithmetic
Pipeline,
Instruction pipe line, RISC pipeline Vector Processing, Array Processors

Parallel Processing

Instead of processing each instruction sequentially,


a parallel processing system provides concurrent data
processing to increase the execution time.

In this the system may have two or more ALU's and should
be able to execute two or more instructions at the same
time. The system may have two or more processors
operating concurrently

The purpose of parallel processing is to speed up the


computer processing capability and increase its
throughput.

NOTE: Throughput is the number of instructions that can


be executed in a unit of time.

parallel processing can be achieved by using multiple


functional units that perform many operations
simultaneously.
As an example, arithmetic, shift and logic operations can be divided into
three units and operations are transformed into each unit under the
supervision of a control unit.
One possible method of dividing the execution unit into eight functional units
operating in parallel is shown in figure. Depending on the operation specified
by the instruction, operands in the registers are transferred to one of the
units, associated with the operands. In each functional unit, the operation
performed is denoted in each block of the diagram. The arithmetic operations
with integer numbers are performed by the adder and integer multiplier.
Floating-point operations can be divided into three circuits operating in
parallel. Logic, shift, and increment operations are performed concurrently on
different data. All units are independent of each other, therefore one number
is shifted while another number is being incremented. Generally, a multi-
functional organization is associated with a complex control unit to
coordinate all the activities between the several components.

Flynn's Classification of Computers


M.J. Flynn proposed a classification for the organization of a computer
system by the number of instructions and data items that are manipulated
simultaneously.
The sequence of instructions read from memory constitutes
an instruction stream.

The operations performed on the data in the processor constitute a data


stream.

Parallel processing may occur in the instruction stream, in the data


stream, or both.

Flynn's classification divides computers into four major groups that are:

Single instruction stream, single data stream (SISD)

Single instruction stream, multiple data stream (SIMD)

Multiple instruction stream, single data stream (MISD)

Multiple instruction stream, multiple data stream (MIMD)

SISD (Single Instruction Stream, Single Data Stream)

It represents the organization of a single computer


containing a control unit, processor unit and a memory
unit. Instructions are executed sequentially. Parallel
processing in this case may be achieved by pipelining or
multiple functional units.

SIMD (Single Instruction Stream, Multiple Data Stream)

It represents an organization that includes multiple


processing units under the control of a common control
unit. All processors receive the same instruction from
control unit but operate on different parts of the data.

They are highly specialized computers. They are basically


used for numerical problems that are expressed in the form
of vector or matrix. But they are not suitable for other
types of computations
MISD (Multiple Instruction Stream, Single Data Stream)

It consists of a single computer containing multiple


processors connected with multiple control units and a
common memory unit. It is capable of processing several
instructions over single data stream simultaneously. MISD
structure is only of theoretical interest since no practical
system has been constructed using this organization.

MIMD (Multiple Instruction Stream, Multiple Data Stream

It represents the organization which is capable of


processing several programs at same time. It is the
organization of a single computer containing multiple
processors connected with multiple control units and a
shared memory unit. The shared memory unit contains
multiple modules to communicate with all processors
simultaneously. Multiprocessors and multicomputer are the
examples of MIMD. It fulfills the demand of large scale
computations.

Pipelining : Pipelining is a process of arrangement of hardware elements of


the CPU such that its overall performance is increased. Simultaneous
execution of more than one instruction takes place in a pipelined processor.

The term Pipelining refers to a technique of decomposing a sequential process


into sub-operations, with each sub-operation being executed in a dedicated
segment that operates concurrently with all other segments.

The most important characteristic of a pipeline technique is that several


computations can be in progress in distinct segments at the same time

Let us consider an example of combined multiplication and addition operation to


get a better understanding of the pipeline organization.

The combined multiplication and addition operation is done with a stream


of numbers such as:

Ai* Bi + Ci for i = 1, 2, 3, ......., 7


The operation to be performed on the numbers is decomposed into sub-
operations with each sub-operation to be implemented in a segment
within a pipeline.

The sub-operations performed in each segment of the pipeline are defined


as:

R1 ← Ai, R2 ← Bi Input Ai, and Bi


R3 ← R1 * R2, R4 ← Ci Multiply, and input Ci
R5 ← R3 + R4 Add Ci to product

The following block diagram represents the combined as well as the sub-
operations performed in each segment of the pipeline.
 Perhaps the simplest way of viewing the pipeline structure is to imagine that each
segment consists of an input register followed by a combinational circuit. The register
holds the data and the combinational circuit performs the suboperation in the particular
segment.

 The output of the combinational circuit in a given segment is applied to the input register
of the next segment. A clock is applied to all registers after enough time has elapsed to
perform all segment activity. In this way the information flows through the pipeline one
step at a time.

The pipeline organization will be demonstrated by means of a simple example.


Suppose that we want to perform the combined multiply and add operations with a
stream of numbers. Ai * Bi + Ci for i = 1, 2, 3, . .. , 7

 Each suboperation is to be implemented in a segment within a pipeline. Each segment


has one or two registers and a combinational circuit as shown in Fig. 2. R1 through R5
are registers that receive new data with every clock pulse. The multiplier and adder are
combinational circuits. The suboperations performed in each segment of the pipeline are
as follows:

R1 ← Ai, R2 ← Bi Input Ai and Bi

R3 ← R1 * R2, R4 ← Ci Multiply and input Ci

R5 ← R3 + R4 Add Ci to product

 The five registers are loaded with new data every clock pulse. The effect of each clock
is shown in Table 1. The first clock pulse transfers A1 and B1 into R1 and R2.
 The second dock pulse transfers the product of R1 and R2 into R3 and C1 into R4. The
same clock pulse transfers A2 and B2 into R1 and R2. The third clock pulse operates on
all three segments simultaneously.

 It places A, and B, into R1 and R2, transfers the product of R1 and R2 into R3, transfers
C, into R4, and places the sum of R3 and R4 into RS. It takes three clock pulses to fill up
the pipe and retrieve the first output from RS. From there on, each dock produces a new
output and moves the data one step down the pipeline.

 This happens as long as new input data flow into the system. When no more input data
are available, the clock must continue until the last output emerges out of the pipeline.
In general, the pipeline organization is applicable for two areas of
computer design which includes:

Arithmetic Pipeline

Instruction Pipeline

Arithmetic Pipeline

Arithmetic Pipelines are mostly used in high-speed computers. They are used to
implement floating-point operations, multiplication of fixed-point numbers, and
similar computations encountered in scientific problems.

To understand the concepts of arithmetic pipeline in a more convenient


way, let us consider an example of a pipeline unit for floating-point
addition and subtraction.

The inputs to the floating-point adder pipeline are two normalized floating-
point binary numbers defined as:

X = A * 2a = 0.9504 * 103
Y = B * 2b = 0.8200 * 102

Where A and B are two fractions that represent the mantissa


and a and b are the exponents.

6M

148
The combined operation of floating-point addition and subtraction is
divided into four segments. Each segment contains the corresponding
suboperation to be performed in the given pipeline. The suboperations
that are shown in the four segments are:

1. Compare the exponents by subtraction.


2. Align the mantissas.
3. Add or subtract the mantissas.
4. Normalize the result.

We will discuss each suboperation in a more detailed manner later in this


section.

The following block diagram represents the suboperations performed in


each segment of the pipeline.
Note: Registers are placed after each suboperation to store the intermediate results.

1. Compare exponents by subtraction:


The exponents are compared by subtracting them to determine their
difference. The larger exponent is chosen as the exponent of the result.

The difference of the exponents, i.e., 3 - 2 = 1 determines how many


times the mantissa associated with the smaller exponent must be shifted
to the right.

2. Align the mantissas:


The mantissa associated with the smaller exponent is shifted according to
the difference of exponents determined in segment one.

X = 0.9504 * 103
Y = 0.08200 * 103

3. Add mantissas:
The two mantissas are added in segment three.

Z = X + Y = 1.0324 * 103

4. Normalize the result:


After normalization, the result is written as:

Z = 0.1324 * 104

Instruction Pipeline
Pipeline processing can occur not only in the data stream but in the
instruction stream as well.

Most of the digital computers with complex instructions require instruction


pipeline to carry out operations like fetch, decode and execute
instructions.

In general, the computer needs to process each instruction with the


following sequence of steps.

1. Fetch instruction from memory.


2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.

Each step is executed in a particular segment, and there are times when
different segments may take different times to operate on the incoming
information. Moreover, there are times when two or more segments may
require memory access at the same time, causing one segment to wait
until another is finished with the memory.

44.4M

811

Java Try Catch

The organization of an instruction pipeline will be more efficient if the


instruction cycle is divided into segments of equal duration. One of the
most common examples of this type of organization is a Four-segment
instruction pipeline.

A four-segment instruction pipeline combines two or more different


segments and makes it as a single one. For instance, the decoding of the
instruction can be combined with the calculation of the effective address
into one segment.

The following block diagram shows a typical example of a four-segment


instruction pipeline. The instruction cycle is completed in four segments.
Segment 1:

The instruction fetch segment can be implemented using first in, first out
(FIFO) buffer.
Segment 2:

The instruction fetched from memory is decoded in the second segment,


and eventually, the effective address is calculated in a separate arithmetic
circuit.

Segment 3:

An operand from memory is fetched in the third segment.

Segment 4:

The instructions are finally executed in the last segment of the pipeline
organization.

Data Dependency

 A difficulty that may caused a degradation of performance in an instruction pipeline is


due to possible collision of data or address.

 A collision occurs when an instruction cannot proceed because previous instructions did
not complete certain operations.

 A data dependency occurs when an instruction needs data that are not yet available.

 For example, an instruction in the FO segment may need to fetch an operand that is
being generated at the same time by the previous instruction in segment EX.

 Therefore, the second instruction must wait for data to become available by the first
instruction. Similarly, an address dependency may occur when an operand address
cannot be calculated because the information needed by the addressing mode is not
available.

 For example, an instruction with register indirect mode cannot proceed to fetch the
operand if the previous instruction is loading the address into the register.

 Therefore, the operand access to memory must be delayed until the required address is
available. Pipelined computers deal with such conflicts between data dependencies in a
variety of ways.
 The most straightforward method is to insert hardware interlocks. An interlock is a
circuit that detects instructions whose source operands are destinations of instructions
farther up in the pipeline.

 Detection of this situation causes the instruction whose source is not available to be
delayed by enough clock cycles to resolve the conflict. This approach maintains the
program sequence by using hardware to insert the required delays.

 Another technique called operand forwarding uses special hardware to detect a conflict
and then avoid it by routing the data through special paths between pipeline segments.

 For example, instead of transferring an ALU result into a destination register, the
hardware checks the destination operand, and if it is needed as a source in the next
instruction, it passes the result directly into the ALU input, bypassing the register.

 This method requires additional hardware paths through multiplexers as well as the
circuit that detects the conflict.

 A procedure employed in some computers is to give the responsibility for solving data
conflicts problems to the compiler that translates the high-level programming language
into a machine language program.

 The compiler for such computers is designed to detect a data conflict and reorder the
instructions as necessary to delay the loading of the conflicting data by inserting no-
operation instructions. This method is referred to as delayed load.

Handling of Branch Instructions

 One of the major problems in operating an instruction pipeline is the occurrence of


branch instructions.

 A branch instruction can be conditional or unconditional. An unconditional branch


always alters the sequential program flow by loading the program counter with the target
address.
 In a conditional branch, the control selects the target instruction if the condition is
satisfied or the next sequential instruction if the condition is not satisfied. As mentioned
previously, the branch instruction breaks the norrnal sequence of the instruction stream,
causing difficulties in the operation of the instruction pipeline.

 Pipelined computers employ various hardware techniques to minimize the performance


degradation caused by instruction branching. One way of handling a conditional branch is
to prefetch the target instruction in addition to the instruction following the branch. Both
are saved until the branch is executed. If the branch condition is successful, the pipeline
continues from the branch target instruction. An extension of this procedure is to continue
fetching instructions from both places until the branch decision is made. At that time
control chooses the instruction stream of the correct program flow.

 Another possibility is the use of a branch target buffer or BTB.

 The BTB is an associative memory included in the fetch segment of the pipeline. Each
entry in the BTB consists of the address of a previously executed branch instruction and
the target instruction for that branch.

 It also stores the next few instructions after the branch target instruction. When the
pipeline decodes a branch instruction, it searches the associative memory BTB for the
address of the instruction.

 If it is in the BTB, the instruction is available directly and prefetch continues from the
new path.

 If the instruction is not in the BTB, the pipeline shifts to a new instruction stream and
stores the target instruction in the BTB. The advantage of this scheme is that branch
instructions that have occurred previously are readily available in the pipeline without
interruption. A variation of the BTB is the loop buffer.

 This is a small very high speed register file maintained by the instruction fetch segment
of the pipeline. When a program loop is detected in the program, it is stored in the loop
buffer in its entirety, including all branches.

 The program loop can be executed directly without having to access memory until the
loop mode is removed by the final branching out. Another procedure that some
computers use is branch prediction . A pipeline with branch prediction uses some
additional logic to guess the outcome of a conditional branch instruction before it is
executed.
 The pipeline then begins prefetching the instruction stream from the predicted path. A
correct prediction eliminates the wasted time caused by branch penalties. A procedure
employed in most ruse processors is the delayed branch .

 In this procedure, the compiler detects the branch instructions and rearranges the
machine language code sequence by inserting useful instructions that keep the pipeline
operating without interruptions.

 An example of delayed branch is the insertion of a no-operation instruction after a


branch instruction. This causes the computer to fetch the target instruction during the
execution of the no operation instruction, allowing a continuous flow of the pipeline.

RISC Pipeline

 Among the characteristics attributed to ruse is its ability to use an efficient instruction
pipeline.

 The simplicity of the instruction set can be utilized to implement an instruction pipeline
using a small number of suboperations, with each being executed in one clock cycle.

 Because of the fixed-length instruction format, the decoding of the operation can occur
at the same time as the register selection. All data manipulation instructions have
register-toregister operations.

 Since all operands are in registers, there is no need for calculating an effective address
or fetching of operands from memory. Therefore, the instruction pipeline can be
implemented with two or three segments. One segment fetches the instruction from
program memory, and the other segment executes the instruction in the ALU.

 A third segment may be used to store the result of the ALU operation in a destination
register. The data transfer instructions in RISC are limited to load and store instructions.
These instructions use register indirect addressing.

 They usually need three or four stages in the pipeline. To prevent conflicts between a
memory access to fetch an instruction and to load or store an operand, most RISC
machines use two separate buses with two memories: one for storing the instructions
and the other for storing the data. The two memories can sometime operate at the same
speed as the CPU clock and are referred to as cache memories.
 One of the major advantages of RISC is its ability to execute instructions at the rate of
one per clock cycle. It is not possible to expect that every instruction be fetched from
memory and executed in one clock cycle.

 What is done, in effect, is to start each instruction with each clock cycle and to pipeline
the processor to achieve the goal of single-cycle instruction execution.

 The advantage of RISC over OSC (complex instruction set computer) is that RISC can
achieve pipeline segments, requiring just one clock cycle, while OSC uses many
segments in its pipeline, with the longest segment requiring two or more clock cycles.
Another characteristic of RISC is the support given by the compiler that translates the
high-level language program into machine language program.

 Instead of designing hardware to handle the difficulties associated with data conflicts
and branch penalties, RISC processors rely on the efficiency of the compiler to detect
and minimize the delays encountered with these problems.

Example: Three-Segment Instruction Pipeline

 The data manipulation instructions operate on data in processor registers. The data
transfer instructions are load and store instructions that use an effective address
obtained from the addition of the contents of two registers or a register and a
displacement constant provided in the instruction.

 The program control instructions use register values and a constant to evaluate the
branch address, which is transferred to a register or the program counter PC.

 Now consider the hardware operation for such a computer. The control section fetches
the instruction from program memory into an instruction register. The instruction is
decoded at the same time that the registers needed for the execution of the instruction
are selected.

 The processor unit consists of a number of registers and an arithmetic logic unit (ALU)
that performs the necessary arithmetic, logic, and shift operations. A data memory is
used to load or store the data from a selected register in the register file. The instruction
cycle can be divided into three suboperations and implemented in three segments:
 I: Instruction fetch
 A: ALU operation
 E: Execute instruction

 The I segment fetches the instruction from program memory. The instruction is decoded
and an ALU operation is performed in the A segment.

 The ALU is used for three different functions, depending on the decoded instruction. It
performs an operation for a data manipulation instruction, it evaluates the effective
address for a load or store instruction, or it calculates the branch address for a program
control instruction.

 The E segment directs the output of the ALU to one of three destinations, depending on
the decoded instruction. It transfers the result of the ALU operation into a destination
register in the register file, it transfers the effective address to a data memory for loading
or storing, or it transfers the branch address to the program counter.

Delayed Load

 Consider now the operation of the following four instructions:

 1. LOAD: R1 ← M [address 1]
 2. LOAD: R2 ← M [address 2]
 3. ADD: R3 ← R1 + R2
 4. STORE: M[address 3] ← R3

 If the three-segment pipeline proceeds without interruptions, there will be a data


conflict in instruction 3 because the operand in R2 is not yet available in the A segment.

 This can be seen from the timing of the pipeline shown in Fig. 9(a). The E segment in
clock cycle 4 is in a process of placing the memory data into R2.
 The A segment in clock cycle 4 is using the data from R2, but the value in R2 will not be
the correct value since it has not yet been transferred from memory.

 It is up to the compiler to make sure that the instruction following the load instruction
uses the data fetched from memory. If the compiler cannot find a useful instruction to put
after the load, it inserts a no-op (no-operation) instruction.

 This is a type of instruction that is fetched from

 memory but has no operation, thus wasting a clock cycle. This concept of delaying the
use of the data loaded from memory is referred to as delayed load.

 Figure 9(b) shows the same program with a no-op instruction inserted after the load to
R2 instruction. The data is loaded into R2 in clock cycle 4.
 The add instruction uses the value of R2 in step 5.

 Thus the no-op instruction is used to advance one clock cycle in order to compensate
for the data conflict in the pipeline. (Note that no operation is performed in segment A
during clock cycle 4 or segment E during clock cycle 5.)

 The advantage of the delayed load approach is that the data dependency is taken care
of by the compiler rather than the hardware.

 This results in a simpler hardware segment since the segment does not have to check if
the content of the register being accessed is currently valid or not.
Delayed branch
A branch instruction delays the pipeline operations until the instruction at the branch
address is fetched.
With DELAYED BRANCH It is possible to improve pipeline performance by automatically
rearranging instructions within a program, so that branch instructions occur later than actually
desired.
Delayed Branch

Vector processing
There are computational problems that require a vast number of computations
that will take a conventional computer days or even weeks to complete. In many
science and engineering applications, the problems can be formulated in terms of
vectors and matrices that lend themselves to vector processing.
Computers with vector processing capabilities are in demand in specialized
applications.
Following are the application areas where vector processing is of the utmost
importance
To achieve the required level of high performance it is necessary to utilize the fastest and most
reliable hardware and apply innovative procedures from vector and parallel processing techniques.

Vector operations

A vector is and ordered set of 1-dimensional array of data items. A vector V of length n is
represented as a row vector by V = [V1,V2,V3, …..Vn ]. It may be represented as a column vector if the
data items are listed in a column.

The element Vi of vector V is written as V(I) and the index I refers to a memory address or register
where the number is stored.

To examine the difference b/w a scalar processor and a vector processor consider the following
FORTRAN do loop:

This is a program for adding two vectors A and B of length 100 to produce a vector C. This is
implemented in machine language by the following sequence of operations
A computer capable of vector processing eliminates the overhead associated with

Time it takes to fetch and execute the instructions in the program loop. It allows operations to be
specified with a single vector instruction of the form

Instruction format of vecote processor

This is essentially a three address instruction with three fields specifying the base address of the
operands and an additional field that gives the length of the data itesm in the vector.

You might also like