CO UNIT - 5 - Parallel Processing
CO UNIT - 5 - Parallel Processing
UNIT – V: Syllabus
Pipeline and Vector Processing: Parallel Processing, Pipelining, Arithmetic
Pipeline,
Instruction pipe line, RISC pipeline Vector Processing, Array Processors
Parallel Processing
In this the system may have two or more ALU's and should
be able to execute two or more instructions at the same
time. The system may have two or more processors
operating concurrently
Flynn's classification divides computers into four major groups that are:
The following block diagram represents the combined as well as the sub-
operations performed in each segment of the pipeline.
Perhaps the simplest way of viewing the pipeline structure is to imagine that each
segment consists of an input register followed by a combinational circuit. The register
holds the data and the combinational circuit performs the suboperation in the particular
segment.
The output of the combinational circuit in a given segment is applied to the input register
of the next segment. A clock is applied to all registers after enough time has elapsed to
perform all segment activity. In this way the information flows through the pipeline one
step at a time.
R5 ← R3 + R4 Add Ci to product
The five registers are loaded with new data every clock pulse. The effect of each clock
is shown in Table 1. The first clock pulse transfers A1 and B1 into R1 and R2.
The second dock pulse transfers the product of R1 and R2 into R3 and C1 into R4. The
same clock pulse transfers A2 and B2 into R1 and R2. The third clock pulse operates on
all three segments simultaneously.
It places A, and B, into R1 and R2, transfers the product of R1 and R2 into R3, transfers
C, into R4, and places the sum of R3 and R4 into RS. It takes three clock pulses to fill up
the pipe and retrieve the first output from RS. From there on, each dock produces a new
output and moves the data one step down the pipeline.
This happens as long as new input data flow into the system. When no more input data
are available, the clock must continue until the last output emerges out of the pipeline.
In general, the pipeline organization is applicable for two areas of
computer design which includes:
Arithmetic Pipeline
Instruction Pipeline
Arithmetic Pipeline
Arithmetic Pipelines are mostly used in high-speed computers. They are used to
implement floating-point operations, multiplication of fixed-point numbers, and
similar computations encountered in scientific problems.
The inputs to the floating-point adder pipeline are two normalized floating-
point binary numbers defined as:
X = A * 2a = 0.9504 * 103
Y = B * 2b = 0.8200 * 102
6M
148
The combined operation of floating-point addition and subtraction is
divided into four segments. Each segment contains the corresponding
suboperation to be performed in the given pipeline. The suboperations
that are shown in the four segments are:
X = 0.9504 * 103
Y = 0.08200 * 103
3. Add mantissas:
The two mantissas are added in segment three.
Z = X + Y = 1.0324 * 103
Z = 0.1324 * 104
Instruction Pipeline
Pipeline processing can occur not only in the data stream but in the
instruction stream as well.
Each step is executed in a particular segment, and there are times when
different segments may take different times to operate on the incoming
information. Moreover, there are times when two or more segments may
require memory access at the same time, causing one segment to wait
until another is finished with the memory.
44.4M
811
The instruction fetch segment can be implemented using first in, first out
(FIFO) buffer.
Segment 2:
Segment 3:
Segment 4:
The instructions are finally executed in the last segment of the pipeline
organization.
Data Dependency
A collision occurs when an instruction cannot proceed because previous instructions did
not complete certain operations.
A data dependency occurs when an instruction needs data that are not yet available.
For example, an instruction in the FO segment may need to fetch an operand that is
being generated at the same time by the previous instruction in segment EX.
Therefore, the second instruction must wait for data to become available by the first
instruction. Similarly, an address dependency may occur when an operand address
cannot be calculated because the information needed by the addressing mode is not
available.
For example, an instruction with register indirect mode cannot proceed to fetch the
operand if the previous instruction is loading the address into the register.
Therefore, the operand access to memory must be delayed until the required address is
available. Pipelined computers deal with such conflicts between data dependencies in a
variety of ways.
The most straightforward method is to insert hardware interlocks. An interlock is a
circuit that detects instructions whose source operands are destinations of instructions
farther up in the pipeline.
Detection of this situation causes the instruction whose source is not available to be
delayed by enough clock cycles to resolve the conflict. This approach maintains the
program sequence by using hardware to insert the required delays.
Another technique called operand forwarding uses special hardware to detect a conflict
and then avoid it by routing the data through special paths between pipeline segments.
For example, instead of transferring an ALU result into a destination register, the
hardware checks the destination operand, and if it is needed as a source in the next
instruction, it passes the result directly into the ALU input, bypassing the register.
This method requires additional hardware paths through multiplexers as well as the
circuit that detects the conflict.
A procedure employed in some computers is to give the responsibility for solving data
conflicts problems to the compiler that translates the high-level programming language
into a machine language program.
The compiler for such computers is designed to detect a data conflict and reorder the
instructions as necessary to delay the loading of the conflicting data by inserting no-
operation instructions. This method is referred to as delayed load.
The BTB is an associative memory included in the fetch segment of the pipeline. Each
entry in the BTB consists of the address of a previously executed branch instruction and
the target instruction for that branch.
It also stores the next few instructions after the branch target instruction. When the
pipeline decodes a branch instruction, it searches the associative memory BTB for the
address of the instruction.
If it is in the BTB, the instruction is available directly and prefetch continues from the
new path.
If the instruction is not in the BTB, the pipeline shifts to a new instruction stream and
stores the target instruction in the BTB. The advantage of this scheme is that branch
instructions that have occurred previously are readily available in the pipeline without
interruption. A variation of the BTB is the loop buffer.
This is a small very high speed register file maintained by the instruction fetch segment
of the pipeline. When a program loop is detected in the program, it is stored in the loop
buffer in its entirety, including all branches.
The program loop can be executed directly without having to access memory until the
loop mode is removed by the final branching out. Another procedure that some
computers use is branch prediction . A pipeline with branch prediction uses some
additional logic to guess the outcome of a conditional branch instruction before it is
executed.
The pipeline then begins prefetching the instruction stream from the predicted path. A
correct prediction eliminates the wasted time caused by branch penalties. A procedure
employed in most ruse processors is the delayed branch .
In this procedure, the compiler detects the branch instructions and rearranges the
machine language code sequence by inserting useful instructions that keep the pipeline
operating without interruptions.
RISC Pipeline
Among the characteristics attributed to ruse is its ability to use an efficient instruction
pipeline.
The simplicity of the instruction set can be utilized to implement an instruction pipeline
using a small number of suboperations, with each being executed in one clock cycle.
Because of the fixed-length instruction format, the decoding of the operation can occur
at the same time as the register selection. All data manipulation instructions have
register-toregister operations.
Since all operands are in registers, there is no need for calculating an effective address
or fetching of operands from memory. Therefore, the instruction pipeline can be
implemented with two or three segments. One segment fetches the instruction from
program memory, and the other segment executes the instruction in the ALU.
A third segment may be used to store the result of the ALU operation in a destination
register. The data transfer instructions in RISC are limited to load and store instructions.
These instructions use register indirect addressing.
They usually need three or four stages in the pipeline. To prevent conflicts between a
memory access to fetch an instruction and to load or store an operand, most RISC
machines use two separate buses with two memories: one for storing the instructions
and the other for storing the data. The two memories can sometime operate at the same
speed as the CPU clock and are referred to as cache memories.
One of the major advantages of RISC is its ability to execute instructions at the rate of
one per clock cycle. It is not possible to expect that every instruction be fetched from
memory and executed in one clock cycle.
What is done, in effect, is to start each instruction with each clock cycle and to pipeline
the processor to achieve the goal of single-cycle instruction execution.
The advantage of RISC over OSC (complex instruction set computer) is that RISC can
achieve pipeline segments, requiring just one clock cycle, while OSC uses many
segments in its pipeline, with the longest segment requiring two or more clock cycles.
Another characteristic of RISC is the support given by the compiler that translates the
high-level language program into machine language program.
Instead of designing hardware to handle the difficulties associated with data conflicts
and branch penalties, RISC processors rely on the efficiency of the compiler to detect
and minimize the delays encountered with these problems.
The data manipulation instructions operate on data in processor registers. The data
transfer instructions are load and store instructions that use an effective address
obtained from the addition of the contents of two registers or a register and a
displacement constant provided in the instruction.
The program control instructions use register values and a constant to evaluate the
branch address, which is transferred to a register or the program counter PC.
Now consider the hardware operation for such a computer. The control section fetches
the instruction from program memory into an instruction register. The instruction is
decoded at the same time that the registers needed for the execution of the instruction
are selected.
The processor unit consists of a number of registers and an arithmetic logic unit (ALU)
that performs the necessary arithmetic, logic, and shift operations. A data memory is
used to load or store the data from a selected register in the register file. The instruction
cycle can be divided into three suboperations and implemented in three segments:
I: Instruction fetch
A: ALU operation
E: Execute instruction
The I segment fetches the instruction from program memory. The instruction is decoded
and an ALU operation is performed in the A segment.
The ALU is used for three different functions, depending on the decoded instruction. It
performs an operation for a data manipulation instruction, it evaluates the effective
address for a load or store instruction, or it calculates the branch address for a program
control instruction.
The E segment directs the output of the ALU to one of three destinations, depending on
the decoded instruction. It transfers the result of the ALU operation into a destination
register in the register file, it transfers the effective address to a data memory for loading
or storing, or it transfers the branch address to the program counter.
Delayed Load
1. LOAD: R1 ← M [address 1]
2. LOAD: R2 ← M [address 2]
3. ADD: R3 ← R1 + R2
4. STORE: M[address 3] ← R3
This can be seen from the timing of the pipeline shown in Fig. 9(a). The E segment in
clock cycle 4 is in a process of placing the memory data into R2.
The A segment in clock cycle 4 is using the data from R2, but the value in R2 will not be
the correct value since it has not yet been transferred from memory.
It is up to the compiler to make sure that the instruction following the load instruction
uses the data fetched from memory. If the compiler cannot find a useful instruction to put
after the load, it inserts a no-op (no-operation) instruction.
memory but has no operation, thus wasting a clock cycle. This concept of delaying the
use of the data loaded from memory is referred to as delayed load.
Figure 9(b) shows the same program with a no-op instruction inserted after the load to
R2 instruction. The data is loaded into R2 in clock cycle 4.
The add instruction uses the value of R2 in step 5.
Thus the no-op instruction is used to advance one clock cycle in order to compensate
for the data conflict in the pipeline. (Note that no operation is performed in segment A
during clock cycle 4 or segment E during clock cycle 5.)
The advantage of the delayed load approach is that the data dependency is taken care
of by the compiler rather than the hardware.
This results in a simpler hardware segment since the segment does not have to check if
the content of the register being accessed is currently valid or not.
Delayed branch
A branch instruction delays the pipeline operations until the instruction at the branch
address is fetched.
With DELAYED BRANCH It is possible to improve pipeline performance by automatically
rearranging instructions within a program, so that branch instructions occur later than actually
desired.
Delayed Branch
Vector processing
There are computational problems that require a vast number of computations
that will take a conventional computer days or even weeks to complete. In many
science and engineering applications, the problems can be formulated in terms of
vectors and matrices that lend themselves to vector processing.
Computers with vector processing capabilities are in demand in specialized
applications.
Following are the application areas where vector processing is of the utmost
importance
To achieve the required level of high performance it is necessary to utilize the fastest and most
reliable hardware and apply innovative procedures from vector and parallel processing techniques.
Vector operations
A vector is and ordered set of 1-dimensional array of data items. A vector V of length n is
represented as a row vector by V = [V1,V2,V3, …..Vn ]. It may be represented as a column vector if the
data items are listed in a column.
The element Vi of vector V is written as V(I) and the index I refers to a memory address or register
where the number is stored.
To examine the difference b/w a scalar processor and a vector processor consider the following
FORTRAN do loop:
This is a program for adding two vectors A and B of length 100 to produce a vector C. This is
implemented in machine language by the following sequence of operations
A computer capable of vector processing eliminates the overhead associated with
Time it takes to fetch and execute the instructions in the program loop. It allows operations to be
specified with a single vector instruction of the form
This is essentially a three address instruction with three fields specifying the base address of the
operands and an additional field that gives the length of the data itesm in the vector.