Coa Unit 5
Coa Unit 5
Parallel Processing
The need of parallel processing is to enhance the computer processing capability and
increase its throughput, i.e. the amount of processing that can be accomplished during
a given interval of time.
Introduction
Parallel processing can be described as a class of techniques which enables the system
to achieve simultaneous data-processing tasks to increase the computational speed of a
computer system.
A parallel processing system can carry out simultaneous data-processing to achieve
faster execution time.
The primary purpose of parallel processing is to enhance the computer processing
capability and increase its throughput, i.e. the amount of processing that can be
accomplished during a given interval of time.
A parallel processing system can be achieved by having a multiplicity of functional
units that perform identical or different operations simultaneously.
The data can be distributed among various multiple functional units.
The following diagram shows one possible way of separating the execution unit into
eight functional units operating in parallel and the operation performed in each
functional unit is indicated in each block of the diagram:
o The adder and integer multiplier performs the arithmetic operation with integer
numbers.
o The floating-point operations are separated into three circuits operating in parallel.
1
o The logic, shift, and increment operations can be performed concurrently on different
data.
o All units are independent of each other.
There are a variety of ways that parallel processing can be classified. it can be
considered from the:
o Internal organization of the processors
o Interconnection structure between processors
o The flow of information through the system
One classification introduced by M. J. Flynn considers the organization of a computer
system by the number of instructions and data items that are manipulated
simultaneously.
The normal operation of a computer is to fetch instructions from memory and execute
them in the processor.
The sequence of instructions read from memory constitutes an instruction stream.
The operations performed on the data in the processor constitute a data stream
Parallel processing may occur in the instruction stream, in the data stream, or in both.
Flynn’s classification divides computers into four major groups as follows:
1. Single instruction stream, single data stream (SISD)
2. Single instruction stream, multiple data stream (SIMD)
3. Multiple instruction stream, single data stream (MISD)
4. Multiple instruction stream, multiple data stream (MIMD)
2
3. Multiple Instruction stream, Single Data stream (MISD)
• Here there are n processor units, each receiving distinct instructions operating over the
same data stream.
• The results of the one processor become the input of the next processor in the macro
pipe.
• This structure receives less attention and has been challenged as impractical in some
application.
1. It saves time and money as many resources working together will reduce the time
3
and cut potential costs.
2. It can be impractical to solve larger problems on Serial Computing.
3. It can take advantage of non-local resources when the local resources are finite.
4
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE CSE/AIML II-II
Examples of Pipelining
Example 1:
1
Example 2:
To perform the combined multiply and add operations with a stream of numbers
Ai*Bi+ Ci for i =1,2,3,…,7
Each sub operation is to be implemented in a segment within a pipeline.
Segment 1: R1Ai,R2Bi Input Ai and Bi
Segment 2:R3R1*R2,R4Ci Multiply and input Ci
Segment 3:R5R3+R4 Add Ci to product
The five registers are loaded with new data every clock pulse.
The effect of each clock is shown in Table 9-1.
The first clock pulse transfers A1 and 31 into R1 andR2.
The second clock pulse transfers the product of R1 and R2 into R3 and C1into R4.
The same clock pulse transfers A2 and B2 into R1 and R2.
The third clock pulse operates on all three segments simultaneously.
It places A3 and B3 into R1 and R2, transfers the product of R1 and R2 into R3,
transfers C2 into R4, and places the sum of R3 and R4 into R5.
It takes three clock pulses to fill up the pipe and retrieve the first output from R5.
From there on, each clock produces a new output and moves the data one step down
the pipeline.
Each segment has one or two registers and a combinational circuit as shown in Fig. 9-
2.
2
If the above task is executed without the pipelining, then each data operation will take
5 cycles, totally they are 35 cycles of CPU are needed to perform the operation.
But if are using the concept of pipeline, the task can be executed in 9 cycles.
Principles of Pipelining
General considerations
Any operation that can be decomposed into a sequence of sub operations of about the
same complexity can be implemented by a pipeline processor.
The general structure of a four-segment pipeline is illustrated in Fig. 9-3.
3
The behavior of a pipeline can be illustrated with a space-time diagram.
The space-time diagram of a four-segment pipeline is demonstrated in Fig. 9-4.
4
5
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE CSE/AIML II-II
Arithmetic Pipeline
An arithmetic pipeline divides an arithmetic problem into various sub problems for
execution in various pipeline segments.
Arithmetic Pipelines are mostly used in high-speed computers.
They are used to implement floating-point operations, multiplication of fixed-point
numbers, and similar computations encountered in scientific problems.
To understand the concepts of arithmetic pipeline, let us consider an example of a
pipeline unit for floating-point addition and subtraction with two normalized floating-
point binary numbers defined as:
X = A * 2a = 0.9504 * 103
Y = B * 2b = 0.8200 * 102
Where A and B are two fractions that represent the mantisas, a and b are the exponents.
The combined operation of floating-point addition and subtraction is divided into four
segments.
Each segment contains the corresponding sub-operation to be performed in the given
pipeline.
The sub-operations that are shown in the four segments are:
1. Compare the exponents by subtraction.
2. Align the mantissas.
3. Add or subtract the mantissas.
4. Normalize the result.
The process or flowchart arithmetic pipeline for floating point addition is shown in
the diagram.
1
Figure : Flowchart of Arithmetic Pipeline
3. Add mantissas:
The two mantissas are added in segment three.
Z = X + Y = 1.0324 * 103
2
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE CSE/AIML II-II
Instruction Pipeline
Pipeline processing can occur not only in the data stream but in the instruction stream
as well.
Most of the digital computers with complex instructions require instruction pipeline to
carry out operations like fetch, decode and execute instructions.
In this a stream of instructions can be executed by overlapping fetch, decode and
execute phases of an instruction cycle.
This type of technique is used to increase the throughput of the computer system.
An instruction pipeline reads instruction from the memory while previous instructions
are being executed in other segments of the pipeline.
Thus we can execute multiple instructions simultaneously.
The pipeline will be more efficient if the instruction cycle is divided into segments of
equal duration.
In general, the computer needs to process each instruction with the following sequence
of steps:
1. Fetch instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the results.
The flowchart for instruction pipeline is shown below.
3
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE CSE/AIML II-II
RISC Pipeline
Among the characteristics attributed to RISC is its ability to use an efficient
instruction pipeline.
The simplicity of the instruction set can be utilized to implement an instruction
pipeline using a small number of sub operations, with each being executed in one clock
cycle.
Because of the fixed-length instruction format, the decoding of the operation can occur
at the same time as there register selection.
All data manipulation instructions have register-to register operations.
Since all operands are in registers, there is no need for calculating an effective address
or fetching of operands from memory.
Therefore, the instruction pipeline can be implemented with two or three segments.
One segment fetches the instruction from program memory, and the other segment
executes the instruction in the ALU.
A third segment may be used to store the result of the ALU operation in a destination
register.
The data transfer instructions in RISC are limited to load and store instructions.
1. These instructions use register indirect addressing. They usually need three or
four stages in the pipeline.
2. To prevent conflicts between a memory access to fetch an instruction and to
load or store an operand, most RISC machines use two separate buses with two
memories.
3. Cache memory: operate at the same speed as the CPU clock
One of the major advantages of RISC is its ability to execute instructions at the rate of
one per clock cycle.
1. In effect,it is to start each instruction with each clock cycle and to pipeline the
processor to achieve the goal of single-cycle instruction execution.
2. RISC can achieve pipeline segments, requiring just one clock cycle.
1
Consider the hardware operation for such a computer.
The control section fetches the instruction from program memory into an instruction
register.
1. The instruction is decoded at the same time that the registers needed for the
execution of the instruction are selected.
The processor unit consists of a number of registers and an arithmetic logic unit
(ALU).
A data memory is used to load or store the data from a selected register in the register
file.
The instruction cycle can be divided into three sub operations and implemented in
three segments:
1. I:Instruction fetch
Fetches the instruction from program memory
2. A:ALU operation
The instruction is decoded and an ALU operation is performed.
It performs an operation for a data manipulation instruction.
It evaluates the effective address for a load or store instruction.
It calculates the branch address for a program control instruction.
3. E:Execute instruction
Directs the output of the ALU to one of three destinations, depending on
the decoded instruction.
It transfers the result of the ALU operation into a destination register in
the register file.
It transfers the effective address to a data memory for loading or
storing.
It transfers the branch address to the program counter.
Delayed Load
2
delayed load.
Figure9-9(b) shows the same program with a no-op instruction inserted after the load
to R2 instruction.
The data is loaded intoR2 in clock cycle4.
The add instruction uses the value of R2 in step 5.
Thus the no-op instruction is used to advance one clock cycle in order to compensate
for the data conflict in the pipeline.
The advantage of the delayed load approach is that the data dependency is taken care
of by the compiler rather than the hardware.
This results in a simpler hardware segment since the segment does not have to check if
the content of the register being accessed is currently valid or not.
Delayed Branch
The method used in most RISC processors is to rely on the compiler to redefine the
branches so that they take effect at the proper time in the pipeline.
This method is referred to as delayed branch.
The compiler is designed to analyze the instructions before and after the branch and
rearrange the program sequence by inserting useful instructions in the delay steps.
It is up to the compiler to find useful instructions to put after the branch instruction.
Failing that, the compiler can insert no-op instructions.
4
5
6
7
8
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE CSE/AIML II-II
Characteristics of RISC
Simpler instruction, hence simple instruction decoding.
Instruction comes undersize of one word.
Instruction takes a single clock cycle to get executed.
More general-purpose registers.
Simple Addressing Modes.
Fewer Data types.
A pipeline can be achieved.
Advantages of RISC
Simpler instructions: RISC processors use a smaller set of simple instructions,
which makes them easier to decode and execute quickly. This results in faster
processing times.
Faster execution: Because RISC processors have a simpler instruction set, they can
execute instructions faster than CISC processors.
Lower power consumption: RISC processors consume less power than CISC
processors, making them ideal for portable devices.
Disadvantages of RISC
More instructions required: RISC processors require more instructions to perform
complex tasks than CISC processors.
1
Increased memory usage: RISC processors require more memory to store the
additional instructions needed to perform complex tasks.
Higher cost: Developing and manufacturing RISC processors can be more expensive
than CISC processors.
Characteristics of CISC
Complex instruction, hence complex instruction decoding.
Instructions are larger than one-word size.
Instruction may take more than a single clock cycle to get executed.
Less number of general-purpose registers as operations get performed in memory
itself.
Complex Addressing Modes.
More Data types.
Advantages of CISC
Reduced code size: CISC processors use complex instructions that can perform
multiple operations, reducing the amount of code needed to perform a task.
More memory efficient: Because CISC instructions are more complex, they require
fewer instructions to perform complex tasks, which can result in more memory-
efficient code.
Widely used: CISC processors have been in use for a longer time than RISC
processors, so they have a larger user base and more available software.
Disadvantages of CISC
Slower execution: CISC processors take longer to execute instructions because they
have more complex instructions and need more time to decode them.
More complex design: CISC processors have more complex instruction sets, which
makes them more difficult to design and manufacture.
Higher power consumption: CISC processors consume more power than RISC
processors because of their more complex instruction sets.
2
Key Difference between RISC and CISC Processor
• Instructions are easier to decode in RISC than in CISC, which has a more complex
decoding process.
• Calculations in CISC require external memory, but they are not necessary for RISC.
• While CISC only has a single register set, RISC has numerous register sets.
• The execution time of a RISC computer is very low compared to a CISC computer,
which is very high.
RISC architectures emphasize simpler instructions, shorter execution times, and
efficient use of registers.
CISC architectures offer complex instructions, potentially leading to fewer
instructions overall but with longer execution times.
RISC architectures often follow a load-store model, while CISC architectures allow
direct memory access instructions.
3
4
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE CSE/AIML II-II
However, the execution unit can only operate on data that has been loaded into
one of the six registers (A, B, C, D, E, or F).
To find the product of two numbers - one stored in location 2:3 and another
stored in location 5:2 - and then store the product back in the location 2:3.
RISC processors only use simple instructions that can be executed within one clock cycle.
Thus, the "MULT" command could be divided into three separate commands: "LOAD,"
which moves data from the memory bank to a register, "PROD," which finds the product of
two operands located within the registers, and "STORE," which moves data from a register
to the memory banks.
LOAD A, 2:3
LOAD B, 5:2
PROD A, B
STORE 2:3, A
Here, there are more lines of code, more RAM is needed to store the assembly level
instructions.
The compiler must also perform more work to convert a high-level language statement into
code of this form.
The advantages of RISC is that because each instruction requires only one clock cycle to
execute, the entire program will execute in approximately the same amount of time as the
multi-cycle "MULT" command.
These RISC "reduced instructions" require less transistors of hardware space than the
complex instructions, leaving more room for general purpose registers.
Because all of the instructions execute in a uniform amount of time (i.e. one clock),
pipelining is possible.
Separating the "LOAD" and "STORE" instructions actually reduces the amount of work
that the computer must perform.
2
CISC VERSUS RISC
CISC RISC
13. CISC processor executes microcode 13. RISC processor has a number
instructions. of hardwired instructions.
14. CISC processors cannot have 14. Large number of registers, most
a large number of registers. of which can be used as general
purpose registers.
15. Intel and AMD CPU's are based 15. Apple uses RISC chips.
on CISC architectures.
3
16. Mainly used in normal PC’s, 16.Mainly used for real time
Workstations and servers. applications.
18. Direct addition between data in two 18. Direct addition is not
memory locations. Ex.8085. possible.
20. CISC approach minimizes the number 20. RISC approach maximizes the
of instructions per program and increases the number of instructions per program
number of cycles per instruction. and reduces the number of cycles per
instruction.
4
5
6
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II
Vector Processing
In many science and engineering applications, the problems can be formulated in
terms of vectors and matrices that lend themselves to vector processing.
Computers with vector processing capabilities are in demand in specialized
applications. e.g.
o Long-range weather forecasting
o Petroleum explorations
o Seismic data analysis
o Medical diagnosis
o Artificial intelligence and expert systems
o Image processing
o Mapping the human genome
To achieve the required level of high performance it is necessary to utilize the fastest
and most reliable hardware and apply innovative procedures from vector and parallel
processing techniques.
Vector operations
Matrix Multiplication
c11=a11b11+a12b21+a13 b31
This requires three multiplication and (after initializing c11 to 0) three additions.
In general, the inner product consists of the sum of k product terms of the form
C = A1B 1+A2B2+A3B3+…+AkBk.
In a typical application k may be equal to 100 or even 1000.
The inner product calculation on a pipeline vector processor is shown inFig.9-12.
CA1B1A5B5A9B9A13B13
A2B2A6B6A10B10A14B14
A3B3A7B7A11B11A15B15
A4B4A8B8A12B12A16B16
2
3
4
5
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II
Array Processors
Introduction
An array processor is a processor that performs computations on large arrays of
data.
The term is used to refer to two different types of processors.
o Attached array processor:
Is an auxiliary processor.
It is intended to improve the performance of the host computer in specific
numerical computation tasks.
o SIMD array processor:
Has a single-instruction multiple-data organization.
It manipulates vector instructions by means of multiple functional units
responding to a common instruction.
1
and the computer treats it like an external interface.
The data for the attached processor are transferred from main memory to a local
memory through a high-speed bus.
The general-purpose computer without the attached processor serves the users that
need conventional data processing.
The system with the attached processor satisfies the needs for complex arithmetic
applications.
The array processor can be programmed by the user to accommodate a variety of
complex arithmetic problems.
The objective of the attached array processor is to provide vector manipulation
capabilities to a conventional computer at a fraction of the cost of supercomputer.
2
Example
• Consider the vector addition C=A+B
• Master CU first stores the ith component ai and bi of A and B in local memory MI for
i=1,2,3…n
• It then broadcasts the floating point add instruction ci=ai+bi to all the PEs, causing the
addition to take place simultaneously.
• The components of ci are stored in fixed locations in local memory.
• This produces the desired vector sum in one add cycle.
3
4
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II
Introduction to Multiprocessors
Multiprocessor is a set of processors connected by a communication network
1
Fig.5.2 Taxonomy of mono-multi porcessor organizations
Characteristics of Multiprocessors
2. Improved System performance. System derives high performance from the fact
that computations can proceed in parallel in one of the two ways:
a) Multiple independent jobs can be made to operate in parallel.
b) A single job can be partitioned into multiple parallel tasks.
Shared(Global)Memory
A Global Memory Space accessible by all processors
Processors may also have some local memory
Distributed (Local, Message-Passing) Memory
All memory units are associated with processors
To retrieve information from another processor's memory a message must be
sent there
3
4
5
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II
Interconnection Structures
A multiprocessor system consists of set of components like CPU, memory and I/O
processors etc that communicates with each other.
The collection of paths connection various components is called interconnection
structures.
Five ways in which we can interconnect different types of CPUs:
1
Fig.5.6 System bus structure for multiprocessor
Disadvantages:
Only one processor can communicate with the memory or another processor at
any given time.
As a consequence, the total overall transfer rate within the system is limited by
the speed of the single path
b. Multiport Memory:
• -Multiport Memory System employs separate buses between each memory
module and each CPU.
• A processor bus comprises the address, data and control lines necessary to
communicate with memory.
• Each memory module connects each processor bus.
• At any given time, the memory module should have internal control logic to
obtain which port can have access to memory.
Fig.5.7Multiport memory
2
• Memory module can be said to have four ports and each port accommodates one
of the buses.
• Assigning fixed priorities to each memory port resolve the memory access
conflicts.
• priority is established for memory access associated with each processor by the
physical port position.
• Therefore CPU 1 can have priority over CPU 2, CPU 2 can have priority over CPU
3 and CPU 4 can have the lowest priority.
Advantages:
The high transfer rate can be achieved because of the multiple paths.
Disadvantages:
It requires expensive memory control logic and a large number of cables
and connections
a. Crossbar switch:
Crossbar Switch system contains of a number of crosspoints that are kept at
intersections among memory module and processor buses paths.
In each crosspoint, the small square represents a switch which obtains the path
from a processor to a memory module.
Each switch point has control logic to set up the transfer path among a memory
and processor.
It calculates the address which is placed in the bus to obtain whether its specific
module is being addressed.
In addition, it eliminates multiple requests for access to the same memory
module on a predetermined priority basis.
a) b)
Fig. 5.8 a) Cross bar switch b) Block diagram of cross bar switch
3
Advantage:
- Supports simultaneous transfers from all memory modules
Disadvantage:
- The hardware required to implement the switch can become quite large and
complex.
4
c. Hypercube System:
The hypercube or binary n-cube multiprocessor structure is a loosely coupled system
composed of N=2n processors interconnected in an n-dimensional binary cube.
Each processor makes a node of the cube.
Each node is assigned a binary address in such a manner, that the addresses of two
neighbors differ in exactly one bit position.
For example, the three neighbors of the node with address 100 are 000, 110, and 101
in a three-cube structure.
Each of these binary numbers differs from address 100 by one bit value.
Routing messages through an n-cube structure may take from one to n links from a
source node to a destination node.
• In a three-cube structure, node 000 may communicate with 011 (from 000 to 010 to
011 or from 000 to 001 to 011).
• It should cross at least three links to communicate from node 000 to node 111.
• A routing procedure is designed by determining the exclusive-OR of the source node
address with the destination node address.
• The resulting binary value will have 1 bits corresponding to the axes on which the two
nodes differ.
• Then, message is transmitted along any one of the exes.
Example:
• A message at node 010 going to node 001 produces an exclusive-OR of the two
addresses equal to 011 in a three-cube structure.
• The message can be transmitted along the second axis to node 000 and then through
the third axis to node 001.
5
011 (from 000 to 001 to 011 or from 000 to 010 to 011).
It is necessary to go through at least three links to communicate from node 000 to
node 111. A routing procedure can be developed by computing the exclusive-OR
of the source node address with the destination
nodeaddress.Theresultingbinaryvaluewillhave1bitscorrespondingtotheaxes on
which the two nodes differ. The message is then sent along anyone of the axes. For
example, in a three-cube structure, a message at 010 going to 001 produces an
exclusive-OR of the two addresses equal to 011. The message can be sent along
the second axis to 000 and then through the third axis to 001.
6
7
8
9
10
11
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II
Interprocessor Arbitration
The processor, main memory and I/O devices can be interconnected by means of a
common bus.
A bus provides a communication path for the transfer of data between their various
components.
There is a dispute in the system bus that connects the CPU, Input-Output processor,
and memory (main components of a computer).
Only one between the CPU, Input-Output processor, and memory gets the grant to use
the bus simultaneously.
Hence, an appropriate priority resolving mechanism is required to decide which
processor should get control of the bus.
Therefore, a mechanism is needed to handle multiple requests for the bus, known as
Interprocessor Arbitration.
The arbitration procedure services all processor requests based on established
priorities.
Different types of arbitration: Static arbitration, Dynamic arbitration
Serial Arbitration
This technique is obtained from daisy chain (serial) connection of processors and also
known as Daisy chain Arbitration.
In this type of arbitration, processors can access bus based on priority.
In serial arbitration, bus access priority resolving based on the serial connection of the
processors.
The serial priority resolving technique is obtained from daisy-chain connection similar
to the daisy chain priority interrupt logic.
The processors connected to the system bus are assigned priority according to their
position along the priority control line.
1
Figure : Serial (Daisy Chain) Arbitration
When multiple devices concurrently request the use of the bus, the device with the
highest priority is granted access to it.
Each processor has its own bus arbiter logic with priority-in and priority-out lines.
The priority out (PO) of each arbiter is connected to the priority in (PI) of the next-
lower-priority arbiter.
The PI of the highest-priority unit is maintained at a logic value 1.
The highest-priority unit in the system will always receive access to the system bus
when it requests it.
The processor whose arbiter has a PI = 1 and PO = 0. That processor accesses the
system bus.
Advantages
It is a simple design.
Less number of control lines are used.
Disadvantages
Priority depends on the physical location of the device
Propagation delay due to serially granting of bus
Failure of one of the devices may fail the entire system
Cannot assure fairness- a low-priority device may be locked out indefinitely
Parallel Arbitration
This technique uses an external priority encoder and decoder as shown in figure
below.
Each bus arbiter in the parallel scheme has a bus request output line and a bus
acknowledge input line.
When processor wants to access system bus, arbiter of that processor enables request
line.
The processor takes control of the bus if it acknowledges input line is enabled.
2
Figure shows the request lines from four arbiters going into a 4 x 2 priority encoder.
The output of the encoder generates a 2-bit code, which represents the highest-priority
unit among those requesting the bus.
The 2-bit code from the encoder output drives a 2×4 decoder which enables the proper
acknowledge line to grant bus access to the highest-priority unit.
It works on priority encoder truth table.
Advantages
Disadvantages
Dynamic Arbitration
Serial and Parallel bus arbitration are static since the priorities assigned are fixed.
In dynamic arbitration, priorities of the system change while the system is in
operation.
In contrast, a dynamic priority algorithm gives the system the capability for changing
the priority of the devices while the system is in operation.
Few dynamic arbitration procedures that use dynamic priority algorithms: Time Slice,
Polling, LRU, FIFO
Time Slice
In this algorithm allocates a fixed-length time slice of bus time that is offered to each
processor in sequentially manner, in round-robin fashion.
The service provide to each processor with this scheme is independent of its location
along the bus.
No preference is given to any particular device since each is allotted the same amount
of time to communicate with the bus.
3
Polling:
In a bus system that uses polling, the bus-grant signal is replaced by a set of lines
called poll lines, which are connected to all units.
Poll lines are used by the bus controller to define an address for each device connected
to the bus.
The bus controller, arrange address in a sequence through prescribed manner.
When a processor that recognizes its address, it activates the bus busy-line and then
accesses the bus.
After a number of bus cycles, the polling process continues by choosing a different
processor.
The polling sequence is normally programmable, and as a result, the selection priority
can be randomly under program control.
LRU:
The LRU (least recently used) algorithm gives the highest priority to the requesting
device that has not used the bus for the longest interval.
The priorities are adjusted after a number of bus cycles according to the LRU
algorithm.
With this procedure, no processor is favoured over any other since the priorities are
dynamically changed to give every device an opportunity to access the bus.
FIFO:
In the first-come, first-serve scheme, requests are served in the order received.
To implement this algorithm, the bus controller establishes a queue arranged
according to the time that the bus requests arrive.
Each processor must wait for its turn to use the bus on a first-in, first-out (FIFO) basis.
Advantages
4
5
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II
Interprocess Synchronization
The instruction set of a multiprocessor contains basic instructions that are used to
implement communication and synchronization between cooperating processes.
Communication refers to the exchange of data between different processes.
Synchronization refers to the special case where the data used to communicate
between processors is control information.
Synchronization is needed to enforce the correct sequence of processes and to ensure
mutually exclusive access to shared writable data.
Multiprocessor systems usually include various mechanisms to deal with the
synchronization of resources.
Low-level primitives are implemented directly by the hardware.
These primitives are the basic mechanisms that enforce mutual exclusion for more
complex mechanisms implemented in software.
A number of hardware mechanisms for mutual exclusion have been developed.
o A binary semaphore
2
3
4
5
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II
Sometime during the operation an element X from main memory is loaded into the
three processors, P1, P2, and P3.
As a consequence, it is also copied into the private caches of the three processors.
1
we assume that X contains the value of 52.
The load on X to the three processors results in consistent copies in the caches and
main memory.
If one of the processors performs a store to X, the copies of X in the caches become
inconsistent.
A load by the other processors will not return the latest value.
Depending on the memory update policy used in the cache, the main memory may
also be inconsistent with respect to the cache shown in Fig. 13-13.
A store to X (of the value of 120) into the cache of processor P1 updates memory to
the new value in a write-through policy.
A write-through policy maintains consistency between memory and the originating
cache, but the other two caches are inconsistent since they still hold the old value.
In a write-back policy, main memory is not updated at the time of the store.
The copies in the other two caches and main memory are inconsistent.
Memory is updated eventually when the modified data in the cache are copied back
into memory.
Various schemes have been proposed to solve the cache coherence problem in shared
memory multiprocessors.
A simple scheme is to disallow private caches for each processor and have a shared
cache memory associated with main memory.
Every data access is made to the shared cache.
2
This method violates the principle of closeness of CPU to cache and increases the
average memory access time.
In effect, this scheme solves the problem by avoiding it.
For performance considerations it is desirable to attach a private cache to each
processor.
One scheme that has been used allows only nonshared and read-only data to be stored
in caches.
Such items are called cachable.
Shared writable data are noncachable.
The compiler must tag data as either cachable or noncachable, and the system
hardware makes sure that only cachable data are stored in caches.
The noncachable data remain in main memory.
This method restricts the type of data stored in caches and introduces an extra software
overhead that may degrade performance.
A scheme that allows writable data to exist in at least one cache is method that
employs a centralized global table in its compiler.
The status memory blocks are stored in the central global table.
Each block is identified as read-only (RO) or read and write (RW). All caches can
have copies of blocks identified as RO.
Only one cache can have a copy of an RW block.
Thus if the data are updated in the cache with an RW block, the other caches are not
affected because they do not have a copy of this block.
The cache coherence problem can be solved by means of a combination of software
and hardware or by means of hardware-only schemes.
The two methods mentioned previously use software-based procedures that require the
ability to tag information in order to disable caching of shared writable data.
Hardware-only solutions are handled by the hardware automatically and have the
advantage of higher speed and program transparency.
In the hardware solution, the cache controller is specially designed to allow it to
monitor all bus requests from CPUs and IOPs.
All caches attached to the bus constantly monitor the network for possible write
operations.
Depending on the method used, they must then either update or invalidate their own
cache copies when a match is detected.
The bus controller that monitors this action is referred to as a snoopy cache controller.
This is basically a hardware unit designed to maintain a bus-watching mechanism over
all the caches attached to the bus.
Various schemes have been proposed to solve the cache coherence problem by means
of snoopy cache protocol.
The simplest method is to adopt a write-through policy and use the following
procedure.
All the snoopy controllers watch the bus for memory store operations. When a word in
a cache is updated by writing into it, the corresponding location in main memory is also
updated.
The local snoopy controllers in all other caches check their memory to determine if
they have a copy of the word that has been overwritten.
3
If a copy exists in a remote cache, that location is marked invalid.
Because all caches snoop on all bus writes, whenever a word is written, the net effect
to update it in the original cache and main memory and remove it from all other caches.
If at some future time a processor accesses the invalid item from its cache, the
response is equivalent to a cache miss, and the updated item is transferred from main
memory.