0% found this document useful (0 votes)
8 views71 pages

Coa Unit 5

The document provides an overview of parallel processing and pipelining techniques in computer organization and architecture. It discusses the need for parallel processing to enhance computational speed and throughput, classifies parallel processing systems according to Flynn's taxonomy, and outlines the principles and advantages of pipelining. Additionally, it covers arithmetic and instruction pipelines, highlighting their efficiency in executing operations and improving system performance.

Uploaded by

Luna Lunaluna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views71 pages

Coa Unit 5

The document provides an overview of parallel processing and pipelining techniques in computer organization and architecture. It discusses the need for parallel processing to enhance computational speed and throughput, classifies parallel processing systems according to Flynn's taxonomy, and outlines the principles and advantages of pipelining. Additionally, it covers arithmetic and instruction pipelines, highlighting their efficiency in executing operations and improving system performance.

Uploaded by

Luna Lunaluna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

KR23 COMPUTER ORGANIZATION AND ARCHITECTURE CSE/AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
OVERVIEW OF PIPELINIG TECHNIQUE & PROCESSOR ARCHITECTURE
TOPIC- PARALLEL PROCESSING

Parallel Processing

Need of Parallel Processing

 The need of parallel processing is to enhance the computer processing capability and
increase its throughput, i.e. the amount of processing that can be accomplished during
a given interval of time.

Introduction
 Parallel processing can be described as a class of techniques which enables the system
to achieve simultaneous data-processing tasks to increase the computational speed of a
computer system.
 A parallel processing system can carry out simultaneous data-processing to achieve
faster execution time.
 The primary purpose of parallel processing is to enhance the computer processing
capability and increase its throughput, i.e. the amount of processing that can be
accomplished during a given interval of time.
 A parallel processing system can be achieved by having a multiplicity of functional
units that perform identical or different operations simultaneously.
 The data can be distributed among various multiple functional units.
 The following diagram shows one possible way of separating the execution unit into
eight functional units operating in parallel and the operation performed in each
functional unit is indicated in each block of the diagram:

o The adder and integer multiplier performs the arithmetic operation with integer
numbers.
o The floating-point operations are separated into three circuits operating in parallel.

1
o The logic, shift, and increment operations can be performed concurrently on different
data.
o All units are independent of each other.

Classification of Parallel Processing (or) Flynn’s classification (or) Flynn’s Taxonomy

 There are a variety of ways that parallel processing can be classified. it can be
considered from the:
o Internal organization of the processors
o Interconnection structure between processors
o The flow of information through the system
 One classification introduced by M. J. Flynn considers the organization of a computer
system by the number of instructions and data items that are manipulated
simultaneously.
 The normal operation of a computer is to fetch instructions from memory and execute
them in the processor.
 The sequence of instructions read from memory constitutes an instruction stream.
 The operations performed on the data in the processor constitute a data stream
 Parallel processing may occur in the instruction stream, in the data stream, or in both.
 Flynn’s classification divides computers into four major groups as follows:
1. Single instruction stream, single data stream (SISD)
2. Single instruction stream, multiple data stream (SIMD)
3. Multiple instruction stream, single data stream (MISD)
4. Multiple instruction stream, multiple data stream (MIMD)

1.Single Instruction stream, Single Data stream (SISD)


• An SISD computing system is a uniprocessor machine which is capable of executing a
single instruction, operating on a single data stream.
• In SISD, machine instructions are processed in a sequential manner and computers
adopting this model are popularly called sequential computers.
• Most conventional computers have SISD architecture.
• Ex: Von-Neumann computers

2.Single Instruction stream, Multiple Data stream (SIMD)


• Here there are multiple processing elements which is supervised by the same control
unit.
• ALL PE’s receive the same instruction broadcast from the control unit but operate on
different data sets from data streams.
• The shared memory subsystem may contain multiple modules.
Ex: Pipelined Processors

2
3. Multiple Instruction stream, Single Data stream (MISD)
• Here there are n processor units, each receiving distinct instructions operating over the
same data stream.
• The results of the one processor become the input of the next processor in the macro
pipe.
• This structure receives less attention and has been challenged as impractical in some
application.

4.Multiple Instruction stream, Multiple Data stream (MIMD)


• Most multiprocessor systems and multiple computer systems come under this
category.
• An intrinsic MIMD computer implies interactions among the ‘n’ processors because
all memory stream are derived from the same data space shared by all processors.
• If the ‘n’ data streams were derived from disjointed subspaces of the shared
memories, it is called as multiple SISD (MSISD).
• Ex: superscalar Processors.

Advantages of Parallel Computing over Serial Computing

1. It saves time and money as many resources working together will reduce the time

3
and cut potential costs.
2. It can be impractical to solve larger problems on Serial Computing.
3. It can take advantage of non-local resources when the local resources are finite.

Disadvantages of Parallel Computing over Serial Computing


1. Increases the cost of computers since more hardware is required.
2. Multicore architectures consume higher power.
3. Parallel architectures are difficult to achieve.
4. It increases the overhead cost due to synchronization and data transfers.

4
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE CSE/AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
OVERVIEW OF PIPELINIG TECHNIQUE & PROCESSOR ARCHITECTURE
TOPIC- PIPELINING
Pipelining
 Pipelining is a technique of decomposing a sequential process into sub operations,
with each sub process being executed in a special dedicated segment that operates
concurrently with all other segments.
 The name “pipeline” implies a flow of information analogous to an industrial
assembly line.
 It is characteristic of pipelines that several computations can be in progress in distinct
segments at the same time.

Examples of Pipelining
Example 1:

1
Example 2:
 To perform the combined multiply and add operations with a stream of numbers
Ai*Bi+ Ci for i =1,2,3,…,7
 Each sub operation is to be implemented in a segment within a pipeline.
Segment 1: R1Ai,R2Bi Input Ai and Bi
Segment 2:R3R1*R2,R4Ci Multiply and input Ci
Segment 3:R5R3+R4 Add Ci to product
 The five registers are loaded with new data every clock pulse.
 The effect of each clock is shown in Table 9-1.
 The first clock pulse transfers A1 and 31 into R1 andR2.
 The second clock pulse transfers the product of R1 and R2 into R3 and C1into R4.
 The same clock pulse transfers A2 and B2 into R1 and R2.
 The third clock pulse operates on all three segments simultaneously.
 It places A3 and B3 into R1 and R2, transfers the product of R1 and R2 into R3,
transfers C2 into R4, and places the sum of R3 and R4 into R5.
 It takes three clock pulses to fill up the pipe and retrieve the first output from R5.
 From there on, each clock produces a new output and moves the data one step down
the pipeline.
 Each segment has one or two registers and a combinational circuit as shown in Fig. 9-
2.

2

 We are considering the implementation of A[7] array with B[7] array

 If the above task is executed without the pipelining, then each data operation will take
5 cycles, totally they are 35 cycles of CPU are needed to perform the operation.
 But if are using the concept of pipeline, the task can be executed in 9 cycles.

Principles of Pipelining

• Each task is divided into number of successive tasks.


• A pipeline stage is associated with each task.
• Same amount of time is required for each pipeline stage.
• The output of one stage is given as input to next stage.
• Pipeline is clocked synchronously

General considerations
 Any operation that can be decomposed into a sequence of sub operations of about the
same complexity can be implemented by a pipeline processor.
 The general structure of a four-segment pipeline is illustrated in Fig. 9-3.

3
 The behavior of a pipeline can be illustrated with a space-time diagram.
 The space-time diagram of a four-segment pipeline is demonstrated in Fig. 9-4.

Serial execution vs Pipelining


 In a k-segment pipeline with a clock cycle time t p is used to execute a task.
 The first task T1 requires a time equal to kt p to complete its operation.
 The remaining n-1 tasks will be completed after a time equal to (n-1)tp.
 Therefore, to complete n tasks using a k-segment pipeline requires k+(n-1) tp clock
cycles.
 Consider a non pipeline unit that performs the same operation and takes a time equal
to tn to complete each task.
 The total time required for n tasks is nt n.

 The speedup of a pipeline processing over an equivalent non-pipelined processing is


defined by the ratio
S = ntn/(k+n-1)tp .
 If n becomes much larger than k-1, the speedup becomes S = t n/tp.
 If we assume that the time it takes to process a task is the same in the pipeline and
non-pipelined circuits, i.e.,t n = ktp, the speedup reduces to S=kt p/tp=k.
 This shows that maximum speedup that a pipeline can provide is k,where k is the
number of segments in the pipeline.
 To duplicate the theoretical speed advantage of a pipeline process by means of
multiple functional units, it is necessary to construct k identical units that will be
operating in parallel.
 This is illustrated in Fig. 9-5, where four identical circuits are connected in parallel.

4
5
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE CSE/AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
OVERVIEW OF PIPELINIG TECHNIQUE & PROCESSOR ARCHITECTURE
TOPIC- ARITHMETIC PIPELINE

Arithmetic Pipeline

 An arithmetic pipeline divides an arithmetic problem into various sub problems for
execution in various pipeline segments.
 Arithmetic Pipelines are mostly used in high-speed computers.
 They are used to implement floating-point operations, multiplication of fixed-point
numbers, and similar computations encountered in scientific problems.
 To understand the concepts of arithmetic pipeline, let us consider an example of a
pipeline unit for floating-point addition and subtraction with two normalized floating-
point binary numbers defined as:
X = A * 2a = 0.9504 * 103
Y = B * 2b = 0.8200 * 102
Where A and B are two fractions that represent the mantisas, a and b are the exponents.
 The combined operation of floating-point addition and subtraction is divided into four
segments.
 Each segment contains the corresponding sub-operation to be performed in the given
pipeline.
 The sub-operations that are shown in the four segments are:
1. Compare the exponents by subtraction.
2. Align the mantissas.
3. Add or subtract the mantissas.
4. Normalize the result.
 The process or flowchart arithmetic pipeline for floating point addition is shown in
the diagram.

1
Figure : Flowchart of Arithmetic Pipeline

1. Compare exponents by subtraction:


 The exponents are compared by subtracting them to determine their difference. The
larger exponent is chosen as the exponent of the result.
 The difference of the exponents, i.e., 3 - 2 = 1 determines how many times the
mantissa associated with the smaller exponent must be shifted to the right.

2. Align the mantissas:


 The mantissa associated with the smaller exponent is shifted according to the
difference of exponents determined in segment one.
X = 0.9504 * 103
Y = 0.08200 * 103

3. Add mantissas:
 The two mantissas are added in segment three.
Z = X + Y = 1.0324 * 103

4. Normalize the result:


 After normalization, the result is written as:
Z = 0.10324 * 104

2
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE CSE/AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
OVERVIEW OF PIPELINIG TECHNIQUE & PROCESSOR ARCHITECTURE
TOPIC- INSTRUCTION PIPELINE

Instruction Pipeline
 Pipeline processing can occur not only in the data stream but in the instruction stream
as well.
 Most of the digital computers with complex instructions require instruction pipeline to
carry out operations like fetch, decode and execute instructions.
 In this a stream of instructions can be executed by overlapping fetch, decode and
execute phases of an instruction cycle.
 This type of technique is used to increase the throughput of the computer system.
 An instruction pipeline reads instruction from the memory while previous instructions
are being executed in other segments of the pipeline.
 Thus we can execute multiple instructions simultaneously.
 The pipeline will be more efficient if the instruction cycle is divided into segments of
equal duration.
 In general, the computer needs to process each instruction with the following sequence
of steps:
1. Fetch instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the results.
 The flowchart for instruction pipeline is shown below.

Figure 3.1: Flowchart of Instruction Pipeline


1
 Figure 3.1 shows how the instruction cycle in the CPU can be processed with a four-
segment pipeline.
 While an instruction is being executed in segment 4, the next instruction in sequence
is busy fetching an operand from memory in segment 3.
 The effective address may be calculated in a separate arithmetic circuit for the third
instruction, and whenever the memory is available, the fourth and all subsequent
instructions can be fetched and placed in an instruction FIFO.
 Thus up to four sub-operations in the instruction cycle can overlap and up to four
different instructions can be in progress of being processed at the same time.
 Once in a while, an instruction in the sequence may be a program control type that
causes a branch out of normal sequence.
 In that case the pending operations in the last two segments are completed and all
information stored in the instruction buffer is deleted.
 The pipeline then restarts from the new address stored in the program counter.
 Similarly, an interrupt request, when acknowledged, will cause the pipeline to empty
and start again from a new address value.
 Figure 3.2 shows the operation of a instruction pipeline with the help of a four
segment pipeline.

 The instruction cycle is divided in four segments or phases:


 Segment 1:
FI- Fetches an instruction from memory
 Segment 2:
DA-The instruction fetched from memory is decoded in the second segment, and the
effective address is calculated in a separate arithmetic circuit.
 Segment 3:
FO- An operand from memory is fetched in the third segment.
 Segment 4:
EX-The instructions are finally executed in the last segment of the pipeline organization.

Figure 3.2: Four Segment Instruction Pipeline

 Here the instruction is fetched on first clock cycle in segment 1.


 Now it is decoded in next clock cycle, then operands are fetched and finally the
instruction is executed.
2
 The fetch and decode phase overlap due to pipelining i.e by the time the first
instruction is being decoded, next instruction is fetched by the pipeline.
 In case of third instruction, it is a branch instruction.
 Here when it is being decoded, 4th instruction is fetched simultaneously.
 But as it is a branch instruction it may point to some other instruction when it is
decoded.
 Thus fourth instruction is kept on hold until the branch instruction is executed.
 When it gets executed then the fourth instruction is copied back and the other phases
continue as usual.

3
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE CSE/AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
OVERVIEW OF PIPELINIG TECHNIQUE & PROCESSOR ARCHITECTURE
TOPIC- RISC PIPELINE

RISC Pipeline
 Among the characteristics attributed to RISC is its ability to use an efficient
instruction pipeline.
 The simplicity of the instruction set can be utilized to implement an instruction
pipeline using a small number of sub operations, with each being executed in one clock
cycle.
 Because of the fixed-length instruction format, the decoding of the operation can occur
at the same time as there register selection.
 All data manipulation instructions have register-to register operations.
 Since all operands are in registers, there is no need for calculating an effective address
or fetching of operands from memory.
 Therefore, the instruction pipeline can be implemented with two or three segments.
One segment fetches the instruction from program memory, and the other segment
executes the instruction in the ALU.
 A third segment may be used to store the result of the ALU operation in a destination
register.
 The data transfer instructions in RISC are limited to load and store instructions.
1. These instructions use register indirect addressing. They usually need three or
four stages in the pipeline.
2. To prevent conflicts between a memory access to fetch an instruction and to
load or store an operand, most RISC machines use two separate buses with two
memories.
3. Cache memory: operate at the same speed as the CPU clock
 One of the major advantages of RISC is its ability to execute instructions at the rate of
one per clock cycle.
1. In effect,it is to start each instruction with each clock cycle and to pipeline the
processor to achieve the goal of single-cycle instruction execution.
2. RISC can achieve pipeline segments, requiring just one clock cycle.

Example: Three-Segment Instruction Pipeline

 A typical set of instructions for a RISC processor consists of three types of


instructions:
1. The data manipulation instructions: operate on data in processor registers
2. The data transfer instructions
3. The program control instructions

1
Consider the hardware operation for such a computer.
 The control section fetches the instruction from program memory into an instruction
register.
1. The instruction is decoded at the same time that the registers needed for the
execution of the instruction are selected.
 The processor unit consists of a number of registers and an arithmetic logic unit
(ALU).
 A data memory is used to load or store the data from a selected register in the register
file.
 The instruction cycle can be divided into three sub operations and implemented in
three segments:
1. I:Instruction fetch
 Fetches the instruction from program memory
2. A:ALU operation
 The instruction is decoded and an ALU operation is performed.
 It performs an operation for a data manipulation instruction.
 It evaluates the effective address for a load or store instruction.
 It calculates the branch address for a program control instruction.
3. E:Execute instruction
 Directs the output of the ALU to one of three destinations, depending on
the decoded instruction.
 It transfers the result of the ALU operation into a destination register in
the register file.
 It transfers the effective address to a data memory for loading or
storing.
 It transfers the branch address to the program counter.

Delayed Load

 Consider the operation of the following four instructions:


1. LOAD:R1M[address 1]
2. LOAD:R2M[address 2]
3. ADD: R3 R1 +R2
4. STORE:M[address3]R3
 There will be a data conflict in instruction 3 because the operand in R2 is not yet
available in the A segment which can be seen from the timing of the pipeline shown in
Fig.9-9(a).
 The E segment in clock cycle 4 is in a process of placing the memory data into R2.
 The A segment in clock cycle 4 is using the data from R2,but the value in R2 will not
be the correct value since it has not yet been transferred from memory.
 It is up to the compiler to make sure that the instruction following the load instruction
uses the data fetched from memory.
 If the compiler cannot find a useful instruction to put after the load, it inserts a no-
op(no- operation) instruction.
 This is a type of instruction that is fetched from memory but has no operation, thus
wasting a clock cycle.
 This concept of delaying the use of the data loaded from memory is referred to as

2
delayed load.
 Figure9-9(b) shows the same program with a no-op instruction inserted after the load
to R2 instruction.
 The data is loaded intoR2 in clock cycle4.
 The add instruction uses the value of R2 in step 5.
 Thus the no-op instruction is used to advance one clock cycle in order to compensate
for the data conflict in the pipeline.
 The advantage of the delayed load approach is that the data dependency is taken care
of by the compiler rather than the hardware.
 This results in a simpler hardware segment since the segment does not have to check if
the content of the register being accessed is currently valid or not.

Delayed Branch

 The method used in most RISC processors is to rely on the compiler to redefine the
branches so that they take effect at the proper time in the pipeline.
 This method is referred to as delayed branch.
 The compiler is designed to analyze the instructions before and after the branch and
rearrange the program sequence by inserting useful instructions in the delay steps.
 It is up to the compiler to find useful instructions to put after the branch instruction.
Failing that, the compiler can insert no-op instructions.

An Example of Delayed Branch


 The program for this example consists of five instructions.
1. Load from memory to R1
2. Increment R2
3. Add R3 to R4
4. Subtract R5 from R6
5. Branch to address X
3
 In Fig. 9-10(a) the compiler inserts two no-op instructions after the branch.
 The branch address X is transferred to PC in clock cycle 7.
 The fetching of the instruction at X is delayed by two clock cycles by the no-op
instructions.
 The instruction at X starts the fetch phase at clock cycle 8 after the program counter
PC has been updated.
 The program inFig.9-10(b) is rearranged by placing the add and subtract instructions
after the branch instruction instead of before as in the original program.
 PC is updated to the value of X in clock cycle 5, but add and subtract instructions are
fetched from memory and executed in the proper sequence.
 In other words, if the load instruction is at address 101 and X is equal to 350, the
branch instruction is fetched from address 103.
 The add instruction is fetched from address 104 and executed in clock cycle 6. The
subtract instruction is fetched from address 105and executed in clock cycle 7.
 Since the value of X is transferred to PC with clock cycle 5 in the E segment, the
instruction fetched from memory at clock cycle6 is from address350, which is the
instruction at the branch address.

4
5
6
7
8
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE CSE/AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
OVERVIEW OF PIPELINIG TECHNIQUE & PROCESSOR ARCHITECTURE
TOPIC- REDUCED INSTRUCTION SET COMPUTER (RISC) & COMPLEX
INSTRUCTION SET COMPUTRER(CISC)

Reduced Instruction Set Architecture (RISC)


 It stands for Reduced Instruction Set Computer.
 It is a type of microprocessor architecture that uses a small set of instructions of
uniform length.
 These are simple instructions that are generally executed in one clock cycle.
 RISC chips are relatively simple to design and inexpensive.
 The main idea behind this is to simplify hardware by using an instruction set
composed of a few basic steps for loading, evaluating, and storing operations just like
a load command will load data, a store command will store the data.
 RISC architectures are generally well-suited for high-throughput applications like
scientific computing and multimedia processing.
 Their streamlined instructions and efficient use of registers contribute to faster
execution, which is crucial for data-intensive tasks.
 Examples: SPARC, POWER PC, etc.

Characteristics of RISC
 Simpler instruction, hence simple instruction decoding.
 Instruction comes undersize of one word.
 Instruction takes a single clock cycle to get executed.
 More general-purpose registers.
 Simple Addressing Modes.
 Fewer Data types.
 A pipeline can be achieved.

Advantages of RISC
 Simpler instructions: RISC processors use a smaller set of simple instructions,
which makes them easier to decode and execute quickly. This results in faster
processing times.
 Faster execution: Because RISC processors have a simpler instruction set, they can
execute instructions faster than CISC processors.
 Lower power consumption: RISC processors consume less power than CISC
processors, making them ideal for portable devices.

Disadvantages of RISC
 More instructions required: RISC processors require more instructions to perform
complex tasks than CISC processors.
1
 Increased memory usage: RISC processors require more memory to store the
additional instructions needed to perform complex tasks.
 Higher cost: Developing and manufacturing RISC processors can be more expensive
than CISC processors.

Complex Instruction Set Architecture (CISC)


 It stands for Complex Instruction Set Computer.
 These processors offer hundreds of instructions of variable sizes.
 CISC architecture includes a complete set of special-purpose circuits that carry out
these instructions at a very high speed.
 These instructions interact with memory by using complex addressing modes.
 CISC processors reduce the program size and hence lesser number of memory cycles
are required to execute the programs. This increases the overall speed of execution.
 The main idea is that a single instruction will do all loading, evaluating, and storing
operations just like a multiplication command will do stuff like loading data,
evaluating, and storing it, hence it’s complex.
 Examples: Intel architecture, AMD

Characteristics of CISC
 Complex instruction, hence complex instruction decoding.
 Instructions are larger than one-word size.
 Instruction may take more than a single clock cycle to get executed.
 Less number of general-purpose registers as operations get performed in memory
itself.
 Complex Addressing Modes.
 More Data types.

Advantages of CISC
 Reduced code size: CISC processors use complex instructions that can perform
multiple operations, reducing the amount of code needed to perform a task.
 More memory efficient: Because CISC instructions are more complex, they require
fewer instructions to perform complex tasks, which can result in more memory-
efficient code.
 Widely used: CISC processors have been in use for a longer time than RISC
processors, so they have a larger user base and more available software.

Disadvantages of CISC
 Slower execution: CISC processors take longer to execute instructions because they
have more complex instructions and need more time to decode them.
 More complex design: CISC processors have more complex instruction sets, which
makes them more difficult to design and manufacture.
 Higher power consumption: CISC processors consume more power than RISC
processors because of their more complex instruction sets.

2
Key Difference between RISC and CISC Processor
• Instructions are easier to decode in RISC than in CISC, which has a more complex
decoding process.
• Calculations in CISC require external memory, but they are not necessary for RISC.
• While CISC only has a single register set, RISC has numerous register sets.
• The execution time of a RISC computer is very low compared to a CISC computer,
which is very high.
 RISC architectures emphasize simpler instructions, shorter execution times, and
efficient use of registers.
 CISC architectures offer complex instructions, potentially leading to fewer
instructions overall but with longer execution times.
 RISC architectures often follow a load-store model, while CISC architectures allow
direct memory access instructions.

3
4
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE CSE/AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
OVERVIEW OF PIPELINIG TECHNIQUE & PROCESSOR ARCHITECTURE
TOPIC- EXAMPLE OF RISC & CISC, RISC VS CISC

Example of RISC and CISC

Multiplying Two Numbers in the Memory


 The main memory is divided into locations numbered from (row) 1:
(column) 1 to (row) 6: (column) 4.
 The execution unit is responsible for carrying out all the computations.

Figure: Multiplying two numbers in the memory

 However, the execution unit can only operate on data that has been loaded into
one of the six registers (A, B, C, D, E, or F).
 To find the product of two numbers - one stored in location 2:3 and another
stored in location 5:2 - and then store the product back in the location 2:3.

The CISC Approach

 The primary goal of CISC architecture is to complete a task in as few lines of


assembly as possible.
 This is achieved by building processor hardware that is capable of understanding
and executing a series of operations.
 For this particular task, a CISC processor has a specific instruction (called
"MULT"). When executed, this instruction loads the two values into separate
registers, multiplies the operands in the execution unit, and then stores the product
in the appropriate register.
1
 The task of multiplying two numbers can be completed with one instruction:
MULT 2:3, 5:2
 MULT is known as a "complex instruction" ,Which operates directly on the
computer's memory banks and does not require the programmer to explicitly call any
loading or storing functions.
 One of the primary advantages is that the compiler has to do very little work to
translate a high-level language statement into assembly. Because the length of the
code is relatively short, very little RAM is required to store instructions. The emphasis
is directly into the hardware.

The RISC Approach

 RISC processors only use simple instructions that can be executed within one clock cycle.
 Thus, the "MULT" command could be divided into three separate commands: "LOAD,"
which moves data from the memory bank to a register, "PROD," which finds the product of
two operands located within the registers, and "STORE," which moves data from a register
to the memory banks.
LOAD A, 2:3
LOAD B, 5:2
PROD A, B
STORE 2:3, A
 Here, there are more lines of code, more RAM is needed to store the assembly level
instructions.
 The compiler must also perform more work to convert a high-level language statement into
code of this form.
 The advantages of RISC is that because each instruction requires only one clock cycle to
execute, the entire program will execute in approximately the same amount of time as the
multi-cycle "MULT" command.
 These RISC "reduced instructions" require less transistors of hardware space than the
complex instructions, leaving more room for general purpose registers.
 Because all of the instructions execute in a uniform amount of time (i.e. one clock),
pipelining is possible.
 Separating the "LOAD" and "STORE" instructions actually reduces the amount of work
that the computer must perform.

2
CISC VERSUS RISC

CISC RISC

1. Large instruction set. 1. Small instruction set.


2. Emphasis on the hardware. 2. Emphasis on the software.
3. Complex operations. 3. Simple instructions to allow for
fast execution (fewer steps).
4. CISC chips have a large 4. Reduced instructions that take
amount of different and complex one clock cycle.
instructions which require multiple
cycles.

5. Instructions are executed one at a 5.Uses pipelining to execute


time. instructions
6. Many instructions can reference 6. Only Load and Store instructions
memory. can reference memory.
7. Few general registers. 7. Many general register

8. Easy to program, simpler compiler. 8. Complex tasks are left to the


compiler to construct from simple
operations, with increased compiler
complexity and compiling time.
9. Complex addressing modes are 9. Simple addressing modes to allow
infrequently used, and they can always be for fast address computation.
realized using several simple instructions.
10. CISC requires more 10. RISC chips require fewer
transistors. So, harder to design and transistors.
costly.
11. It is slower than RISC chips. 11. It is faster than CISC chips.

12. At least 75% of the processor 12. RISC architecture is not


use CISC architecture. widely used

13. CISC processor executes microcode 13. RISC processor has a number
instructions. of hardwired instructions.

14. CISC processors cannot have 14. Large number of registers, most
a large number of registers. of which can be used as general
purpose registers.

15. Intel and AMD CPU's are based 15. Apple uses RISC chips.
on CISC architectures.

3
16. Mainly used in normal PC’s, 16.Mainly used for real time
Workstations and servers. applications.

17. In CISC, software developers do 17. RISC puts a greater burden on


not need to write more lines for the same the software. Software developers
tasks. need to write more lines for the same
tasks.

18. Direct addition between data in two 18. Direct addition is not
memory locations. Ex.8085. possible.

19. Pipelining implementation is not easy 19. Pipelining can be implemented


easily.

20. CISC approach minimizes the number 20. RISC approach maximizes the
of instructions per program and increases the number of instructions per program
number of cycles per instruction. and reduces the number of cycles per
instruction.

21. Examples of CISC : 21. Examples of RISC:

 System/360  IBM 360.


 VAX .  DEC VAX.
 PDP-11.  Intel80x86 families.
 Motorola68000  Motorola 68xxx
family.
 Intelx86/Pentium CPU’s
22. Modern CISC 22. Traditional RISC

4
5
6
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
OVERVIEW OF PIPELINIG TECHNIQUE & PROCESSOR ARCHITECTURE
TOPIC- VECTOR PROCESSING

Vector Processing
 In many science and engineering applications, the problems can be formulated in
terms of vectors and matrices that lend themselves to vector processing.
 Computers with vector processing capabilities are in demand in specialized
applications. e.g.
o Long-range weather forecasting
o Petroleum explorations
o Seismic data analysis
o Medical diagnosis
o Artificial intelligence and expert systems
o Image processing
o Mapping the human genome

 To achieve the required level of high performance it is necessary to utilize the fastest
and most reliable hardware and apply innovative procedures from vector and parallel
processing techniques.

Vector operations

 Many scientific problems require arithmetic operations on large arrays of numbers.


 A vector is an ordered set of a one-dimensional array of data items.
 A vector V of length n is represented as a row vector by V= [v1,v2,…,Vn].
 To examine the difference between a conventional scalar processor and a vector
processor,
 consider the following Fortran DO loop:
DO 20 I=1,100
20 C(I)=B(I)+A(I)
 This is implemented in machine language by the following sequence of operations.
Initialize I=0
20 Read A (I)
Read B (I)
Store C (I)=A(I)+B(I)
Increment I=I+1
If I100 go to 20 Continue
 A computer capable of vector processing eliminates the overhead associated with the
time it takes to fetch and execute the instructions in the program loop.
1
C (1:100)=A(1:100)+B(1:100)

 A possible instruction format for a vector instruction is shown in Fig.9-11.

 This assumes that the vector operands reside in memory.


 It is also possible to design the processor with a large number of registers and store all
operands in registers prior to the addition operation.
 The base address and length in the vector instruction specify a group of CPU registers.

Matrix Multiplication

 The multiplication of two n x n matrices consists of n2 inner product so rn3 multiply-


add operations.
 Consider, for example, the multiplication of two 3x3 matrices Aand B.

 c11=a11b11+a12b21+a13 b31
 This requires three multiplication and (after initializing c11 to 0) three additions.
 In general, the inner product consists of the sum of k product terms of the form
C = A1B 1+A2B2+A3B3+…+AkBk.
 In a typical application k may be equal to 100 or even 1000.
 The inner product calculation on a pipeline vector processor is shown inFig.9-12.
CA1B1A5B5A9B9A13B13
A2B2A6B6A10B10A14B14
A3B3A7B7A11B11A15B15
A4B4A8B8A12B12A16B16

2
3
4
5
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
OVERVIEW OF PIPELINIG TECHNIQUE & PROCESSOR ARCHITECTURE
TOPIC- ARRAY PROCESSORS

Array Processors
Introduction
 An array processor is a processor that performs computations on large arrays of
data.
 The term is used to refer to two different types of processors.
o Attached array processor:
 Is an auxiliary processor.
 It is intended to improve the performance of the host computer in specific
numerical computation tasks.
o SIMD array processor:
 Has a single-instruction multiple-data organization.
 It manipulates vector instructions by means of multiple functional units
responding to a common instruction.

Attached Array Processor

 An attached array processor is designed as a peripheral for a conventional host


computer.
 Its purpose is to enhance the performance of the computer by providing vector
processing for complex scientific applications.
 High performance is achieved by means of parallel processing with multiple functional
units.
 The host computer is a general-purpose commercial computer and the attached
processor is a back-end machine driven by the host computer.
 Fig.9-14 shows the interconnection of an attached array processor to a host computer.

 The array processor is connected through an input-output controller to the computer

1
and the computer treats it like an external interface.
 The data for the attached processor are transferred from main memory to a local
memory through a high-speed bus.
 The general-purpose computer without the attached processor serves the users that
need conventional data processing.
 The system with the attached processor satisfies the needs for complex arithmetic
applications.
 The array processor can be programmed by the user to accommodate a variety of
complex arithmetic problems.
 The objective of the attached array processor is to provide vector manipulation
capabilities to a conventional computer at a fraction of the cost of supercomputer.

SIMD Array Processor

• An SIMD array processor is a computer with multiple processing units operating in


parallel.
• The processing units are synchronized to perform the same task under the control of a
common control unit.
• It contains a set of identical processing elements (PEs), each having a local memory M.
• Each PE includes an ALU, a floating-point arithmetic unit, and working registers.

 Vector instructions are broadcast to all PEs simultaneously.


 Masking schemes are used to control the status of each PE during the execution of
vector instructions.
 Each PE has a flag that is set when the PE is active and reset when the PE is inactive.
 For example, the ILLIACIV computer developed at the University of Illinois and
manufactured by the Burroughs Corp.
o Are highly specialized computers.
o They are suited primarily for numerical problems that can be expressed in vector or
matrix form.

2
Example
• Consider the vector addition C=A+B
• Master CU first stores the ith component ai and bi of A and B in local memory MI for
i=1,2,3…n
• It then broadcasts the floating point add instruction ci=ai+bi to all the PEs, causing the
addition to take place simultaneously.
• The components of ci are stored in fixed locations in local memory.
• This produces the desired vector sum in one add cycle.

3
4
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
MULTIPROCESSORS
TOPIC- CHARACTERISTICS OF MULTIPROCESSORS

Introduction to Multiprocessors
 Multiprocessor is a set of processors connected by a communication network

Fig.5.1:Basic multi processor Architecure


 A multiprocessor system is an interconnection of two or more CPU’s with memory
and input-output equipment.
 Multiprocessors system are classified as multiple instruction stream, multiple data
stream systems(MIMD).
 There exists a distinction between multiprocessor and multi computers that though
both support concurrent operations.
 In multicomputers several autonomous computers are connected through a network
and they may or may not communicate but in a multiprocessor system there is a single
OS Control that provides interaction between processors and all the components of the
system to cooperate in the solution of the problem.
 VLSI circuit technology has reduced the cost of the computers to such a low Level
that the concept of applying multiple processors to meet system performance
requirements has become an attractive design possibility.

1
Fig.5.2 Taxonomy of mono-multi porcessor organizations

Characteristics of Multiprocessors

1. Multiprocessing increases the reliability of the system so that a failure or error


in one part has limited effect on the rest of the system. If a fault causes one processor to
fail, a second processor can be assigned to perform the functions of the disable done.

2. Improved System performance. System derives high performance from the fact
that computations can proceed in parallel in one of the two ways:
a) Multiple independent jobs can be made to operate in parallel.
b) A single job can be partitioned into multiple parallel tasks.

This can be achieved in two ways:


- The user explicitly declares that the tasks of the program be executed in parallel.
-The compiler provided with multiprocessors/w that can automatically detect
parallelism in program. Actually it checks for Data dependency
Coupling of processors

• Multiprocessors are classified by the way their memory is organized.

Tightly Coupled System/shared memory


• A multiprocessor system with common shared memory is classified as a shared- tightly
coupled memory or tightly coupled multiprocessor.
• Tightly coupled multiprocessors provide a cache memory with each CPU.
• In addition, there is a global common memory that all CPUs can access.
• Information can therefore be shared among the CPUs by placing it in the common
2
global memory.

Loosely Coupled System/Distributed Memory


• Each processor element in a loosely coupled system has its own private local memory.
• The processors are tied together by a switching scheme designed to route information
from one processor to another through a message-passing scheme.
• Message passing packets consisting of an address, the data content, and some error
detection code.
• Overhead for data exchange is high
Loosely coupled systems are more efficient when the interaction between tasks is
minimal, whereas tightly coupled system can tolerate a higher degree of interaction
between tasks.

Shared(Global)Memory
 A Global Memory Space accessible by all processors
 Processors may also have some local memory
Distributed (Local, Message-Passing) Memory
 All memory units are associated with processors
 To retrieve information from another processor's memory a message must be
sent there

3
4
5
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
MULTIPROCESSORS
TOPIC- INTERCONNECTION STRUCTURES

Interconnection Structures

 A multiprocessor system consists of set of components like CPU, memory and I/O
processors etc that communicates with each other.
 The collection of paths connection various components is called interconnection
structures.
 Five ways in which we can interconnect different types of CPUs:

a. Time-Shared Common Bus


b. Multiport Memory
c. Crossbar Switch
d. Multistage Switching Network
e. Hyper cube System

a. Time shared common Bus


 All processors (and memory)are connected to a common bus
 Memory access is fairly uniform, but not very scalable
 A processor can use bus only when the bus is free.

Fig.5.5Time shared common bus organization

1
Fig.5.6 System bus structure for multiprocessor

• Each local bus may be connected to a CPU, an IOP, or any combinations of


processors.
• A system bus controller links each local bus to a common system bus.
• The I/O devices connected to the local IOP, as well as the local memory, are
available to the local processor.
• The memory connected to the common system bus is shared by all processors.
• If an IOP is connected directly to the system bus the I/O devices attached to it
may be made available to all processors.

Disadvantages:
 Only one processor can communicate with the memory or another processor at
any given time.
 As a consequence, the total overall transfer rate within the system is limited by
the speed of the single path

b. Multiport Memory:
• -Multiport Memory System employs separate buses between each memory
module and each CPU.
• A processor bus comprises the address, data and control lines necessary to
communicate with memory.
• Each memory module connects each processor bus.
• At any given time, the memory module should have internal control logic to
obtain which port can have access to memory.

Fig.5.7Multiport memory
2
• Memory module can be said to have four ports and each port accommodates one
of the buses.
• Assigning fixed priorities to each memory port resolve the memory access
conflicts.
• priority is established for memory access associated with each processor by the
physical port position.
• Therefore CPU 1 can have priority over CPU 2, CPU 2 can have priority over CPU
3 and CPU 4 can have the lowest priority.

Advantages:
 The high transfer rate can be achieved because of the multiple paths.

Disadvantages:
 It requires expensive memory control logic and a large number of cables
and connections

a. Crossbar switch:
 Crossbar Switch system contains of a number of crosspoints that are kept at
intersections among memory module and processor buses paths.
 In each crosspoint, the small square represents a switch which obtains the path
from a processor to a memory module.
 Each switch point has control logic to set up the transfer path among a memory
and processor.
 It calculates the address which is placed in the bus to obtain whether its specific
module is being addressed.
 In addition, it eliminates multiple requests for access to the same memory
module on a predetermined priority basis.

a) b)

Fig. 5.8 a) Cross bar switch b) Block diagram of cross bar switch

3
Advantage:
- Supports simultaneous transfers from all memory modules
Disadvantage:
- The hardware required to implement the switch can become quite large and
complex.

b. Multistage Switching Network:


 The basic component of a multi stage switching network is a two-input, two- output
interchange switch.

Fig.5.9 Operation of 2X2 interconnection switch


• Using the 2x2 switch as a building block,it is possible to build a multi stage
network to control the communication between a number of sources and
destinations.
• To see how this is done, consider the binary tree shown in Fig.below.
• In a tightly coupled multiprocessor system,the source is a processor and the
destination is a memory module.

Fig5.10 Binary tree with 2x2 switches

Fig.5.1 18X8O mega switching network

4
c. Hypercube System:
 The hypercube or binary n-cube multiprocessor structure is a loosely coupled system
composed of N=2n processors interconnected in an n-dimensional binary cube.
 Each processor makes a node of the cube.
 Each node is assigned a binary address in such a manner, that the addresses of two
neighbors differ in exactly one bit position.
 For example, the three neighbors of the node with address 100 are 000, 110, and 101
in a three-cube structure.
 Each of these binary numbers differs from address 100 by one bit value.
 Routing messages through an n-cube structure may take from one to n links from a
source node to a destination node.

Fig.5.12 Hyper cube structures for n=1,2,3

• In a three-cube structure, node 000 may communicate with 011 (from 000 to 010 to
011 or from 000 to 001 to 011).
• It should cross at least three links to communicate from node 000 to node 111.
• A routing procedure is designed by determining the exclusive-OR of the source node
address with the destination node address.
• The resulting binary value will have 1 bits corresponding to the axes on which the two
nodes differ.
• Then, message is transmitted along any one of the exes.

Example:
• A message at node 010 going to node 001 produces an exclusive-OR of the two
addresses equal to 011 in a three-cube structure.
• The message can be transmitted along the second axis to node 000 and then through
the third axis to node 001.

5
011 (from 000 to 001 to 011 or from 000 to 010 to 011).
It is necessary to go through at least three links to communicate from node 000 to
node 111. A routing procedure can be developed by computing the exclusive-OR
of the source node address with the destination
nodeaddress.Theresultingbinaryvaluewillhave1bitscorrespondingtotheaxes on
which the two nodes differ. The message is then sent along anyone of the axes. For
example, in a three-cube structure, a message at 010 going to 001 produces an
exclusive-OR of the two addresses equal to 011. The message can be sent along
the second axis to 000 and then through the third axis to 001.

6
7
8
9
10
11
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
MULTIPROCESSORS
TOPIC- INTERPROCESSOR ARBITRATION

Interprocessor Arbitration

 The processor, main memory and I/O devices can be interconnected by means of a
common bus.
 A bus provides a communication path for the transfer of data between their various
components.
 There is a dispute in the system bus that connects the CPU, Input-Output processor,
and memory (main components of a computer).
 Only one between the CPU, Input-Output processor, and memory gets the grant to use
the bus simultaneously.
 Hence, an appropriate priority resolving mechanism is required to decide which
processor should get control of the bus.
 Therefore, a mechanism is needed to handle multiple requests for the bus, known as
Interprocessor Arbitration.
 The arbitration procedure services all processor requests based on established
priorities.
 Different types of arbitration: Static arbitration, Dynamic arbitration

Static Arbitration Techniques


 In this technique, the priority assigned is fixed.
 It has two types: Serial arbitration and Parallel arbitration.

Serial Arbitration

 This technique is obtained from daisy chain (serial) connection of processors and also
known as Daisy chain Arbitration.
 In this type of arbitration, processors can access bus based on priority.
 In serial arbitration, bus access priority resolving based on the serial connection of the
processors.
 The serial priority resolving technique is obtained from daisy-chain connection similar
to the daisy chain priority interrupt logic.
 The processors connected to the system bus are assigned priority according to their
position along the priority control line.

1

Figure : Serial (Daisy Chain) Arbitration

 When multiple devices concurrently request the use of the bus, the device with the
highest priority is granted access to it.
 Each processor has its own bus arbiter logic with priority-in and priority-out lines.
 The priority out (PO) of each arbiter is connected to the priority in (PI) of the next-
lower-priority arbiter.
 The PI of the highest-priority unit is maintained at a logic value 1.
 The highest-priority unit in the system will always receive access to the system bus
when it requests it.
 The processor whose arbiter has a PI = 1 and PO = 0. That processor accesses the
system bus.

Advantages
 It is a simple design.
 Less number of control lines are used.

Disadvantages
 Priority depends on the physical location of the device
 Propagation delay due to serially granting of bus
 Failure of one of the devices may fail the entire system
 Cannot assure fairness- a low-priority device may be locked out indefinitely

Parallel Arbitration

 This technique uses an external priority encoder and decoder as shown in figure
below.
 Each bus arbiter in the parallel scheme has a bus request output line and a bus
acknowledge input line.
 When processor wants to access system bus, arbiter of that processor enables request
line.
 The processor takes control of the bus if it acknowledges input line is enabled.

2
 Figure shows the request lines from four arbiters going into a 4 x 2 priority encoder.
 The output of the encoder generates a 2-bit code, which represents the highest-priority
unit among those requesting the bus.
 The 2-bit code from the encoder output drives a 2×4 decoder which enables the proper
acknowledge line to grant bus access to the highest-priority unit.
 It works on priority encoder truth table.

Advantages

 Separate pair of bus request and bus grant signals, so it is faster.

Disadvantages

 Require more bus request and grant signal.

Dynamic Arbitration

 Serial and Parallel bus arbitration are static since the priorities assigned are fixed.
 In dynamic arbitration, priorities of the system change while the system is in
operation.
 In contrast, a dynamic priority algorithm gives the system the capability for changing
the priority of the devices while the system is in operation.
 Few dynamic arbitration procedures that use dynamic priority algorithms: Time Slice,
Polling, LRU, FIFO

Time Slice

 In this algorithm allocates a fixed-length time slice of bus time that is offered to each
processor in sequentially manner, in round-robin fashion.
 The service provide to each processor with this scheme is independent of its location
along the bus.
 No preference is given to any particular device since each is allotted the same amount
of time to communicate with the bus.

3
Polling:

 In a bus system that uses polling, the bus-grant signal is replaced by a set of lines
called poll lines, which are connected to all units.
 Poll lines are used by the bus controller to define an address for each device connected
to the bus.
 The bus controller, arrange address in a sequence through prescribed manner.
 When a processor that recognizes its address, it activates the bus busy-line and then
accesses the bus.
 After a number of bus cycles, the polling process continues by choosing a different
processor.
 The polling sequence is normally programmable, and as a result, the selection priority
can be randomly under program control.

LRU:

 The LRU (least recently used) algorithm gives the highest priority to the requesting
device that has not used the bus for the longest interval.
 The priorities are adjusted after a number of bus cycles according to the LRU
algorithm.
 With this procedure, no processor is favoured over any other since the priorities are
dynamically changed to give every device an opportunity to access the bus.

FIFO:

 In the first-come, first-serve scheme, requests are served in the order received.
 To implement this algorithm, the bus controller establishes a queue arranged
according to the time that the bus requests arrive.
 Each processor must wait for its turn to use the bus on a first-in, first-out (FIFO) basis.

Advantages

 The priority can be changed by altering the sequence stored in controller.


 More reliable.

4
5
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
MULTIPROCESSORS
TOPIC- INTERPROCESSOR COMMUNICATION & SYNCHRONIZATION

Interprocessor Communication & synchronization


 The various processors in a multi processor system must be provided with a facility for
communicating with each other.
 A communication path can be established through a portion of memory or a common
input-output channels.
 The sending processor structures a request,a message,or a procedure, and places it in
the memory mailbox.
 Status bits residing in common memory
 The receiving processor can check the mailbox periodically.
 The response time of this procedure can be time consuming.
 A more efficient procedure is for the sending processor to alert the receiving
processor directly by means of an interrupt signal.
 In addition to shared memory, a multiprocessor system may have other shared
resources.
 e.g.,a magnetic disk storage unit.
 To prevent conflicting use of shared resources by several processors there must be a
provision for assigning resources to processors. i.e., operating system.
 There are three organizations that have been used in the design of operating system for
multiprocessors: master-slave configuration, separate operating system, and distributed
operating system.
 In a master-slave mode, one processor, master, always executes the operating system
functions.
 In the separate operating system organization, each processor can execute the
operating system routines it needs. This organization is more suitable for loosely
coupled systems.
 In the distributed operating system organization, the operating system routines are
distributed among the available processors.
 However, each particular operating system function is assigned to only one processor
at a time. It is also referred to as a floating operating system.

Loosely Coupled System


 There is no shared memory for passing information.
 The communication between processors is by means of message passing through I/O
channels.
 The communication is initiated by one processor calling a procedure that resides in the
memory of the processor with which it wishes to communicate.
1
 The communication efficiency of the interprocessor network depends on the
communication routing protocol, processor speed, data link speed, and the topology of
the network.

Interprocess Synchronization
 The instruction set of a multiprocessor contains basic instructions that are used to
implement communication and synchronization between cooperating processes.
 Communication refers to the exchange of data between different processes.
 Synchronization refers to the special case where the data used to communicate
between processors is control information.
 Synchronization is needed to enforce the correct sequence of processes and to ensure
mutually exclusive access to shared writable data.
 Multiprocessor systems usually include various mechanisms to deal with the
synchronization of resources.
 Low-level primitives are implemented directly by the hardware.
 These primitives are the basic mechanisms that enforce mutual exclusion for more
complex mechanisms implemented in software.
 A number of hardware mechanisms for mutual exclusion have been developed.
o A binary semaphore

Mutual Exclusion with Semaphore


 A properly functioning multiprocessor system must provide a mechanism that will
guarantee orderly access to shared memory and other shared resources.
 Mutual exclusion: This is necessary to protect data from being changed
Simultaneously by two or more processors.
 Critical section: is a program sequence that must complete execution before another
processor accesses the same shared resource.
 A binary variable called a semaphore is often used to indicate whether or not a
processor is executing a critical section.
 Testing and setting the semaphore is itself a critical operation and must be performed
as a single indivisible operation.
 A semaphore can be initialized by means of a test and set instruction in conjunction
with a hardware lock mechanism.
 The instruction TSLSEM will be executed in two memory cycles(the first to read and
the second to write) as follows: RM[SEM],M[SEM]1

2
3
4
5
KR23 COMPUTER ORGANIZATION AND ARCHITECTURE AIML II-II

COMPUTER ORGANIZATION AND ARCHITECTURE


UNIT –V
MULTIPROCESSORS
TOPIC- CACHE COHERENCE
Cache Coherence
 The primary advantage of cache is its ability to reduce the average access time in
uniprocessors.
 When the processor finds a word in cache during a read operation, the main memory is
not involved in the transfer.
 If the operation is to write, there are two commonly used procedures to update
memory.
 In the write-through policy, both cache and main memory are updated with every
write operation.
 In the write-back policy, only the cache is updated and the location is marked so that it
can be copied later into main memory.
 In a shared memory multiprocessor system, all the processors share a common
memory.
 In addition, each processor may have a local memory, part or all of which may be a
cache.
 The compelling reason for having separate caches for each processor is to reduce the
average access time in each processor.
 The same information may reside in a number of copies in some caches and main
memory.
 To ensure the ability of the system to execute memory operations correctly, the
multiple copies must be kept identical.
 This requirement imposes a cache coherence problem.
 To illustrate the problem, consider the three-processor configuration with private
caches shown in Fig. 13-13.

 Sometime during the operation an element X from main memory is loaded into the
three processors, P1, P2, and P3.
 As a consequence, it is also copied into the private caches of the three processors.
1
 we assume that X contains the value of 52.
 The load on X to the three processors results in consistent copies in the caches and
main memory.
 If one of the processors performs a store to X, the copies of X in the caches become
inconsistent.
 A load by the other processors will not return the latest value.
 Depending on the memory update policy used in the cache, the main memory may
also be inconsistent with respect to the cache shown in Fig. 13-13.

 A store to X (of the value of 120) into the cache of processor P1 updates memory to
the new value in a write-through policy.
 A write-through policy maintains consistency between memory and the originating
cache, but the other two caches are inconsistent since they still hold the old value.
 In a write-back policy, main memory is not updated at the time of the store.
 The copies in the other two caches and main memory are inconsistent.
 Memory is updated eventually when the modified data in the cache are copied back
into memory.

Solutions to the Cache Coherence Problem

 Various schemes have been proposed to solve the cache coherence problem in shared
memory multiprocessors.
 A simple scheme is to disallow private caches for each processor and have a shared
cache memory associated with main memory.
 Every data access is made to the shared cache.
2
 This method violates the principle of closeness of CPU to cache and increases the
average memory access time.
 In effect, this scheme solves the problem by avoiding it.
 For performance considerations it is desirable to attach a private cache to each
processor.
 One scheme that has been used allows only nonshared and read-only data to be stored
in caches.
 Such items are called cachable.
 Shared writable data are noncachable.
 The compiler must tag data as either cachable or noncachable, and the system
hardware makes sure that only cachable data are stored in caches.
 The noncachable data remain in main memory.
 This method restricts the type of data stored in caches and introduces an extra software
overhead that may degrade performance.
 A scheme that allows writable data to exist in at least one cache is method that
employs a centralized global table in its compiler.
 The status memory blocks are stored in the central global table.
 Each block is identified as read-only (RO) or read and write (RW). All caches can
have copies of blocks identified as RO.
 Only one cache can have a copy of an RW block.
 Thus if the data are updated in the cache with an RW block, the other caches are not
affected because they do not have a copy of this block.
 The cache coherence problem can be solved by means of a combination of software
and hardware or by means of hardware-only schemes.
 The two methods mentioned previously use software-based procedures that require the
ability to tag information in order to disable caching of shared writable data.
 Hardware-only solutions are handled by the hardware automatically and have the
advantage of higher speed and program transparency.
 In the hardware solution, the cache controller is specially designed to allow it to
monitor all bus requests from CPUs and IOPs.
 All caches attached to the bus constantly monitor the network for possible write
operations.
 Depending on the method used, they must then either update or invalidate their own
cache copies when a match is detected.
 The bus controller that monitors this action is referred to as a snoopy cache controller.
 This is basically a hardware unit designed to maintain a bus-watching mechanism over
all the caches attached to the bus.
 Various schemes have been proposed to solve the cache coherence problem by means
of snoopy cache protocol.
 The simplest method is to adopt a write-through policy and use the following
procedure.
 All the snoopy controllers watch the bus for memory store operations. When a word in
a cache is updated by writing into it, the corresponding location in main memory is also
updated.
 The local snoopy controllers in all other caches check their memory to determine if
they have a copy of the word that has been overwritten.
3
 If a copy exists in a remote cache, that location is marked invalid.
 Because all caches snoop on all bus writes, whenever a word is written, the net effect
to update it in the original cache and main memory and remove it from all other caches.
 If at some future time a processor accesses the invalid item from its cache, the
response is equivalent to a cache miss, and the updated item is transferred from main
memory.

You might also like