0% found this document useful (0 votes)
62 views165 pages

Dynamic Scheduling Computer Architecture

Chapter 3 of 'Computer Architecture: A Quantitative Approach' discusses instruction-level parallelism (ILP) and its exploitation through pipelining, which overlaps instruction execution to improve performance. It highlights the challenges of data dependencies and the importance of optimizing instruction execution across branches. The chapter also details the structure of RISC instruction sets and the execution cycle of instructions in a pipelined architecture.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views165 pages

Dynamic Scheduling Computer Architecture

Chapter 3 of 'Computer Architecture: A Quantitative Approach' discusses instruction-level parallelism (ILP) and its exploitation through pipelining, which overlaps instruction execution to improve performance. It highlights the challenges of data dependencies and the importance of optimizing instruction execution across branches. The chapter also details the structure of RISC instruction sets and the execution cycle of instructions in a pipelined architecture.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 165

Computer Architecture

A Quantitative Approach, Fifth Edition

Chapter 3
Instruction-Level Parallelism
and Its Exploitation

Basic Pipelining & Performance 1


Introduction
Introduction
 Pipelining became a universal technique in 1985
 Overlaps execution of instructions
 Exploits “Instruction Level Parallelism”

The potential overlap among instructions

 Beyond this, there are two main approaches:


 Hardware-based dynamic approaches(out of order
exe)

Used in server and desktop processors

Not used as extensively in PMD processors
 Compiler-based static approaches

Not as successful outside of scientific applications

Basic Pipelining & Performance 2


Introduction
Instruction-Level Parallelism
 When exploiting instruction-level parallelism,
goal is to maximize IPC (or minimize CPI)
 Pipeline CPI =

Ideal pipeline CPI +

Structural stalls +

Data hazard stalls +

Control stalls

 Parallelism within basic block is limited


 A code sequence:


no branch in except to the entry & no branch out
except at the exit
 Typical size of basic block = 3-6 instructions
 Must optimize across branches
Basic Pipelining & Performance 3
Introduction
Data Dependence
 Loop-Level Parallelism
 Unroll loop statically or dynamically
 Use SIMD (vector processors and GPUs)

 Challenges:
 Data dependency

Instruction j is data dependent on instruction i if
 Instruction i produces a result that may be used by instruction j
 Instruction j is data dependent on instruction k and instruction k
is data dependent on instruction i

 Dependent instructions cannot be executed


simultaneously

Basic Pipelining & Performance 4


Introduction
Data Dependence
 Dependencies are a property of programs
 Pipeline organization determines if dependence
is detected and if it causes a stall

 Data dependence conveys:


 Possibility of a hazard
 Order in which results must be calculated
 Upper bound on exploitable instruction level
parallelism

 Dependencies that flow through memory


locations are difficult to detect
Basic Pipelining & Performance 5
Introduction
Name Dependence
 Two instructions use the same name but no flow
of information
 Not a true data dependence, but is a problem when
reordering instructions
 Antidependence: instruction j writes a register or
memory location that instruction i reads

Initial ordering (i before j) must be preserved

 Output dependence: instruction i and instruction j


write the same register or memory location

Ordering must be preserved

 To resolve, use renaming techniques

Basic Pipelining & Performance 6


Pipelining
 Multiple instructions are overlapped in execution
 Takes advantage of parallelism
 Increases instruction throughput, but does not
reduce the time to complete an instruction
 Ideal case:
 The time per instruction: Time per instruction on
unpipelined machine / number of pipe stages
 Speedup: the number of pipe stages
 Intel Pentium 4: 20 stages; Cedar Mill: 31 stages
 Intel Core i7: 14 stages. Why?

Basic Pipelining & Performance 7


/2025

The Basics of a RISC Instruction


Set
 Key Proprieties:
 All operations on data apply to data in registers
 The only operations that affect memory are load and
store operations
 The instruction formats are few in number, with all
instructions typically being one size.

Basic Pipelining & Performance 8


/2025 Basic Pipelining: A "Typical"
RISC ISA

 32-bit fixed format instruction (3 formats)


 ALU instructions, Load/Store, Branches and
Jumps
 32 32-bit GPR (R0 contains zero, DP
take pair)
 3-address, reg-reg arithmetic instruction

 Single address mode for load/store:

base
see: +
SPARC,displacement
MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,
CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
 no indirection
 Simple branch conditions
Basic Pipelining /Jump
& Performance 9 9
/2025 Basic Pipelining: e.g., MIPS (­
MIPS)
Register-Register
31 26 25 2120 16 15 1110 6 5 0

Op Rs1 Rs2 Rd Opx

Register-Immediate
31 26 25 2120 16 15 0

Op Rs1 Rd immediate

Branch
31 26 25 2120 16 15 0

Op Rs1 Rs2/Opx immediate

Jump / Call
31 26 25 0

Op target

Basic Pipelining & Performance 10 10


/2025 A simple implementation of a
RISC Instruction Set
 Every instruction can be implemented in at most
5 cycles.
 Instruction fetch cycle (IF): send PC to memory and
fetch the current instruction; Update the PC to the
next PC by adding 4.
 Instruction decode/register fetch cycle (ID): decode
the instruction and read the registers. For a branch,
do the equality test on the registers, and compute the
target address
 Execution/effective address cycle (EX): The ALU
operates on the operands prepared in the prior cycle,
performing:

Memory reference: ALU adds the base register and the
offset. LW 100(R1) => 100 + R1
Basic Pipelining & Performance 11

Register-Register ALU instruction: ALU performs the
/2025 A simple implementation of a
RISC Instruction Set
 Every instruction can be implemented in at most
5 cycles.
 Memory access (MEM). Read the memory using the
effective address computed in the previous cycle if the
instruction is a load; If it is a store, write the data to
the memory
 Write-back cycle (WB). For Register-Register ALU
instruction or load instruction, write the result into the
register file, whether it comes from the memory (for a
load) or from the ALU (for an ALU instruction)
 Branch instructions require 2 cycles, store
instructions require 4 cycles, others require 5
cycles.
Basic Pipelining & Performance 12
/2025
An example of an instruction
execution
 ADD R3, R1, R2 => R3 = R1 + R2;
IR

ALU
PC

Reg File
Memory

Basic Pipelining & Performance 13


/2025

Instruction Fetch (IF)


 ADD R3, R1, R2 => R3 = R1 + R2;
IR

ALU
PC
Memory

000001000010001000011…

Reg File
000011000111111100000…

Basic Pipelining & Performance 14


/2025

Instruction Fetch (IF)


 ADD R3, R1, R2 => R3 = R1 + R2;
IR

ALU
PC
Memory

000001000010001000011…

Reg File
000011000111111100000…

Basic Pipelining & Performance 15


/2025

Instruction Fetch (IF)


 ADD R3, R1, R2 => R3 = R1 + R2;
IR
000001000010001000011…

ALU
PC
Memory

000001000010001000011…

Reg File
000011000111111100000…

Fetch the instruction

Basic Pipelining & Performance 16


/2025

Instruction Fetch (IF)


 ADD R3, R1, R2 => R3 = R1 + R2;
IR

ALU
000001000010001000011…
Memory

000001000010001000011…
PC

Reg File
000011000111111100000…
Fetch the instruction
Move PC to the next
instruction

Basic Pipelining & Performance 17


/2025

Instruction Decode (ID)


 ADD R3, R1, R2 => R3 = R1 + R2;
IR

ALU
000001000010001000011…

Reg File

Reg File
R0 0
R1 x
R2 y
R3 z
Memory

R31 …

Basic Pipelining & Performance 18


/2025

Instruction Decode (ID)


 ADD R3, R1, R2 => R3 = R1 + R2;
IR

ALU
000001000010001000011…

Reg File

Reg File
ADD R0 0
R1 x
R2 y decode the instruction
R3 z
Memory

R31 …

Basic Pipelining & Performance 19


/2025

Instruction Decode (ID)


 ADD R3, R1, R2 => R3 = R1 + R2;
IR

ALU
000001000010001000011…

Reg File

Reg File
ADD R1 R0 0
R1 x
R2 y decode the instruction
R3 z
Memory

R31 …

Basic Pipelining & Performance 20


/2025

Instruction Decode (ID)


 ADD R3, R1, R2 => R3 = R1 + R2;
IR

ALU
000001000010001000011…

Reg File

Reg File
ADD R1 R2 R0 0
R1 x
R2 y decode the instruction
R3 z
Memory

R31 …

Basic Pipelining & Performance 21


/2025

Instruction Decode (ID)


 ADD R3, R1, R2 => R3 = R1 + R2;
IR

ALU
000001000010001000011…

Reg File

Reg File
ADD R1 R2 R3 R0 0
R1 x
R2 y decode the instruction
R3 z
Memory

R31 …

Basic Pipelining & Performance 22


/2025

Instruction Decode (ID)


 ADD R3, R1, R2 => R3 = R1 + R2;
IR x

ALU
000001000010001000011…

y
Reg File

Reg File
ADD R1 R2 R3 R0 0
R1 x
R2 y decode the instruction
Load the reg file
R3 z
Memory

R31 …

Basic Pipelining & Performance 23


/2025

Execution (EX)
 ADD R3, R1, R2 => R3 = R1 + R2;
IR x

ALU
000001000010001000011…

Execute the operation


Reg File
Memory

Basic Pipelining & Performance 24


/2025

Execution (EX)
 ADD R3, R1, R2 => R3 = R1 + R2;
IR x

ALU
000001000010001000011… x+y

Execute the operation


Reg File
Memory

Basic Pipelining & Performance 25


/2025

Writeback (WB)
 ADD R3, R1, R2 => R3 = R1 + R2;
IR x

ALU
000001000010001000011… x+y

y
Reg File

Reg File
R0 0
R1 x
R2 y Write the result to the
register
R3 z
Memory

R31 …

Basic Pipelining & Performance 26


/2025

Writeback (WB)
 ADD R3, R1, R2 => R3 = R1 + R2;
IR x

ALU
000001000010001000011… x+y

y
Reg File

Reg File
R0 0
R1 x
R2 y Write the result to the
register
R3 x+y
Memory

R31 …

Basic Pipelining & Performance 27


/2025

Another example: Load


 LW R3, R31, 200 => R3 = mem[R31 + 200];
IR
000011000111111100000…

ALU
Reg File

Reg File
R0 0
R1 x
R2 y
R3 z
Memory

R31 n

Basic Pipelining & Performance 28


/2025

Before the Execution (EX) stage


 LW R3, R31, 200 => R3 = mem[R31 + 200];
IR n
000011000111111100000…

ALU
Reg File 200

Reg File
R0 0
R1 x
R2 y
R3 z
Memory

R31 n

Basic Pipelining & Performance 29


/2025

Execution (EX)
 LW R3, R31, 200 => R3 = mem[R31 + 200];
IR n
000011000111111100000…

ALU
n + 200

200

Execute the operation


Reg File
Memory

Basic Pipelining & Performance 30


/2025

Memory Access (MEM)


 LW R3, R31, 200 => R3 = mem[R31 + 200];
IR n
000011000111111100000…

ALU
n + 200

200
Memory

Reg File
000001000010001000011…

Reg File
R0 0
000011000111111100000…
PC R1 x
000001000010000100000…
R2 y Read the memor

n + 200 R3 z

123456

R31 n

Basic Pipelining & Performance 31


/2025

Writeback (WB)
 LW R3, R31, 200 => R3 = mem[R31 + 200];
IR n
000011000111111100000…

ALU
n + 200

200
Memory

Reg File
000001000010001000011…

Reg File
R0 0
000011000111111100000…
PC R1 x
000001000010000100000… Write to the
R2 y
register
n + 200 R3 z

123456

R31 n

Basic Pipelining & Performance 32


/2025

5-stage pipeline
 IF: mem, alu, load/store, branch
 ID: reg file, alu, load/store, branch
 EX: ALU, alu, load/store
 MEM: mem, load/store
 WB: reg file, alu, load

 Can we change the order of MEM and WB?


Why?

Basic Pipelining & Performance 33


/2025

5 Steps of MIPS Datapath


Figure A.2, Page A-8
Instruction Instr. Decode Execute Memory Writ
Fetch Reg. Fetch Addr. Access e
Next PC
Calc Back

MUX
Adder

Next SEQ PC

4 RS1
Zero?

Reg File
Address

RS2
Memory

MUX MUX
Inst

ALU

Memory
RD L

Data
M

MUX
D
Sign
IR <= mem[PC]; Imm Extend

PC <= PC + 4
WB Data
Reg[IRrd] <= Reg[IRrs1] opIRop Reg[IRrs2]

Basic Pipelining & Performance 36 36


/2025

Classic RISC Pipeline

Source: http://en.wikipedia.org/wiki/Classic_RISC_pipeline

Basic Pipelining & Performance 37


/2025

Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows the overlap among the parts of
the data path, with clock cycle 5 (CC 5) showing the steady-state situation. Because the register file is used as a source in the ID
stage and as a destination in the WB stage, it appears twice. We show that it is read in one part of the stage and written in another by
using a solid line, on the right or left, respectively, and a dashed line on the other side. The abbreviation IM is used for instruction
memory, DM for data memory, and CC for clock cycle.
Basic Pipelining & Performance 38
/2025

5 Steps of MIPS Datapath


Figure A.3, Page A-9
Instruction Instr. Decode Execute Memory Writ
Fetch Reg. Fetch Addr. Access e
Next PC
Calc Back

IF/ID

ID/EX

EX/MEM

MEM/WB
Next SEQ PC Next SEQ PC

MUX
Adder

4 RS1
Zero?

Reg File
Address

RS2
Memory

MUX MUX

ALU

Memory
RD

Data

MUX
IR <= mem[PC];

WB Data
Sign
Extend
PC <= PC + 4 Imm

A <= Reg[IRrs1]; RD RD RD

B <= Reg[IRrs2]
rslt <= A opIRop B
WB <= result
Basic Pipelining & Performance 40 40
Reg[IRrd] <= WB
/2025

Visualizing Pipelining
Figure A.2, Page A-8
Time (clock cycles)

Cycle 1Cycle 2 Cycle 3 Cycle 4Cycle 5 Cycle 6Cycle 7


I

ALU
Reg
n Ifetch Reg DMem

s
t
r.

ALU
Ifetch Reg DMem Reg

O
r

ALU
Ifetch Reg DMem Reg

d
e
r

ALU
Ifetch Reg DMem Reg

Basic Pipelining & Performance 41 41


Time (clock cycles)

Cycle 1Cycle 2 Cycle 3 Cycle 4Cycle 5 Cycle 6Cycle 7 Cycle 8Cycle 9

ALU
n Ifetch Reg DMem Reg

s
t
r.

ALU
Ifetch Reg DMem Reg

O
r

ALU
Ifetch Reg DMem Reg
d
e
r

ALU
Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
42

Basic Pipelining & Performance 42


Time (clock cycles)

Cycle 1Cycle 2 Cycle 3 Cycle 4Cycle 5 Cycle 6Cycle 7 Cycle 8Cycle 9

ALU
n Ifetch Reg DMem Reg

s
t
r.

ALU
Ifetch Reg DMem Reg

O
r

ALU
Ifetch Reg DMem Reg
d
e
r

ALU
Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
43

Basic Pipelining & Performance 43


/2025

Pipelining is not quite that easy!

 Limits to pipelining: Hazards prevent next


instruction from executing during its
designated clock cycle
 Structural hazards: HW cannot support this
combination of instructions
 Data hazards: Instruction depends on result of
prior instruction still in the pipeline
 Control hazards: Caused by delay between the
fetching of instructions and decisions about
changes in control flow
Basic Pipelining (branches and
& Performance 44 jumps). 44
/2025
One Memory Port/Structural
Hazards
Figure A.4, Page A-14

Time (clock cycles)


Cycle 1Cycle 2 Cycle 3 Cycle 4Cycle 5 Cycle 6Cycle 7

ALU
I Load Ifetch Reg DMem Reg

n
s

ALU
t Instr 1
Ifetch Reg DMem Reg

r.

ALU
Ifetch Reg DMem Reg
Instr 2
O
r

ALU
Reg
d Instr 3 Ifetch Reg DMem

ALU
r Instr 4 Ifetch Reg DMem Reg

Basic Pipelining & Performance 45 45


/2025 One Memory Port/Structural
Hazards
(Similar to Figure A.5, Page A-15)
Time (clock cycles)
Cycle 1Cycle 2 Cycle 3 Cycle 4Cycle 5 Cycle 6Cycle 7

ALU
I Load Ifetch Reg DMem Reg

n
s

ALU
t Instr 1
Ifetch Reg DMem Reg

r.

ALU
Ifetch Reg DMem Reg
Instr 2
O
r
d Stall Bubble Bubble Bubble Bubble Bubble

ALU
r Instr 3 Ifetch Reg DMem Reg

Basic Pipelining & Performance 46 46


/2025

Data Hazard on R1
Figure A.6, Page A-17

Time (clock cycles)

IF ID/RF EX MEM WB

ALU
add r1,r2,r3 Ifetch Reg DMem Reg

n
s

ALU
t sub r4,r1,r3
Ifetch Reg DMem Reg

r.

ALU
Ifetch Reg DMem Reg
O and r6,r1,r7
r
d

ALU
Ifetch Reg DMem Reg
e or r8,r1,r9
r

ALU
xor r10,r1,r11 Ifetch Reg DMem Reg

Basic Pipelining & Performance 47 47


/2025

Three Generic Data Hazards


 Read After Write (RAW)
InstrJ tries to read operand before InstrI
writes itI: add r1,r2,r3
J: sub r4,r1,r3

 Caused by a “Dependence” (in compiler


nomenclature). This& Performance
Basic Pipelining hazard results
48 from an
48
/2025

Three Generic Data Hazards


 Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7

 Called an “anti-dependence” by compiler


writers.
This results from reuse of the name “r1”.
 Can it happen in MIPS 5 stage pipeline?
 All instructions take 5 stages, and
 Reads are always in stage
Basic Pipelining 2, and
& Performance 49 49
/2025

Three Generic Data Hazards


 Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7

 Called an “output dependence” by


compiler writers
This also results from the reuse of name
“r1”.
 Can’t happen in MIPS 5 stage pipeline
Basic Pipelining & Performance 50 50
Forwarding to Avoid Data Hazard
/2025

Figure A.7, Page A-19

Time (clock cycles)


I
n

ALU
add r1,r2,r3 Ifetch Reg DMem Reg

s
t
r.

ALU
sub r4,r1,r3 Ifetch Reg DMem Reg

O
r

ALU
Ifetch Reg DMem Reg
d and r6,r1,r7
e
r

ALU
Ifetch Reg DMem Reg
or r8,r1,r9

ALU
Ifetch Reg DMem Reg
xor r10,r1,r11

Basic Pipelining & Performance 51 51


/2025

HW Change for Forwarding


Figure A.23, Page A-37

ID/EX

EX/MEM

MEM/WR
NextPC

mux
Registers

ALU
Data
mux

mux
Memory
Immediate

Basic Pipelining & Performance 52 52


/2025

Forwarding to Avoid LW-SW Data Hazard


Figure A.8, Page A-20

Time (clock cycles)


I
n

ALU
add r1,r2,r3 Ifetch Reg DMem Reg

s
t
r.

ALU
lw r4, 0(r1) Ifetch Reg DMem Reg

O
r

ALU
Ifetch Reg DMem Reg
d sw r4,12(r1)
e
r

ALU
Ifetch Reg DMem Reg
or r8,r6,r9

ALU
Ifetch Reg DMem Reg
xor r10,r9,r11
Any hazard that cannot be avoided with forwarding?
Basic Pipelining & Performance 53 53
/2025

Data Hazard Even with Forwarding


Time (clock cycles)

ALU
Reg
I lw r1, 0(r2)Ifetch Reg DMem

n
s

ALU
t sub r4,r1,r6
Ifetch Reg DMem Reg

r.

ALU
O and r6,r1,r7
Ifetch Reg DMem Reg

r
d

ALU
e Ifetch Reg DMem Reg

r or r8,r1,r9

Basic Pipelining & Performance 54 54


/2025

Data Hazard Even with Forwarding


(Similar to Figure A.10, Page A-21)

Time (clock cycles)


I
n
lw r1, 0(r2)

ALU
Ifetch Reg DMem Reg
s
t
r.

ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg

O
r
d

ALU
Ifetch Bubble Reg DMem Reg
e and r6,r1,r7
r

ALU
Bubble Ifetch Reg DMem
or r8,r1,r9

Basic Pipelining & Performance 55 55


/2025
Software Scheduling to Avoid
Load Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code: Fast code:
LW Rb,b LW Rb,b
LW Rc,c LW Rc,c
ADD Ra,Rb,Rc LW Re,e
SW a,Ra
ADD Ra,Rb,Rc
LW Re,e
LW Rf,f
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SUB Rd,Re,Rf
SW d,Rd
SW d,Rd
Compiler optimizes for performance. Hardware checks for safety.
Basic Pipelining & Performance 56 56
Introduction
Other Factors
 Data Hazards exa mple
 Read after write (RAW) if (p1) {s1};
 Write after write (WAW) p3=0;

Write after read (WAR) if (p2) {s2};

 Control Dependence
 Ordering of instruction i with respect to a branch
instruction

Instruction control dependent on a branch cannot be moved
before the branch so that its execution is no longer controlled
by the branch (e.g., cannot move s1 before if(p1) )

An instruction not control dependent on a branch cannot be
moved after the branch so that its execution is controlled by
the branch (e.g., cannot move p3=0 into {s2})

Basic Pipelining & Performance 57


Introduction
Examples
• Example 1:  OR instruction dependent
DADDU R1,R2,R3 on DADDU and DSUBU
BEQZ R4,L
DSUBU R1,R1,R6
L: …
OR R7,R1,R8

• Example 2:  Assume R4 isn’t used after


DADDU R1,R2,R3 skip
BEQZ R12,skip  Possible to move DSUBU
DSUBU R4,R5,R6
before the branch
DADDU R5,R4,R9
skip:
OR R7,R8,R9

Basic Pipelining & Performance 58


/2025

Control Hazard:
Instruction Instr. Decode Execute Memory Writ
Fetch Reg. Fetch Addr. Access e
Next PC
Calc Back

IF/ID

ID/EX

EX/MEM

MEM/WB
Next SEQ PC Next SEQ PC

MUX
Adder

4 RS1
Zero?

Reg File
Address

RS2
Memory

MUX MUX

ALU

Memory
Data

MUX

WB Data
Sign
Extend
Imm

RD RD RD

• Branch condition determined @ end of EX stage


• Target instruction address
Basic Pipelining ready @ end of Mem
& Performance 59
stage
/2025
Control Hazard on Branches
Three Stage Stall

ALU
Ifetch Reg DMem Reg
10: beq r1,r3,36

ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5

ALU
Reg
18: or r6,r1,r7 Ifetch Reg DMem

ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9

ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg

What do you do with the 3 instructions in between?


How do you do it?
Where is the “commit”?
Basic Pipelining & Performance 60 60
/2025

Branch Stall Impact

 If CPI = 1, 30% branch,


Stall 3 cycles => new CPI = 1.9!
 Two part solution:
 Determine branch outcome sooner, AND
 Compute taken branch (target) address earlier
 MIPS branch tests if register = 0 or  0
 MIPS Solution:
 Move Zero test to ID/RF stage
 Add an adder to calculate new PC in ID/RF stage
 1 clock cycle penalty for& Performance
Basic Pipelining branch versus
61 3 61
/2025

Two-part solution to the long stalls:


Instruction Instr. Decode Execute Memory Writ
Fetch Reg. Fetch Addr. Access e
Next PC Next
Calc Back

EX/MEM

MEM/WB
ID/EX
SEQ PC

MUX
Adder
Adder

IF/ID
Zero?
4 RS1
Address

Reg File
RS2
Memory

ALU

Memory
Data
MUX

MUX

WB Data
Sign
Extend
Imm

RD RD RD

1. Determine the condition earlier @ ID


2. Calculate the target instruction address earlier @
Basic Pipelining & Performance 62
ID
/2025

Four Branch Hazard Alternatives


#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
 Execute successor instructions in sequence
 “Squash” instructions in pipeline if branch actually
taken
 Advantage of late pipeline state update
 47% MIPS branches not taken on average
 PC+4 already calculated, so use it to get next
instruction
#3: Predict Branch Taken
 53% MIPS branches taken on average
 But haven’t calculated branch target address in MIPS

MIPS still incurs 1 cycle branch penalty
Basic Pipelining & Performance 63

Other machines: branch target known before 63
/2025

Four Branch Hazard Alternatives


#4: Delayed Branch
 Define branch to take place AFTER a
following instruction

branch instruction
sequential successor1
Branch delay of length n
sequential successor2
........
sequential successorn
branch target if taken

 1 slot delayBasic
allows proper decision
Pipelining & Performance 64
and 64
/2025

Scheduling Branch Delay Slots (Fig A.14)


A. From before branch B. From branch target C. From fall through
add $1,$2,$3 sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
delay slot delay slot
add $1,$2,$3
if $1=0 then sub $4,$5,$6
delay slot OR $7,$8,$9

becomes becomes becomes


sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
add $1,$2,$3 sub $4,$5,$6
add $1,$2,$3
if $1=0 then
sub $4,$5,$6 OR $7,$8,$9

 A is the best choice, fills delay slot


 In B, the sub instruction may need to be copied, increasing IC
 In B and C, must be okay to execute sub when branch fails
Basic Pipelining & Performance 65 65
/2025

Delayed Branch
 Compiler effectiveness for single branch
delay slot:
 Fills about 60% of branch delay slots
 About 80% of instructions executed in branch
delay slots useful in computation
 About 50% (60% x 80%) of slots usefully filled
 Delayed Branch downside: As processors go
to deeper pipelines and multiple issues, the
branch delay grows and needs more than one
delay slot
 Delayed branching has lost popularity compared
to more expensive but more flexible dynamic
Basic Pipelining & Performance 66 66
Compiler Techniques
Compiler Techniques for Exposing ILP
 Pipeline scheduling
 Separate dependent instruction from the
source instruction by the pipeline latency of
the source instruction
 Example:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;

Basic Pipelining & Performance 67


FP Loop: Where are the Hazards?
• First translate into MIPS code:
-To simplify, assume 8 is lowest address, R1 =
base address of X and F2 = s

Loop: L.D F0,0(R1) ;F0=vector element


ADD.D F4,F0,F2 ;add scalar from F2
S.D 0(R1),F4 ;store result
DADDUI R1,R1,-8 ;decrement
pointer 8B (DW)
BNEZ R1,Loop ;branch R1!=zero

08/25/2025 Basic Pipelining & Performance 68


FP Loop Showing Stalls
1 Loop: L.D F0,0(R1) ;F0=vector element
2 stall
3 ADD.D F4,F0,F2 ;add scalar in F2
4 stall
5 stall
6 S.D 0(R1),F4 ;store result
7 DADDUI R1,R1,-8 ;decrement pointer 8B (DW)
8 stall ;assumes can’t forward to branch
9 BNEZ R1,Loop ;branch R1!=zero

Instruction Instruction Latency in


producing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1

• 9 clock cycles: Rewrite code to minimize stalls?


08/25/2025 Basic Pipelining & Performance 69
Revised FP Loop Minimizing Stalls
1 Loop: L.D F0,0(R1)
2 DADDUI R1,R1,-8
3 ADD.D F4,F0,F2
4 stall
5 stall
6 S.D 8(R1),F4 ;altered offset when move DADDUI
7 BNEZ R1,Loop

Swap DADDUI and S.D by changing address of S.D


Instruction Instruction Latency in
producing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1

7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop
overhead; How to make faster?
08/25/2025 Basic Pipelining & Performance 70
Unroll Loop Four Times
(straightforward way)
1 cycle stall
1 Loop:L.D F0,0(R1) Rewrite loop to
3 ADD.D F4,F0,F2 2 cycles stall
minimize stalls?
6 S.D 0(R1),F4 ;drop DSUBUI & BNEZ
7 L.D F6,-8(R1)
9 ADD.D F8,F6,F2
12 S.D -8(R1),F8 ;drop DSUBUI & BNEZ
13 L.D F10,-16(R1)
15 ADD.D F12,F10,F2
18 S.D -16(R1),F12 ;drop DSUBUI & BNEZ
19 L.D F14,-24(R1)
21 ADD.D F16,F14,F2
24 S.D -24(R1),F16
25 DADDUI R1,R1,#-32 ;alter to 4*8
26 BNEZ R1,LOOP

27 clock cycles, or 6.75 per iteration


(Assumes R1 is multiple of 4)
08/25/2025 Basic Pipelining & Performance 71
Unrolled Loop That Minimizes Stalls
1 Loop: L.D F0,0(R1)
2 L.D F6,-8(R1)
3 L.D F10,-16(R1)
4 L.D F14,-24(R1)
5 ADD.D F4,F0,F2
6 ADD.D F8,F6,F2
7 ADD.D F12,F10,F2
8 ADD.D F16,F14,F2
9 S.D 0(R1),F4
10 S.D -8(R1),F8
11 S.D -16(R1),F12
12 DSUBUI R1,R1,#32
13 S.D 8(R1),F16 ; 8-32 = -24
14 BNEZ R1,LOOP

14 clock cycles, or 3.5 per iteration

08/25/2025 Basic Pipelining & Performance 72


Unrolled Loop Detail
• Do not usually know upper bound of loop
• Suppose it is n, and we would like to unroll the
loop to make k copies of the body
• Instead of a single unrolled loop, we generate a
pair of consecutive loops:
– 1st executes (n mod k) times and has a body that is the
original loop
– 2nd is the unrolled body surrounded by an outer loop that
iterates (n/k) times
• For large values of n, most of the execution time
will be spent in the unrolled loop

08/25/2025 Basic Pipelining & Performance 73


5 Loop Unrolling Decisions
• Requires understanding how one instruction depends on
another and how the instructions can be changed or
reordered given the dependences:
1. Determine loop unrolling useful by finding that loop
iterations were independent (except for maintenance code)
2. Use different registers to avoid unnecessary constraints
forced by using same registers for different computations
3. Eliminate the extra test and branch instructions and adjust
the loop termination and iteration code
4. Determine that loads and stores in unrolled loop can be
interchanged by observing that loads and stores from
different iterations are independent
• Transformation requires analyzing memory addresses
and finding that they do not refer to the same address
5. Schedule the code, preserving any dependences needed to
yield the same result as the original code

08/25/2025 Basic Pipelining & Performance 74


3 Limits to Loop Unrolling
1. Decrease in amount of overhead amortized with
each extra unrolling

2. Growth in code size


• For larger loops, it increases the instruction cache miss rate
3. Register pressure: potential shortfall in
registers created by aggressive unrolling and
scheduling
• If not possible to allocate all live values to registers, may lose
some or all of its advantage
• Loop unrolling reduces impact of branches on
pipeline; another way is branch prediction

08/25/2025 Basic Pipelining & Performance 75


In-class Exercise
• Identify data hazards in the code below:
– MULTD F3, F4, F2
– ADDD F2, F6, F1
– SD F2, 0(F3)
• For each of the following code fragments,
identify each type of dependence that a compiler
will find (a fragment may have no dependences)
and whether a compiler could schedule the two
instructions (i.e., change their orders).
1. DADDI R1, R1, #4 2. DADD R3, R1, R2
LD R2, 7(R1) SD R2, 7(R1)

3. SD R2, 7(R1) 4. BEZ R1, place


SD F2, 200(R7) SD R1, 7(R1)

08/25/2025 Basic Pipelining & Performance 76


Static Branch Prediction
• Previous lectures showed scheduling code around
delayed branch
• To reorder code around branches, need to predict
branch statically when compile
• Simplest scheme is to predict a branch as taken
– Average misprediction = untaken branch frequency = 34% SPEC
25% 22%
• More accurate
Misprediction Rate

20% 18%
scheme
15%
predicts 15% 12% 11% 12%
branches 10%
9%
using profile 10%
6%
information 4%
5%
collected from
earlier runs, 0%
and modify t
s ot cc li uc
r d p or
prediction e s n t s so g d ea o2 jl d c
pr e q re do ydr m
d
su
2
based on last m
es
p h
co
run:
08/25/2025 Basic Pipelining & Performance 77
Integer Floating Point
Dynamic Branch Prediction
• Why does prediction work?
– Underlying algorithm has regularities
– Data that is being operated on has regularities
– Instruction sequence has redundancies that are artifacts of
way that humans/compilers think about problems
• Is dynamic branch prediction better than static
branch prediction?
– Seems to be
– There are a small number of important branches in programs
which have dynamic behavior

08/25/2025 Basic Pipelining & Performance 78


Dynamic Branch Prediction

• Performance = ƒ(accuracy, cost of misprediction)


• Branch History Table: Lower bits of PC address
index table of 1-bit values
– Says whether or not branch taken last time
– No address check
• Problem: in a loop, 1-bit BHT will cause two
mispredictions (avg is 9 iterations before exit):
– End of loop case, when it exits instead of looping as before
– First time through loop on next time through code, when it
predicts exit instead of looping

08/25/2025 Basic Pipelining & Performance 79


Dynamic Branch Prediction

• Solution: 2-bit scheme where change prediction


only if get misprediction twice

NT
Predict Taken Predict Taken
T
T NT NT
Predict Not Predict Not
T Taken
Taken

• Red: stop, not taken NT


• Green: go, taken
• Adds hysteresis to decision making process
08/25/2025 Basic Pipelining & Performance 80
BHT Accuracy
• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch when index the table
• 4096 entry table:
20% 18%
18%
Misprediction Rate

16%
14% 12%
12% 10%
10% 9% 9% 9%
8%
6% 5% 5%
4%
2% 1%
0%
0%
t li
tot so gc
c
ice d uc ice p pp 300 sa7
n s sp sp
eq pr
e do fp trix n a
s a
e m
Integer Floating Point
08/25/2025 Basic Pipelining & Performance 81
Correlated Branch Prediction
• Idea: record m most recently executed branches
as taken or not taken, and use that pattern to
select the proper n-bit branch history table

• In general, a (m,n) predictor means recording last


m branches to select between 2m history tables,
each with n-bit counters
– Thus, old 2-bit BHT is a (0,2) predictor

• Global Branch History: m-bit shift register


keeping T/NT status of last m branches.

• Each entry in table has 2m n-bit predictors.

08/25/2025 Basic Pipelining & Performance 82


Correlating Branches

(2,2) predictor Branch address


– Behavior of recent 4
branches selects 2-bits per branch predictor
between four
predictions of next
branch, updating just Prediction
that prediction

2-bit global branch history

08/25/2025 Basic Pipelining & Performance 83


Accuracy of Different Schemes

20%

18% 4096 Entries 2-bit BHT


Frequency of Mispredictions

16% Unlimited Entries 2-bit BHT


14% 1024 Entries (2,2) BHT
12%
11%
10%

8%
6% 6% 6%
6%
5% 5%
4%
4%

2%
1% 1%
0%
0%
spice

gcc
fpppp

expresso
nasa7

matrix300

tomcatv

doducd

li
eqntott
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

08/25/2025 Basic Pipelining & Performance 84


Tournament Predictors

Tournament predictor using, say, 4K 2-bit counters


indexed by local branch address. Chooses
between:
• Global predictor
– 4K entries index by history of last 12 branches (212 = 4K)
– Each entry is a standard 2-bit predictor

• Local predictor
– Local history table: 1024 10-bit entries recording last 10
branches, index by branch address
– The pattern of the last 10 occurrences of that particular branch
used to index table of 1K entries with 3-bit saturating counters

08/25/2025 Basic Pipelining & Performance 85


Comparing Predictors (Fig. 2.8)
• Advantage of tournament predictor is ability to
select the right predictor for a particular branch
– Particularly crucial for integer benchmarks.
– A typical tournament predictor will select the global predictor
almost 40% of the time for the SPEC integer benchmarks and
less than 15% of the time for the SPEC FP benchmarks

08/25/2025 Basic Pipelining & Performance 87


Pentium 4 Misprediction Rate
(per 1000 instructions, not per branch)
14
13
6% misprediction rate per branch SPECint
13
12 (19% of INT instructions are branch)
12
Branch mispredictions per 1000 I nstructions

2% misprediction rate per branch SPECfp


11
11
(5% of FP instructions are branch)
10

9 9

8
7
7

6
5
5

1 1
0 0 0
0
c

id
e

u
ip

im

a
ty
r

cf
vp

gc

is

es
pl
m

gr
gz

af

sw
w

ap
5.

6.

m
1.

m
cr
4.

up
17

1.
17

7.
3.
18

6.

2.
16

17

17
17
18

17
8.
16

SPECint2000 SPECfp2000
08/25/2025 Basic Pipelining & Performance 88
Branch Target Buffers (BTB)

• Branch target calculation is costly and stalls the


instruction fetch.
• BTB stores PCs the same way as caches
• The PC of a branch is sent to the BTB
• When a match is found the corresponding
Predicted PC is returned
• If the branch was predicted taken, instruction
fetch continues at the returned predicted PC

08/25/2025 Basic Pipelining & Performance 89


Branch Target Buffers

Branch target folding: for unconditional branches store the


target instructions Basic
08/25/2025 themselves
Pipelining & in the buffer!
Performance 90
Dynamic Branch Prediction Summary
• Prediction becoming important part of execution
• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches correlated
with next branch
– Either different branches (GA)
– Or different executions of same branches (PA)
• Tournament predictors take insight to next level, by
using multiple predictors
– usually one based on global information and one based on local
information, and combining them with a selector
– In 2006, tournament predictors using  30K bits are in processors
like the Power5 and Pentium 4
• Branch Target Buffer: include branch address &
prediction;
• Branch target folding.
08/25/2025 Basic Pipelining & Performance 91
In-Class Exercise
• Given a deeply pipelined processor and a branch-target
buffer for conditional branches only, assuming a
misprediction penalty of 4 cycles and a buffer miss penalty
of 3 cycles, 90% hit rate and 90% accuracy, and 15%
branch frequency. How much faster is the processor with
the BTB vs. a processor that has a fixed 2-cycle branch
penalty?
• Speedup = CPInoBTB /CPIBTB
• = (CPIbase + StallsnoBTB )/(CPIbase + StallsBTB )
• Stalls = ∑ Frequency x Penalty
• StallsnoBTB = 15% x 2 = 0.30
• StallsBTB = Stallsmiss-BTB + Stallshit-BTB-correct + Stallshit-BTB-wrong
• = 15%x10%x3 + 15%x90%x90%x0 +15%x90%x10%x4
• = 0.097
• Speedup = (1 + 0.3) / (1 + 0.097) = 1.2

08/25/2025 Basic Pipelining & Performance 92


Branch Prediction
Dynamic Scheduling
 Rearrange order of instructions to reduce
stalls while maintaining data flow

 Advantages:
 Compiler doesn’t need to have knowledge of
microarchitecture
 Handles cases where dependencies are
unknown at compile time
 Helps to tolerate unpredictable delay (cache
misses)
 Disadvantage:
 Substantial increase in hardware
Basic Pipelining & Performance
complexity
100
Branch Prediction
Dynamic Scheduling
 Dynamic scheduling implies:
 Out-of-order execution
 Out-of-order completion

 Creates the possibility for WAR and WAW


hazards

 Tomasulo’s Approach
 Tracks when operands are available
 Introduces register renaming in hardware

Minimizes WAW and WAR hazards
Basic Pipelining & Performance 101
Branch Prediction
Register Renaming
RAR. RAW. WAR ,WAW
 Example:

DIV.D F0,F2,F4
ADD.D F6,F0,F8
Antidependence F8
S.D F6,0(R1)
SUB.D F8,F10,F14 Antidependence F6

MUL.D F6,F10,F8

+ output dependence with F6

Basic Pipelining & Performance 102


Branch Prediction
Register Renaming
 Example:

DIV.D F0,F2,F4
ADD.D S,F0,F8
S.D S,0(R1)
SUB.D T,F10,F14
MUL.D F6,F10,T

 Now only RAW hazards remain, which can be strictly


ordered

Basic Pipelining & Performance 103


Branch Prediction
Register Renaming
 Register renaming is provided by reservation stations
(RS)
 Contains:

The instruction

Buffered operand values (when available)

Reservation station number of instruction providing
the operand values
 RS fetches and buffers an operand as soon as it becomes
available (not necessarily involving register file)
 Pending instructions designate the RS to which they will send
their output

Result values broadcast on a result bus, called the common data bus (CDB)
 Only the last output updates the register file
 As instructions are issued, the register specifiers are renamed
with the reservation station
 May be more reservation stations than registers
Basic Pipelining & Performance 104
Branch Prediction
Tomasulo’s Algorithm
 Load and store buffers
 Contain data and addresses, act like
reservation stations

 Top-level design:

Basic Pipelining & Performance 105


Branch Prediction
Tomasulo’s Algorithm
 Three Steps:
 Issue

Get next instruction from FIFO queue

If available RS, issue the instruction to the RS with operand values if
available

If operand values not available, stall the instruction
 Execute

When operand becomes available, store it in any reservation
stations waiting for it

When all operands are ready, issue the instruction

Loads and store maintained in program order through effective
address

No instruction allowed to initiate execution until all branches that
proceed it in program order have completed
 Write result

Write result on CDB into reservation stations and store buffers
 (Stores must wait until address and value are received)

Basic Pipelining & Performance 106


Tomasulo Organization
From Mem FP Op FP Registers
Queue
Load Buffers
Load1
Load2
Load3
Load4
Load5 Store
Load6
Buffers

Add1
Add2 Mult1
Add3 Mult2

Reservation To Mem
Stations
FP
FPadders
adders FP
FPmultipliers
multipliers

Common Data Bus (CDB)


08/25/2025 Basic Pipelining & Performance 107
Reservation Station Components

Op: Operation to perform in the unit (e.g., + or –)


Vj, Vk: Value of Source operands
– Store buffers has V field, result to be stored
Qj, Qk: Reservation stations producing source
registers (value to be written)
– Note: Qj,Qk=0 => ready
– Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy

Register result status—Indicates which functional unit


will write each register, if one exists. Blank when no
pending instructions that will write that register.

08/25/2025 Basic Pipelining & Performance 108


Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue


If reservation station free (no structural hazard),
control issues instr & sends operands (renames registers).
2. Execute—operate on operands (EX)
When both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available
• Normal data bus: data + destination (“go to” bus)
• Common data bus: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– Write if matches expected Functional Unit (produces result)
– Does the broadcast
• Example speed:
3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /

08/25/2025 Basic Pipelining & Performance 109


Tomasulo Example
Instruction stream
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 Load1 No
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2 3 Load/Buffers
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
FU count Add2 No
3 FP Adder R.S.
Add3 No
down 2 FP Mult R.S.
Mult1 No
Mult2 No

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU

Clock cycle
counter
08/25/2025 Basic Pipelining & Performance 110
Tomasulo Example Cycle 1
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Load1

08/25/2025 Basic Pipelining & Performance 111


Tomasulo Example Cycle 2
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 2 Load2 Yes 45+R3
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Load2 Load1

Note: Can have multiple loads outstanding


08/25/2025 Basic Pipelining & Performance 112
Tomasulo Example Cycle 3
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 Load1 Yes 34+R2
LD F2 45+ R3 2 Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Mult1 Load2 Load1
• Note: registers names are removed (“renamed”) in Reservation
Stations; MULT issued
• Load1 completing; what is waiting for Load1?
08/25/2025 Basic Pipelining & Performance 113
Tomasulo Example Cycle 4
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6
ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 Yes SUBD M(A1) Load2
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Mult1 Load2 M(A1) Add1

• Load2 completing; what is waiting for Load2?

08/25/2025 Basic Pipelining & Performance 114


Tomasulo Example Cycle 5
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5
ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
2 Add1 Yes SUBD M(A1) M(A2)
Add2 No
Add3 No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Mult1 M(A2) M(A1) Add1 Mult2

• Timer starts down for Add1, Mult1


08/25/2025 Basic Pipelining & Performance 115
Tomasulo Example Cycle 6
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1
Add3 No
9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 M(A2) Add2 Add1 Mult2

• Issue ADDD here despite name dependency on F6?


08/25/2025 Basic Pipelining & Performance 116
Tomasulo Example Cycle 7
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1
Add3 No
8 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 M(A2) Add2 Add1 Mult2

• Add1 (SUBD) completing; what is waiting for it?


08/25/2025 Basic Pipelining & Performance 117
Tomasulo Example Cycle 8
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
2 Add2 Yes ADDD (M-M) M(A2)
Add3 No
7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 M(A2) Add2 (M-M) Mult2

08/25/2025 Basic Pipelining & Performance 118


Tomasulo Example Cycle 9
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
1 Add2 Yes ADDD (M-M) M(A2)
Add3 No
6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 M(A2) Add2 (M-M) Mult2

08/25/2025 Basic Pipelining & Performance 119


Tomasulo Example Cycle 10
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
0 Add2 Yes ADDD (M-M) M(A2)
Add3 No
5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
10 FU Mult1 M(A2) Add2 (M-M) Mult2

• Add2 (ADDD) completing; what is waiting for it?


08/25/2025 Basic Pipelining & Performance 120
Tomasulo Example Cycle 11
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
4 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

• Write result of ADDD here?


• All quick instructions complete in this cycle!
08/25/2025 Basic Pipelining & Performance 121
Tomasulo Example Cycle 12
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
3 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

08/25/2025 Basic Pipelining & Performance 122


Tomasulo Example Cycle 13
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
2 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
13 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

08/25/2025 Basic Pipelining & Performance 123


Tomasulo Example Cycle 14
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
1 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

08/25/2025 Basic Pipelining & Performance 124


Tomasulo Example Cycle 15
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
0 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

• Mult1 (MULTD) completing; what is waiting for it?


08/25/2025 Basic Pipelining & Performance 125
Tomasulo Example Cycle 16
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

• Just waiting for Mult2 (DIVD) to complete


08/25/2025 Basic Pipelining & Performance 126
Faster than light computation
(skip a couple of cycles)

08/25/2025 Basic Pipelining & Performance 127


Tomasulo Example Cycle 55
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
55 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

08/25/2025 Basic Pipelining & Performance 128


Tomasulo Example Cycle 56
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5 56
ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
56 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

• Mult2 (DIVD) is completing; what is waiting for it?


08/25/2025 Basic Pipelining & Performance 129
Tomasulo Example Cycle 57
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5 56 57
ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)

Register result status:


Clock F0 F2 F4 F6 F8 F10 F12 ... F30
56 FU M*F4 M(A2) (M-M+M)(M-M) Result

• Once again: In-order issue, out-of-order execution and


out-of-order completion.
08/25/2025 Basic Pipelining & Performance 130
Speculation to greater ILP
• Greater ILP: Overcome control dependence by
hardware speculating on outcome of branches
and executing program as if guesses were correct
– Speculation  fetch, issue, and execute instructions as if
branch predictions were always correct
– Dynamic scheduling  only fetches and issues
instructions
• Essentially a data flow execution model:
Operations execute as soon as their operands are
available

08/25/2025 Basic Pipelining & Performance 135


Speculation to greater ILP
 3 components of HW-based speculation:

1. Dynamic branch prediction to choose which


instructions to execute
2. Speculation to allow execution of instructions
before control dependences are resolved
+ ability to undo effects of incorrectly speculated
sequence
3. Dynamic scheduling to deal with scheduling of
different combinations of basic blocks

08/25/2025 Basic Pipelining & Performance 136


Adding Speculation to Tomasulo
• Must separate execution from allowing
instruction to finish or “commit”
• This additional step called instruction commit
• When an instruction is no longer speculative,
allow it to update the register file or memory
• Requires additional set of buffers to hold results
of instructions that have finished execution but
have not committed
• This reorder buffer (ROB) is also used to pass
results among instructions that may be
speculated

08/25/2025 Basic Pipelining & Performance 137


Reorder Buffer (ROB)
• In non-speculative Tomasulo’s algorithm, once
an instruction writes its result, any subsequently
issued instructions will find result in the register
file
• With speculation, the register file is not updated
until the instruction commits
– (we know definitively that the instruction should execute)
• Thus, the ROB supplies operands in interval
between completion of instruction execution and
instruction commit
– ROB is a source of operands for instructions, just as
reservation stations (RS) provide operands in Tomasulo’s
algorithm
– ROB extends architectured registers like RS

08/25/2025 Basic Pipelining & Performance 138


Reorder Buffer Entry Fields
 Each entry in the ROB contains four fields:
1. Instruction type
• a branch (has no destination result), a store (has a memory
address destination), or a register operation (ALU operation
or load, which has register destinations)
2. Destination
• Register number (for loads and ALU operations) or
memory address (for stores)
where the instruction result should be written
3. Value
• Value of instruction result until the instruction commits
4. Ready
• Indicates that instruction has completed execution, and the
value is ready

08/25/2025 Basic Pipelining & Performance 139


Reorder Buffer operation
• Holds instructions in FIFO order, exactly as issued
• When instructions complete, results placed into ROB
– Supplies operands to other instruction between execution
complete & commit  more registers like RS
– Tag results with ROB buffer number instead of reservation station
• Instructions commit values at head of ROB placed in
registers (or memory locations)
• As a result, easy to undo Reorder
speculated instructions FP Buffer
on mispredicted branches Op
or on exceptions Queue FP Regs
Commit path

Res Stations Res Stations


FP Adder FP Adder

08/25/2025 Basic Pipelining & Performance 140


Recall: 4 Steps of Speculative Tomasulo
Algorithm

1. Issue—get instruction from FP Op Queue


If reservation station and reorder buffer slot free, issue instr &
send operands & reorder buffer no. for destination (this stage
sometimes called “dispatch”)
2. Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch
CDB for result; when both in reservation station, execute;
checks RAW (sometimes called “issue”)
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4. Commit—update register with reorder result
When instr. at head of reorder buffer & result present, update
register with result (or store to memory) and remove instr from
reorder buffer. Mispredicted branch flushes reorder buffer
(sometimes called “graduation”)

08/25/2025 Basic Pipelining & Performance 141


Tomasulo With Reorder buffer:
Done?
FP Op ROB7 Newest
Queue ROB6
ROB5

Reorder Buffer ROB4


ROB3
ROB2 Oldest
F0 LD F0,10(R2) N ROB1

Registers To
Memory
Dest from
Dest
Memory
Dest
Reservation 1 10+R2
Stations
FP adders FP multipliers

08/25/2025 Basic Pipelining & Performance 142


Tomasulo With Reorder buffer:
Done?
FP Op ROB7 Newest
Queue ROB6
ROB5

Reorder Buffer ROB4


ROB3
F10 ADDD F10,F4,F0 N ROB2 Oldest
F0 LD F0,10(R2) N ROB1

Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
Dest
Reservation 1 10+R2
Stations
FP adders FP multipliers

08/25/2025 Basic Pipelining & Performance 143


Tomasulo With Reorder buffer:
Done?
FP Op ROB7 Newest
Queue ROB6
ROB5

Reorder Buffer F2 DIVD F2,F10,F6 N


ROB4
ROB3
F10 ADDD F10,F4,F0 N ROB2 Oldest
F0 LD F0,10(R2) N ROB1

Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
3 DIVD ROB2,R(F6)
Dest
Reservation 1 10+R2
Stations
FP adders FP multipliers

08/25/2025 Basic Pipelining & Performance 144


Tomasulo With Reorder buffer:
Done?
FP Op ROB7 Newest
Queue F0 ADDD F0,F4,F6 N ROB6
F4 LD F4,0(R3) N ROB5

Reorder Buffer --
F2
BNE F2,<…>
DIVD F2,F10,F6
N
N
ROB4
ROB3
F10 ADDD F10,F4,F0 N ROB2 Oldest
F0 LD F0,10(R2) N ROB1

Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
6 ADDD ROB5, R(F6) 3 DIVD ROB2,R(F6)
Dest
Reservation 1 10+R2
Stations 5 0+R3
FP adders FP multipliers

08/25/2025 Basic Pipelining & Performance 145


Tomasulo With Reorder buffer:
Done?
FP Op -- ROB5 ST 0(R3),F4 N ROB7 Newest
Queue F0 ADDD F0,F4,F6 N ROB6
F4 LD F4,0(R3) N ROB5

Reorder Buffer --
F2
BNE F2,<…>
DIVD F2,F10,F6
N
N
ROB4
ROB3
F10 ADDD F10,F4,F0 N ROB2 Oldest
F0 LD F0,10(R2) N ROB1

Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
6 ADDD ROB5, R(F6) 3 DIVD ROB2,R(F6)
Dest
Reservation 1 10+R2
Stations 5 0+R3
FP adders FP multipliers

08/25/2025 Basic Pipelining & Performance 146


Tomasulo With Reorder buffer:
Done?
FP Op -- M[10] ST 0(R3),F4 Y ROB7 Newest
Queue F0 ADDD F0,F4,F6 N ROB6
F4 M[10] LD F4,0(R3) Y ROB5

Reorder Buffer --
F2
BNE F2,<…>
DIVD F2,F10,F6
N
N
ROB4
ROB3
F10 ADDD F10,F4,F0 N ROB2 Oldest
F0 LD F0,10(R2) N ROB1

Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
6 ADDD M[10],R(F6) 3 DIVD ROB2,R(F6)
Dest
Reservation 1 10+R2
Stations
FP adders FP multipliers

08/25/2025 Basic Pipelining & Performance 147


Tomasulo With Reorder buffer:
Done?
FP Op -- M[10] ST 0(R3),F4 Y ROB7 Newest
Queue F0 <val2> ADDD F0,F4,F6 Ex ROB6
F4 M[10] LD F4,0(R3) Y ROB5

Reorder Buffer --
F2
BNE F2,<…> N
DIVD F2,F10,F6 N
ROB4
ROB3
F10 ADDD F10,F4,F0 N ROB2 Oldest
F0 LD F0,10(R2) N ROB1

Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
3 DIVD ROB2,R(F6)
Dest
Reservation 1 10+R2
Stations
FP adders FP multipliers

08/25/2025 Basic Pipelining & Performance 148


Tomasulo With Reorder buffer:
Done?
FP Op -- M[10] ST 0(R3),F4 Y ROB7 Newest
Queue F0 <val2> ADDD F0,F4,F6 Ex ROB6
F4 M[10] LD F4,0(R3) Y ROB5

Reorder Buffer --
F2
BNE F2,<…> N
DIVD F2,F10,F6 N
ROB4
ROB3
F10 ADDD F10,F4,F0 N ROB2 Oldest
What about memory F0 LD F0,10(R2) N ROB1
hazards???
Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
3 DIVD ROB2,R(F6)
Dest
Reservation 1 10+R2
Stations
FP adders FP multipliers

08/25/2025 Basic Pipelining & Performance 149


Avoiding Memory Hazards
• WAW and WAR hazards through memory are
eliminated with speculation because actual
updating of memory occurs in order, when a
store is at head of the ROB, and hence, no
earlier loads or stores can still be pending
• RAW dependence through memory are
maintained by two restrictions:
1. not allowing a load to initiate the second step of its execution
if any active ROB entry occupied by a store has a Destination
field that matches the value of the A field of the load, and
2. maintaining the program order for the computation of an
effective address of a load with respect to all earlier stores.
• these restrictions ensure that any load that
accesses a memory location written to by an
earlier store cannot perform the memory access
until the store has written the data

08/25/2025 Basic Pipelining & Performance 150


Exceptions and Interrupts
• IBM 360/91 invented “imprecise interrupts”
– Computer stopped at this PC; its likely close to this address
– Not so popular with programmers
– Also, what about Virtual Memory? (Not in IBM 360)
• Technique for both precise interrupts/exceptions
and speculation: in-order completion and in-
order commit
– If we speculate and are wrong, need to back up and restart
execution to point at which we predicted incorrectly
– This is exactly same as need to do with precise exceptions
• Exceptions are handled by not recognizing the
exception until instruction that caused it is ready
to commit in ROB
– If a speculated instruction raises an exception, the exception
is recorded in the ROB
– This is why reorder buffers in all new processors

08/25/2025 Basic Pipelining & Performance 151


Multiple Issue and Static Scheduling
Multiple Issue and Static Scheduling
 To achieve CPI < 1, need to complete
multiple instructions per clock

 Solutions:
 Statically scheduled superscalar processors
 VLIW (very long instruction word) processors
 dynamically scheduled superscalar
processors

Basic Pipelining & Performance 153


Multiple Issue and Static Scheduling
Multiple Issue

Basic Pipelining & Performance 154


VLIW: Very Large Instruction Word
• Each “instruction” has explicit coding for multiple
operations
– In IA-64, grouping called a “packet”
– In Transmeta, grouping called a “molecule” (with “atoms” as ops)
• Tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long
instruction word are independent => execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
» 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling technique that schedules across several branches

08/25/2025 Basic Pipelining & Performance 155


Recall: Unrolled Loop that
Minimizes Stalls for Scalar

1 Loop: L.D F0,0(R1) L.D to ADD.D: 1 Cycle


2 L.D F6,-8(R1) ADD.D to S.D: 2 Cycles
3 L.D F10,-16(R1)
4 L.D F14,-24(R1)
5 ADD.D F4,F0,F2
6 ADD.D F8,F6,F2
7 ADD.D F12,F10,F2
8 ADD.D F16,F14,F2
9 S.D 0(R1),F4
10 S.D -8(R1),F8
11 S.D -16(R1),F12
12 DSUBUI R1,R1,#32
13 BNEZ R1,LOOP
14 S.D 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iteration


08/25/2025 Basic Pipelining & Performance 156
Loop Unrolling in VLIW
Memory Memory FP FP Int. op/ Clock
reference 1 reference 2 operation 1 op. 2 branch
L.D F0,0(R1) L.D F6,-8(R1) 1
L.D F10,-16(R1) L.D F14,-24(R1) 2
L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3
L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4
ADD.D F20,F18,F2 ADD.D F24,F22,F2 5
S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6
S.D -16(R1),F12 S.D -24(R1),F16 7
S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8
S.D -0(R1),F28 BNEZ R1,LOOP 9
Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)
Average: 2.5 ops per clock, 50% efficiency
Note: Need more registers in VLIW (15 vs. 6 in SS)

08/25/2025 Basic Pipelining & Performance 157


Problems with 1st Generation VLIW
• Increase in code size
– generating enough operations in a straight-line code fragment
requires ambitiously unrolling loops
– whenever VLIW instructions are not full, unused functional
units translate to wasted bits in instruction encoding
• Operated in lock-step; no hazard detection HW
– a stall in any functional unit pipeline caused entire processor
to stall, since all functional units must be kept synchronized
– Compiler might predict function units, but caches hard to
predict
• Binary code compatibility
– Pure VLIW => different numbers of functional units and unit
latencies require different versions of the code

08/25/2025 Basic Pipelining & Performance 158


Dynamic Scheduling, Multiple Issue, and Speculation
Multiple Issue
 Limit the number of instructions of a given
class that can be issued in a “bundle”
 I.e. one FP, one integer, one load, one store

 Examine all the dependencies among the


instructions in the bundle

 If dependencies exist in bundle, encode


them in reservation stations

 Also need multiple completion/commit


Basic Pipelining & Performance 163
Dynamic Scheduling, Multiple Issue, and Speculation
Example
Loop: LD R2,0(R1) ;R2=array element
DADDIU R2,R2,#1 ;increment R2
SD R2,0(R1) ;store result
DADDIU R1,R1,#8 ;increment pointer
BNE R2,R3,LOOP ;branch if not last element

Basic Pipelining & Performance 164


Dynamic Scheduling, Multiple Issue, and Speculation
Example (No Speculation)

Basic Pipelining & Performance 165


Dynamic Scheduling, Multiple Issue, and Speculation
Example (Speculating)

Basic Pipelining & Performance 166


Adv. Techniques for Instruction Delivery and Speculation
Branch-Target Buffer
 Need high instruction bandwidth!
 Branch-Target buffers

Next PC prediction buffer, indexed by current PC

Basic Pipelining & Performance 167


Adv. Techniques for Instruction Delivery and Speculation
Branch Folding
 Optimization:
 Larger branch-target buffer
 Add target instruction into buffer to deal with
longer decoding time required by larger buffer
 “Branch folding”

Basic Pipelining & Performance 168


Adv. Techniques for Instruction Delivery and Speculation
Return Address Predictor
 Most unconditional branches come from
function returns
 The same procedure can be called from
multiple sites
 Causes the buffer to potentially forget about
the return address from previous calls
 Create return address buffer organized
as a stack

Basic Pipelining & Performance 169


Adv. Techniques for Instruction Delivery and Speculation
Integrated Instruction Fetch Unit
 Design monolithic unit that performs:
 Branch prediction
 Instruction prefetch

Fetch ahead
 Instruction memory access and buffering

Deal with crossing cache lines

Basic Pipelining & Performance 170


Adv. Techniques for Instruction Delivery and Speculation
Register Renaming
 Register renaming vs. reorder buffers
 Instead of virtual registers from reservation stations and
reorder buffer, create a single register pool

Contains visible registers and virtual registers
 Use hardware-based map to rename registers during issue
 WAW and WAR hazards are avoided
 Speculation recovery occurs by copying during commit
 Still need a ROB-like queue to update table in order
 Simplifies commit:

Record that mapping between architectural register and physical register
is no longer speculative

Free up physical register used to hold older value

In other words: SWAP physical registers on commit
 Physical register de-allocation is more difficult

Basic Pipelining & Performance 171


Adv. Techniques for Instruction Delivery and Speculation
Integrated Issue and Renaming
 Combining instruction issue with register
renaming:
 Issue logic pre-reserves enough physical
registers for the bundle (fixed number?)
 Issue logic finds dependencies within
bundle, maps registers as necessary
 Issue logic finds dependencies between
current bundle and already in-flight bundles,
maps registers as necessary

Basic Pipelining & Performance 172


Adv. Techniques for Instruction Delivery and Speculation
How Much?
 How much to speculate
 Mis-speculation degrades performance and
power relative to no speculation

May cause additional misses (cache, TLB)
 Prevent speculative code from causing
higher costing misses (e.g. L2)

 Speculating through multiple branches


 Complicates speculation recovery
 No processor can resolve multiple branches
per cycle
Basic Pipelining & Performance 173
Adv. Techniques for Instruction Delivery and Speculation
Energy Efficiency
 Speculation and energy efficiency
 Note: speculation is only energy efficient
when it significantly improves performance

 Value prediction
 Uses:

Loads that load from a constant pool

Instruction that produces a value from a small set
of values
 Not been incorporated into modern
processors
 Similar idea--address aliasing prediction--is
Basic Pipelining & Performance 174
Speed Up Equation for Pipelining

CPIpipelined Ideal CPI  AverageStall cyclesper Inst

Ideal CPI  Pipeline


depth CycleTimeunpipeline
d
Speedup
 
Ideal CPI  PipelinestallCPI CycleTimepipelined

For simple RISC pipeline, Ideal CPI = 1:

Pipelinedepth CycleTimeunpipeline
d
Speedup
 
1  Pipeline
stallCPI CycleTimepipelined

08/25/2025 Basic Pipelining & Performance 176


Example: Dual-port vs. Single-port
• Machine A: Dual ported memory (“Harvard Architecture”)
• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe /
1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) =
1.33
• Machine A is 1.33 times faster
08/25/2025 Basic Pipelining & Performance 177
Evaluating Branch Alternatives
Assume 4% unconditional branch, 6% conditional branch-
untaken, 10% conditional branch-taken; Deeper pipeline:
target & condition known @ end of 3rd & 4th stage respectively
What’s the addition to CPI?
Scheduling Uncond Untaken
Taken.
scheme penalty penalty penalty
Stall pipeline 2 3 3
Predict taken 2 3 2
Predict not taken 2 0 3
Scheduling Uncond Untaken Taken
All branches
scheme branches branches branches
4% 6% 10% 20%
Stall pipeline 0.08 0.18 0.3 0.56
Predict taken 0.08 0.18 0.2 0.46
Predict
08/25/2025 not taken 0.08 0.00
Basic Pipelining 0.3
& Performance 0.38 178
More branch evaluations
• Suppose the branch frequencies (as percentage of all
instructions) of 15% cond. Branches, 1% jumps and calls,
and 60% cond. Branches are taken. Consider a 4-stage
pipeline where branch is resolved at the end of the 2 nd
cycle for uncond. Branches and at the end of the 3 rd cycle
for cond. Branches. How much faster would the machine
be without any branch hazards, ignoring other pipeline
stalls?
Pipeline speedupideal = Pipeline depth/(1+Pipeline stalls)
= 4/(1+0) = 4
Pipeline stallsreal = (1x1%) + (2x9%) + (0x6%) = 0.19

Pipeline speedupreal = 4/(1+0.19) = 3.36

Pipeline speedupwithout control hazards = 4/3.36 = 1.19

08/25/2025 Basic Pipelining & Performance 179


More branch question
• A reduced hardware implementation of the
classic 5-stage RISC pipeline might use the EX
stage hardware to perform a branch instruction
comparison and then not actually deliver the
branch target PC to the IF stage until the clock
cycle in which the branch reaches the MEM
stage. Control hazard stalls can be reduced by
resolving branch instructions in ID, but
improving performance in one aspect may
reduce performance in other circumstances.
• How does determining branch outcome in the ID
stage have the potential to increase data hazard
stall cycles?

08/25/2025 Basic Pipelining & Performance 180


Problems with Pipelining
• Exception: An unusual event happens to an
instruction during its execution
– Examples: divide by zero, undefined opcode
• Interrupt: Hardware signal to switch the
processor to a new instruction stream
– Example: a sound card interrupts when it needs more audio
output samples (an audio “click” happens if it is left waiting)
• Problem: It must appear that the exception or
interrupt must appear between 2 instructions (Ii
and Ii+1)
– The effect of all instructions up to and including I i is totally
complete
– No effect of any instruction after Ii can take place
• The interrupt (exception) handler either aborts
program or restarts at instruction Ii+1

08/25/2025 Basic Pipelining & Performance 181


Homework
• C.1.(e)
– 10-stage pipeline, full forwarding & bypassing, predicted taken
– Branch outcomes and targets are obtained at the end of the ID stage.

1 2 3 4 5 6 7 8 9 10 11 12 13 14
LD R1,0(R2) IF1 IF2 ID1 ID2 EX1 EX2 MEM1 MEM2 WB1 WB2
DADDI R1, R1, #1 IF1 IF2 ………………….
SD R1, 0(R2) IF1 ……………………

08/25/2025 Basic Pipelining & Performance 182


Homework
• C.1.(e)
– “For example, data are forwarded from the output of the second EX stage to the input of the
first EX stage, still causing 1-cycle delay”
– Example: (not as the same as the problem)

Data Dependence: Read After Write


1 2 3 4 5 6 7 8 9 10 11 12 13
ADD R1, R2, R3 IF1 IF2 ID1 ID2 EX1 EX2 MEM1 MEM2 WB1 WB2
SUB R4, R5, R1 IF1 IF2 ID1 ID2 stall EX1 EX2 MEM1 MEM2 WB1 WB2

No Data Dependence
1 2 3 4 5 6 7 8 9 10 11 12 13
ADD R1, R2, R3 IF1 IF2 ID1 ID2 EX1 EX2 MEM1 MEM2 WB1 WB2
SUB R4, R5, R6 IF1 IF2 ID1 ID2 EX1 EX2 MEM1 MEM2 WB1 WB2

08/25/2025 Basic Pipelining & Performance 183


Homework
• C.1.(g)
– CPI = The total number of clock cycles / the total number of
instructions of code segment

08/25/2025 Basic Pipelining & Performance 184


Homework
• C.2.(a) & C.2.(b)
– “Assuming that only the first pipe stage can always be done
independent of whether the branch goes”
– An indicator of “predicted not taken”
– The first stage of the instruction after the branch instruction
could be done
– Example: 4-stage pipeline
– Unconditional branch: “at the end of the second cycle”
1 2 3 4 5 6
J loop IF ID EX WB
Next instruction IF
Target instruction IF ID EX WB

08/25/2025 Basic Pipelining & Performance 185


Homework
• C.2.(a) & C.2.(b)
– Conditional Branch: “at the end of the third cycle”
– If the branch is not taken, penalty = ?
1 2 3 4 5 6
BNEZ R4, loop IF ID EX WB
Next instruction IF stall ID EX WB
Next + 1 instruction IF ID EX WB

– If the branch is taken, penalty = ?


1 2 3 4 5 6
BNEZ R4, loop IF ID EX WB
Next instruction IF stall ID EX WB
Target instruction IF ID EX WB

08/25/2025 Basic Pipelining & Performance 186

You might also like