Dynamic Scheduling Computer Architecture
Dynamic Scheduling Computer Architecture
Chapter 3
Instruction-Level Parallelism
and Its Exploitation
no branch in except to the entry & no branch out
except at the exit
Typical size of basic block = 3-6 instructions
Must optimize across branches
Basic Pipelining & Performance 3
Introduction
Data Dependence
Loop-Level Parallelism
Unroll loop statically or dynamically
Use SIMD (vector processors and GPUs)
Challenges:
Data dependency
Instruction j is data dependent on instruction i if
Instruction i produces a result that may be used by instruction j
Instruction j is data dependent on instruction k and instruction k
is data dependent on instruction i
base
see: +
SPARC,displacement
MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,
CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
no indirection
Simple branch conditions
Basic Pipelining /Jump
& Performance 9 9
/2025 Basic Pipelining: e.g., MIPS (
MIPS)
Register-Register
31 26 25 2120 16 15 1110 6 5 0
Register-Immediate
31 26 25 2120 16 15 0
Op Rs1 Rd immediate
Branch
31 26 25 2120 16 15 0
Jump / Call
31 26 25 0
Op target
ALU
PC
Reg File
Memory
ALU
PC
Memory
000001000010001000011…
Reg File
000011000111111100000…
ALU
PC
Memory
000001000010001000011…
Reg File
000011000111111100000…
ALU
PC
Memory
000001000010001000011…
Reg File
000011000111111100000…
ALU
000001000010001000011…
Memory
000001000010001000011…
PC
Reg File
000011000111111100000…
Fetch the instruction
Move PC to the next
instruction
ALU
000001000010001000011…
Reg File
Reg File
R0 0
R1 x
R2 y
R3 z
Memory
R31 …
ALU
000001000010001000011…
Reg File
Reg File
ADD R0 0
R1 x
R2 y decode the instruction
R3 z
Memory
R31 …
ALU
000001000010001000011…
Reg File
Reg File
ADD R1 R0 0
R1 x
R2 y decode the instruction
R3 z
Memory
R31 …
ALU
000001000010001000011…
Reg File
Reg File
ADD R1 R2 R0 0
R1 x
R2 y decode the instruction
R3 z
Memory
R31 …
ALU
000001000010001000011…
Reg File
Reg File
ADD R1 R2 R3 R0 0
R1 x
R2 y decode the instruction
R3 z
Memory
R31 …
ALU
000001000010001000011…
y
Reg File
Reg File
ADD R1 R2 R3 R0 0
R1 x
R2 y decode the instruction
Load the reg file
R3 z
Memory
R31 …
Execution (EX)
ADD R3, R1, R2 => R3 = R1 + R2;
IR x
ALU
000001000010001000011…
Execution (EX)
ADD R3, R1, R2 => R3 = R1 + R2;
IR x
ALU
000001000010001000011… x+y
Writeback (WB)
ADD R3, R1, R2 => R3 = R1 + R2;
IR x
ALU
000001000010001000011… x+y
y
Reg File
Reg File
R0 0
R1 x
R2 y Write the result to the
register
R3 z
Memory
R31 …
Writeback (WB)
ADD R3, R1, R2 => R3 = R1 + R2;
IR x
ALU
000001000010001000011… x+y
y
Reg File
Reg File
R0 0
R1 x
R2 y Write the result to the
register
R3 x+y
Memory
R31 …
ALU
Reg File
Reg File
R0 0
R1 x
R2 y
R3 z
Memory
R31 n
ALU
Reg File 200
Reg File
R0 0
R1 x
R2 y
R3 z
Memory
R31 n
Execution (EX)
LW R3, R31, 200 => R3 = mem[R31 + 200];
IR n
000011000111111100000…
ALU
n + 200
200
ALU
n + 200
200
Memory
Reg File
000001000010001000011…
Reg File
R0 0
000011000111111100000…
PC R1 x
000001000010000100000…
R2 y Read the memor
n + 200 R3 z
123456
R31 n
Writeback (WB)
LW R3, R31, 200 => R3 = mem[R31 + 200];
IR n
000011000111111100000…
ALU
n + 200
200
Memory
Reg File
000001000010001000011…
Reg File
R0 0
000011000111111100000…
PC R1 x
000001000010000100000… Write to the
R2 y
register
n + 200 R3 z
123456
R31 n
5-stage pipeline
IF: mem, alu, load/store, branch
ID: reg file, alu, load/store, branch
EX: ALU, alu, load/store
MEM: mem, load/store
WB: reg file, alu, load
MUX
Adder
Next SEQ PC
4 RS1
Zero?
Reg File
Address
RS2
Memory
MUX MUX
Inst
ALU
Memory
RD L
Data
M
MUX
D
Sign
IR <= mem[PC]; Imm Extend
PC <= PC + 4
WB Data
Reg[IRrd] <= Reg[IRrs1] opIRop Reg[IRrs2]
Source: http://en.wikipedia.org/wiki/Classic_RISC_pipeline
Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows the overlap among the parts of
the data path, with clock cycle 5 (CC 5) showing the steady-state situation. Because the register file is used as a source in the ID
stage and as a destination in the WB stage, it appears twice. We show that it is read in one part of the stage and written in another by
using a solid line, on the right or left, respectively, and a dashed line on the other side. The abbreviation IM is used for instruction
memory, DM for data memory, and CC for clock cycle.
Basic Pipelining & Performance 38
/2025
IF/ID
ID/EX
EX/MEM
MEM/WB
Next SEQ PC Next SEQ PC
MUX
Adder
4 RS1
Zero?
Reg File
Address
RS2
Memory
MUX MUX
ALU
Memory
RD
Data
MUX
IR <= mem[PC];
WB Data
Sign
Extend
PC <= PC + 4 Imm
A <= Reg[IRrs1]; RD RD RD
B <= Reg[IRrs2]
rslt <= A opIRop B
WB <= result
Basic Pipelining & Performance 40 40
Reg[IRrd] <= WB
/2025
Visualizing Pipelining
Figure A.2, Page A-8
Time (clock cycles)
ALU
Reg
n Ifetch Reg DMem
s
t
r.
ALU
Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d
e
r
ALU
Ifetch Reg DMem Reg
ALU
n Ifetch Reg DMem Reg
s
t
r.
ALU
Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d
e
r
ALU
Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
42
ALU
n Ifetch Reg DMem Reg
s
t
r.
ALU
Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d
e
r
ALU
Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
43
ALU
I Load Ifetch Reg DMem Reg
n
s
ALU
t Instr 1
Ifetch Reg DMem Reg
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
r
ALU
Reg
d Instr 3 Ifetch Reg DMem
ALU
r Instr 4 Ifetch Reg DMem Reg
ALU
I Load Ifetch Reg DMem Reg
n
s
ALU
t Instr 1
Ifetch Reg DMem Reg
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
r
d Stall Bubble Bubble Bubble Bubble Bubble
ALU
r Instr 3 Ifetch Reg DMem Reg
Data Hazard on R1
Figure A.6, Page A-17
IF ID/RF EX MEM WB
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
n
s
ALU
t sub r4,r1,r3
Ifetch Reg DMem Reg
r.
ALU
Ifetch Reg DMem Reg
O and r6,r1,r7
r
d
ALU
Ifetch Reg DMem Reg
e or r8,r1,r9
r
ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
s
t
r.
ALU
sub r4,r1,r3 Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d and r6,r1,r7
e
r
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
ID/EX
EX/MEM
MEM/WR
NextPC
mux
Registers
ALU
Data
mux
mux
Memory
Immediate
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
s
t
r.
ALU
lw r4, 0(r1) Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d sw r4,12(r1)
e
r
ALU
Ifetch Reg DMem Reg
or r8,r6,r9
ALU
Ifetch Reg DMem Reg
xor r10,r9,r11
Any hazard that cannot be avoided with forwarding?
Basic Pipelining & Performance 53 53
/2025
ALU
Reg
I lw r1, 0(r2)Ifetch Reg DMem
n
s
ALU
t sub r4,r1,r6
Ifetch Reg DMem Reg
r.
ALU
O and r6,r1,r7
Ifetch Reg DMem Reg
r
d
ALU
e Ifetch Reg DMem Reg
r or r8,r1,r9
ALU
Ifetch Reg DMem Reg
s
t
r.
ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
O
r
d
ALU
Ifetch Bubble Reg DMem Reg
e and r6,r1,r7
r
ALU
Bubble Ifetch Reg DMem
or r8,r1,r9
Control Dependence
Ordering of instruction i with respect to a branch
instruction
Instruction control dependent on a branch cannot be moved
before the branch so that its execution is no longer controlled
by the branch (e.g., cannot move s1 before if(p1) )
An instruction not control dependent on a branch cannot be
moved after the branch so that its execution is controlled by
the branch (e.g., cannot move p3=0 into {s2})
Control Hazard:
Instruction Instr. Decode Execute Memory Writ
Fetch Reg. Fetch Addr. Access e
Next PC
Calc Back
IF/ID
ID/EX
EX/MEM
MEM/WB
Next SEQ PC Next SEQ PC
MUX
Adder
4 RS1
Zero?
Reg File
Address
RS2
Memory
MUX MUX
ALU
Memory
Data
MUX
WB Data
Sign
Extend
Imm
RD RD RD
ALU
Ifetch Reg DMem Reg
10: beq r1,r3,36
ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5
ALU
Reg
18: or r6,r1,r7 Ifetch Reg DMem
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
EX/MEM
MEM/WB
ID/EX
SEQ PC
MUX
Adder
Adder
IF/ID
Zero?
4 RS1
Address
Reg File
RS2
Memory
ALU
Memory
Data
MUX
MUX
WB Data
Sign
Extend
Imm
RD RD RD
branch instruction
sequential successor1
Branch delay of length n
sequential successor2
........
sequential successorn
branch target if taken
1 slot delayBasic
allows proper decision
Pipelining & Performance 64
and 64
/2025
Delayed Branch
Compiler effectiveness for single branch
delay slot:
Fills about 60% of branch delay slots
About 80% of instructions executed in branch
delay slots useful in computation
About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: As processors go
to deeper pipelines and multiple issues, the
branch delay grows and needs more than one
delay slot
Delayed branching has lost popularity compared
to more expensive but more flexible dynamic
Basic Pipelining & Performance 66 66
Compiler Techniques
Compiler Techniques for Exposing ILP
Pipeline scheduling
Separate dependent instruction from the
source instruction by the pipeline latency of
the source instruction
Example:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop
overhead; How to make faster?
08/25/2025 Basic Pipelining & Performance 70
Unroll Loop Four Times
(straightforward way)
1 cycle stall
1 Loop:L.D F0,0(R1) Rewrite loop to
3 ADD.D F4,F0,F2 2 cycles stall
minimize stalls?
6 S.D 0(R1),F4 ;drop DSUBUI & BNEZ
7 L.D F6,-8(R1)
9 ADD.D F8,F6,F2
12 S.D -8(R1),F8 ;drop DSUBUI & BNEZ
13 L.D F10,-16(R1)
15 ADD.D F12,F10,F2
18 S.D -16(R1),F12 ;drop DSUBUI & BNEZ
19 L.D F14,-24(R1)
21 ADD.D F16,F14,F2
24 S.D -24(R1),F16
25 DADDUI R1,R1,#-32 ;alter to 4*8
26 BNEZ R1,LOOP
20% 18%
scheme
15%
predicts 15% 12% 11% 12%
branches 10%
9%
using profile 10%
6%
information 4%
5%
collected from
earlier runs, 0%
and modify t
s ot cc li uc
r d p or
prediction e s n t s so g d ea o2 jl d c
pr e q re do ydr m
d
su
2
based on last m
es
p h
co
run:
08/25/2025 Basic Pipelining & Performance 77
Integer Floating Point
Dynamic Branch Prediction
• Why does prediction work?
– Underlying algorithm has regularities
– Data that is being operated on has regularities
– Instruction sequence has redundancies that are artifacts of
way that humans/compilers think about problems
• Is dynamic branch prediction better than static
branch prediction?
– Seems to be
– There are a small number of important branches in programs
which have dynamic behavior
NT
Predict Taken Predict Taken
T
T NT NT
Predict Not Predict Not
T Taken
Taken
16%
14% 12%
12% 10%
10% 9% 9% 9%
8%
6% 5% 5%
4%
2% 1%
0%
0%
t li
tot so gc
c
ice d uc ice p pp 300 sa7
n s sp sp
eq pr
e do fp trix n a
s a
e m
Integer Floating Point
08/25/2025 Basic Pipelining & Performance 81
Correlated Branch Prediction
• Idea: record m most recently executed branches
as taken or not taken, and use that pattern to
select the proper n-bit branch history table
20%
8%
6% 6% 6%
6%
5% 5%
4%
4%
2%
1% 1%
0%
0%
spice
gcc
fpppp
expresso
nasa7
matrix300
tomcatv
doducd
li
eqntott
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
• Local predictor
– Local history table: 1024 10-bit entries recording last 10
branches, index by branch address
– The pattern of the last 10 occurrences of that particular branch
used to index table of 1K entries with 3-bit saturating counters
9 9
8
7
7
6
5
5
1 1
0 0 0
0
c
id
e
u
ip
im
a
ty
r
cf
vp
gc
is
es
pl
m
gr
gz
af
sw
w
ap
5.
6.
m
1.
m
cr
4.
up
17
1.
17
7.
3.
18
6.
2.
16
17
17
17
18
17
8.
16
SPECint2000 SPECfp2000
08/25/2025 Basic Pipelining & Performance 88
Branch Target Buffers (BTB)
Advantages:
Compiler doesn’t need to have knowledge of
microarchitecture
Handles cases where dependencies are
unknown at compile time
Helps to tolerate unpredictable delay (cache
misses)
Disadvantage:
Substantial increase in hardware
Basic Pipelining & Performance
complexity
100
Branch Prediction
Dynamic Scheduling
Dynamic scheduling implies:
Out-of-order execution
Out-of-order completion
Tomasulo’s Approach
Tracks when operands are available
Introduces register renaming in hardware
Minimizes WAW and WAR hazards
Basic Pipelining & Performance 101
Branch Prediction
Register Renaming
RAR. RAW. WAR ,WAW
Example:
DIV.D F0,F2,F4
ADD.D F6,F0,F8
Antidependence F8
S.D F6,0(R1)
SUB.D F8,F10,F14 Antidependence F6
MUL.D F6,F10,F8
DIV.D F0,F2,F4
ADD.D S,F0,F8
S.D S,0(R1)
SUB.D T,F10,F14
MUL.D F6,F10,T
Top-level design:
Add1
Add2 Mult1
Add3 Mult2
Reservation To Mem
Stations
FP
FPadders
adders FP
FPmultipliers
multipliers
Clock cycle
counter
08/25/2025 Basic Pipelining & Performance 110
Tomasulo Example Cycle 1
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 Yes SUBD M(A1) Load2
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
2 Add1 Yes SUBD M(A1) M(A2)
Add2 No
Add3 No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1
Add3 No
9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1
Add3 No
8 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
2 Add2 Yes ADDD (M-M) M(A2)
Add3 No
7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
1 Add2 Yes ADDD (M-M) M(A2)
Add3 No
6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
0 Add2 Yes ADDD (M-M) M(A2)
Add3 No
5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
4 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
3 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
2 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
1 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
0 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)
Registers To
Memory
Dest from
Dest
Memory
Dest
Reservation 1 10+R2
Stations
FP adders FP multipliers
Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
Dest
Reservation 1 10+R2
Stations
FP adders FP multipliers
Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
3 DIVD ROB2,R(F6)
Dest
Reservation 1 10+R2
Stations
FP adders FP multipliers
Reorder Buffer --
F2
BNE F2,<…>
DIVD F2,F10,F6
N
N
ROB4
ROB3
F10 ADDD F10,F4,F0 N ROB2 Oldest
F0 LD F0,10(R2) N ROB1
Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
6 ADDD ROB5, R(F6) 3 DIVD ROB2,R(F6)
Dest
Reservation 1 10+R2
Stations 5 0+R3
FP adders FP multipliers
Reorder Buffer --
F2
BNE F2,<…>
DIVD F2,F10,F6
N
N
ROB4
ROB3
F10 ADDD F10,F4,F0 N ROB2 Oldest
F0 LD F0,10(R2) N ROB1
Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
6 ADDD ROB5, R(F6) 3 DIVD ROB2,R(F6)
Dest
Reservation 1 10+R2
Stations 5 0+R3
FP adders FP multipliers
Reorder Buffer --
F2
BNE F2,<…>
DIVD F2,F10,F6
N
N
ROB4
ROB3
F10 ADDD F10,F4,F0 N ROB2 Oldest
F0 LD F0,10(R2) N ROB1
Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
6 ADDD M[10],R(F6) 3 DIVD ROB2,R(F6)
Dest
Reservation 1 10+R2
Stations
FP adders FP multipliers
Reorder Buffer --
F2
BNE F2,<…> N
DIVD F2,F10,F6 N
ROB4
ROB3
F10 ADDD F10,F4,F0 N ROB2 Oldest
F0 LD F0,10(R2) N ROB1
Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
3 DIVD ROB2,R(F6)
Dest
Reservation 1 10+R2
Stations
FP adders FP multipliers
Reorder Buffer --
F2
BNE F2,<…> N
DIVD F2,F10,F6 N
ROB4
ROB3
F10 ADDD F10,F4,F0 N ROB2 Oldest
What about memory F0 LD F0,10(R2) N ROB1
hazards???
Registers To
Memory
Dest from
Dest
2 ADDD R(F4),ROB1 Memory
3 DIVD ROB2,R(F6)
Dest
Reservation 1 10+R2
Stations
FP adders FP multipliers
Solutions:
Statically scheduled superscalar processors
VLIW (very long instruction word) processors
dynamically scheduled superscalar
processors
Value prediction
Uses:
Loads that load from a constant pool
Instruction that produces a value from a small set
of values
Not been incorporated into modern
processors
Similar idea--address aliasing prediction--is
Basic Pipelining & Performance 174
Speed Up Equation for Pipelining
Pipelinedepth CycleTimeunpipeline
d
Speedup
1 Pipeline
stallCPI CycleTimepipelined
1 2 3 4 5 6 7 8 9 10 11 12 13 14
LD R1,0(R2) IF1 IF2 ID1 ID2 EX1 EX2 MEM1 MEM2 WB1 WB2
DADDI R1, R1, #1 IF1 IF2 ………………….
SD R1, 0(R2) IF1 ……………………
No Data Dependence
1 2 3 4 5 6 7 8 9 10 11 12 13
ADD R1, R2, R3 IF1 IF2 ID1 ID2 EX1 EX2 MEM1 MEM2 WB1 WB2
SUB R4, R5, R6 IF1 IF2 ID1 ID2 EX1 EX2 MEM1 MEM2 WB1 WB2