Computer Architecture
ELE 475 / COS 475
Slide Deck 4: Superscalar 1
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Types of Data Hazards
Consider executing a sequence of
rk ri op rj
type of instructions
Data-dependence
r3 r1 op r2 Read-after-Write
r5 r3 op r4 (RAW) hazard
Anti-dependence
r3 r1 op r2 Write-after-Read
r1 r4 op r5 (WAR) hazard
Output-dependence
r3 r1 op r2 Write-after-Write
r3 r6 op r7 (WAW) hazard
2
Introduction to Superscalar Processor
• Processors studied so far are fundamentally
limited to CPI >= 1
• Superscalar processors enable CPI < 1 (IPC > 1)
by executing multiple instructions in parallel
• Can have both in-order and out-of-order
superscalar processors. We will start with in-
order.
3
Baseline 2-Way In-Order Superscalar
Processor
IR0 Branch Cond.
ALU
PC
addr
rdata A
RF RF
Instr. IR1 Read
Cache
Write
ALU
addr
B rdata
Data
Cache
Pipe A: Integer Ops., Branches
Pipe B: Integer Ops., Memory 4
Baseline 2-Way In-Order Superscalar
Processor
4 Read 2 Write
IR0 Ports Branch Cond. Ports
ALU
PC
addr
rdata A
RF RF
Instr. IR1 Read
Cache
Write
ALU
addr
B rdata
Data
Fetch 2 Instructions at Cache
same time
Pipe A: Integer Ops., Branches
Pipe B: Integer Ops., Memory 5
Baseline 2-Way In-Order Superscalar
Processor
IR0 Branch Cond.
ALU
PC
addr
rdata A
RF RF
Instr. IR1 Read
Cache
Write
ALU
addr
B rdata
Data
Issue Logic / Cache
Instruction
Steering Pipe A: Integer Ops., Branches
Pipe B: Integer Ops., Memory 6
Baseline 2-Way In-Order Superscalar
Duplicate Control
Processor
IR0 Decode A
IR1 Decode B
IR0 Branch Cond.
ALU
PC
addr
rdata A
RF RF
Instr. IR1 Read
Cache
Write
ALU
addr
B rdata
Data
Cache
Pipe A: Integer Ops., Branches
Pipe B: Integer Ops., Memory 7
Issue Logic Pipeline Diagrams
OpA F D A0 A1 W CPI = 0.5 (IPC = 2)
OpB F D B0 B1 W
OpC F D A0 A1 W Double Issue Pipeline
Can have two instructions in
OpD F D B0 B1 W same stage at same time
OpE F D A0 A1 W
OpF F D B0 B1 W
ADDIU F D A0 A1 W
LW F D B0 B1 W
Instruction Issue Logic swaps from
LW F D B0 B1 W natural position
ADDIU F D A0 A1 W
LW F D B0 B1 W
Structural
LW F D D B0 B1 W Hazard
8
Dual Issue Data Hazards
No Bypassing:
ADDIU R1,R1,1 F D A0 A1 W
ADDIU R3,R4,1 F D B0 B1 W
ADDIU R5,R6,1 F D A0 A1 W
ADDIU R7,R5,1 F D D D D A0 A1 W
Full Bypassing:
ADDIU R1,R1,1 F D A0 A1 W
ADDIU R3,R4,1 F D B0 B1 W
ADDIU R5,R6,1 F D A0 A1 W
ADDIU R7,R5,1 F D D A0 A1 W 9
Dual Issue Data Hazards
Order Matters:
ADDIU R1,R1,1 F D A0 A1 W
ADDIU R3,R4,1 F D B0 B1 W
ADDIU R7,R5,1 F D A0 A1 W
ADDIU R5,R6,1 F D B0 B1 W
WAR Hazard Possible?
10
Fetch Logic and Alignment
Cyc Addr Instr
0 0x000 OpA 0x000 0 0 1 1
0 0x004 OpB
1 0x008 OpC …
1 0x00C J 0x100
… 0x100 2 2
2 0x100 OpD
2 0x104 J 0x204 …
…
3 0x204 OpE 0x200 3 3
3 0x208 J 0x30C
…
…
4 0x30C OpF 0x300 4
4 0x310 OpG
5 0x314 OpH 0x310 4 5
Fetching across cache Lines is
very hard. May need extra ports 11
Fetch Logic and Alignment
Cyc Addr Instr
0 0x000 OpA Ideal, No Alignment Constraints
0 0x004 OpB
1 0x008 OpC OpA F D A0 A1 W
1 0x00C J 0x100 OpB F D B0 B1 W
… OpC F D B0 B1 W
2 0x100 OpD J F D A0 A1 W
2 0x104 J 0x204 OpD F D B0 B1 W
… J F D A0 A1 W
3 0x204 OpE OpE F D B0 B1 W
3 0x208 J 0x30C J F D A0 A1 W
… OpF F D A0 A1 W
4 0x30C OpF OpG F D B0 B1 W
4 0x310 OpG OpH F D A0 A1 W
5 0x314 OpH
12
With Alignment Constraints
Cyc Addr Instr
? 0x000 OpA 0x000 0 0 1 1
? 0x004 OpB
? 0x008 OpC …
? 0x00C J 0x100
… 0x100 2 2
? 0x100 OpD
? 0x104 J 0x204 …
…
? 0x204 OpE 0x200 3 3 4 4
? 0x208 J 0x30C
…
…
? 0x30C OpF 0x300 5 5
? 0x310 OpG
? 0x314 OpH 0x310 6 6
13
With Alignment Constraints
Cyc Addr Instr
1 0x000 OpA F D A0 A1 W
1 0x004 OpB F D B0 B1 W
2 0x008 OpC F D B0 B1 W
2 0x00C J 0x100 F D A0 A1 W
3 0x100 OpD F D B0 B1 W
3 0x104 J 0x204 F D A0 A1 W
4 0x200 ? F - - - -
4 0x204 OpE F D A0 A1 W
5 0x208 J 0x30C F D A0 A1 W
5 0x20C ? F - - - -
6 0x308 ? F - - - -
6 0x30C OpF F D A0 A1 W
7 0x310 OpG F D A0 A1 W
7 0x314 OpH F D B0 B1 W
14
Precise Exceptions and Superscalars
• Similar to tracking program order for data
dependencies, we need to track order for
exceptions
LW F D B0 B1 W
SYSCALL F D A0 A1 W
LW is in B pipeline, but commits first in logical
order!
15
Bypassing in Superscalar Pipelines
IR0 Branch Cond.
ALU
PC
addr
rdata A
RF RF
Instr. IR1 Read
Cache
Write
ALU
addr
B rdata
Data
Cache
16
Bypassing in Superscalar Pipelines
Branch Cond.
ALU
A
RF
Write
ALU
addr
B rdata
Data
Cache
17
Bypassing in Superscalar Pipelines
Branch Cond.
ALU
A
RF
Write
ALU
addr
B rdata
Data
Cache
18
Bypassing in Superscalar Pipelines
Branch Cond.
ALU
A1 3 5
RF
Write
ALU
addr
B2 rdata
Data
4 6
Cache
19
123456
Breaking Decode and Issue Stage
• Bypass Network can become very complex
• Can motivate breaking Decode and Issue Stage
D = Decode, Possibly resolve structural Hazards
I = Register file read, Bypassing, Issue/Steer
Instructions to proper unit
OpA F D I A0 A1 W
OpB F D I B0 B1 W
OpC F D I A0 A1 W
OpD F D I B0 B1 W
20
Superscalars Multiply Branch Cost
BEQZ F D I A0 A1 W
OpA F D I B0 - -
OpB F D I - - -
OpC F D I - - -
OpD F D - - - -
OpE F D - - - -
OpF F - - - - -
OpG F - - - - -
OpH F D I A0 A1 W
OpI F D I B0 B1 W
21
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)
• MIT material derived from course 6.823
• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750
22
Copyright © 2013 David Wentzlaff
23