0% found this document useful (0 votes)
11 views56 pages

Lecture08 RISCV Impl Pipeline

The document discusses the implementation of a RISC-V pipeline in computer architecture, focusing on CPU performance factors like instruction count, CPI, and cycle time. It reviews the unpipelined and pipelined datapaths, detailing the stages of a 5-stage pipeline and the control logic required to handle hazards such as structural and data hazards. Solutions for data hazards include interlocking and forwarding techniques to maintain efficient instruction execution.

Uploaded by

muahebttgc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views56 pages

Lecture08 RISCV Impl Pipeline

The document discusses the implementation of a RISC-V pipeline in computer architecture, focusing on CPU performance factors like instruction count, CPI, and cycle time. It reviews the unpipelined and pipelined datapaths, detailing the stages of a 5-stage pipeline and the control logic required to handle hazards such as structural and data hazards. Solutions for data hazards include interlocking and forwarding techniques to maintain efficient instruction execution.

Uploaded by

muahebttgc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lecture 08: RISC-V Pipeline

Implementation

CSCE 513 Computer Architecture

Department of Computer Science and Engineering


Yonghong Yan
yanyh@[Link]
[Link]

1
Acknowledgement
• Slides adapted from Computer Science 152: Computer
Architecture and Engineering, Spring 2016 by Dr. George
Michelogiannakis from UC Berkeley

2
Review
• CPU performance factors
– Instruction count
• Determined by ISA and compiler Instructions Cycles Time
– CPI and Cycle time CPU Time = * *
• Determined by CPU hardware Program Instruction Cycle

• Three groups of instructions


– Memory reference: lw, sw
– Arithmetic/logical: add, sub, and, or, slt
– Control transfer: jal, jalr, b*

• CPI
– Single-cycle, CPI = 1, and normally longer cycle
– 5 stage unpipelined, CPI = 5
– 5 stage pipelined, CPI = 1
3
Review: Unpipelined Datapath for RISC-V
PCSel
br RegWriteEn MemWrite WBSel
rind
jabs
pc+4

0x4
Add
Add

clk

we Br Logic Bcomp?
clk
rs1
rs2
PC addr rd1 we
inst wa addr
wd rd2 ALU
clk Inst. GPRs rdata
Memory Data
Imm Memory
Select wdata
ALU
Control

OpCode WASel ImmSel Op2Sel

4
Review: Hardwired Control Table
Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel
ALU * Reg Func no yes ALU rd pc+4
ALUi IType12 Imm Op no yes ALU rd pc+4
LW IType12 Imm + no yes Mem rd pc+4
SW SType12 Imm + yes no * * pc+4
BEQtrue SBType12 * * no no * * br
BEQfalse SBType12 * * no no * * pc+4
J * * * no no * * jabs
JAL * * * no yes PC X1 jabs
JALR * * * no yes PC rd rind

Op2Sel= Reg / Imm WBSel = ALU / Mem / PC


WASel = rd / X1 PCSel = pc+4 / br / rind / jabs

5
An Ideal Pipeline

stage stage stage stage


1 2 3 4

• All objects go through the same stages


• No sharing of resources between any two stages
• Propagation delay through all pipeline stages is equal
• The scheduling of an object entering the pipeline is not
affected by the objects in other stages
These conditions generally hold for industrial assembly lines
For laundry pipeline, two loads do not depend on each other.
But instructions depend on each other!
6
Technology Assumptions
• A small amount of very fast memory (caches)
backed up by a large, slower memory
• Fast ALU (at least for integers)
• Multiported Register files (slower)
Thus, the following timing assumption is reasonable
tIF ~= tID/RF ~= tEX ~= tMEM ~= tWB

A 5-stage pipeline will be focus of our detailed design


Some commercial designs have over 30 pipeline
stages to do an integer add!

7
5-Stage Pipelined Execution: Resource Usage
The Whole Pipeline Resources are Used by 5 Instructions in Every Cycle!
0x4
Add
we
rs1
rs2
PC addr rd1 we
rdata IR wa addr
wd rd2 ALU
GPRs rdata
Inst. Data
Imm Memory
Memory Select wdata
Write
I-Fetch -Back
Decode, Reg. Fetch Execute Memory
(IF) - I5 (EX) - I3 (ME) - I2 (WB) - I1
(ID) - I4
time t0 t1 t2 t3 t4 t5 t6 t7 ....
Instruction 1 IF1 ID1 EX1 ME1 WB1
Instruction 2 IF2 ID2 EX2 ME2 WB2
Instruction 3 IF3 ID3 EX3 ME3 WB3
Instruction 4 IF4 ID4 EX4 ME4 WB4
Instruction 5 IF5 ID5 EX5 ME5 WB5
8
Instruction Register in
Each Stage

0x4
Add IR IR IR

Instruction 1
we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm Memory
wdata
Select wdata
MD1 MD2

• An instruction Reg (IR) in each stage to


contain the instruction in that stage

9
Connect Controls from Instruction Register
F D E M W
IR IR IR

Instruction 3

Instruction 2
0x4
Add Instruction 1

RegWriteEn
we ALU
rs1 Control MemWrite
WBSel
rs2
addr rd1 A we
PC
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm
Select
Memory
wdata
wdata

MD1 MD2

ImmSel Op2Sel

Instruction 4

10
Compared With Control Logic in Unpipelined

• Unpipelined:
– Single control logic uses
the instruction from IF
• Pipelined:
– Distributed logics that
uses instructions from
IRs in each stage

11
Instructions Interact With Each Other in Pipeline:
Dealing with Hazards
• An instruction may need a resource being used by another
instruction à structural hazard
– Solution #1: Stalling newer instruction till older instruction finishes
– Solution #2: Adding more hardware to design
• E.g., separate memory into I-memory and D-memory
– Our 5-stage pipeline has no structural hazards by design

• An instruction depends on something produced by an earlier one


– Dependence may be for a data value or for using same register (not the
value) àdata hazard
• Solutions for RAW hazards: #1, interlocking (bubble delay), and #2,
forwarding
• WAR and WRW hazards: not possible for 5-stage pipeline

– Dependence may be for the next instruction’s address à control hazard


(branches, exceptions)
• Solutions: # delay, prediction, etc
12
Read-After-Write (RAW) Data Hazards

x4 ¬ x1 … x1 ¬ …
0x4
Add IR IR IR

we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
Select wdata
wdata

MD1 MD2

... x1 in GPRs contains stale value since the


x1 ¬ x0 + 10 passing of value between two
x4 ¬ x1 + 17 instructions has to go through
... GPRs (register file).
13
To Resolve Data Hazards: #1, Interlocking, i.e.
Stall Pipeline by Inserting Bubbles
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) x1 ¬ (x0) + 10 IF1 ID1 EX1 ME1 WB1
(I2) x4 ¬ (x1) + 17 IF2 ID2 ID2 ID2 ID2 EX2 ME2 WB2
(I3) IF3 IF3 IF3 IF3 ID3 EX3 ME3 WB3
(I4) stalled stages IF4 ID4 EX4 ME4 WB4
(I5) IF5 ID5 EX5 ME5 WB5

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I3 I3 I3 I4 I5
ID I1 I2 I2 I2 I2 I3 I4 I5
Resource EX I1 - - - I2 I3 I4 I5
Usage ME I1 - - - I2 I3 I4 I5
WB I1 - - - I2 I3 I4 I5

- Þ pipeline bubble
14
Insert Bubble for
Interlocking
Bubble: Nop instruction, e.g. add x0, x0, x0
Stall Condition
Bubble are inserted at ID and by entering EXE stage

0x4 bubble
IR IR IR
Add

we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Select wdata
...
MD1 MD2
x1 ¬ x0 + 10
x4 ¬ x1 + 17
... 15
Interlock Control Logic
...
x1 ¬ x0 + 10 ws
stall
x4 ¬ x1 + 17 Cstall
... rs2 ?
rs1

0x4 bubble
Add IR IR IR

we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Select wdata

MD1 MD2

Compare the source registers of the instruction in the decode


stage with the destination register of the uncommitted
instructions.
16
we: write enable, 1-bit on/off
Interlock Control Logic ws: write select, 5-bit register number
re: read enable, 1-bit on/off
ignoring jumps & branches rs: read select, 5-bit register number
... wsWB
stall
x1 ¬ x0 + 10 weWB
Cstall
x4 ¬ x1 + 17 rs1 ? weMEM wsMEM
... rs2
weEX wsEX
re1 re2 Cdest Cdest
Cre
0x4 bubble
Add IR IR IR

we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm Memory
wdata
Select wdata
MD1 MD2

Should we always stall if an rs field matches some rd?


not every instruction writes a register => we
not every instruction reads a register => re 17
Source & Destination Registers
func7 rs2 rs1 func3 rd opcode
ALU
immediate12 rs1 func3 rd opcode ALUI/LW/JALR
imm rs2 rs1 func3 imm opcode SW/Bcond
Jump Offset[19:0] rd opcode

source(s) destination
ALU rd <= rs1 func10 rs2 rs1, rs2 rd
ALUI rd <= rs1 op imm rs1 rd
LW rd <= M [rs1 + imm] rs1 rd
SW M [rs1 + imm] <= rs2 rs1, rs2 -
Bcond rs1,rs2 rs1, rs2 -
true: PC <= PC + imm
false: PC <= PC + 4
JAL x1 <= PC, PC <= PC + imm - rd
JALR rd <= PC, PC <= rs1 + imm rs1 rd 18
Deriving the Stall Signal
Cre
re1 = Case opcode
ALU, ALUi, LW, SW, Bcond, JALR => on
source(s) destination JAL =>off
ALU rd <= rs1 func10 rs2 rs1, rs2 rd
ALUI rd <= rs1 op imm rs1 rd re2 = Case opcode
LW
SW
rd <= M [rs1 + imm]
M [rs1 + imm] <= rs2
rs1
rs1, rs2
rd
-
ALU, SW, Bcond =>on
Bcond rs1,rs2 rs1, rs2 - … =>off
true: PC <= PC + imm
false: PC <= PC + 4 No need the WB for interlock control since we only need to deal with hazard
JAL x1 <= PC, PC <= PC + imm - rd
between MEM-EXE and EXE-EXE. For two instructions which are in WB and EXE,
JALR rd <= PC, PC <= rs1 + imm rs1 rd
and have RAW hazard, the dependency are handled through the register file.

Cstall stall = ((rs1D == wsEX) && weEX +


(rs1D == wsMEM) && weMEM +
Cdest (rs1D == wsWB) && weWB) && re1D +
ws = rd
we = Case opcode ((rs2D == wsEX) && weEX +
ALU, ALUi, LW, JALR =>on (rs2D == wsMEM) && weMEM +
... =>off (rs2D == wsWB) && weWB) && re2D
19
To Resolve Data Hazards: #2, Forwarding (Bypassing)
time t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) x1 ¬ x0 + 10 IF1 ID1 EX1 ME1 WB1
(I2) x4 ¬ x1 + 17 IF2 ID2 ID2 ID2 ID2 EX2 ME2 WB2
(I3) IF3 IF3 IF3 IF3 ID3 EX3 ME3
(I4) stalled stages IF4 ID4 EX4
(I5) IF5 ID5
Each stall or kill introduces a bubble in the pipeline
=> CPI > 1
A new datapath, i.e., a bypass, can get the data from
the output of the ALU to its input
time t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) x1 ¬ x0 + 10 IF1 ID1 EX1 ME1 WB1
(I2) x4 ¬ x1 + 17 IF2 ID2 EX2 ME2 WB2
(I3) IF3 ID3 EX3 ME3 WB3
(I4) IF4 ID4 EX4 ME4 WB4
(I5) IF5 ID5 EX5 ME5 WB5
20
Review: Hardware Support for Forwarding, and Detecting RAW
Hazards with Previous and 2nd Previous Instructions

• Slide 48 of lecture05-06

21
Adding a Bypass (To Bypass Register Files)
stall

x4 <= x1... x1 <= ...


0x4 bubble
E M W
Add IR IR IR

ASrc
we
rs1
rs2
addr D rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm Memory
wdata
Select wdata
MD1 MD2

... When does this bypass help?


(I1) x1 <= x0 + 10 x1 <= M[x0 + 10] JAL 500
(I2) x4 <= x1 + 17 x4 <= x1 + 17 x4 <= x1 + 17
Yes No, LoadàEXE-Use No
22
The Bypass Signal: Deriving it from the Stall
Signal
stall = ( ((rs1D == wsE) && weE + (rs1D == wsM) && weM + (rs1D == wsW) && weW) && re1D
+((rs2D == wsE) && weE + (rs2D == wsM) && weM + (rs2D == wsW) && weW) && re2D )

ws = rd we = Case opcode
ALU, ALUi, LW,, JAL JALR => on
... => off

ASrc = (rs1D== wsE) && weE && re1D Is this correct?

No because only ALU and ALUi instructions can benefit from this
bypass
Split weE into two components: we-bypass, we-stall

23
Bypass and Stall Signals
Split weE into two components: we-bypass, we-stall
we-bypassE = Case opcodeE we-stallE = Case opcodeE
ALU, ALUi => on LW, JAL, JALR=> on
... => off JAL => on
... => off
ASrc = (rs1D == wsE) && we-bypassE && re1D
stall = ((rs1D == wsE) && we-stallE +
(rs1D== wsM) && weM + (rs1D== wsW) && weW) && re1D
+((rs2D == wsE) && weE + (rs2D == wsM) && weM + (rs2D == wsW) && weW) && re2D

24
Fully Bypassed Datapath
stall PC for JAL, ...

0x4 E M W
bubble IR IR IR
Add
ASrc

we
rs1
rs2
addr D rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm Memory
wdata
Select wdata
BSrc
MD1 MD2

Is there still
a need for the
stall signal ? stall = (rs1D== wsE) && (opcodeE==LWE)&&(wsE!=0 )&&re1D
+ (rs2D== wsE) && (opcodeE== LWE)&&(wsE!=0 )&&re2D
25
Control Hazards: Branches and Jumps
• JAL: unconditional jump to PC+immediate

• JALR: indirect jump to rs1+immediate

• Branch: if (rs1 conds rs2), branch to PC+immediate

26
Info for
Control Transfer
Two pieces of info:
1. Taken or Not Taken
2. Target address?

• JAL: unconditional jump to PC+immediate


• JALR: indirect jump to rs1+immediate
• Branch: if (rs1 conds rs2), branch to PC+immediate

Instruction Taken known? Target known?


JAL After Inst. Decode After Inst. Decode
JALR After Inst. Decode After Reg. Fetch
B<cond.> After Execute After Inst. Decode

27
Speculate Next Address is PC+4
PCSrc (pc+4 / jabs / rind/ br) stall

Add
E M
0x4 bubble
Add IR IR

Jump? I1

PC addr
inst IR

104 Inst
Memory I2

A jump instruction kills (not stalls)


I1 096 ADD
I2 100 J 304 the following instruction
I3 104 ADD kill How?
I4 304 ADD 28
Pipelining Jumps
PCSrc (pc+4 / jabs / rind/ br)
stall
To kill a fetched
instruction -- Insert
a mux before IR
Add
E M
0x4 bubble
Add IR IR

Jump? II21 I1

IRSrcD
Any
addr
interaction
PC bubble
inst IR between
304
104 Inst
bubble
I2
stall and
Memory
jump?
IRSrcD = Case opcodeD
I1 096 ADD JAL Þ bubble
I2 100 J 304 ... Þ IM
I3 104 ADD kill
I4 304 ADD 29
Jump Pipeline Diagrams
time
t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 ME1 WB1
(I2) 100: J 304 IF2 ID2 EX2 ME2 WB2
(I3) 104: ADD IF3 - - - -
(I4) 304: ADD IF4 ID4 EX4 ME4 WB4

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 - I4 I5
Resource EX I1 I2 - I4 I5
Usage ME I1 I2 - I4 I5
WB I1 I2 - I4 I5

- Þ pipeline bubble

30
Pipelining Conditional Branches
PCSrc (pc+4 / jabs / rind / br) stall

Add
E M
0x4 bubble
Add IR IR

BEQ? I1
Taken?
IRSrcD
addr
PC
inst bubble IR
A
ALU Y
104 Inst
I2
Memory

I1 096 ADD
Branch condition is not known until the
I2 100 BEQ x1,x2 +200 execute stage
I3 104 ADD
I4 304 ADD
31
Pipelining Conditional Branches
PCSrc (pc+4 / jabs / rind / br) stall
?
Add
E Bcond? M
0x4 bubble
Add IR IR

I2 I1
Taken?
IRSrcD
addr
PC
inst bubble IR
A
ALU Y
108 Inst
I3
Memory
If the branch is taken:
I1 096 ADD
I2 100 BEQ x1,x2 +200 - Kill the two following instructions
I3 104 ADD - The instruction at the decode stage is
I4 304 ADD not valid Þ stall signal is not valid
32
Pipelining Conditional Branches
PCSrc (pc+4/jabs/rind/br) stall

Add
E Bcond? M
IRSrcE
0x4 bubble
Add IR IR

Jump? I2 I1
Taken?
PC
IRSrcD
addr
PC
inst bubble IR
A
ALU Y
108 Inst
I3
Memory

I1: 096 ADD If the branch is taken


I2: 100 BEQ x1,x2 +200 - kill the two following instructions
I3: 104 ADD - the instruction at the decode stage is
I4: 304 ADD not valid Þ stall signal is not valid
33
Branch Pipeline Diagrams
(resolved in execute stage)

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) 096: ADD IF1 ID1 EX1 ME1 WB1
(I2) 100: BEQ +200 IF2 ID2 EX2 ME2 WB2
(I3) 104: ADD IF3 ID3 - - -
(I4) 108: IF4 - - - -
(I5) 304: ADD IF5 ID5 EX5 ME5 WB5

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 I3 - I5
Resource EX I1 I2 - - I5
Usage ME I1 I2 - - I5
WB I1 I2 - - I5

- Þ pipeline bubble
34
Use Simpler Branches: E.g. Only Compare One
Register Against Zero in ID Stage
time
t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 ME1 WB1
(I2) 100: BEQZ +200 IF2 ID2 EX2 ME2 WB2
(I3) 104: ADD IF3 - - - -
(I4) 300: ADD IF4 ID4 EX4 ME4 WB4

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 - I4 I5
Resource EX I1 I2 - I4 I5
Usage ME I1 I2 - I4 I5
WB I1 I2 - I4 I5

- Þ pipeline bubble
35
Pipelined MIPS Datapath
Instruction Instr. Decode Execute Memory Writ
Fetch Reg. Fetch Addr. Calc Access Back
Next PC Next
SEQ PC

MUX
Adder
Adder
Zero?
4 RS1

MEM/WB
Address

Memory

RS2

EX/MEM
Reg File

ID/EX

ALU
IF/ID

Memory
MUX

Data

MUX
(I1) 096: ADD Sign
Extend
(I2) 100: BEQZ +200 Imm

(I3) 104: ADD RD RD RD


(I4) 300: ADD

36
Control Hazard Delay Summary
• JAL: unconditional jump to PC+immediate
– 1 cycle delay of pipeline
• JALR: indirect jump to rs1+immediate
– 1 cycle delay
• Branch: if (rs1 conds rs2), branch to PC+immediate
– 2 cycles delay
– 1 cycle delay for simpler branch (BEQZ) with pipeline
improvement

37
Reducing Control Flow Penalty
• Software solutions
– Eliminate branches - loop unrolling
• Increases the run length
– Reduce resolution time - instruction scheduling
• Compute the branch condition as early as possible (of
limited value because branches often in critical path through
code)

• Hardware solutions
– Find something else to do - delay slots
• Replaces pipeline bubbles with useful work (requires
software cooperation)
– Speculate - branch prediction
• Speculative execution of instructions beyond the branch
38
Additional Materials – Branch
Prediction

39
Branch Prediction
• Motivation
– Branch penalties limit performance of deeply pipelined processors
– Modern branch predictors have high accuracy
– (>95%) and can reduce branch penalties significantly

• Required hardware support:


– Prediction structures:
• Branch history tables, branch target buffers, etc.

• Mispredict recovery mechanisms:


– Keep result computation separate from commit
– Kill instructions following branch in pipeline
– Restore state to that following branch
40
Static Branch Prediction
Overall probability a branch is taken is ~60-70% but:
What C++ statement What C++ statement
does this look like does this look like
backward forward
90% 50%

ISA can attach preferred direction semantics to branches, e.g.,


Motorola MC88110
bne0 (preferred taken) beq0 (not taken)

41
Dynamic Branch Prediction
learning based on past behavior
• Temporal correlation (time)
– If I tell you that a certain branch was taken last time, does
this help?
– The way a branch resolves may be a good predictor of the way
it will resolve at the next execution

• Spatial correlation (space)


– Several branches may resolve in a highly correlated manner
– For instance, a preferred path of execution

42
Dynamic Branch Prediction
• 1-bit prediction scheme
– Low-portion address as address for a one-bit flag for Taken or
NotTaken historically
– Simple
• 2-bit prediction
– Miss twice to change

43
Branch Prediction Bits
• Assume 2 BP bits per instruction
• Change the prediction after two consecutive mistakes!
¬take
wrong ¬ taken
taken
taken
taken ¬ taken
take ¬take
right right
taken
¬ taken ¬ taken
take
wrong

BP state:
(predict take/¬take) x (last prediction right/wrong)
44
Branch History Table
Fetch PC 00

k 2k-entry
I-Cache BHT Index BHT,
2 bits/entry
Instruction
Opcode offset

Branch? Target PC Taken/¬Taken?


4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
45
Exploiting Spatial Correlation
Yeh and Patt, 1992

if (x[i] < 7) then


y += 1;
if (x[i] < 5) then
c -= 4;

If first condition false, second condition also


false

History register, H, records the direction of the


last N branches executed by the processor

46
Two-Level Branch Predictor
Pentium Pro uses the result from the last two branches
to select one of the four sets of BHT bits (~95% correct)

00
Fetch PC k

2-bit global branch history


shift register

Shift in Taken/¬Taken
results of each branch

Taken/¬Taken? 47
Speculating Both Directions
• An alternative to branch prediction is to execute
both directions of a branch speculatively
– resource requirement is proportional to the number of concurrent
speculative executions
– only half the resources engage in useful work when both directions
of a branch are executed speculatively
– branch prediction takes less resources than speculative execution
of both paths
• With accurate branch prediction, it is more cost
effective to dedicate all resources to the
predicted direction!
– What would you choose with 80% accuracy?

48
Are We Missing Something?
• Knowing whether a branch is taken or not is great, but what
else do we need to know about it?

Branch target address

49
Branch Target Buffer
predicted BPb
target
Branch
Target
IMEM
Buffer
(2k entries)
k

PC

target BP

BP bits are stored with the predicted target address.

IF stage: If (BP=taken) then nPC=target else nPC=PC+4


Later: check prediction, if wrong then kill the instruction and
update BTB & BPb else update BPb
50
Address Collisions (MisPrediction)

132 Jump +104


Assume a
128-entry
BTB 1028 Add .....
target BPb
236 take
Instruction
What will be fetched after the instruction at 1028? Memory
BTB prediction = 236
Correct target = 1032

=> kill PC=236 and fetch PC=1032

Is this a common occurrence?

51
BTB is only for Control Instructions

• Is even branch prediction fast enough to avoid bubbles?


• When do we index the BTB?
– i.e., what state is the branch in, in order to avoid bubbles?

• BTB contains useful information for branch and jump


instructions only
=> Do not update it for other instructions

• For all other instructions the next PC is PC+4 !


• How to achieve this effect without decoding the instruction?

52
Branch Target Buffer (BTB)
I-Cache 2k-entry direct-mapped BTB
PC (can also be associative)
Entry PC Valid predicted
target PC

match valid target


• Keep both the branch PC and target PC in the BTB
• PC+4 is fetched if match fails
• Only taken branches and jumps held in BTB
• Next PC determined before branch fetched and decoded
53
Combining BTB and BHT
• BTB entries are considerably more expensive than BHT, but can redirect
fetches at earlier stage in pipeline and can accelerate indirect branches (JR)
• BHT can hold many more entries and is more accurate

A PC Generation/Mux
BTB P Instruction Fetch Stage 1
F Instruction Fetch Stage 2
BHT in later BHT B Branch Address Calc/Begin Decode
pipeline stage
corrects when
I Complete Decode
BTB misses a J Steer Instructions to Functional units
predicted taken R Register File Read
branch
E Integer Execute

BTB/BHT only updated after branch resolves in E stage


54
Uses of Jump Register (JR)
• Switch statements (jump to address of matching case)
BTB works well if same case used repeatedly

• Dynamic function call (jump to run-time function address)


BTB works well if same function usually called, (e.g., in
C++ programming, when objects have same type in virtual
function call)

• Subroutine returns (jump to return address)


BTB works well if usually return to the same place
Þ Often one function called from many distinct call sites!
How well does BTB work for each of these cases?
55
Subroutine Return Stack
Small structure to accelerate JR for subroutine returns,
typically much more accurate than BTBs.
fa() { fb(); }
fb() { fc(); }
fc() { fd(); }

Push call address when Pop return address when


function call executed subroutine return decoded

&fd() k entries
&fc() (typically k=8-16)
&fb()
56

You might also like