0% found this document useful (0 votes)

11 views56 pages

Lecture08 RISCV Impl Pipeline

The document discusses the implementation of a RISC-V pipeline in computer architecture, focusing on CPU performance factors like instruction count, CPI, and cycle time. It reviews the unpipelined and pipelined datapaths, detailing the stages of a 5-stage pipeline and the control logic required to handle hazards such as structural and data hazards. Solutions for data hazards include interlocking and forwarding techniques to maintain efficient instruction execution.

Uploaded by

muahebttgc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views56 pages

Lecture08 RISCV Impl Pipeline

Uploaded by

muahebttgc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture 08: RISC-V Pipeline

Implementation

CSCE 513 Computer Architecture

Department of Computer Science and Engineering

Yonghong Yan
yanyh@[Link]
[Link]

1
Acknowledgement
• Slides adapted from Computer Science 152: Computer
Architecture and Engineering, Spring 2016 by Dr. George
Michelogiannakis from UC Berkeley

2
Review
• CPU performance factors
– Instruction count
• Determined by ISA and compiler Instructions Cycles Time
– CPI and Cycle time CPU Time = * *
• Determined by CPU hardware Program Instruction Cycle

• Three groups of instructions

– Memory reference: lw, sw
– Arithmetic/logical: add, sub, and, or, slt
– Control transfer: jal, jalr, b*

• CPI
– Single-cycle, CPI = 1, and normally longer cycle
– 5 stage unpipelined, CPI = 5
– 5 stage pipelined, CPI = 1
3
Review: Unpipelined Datapath for RISC-V
PCSel
br RegWriteEn MemWrite WBSel
rind
jabs
pc+4

0x4
Add
Add

clk

we Br Logic Bcomp?
clk
rs1
rs2
PC addr rd1 we
inst wa addr
wd rd2 ALU
clk Inst. GPRs rdata
Memory Data
Imm Memory
Select wdata
ALU
Control

OpCode WASel ImmSel Op2Sel

4
Review: Hardwired Control Table
Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel
ALU * Reg Func no yes ALU rd pc+4
ALUi IType12 Imm Op no yes ALU rd pc+4
LW IType12 Imm + no yes Mem rd pc+4
SW SType12 Imm + yes no * * pc+4
BEQtrue SBType12 * * no no * * br
BEQfalse SBType12 * * no no * * pc+4
J * * * no no * * jabs
JAL * * * no yes PC X1 jabs
JALR * * * no yes PC rd rind

Op2Sel= Reg / Imm WBSel = ALU / Mem / PC

WASel = rd / X1 PCSel = pc+4 / br / rind / jabs

5
An Ideal Pipeline

stage stage stage stage

1 2 3 4

• All objects go through the same stages

• No sharing of resources between any two stages
• Propagation delay through all pipeline stages is equal
• The scheduling of an object entering the pipeline is not
affected by the objects in other stages
These conditions generally hold for industrial assembly lines
For laundry pipeline, two loads do not depend on each other.
But instructions depend on each other!
6
Technology Assumptions
• A small amount of very fast memory (caches)
backed up by a large, slower memory
• Fast ALU (at least for integers)
• Multiported Register files (slower)
Thus, the following timing assumption is reasonable
tIF ~= tID/RF ~= tEX ~= tMEM ~= tWB

A 5-stage pipeline will be focus of our detailed design

Some commercial designs have over 30 pipeline
stages to do an integer add!

7
5-Stage Pipelined Execution: Resource Usage
The Whole Pipeline Resources are Used by 5 Instructions in Every Cycle!
0x4
Add
we
rs1
rs2
PC addr rd1 we
rdata IR wa addr
wd rd2 ALU
GPRs rdata
Inst. Data
Imm Memory
Memory Select wdata
Write
I-Fetch -Back
Decode, Reg. Fetch Execute Memory
(IF) - I5 (EX) - I3 (ME) - I2 (WB) - I1
(ID) - I4
time t0 t1 t2 t3 t4 t5 t6 t7 ....
Instruction 1 IF1 ID1 EX1 ME1 WB1
Instruction 2 IF2 ID2 EX2 ME2 WB2
Instruction 3 IF3 ID3 EX3 ME3 WB3
Instruction 4 IF4 ID4 EX4 ME4 WB4
Instruction 5 IF5 ID5 EX5 ME5 WB5
8
Instruction Register in
Each Stage

0x4
Add IR IR IR

Instruction 1
we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm Memory
wdata
Select wdata
MD1 MD2

• An instruction Reg (IR) in each stage to

contain the instruction in that stage

9
Connect Controls from Instruction Register
F D E M W
IR IR IR

Instruction 3

Instruction 2
0x4
Add Instruction 1

RegWriteEn
we ALU
rs1 Control MemWrite
WBSel
rs2
addr rd1 A we
PC
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm
Select
Memory
wdata
wdata

MD1 MD2

ImmSel Op2Sel

Instruction 4

10
Compared With Control Logic in Unpipelined

• Unpipelined:
– Single control logic uses
the instruction from IF
• Pipelined:
– Distributed logics that
uses instructions from
IRs in each stage

11
Instructions Interact With Each Other in Pipeline:
Dealing with Hazards
• An instruction may need a resource being used by another
instruction à structural hazard
– Solution #1: Stalling newer instruction till older instruction finishes
– Solution #2: Adding more hardware to design
• E.g., separate memory into I-memory and D-memory
– Our 5-stage pipeline has no structural hazards by design

• An instruction depends on something produced by an earlier one

– Dependence may be for a data value or for using same register (not the
value) àdata hazard
• Solutions for RAW hazards: #1, interlocking (bubble delay), and #2,
forwarding
• WAR and WRW hazards: not possible for 5-stage pipeline

– Dependence may be for the next instruction’s address à control hazard

(branches, exceptions)
• Solutions: # delay, prediction, etc
12
Read-After-Write (RAW) Data Hazards

x4 ¬ x1 … x1 ¬ …
0x4
Add IR IR IR

we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
Select wdata
wdata

MD1 MD2

... x1 in GPRs contains stale value since the

x1 ¬ x0 + 10 passing of value between two
x4 ¬ x1 + 17 instructions has to go through
... GPRs (register file).
13
To Resolve Data Hazards: #1, Interlocking, i.e.
Stall Pipeline by Inserting Bubbles
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) x1 ¬ (x0) + 10 IF1 ID1 EX1 ME1 WB1
(I2) x4 ¬ (x1) + 17 IF2 ID2 ID2 ID2 ID2 EX2 ME2 WB2
(I3) IF3 IF3 IF3 IF3 ID3 EX3 ME3 WB3
(I4) stalled stages IF4 ID4 EX4 ME4 WB4
(I5) IF5 ID5 EX5 ME5 WB5

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I3 I3 I3 I4 I5
ID I1 I2 I2 I2 I2 I3 I4 I5
Resource EX I1 - - - I2 I3 I4 I5
Usage ME I1 - - - I2 I3 I4 I5
WB I1 - - - I2 I3 I4 I5

- Þ pipeline bubble
14
Insert Bubble for
Interlocking
Bubble: Nop instruction, e.g. add x0, x0, x0
Stall Condition
Bubble are inserted at ID and by entering EXE stage

0x4 bubble
IR IR IR
Add

we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Select wdata
...
MD1 MD2
x1 ¬ x0 + 10
x4 ¬ x1 + 17
... 15
Interlock Control Logic
...
x1 ¬ x0 + 10 ws
stall
x4 ¬ x1 + 17 Cstall
... rs2 ?
rs1

0x4 bubble
Add IR IR IR

we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Select wdata

MD1 MD2

Compare the source registers of the instruction in the decode

stage with the destination register of the uncommitted
instructions.
16
we: write enable, 1-bit on/off
Interlock Control Logic ws: write select, 5-bit register number
re: read enable, 1-bit on/off
ignoring jumps & branches rs: read select, 5-bit register number
... wsWB
stall
x1 ¬ x0 + 10 weWB
Cstall
x4 ¬ x1 + 17 rs1 ? weMEM wsMEM
... rs2
weEX wsEX
re1 re2 Cdest Cdest
Cre
0x4 bubble
Add IR IR IR

we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm Memory
wdata
Select wdata
MD1 MD2

Should we always stall if an rs field matches some rd?

not every instruction writes a register => we
not every instruction reads a register => re 17
Source & Destination Registers
func7 rs2 rs1 func3 rd opcode
ALU
immediate12 rs1 func3 rd opcode ALUI/LW/JALR
imm rs2 rs1 func3 imm opcode SW/Bcond
Jump Offset[19:0] rd opcode

source(s) destination
ALU rd <= rs1 func10 rs2 rs1, rs2 rd
ALUI rd <= rs1 op imm rs1 rd
LW rd <= M [rs1 + imm] rs1 rd
SW M [rs1 + imm] <= rs2 rs1, rs2 -
Bcond rs1,rs2 rs1, rs2 -
true: PC <= PC + imm
false: PC <= PC + 4
JAL x1 <= PC, PC <= PC + imm - rd
JALR rd <= PC, PC <= rs1 + imm rs1 rd 18
Deriving the Stall Signal
Cre
re1 = Case opcode
ALU, ALUi, LW, SW, Bcond, JALR => on
source(s) destination JAL =>off
ALU rd <= rs1 func10 rs2 rs1, rs2 rd
ALUI rd <= rs1 op imm rs1 rd re2 = Case opcode
LW
SW
rd <= M [rs1 + imm]
M [rs1 + imm] <= rs2
rs1
rs1, rs2
rd
-
ALU, SW, Bcond =>on
Bcond rs1,rs2 rs1, rs2 - … =>off
true: PC <= PC + imm
false: PC <= PC + 4 No need the WB for interlock control since we only need to deal with hazard
JAL x1 <= PC, PC <= PC + imm - rd
between MEM-EXE and EXE-EXE. For two instructions which are in WB and EXE,
JALR rd <= PC, PC <= rs1 + imm rs1 rd
and have RAW hazard, the dependency are handled through the register file.

Cstall stall = ((rs1D == wsEX) && weEX +

(rs1D == wsMEM) && weMEM +
Cdest (rs1D == wsWB) && weWB) && re1D +
ws = rd
we = Case opcode ((rs2D == wsEX) && weEX +
ALU, ALUi, LW, JALR =>on (rs2D == wsMEM) && weMEM +
... =>off (rs2D == wsWB) && weWB) && re2D
19
To Resolve Data Hazards: #2, Forwarding (Bypassing)
time t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) x1 ¬ x0 + 10 IF1 ID1 EX1 ME1 WB1
(I2) x4 ¬ x1 + 17 IF2 ID2 ID2 ID2 ID2 EX2 ME2 WB2
(I3) IF3 IF3 IF3 IF3 ID3 EX3 ME3
(I4) stalled stages IF4 ID4 EX4
(I5) IF5 ID5
Each stall or kill introduces a bubble in the pipeline
=> CPI > 1
A new datapath, i.e., a bypass, can get the data from
the output of the ALU to its input
time t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) x1 ¬ x0 + 10 IF1 ID1 EX1 ME1 WB1
(I2) x4 ¬ x1 + 17 IF2 ID2 EX2 ME2 WB2
(I3) IF3 ID3 EX3 ME3 WB3
(I4) IF4 ID4 EX4 ME4 WB4
(I5) IF5 ID5 EX5 ME5 WB5
20
Review: Hardware Support for Forwarding, and Detecting RAW
Hazards with Previous and 2nd Previous Instructions

• Slide 48 of lecture05-06

21
Adding a Bypass (To Bypass Register Files)
stall

x4 <= x1... x1 <= ...

0x4 bubble
E M W
Add IR IR IR

ASrc
we
rs1
rs2
addr D rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm Memory
wdata
Select wdata
MD1 MD2

... When does this bypass help?

(I1) x1 <= x0 + 10 x1 <= M[x0 + 10] JAL 500
(I2) x4 <= x1 + 17 x4 <= x1 + 17 x4 <= x1 + 17
Yes No, LoadàEXE-Use No
22
The Bypass Signal: Deriving it from the Stall
Signal
stall = ( ((rs1D == wsE) && weE + (rs1D == wsM) && weM + (rs1D == wsW) && weW) && re1D
+((rs2D == wsE) && weE + (rs2D == wsM) && weM + (rs2D == wsW) && weW) && re2D )

ws = rd we = Case opcode
ALU, ALUi, LW,, JAL JALR => on
... => off

ASrc = (rs1D== wsE) && weE && re1D Is this correct?

No because only ALU and ALUi instructions can benefit from this
bypass
Split weE into two components: we-bypass, we-stall

23
Bypass and Stall Signals
Split weE into two components: we-bypass, we-stall
we-bypassE = Case opcodeE we-stallE = Case opcodeE
ALU, ALUi => on LW, JAL, JALR=> on
... => off JAL => on
... => off
ASrc = (rs1D == wsE) && we-bypassE && re1D
stall = ((rs1D == wsE) && we-stallE +
(rs1D== wsM) && weM + (rs1D== wsW) && weW) && re1D
+((rs2D == wsE) && weE + (rs2D == wsM) && weM + (rs2D == wsW) && weW) && re2D

24
Fully Bypassed Datapath
stall PC for JAL, ...

0x4 E M W
bubble IR IR IR
Add
ASrc

we
rs1
rs2
addr D rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm Memory
wdata
Select wdata
BSrc
MD1 MD2

Is there still
a need for the
stall signal ? stall = (rs1D== wsE) && (opcodeE==LWE)&&(wsE!=0 )&&re1D
+ (rs2D== wsE) && (opcodeE== LWE)&&(wsE!=0 )&&re2D
25
Control Hazards: Branches and Jumps
• JAL: unconditional jump to PC+immediate

• JALR: indirect jump to rs1+immediate

• Branch: if (rs1 conds rs2), branch to PC+immediate

26
Info for
Control Transfer
Two pieces of info:
1. Taken or Not Taken
2. Target address?

• JAL: unconditional jump to PC+immediate

• JALR: indirect jump to rs1+immediate
• Branch: if (rs1 conds rs2), branch to PC+immediate

Instruction Taken known? Target known?

JAL After Inst. Decode After Inst. Decode
JALR After Inst. Decode After Reg. Fetch
B<cond.> After Execute After Inst. Decode

27
Speculate Next Address is PC+4
PCSrc (pc+4 / jabs / rind/ br) stall

Add
E M
0x4 bubble
Add IR IR

Jump? I1

PC addr
inst IR

104 Inst
Memory I2

A jump instruction kills (not stalls)

I1 096 ADD
I2 100 J 304 the following instruction
I3 104 ADD kill How?
I4 304 ADD 28
Pipelining Jumps
PCSrc (pc+4 / jabs / rind/ br)
stall
To kill a fetched
instruction -- Insert
a mux before IR
Add
E M
0x4 bubble
Add IR IR

Jump? II21 I1

IRSrcD
Any
addr
interaction
PC bubble
inst IR between
304
104 Inst
bubble
I2
stall and
Memory
jump?
IRSrcD = Case opcodeD
I1 096 ADD JAL Þ bubble
I2 100 J 304 ... Þ IM
I3 104 ADD kill
I4 304 ADD 29
Jump Pipeline Diagrams
time
t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 ME1 WB1
(I2) 100: J 304 IF2 ID2 EX2 ME2 WB2
(I3) 104: ADD IF3 - - - -
(I4) 304: ADD IF4 ID4 EX4 ME4 WB4

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 - I4 I5
Resource EX I1 I2 - I4 I5
Usage ME I1 I2 - I4 I5
WB I1 I2 - I4 I5

- Þ pipeline bubble

30
Pipelining Conditional Branches
PCSrc (pc+4 / jabs / rind / br) stall

Add
E M
0x4 bubble
Add IR IR

BEQ? I1
Taken?
IRSrcD
addr
PC
inst bubble IR
A
ALU Y
104 Inst
I2
Memory

I1 096 ADD
Branch condition is not known until the
I2 100 BEQ x1,x2 +200 execute stage
I3 104 ADD
I4 304 ADD
31
Pipelining Conditional Branches
PCSrc (pc+4 / jabs / rind / br) stall
?
Add
E Bcond? M
0x4 bubble
Add IR IR

I2 I1
Taken?
IRSrcD
addr
PC
inst bubble IR
A
ALU Y
108 Inst
I3
Memory
If the branch is taken:
I1 096 ADD
I2 100 BEQ x1,x2 +200 - Kill the two following instructions
I3 104 ADD - The instruction at the decode stage is
I4 304 ADD not valid Þ stall signal is not valid
32
Pipelining Conditional Branches
PCSrc (pc+4/jabs/rind/br) stall

Add
E Bcond? M
IRSrcE
0x4 bubble
Add IR IR

Jump? I2 I1
Taken?
PC
IRSrcD
addr
PC
inst bubble IR
A
ALU Y
108 Inst
I3
Memory

I1: 096 ADD If the branch is taken

I2: 100 BEQ x1,x2 +200 - kill the two following instructions
I3: 104 ADD - the instruction at the decode stage is
I4: 304 ADD not valid Þ stall signal is not valid
33
Branch Pipeline Diagrams
(resolved in execute stage)

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) 096: ADD IF1 ID1 EX1 ME1 WB1
(I2) 100: BEQ +200 IF2 ID2 EX2 ME2 WB2
(I3) 104: ADD IF3 ID3 - - -
(I4) 108: IF4 - - - -
(I5) 304: ADD IF5 ID5 EX5 ME5 WB5

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 I3 - I5
Resource EX I1 I2 - - I5
Usage ME I1 I2 - - I5
WB I1 I2 - - I5

- Þ pipeline bubble
34
Use Simpler Branches: E.g. Only Compare One
Register Against Zero in ID Stage
time
t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 ME1 WB1
(I2) 100: BEQZ +200 IF2 ID2 EX2 ME2 WB2
(I3) 104: ADD IF3 - - - -
(I4) 300: ADD IF4 ID4 EX4 ME4 WB4

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 - I4 I5
Resource EX I1 I2 - I4 I5
Usage ME I1 I2 - I4 I5
WB I1 I2 - I4 I5

- Þ pipeline bubble
35
Pipelined MIPS Datapath
Instruction Instr. Decode Execute Memory Writ
Fetch Reg. Fetch Addr. Calc Access Back
Next PC Next
SEQ PC

MUX
Adder
Adder
Zero?
4 RS1

MEM/WB
Address

Memory

RS2

EX/MEM
Reg File

ID/EX

ALU
IF/ID

Memory
MUX

Data

MUX
(I1) 096: ADD Sign
Extend
(I2) 100: BEQZ +200 Imm

(I3) 104: ADD RD RD RD

(I4) 300: ADD

36
Control Hazard Delay Summary
• JAL: unconditional jump to PC+immediate
– 1 cycle delay of pipeline
• JALR: indirect jump to rs1+immediate
– 1 cycle delay
• Branch: if (rs1 conds rs2), branch to PC+immediate
– 2 cycles delay
– 1 cycle delay for simpler branch (BEQZ) with pipeline
improvement

37
Reducing Control Flow Penalty
• Software solutions
– Eliminate branches - loop unrolling
• Increases the run length
– Reduce resolution time - instruction scheduling
• Compute the branch condition as early as possible (of
limited value because branches often in critical path through
code)

• Hardware solutions
– Find something else to do - delay slots
• Replaces pipeline bubbles with useful work (requires
software cooperation)
– Speculate - branch prediction
• Speculative execution of instructions beyond the branch
38
Additional Materials – Branch
Prediction

39
Branch Prediction
• Motivation
– Branch penalties limit performance of deeply pipelined processors
– Modern branch predictors have high accuracy
– (>95%) and can reduce branch penalties significantly

• Required hardware support:

– Prediction structures:
• Branch history tables, branch target buffers, etc.

• Mispredict recovery mechanisms:

– Keep result computation separate from commit
– Kill instructions following branch in pipeline
– Restore state to that following branch
40
Static Branch Prediction
Overall probability a branch is taken is ~60-70% but:
What C++ statement What C++ statement
does this look like does this look like
backward forward
90% 50%

ISA can attach preferred direction semantics to branches, e.g.,

Motorola MC88110
bne0 (preferred taken) beq0 (not taken)

41
Dynamic Branch Prediction
learning based on past behavior
• Temporal correlation (time)
– If I tell you that a certain branch was taken last time, does
this help?
– The way a branch resolves may be a good predictor of the way
it will resolve at the next execution

• Spatial correlation (space)

– Several branches may resolve in a highly correlated manner
– For instance, a preferred path of execution

42
Dynamic Branch Prediction
• 1-bit prediction scheme
– Low-portion address as address for a one-bit flag for Taken or
NotTaken historically
– Simple
• 2-bit prediction
– Miss twice to change

43
Branch Prediction Bits
• Assume 2 BP bits per instruction
• Change the prediction after two consecutive mistakes!
¬take
wrong ¬ taken
taken
taken
taken ¬ taken
take ¬take
right right
taken
¬ taken ¬ taken
take
wrong

BP state:
(predict take/¬take) x (last prediction right/wrong)
44
Branch History Table
Fetch PC 00

k 2k-entry
I-Cache BHT Index BHT,
2 bits/entry
Instruction
Opcode offset

Branch? Target PC Taken/¬Taken?

4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
45
Exploiting Spatial Correlation
Yeh and Patt, 1992

if (x[i] < 7) then

y += 1;
if (x[i] < 5) then
c -= 4;

If first condition false, second condition also

false

History register, H, records the direction of the

last N branches executed by the processor

46
Two-Level Branch Predictor
Pentium Pro uses the result from the last two branches
to select one of the four sets of BHT bits (~95% correct)

00
Fetch PC k

2-bit global branch history

shift register

Shift in Taken/¬Taken
results of each branch

Taken/¬Taken? 47
Speculating Both Directions
• An alternative to branch prediction is to execute
both directions of a branch speculatively
– resource requirement is proportional to the number of concurrent
speculative executions
– only half the resources engage in useful work when both directions
of a branch are executed speculatively
– branch prediction takes less resources than speculative execution
of both paths
• With accurate branch prediction, it is more cost
effective to dedicate all resources to the
predicted direction!
– What would you choose with 80% accuracy?

48
Are We Missing Something?
• Knowing whether a branch is taken or not is great, but what
else do we need to know about it?

Branch target address

49
Branch Target Buffer
predicted BPb
target
Branch
Target
IMEM
Buffer
(2k entries)
k

target BP

BP bits are stored with the predicted target address.

IF stage: If (BP=taken) then nPC=target else nPC=PC+4

Later: check prediction, if wrong then kill the instruction and
update BTB & BPb else update BPb
50
Address Collisions (MisPrediction)

132 Jump +104

Assume a
128-entry
BTB 1028 Add .....
target BPb
236 take
Instruction
What will be fetched after the instruction at 1028? Memory
BTB prediction = 236
Correct target = 1032

=> kill PC=236 and fetch PC=1032

Is this a common occurrence?

51
BTB is only for Control Instructions

• Is even branch prediction fast enough to avoid bubbles?

• When do we index the BTB?
– i.e., what state is the branch in, in order to avoid bubbles?

• BTB contains useful information for branch and jump

instructions only
=> Do not update it for other instructions

• For all other instructions the next PC is PC+4 !

• How to achieve this effect without decoding the instruction?

52
Branch Target Buffer (BTB)
I-Cache 2k-entry direct-mapped BTB
PC (can also be associative)
Entry PC Valid predicted
target PC

match valid target

• Keep both the branch PC and target PC in the BTB
• PC+4 is fetched if match fails
• Only taken branches and jumps held in BTB
• Next PC determined before branch fetched and decoded
53
Combining BTB and BHT
• BTB entries are considerably more expensive than BHT, but can redirect
fetches at earlier stage in pipeline and can accelerate indirect branches (JR)
• BHT can hold many more entries and is more accurate

A PC Generation/Mux
BTB P Instruction Fetch Stage 1
F Instruction Fetch Stage 2
BHT in later BHT B Branch Address Calc/Begin Decode
pipeline stage
corrects when
I Complete Decode
BTB misses a J Steer Instructions to Functional units
predicted taken R Register File Read
branch
E Integer Execute

BTB/BHT only updated after branch resolves in E stage

54
Uses of Jump Register (JR)
• Switch statements (jump to address of matching case)
BTB works well if same case used repeatedly

• Dynamic function call (jump to run-time function address)

BTB works well if same function usually called, (e.g., in
C++ programming, when objects have same type in virtual
function call)

• Subroutine returns (jump to return address)

BTB works well if usually return to the same place
Þ Often one function called from many distinct call sites!
How well does BTB work for each of these cases?
55
Subroutine Return Stack
Small structure to accelerate JR for subroutine returns,
typically much more accurate than BTBs.
fa() { fb(); }
fb() { fc(); }
fc() { fd(); }

Push call address when Pop return address when

function call executed subroutine return decoded

&fd() k entries
&fc() (typically k=8-16)
&fb()
56

Lec7 Pipelining
No ratings yet
Lec7 Pipelining
22 pages
Lecture 4.3 - The Processor - Pipelining
No ratings yet
Lecture 4.3 - The Processor - Pipelining
27 pages
Pipelining for Enhanced Performance
No ratings yet
Pipelining for Enhanced Performance
71 pages
Chapter 04 Processor 3.5
No ratings yet
Chapter 04 Processor 3.5
52 pages
L04 Pipelining
No ratings yet
L04 Pipelining
48 pages
Lec04 ch4b Pipelining
No ratings yet
Lec04 ch4b Pipelining
38 pages
Design of 32bit MIPS Processor
No ratings yet
Design of 32bit MIPS Processor
23 pages
CS 162 Computer Architecture Lecture 3: Pipelining Contd.: Instructor: L.N. Bhuyan
No ratings yet
CS 162 Computer Architecture Lecture 3: Pipelining Contd.: Instructor: L.N. Bhuyan
21 pages
Arch4 Pipelined Processor Design Afterlecture
No ratings yet
Arch4 Pipelined Processor Design Afterlecture
130 pages
Chapter 04 RISC V Removed
No ratings yet
Chapter 04 RISC V Removed
99 pages
Implementation of Multi Stage Processor
No ratings yet
Implementation of Multi Stage Processor
16 pages
Pipe Lining
No ratings yet
Pipe Lining
138 pages
Forwarding Paths in Pipelined Datapath
No ratings yet
Forwarding Paths in Pipelined Datapath
11 pages
Pipelining 3
No ratings yet
Pipelining 3
37 pages
CS M151B / EE M116C: Computer Systems Architecture
No ratings yet
CS M151B / EE M116C: Computer Systems Architecture
50 pages
COA Kofe
No ratings yet
COA Kofe
35 pages
Pipelining Basic and Intermediate Concepts
No ratings yet
Pipelining Basic and Intermediate Concepts
127 pages
CSCE 5610 Computer System Architecture: Instruction Level Parallelism
No ratings yet
CSCE 5610 Computer System Architecture: Instruction Level Parallelism
16 pages
L13 Stalls and Flushes
No ratings yet
L13 Stalls and Flushes
27 pages
Computer Architecture: Introduction To The Concept of Pipelined Processor
No ratings yet
Computer Architecture: Introduction To The Concept of Pipelined Processor
20 pages
MIPS Pipelining and Hazards
0% (1)
MIPS Pipelining and Hazards
38 pages
Pipelined Datapath and Control
No ratings yet
Pipelined Datapath and Control
26 pages
Pipeline Execution and Hazard Management
No ratings yet
Pipeline Execution and Hazard Management
7 pages
L117-19 MIPS Pipeline Implementation
No ratings yet
L117-19 MIPS Pipeline Implementation
37 pages
Overcoming Data Hazards with Forwarding
No ratings yet
Overcoming Data Hazards with Forwarding
35 pages
Tiled Chip Multicore Processor Overview
No ratings yet
Tiled Chip Multicore Processor Overview
64 pages
Pipelined Processor Datapath Guide
No ratings yet
Pipelined Processor Datapath Guide
16 pages
Week 11
No ratings yet
Week 11
33 pages
L8 PipelineHazards 1
No ratings yet
L8 PipelineHazards 1
28 pages
Two Forms of Pipelining: - E.g., Floating Point Operations
No ratings yet
Two Forms of Pipelining: - E.g., Floating Point Operations
36 pages
Discussion8 2
No ratings yet
Discussion8 2
6 pages
Pipe 2 New
No ratings yet
Pipe 2 New
41 pages
2022HI400070G Nivedita
No ratings yet
2022HI400070G Nivedita
27 pages
RISC Pipeline Overview
No ratings yet
RISC Pipeline Overview
39 pages
2014fa CS61C L31 DG PipelineII 6up
No ratings yet
2014fa CS61C L31 DG PipelineII 6up
4 pages
Computer Science Pipeline Lab
No ratings yet
Computer Science Pipeline Lab
4 pages
Pipeline
100% (2)
Pipeline
8 pages
Embedded Computer Architecture 5SAI0
No ratings yet
Embedded Computer Architecture 5SAI0
59 pages
Chapter4 2
No ratings yet
Chapter4 2
34 pages
CA07 2022S3 New
No ratings yet
CA07 2022S3 New
29 pages
ch4 3
No ratings yet
ch4 3
61 pages
Chapter 04 Processor 2
No ratings yet
Chapter 04 Processor 2
28 pages
l06 Pipeline PDF
No ratings yet
l06 Pipeline PDF
38 pages
Understanding Pipeline Hazards in CSE 240A
No ratings yet
Understanding Pipeline Hazards in CSE 240A
7 pages
Unit 5 Pipeline Hazard
No ratings yet
Unit 5 Pipeline Hazard
31 pages
Ca07 2014 PDF
No ratings yet
Ca07 2014 PDF
56 pages
Pipeline Registers in Pipelined Datapath
No ratings yet
Pipeline Registers in Pipelined Datapath
33 pages
ACSA3 Pipelining
No ratings yet
ACSA3 Pipelining
68 pages
RISC V Pipelined Project
No ratings yet
RISC V Pipelined Project
16 pages
Advanced CPU Pipeline Techniques
No ratings yet
Advanced CPU Pipeline Techniques
17 pages
Advanced Pipelining Techniques
No ratings yet
Advanced Pipelining Techniques
44 pages
05 Risc V Pipeline
No ratings yet
05 Risc V Pipeline
31 pages
Table 1: Control Signals and Opcodes
No ratings yet
Table 1: Control Signals and Opcodes
6 pages
05 Pipelining
No ratings yet
05 Pipelining
37 pages
Group 17 - 2151177
No ratings yet
Group 17 - 2151177
15 pages
Lec 06
No ratings yet
Lec 06
18 pages
Bubble Sort
No ratings yet
Bubble Sort
1 page
Week 1 CourseIntroduction
No ratings yet
Week 1 CourseIntroduction
8 pages
The New Embedded System Design Methodology For Improving Design Process Performance
No ratings yet
The New Embedded System Design Methodology For Improving Design Process Performance
10 pages
Structure Graph
No ratings yet
Structure Graph
4 pages
Verilog - Chapter5 - Structural Model - Combinational Logic - Sequential Logic
No ratings yet
Verilog - Chapter5 - Structural Model - Combinational Logic - Sequential Logic
31 pages
Extended Abstract
No ratings yet
Extended Abstract
10 pages
Ug Nios2 Ide Help
No ratings yet
Ug Nios2 Ide Help
157 pages
Electronics 13 02971 v2
No ratings yet
Electronics 13 02971 v2
11 pages
IEEE Xplore Reference Download 2025.4.20.8.40.21
No ratings yet
IEEE Xplore Reference Download 2025.4.20.8.40.21
1 page
Design & Simulation of 32-Bit Floating Point Alu
No ratings yet
Design & Simulation of 32-Bit Floating Point Alu
3 pages
Bieu Dien Do Thi Bang Danh Sach Ke Dung Vector
No ratings yet
Bieu Dien Do Thi Bang Danh Sach Ke Dung Vector
2 pages
Manuscript
No ratings yet
Manuscript
12 pages
Chapter 6: The Relational Algebra and Relational Calculus: Answers To Selected Exercises
100% (2)
Chapter 6: The Relational Algebra and Relational Calculus: Answers To Selected Exercises
4 pages
Android System Server Initialization Logs
No ratings yet
Android System Server Initialization Logs
1,172 pages
Full Stack Developer Profile - Aradhana Salunkhe
No ratings yet
Full Stack Developer Profile - Aradhana Salunkhe
2 pages
NGUYỄN VĂN TUYÊN SOFTWARE ENGINEER
No ratings yet
NGUYỄN VĂN TUYÊN SOFTWARE ENGINEER
10 pages
Arm Ip Explorer Overview
No ratings yet
Arm Ip Explorer Overview
11 pages
01 4021 Course Details 2024s
No ratings yet
01 4021 Course Details 2024s
35 pages
Unified Test in Math 8
No ratings yet
Unified Test in Math 8
14 pages
Grid Computing: A Comprehensive Guide
No ratings yet
Grid Computing: A Comprehensive Guide
26 pages
EDPM Multiple Choice Topic by Topic Test
100% (5)
EDPM Multiple Choice Topic by Topic Test
12 pages
ENGR 3032 2068 Report-3
No ratings yet
ENGR 3032 2068 Report-3
5 pages
Digit MP 2023 en
No ratings yet
Digit MP 2023 en
38 pages
Sage X3 Sandbox Configuration Guide
No ratings yet
Sage X3 Sandbox Configuration Guide
8 pages
Moon Weens Last Concert O
100% (1)
Moon Weens Last Concert O
18 pages
Limited Edition Alchemist Plush Sale
No ratings yet
Limited Edition Alchemist Plush Sale
1 page
Cyberops Practical Exam
No ratings yet
Cyberops Practical Exam
4 pages
pg046 Aurora 8b10b en Us 11.1
No ratings yet
pg046 Aurora 8b10b en Us 11.1
117 pages
GEOM 121 - Lecture 5 - Building GIS - Data Entry
No ratings yet
GEOM 121 - Lecture 5 - Building GIS - Data Entry
31 pages
Nokia TV Repair Issues Summary
No ratings yet
Nokia TV Repair Issues Summary
2 pages
Primavera P6 Enterprise Project Portfolio Management Release 18.8 Tested Configurations
No ratings yet
Primavera P6 Enterprise Project Portfolio Management Release 18.8 Tested Configurations
18 pages
Brosur Komputer
No ratings yet
Brosur Komputer
1 page
TEM045V User Manual - MTEM045V-R02-0720
No ratings yet
TEM045V User Manual - MTEM045V-R02-0720
66 pages
Its665 Report
No ratings yet
Its665 Report
45 pages
Class 10 Half Yearly 24.08.25
No ratings yet
Class 10 Half Yearly 24.08.25
6 pages
Assignment 1
No ratings yet
Assignment 1
12 pages
Delta Ia-Plc Asrtu-Ec Op en 20240601
No ratings yet
Delta Ia-Plc Asrtu-Ec Op en 20240601
158 pages
NozzleBlockIssue UL L4260-05PowerCleaning
No ratings yet
NozzleBlockIssue UL L4260-05PowerCleaning
2 pages
Impact of Artificial Intelligence On The Planning and Operation of Distributed Energy Systems in Smart Grids
No ratings yet
Impact of Artificial Intelligence On The Planning and Operation of Distributed Energy Systems in Smart Grids
22 pages
GNU Radio Python Tutorial
No ratings yet
GNU Radio Python Tutorial
16 pages
Alliana Girl of Dragons
No ratings yet
Alliana Girl of Dragons
13 pages
Understanding Database Management Systems
No ratings yet
Understanding Database Management Systems
18 pages

Lecture08 RISCV Impl Pipeline

Uploaded by

Lecture08 RISCV Impl Pipeline

Uploaded by

Lecture 08: RISC-V Pipeline

CSCE 513 Computer Architecture

Department of Computer Science and Engineering

• Three groups of instructions

OpCode WASel ImmSel Op2Sel

Op2Sel= Reg / Imm WBSel = ALU / Mem / PC

stage stage stage stage

• All objects go through the same stages

A 5-stage pipeline will be focus of our detailed design

• An instruction Reg (IR) in each stage to

• An instruction depends on something produced by an earlier one

– Dependence may be for the next instruction’s address à control hazard

... x1 in GPRs contains stale value since the

Compare the source registers of the instruction in the decode

Should we always stall if an rs field matches some rd?

Cstall stall = ((rs1D == wsEX) && weEX +

x4 <= x1... x1 <= ...

... When does this bypass help?

ASrc = (rs1D== wsE) && weE && re1D Is this correct?

• JALR: indirect jump to rs1+immediate

• Branch: if (rs1 conds rs2), branch to PC+immediate

• JAL: unconditional jump to PC+immediate

Instruction Taken known? Target known?

A jump instruction kills (not stalls)

I1: 096 ADD If the branch is taken

(I3) 104: ADD RD RD RD

• Required hardware support:

• Mispredict recovery mechanisms:

ISA can attach preferred direction semantics to branches, e.g.,

• Spatial correlation (space)

Branch? Target PC Taken/¬Taken?

if (x[i] < 7) then

If first condition false, second condition also

History register, H, records the direction of the

2-bit global branch history

Branch target address

BP bits are stored with the predicted target address.

IF stage: If (BP=taken) then nPC=target else nPC=PC+4

132 Jump +104

=> kill PC=236 and fetch PC=1032

Is this a common occurrence?

• Is even branch prediction fast enough to avoid bubbles?

• BTB contains useful information for branch and jump

• For all other instructions the next PC is PC+4 !

match valid target

BTB/BHT only updated after branch resolves in E stage

• Dynamic function call (jump to run-time function address)

• Subroutine returns (jump to return address)

Push call address when Pop return address when

You might also like