Lecture08 RISCV Impl Pipeline
Lecture08 RISCV Impl Pipeline
Implementation
1
Acknowledgement
• Slides adapted from Computer Science 152: Computer
Architecture and Engineering, Spring 2016 by Dr. George
Michelogiannakis from UC Berkeley
2
Review
• CPU performance factors
– Instruction count
• Determined by ISA and compiler Instructions Cycles Time
– CPI and Cycle time CPU Time = * *
• Determined by CPU hardware Program Instruction Cycle
• CPI
– Single-cycle, CPI = 1, and normally longer cycle
– 5 stage unpipelined, CPI = 5
– 5 stage pipelined, CPI = 1
3
Review: Unpipelined Datapath for RISC-V
PCSel
br RegWriteEn MemWrite WBSel
rind
jabs
pc+4
0x4
Add
Add
clk
we Br Logic Bcomp?
clk
rs1
rs2
PC addr rd1 we
inst wa addr
wd rd2 ALU
clk Inst. GPRs rdata
Memory Data
Imm Memory
Select wdata
ALU
Control
4
Review: Hardwired Control Table
Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel
ALU * Reg Func no yes ALU rd pc+4
ALUi IType12 Imm Op no yes ALU rd pc+4
LW IType12 Imm + no yes Mem rd pc+4
SW SType12 Imm + yes no * * pc+4
BEQtrue SBType12 * * no no * * br
BEQfalse SBType12 * * no no * * pc+4
J * * * no no * * jabs
JAL * * * no yes PC X1 jabs
JALR * * * no yes PC rd rind
5
An Ideal Pipeline
7
5-Stage Pipelined Execution: Resource Usage
The Whole Pipeline Resources are Used by 5 Instructions in Every Cycle!
0x4
Add
we
rs1
rs2
PC addr rd1 we
rdata IR wa addr
wd rd2 ALU
GPRs rdata
Inst. Data
Imm Memory
Memory Select wdata
Write
I-Fetch -Back
Decode, Reg. Fetch Execute Memory
(IF) - I5 (EX) - I3 (ME) - I2 (WB) - I1
(ID) - I4
time t0 t1 t2 t3 t4 t5 t6 t7 ....
Instruction 1 IF1 ID1 EX1 ME1 WB1
Instruction 2 IF2 ID2 EX2 ME2 WB2
Instruction 3 IF3 ID3 EX3 ME3 WB3
Instruction 4 IF4 ID4 EX4 ME4 WB4
Instruction 5 IF5 ID5 EX5 ME5 WB5
8
Instruction Register in
Each Stage
0x4
Add IR IR IR
Instruction 1
we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm Memory
wdata
Select wdata
MD1 MD2
9
Connect Controls from Instruction Register
F D E M W
IR IR IR
Instruction 3
Instruction 2
0x4
Add Instruction 1
RegWriteEn
we ALU
rs1 Control MemWrite
WBSel
rs2
addr rd1 A we
PC
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm
Select
Memory
wdata
wdata
MD1 MD2
ImmSel Op2Sel
Instruction 4
10
Compared With Control Logic in Unpipelined
• Unpipelined:
– Single control logic uses
the instruction from IF
• Pipelined:
– Distributed logics that
uses instructions from
IRs in each stage
11
Instructions Interact With Each Other in Pipeline:
Dealing with Hazards
• An instruction may need a resource being used by another
instruction à structural hazard
– Solution #1: Stalling newer instruction till older instruction finishes
– Solution #2: Adding more hardware to design
• E.g., separate memory into I-memory and D-memory
– Our 5-stage pipeline has no structural hazards by design
x4 ¬ x1 … x1 ¬ …
0x4
Add IR IR IR
we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
Select wdata
wdata
MD1 MD2
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I3 I3 I3 I4 I5
ID I1 I2 I2 I2 I2 I3 I4 I5
Resource EX I1 - - - I2 I3 I4 I5
Usage ME I1 - - - I2 I3 I4 I5
WB I1 - - - I2 I3 I4 I5
- Þ pipeline bubble
14
Insert Bubble for
Interlocking
Bubble: Nop instruction, e.g. add x0, x0, x0
Stall Condition
Bubble are inserted at ID and by entering EXE stage
0x4 bubble
IR IR IR
Add
we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Select wdata
...
MD1 MD2
x1 ¬ x0 + 10
x4 ¬ x1 + 17
... 15
Interlock Control Logic
...
x1 ¬ x0 + 10 ws
stall
x4 ¬ x1 + 17 Cstall
... rs2 ?
rs1
0x4 bubble
Add IR IR IR
we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Select wdata
MD1 MD2
we
rs1
rs2
addr rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm Memory
wdata
Select wdata
MD1 MD2
source(s) destination
ALU rd <= rs1 func10 rs2 rs1, rs2 rd
ALUI rd <= rs1 op imm rs1 rd
LW rd <= M [rs1 + imm] rs1 rd
SW M [rs1 + imm] <= rs2 rs1, rs2 -
Bcond rs1,rs2 rs1, rs2 -
true: PC <= PC + imm
false: PC <= PC + 4
JAL x1 <= PC, PC <= PC + imm - rd
JALR rd <= PC, PC <= rs1 + imm rs1 rd 18
Deriving the Stall Signal
Cre
re1 = Case opcode
ALU, ALUi, LW, SW, Bcond, JALR => on
source(s) destination JAL =>off
ALU rd <= rs1 func10 rs2 rs1, rs2 rd
ALUI rd <= rs1 op imm rs1 rd re2 = Case opcode
LW
SW
rd <= M [rs1 + imm]
M [rs1 + imm] <= rs2
rs1
rs1, rs2
rd
-
ALU, SW, Bcond =>on
Bcond rs1,rs2 rs1, rs2 - … =>off
true: PC <= PC + imm
false: PC <= PC + 4 No need the WB for interlock control since we only need to deal with hazard
JAL x1 <= PC, PC <= PC + imm - rd
between MEM-EXE and EXE-EXE. For two instructions which are in WB and EXE,
JALR rd <= PC, PC <= rs1 + imm rs1 rd
and have RAW hazard, the dependency are handled through the register file.
• Slide 48 of lecture05-06
21
Adding a Bypass (To Bypass Register Files)
stall
ASrc
we
rs1
rs2
addr D rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm Memory
wdata
Select wdata
MD1 MD2
ws = rd we = Case opcode
ALU, ALUi, LW,, JAL JALR => on
... => off
No because only ALU and ALUi instructions can benefit from this
bypass
Split weE into two components: we-bypass, we-stall
23
Bypass and Stall Signals
Split weE into two components: we-bypass, we-stall
we-bypassE = Case opcodeE we-stallE = Case opcodeE
ALU, ALUi => on LW, JAL, JALR=> on
... => off JAL => on
... => off
ASrc = (rs1D == wsE) && we-bypassE && re1D
stall = ((rs1D == wsE) && we-stallE +
(rs1D== wsM) && weM + (rs1D== wsW) && weW) && re1D
+((rs2D == wsE) && weE + (rs2D == wsM) && weM + (rs2D == wsW) && weW) && re2D
24
Fully Bypassed Datapath
stall PC for JAL, ...
0x4 E M W
bubble IR IR IR
Add
ASrc
we
rs1
rs2
addr D rd1 A
PC we
inst IR wa ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data R
Imm Memory
wdata
Select wdata
BSrc
MD1 MD2
Is there still
a need for the
stall signal ? stall = (rs1D== wsE) && (opcodeE==LWE)&&(wsE!=0 )&&re1D
+ (rs2D== wsE) && (opcodeE== LWE)&&(wsE!=0 )&&re2D
25
Control Hazards: Branches and Jumps
• JAL: unconditional jump to PC+immediate
26
Info for
Control Transfer
Two pieces of info:
1. Taken or Not Taken
2. Target address?
27
Speculate Next Address is PC+4
PCSrc (pc+4 / jabs / rind/ br) stall
Add
E M
0x4 bubble
Add IR IR
Jump? I1
PC addr
inst IR
104 Inst
Memory I2
Jump? II21 I1
IRSrcD
Any
addr
interaction
PC bubble
inst IR between
304
104 Inst
bubble
I2
stall and
Memory
jump?
IRSrcD = Case opcodeD
I1 096 ADD JAL Þ bubble
I2 100 J 304 ... Þ IM
I3 104 ADD kill
I4 304 ADD 29
Jump Pipeline Diagrams
time
t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 ME1 WB1
(I2) 100: J 304 IF2 ID2 EX2 ME2 WB2
(I3) 104: ADD IF3 - - - -
(I4) 304: ADD IF4 ID4 EX4 ME4 WB4
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 - I4 I5
Resource EX I1 I2 - I4 I5
Usage ME I1 I2 - I4 I5
WB I1 I2 - I4 I5
- Þ pipeline bubble
30
Pipelining Conditional Branches
PCSrc (pc+4 / jabs / rind / br) stall
Add
E M
0x4 bubble
Add IR IR
BEQ? I1
Taken?
IRSrcD
addr
PC
inst bubble IR
A
ALU Y
104 Inst
I2
Memory
I1 096 ADD
Branch condition is not known until the
I2 100 BEQ x1,x2 +200 execute stage
I3 104 ADD
I4 304 ADD
31
Pipelining Conditional Branches
PCSrc (pc+4 / jabs / rind / br) stall
?
Add
E Bcond? M
0x4 bubble
Add IR IR
I2 I1
Taken?
IRSrcD
addr
PC
inst bubble IR
A
ALU Y
108 Inst
I3
Memory
If the branch is taken:
I1 096 ADD
I2 100 BEQ x1,x2 +200 - Kill the two following instructions
I3 104 ADD - The instruction at the decode stage is
I4 304 ADD not valid Þ stall signal is not valid
32
Pipelining Conditional Branches
PCSrc (pc+4/jabs/rind/br) stall
Add
E Bcond? M
IRSrcE
0x4 bubble
Add IR IR
Jump? I2 I1
Taken?
PC
IRSrcD
addr
PC
inst bubble IR
A
ALU Y
108 Inst
I3
Memory
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) 096: ADD IF1 ID1 EX1 ME1 WB1
(I2) 100: BEQ +200 IF2 ID2 EX2 ME2 WB2
(I3) 104: ADD IF3 ID3 - - -
(I4) 108: IF4 - - - -
(I5) 304: ADD IF5 ID5 EX5 ME5 WB5
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 I3 - I5
Resource EX I1 I2 - - I5
Usage ME I1 I2 - - I5
WB I1 I2 - - I5
- Þ pipeline bubble
34
Use Simpler Branches: E.g. Only Compare One
Register Against Zero in ID Stage
time
t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 ME1 WB1
(I2) 100: BEQZ +200 IF2 ID2 EX2 ME2 WB2
(I3) 104: ADD IF3 - - - -
(I4) 300: ADD IF4 ID4 EX4 ME4 WB4
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 - I4 I5
Resource EX I1 I2 - I4 I5
Usage ME I1 I2 - I4 I5
WB I1 I2 - I4 I5
- Þ pipeline bubble
35
Pipelined MIPS Datapath
Instruction Instr. Decode Execute Memory Writ
Fetch Reg. Fetch Addr. Calc Access Back
Next PC Next
SEQ PC
MUX
Adder
Adder
Zero?
4 RS1
MEM/WB
Address
Memory
RS2
EX/MEM
Reg File
ID/EX
ALU
IF/ID
Memory
MUX
Data
MUX
(I1) 096: ADD Sign
Extend
(I2) 100: BEQZ +200 Imm
36
Control Hazard Delay Summary
• JAL: unconditional jump to PC+immediate
– 1 cycle delay of pipeline
• JALR: indirect jump to rs1+immediate
– 1 cycle delay
• Branch: if (rs1 conds rs2), branch to PC+immediate
– 2 cycles delay
– 1 cycle delay for simpler branch (BEQZ) with pipeline
improvement
37
Reducing Control Flow Penalty
• Software solutions
– Eliminate branches - loop unrolling
• Increases the run length
– Reduce resolution time - instruction scheduling
• Compute the branch condition as early as possible (of
limited value because branches often in critical path through
code)
• Hardware solutions
– Find something else to do - delay slots
• Replaces pipeline bubbles with useful work (requires
software cooperation)
– Speculate - branch prediction
• Speculative execution of instructions beyond the branch
38
Additional Materials – Branch
Prediction
39
Branch Prediction
• Motivation
– Branch penalties limit performance of deeply pipelined processors
– Modern branch predictors have high accuracy
– (>95%) and can reduce branch penalties significantly
41
Dynamic Branch Prediction
learning based on past behavior
• Temporal correlation (time)
– If I tell you that a certain branch was taken last time, does
this help?
– The way a branch resolves may be a good predictor of the way
it will resolve at the next execution
42
Dynamic Branch Prediction
• 1-bit prediction scheme
– Low-portion address as address for a one-bit flag for Taken or
NotTaken historically
– Simple
• 2-bit prediction
– Miss twice to change
43
Branch Prediction Bits
• Assume 2 BP bits per instruction
• Change the prediction after two consecutive mistakes!
¬take
wrong ¬ taken
taken
taken
taken ¬ taken
take ¬take
right right
taken
¬ taken ¬ taken
take
wrong
BP state:
(predict take/¬take) x (last prediction right/wrong)
44
Branch History Table
Fetch PC 00
k 2k-entry
I-Cache BHT Index BHT,
2 bits/entry
Instruction
Opcode offset
46
Two-Level Branch Predictor
Pentium Pro uses the result from the last two branches
to select one of the four sets of BHT bits (~95% correct)
00
Fetch PC k
Shift in Taken/¬Taken
results of each branch
Taken/¬Taken? 47
Speculating Both Directions
• An alternative to branch prediction is to execute
both directions of a branch speculatively
– resource requirement is proportional to the number of concurrent
speculative executions
– only half the resources engage in useful work when both directions
of a branch are executed speculatively
– branch prediction takes less resources than speculative execution
of both paths
• With accurate branch prediction, it is more cost
effective to dedicate all resources to the
predicted direction!
– What would you choose with 80% accuracy?
48
Are We Missing Something?
• Knowing whether a branch is taken or not is great, but what
else do we need to know about it?
49
Branch Target Buffer
predicted BPb
target
Branch
Target
IMEM
Buffer
(2k entries)
k
PC
target BP
51
BTB is only for Control Instructions
52
Branch Target Buffer (BTB)
I-Cache 2k-entry direct-mapped BTB
PC (can also be associative)
Entry PC Valid predicted
target PC
A PC Generation/Mux
BTB P Instruction Fetch Stage 1
F Instruction Fetch Stage 2
BHT in later BHT B Branch Address Calc/Begin Decode
pipeline stage
corrects when
I Complete Decode
BTB misses a J Steer Instructions to Functional units
predicted taken R Register File Read
branch
E Integer Execute
&fd() k entries
&fc() (typically k=8-16)
&fb()
56