Pipelining: Basic and
Intermediate Concepts
Chappathi Making
Chappathi Making (Contd.)
Chappathi Making (Contd.)
Pipelining Basics
• Pipelining overlaps execution of instructions to
improve performance.
• Pipelining partitions the system into multiple
independent stages with added buffers between
the stages.
• Similar to an assembly line where each stage
handles part of the task.
• Each stage works concurrently on different
instructions.
• Key metric: throughput, or instructions completed
per unit time.
Classic Five-Stage RISC
Pipeline
• Five stages: Instruction Fetch (IF),
Instruction Decode (ID), Execute
(EX), Memory Access (MEM), Write-
Back (WB).
• Each new instruction begins one
cycle after the previous.
• Multiple instructions progress
simultaneously through the pipeline.
Visualizing Pipelining
Pipelining Characteristics
Time Pipelining doesn’t reduce
30 40 40 40 40 20 latency of single task, it
A improves throughput of entire
workload
B
Pipeline rate limited by slowest
C pipeline stage
D Potential speedup = Number of
pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to “fill” pipeline and time to “drain” it reduces speedup
Datapaths – single cycle and multi-cycle
●
Datapath
– collection of functional units (like ALU,
registers, memory) and their connections
that execute instructions
– It’s the “hardware pathway” data travels
through during instruction execution.
– two kinds of Datapaths: Single Cycle
Datapath and Multiple Cycle Datapath
26 / 35
Single Cycle Datapath
●
Every instruction completes in exactly one clock
cycle — no matter if it’s a simple ADD or a complex
LOAD.
●
Key characteristics:
●
One long clock cycle that must be long enough for
the slowest instruction.
●
All steps (fetch, decode, execute, memory access,
write-back) happen in that one cycle.
●
Hardware must duplicate some resources (e.g.,
separate instruction and data memory) so multiple
steps can happen at once in that cycle.
27 / 35
Single Cycle Datapath (Contd.)
●
Advantages of Single Cycle Datapath
●
Simplicity: It is less complex than top-down since
its development is easier than that of other large
systems and as such perfect for such systems.
●
Consistent Timing: All of the instructions take
equal amounts of time.
28 / 35
Single Cycle Datapath (Contd.)
●
Disadvantages of Single Cycle Datapath
●
Inefficient for Complex Instructions: Since all
instructions, including the basic ones, have to wait
for the longest operation to be completed, this can
result in more time being taken.
●
Duplicate Hardware Requirement: It calls for
more hardware due to the fact that functional units
cannot be shared or utilized in the same instruction
cycle.
29 / 35
Multi-Cycle Datapath
●
Each instruction is split into multiple steps, with each
step taking one clock cycle.
●
Different instructions take different numbers of
cycles depending on complexity.
●
Key characteristics:
●
Shorter clock cycle (based on the slowest single step,
not the slowest whole instruction).
●
Can reuse hardware (same memory for instruction
and data in different cycles).
●
Control logic is more complex (often
microprogrammed).
30 / 35
Multi-Cycle Datapath (Contd.)
●
Advantages of Multi-Cycle Datapath
●
More efficient — clock period is shorter, simple
instructions finish quickly, complex ones take longer.
●
Less hardware duplication - Some functional units
can be utilized more than once during a single step
of the same instruction, as is the case with ALU.
31 / 35
Multi-Cycle Datapath (Contd.)
●
Disadvantages of Multi-Cycle Datapath
●
Complex Control Logic: Additionally, the control
unit has to be more complex to handle multiple
cycles per one instruction.
●
Extra Registers: In between the steps, more
registers are required to hold results that will be used
in later operations or computations.
●
Slower overall throughput compared to pipelining
(but better than single-cycle).
32 / 35
Single Vs Multi-Cycle Datapaths
33 / 35
Pipeline – a series of datapaths
34 / 35
Datapath with pipeline registers
35 / 35
Pipelined RISC Data path
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MEM/WB
Address
Reg File
EX/MEM
RS2
ID/EX
IF/ID
ALU
Memory
Memory
MUX
WB Data
Sign
Extend
Imm
RD RD RD
RISC MIPS Instruction Pipeline
Each instruction can take at most 5 clock cycles
Instruction fetch cycle (IF)
Based on PC, fetch the instruction from memory
Increment PC
Next PC
Instruction
Fetch Adder
4
Address
Memory
RISC MIPS Instruction Pipeline
Instruction decode/register fetch cycle (ID)
Decode the instruction + register read operation
Fixed field decoding [ADD R1,R2,R2] OR [LW R1,8(R2) ]
Ex: A3.01.02.03 : 10100011 00000001 00000010 00000011
Ex: 86.01.02.03 : 10000110 00000001 00001000 00000010
Next SEQ PC
Instr. Decode
RS1
Reg. Fetch
Reg File
RS2
Sign
Extend
Imm
RD
RISC MIPS Instruction Pipeline
Execution/Effective address cycle (EX)
Memory reference: Calculate the effective address
[LW R1,8(R2) ] EFF ADDR= [R2] +8
Register-register ALU instruction [ADD R1,R2,R2]
Execute
Zero?
Addr. Calc
ALU
RD
RISC MIPS Instruction Pipeline
Memory access cycle (MEM)
Load from memory and store in register [LW R1,8(R2)]
Store the data from the register to memory [SW R3,16(R4)]
Memory
Access
Memory
RISC Instruction Pipeline
Write-back cycle (WB)
Register-register ALU instruction or load instruction
Write to register file [LW R1,8(R2)] , [ADD R1,R2,R3]
Instr. Decode Write
Reg. Fetch Back
RS1
RS2
Reg File
RD
Imm
Sign
Extend
Visualizing Pipelining
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
ALU
IM REG DM REG
IM REG ALU DM REG
IM REG ALU DM REG
ALU
IM REG DM REG
Benefits of Pipelining
• Increases instruction throughput
significantly.
• Can approach ideal speedup =
number of pipeline stages.
• Reduction in CPI (cycles per
instruction) for many workloads.
• Not visible to the programmer;
handled entirely in hardware.
Limits to pipelining
Hazards: circumstances that would cause incorrect execution if
next instruction is fetched and executed
Structural hazards: Attempting to use the same hardware to do
two different things at the same time
Data hazards: Instruction depends on result of prior instruction
still in the pipeline
Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches)
Structural Hazard
Eg: Uniport Memory
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
ALU
Load Ifetch Reg DMem Reg
ALU
Instr 1 Ifetch Reg DMem Reg
ALU
Instr 2 Ifetch Reg DMem Reg
ALU
Instr 3 Ifetch Reg DMem Reg
Instr 4
Structural Hazard
Resolving Structural Hazard
Eliminate the use same hardware for two different things at the
same time
Solution 1: Wait
must detect the hazard
must have mechanism to stall
Solution 2: Duplicate hardware
Multiple such units will help both instruction to progress
Handling Structural Hazards
• Duplicate or pipeline functional units
to avoid conflicts.
• Use stalls (pipeline bubbles) when
conflicts can't be avoided.
• Example: shared memory causing
fetch/store conflict.
• Optimized designs use split caches
and buffers to reduce stalls.
Detecting & Resolving Structural Hazard
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
ALU
Load Ifetch Reg DMem Reg
ALU
Instr 1 Ifetch Reg DMem Reg
ALU
Instr 2 Ifetch Reg DMem Reg
Stall Bubble Bubble Bubble Bubble Bubble
ALU
Instr 3 Ifetch Reg DMem Reg
Eliminating Structural Hazards at Design
Adder
Next PC
Next SEQ PC Next SEQ PC
4 Zero?
RS1
MUX MUX
MEM/WB
Address
RS2
EX/MEM
Reg File
ID/EX
IF/ID
ALU
Cache
Cache
Instr
Data
MUX
WB Data
Sign
Extend
Imm
RD RD RD
Data Hazard
Time (clock cycles)
IF ID/RF EX MEM WB
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
ALU
and r6,r1,r7 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
Data Hazards and
Forwarding
• Data hazard: later instruction reads a
register before it’s updated.
• Forwarding (bypassing) sends results
directly to dependent stages.
• Prevents stalls for most ALU
operations.
• More complex forwarding needed for
multi-cycle instructions.
Three Generic Data Hazards
Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I : add r1,r2,r3
J: sub r4,r1,r3
Caused by a data dependence
This hazard results from an actual need for communication.
Three Generic Data Hazards
Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an anti-dependence by compiler writers.
This results from reuse of the name r1
Can’t happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and
Reads are always in stage 2, and
Writes are always in stage 5
Three Generic Data Hazards
Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
Called an output dependence J: add r1,r2,r3
This also results from the reuse of name r1. K: mul r6,r1,r7
Can’t happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and
Writes are always in stage 5
WAR and WAW happens in out of order pipes
Operand Forwarding to Avoid Data Hazard
Time (clock cycles)
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
ALU
sub r4,r1,r3 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
and r6,r1,r7
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
Pipeline Interlocks
• Hardware mechanism to detect and
resolve hazards.
• Stalls the pipeline when data isn't
available in time.
• Ensures correct execution without
programmer intervention.
• Example: stall after a load followed
by a dependent ALU op.
Branch Hazards
• Branches disrupt sequential
instruction flow.
• The target of the branch isn’t known
until later stage (ID or EX).
• Stalls or incorrect path execution can
occur.
• Common solution: freeze pipeline or
flush incorrect instructions.
Control Hazard on Branches
=> Three Stage Stall
ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5
ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
#3: Predict Branch Taken
#4: Delayed Branch
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
Execute successor instructions in sequence
“Squash” instructions in pipeline if branch actually taken
Four Branch Hazard Alternatives
#3: Predict Branch Taken
But branch target address in is not known by IF stage
Target is known at same time as branch outcome (IDstage)
MIPS still incurs 1 cycle branch penalty
Four Branch Hazard Alternatives
#4: Delayed Branch
Define branch to take place AFTER one instruction
following the branch instruction.
1 slot delay allows proper decision and branch target
address in 5 stage pipeline (MIPS uses this approach)
Where to get instructions to fill branch delay slot?
Conditional Branches
When do you know you have a branch?
During ID cycle (Could you know before that?)
When do you know if the branch is Taken or Not-Taken
During EXE cycle/ ID stage depending on the design
We need for sophisticated solutions for following cases
Modern pipelines are deep ( 10 + stages)
Several instructions issued/cycle
Several predicted branches in-flight at the same time
Simple static predictive schemes
Predict branch Not -Taken (easiest to implement; default
for dynamic branch prediction)
If prediction correct no problem;
If prediction incorrect, delay = number of stages
between ID and EXE
Predict branch Taken
Interesting only if target address can be computed
early
Prediction depends on the direction of branch
Backward-Taken-Forward-Not-Taken (BTFNT)
Rationale: Backward branches at end of loops:
mostly taken
Dynamic branch prediction
Execution of a branch requires knowledge of:
Branch instruction - encode whether instruction is a
branch or not. Decide on taken or not taken (i.e.,
prediction can be done at IF stage)
Whether the branch is Taken/Not-Taken (hence a branch
prediction mechanism)
If the branch is taken what is the target address (can be
computed but can also be “precomputed”, i.e., stored in
some table)
If the branch is taken what is the instruction at the branch
target address (saves the fetch cycle for that instruction)
Dynamic branch prediction
Use a Branch Prediction Buffer (BPB)
Also called Branch Prediction Table (BPT), Branch History
Table (BHT)
Records previous outcomes of the branch instruction.
How to index into the table is an issue.
A prediction using BPB is attempted when the branch
instruction is fetched (IF stage or equivalent)
It is acted upon during ID stage (when we know we have a
branch)
Dynamic branch prediction
Has a prediction been made (Y/N)
If not use default “Not Taken”
Is it correct or incorrect ?
Two cases:
Case 1: Yes and the prediction was correct (known at
ID stage) or No but the default was correct: No delay
Case 2: Yes and the prediction was incorrect or No and
the default was incorrect: Delay
RISC Instruction Set
• MIPS
RISC-V 64-bit
32/64 bit is used as the
representative RISC architecture.
• Three main instruction classes: ALU,
Load/Store, Branches.
• Fixed instruction format simplifies
decoding and pipelining.
• Registers are the only operands for
arithmetic/logical ops.
Unpipelined Implementation
• Executes one instruction at a time,
over multiple cycles.
• Typical instruction takes 4-5 cycles
(CPI ≈ 4.54).
• Easy to understand and implement,
but inefficient.
• Used as a baseline to illustrate
pipelining benefits.
Unpipelined RISC Data path
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
Adder
Next SEQ PC
4 RS1
Zero?
Reg File
Address
RS2
ALU
Inst
Memory
Memory
RD L
M
D
Sign
Imm Extend
WB Data
Implementing the Pipeline
• Split instruction/data memory or
cache to avoid memory conflicts.
• Use pipeline registers between
stages (e.g., IF/ID, ID/EX).
• Careful control needed to avoid
simultaneous hardware usage.
• Balance stage durations to prevent
bottlenecks.
Pipelined RISC Data path
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MEM/WB
Address
Reg File
EX/MEM
RS2
ID/EX
IF/ID
ALU
Memory
Memory
MUX
WB Data
Sign
Extend
Imm
RD RD RD
RISC MIPS Instruction Pipeline
RISC-V
Each instruction can take at most 5 clock cycles
Instruction fetch cycle (IF)
Based on PC, fetch the instruction from memory
Increment PC
Next PC
Instruction
Fetch Adder
4
Address
Memory
RISC MIPS Instruction Pipeline
RISC-V
Instruction decode/register fetch cycle (ID)
Decode the instruction + register read operation
Fixed field decoding [ADD R1,R2,R2] OR [LW R1,8(R2) ]
Ex: A3.01.02.03 : 10100011 00000001 00000010 00000011
Ex: 86.01.02.03 : 10000110 00000001 00001000 00000010
Next SEQ PC
Instr. Decode
RS1
Reg. Fetch
Reg File
RS2
Sign
Extend
Imm
RD
RISC MIPS Instruction Pipeline
RISC-V
Execution/Effective address cycle (EX)
Memory reference: Calculate the effective address
[LW R1,8(R2) ] EFF ADDR= [R2] +8
Register-register ALU instruction [ADD R1,R2,R2]
Execute
Zero?
Addr. Calc
ALU
RD
RISC MIPS Instruction Pipeline
RISC-V
Memory access cycle (MEM)
Load from memory and store in register [LW R1,8(R2)]
Store the data from the register to memory [SW R3,16(R4)]
Memory
Access
Memory
RISC Instruction Pipeline
RISC-V
Write-back cycle (WB)
Register-register ALU instruction or load instruction
Write to register file [LW R1,8(R2)] , [ADD R1,R2,R3]
Instr. Decode Write
Reg. Fetch Back
RS1
RS2
Reg File
RD
Imm
Sign
Extend
RISC-V
RISC MIPS Instruction Pipeline
Cycles required to implement different instructions
Branch instructions – 4 cycles
Store instructions – 4 cycles
All other instructions – 5 cycles
EX
IF ID MEM WB
Visualizing Pipelining
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
ALU
IM REG DM REG
IM REG ALU DM REG
IM REG ALU DM REG
ALU
IM REG DM REG
Multi-cycle Operations
Can EXE stage complete the operation in 1 cycle ?
Some operations require more than 1 clock cycle to complete.
Floating Point/Integer Multiply
Floating Point/Integer Divide
Floating Point Add/Sub
Dedicated hardware units are available on the processor for
performing these operations.
FP-Mul and FP-Add are fully pipelined, but FP-Div is un-
pipelined.
Multi-cycle Operations
Latency: The number of intervening cycles between an
instruction that produces a result and an instruction that uses
the result.
Initiation / Repeat Interval: The number of cycles that must
elapse between issuing two operations of a given type.