Pipelining
Advanced Computer Architecture
Pipeline Design
Instruction Pipeline Design
Instruction Execution Phases Mechanism for Instruction Pipelining Dynamic Instruction Scheduling Branch Handling Techniques
Arithmetic Pipeline Design
Computer Arithmetic Principles Static Arithmetic Pipelines Multifunctional Arithmetic Pipelines
Typical Instruction Pipeline
A typical instruction execution includes a sequence of operations which includes: Instruction Fetch (F) Decode (D) Operand Fetch or Issue (I) Execute, several stages (E) Write Back (W)
Source: Kai Hwang
Instruction Execution Phases
Each operation (F, D, I, E, W) may require one clock cycle or more. Ideally, these operations need to be overlapped. Example (assumptions): load and store instructions take four cycles add and multiply instructions take three cycles
Shaded regions indicate idle cycles due to dependencies
Source: Kai Hwang
Mechanisms for Instruction Pipelining
Goal: Achieve maximum parallelism in pipeline by smoothening the instruction flow and minimizing the idle cycles Mechanisms: Prefetch Buffers Multiple Functional Units Internal Data Forwarding Hazard Avoidance
Prefetch Buffers
Used to match the instruction fetch rate to the pipeline consumption rate In a single memory access, a block of consecutive instructions are fetched into a prefetch buffer Three types of prefetch buffers: Sequential buffers, used to store sequential instructions Target buffers, used to store branch target instructions Loop buffer, used to store loop instructions
Source: Kai Hwang
Multiple Functional Units
At times, a specific pipeline stage becomes the bottleneck Identified by large number of checks in a row in reservation table To resolve dependencies, we use reservation stations Each RS is uniquely identified with a tag monitored by tag unit (Register Tagging) Helps in conflict resolution and serving as buffer
Source: Kai Hwang
Internal Data Forwarding
Goal: Memory access operations to be replaced with register transfer operations Types: Store load forwarding Load load forwarding Store store forwarding
Source: Kai Hwang
Hazard Avoidance
Read/write of shared variables by different instructions in pipeline may lead to different results if instructions are executed out of order Types: Read after Write (RAW) Hazard Write after Write (WAW) Hazard Write after Read (WAR) Hazard
Source: Kai Hwang
Instruction Scheduling
Aim: To schedule instructions through an instruction pipeline Types of instruction scheduling: Static Scheduling
Supported by optimizing compiler
Dynamic Scheduling
Achieved by Tomasulos register-tagging scheme Using scoreboarding scheme
Static Scheduling
Data dependency in a sequence of instructions create interlocked relationships Interlocking can be resolved by compiler by increasing separation between interlocked instructions Example:
Two independent load instructions can be moved ahead so that spacing between them and multiply instruction is increased.
Tomasulos Algorithm
Hardware dependent scheme Data operands are saved in Register Station (RS) until dependencies get resolved Register tagging is used to allocate/deallocate register All working registers are tagged
Source: Kai Hwang
Scoreboarding
Multiple functional units appear in multiple execution pipelines. Parallel units allow instruction to execute out of order w.r.t. original program sequence. Processor has instruction buffers, instructions are issues regardless of the availability of their operands. Centralized control units called scoreboard is used to keep track of unavailable operands for instructions stored in buffer
Source: Kai Hwang
Branch Handling Techniques
Pipeline performance is limited by presence of branch instructions in program Various branch strategies are applied to minimize performance degradation To evaluate branch strategy, two approaches can be followed
Trace data approach Analytical analysis
Effect of branching contd.
Branching Illustrated
Ib: Branch Instruction Once branch taken is decided, all instructions are flushed Subsequently, all the instructions at branch target are run
Source: Kai Hwang
Effect of Branching
Nomenclature:
Branch Taken, action of fetching non-sequential (remote) instructions after branch instruction Branch Target, (remote) instruction to be executed after branch taken Delay Slot (b), number of pipeline cycles consumed between branch taken and branch target In general, 0 <= b <= k-1 where k is number of pipeline stages
Effect of Branching
When branch taken occurs, all instruction after branch instruction become useless, pipeline is flushed, loosing number of cycles Let Ib be branch instruction, then branch taken shall cause all instructions from Ib+1 till Ib+k-1 to be drained from pipeline Let p be probability of instruction to be branch instruction and q be probability of branch taken, then penalty, in terms of time is expressed as Tpenalty = pqnbt , where n: number of instructions; b: number of pipeline cycles consumed; t: cycle time Effective execution time becomes T = kt + (n-1)t +
Branch Prediction
Branch can be predicted based on
Static Branch Strategy
Probability of branch with respect to a particular branch type can be used to predict branch Probability may be obtained by collecting frequency of branch taken and branch types across large number of program traces
Dynamic Branch Strategy
Uses limited recent branch history to predict whether or not branch will be taken when it occurs next time
Branch Prediction Internals
Branch prediction buffer
Used to store the branch history information in order to make branch prediction
State transition diagram used in dynamic branch prediction
Source: Kai Hwang
Delayed Branches
Branch penalty can be reduced by the concept of delayed branch The central idea is to delay the execution of branch instruction to accommodate independent* instructions
Delaying by d cycles allows few useful instructions (independent*) of branch instructions to be executed * Execution of these instructions should be independent of outcome of branch instruction
Linear Pipeline Processors
A linear pipeline processor is constructed with k processing stages i.e. S1 Sk These stages are linearly connected to perform a specific function Data stream flows from one end of the pipeline to another end, external inputs are fed into S1 and final results move out from Sk , intermediate results pass from Si to Si+1 Linear pipelining applied to: Instruction execution Arithmetic computation Memory access operations
Asynchronous Model
Data flow between adjacent stages is controlled by handshaking protocol
When a stage Si is ready to transmit, it sends a ready signal to stage Si+1 This is followed by the actual data transfer After stage Si+1 receives the data, it returns an acknowledge signal to Si
Source: Kai Hwang
Contd
Asynchronous pipelines are useful in designing communication channels in message passing multicomputers They have variable throughput rate. Different amount of delays may be experienced in different stages.
Synchronous Model
Clocked latches are used to interface between stages
Latches are master-slave flip flops that isolate inputs from outputs. Upon arrival of a clock pulse, all latches transfer data to next stage at same time.
Pipeline stages are combinational circuits.
Source: Kai Hwang
Contd
It is desirable to have equal delays in all stages. These delays determine the clock period and thus the speed of the pipeline.
Reservation Table
It specifies the utilization pattern of successive stages in a synchronous pipeline Space time graph depicting precedence relationship in using the pipeline stages For a k-stage linear pipeline clock cycles are needed for data to flow through the pipeline
Clocking and Timing Control
Clock cycle and throughput:
Clock cycle time (t) of a pipeline is given below t = tm + d where tm denote maximum stage delay d denote latch delay Pipeline frequency (1/t) is referred as throughput of the pipeline
Clock skewing:
Ideally clock pulses should arrive at all stages at same time, but due to clock skewing, same clock pulse may arrive at different stages with an offset of s Further, let tmax be time delay of longest logic path in a stage and tmin be that of shortest logic path in a stage, then d + tmax + s <= t <= tm + tmin - s
Speedup
Case 1: Pipelined processor
Ideally, number of clock cycles required by a k stage pipeline to process n tasks is:Np = k + (n-1) (k clock cycles for first task & 1 clock cycle for each of n-1 tasks) Total time required is Tk = (k+(n-1))t
Case 2: Non-pipelined processor
Non-pipelined processor would take time, T1 = nkt
Speedup Factor: Sk of a k-stage pipeline over an equivalent non pipelined processor is:
Sk = T1 / Tk = nkt / (k+ (n-1))t = nk / (k + n-1))
Optimal number of stages:
Most pipelining is staged as functional level with 2k15. Very few pipelines are designed to exceed 10 stages in real computers. Optimal choice of number of pipeline stages should be able to maximize the performance/cost ratio for target processing load. Performance/cost ratio(PCR)=f/c+kh =1/(t/k+d)(c+kh) where f=1/(t/k+d) PCR corresponds to the optimal choice for the number of desired pipeline stages: k0 =t.c/d.h,where t is the total flow-through delay of pipeline,c is total stage cost,d is latch delay and latch cost h.
Efficiency & Throughput
Efficiency: It is defined as speedup factor divided by number of stages:Ek = Sk / k = n / (k + (n-1))
Pipeline Throughput: It is defined as number of tasks per unit time as below:Hk = n / (k + (n-1))t = nf / (k + (n-1))