100% found this document useful (1 vote)

1K views30 pages

Instruction and Arithmetic Pipelining

Pipelining is a technique used to overlap the execution of instructions in order to increase throughput in processors. It involves breaking down instruction execution into stages and allowing subsequent instructions to begin execution before previous ones have finished. Key aspects of pipelining include instruction and arithmetic pipelines, hazards, forwarding, scheduling, and branch prediction techniques. Pipelining can provide significant speedup over non-pipelined execution but requires careful design to avoid stalls from dependencies between instructions.

Uploaded by

Shinisg Vava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

1K views30 pages

Instruction and Arithmetic Pipelining

Uploaded by

Shinisg Vava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 30

Pipelining

Advanced Computer Architecture

Pipeline Design
Instruction Pipeline Design
Instruction Execution Phases Mechanism for Instruction Pipelining Dynamic Instruction Scheduling Branch Handling Techniques

Arithmetic Pipeline Design

Computer Arithmetic Principles Static Arithmetic Pipelines Multifunctional Arithmetic Pipelines

Typical Instruction Pipeline

A typical instruction execution includes a sequence of operations which includes: Instruction Fetch (F) Decode (D) Operand Fetch or Issue (I) Execute, several stages (E) Write Back (W)

Source: Kai Hwang

Instruction Execution Phases

Each operation (F, D, I, E, W) may require one clock cycle or more. Ideally, these operations need to be overlapped. Example (assumptions): load and store instructions take four cycles add and multiply instructions take three cycles

Shaded regions indicate idle cycles due to dependencies

Source: Kai Hwang

Mechanisms for Instruction Pipelining

Goal: Achieve maximum parallelism in pipeline by smoothening the instruction flow and minimizing the idle cycles Mechanisms: Prefetch Buffers Multiple Functional Units Internal Data Forwarding Hazard Avoidance

Prefetch Buffers
Used to match the instruction fetch rate to the pipeline consumption rate In a single memory access, a block of consecutive instructions are fetched into a prefetch buffer Three types of prefetch buffers: Sequential buffers, used to store sequential instructions Target buffers, used to store branch target instructions Loop buffer, used to store loop instructions

Source: Kai Hwang

Multiple Functional Units

At times, a specific pipeline stage becomes the bottleneck Identified by large number of checks in a row in reservation table To resolve dependencies, we use reservation stations Each RS is uniquely identified with a tag monitored by tag unit (Register Tagging) Helps in conflict resolution and serving as buffer

Source: Kai Hwang

Internal Data Forwarding

Goal: Memory access operations to be replaced with register transfer operations Types: Store load forwarding Load load forwarding Store store forwarding

Source: Kai Hwang

Hazard Avoidance
Read/write of shared variables by different instructions in pipeline may lead to different results if instructions are executed out of order Types: Read after Write (RAW) Hazard Write after Write (WAW) Hazard Write after Read (WAR) Hazard

Source: Kai Hwang

Instruction Scheduling
Aim: To schedule instructions through an instruction pipeline Types of instruction scheduling: Static Scheduling
Supported by optimizing compiler

Dynamic Scheduling
Achieved by Tomasulos register-tagging scheme Using scoreboarding scheme

Static Scheduling
Data dependency in a sequence of instructions create interlocked relationships Interlocking can be resolved by compiler by increasing separation between interlocked instructions Example:

Two independent load instructions can be moved ahead so that spacing between them and multiply instruction is increased.

Tomasulos Algorithm
Hardware dependent scheme Data operands are saved in Register Station (RS) until dependencies get resolved Register tagging is used to allocate/deallocate register All working registers are tagged

Source: Kai Hwang

Scoreboarding
Multiple functional units appear in multiple execution pipelines. Parallel units allow instruction to execute out of order w.r.t. original program sequence. Processor has instruction buffers, instructions are issues regardless of the availability of their operands. Centralized control units called scoreboard is used to keep track of unavailable operands for instructions stored in buffer

Source: Kai Hwang

Branch Handling Techniques

Pipeline performance is limited by presence of branch instructions in program Various branch strategies are applied to minimize performance degradation To evaluate branch strategy, two approaches can be followed
Trace data approach Analytical analysis

Effect of branching contd.

Branching Illustrated
Ib: Branch Instruction Once branch taken is decided, all instructions are flushed Subsequently, all the instructions at branch target are run

Source: Kai Hwang

Effect of Branching
Nomenclature:
Branch Taken, action of fetching non-sequential (remote) instructions after branch instruction Branch Target, (remote) instruction to be executed after branch taken Delay Slot (b), number of pipeline cycles consumed between branch taken and branch target In general, 0 <= b <= k-1 where k is number of pipeline stages

Effect of Branching
When branch taken occurs, all instruction after branch instruction become useless, pipeline is flushed, loosing number of cycles Let Ib be branch instruction, then branch taken shall cause all instructions from Ib+1 till Ib+k-1 to be drained from pipeline Let p be probability of instruction to be branch instruction and q be probability of branch taken, then penalty, in terms of time is expressed as Tpenalty = pqnbt , where n: number of instructions; b: number of pipeline cycles consumed; t: cycle time Effective execution time becomes T = kt + (n-1)t +

Branch Prediction
Branch can be predicted based on
Static Branch Strategy
Probability of branch with respect to a particular branch type can be used to predict branch Probability may be obtained by collecting frequency of branch taken and branch types across large number of program traces

Dynamic Branch Strategy

Uses limited recent branch history to predict whether or not branch will be taken when it occurs next time

Branch Prediction Internals

Branch prediction buffer
Used to store the branch history information in order to make branch prediction

State transition diagram used in dynamic branch prediction

Source: Kai Hwang

Delayed Branches
Branch penalty can be reduced by the concept of delayed branch The central idea is to delay the execution of branch instruction to accommodate independent* instructions
Delaying by d cycles allows few useful instructions (independent*) of branch instructions to be executed * Execution of these instructions should be independent of outcome of branch instruction

Linear Pipeline Processors

A linear pipeline processor is constructed with k processing stages i.e. S1 Sk These stages are linearly connected to perform a specific function Data stream flows from one end of the pipeline to another end, external inputs are fed into S1 and final results move out from Sk , intermediate results pass from Si to Si+1 Linear pipelining applied to: Instruction execution Arithmetic computation Memory access operations

Asynchronous Model
Data flow between adjacent stages is controlled by handshaking protocol
When a stage Si is ready to transmit, it sends a ready signal to stage Si+1 This is followed by the actual data transfer After stage Si+1 receives the data, it returns an acknowledge signal to Si

Source: Kai Hwang

Contd
Asynchronous pipelines are useful in designing communication channels in message passing multicomputers They have variable throughput rate. Different amount of delays may be experienced in different stages.

Synchronous Model
Clocked latches are used to interface between stages
Latches are master-slave flip flops that isolate inputs from outputs. Upon arrival of a clock pulse, all latches transfer data to next stage at same time.

Pipeline stages are combinational circuits.

Source: Kai Hwang

Contd
It is desirable to have equal delays in all stages. These delays determine the clock period and thus the speed of the pipeline.

Reservation Table
It specifies the utilization pattern of successive stages in a synchronous pipeline Space time graph depicting precedence relationship in using the pipeline stages For a k-stage linear pipeline clock cycles are needed for data to flow through the pipeline

Clocking and Timing Control

Clock cycle and throughput:
Clock cycle time (t) of a pipeline is given below t = tm + d where tm denote maximum stage delay d denote latch delay Pipeline frequency (1/t) is referred as throughput of the pipeline

Clock skewing:
Ideally clock pulses should arrive at all stages at same time, but due to clock skewing, same clock pulse may arrive at different stages with an offset of s Further, let tmax be time delay of longest logic path in a stage and tmin be that of shortest logic path in a stage, then d + tmax + s <= t <= tm + tmin - s

Speedup
Case 1: Pipelined processor
Ideally, number of clock cycles required by a k stage pipeline to process n tasks is:Np = k + (n-1) (k clock cycles for first task & 1 clock cycle for each of n-1 tasks) Total time required is Tk = (k+(n-1))t

Case 2: Non-pipelined processor

Non-pipelined processor would take time, T1 = nkt

Speedup Factor: Sk of a k-stage pipeline over an equivalent non pipelined processor is:
Sk = T1 / Tk = nkt / (k+ (n-1))t = nk / (k + n-1))

Optimal number of stages:

Most pipelining is staged as functional level with 2k15. Very few pipelines are designed to exceed 10 stages in real computers. Optimal choice of number of pipeline stages should be able to maximize the performance/cost ratio for target processing load. Performance/cost ratio(PCR)=f/c+kh =1/(t/k+d)(c+kh) where f=1/(t/k+d) PCR corresponds to the optimal choice for the number of desired pipeline stages: k0 =t.c/d.h,where t is the total flow-through delay of pipeline,c is total stage cost,d is latch delay and latch cost h.

Efficiency & Throughput

Efficiency: It is defined as speedup factor divided by number of stages:Ek = Sk / k = n / (k + (n-1))

Pipeline Throughput: It is defined as number of tasks per unit time as below:Hk = n / (k + (n-1))t = nf / (k + (n-1))

Instruction Pipeline Design, Arithmetic Pipeline Deign - Super Scalar Pipeline Design
No ratings yet
Instruction Pipeline Design, Arithmetic Pipeline Deign - Super Scalar Pipeline Design
34 pages
8086 Interrupt: Interrupt Vector Table - IVT
No ratings yet
8086 Interrupt: Interrupt Vector Table - IVT
2 pages
Intel 8086 Microprocessor Guide
No ratings yet
Intel 8086 Microprocessor Guide
79 pages
Unit 5
No ratings yet
Unit 5
86 pages
Reducing Pipeline Branch Penalties
No ratings yet
Reducing Pipeline Branch Penalties
4 pages
Introduction to Operating Systems Overview
No ratings yet
Introduction to Operating Systems Overview
43 pages
Four Segment Instruction Pipeline Overview
No ratings yet
Four Segment Instruction Pipeline Overview
10 pages
ARM7TDMI Processor
No ratings yet
ARM7TDMI Processor
44 pages
Understanding CPU Scheduling Basics
No ratings yet
Understanding CPU Scheduling Basics
4 pages
Embedded C Programming Overview
No ratings yet
Embedded C Programming Overview
24 pages
Computer Organisation Asynchronous Bus
No ratings yet
Computer Organisation Asynchronous Bus
10 pages
Superpipelining in Computer Architecture
No ratings yet
Superpipelining in Computer Architecture
7 pages
Chapter 4 (Processors and Memory Hierarchy)
100% (1)
Chapter 4 (Processors and Memory Hierarchy)
17 pages
Linkers, Loaders & Software Tools
No ratings yet
Linkers, Loaders & Software Tools
18 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
Oracle PL/SQL Exception Handling Guide
100% (1)
Oracle PL/SQL Exception Handling Guide
54 pages
RGPV CS602 Computer Networks Overview
No ratings yet
RGPV CS602 Computer Networks Overview
14 pages
Bus and Cache Memory Systems Explained
100% (1)
Bus and Cache Memory Systems Explained
8 pages
Understanding Multithreading Techniques
No ratings yet
Understanding Multithreading Techniques
22 pages
POSIX Threads in Linux
No ratings yet
POSIX Threads in Linux
11 pages
Ca Unit 3 Prabu
100% (1)
Ca Unit 3 Prabu
24 pages
Module-1 and 2
No ratings yet
Module-1 and 2
150 pages
Understanding SIMD Architecture
No ratings yet
Understanding SIMD Architecture
28 pages
Linking & Relocation
No ratings yet
Linking & Relocation
30 pages
Processmining Dharmateja-1
No ratings yet
Processmining Dharmateja-1
46 pages
Linkers, Loaders & OS Basics
No ratings yet
Linkers, Loaders & OS Basics
25 pages
8085 Microprocessor Operations Overview
No ratings yet
8085 Microprocessor Operations Overview
3 pages
ARM7,9,11 Processor
No ratings yet
ARM7,9,11 Processor
34 pages
Run-Time Storage Organization: 66.648 Compiler Design Lecture (03/23/98) Computer Science Rensselaer Polytechnic
No ratings yet
Run-Time Storage Organization: 66.648 Compiler Design Lecture (03/23/98) Computer Science Rensselaer Polytechnic
16 pages
3 Stage and 5 Stage ARM
No ratings yet
3 Stage and 5 Stage ARM
4 pages
Unit5 Aca
100% (1)
Unit5 Aca
11 pages
PIL Question Bank: 80386 and 8051
100% (1)
PIL Question Bank: 80386 and 8051
13 pages
Understanding Context Switching in OS
No ratings yet
Understanding Context Switching in OS
4 pages
Advanced OS Assignment 1,2 Notes
100% (1)
Advanced OS Assignment 1,2 Notes
17 pages
RISC Pipelining in Computer Architecture
No ratings yet
RISC Pipelining in Computer Architecture
14 pages
ARM Processor Core
No ratings yet
ARM Processor Core
34 pages
Parallelism in Uniprocessor System and Granularity
100% (5)
Parallelism in Uniprocessor System and Granularity
5 pages
Advanced Computer Architecture: CSE-401 E
No ratings yet
Advanced Computer Architecture: CSE-401 E
71 pages
Process States: State Diagram
No ratings yet
Process States: State Diagram
2 pages
X86 Microprocessor Interrupts Overview
100% (1)
X86 Microprocessor Interrupts Overview
18 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
Shivaji University, Kolhapur
No ratings yet
Shivaji University, Kolhapur
12 pages
Operating Systems Lab C Programs Guide
No ratings yet
Operating Systems Lab C Programs Guide
65 pages
Basic Operational Concepts
No ratings yet
Basic Operational Concepts
29 pages
Unit 3 Os Notes
No ratings yet
Unit 3 Os Notes
20 pages
Intel 8086 Microprocessor Guide
No ratings yet
Intel 8086 Microprocessor Guide
5 pages
FMPMC Unit 1
No ratings yet
FMPMC Unit 1
48 pages
Dynamic Scheduling Using Tomasulo's Approach
No ratings yet
Dynamic Scheduling Using Tomasulo's Approach
4 pages
Unit - I - ARM Processor - Dr. M. R. Arun
100% (2)
Unit - I - ARM Processor - Dr. M. R. Arun
3 pages
Types and Priority of Interrupts
No ratings yet
Types and Priority of Interrupts
3 pages
Pipelining and Vector Processing
No ratings yet
Pipelining and Vector Processing
37 pages
William Stallings Computer Organization and Architecture 8 Edition Instruction Sets: Addressing Modes and Formats
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Instruction Sets: Addressing Modes and Formats
47 pages
Java Notes Unit 1
No ratings yet
Java Notes Unit 1
9 pages
Interrupt Vectors Explained
No ratings yet
Interrupt Vectors Explained
3 pages
Instruction and Arithmetic Pipeline Design
100% (1)
Instruction and Arithmetic Pipeline Design
34 pages
Pipelining & Superscalar Techniques
No ratings yet
Pipelining & Superscalar Techniques
71 pages
Unit 3
No ratings yet
Unit 3
94 pages
CAO-II Module 2 Complete
100% (1)
CAO-II Module 2 Complete
32 pages
Advanced Pipelining Techniques
No ratings yet
Advanced Pipelining Techniques
44 pages
Parallel Processing Chapter - 3: Instruction Level Parallelism
No ratings yet
Parallel Processing Chapter - 3: Instruction Level Parallelism
33 pages
Am Pyq
No ratings yet
Am Pyq
18 pages
Ee660 2017 Spring Materials Week 04 Slides
No ratings yet
Ee660 2017 Spring Materials Week 04 Slides
40 pages
CEN498 Embedded Systems: Assoc. Prof. Dr. Kamil Dimililer
No ratings yet
CEN498 Embedded Systems: Assoc. Prof. Dr. Kamil Dimililer
9 pages
Computer Inventory Overview
No ratings yet
Computer Inventory Overview
2 pages
c6x Assembly Programming 1
No ratings yet
c6x Assembly Programming 1
20 pages
Chapter 3 2
No ratings yet
Chapter 3 2
22 pages
List of Common Microcontrollers: Altera
No ratings yet
List of Common Microcontrollers: Altera
28 pages
Modern Processors
No ratings yet
Modern Processors
13 pages
Chapter 02 RISC V
No ratings yet
Chapter 02 RISC V
92 pages
8051 Microcontroller Lab Guide
No ratings yet
8051 Microcontroller Lab Guide
9 pages
CS 322 - Computer Organization Lecture 5 & 6
No ratings yet
CS 322 - Computer Organization Lecture 5 & 6
55 pages
8085 Microprocessor Timing Diagrams
No ratings yet
8085 Microprocessor Timing Diagrams
50 pages
CPU-OS Simulator: Pipelining, Scheduling, Multithreading
No ratings yet
CPU-OS Simulator: Pipelining, Scheduling, Multithreading
8 pages
Computer Buses
No ratings yet
Computer Buses
4 pages
Be Summer 2022
No ratings yet
Be Summer 2022
2 pages
NMOS 6502 Undocumented Opcodes Guide
No ratings yet
NMOS 6502 Undocumented Opcodes Guide
6 pages
Microprocessor Basics & Evolution
No ratings yet
Microprocessor Basics & Evolution
69 pages
Richard Grisenthwaite
No ratings yet
Richard Grisenthwaite
25 pages
Gshare and Pshare Branch Predictors
No ratings yet
Gshare and Pshare Branch Predictors
4 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
Multi-Core Computer Architecture Review
No ratings yet
Multi-Core Computer Architecture Review
26 pages
Microprocessors Lab Viva Questions and Answers
No ratings yet
Microprocessors Lab Viva Questions and Answers
10 pages
ARM7
No ratings yet
ARM7
4 pages
ECE-310 Microcontroller Assignment
No ratings yet
ECE-310 Microcontroller Assignment
10 pages
PCC-CS402: Computer Architecture Code: Contacts: 3L Computer Architecture
No ratings yet
PCC-CS402: Computer Architecture Code: Contacts: 3L Computer Architecture
2 pages
1st Series MC
No ratings yet
1st Series MC
1 page
8086 Microprocessor Architecture
No ratings yet
8086 Microprocessor Architecture
7 pages
DDI0489B Cortex m7 TRM
No ratings yet
DDI0489B Cortex m7 TRM
145 pages
Hardware Design
No ratings yet
Hardware Design
42 pages

Instruction and Arithmetic Pipelining

Uploaded by

Instruction and Arithmetic Pipelining

Uploaded by

Pipelining

Advanced Computer Architecture

Arithmetic Pipeline Design

Typical Instruction Pipeline

Source: Kai Hwang

Instruction Execution Phases

Shaded regions indicate idle cycles due to dependencies

Source: Kai Hwang

Mechanisms for Instruction Pipelining

Source: Kai Hwang

Multiple Functional Units

Source: Kai Hwang

Internal Data Forwarding

Source: Kai Hwang

Source: Kai Hwang

Source: Kai Hwang

Source: Kai Hwang

Branch Handling Techniques

Effect of branching contd.

Source: Kai Hwang

Dynamic Branch Strategy

Branch Prediction Internals

State transition diagram used in dynamic branch prediction

Source: Kai Hwang

Linear Pipeline Processors

Source: Kai Hwang

Pipeline stages are combinational circuits.

Source: Kai Hwang

Clocking and Timing Control

Case 2: Non-pipelined processor

Optimal number of stages:

Efficiency & Throughput

You might also like