0% found this document useful (0 votes)
58 views33 pages

GPU Architecture and Programming NPTEL Week 1 Assignment

The document discusses GPU architectures and programming, highlighting the evolution from VGA controllers to complex GPUs with programmable software. It covers key concepts such as the classic 5-stage RISC pipeline, memory hierarchy, and instruction level parallelism (ILP). Additionally, it outlines a course organization for teaching GPU architectures, including topics like CUDA programming and efficient neural network training.

Uploaded by

mr.descifrador98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views33 pages

GPU Architecture and Programming NPTEL Week 1 Assignment

The document discusses GPU architectures and programming, highlighting the evolution from VGA controllers to complex GPUs with programmable software. It covers key concepts such as the classic 5-stage RISC pipeline, memory hierarchy, and instruction level parallelism (ILP). Additionally, it outlines a course organization for teaching GPU architectures, including topics like CUDA programming and efficient neural network training.

Uploaded by

mr.descifrador98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

EL
GPU Architectures and Programming

Soumyajit Dey, Assistant Professor,

PT
CSE, IIT Kharagpur

December 5, 2019
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

EL
I Fifteen years ago, Graphics on a PC were performed by a video graphics array
(VGA) controller.
I VGAs evolved to more complex hardwares : accelerating graphics functions

PT
I Early GPUs and their associated drivers implemented the OpenGL and DirectX
models (APIs) of graphics processing.
I With time, HW functionality evolved as programmable SW
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

EL
CPU
FRONT SIDE BUS

NORTH
MEMORY
BRIDGE
PCI BUS

PT
SOUTH VGA FRAMEBUFFER
BRIDGE CONTROLLER MEMORY

VGA
LAN UART DISPLAY
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Historical PC. - Hennessy and Patterson "Computer Organization and Design" -

IND
 

19 5 1

yog, kms kOflm^

(Figure reproduced)
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Host CPU Bridge System Memory


GPU
Host Interface

EL
Input Assembler Clip/Setup/Raster/ZCull Compute Work HD Video Processor SM
Distribution
Vertex Work Distribution Pixel Work Distribution I-Cache
MT Issue
TPC TPC TPC
C-Cache

SM SM SM SM SM SM SP SP

….. SP SP

PT
SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP SP SP

SP SP
Texture Unit Texture Unit Texture Unit
Tex L1 Tex L1 Tex L1
SFU SFU
Interconnection Network

ROP L2 ROP L2 ….. ROP L2 Display Interface


Shared
Memory
N
DRAM DRAM DRAM DISPLAY
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
Figure: GPU Architecture - Hennessy Patterson (Figure reproduced)
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Course Organization

EL
Topic Week Hours
Review of basic COA w.r.t. performance 1 2
Intro to GPU architectures 2 3
Intro to CUDA programming 3 2
Multi-dimensional data and synchronization 4 2

PT
Warp Scheduling and Divergence 5 2
Memory Access Coalescing 6 2
Optimizing Reduction Kernels 7 3
Kernel Fusion, Thread and Block Coarsening 8 3
OpenCL - runtime system 9 3
N
OpenCL - heterogeneous computing 10 2 TECHNO
OF LO
TE

Efficient Neural Network Training/Inferencing 11-12 6

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

EL
Section 1

The classic 5-stage RISC pipeline

PT
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Basic RISC architecture

EL
I The operation of a processor is characterized by a fetch⇒ decode⇒execute cycle.
I RISC n CISC ⇒ two different philosophies of computing hardware design
I RISC/CISC - Reduced/Complex Instruction Set Computing

PT
I CISC approach - complete a task with as few instructions (instrs) as possible
I A CISC instruction : MUL addr1 addr2 addr3
I Equivalent RISC : LOAD R2 addr2 ; LOAD R3 addr3 ; MUL R1 R2 R3; STORE
addr1 R1
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

CISC vs RISC

EL
CISC features
RISC features
I Older ISA
I Ideas emerged in 1980s
I Multi-cycle instructions, HW intensive
I Single-cycle instructions, SW intensive
design

PT
design
I Efficient RAM usage
I Heavy RAM usage, Large Register file
I Instructions - complex and variable
I Small no. of simple fixed length
length, lots of them
instructions
I Micro-code support
I Less no. of addressing modes
N
I Compound addressing modes OF
TECHNO
LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Elementary CPU Datapath

EL
PCSrc

4 M
U
X
ADD
Shift
ADD left 2

ALU operation
MemRead
R−Reg−1
Data−1

PT
MemToReg
PC Address Address Data
Instruction Zero
R−Reg−2
ALUSrc
Instruction Registers Data Memory M
W−Reg Result U
Data−2 M X
U Write data
Instruction
Memory Data X
MemWrite
RegWrite

Sign
Extend
N
I The datapath ‘fetches’ instruction, ‘decodes’ and ‘executes’ it
I Control logic generates suitable activation signals TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
I Executes different instructions with variable delays
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Single cycle implementation of datapath

EL
I The choice of clock rate is limited by the instruction with maximum delay
I Options : choose the clock period more than latency of ‘slowest’ instruction or,
I choose variable periods for diff instructions – not practical !

PT
I Alternate possibility - break the instruction execution cycle into a series of basic
steps
I Basic steps have less delay, choose a fast clock and use it to execute one basic step
at a time
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Multi-cycle instructions

EL
A basic stage represents one of the following states in the execution of an instruction
I Fetch (IF): IR ⇐ Memory[PC]; PC=PC+4
I Decode (ID): Understand instruction semantics

PT
I Execute (EX): based on instruction type
I Arithmetic/logical operation, Mem address / Branch condition computation
I Memory (MEM): For load/store Instr, read/write data from/to memory
I Writeback (WB): Update register file
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Pipelining

EL
I Operate IF→ID→EX→MEM→WB in parallel for a sequence of instructions
I Every basic stage is always processing some instruction

PT
I In every clock cycle, one instruction completes - ideal scenario
I Practical issues - pipeline hazards
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Structural hazard

EL
I Consider a sequence of 4 lw (load-word) instructions
I When the first instruction fetches data from memory, the fourth instruction itself is

PT
to be fetched from memory
I This is structural hazard as the pipeline needs to stall due to lack of resources, if
the hardware cannot support multiple reads in parallel
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Data Hazard : MIPS example

EL
I sub $2, $1, $3; and $12, $2, $5 Read after Write (RAW)
I if ‘sub’ is in IF stage in i + 1-th clock cycle, $2 is updated in (i + 5)-th cycle
I ‘and’ is in EX stage in i + 4-th cycle, updated value of $2 is not yet ready

PT
I Solution : ‘sub’ computes the value for $2 in (i + 3)-th stage,
I this may be forwarded directly to execution of ‘and’
I need suitable logic to detect hazard and forwarding requirement
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Control hazards

EL
I Branch decisions : the branch condition needs evaluation (beq $1, $2, offset)
I The branch decision is inferred only in MEM stage
I Optimization : assume branch not taken, operate pipeline normally,

PT
I Execute branch when decision is evaluated as true (taken) and flush intermediate
instructions from pipeline
I Sophisticated schemes : use branch prediction HW (predict a branch decision
based on branch history table content)
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

EL
Section 2

The Memory Hierarchy

PT
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Multi-level Arrangement

EL
PT
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

Figure: Near to CPU is faster 19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Principle of locality

EL
I Temporal locality : If an item is referenced, it will tend to be referenced again soon
I Spatial locality : If an item is referenced, items at nearby addresses will be
referenced soon

PT
I Hence, computer memory is hierarchically organized
I Register file provides fastest access,
I Cache memory uses (fast) SRAM (static random access memory)
I Main memory uses (slow) DRAM (dynamic random access memory) : is less costly
per bit than SRAM
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Cache Mapping

EL
Cache

011
000

010

100

110
101
001

111
PT
000001 000101 001001 001101
Memory

110001 110101 111001 111101


N
I Direct mapped : Cache block address = (memory block address) modulo (Number
of cache blocks in the cache)
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
I Block = minimum unit of information that can be either present or not present
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Cache Blocks

EL
I With larger blocks we have lower miss rates due to spatial locality, large blocks
lead to large miss penalty
I Nothing is free : with very big block sizes, we have too small no of blocks in

PT
cache, eventually the miss rate goes up
I Handling Cache Miss:
I Send the PC value (current PC – 4) to the memory
I Read access from main memory, write updated cache entry
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Cache write policy

EL
I Handling consistency : always write the data into both the memory and the cache
(write-through)
I Conservative policy, slows things down
I Use write buffer to perform writes only when buffer is full. Buffer size can be

PT
decided by memory speed
I Alternative policy write-back : Writes are updated only in cache. Main memory is
update only during cache block replacement
I Write-back offers better performance in case of frequent writes, is more complex to
N
implement
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Memory System

EL
I Memory chips are designed to read/write more than one word in parallel (hiding
latency)
I Use a wide bus - allow parallel access to all words in a block

PT
I OR - keep bus of standard width (= memory word length = register size) and
connect bus with multiple memory units in parallel (memory banks)
I WHY ? bus transmission is fast, memory read/write is slow
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Cache Mapping: alternate schemes

EL
I Fully associative: a block can be placed in any location in the cache. (Large HW
requirement for fast parallel search)
I Practical only for cache with small number of blocks
I Optimizing in the middle : set associative cache

PT
I An n-way set-associative cache consists of a number of sets, each of which consists
of n blocks.
I Set number = (Memory Block number) modulo (Number of sets in the cache)
I Inside a set, all the tags of all the elements must be searched
N
I Increasing associativity decreases miss rate up to a point, but increases hit time
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Cache replacement policy

EL
I In direct mapped cache, a new block can go to exactly one location
I In fully associative cache, a new block can potentially replace any existing block -
how to resolve ?

PT
I In set associative cache, a new block can potentially replace any existing block
inside a matching set - how to resolve ?
I Least Recently Used (LRU) policy - The block replaced is the one that has been
unused for the longest time.
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

EL
Section 3

Instruction Level Parallelism (ILP)

PT
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Actual Pipeline CPI

EL
Pipeline Cycles per instruction (CPI) = Ideal pipeline CPI + Structural stalls + Data
hazard stalls + Control stalls
I Handling hazards require both architectural and compiler techniques
I Data hazard types while executing instruction i followed by j in a pipeline

PT
I RAW — j tries to read a source before i writes it, so j incorrectly gets the old value
I WAW — j tries to write an operand before it is written by i. Will not happen in
simple RISC, but in pipelines that write in more than one basic stage or allow an
instruction to proceed even when a previous instruction is stalled
I WAR - j tries to write a destination before it is read by i, can happen in case
instructions are reordered
N
I RAR - not a hazard
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Compiler Techniques for ILP

EL
To keep a pipeline full, a compiler can find sequences of unrelated instructions that can
be overlapped
for (i=100; i>=0; i=i–1)
x[i] = x[i] + s;
Unoptimized MIPs

PT
Loop:
L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
N
DADDUI R1,R1,#-8 ;decrement pointer //loop overhead
;8 bytes (per DW) TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
BNE R1,R2,Loop ;branch R1!=R2 //branch decision 

19 5 1

yog, kms kOflm^




GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Unrolling: eliminated three branches and decrements of R1 (Hen Pat etl. al.)

EL
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
L.D F6,-8(R1)
ADD.D F8,F6,F2

PT
S.D F8,-8(R1) //Code size increase - more instr cache miss
L.D F10,-16(R1) //more no. of live values - increased register pressure
ADD.D F12,F10,F2
S.D F12,-16(R1)
L.D F14,-24(R1)
ADD.D F16,F14,F2
N
S.D F16,-24(R1) TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
DADDUI R1,R1,#-32

IND
 

19 5 1

yog, kms kOflm^

BNE R1,R2,Loop
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Branch Prediction assisted ILP

EL
General single level predictor with 2-bit saturating counter

taken taken taken

PT
strongly weakly weakly strongly
not taken not taken taken taken
not taken taken

not taken not taken not taken

I conditional jump has to deviate twice from past before the prediction changes.
N
I Consider a sequence of altering decisions in a loop and calculate performance TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
improvement over 1-bit saturating counter !!!!

IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Hierarchical Prediction

EL
How about generalizing the idea of prediction with larger branch histories.
I store m length history of a branch - 2m possibilities

PT
I for each possibility use an n-bit predictor : (m, n) prediction scheme
I a two-level predictor with m-bit history can predict any repetitive sequence with
any period if all m-bit sub-sequences are different.
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Dynamic Scheduling for ILP

EL
I Simple pipelines execute instructions in-order
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14

PT
I SUB.D suffers as ADD.D stalls due to dependence
I different ordering will avoid stall in this case
I Out of order execution brings in the possibility of WAR and WAW hazards
Robert Tomasulo: developed algorithm to minimize WAW and WAR hazards while
N
allowing out of order execution (tracks when operands for instructions are available to
minimize RAW hazards and uses register renaming to minimize WAW and WAR). TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Register Renaming

EL
DIV F0,F2,F4
ADD F6,F0,F8 //(RAW for DIV : F0) DIV F0,F2,F4
S F6,0(R1) //(RAW for ADD : F6) ADD S,F0,F8
SUB F8,F10,F14 //(WAR for ADD : F8) S S,0(R1)

PT
MUL F6,F10,F8 // (WAR for S, WAW for ADD) SUB T,F10,F14
//(RAW for SUB : F8) MUL F6,F10,T

I RAW is due to data dependency, stalls in-order I S removes WAR of MUL,


pipeline I S removes WAW of MUL,
N
I WAR/WAW constrains out-of-order execution I T removes WAR of SUB,
TECHNO
OF LO
TE

GY
ITU

IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

ILP Using Multiple Issue and Static Scheduling

EL
Multiple-issue processors - allow multiple instructions to be issued in a clock cycle
I VLIW (very long instruction word) - Parallel instructions statically scheduled by
compiler; issue a fixed number of instructions formatted as one large instruction
I Statically scheduled superscalar - issue a varying rather than a fixed number of

PT
instructions (compiler decided) per clock, in-order execution
I Dynamically scheduled superscalar - issue a varying rather than a fixed number of
instructions (hardware decided) per clock, out-of-order execution
For large issue width VLIW (with multiple independent FUs) is preferred w.r.t.
statically scheduled superscalar
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur

You might also like