GPU Architecture and Programming NPTEL Week 1 Assignment
GPU Architecture and Programming NPTEL Week 1 Assignment
EL
GPU Architectures and Programming
PT
CSE, IIT Kharagpur
December 5, 2019
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
I Fifteen years ago, Graphics on a PC were performed by a video graphics array
(VGA) controller.
I VGAs evolved to more complex hardwares : accelerating graphics functions
PT
I Early GPUs and their associated drivers implemented the OpenGL and DirectX
models (APIs) of graphics processing.
I With time, HW functionality evolved as programmable SW
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
CPU
FRONT SIDE BUS
NORTH
MEMORY
BRIDGE
PCI BUS
PT
SOUTH VGA FRAMEBUFFER
BRIDGE CONTROLLER MEMORY
VGA
LAN UART DISPLAY
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
Figure: Historical PC. - Hennessy and Patterson "Computer Organization and Design" -
IND
19 5 1
(Figure reproduced)
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
Input Assembler Clip/Setup/Raster/ZCull Compute Work HD Video Processor SM
Distribution
Vertex Work Distribution Pixel Work Distribution I-Cache
MT Issue
TPC TPC TPC
C-Cache
SM SM SM SM SM SM SP SP
….. SP SP
PT
SP SP SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP SP SP SP SP
SP SP
Texture Unit Texture Unit Texture Unit
Tex L1 Tex L1 Tex L1
SFU SFU
Interconnection Network
GY
ITU
IAN INST
KH
ARAGPUR
IND
Figure: GPU Architecture - Hennessy Patterson (Figure reproduced)
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
Course Organization
EL
Topic Week Hours
Review of basic COA w.r.t. performance 1 2
Intro to GPU architectures 2 3
Intro to CUDA programming 3 2
Multi-dimensional data and synchronization 4 2
PT
Warp Scheduling and Divergence 5 2
Memory Access Coalescing 6 2
Optimizing Reduction Kernels 7 3
Kernel Fusion, Thread and Block Coarsening 8 3
OpenCL - runtime system 9 3
N
OpenCL - heterogeneous computing 10 2 TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
Section 1
PT
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
I The operation of a processor is characterized by a fetch⇒ decode⇒execute cycle.
I RISC n CISC ⇒ two different philosophies of computing hardware design
I RISC/CISC - Reduced/Complex Instruction Set Computing
PT
I CISC approach - complete a task with as few instructions (instrs) as possible
I A CISC instruction : MUL addr1 addr2 addr3
I Equivalent RISC : LOAD R2 addr2 ; LOAD R3 addr3 ; MUL R1 R2 R3; STORE
addr1 R1
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
CISC vs RISC
EL
CISC features
RISC features
I Older ISA
I Ideas emerged in 1980s
I Multi-cycle instructions, HW intensive
I Single-cycle instructions, SW intensive
design
PT
design
I Efficient RAM usage
I Heavy RAM usage, Large Register file
I Instructions - complex and variable
I Small no. of simple fixed length
length, lots of them
instructions
I Micro-code support
I Less no. of addressing modes
N
I Compound addressing modes OF
TECHNO
LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
PCSrc
4 M
U
X
ADD
Shift
ADD left 2
ALU operation
MemRead
R−Reg−1
Data−1
PT
MemToReg
PC Address Address Data
Instruction Zero
R−Reg−2
ALUSrc
Instruction Registers Data Memory M
W−Reg Result U
Data−2 M X
U Write data
Instruction
Memory Data X
MemWrite
RegWrite
Sign
Extend
N
I The datapath ‘fetches’ instruction, ‘decodes’ and ‘executes’ it
I Control logic generates suitable activation signals TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
I Executes different instructions with variable delays
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
I The choice of clock rate is limited by the instruction with maximum delay
I Options : choose the clock period more than latency of ‘slowest’ instruction or,
I choose variable periods for diff instructions – not practical !
PT
I Alternate possibility - break the instruction execution cycle into a series of basic
steps
I Basic steps have less delay, choose a fast clock and use it to execute one basic step
at a time
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
Multi-cycle instructions
EL
A basic stage represents one of the following states in the execution of an instruction
I Fetch (IF): IR ⇐ Memory[PC]; PC=PC+4
I Decode (ID): Understand instruction semantics
PT
I Execute (EX): based on instruction type
I Arithmetic/logical operation, Mem address / Branch condition computation
I Memory (MEM): For load/store Instr, read/write data from/to memory
I Writeback (WB): Update register file
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
Pipelining
EL
I Operate IF→ID→EX→MEM→WB in parallel for a sequence of instructions
I Every basic stage is always processing some instruction
PT
I In every clock cycle, one instruction completes - ideal scenario
I Practical issues - pipeline hazards
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
Structural hazard
EL
I Consider a sequence of 4 lw (load-word) instructions
I When the first instruction fetches data from memory, the fourth instruction itself is
PT
to be fetched from memory
I This is structural hazard as the pipeline needs to stall due to lack of resources, if
the hardware cannot support multiple reads in parallel
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
I sub $2, $1, $3; and $12, $2, $5 Read after Write (RAW)
I if ‘sub’ is in IF stage in i + 1-th clock cycle, $2 is updated in (i + 5)-th cycle
I ‘and’ is in EX stage in i + 4-th cycle, updated value of $2 is not yet ready
PT
I Solution : ‘sub’ computes the value for $2 in (i + 3)-th stage,
I this may be forwarded directly to execution of ‘and’
I need suitable logic to detect hazard and forwarding requirement
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
Control hazards
EL
I Branch decisions : the branch condition needs evaluation (beq $1, $2, offset)
I The branch decision is inferred only in MEM stage
I Optimization : assume branch not taken, operate pipeline normally,
PT
I Execute branch when decision is evaluated as true (taken) and flush intermediate
instructions from pipeline
I Sophisticated schemes : use branch prediction HW (predict a branch decision
based on branch history table content)
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
Section 2
PT
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
Multi-level Arrangement
EL
PT
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
Principle of locality
EL
I Temporal locality : If an item is referenced, it will tend to be referenced again soon
I Spatial locality : If an item is referenced, items at nearby addresses will be
referenced soon
PT
I Hence, computer memory is hierarchically organized
I Register file provides fastest access,
I Cache memory uses (fast) SRAM (static random access memory)
I Main memory uses (slow) DRAM (dynamic random access memory) : is less costly
per bit than SRAM
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
Cache Mapping
EL
Cache
011
000
010
100
110
101
001
111
PT
000001 000101 001001 001101
Memory
…
GY
ITU
IAN INST
KH
ARAGPUR
IND
I Block = minimum unit of information that can be either present or not present
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
Cache Blocks
EL
I With larger blocks we have lower miss rates due to spatial locality, large blocks
lead to large miss penalty
I Nothing is free : with very big block sizes, we have too small no of blocks in
PT
cache, eventually the miss rate goes up
I Handling Cache Miss:
I Send the PC value (current PC – 4) to the memory
I Read access from main memory, write updated cache entry
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
I Handling consistency : always write the data into both the memory and the cache
(write-through)
I Conservative policy, slows things down
I Use write buffer to perform writes only when buffer is full. Buffer size can be
PT
decided by memory speed
I Alternative policy write-back : Writes are updated only in cache. Main memory is
update only during cache block replacement
I Write-back offers better performance in case of frequent writes, is more complex to
N
implement
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
Memory System
EL
I Memory chips are designed to read/write more than one word in parallel (hiding
latency)
I Use a wide bus - allow parallel access to all words in a block
PT
I OR - keep bus of standard width (= memory word length = register size) and
connect bus with multiple memory units in parallel (memory banks)
I WHY ? bus transmission is fast, memory read/write is slow
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
I Fully associative: a block can be placed in any location in the cache. (Large HW
requirement for fast parallel search)
I Practical only for cache with small number of blocks
I Optimizing in the middle : set associative cache
PT
I An n-way set-associative cache consists of a number of sets, each of which consists
of n blocks.
I Set number = (Memory Block number) modulo (Number of sets in the cache)
I Inside a set, all the tags of all the elements must be searched
N
I Increasing associativity decreases miss rate up to a point, but increases hit time
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
I In direct mapped cache, a new block can go to exactly one location
I In fully associative cache, a new block can potentially replace any existing block -
how to resolve ?
PT
I In set associative cache, a new block can potentially replace any existing block
inside a matching set - how to resolve ?
I Least Recently Used (LRU) policy - The block replaced is the one that has been
unused for the longest time.
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
Section 3
PT
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
Pipeline Cycles per instruction (CPI) = Ideal pipeline CPI + Structural stalls + Data
hazard stalls + Control stalls
I Handling hazards require both architectural and compiler techniques
I Data hazard types while executing instruction i followed by j in a pipeline
PT
I RAW — j tries to read a source before i writes it, so j incorrectly gets the old value
I WAW — j tries to write an operand before it is written by i. Will not happen in
simple RISC, but in pipelines that write in more than one basic stage or allow an
instruction to proceed even when a previous instruction is stalled
I WAR - j tries to write a destination before it is read by i, can happen in case
instructions are reordered
N
I RAR - not a hazard
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
To keep a pipeline full, a compiler can find sequences of unrelated instructions that can
be overlapped
for (i=100; i>=0; i=i–1)
x[i] = x[i] + s;
Unoptimized MIPs
PT
Loop:
L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
N
DADDUI R1,R1,#-8 ;decrement pointer //loop overhead
;8 bytes (per DW) TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
BNE R1,R2,Loop ;branch R1!=R2 //branch decision
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
Unrolling: eliminated three branches and decrements of R1 (Hen Pat etl. al.)
EL
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
L.D F6,-8(R1)
ADD.D F8,F6,F2
PT
S.D F8,-8(R1) //Code size increase - more instr cache miss
L.D F10,-16(R1) //more no. of live values - increased register pressure
ADD.D F12,F10,F2
S.D F12,-16(R1)
L.D F14,-24(R1)
ADD.D F16,F14,F2
N
S.D F16,-24(R1) TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
DADDUI R1,R1,#-32
IND
19 5 1
BNE R1,R2,Loop
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
General single level predictor with 2-bit saturating counter
PT
strongly weakly weakly strongly
not taken not taken taken taken
not taken taken
I conditional jump has to deviate twice from past before the prediction changes.
N
I Consider a sequence of altering decisions in a loop and calculate performance TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
improvement over 1-bit saturating counter !!!!
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
Hierarchical Prediction
EL
How about generalizing the idea of prediction with larger branch histories.
I store m length history of a branch - 2m possibilities
PT
I for each possibility use an n-bit predictor : (m, n) prediction scheme
I a two-level predictor with m-bit history can predict any repetitive sequence with
any period if all m-bit sub-sequences are different.
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
I Simple pipelines execute instructions in-order
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14
PT
I SUB.D suffers as ADD.D stalls due to dependence
I different ordering will avoid stall in this case
I Out of order execution brings in the possibility of WAR and WAW hazards
Robert Tomasulo: developed algorithm to minimize WAW and WAR hazards while
N
allowing out of order execution (tracks when operands for instructions are available to
minimize RAW hazards and uses register renaming to minimize WAW and WAR). TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
Register Renaming
EL
DIV F0,F2,F4
ADD F6,F0,F8 //(RAW for DIV : F0) DIV F0,F2,F4
S F6,0(R1) //(RAW for ADD : F6) ADD S,F0,F8
SUB F8,F10,F14 //(WAR for ADD : F8) S S,0(R1)
PT
MUL F6,F10,F8 // (WAR for S, WAW for ADD) SUB T,F10,F14
//(RAW for SUB : F8) MUL F6,F10,T
GY
ITU
⇒
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)
EL
Multiple-issue processors - allow multiple instructions to be issued in a clock cycle
I VLIW (very long instruction word) - Parallel instructions statically scheduled by
compiler; issue a fixed number of instructions formatted as one large instruction
I Statically scheduled superscalar - issue a varying rather than a fixed number of
PT
instructions (compiler decided) per clock, in-order execution
I Dynamically scheduled superscalar - issue a varying rather than a fixed number of
instructions (hardware decided) per clock, out-of-order execution
For large issue width VLIW (with multiple independent FUs) is preferred w.r.t.
statically scheduled superscalar
N
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur