CKV
Advanced VLSI Architecture
MEL G624
Lecture 29: Data Level Parallelism
CKV
Graphical Processing Units
Basic Idea
Heterogeneous execution model
CPU is the host, GPU is the device
Develop a C‐like programming language for GPU
Compute Unified Device Architecture (CUDA) Nvidia
OpenCL for vendor-independent language
Unify all forms of GPU parallelism as CUDA thread
Programming model is “Single Instruction Multiple Thread”
CKV
NVIDIA GPU Architecture
Similarities to vector machines:
Works well with data‐level parallel problems
Scatter‐gather transfers from memory into local store
Mask registers
Large register files
Differences:
No scalar processor, scalar integration
Uses multithreading to hide memory latency
Has many functional units, as opposed to a few deeply
pipelined units like a vector processor
CKV
Example
Multiply two vectors of length 8192
Code that works over all elements is the grid
Thread blocks break this down into manageable sizes
512 elements per block
SIMD instruction executes 32 elements (per thread) at a time
Thus grid size = 16 blocks
Block is analogous to a strip‐mined vector loop (VL 32 elements)
Block is assigned to a multithreaded SIMD processor be scheduler
CKV
A[0]=B[0] * C[0]
SIMD A[1]=B[1] * C[1]
Thread 0
A[31]=B[31] * C[31]
SIMD
Thread Thread 1
Block 0 A[63]=B[63] * C[63]
SIMD
Thread
Grid 15
A[511]=B[511] * C[511]
Thread
Block 15
SIMD
Thread
15
A[8191]=B[8191] * C[8191]
CKV
Fermi GTX 480 Thread Block assigned to
Multithreaded SIMD Processor
Thread Scheduler
Thread Block Scheduler
SM
CKV
Terminology
Threads of SIMD instructions
Each has its own PC
Thread scheduler uses scoreboard to dispatch
No data dependencies between threads!
Keeps track of up to 48 threads of SIMD instructions
Hides Memory Latency
Thread block scheduler schedules blocks to SIMD processors
Within each SIMD processor:
16 SIMD lanes
Wide and shallow compared to vector processors
CKV
Thread Scheduler
CKV
GPU Register Example
NVIDIA GPU has 32,768 registers
Divided logically into lanes
Each SIMD thread is limited to 64 vector like registers
SIMD thread has up to:
64 vector registers of 32 32‐bit elements 2048
32 vector registers of 32 64‐bit elements 32-bit Registers
Each CUDA thread gets one element
CUDA threads of thread block uses how many Reg?
CKV
Multi-Threaded SIMD Processor
SIMD Thread
Each instruction works on 32 elements
Keep Track of 48
Lanes
independent threads
CUDA Thread
2048 32-bit registers Dynamically Allocated to a Thread
CKV
Multi Threading
MIMD
SIMD
ILP
SM
CKV
Programming the GPU
Challenge for GPU Programmer
Get good performance
Coordinating the scheduling of computation on the system
processor and GPU
Transfer of data between system memory and GPU Memory
CKV
Programming the GPU
To distinguish between functions for GPU (device) and
System processor (host)
_device_ or _global_ => GPU Device
_host_ => System processor (host)
Variables declared in _device_ or _global_ functions are
allocated to GPU memory
Function Call: name<<<dimGrid, dimBlock>>>..parameters list..)
threadIdx: Identifier for threads per block
blockIdx: Identifier for blocks
blockDim: Number of threads per block
CKV
Programming the GPU
//Invoke DAXPY
daxpy(n,2.0,x,y);
//DAXPY in C
void daxpy(int n, double a, double* x, double* y){
for (int i=0;i<n;i++)
y[i]= a*x[i]+ y[i]
}
//Invoke DAXPY with 256 threads per Thread Block
_host_
int nblocks = (n+255)/256;
daxpy<<<nblocks, 256>>> (n,2.0,x,y);
//DAXPY in CUDA
_device_
void daxpy(int n,double a,double* x,double* y){
int i=blockIDx.x*blockDim.x+threadIdx.x;
if (i<n) y[i]= a*x[i]+ y[i]
}
CKV
Thank You for Attending