Graphical
Processing Unit
Parallel and Distributed Computing
Arfan Shahzad
{ [email protected] }
GPU Architecture & Programming
• Graphics Processing Units (GPUs) are specialized hardware designed
for highly parallel computing tasks.
• Unlike CPUs, which have a few powerful cores optimized for
sequential processing, GPUs consist of thousands of smaller cores
that execute tasks in parallel.
GPU Architecture & Programming cont…
GPU vs. CPU Architecture
• GPUs differ from Central Processing Units (CPUs) in terms of architecture
and execution model:
• CPU: Optimized for sequential tasks, with few powerful cores, complex
control logic, and large caches.
• GPU: Optimized for parallelism, with thousands of simple cores executing
multiple threads simultaneously.
GPU Architecture & Programming cont…
GPU vs. CPU Architecture
Feature CPU GPU
Cores Few (4–64) Thousands
Execution Model Sequential Parallel
Latency Low High
Throughput Low High
Control Flow Complex Simple
Memory Hierarchy Large caches Small shared memory
GPU Architecture & Programming cont…
Architectural components
• A typical GPU consists of following architectural components:
• 1- Streaming Multiprocessors (SMs): A GPU consists of multiple SMs, each
containing numerous CUDA cores (or streaming processors) that perform
computations in parallel.
• 2- CUDA Cores: Individual processing units within SMs, executing
instructions in parallel.
GPU Architecture & Programming cont…
Architectural components
• 3- Memory Hierarchy: GPUs have multiple memory types, including:
• Global Memory: Large but relatively slow, accessible by all threads.
• Shared Memory: Faster, limited in size, shared within a single SM.
• Registers: Fastest memory, private to each thread.
• Texture and Constant Memory: Optimized for specific read patterns and
frequently used data.
GPU Architecture & Programming cont…
Architectural components
• 4- Warp-based Execution: Threads are organized into warps (typically 32
threads), which execute instructions in lockstep.
• 5- Memory Controller: Manages access to different memory types and optimizes
bandwidth.
• 6- SIMD Execution Model: GPUs follow the Single Instruction Multiple Data
(SIMD) model, where multiple threads execute the same instruction on different
data.
GPU Architecture & Programming cont…
Programming models
• Programming a GPU involves writing parallel code using specialized
frameworks. The most common models include:
• 1- CUDA (Compute Unified Device Architecture): A parallel
computing platform and API from NVIDIA that allows direct
programming of GPUs using C/C++. Some features of CUDA are given
here:
GPU Architecture & Programming cont…
Programming models
• A- CUDA Kernels: Functions executed on the GPU, written in CUDA.
• B- CUDA Threads and Blocks: Threads are grouped into blocks, and
blocks form a grid, allowing scalable parallel execution.
• C- Memory Management: Optimizing memory access patterns (e.g.,
coalesced memory access) is crucial for performance.
GPU Architecture & Programming cont…
Programming models
• D- Streams and Asynchronous Execution: CUDA streams allow
overlapping computation and memory transfers.
GPU Architecture & Programming cont…
Programming models
• 2- OpenCL (Open Computing Language): A framework for writing
programs that execute across heterogeneous platforms, including
GPUs, CPUs, and FPGAs.
• 3- HIP (Heterogeneous Interface for Portability): A CUDA-like
framework for AMD GPUs.
GPU Architecture & Programming cont…
CUDA Programming Model
• CUDA is a widely used framework for GPU programming. It follows a
hierarchical execution model:
• Thread: Smallest execution unit.
• Block: A group of threads sharing memory and synchronization
mechanisms.
• Grid: A collection of thread blocks.
GPU Architecture & Programming cont…
CUDA Programming Model
• A basic CUDA program consists of:
• A- Kernel Function: Defines the computation to be executed on the GPU.
• B- Memory Management: Allocating and transferring data between CPU (host)
and GPU (device).
• C- Launching Kernel: Configuring execution parameters (grid and block sizes).
• D- Synchronizing Threads: Ensuring proper execution order.
GPU Architecture & Programming cont…
Memory Management in GPU
• Efficient memory management is crucial for performance optimization:
• A- Memory Coalescing: Ensuring threads access consecutive memory
locations.
• B- Shared Memory Usage: Reducing global memory accesses by using
shared memory.
GPU Architecture & Programming cont…
Memory Management in GPU
• C- Register Optimization: Minimizing register spilling to avoid
performance degradation.
• D- Unified Memory: Allowing CPU and GPU to share memory
seamlessly.
GPU Architecture & Programming cont…
Optimization Techniques
• To achieve high performance in GPU programming, the following
optimizations are applied:
• A- Thread Block Sizing: Choosing optimal thread and block sizes for
maximum occupancy.
• B- Loop Unrolling: Reducing loop overhead by manually unrolling
iterations.
GPU Architecture & Programming cont…
Optimization Techniques
• C- Warp-Level Synchronization: Utilizing warp-level primitives for
faster communication.
• D- Occupancy Maximization: Ensuring high SM utilization by
managing resource allocation.
• E- Reducing Divergence: Minimizing branch divergence within warps.
GPU Architecture & Programming cont…
Application of GPU Computing
• GPUs are extensively used in:
• A- Deep Learning & AI: Accelerating neural network training and
inference (e.g., TensorFlow, PyTorch).
• B- Scientific Computing: Simulating physical phenomena, fluid
dynamics, and quantum mechanics.
GPU Architecture & Programming cont…
Application of GPU Computing
• C- Cryptography & Blockchain: Processing cryptographic algorithms
and mining cryptocurrencies.
• D- Gaming & Graphics: Real-time rendering and physics-based
simulations.
• E- Financial Modeling: Risk analysis and Monte Carlo simulations.