UNIT-4
PROGRAMMING
WITH CUDA
PROGRAMMING GUIDE :: CUDA TOOLKIT DOCUMENTATION ([Link])
Contents:
Introduction: The Benefits of Using GPUs, CUDA: A General-Purpose Parallel
Computing Platform and Programming Model, A Scalable Programming Model.
Programming Model: Kernels, Thread Hierarchy, Memory Hierarchy,
Heterogeneous Programming, Asynchronous SIMT Programming Model,
Compute Capability. CUDA Parallel Programming: Summing two vectors (CPU-
GPU), Dot Product optimized.
What is a GPU?
◦A GPU is basically an electronic circuit that your
computer uses to speed up the process of creating
and rendering computer graphics.
◦To improve the quality of all the images,
animations, and videos you see on your computer
monitor
◦Find a GPU in desktop computers, laptops, game
consoles, smartphones, tablets, and more.
◦Typically, desktop computers and laptops use GPUs
to enhance video performance, especially for
graphics-intensive video games, and 3D modeling
software, like AutoCAD.
CPU vs GPU
CPU GPU
Central Processing Graphics Processing
Unit Unit
Several cores Many cores
Low latency High throughput
Good for serial Good for parallel
processing processing
Can do a handful of Can do thousands of
operations at once operations at once
Nvidia - Computer Graphics/Processor 101 (NSIST)
How to Enable GPU Rendering in Maya [3D\HD] - Bing video
- Bing video
HPC APPLICATIONS ON GPU
HPC Applications are: ◦Evolve Max Graphics Bench
mark - NVidia vs. AMD FPS
◦Weather modeling, Streaming live sporting – YouTube
events, tracking a developing storm,
treatment of some genetic conditions, • Tropical Storm Ida: Friday track updat
e & forecast - Bing video
analyzing stock trends
HPC domains are: ◦HIGH SCHOOL FOOTBALL LI
VE STREAMING SETUP - Bin
◦ Artificial intelligence and machine learning, g video
Oil and gas, Media and entertainment,
◦Genetic Disorders And Dise
Financial services, industry and research labs ases - Bing video
The Benefits of Using GPUs
◦ The Graphics Processing Unit (GPU)provides much higher instruction throughput and memory
bandwidth than the CPU within a similar price and power envelope.
◦ Many applications leverage these higher capabilities to run faster on the GPU than on the CPU
◦ Other computing devices, like FPGAs[Field-programmable gate array], are also very energy
efficient, but offer much less programming flexibility than GPUs.
◦ This difference in capabilities between the GPU and the CPU exists because they are designed
with different goals in mind
◦ The GPU is designed to excel at executing thousands of them in parallel (amortizing the
slower single-thread performance to achieve greater throughput).
◦ The GPU is specialized for highly parallel computations and therefore designed such that more
transistors are devoted to data processing rather than data caching and flow control.
◦ The schematic Figure 1 shows an example distribution of chip resources for a CPU versus a
GPU
• Devoting more transistors to data processing, e.g., floating-point computations, is beneficial for highly
parallel computations;
• the GPU can hide memory access latencies with computation, instead of relying on large data caches
and complex flow control to avoid long memory access latencies, both of which are expensive in terms
of transistors.
• In general, an application has a mix of parallel parts and sequential parts, so systems are designed
with a mix of GPUs and CPUs in order to maximize overall performance. Applications with a high
degree of parallelism can exploit this massively parallel nature of the GPU to achieve higher
performance than on the CPU.
CUDA: A General-Purpose Parallel Computing Platform and
Programming Model
◦ In November 2006, NVIDIA® introduced
CUDA®, a general purpose parallel computing
platform and programming model that powers
the parallel compute engine in NVIDIA GPUs
to solve many complex computational problems
in a more efficient way than on a CPU.
◦ CUDA comes with a software environment that
allows developers to use C++ as a high-level
programming language.
◦ As illustrated by Figure 2, other languages,
application programming interfaces, or
directives-based approaches are supported, such
as FORTRAN, Direct Compute, OpenACC.
CUDA
◦CUDA − Compute Unified Device Architecture. It is an extension of C programming,
an API model for parallel computing created by Nvidia.
◦Programs are written using CUDA to harness the power of GPU. Thus, increasing the
computing performance.
◦CUDA is a parallel computing platform and an API model that was developed by
Nvidia.
◦Using CUDA, one can utilize the power of Nvidia GPUs to perform general
computing tasks, such as multiplying matrices and performing other linear algebra
operations, instead of just doing graphical calculations.
◦ Using CUDA, developers can now connect the potential of the GPU for general
purpose computing (GPGPU).
CUDA processing flow
[Link] data from main memory
to GPU memory
[Link] initiates the GPU
compute kernel
[Link]'s CUDA cores execute the
kernel in parallel
[Link] the resulting data from
GPU memory to main memory
CPU-GPU ARCHITECTURE
◦ Each CPU has multiple cores, and each
core has its cache.
◦ The CPU is responsible for application
control, distributing tasks between CPU
and GPU, and reading the result from the
GPU. The GPU has several streaming
multiprocessors (SMs).
◦ GPUs perform parallel processing based on
the data provided by the CPU. Even
though GPUs can perform massively
parallel computation, they cannot operate
independently without the support of a CPU.
Program Structure of CUDA
◦ A typical CUDA program has code intended both for GPU and the CPU.
◦ By default, a traditional C program is a CUDA program with only the host
code.
◦ The CPU is referred to as the host, and the GPU is referred to as the device.
◦ Whereas the host code can be compiled by a traditional C compiler as the
GCC,
◦ the device code needs a special compiler to understand the api functions that
are used. For Nvidia GPUs, the compiler is called the NVCC (Nvidia C
Compiler).
◦ The device code runs on the GPU, and the host code runs on the CPU.
◦ The NVCC processes a CUDA program, and separates the host code from
the device code.
◦ To accomplish this, special CUDA keywords are looked for. The code
intended to run of the GPU (device code) is marked with special CUDA
keywords for labeling data-parallel functions, called ‘Kernels’.
◦ The device code is further compiled by the NVCC and executed on the
GPU.
A Scalable Programming Model
◦ The introduction of multicore CPUs and manycore GPUs means that mainstream processor chips are now
parallel systems.
◦ The challenge is to develop application software that transparently scales its parallelism to power the
increasing number of processor cores, much as 3D graphics applications transparently scale their
parallelism to manycore GPUs with widely varying numbers of cores.
◦ The CUDA parallel programming model is designed to overcome this challenge while maintaining a low
learning curve for programmers familiar with standard programming languages such as C.
◦ At its core are three key constructs:
◦ 1. a hierarchy of thread groups
◦ 2. shared memories
◦ 3. barrier synchronization
◦ These abstractions provide powdered data parallelism and thread parallelism, nested within coarse-grained
data parallelism and task parallelism
◦ They guide the programmer to partition the problem into sub-problems that can be solved independently in
parallel by blocks of threads, and each sub-problem into finer pieces that can be solved cooperatively in
parallel by all threads within the block.
◦ This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-
problem, and at the same time enables automatic scalability
Indeed, each block of threads can be scheduled on any of the available multiprocessors
within a GPU, in any order, concurrently or sequentially, so that a compiled CUDA program
can execute on any number of multiprocessors as illustrated by Figure 3, and only the runtime
system needs to know the physical multiprocessor count.
A GPU is built around an array of Streaming Multiprocessors (SMs).
multithreaded program is partitioned into blocks of threads that execute independently from
each other, so that a GPU with more multiprocessors will automatically execute the program
in less time than a GPU with fewer multiprocessors.
Programming Model
◦ CUDA programming model by outlining how they are exposed in C/C++.
I. Kernels,
II. Thread Hierarchy,
III. Memory Hierarchy,
IV. Heterogeneous Programming,
Kernels
◦ CUDA C++ extends C++ by allowing the programmer to
define C++ functions, called kernels, that, when called, are
executed N times in parallel by N different CUDA
threads, as opposed to only once like regular C++ functions.
◦ A kernel is defined using the _ _global_ _declaration
specifier and the number of CUDA threads that execute
that kernel for a given kernel call is specified using a new
<<<...>>> execution configuration syntax
◦ Each thread that executes the kernel is given a
unique thread ID that is accessible within the kernel
through built-in variables.
◦ As an illustration, the following sample code, using the
built-in variable threadIdx adds two vectors A and B of
size N and stores the result into vector C:
◦ Here, each of the N threads that execute VecAdd()
performs one pair-wise addition.
comparison between CUDA and C
void c_hello(){ _ _global_ _ void cuda_hello(){
printf("Hello World!\n"); printf("Hello World from GPU!\n");
} }
int main() { int main() {
c_hello(); cuda_hello<<<1,1>>>();
return 0; return 0;
} }
$> nvcc [Link] -o hello
Program to add 2 numbers
Thread Hierarchy As an example, the following code adds two
matrices A and B of size NxN and stores the result
into matrix C:
◦ threadIdx is a 3-component vector,
so that threads can be identified
using a one-dimensional, two-
dimensional, or three-
dimensional thread index,
◦ forming a one-dimensional, two-
dimensional, or three-dimensional
block of threads, called a thread
block.
◦ This provides a natural way to
invoke computation across the
elements in a domain such as a
vector, matrix, or volume.
There is a limit to the number of threads per block,
since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of
that core.
On current GPUs, a thread block may contain up to 1024 threads.
the kernel can be executed by multiple equally-shaped thread blocks so that the total number of threads is equal to the number of
threads per block times the number of blocks.
Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional grid of thread blocks as illustrated by Figure 4.
The number of thread blocks in a grid is usually dictated by the size of the data being processed, which typically exceeds the number
of processors in the system.
The number of threads per block and the number of blocks
per grid specified in the <<<...>>>
Two-dimensional blocks or grids can be specified as
in the example
◦ Extending the previous MatAdd() example to
◦ Each block within the grid can be identified by a one-
handle multiple blocks, the code becomes as
dimensional, two-dimensional, or three-dimensional follows
unique index accessible within the kernel through the
built-in blockIdx
◦ variable. The dimension of the thread block is accessible
within the kernel through the built-in blockDim variable
◦ A thread block size of 16x16 (256 threads), although
arbitrary in this case, is a common choice.
◦ The grid is created with enough blocks to have one
thread per matrix element as before.
◦ For simplicity, this example assumes that the number of
threads per grid in each dimension is evenly divisible by
the number of threads per block in that dimension,
although that need not be the case.
Thread blocks are required to execute independently:
It must be possible to execute them in any order, in parallel or in series.
This independence requirement allows thread blocks to be scheduled in any order across any number of cores as
illustrated by Figure 3
Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their
execution to coordinate memory accesses
specify synchronization points in the kernel by calling the _ _syncthreads() : acts as a barrier at which all threads in the
block must wait before any is allowed to proceed.
Vector Addition in CUDA C
CPU vector sums
Our slightly more convoluted method was intended to
suggest a potential way to parallelize the code on a system
with multiple CPUs or CPU cores. For example, with a
dual-core processor, one could change the increment to
2 and have one core initialize the loop with tid = 0 and
another with tid = 1. The first core would add the even-
indexed elements, and the second core would add the odd
indexed elements. This amounts to executing the following
code on each of the two CPU cores:
GPU vector sums
Memory Hierarchy
◦ CUDA threads may access data from multiple memory spaces
during their execution as illustrated by Figure 5.
◦ Each thread has private local memory. Each thread block has
shared memory visible to all threads of the block and with the
same lifetime as the block. All threads have access to the same
global memory.
◦ the constant and texture memory spaces. The global, constant,
and texture memory spaces are optimized for different memory
usages (see Device Memory Accesses). Texture memory also
offers different addressing modes, as well as data filtering, for
some specific data formats (see Texture and Surface Memory).
◦ The global, constant, and texture memory spaces are persistent
across kernel launches by the same application.
Heterogeneous Programming
◦ As illustrated by Figure 6, the CUDA programming model assumes that the CUDA
threads execute on a physically separate device that operates as a coprocessor to
the host running the C++ program. This is the case, for example, when the kernels
execute on a GPU and the rest of the C++ program executes on a CPU.
◦ The CUDA programming model also assumes that both the host and the device
maintain their own separate memory spaces in DRAM, referred to as host
memory and device memory, respectively. Therefore, a program manages the
global, constant, and texture memory spaces visible to kernels through calls to the
CUDA runtime (described in Programming Interface). This includes device
memory allocation and deallocation as well as data transfer between host and
device memory.
◦ Unified Memory provides managed memory to bridge the host and device memory
spaces. Managed memory is accessible from all CPUs and GPUs in the system as a
single, coherent memory image with a common address space. This capability
enables oversubscription of device memory and can greatly simplify the task of
porting applications by eliminating the need to explicitly mirror data on host and
device. See Unified Memory Programming for an introduction to Unified Memory.
Asynchronous SIMT Programming Model
◦ In the CUDA programming model a thread is the lowest level of abstraction for doing a computation or a memory operation. Starting
with devices based on the NVIDIA Ampere GPU architecture, the CUDA programming model provides acceleration to memory
operations via the asynchronous programming model. The asynchronous programming model defines the behavior of asynchronous
operations with respect to CUDA threads.
◦ The asynchronous programming model defines the behavior of Asynchronous Barrier for synchronization between CUDA threads.
The model also explains and defines how cuda::memcpy_async can be used to move data asynchronously from global memory while
computing in the GPU.
Asynchronous Operations
◦ An asynchronous operation is defined as an operation that is initiated by a CUDA thread and is executed asynchronously as-if by
another thread. In a well formed program one or more CUDA threads synchronize with the asynchronous operation. The CUDA thread
that initiated the asynchronous operation is not required to be among the synchronizing threads.
◦ These synchronization objects can be used at different thread scopes. A scope defines the set of threads that may use the
synchronization object to synchronize with the asynchronous operation. The following table defines the thread scopes available in
CUDA C++ and the threads that can be synchronized with each.
Compute Capability
◦ The compute capability of a device is represented by a version number, also sometimes called its "SM version". This version
number identifies the features supported by the GPU hardware and is used by applications at runtime to determine which
hardware features and/or instructions are available on the present GPU.
◦ The compute capability comprises a major revision number X and a minor revision number Y and is denoted by X.Y.
◦ Devices with the same major revision number are of the same core architecture. The major revision number is 8 for devices based
on the NVIDIA Ampere GPU architecture, 7 for devices based on the Volta architecture, 6 for devices based on the Pascal
architecture, 5 for devices based on the Maxwell architecture, 3 for devices based on the Kepler architecture, 2 for devices based
on the Fermi architecture, and 1 for devices based on the Tesla architecture.
◦ The minor revision number corresponds to an incremental improvement to the core architecture, possibly including new features.
Turing is the architecture for devices of compute capability 7.5, and is an incremental update based on the Volta architecture.
CUDA-Enabled GPUs
◦ Lists of all CUDA-enabled devices along with their compute capability. Compute Capabilities gives the technical specifications of
each compute capability.