0% found this document useful (0 votes)
2 views5 pages

Introduction To CUDA Programming-1

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA that allows developers to utilize GPUs for general-purpose computing, significantly enhancing performance in various applications. It supports multiple programming languages and provides tools for memory management and kernel execution, enabling efficient computation through parallel processing. While CUDA offers numerous benefits, including high-speed computations and integrated memory, it has limitations such as interoperability issues and support exclusively for NVIDIA hardware.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views5 pages

Introduction To CUDA Programming-1

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA that allows developers to utilize GPUs for general-purpose computing, significantly enhancing performance in various applications. It supports multiple programming languages and provides tools for memory management and kernel execution, enabling efficient computation through parallel processing. While CUDA offers numerous benefits, including high-speed computations and integrated memory, it has limitations such as interoperability issues and support exclusively for NVIDIA hardware.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Introduction to CUDA Programming

CUDA stands for Compute Unified Device Architecture. It is an extension of C/C++


programming. CUDA is a programming language that uses the Graphical Processing Unit
(GPU). It is a parallel computing platform and an API (Application Programming Interface)
model, Compute Unified Device Architecture was developed by Nvidia. This allows
computations to be performed in parallel while providing well-formed speed. Using CUDA, one
can harness the power of the Nvidia GPU to perform common computing tasks, such as
processing matrices and other linear algebra operations, rather than simply performing graphical
calculations.

Need of CUDA

 GPUs are designed to perform high-speed parallel computations to display graphics such as
games.
 Use available CUDA resources. More than 100 million GPUs are already deployed.
 It provides 30-100x speed-up over other microprocessors for some applications.
 GPUs have very small Arithmetic Logic Units (ALUs) compared to the somewhat larger
CPUs. This allows for many parallel calculations, such as calculating the color for each pixel
on the screen, etc.
Architecture of CUDA

 16 Streaming Multiprocessor (SM) diagrams are shown in the above diagram.


 Each Streaming Multiprocessor has 8 Streaming Processors (SP) ie, we get a total of 128
Streaming Processors (SPs).
 Now, each Streaming processor has a MAD unit (Multiplication and Addition Unit) and
an additional MU (multiplication unit).
 The GT200 has 30 Streaming Multiprocessors (SMs) and each Streaming Multiprocessor
(SM) has 8 Streaming Processors (SPs) ie, a total of 240 Streaming Processors (SPs), and
more than 1 TFLOP processing power.
 Each Streaming Processor is gracefully threaded and can run thousands of threads per
application.
 The G80 card has 16 Streaming Multiprocessors (SMs) and each SM has 8 Streaming
Processors (SPs), i.e., a total of 128 SPs and it supports 768 threads per Streaming
Multiprocessor (note: not per SP).
 Eventually, after each Streaming Multiprocessor has 8 SPs, each SP supports a maximal
of 768/8 = 96 threads. Total threads that can run on 128 SPs - 128 * 96 = 12,228 times.
 Therefore these processors are called massively parallel.
 The G80 chips have a memory bandwidth of 86.4GB/s.
 It also has an 8GB/s communication channel with the CPU (4GB/s for uploading to the
CPU RAM, and 4GB/s for downloading from the CPU RAM).
CUDA working procedure
 GPUs run one kernel (a group of tasks) at a time.
 Each kernel consists of blocks, which are independent groups of ALUs.
 Each block contains threads, which are levels of computation.
 The threads in each block typically work together to calculate a value.
 Threads in the same block can share memory.
 In CUDA, sending information from the CPU to the GPU is often the most typical part of
the computation.
 For each thread, local memory is the fastest, followed by shared memory, global, static,
and texture memory the slowest.

Basic CUDA Program flow


 Load data into CPU memory
 Copy data from CPU to GPU memory - e.g., cudaMemcpy(...,
cudaMemcpyHostToDevice)
 Call GPU kernel using device variable - e.g., kernel<<<>>> (gpuVar)
 Copy results from GPU to CPU memory - e.g., cudaMemcpy(..,
cudaMemcpyDeviceToHost)
 Use results on CPU

Work distribution
 Each thread "knows" the x and y coordinates of the block it is in, and the coordinates
where it is in the block.
 These positions can be used to calculate a unique thread ID for each thread.
 The computational work done will depend on the value of the thread ID.
 For example, the thread ID corresponds to a group of matrix elements.

CUDA Applications
 CUDA applications must run parallel operations on a lot of data, and be processing-
intensive.
 Computational finance
 Climate, weather, and ocean modeling
 Data science and analytics
 Deep learning and machine learning
 Defense and intelligence
 Manufacturing/AEC
 Media and entertainment
 Medical imaging
 Oil and gas
 Research
 Safety and security
 Tools and management

Benefits of CUDA
 There are several advantages that give CUDA an edge over traditional general-purpose
graphics processor (GPU) computers with graphics APIs:
 Integrated memory (CUDA 6.0 or later) and Integrated virtual memory (CUDA 4.0 or
later).
 Shared memory provides a fast area of shared memory for CUDA threads. It can be used
as a caching mechanism and provides more bandwidth than texture lookup.
 Scattered read codes can be read from any address in memory.
 Improved performance on downloads and reads, which works well from the GPU and to
the GPU.
 CUDA has full support for bitwise and integer operations.

Limitations of CUDA
 CUDA source code is given on the host machine or GPU, as defined by the C++ syntax
rules. Longstanding versions of CUDA use C syntax rules, which means that up-to-date
CUDA source code may or may not work as required.
 CUDA has unilateral interoperability (the ability of computer systems or software to
exchange and make use of information) with transferor languages like OpenGL. OpenGL
can access CUDA registered memory, but CUDA cannot access OpenGL memory.
 Afterward versions of CUDA do not provide emulators or fallback support for older
versions.
 CUDA supports only NVIDIA hardware.

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model
developed by NVIDIA. It enables developers to use NVIDIA GPUs for general-purpose computing,
significantly accelerating applications in areas like scientific simulations, data processing, machine
learning, and more.

Key aspects of CUDA:


 Parallel Computing:

CUDA leverages the massively parallel architecture of NVIDIA GPUs, which contain thousands of cores,
to perform computations simultaneously. This is in contrast to traditional CPUs, which are optimized for
sequential processing.

 Programming Model:

CUDA provides extensions to popular programming languages like C, C++, Fortran, Python, and Julia,
allowing developers to express parallelism and offload compute-intensive portions of their applications
to the GPU.

 CUDA Toolkit:

NVIDIA offers the free CUDA Toolkit, which includes essential components for GPU-accelerated
development, such as a compiler, development tools, GPU-accelerated libraries, and the CUDA runtime.

 Kernel Execution:

In CUDA, the parallelizable parts of an application are written as "kernels" that are executed on the
GPU. These kernels are launched with a specific execution configuration, defining the number of thread
blocks and threads within each block, which dictates how the work is distributed across the GPU's cores.

 Memory Management:

CUDA provides mechanisms for managing data transfer between the CPU (host) memory and the GPU
(device) memory, which is crucial for optimal performance. Techniques like unified memory
management simplify this process.

You might also like