Introduction to Parallel Computing (CMSC416 / CMSC616)
GPGPUs and CUDA
Abhinav Bhatele, Alan Sussman
Many slides borrowed from Daniel Nichols’ slides
GPGPUs
• Originally developed to handle computation related to graphics processing
• Also found to be useful for scienti c computing
• Hence the name: General Purpose Graphics Processing Unit
Abhinav Bhatele (CMSC416 / CMSC616) 2
fi
Accelerators
• IBM’s Cell processors
• Used in Sony’s Playstation 3 (2006)
• GPUs: NVIDIA, AMD, Intel
• First programmable GPU: NVIDIA GeForce 256 (1999)
• Around 1999-2001, early GPGPU results
• FPGAs
https://www.cs.unc.edu/xcms/wp les/50th-symp/Harris.pdf
Abhinav Bhatele (CMSC416 / CMSC616) 3
fi
Used for mainstream HPC
• 2013: NAMD, used for molecular dynamics
simulations on a supercomputer with 3000
NVIDIA Tesla GPUs
Abhinav Bhatele (CMSC416 / CMSC616) 4
GPGPU Hardware
• Higher instruction throughput
• Hide memory access latencies with computation
Abhinav Bhatele (CMSC416 / CMSC616) 5
Comparing GPUs to CPUs
• Intel i9 11900K • NVIDIA GeForce RTX 3090
• 8 cores • 10,496 cores
• 3.3 GHz • 1.4 GHz
• AMD Epyc 7763 • NVIDIA A100
• 64 cores • 17,712 cores
• 2.45 GHz • 0.76 GHz
Abhinav Bhatele (CMSC416 / CMSC616) 6
Volta GV100 SM
• CUDA Core
• Single serial execution unit
• Each Volta Streaming Multiprocessor (SM) has:
• 64 FP32 cores
• 64 INT32 cores
• 32 FP64 cores
• 8 Tensor cores
• CUDA capable device or GPU
• Collection of SMs
https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
Figure 5. Volta GV100 Streaming Multiprocessor (SM)
Abhinav Bhatele (CMSC416 / CMSC616) 7
Volta GV100
Figure 4. Abhinav Bhatele
Volta GV100 Full GPU(CMSC416
with/ CMSC616)
84 SM Units 8
Volta GV100
Figure 4. Abhinav Bhatele
Volta GV100 Full GPU(CMSC416
with/ CMSC616)
84 SM Units 8
it Node Overview
900 GB/s
900 GB/s
GPU-based nodes
16 GB
16 GB
DRAM DRAM
HBM
HBM
GPU
GPU
7 TF
7 TF
256 GB 256 GB
50 GB/s
50 GB/s
135 GB/s
135 GB/s
50 GB/s 50 GB/s
64
900 GB/s
900 GB/s
50 GB/s
50 GB/s
GB/s
16 GB
16 GB
HBM
HBM
GPU
GPU
7 TF
7 TF
P9 P9
16 G
B/s
50 GB/s
50 GB/s
16 G
B/s
50 GB/s 50 GB/s
900 GB/s
900 GB/s
16 GB
16 GB
HBM
HBM
GPU
GPU
7 TF
7 TF
NIC
• Figure on the right shows a single
node of Summit @ ORNL
12.5 GB/s
12.5 GB/s
6.0 GB/s Read
NVM 2.2 GB/s Write
TF 42 TF (6x7 TF) HBM/DRAM Bus (aggregate B/W)
HBM 96 GB (6x16 GB) NVLINK
DRAM 512 GB (2x16x16 GB) X-Bus (SMP)
NET 25 GB/s (2x12.5 GB/s) PCIe Gen4
MMsg/s 83 EDR IB
HBM & DRAM speeds are aggregate (Read+Write).
All other speeds (X-Bus, NVLink, PCIe, IB) are bi-directional.
Abhinav Bhatele (CMSC416 / CMSC616) 9
CUDA: A programming model for NVIDIA GPUs
• Allows developers to use C++ as a high-level programming language
• CUDA is a language extension
• Built around threads, blocks and grids
• Terminology:
• Host: CPU
• Device: GPU
• CUDA kernel: a function that gets executed on the GPU
Abhinav Bhatele (CMSC416 / CMSC616) 10
CUDA software abstraction
Abhinav Bhatele (CMSC416 / CMSC616) 11
CUDA software abstraction
• Thread
• Serial unit of execution
Abhinav Bhatele (CMSC416 / CMSC616) 11
CUDA software abstraction
• Thread
• Serial unit of execution
• Block
• Collection of threads
• Number of threads in block <= 1024
Abhinav Bhatele (CMSC416 / CMSC616) 11
CUDA software abstraction
• Thread
• Serial unit of execution
• Block
• Collection of threads
• Number of threads in block <= 1024
• Grid
• Collection of blocks
Abhinav Bhatele (CMSC416 / CMSC616) 11
Software to hardware mapping
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Abhinav Bhatele (CMSC416 / CMSC616) 12
Three steps to writing a CUDA kernel
• Copy input data from host to device memory
• Load the GPU program (kernel) and execute
• Copy the results back to host memory
Abhinav Bhatele (CMSC416 / CMSC616) 13
Copying data to the GPU
double *d_Matrix, *h_Matrix;
h_Matrix = new double[N];
cudaMalloc(&d_Matrix, sizeof(double)*N);
// ... initialize h_Matrix ...
cudaMemcpy(d_Matrix, h_Matrix, sizeof(double)*N, cudaMemcpyHostToDevice);
// ... some computation on GPU …
cudaMemcpy(h_Matrix, d_Matrix, sizeof(double)*N, cudaMemcpyDeviceToHost);
cudaFree(d_Matrix);
Abhinav Bhatele (CMSC416 / CMSC616) 14
Copying data to the GPU
double *d_Matrix, *h_Matrix; cudaMemcpyHostToDevice
h_Matrix = new double[N]; cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
cudaMalloc(&d_Matrix, sizeof(double)*N); cudaMemcpyHostToHost
cudaMemcpyDefault
// ... initialize h_Matrix ...
cudaMemcpy(d_Matrix, h_Matrix, sizeof(double)*N, cudaMemcpyHostToDevice);
// ... some computation on GPU …
cudaMemcpy(h_Matrix, d_Matrix, sizeof(double)*N, cudaMemcpyDeviceToHost);
cudaFree(d_Matrix);
Abhinav Bhatele (CMSC416 / CMSC616) 14
CUDA syntax
__global__ void saxpy(float *x, float *y, float alpha) {
int i = threadIdx.x;
y[i] = alpha*x[i] + y[i];
}
int main() {
...
saxpy<<<1, N>>>(x, y, alpha);
...
}
Grid size, Block size
Abhinav Bhatele (CMSC416 / CMSC616) 15
CUDA syntax
__global__ void saxpy(float *x, float *y, float alpha) {
int i = threadIdx.x;
y[i] = alpha*x[i] + y[i];
}
int main() {
...
saxpy<<<1, N>>>(x, y, alpha);
...
}
Grid size, Block size
Abhinav Bhatele (CMSC416 / CMSC616) 15
CUDA syntax
__global__ void saxpy(float *x, float *y, float alpha) {
int i = threadIdx.x;
y[i] = alpha*x[i] + y[i];
}
int main() {
...
saxpy<<<1, N>>>(x, y, alpha);
...
}
<<<#blocks, threads_per_block>>>
Grid size, Block size
Abhinav Bhatele (CMSC416 / CMSC616) 15
CUDA syntax
__global__ void saxpy(float *x, float *y, float alpha) {
int i = threadIdx.x;
y[i] = alpha*x[i] + y[i];
} What happens when:
array size (N) > 1024?
int main() {
...
saxpy<<<1, N>>>(x, y, alpha);
...
}
<<<#blocks, threads_per_block>>>
Grid size, Block size
Abhinav Bhatele (CMSC416 / CMSC616) 15
Compiling CUDA code
nvcc -o saxpy --generate-code arch=compute_80,code=sm_80 saxpy.cu
./saxpy
Abhinav Bhatele (CMSC416 / CMSC616) 16
Multiple blocks
__global__ void saxpy(float *x, float *y, float alpha, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
y[i] = alpha*x[i] + y[i];
}
int main() {
...
int threadsPerBlock = 512;
int numBlocks = N/threadsPerBlock
+ (N % threadsPerBlock != 0);
saxpy<<<numBlocks, threadsPerBlock>>>(x, y, alpha, N);
...
}
Abhinav Bhatele (CMSC416 / CMSC616) 17
Questions?