0% found this document useful (0 votes)
35 views26 pages

07 cmsc416 Cuda

Uploaded by

qiqi85078802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views26 pages

07 cmsc416 Cuda

Uploaded by

qiqi85078802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Parallel Computing (CMSC416 / CMSC616)

GPGPUs and CUDA


Abhinav Bhatele, Alan Sussman

Many slides borrowed from Daniel Nichols’ slides


GPGPUs

• Originally developed to handle computation related to graphics processing

• Also found to be useful for scienti c computing

• Hence the name: General Purpose Graphics Processing Unit

Abhinav Bhatele (CMSC416 / CMSC616) 2


fi
Accelerators

• IBM’s Cell processors


• Used in Sony’s Playstation 3 (2006)

• GPUs: NVIDIA, AMD, Intel


• First programmable GPU: NVIDIA GeForce 256 (1999)

• Around 1999-2001, early GPGPU results

• FPGAs

https://www.cs.unc.edu/xcms/wp les/50th-symp/Harris.pdf

Abhinav Bhatele (CMSC416 / CMSC616) 3


fi
Used for mainstream HPC

• 2013: NAMD, used for molecular dynamics


simulations on a supercomputer with 3000
NVIDIA Tesla GPUs

Abhinav Bhatele (CMSC416 / CMSC616) 4


GPGPU Hardware
• Higher instruction throughput

• Hide memory access latencies with computation

Abhinav Bhatele (CMSC416 / CMSC616) 5


Comparing GPUs to CPUs

• Intel i9 11900K • NVIDIA GeForce RTX 3090


• 8 cores • 10,496 cores

• 3.3 GHz • 1.4 GHz

• AMD Epyc 7763 • NVIDIA A100


• 64 cores • 17,712 cores

• 2.45 GHz • 0.76 GHz

Abhinav Bhatele (CMSC416 / CMSC616) 6


Volta GV100 SM
• CUDA Core
• Single serial execution unit

• Each Volta Streaming Multiprocessor (SM) has:


• 64 FP32 cores

• 64 INT32 cores

• 32 FP64 cores

• 8 Tensor cores

• CUDA capable device or GPU


• Collection of SMs

https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
Figure 5. Volta GV100 Streaming Multiprocessor (SM)

Abhinav Bhatele (CMSC416 / CMSC616) 7


Volta GV100

Figure 4. Abhinav Bhatele


Volta GV100 Full GPU(CMSC416
with/ CMSC616)
84 SM Units 8
Volta GV100

Figure 4. Abhinav Bhatele


Volta GV100 Full GPU(CMSC416
with/ CMSC616)
84 SM Units 8
it Node Overview

900 GB/s

900 GB/s
GPU-based nodes

16 GB

16 GB
DRAM DRAM

HBM

HBM
GPU

GPU
7 TF

7 TF
256 GB 256 GB

50 GB/s

50 GB/s
135 GB/s

135 GB/s
50 GB/s 50 GB/s

64

900 GB/s

900 GB/s
50 GB/s

50 GB/s
GB/s

16 GB

16 GB
HBM

HBM
GPU

GPU
7 TF

7 TF
P9 P9

16 G

B/s
50 GB/s

50 GB/s
16 G
B/s
50 GB/s 50 GB/s

900 GB/s

900 GB/s
16 GB

16 GB
HBM

HBM
GPU

GPU
7 TF

7 TF
NIC
• Figure on the right shows a single
node of Summit @ ORNL

12.5 GB/s

12.5 GB/s
6.0 GB/s Read
NVM 2.2 GB/s Write

TF 42 TF (6x7 TF) HBM/DRAM Bus (aggregate B/W)


HBM 96 GB (6x16 GB) NVLINK
DRAM 512 GB (2x16x16 GB) X-Bus (SMP)
NET 25 GB/s (2x12.5 GB/s) PCIe Gen4
MMsg/s 83 EDR IB

HBM & DRAM speeds are aggregate (Read+Write).


All other speeds (X-Bus, NVLink, PCIe, IB) are bi-directional.

Abhinav Bhatele (CMSC416 / CMSC616) 9


CUDA: A programming model for NVIDIA GPUs

• Allows developers to use C++ as a high-level programming language


• CUDA is a language extension

• Built around threads, blocks and grids

• Terminology:
• Host: CPU

• Device: GPU

• CUDA kernel: a function that gets executed on the GPU

Abhinav Bhatele (CMSC416 / CMSC616) 10


CUDA software abstraction

Abhinav Bhatele (CMSC416 / CMSC616) 11


CUDA software abstraction

• Thread
• Serial unit of execution

Abhinav Bhatele (CMSC416 / CMSC616) 11


CUDA software abstraction

• Thread
• Serial unit of execution

• Block
• Collection of threads

• Number of threads in block <= 1024

Abhinav Bhatele (CMSC416 / CMSC616) 11


CUDA software abstraction

• Thread
• Serial unit of execution

• Block
• Collection of threads

• Number of threads in block <= 1024

• Grid
• Collection of blocks

Abhinav Bhatele (CMSC416 / CMSC616) 11


Software to hardware mapping

https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/

Abhinav Bhatele (CMSC416 / CMSC616) 12


Three steps to writing a CUDA kernel

• Copy input data from host to device memory

• Load the GPU program (kernel) and execute

• Copy the results back to host memory

Abhinav Bhatele (CMSC416 / CMSC616) 13


Copying data to the GPU
double *d_Matrix, *h_Matrix;
h_Matrix = new double[N];

cudaMalloc(&d_Matrix, sizeof(double)*N);

// ... initialize h_Matrix ...


cudaMemcpy(d_Matrix, h_Matrix, sizeof(double)*N, cudaMemcpyHostToDevice);

// ... some computation on GPU …

cudaMemcpy(h_Matrix, d_Matrix, sizeof(double)*N, cudaMemcpyDeviceToHost);

cudaFree(d_Matrix);

Abhinav Bhatele (CMSC416 / CMSC616) 14


Copying data to the GPU
double *d_Matrix, *h_Matrix; cudaMemcpyHostToDevice
h_Matrix = new double[N]; cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
cudaMalloc(&d_Matrix, sizeof(double)*N); cudaMemcpyHostToHost
cudaMemcpyDefault
// ... initialize h_Matrix ...
cudaMemcpy(d_Matrix, h_Matrix, sizeof(double)*N, cudaMemcpyHostToDevice);

// ... some computation on GPU …

cudaMemcpy(h_Matrix, d_Matrix, sizeof(double)*N, cudaMemcpyDeviceToHost);

cudaFree(d_Matrix);

Abhinav Bhatele (CMSC416 / CMSC616) 14


CUDA syntax

__global__ void saxpy(float *x, float *y, float alpha) {


int i = threadIdx.x;
y[i] = alpha*x[i] + y[i];
}

int main() {
...
saxpy<<<1, N>>>(x, y, alpha);
...
}

Grid size, Block size


Abhinav Bhatele (CMSC416 / CMSC616) 15
CUDA syntax

__global__ void saxpy(float *x, float *y, float alpha) {


int i = threadIdx.x;
y[i] = alpha*x[i] + y[i];
}

int main() {
...
saxpy<<<1, N>>>(x, y, alpha);
...
}

Grid size, Block size


Abhinav Bhatele (CMSC416 / CMSC616) 15
CUDA syntax

__global__ void saxpy(float *x, float *y, float alpha) {


int i = threadIdx.x;
y[i] = alpha*x[i] + y[i];
}

int main() {
...
saxpy<<<1, N>>>(x, y, alpha);
...
}

<<<#blocks, threads_per_block>>>
Grid size, Block size
Abhinav Bhatele (CMSC416 / CMSC616) 15
CUDA syntax

__global__ void saxpy(float *x, float *y, float alpha) {


int i = threadIdx.x;
y[i] = alpha*x[i] + y[i];
} What happens when:
array size (N) > 1024?
int main() {
...
saxpy<<<1, N>>>(x, y, alpha);
...
}

<<<#blocks, threads_per_block>>>
Grid size, Block size
Abhinav Bhatele (CMSC416 / CMSC616) 15
Compiling CUDA code

nvcc -o saxpy --generate-code arch=compute_80,code=sm_80 saxpy.cu

./saxpy

Abhinav Bhatele (CMSC416 / CMSC616) 16


Multiple blocks
__global__ void saxpy(float *x, float *y, float alpha, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
y[i] = alpha*x[i] + y[i];
}

int main() {
...
int threadsPerBlock = 512;
int numBlocks = N/threadsPerBlock
+ (N % threadsPerBlock != 0);

saxpy<<<numBlocks, threadsPerBlock>>>(x, y, alpha, N);


...
}
Abhinav Bhatele (CMSC416 / CMSC616) 17
Questions?

You might also like