0% found this document useful (0 votes)

35 views26 pages

07 cmsc416 Cuda

Uploaded by

qiqi85078802

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views26 pages

07 cmsc416 Cuda

Uploaded by

qiqi85078802

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Parallel Computing (CMSC416 / CMSC616)

GPGPUs and CUDA

Abhinav Bhatele, Alan Sussman

Many slides borrowed from Daniel Nichols’ slides

GPGPUs

• Originally developed to handle computation related to graphics processing

• Also found to be useful for scienti c computing

• Hence the name: General Purpose Graphics Processing Unit

Abhinav Bhatele (CMSC416 / CMSC616) 2

fi
Accelerators

• IBM’s Cell processors

• Used in Sony’s Playstation 3 (2006)

• GPUs: NVIDIA, AMD, Intel

• First programmable GPU: NVIDIA GeForce 256 (1999)

• Around 1999-2001, early GPGPU results

• FPGAs

https://www.cs.unc.edu/xcms/wp les/50th-symp/Harris.pdf

Abhinav Bhatele (CMSC416 / CMSC616) 3

fi
Used for mainstream HPC

• 2013: NAMD, used for molecular dynamics

simulations on a supercomputer with 3000
NVIDIA Tesla GPUs

Abhinav Bhatele (CMSC416 / CMSC616) 4

GPGPU Hardware
• Higher instruction throughput

• Hide memory access latencies with computation

Abhinav Bhatele (CMSC416 / CMSC616) 5

Comparing GPUs to CPUs

• Intel i9 11900K • NVIDIA GeForce RTX 3090

• 8 cores • 10,496 cores

• 3.3 GHz • 1.4 GHz

• AMD Epyc 7763 • NVIDIA A100

• 64 cores • 17,712 cores

• 2.45 GHz • 0.76 GHz

Abhinav Bhatele (CMSC416 / CMSC616) 6

Volta GV100 SM
• CUDA Core
• Single serial execution unit

• Each Volta Streaming Multiprocessor (SM) has:

• 64 FP32 cores

• 64 INT32 cores

• 32 FP64 cores

• 8 Tensor cores

• CUDA capable device or GPU

• Collection of SMs

https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
Figure 5. Volta GV100 Streaming Multiprocessor (SM)

Abhinav Bhatele (CMSC416 / CMSC616) 7

Volta GV100

Figure 4. Abhinav Bhatele

Volta GV100 Full GPU(CMSC416
with/ CMSC616)
84 SM Units 8
Volta GV100

Figure 4. Abhinav Bhatele

Volta GV100 Full GPU(CMSC416
with/ CMSC616)
84 SM Units 8
it Node Overview

900 GB/s

900 GB/s
GPU-based nodes

16 GB

16 GB
DRAM DRAM

HBM

HBM
GPU

GPU
7 TF

7 TF
256 GB 256 GB

50 GB/s

50 GB/s
135 GB/s

135 GB/s
50 GB/s 50 GB/s

900 GB/s

900 GB/s
50 GB/s

50 GB/s
GB/s

16 GB

16 GB
HBM

HBM
GPU

GPU
7 TF

7 TF
P9 P9

16 G

B/s
50 GB/s

50 GB/s
16 G
B/s
50 GB/s 50 GB/s

900 GB/s

900 GB/s
16 GB

16 GB
HBM

HBM
GPU

GPU
7 TF

7 TF
NIC
• Figure on the right shows a single
node of Summit @ ORNL

12.5 GB/s

12.5 GB/s
6.0 GB/s Read
NVM 2.2 GB/s Write

TF 42 TF (6x7 TF) HBM/DRAM Bus (aggregate B/W)

HBM 96 GB (6x16 GB) NVLINK
DRAM 512 GB (2x16x16 GB) X-Bus (SMP)
NET 25 GB/s (2x12.5 GB/s) PCIe Gen4
MMsg/s 83 EDR IB

HBM & DRAM speeds are aggregate (Read+Write).

All other speeds (X-Bus, NVLink, PCIe, IB) are bi-directional.

Abhinav Bhatele (CMSC416 / CMSC616) 9

CUDA: A programming model for NVIDIA GPUs

• Allows developers to use C++ as a high-level programming language

• CUDA is a language extension

• Built around threads, blocks and grids

• Terminology:
• Host: CPU

• Device: GPU

• CUDA kernel: a function that gets executed on the GPU

Abhinav Bhatele (CMSC416 / CMSC616) 10

CUDA software abstraction

Abhinav Bhatele (CMSC416 / CMSC616) 11

CUDA software abstraction

• Thread
• Serial unit of execution

Abhinav Bhatele (CMSC416 / CMSC616) 11

CUDA software abstraction

• Thread
• Serial unit of execution

• Block
• Collection of threads

• Number of threads in block <= 1024

Abhinav Bhatele (CMSC416 / CMSC616) 11

CUDA software abstraction

• Thread
• Serial unit of execution

• Block
• Collection of threads

• Number of threads in block <= 1024

• Grid
• Collection of blocks

Abhinav Bhatele (CMSC416 / CMSC616) 11

Software to hardware mapping

https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/

Abhinav Bhatele (CMSC416 / CMSC616) 12

Three steps to writing a CUDA kernel

• Copy input data from host to device memory

• Load the GPU program (kernel) and execute

• Copy the results back to host memory

Abhinav Bhatele (CMSC416 / CMSC616) 13

Copying data to the GPU
double *d_Matrix, *h_Matrix;
h_Matrix = new double[N];

cudaMalloc(&d_Matrix, sizeof(double)*N);

// ... initialize h_Matrix ...

cudaMemcpy(d_Matrix, h_Matrix, sizeof(double)*N, cudaMemcpyHostToDevice);

// ... some computation on GPU …

cudaMemcpy(h_Matrix, d_Matrix, sizeof(double)*N, cudaMemcpyDeviceToHost);

cudaFree(d_Matrix);

Abhinav Bhatele (CMSC416 / CMSC616) 14

Copying data to the GPU
double *d_Matrix, *h_Matrix; cudaMemcpyHostToDevice
h_Matrix = new double[N]; cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
cudaMalloc(&d_Matrix, sizeof(double)*N); cudaMemcpyHostToHost
cudaMemcpyDefault
// ... initialize h_Matrix ...
cudaMemcpy(d_Matrix, h_Matrix, sizeof(double)*N, cudaMemcpyHostToDevice);

// ... some computation on GPU …

cudaMemcpy(h_Matrix, d_Matrix, sizeof(double)*N, cudaMemcpyDeviceToHost);

cudaFree(d_Matrix);

Abhinav Bhatele (CMSC416 / CMSC616) 14

CUDA syntax

global void saxpy(float x, float y, float alpha) {

int i = threadIdx.x;
y[i] = alpha*x[i] + y[i];
}

int main() {
...
saxpy<<<1, N>>>(x, y, alpha);
...
}

Grid size, Block size

Abhinav Bhatele (CMSC416 / CMSC616) 15
CUDA syntax

global void saxpy(float x, float y, float alpha) {

int i = threadIdx.x;
y[i] = alpha*x[i] + y[i];
}

int main() {
...
saxpy<<<1, N>>>(x, y, alpha);
...
}

Grid size, Block size

Abhinav Bhatele (CMSC416 / CMSC616) 15
CUDA syntax

global void saxpy(float x, float y, float alpha) {

int i = threadIdx.x;
y[i] = alpha*x[i] + y[i];
}

int main() {
...
saxpy<<<1, N>>>(x, y, alpha);
...
}

<<<#blocks, threads_per_block>>>
Grid size, Block size
Abhinav Bhatele (CMSC416 / CMSC616) 15
CUDA syntax

global void saxpy(float x, float y, float alpha) {

int i = threadIdx.x;
y[i] = alpha*x[i] + y[i];
} What happens when:
array size (N) > 1024?
int main() {
...
saxpy<<<1, N>>>(x, y, alpha);
...
}

<<<#blocks, threads_per_block>>>
Grid size, Block size
Abhinav Bhatele (CMSC416 / CMSC616) 15
Compiling CUDA code

nvcc -o saxpy --generate-code arch=compute_80,code=sm_80 saxpy.cu

./saxpy

Abhinav Bhatele (CMSC416 / CMSC616) 16

Multiple blocks
__global__ void saxpy(float *x, float *y, float alpha, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
y[i] = alpha*x[i] + y[i];
}

int main() {
...
int threadsPerBlock = 512;
int numBlocks = N/threadsPerBlock
+ (N % threadsPerBlock != 0);

saxpy<<<numBlocks, threadsPerBlock>>>(x, y, alpha, N);

...
}
Abhinav Bhatele (CMSC416 / CMSC616) 17
Questions?

Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Introduction to CUDA Programming Basics
No ratings yet
Introduction to CUDA Programming Basics
50 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
1 Cuda
100% (1)
1 Cuda
173 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
Introduction to GPGPU Programming
No ratings yet
Introduction to GPGPU Programming
32 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Lec 1
No ratings yet
Lec 1
27 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
CUDA Programming Model Overview
No ratings yet
CUDA Programming Model Overview
31 pages
GPU Programming Course Schedule
No ratings yet
GPU Programming Course Schedule
33 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
GPGPU-Sim 3.x Performance Simulator Guide
No ratings yet
GPGPU-Sim 3.x Performance Simulator Guide
27 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Introduction to CUDA Programming Basics
No ratings yet
Introduction to CUDA Programming Basics
247 pages
Introduction to CUDA Programming
No ratings yet
Introduction to CUDA Programming
26 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
w13s1 MultiprocessingGPU
No ratings yet
w13s1 MultiprocessingGPU
21 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
High Performance Computing WS2022 Slides 11 Cuda
No ratings yet
High Performance Computing WS2022 Slides 11 Cuda
18 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
CUDA Programming: Advantages & Limitations
No ratings yet
CUDA Programming: Advantages & Limitations
35 pages
Understanding PGPU and CUDA Basics
No ratings yet
Understanding PGPU and CUDA Basics
70 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
BCS702 Module 5 Textbook
No ratings yet
BCS702 Module 5 Textbook
48 pages
Lec 14
No ratings yet
Lec 14
52 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Manycore GPU Programming Overview
No ratings yet
Manycore GPU Programming Overview
67 pages
CUDA Programming Overview
No ratings yet
CUDA Programming Overview
38 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
CUDA
No ratings yet
CUDA
46 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Movitools - Connection To Simatic s7 Via Mpi
No ratings yet
Movitools - Connection To Simatic s7 Via Mpi
32 pages
Os Unit Iv
No ratings yet
Os Unit Iv
19 pages
8086 Addressing Modes Guide
100% (1)
8086 Addressing Modes Guide
13 pages
Protool CP Freigabe PB 76
No ratings yet
Protool CP Freigabe PB 76
8 pages
Class 9 Chapter 2 (Part-2)
No ratings yet
Class 9 Chapter 2 (Part-2)
29 pages
Nursing Informatics Course Overview
No ratings yet
Nursing Informatics Course Overview
108 pages
Chapter 01: Information Technology, The Internet, and You
No ratings yet
Chapter 01: Information Technology, The Internet, and You
18 pages
Implementing Power Management Features On The Nucleus Rtos
No ratings yet
Implementing Power Management Features On The Nucleus Rtos
15 pages
HP Full Manual Envy x360
100% (1)
HP Full Manual Envy x360
77 pages
IBM VIOS Maintenance
100% (1)
IBM VIOS Maintenance
46 pages
OPTIVISTA EPK-i7010 Video Processor Overview
No ratings yet
OPTIVISTA EPK-i7010 Video Processor Overview
2 pages
Technology Inventory Template Final
No ratings yet
Technology Inventory Template Final
6 pages
Theory and Applications of Parallel Computing
No ratings yet
Theory and Applications of Parallel Computing
148 pages
8051 Timer Interrupt Programming
100% (1)
8051 Timer Interrupt Programming
6 pages
Ig31k-M7s Bios 100719
No ratings yet
Ig31k-M7s Bios 100719
35 pages
Microprocessor Lab Exercises Overview
No ratings yet
Microprocessor Lab Exercises Overview
1 page
Motherboard Chip Level Servicing Tutorials
No ratings yet
Motherboard Chip Level Servicing Tutorials
5 pages
DX 3340
No ratings yet
DX 3340
2 pages
Faculty of Engineering & Technology: Laboratory Manual
No ratings yet
Faculty of Engineering & Technology: Laboratory Manual
34 pages
RTD2660 JSP PDF
100% (1)
RTD2660 JSP PDF
401 pages
3,4,5 Units of DL&CO
No ratings yet
3,4,5 Units of DL&CO
45 pages
Sharecenter™ + 4-Bay Cloud Network Storage Enclosure: Product Highlights
No ratings yet
Sharecenter™ + 4-Bay Cloud Network Storage Enclosure: Product Highlights
3 pages
Keil Tutorial v2
No ratings yet
Keil Tutorial v2
6 pages
DX Diag
No ratings yet
DX Diag
31 pages
11-1 Catalogo Tablero Sas
No ratings yet
11-1 Catalogo Tablero Sas
195 pages
STM32F105xx STM32F107xx
No ratings yet
STM32F105xx STM32F107xx
103 pages
3.75G Hsupa Usb Adapter dWM-156
No ratings yet
3.75G Hsupa Usb Adapter dWM-156
2 pages
Kenwood Fgz-201elf2 Erf2 SM
No ratings yet
Kenwood Fgz-201elf2 Erf2 SM
52 pages
TV Production
No ratings yet
TV Production
43 pages

07 cmsc416 Cuda

Uploaded by

07 cmsc416 Cuda

Uploaded by

Introduction to Parallel Computing (CMSC416 / CMSC616)

GPGPUs and CUDA

Many slides borrowed from Daniel Nichols’ slides

• Originally developed to handle computation related to graphics processing

• Also found to be useful for scienti c computing

• Hence the name: General Purpose Graphics Processing Unit

Abhinav Bhatele (CMSC416 / CMSC616) 2

• IBM’s Cell processors

• GPUs: NVIDIA, AMD, Intel

• Around 1999-2001, early GPGPU results

Abhinav Bhatele (CMSC416 / CMSC616) 3

• 2013: NAMD, used for molecular dynamics

Abhinav Bhatele (CMSC416 / CMSC616) 4

• Hide memory access latencies with computation

Abhinav Bhatele (CMSC416 / CMSC616) 5

• Intel i9 11900K • NVIDIA GeForce RTX 3090

• 3.3 GHz • 1.4 GHz

• AMD Epyc 7763 • NVIDIA A100

• 2.45 GHz • 0.76 GHz

Abhinav Bhatele (CMSC416 / CMSC616) 6

• Each Volta Streaming Multiprocessor (SM) has:

• CUDA capable device or GPU

Abhinav Bhatele (CMSC416 / CMSC616) 7

Figure 4. Abhinav Bhatele

Figure 4. Abhinav Bhatele

TF 42 TF (6x7 TF) HBM/DRAM Bus (aggregate B/W)

HBM & DRAM speeds are aggregate (Read+Write).

Abhinav Bhatele (CMSC416 / CMSC616) 9

• Allows developers to use C++ as a high-level programming language

• Built around threads, blocks and grids

• CUDA kernel: a function that gets executed on the GPU

Abhinav Bhatele (CMSC416 / CMSC616) 10

Abhinav Bhatele (CMSC416 / CMSC616) 11

Abhinav Bhatele (CMSC416 / CMSC616) 11

• Number of threads in block <= 1024

Abhinav Bhatele (CMSC416 / CMSC616) 11

• Number of threads in block <= 1024

Abhinav Bhatele (CMSC416 / CMSC616) 11

Abhinav Bhatele (CMSC416 / CMSC616) 12

• Copy input data from host to device memory

• Load the GPU program (kernel) and execute

• Copy the results back to host memory

Abhinav Bhatele (CMSC416 / CMSC616) 13

// ... initialize h_Matrix ...

// ... some computation on GPU …

cudaMemcpy(h_Matrix, d_Matrix, sizeof(double)*N, cudaMemcpyDeviceToHost);

Abhinav Bhatele (CMSC416 / CMSC616) 14

// ... some computation on GPU …

cudaMemcpy(h_Matrix, d_Matrix, sizeof(double)*N, cudaMemcpyDeviceToHost);

Abhinav Bhatele (CMSC416 / CMSC616) 14

__global__ void saxpy(float *x, float *y, float alpha) {

Grid size, Block size

__global__ void saxpy(float *x, float *y, float alpha) {

Grid size, Block size

__global__ void saxpy(float *x, float *y, float alpha) {

__global__ void saxpy(float *x, float *y, float alpha) {

nvcc -o saxpy --generate-code arch=compute_80,code=sm_80 saxpy.cu

Abhinav Bhatele (CMSC416 / CMSC616) 16

saxpy<<<numBlocks, threadsPerBlock>>>(x, y, alpha, N);

You might also like

global void saxpy(float x, float y, float alpha) {

global void saxpy(float x, float y, float alpha) {

global void saxpy(float x, float y, float alpha) {

global void saxpy(float x, float y, float alpha) {