0% found this document useful (0 votes)

22 views40 pages

Module 4

Uploaded by

singhguma86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views40 pages

Module 4

Uploaded by

singhguma86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MODULE FOUR:

GPU PROGRAMMING
Dr. Volker Weinberg | LRZ
MODULE OVERVIEW
OpenACC Directives

 Multicore CPU vs GPU

 Introduction to GPU Data Management
 CUDA Managed Memory
 GPU Profiling with Nsight Systems
CPU VS GPU
CPU VS GPU
Number of cores and parallelism

 Both are extremely popular parallel processors, but

with different degrees of parallelism
 CPUs generally have a small number of very fast
physical cores
 GPUs have thousands of simple cores able to
achieve high performance in aggregate
 Both require parallelism to be fully utilized, but GPUs
require much more
CPU + GPU WORKFLOW
Application Code

GPU CPU
Compute-Intensive Functions
Rest of Sequential
Small % of Code CPU Code
Large % of Runtime
GPU PROGRAMMING IN OPENACC

 Execution always begins and ends on the

host CPU
 Compute-intensive loops are offloaded to the
GPU using directives
int main(){
 Offloading may or may not require data <sequential code>
movement between the host and device.
Compiler
#pragma acc kernels
Hint
{
<parallel code>
}

}
CPU + GPU
Physical Diagram CPU GPU

 CPU memory is larger, GPU memory has $ $ $ $ $ $

more bandwidth $ $ $ $ $ $

Shared Cache
 CPU and GPU memory are usually separate,
connected by an I/O bus (traditionally PCI-e)
$ $ $ $ $ $ $ $

 Any data transferred between the CPU and

GPU will be handled by the I/O Bus Shared Cache
High
 The I/O Bus is relatively slow compared to Capacity
memory bandwidth Memory
IO Bus
High Bandwidth
 The GPU cannot perform computation until the Memory
data is within its memory
BASIC DATA MANAGEMENT
BASIC DATA MANAGEMENT
Between the host and device

 The host is traditionally a CPU Host

 The device is some parallel accelerator Device
 When our target hardware is multicore, the
host and device are the same, meaning that
their memory is also the same
 There is no need to explicitly manage data
when using a shared memory accelerator, Host
such as the multicore target Memory
Device
Memory
BASIC DATA MANAGEMENT
Between the host and device CPU GPU

 When the target hardware is a GPU data will $ $ $ $ $ $

usually need to migrate between CPU and $ $ $ $ $ $

GPU memory
Shared Cache

 The next lecture will discuss OpenACC data

management, for now we’ll assume a unified $ $ $ $ $ $ $ $

Host/Accelerator memory
Shared Cache
High
Capacity
Memory
IO Bus
High Bandwidth
Memory
CUDA MANAGED MEMORY
CUDA MANAGED MEMORY Commonly referred to as “unified
memory.”
Simplified Developer Effort

Without Managed Memory With Managed Memory

CPU and GPU memories are

combined into a single, shared pool

System GPU Memory Managed Memory

Memory
CUDA MANAGED MEMORY
Usefulness

 Handling explicit data transfers between the host and device (CPU and GPU) can be
difficult
 The PGI compiler can utilize CUDA Managed Memory to defer data management
 This allows the developer to concentrate on parallelism and think about data
movement as an optimization

$ pgcc –fast –acc –ta=tesla:managed –Minfo=accel main.c

$ pgfortran –fast –acc –ta=tesla:managed –Minfo=accel main.f90

MANAGED MEMORY
Limitations
With Managed Memory
 The programmer will almost always be able to
get better performance by manually handling
data transfers
 Memory allocation/deallocation takes longer
with managed memory
 Cannot transfer data asynchronously
 Currently only available from PGI on NVIDIA
GPUs. Managed Memory
OPENACC WITH MANAGED MEMORY
An Example from the Lab Code
while ( error > tol && iter < iter_max )
{
error = 0.0;
#pragma acc kernels
Without Managed Memory the
{ compiler must determine the size of
for( int j = 1; j < n-1; j++)
{ A and Anew and copy their data to
for( int i = 1; i < m-1; i++ )
{
and from the GPU each iteration to
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] ensure correctness
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}

for( int j = 1; j < n-1; j++)

{
for( int i = 1; i < m-1; i++ )
{ With Managed Memory the
A[j][i] = Anew[j][i];
} underlying runtime will move the
}
}
data only when needed
}
INTRODUCTION TO DATA CLAUSES
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy
 Data clauses allow the programmer to tell the compiler which data to move and
when
 Data clauses may be added to kernels or parallel regions, but also data, enter
data, and exit data, which will discussed shortly

C/C++
#pragma acc kernels
for(int i = 0; i < N; i++){
a[i] = 0;
}
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy
 Data clauses allow the programmer to tell the compiler which data to move and
when
 Data clauses may be added to kernels or parallel regions, but also data, enter
data, and exit data, which will discussed shortly

C/C++
#pragma acc parallel loop copyout(a[0:n])
for(int i = 0; i < N; i++){
a[i] = 0; I don’t need the initial value
} of a, so I’ll only copy it out
of the region at the end.
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy

Copy ‘a’ Copy ‘a’ Deallocate

Allocate ‘a’ on Execute
from CPU from GPU ‘a’ from
GPU Kernels
to GPU to CPU GPU

#pragma acc parallel loop copy(a[0:N])

for(int i = 0; i < N; i++){
a[i] = 2 * a[i];
}
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy

Copy ‘a’ Copy ‘a’ Deallocate

Allocate ‘a’ on Execute
from CPU from GPU ‘a’ from
GPU Kernels
to GPU to CPU GPU

CPU MEMORY GPU MEMORY

A
A’ A’
DATA CLAUSES
copy( list ) Allocates memory on GPU and copies data from host to GPU when
entering region and copies data to the host when exiting region.

Principal use: For many important data structures in your code, this is a
logical default to input, modify and return the data.

copyin( list ) Allocates memory on GPU and copies data from host to GPU when
entering region.

Principal use: Think of this like an array that you would use as just an
input to a subroutine.

copyout( list ) Allocates memory on GPU and copies data to the host when exiting
region.

Principal use: A result that isn’t overwriting the input data structure.

create( list ) Allocates memory on GPU but does not copy.

Principal use: Temporary arrays.

ARRAY SHAPING

 Sometimes the compiler needs help understanding the shape of an array

 The first number is the start index of the array
 In C/C++, the second number is how much data is to be transferred
 In Fortran, the second number is the ending index

copy(array[starting_index:length]) C/C++

copy(array(starting_index:ending_index)) Fortran
BASIC DATA MANAGEMENT
Multi-dimensional Array shaping

copy(array[0:N][0:M]) C/C++

copy(array(1:N, 1:M)) Fortran

PROFILING GPU CODE
PROFILING GPU CODE (PGI)
Obtaining information about your GPU

 Using the pgaccelinfo command will

display information about available Terminal Window
accelerators
$ pgaccelinfo
Device Number: 0
Device Name: Tesla P100-PCIE-16GB
...
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc60
PROFILING GPU CODE
Obtaining information about your GPU

 Using the pgaccelinfo command will

display information about available Terminal Window
accelerators
$ pgaccelinfo
 Each device is numbered starting with Device Number: 0
Device Name: Tesla P100-PCIE-16GB
0 ...
Managed Memory: Yes
 The Device Name identifies the type of PGI Compiler Option: -ta=tesla:cc60
accelerator
Without Manage Memory
 Can Managed Memory be used? $ pgcc –ta=tesla:cc60 main.c
 What compiler options should be used With Manage Memory
to target this device? $ pgcc –ta=tesla:cc60,managed main.c
COMPILING GPU CODE
Terminal Window
$ pgcc –fast –ta=tesla:cc60 –Minfo=accel jacobi.c laplace2d.c
calcNext:
37, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated We can see that our data
Generating Tesla code copies are being applied by
37, Generating reduction(max:error) the compiler
38, #pragma acc loop gang /* blockIdx.x */
41, #pragma acc loop vector(128) /* threadIdx.x */
41, Loop is parallelizable
swap:
56, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(128) /* threadIdx.x */
60, Loop is parallelizable
COMPILING GPU CODE
Terminal Window
$ pgcc –fast –ta=tesla:cc60 –Minfo=accel jacobi.c laplace2d.c
calcNext:
37, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated We also see that the
Generating Tesla code compiler is generating code
37, Generating reduction(max:error) for our GPU
38, #pragma acc loop gang /* blockIdx.x */
41, #pragma acc loop vector(128) /* threadIdx.x */
41, Loop is parallelizable
swap:
56, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(128) /* threadIdx.x */
60, Loop is parallelizable
COMPILING GPU CODE
Terminal Window
$ pgcc –fast –ta=tesla:cc60 –Minfo=accel jacobi.c laplace2d.c
calcNext:
37, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
This is the parallelization of
37, Generating reduction(max:error) the outer
inner loop
38, #pragma acc loop gang /* blockIdx.x */
41, #pragma acc loop vector(128) /* threadIdx.x */
41, Loop is parallelizable
swap:
56, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(128) /* threadIdx.x */
60, Loop is parallelizable
PROFILING GPU CODE (NSIGHT SYSTEMS)
Using nsys to profile GPU code
▪ Nsight Systems presents far more
information when running on a
GPU
▪ It is capable of capturing
information about CUDA
execution in the profiled process.
▪ In the Timeline view, you can see
all the information about kernels
and memory movements (expand
the CUDA row)
PROFILING GPU CODE (NSIGHT SYSTEMS)
Using nsys to profile GPU code

▪ Kernels: These are our

computational functions. We can
see our calcNext and swap function

▪ MemCpy(HtoD): This includes data

transfers from the Host to the Device
(CPU to GPU)
▪ MemCpy(DtoH): These are data
transfers from the Device to the Host
(GPU to CPU)
PROFILING GPU CODE
Receiving unexpected code results

 Here we can see the runtime of our Terminal Window

application: 151 seconds $ pgcc –ta=tesla:cc60 jacobi.c laplace2d.c
$ ./a.out
 The program is now performing over 3 0, 0.250000
times worse than the sequential 100, 0.002397
version 200, 0.001204
300, 0.000804
 A profiler can help us understand why 400, 0.000603
this performance is worse 500, 0.000483
600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 151.772627 s
PROFILING GPU CODE
Inspecting the Nsight Sytems timeline

▪ Let’s focus on the data movement

(Memory row)
▪ At a first glance, it looks like our
program is spending a significant
amount of time transferring data
between the host and device
▪ We also see that the compute
regions are very small and spread
out
▪ What if we try Managed Memory?
PROFILING GPU CODE
Using managed memory
Terminal Window
 Using managed memory $ pgcc –ta=tesla:cc60,managed jacobi.c
drastically improves performance laplace2d.c
$ ./a.out
 This managed memory version is 0, 0.250000
performing over 20x better than 100, 0.002397
the sequential code 200, 0.001204
300, 0.000804
 What does the profiler tell us about 400, 0.000603
500, 0.000483
this? 600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 1.474951 s
PROFILING GPU CODE
Using managed memory

▪ The data no longer needs to transfer

between each kernel
▪ The data is only moved when it’s first
accessed on the GPU or CPU
▪ During the timestepping data remains on
the device
▪ Now a higher percentage of time is spent
computing
KEY CONCEPTS
In this module we discussed…
 The fundamental differences between CPUs and GPUs
 Assisting the compiler by providing information about array sizes for data
management
 Managed memory
THANK YOU

OpenACC Programming for Fortran
No ratings yet
OpenACC Programming for Fortran
53 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
CUDA C vs Thrust vs Libraries Overview
No ratings yet
CUDA C vs Thrust vs Libraries Overview
64 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
CUDA Programming and Data Parallelism Guide
No ratings yet
CUDA Programming and Data Parallelism Guide
52 pages
CUDA Programming Model Overview
No ratings yet
CUDA Programming Model Overview
31 pages
Manycore GPU Programming Overview
No ratings yet
Manycore GPU Programming Overview
67 pages
S3076 Getting Started With OpenACC
No ratings yet
S3076 Getting Started With OpenACC
58 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Lec 1
No ratings yet
Lec 1
27 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
OpenACC 2017spring
No ratings yet
OpenACC 2017spring
46 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
Understanding NVIDIA Volta GPU Architecture
No ratings yet
Understanding NVIDIA Volta GPU Architecture
36 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Introduction to CUDA Programming
No ratings yet
Introduction to CUDA Programming
26 pages
CUDA Programming Model Explained
No ratings yet
CUDA Programming Model Explained
14 pages
OpenACC for HPC Developers
No ratings yet
OpenACC for HPC Developers
47 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
OpenACC for Efficient Parallel Computing
No ratings yet
OpenACC for Efficient Parallel Computing
51 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
Introduction To CUDA Platform 1
No ratings yet
Introduction To CUDA Platform 1
18 pages
OpenACC Programming
No ratings yet
OpenACC Programming
43 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Module 5
No ratings yet
Module 5
30 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
Note2 4
No ratings yet
Note2 4
11 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Unit 6 Memory Management
No ratings yet
Unit 6 Memory Management
38 pages
Introduction to CUDA C/C++ Programming
No ratings yet
Introduction to CUDA C/C++ Programming
76 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Introduction to CUDA Programming Basics
No ratings yet
Introduction to CUDA Programming Basics
50 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Owens
No ratings yet
Owens
67 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
42 pages
Memory Management: Main Memory
No ratings yet
Memory Management: Main Memory
8 pages
Introduction to GPGPU Programming
No ratings yet
Introduction to GPGPU Programming
32 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Introduction to CUDA Programming Basics
No ratings yet
Introduction to CUDA Programming Basics
247 pages
02 Basicarch
No ratings yet
02 Basicarch
103 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Instruction Set Architecture and Principles
No ratings yet
Instruction Set Architecture and Principles
42 pages
Final Report3
No ratings yet
Final Report3
82 pages
Mobile Healthcare App Security
No ratings yet
Mobile Healthcare App Security
16 pages
7 Rsma
No ratings yet
7 Rsma
54 pages
High-Quality Face Restoration from CCTV
No ratings yet
High-Quality Face Restoration from CCTV
29 pages
Project Reference Paper
No ratings yet
Project Reference Paper
5 pages
Computer System Elements in Accounting
No ratings yet
Computer System Elements in Accounting
10 pages
Kids' Lives Before the Internet
No ratings yet
Kids' Lives Before the Internet
4 pages
Basic Computer Engineering Syllabus 2024
No ratings yet
Basic Computer Engineering Syllabus 2024
2 pages
Cutting Ploter
No ratings yet
Cutting Ploter
17 pages
30 Day Social Media Calendar
No ratings yet
30 Day Social Media Calendar
1 page
Math Patterns for Teachers
No ratings yet
Math Patterns for Teachers
64 pages
Full Preparation Guide
No ratings yet
Full Preparation Guide
112 pages
1 Hongyanmao2010
No ratings yet
1 Hongyanmao2010
4 pages
Daisy To Lift Token Immigration (Phone)
No ratings yet
Daisy To Lift Token Immigration (Phone)
8 pages
Exam DBU F18 Solution
No ratings yet
Exam DBU F18 Solution
14 pages
2016 Heffernan Maths Methods Units 1 & 2 Exam 2 Solutions
No ratings yet
2016 Heffernan Maths Methods Units 1 & 2 Exam 2 Solutions
16 pages
CDAC Shivansh Mishra Report
No ratings yet
CDAC Shivansh Mishra Report
23 pages
CSE 439: Visual Programming: Prof. Dr. Shamim Akhter Shamimakhter@iubat - Edu
No ratings yet
CSE 439: Visual Programming: Prof. Dr. Shamim Akhter Shamimakhter@iubat - Edu
19 pages
Career Objective
No ratings yet
Career Objective
2 pages
Python Pandas
No ratings yet
Python Pandas
177 pages
Math Reviewer Stem 2 2
No ratings yet
Math Reviewer Stem 2 2
16 pages
Data Science in Digital Marketing
No ratings yet
Data Science in Digital Marketing
7 pages
Part II Computerised Accounting System-1
No ratings yet
Part II Computerised Accounting System-1
27 pages
DMIPRO9 09 Google Analytics StudyNotes
No ratings yet
DMIPRO9 09 Google Analytics StudyNotes
51 pages
YouTube Video Summarizer in Regional Language
No ratings yet
YouTube Video Summarizer in Regional Language
5 pages
How To Delete A Service in Windows
No ratings yet
How To Delete A Service in Windows
23 pages
Denver SW 171
No ratings yet
Denver SW 171
74 pages
Information Theory A Concise Introduction Hollos
No ratings yet
Information Theory A Concise Introduction Hollos
440 pages
About Me: Suffix IT Limited Sr. Software Engineer (Java)
No ratings yet
About Me: Suffix IT Limited Sr. Software Engineer (Java)
2 pages
BigRep ONE Brochure 1
No ratings yet
BigRep ONE Brochure 1
10 pages

Module 4

Uploaded by

Module 4

Uploaded by

MODULE FOUR:

 Multicore CPU vs GPU

 Both are extremely popular parallel processors, but

 Execution always begins and ends on the

 CPU memory is larger, GPU memory has $ $ $ $ $ $

 Any data transferred between the CPU and

 The host is traditionally a CPU Host

 When the target hardware is a GPU data will $ $ $ $ $ $

usually need to migrate between CPU and $ $ $ $ $ $

 The next lecture will discuss OpenACC data

Without Managed Memory With Managed Memory

CPU and GPU memories are

System GPU Memory Managed Memory

$ pgcc –fast –acc –ta=tesla:managed –Minfo=accel main.c

$ pgfortran –fast –acc –ta=tesla:managed –Minfo=accel main.f90

for( int j = 1; j < n-1; j++)

Copy ‘a’ Copy ‘a’ Deallocate

#pragma acc parallel loop copy(a[0:N])

Copy ‘a’ Copy ‘a’ Deallocate

CPU MEMORY GPU MEMORY

create( list ) Allocates memory on GPU but does not copy.

Principal use: Temporary arrays.

 Sometimes the compiler needs help understanding the shape of an array

copy(array(1:N, 1:M)) Fortran

 Using the pgaccelinfo command will

 Using the pgaccelinfo command will

 Using the pgaccelinfo command will

 Using the pgaccelinfo command will

 Using the pgaccelinfo command will

▪ Kernels: These are our

▪ MemCpy(HtoD): This includes data

 Here we can see the runtime of our Terminal Window

▪ Let’s focus on the data movement

▪ The data no longer needs to transfer

You might also like