MODULE FOUR:
GPU PROGRAMMING
Dr. Volker Weinberg | LRZ
MODULE OVERVIEW
OpenACC Directives
Multicore CPU vs GPU
Introduction to GPU Data Management
CUDA Managed Memory
GPU Profiling with Nsight Systems
CPU VS GPU
CPU VS GPU
Number of cores and parallelism
Both are extremely popular parallel processors, but
with different degrees of parallelism
CPUs generally have a small number of very fast
physical cores
GPUs have thousands of simple cores able to
achieve high performance in aggregate
Both require parallelism to be fully utilized, but GPUs
require much more
CPU + GPU WORKFLOW
Application Code
GPU CPU
Compute-Intensive Functions
Rest of Sequential
Small % of Code CPU Code
Large % of Runtime
GPU PROGRAMMING IN OPENACC
Execution always begins and ends on the
host CPU
Compute-intensive loops are offloaded to the
GPU using directives
int main(){
Offloading may or may not require data <sequential code>
movement between the host and device.
Compiler
#pragma acc kernels
Hint
{
<parallel code>
}
}
CPU + GPU
Physical Diagram CPU GPU
CPU memory is larger, GPU memory has $ $ $ $ $ $
more bandwidth $ $ $ $ $ $
Shared Cache
CPU and GPU memory are usually separate,
connected by an I/O bus (traditionally PCI-e)
$ $ $ $ $ $ $ $
Any data transferred between the CPU and
GPU will be handled by the I/O Bus Shared Cache
High
The I/O Bus is relatively slow compared to Capacity
memory bandwidth Memory
IO Bus
High Bandwidth
The GPU cannot perform computation until the Memory
data is within its memory
BASIC DATA MANAGEMENT
BASIC DATA MANAGEMENT
Between the host and device
The host is traditionally a CPU Host
The device is some parallel accelerator Device
When our target hardware is multicore, the
host and device are the same, meaning that
their memory is also the same
There is no need to explicitly manage data
when using a shared memory accelerator, Host
such as the multicore target Memory
Device
Memory
BASIC DATA MANAGEMENT
Between the host and device CPU GPU
When the target hardware is a GPU data will $ $ $ $ $ $
usually need to migrate between CPU and $ $ $ $ $ $
GPU memory
Shared Cache
The next lecture will discuss OpenACC data
management, for now we’ll assume a unified $ $ $ $ $ $ $ $
Host/Accelerator memory
Shared Cache
High
Capacity
Memory
IO Bus
High Bandwidth
Memory
CUDA MANAGED MEMORY
CUDA MANAGED MEMORY Commonly referred to as “unified
memory.”
Simplified Developer Effort
Without Managed Memory With Managed Memory
CPU and GPU memories are
combined into a single, shared pool
System GPU Memory Managed Memory
Memory
CUDA MANAGED MEMORY
Usefulness
Handling explicit data transfers between the host and device (CPU and GPU) can be
difficult
The PGI compiler can utilize CUDA Managed Memory to defer data management
This allows the developer to concentrate on parallelism and think about data
movement as an optimization
$ pgcc –fast –acc –ta=tesla:managed –Minfo=accel main.c
$ pgfortran –fast –acc –ta=tesla:managed –Minfo=accel main.f90
MANAGED MEMORY
Limitations
With Managed Memory
The programmer will almost always be able to
get better performance by manually handling
data transfers
Memory allocation/deallocation takes longer
with managed memory
Cannot transfer data asynchronously
Currently only available from PGI on NVIDIA
GPUs. Managed Memory
OPENACC WITH MANAGED MEMORY
An Example from the Lab Code
while ( error > tol && iter < iter_max )
{
error = 0.0;
#pragma acc kernels
Without Managed Memory the
{ compiler must determine the size of
for( int j = 1; j < n-1; j++)
{ A and Anew and copy their data to
for( int i = 1; i < m-1; i++ )
{
and from the GPU each iteration to
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] ensure correctness
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{ With Managed Memory the
A[j][i] = Anew[j][i];
} underlying runtime will move the
}
}
data only when needed
}
INTRODUCTION TO DATA CLAUSES
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy
Data clauses allow the programmer to tell the compiler which data to move and
when
Data clauses may be added to kernels or parallel regions, but also data, enter
data, and exit data, which will discussed shortly
C/C++
#pragma acc kernels
for(int i = 0; i < N; i++){
a[i] = 0;
}
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy
Data clauses allow the programmer to tell the compiler which data to move and
when
Data clauses may be added to kernels or parallel regions, but also data, enter
data, and exit data, which will discussed shortly
C/C++
#pragma acc parallel loop copyout(a[0:n])
for(int i = 0; i < N; i++){
a[i] = 0; I don’t need the initial value
} of a, so I’ll only copy it out
of the region at the end.
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy
Copy ‘a’ Copy ‘a’ Deallocate
Allocate ‘a’ on Execute
from CPU from GPU ‘a’ from
GPU Kernels
to GPU to CPU GPU
#pragma acc parallel loop copy(a[0:N])
for(int i = 0; i < N; i++){
a[i] = 2 * a[i];
}
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy
Copy ‘a’ Copy ‘a’ Deallocate
Allocate ‘a’ on Execute
from CPU from GPU ‘a’ from
GPU Kernels
to GPU to CPU GPU
CPU MEMORY GPU MEMORY
A
A’ A’
DATA CLAUSES
copy( list ) Allocates memory on GPU and copies data from host to GPU when
entering region and copies data to the host when exiting region.
Principal use: For many important data structures in your code, this is a
logical default to input, modify and return the data.
copyin( list ) Allocates memory on GPU and copies data from host to GPU when
entering region.
Principal use: Think of this like an array that you would use as just an
input to a subroutine.
copyout( list ) Allocates memory on GPU and copies data to the host when exiting
region.
Principal use: A result that isn’t overwriting the input data structure.
create( list ) Allocates memory on GPU but does not copy.
Principal use: Temporary arrays.
ARRAY SHAPING
Sometimes the compiler needs help understanding the shape of an array
The first number is the start index of the array
In C/C++, the second number is how much data is to be transferred
In Fortran, the second number is the ending index
copy(array[starting_index:length]) C/C++
copy(array(starting_index:ending_index)) Fortran
BASIC DATA MANAGEMENT
Multi-dimensional Array shaping
copy(array[0:N][0:M]) C/C++
copy(array(1:N, 1:M)) Fortran
PROFILING GPU CODE
PROFILING GPU CODE (PGI)
Obtaining information about your GPU
Using the pgaccelinfo command will
display information about available Terminal Window
accelerators
$ pgaccelinfo
Device Number: 0
Device Name: Tesla P100-PCIE-16GB
...
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc60
PROFILING GPU CODE
Obtaining information about your GPU
Using the pgaccelinfo command will
display information about available Terminal Window
accelerators
$ pgaccelinfo
Each device is numbered starting with Device Number: 0
Device Name: Tesla P100-PCIE-16GB
0 ...
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc60
PROFILING GPU CODE
Obtaining information about your GPU
Using the pgaccelinfo command will
display information about available Terminal Window
accelerators
$ pgaccelinfo
Each device is numbered starting with Device Number: 0
Device Name: Tesla P100-PCIE-16GB
0 ...
Managed Memory: Yes
The Device Name identifies the type of PGI Compiler Option: -ta=tesla:cc60
accelerator
PROFILING GPU CODE
Obtaining information about your GPU
Using the pgaccelinfo command will
display information about available Terminal Window
accelerators
$ pgaccelinfo
Each device is numbered starting with Device Number: 0
Device Name: Tesla P100-PCIE-16GB
0 ...
Managed Memory: Yes
The Device Name identifies the type of PGI Compiler Option: -ta=tesla:cc60
accelerator
Can Managed Memory be used?
PROFILING GPU CODE
Obtaining information about your GPU
Using the pgaccelinfo command will
display information about available Terminal Window
accelerators
$ pgaccelinfo
Each device is numbered starting with Device Number: 0
Device Name: Tesla P100-PCIE-16GB
0 ...
Managed Memory: Yes
The Device Name identifies the type of PGI Compiler Option: -ta=tesla:cc60
accelerator
Without Manage Memory
Can Managed Memory be used? $ pgcc –ta=tesla:cc60 main.c
What compiler options should be used With Manage Memory
to target this device? $ pgcc –ta=tesla:cc60,managed main.c
COMPILING GPU CODE
Terminal Window
$ pgcc –fast –ta=tesla:cc60 –Minfo=accel jacobi.c laplace2d.c
calcNext:
37, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated We can see that our data
Generating Tesla code copies are being applied by
37, Generating reduction(max:error) the compiler
38, #pragma acc loop gang /* blockIdx.x */
41, #pragma acc loop vector(128) /* threadIdx.x */
41, Loop is parallelizable
swap:
56, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(128) /* threadIdx.x */
60, Loop is parallelizable
COMPILING GPU CODE
Terminal Window
$ pgcc –fast –ta=tesla:cc60 –Minfo=accel jacobi.c laplace2d.c
calcNext:
37, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated We also see that the
Generating Tesla code compiler is generating code
37, Generating reduction(max:error) for our GPU
38, #pragma acc loop gang /* blockIdx.x */
41, #pragma acc loop vector(128) /* threadIdx.x */
41, Loop is parallelizable
swap:
56, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(128) /* threadIdx.x */
60, Loop is parallelizable
COMPILING GPU CODE
Terminal Window
$ pgcc –fast –ta=tesla:cc60 –Minfo=accel jacobi.c laplace2d.c
calcNext:
37, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
This is the parallelization of
37, Generating reduction(max:error) the outer
inner loop
38, #pragma acc loop gang /* blockIdx.x */
41, #pragma acc loop vector(128) /* threadIdx.x */
41, Loop is parallelizable
swap:
56, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(128) /* threadIdx.x */
60, Loop is parallelizable
PROFILING GPU CODE (NSIGHT SYSTEMS)
Using nsys to profile GPU code
▪ Nsight Systems presents far more
information when running on a
GPU
▪ It is capable of capturing
information about CUDA
execution in the profiled process.
▪ In the Timeline view, you can see
all the information about kernels
and memory movements (expand
the CUDA row)
PROFILING GPU CODE (NSIGHT SYSTEMS)
Using nsys to profile GPU code
▪ Kernels: These are our
computational functions. We can
see our calcNext and swap function
▪ MemCpy(HtoD): This includes data
transfers from the Host to the Device
(CPU to GPU)
▪ MemCpy(DtoH): These are data
transfers from the Device to the Host
(GPU to CPU)
PROFILING GPU CODE
Receiving unexpected code results
Here we can see the runtime of our Terminal Window
application: 151 seconds $ pgcc –ta=tesla:cc60 jacobi.c laplace2d.c
$ ./a.out
The program is now performing over 3 0, 0.250000
times worse than the sequential 100, 0.002397
version 200, 0.001204
300, 0.000804
A profiler can help us understand why 400, 0.000603
this performance is worse 500, 0.000483
600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 151.772627 s
PROFILING GPU CODE
Inspecting the Nsight Sytems timeline
▪ Let’s focus on the data movement
(Memory row)
▪ At a first glance, it looks like our
program is spending a significant
amount of time transferring data
between the host and device
▪ We also see that the compute
regions are very small and spread
out
▪ What if we try Managed Memory?
PROFILING GPU CODE
Using managed memory
Terminal Window
Using managed memory $ pgcc –ta=tesla:cc60,managed jacobi.c
drastically improves performance laplace2d.c
$ ./a.out
This managed memory version is 0, 0.250000
performing over 20x better than 100, 0.002397
the sequential code 200, 0.001204
300, 0.000804
What does the profiler tell us about 400, 0.000603
500, 0.000483
this? 600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 1.474951 s
PROFILING GPU CODE
Using managed memory
▪ The data no longer needs to transfer
between each kernel
▪ The data is only moved when it’s first
accessed on the GPU or CPU
▪ During the timestepping data remains on
the device
▪ Now a higher percentage of time is spent
computing
KEY CONCEPTS
In this module we discussed…
The fundamental differences between CPUs and GPUs
Assisting the compiler by providing information about array sizes for data
management
Managed memory
THANK YOU