0% found this document useful (0 votes)
21 views43 pages

Parallel Programming With OpenACC

OpenACC is a programming standard designed for parallel computing on accelerators, primarily NVIDIA GPUs, using compiler directives to simplify GPU programming for Fortran, C, and C++. It allows for efficient parallel code generation with minimal modifications to existing serial code and supports various levels of parallelism through constructs like kernels and parallel regions. OpenACC facilitates data management and memory allocation automatically, making it easier for programmers to develop portable applications that leverage hybrid CPU/GPU architectures.

Uploaded by

outoftime1001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views43 pages

Parallel Programming With OpenACC

OpenACC is a programming standard designed for parallel computing on accelerators, primarily NVIDIA GPUs, using compiler directives to simplify GPU programming for Fortran, C, and C++. It allows for efficient parallel code generation with minimal modifications to existing serial code and supports various levels of parallelism through constructs like kernels and parallel regions. OpenACC facilitates data management and memory allocation automatically, making it easier for programmers to develop portable applications that leverage hybrid CPU/GPU architectures.

Uploaded by

outoftime1001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

UNIT-IV

Parallel Programming with


OpenACC

1
Three ways to accelerate applications on
GPU

2
OpenACC
What is OpenACC?
• OpenACC (for Open Accelerators) is a programming standard for parallel
computing on accelerators (mostly on NIVDIA GPU).
• A set of compiler directives that allow code regions to be offloaded from a host CPU
to,on a GPU
• High level GPU programming
• Similar to OpenMP directives
• Works for Fortran, C, C++
• Portable across different platforms and compilers
• It is designed to simplify GPU programming.
• The basic approach is to insert special comments (directives) into the code so as
to offload computation onto GPUs and parallelize the code at the level of GPU
(CUDA) cores.
• It is possible for programmers to create an efficient parallel OpenACC code with only
minor modifications to a serial CPU code.
3
OpenACC
• The OpenACC Application Programming Interface (API) provides a set of compiler directives,
library routines, and environment variables that can be used to write data-parallel FORTRAN, C,
and C++ programs that run on accelerator devices, including GPUs.
• It is an extension to the host language.
• The OpenACC specification was initially developed by the Portland Group (PGI), Cray Inc., and
NVIDIA, with support from CAPS enterprise.

4
OpenACC Task
Granularity
• Gang – block
• Worker – warp
• Vector – thread

Syntax for C
#pragma acc kernels loop gang(n) worker(m) vector(k)
#pragma acc parallel loop num_gangs(n) num_workers(m)
vector_length(k)
Syntax for Fortran
!$acc kernels loop gang(n) worker(m) vector(k)
!$acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
10

5
Levels of Parallelism

6
key concepts
• Vector : Threads work in SIMT fashion

Individual tasks that are executed in parallel on the GPU

Threads are organized into warps, which are groups of 32 threads each

All threads within a warp are executed on a single GPU core


• Worker : Groups of threads that can be scheduled and executed on a

streaming multiprocessor (SM) within the GPU.

• Gang : workers are organized into gangs. Multiple gangs work independently

7
OpenACC
What are compiler directives?
• The directives tell the compiler or runtime to ……
• Generate parallel code for GPU
• Allocate GPU memory and copy input data
• Execute parallel code on GPU
• Copy output data to CPU and deallocate GPU memory

8
OpenACC Directive syntax
•C
•#pragma acc directive [clause [,] clause] …]… often followed by a structured code
block

9
OpenACC Syntax

• #pragma - Gives instructions to the compiler on how to compile the code


• acc - Informs the compiler that code is to be executed using OpenACC
• directives - Commands in OpenACC for altering our code
• clauses - Specifiers or additions to directives

1
0
Execution Model

• Program runs on the host CPU


• Host offloads compute-intensive regions

(kernels) and related data to the


accelerator GPU

• Compute kernels are executed by the

GPU

1
1
What are compiler
directives?
❑ The directives tell the compiler or runtime to ……
✔ Generate parallel code for GPU
✔ Allocate GPU memory and copy input data GPU
✔ Execute parallel code on GPU
✔ Copy output data to CPU and deallocate GPU memory

// ... serial code ...


❑ The first OpenACC directive: kernels #pragma acc kernels
✔ ask the compiler to generate a GPU code for (int i= 0; i<n; i++) {
✔ let the compiler determine
//... parallel code ...
safe parallelism and data
transfer . }
// ... serial code ...
12
The first OpenACC program
❑ Example: Compute a*x + y, where x and y are vectors, and a is a scalar.

C
int main(int argc, char **argv){
int N=1000;
float a = 3.0f;

float x[N], y[N];


for (int i = 0; i < N; ++i) {
x[i] = 2.0f;

y[i] = 1.0f;
}
#pragma acc kernels
for (int i = 0; i < N; ++i) {
y[i] = a * x[i] + y[i];
}
13
}
Data
dependency
❑ The loop is not parallelized if there is data dependency. For example,
#pragma acc kernels
for (int i = 0; i < N-1; i++) {
x[i] = a * x[i+1] ;
}

❑ The compiling output:


……
14, Loop carried dependence of x-> prevents parallelization
Loop carried backward dependence of x-> prevents vectorization
Accelerator scalar kernel generated

Loop carried backward dependence of x-> prevents vectorization

❑ The compiler creates a serial program, which runs slower on GPU than on CPU!
14
OPENACC VERSUS CUDA C
• One big difference between OpenACC and CUDA C is the use of compiler directives in OpenACC

15
#include<stdio.h>
int main()
{
#pragma acc kernels
for(int i = 0; i < 5; i++)
{
printf("Hello World!\n");
}
}

16
❑ Accelerator kernel is
generated. The loop
computation is offloaded to
(Tesla) GPU
and is parallelized.

❑ The keywords copy and


copyin are involved with data
transfer.

17
• The loop will parallelize the for loop plus also accommodate other OpenACC clauses, for
example here copyin and copyout.
• The above example needs two vectors to be copied to GPU and one vector needs to send
the value back to CPU.
• copyin will create the memory on the GPU and transfer the data from CPU to GPU.
• copyout will create the memory on the GPU and transfer the data from GPU to CPU.

18
PGI Compiler basics
• The command to compile C code is ‘pgcc’

• The command to compile C++ code is ‘pgc++’

• The command to compile fortran code is


‘pgfortran’
$ pgcc main.c

$ pgc++ main.cpp

$ pgfortran main.f90

1
9
OpenACC Directive for parallelism
Two different approaches for defining parallel regions
kernels
• Defines a region to be transferred into a series of kernels to be executed in sequence
on an accelerator
• Work sharing parallelism is defined automatically for the separate kernels
parallel
• Defines a region to be executed on an accelerator
• Work sharing parallelism has to be defined manually

with similar work sharing, both can perform equally well

2
0
2
1
20

22
•OpenACC is not GPU programming
•OpenACC is expressing the parallelism in your existing code
•OpenACC can be used in both Nvidia and AMD GPUs
•OpenACC will enable programmers to easily develop portable applications that maximize
the performance and power efficiency benefits of the hybrid CPU/GPU architecture of
Titan.
•OpenACC is a technically impressive initiative brought together by members of the
OpenMP Working Group on Accelerators, as well as many others.

23
Compilers and directives
• OpenACC is supported by the Nvidia, PGI, GCC, and HPE Gray (only for FORTRAN)
compilers
• Now PGI is part of Nvidia, and it is available through Nvidia HPC SDK
• Compute constructs:
• parallel and kernel
• Loop constructs:
• loop, collapse, gang, worker, vector, etc.
• Data management clauses:
• copy, create, copyin, copyout, delete and present
• Others:
• reduction, atomic, cache, etc.

24
• OpenACC would be portable to all computing architectures
• At its core OpenACC supports offloading of both computation and data from a host device to an
accelerator device.
• In fact, these devices may be the same or may be completely different architectures, such as the
case of a CPU host and GPU accelerator.
• The two devices may also have separate memory spaces or a single memory space.
• In the case that the two devices have different memories the OpenACC compiler and runtime will
analyze the code and handle any accelerator memory management and the transfer of data between
host and device memory.

25
The OpenACC Accelerator Model

26
EXECUTION MODEL
• The OpenACC target machine has a host and an attached accelerator device, such as a GPU.
• Most accelerator devices can support multiple levels of parallelism.
• Figure 15.2 illustrates a typical accelerator that supports three levels of parallelism.
• At the outermost coarse-grain level, there are multiple execution units. Within each execution unit,
there are multiple threads.
• At the innermost level, each thread is capable of executing vector operations.
• Currently, OpenACC does not assume any synchronization capability on the accelerator, except for
thread forking and joining.
• Once work is distributed among the execution units, they will execute in parallel from start to
finish.
• Similarly, once work is distributed among the threads within an execution unit, the threads execute
in parallel.

27
28
• An OpenACC program starts its execution on the host single-threaded (Figure 15.3).
• When the host thread encounters a parallel or a kernels construct, a parallel region or a kernels
region that comprises all the code enclosed in the construct is created and launched on the
accelerator device.
• The parallel region or kernels region can optionally execute asynchronously with the host thread
and join with the host thread at a future synchronization point.
• The parallel region is executed entirely on the accelerator device.

29
30
• The kernels region may contain a sequence of kernels, each of which is executed on the
accelerator device.
• The kernel execution follows a fork-join model. A group of gangs are used to execute each kernel.
• A group of workers can be forked to execute a parallel work-sharing loop that belongs to a gang.
• The workers are disbanded when the loop is done.
• Typically a gang executes on one execution unit, and a worker runs on one thread within an
execution unit.
• The programmer can instruct how the work within a parallel region or a kernels region is to be
distributed among the different levels of parallel- ism on the accelerator.

31
MEMORY MODEL
• In an OpenACC memory model, the host memory and the device memory are treated as separated.
• It is assumed that the host is not able to access device memory directly and the device is not able to
access host memory directly.
• This is to ensure that the OpenACC programming model can sup- port a wide range of accelerator
devices, including most of the current GPUs that do not have the capability of unified memory
access between GPUs and CPUs.
• The unified virtual addressing and the GPU Direct introduced by NVIDIA in CUDA 4.0 allow a
single virtual address space for both host memory and device memory and allow direct
cross-device memory access between different GPUs. However, cross-host and device memory
access is still not possible.

32
• Just like in CUDA C/C++, in OpenACC input data needs to be transferred from the host to the
device before kernel launches and result data needs to be transferred back from the device to the
host.
• However, unlike in CUDA C/ C++ where programmers need to explicitly code data movement
through API calls, in OpenACC they can just annotate which memory objects need to be
transferred, as shown by line 4 in Figure 15.1.
• The OpenACC compiler will automatically generate code for memory allocation, copying, and
de-allocation.

33
• OpenACC adopts a fairly weak consistency model for memory on the accelerator device.
• Although data on the accelerator can be shared by all execution units, OpenACC does not provide a reliable way
to allow one execution unit to consume the data produced by another execution unit.
• There are two reasons for this.
• First, recall OpenACC does not provide any mechanism for synchronization between execution units.
• Second, memories between different execution units are not coherent.
• Although some hardware provides instructions to explicitly invalidate and update cache, they are not exposed at
the OpenACC level.
• Therefore, in OpenACC, different execution units are expected to work on disjoint memory sets.
• Threads within an execution unit can also share memory and threads have coherent memory.
• However, OpenACC currently only mandates a memory fence at the thread fork and join, which are also the only
synchronizations OpenACC provides for threads.
• While the device memory model may appear very limiting, it is not so in practice. For datarace free OpenACC
data-parallel applications, the weak memory model works quite well.

34
• OpenACC provides two different approaches for exposing parallelism in the code:
parallel and kernels regions.

35
The Kernels Construct
• The kernels construct identifies a region of code that may contain parallelism, but relies
on the automatic parallelization capabilities of the compiler to analyze the region, identify
which loops are safe to parallelize, and then accelerate those loops.

36
37
• In this example the code is initializing two arrays and then performing a simple calculation on
them.
• Code contains two candidate loops for acceleration.
• The compiler will analyze these loops for data independence and parallelize both loops by
generating an accelerator kernel for each.
• The compiler is given complete freedom to determine how best to map the parallelism available in
these loops to the hardware, meaning that
• we will be able to use this same code regardless of the accelerator we are building for.
• The compiler will use its own knowledge of the target accelerator to choose the best path for
acceleration.
• One caution about the kernels directive, however, is that if the compiler cannot be certain that a
loop is data independent, it will not parallelize the loop.

38
Routine Directive
• Function or subroutine calls within parallel loops can be problematic for compilers,
• since it’s not always possible for the compiler to see all of the loops at one time.
• OpenACC 1.0 compilers were forced to either inline all routines called within parallel regions or
not parallelize loops containing routine calls at all.
• OpenACC 2.0 introduced the routine directive to address this shortcoming.
• The routine directive gives the compiler the necessary information about the function or subroutine
and the loops it contains in order to parallelize the calling parallel region.
• The routine directive must be added to a function definition informing the compiler of the level of
parallelism used within the routine.
• OpenACC’s levels of parallelism will be discussed in a later section.

39
40
1. The routine directive enables function execution on the GPU.
2. Functions need explicit directives to be used inside parallel or kernels regions.
3. The execution level (seq, vector, worker, gang) determines how the function is executed.
4. Ensure functions are compiled with OpenACC support for GPU execution.

41
Routine Directive
• The routine directive in OpenACC is used to specify how a function should be compiled
for execution on the accelerator.
• It allows functions to be called within parallel regions or kernels, enabling better
performance on GPUs.

42
UNIT 4
ENDS!!!

43

You might also like