MODULE THREE:
OPENACC DIRECTIVES
Dr. Volker Weinberg | LRZ
MODULE OVERVIEW
OpenACC Directives
The parallel directive
The kernels directive
The loop directive
Fundamental differences between the kernels and parallel directive
Expressing parallelism in OpenACC
OPENACC SYNTAX
OPENACC SYNTAX
Syntax for using OpenACC directives in code
C/C++ Fortran
#pragma acc directive clauses !$acc directive clauses
<code> <code>
A pragma in C/C++ gives instructions to the compiler on how to compile the code.
Compilers that do not understand a particular pragma can freely ignore it.
A directive in Fortran is a specially formatted comment that likewise instructions the
compiler in it compilation of the code and can be freely ignored.
“acc” informs the compiler that what will come is an OpenACC directive
Directives are commands in OpenACC for altering our code.
Clauses are specifiers or additions to directives.
OPENACC PARALLEL DIRECTIVE
OPENACC PARALLEL DIRECTIVE
Explicit programming
Parallel Hardware
The parallel directive instructs the compiler to
CPU create parallel gangs on the accelerator
Gangs are independent groups of worker
threads on the accelerator
The code contained within a parallel directive
<sequential code> is executed redundantly by all parallel gangs
#pragma acc parallel
{
<sequential code>
}
OPENACC PARALLEL DIRECTIVE
Expressing parallelism
#pragma acc parallel
{ gang gang
When encountering the
parallel directive, the
compiler will generate gang gang
1 or more parallel
gangs, which execute
redundantly.
} gang gang
OPENACC PARALLEL DIRECTIVE
Expressing parallelism
loop
loop
#pragma acc parallel
{ gang gang
loop
loop
loop
for(int i = 0; i < N; i++)
{ gang gang
// Do Something
}
loop
loop
This loop will be gang gang
}
executed redundantly
on each gang
OPENACC PARALLEL DIRECTIVE
Expressing parallelism
#pragma acc parallel
{ gang gang
for(int i = 0; i < N; i++)
{ gang gang
// Do Something
}
This means that each gang gang
}
gang will execute the
entire loop
OPENACC PARALLEL DIRECTIVE
Parallelizing a single loop
C/C++
#pragma acc parallel
Use a parallel directive to mark a region of
{
#pragma acc loop code where you want parallel execution to occur
for(int i = 0; i < N; i++)
a[i] = 0; This parallel region is marked by curly braces in
} C/C++ or a start and end directive in Fortran
Fortran The loop directive is used to instruct the
!$acc parallel
compiler to parallelize the iterations of the next
!$acc loop loop to run across the parallel gangs
do i = 1, N
a(i) = 0
end do
!$acc end parallel
OPENACC PARALLEL DIRECTIVE
Parallelizing a single loop
C/C++ This pattern is so common that you can do all of
#pragma acc parallel loop this in a single line of code
for(int i = 0; i < N; i++)
a[i] = 0; In this example, the parallel loop directive
applies to the next loop
This directive both marks the region for parallel
Fortran execution and distributes the iterations of the
loop.
!$acc parallel loop
do i = 1, N When applied to a loop with a data dependency,
a(i) = 0
end do parallel loop may produce incorrect results
OPENACC PARALLEL DIRECTIVE
Expressing parallelism
#pragma acc parallel
{
#pragma acc loop
for(int i = 0; i < N; i++)
{
// Do Something
}
The loop directive
informs the compiler
} which loops to
parallelize.
OPENACC PARALLEL DIRECTIVE
Parallelizing many loops
To parallelize multiple loops, each loop should
be accompanied by a parallel directive
#pragma acc parallel loop Each parallel loop can have different loop
for(int i = 0; i < N; i++)
a[i] = 0;
boundaries and loop optimizations
#pragma acc parallel loop
Each parallel loop can be parallelized in a
for(int j = 0; j < M; j++) different way
b[j] = 0;
This is the recommended way to parallelize
multiple loops. Attempting to parallelize multiple
loops within the same parallel region may give
performance issues or unexpected results
OPENACC LOOP DIRECTIVE
OPENACC LOOP DIRECTIVE
Expressing parallelism
Mark a single for loop for parallelization C/C++
Allows the programmer to give additional #pragma acc loop
information and/or optimizations about the for(int i = 0; i < N; i++)
// Do something
loop
Provides many different ways to describe the Fortran
type of parallelism to apply to the loop
!$acc loop
Must be contained within an OpenACC do i = 1, N
! Do something
compute region (either a kernels or a parallel
region) to parallelize loops
OPENACC LOOP DIRECTIVE
Inside of a parallel compute region
#pragma acc parallel In this example, the first loop is not marked with
{ the loop directive
for(int i = 0; i < N; i++)
a[i] = 0; This means that the loop will be “redundantly
parallelized”
#pragma acc loop
for(int j = 0; j < N; j++) Redundant parallelization, in this case, means
a[j]++; that the loop will be run in its entirety, multiple
} times, by the parallel hardware
The second loop is marked with the loop
directive, meaning that the loop iterations will be
properly split across the parallel hardware
OPENACC LOOP DIRECTIVE
Inside of a kernels compute region
#pragma acc kernels With the kernels directive, the loop directive is
{ implied
#pragma acc loop
for(int i = 0; i < N; i++) The programmer can still explicitly define loops
a[i] = 0; with the loop directive, however this could affect
the optimizations the compiler makes
#pragma acc loop
for(int j = 0; j < M; j++) The loop directive is not needed, but does allow
b[j] = 0;
the programmer to optimize the loops
}
themselves
OPENACC LOOP DIRECTIVE
Parallelizing loop nests
#pragma acc parallel loop You are able to include multiple loop directives
for(int i = 0; i < N; i++){ to parallelize multi-dimensional loop nests
#pragma acc loop
C/C++
for(int j = 0; j < M; j++){ On some parallel hardware, this will allow you to
a[i][j] = 0;
} express more levels of parallelism, and increase
} performance further
Other parallel hardware has difficulties
!$acc parallel loop expressing enough parallelism for multi-
do i = 1, N
dimensional loops
Fortran
!$acc loop
do j = 1, M
a(i,j) = 0 In this case, inner loop directives may be
end do ignored
end do
OPENACC KERNELS DIRECTIVE
OPENACC KERNELS DIRECTIVE
Compiler directed parallelization
CPU Parallel Hardware
The kernels directive instructs the compiler to
search for parallel loops in the code
The compiler will analyze the loops and parallelize
those it finds safe and profitable to do so
<sequential code>
The kernels directive can be applied to regions
#pragma acc kernels containing multiple loop nests
{
<for loop>
<for loop>
}
OPENACC KERNELS DIRECTIVE
Parallelizing a single loop
C/C++
#pragma acc kernels In this example, the kernels directive applies to
for(int i = 0; j < N; i++) the next for loop
a[i] = 0;
The compiler will take the loop, and attempt to
parallelize it on the parallel hardware
Fortran
!$acc kernels The compiler will also attempt to optimize the
do i = 1, N loop
a(i) = 0
end do If the compiler decides that the loop is not
!$acc end kernels parallelizable, it will not parallelize the loop
OPENACC KERNELS DIRECTIVE
Parallelizing many loops
#pragma acc kernels In this example, we mark a region of code with
{
for(int i = 0; i < N; i++) the kernels directive
C/C++
a[i] = 0;
The kernels region is defined by the curly
for(int j = 0; j < M; j++)
b[j] = 0; braces in C/C++, and the !$acc kernels and
} !$acc end kernels in Fortran
!$acc kernels The compiler will attempt to parallelize all loops
do i = 1, N within the kernels region
a(i) = 0
Fortran
end do
Each loop can be parallelized/optimized in a
do j = 1, M different way
b(j) = 0
end do
!$acc end kernels
EXPRESSING PARALLELISM
Compiler generated parallelism
#pragma acc kernels
{
for(int i = 0; i < N; i++)
{
// Do Something
}
for(int i = 0; i < M; i++)
{
// Do Something Else
}
With the kernels
} directive, the loop
directive is implied.
EXPRESSING PARALLELISM
Compiler generated parallelism
Each loop can have a different number of
#pragma acc kernels gangs, and those gangs can be
{ organized/optimized completely differently.
for(int i = 0; i < N; i++)
{
// Do Something
}
for(int i = 0; i < M; i++)
{
// Do Something Else
}
This process can happen
} multiple times within the
kernels region.
OPENACC KERNELS DIRECTIVE
Fortran array syntax
!$acc kernels One advantage that the kernels directive has
a(:) = 1
b(:) = 2 over the parallel directive is Fortran array syntax
c(:) = a(:) + b(:)
!$acc end kernels
The parallel directive must be paired with the
loop directive, and the loop directive does not
recognize the array syntax as a loop
!$acc parallel loop The kernels directive can correctly parallelize
c(:) = a(:) + b(:)
the array syntax
KERNELS VS PARALLEL
Kernels Parallel
Programmer based parallelization
Compiler decides
Programmer basedwhat to parallelize
optimizations Programmer decides what to parallelize
with direction from user and communicates that to the compiler
Programmer based restrictions
Compiler guarantees correctness Programmer guarantees correctness
Can cover multiple loop nests Must decorate each loop nest
When fully optimized, both will give similar performance.
COMPILING PARALLEL CODE
COMPILING PARALLEL CODE (PGI)
CODE
7: #pragma acc parallel loop
8: for(int i = 0; i < N; i++)
9: a[i] = 0;
COMPILING
$ pgcc –fast –acc –ta=multicore –Minfo=accel main.c
FEEDBACK
main:
7, Generating Multicore code
8, #pragma acc loop gang
COMPILING PARALLEL CODE (PGI)
CODE
7: #pragma acc kernels
8: for(int i = 0; i < N; i++)
9: a[i] = 0;
COMPILING
$ pgcc –fast –acc –ta=multicore –Minfo=accel main.c
FEEDBACK
main:
8, Loop is parallelizable
Generating Multicore code
8, #pragma acc loop gang
COMPILING PARALLEL CODE (PGI)
CODE
7: #pragma acc kernels
8: for(int i = 1; i < N; i++)
Non-parallel loop
9: a[i] = a[i-1] + a[i];
COMPILING
$ pgcc –fast –acc –ta=multicore –Minfo=accel main.c
FEEDBACK
main:
8, Loop carried dependence of a-> prevents parallelization
Loop carried backward dependence of a-> prevents vectorization
COMPILING PARALLEL CODE (PGI)
CODE
7: #pragma acc parallel loop
8: for(int i = 1; i < N; i++)
Non-parallel loop
9: a[i] = a[i-1] + a[i];
COMPILING
$ pgcc –fast –acc –ta=multicore –Minfo=accel main.c
FEEDBACK
main:
7, Generating Multicore code
8, #pragma acc loop gang
KEY CONCEPTS
By end of this module, you should now understand
The parallel, kernels, and loop directives
The key differences in functionality and use between the kernels and parallel
directives
When and where to include loop directives
How the parallel and kernel directives conceptually generate parallelism
THANK YOU
OPENACC RESOURCES
Guides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow
Resources Success Stories
[Link] [Link]
FREE
Compilers
Compilers and Tools Events
[Link] [Link]
[Link]