0% found this document useful (0 votes)

21 views43 pages

Parallel Programming With OpenACC

OpenACC is a programming standard designed for parallel computing on accelerators, primarily NVIDIA GPUs, using compiler directives to simplify GPU programming for Fortran, C, and C++. It allows for efficient parallel code generation with minimal modifications to existing serial code and supports various levels of parallelism through constructs like kernels and parallel regions. OpenACC facilitates data management and memory allocation automatically, making it easier for programmers to develop portable applications that leverage hybrid CPU/GPU architectures.

Uploaded by

outoftime1001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views43 pages

Parallel Programming With OpenACC

Uploaded by

outoftime1001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

UNIT-IV

Parallel Programming with

OpenACC

1
Three ways to accelerate applications on
GPU

2
OpenACC
What is OpenACC?
• OpenACC (for Open Accelerators) is a programming standard for parallel
computing on accelerators (mostly on NIVDIA GPU).
• A set of compiler directives that allow code regions to be offloaded from a host CPU
to,on a GPU
• High level GPU programming
• Similar to OpenMP directives
• Works for Fortran, C, C++
• Portable across different platforms and compilers
• It is designed to simplify GPU programming.
• The basic approach is to insert special comments (directives) into the code so as
to offload computation onto GPUs and parallelize the code at the level of GPU
(CUDA) cores.
• It is possible for programmers to create an efficient parallel OpenACC code with only
minor modifications to a serial CPU code.
3
OpenACC
• The OpenACC Application Programming Interface (API) provides a set of compiler directives,
library routines, and environment variables that can be used to write data-parallel FORTRAN, C,
and C++ programs that run on accelerator devices, including GPUs.
• It is an extension to the host language.
• The OpenACC speciﬁcation was initially developed by the Portland Group (PGI), Cray Inc., and
NVIDIA, with support from CAPS enterprise.

4
OpenACC Task
Granularity
• Gang – block
• Worker – warp
• Vector – thread

Syntax for C
#pragma acc kernels loop gang(n) worker(m) vector(k)
#pragma acc parallel loop num_gangs(n) num_workers(m)
vector_length(k)
Syntax for Fortran
!$acc kernels loop gang(n) worker(m) vector(k)
!$acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
10

5
Levels of Parallelism

6
key concepts
• Vector : Threads work in SIMT fashion

Individual tasks that are executed in parallel on the GPU

Threads are organized into warps, which are groups of 32 threads each

All threads within a warp are executed on a single GPU core

• Worker : Groups of threads that can be scheduled and executed on a

streaming multiprocessor (SM) within the GPU.

• Gang : workers are organized into gangs. Multiple gangs work independently

7
OpenACC
What are compiler directives?
• The directives tell the compiler or runtime to ……
• Generate parallel code for GPU
• Allocate GPU memory and copy input data
• Execute parallel code on GPU
• Copy output data to CPU and deallocate GPU memory

8
OpenACC Directive syntax
•C
•#pragma acc directive [clause [,] clause] …]… often followed by a structured code
block

9
OpenACC Syntax

• #pragma - Gives instructions to the compiler on how to compile the code

• acc - Informs the compiler that code is to be executed using OpenACC
• directives - Commands in OpenACC for altering our code
• clauses - Specifiers or additions to directives

1
0
Execution Model

• Program runs on the host CPU

• Host offloads compute-intensive regions

(kernels) and related data to the

accelerator GPU

• Compute kernels are executed by the

GPU

1
1
What are compiler
directives?
❑ The directives tell the compiler or runtime to ……
✔ Generate parallel code for GPU
✔ Allocate GPU memory and copy input data GPU
✔ Execute parallel code on GPU
✔ Copy output data to CPU and deallocate GPU memory

// ... serial code ...

❑ The first OpenACC directive: kernels #pragma acc kernels
✔ ask the compiler to generate a GPU code for (int i= 0; i<n; i++) {
✔ let the compiler determine
//... parallel code ...
safe parallelism and data
transfer . }
// ... serial code ...
12
The ﬁrst OpenACC program
❑ Example: Compute a*x + y, where x and y are vectors, and a is a scalar.

C
int main(int argc, char **argv){
int N=1000;
float a = 3.0f;

float x[N], y[N];

for (int i = 0; i < N; ++i) {
x[i] = 2.0f;

y[i] = 1.0f;
}
#pragma acc kernels
for (int i = 0; i < N; ++i) {
y[i] = a * x[i] + y[i];
}
13
}
Data
dependency
❑ The loop is not parallelized if there is data dependency. For example,
#pragma acc kernels
for (int i = 0; i < N-1; i++) {
x[i] = a * x[i+1] ;
}

❑ The compiling output:

……
14, Loop carried dependence of x-> prevents parallelization
Loop carried backward dependence of x-> prevents vectorization
Accelerator scalar kernel generated

Loop carried backward dependence of x-> prevents vectorization

❑ The compiler creates a serial program, which runs slower on GPU than on CPU!
14
OPENACC VERSUS CUDA C
• One big diﬀerence between OpenACC and CUDA C is the use of compiler directives in OpenACC

15
#include<stdio.h>
int main()
{
#pragma acc kernels
for(int i = 0; i < 5; i++)
{
printf("Hello World!\n");
}
}

16
❑ Accelerator kernel is
generated. The loop
computation is offloaded to
(Tesla) GPU
and is parallelized.

❑ The keywords copy and

copyin are involved with data
transfer.

17
• The loop will parallelize the for loop plus also accommodate other OpenACC clauses, for
example here copyin and copyout.
• The above example needs two vectors to be copied to GPU and one vector needs to send
the value back to CPU.
• copyin will create the memory on the GPU and transfer the data from CPU to GPU.
• copyout will create the memory on the GPU and transfer the data from GPU to CPU.

18
PGI Compiler basics
• The command to compile C code is ‘pgcc’

• The command to compile C++ code is ‘pgc++’

• The command to compile fortran code is

‘pgfortran’
$ pgcc main.c

$ pgc++ main.cpp

$ pgfortran main.f90

1
9
OpenACC Directive for parallelism
Two different approaches for defining parallel regions
kernels
• Defines a region to be transferred into a series of kernels to be executed in sequence
on an accelerator
• Work sharing parallelism is defined automatically for the separate kernels
parallel
• Defines a region to be executed on an accelerator
• Work sharing parallelism has to be defined manually

with similar work sharing, both can perform equally well

2
0
2
1
20

22
•OpenACC is not GPU programming
•OpenACC is expressing the parallelism in your existing code
•OpenACC can be used in both Nvidia and AMD GPUs
•OpenACC will enable programmers to easily develop portable applications that maximize
the performance and power efficiency benefits of the hybrid CPU/GPU architecture of
Titan.
•OpenACC is a technically impressive initiative brought together by members of the
OpenMP Working Group on Accelerators, as well as many others.

23
Compilers and directives
• OpenACC is supported by the Nvidia, PGI, GCC, and HPE Gray (only for FORTRAN)
compilers
• Now PGI is part of Nvidia, and it is available through Nvidia HPC SDK
• Compute constructs:
• parallel and kernel
• Loop constructs:
• loop, collapse, gang, worker, vector, etc.
• Data management clauses:
• copy, create, copyin, copyout, delete and present
• Others:
• reduction, atomic, cache, etc.

24
• OpenACC would be portable to all computing architectures
• At its core OpenACC supports offloading of both computation and data from a host device to an
accelerator device.
• In fact, these devices may be the same or may be completely different architectures, such as the
case of a CPU host and GPU accelerator.
• The two devices may also have separate memory spaces or a single memory space.
• In the case that the two devices have different memories the OpenACC compiler and runtime will
analyze the code and handle any accelerator memory management and the transfer of data between
host and device memory.

25
The OpenACC Accelerator Model

26
EXECUTION MODEL
• The OpenACC target machine has a host and an attached accelerator device, such as a GPU.
• Most accelerator devices can support multiple levels of parallelism.
• Figure 15.2 illustrates a typical accelerator that supports three levels of parallelism.
• At the outermost coarse-grain level, there are multiple execution units. Within each execution unit,
there are multiple threads.
• At the innermost level, each thread is capable of executing vector operations.
• Currently, OpenACC does not assume any synchronization capability on the accelerator, except for
thread forking and joining.
• Once work is distributed among the execution units, they will execute in parallel from start to
ﬁnish.
• Similarly, once work is distributed among the threads within an execution unit, the threads execute
in parallel.

27
28
• An OpenACC program starts its execution on the host single-threaded (Figure 15.3).
• When the host thread encounters a parallel or a kernels construct, a parallel region or a kernels
region that comprises all the code enclosed in the construct is created and launched on the
accelerator device.
• The parallel region or kernels region can optionally execute asynchronously with the host thread
and join with the host thread at a future synchronization point.
• The parallel region is executed entirely on the accelerator device.

29
30
• The kernels region may contain a sequence of kernels, each of which is executed on the
accelerator device.
• The kernel execution follows a fork-join model. A group of gangs are used to execute each kernel.
• A group of workers can be forked to execute a parallel work-sharing loop that belongs to a gang.
• The workers are disbanded when the loop is done.
• Typically a gang executes on one execution unit, and a worker runs on one thread within an
execution unit.
• The programmer can instruct how the work within a parallel region or a kernels region is to be
distributed among the diﬀerent levels of parallelism on the accelerator.

31
MEMORY MODEL
• In an OpenACC memory model, the host memory and the device memory are treated as separated.
• It is assumed that the host is not able to access device memory directly and the device is not able to
access host memory directly.
• This is to ensure that the OpenACC programming model can support a wide range of accelerator
devices, including most of the current GPUs that do not have the capability of unified memory
access between GPUs and CPUs.
• The unified virtual addressing and the GPU Direct introduced by NVIDIA in CUDA 4.0 allow a
single virtual address space for both host memory and device memory and allow direct
cross-device memory access between different GPUs. However, cross-host and device memory
access is still not possible.

32
• Just like in CUDA C/C++, in OpenACC input data needs to be transferred from the host to the
device before kernel launches and result data needs to be transferred back from the device to the
host.
• However, unlike in CUDA C/ C++ where programmers need to explicitly code data movement
through API calls, in OpenACC they can just annotate which memory objects need to be
transferred, as shown by line 4 in Figure 15.1.
• The OpenACC compiler will automatically generate code for memory allocation, copying, and
de-allocation.

33
• OpenACC adopts a fairly weak consistency model for memory on the accelerator device.
• Although data on the accelerator can be shared by all execution units, OpenACC does not provide a reliable way
to allow one execution unit to consume the data produced by another execution unit.
• There are two reasons for this.
• First, recall OpenACC does not provide any mechanism for synchronization between execution units.
• Second, memories between diﬀerent execution units are not coherent.
• Although some hardware provides instructions to explicitly invalidate and update cache, they are not exposed at
the OpenACC level.
• Therefore, in OpenACC, diﬀerent execution units are expected to work on disjoint memory sets.
• Threads within an execution unit can also share memory and threads have coherent memory.
• However, OpenACC currently only mandates a memory fence at the thread fork and join, which are also the only
synchronizations OpenACC provides for threads.
• While the device memory model may appear very limiting, it is not so in practice. For datarace free OpenACC
data-parallel applications, the weak memory model works quite well.

34
• OpenACC provides two diﬀerent approaches for exposing parallelism in the code:
parallel and kernels regions.

35
The Kernels Construct
• The kernels construct identiﬁes a region of code that may contain parallelism, but relies
on the automatic parallelization capabilities of the compiler to analyze the region, identify
which loops are safe to parallelize, and then accelerate those loops.

36
37
• In this example the code is initializing two arrays and then performing a simple calculation on
them.
• Code contains two candidate loops for acceleration.
• The compiler will analyze these loops for data independence and parallelize both loops by
generating an accelerator kernel for each.
• The compiler is given complete freedom to determine how best to map the parallelism available in
these loops to the hardware, meaning that
• we will be able to use this same code regardless of the accelerator we are building for.
• The compiler will use its own knowledge of the target accelerator to choose the best path for
acceleration.
• One caution about the kernels directive, however, is that if the compiler cannot be certain that a
loop is data independent, it will not parallelize the loop.

38
Routine Directive
• Function or subroutine calls within parallel loops can be problematic for compilers,
• since it’s not always possible for the compiler to see all of the loops at one time.
• OpenACC 1.0 compilers were forced to either inline all routines called within parallel regions or
not parallelize loops containing routine calls at all.
• OpenACC 2.0 introduced the routine directive to address this shortcoming.
• The routine directive gives the compiler the necessary information about the function or subroutine
and the loops it contains in order to parallelize the calling parallel region.
• The routine directive must be added to a function definition informing the compiler of the level of
parallelism used within the routine.
• OpenACC’s levels of parallelism will be discussed in a later section.

39
40
1. The routine directive enables function execution on the GPU.
2. Functions need explicit directives to be used inside parallel or kernels regions.
3. The execution level (seq, vector, worker, gang) determines how the function is executed.
4. Ensure functions are compiled with OpenACC support for GPU execution.

41
Routine Directive
• The routine directive in OpenACC is used to specify how a function should be compiled
for execution on the accelerator.
• It allows functions to be called within parallel regions or kernels, enabling better
performance on GPUs.

42
UNIT 4
ENDS!!!

OpenACC for Efficient Parallel Computing
No ratings yet
OpenACC for Efficient Parallel Computing
51 pages
OpenACC Programming
No ratings yet
OpenACC Programming
43 pages
OpenACC Parallelism Techniques Guide
No ratings yet
OpenACC Parallelism Techniques Guide
23 pages
OpenACC 2017spring
No ratings yet
OpenACC 2017spring
46 pages
Lab 06
No ratings yet
Lab 06
16 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
OpenACC for HPC Developers
No ratings yet
OpenACC for HPC Developers
47 pages
S3076 Getting Started With OpenACC
No ratings yet
S3076 Getting Started With OpenACC
58 pages
Coursera Lecture 11.1 OpenACC Intro
No ratings yet
Coursera Lecture 11.1 OpenACC Intro
11 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
OpenACC Programming for Fortran
No ratings yet
OpenACC Programming for Fortran
53 pages
Navya2022 Chapter ComparativeStudyOfDirective-ba
No ratings yet
Navya2022 Chapter ComparativeStudyOfDirective-ba
13 pages
GPU Programming Slides 1
No ratings yet
GPU Programming Slides 1
33 pages
Understanding NVIDIA Volta GPU Architecture
No ratings yet
Understanding NVIDIA Volta GPU Architecture
36 pages
Week 9 Assignment
No ratings yet
Week 9 Assignment
3 pages
OpenACC Programming Guide 0 0
No ratings yet
OpenACC Programming Guide 0 0
73 pages
Introduction To CUDA Platform 1
No ratings yet
Introduction To CUDA Platform 1
18 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
OpenACC 1
No ratings yet
OpenACC 1
44 pages
Module 3
No ratings yet
Module 3
34 pages
DS1822-Parallel Computing - Unit5
No ratings yet
DS1822-Parallel Computing - Unit5
16 pages
Module 4
No ratings yet
Module 4
40 pages
L04 Parallel Programming Models I
No ratings yet
L04 Parallel Programming Models I
72 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
Parallel Computing Chapter 1
No ratings yet
Parallel Computing Chapter 1
56 pages
CUDA C vs Thrust vs Libraries Overview
No ratings yet
CUDA C vs Thrust vs Libraries Overview
64 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Module-02 Bcs702 (Parallel Computing) Search Creators
No ratings yet
Module-02 Bcs702 (Parallel Computing) Search Creators
22 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Lobeiras 2015
No ratings yet
Lobeiras 2015
12 pages
High Performance Computing WS2022 Slides 10 Openacc
No ratings yet
High Performance Computing WS2022 Slides 10 Openacc
8 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Module 6
No ratings yet
Module 6
20 pages
Chapter Three Parallel Computing
No ratings yet
Chapter Three Parallel Computing
44 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
02 - Introduction To Concurrent Systems PDF
No ratings yet
02 - Introduction To Concurrent Systems PDF
31 pages
Fortran PGI Directives: 11 Optimization Tips
No ratings yet
Fortran PGI Directives: 11 Optimization Tips
15 pages
Introduction to CUDA C/C++ Basics
100% (1)
Introduction to CUDA C/C++ Basics
82 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Understanding Parallel Processing Concepts
No ratings yet
Understanding Parallel Processing Concepts
38 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Parallel Programming Module 4
No ratings yet
Parallel Programming Module 4
93 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
ASPLOS 2021 - Golden Age of Compilers
No ratings yet
ASPLOS 2021 - Golden Age of Compilers
64 pages
CS 179: GPU Computing: Recitation 2: Synchronization, Shared
No ratings yet
CS 179: GPU Computing: Recitation 2: Synchronization, Shared
22 pages
Kernelgen Ncar 2012 Slides
No ratings yet
Kernelgen Ncar 2012 Slides
28 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
CUDA C Programming Course Overview
No ratings yet
CUDA C Programming Course Overview
30 pages
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
No ratings yet
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
74 pages
CUDA Programming Basics Guide
No ratings yet
CUDA Programming Basics Guide
38 pages
08 Systems Programming-Concurrent Programming
No ratings yet
08 Systems Programming-Concurrent Programming
61 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
Understanding PGPU and CUDA Basics
No ratings yet
Understanding PGPU and CUDA Basics
70 pages
Micro Program Control
No ratings yet
Micro Program Control
27 pages
Subject: Computer Organization and BASIC Programming Subject Code: Comp-214
No ratings yet
Subject: Computer Organization and BASIC Programming Subject Code: Comp-214
14 pages
Nx100 Instruction Manual
No ratings yet
Nx100 Instruction Manual
308 pages
Roofing Materials for School Project
No ratings yet
Roofing Materials for School Project
1 page
October 2024 Card Processing Summary
No ratings yet
October 2024 Card Processing Summary
8 pages
CSE Model Paper: Computer Architecture
No ratings yet
CSE Model Paper: Computer Architecture
21 pages
Information Revolution
No ratings yet
Information Revolution
9 pages
RELEASE - Tesla - The Nintendo Switch Overlay Menu - GBAtemp - Net - The Independent Video Game Community (2020-04-17 11 - 38 - 32 AM)
No ratings yet
RELEASE - Tesla - The Nintendo Switch Overlay Menu - GBAtemp - Net - The Independent Video Game Community (2020-04-17 11 - 38 - 32 AM)
13 pages
Tugas B.Inggris
No ratings yet
Tugas B.Inggris
2 pages
Computer Memory Hierarchies Explained
No ratings yet
Computer Memory Hierarchies Explained
15 pages
Garage Frontier Universal
No ratings yet
Garage Frontier Universal
36 pages
Hynix DDR4 Part Number Guide
No ratings yet
Hynix DDR4 Part Number Guide
1 page
Parts Catalog A193/A224: PN: RCPC0200
No ratings yet
Parts Catalog A193/A224: PN: RCPC0200
207 pages
5221 Firmware Download
No ratings yet
5221 Firmware Download
1 page
LT2245
No ratings yet
LT2245
18 pages
Lab 15 - Managing LVM Logical Volumes - Exercises
No ratings yet
Lab 15 - Managing LVM Logical Volumes - Exercises
11 pages
Today Note
No ratings yet
Today Note
7 pages
Family 8969+01 IBM IBM Storage Networking SAN24B-6 - IBM Documentation
No ratings yet
Family 8969+01 IBM IBM Storage Networking SAN24B-6 - IBM Documentation
51 pages
Yale Manual - 7000 Series Exit Devices
No ratings yet
Yale Manual - 7000 Series Exit Devices
72 pages
Windows XP Basic Notes December 2024
No ratings yet
Windows XP Basic Notes December 2024
3 pages
Disable Windows 10 Driver Updates
No ratings yet
Disable Windows 10 Driver Updates
9 pages
Process Synchronization Notes
No ratings yet
Process Synchronization Notes
24 pages
#06a How To Install - Apple - Wet & Dry Procreate Brushset
No ratings yet
#06a How To Install - Apple - Wet & Dry Procreate Brushset
7 pages
Infoblox Datasheet - NIOS-X Server Options
No ratings yet
Infoblox Datasheet - NIOS-X Server Options
5 pages
8085 & 8051 Microprocessor Guide
No ratings yet
8085 & 8051 Microprocessor Guide
4 pages
40 Space Shooter PY040
No ratings yet
40 Space Shooter PY040
6 pages
eSATA: High-Speed External Storage Guide
No ratings yet
eSATA: High-Speed External Storage Guide
3 pages
STŌK 2-Burner Gas Grill Parts List
No ratings yet
STŌK 2-Burner Gas Grill Parts List
5 pages
GX Works2 Version 1 Operating Manual (Intelligent Function Module)
No ratings yet
GX Works2 Version 1 Operating Manual (Intelligent Function Module)
214 pages
Tax Invoice for ZOTAC GTX 1050 Ti
No ratings yet
Tax Invoice for ZOTAC GTX 1050 Ti
1 page

Parallel Programming With OpenACC

Uploaded by

Parallel Programming With OpenACC

Uploaded by

UNIT-IV

Parallel Programming with

Individual tasks that are executed in parallel on the GPU

All threads within a warp are executed on a single GPU core

streaming multiprocessor (SM) within the GPU.

• #pragma - Gives instructions to the compiler on how to compile the code

• Program runs on the host CPU

(kernels) and related data to the

• Compute kernels are executed by the

// ... serial code ...

float x[N], y[N];

❑ The compiling output:

Loop carried backward dependence of x-> prevents vectorization

❑ The keywords copy and

• The command to compile C++ code is ‘pgc++’

• The command to compile fortran code is

with similar work sharing, both can perform equally well

You might also like