0% found this document useful (0 votes)

16 views72 pages

A Compilers View of OpenMP

The document discusses the implementation and optimization of OpenMP within the LLVM framework, highlighting the author's expertise and contributions to OpenMP offloading. It details LLVM's capabilities in handling OpenMP, including optimizations for parallel regions and runtime interactions. Additionally, the document covers the challenges and strategies for offloading computations to GPUs using OpenMP directives.

Uploaded by

foertter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views72 pages

A Compilers View of OpenMP

Uploaded by

foertter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

A Compiler’s View of OpenMP

Johannes Doerfert, Argonne National Laboratory

A Compiler’s View of OpenMP
Johannes Doerfert (Argonne National Laboratory)
About Me
/\ /\
PhD in CS from Saarland University, \-------------- me in Zurich ----------------/
Saarbrücken, Germany

Researcher at Argonne National

Laboratory (ANL), Chicago, USA Code owner for OpenMP oﬄoading
in LLVM (oﬃcially) since recently
Active in the LLVM community
since 2014, in the OpenMP
community since 2018
Background
LLVM in a Nutshell

● open (source/community/...)

Thanks 2 Ryan Houdek

● extensible, “ﬁxable”
● portable (GPUs, CPUs, …)
● C++/OpenMP/SYCL/HIP/CUDA/… feature complete😉
● early access to *the coolest* features

● performant and correct ;)

[😉 eventually]
LLVM/Clang 101

opt llc

LLVM Machine
file.c LLVM IR
MIR Code

clang llc llc

Slide originally by Eric Christopher and Johannes Doerfert [Link]

Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Slide originally presented at LLVM-Dev Meeting 2020 [Link]

Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Clang
OpenMP
Parser

OpenMP
Sema

OpenMP
CodeGen

Slide originally presented at LLVM-Dev Meeting 2020 [Link]

Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Clang OpenMP
OpenMP
runtimes
Parser
[Link] (classic, host)
OpenMP
Sema

OpenMP
CodeGen

Slide originally presented at LLVM-Dev Meeting 2020 [Link]

Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Clang OpenMP
OpenMP
runtimes
Parser
[Link] (classic, host)
OpenMP
libomptarget + plugins
Sema
(oﬄoading, host)
OpenMP
CodeGen libomptarget-nvptx
(oﬄoading, device)
Slide originally presented at LLVM-Dev Meeting 2020 [Link]
Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Flang
Clang
OpenMP OpenMP
Parser
OpenMP
runtimes
OpenMP
Parser
Sema [Link] (classic, host)
OpenMP
OpenMP libomptarget + plugins
Sema
CodeGen (oﬄoading, host)
OpenMP
CodeGen libomptarget-nvptx
(oﬄoading, device)
Slide originally presented at LLVM-Dev Meeting 2020 [Link]
Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Flang
Clang
OpenMP OpenMPIRBuilder OpenMP
Parser
OpenMP frontend independant
runtimes
OpenMP
Parser OpenMP LLVM-IR generation
Sema [Link] (classic, host)
OpenMP favor simple and expressive
OpenMP libomptarget + plugins
Sema LLVM-IR
CodeGen (oﬄoading, host)
OpenMP reusable for non-OpenMP
CodeGen parallelism libomptarget-nvptx
(oﬄoading, device)
Slide originally presented at LLVM-Dev Meeting 2020 [Link]
Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Flang
Clang
OpenMP OpenMPIRBuilder OpenMPOpt OpenMP
Parser
OpenMP frontend independant interprocedural
runtimes
OpenMP
Parser OpenMP LLVM-IR generation optimization pass
Sema [Link] (classic, host)
OpenMP favor simple and expressive contains host & device
OpenMP libomptarget + plugins
Sema LLVM-IR optimizations
CodeGen (oﬄoading, host)
OpenMP reusable for non-OpenMP run with -O2 and -O3
CodeGen parallelism since LLVM 11 libomptarget-nvptx
(oﬄoading, device)
Slide originally presented at LLVM-Dev Meeting 2020 [Link]
OpenMP Implementation & Optimization
Use default(firstprivate), or
default(none) + firstprivate(...)
for (almost) all values!
LLVM’s OpenMP-Aware Optimizations
LLVM’s OpenMP-Aware Optimizations
Towards OpenMP-aware compiler optimizations

OpenMPOpt
● LLVM “knows” about OpenMP API and (internal) runtime calls,
interprocedural
incl. their potential eﬀects (e.g., they won’t throw exceptions).
optimization pass
● LLVM performs “high-level” optimizations, e.g., parallel region
merging, and various GPU-speciﬁc optimizations late contains host & device
optimizations
● Some LLVM/Clang “optimizations” remain, but we are in the
process of removing them: simple frontend, smart middle-end
run with -O2 and -O3
since LLVM 11
Optimization Remarks
Example: OpenMP runtime call deduplication

double A = malloc(size omp_get_thread_limit()); OpenMP runtime calls with

double *B = malloc(size * omp_get_thread_limit());
same return values can be
#pragma omp parallel
do_work(A, B); merged to a single call
Optimization Remarks
Example: OpenMP runtime call deduplication

double A = malloc(size omp_get_thread_limit()); OpenMP runtime calls with

double *B = malloc(size * omp_get_thread_limit());
same return values can be
#pragma omp parallel
do_work(A, B); merged to a single call

$ clang -g -O2 deduplicate.c -fopenmp -Rpass=openmp-opt

deduplicate.[Link] remark: OpenMP runtime call omp_get_thread_limit moved to deduplicate.[Link] [-Rpass=openmp-opt]

double *B = malloc(size*omp_get_thread_limit());
deduplicate.[Link] remark: OpenMP runtime call omp_get_thread_limit deduplicated [-Rpass=openmp-opt]
double *A = malloc(size*omp_get_thread_limit());
Optimization Remarks
Example: OpenMP Target Scheduling

clang12 -Rpass=openmp-opt ...

void bar(void) { remark: Found a parallel region that is called in a target region but not part of a combined target construct nor nested inside a target construct
without intermediate code. This can lead to excessive register usage for unrelated target regions in the same translation unit due to spurious
#pragma omp parallel call edges assumed by ptxas.

!
{} remark: Parallel region is not known to be called from a unique single target region, maybe the surrounding function has external linkage?; will
} not attempt to rewrite the state machine use.

e r
t
remark: Found a parallel region that is called in a target region but not part of a combined target construct nor nested inside a target construct

a
void foo(void) { without intermediate code. This can lead to excessive register usage for unrelated target regions in the same translation unit due to spurious

L
#pragma omp target teams call edges assumed by ptxas.

d
remark: Specialize parallel region that is only reached from a single target region to avoid spurious call edges and excessive register usage in
{
e
other target regions. (parallel region ID: __omp_outlined__1_wrapper, kernel ID: __omp_oﬄoading_35_a1e179_foo_l7)

n
#pragma omp parallel remark: Target region containing the parallel region that is specialized. (parallel region ID: __omp_outlined__1_wrapper, kernel ID:
{} __omp_oﬄoading_35_a1e179_foo_l7)

a i
pl
remark: Found a parallel region that is called in a target region but not part of a combined target construct nor nested inside a target construct
bar(); without intermediate code. This can lead to excessive register usage for unrelated target regions in the same translation unit due to spurious

Ex
#pragma omp parallel call edges assumed by ptxas.
remark: Specialize parallel region that is only reached from a single target region to avoid spurious call edges and excessive register usage in
{} other target regions. (parallel region ID: __omp_outlined__3_wrapper, kernel ID: __omp_offloading_35_a1e179_foo_l7)
} remark: Target region containing the parallel region that is specialized. (parallel region ID: __omp_outlined__3_wrapper, kernel ID:
} __omp_offloading_35_a1e179_foo_l7)
remark: OpenMP GPU kernel __omp_offloading_35_a1e179_foo_l7
OpenMP Compile-Time and Runtime Information

● Use OpenMP optimization remarks

● Optimization remark explanations, examples, FAQs, …
all gradually added to [Link]
● Use LIBOMPTARGET_INFO for runtime library interactions

$ clang -O2 generic.c -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -o generic

$ env LIBOMPTARGET_INFO=1 ./generic

CUDA device 0 info: Device supports up to 65536 CUDA blocks and 1024 threads with a warp size of 32
CUDA device 0 info: Launching kernel __omp_oﬄoading_fd02_c2a59832_main_l106 with 48 blocks and 128 threads in Generic mode
OpenMP Offloading (in LLVM)
Compiling clang -fopenmp -fopenmp-targets=nvptx64 ﬁle.c
Clang Actions
C file {0}Input {6}Input C file

CPP-output {1}Preprocessor {7}Preprocessor CPP-output

LLVM IR {2}Compiler {8}Compiler LLVM IR

{9} Offload LLVM IR

Assembler {3}Backend

{10}Backend Assembler
Object {4}Assembler
{11}Assembler Object
Image {5} Linker
{12}Linker Image

{13}Offload Linker host-openmp

Fat binary Image device-openmp
Slide originally by Jose Monsalve Diaz
OpenMP Oﬄoading
The Tricky Bits
math.h

/* Test for negative number. Used in the signbit() macro. */

#include <math.h> __MATH_INLINE int
__NTH (__signbitf (float __x))
#pragma omp begin declare target {
void science(float f) { # ifdef __SSE2_MATH__
if (signbitf(f)) { int __m;
// some science __asm (""pmovmskb %1, %0"" : ""=r"" (__m) : ""x"" (__x));
} else { return (__m & 0x8) != 0;
// some other science # else
} __extension__ union { float __f; int __i; } __u = { __f: __x };
} return __u.__i < 0;
#pragma omp end declare target # endif
}

science can be called from the host and device

OpenMP Oﬄoading
The Tricky Bits
GPUs do not provide a math.h,
and more importantly, no libm.
#include <math.h>

#pragma omp begin declare target // LLVM/Clang's "math.h" wrapper for NVPTX (CUDA)
void science(float f) {
if (signbitf(f)) { int __signbitf(float __a) { return __nv_signbitf(__a); }
// some science
} else { #pragma omp begin declare variant match(device={kind(gpu)})
// some other science bool signbit(float __x) { return ::__signbitf(__x); }
} #pragma omp end declare variant
}
#pragma omp end declare target

science can be called from the host and device

OpenMP Oﬄoading
The Tricky Bits

Linking
not today 😢
OpenMP Oﬄoading vs Kernel Languages LLVM/OpenMP

block

Func<<</* blocks / 1, / threads */ 4>>>(args); Func

SPMD-mode
thread

#pragma omp target teams num_teams(1) block

{
A();
#pragma omp parallel num_threads(4) default(firstprivate)
{ Func
Func(args);
Generic-mode
}
B();
thread
}
OpenMP Offloading vs Kernel Languages void A() {
#pragma omp parallel
Kernel2();
#pragma omp target teams num_teams(1) }
{
#pragma omp parallel num_threads(4) default(firstprivate) block
{
if (omp_get_thread_num() == 0)
A(); Func
#pragma omp barrier
Func(args);
#pragma omp barrier
if (omp_get_thread_num() == 0) thread
B();
}
}
void B() {
#pragma omp barrier
}
SPMD-zation, coming soon!
OpenMP Offloading vs Kernel Languages (simplified)

#pragma omp target teams num_teams(1) block

{
A();
#pragma omp parallel num_threads(4) default(firstprivate)
{ Func
Func(args);
}
B();
thread
}
OpenMP Oﬄoading vs Kernel Languages (simpliﬁed)
Function Pointer

Q: How do you identify a parallel region?

A: Via the function (pointer) we outlined it into.

Q: Won’t that cause indirect calls and spurious call edges?

A: Yes. That’s why we try to use non-function pointer IDs.

OpenMP Oﬄoading vs Kernel Languages (simpliﬁed)
static char parFnId;
static void parFn() { static void parFn() { Function Id
Pointer
// parallel function code // parallel function code
} }

void kernel() { void kernel() {

if (is_worker()) { if (is_worker()) {
while (1) { while (1) {
fn = __omp_wait_for_parallel(); fn = __omp_wait_for_parallel();
fn(); (fn == &parFnId) ? parFn() : fn();
__omp_inform_parallel_done(); __omp_inform_parallel_done();
} }
} else { } else {
__omp_inform_workers(&parFn, ...) __omp_inform_workers(&parFnId, ...)
parFn(); parFn();
__omp_wait_for_workers(); __omp_wait_for_workers();
} }
} }

Performed since LLVM 12

OpenMP Oﬄoading vs Kernel Languages (simpliﬁed)
static char parFnId;
static void parFn() { static void parFn() {
// parallel function code // parallel function code
} } Use optimization
remarks to learn about
void kernel() { void kernel() {
if (is_worker() { missed opportunities
if (is_worker() {
// ... // ...
} else { } else {
visible(); visible();
} }
} }

#pragma omp begin assumes ompx_no_external_callers

void visible() { void visible() {
__omp_inform_workers(&parFn, ...) __omp_inform_workers(&parFnId, ...)
parFn(); parFn();
__omp_wait_for_workers(); __omp_wait_for_workers();
} }
#pragma omp end assumes
LLVM 13 will know more tricks :)
What OpenMP got Wrong
(non exhaustive list)
What OpenMP got Wrong
All instances where a directive retroactively changes something:
static int X;

static int PleaseDont[alignof(X)];

int* whileWeAreHere(void) { return &X; }

#pragma omp allocate(X) allocator(...) align(...)

The ﬁxation on syntactic nesting:

#pragma omp target #pragma omp target teams #pragma omp target teams
{ { {
#pragma omp atomic update #pragma omp atomic update // error // pragma omp atomic in foo is fine
++X; ++X; foo();
} } }
What OpenMP got (kinda) Right
(non exhaustive list)
What OpenMP got (kinda) Right
The target device abstraction:

GPU 2-4 GPU 5-7 GPU 8-10

GPU 0

LLVM 12 provides remote GPUs! GPU 1

What OpenMP got (kinda) Right
The target device abstraction:

CPU Device 0

Virtual GPU
Device 1
LLVM 13 will provide a VGPU :)
What OpenMP got (kinda) Right
The target device abstraction:

Application + OpenMP Device (Abstraction)

World World

cuda
What’s Next?
Johannes Doerfert
jdoerfert@[Link]
What’s Next? Argonne National L
ab
LLVM OpenMP

● More OpenMP-aware optimizations: ❏ OpenMP Interop and dynamic context

○ hide memory transfer latencies selector implementations
○ exploit OpenMP domain knowledge ❏ A community developed OMPX (header)
○ ask for and utilize user assumptions
library (think stdlib for OpenMP).
● GPU-speciﬁc optimizations
❏ Function variants shipped via libraries
● More actionable optimization remarks
❏ More powerful assumptions
● OpenMP 5.1 features
❏ Less syntactic / more semantic reasoning*
● A new (portable and performant) GPU
❏ Deprecations*
device runtime (written in OpenMP 5.1 !)
● Helpful oﬄoading “devices”:
○ VGPU + NewProcess for debugging, or
○ JIT for performance
● Host-Device optimizations
* I hope
Final Thoughts
(aka. Rambling)
Parallel Worksharing Loops ≠ “Parallel Loops”

void f(double *A, double *B) { void f(double *A, double *B) {
#pragma omp parallel for #pragma omp parallel for order(concurrent)
for (int i = 0; i < N; ++i) { for (int i = 0; i < N; ++i) {
// ... // ...
} }
} }

omp_set_num_threads(1);
f(A, B);

void f(double *A, double *B) { void f(double *A, double *B) {
#pragma omp parallel for schedule(static, N) #pragma omp parallel for schedule(static, 1)
for (int i = 0; i < N; ++i) { for (int i = 0; i < N; ++i) {
// ... // ...
} }
} }
Johannes Doerfert
jdoerfert@[Link]
What’s Next? Argonne National L
ab
LLVM OpenMP

● More OpenMP-aware optimizations: ❏ OpenMP Interop and dynamic context

○ hide memory transfer latencies selector implementations
○ exploit OpenMP domain knowledge ❏ A community developed OMPX (header)
○ ask for and utilize user assumptions
library (think stdlib for OpenMP).
● GPU-speciﬁc optimizations
❏ Function variants shipped via libraries
● More actionable optimization remarks
❏ More powerful assumptions
● OpenMP 5.1 features
❏ Less syntactic / more semantic reasoning*
● A new (portable and performant) GPU
❏ Deprecations*
device runtime (written in OpenMP 5.1 !)
● Helpful oﬄoading “devices”:
○ VGPU + NewProcess for debugging, or Thanks!
○ JIT for performance Interested?
● Host-Device optimizations
Reach out!
* I hope
Joseph H
huberjn@ ber
Oak Ridg [Link]
e ation
al Lab

Design Goal

Report every successful and failed optimization

Shile Tian
shile .tian@[Link]
Stony roo Univer t

Design Goal

Optimize oﬄoading code

perform host + accelerator optimizations
OpenMP Oﬄoad Compilation (simpliﬁed)

user_code_1.c
void foo() {
int N = 1024;

#pragma omp target

*mem = N;
}

* RFC: [Link]
OpenMP Oﬄoad Compilation (simpliﬁed)
host.c
extern void device_func7(int);

user_code_1.c void foo() {

int N = 1024;
void foo() {
int N = 1024; if (!offload(device_func7, N)) {
// host fallback
#pragma omp target *mem = N;
*mem = N; }
} }
device.c
void device_func7(int N) {
*mem = N;
}

* RFC: [Link]
OpenMP Oﬄoad Compilation (simpliﬁed)
host.c
extern void device_func7(int);

user_code_1.c void foo() {

int N = 1024;
void foo() {
int N = 1024; if (!offload(device_func7, 1024)) {
// host fallback
#pragma omp target *mem = 1024;
*mem = N; }
} }
device.c
void device_func7(int N) {
*mem = N;
}

* RFC: [Link]
OpenMP Oﬄoad Compilation (simpliﬁed)
host.c
extern void device_func7(int);
The constant
is part of the
user_code_1.c void foo() { “host code”.
int N = 1024;
void foo() {
int N = 1024; if (!offload(device_func7, 1024)) {
// host fallback
#pragma omp target *mem = 1024;
*mem = N; }
} }
device.c
void device_func7(int N) {
*mem = N;
}

* RFC: [Link]
Heterogeneous LLVM-IR Module
heterogeneous.c
__attribute__((callback(Func, ...)))
int offload(void (*)(...) Func, ...);

user_code_1.c target 0 void foo() {

int N = 1024;
void foo() {
int N = 1024;
if (!offload(device_func7, N)) {
// host fallback
#pragma omp target
*mem = N;
*mem = N;
}
}
}

target 1 void device_func7(int N) {

*mem = N;
}

* RFC: [Link] * callback attribute: [Link]

Heterogeneous LLVM-IR Module
heterogeneous.c
__attribute__((callback(Func, ...)))
int offload(void (*)(...) Func, ...);

user_code_1.c target 0 void foo() {

int N = 1024;
void foo() {
int N = 1024;
if (!offload(device_func7, N)) {
// host fallback
#pragma omp target
*mem = 1024;
*mem = N;
}
}
}

target 1 void device_func7(int N) {

*mem = 1024;
}

* RFC: [Link] * callback attribute: [Link]

ATPESC 2022 Track 2a Talk 1 Mattson Openmp
No ratings yet
ATPESC 2022 Track 2a Talk 1 Mattson Openmp
287 pages
OpenMP for C/C++ Parallel Programming
No ratings yet
OpenMP for C/C++ Parallel Programming
7 pages
OpenMP Programming Overview and Examples
No ratings yet
OpenMP Programming Overview and Examples
46 pages
Beginning OpenMP
No ratings yet
Beginning OpenMP
20 pages
OpenMP Basics for Programmers
No ratings yet
OpenMP Basics for Programmers
5 pages
OpenMP Multithreaded Programming Guide
No ratings yet
OpenMP Multithreaded Programming Guide
27 pages
OpenMP SPM
No ratings yet
OpenMP SPM
9 pages
Shared Memory Parallel Programming: Introduction To Openmp
No ratings yet
Shared Memory Parallel Programming: Introduction To Openmp
39 pages
4 Performance.4x
No ratings yet
4 Performance.4x
14 pages
11-Programming With OpenMP
No ratings yet
11-Programming With OpenMP
28 pages
OpenMP Shared Memory Programming Guide
No ratings yet
OpenMP Shared Memory Programming Guide
65 pages
Parallel Programming
No ratings yet
Parallel Programming
108 pages
Lect11 Openmp1
No ratings yet
Lect11 Openmp1
35 pages
Unit 3 HPC
No ratings yet
Unit 3 HPC
10 pages
Unit 3
No ratings yet
Unit 3
13 pages
Openmp
No ratings yet
Openmp
95 pages
PDSOpen MP
No ratings yet
PDSOpen MP
22 pages
Omp Hands On SC08 PDF
No ratings yet
Omp Hands On SC08 PDF
153 pages
OpenMP Tutorial: Hands-On Introduction
No ratings yet
OpenMP Tutorial: Hands-On Introduction
153 pages
OMP Common Core-Voss
No ratings yet
OMP Common Core-Voss
217 pages
Updated - CS8083 MCP UNIT III Notes
No ratings yet
Updated - CS8083 MCP UNIT III Notes
26 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
OpenMP Examples
No ratings yet
OpenMP Examples
12 pages
OpenMP Basics
No ratings yet
OpenMP Basics
47 pages
OpenMP Overview for Parallel Computing
No ratings yet
OpenMP Overview for Parallel Computing
21 pages
ParallelProgramming Start2016
No ratings yet
ParallelProgramming Start2016
41 pages
Open MPLecture
No ratings yet
Open MPLecture
54 pages
Using OpenMP The Examples UG
No ratings yet
Using OpenMP The Examples UG
4 pages
Openmp Overview
No ratings yet
Openmp Overview
74 pages
CSC-334 - P&DC - Lab Manual - V2.0
No ratings yet
CSC-334 - P&DC - Lab Manual - V2.0
102 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
OpenMP Tutorial for Parallel Programming
No ratings yet
OpenMP Tutorial for Parallel Programming
75 pages
OpenMP Library Functions Overview
No ratings yet
OpenMP Library Functions Overview
26 pages
Lab3 PAP
No ratings yet
Lab3 PAP
14 pages
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
No ratings yet
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
10 pages
Sample - Code - Parallel - Cse6230 Fa14 04 Omp
No ratings yet
Sample - Code - Parallel - Cse6230 Fa14 04 Omp
51 pages
A Tutorial On Parallel Computing On Shared Memory Systems
No ratings yet
A Tutorial On Parallel Computing On Shared Memory Systems
23 pages
OpenMP for Shared Memory Programming
No ratings yet
OpenMP for Shared Memory Programming
30 pages
OpenMP and MPI for Multicore Programming
No ratings yet
OpenMP and MPI for Multicore Programming
77 pages
Parallel Programming For Multicore Machines Using OpenMP and MPI Lecture Notes (Dr. Constantinos Evangelinos) (Z-Library)
No ratings yet
Parallel Programming For Multicore Machines Using OpenMP and MPI Lecture Notes (Dr. Constantinos Evangelinos) (Z-Library)
292 pages
OpenMP Shared-Memory Programming Guide
No ratings yet
OpenMP Shared-Memory Programming Guide
37 pages
OpenMP Programming Tutorial Guide
No ratings yet
OpenMP Programming Tutorial Guide
62 pages
Introduction to OpenMP Programming
No ratings yet
Introduction to OpenMP Programming
35 pages
OpenMP for Parallel Programming
No ratings yet
OpenMP for Parallel Programming
29 pages
Introduction to OpenMP for C++
No ratings yet
Introduction to OpenMP for C++
24 pages
Omp Handouts
No ratings yet
Omp Handouts
109 pages
DS1822-Parallel Computing - Unit2
No ratings yet
DS1822-Parallel Computing - Unit2
25 pages
CP4253 Map Unit Iii
No ratings yet
CP4253 Map Unit Iii
26 pages
High Performance Computing (HPC) Lec4
No ratings yet
High Performance Computing (HPC) Lec4
32 pages
Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
PDC Lecture 7
No ratings yet
PDC Lecture 7
11 pages
OpenMP Multithreading Tutorial
100% (1)
OpenMP Multithreading Tutorial
82 pages
Introduction to OpenMP Basics
No ratings yet
Introduction to OpenMP Basics
152 pages
OpenMP and Pthread Functions Explained
No ratings yet
OpenMP and Pthread Functions Explained
4 pages
Loops
No ratings yet
Loops
26 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
HPCG Benchmark Performance Overview
No ratings yet
HPCG Benchmark Performance Overview
2 pages
Evolving Genechip Correlation Predictors On Parallel Graphics Hardware
No ratings yet
Evolving Genechip Correlation Predictors On Parallel Graphics Hardware
6 pages
Storti Foundation: Haematologica
No ratings yet
Storti Foundation: Haematologica
5 pages
Unix and Shell Lab Programming Manual Week6,7,8,9,10
No ratings yet
Unix and Shell Lab Programming Manual Week6,7,8,9,10
7 pages
Communication & Collaboration in The Cyberspace
100% (1)
Communication & Collaboration in The Cyberspace
24 pages
VA Job Application Guide for Virtual Latinos
No ratings yet
VA Job Application Guide for Virtual Latinos
12 pages
FAI Information Security Assessment Questionnaire v2.1
No ratings yet
FAI Information Security Assessment Questionnaire v2.1
7 pages
ABAP 752 Overview Part1 4 20170919
No ratings yet
ABAP 752 Overview Part1 4 20170919
24 pages
ISC NIRScan User Manual - EN - V1.2
No ratings yet
ISC NIRScan User Manual - EN - V1.2
108 pages
Personalized Email Classifier Organize Your Inbox Your Way
No ratings yet
Personalized Email Classifier Organize Your Inbox Your Way
6 pages
Sahil Vaghasiya Node PDF
No ratings yet
Sahil Vaghasiya Node PDF
2 pages
Ultra-Low Power Wireless Mouse Sensor
No ratings yet
Ultra-Low Power Wireless Mouse Sensor
5 pages
Ghana EGovernment Interoperability Framework (EGIF) Version 2.0 - V.draft For Stakeholder Review
No ratings yet
Ghana EGovernment Interoperability Framework (EGIF) Version 2.0 - V.draft For Stakeholder Review
86 pages
Sprinter Operator's Manual Overview
100% (48)
Sprinter Operator's Manual Overview
2 pages
US CAN MW Site Prep Guide V5 5
No ratings yet
US CAN MW Site Prep Guide V5 5
14 pages
Brochure Rubric
100% (1)
Brochure Rubric
2 pages
Muse App: Local Art & Ticketing
No ratings yet
Muse App: Local Art & Ticketing
10 pages
2021 SDP PITCH DECK 101-V14-Talktracknotes - 1641590417937
No ratings yet
2021 SDP PITCH DECK 101-V14-Talktracknotes - 1641590417937
15 pages
Ethics Essay
No ratings yet
Ethics Essay
4 pages
Rivigo Nexus 2018
No ratings yet
Rivigo Nexus 2018
11 pages
Surreptitious Software Book
No ratings yet
Surreptitious Software Book
13 pages
Use Only Nikon Brand Electronic Accessories
0% (1)
Use Only Nikon Brand Electronic Accessories
2 pages
Punjab College Depalpur
No ratings yet
Punjab College Depalpur
1 page
UC900 SS23 Cat.7 LSH-FR C S1d1a1
No ratings yet
UC900 SS23 Cat.7 LSH-FR C S1d1a1
3 pages
Supplier Nonconformance & Action Form
No ratings yet
Supplier Nonconformance & Action Form
3 pages
ZXUR 9000 UMTS RNC Alarm Guide
0% (1)
ZXUR 9000 UMTS RNC Alarm Guide
37 pages
Network Management Protocols
No ratings yet
Network Management Protocols
35 pages
Digital Data Modulation Techniques
No ratings yet
Digital Data Modulation Techniques
30 pages
Vacon Configuration Loader System User Guide V06 PDF
100% (1)
Vacon Configuration Loader System User Guide V06 PDF
24 pages
Chapter 2
No ratings yet
Chapter 2
59 pages
Iphone 5 Layout
No ratings yet
Iphone 5 Layout
1 page
Schematic Lenovo A706 PDF
No ratings yet
Schematic Lenovo A706 PDF
27 pages

A Compilers View of OpenMP

Uploaded by

A Compilers View of OpenMP

Uploaded by

A Compiler’s View of OpenMP

Johannes Doerfert, Argonne National Laboratory

Researcher at Argonne National

Thanks 2 Ryan Houdek

● performant and correct ;)

clang llc llc

Slide originally by Eric Christopher and Johannes Doerfert [Link]

Slide originally presented at LLVM-Dev Meeting 2020 [Link]

Slide originally presented at LLVM-Dev Meeting 2020 [Link]

Slide originally presented at LLVM-Dev Meeting 2020 [Link]

double *A = malloc(size * omp_get_thread_limit()); OpenMP runtime calls with

double *A = malloc(size * omp_get_thread_limit()); OpenMP runtime calls with

$ clang -g -O2 deduplicate.c -fopenmp -Rpass=openmp-opt

deduplicate.[Link] remark: OpenMP runtime call omp_get_thread_limit moved to deduplicate.[Link] [-Rpass=openmp-opt]

clang12 -Rpass=openmp-opt ...

● Use OpenMP optimization remarks

$ clang -O2 generic.c -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -o generic

CPP-output {1}Preprocessor {7}Preprocessor CPP-output

LLVM IR {2}Compiler {8}Compiler LLVM IR

{9} Offload LLVM IR

{13}Offload Linker host-openmp

/* Test for negative number. Used in the signbit() macro. */

science can be called from the host and device

science can be called from the host and device

Func<<</* blocks */ 1, /* threads */ 4>>>(args); Func

#pragma omp target teams num_teams(1) block

#pragma omp target teams num_teams(1) block

Q: How do you identify a parallel region?

A: Via the function (pointer) we outlined it into.

Q: Won’t that cause indirect calls and spurious call edges?

A: Yes. That’s why we try to use non-function pointer IDs.

void kernel() { void kernel() {

Performed since LLVM 12

#pragma omp begin assumes ompx_no_external_callers

static int PleaseDont[alignof(X)];

int* whileWeAreHere(void) { return &X; }

#pragma omp allocate(X) allocator(...) align(...)

The ﬁxation on syntactic nesting:

GPU 2-4 GPU 5-7 GPU 8-10

LLVM 12 provides remote GPUs! GPU 1

Application + OpenMP Device (Abstraction)

● More OpenMP-aware optimizations: ❏ OpenMP Interop and dynamic context

● More OpenMP-aware optimizations: ❏ OpenMP Interop and dynamic context

Report every successful and failed optimization

Optimize oﬄoading code

#pragma omp target

user_code_1.c void foo() {

user_code_1.c void foo() {

user_code_1.c target 0 void foo() {

target 1 void device_func7(int N) {

* RFC: [Link] * callback attribute: [Link]

user_code_1.c target 0 void foo() {

target 1 void device_func7(int N) {

* RFC: [Link] * callback attribute: [Link]

You might also like

double A = malloc(size omp_get_thread_limit()); OpenMP runtime calls with

double A = malloc(size omp_get_thread_limit()); OpenMP runtime calls with

Func<<</* blocks / 1, / threads */ 4>>>(args); Func