0% found this document useful (0 votes)
16 views72 pages

A Compilers View of OpenMP

The document discusses the implementation and optimization of OpenMP within the LLVM framework, highlighting the author's expertise and contributions to OpenMP offloading. It details LLVM's capabilities in handling OpenMP, including optimizations for parallel regions and runtime interactions. Additionally, the document covers the challenges and strategies for offloading computations to GPUs using OpenMP directives.

Uploaded by

foertter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views72 pages

A Compilers View of OpenMP

The document discusses the implementation and optimization of OpenMP within the LLVM framework, highlighting the author's expertise and contributions to OpenMP offloading. It details LLVM's capabilities in handling OpenMP, including optimizations for parallel regions and runtime interactions. Additionally, the document covers the challenges and strategies for offloading computations to GPUs using OpenMP directives.

Uploaded by

foertter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A Compiler’s View of OpenMP

Johannes Doerfert, Argonne National Laboratory


A Compiler’s View of OpenMP
Johannes Doerfert (Argonne National Laboratory)
About Me
/\ /\
PhD in CS from Saarland University, \-------------- me in Zurich ----------------/
Saarbrücken, Germany

Researcher at Argonne National


Laboratory (ANL), Chicago, USA Code owner for OpenMP offloading
in LLVM (officially) since recently
Active in the LLVM community
since 2014, in the OpenMP
community since 2018
Background
LLVM in a Nutshell

● open (source/community/...)

Thanks 2 Ryan Houdek


● extensible, “fixable”
● portable (GPUs, CPUs, …)
● C++/OpenMP/SYCL/HIP/CUDA/… feature complete😉
● early access to *the coolest* features

● performant and correct ;)

[😉 eventually]
LLVM/Clang 101

opt llc

LLVM Machine
file.c LLVM IR
MIR Code

clang llc llc

Slide originally by Eric Christopher and Johannes Doerfert [Link]


Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Slide originally presented at LLVM-Dev Meeting 2020 [Link]


Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Clang
OpenMP
Parser

OpenMP
Sema

OpenMP
CodeGen

Slide originally presented at LLVM-Dev Meeting 2020 [Link]


Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Clang OpenMP
OpenMP
runtimes
Parser
[Link] (classic, host)
OpenMP
Sema

OpenMP
CodeGen

Slide originally presented at LLVM-Dev Meeting 2020 [Link]


Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Clang OpenMP
OpenMP
runtimes
Parser
[Link] (classic, host)
OpenMP
libomptarget + plugins
Sema
(offloading, host)
OpenMP
CodeGen libomptarget-nvptx
(offloading, device)
Slide originally presented at LLVM-Dev Meeting 2020 [Link]
Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Flang
Clang
OpenMP OpenMP
Parser
OpenMP
runtimes
OpenMP
Parser
Sema [Link] (classic, host)
OpenMP
OpenMP libomptarget + plugins
Sema
CodeGen (offloading, host)
OpenMP
CodeGen libomptarget-nvptx
(offloading, device)
Slide originally presented at LLVM-Dev Meeting 2020 [Link]
Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Flang
Clang
OpenMP OpenMPIRBuilder OpenMP
Parser
OpenMP frontend independant
runtimes
OpenMP
Parser OpenMP LLVM-IR generation
Sema [Link] (classic, host)
OpenMP favor simple and expressive
OpenMP libomptarget + plugins
Sema LLVM-IR
CodeGen (offloading, host)
OpenMP reusable for non-OpenMP
CodeGen parallelism libomptarget-nvptx
(offloading, device)
Slide originally presented at LLVM-Dev Meeting 2020 [Link]
Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab

Flang
Clang
OpenMP OpenMPIRBuilder OpenMPOpt OpenMP
Parser
OpenMP frontend independant interprocedural
runtimes
OpenMP
Parser OpenMP LLVM-IR generation optimization pass
Sema [Link] (classic, host)
OpenMP favor simple and expressive contains host & device
OpenMP libomptarget + plugins
Sema LLVM-IR optimizations
CodeGen (offloading, host)
OpenMP reusable for non-OpenMP run with -O2 and -O3
CodeGen parallelism since LLVM 11 libomptarget-nvptx
(offloading, device)
Slide originally presented at LLVM-Dev Meeting 2020 [Link]
OpenMP Implementation & Optimization
Use default(firstprivate), or
default(none) + firstprivate(...)
for (almost) all values!
LLVM’s OpenMP-Aware Optimizations
LLVM’s OpenMP-Aware Optimizations
Towards OpenMP-aware compiler optimizations

OpenMPOpt
● LLVM “knows” about OpenMP API and (internal) runtime calls,
interprocedural
incl. their potential effects (e.g., they won’t throw exceptions).
optimization pass
● LLVM performs “high-level” optimizations, e.g., parallel region
merging, and various GPU-specific optimizations late contains host & device
optimizations
● Some LLVM/Clang “optimizations” remain, but we are in the
process of removing them: simple frontend, smart middle-end
run with -O2 and -O3
since LLVM 11
Optimization Remarks
Example: OpenMP runtime call deduplication

double *A = malloc(size * omp_get_thread_limit()); OpenMP runtime calls with


double *B = malloc(size * omp_get_thread_limit());
same return values can be
#pragma omp parallel
do_work(A, B); merged to a single call
Optimization Remarks
Example: OpenMP runtime call deduplication

double *A = malloc(size * omp_get_thread_limit()); OpenMP runtime calls with


double *B = malloc(size * omp_get_thread_limit());
same return values can be
#pragma omp parallel
do_work(A, B); merged to a single call

$ clang -g -O2 deduplicate.c -fopenmp -Rpass=openmp-opt

deduplicate.[Link] remark: OpenMP runtime call omp_get_thread_limit moved to deduplicate.[Link] [-Rpass=openmp-opt]


double *B = malloc(size*omp_get_thread_limit());
deduplicate.[Link] remark: OpenMP runtime call omp_get_thread_limit deduplicated [-Rpass=openmp-opt]
double *A = malloc(size*omp_get_thread_limit());
Optimization Remarks
Example: OpenMP Target Scheduling

clang12 -Rpass=openmp-opt ...

void bar(void) { remark: Found a parallel region that is called in a target region but not part of a combined target construct nor nested inside a target construct
without intermediate code. This can lead to excessive register usage for unrelated target regions in the same translation unit due to spurious
#pragma omp parallel call edges assumed by ptxas.

!
{} remark: Parallel region is not known to be called from a unique single target region, maybe the surrounding function has external linkage?; will
} not attempt to rewrite the state machine use.

e r
t
remark: Found a parallel region that is called in a target region but not part of a combined target construct nor nested inside a target construct

a
void foo(void) { without intermediate code. This can lead to excessive register usage for unrelated target regions in the same translation unit due to spurious

L
#pragma omp target teams call edges assumed by ptxas.

d
remark: Specialize parallel region that is only reached from a single target region to avoid spurious call edges and excessive register usage in
{
e
other target regions. (parallel region ID: __omp_outlined__1_wrapper, kernel ID: __omp_offloading_35_a1e179_foo_l7)

n
#pragma omp parallel remark: Target region containing the parallel region that is specialized. (parallel region ID: __omp_outlined__1_wrapper, kernel ID:
{} __omp_offloading_35_a1e179_foo_l7)

a i
pl
remark: Found a parallel region that is called in a target region but not part of a combined target construct nor nested inside a target construct
bar(); without intermediate code. This can lead to excessive register usage for unrelated target regions in the same translation unit due to spurious

Ex
#pragma omp parallel call edges assumed by ptxas.
remark: Specialize parallel region that is only reached from a single target region to avoid spurious call edges and excessive register usage in
{} other target regions. (parallel region ID: __omp_outlined__3_wrapper, kernel ID: __omp_offloading_35_a1e179_foo_l7)
} remark: Target region containing the parallel region that is specialized. (parallel region ID: __omp_outlined__3_wrapper, kernel ID:
} __omp_offloading_35_a1e179_foo_l7)
remark: OpenMP GPU kernel __omp_offloading_35_a1e179_foo_l7
OpenMP Compile-Time and Runtime Information

● Use OpenMP optimization remarks


● Optimization remark explanations, examples, FAQs, …
all gradually added to [Link]
● Use LIBOMPTARGET_INFO for runtime library interactions

$ clang -O2 generic.c -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -o generic


$ env LIBOMPTARGET_INFO=1 ./generic

CUDA device 0 info: Device supports up to 65536 CUDA blocks and 1024 threads with a warp size of 32
CUDA device 0 info: Launching kernel __omp_offloading_fd02_c2a59832_main_l106 with 48 blocks and 128 threads in Generic mode
OpenMP Offloading (in LLVM)
Compiling clang -fopenmp -fopenmp-targets=nvptx64 file.c
Clang Actions
C file {0}Input {6}Input C file

CPP-output {1}Preprocessor {7}Preprocessor CPP-output

LLVM IR {2}Compiler {8}Compiler LLVM IR

{9} Offload LLVM IR


Assembler {3}Backend

{10}Backend Assembler
Object {4}Assembler
{11}Assembler Object
Image {5} Linker
{12}Linker Image

{13}Offload Linker host-openmp


Fat binary Image device-openmp
Slide originally by Jose Monsalve Diaz
OpenMP Offloading
The Tricky Bits
math.h

/* Test for negative number. Used in the signbit() macro. */


#include <math.h> __MATH_INLINE int
__NTH (__signbitf (float __x))
#pragma omp begin declare target {
void science(float f) { # ifdef __SSE2_MATH__
if (signbitf(f)) { int __m;
// some science __asm (""pmovmskb %1, %0"" : ""=r"" (__m) : ""x"" (__x));
} else { return (__m & 0x8) != 0;
// some other science # else
} __extension__ union { float __f; int __i; } __u = { __f: __x };
} return __u.__i < 0;
#pragma omp end declare target # endif
}

science can be called from the host and device


OpenMP Offloading
The Tricky Bits
GPUs do not provide a math.h,
and more importantly, no libm.
#include <math.h>

#pragma omp begin declare target // LLVM/Clang's "math.h" wrapper for NVPTX (CUDA)
void science(float f) {
if (signbitf(f)) { int __signbitf(float __a) { return __nv_signbitf(__a); }
// some science
} else { #pragma omp begin declare variant match(device={kind(gpu)})
// some other science bool signbit(float __x) { return ::__signbitf(__x); }
} #pragma omp end declare variant
}
#pragma omp end declare target

science can be called from the host and device


OpenMP Offloading
The Tricky Bits

Linking
not today 😢
OpenMP Offloading vs Kernel Languages LLVM/OpenMP

block

Func<<</* blocks */ 1, /* threads */ 4>>>(args); Func


SPMD-mode
thread

#pragma omp target teams num_teams(1) block


{
A();
#pragma omp parallel num_threads(4) default(firstprivate)
{ Func
Func(args);
Generic-mode
}
B();
thread
}
OpenMP Offloading vs Kernel Languages void A() {
#pragma omp parallel
Kernel2();
#pragma omp target teams num_teams(1) }
{
#pragma omp parallel num_threads(4) default(firstprivate) block
{
if (omp_get_thread_num() == 0)
A(); Func
#pragma omp barrier
Func(args);
#pragma omp barrier
if (omp_get_thread_num() == 0) thread
B();
}
}
void B() {
#pragma omp barrier
}
SPMD-zation, coming soon!
OpenMP Offloading vs Kernel Languages (simplified)

#pragma omp target teams num_teams(1) block


{
A();
#pragma omp parallel num_threads(4) default(firstprivate)
{ Func
Func(args);
}
B();
thread
}
OpenMP Offloading vs Kernel Languages (simplified)
Function Pointer

Q: How do you identify a parallel region?

A: Via the function (pointer) we outlined it into.

Q: Won’t that cause indirect calls and spurious call edges?

A: Yes. That’s why we try to use non-function pointer IDs.


OpenMP Offloading vs Kernel Languages (simplified)
static char parFnId;
static void parFn() { static void parFn() { Function Id
Pointer
// parallel function code // parallel function code
} }

void kernel() { void kernel() {


if (is_worker()) { if (is_worker()) {
while (1) { while (1) {
fn = __omp_wait_for_parallel(); fn = __omp_wait_for_parallel();
fn(); (fn == &parFnId) ? parFn() : fn();
__omp_inform_parallel_done(); __omp_inform_parallel_done();
} }
} else { } else {
__omp_inform_workers(&parFn, ...) __omp_inform_workers(&parFnId, ...)
parFn(); parFn();
__omp_wait_for_workers(); __omp_wait_for_workers();
} }
} }

Performed since LLVM 12


OpenMP Offloading vs Kernel Languages (simplified)
static char parFnId;
static void parFn() { static void parFn() {
// parallel function code // parallel function code
} } Use optimization
remarks to learn about
void kernel() { void kernel() {
if (is_worker() { missed opportunities
if (is_worker() {
// ... // ...
} else { } else {
visible(); visible();
} }
} }

#pragma omp begin assumes ompx_no_external_callers


void visible() { void visible() {
__omp_inform_workers(&parFn, ...) __omp_inform_workers(&parFnId, ...)
parFn(); parFn();
__omp_wait_for_workers(); __omp_wait_for_workers();
} }
#pragma omp end assumes
LLVM 13 will know more tricks :)
What OpenMP got Wrong
(non exhaustive list)
What OpenMP got Wrong
All instances where a directive retroactively changes something:
static int X;

static int PleaseDont[alignof(X)];

int* whileWeAreHere(void) { return &X; }

#pragma omp allocate(X) allocator(...) align(...)

The fixation on syntactic nesting:

#pragma omp target #pragma omp target teams #pragma omp target teams
{ { {
#pragma omp atomic update #pragma omp atomic update // error // pragma omp atomic in foo is fine
++X; ++X; foo();
} } }
What OpenMP got (kinda) Right
(non exhaustive list)
What OpenMP got (kinda) Right
The target device abstraction:

GPU 2-4 GPU 5-7 GPU 8-10

GPU 0

LLVM 12 provides remote GPUs! GPU 1


What OpenMP got (kinda) Right
The target device abstraction:

CPU Device 0

Virtual GPU
Device 1
LLVM 13 will provide a VGPU :)
What OpenMP got (kinda) Right
The target device abstraction:

Application + OpenMP Device (Abstraction)


World World

cuda
What’s Next?
Johannes Doerfert
jdoerfert@[Link]
What’s Next? Argonne National L
ab
LLVM OpenMP

● More OpenMP-aware optimizations: ❏ OpenMP Interop and dynamic context


○ hide memory transfer latencies selector implementations
○ exploit OpenMP domain knowledge ❏ A community developed OMPX (header)
○ ask for and utilize user assumptions
library (think stdlib for OpenMP).
● GPU-specific optimizations
❏ Function variants shipped via libraries
● More actionable optimization remarks
❏ More powerful assumptions
● OpenMP 5.1 features
❏ Less syntactic / more semantic reasoning*
● A new (portable and performant) GPU
❏ Deprecations*
device runtime (written in OpenMP 5.1 !)
● Helpful offloading “devices”:
○ VGPU + NewProcess for debugging, or
○ JIT for performance
● Host-Device optimizations
* I hope
Final Thoughts
(aka. Rambling)
Parallel Worksharing Loops ≠ “Parallel Loops”

void f(double *A, double *B) { void f(double *A, double *B) {
#pragma omp parallel for #pragma omp parallel for order(concurrent)
for (int i = 0; i < N; ++i) { for (int i = 0; i < N; ++i) {
// ... // ...
} }
} }

omp_set_num_threads(1);
f(A, B);

void f(double *A, double *B) { void f(double *A, double *B) {
#pragma omp parallel for schedule(static, N) #pragma omp parallel for schedule(static, 1)
for (int i = 0; i < N; ++i) { for (int i = 0; i < N; ++i) {
// ... // ...
} }
} }
Johannes Doerfert
jdoerfert@[Link]
What’s Next? Argonne National L
ab
LLVM OpenMP

● More OpenMP-aware optimizations: ❏ OpenMP Interop and dynamic context


○ hide memory transfer latencies selector implementations
○ exploit OpenMP domain knowledge ❏ A community developed OMPX (header)
○ ask for and utilize user assumptions
library (think stdlib for OpenMP).
● GPU-specific optimizations
❏ Function variants shipped via libraries
● More actionable optimization remarks
❏ More powerful assumptions
● OpenMP 5.1 features
❏ Less syntactic / more semantic reasoning*
● A new (portable and performant) GPU
❏ Deprecations*
device runtime (written in OpenMP 5.1 !)
● Helpful offloading “devices”:
○ VGPU + NewProcess for debugging, or Thanks!
○ JIT for performance Interested?
● Host-Device optimizations
Reach out!
* I hope
Joseph H
huberjn@ ber
Oak Ridg [Link]
e ation
al Lab

Design Goal

Report every successful and failed optimization


Shile Tian
shile .tian@[Link]
Stony roo Univer t

Design Goal

Optimize offloading code


perform host + accelerator optimizations
OpenMP Offload Compilation (simplified)

user_code_1.c
void foo() {
int N = 1024;

#pragma omp target


*mem = N;
}

* RFC: [Link]
OpenMP Offload Compilation (simplified)
host.c
extern void device_func7(int);

user_code_1.c void foo() {


int N = 1024;
void foo() {
int N = 1024; if (!offload(device_func7, N)) {
// host fallback
#pragma omp target *mem = N;
*mem = N; }
} }
device.c
void device_func7(int N) {
*mem = N;
}

* RFC: [Link]
OpenMP Offload Compilation (simplified)
host.c
extern void device_func7(int);

user_code_1.c void foo() {


int N = 1024;
void foo() {
int N = 1024; if (!offload(device_func7, 1024)) {
// host fallback
#pragma omp target *mem = 1024;
*mem = N; }
} }
device.c
void device_func7(int N) {
*mem = N;
}

* RFC: [Link]
OpenMP Offload Compilation (simplified)
host.c
extern void device_func7(int);
The constant
is part of the
user_code_1.c void foo() { “host code”.
int N = 1024;
void foo() {
int N = 1024; if (!offload(device_func7, 1024)) {
// host fallback
#pragma omp target *mem = 1024;
*mem = N; }
} }
device.c
void device_func7(int N) {
*mem = N;
}

* RFC: [Link]
Heterogeneous LLVM-IR Module
heterogeneous.c
__attribute__((callback(Func, ...)))
int offload(void (*)(...) Func, ...);

user_code_1.c target 0 void foo() {


int N = 1024;
void foo() {
int N = 1024;
if (!offload(device_func7, N)) {
// host fallback
#pragma omp target
*mem = N;
*mem = N;
}
}
}

target 1 void device_func7(int N) {


*mem = N;
}

* RFC: [Link] * callback attribute: [Link]


Heterogeneous LLVM-IR Module
heterogeneous.c
__attribute__((callback(Func, ...)))
int offload(void (*)(...) Func, ...);

user_code_1.c target 0 void foo() {


int N = 1024;
void foo() {
int N = 1024;
if (!offload(device_func7, N)) {
// host fallback
#pragma omp target
*mem = 1024;
*mem = N;
}
}
}

target 1 void device_func7(int N) {


*mem = 1024;
}

* RFC: [Link] * callback attribute: [Link]

You might also like