A Compilers View of OpenMP
A Compilers View of OpenMP
● open (source/community/...)
[😉 eventually]
LLVM/Clang 101
opt llc
LLVM Machine
file.c LLVM IR
MIR Code
Clang
OpenMP
Parser
OpenMP
Sema
OpenMP
CodeGen
Clang OpenMP
OpenMP
runtimes
Parser
[Link] (classic, host)
OpenMP
Sema
OpenMP
CodeGen
Clang OpenMP
OpenMP
runtimes
Parser
[Link] (classic, host)
OpenMP
libomptarget + plugins
Sema
(offloading, host)
OpenMP
CodeGen libomptarget-nvptx
(offloading, device)
Slide originally presented at LLVM-Dev Meeting 2020 [Link]
Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab
Flang
Clang
OpenMP OpenMP
Parser
OpenMP
runtimes
OpenMP
Parser
Sema [Link] (classic, host)
OpenMP
OpenMP libomptarget + plugins
Sema
CodeGen (offloading, host)
OpenMP
CodeGen libomptarget-nvptx
(offloading, device)
Slide originally presented at LLVM-Dev Meeting 2020 [Link]
Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab
Flang
Clang
OpenMP OpenMPIRBuilder OpenMP
Parser
OpenMP frontend independant
runtimes
OpenMP
Parser OpenMP LLVM-IR generation
Sema [Link] (classic, host)
OpenMP favor simple and expressive
OpenMP libomptarget + plugins
Sema LLVM-IR
CodeGen (offloading, host)
OpenMP reusable for non-OpenMP
CodeGen parallelism libomptarget-nvptx
(offloading, device)
Slide originally presented at LLVM-Dev Meeting 2020 [Link]
Johannes Doerfert
jdoerfert@[Link]
OpenMP in LLVM Argonne National L
[Link] ab
Flang
Clang
OpenMP OpenMPIRBuilder OpenMPOpt OpenMP
Parser
OpenMP frontend independant interprocedural
runtimes
OpenMP
Parser OpenMP LLVM-IR generation optimization pass
Sema [Link] (classic, host)
OpenMP favor simple and expressive contains host & device
OpenMP libomptarget + plugins
Sema LLVM-IR optimizations
CodeGen (offloading, host)
OpenMP reusable for non-OpenMP run with -O2 and -O3
CodeGen parallelism since LLVM 11 libomptarget-nvptx
(offloading, device)
Slide originally presented at LLVM-Dev Meeting 2020 [Link]
OpenMP Implementation & Optimization
Use default(firstprivate), or
default(none) + firstprivate(...)
for (almost) all values!
LLVM’s OpenMP-Aware Optimizations
LLVM’s OpenMP-Aware Optimizations
Towards OpenMP-aware compiler optimizations
OpenMPOpt
● LLVM “knows” about OpenMP API and (internal) runtime calls,
interprocedural
incl. their potential effects (e.g., they won’t throw exceptions).
optimization pass
● LLVM performs “high-level” optimizations, e.g., parallel region
merging, and various GPU-specific optimizations late contains host & device
optimizations
● Some LLVM/Clang “optimizations” remain, but we are in the
process of removing them: simple frontend, smart middle-end
run with -O2 and -O3
since LLVM 11
Optimization Remarks
Example: OpenMP runtime call deduplication
void bar(void) { remark: Found a parallel region that is called in a target region but not part of a combined target construct nor nested inside a target construct
without intermediate code. This can lead to excessive register usage for unrelated target regions in the same translation unit due to spurious
#pragma omp parallel call edges assumed by ptxas.
!
{} remark: Parallel region is not known to be called from a unique single target region, maybe the surrounding function has external linkage?; will
} not attempt to rewrite the state machine use.
e r
t
remark: Found a parallel region that is called in a target region but not part of a combined target construct nor nested inside a target construct
a
void foo(void) { without intermediate code. This can lead to excessive register usage for unrelated target regions in the same translation unit due to spurious
L
#pragma omp target teams call edges assumed by ptxas.
d
remark: Specialize parallel region that is only reached from a single target region to avoid spurious call edges and excessive register usage in
{
e
other target regions. (parallel region ID: __omp_outlined__1_wrapper, kernel ID: __omp_offloading_35_a1e179_foo_l7)
n
#pragma omp parallel remark: Target region containing the parallel region that is specialized. (parallel region ID: __omp_outlined__1_wrapper, kernel ID:
{} __omp_offloading_35_a1e179_foo_l7)
a i
pl
remark: Found a parallel region that is called in a target region but not part of a combined target construct nor nested inside a target construct
bar(); without intermediate code. This can lead to excessive register usage for unrelated target regions in the same translation unit due to spurious
Ex
#pragma omp parallel call edges assumed by ptxas.
remark: Specialize parallel region that is only reached from a single target region to avoid spurious call edges and excessive register usage in
{} other target regions. (parallel region ID: __omp_outlined__3_wrapper, kernel ID: __omp_offloading_35_a1e179_foo_l7)
} remark: Target region containing the parallel region that is specialized. (parallel region ID: __omp_outlined__3_wrapper, kernel ID:
} __omp_offloading_35_a1e179_foo_l7)
remark: OpenMP GPU kernel __omp_offloading_35_a1e179_foo_l7
OpenMP Compile-Time and Runtime Information
CUDA device 0 info: Device supports up to 65536 CUDA blocks and 1024 threads with a warp size of 32
CUDA device 0 info: Launching kernel __omp_offloading_fd02_c2a59832_main_l106 with 48 blocks and 128 threads in Generic mode
OpenMP Offloading (in LLVM)
Compiling clang -fopenmp -fopenmp-targets=nvptx64 file.c
Clang Actions
C file {0}Input {6}Input C file
{10}Backend Assembler
Object {4}Assembler
{11}Assembler Object
Image {5} Linker
{12}Linker Image
#pragma omp begin declare target // LLVM/Clang's "math.h" wrapper for NVPTX (CUDA)
void science(float f) {
if (signbitf(f)) { int __signbitf(float __a) { return __nv_signbitf(__a); }
// some science
} else { #pragma omp begin declare variant match(device={kind(gpu)})
// some other science bool signbit(float __x) { return ::__signbitf(__x); }
} #pragma omp end declare variant
}
#pragma omp end declare target
Linking
not today 😢
OpenMP Offloading vs Kernel Languages LLVM/OpenMP
block
#pragma omp target #pragma omp target teams #pragma omp target teams
{ { {
#pragma omp atomic update #pragma omp atomic update // error // pragma omp atomic in foo is fine
++X; ++X; foo();
} } }
What OpenMP got (kinda) Right
(non exhaustive list)
What OpenMP got (kinda) Right
The target device abstraction:
GPU 0
CPU Device 0
Virtual GPU
Device 1
LLVM 13 will provide a VGPU :)
What OpenMP got (kinda) Right
The target device abstraction:
cuda
What’s Next?
Johannes Doerfert
jdoerfert@[Link]
What’s Next? Argonne National L
ab
LLVM OpenMP
void f(double *A, double *B) { void f(double *A, double *B) {
#pragma omp parallel for #pragma omp parallel for order(concurrent)
for (int i = 0; i < N; ++i) { for (int i = 0; i < N; ++i) {
// ... // ...
} }
} }
omp_set_num_threads(1);
f(A, B);
void f(double *A, double *B) { void f(double *A, double *B) {
#pragma omp parallel for schedule(static, N) #pragma omp parallel for schedule(static, 1)
for (int i = 0; i < N; ++i) { for (int i = 0; i < N; ++i) {
// ... // ...
} }
} }
Johannes Doerfert
jdoerfert@[Link]
What’s Next? Argonne National L
ab
LLVM OpenMP
Design Goal
Design Goal
user_code_1.c
void foo() {
int N = 1024;
* RFC: [Link]
OpenMP Offload Compilation (simplified)
host.c
extern void device_func7(int);
* RFC: [Link]
OpenMP Offload Compilation (simplified)
host.c
extern void device_func7(int);
* RFC: [Link]
OpenMP Offload Compilation (simplified)
host.c
extern void device_func7(int);
The constant
is part of the
user_code_1.c void foo() { “host code”.
int N = 1024;
void foo() {
int N = 1024; if (!offload(device_func7, 1024)) {
// host fallback
#pragma omp target *mem = 1024;
*mem = N; }
} }
device.c
void device_func7(int N) {
*mem = N;
}
* RFC: [Link]
Heterogeneous LLVM-IR Module
heterogeneous.c
__attribute__((callback(Func, ...)))
int offload(void (*)(...) Func, ...);