Conversation
9b6e1d9 to
ff3d0d8
Compare
f178f34 to
479fff7
Compare
## Summary AMReX does not have a concept yet to help users write effective SIMD code on CPU, besides relying on auto-vectorization and pragmas, which are unreliable for any complex enough code. [1] Lucky enough, C++ `std::datapar` was just accepted into C++26, which gives an easy in to write portable SIMD/scalar code. Yet, I did not find a compiler/stdlib yet with support for it, so I finally had play with the C++17 `<experimental/simd>` headers, which are not as complete as C++26 but a good in, especially if complemented with the https://github.com/mattkretz/vir-simd library. This PR adds initial support for portable user-code by providing: - build system support: `AMReX_SIMD` (default is OFF), relying on [vir-simd](https://github.com/mattkretz/vir-simd) - an `AMReX_SIMD.H` header that handles includes & helper types - `ParallelForSIMD<SIMD_WIDTH>(...)` ## Additional background [1] Fun fact one: As written in the [story behind Intel's iscp compiler](https://pharr.org/matt/blog/2018/04/18/ispc-origins) and credited to [Tim Foley](http://graphics.stanford.edu/~tfoley/), *auto-vectorization is not a programming model.* Fun fact two: This is as ad-hoc as the implementation for [data parallel types / SIMD in Kokkos](https://kokkos.org/kokkos-core-wiki/API/simd/simd.html), it seems. ## User-Code Examples & Benchmark [Please see this ImpactX PR for details.](BLAST-ImpactX/impactx#1002) ## Checklist - [x] clean up commits (separate commits) - [x] finalize fallbacks & CI checks - [ ] add a `vir::stdx::simd` test in CI - [x] CMake - [ ] GnuMake - [x] `AMReX_SIMD.H` - [x] `ParallelForSIMD` - [x] `ParticleIdWrapper::make_(in)valid(mask)` - [x] clean up `sincos` support - [x] `SmallMatrix` - [x] Support for `GpuComplex` (minimal) - [x] Support [passing WIDTH as compile-time meta-data](https://godbolt.org/z/7455hqrEc) to callee in `ParallelForSIMD` - [ ] include documentation in the code and/or rst files, if appropriate - [x] add `vir::stdx::simd` in package managers: - [x] Spack [vir-simd](spack/spack-packages#332) - [x] Conda [vir-simd](conda-forge/staged-recipes#30377) ## Future Ideas / PRs - allocate particle arrays aligned so we can use [stdx::vector_aligned](https://en.cppreference.com/w/cpp/experimental/simd/vector_aligned.html) (for [copies](https://en.cppreference.com/w/cpp/experimental/simd/simd/copy_from) into/out of vector registers - note: makes no difference anymore on modern CPUs) - Support more/all functions in `ParticleIdWrapper`/`ParticleCpuWrapper` - Support for [vir::simdize<std::complex<T>>](mattkretz/vir-simd#42) instead of `GpuComplex<SIMD>` - `ParallelFor` ND support - `ParallelFor`/`ParallelForSIMD`: one could, maybe, with enable-if magic, etc fuse them into a single name again - CMake superbuild: `vir-simd` auto-download for convenience (opt-out) - Build system: "SIMD provider" selection, once we can opt-in to a C++26 compiler+stdlib instead of C++17 TS2 + vir-simd - Update AMReX package in package management: - Spack [vir-simd](spack/spack-packages#332) - Conda [vir-simd](conda-forge/staged-recipes#30377) --------- Co-authored-by: Alexander Sinn <[email protected]>
Basic build system support for AMReX SIMD in superbuilds. See: - AMReX-Codes/amrex#4520 (requires this PR) - BLAST-ImpactX/impactx#1002 (needed for this PR to be finished)
c152e9b to
521516d
Compare
| // if the aperture is periodic, shift sx,sy coordinates to the fundamental domain | ||
| amrex::ParticleReal u = (m_repeat_x==0.0_prt) ? sx : (std::fmod(std::abs(sx)+dx,m_repeat_x)-dx); | ||
| amrex::ParticleReal v = (m_repeat_y==0.0_prt) ? sy : (std::fmod(std::abs(sy)+dy,m_repeat_y)-dy); | ||
| T_Real u = (m_repeat_x == 0_prt) ? sx : (fmod(abs(sx)+dx, m_repeat_x)-dx); |
Check notice
Code scanning / CodeQL
Equality test on floating-point values Note
| amrex::ParticleReal u = (m_repeat_x==0.0_prt) ? sx : (std::fmod(std::abs(sx)+dx,m_repeat_x)-dx); | ||
| amrex::ParticleReal v = (m_repeat_y==0.0_prt) ? sy : (std::fmod(std::abs(sy)+dy,m_repeat_y)-dy); | ||
| T_Real u = (m_repeat_x == 0_prt) ? sx : (fmod(abs(sx)+dx, m_repeat_x)-dx); | ||
| T_Real v = (m_repeat_y == 0_prt) ? sy : (fmod(abs(sy)+dy, m_repeat_y)-dy); |
Check notice
Code scanning / CodeQL
Equality test on floating-point values Note
Ok to run through invalid branch afterwards, must even for the valid particles in the SIMD lane.
|
Biggest wins so far (SP, before/after PR):
There are also a few very simple elements (that are not compute bound), where the benchmarks show that the packing/unpacking into vector registers is not armortized, so we should not vectorize those. (Small harm, because they are so quick, but still simple to avoid). Those are:
|
Double PrecisionMachine: Dane (LLNL),
pytest-benchmark compare --group-by=name --columns=min,stddev --sort=name .benchmarks/Linux-CPython-3.11-64bit/000*pytest-benchmark compare --group-by=name --sort=name --histogram=dp_histo .benchmarks/Linux-CPython-3.11-64bit/000* |
|
|
||
| // assign intermediate parameter | ||
| amrex::ParticleReal const step = slice_ds /std::sqrt(powi<2>(pt)-1.0_prt); | ||
| amrex::ParticleReal const step = slice_ds / std::sqrt(powi<2>(pt)-1.0_prt); |
There was a problem hiding this comment.
Should this be sqrt instead of std::sqrt here?
There was a problem hiding this comment.
Perhaps not--this is still of type amrex::ParticleReal and not T_Real.
There was a problem hiding this comment.
Not here, because it is not in the particle tracking operator(). I changed this white space by accident (sorry for the noise).
cemitch99
left a comment
There was a problem hiding this comment.
The updated logic looks reasonable to me--I did not find any obvious place where the math should break. I made only minor comments. Suggest one of our experienced AMReX colleagues take a look.
|
"Simple" elements, where vectorization packing/unpacking causes a slight overhead. I will revert those to stay with a serial implementation. Note that some of the benchmarks can be tainted by the well-known Intel AVX-512 frequency drops in the benchmarks, but even if they are there is no huge issue to exclude them for now from vectorization, because they are super-fast ("simple") elements. SP
DP
|
Add inline comments on TODOs that we will explore in follow-ups.
|
@atmyers @AlexanderSinn @WeiqunZhang let me know if one of you wants to take a final look, good to go from our end otherwise. |
Until recently (see AMReX PR AMReX-Codes/amrex#4520), AMReX does not have a concept to help users write effective SIMD code on CPU, besides relying on auto-vectorization and pragmas, which are unreliable for any complex enough code. [1]
Lucky enough, C++
std::datapar/std::simdwas just accepted into C++26, which gives an easy in to write portable SIMD/scalar code. Yet, I did not find a compiler/stdlib yet with support for it, so I implemented in AMReX using C++17<experimental/simd>headers a new addition to theParallelForperformance portability construct using the https://github.com/mattkretz/vir-simd library.This PR vectorizes CPU code in ImpactX for particle tracking, both for NOACC and OpenMP compute backends, in ImpactX for element pushes, keeping our portable, single-source approach.
Additional background
The implementation here works with any kind of vectorization (if implemented in the respective compiler): SSE(2), AVX(2), AVX512, Neon, ... 512-bit wide vector registers are the largest currently available, being able to push up to 8 (DP) or 16 (SP) particles in parallel, where a non-data-parallel (non-SIMD) implementation would push just 1 (in the worst case, for any complex enough code [1]).
[1] As written in the story behind Intel's iscp compiler and credited to Tim Foley, auto-vectorization is not a programming model.
Vectorization is really a technique that primarily benefits us for elements that are very compute heavy (i.e., not memory-bandwidth bound). Not coincidentally, ImpactX focuses on symplectic, (high-order/exact) methods for chromatic and exact tracking, where this is exactly the case. Optimizations like vectorization thus makes a huge impact for our community/users/time-to-solution, and exactly address our most costly tracking methods, making it highly benefitial for our users to rely on them and still get fast results when needed.
Related Reads for AMD/Intel GPUs with AVX-512
For AVX-512, Intel famously messed up in their early/current chips the clock frequency/throughput for AVX-512. That means in benchmarks one should run long and with many particles for reliable results, but I kept it simple above. The consequence is that sometimes where SIMD "looks meh/bad" it could be (even) better in practice.
AMD Zen5 CPUs do not have that flaw in their AVX-512 implementation.
Details: https://chipsandcheese.com/p/zen-5s-avx-512-frequency-behavior
To Do
Benchmarks on Dane (LLNL)
Machine: Dane (LLNL),
-march=sapphirerapidsCPUCompiler: gcc/13.3.1
Vector registers: 512 bit (i.e., 8 doubles or 16 floats)
Development:
SIMD Branch:
Single Precision
sp: single precisionnm: normal/default mathfm: fast-mathd(ev): development branchs(imd): the branch in this PRpytest-benchmark compare --group-by=name --columns=min,stddev --sort=name .benchmarks/Linux-CPython-3.11-64bit/000*pytest-benchmark compare --group-by=name --sort=name --histogram=sp_histo .benchmarks/Linux-CPython-3.11-64bit/000*Benchmarks on Perlmutter (NERSC)
Machine: Perlmutter (NERSC)
Compiler: ...
Vector registers: 256 bit (i.e., 4 doubles or 8 floats)
Note: Perlmutter CPUs (AMD Zen 3) do not support AVX512, AMD introduced 512 bit (i.e., 8 doubles) wide vector registers only in Zen 5.
Boring old CPU :D
Required PRs