AMReX SIMD Helpers#4520
Conversation
4644b47 to
5cfa97f
Compare
Src/Base/AMReX_GpuLaunchFunctsC.H
Outdated
| */ | ||
| template <int WIDTH, typename T, typename L, typename M=std::enable_if_t<std::is_integral_v<T>> > | ||
| AMREX_ATTRIBUTE_FLATTEN_FOR | ||
| void ParallelForSIMD (T n, L const& f) noexcept |
There was a problem hiding this comment.
The design of this seems to be missing a lot to me. How does this work if SIMD is not enabled? What does this do on GPU? I think currently this cannot take a lambda as input and needs a struct with both simd and non-simd operator().
There was a problem hiding this comment.
This one is fully standalone. The SIMD header brings the the types for it.
One could in general use this pattern on GPU as well, but not the target of this PR.
There is a solid helper for copy in/out needed that I currently have in user-code, which will probably need documentation eventually. I do not yet have a good strategy how to make this a helper inside AMReX, but maybe later we know a way:
- https://github.com/BLAST-ImpactX/impactx/blob/629b3d884c3ca974d75ee99d3a0a494c3dc0e6f8/src/elements/mixin/beamoptic.H#L105-L151
- https://github.com/BLAST-ImpactX/impactx/blob/629b3d884c3ca974d75ee99d3a0a494c3dc0e6f8/src/elements/mixin/beamoptic.H#L172-L198
There was a problem hiding this comment.
I think currently this cannot take a lambda as input
Alex has a draft that would change the impl to support lambda: https://godbolt.org/z/Pbzs6dnbP
I added a little improvement here: https://godbolt.org/z/7455hqrEc
[I think this] needs a struct with both simd and non-simd operator().
Correct, because one needs to be able to resolve the remainder. We could alternatively resolve the remainder using the SIMD operator() and WIDTH=1, but in practice one will template the implementation anyway as a T_Real for CPU/GPU portability.
|
@AlexanderSinn tested and it also works called as a Lambda, and we could consider adding a few helpers for the aligned reads & writes to AMReX (later on?): https://godbolt.org/z/Pbzs6dnbP |
|
I had a godbolt test link somewhere, where I ensured that the writes of non-modified reads are are optimized out by the compiler in That would be quite important, otherwise we have to manually track our modifications. |
|
One can force it with metaprogramming, and then the user just needs to Shorter: (thx @AlexanderSinn) template <typename R, typename... Args>
constexpr bool is_nth_arg_non_const (R(*)(Args...), int n) {
bool val_arr[sizeof...(Args)] {!std::is_const_v<std::remove_reference_t<Args>>...};
return val_arr[n];
}
...
if constexpr (is_nth_arg_non_const(modify<double>, 0))
ad.copy_to(ap + i, stdx::vector_aligned);My update to also work with functors: |
|
Ah nice, and @AlexanderSinn just showed me that the issue is because of the |
0b4571a to
1710287
Compare
9f4b277 to
6b3640f
Compare
|
|
||
| /** ParallelFor with a SIMD Width (in elements) | ||
| * | ||
| * SIMD load/Write-back operations need to be performed before/after calling this. |
There was a problem hiding this comment.
Probably good to add a C++ example snippet here in the doc string, too.
Compile-time helper that can be optionally used instead of an `int` running index in user code, to query the SIMD WIDTH set in `ParallelForSIMD` inside a called function. Co-authored-by: Alexander Sinn <[email protected]>
Add support for mixed vector/scalar operations. Sufficient for ImpactX `Multipole` elements.
WeiqunZhang
left a comment
There was a problem hiding this comment.
LGTM. Just another minor comment.
|
This broke Castro/Microphysics compilation. Is there a way to opt out of this? |
PR AMReX-Codes#4520 broke Microphysics's autodiff math. It could be fixed on the Microphysics side. However, we can fix it on the amrex side and we should use std::enable_if anyway to limit the data types.
PR AMReX-Codes#4520 broke Microphysics's autodiff math. It could be fixed on the Microphysics side. However, we can fix it on the amrex side and we should use std::enable_if anyway to limit the data types.
|
It's because of |
PR #4520 broke Microphysics's autodiff math. It could be fixed on the Microphysics side. However, we can fix it on the amrex side and we should use std::enable_if anyway to limit the data types. --------- Co-authored-by: Axel Huebl <[email protected]>
Basic build system support for AMReX SIMD in superbuilds. See: - AMReX-Codes/amrex#4520 (requires this PR) - BLAST-ImpactX/impactx#1002 (needed for this PR to be finished)
|
Fix merged in #4570 |
|
Update: This functionality made ImpactX astronomically faster on CPU for compute-heavy/bound kernels: |
## Add `ParallelForSIMD<T>`
This adds another template overload to `ParallelForSIMD`.
A typical user pattern for maximum controls so far is:
```C++
#ifdef AMREX_USE_SIMD
if constexpr (amrex::simd::is_vectorized<T>) {
amrex::ParallelForSIMD<T::simd_width>(np, pushSingleParticle);
} else
#endif
{
amrex::ParallelFor(np, pushSingleParticle); // GPU & non-SIMD CPU
}
```
This simplifies it to:
```C++
amrex::ParallelForSIMD<T>(np, pushSingleParticle);
```
indicating there _might_ be a SIMD path if `T` (e.g., a functor)
implements it.
One can still call `ParallelForSIMD` with an explicit SIMD width (int),
as before.
## Generalized Particle Load/Store
A typical SIMD user pattern for particle SoA kernels was:
```C++
SIMDParticleReal<SIMD_WIDTH> part_x;
part_x.copy_from(&m_part_x[i], stdx::element_aligned);
el.compute(part_x);
#ifdef AMREX_USE_SIMD
if constexpr (is_nth_arg_non_const<&el::compute, n>)
part_x.copy_to(&m_part_x[i], stdx::element_aligned);
#endif
```
This simplifies it to:
```C++
decltype(auto) x = load_1d(m_part_x, i);
el.compute(x);
store_1d<&el::compute, 0>(x, m_part_x, i);
```
and can now also be used for the GPU path, where for now the load is a
transparent pointer forward/deref and the store is a no-OP.
## Combined
Using these patterns together, one can now write single-source
SIMD-CPU/non-SIMD-CPU/GPU kernels, e.g.,
BLAST-ImpactX/impactx#1279
Follow-up to #4520
## Checklist
The proposed changes:
- [ ] fix a bug or incorrect behavior in AMReX
- [x] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX
users
- [x] include documentation in the code and/or rst files, if appropriate
## Summary Needed to cover SIMD features and generic SIMD/non-SIMD patterns. Bravely vibe coded, but now reviewed and improved for sensibility. ## Additional background #4924 #4607 #4600 #4520 ## Checklist The proposed changes: - [ ] fix a bug or incorrect behavior in AMReX - [ ] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] include documentation in the code and/or rst files, if appropriate --------- Co-authored-by: Weiqun Zhang <[email protected]>
Summary
AMReX does not have a concept yet to help users write effective SIMD code on CPU, besides relying on auto-vectorization and pragmas, which are unreliable for any complex enough code. [1]
Lucky enough, C++
std::dataparwas just accepted into C++26, which gives an easy in to write portable SIMD/scalar code. Yet, I did not find a compiler/stdlib yet with support for it, so I finally had play with the C++17<experimental/simd>headers, which are not as complete as C++26 but a good in, especially if complemented with the https://github.com/mattkretz/vir-simd library.This PR adds initial support for portable user-code by providing:
AMReX_SIMD(default is OFF), relying on vir-simdAMReX_SIMD.Hheader that handles includes & helper typesParallelForSIMD<SIMD_WIDTH>(...)Additional background
[1] Fun fact one: As written in the story behind Intel's iscp compiler and credited to Tim Foley, auto-vectorization is not a programming model.
Fun fact two: This is as ad-hoc as the implementation for data parallel types / SIMD in Kokkos, it seems.
User-Code Examples & Benchmark
Please see this ImpactX PR for details.
Checklist
vir::stdx::simdtest in CIAMReX_SIMD.HParallelForSIMDParticleIdWrapper::make_(in)valid(mask)sincossupportSmallMatrixGpuComplex(minimal)ParallelForSIMDvir::stdx::simdin package managers:Future Ideas / PRs
ParticleIdWrapper/ParticleCpuWrapperGpuComplex<SIMD>ParallelForND supportParallelFor/ParallelForSIMD: one could, maybe, with enable-if magic, etc fuse them into a single name againvir-simdauto-download for convenience (opt-out)