Skip to content

AMReX SIMD Helpers#4520

Merged
WeiqunZhang merged 10 commits intoAMReX-Codes:developmentfrom
ax3l:topic-simd
Jul 16, 2025
Merged

AMReX SIMD Helpers#4520
WeiqunZhang merged 10 commits intoAMReX-Codes:developmentfrom
ax3l:topic-simd

Conversation

@ax3l
Copy link
Copy Markdown
Member

@ax3l ax3l commented Jun 23, 2025

Summary

AMReX does not have a concept yet to help users write effective SIMD code on CPU, besides relying on auto-vectorization and pragmas, which are unreliable for any complex enough code. [1]

Lucky enough, C++ std::datapar was just accepted into C++26, which gives an easy in to write portable SIMD/scalar code. Yet, I did not find a compiler/stdlib yet with support for it, so I finally had play with the C++17 <experimental/simd> headers, which are not as complete as C++26 but a good in, especially if complemented with the https://github.com/mattkretz/vir-simd library.

This PR adds initial support for portable user-code by providing:

  • build system support: AMReX_SIMD (default is OFF), relying on vir-simd
  • an AMReX_SIMD.H header that handles includes & helper types
  • ParallelForSIMD<SIMD_WIDTH>(...)

Additional background

[1] Fun fact one: As written in the story behind Intel's iscp compiler and credited to Tim Foley, auto-vectorization is not a programming model.

Fun fact two: This is as ad-hoc as the implementation for data parallel types / SIMD in Kokkos, it seems.

User-Code Examples & Benchmark

Please see this ImpactX PR for details.

Checklist

  • clean up commits (separate commits)
  • finalize fallbacks & CI checks
  • add a vir::stdx::simd test in CI
  • CMake
  • GnuMake
  • AMReX_SIMD.H
  • ParallelForSIMD
  • ParticleIdWrapper::make_(in)valid(mask)
  • clean up sincos support
  • SmallMatrix
  • Support for GpuComplex (minimal)
  • Support passing WIDTH as compile-time meta-data to callee in ParallelForSIMD
  • include documentation in the code and/or rst files, if appropriate
  • add vir::stdx::simd in package managers:

Future Ideas / PRs

  • allocate particle arrays aligned so we can use stdx::vector_aligned (for copies into/out of vector registers - note: makes no difference anymore on modern CPUs)
  • Support more/all functions in ParticleIdWrapper/ParticleCpuWrapper
  • Support for vir::simdize<std::complex> instead of GpuComplex<SIMD>
  • ParallelFor ND support
  • ParallelFor/ParallelForSIMD: one could, maybe, with enable-if magic, etc fuse them into a single name again
  • CMake superbuild: vir-simd auto-download for convenience (opt-out)
  • Build system: "SIMD provider" selection, once we can opt-in to a C++26 compiler+stdlib instead of C++17 TS2 + vir-simd
  • Update AMReX package in package management:

@ax3l ax3l force-pushed the topic-simd branch 3 times, most recently from 4644b47 to 5cfa97f Compare July 10, 2025 18:57
@ax3l ax3l changed the title [WIP] AMReX SIMD Helpers AMReX SIMD Helpers Jul 10, 2025
@ax3l ax3l marked this pull request as ready for review July 10, 2025 19:12
@ax3l ax3l requested a review from WeiqunZhang July 10, 2025 19:12
*/
template <int WIDTH, typename T, typename L, typename M=std::enable_if_t<std::is_integral_v<T>> >
AMREX_ATTRIBUTE_FLATTEN_FOR
void ParallelForSIMD (T n, L const& f) noexcept
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design of this seems to be missing a lot to me. How does this work if SIMD is not enabled? What does this do on GPU? I think currently this cannot take a lambda as input and needs a struct with both simd and non-simd operator(). 

Copy link
Copy Markdown
Member Author

@ax3l ax3l Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is fully standalone. The SIMD header brings the the types for it.

One could in general use this pattern on GPU as well, but not the target of this PR.

There is a solid helper for copy in/out needed that I currently have in user-code, which will probably need documentation eventually. I do not yet have a good strategy how to make this a helper inside AMReX, but maybe later we know a way:

Copy link
Copy Markdown
Member Author

@ax3l ax3l Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently this cannot take a lambda as input

Alex has a draft that would change the impl to support lambda: https://godbolt.org/z/Pbzs6dnbP

I added a little improvement here: https://godbolt.org/z/7455hqrEc

[I think this] needs a struct with both simd and non-simd operator().

Correct, because one needs to be able to resolve the remainder. We could alternatively resolve the remainder using the SIMD operator() and WIDTH=1, but in practice one will template the implementation anyway as a T_Real for CPU/GPU portability.

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jul 11, 2025

@AlexanderSinn tested and it also works called as a Lambda, and we could consider adding a few helpers for the aligned reads & writes to AMReX (later on?): https://godbolt.org/z/Pbzs6dnbP

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jul 11, 2025

I had a godbolt test link somewhere, where I ensured that the writes of non-modified reads are are optimized out by the compiler in -O3, but have a hard time reproducing it in that last link: https://godbolt.org/z/P3K87rsea
(Simpler example: https://godbolt.org/z/41Teqz4bE )

That would be quite important, otherwise we have to manually track our modifications.

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jul 11, 2025

One can force it with metaprogramming, and then the user just needs to const their declarations properly, which we usually do:
https://godbolt.org/z/4czrsEP1n

Shorter: (thx @AlexanderSinn)

template <typename R, typename... Args>
constexpr bool is_nth_arg_non_const (R(*)(Args...), int n) {

    bool val_arr[sizeof...(Args)] {!std::is_const_v<std::remove_reference_t<Args>>...};

    return val_arr[n];
}

...

if constexpr (is_nth_arg_non_const(modify<double>, 0))
    ad.copy_to(ap + i, stdx::vector_aligned);

My update to also work with functors:
https://godbolt.org/z/hv43EGEKb

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jul 11, 2025

Ah nice, and @AlexanderSinn just showed me that the issue is because of the std::vector use in the example (and only for gcc, ok for clang), so I remembered my test correctly that the compiler can optimize this on its own for the usual pattern we use in AMReX: https://godbolt.org/z/8xYxTY5Mo and that I implemented in ImpactX.

@ax3l ax3l force-pushed the topic-simd branch 4 times, most recently from 0b4571a to 1710287 Compare July 14, 2025 19:20
@ax3l ax3l force-pushed the topic-simd branch 2 times, most recently from 9f4b277 to 6b3640f Compare July 14, 2025 20:50

/** ParallelFor with a SIMD Width (in elements)
*
* SIMD load/Write-back operations need to be performed before/after calling this.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good to add a C++ example snippet here in the doc string, too.

Compile-time helper that can be optionally used instead of
an `int` running index in user code, to query the SIMD WIDTH
set in `ParallelForSIMD` inside a called function.

Co-authored-by: Alexander Sinn <[email protected]>
Add support for mixed vector/scalar operations.
Sufficient for ImpactX `Multipole` elements.
Copy link
Copy Markdown
Member

@WeiqunZhang WeiqunZhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just another minor comment.

@WeiqunZhang WeiqunZhang merged commit 2211077 into AMReX-Codes:development Jul 16, 2025
75 checks passed
@zingale
Copy link
Copy Markdown
Member

zingale commented Jul 17, 2025

This broke Castro/Microphysics compilation. Is there a way to opt out of this?

WeiqunZhang added a commit to WeiqunZhang/amrex that referenced this pull request Jul 17, 2025
PR AMReX-Codes#4520 broke Microphysics's autodiff math. It could be fixed on the
Microphysics side. However, we can fix it on the amrex side and we should
use std::enable_if anyway to limit the data types.
WeiqunZhang added a commit to WeiqunZhang/amrex that referenced this pull request Jul 17, 2025
PR AMReX-Codes#4520 broke Microphysics's autodiff math. It could be fixed on the
Microphysics side. However, we can fix it on the amrex side and we should
use std::enable_if anyway to limit the data types.
@ax3l ax3l deleted the topic-simd branch July 17, 2025 02:11
@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jul 17, 2025

@zingale oh no, how did it break?

Fix in #4570, but maybe we can understand what breaks and update downstream?

@WeiqunZhang
Copy link
Copy Markdown
Member

It's because of using amrex::Math::sincos and they have their own sincos.

ax3l added a commit that referenced this pull request Jul 17, 2025
PR #4520 broke Microphysics's autodiff math. It could be fixed on the
Microphysics side. However, we can fix it on the amrex side and we
should use std::enable_if anyway to limit the data types.

---------

Co-authored-by: Axel Huebl <[email protected]>
ax3l added a commit to BLAST-WarpX/warpx that referenced this pull request Jul 17, 2025
Basic build system support for AMReX SIMD in superbuilds.

See:
- AMReX-Codes/amrex#4520 (requires this PR)
- BLAST-ImpactX/impactx#1002 (needed for this PR
to be finished)
@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jul 17, 2025

Fix merged in #4570

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jul 31, 2025

Update: This functionality made ImpactX astronomically faster on CPU for compute-heavy/bound kernels:
BLAST-ImpactX/impactx#1002 (comment)

@ax3l ax3l mentioned this pull request Jan 27, 2026
5 tasks
ax3l added a commit that referenced this pull request Feb 5, 2026
## Add `ParallelForSIMD<T>`

This adds another template overload to `ParallelForSIMD`.

A typical user pattern for maximum controls so far is:
```C++
#ifdef AMREX_USE_SIMD
if constexpr (amrex::simd::is_vectorized<T>) {
    amrex::ParallelForSIMD<T::simd_width>(np, pushSingleParticle);
} else
#endif
{
    amrex::ParallelFor(np, pushSingleParticle);  // GPU & non-SIMD CPU
}
```

This simplifies it to:
```C++
amrex::ParallelForSIMD<T>(np, pushSingleParticle);
```
indicating there _might_ be a SIMD path if `T` (e.g., a functor)
implements it.

One can still call `ParallelForSIMD` with an explicit SIMD width (int),
as before.

## Generalized Particle Load/Store

A typical SIMD user pattern for particle SoA kernels was:
```C++
SIMDParticleReal<SIMD_WIDTH> part_x;
part_x.copy_from(&m_part_x[i], stdx::element_aligned);

el.compute(part_x);

#ifdef AMREX_USE_SIMD
if constexpr (is_nth_arg_non_const<&el::compute, n>)
    part_x.copy_to(&m_part_x[i], stdx::element_aligned);
#endif
```

This simplifies it to:
```C++
decltype(auto) x = load_1d(m_part_x, i);

el.compute(x);

store_1d<&el::compute, 0>(x, m_part_x, i);
```
and can now also be used for the GPU path, where for now the load is a
transparent pointer forward/deref and the store is a no-OP.

## Combined

Using these patterns together, one can now write single-source
SIMD-CPU/non-SIMD-CPU/GPU kernels, e.g.,
BLAST-ImpactX/impactx#1279

Follow-up to #4520

## Checklist

The proposed changes:
- [ ] fix a bug or incorrect behavior in AMReX
- [x] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX
users
- [x] include documentation in the code and/or rst files, if appropriate
@ax3l ax3l mentioned this pull request Feb 6, 2026
5 tasks
ax3l added a commit that referenced this pull request Feb 7, 2026
## Summary

Needed to cover SIMD features and generic SIMD/non-SIMD patterns.
Bravely vibe coded, but now reviewed and improved for sensibility.

## Additional background

#4924 #4607 #4600  #4520

## Checklist

The proposed changes:
- [ ] fix a bug or incorrect behavior in AMReX
- [ ] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX
users
- [ ] include documentation in the code and/or rst files, if appropriate

---------

Co-authored-by: Weiqun Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants