AMReX SIMD Helpers by ax3l · Pull Request #4520 · AMReX-Codes/amrex

ax3l · 2025-06-23T19:43:20Z

Summary

AMReX does not have a concept yet to help users write effective SIMD code on CPU, besides relying on auto-vectorization and pragmas, which are unreliable for any complex enough code. [1]

Lucky enough, C++ std::datapar was just accepted into C++26, which gives an easy in to write portable SIMD/scalar code. Yet, I did not find a compiler/stdlib yet with support for it, so I finally had play with the C++17 <experimental/simd> headers, which are not as complete as C++26 but a good in, especially if complemented with the https://github.com/mattkretz/vir-simd library.

This PR adds initial support for portable user-code by providing:

build system support: AMReX_SIMD (default is OFF), relying on vir-simd
an AMReX_SIMD.H header that handles includes & helper types
ParallelForSIMD<SIMD_WIDTH>(...)

Additional background

[1] Fun fact one: As written in the story behind Intel's iscp compiler and credited to Tim Foley, auto-vectorization is not a programming model.

Fun fact two: This is as ad-hoc as the implementation for data parallel types / SIMD in Kokkos, it seems.

User-Code Examples & Benchmark

Please see this ImpactX PR for details.

Checklist

Future Ideas / PRs

allocate particle arrays aligned so we can use stdx::vector_aligned (for copies into/out of vector registers - note: makes no difference anymore on modern CPUs)
Support more/all functions in ParticleIdWrapper/ParticleCpuWrapper
Support for vir::simdize<std::complex> instead of GpuComplex<SIMD>
ParallelFor ND support
ParallelFor/ParallelForSIMD: one could, maybe, with enable-if magic, etc fuse them into a single name again
CMake superbuild: vir-simd auto-download for convenience (opt-out)
Build system: "SIMD provider" selection, once we can opt-in to a C++26 compiler+stdlib instead of C++17 TS2 + vir-simd
Update AMReX package in package management:
- Spack vir-simd
- Conda vir-simd

Src/Particle/AMReX_Particle.H

Src/Base/AMReX_GpuLaunchFunctsC.H

AlexanderSinn · 2025-07-10T20:15:00Z

Src/Base/AMReX_GpuLaunchFunctsC.H

+ */
+template <int WIDTH, typename T, typename L, typename M=std::enable_if_t<std::is_integral_v<T>> >
+AMREX_ATTRIBUTE_FLATTEN_FOR
+void ParallelForSIMD (T n, L const& f) noexcept


The design of this seems to be missing a lot to me. How does this work if SIMD is not enabled? What does this do on GPU? I think currently this cannot take a lambda as input and needs a struct with both simd and non-simd operator().

This one is fully standalone. The SIMD header brings the the types for it.

One could in general use this pattern on GPU as well, but not the target of this PR.

There is a solid helper for copy in/out needed that I currently have in user-code, which will probably need documentation eventually. I do not yet have a good strategy how to make this a helper inside AMReX, but maybe later we know a way:

https://github.com/BLAST-ImpactX/impactx/blob/629b3d884c3ca974d75ee99d3a0a494c3dc0e6f8/src/elements/mixin/beamoptic.H#L105-L151

https://github.com/BLAST-ImpactX/impactx/blob/629b3d884c3ca974d75ee99d3a0a494c3dc0e6f8/src/elements/mixin/beamoptic.H#L172-L198

I think currently this cannot take a lambda as input

Alex has a draft that would change the impl to support lambda: https://godbolt.org/z/Pbzs6dnbP

I added a little improvement here: https://godbolt.org/z/7455hqrEc

[I think this] needs a struct with both simd and non-simd operator().

Correct, because one needs to be able to resolve the remainder. We could alternatively resolve the remainder using the SIMD operator() and WIDTH=1, but in practice one will template the implementation anyway as a T_Real for CPU/GPU portability.

ax3l · 2025-07-11T20:33:04Z

@AlexanderSinn tested and it also works called as a Lambda, and we could consider adding a few helpers for the aligned reads & writes to AMReX (later on?): https://godbolt.org/z/Pbzs6dnbP

ax3l · 2025-07-11T20:52:33Z

I had a godbolt test link somewhere, where I ensured that the writes of non-modified reads are are optimized out by the compiler in -O3, but have a hard time reproducing it in that last link: https://godbolt.org/z/P3K87rsea
(Simpler example: https://godbolt.org/z/41Teqz4bE )

That would be quite important, otherwise we have to manually track our modifications.

ax3l · 2025-07-11T22:25:45Z

One can force it with metaprogramming, and then the user just needs to const their declarations properly, which we usually do:
https://godbolt.org/z/4czrsEP1n

Shorter: (thx @AlexanderSinn)

template <typename R, typename... Args>
constexpr bool is_nth_arg_non_const (R(*)(Args...), int n) {

    bool val_arr[sizeof...(Args)] {!std::is_const_v<std::remove_reference_t<Args>>...};

    return val_arr[n];
}

...

if constexpr (is_nth_arg_non_const(modify<double>, 0))
    ad.copy_to(ap + i, stdx::vector_aligned);

My update to also work with functors:
https://godbolt.org/z/hv43EGEKb

ax3l · 2025-07-11T22:26:33Z

Ah nice, and @AlexanderSinn just showed me that the issue is because of the std::vector use in the example (and only for gcc, ok for clang), so I remembered my test correctly that the compiler can optimize this on its own for the usual pattern we use in AMReX: https://godbolt.org/z/8xYxTY5Mo and that I implemented in ImpactX.

Src/Base/AMReX_SIMD.H

ax3l · 2025-07-15T18:36:48Z

Src/Base/AMReX_GpuLaunchFunctsC.H


+/** ParallelFor with a SIMD Width (in elements)
+ *
+ * SIMD load/Write-back operations need to be performed before/after calling this.


Probably good to add a C++ example snippet here in the doc string, too.

Src/Base/AMReX_GpuLaunchFunctsC.H

Compile-time helper that can be optionally used instead of an `int` running index in user code, to query the SIMD WIDTH set in `ParallelForSIMD` inside a called function. Co-authored-by: Alexander Sinn <[email protected]>

Src/Base/AMReX_GpuComplex.H

Add support for mixed vector/scalar operations. Sufficient for ImpactX `Multipole` elements.

WeiqunZhang

LGTM. Just another minor comment.

Src/Base/AMReX_SIMD.H

zingale · 2025-07-17T00:20:13Z

This broke Castro/Microphysics compilation. Is there a way to opt out of this?

PR AMReX-Codes#4520 broke Microphysics's autodiff math. It could be fixed on the Microphysics side. However, we can fix it on the amrex side and we should use std::enable_if anyway to limit the data types.

ax3l · 2025-07-17T02:13:30Z

@zingale oh no, how did it break?

Fix in #4570, but maybe we can understand what breaks and update downstream?

WeiqunZhang · 2025-07-17T02:18:22Z

It's because of using amrex::Math::sincos and they have their own sincos.

PR #4520 broke Microphysics's autodiff math. It could be fixed on the Microphysics side. However, we can fix it on the amrex side and we should use std::enable_if anyway to limit the data types. --------- Co-authored-by: Axel Huebl <[email protected]>

Basic build system support for AMReX SIMD in superbuilds. See: - AMReX-Codes/amrex#4520 (requires this PR) - BLAST-ImpactX/impactx#1002 (needed for this PR to be finished)

ax3l · 2025-07-17T15:11:41Z

Fix merged in #4570

ax3l · 2025-07-31T07:56:44Z

Update: This functionality made ImpactX astronomically faster on CPU for compute-heavy/bound kernels:
BLAST-ImpactX/impactx#1002 (comment)

## Add `ParallelForSIMD<T>` This adds another template overload to `ParallelForSIMD`. A typical user pattern for maximum controls so far is: ```C++ #ifdef AMREX_USE_SIMD if constexpr (amrex::simd::is_vectorized<T>) { amrex::ParallelForSIMD<T::simd_width>(np, pushSingleParticle); } else #endif { amrex::ParallelFor(np, pushSingleParticle); // GPU & non-SIMD CPU } ``` This simplifies it to: ```C++ amrex::ParallelForSIMD<T>(np, pushSingleParticle); ``` indicating there _might_ be a SIMD path if `T` (e.g., a functor) implements it. One can still call `ParallelForSIMD` with an explicit SIMD width (int), as before. ## Generalized Particle Load/Store A typical SIMD user pattern for particle SoA kernels was: ```C++ SIMDParticleReal<SIMD_WIDTH> part_x; part_x.copy_from(&m_part_x[i], stdx::element_aligned); el.compute(part_x); #ifdef AMREX_USE_SIMD if constexpr (is_nth_arg_non_const<&el::compute, n>) part_x.copy_to(&m_part_x[i], stdx::element_aligned); #endif ``` This simplifies it to: ```C++ decltype(auto) x = load_1d(m_part_x, i); el.compute(x); store_1d<&el::compute, 0>(x, m_part_x, i); ``` and can now also be used for the GPU path, where for now the load is a transparent pointer forward/deref and the store is a no-OP. ## Combined Using these patterns together, one can now write single-source SIMD-CPU/non-SIMD-CPU/GPU kernels, e.g., BLAST-ImpactX/impactx#1279 Follow-up to #4520 ## Checklist The proposed changes: - [ ] fix a bug or incorrect behavior in AMReX - [x] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [x] include documentation in the code and/or rst files, if appropriate

## Summary Needed to cover SIMD features and generic SIMD/non-SIMD patterns. Bravely vibe coded, but now reviewed and improved for sensibility. ## Additional background #4924 #4607 #4600 #4520 ## Checklist The proposed changes: - [ ] fix a bug or incorrect behavior in AMReX - [ ] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] include documentation in the code and/or rst files, if appropriate --------- Co-authored-by: Weiqun Zhang <[email protected]>

ax3l added the enhancement label Jun 23, 2025

This was referenced Jun 23, 2025

CPU SIMD Support BLAST-ImpactX/impactx#1002

Merged

CMake: ABLASTR/WarpX/AMReX SIMD BLAST-WarpX/warpx#5966

Merged

ax3l mentioned this pull request Jul 2, 2025

Add vir::stdx::simd (Spack: vir-simd) spack/spack-packages#332

Merged

ax3l force-pushed the topic-simd branch from 4548694 to c1001c0 Compare July 10, 2025 04:27

ax3l commented Jul 10, 2025

View reviewed changes

Src/Particle/AMReX_Particle.H Show resolved Hide resolved

ax3l force-pushed the topic-simd branch 3 times, most recently from 4644b47 to 5cfa97f Compare July 10, 2025 18:57

ax3l changed the title ~~[WIP] AMReX SIMD Helpers~~ AMReX SIMD Helpers Jul 10, 2025

ax3l marked this pull request as ready for review July 10, 2025 19:12

ax3l force-pushed the topic-simd branch from 5cfa97f to 12c56c8 Compare July 10, 2025 19:12

ax3l requested a review from WeiqunZhang July 10, 2025 19:12

ax3l assigned WeiqunZhang Jul 10, 2025

AlexanderSinn reviewed Jul 10, 2025

View reviewed changes

Src/Base/AMReX_GpuLaunchFunctsC.H Outdated Show resolved Hide resolved

AlexanderSinn reviewed Jul 10, 2025

View reviewed changes

ax3l force-pushed the topic-simd branch 4 times, most recently from 0b4571a to 1710287 Compare July 14, 2025 19:20

ax3l commented Jul 14, 2025

View reviewed changes

Src/Base/AMReX_SIMD.H Show resolved Hide resolved

ax3l force-pushed the topic-simd branch 2 times, most recently from 9f4b277 to 6b3640f Compare July 14, 2025 20:50

ax3l commented Jul 15, 2025

View reviewed changes

ax3l force-pushed the topic-simd branch from 625e091 to 6c586bf Compare July 15, 2025 20:56

AlexanderSinn reviewed Jul 15, 2025

View reviewed changes

Src/Base/AMReX_GpuLaunchFunctsC.H Outdated Show resolved Hide resolved

ax3l force-pushed the topic-simd branch from 972e7c7 to cf8196f Compare July 16, 2025 09:59

SIMDindex: Helper to Pass WIDTH to Callee

7da6dee

Compile-time helper that can be optionally used instead of an `int` running index in user code, to query the SIMD WIDTH set in `ParallelForSIMD` inside a called function. Co-authored-by: Alexander Sinn <[email protected]>

ax3l force-pushed the topic-simd branch from cf8196f to 8a27409 Compare July 16, 2025 10:07

WeiqunZhang reviewed Jul 16, 2025

View reviewed changes

Src/Base/AMReX_GpuComplex.H Outdated Show resolved Hide resolved

ax3l force-pushed the topic-simd branch from 8a27409 to 60ac7b5 Compare July 16, 2025 18:22

SIMD: Support amrex::GpuComplex (Minimal)

d86ff59

Add support for mixed vector/scalar operations. Sufficient for ImpactX `Multipole` elements.

ax3l force-pushed the topic-simd branch from 60ac7b5 to d86ff59 Compare July 16, 2025 18:23

ax3l added the performance label Jul 16, 2025

WeiqunZhang reviewed Jul 16, 2025

View reviewed changes

Src/Base/AMReX_SIMD.H Outdated Show resolved Hide resolved

AMReX_SIMD.H: Include for std::uint64_t

16f7dcf

WeiqunZhang approved these changes Jul 16, 2025

View reviewed changes

WeiqunZhang merged commit 2211077 into AMReX-Codes:development Jul 16, 2025
75 checks passed

WeiqunZhang mentioned this pull request Jul 17, 2025

Fix amrex::Math for Microphysics #4570

Merged

ax3l deleted the topic-simd branch July 17, 2025 02:11

mirenradia mentioned this pull request Jul 21, 2025

Remove templating over data_t and simd folder GRTLCollaboration/GRTeclyn#103

Closed

ax3l mentioned this pull request Jul 31, 2025

SIMD: Align Particle Properties #4592

Open

ax3l mentioned this pull request Jan 27, 2026

Generalize SIMD Single Source Design #4924

Merged

5 tasks

ax3l mentioned this pull request Feb 6, 2026

Test: SIMD #4938

Merged

5 tasks

ax3l mentioned this pull request Feb 26, 2026

SIMD: Add where and ternary operator #5095

Open

5 tasks

Conversation

ax3l commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Additional background

User-Code Examples & Benchmark

Checklist

Future Ideas / PRs

Uh oh!

Uh oh!

Uh oh!

AlexanderSinn Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

ax3l Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ax3l Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ax3l commented Jul 11, 2025

Uh oh!

ax3l commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ax3l commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ax3l commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ax3l Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

WeiqunZhang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zingale commented Jul 17, 2025

Uh oh!

ax3l commented Jul 17, 2025

Uh oh!

WeiqunZhang commented Jul 17, 2025

Uh oh!

ax3l commented Jul 17, 2025

Uh oh!

ax3l commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ax3l commented Jun 23, 2025 •

edited

Loading

ax3l Jul 10, 2025 •

edited

Loading

ax3l Jul 15, 2025 •

edited

Loading

ax3l commented Jul 11, 2025 •

edited

Loading

ax3l commented Jul 11, 2025 •

edited

Loading

ax3l commented Jul 11, 2025 •

edited

Loading

ax3l commented Jul 31, 2025 •

edited

Loading