CPU SIMD Support by ax3l · Pull Request #1002 · BLAST-ImpactX/impactx

ax3l · 2025-06-23T19:43:27Z

Until recently (see AMReX PR AMReX-Codes/amrex#4520), AMReX does not have a concept to help users write effective SIMD code on CPU, besides relying on auto-vectorization and pragmas, which are unreliable for any complex enough code. [1]

Lucky enough, C++ std::datapar/std::simd was just accepted into C++26, which gives an easy in to write portable SIMD/scalar code. Yet, I did not find a compiler/stdlib yet with support for it, so I implemented in AMReX using C++17 <experimental/simd> headers a new addition to the ParallelFor performance portability construct using the https://github.com/mattkretz/vir-simd library.

This PR vectorizes CPU code in ImpactX for particle tracking, both for NOACC and OpenMP compute backends, in ImpactX for element pushes, keeping our portable, single-source approach.

Additional background

The implementation here works with any kind of vectorization (if implemented in the respective compiler): SSE(2), AVX(2), AVX512, Neon, ... 512-bit wide vector registers are the largest currently available, being able to push up to 8 (DP) or 16 (SP) particles in parallel, where a non-data-parallel (non-SIMD) implementation would push just 1 (in the worst case, for any complex enough code [1]).

[1] As written in the story behind Intel's iscp compiler and credited to Tim Foley, auto-vectorization is not a programming model.

Vectorization is really a technique that primarily benefits us for elements that are very compute heavy (i.e., not memory-bandwidth bound). Not coincidentally, ImpactX focuses on symplectic, (high-order/exact) methods for chromatic and exact tracking, where this is exactly the case. Optimizations like vectorization thus makes a huge impact for our community/users/time-to-solution, and exactly address our most costly tracking methods, making it highly benefitial for our users to rely on them and still get fast results when needed.

To Do

update install docs
update CI (partially): SIMD, ideally even single-precision (uncovers more issues at compile-time)
undo vectorization for a few simple elements that are far from compute bound (see below)

Benchmarks on Dane (LLNL)

Machine: Dane (LLNL), -march=sapphirerapids CPU
Compiler: gcc/13.3.1
Vector registers: 512 bit (i.e., 8 doubles or 16 floats)

module load gcc/13.3.1
module load hdf5-serial
module load cmake/3.26.3
module load fftw/3.3.10
module load python/3.11.5

wget https://github.com/mattkretz/vir-simd/archive/refs/tags/v0.4.4.tar.gz
tar -xvf v0.4.4.tar.gz
rm -rf v0.4.4.tar.gz
cmake -S vir-simd-0.4.4 -B vir-simd-build -DCMAKE_INSTALL_PREFIX=$HOME/sw/vir-simd
cmake --build vir-simd-build --target install
rm -rf vir-simd-0.4.4
export CMAKE_PREFIX_PATH=$HOME/sw/vir-simd:${CMAKE_PREFIX_PATH}

alias getNode="srun --time=1:00:00 --nodes=1 --ntasks-per-node=1 --cpus-per-task=56 -p pdebug --pty bash"

# fix system defaults: do not escape $ with a \ on tab completion
shopt -s direxpand

export CXXFLAGS="-march=sapphirerapids"
export CFLAGS="-march=sapphirerapids"

export CC=$(which gcc)
export CXX=$(which g++)

git clone https://github.com/BLAST-ImpactX/impactx.git ${HOME}/src/impactx

rm -rf ${HOME}/venvs/impactx-dane
python3 -m venv ${HOME}/venvs/impactx-dane
source ${HOME}/venvs/impactx-dane/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade build
python3 -m pip install --upgrade packaging
python3 -m pip install --upgrade wheel
python3 -m pip install --upgrade setuptools[core]
python3 -m pip install --upgrade numpy
python3 -m pip install --upgrade pandas
python3 -m pip install --upgrade pytest
python3 -m pip install --upgrade pytest-benchmark
python3 -m pip install --upgrade scipy
python3 -m pip install --upgrade -r ${HOME}/src/impactx/requirements.txt

rm -rf $HOME/.benchmarks

Development:

rm -rf ${HOME}/src/impactx
git clone https://github.com/BLAST-ImpactX/impactx.git ${HOME}/src/impactx

SIMD Branch:

rm -rf ${HOME}/src/impactx
git clone -b topic-simd https://github.com/ax3l/impactx.git ${HOME}/src/impactx

Single Precision

script: SP.txt
results: sp-benchmarks.tar.gz
abbreviations:
- sp: single precision
- nm: normal/default math
- fm: fast-math
- d(ev): development branch
- s(imd): the branch in this PR

pytest-benchmark compare --group-by=name --columns=min,stddev --sort=name .benchmarks/Linux-CPython-3.11-64bit/000*

---------------- benchmark 'test_Aperture': 4 tests ---------------
Name (time in ms)                   Min            StdDev          
-------------------------------------------------------------------
test_Aperture (0001_sp-nm-d)     8.1991 (5.96)     0.0148 (1.0)    
test_Aperture (0002_sp-nm-s)     1.3747 (1.0)      0.0328 (2.23)   
test_Aperture (0003_sp-fm-d)     7.7328 (5.63)     0.0380 (2.58)   
test_Aperture (0004_sp-fm-s)     1.4767 (1.07)     0.0421 (2.85)   
-------------------------------------------------------------------

--------------- benchmark 'test_Buncher': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_Buncher (0001_sp-nm-d)     1.2732 (1.25)     0.0236 (1.0)    
test_Buncher (0002_sp-nm-s)     1.3347 (1.31)     0.0339 (1.43)   
test_Buncher (0003_sp-fm-d)     1.0221 (1.0)      0.0275 (1.16)   
test_Buncher (0004_sp-fm-s)     1.4687 (1.44)     0.0399 (1.69)   
------------------------------------------------------------------

---------------- benchmark 'test_CFbend': 4 tests ---------------
Name (time in ms)                 Min            StdDev          
-----------------------------------------------------------------
test_CFbend (0001_sp-nm-d)     1.3791 (1.03)     0.0340 (1.0)    
test_CFbend (0002_sp-nm-s)     1.3387 (1.0)      0.0345 (1.01)   
test_CFbend (0003_sp-fm-d)     1.3870 (1.04)     0.0390 (1.15)   
test_CFbend (0004_sp-fm-s)     1.5895 (1.19)     0.0561 (1.65)   
-----------------------------------------------------------------

---------------- benchmark 'test_ChrAcc': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_ChrAcc (0001_sp-nm-d)     19.0046 (3.65)     0.0038 (1.0)    
test_ChrAcc (0002_sp-nm-s)      5.8782 (1.13)     0.0062 (1.61)   
test_ChrAcc (0003_sp-fm-d)     18.5467 (3.56)     0.0245 (6.41)   
test_ChrAcc (0004_sp-fm-s)      5.2128 (1.0)      0.0156 (4.06)   
------------------------------------------------------------------

---------------- benchmark 'test_ChrDrift': 4 tests ---------------
Name (time in ms)                   Min            StdDev          
-------------------------------------------------------------------
test_ChrDrift (0001_sp-nm-d)     5.9677 (4.44)     0.0305 (1.04)   
test_ChrDrift (0002_sp-nm-s)     1.3426 (1.0)      0.0416 (1.42)   
test_ChrDrift (0003_sp-fm-d)     1.4907 (1.11)     0.0293 (1.0)    
test_ChrDrift (0004_sp-fm-s)     1.3428 (1.00)     0.1671 (5.71)   
-------------------------------------------------------------------

---------------- benchmark 'test_ChrPlasmaLens': 4 tests ----------------
Name (time in ms)                         Min            StdDev          
-------------------------------------------------------------------------
test_ChrPlasmaLens (0001_sp-nm-d)     23.6499 (7.48)     0.0238 (1.02)   
test_ChrPlasmaLens (0002_sp-nm-s)      3.6103 (1.14)     0.0824 (3.54)   
test_ChrPlasmaLens (0003_sp-fm-d)     20.5786 (6.51)     0.0830 (3.56)   
test_ChrPlasmaLens (0004_sp-fm-s)      3.1605 (1.0)      0.0233 (1.0)    
-------------------------------------------------------------------------

---------------- benchmark 'test_ChrQuad': 4 tests ----------------
Name (time in ms)                   Min            StdDev          
-------------------------------------------------------------------
test_ChrQuad (0001_sp-nm-d)     70.8837 (1.99)     0.2649 (4.67)   
test_ChrQuad (0002_sp-nm-s)     54.6676 (1.53)     0.0945 (1.67)   
test_ChrQuad (0003_sp-fm-d)     51.0190 (1.43)     0.0595 (1.05)   
test_ChrQuad (0004_sp-fm-s)     35.6522 (1.0)      0.0568 (1.0)    
-------------------------------------------------------------------

---------------- benchmark 'test_ConstF': 4 tests ---------------
Name (time in ms)                 Min            StdDev          
-----------------------------------------------------------------
test_ConstF (0001_sp-nm-d)     1.0483 (1.0)      0.0277 (1.0)    
test_ConstF (0002_sp-nm-s)     1.3607 (1.30)     0.0302 (1.09)   
test_ConstF (0003_sp-fm-d)     1.3889 (1.32)     0.0334 (1.20)   
test_ConstF (0004_sp-fm-s)     1.3548 (1.29)     0.0292 (1.05)   
-----------------------------------------------------------------

------------------ benchmark 'test_DipEdge': 4 tests ------------------
Name (time in us)                      Min             StdDev          
-----------------------------------------------------------------------
test_DipEdge (0001_sp-nm-d)       766.3820 (1.0)      22.2235 (1.0)    
test_DipEdge (0002_sp-nm-s)     1,339.5720 (1.75)     31.9027 (1.44)   
test_DipEdge (0003_sp-fm-d)       926.6780 (1.21)     29.1228 (1.31)   
test_DipEdge (0004_sp-fm-s)     1,323.4870 (1.73)     26.7041 (1.20)   
-----------------------------------------------------------------------

--------------- benchmark 'test_Drift': 4 tests ----------------
Name (time in ms)                Min            StdDev          
----------------------------------------------------------------
test_Drift (0001_sp-nm-d)     1.0143 (1.0)      0.0249 (1.0)    
test_Drift (0002_sp-nm-s)     1.3445 (1.33)     0.0310 (1.24)   
test_Drift (0003_sp-fm-d)     1.2817 (1.26)     0.0263 (1.06)   
test_Drift (0004_sp-fm-s)     1.3346 (1.32)     0.0297 (1.19)   
----------------------------------------------------------------

---------------- benchmark 'test_ExactDrift': 4 tests ---------------
Name (time in ms)                     Min            StdDev          
---------------------------------------------------------------------
test_ExactDrift (0001_sp-nm-d)     4.1055 (3.10)     0.0034 (1.0)    
test_ExactDrift (0002_sp-nm-s)     1.3349 (1.01)     0.0342 (10.19)  
test_ExactDrift (0003_sp-fm-d)     1.4068 (1.06)     0.0298 (8.86)   
test_ExactDrift (0004_sp-fm-s)     1.3249 (1.0)      0.0352 (10.47)  
---------------------------------------------------------------------

---------------- benchmark 'test_ExactMultipole': 4 tests ----------------
Name (time in ms)                          Min            StdDev          
--------------------------------------------------------------------------
test_ExactMultipole (0001_sp-nm-d)     45.0051 (9.07)     0.0124 (1.0)    
test_ExactMultipole (0002_sp-nm-s)      6.1070 (1.23)     0.0247 (1.99)   
test_ExactMultipole (0003_sp-fm-d)     44.6973 (9.01)     0.0164 (1.32)   
test_ExactMultipole (0004_sp-fm-s)      4.9605 (1.0)      0.0383 (3.07)   
--------------------------------------------------------------------------

---------------- benchmark 'test_ExactQuad': 4 tests -----------------
Name (time in ms)                      Min            StdDev          
----------------------------------------------------------------------
test_ExactQuad (0001_sp-nm-d)     287.8775 (18.32)    0.1472 (37.29)  
test_ExactQuad (0002_sp-nm-s)      23.1705 (1.47)     0.0272 (6.88)   
test_ExactQuad (0003_sp-fm-d)     157.9566 (10.05)    0.0498 (12.61)  
test_ExactQuad (0004_sp-fm-s)      15.7107 (1.0)      0.0039 (1.0)    
----------------------------------------------------------------------

---------------- benchmark 'test_ExactSbend': 4 tests ----------------
Name (time in ms)                      Min            StdDev          
----------------------------------------------------------------------
test_ExactSbend (0001_sp-nm-d)     16.6224 (2.23)     0.3834 (12.75)  
test_ExactSbend (0002_sp-nm-s)      8.3900 (1.13)     0.0579 (1.93)   
test_ExactSbend (0003_sp-fm-d)     13.4190 (1.80)     0.0326 (1.09)   
test_ExactSbend (0004_sp-fm-s)      7.4398 (1.0)      0.0301 (1.0)    
----------------------------------------------------------------------

------------------ benchmark 'test_Kicker': 4 tests ------------------
Name (time in us)                     Min             StdDev          
----------------------------------------------------------------------
test_Kicker (0001_sp-nm-d)       936.2970 (1.03)     18.9457 (1.0)    
test_Kicker (0002_sp-nm-s)     1,338.2910 (1.47)     32.9080 (1.74)   
test_Kicker (0003_sp-fm-d)       908.7170 (1.0)      25.1170 (1.33)   
test_Kicker (0004_sp-fm-s)     1,330.8750 (1.46)     26.9152 (1.42)   
----------------------------------------------------------------------

--------------- benchmark 'test_LinearMap': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_LinearMap (0001_sp-nm-d)     7.4159 (5.46)     0.0420 (1.25)   
test_LinearMap (0002_sp-nm-s)     1.3799 (1.02)     0.0358 (1.06)   
test_LinearMap (0003_sp-fm-d)     7.5989 (5.59)     0.0664 (1.97)   
test_LinearMap (0004_sp-fm-s)     1.3588 (1.0)      0.0337 (1.0)    
--------------------------------------------------------------------

--------------- benchmark 'test_Multipole': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_Multipole (0001_sp-nm-d)     4.1712 (3.08)     0.0308 (1.14)   
test_Multipole (0002_sp-nm-s)     1.3526 (1.0)      0.0315 (1.16)   
test_Multipole (0003_sp-fm-d)     4.2997 (3.18)     0.0271 (1.0)    
test_Multipole (0004_sp-fm-s)     1.3540 (1.00)     0.0281 (1.04)   
--------------------------------------------------------------------

---------------- benchmark 'test_NonlinearLens': 4 tests ----------------
Name (time in ms)                         Min            StdDev          
-------------------------------------------------------------------------
test_NonlinearLens (0001_sp-nm-d)     39.5728 (1.02)     0.2085 (12.86)  
test_NonlinearLens (0002_sp-nm-s)     39.8526 (1.03)     0.1505 (9.28)   
test_NonlinearLens (0003_sp-fm-d)     39.4713 (1.02)     0.0162 (1.0)    
test_NonlinearLens (0004_sp-fm-s)     38.6155 (1.0)      0.0721 (4.44)   
-------------------------------------------------------------------------

---------------- benchmark 'test_PRot': 4 tests ---------------
Name (time in ms)               Min            StdDev          
---------------------------------------------------------------
test_PRot (0001_sp-nm-d)     3.2337 (2.44)     0.0041 (1.0)    
test_PRot (0002_sp-nm-s)     1.3243 (1.0)      0.0352 (8.50)   
test_PRot (0003_sp-fm-d)     1.3922 (1.05)     0.0234 (5.65)   
test_PRot (0004_sp-fm-s)     1.3263 (1.00)     0.0337 (8.15)   
---------------------------------------------------------------

------------------ benchmark 'test_PlaneXYRot': 4 tests ------------------
Name (time in us)                         Min             StdDev          
--------------------------------------------------------------------------
test_PlaneXYRot (0001_sp-nm-d)       788.4900 (1.0)      24.5604 (1.14)   
test_PlaneXYRot (0002_sp-nm-s)     1,357.0950 (1.72)     34.5120 (1.60)   
test_PlaneXYRot (0003_sp-fm-d)     1,006.9610 (1.28)     21.5856 (1.0)    
test_PlaneXYRot (0004_sp-fm-s)     1,335.5390 (1.69)     35.9032 (1.66)   
--------------------------------------------------------------------------

---------------- benchmark 'test_Quad': 4 tests ---------------
Name (time in ms)               Min            StdDev          
---------------------------------------------------------------
test_Quad (0001_sp-nm-d)     1.2482 (1.0)      0.0381 (1.13)   
test_Quad (0002_sp-nm-s)     1.3583 (1.09)     0.0382 (1.13)   
test_Quad (0003_sp-fm-d)     1.7011 (1.36)     0.0390 (1.16)   
test_Quad (0004_sp-fm-s)     1.3495 (1.08)     0.0337 (1.0)    
---------------------------------------------------------------

---------------- benchmark 'test_RFCavity': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_RFCavity (0001_sp-nm-d)     18.7676 (13.87)    0.0060 (1.0)    
test_RFCavity (0002_sp-nm-s)      1.3643 (1.01)     0.0417 (7.01)   
test_RFCavity (0003_sp-fm-d)     16.4755 (12.17)    0.0325 (5.47)   
test_RFCavity (0004_sp-fm-s)      1.3532 (1.0)      0.0338 (5.68)   
--------------------------------------------------------------------

--------------- benchmark 'test_Sbend': 4 tests ----------------
Name (time in ms)                Min            StdDev          
----------------------------------------------------------------
test_Sbend (0001_sp-nm-d)     1.0152 (1.0)      0.0261 (1.0)    
test_Sbend (0002_sp-nm-s)     1.3506 (1.33)     0.0327 (1.25)   
test_Sbend (0003_sp-fm-d)     1.3534 (1.33)     0.0284 (1.09)   
test_Sbend (0004_sp-fm-s)     1.3350 (1.32)     0.0310 (1.19)   
----------------------------------------------------------------

--------------- benchmark 'test_ShortRF': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_ShortRF (0001_sp-nm-d)     7.1661 (6.20)     0.0278 (1.86)   
test_ShortRF (0002_sp-nm-s)     1.3507 (1.17)     0.0348 (2.33)   
test_ShortRF (0003_sp-fm-d)     1.1567 (1.0)      0.0149 (1.0)    
test_ShortRF (0004_sp-fm-s)     1.3384 (1.16)     0.0355 (2.37)   
------------------------------------------------------------------

---------------- benchmark 'test_SoftQuadrupole': 4 tests ----------------
Name (time in ms)                          Min            StdDev          
--------------------------------------------------------------------------
test_SoftQuadrupole (0001_sp-nm-d)     19.3863 (14.29)    0.0028 (2.01)   
test_SoftQuadrupole (0002_sp-nm-s)      1.3653 (1.01)     0.0370 (26.59)  
test_SoftQuadrupole (0003_sp-fm-d)     16.1216 (11.89)    0.0014 (1.0)    
test_SoftQuadrupole (0004_sp-fm-s)      1.3564 (1.0)      0.0366 (26.27)  
--------------------------------------------------------------------------

---------------- benchmark 'test_SoftSolenoid': 4 tests ----------------
Name (time in ms)                        Min            StdDev          
------------------------------------------------------------------------
test_SoftSolenoid (0001_sp-nm-d)     18.7642 (13.89)    0.0065 (1.0)    
test_SoftSolenoid (0002_sp-nm-s)      1.3685 (1.01)     0.0411 (6.33)   
test_SoftSolenoid (0003_sp-fm-d)     15.9963 (11.84)    0.0124 (1.91)   
test_SoftSolenoid (0004_sp-fm-s)      1.3508 (1.0)      0.0381 (5.86)   
------------------------------------------------------------------------

--------------- benchmark 'test_Sol': 4 tests ----------------
Name (time in ms)              Min            StdDev          
--------------------------------------------------------------
test_Sol (0001_sp-nm-d)     1.0205 (1.0)      0.0269 (1.01)   
test_Sol (0002_sp-nm-s)     1.3458 (1.32)     0.0366 (1.37)   
test_Sol (0003_sp-fm-d)     1.0292 (1.01)     0.0267 (1.0)    
test_Sol (0004_sp-fm-s)     1.3404 (1.31)     0.0329 (1.23)   
--------------------------------------------------------------

------------------ benchmark 'test_TaperedPL': 4 tests ------------------
Name (time in us)                        Min             StdDev          
-------------------------------------------------------------------------
test_TaperedPL (0001_sp-nm-d)       775.8910 (1.0)      23.6305 (1.40)   
test_TaperedPL (0002_sp-nm-s)     1,342.5990 (1.73)     36.5186 (2.16)   
test_TaperedPL (0003_sp-fm-d)       780.8800 (1.01)     16.9080 (1.0)    
test_TaperedPL (0004_sp-fm-s)     1,343.6800 (1.73)     24.9062 (1.47)   
-------------------------------------------------------------------------

---------------- benchmark 'test_ThinDipole': 4 tests ---------------
Name (time in ms)                     Min            StdDev          
---------------------------------------------------------------------
test_ThinDipole (0001_sp-nm-d)     4.0509 (3.91)     0.0102 (1.0)    
test_ThinDipole (0002_sp-nm-s)     1.3486 (1.30)     0.0365 (3.57)   
test_ThinDipole (0003_sp-fm-d)     1.0370 (1.0)      0.0245 (2.40)   
test_ThinDipole (0004_sp-fm-s)     1.3400 (1.29)     0.0318 (3.11)   
---------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

pytest-benchmark compare --group-by=name --sort=name --histogram=sp_histo .benchmarks/Linux-CPython-3.11-64bit/000*

Benchmarks on Perlmutter (NERSC)

Machine: Perlmutter (NERSC)
Compiler: ...
Vector registers: 256 bit (i.e., 4 doubles or 8 floats)
Note: Perlmutter CPUs (AMD Zen 3) do not support AVX512, AMD introduced 512 bit (i.e., 8 doubles) wide vector registers only in Zen 5.

Boring old CPU :D

Required PRs

## Summary AMReX does not have a concept yet to help users write effective SIMD code on CPU, besides relying on auto-vectorization and pragmas, which are unreliable for any complex enough code. [1] Lucky enough, C++ `std::datapar` was just accepted into C++26, which gives an easy in to write portable SIMD/scalar code. Yet, I did not find a compiler/stdlib yet with support for it, so I finally had play with the C++17 `<experimental/simd>` headers, which are not as complete as C++26 but a good in, especially if complemented with the https://github.com/mattkretz/vir-simd library. This PR adds initial support for portable user-code by providing: - build system support: `AMReX_SIMD` (default is OFF), relying on [vir-simd](https://github.com/mattkretz/vir-simd) - an `AMReX_SIMD.H` header that handles includes & helper types - `ParallelForSIMD<SIMD_WIDTH>(...)` ## Additional background [1] Fun fact one: As written in the [story behind Intel's iscp compiler](https://pharr.org/matt/blog/2018/04/18/ispc-origins) and credited to [Tim Foley](http://graphics.stanford.edu/~tfoley/), *auto-vectorization is not a programming model.* Fun fact two: This is as ad-hoc as the implementation for [data parallel types / SIMD in Kokkos](https://kokkos.org/kokkos-core-wiki/API/simd/simd.html), it seems. ## User-Code Examples & Benchmark [Please see this ImpactX PR for details.](BLAST-ImpactX/impactx#1002) ## Checklist - [x] clean up commits (separate commits) - [x] finalize fallbacks & CI checks - [ ] add a `vir::stdx::simd` test in CI - [x] CMake - [ ] GnuMake - [x] `AMReX_SIMD.H` - [x] `ParallelForSIMD` - [x] `ParticleIdWrapper::make_(in)valid(mask)` - [x] clean up `sincos` support - [x] `SmallMatrix` - [x] Support for `GpuComplex` (minimal) - [x] Support [passing WIDTH as compile-time meta-data](https://godbolt.org/z/7455hqrEc) to callee in `ParallelForSIMD` - [ ] include documentation in the code and/or rst files, if appropriate - [x] add `vir::stdx::simd` in package managers: - [x] Spack [vir-simd](spack/spack-packages#332) - [x] Conda [vir-simd](conda-forge/staged-recipes#30377) ## Future Ideas / PRs - allocate particle arrays aligned so we can use [stdx::vector_aligned](https://en.cppreference.com/w/cpp/experimental/simd/vector_aligned.html) (for [copies](https://en.cppreference.com/w/cpp/experimental/simd/simd/copy_from) into/out of vector registers - note: makes no difference anymore on modern CPUs) - Support more/all functions in `ParticleIdWrapper`/`ParticleCpuWrapper` - Support for [vir::simdize<std::complex<T>>](mattkretz/vir-simd#42) instead of `GpuComplex<SIMD>` - `ParallelFor` ND support - `ParallelFor`/`ParallelForSIMD`: one could, maybe, with enable-if magic, etc fuse them into a single name again - CMake superbuild: `vir-simd` auto-download for convenience (opt-out) - Build system: "SIMD provider" selection, once we can opt-in to a C++26 compiler+stdlib instead of C++17 TS2 + vir-simd - Update AMReX package in package management: - Spack [vir-simd](spack/spack-packages#332) - Conda [vir-simd](conda-forge/staged-recipes#30377) --------- Co-authored-by: Alexander Sinn <[email protected]>

Basic build system support for AMReX SIMD in superbuilds. See: - AMReX-Codes/amrex#4520 (requires this PR) - BLAST-ImpactX/impactx#1002 (needed for this PR to be finished)

            // if the aperture is periodic, shift sx,sy coordinates to the fundamental domain
-            amrex::ParticleReal u = (m_repeat_x==0.0_prt) ? sx : (std::fmod(std::abs(sx)+dx,m_repeat_x)-dx);
-            amrex::ParticleReal v = (m_repeat_y==0.0_prt) ? sy : (std::fmod(std::abs(sy)+dy,m_repeat_y)-dy);
+            T_Real u = (m_repeat_x == 0_prt) ? sx : (fmod(abs(sx)+dx, m_repeat_x)-dx);


-            amrex::ParticleReal u = (m_repeat_x==0.0_prt) ? sx : (std::fmod(std::abs(sx)+dx,m_repeat_x)-dx);
-            amrex::ParticleReal v = (m_repeat_y==0.0_prt) ? sy : (std::fmod(std::abs(sy)+dy,m_repeat_y)-dy);
+            T_Real u = (m_repeat_x == 0_prt) ? sx : (fmod(abs(sx)+dx, m_repeat_x)-dx);
+            T_Real v = (m_repeat_y == 0_prt) ? sy : (fmod(abs(sy)+dy, m_repeat_y)-dy);


Ok to run through invalid branch afterwards, must even for the valid particles in the SIMD lane.

ax3l · 2025-07-31T07:50:55Z

Biggest wins so far (SP, before/after PR):

Aperture: 6x speedup
ChrAcc: 3.65x speedup
ChrDrift: 4.44x speedup
ChrPlasmaLens: 7.5x speedup
ChrQuad: 1.3x - 2x speedup
ExactMultipole: 9x speedup
ExactQuad: 18.3x speedup
LinearMap: 5.5x speedup
Multipole: 3x speedup
RFCavity: 14x speedup
SoftQuadrupole: 14.3x speedup
SoftSolenoid: 13.9x speedup

There are also a few very simple elements (that are not compute bound), where the benchmarks show that the packing/unpacking into vector registers is not armortized, so we should not vectorize those. (Small harm, because they are so quick, but still simple to avoid). Those are: ConstF, DipEdge, Drift, Kicker.
And, there are is are fun corner-cases that, if we start to default to fast-math, SIMD similarly shows no benefit: ThinDipole (30% benefit to stay scalar with fast-math only), ShortRF (16%).

CFbend seems to get slower from fast-math only in vectorized code, which might instead though point to system noise/CPU thermal changes/etc. for this quick (i.e. short-runtime) element.

ax3l · 2025-07-31T08:55:09Z

Double Precision

Machine: Dane (LLNL), -march=sapphirerapids CPU
Compiler: gcc/13.3.1
Vector registers: 512 bit (i.e., 8 doubles or 16 floats)

script: DP.txt
results: dp-benchmarks.tar.gz
abbreviations:
- dp: double precision
- nm: normal/default math
- fm: fast-math
- d(ev): development branch
- s(imd): the branch in this PR

pytest-benchmark compare --group-by=name --columns=min,stddev --sort=name .benchmarks/Linux-CPython-3.11-64bit/000*

---------------- benchmark 'test_Aperture': 4 tests ---------------
Name (time in ms)                   Min            StdDev          
-------------------------------------------------------------------
test_Aperture (0001_dp-nm-d)     8.8117 (3.29)     0.0208 (1.05)   
test_Aperture (0002_dp-nm-s)     3.0794 (1.15)     0.3472 (17.48)  
test_Aperture (0003_dp-fm-d)     9.1025 (3.40)     0.0199 (1.0)    
test_Aperture (0004_dp-fm-s)     2.6756 (1.0)      0.2997 (15.09)  
-------------------------------------------------------------------

--------------- benchmark 'test_Buncher': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_Buncher (0001_dp-nm-d)     2.2314 (1.0)      0.0188 (1.0)    
test_Buncher (0002_dp-nm-s)     3.0446 (1.36)     0.0395 (2.10)   
test_Buncher (0003_dp-fm-d)     2.7800 (1.25)     0.4804 (25.57)  
test_Buncher (0004_dp-fm-s)     2.6935 (1.21)     0.0208 (1.11)   
------------------------------------------------------------------

---------------- benchmark 'test_CFbend': 4 tests ---------------
Name (time in ms)                 Min            StdDev          
-----------------------------------------------------------------
test_CFbend (0001_dp-nm-d)     2.2340 (1.0)      0.0159 (1.0)    
test_CFbend (0002_dp-nm-s)     3.1851 (1.43)     0.0433 (2.71)   
test_CFbend (0003_dp-fm-d)     2.9279 (1.31)     0.0220 (1.38)   
test_CFbend (0004_dp-fm-s)     2.7140 (1.21)     0.0234 (1.47)   
-----------------------------------------------------------------

---------------- benchmark 'test_ChrAcc': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_ChrAcc (0001_dp-nm-d)     27.0789 (1.89)     0.2595 (1.09)   
test_ChrAcc (0002_dp-nm-s)     16.0846 (1.12)     0.3071 (1.29)   
test_ChrAcc (0003_dp-fm-d)     26.7344 (1.86)     0.2378 (1.0)    
test_ChrAcc (0004_dp-fm-s)     14.3574 (1.0)      0.3635 (1.53)   
------------------------------------------------------------------

---------------- benchmark 'test_ChrDrift': 4 tests ---------------
Name (time in ms)                   Min            StdDev          
-------------------------------------------------------------------
test_ChrDrift (0001_dp-nm-d)     5.7476 (2.14)     0.0045 (1.0)    
test_ChrDrift (0002_dp-nm-s)     3.3919 (1.26)     0.0348 (7.73)   
test_ChrDrift (0003_dp-fm-d)     3.1150 (1.16)     0.0256 (5.69)   
test_ChrDrift (0004_dp-fm-s)     2.6831 (1.0)      0.0260 (5.77)   
-------------------------------------------------------------------

---------------- benchmark 'test_ChrPlasmaLens': 4 tests ----------------
Name (time in ms)                         Min            StdDev          
-------------------------------------------------------------------------
test_ChrPlasmaLens (0001_dp-nm-d)     38.1896 (5.29)     0.0137 (1.0)    
test_ChrPlasmaLens (0002_dp-nm-s)     10.6129 (1.47)     0.0331 (2.42)   
test_ChrPlasmaLens (0003_dp-fm-d)     35.2867 (4.89)     0.0621 (4.54)   
test_ChrPlasmaLens (0004_dp-fm-s)      7.2192 (1.0)      0.2873 (21.01)  
-------------------------------------------------------------------------

---------------- benchmark 'test_ChrQuad': 4 tests ----------------
Name (time in ms)                   Min            StdDev          
-------------------------------------------------------------------
test_ChrQuad (0001_dp-nm-d)     88.0608 (1.85)     0.0811 (1.73)   
test_ChrQuad (0002_dp-nm-s)     69.0219 (1.45)     0.1602 (3.41)   
test_ChrQuad (0003_dp-fm-d)     69.1601 (1.45)     0.0469 (1.0)    
test_ChrQuad (0004_dp-fm-s)     47.6482 (1.0)      1.0788 (23.00)  
-------------------------------------------------------------------

---------------- benchmark 'test_ConstF': 4 tests ---------------
Name (time in ms)                 Min            StdDev          
-----------------------------------------------------------------
test_ConstF (0001_dp-nm-d)     2.3279 (1.0)      0.0225 (1.0)    
test_ConstF (0002_dp-nm-s)     2.7020 (1.16)     0.0259 (1.15)   
test_ConstF (0003_dp-fm-d)     2.9882 (1.28)     0.0278 (1.24)   
test_ConstF (0004_dp-fm-s)     2.7140 (1.17)     0.0328 (1.46)   
-----------------------------------------------------------------

--------------- benchmark 'test_DipEdge': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_DipEdge (0001_dp-nm-d)     1.6459 (1.0)      0.0164 (1.01)   
test_DipEdge (0002_dp-nm-s)     2.6766 (1.63)     0.0224 (1.37)   
test_DipEdge (0003_dp-fm-d)     1.9928 (1.21)     0.0163 (1.0)    
test_DipEdge (0004_dp-fm-s)     2.6927 (1.64)     0.0211 (1.29)   
------------------------------------------------------------------

--------------- benchmark 'test_Drift': 4 tests ----------------
Name (time in ms)                Min            StdDev          
----------------------------------------------------------------
test_Drift (0001_dp-nm-d)     2.2314 (1.0)      0.0182 (1.0)    
test_Drift (0002_dp-nm-s)     2.6713 (1.20)     0.0215 (1.18)   
test_Drift (0003_dp-fm-d)     2.7497 (1.23)     0.0204 (1.12)   
test_Drift (0004_dp-fm-s)     2.6879 (1.20)     0.0235 (1.29)   
----------------------------------------------------------------

---------------- benchmark 'test_ExactDrift': 4 tests ---------------
Name (time in ms)                     Min            StdDev          
---------------------------------------------------------------------
test_ExactDrift (0001_dp-nm-d)     4.1338 (1.57)     0.0050 (1.0)    
test_ExactDrift (0002_dp-nm-s)     2.6384 (1.0)      0.0257 (5.18)   
test_ExactDrift (0003_dp-fm-d)     2.9893 (1.13)     0.0197 (3.96)   
test_ExactDrift (0004_dp-fm-s)     2.6823 (1.02)     0.0236 (4.75)   
---------------------------------------------------------------------

---------------- benchmark 'test_ExactMultipole': 4 tests ----------------
Name (time in ms)                          Min            StdDev          
--------------------------------------------------------------------------
test_ExactMultipole (0001_dp-nm-d)     58.3870 (3.00)     0.0043 (1.0)    
test_ExactMultipole (0002_dp-nm-s)     19.4795 (1.0)      0.0096 (2.22)   
test_ExactMultipole (0003_dp-fm-d)     61.0599 (3.13)     0.4898 (113.23) 
test_ExactMultipole (0004_dp-fm-s)     19.5392 (1.00)     0.0115 (2.66)   
--------------------------------------------------------------------------

---------------- benchmark 'test_ExactQuad': 4 tests -----------------
Name (time in ms)                      Min            StdDev          
----------------------------------------------------------------------
test_ExactQuad (0001_dp-nm-d)     319.2256 (7.74)     0.9417 (39.89)  
test_ExactQuad (0002_dp-nm-s)      55.8916 (1.36)     0.1077 (4.56)   
test_ExactQuad (0003_dp-fm-d)     218.9466 (5.31)     0.0236 (1.0)    
test_ExactQuad (0004_dp-fm-s)      41.2266 (1.0)      0.1527 (6.47)   
----------------------------------------------------------------------

---------------- benchmark 'test_ExactSbend': 4 tests ----------------
Name (time in ms)                      Min            StdDev          
----------------------------------------------------------------------
test_ExactSbend (0001_dp-nm-d)     19.7546 (1.40)     0.0317 (1.84)   
test_ExactSbend (0002_dp-nm-s)     15.3607 (1.09)     0.0427 (2.47)   
test_ExactSbend (0003_dp-fm-d)     17.7051 (1.26)     0.0380 (2.20)   
test_ExactSbend (0004_dp-fm-s)     14.0693 (1.0)      0.0173 (1.0)    
----------------------------------------------------------------------

---------------- benchmark 'test_Kicker': 4 tests ---------------
Name (time in ms)                 Min            StdDev          
-----------------------------------------------------------------
test_Kicker (0001_dp-nm-d)     1.6393 (1.0)      0.0165 (1.0)    
test_Kicker (0002_dp-nm-s)     2.6686 (1.63)     0.0264 (1.60)   
test_Kicker (0003_dp-fm-d)     1.9354 (1.18)     0.0200 (1.21)   
test_Kicker (0004_dp-fm-s)     2.6777 (1.63)     0.0297 (1.80)   
-----------------------------------------------------------------

--------------- benchmark 'test_LinearMap': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_LinearMap (0001_dp-nm-d)     7.2117 (2.62)     0.0265 (1.35)   
test_LinearMap (0002_dp-nm-s)     3.4136 (1.24)     0.0347 (1.77)   
test_LinearMap (0003_dp-fm-d)     8.0425 (2.92)     0.0451 (2.30)   
test_LinearMap (0004_dp-fm-s)     2.7540 (1.0)      0.0196 (1.0)    
--------------------------------------------------------------------

--------------- benchmark 'test_Multipole': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_Multipole (0001_dp-nm-d)     3.9493 (1.49)     0.0034 (1.0)    
test_Multipole (0002_dp-nm-s)     2.6580 (1.0)      0.0289 (8.53)   
test_Multipole (0003_dp-fm-d)     5.0043 (1.88)     0.0465 (13.74)  
test_Multipole (0004_dp-fm-s)     2.7018 (1.02)     0.0232 (6.84)   
--------------------------------------------------------------------

---------------- benchmark 'test_NonlinearLens': 4 tests ----------------
Name (time in ms)                         Min            StdDev          
-------------------------------------------------------------------------
test_NonlinearLens (0001_dp-nm-d)     42.7865 (1.01)     0.0070 (1.0)    
test_NonlinearLens (0002_dp-nm-s)     42.7888 (1.01)     0.0093 (1.33)   
test_NonlinearLens (0003_dp-fm-d)     42.9065 (1.02)     0.1427 (20.41)  
test_NonlinearLens (0004_dp-fm-s)     42.2151 (1.0)      0.1064 (15.22)  
-------------------------------------------------------------------------

---------------- benchmark 'test_PRot': 4 tests ---------------
Name (time in ms)               Min            StdDev          
---------------------------------------------------------------
test_PRot (0001_dp-nm-d)     4.7603 (2.20)     0.0056 (1.0)    
test_PRot (0002_dp-nm-s)     2.5962 (1.20)     0.0530 (9.43)   
test_PRot (0003_dp-fm-d)     2.1594 (1.0)      0.0219 (3.91)   
test_PRot (0004_dp-fm-s)     2.6438 (1.22)     0.0253 (4.50)   
---------------------------------------------------------------

---------------- benchmark 'test_PlaneXYRot': 4 tests ---------------
Name (time in ms)                     Min            StdDev          
---------------------------------------------------------------------
test_PlaneXYRot (0001_dp-nm-d)     1.6721 (1.0)      0.0165 (1.0)    
test_PlaneXYRot (0002_dp-nm-s)     2.6798 (1.60)     0.0305 (1.85)   
test_PlaneXYRot (0003_dp-fm-d)     2.0960 (1.25)     0.0209 (1.27)   
test_PlaneXYRot (0004_dp-fm-s)     2.7104 (1.62)     0.0238 (1.45)   
---------------------------------------------------------------------

---------------- benchmark 'test_Quad': 4 tests ---------------
Name (time in ms)               Min            StdDev          
---------------------------------------------------------------
test_Quad (0001_dp-nm-d)     2.2294 (1.0)      0.0144 (1.0)    
test_Quad (0002_dp-nm-s)     2.6693 (1.20)     0.0262 (1.82)   
test_Quad (0003_dp-fm-d)     2.2744 (1.02)     0.0189 (1.31)   
test_Quad (0004_dp-fm-s)     2.6944 (1.21)     0.0249 (1.73)   
---------------------------------------------------------------

---------------- benchmark 'test_RFCavity': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_RFCavity (0001_dp-nm-d)     20.7213 (7.65)     0.0185 (1.0)    
test_RFCavity (0002_dp-nm-s)      2.7097 (1.0)      0.0275 (1.49)   
test_RFCavity (0003_dp-fm-d)     14.9478 (5.52)     0.0306 (1.66)   
test_RFCavity (0004_dp-fm-s)      2.7858 (1.03)     0.0278 (1.51)   
--------------------------------------------------------------------

--------------- benchmark 'test_Sbend': 4 tests ----------------
Name (time in ms)                Min            StdDev          
----------------------------------------------------------------
test_Sbend (0001_dp-nm-d)     2.2217 (1.0)      0.0160 (1.0)    
test_Sbend (0002_dp-nm-s)     2.6813 (1.21)     0.0255 (1.59)   
test_Sbend (0003_dp-fm-d)     2.2728 (1.02)     0.0177 (1.11)   
test_Sbend (0004_dp-fm-s)     2.7168 (1.22)     0.0289 (1.81)   
----------------------------------------------------------------

--------------- benchmark 'test_ShortRF': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_ShortRF (0001_dp-nm-d)     7.1609 (2.86)     0.0277 (1.74)   
test_ShortRF (0002_dp-nm-s)     2.6973 (1.08)     0.0256 (1.61)   
test_ShortRF (0003_dp-fm-d)     2.5034 (1.0)      0.0159 (1.0)    
test_ShortRF (0004_dp-fm-s)     2.6975 (1.08)     0.0266 (1.67)   
------------------------------------------------------------------

---------------- benchmark 'test_SoftQuadrupole': 4 tests ----------------
Name (time in ms)                          Min            StdDev          
--------------------------------------------------------------------------
test_SoftQuadrupole (0001_dp-nm-d)     20.9724 (7.78)     0.0042 (1.0)    
test_SoftQuadrupole (0002_dp-nm-s)      2.6960 (1.0)      0.0283 (6.71)   
test_SoftQuadrupole (0003_dp-fm-d)     15.4573 (5.73)     0.0055 (1.30)   
test_SoftQuadrupole (0004_dp-fm-s)      2.7473 (1.02)     0.0337 (7.99)   
--------------------------------------------------------------------------

---------------- benchmark 'test_SoftSolenoid': 4 tests ----------------
Name (time in ms)                        Min            StdDev          
------------------------------------------------------------------------
test_SoftSolenoid (0001_dp-nm-d)     20.9760 (7.77)     0.0049 (1.0)    
test_SoftSolenoid (0002_dp-nm-s)      2.7002 (1.0)      0.0331 (6.78)   
test_SoftSolenoid (0003_dp-fm-d)     17.8606 (6.61)     0.0071 (1.46)   
test_SoftSolenoid (0004_dp-fm-s)      2.7587 (1.02)     0.0312 (6.38)   
------------------------------------------------------------------------

--------------- benchmark 'test_Sol': 4 tests ----------------
Name (time in ms)              Min            StdDev          
--------------------------------------------------------------
test_Sol (0001_dp-nm-d)     2.2489 (1.0)      0.0134 (1.0)    
test_Sol (0002_dp-nm-s)     2.6901 (1.20)     0.0200 (1.49)   
test_Sol (0003_dp-fm-d)     2.3059 (1.03)     0.0157 (1.17)   
test_Sol (0004_dp-fm-s)     2.7046 (1.20)     0.0191 (1.43)   
--------------------------------------------------------------

--------------- benchmark 'test_TaperedPL': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_TaperedPL (0001_dp-nm-d)     1.6531 (1.0)      0.0189 (1.06)   
test_TaperedPL (0002_dp-nm-s)     2.6722 (1.62)     0.0230 (1.29)   
test_TaperedPL (0003_dp-fm-d)     1.6903 (1.02)     0.0177 (1.0)    
test_TaperedPL (0004_dp-fm-s)     2.6962 (1.63)     0.0244 (1.38)   
--------------------------------------------------------------------

---------------- benchmark 'test_ThinDipole': 4 tests ---------------
Name (time in ms)                     Min            StdDev          
---------------------------------------------------------------------
test_ThinDipole (0001_dp-nm-d)     4.1948 (1.83)     0.0060 (1.0)    
test_ThinDipole (0002_dp-nm-s)     2.6098 (1.14)     0.0262 (4.37)   
test_ThinDipole (0003_dp-fm-d)     2.2945 (1.0)      0.0177 (2.96)   
test_ThinDipole (0004_dp-fm-s)     2.6836 (1.17)     0.0253 (4.22)   
---------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

pytest-benchmark compare --group-by=name --sort=name --histogram=dp_histo .benchmarks/Linux-CPython-3.11-64bit/000*

cemitch99 · 2025-07-31T16:58:36Z


            // assign intermediate parameter
-            amrex::ParticleReal const step = slice_ds /std::sqrt(powi<2>(pt)-1.0_prt);
+            amrex::ParticleReal const step = slice_ds / std::sqrt(powi<2>(pt)-1.0_prt);


Should this be sqrt instead of std::sqrt here?

Perhaps not--this is still of type amrex::ParticleReal and not T_Real.

Not here, because it is not in the particle tracking operator(). I changed this white space by accident (sorry for the noise).

cemitch99

The updated logic looks reasonable to me--I did not find any obvious place where the math should break. I made only minor comments. Suggest one of our experienced AMReX colleagues take a look.

ax3l · 2025-07-31T17:26:04Z

"Simple" elements, where vectorization packing/unpacking causes a slight overhead. I will revert those to stay with a serial implementation.

Note that some of the benchmarks can be tainted by the well-known Intel AVX-512 frequency drops in the benchmarks, but even if they are there is no huge issue to exclude them for now from vectorization, because they are super-fast ("simple") elements.

SP

Buncher
CFbend
ConstF?
DipEdge
Drift?
Kicker
PlaneXYRot
Sbend?
Sol
TaperPL

DP

Buncher?
CFbend?
ConstF?
DipEdge
Drift?
Kicker
PlaneXYRot
PRot?
Quad
Sbend
Sol
TaperedPL

Add inline comments on TODOs that we will explore in follow-ups.

ax3l · 2025-07-31T18:45:26Z

@atmyers @AlexanderSinn @WeiqunZhang let me know if one of you wants to take a final look, good to go from our end otherwise.

ax3l added backend: openmp Specific to OpenMP execution (CPUs) Performance optimization labels Jun 23, 2025

This was referenced Jun 23, 2025

AMReX SIMD Helpers AMReX-Codes/amrex#4520

Merged

CMake: ABLASTR/WarpX/AMReX SIMD BLAST-WarpX/warpx#5966

Merged

ax3l mentioned this pull request Jul 2, 2025

Add vir::stdx::simd (Spack: vir-simd) spack/spack-packages#332

Merged

ax3l force-pushed the topic-simd branch 8 times, most recently from 9b6e1d9 to ff3d0d8 Compare July 16, 2025 09:01

ax3l commented Jul 16, 2025

View reviewed changes

Comment thread src/elements/Aperture.H

ax3l force-pushed the topic-simd branch 5 times, most recently from f178f34 to 479fff7 Compare July 16, 2025 10:01

ax3l commented Jul 16, 2025

View reviewed changes

Comment thread src/elements/ExactSbend.H Outdated

ax3l commented Jul 16, 2025

View reviewed changes

Comment thread src/elements/ExactSbend.H Outdated

ax3l force-pushed the topic-simd branch from 479fff7 to 3500668 Compare July 19, 2025 00:35

ax3l force-pushed the topic-simd branch 2 times, most recently from c152e9b to 521516d Compare July 29, 2025 04:44

ax3l changed the title ~~[WIP] CPU SIMD Support~~ CPU SIMD Support Jul 29, 2025

github-advanced-security AI found potential problems Jul 29, 2025

View reviewed changes

ax3l marked this pull request as ready for review July 29, 2025 05:17

ax3l force-pushed the topic-simd branch from 521516d to caae13e Compare July 29, 2025 05:25

ax3l added 3 commits July 30, 2025 16:38

CPU SIMD Support for Most Elements

1bbcd4f

Docs: Install Docs

4e65228

CI: SIMD

5fd5e1d

ax3l force-pushed the topic-simd branch from ed7b6af to 5fd5e1d Compare July 30, 2025 23:38

ax3l added 2 commits July 30, 2025 23:13

Fix/Simplify make_invalid logic

a60db16

Ok to run through invalid branch afterwards, must even for the valid particles in the SIMD lane.

SIMD: Keep Scalars

5a5351c

ax3l force-pushed the topic-simd branch from 25545d2 to 5a5351c Compare July 31, 2025 07:06

ax3l mentioned this pull request Jul 31, 2025

Nice Speedups mattkretz/vir-simd#44

Open

ax3l added component: elements Elements/maps/external fields tracking: particles labels Jul 31, 2025

ax3l added this to the Advanced Methods (SciDAC-5) milestone Jul 31, 2025

ax3l mentioned this pull request Jul 31, 2025

Expose Fast-Math Options (Default still OFF) #1073

Merged

4 tasks

cemitch99 reviewed Jul 31, 2025

View reviewed changes

Comment thread src/elements/NonlinearLens.H

cemitch99 reviewed Jul 31, 2025

View reviewed changes

Comment thread src/elements/Programmable.H

cemitch99 reviewed Jul 31, 2025

View reviewed changes

Comment thread src/elements/mixin/beamoptic.H

cemitch99 approved these changes Jul 31, 2025

View reviewed changes

ax3l added 2 commits July 31, 2025 10:37

Do not Vectorize Memory-Bound elements

3523e0e

TODOs: Link Issues/Requirements

f5d02de

Add inline comments on TODOs that we will explore in follow-ups.

ax3l commented Jul 31, 2025

View reviewed changes

Comment thread examples/cyclotron/analysis_cyclotron_loss.py Outdated

Reset: Cyclotron Example Tolerance

3117433

ax3l merged commit 34d263b into BLAST-ImpactX:development Aug 1, 2025
16 checks passed

ax3l deleted the topic-simd branch August 1, 2025 15:19

kyrsjo mentioned this pull request Aug 6, 2025

ChrPlasmaLens element: Support k<0, and fix uninitialized-variable bug in tracking of reference particle #1030

Merged

ax3l mentioned this pull request Jan 27, 2026

BeamOptics: Generalize SIMD Logic #1279

Merged

4 tasks

ax3l added the backend: SIMD CPU with SIMD acceleration label Jan 29, 2026

Conversation

ax3l commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additional background

Related Reads for AMD/Intel GPUs with AVX-512

To Do

Benchmarks on Dane (LLNL)

Single Precision

Benchmarks on Perlmutter (NERSC)

Required PRs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

Check notice

ax3l commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ax3l commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Double Precision

Uh oh!

cemitch99 Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

cemitch99 Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

ax3l Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cemitch99 left a comment

Choose a reason for hiding this comment

Uh oh!

ax3l commented Jul 31, 2025

SP

DP

Uh oh!

ax3l commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ax3l commented Jun 23, 2025 •

edited

Loading

ax3l commented Jul 31, 2025 •

edited

Loading

ax3l commented Jul 31, 2025 •

edited

Loading

ax3l Jul 31, 2025 •

edited

Loading