Skip to content

CPU SIMD Support#1002

Merged
ax3l merged 8 commits intoBLAST-ImpactX:developmentfrom
ax3l:topic-simd
Aug 1, 2025
Merged

CPU SIMD Support#1002
ax3l merged 8 commits intoBLAST-ImpactX:developmentfrom
ax3l:topic-simd

Conversation

@ax3l
Copy link
Copy Markdown
Member

@ax3l ax3l commented Jun 23, 2025

Until recently (see AMReX PR AMReX-Codes/amrex#4520), AMReX does not have a concept to help users write effective SIMD code on CPU, besides relying on auto-vectorization and pragmas, which are unreliable for any complex enough code. [1]

Lucky enough, C++ std::datapar/std::simd was just accepted into C++26, which gives an easy in to write portable SIMD/scalar code. Yet, I did not find a compiler/stdlib yet with support for it, so I implemented in AMReX using C++17 <experimental/simd> headers a new addition to the ParallelFor performance portability construct using the https://github.com/mattkretz/vir-simd library.

This PR vectorizes CPU code in ImpactX for particle tracking, both for NOACC and OpenMP compute backends, in ImpactX for element pushes, keeping our portable, single-source approach.

Additional background

The implementation here works with any kind of vectorization (if implemented in the respective compiler): SSE(2), AVX(2), AVX512, Neon, ... 512-bit wide vector registers are the largest currently available, being able to push up to 8 (DP) or 16 (SP) particles in parallel, where a non-data-parallel (non-SIMD) implementation would push just 1 (in the worst case, for any complex enough code [1]).

[1] As written in the story behind Intel's iscp compiler and credited to Tim Foley, auto-vectorization is not a programming model.

Vectorization is really a technique that primarily benefits us for elements that are very compute heavy (i.e., not memory-bandwidth bound). Not coincidentally, ImpactX focuses on symplectic, (high-order/exact) methods for chromatic and exact tracking, where this is exactly the case. Optimizations like vectorization thus makes a huge impact for our community/users/time-to-solution, and exactly address our most costly tracking methods, making it highly benefitial for our users to rely on them and still get fast results when needed.

Related Reads for AMD/Intel GPUs with AVX-512

For AVX-512, Intel famously messed up in their early/current chips the clock frequency/throughput for AVX-512. That means in benchmarks one should run long and with many particles for reliable results, but I kept it simple above. The consequence is that sometimes where SIMD "looks meh/bad" it could be (even) better in practice.
AMD Zen5 CPUs do not have that flaw in their AVX-512 implementation.
Details: https://chipsandcheese.com/p/zen-5s-avx-512-frequency-behavior

To Do

  • update install docs
  • update CI (partially): SIMD, ideally even single-precision (uncovers more issues at compile-time)
  • undo vectorization for a few simple elements that are far from compute bound (see below)

Benchmarks on Dane (LLNL)

Machine: Dane (LLNL), -march=sapphirerapids CPU
Compiler: gcc/13.3.1
Vector registers: 512 bit (i.e., 8 doubles or 16 floats)

module load gcc/13.3.1
module load hdf5-serial
module load cmake/3.26.3
module load fftw/3.3.10
module load python/3.11.5

wget https://github.com/mattkretz/vir-simd/archive/refs/tags/v0.4.4.tar.gz
tar -xvf v0.4.4.tar.gz
rm -rf v0.4.4.tar.gz
cmake -S vir-simd-0.4.4 -B vir-simd-build -DCMAKE_INSTALL_PREFIX=$HOME/sw/vir-simd
cmake --build vir-simd-build --target install
rm -rf vir-simd-0.4.4
export CMAKE_PREFIX_PATH=$HOME/sw/vir-simd:${CMAKE_PREFIX_PATH}

alias getNode="srun --time=1:00:00 --nodes=1 --ntasks-per-node=1 --cpus-per-task=56 -p pdebug --pty bash"

# fix system defaults: do not escape $ with a \ on tab completion
shopt -s direxpand

export CXXFLAGS="-march=sapphirerapids"
export CFLAGS="-march=sapphirerapids"

export CC=$(which gcc)
export CXX=$(which g++)
git clone https://github.com/BLAST-ImpactX/impactx.git ${HOME}/src/impactx

rm -rf ${HOME}/venvs/impactx-dane
python3 -m venv ${HOME}/venvs/impactx-dane
source ${HOME}/venvs/impactx-dane/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade build
python3 -m pip install --upgrade packaging
python3 -m pip install --upgrade wheel
python3 -m pip install --upgrade setuptools[core]
python3 -m pip install --upgrade numpy
python3 -m pip install --upgrade pandas
python3 -m pip install --upgrade pytest
python3 -m pip install --upgrade pytest-benchmark
python3 -m pip install --upgrade scipy
python3 -m pip install --upgrade -r ${HOME}/src/impactx/requirements.txt

rm -rf $HOME/.benchmarks

Development:

rm -rf ${HOME}/src/impactx
git clone https://github.com/BLAST-ImpactX/impactx.git ${HOME}/src/impactx

SIMD Branch:

rm -rf ${HOME}/src/impactx
git clone -b topic-simd https://github.com/ax3l/impactx.git ${HOME}/src/impactx

Single Precision

  • script: SP.txt
  • results: sp-benchmarks.tar.gz
  • abbreviations:
    • sp: single precision
    • nm: normal/default math
    • fm: fast-math
    • d(ev): development branch
    • s(imd): the branch in this PR
pytest-benchmark compare --group-by=name --columns=min,stddev --sort=name .benchmarks/Linux-CPython-3.11-64bit/000*
---------------- benchmark 'test_Aperture': 4 tests ---------------
Name (time in ms)                   Min            StdDev          
-------------------------------------------------------------------
test_Aperture (0001_sp-nm-d)     8.1991 (5.96)     0.0148 (1.0)    
test_Aperture (0002_sp-nm-s)     1.3747 (1.0)      0.0328 (2.23)   
test_Aperture (0003_sp-fm-d)     7.7328 (5.63)     0.0380 (2.58)   
test_Aperture (0004_sp-fm-s)     1.4767 (1.07)     0.0421 (2.85)   
-------------------------------------------------------------------

--------------- benchmark 'test_Buncher': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_Buncher (0001_sp-nm-d)     1.2732 (1.25)     0.0236 (1.0)    
test_Buncher (0002_sp-nm-s)     1.3347 (1.31)     0.0339 (1.43)   
test_Buncher (0003_sp-fm-d)     1.0221 (1.0)      0.0275 (1.16)   
test_Buncher (0004_sp-fm-s)     1.4687 (1.44)     0.0399 (1.69)   
------------------------------------------------------------------

---------------- benchmark 'test_CFbend': 4 tests ---------------
Name (time in ms)                 Min            StdDev          
-----------------------------------------------------------------
test_CFbend (0001_sp-nm-d)     1.3791 (1.03)     0.0340 (1.0)    
test_CFbend (0002_sp-nm-s)     1.3387 (1.0)      0.0345 (1.01)   
test_CFbend (0003_sp-fm-d)     1.3870 (1.04)     0.0390 (1.15)   
test_CFbend (0004_sp-fm-s)     1.5895 (1.19)     0.0561 (1.65)   
-----------------------------------------------------------------

---------------- benchmark 'test_ChrAcc': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_ChrAcc (0001_sp-nm-d)     19.0046 (3.65)     0.0038 (1.0)    
test_ChrAcc (0002_sp-nm-s)      5.8782 (1.13)     0.0062 (1.61)   
test_ChrAcc (0003_sp-fm-d)     18.5467 (3.56)     0.0245 (6.41)   
test_ChrAcc (0004_sp-fm-s)      5.2128 (1.0)      0.0156 (4.06)   
------------------------------------------------------------------

---------------- benchmark 'test_ChrDrift': 4 tests ---------------
Name (time in ms)                   Min            StdDev          
-------------------------------------------------------------------
test_ChrDrift (0001_sp-nm-d)     5.9677 (4.44)     0.0305 (1.04)   
test_ChrDrift (0002_sp-nm-s)     1.3426 (1.0)      0.0416 (1.42)   
test_ChrDrift (0003_sp-fm-d)     1.4907 (1.11)     0.0293 (1.0)    
test_ChrDrift (0004_sp-fm-s)     1.3428 (1.00)     0.1671 (5.71)   
-------------------------------------------------------------------

---------------- benchmark 'test_ChrPlasmaLens': 4 tests ----------------
Name (time in ms)                         Min            StdDev          
-------------------------------------------------------------------------
test_ChrPlasmaLens (0001_sp-nm-d)     23.6499 (7.48)     0.0238 (1.02)   
test_ChrPlasmaLens (0002_sp-nm-s)      3.6103 (1.14)     0.0824 (3.54)   
test_ChrPlasmaLens (0003_sp-fm-d)     20.5786 (6.51)     0.0830 (3.56)   
test_ChrPlasmaLens (0004_sp-fm-s)      3.1605 (1.0)      0.0233 (1.0)    
-------------------------------------------------------------------------

---------------- benchmark 'test_ChrQuad': 4 tests ----------------
Name (time in ms)                   Min            StdDev          
-------------------------------------------------------------------
test_ChrQuad (0001_sp-nm-d)     70.8837 (1.99)     0.2649 (4.67)   
test_ChrQuad (0002_sp-nm-s)     54.6676 (1.53)     0.0945 (1.67)   
test_ChrQuad (0003_sp-fm-d)     51.0190 (1.43)     0.0595 (1.05)   
test_ChrQuad (0004_sp-fm-s)     35.6522 (1.0)      0.0568 (1.0)    
-------------------------------------------------------------------

---------------- benchmark 'test_ConstF': 4 tests ---------------
Name (time in ms)                 Min            StdDev          
-----------------------------------------------------------------
test_ConstF (0001_sp-nm-d)     1.0483 (1.0)      0.0277 (1.0)    
test_ConstF (0002_sp-nm-s)     1.3607 (1.30)     0.0302 (1.09)   
test_ConstF (0003_sp-fm-d)     1.3889 (1.32)     0.0334 (1.20)   
test_ConstF (0004_sp-fm-s)     1.3548 (1.29)     0.0292 (1.05)   
-----------------------------------------------------------------

------------------ benchmark 'test_DipEdge': 4 tests ------------------
Name (time in us)                      Min             StdDev          
-----------------------------------------------------------------------
test_DipEdge (0001_sp-nm-d)       766.3820 (1.0)      22.2235 (1.0)    
test_DipEdge (0002_sp-nm-s)     1,339.5720 (1.75)     31.9027 (1.44)   
test_DipEdge (0003_sp-fm-d)       926.6780 (1.21)     29.1228 (1.31)   
test_DipEdge (0004_sp-fm-s)     1,323.4870 (1.73)     26.7041 (1.20)   
-----------------------------------------------------------------------

--------------- benchmark 'test_Drift': 4 tests ----------------
Name (time in ms)                Min            StdDev          
----------------------------------------------------------------
test_Drift (0001_sp-nm-d)     1.0143 (1.0)      0.0249 (1.0)    
test_Drift (0002_sp-nm-s)     1.3445 (1.33)     0.0310 (1.24)   
test_Drift (0003_sp-fm-d)     1.2817 (1.26)     0.0263 (1.06)   
test_Drift (0004_sp-fm-s)     1.3346 (1.32)     0.0297 (1.19)   
----------------------------------------------------------------

---------------- benchmark 'test_ExactDrift': 4 tests ---------------
Name (time in ms)                     Min            StdDev          
---------------------------------------------------------------------
test_ExactDrift (0001_sp-nm-d)     4.1055 (3.10)     0.0034 (1.0)    
test_ExactDrift (0002_sp-nm-s)     1.3349 (1.01)     0.0342 (10.19)  
test_ExactDrift (0003_sp-fm-d)     1.4068 (1.06)     0.0298 (8.86)   
test_ExactDrift (0004_sp-fm-s)     1.3249 (1.0)      0.0352 (10.47)  
---------------------------------------------------------------------

---------------- benchmark 'test_ExactMultipole': 4 tests ----------------
Name (time in ms)                          Min            StdDev          
--------------------------------------------------------------------------
test_ExactMultipole (0001_sp-nm-d)     45.0051 (9.07)     0.0124 (1.0)    
test_ExactMultipole (0002_sp-nm-s)      6.1070 (1.23)     0.0247 (1.99)   
test_ExactMultipole (0003_sp-fm-d)     44.6973 (9.01)     0.0164 (1.32)   
test_ExactMultipole (0004_sp-fm-s)      4.9605 (1.0)      0.0383 (3.07)   
--------------------------------------------------------------------------

---------------- benchmark 'test_ExactQuad': 4 tests -----------------
Name (time in ms)                      Min            StdDev          
----------------------------------------------------------------------
test_ExactQuad (0001_sp-nm-d)     287.8775 (18.32)    0.1472 (37.29)  
test_ExactQuad (0002_sp-nm-s)      23.1705 (1.47)     0.0272 (6.88)   
test_ExactQuad (0003_sp-fm-d)     157.9566 (10.05)    0.0498 (12.61)  
test_ExactQuad (0004_sp-fm-s)      15.7107 (1.0)      0.0039 (1.0)    
----------------------------------------------------------------------

---------------- benchmark 'test_ExactSbend': 4 tests ----------------
Name (time in ms)                      Min            StdDev          
----------------------------------------------------------------------
test_ExactSbend (0001_sp-nm-d)     16.6224 (2.23)     0.3834 (12.75)  
test_ExactSbend (0002_sp-nm-s)      8.3900 (1.13)     0.0579 (1.93)   
test_ExactSbend (0003_sp-fm-d)     13.4190 (1.80)     0.0326 (1.09)   
test_ExactSbend (0004_sp-fm-s)      7.4398 (1.0)      0.0301 (1.0)    
----------------------------------------------------------------------

------------------ benchmark 'test_Kicker': 4 tests ------------------
Name (time in us)                     Min             StdDev          
----------------------------------------------------------------------
test_Kicker (0001_sp-nm-d)       936.2970 (1.03)     18.9457 (1.0)    
test_Kicker (0002_sp-nm-s)     1,338.2910 (1.47)     32.9080 (1.74)   
test_Kicker (0003_sp-fm-d)       908.7170 (1.0)      25.1170 (1.33)   
test_Kicker (0004_sp-fm-s)     1,330.8750 (1.46)     26.9152 (1.42)   
----------------------------------------------------------------------

--------------- benchmark 'test_LinearMap': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_LinearMap (0001_sp-nm-d)     7.4159 (5.46)     0.0420 (1.25)   
test_LinearMap (0002_sp-nm-s)     1.3799 (1.02)     0.0358 (1.06)   
test_LinearMap (0003_sp-fm-d)     7.5989 (5.59)     0.0664 (1.97)   
test_LinearMap (0004_sp-fm-s)     1.3588 (1.0)      0.0337 (1.0)    
--------------------------------------------------------------------

--------------- benchmark 'test_Multipole': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_Multipole (0001_sp-nm-d)     4.1712 (3.08)     0.0308 (1.14)   
test_Multipole (0002_sp-nm-s)     1.3526 (1.0)      0.0315 (1.16)   
test_Multipole (0003_sp-fm-d)     4.2997 (3.18)     0.0271 (1.0)    
test_Multipole (0004_sp-fm-s)     1.3540 (1.00)     0.0281 (1.04)   
--------------------------------------------------------------------

---------------- benchmark 'test_NonlinearLens': 4 tests ----------------
Name (time in ms)                         Min            StdDev          
-------------------------------------------------------------------------
test_NonlinearLens (0001_sp-nm-d)     39.5728 (1.02)     0.2085 (12.86)  
test_NonlinearLens (0002_sp-nm-s)     39.8526 (1.03)     0.1505 (9.28)   
test_NonlinearLens (0003_sp-fm-d)     39.4713 (1.02)     0.0162 (1.0)    
test_NonlinearLens (0004_sp-fm-s)     38.6155 (1.0)      0.0721 (4.44)   
-------------------------------------------------------------------------

---------------- benchmark 'test_PRot': 4 tests ---------------
Name (time in ms)               Min            StdDev          
---------------------------------------------------------------
test_PRot (0001_sp-nm-d)     3.2337 (2.44)     0.0041 (1.0)    
test_PRot (0002_sp-nm-s)     1.3243 (1.0)      0.0352 (8.50)   
test_PRot (0003_sp-fm-d)     1.3922 (1.05)     0.0234 (5.65)   
test_PRot (0004_sp-fm-s)     1.3263 (1.00)     0.0337 (8.15)   
---------------------------------------------------------------

------------------ benchmark 'test_PlaneXYRot': 4 tests ------------------
Name (time in us)                         Min             StdDev          
--------------------------------------------------------------------------
test_PlaneXYRot (0001_sp-nm-d)       788.4900 (1.0)      24.5604 (1.14)   
test_PlaneXYRot (0002_sp-nm-s)     1,357.0950 (1.72)     34.5120 (1.60)   
test_PlaneXYRot (0003_sp-fm-d)     1,006.9610 (1.28)     21.5856 (1.0)    
test_PlaneXYRot (0004_sp-fm-s)     1,335.5390 (1.69)     35.9032 (1.66)   
--------------------------------------------------------------------------

---------------- benchmark 'test_Quad': 4 tests ---------------
Name (time in ms)               Min            StdDev          
---------------------------------------------------------------
test_Quad (0001_sp-nm-d)     1.2482 (1.0)      0.0381 (1.13)   
test_Quad (0002_sp-nm-s)     1.3583 (1.09)     0.0382 (1.13)   
test_Quad (0003_sp-fm-d)     1.7011 (1.36)     0.0390 (1.16)   
test_Quad (0004_sp-fm-s)     1.3495 (1.08)     0.0337 (1.0)    
---------------------------------------------------------------

---------------- benchmark 'test_RFCavity': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_RFCavity (0001_sp-nm-d)     18.7676 (13.87)    0.0060 (1.0)    
test_RFCavity (0002_sp-nm-s)      1.3643 (1.01)     0.0417 (7.01)   
test_RFCavity (0003_sp-fm-d)     16.4755 (12.17)    0.0325 (5.47)   
test_RFCavity (0004_sp-fm-s)      1.3532 (1.0)      0.0338 (5.68)   
--------------------------------------------------------------------

--------------- benchmark 'test_Sbend': 4 tests ----------------
Name (time in ms)                Min            StdDev          
----------------------------------------------------------------
test_Sbend (0001_sp-nm-d)     1.0152 (1.0)      0.0261 (1.0)    
test_Sbend (0002_sp-nm-s)     1.3506 (1.33)     0.0327 (1.25)   
test_Sbend (0003_sp-fm-d)     1.3534 (1.33)     0.0284 (1.09)   
test_Sbend (0004_sp-fm-s)     1.3350 (1.32)     0.0310 (1.19)   
----------------------------------------------------------------

--------------- benchmark 'test_ShortRF': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_ShortRF (0001_sp-nm-d)     7.1661 (6.20)     0.0278 (1.86)   
test_ShortRF (0002_sp-nm-s)     1.3507 (1.17)     0.0348 (2.33)   
test_ShortRF (0003_sp-fm-d)     1.1567 (1.0)      0.0149 (1.0)    
test_ShortRF (0004_sp-fm-s)     1.3384 (1.16)     0.0355 (2.37)   
------------------------------------------------------------------

---------------- benchmark 'test_SoftQuadrupole': 4 tests ----------------
Name (time in ms)                          Min            StdDev          
--------------------------------------------------------------------------
test_SoftQuadrupole (0001_sp-nm-d)     19.3863 (14.29)    0.0028 (2.01)   
test_SoftQuadrupole (0002_sp-nm-s)      1.3653 (1.01)     0.0370 (26.59)  
test_SoftQuadrupole (0003_sp-fm-d)     16.1216 (11.89)    0.0014 (1.0)    
test_SoftQuadrupole (0004_sp-fm-s)      1.3564 (1.0)      0.0366 (26.27)  
--------------------------------------------------------------------------

---------------- benchmark 'test_SoftSolenoid': 4 tests ----------------
Name (time in ms)                        Min            StdDev          
------------------------------------------------------------------------
test_SoftSolenoid (0001_sp-nm-d)     18.7642 (13.89)    0.0065 (1.0)    
test_SoftSolenoid (0002_sp-nm-s)      1.3685 (1.01)     0.0411 (6.33)   
test_SoftSolenoid (0003_sp-fm-d)     15.9963 (11.84)    0.0124 (1.91)   
test_SoftSolenoid (0004_sp-fm-s)      1.3508 (1.0)      0.0381 (5.86)   
------------------------------------------------------------------------

--------------- benchmark 'test_Sol': 4 tests ----------------
Name (time in ms)              Min            StdDev          
--------------------------------------------------------------
test_Sol (0001_sp-nm-d)     1.0205 (1.0)      0.0269 (1.01)   
test_Sol (0002_sp-nm-s)     1.3458 (1.32)     0.0366 (1.37)   
test_Sol (0003_sp-fm-d)     1.0292 (1.01)     0.0267 (1.0)    
test_Sol (0004_sp-fm-s)     1.3404 (1.31)     0.0329 (1.23)   
--------------------------------------------------------------

------------------ benchmark 'test_TaperedPL': 4 tests ------------------
Name (time in us)                        Min             StdDev          
-------------------------------------------------------------------------
test_TaperedPL (0001_sp-nm-d)       775.8910 (1.0)      23.6305 (1.40)   
test_TaperedPL (0002_sp-nm-s)     1,342.5990 (1.73)     36.5186 (2.16)   
test_TaperedPL (0003_sp-fm-d)       780.8800 (1.01)     16.9080 (1.0)    
test_TaperedPL (0004_sp-fm-s)     1,343.6800 (1.73)     24.9062 (1.47)   
-------------------------------------------------------------------------

---------------- benchmark 'test_ThinDipole': 4 tests ---------------
Name (time in ms)                     Min            StdDev          
---------------------------------------------------------------------
test_ThinDipole (0001_sp-nm-d)     4.0509 (3.91)     0.0102 (1.0)    
test_ThinDipole (0002_sp-nm-s)     1.3486 (1.30)     0.0365 (3.57)   
test_ThinDipole (0003_sp-fm-d)     1.0370 (1.0)      0.0245 (2.40)   
test_ThinDipole (0004_sp-fm-s)     1.3400 (1.29)     0.0318 (3.11)   
---------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
pytest-benchmark compare --group-by=name --sort=name --histogram=sp_histo .benchmarks/Linux-CPython-3.11-64bit/000*

sp_histo-test_Aperture
sp_histo-test_Buncher
sp_histo-test_CFbend
sp_histo-test_ChrAcc
sp_histo-test_ChrDrift
sp_histo-test_ChrPlasmaLens
sp_histo-test_ChrQuad
sp_histo-test_ConstF
sp_histo-test_DipEdge
sp_histo-test_Drift
sp_histo-test_ExactDrift
sp_histo-test_ExactMultipole
sp_histo-test_ExactQuad
sp_histo-test_ExactSbend
sp_histo-test_Kicker
sp_histo-test_LinearMap
sp_histo-test_Multipole
sp_histo-test_NonlinearLens
sp_histo-test_PlaneXYRot
sp_histo-test_PRot
sp_histo-test_Quad
sp_histo-test_RFCavity
sp_histo-test_Sbend
sp_histo-test_ShortRF
sp_histo-test_SoftQuadrupole
sp_histo-test_SoftSolenoid
sp_histo-test_Sol
sp_histo-test_TaperedPL
sp_histo-test_ThinDipole

Benchmarks on Perlmutter (NERSC)

Machine: Perlmutter (NERSC)
Compiler: ...
Vector registers: 256 bit (i.e., 4 doubles or 8 floats)
Note: Perlmutter CPUs (AMD Zen 3) do not support AVX512, AMD introduced 512 bit (i.e., 8 doubles) wide vector registers only in Zen 5.

Boring old CPU :D

Required PRs

Comment thread src/elements/Aperture.H
@ax3l ax3l force-pushed the topic-simd branch 5 times, most recently from f178f34 to 479fff7 Compare July 16, 2025 10:01
Comment thread src/elements/ExactSbend.H Outdated
Comment thread src/elements/ExactSbend.H Outdated
WeiqunZhang pushed a commit to AMReX-Codes/amrex that referenced this pull request Jul 16, 2025
## Summary

AMReX does not have a concept yet to help users write effective SIMD
code on CPU, besides relying on auto-vectorization and pragmas, which
are unreliable for any complex enough code. [1]

Lucky enough, C++ `std::datapar` was just accepted into C++26, which
gives an easy in to write portable SIMD/scalar code. Yet, I did not find
a compiler/stdlib yet with support for it, so I finally had play with
the C++17 `<experimental/simd>` headers, which are not as complete as
C++26 but a good in, especially if complemented with the
https://github.com/mattkretz/vir-simd library.

This PR adds initial support for portable user-code by providing:
- build system support: `AMReX_SIMD` (default is OFF), relying on
[vir-simd](https://github.com/mattkretz/vir-simd)
- an `AMReX_SIMD.H` header that handles includes & helper types
- `ParallelForSIMD<SIMD_WIDTH>(...)`

## Additional background

[1] Fun fact one: As written in the [story behind Intel's iscp
compiler](https://pharr.org/matt/blog/2018/04/18/ispc-origins) and
credited to [Tim Foley](http://graphics.stanford.edu/~tfoley/),
*auto-vectorization is not a programming model.*

Fun fact two: This is as ad-hoc as the implementation for [data parallel
types / SIMD in
Kokkos](https://kokkos.org/kokkos-core-wiki/API/simd/simd.html), it
seems.

## User-Code Examples & Benchmark

[Please see this ImpactX PR for
details.](BLAST-ImpactX/impactx#1002)

## Checklist

- [x] clean up commits (separate commits)
- [x] finalize fallbacks & CI checks
- [ ] add a `vir::stdx::simd` test in CI
- [x] CMake
- [ ] GnuMake
- [x] `AMReX_SIMD.H` 
- [x] `ParallelForSIMD` 
- [x] `ParticleIdWrapper::make_(in)valid(mask)`
- [x] clean up `sincos` support
- [x] `SmallMatrix`
- [x] Support for `GpuComplex` (minimal)
- [x] Support [passing WIDTH as compile-time
meta-data](https://godbolt.org/z/7455hqrEc) to callee in
`ParallelForSIMD`
- [ ] include documentation in the code and/or rst files, if appropriate
- [x] add `vir::stdx::simd` in package managers:
- [x] Spack [vir-simd](spack/spack-packages#332)
- [x] Conda
[vir-simd](conda-forge/staged-recipes#30377)

## Future Ideas / PRs

- allocate particle arrays aligned so we can use
[stdx::vector_aligned](https://en.cppreference.com/w/cpp/experimental/simd/vector_aligned.html)
(for
[copies](https://en.cppreference.com/w/cpp/experimental/simd/simd/copy_from)
into/out of vector registers - note: makes no difference anymore on
modern CPUs)
- Support more/all functions in `ParticleIdWrapper`/`ParticleCpuWrapper`
- Support for
[vir::simdize<std::complex<T>>](mattkretz/vir-simd#42)
instead of `GpuComplex<SIMD>`
- `ParallelFor` ND support
- `ParallelFor`/`ParallelForSIMD`: one could, maybe, with enable-if
magic, etc fuse them into a single name again
- CMake superbuild: `vir-simd` auto-download for convenience (opt-out)
- Build system: "SIMD provider" selection, once we can opt-in to a C++26
compiler+stdlib instead of C++17 TS2 + vir-simd
- Update AMReX package in package management:
  - Spack [vir-simd](spack/spack-packages#332)
- Conda
[vir-simd](conda-forge/staged-recipes#30377)

---------

Co-authored-by: Alexander Sinn <[email protected]>
ax3l added a commit to BLAST-WarpX/warpx that referenced this pull request Jul 17, 2025
Basic build system support for AMReX SIMD in superbuilds.

See:
- AMReX-Codes/amrex#4520 (requires this PR)
- BLAST-ImpactX/impactx#1002 (needed for this PR
to be finished)
@ax3l ax3l force-pushed the topic-simd branch 2 times, most recently from c152e9b to 521516d Compare July 29, 2025 04:44
@ax3l ax3l changed the title [WIP] CPU SIMD Support CPU SIMD Support Jul 29, 2025
Comment thread src/elements/Aperture.H
// if the aperture is periodic, shift sx,sy coordinates to the fundamental domain
amrex::ParticleReal u = (m_repeat_x==0.0_prt) ? sx : (std::fmod(std::abs(sx)+dx,m_repeat_x)-dx);
amrex::ParticleReal v = (m_repeat_y==0.0_prt) ? sy : (std::fmod(std::abs(sy)+dy,m_repeat_y)-dy);
T_Real u = (m_repeat_x == 0_prt) ? sx : (fmod(abs(sx)+dx, m_repeat_x)-dx);

Check notice

Code scanning / CodeQL

Equality test on floating-point values Note

Equality checks on floating point values can yield unexpected results.
Comment thread src/elements/Aperture.H
amrex::ParticleReal u = (m_repeat_x==0.0_prt) ? sx : (std::fmod(std::abs(sx)+dx,m_repeat_x)-dx);
amrex::ParticleReal v = (m_repeat_y==0.0_prt) ? sy : (std::fmod(std::abs(sy)+dy,m_repeat_y)-dy);
T_Real u = (m_repeat_x == 0_prt) ? sx : (fmod(abs(sx)+dx, m_repeat_x)-dx);
T_Real v = (m_repeat_y == 0_prt) ? sy : (fmod(abs(sy)+dy, m_repeat_y)-dy);

Check notice

Code scanning / CodeQL

Equality test on floating-point values Note

Equality checks on floating point values can yield unexpected results.
@ax3l ax3l marked this pull request as ready for review July 29, 2025 05:17
ax3l added 2 commits July 30, 2025 23:13
Ok to run through invalid branch afterwards, must even
for the valid particles in the SIMD lane.
@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jul 31, 2025

Biggest wins so far (SP, before/after PR):

  • Aperture: 6x speedup
  • ChrAcc: 3.65x speedup
  • ChrDrift: 4.44x speedup
  • ChrPlasmaLens: 7.5x speedup
  • ChrQuad: 1.3x - 2x speedup
  • ExactMultipole: 9x speedup
  • ExactQuad: 18.3x speedup
  • LinearMap: 5.5x speedup
  • Multipole: 3x speedup
  • RFCavity: 14x speedup
  • SoftQuadrupole: 14.3x speedup
  • SoftSolenoid: 13.9x speedup

There are also a few very simple elements (that are not compute bound), where the benchmarks show that the packing/unpacking into vector registers is not armortized, so we should not vectorize those. (Small harm, because they are so quick, but still simple to avoid). Those are: ConstF, DipEdge, Drift, Kicker.
And, there are is are fun corner-cases that, if we start to default to fast-math, SIMD similarly shows no benefit: ThinDipole (30% benefit to stay scalar with fast-math only), ShortRF (16%).

CFbend seems to get slower from fast-math only in vectorized code, which might instead though point to system noise/CPU thermal changes/etc. for this quick (i.e. short-runtime) element.

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jul 31, 2025

Double Precision

Machine: Dane (LLNL), -march=sapphirerapids CPU
Compiler: gcc/13.3.1
Vector registers: 512 bit (i.e., 8 doubles or 16 floats)

  • script: DP.txt
  • results: dp-benchmarks.tar.gz
  • abbreviations:
    • dp: double precision
    • nm: normal/default math
    • fm: fast-math
    • d(ev): development branch
    • s(imd): the branch in this PR
pytest-benchmark compare --group-by=name --columns=min,stddev --sort=name .benchmarks/Linux-CPython-3.11-64bit/000*
---------------- benchmark 'test_Aperture': 4 tests ---------------
Name (time in ms)                   Min            StdDev          
-------------------------------------------------------------------
test_Aperture (0001_dp-nm-d)     8.8117 (3.29)     0.0208 (1.05)   
test_Aperture (0002_dp-nm-s)     3.0794 (1.15)     0.3472 (17.48)  
test_Aperture (0003_dp-fm-d)     9.1025 (3.40)     0.0199 (1.0)    
test_Aperture (0004_dp-fm-s)     2.6756 (1.0)      0.2997 (15.09)  
-------------------------------------------------------------------

--------------- benchmark 'test_Buncher': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_Buncher (0001_dp-nm-d)     2.2314 (1.0)      0.0188 (1.0)    
test_Buncher (0002_dp-nm-s)     3.0446 (1.36)     0.0395 (2.10)   
test_Buncher (0003_dp-fm-d)     2.7800 (1.25)     0.4804 (25.57)  
test_Buncher (0004_dp-fm-s)     2.6935 (1.21)     0.0208 (1.11)   
------------------------------------------------------------------

---------------- benchmark 'test_CFbend': 4 tests ---------------
Name (time in ms)                 Min            StdDev          
-----------------------------------------------------------------
test_CFbend (0001_dp-nm-d)     2.2340 (1.0)      0.0159 (1.0)    
test_CFbend (0002_dp-nm-s)     3.1851 (1.43)     0.0433 (2.71)   
test_CFbend (0003_dp-fm-d)     2.9279 (1.31)     0.0220 (1.38)   
test_CFbend (0004_dp-fm-s)     2.7140 (1.21)     0.0234 (1.47)   
-----------------------------------------------------------------

---------------- benchmark 'test_ChrAcc': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_ChrAcc (0001_dp-nm-d)     27.0789 (1.89)     0.2595 (1.09)   
test_ChrAcc (0002_dp-nm-s)     16.0846 (1.12)     0.3071 (1.29)   
test_ChrAcc (0003_dp-fm-d)     26.7344 (1.86)     0.2378 (1.0)    
test_ChrAcc (0004_dp-fm-s)     14.3574 (1.0)      0.3635 (1.53)   
------------------------------------------------------------------

---------------- benchmark 'test_ChrDrift': 4 tests ---------------
Name (time in ms)                   Min            StdDev          
-------------------------------------------------------------------
test_ChrDrift (0001_dp-nm-d)     5.7476 (2.14)     0.0045 (1.0)    
test_ChrDrift (0002_dp-nm-s)     3.3919 (1.26)     0.0348 (7.73)   
test_ChrDrift (0003_dp-fm-d)     3.1150 (1.16)     0.0256 (5.69)   
test_ChrDrift (0004_dp-fm-s)     2.6831 (1.0)      0.0260 (5.77)   
-------------------------------------------------------------------

---------------- benchmark 'test_ChrPlasmaLens': 4 tests ----------------
Name (time in ms)                         Min            StdDev          
-------------------------------------------------------------------------
test_ChrPlasmaLens (0001_dp-nm-d)     38.1896 (5.29)     0.0137 (1.0)    
test_ChrPlasmaLens (0002_dp-nm-s)     10.6129 (1.47)     0.0331 (2.42)   
test_ChrPlasmaLens (0003_dp-fm-d)     35.2867 (4.89)     0.0621 (4.54)   
test_ChrPlasmaLens (0004_dp-fm-s)      7.2192 (1.0)      0.2873 (21.01)  
-------------------------------------------------------------------------

---------------- benchmark 'test_ChrQuad': 4 tests ----------------
Name (time in ms)                   Min            StdDev          
-------------------------------------------------------------------
test_ChrQuad (0001_dp-nm-d)     88.0608 (1.85)     0.0811 (1.73)   
test_ChrQuad (0002_dp-nm-s)     69.0219 (1.45)     0.1602 (3.41)   
test_ChrQuad (0003_dp-fm-d)     69.1601 (1.45)     0.0469 (1.0)    
test_ChrQuad (0004_dp-fm-s)     47.6482 (1.0)      1.0788 (23.00)  
-------------------------------------------------------------------

---------------- benchmark 'test_ConstF': 4 tests ---------------
Name (time in ms)                 Min            StdDev          
-----------------------------------------------------------------
test_ConstF (0001_dp-nm-d)     2.3279 (1.0)      0.0225 (1.0)    
test_ConstF (0002_dp-nm-s)     2.7020 (1.16)     0.0259 (1.15)   
test_ConstF (0003_dp-fm-d)     2.9882 (1.28)     0.0278 (1.24)   
test_ConstF (0004_dp-fm-s)     2.7140 (1.17)     0.0328 (1.46)   
-----------------------------------------------------------------

--------------- benchmark 'test_DipEdge': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_DipEdge (0001_dp-nm-d)     1.6459 (1.0)      0.0164 (1.01)   
test_DipEdge (0002_dp-nm-s)     2.6766 (1.63)     0.0224 (1.37)   
test_DipEdge (0003_dp-fm-d)     1.9928 (1.21)     0.0163 (1.0)    
test_DipEdge (0004_dp-fm-s)     2.6927 (1.64)     0.0211 (1.29)   
------------------------------------------------------------------

--------------- benchmark 'test_Drift': 4 tests ----------------
Name (time in ms)                Min            StdDev          
----------------------------------------------------------------
test_Drift (0001_dp-nm-d)     2.2314 (1.0)      0.0182 (1.0)    
test_Drift (0002_dp-nm-s)     2.6713 (1.20)     0.0215 (1.18)   
test_Drift (0003_dp-fm-d)     2.7497 (1.23)     0.0204 (1.12)   
test_Drift (0004_dp-fm-s)     2.6879 (1.20)     0.0235 (1.29)   
----------------------------------------------------------------

---------------- benchmark 'test_ExactDrift': 4 tests ---------------
Name (time in ms)                     Min            StdDev          
---------------------------------------------------------------------
test_ExactDrift (0001_dp-nm-d)     4.1338 (1.57)     0.0050 (1.0)    
test_ExactDrift (0002_dp-nm-s)     2.6384 (1.0)      0.0257 (5.18)   
test_ExactDrift (0003_dp-fm-d)     2.9893 (1.13)     0.0197 (3.96)   
test_ExactDrift (0004_dp-fm-s)     2.6823 (1.02)     0.0236 (4.75)   
---------------------------------------------------------------------

---------------- benchmark 'test_ExactMultipole': 4 tests ----------------
Name (time in ms)                          Min            StdDev          
--------------------------------------------------------------------------
test_ExactMultipole (0001_dp-nm-d)     58.3870 (3.00)     0.0043 (1.0)    
test_ExactMultipole (0002_dp-nm-s)     19.4795 (1.0)      0.0096 (2.22)   
test_ExactMultipole (0003_dp-fm-d)     61.0599 (3.13)     0.4898 (113.23) 
test_ExactMultipole (0004_dp-fm-s)     19.5392 (1.00)     0.0115 (2.66)   
--------------------------------------------------------------------------

---------------- benchmark 'test_ExactQuad': 4 tests -----------------
Name (time in ms)                      Min            StdDev          
----------------------------------------------------------------------
test_ExactQuad (0001_dp-nm-d)     319.2256 (7.74)     0.9417 (39.89)  
test_ExactQuad (0002_dp-nm-s)      55.8916 (1.36)     0.1077 (4.56)   
test_ExactQuad (0003_dp-fm-d)     218.9466 (5.31)     0.0236 (1.0)    
test_ExactQuad (0004_dp-fm-s)      41.2266 (1.0)      0.1527 (6.47)   
----------------------------------------------------------------------

---------------- benchmark 'test_ExactSbend': 4 tests ----------------
Name (time in ms)                      Min            StdDev          
----------------------------------------------------------------------
test_ExactSbend (0001_dp-nm-d)     19.7546 (1.40)     0.0317 (1.84)   
test_ExactSbend (0002_dp-nm-s)     15.3607 (1.09)     0.0427 (2.47)   
test_ExactSbend (0003_dp-fm-d)     17.7051 (1.26)     0.0380 (2.20)   
test_ExactSbend (0004_dp-fm-s)     14.0693 (1.0)      0.0173 (1.0)    
----------------------------------------------------------------------

---------------- benchmark 'test_Kicker': 4 tests ---------------
Name (time in ms)                 Min            StdDev          
-----------------------------------------------------------------
test_Kicker (0001_dp-nm-d)     1.6393 (1.0)      0.0165 (1.0)    
test_Kicker (0002_dp-nm-s)     2.6686 (1.63)     0.0264 (1.60)   
test_Kicker (0003_dp-fm-d)     1.9354 (1.18)     0.0200 (1.21)   
test_Kicker (0004_dp-fm-s)     2.6777 (1.63)     0.0297 (1.80)   
-----------------------------------------------------------------

--------------- benchmark 'test_LinearMap': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_LinearMap (0001_dp-nm-d)     7.2117 (2.62)     0.0265 (1.35)   
test_LinearMap (0002_dp-nm-s)     3.4136 (1.24)     0.0347 (1.77)   
test_LinearMap (0003_dp-fm-d)     8.0425 (2.92)     0.0451 (2.30)   
test_LinearMap (0004_dp-fm-s)     2.7540 (1.0)      0.0196 (1.0)    
--------------------------------------------------------------------

--------------- benchmark 'test_Multipole': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_Multipole (0001_dp-nm-d)     3.9493 (1.49)     0.0034 (1.0)    
test_Multipole (0002_dp-nm-s)     2.6580 (1.0)      0.0289 (8.53)   
test_Multipole (0003_dp-fm-d)     5.0043 (1.88)     0.0465 (13.74)  
test_Multipole (0004_dp-fm-s)     2.7018 (1.02)     0.0232 (6.84)   
--------------------------------------------------------------------

---------------- benchmark 'test_NonlinearLens': 4 tests ----------------
Name (time in ms)                         Min            StdDev          
-------------------------------------------------------------------------
test_NonlinearLens (0001_dp-nm-d)     42.7865 (1.01)     0.0070 (1.0)    
test_NonlinearLens (0002_dp-nm-s)     42.7888 (1.01)     0.0093 (1.33)   
test_NonlinearLens (0003_dp-fm-d)     42.9065 (1.02)     0.1427 (20.41)  
test_NonlinearLens (0004_dp-fm-s)     42.2151 (1.0)      0.1064 (15.22)  
-------------------------------------------------------------------------

---------------- benchmark 'test_PRot': 4 tests ---------------
Name (time in ms)               Min            StdDev          
---------------------------------------------------------------
test_PRot (0001_dp-nm-d)     4.7603 (2.20)     0.0056 (1.0)    
test_PRot (0002_dp-nm-s)     2.5962 (1.20)     0.0530 (9.43)   
test_PRot (0003_dp-fm-d)     2.1594 (1.0)      0.0219 (3.91)   
test_PRot (0004_dp-fm-s)     2.6438 (1.22)     0.0253 (4.50)   
---------------------------------------------------------------

---------------- benchmark 'test_PlaneXYRot': 4 tests ---------------
Name (time in ms)                     Min            StdDev          
---------------------------------------------------------------------
test_PlaneXYRot (0001_dp-nm-d)     1.6721 (1.0)      0.0165 (1.0)    
test_PlaneXYRot (0002_dp-nm-s)     2.6798 (1.60)     0.0305 (1.85)   
test_PlaneXYRot (0003_dp-fm-d)     2.0960 (1.25)     0.0209 (1.27)   
test_PlaneXYRot (0004_dp-fm-s)     2.7104 (1.62)     0.0238 (1.45)   
---------------------------------------------------------------------

---------------- benchmark 'test_Quad': 4 tests ---------------
Name (time in ms)               Min            StdDev          
---------------------------------------------------------------
test_Quad (0001_dp-nm-d)     2.2294 (1.0)      0.0144 (1.0)    
test_Quad (0002_dp-nm-s)     2.6693 (1.20)     0.0262 (1.82)   
test_Quad (0003_dp-fm-d)     2.2744 (1.02)     0.0189 (1.31)   
test_Quad (0004_dp-fm-s)     2.6944 (1.21)     0.0249 (1.73)   
---------------------------------------------------------------

---------------- benchmark 'test_RFCavity': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_RFCavity (0001_dp-nm-d)     20.7213 (7.65)     0.0185 (1.0)    
test_RFCavity (0002_dp-nm-s)      2.7097 (1.0)      0.0275 (1.49)   
test_RFCavity (0003_dp-fm-d)     14.9478 (5.52)     0.0306 (1.66)   
test_RFCavity (0004_dp-fm-s)      2.7858 (1.03)     0.0278 (1.51)   
--------------------------------------------------------------------

--------------- benchmark 'test_Sbend': 4 tests ----------------
Name (time in ms)                Min            StdDev          
----------------------------------------------------------------
test_Sbend (0001_dp-nm-d)     2.2217 (1.0)      0.0160 (1.0)    
test_Sbend (0002_dp-nm-s)     2.6813 (1.21)     0.0255 (1.59)   
test_Sbend (0003_dp-fm-d)     2.2728 (1.02)     0.0177 (1.11)   
test_Sbend (0004_dp-fm-s)     2.7168 (1.22)     0.0289 (1.81)   
----------------------------------------------------------------

--------------- benchmark 'test_ShortRF': 4 tests ----------------
Name (time in ms)                  Min            StdDev          
------------------------------------------------------------------
test_ShortRF (0001_dp-nm-d)     7.1609 (2.86)     0.0277 (1.74)   
test_ShortRF (0002_dp-nm-s)     2.6973 (1.08)     0.0256 (1.61)   
test_ShortRF (0003_dp-fm-d)     2.5034 (1.0)      0.0159 (1.0)    
test_ShortRF (0004_dp-fm-s)     2.6975 (1.08)     0.0266 (1.67)   
------------------------------------------------------------------

---------------- benchmark 'test_SoftQuadrupole': 4 tests ----------------
Name (time in ms)                          Min            StdDev          
--------------------------------------------------------------------------
test_SoftQuadrupole (0001_dp-nm-d)     20.9724 (7.78)     0.0042 (1.0)    
test_SoftQuadrupole (0002_dp-nm-s)      2.6960 (1.0)      0.0283 (6.71)   
test_SoftQuadrupole (0003_dp-fm-d)     15.4573 (5.73)     0.0055 (1.30)   
test_SoftQuadrupole (0004_dp-fm-s)      2.7473 (1.02)     0.0337 (7.99)   
--------------------------------------------------------------------------

---------------- benchmark 'test_SoftSolenoid': 4 tests ----------------
Name (time in ms)                        Min            StdDev          
------------------------------------------------------------------------
test_SoftSolenoid (0001_dp-nm-d)     20.9760 (7.77)     0.0049 (1.0)    
test_SoftSolenoid (0002_dp-nm-s)      2.7002 (1.0)      0.0331 (6.78)   
test_SoftSolenoid (0003_dp-fm-d)     17.8606 (6.61)     0.0071 (1.46)   
test_SoftSolenoid (0004_dp-fm-s)      2.7587 (1.02)     0.0312 (6.38)   
------------------------------------------------------------------------

--------------- benchmark 'test_Sol': 4 tests ----------------
Name (time in ms)              Min            StdDev          
--------------------------------------------------------------
test_Sol (0001_dp-nm-d)     2.2489 (1.0)      0.0134 (1.0)    
test_Sol (0002_dp-nm-s)     2.6901 (1.20)     0.0200 (1.49)   
test_Sol (0003_dp-fm-d)     2.3059 (1.03)     0.0157 (1.17)   
test_Sol (0004_dp-fm-s)     2.7046 (1.20)     0.0191 (1.43)   
--------------------------------------------------------------

--------------- benchmark 'test_TaperedPL': 4 tests ----------------
Name (time in ms)                    Min            StdDev          
--------------------------------------------------------------------
test_TaperedPL (0001_dp-nm-d)     1.6531 (1.0)      0.0189 (1.06)   
test_TaperedPL (0002_dp-nm-s)     2.6722 (1.62)     0.0230 (1.29)   
test_TaperedPL (0003_dp-fm-d)     1.6903 (1.02)     0.0177 (1.0)    
test_TaperedPL (0004_dp-fm-s)     2.6962 (1.63)     0.0244 (1.38)   
--------------------------------------------------------------------

---------------- benchmark 'test_ThinDipole': 4 tests ---------------
Name (time in ms)                     Min            StdDev          
---------------------------------------------------------------------
test_ThinDipole (0001_dp-nm-d)     4.1948 (1.83)     0.0060 (1.0)    
test_ThinDipole (0002_dp-nm-s)     2.6098 (1.14)     0.0262 (4.37)   
test_ThinDipole (0003_dp-fm-d)     2.2945 (1.0)      0.0177 (2.96)   
test_ThinDipole (0004_dp-fm-s)     2.6836 (1.17)     0.0253 (4.22)   
---------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
pytest-benchmark compare --group-by=name --sort=name --histogram=dp_histo .benchmarks/Linux-CPython-3.11-64bit/000*

dp_histo-test_Aperture
dp_histo-test_Buncher
dp_histo-test_CFbend
dp_histo-test_ChrAcc
dp_histo-test_ChrDrift
dp_histo-test_ChrPlasmaLens
dp_histo-test_ChrQuad
dp_histo-test_ConstF
dp_histo-test_DipEdge
dp_histo-test_Drift
dp_histo-test_ExactDrift
dp_histo-test_ExactMultipole
dp_histo-test_ExactQuad
dp_histo-test_ExactSbend
dp_histo-test_Kicker
dp_histo-test_LinearMap
dp_histo-test_Multipole
dp_histo-test_NonlinearLens
dp_histo-test_PlaneXYRot
dp_histo-test_PRot
dp_histo-test_Quad
dp_histo-test_RFCavity
dp_histo-test_Sbend
dp_histo-test_ShortRF
dp_histo-test_SoftQuadrupole
dp_histo-test_SoftSolenoid
dp_histo-test_Sol
dp_histo-test_TaperedPL
dp_histo-test_ThinDipole

@ax3l ax3l added component: elements Elements/maps/external fields tracking: particles labels Jul 31, 2025
@ax3l ax3l added this to the Advanced Methods (SciDAC-5) milestone Jul 31, 2025

// assign intermediate parameter
amrex::ParticleReal const step = slice_ds /std::sqrt(powi<2>(pt)-1.0_prt);
amrex::ParticleReal const step = slice_ds / std::sqrt(powi<2>(pt)-1.0_prt);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be sqrt instead of std::sqrt here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not--this is still of type amrex::ParticleReal and not T_Real.

Copy link
Copy Markdown
Member Author

@ax3l ax3l Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not here, because it is not in the particle tracking operator(). I changed this white space by accident (sorry for the noise).

Comment thread src/elements/NonlinearLens.H
Comment thread src/elements/Programmable.H
Comment thread src/elements/mixin/beamoptic.H
Copy link
Copy Markdown
Member

@cemitch99 cemitch99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated logic looks reasonable to me--I did not find any obvious place where the math should break. I made only minor comments. Suggest one of our experienced AMReX colleagues take a look.

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jul 31, 2025

"Simple" elements, where vectorization packing/unpacking causes a slight overhead. I will revert those to stay with a serial implementation.

Note that some of the benchmarks can be tainted by the well-known Intel AVX-512 frequency drops in the benchmarks, but even if they are there is no huge issue to exclude them for now from vectorization, because they are super-fast ("simple") elements.

SP

  • Buncher
  • CFbend
  • ConstF?
  • DipEdge
  • Drift?
  • Kicker
  • PlaneXYRot
  • Sbend?
  • Sol
  • TaperPL

DP

  • Buncher?
  • CFbend?
  • ConstF?
  • DipEdge
  • Drift?
  • Kicker
  • PlaneXYRot
  • PRot?
  • Quad
  • Sbend
  • Sol
  • TaperedPL

ax3l added 2 commits July 31, 2025 10:37
Add inline comments on TODOs that we will explore in
follow-ups.
@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jul 31, 2025

@atmyers @AlexanderSinn @WeiqunZhang let me know if one of you wants to take a final look, good to go from our end otherwise.

Comment thread examples/cyclotron/analysis_cyclotron_loss.py Outdated
@ax3l ax3l merged commit 34d263b into BLAST-ImpactX:development Aug 1, 2025
16 checks passed
@ax3l ax3l deleted the topic-simd branch August 1, 2025 15:19
@ax3l ax3l mentioned this pull request Jan 27, 2026
4 tasks
@ax3l ax3l added the backend: SIMD CPU with SIMD acceleration label Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend: openmp Specific to OpenMP execution (CPUs) backend: SIMD CPU with SIMD acceleration component: elements Elements/maps/external fields Performance optimization tracking: particles

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants