Reducer: Reduction class with support for runtime number of values by WeiqunZhang · Pull Request #4925 · AMReX-Codes/amrex

WeiqunZhang · 2026-01-28T01:30:16Z

This new class adds support for reduction of a runtime number of values in addition to the existing support for a compile time number of values. The values need to be the same type. However, the reduction types can be a mixing of max, min and sum.

WeiqunZhang · 2026-01-28T01:31:11Z

@ax3l This can be used to implement the ParticleReduce feature you have requested.

WeiqunZhang · 2026-01-28T01:44:42Z

You can do something like

Vector<ReduceOpType> ops;
ops.push_back(ReduceOpType::min);
ops.push_back(ReduceOpType::sum);
Reducer<ParticleReal> reducer(ops);
for (ParIter ...) {
    Long np = ...;
    reducer.eval(np, [=] AMREX_GPU_DEVICE (int iop, Long i)
    {
        return ...; // particle i's value for iop'th reduction.
    });
}
Vector<ParticleReal> result = reducer.getResults();

This new class adds support for reduction of a runtime number of values in addition to the existing support for a compile time number of values. The values need to be the same type. However, the reduction types can be a mixing of max, min and sum.

WeiqunZhang · 2026-01-28T03:39:17Z

/run-hpsf-gitlab-ci

github-actions · 2026-01-28T03:39:26Z

GitLab CI has started at https://gitlab.spack.io/amrex/amrex/-/pipelines/1403015.

amrex-gitlab-ci-reporter · 2026-01-28T04:00:53Z

GitLab CI 1403015 finished with status: success. See details at https://gitlab.spack.io/amrex/amrex/-/pipelines/1403015.

AlexanderSinn · 2026-01-28T16:27:44Z

Tests/Reducer/main.cpp

+                reducer.eval(box, [=] AMREX_GPU_DEVICE (int iop, int i, int j, int k)
+                { // 0 <= iop < 5
+                    if (iop >= 0 && iop <= 2) { // min, max & sum
+                        return a(i,j,k);
+                    } else { // 1-norm & inf-norm
+                        return std::abs(a(i,j,k));
+                    }
+                });


I could imagine this interface being quite awkward to use if, for example, up to 20-30 different quantities need to be reduced with some disabled at runtime. This would require a big switch statement with a lookup to map iop to quantity equations. Additionally, any shared data like a(i,j,k) will need to be read in from memory for each iop separately instead of being reused as registers. I think this would negate most of the memory bandwidth benefit from combining all the reductions into a single kernel, leaving only to the advantage of reduced launch latency.

I like this kind of style. I wonder how the performance compares, as it does more parallel updates but uses less memory bandwidth.

amrex::ParallelFor(amrex::Gpu::KernelInfo().setReduction(true), box, [=] AMREX_GPU_DEVICE (int i, int j, int k, amrex::Gpu::Handler const& handler) noexcept { amrex::Real * result_ptr = rptr; amrex::Real value = a(i, j, k); if (do_min_max_sum) { // min, max & sum amrex::Gpu::deviceReduceMin(result_ptr++, value, handler); amrex::Gpu::deviceReduceMax(result_ptr++, value, handler); amrex::Gpu::deviceReduceSum(result_ptr++, value, handler); } if (do_norm) { // 1-norm & inf-norm amrex::Gpu::deviceReduceSum(result_ptr++, std::abs(value), handler); amrex::Gpu::deviceReduceMax(result_ptr++, std::abs(value), handler); } });

Let me do some testing.

The interface is always going to awkward no matter what we use. If the number of values is big, the performance could be a real issue.

The deviceReduce* way could be very slow for OMP. For sum, it uses omp atomic. For min and max, it uses critical region.

ax3l · 2026-01-28T16:36:28Z

Thank you! ✨

For background, the application is to do a more fine-grained reduction selection in BLAST-ImpactX/impactx#1102
https://github.com/BLAST-ImpactX/impactx/blob/26.01/src/diagnostics/ReducedBeamCharacteristics.cpp#L56-L218

WeiqunZhang · 2026-01-28T19:46:25Z

Some results on perlmutter.

* Perlmutter, A100 80GB
  5 reduces, https://github.com/WeiqunZhang/amrex-devtests/tree/main/reducer
  Run w/ n_cell=512, max_grid_size=128
  Vector: this PR
  Tuple: Use fixed-size GpuTuple
  deviceReduce: use deviceReduceSum etc.
  Loop: Loop over single-element tuple reduction
| Method       |   CPU |     GPU | OMP w/ 8 threads |
|--------------+-------+---------+------------------|
| Vector       | 0.380 | 0.00343 |           0.0508 |
| Tuple        | 0.124 | 0.00136 |           0.0164 |
| deviceReduce | 0.161 | 0.01097 |            97.95 |
| Loop         | 0.394 | 0.00441 |           0.0627 |

* Perlmutter, A100 80GB
  30 reduces, https://github.com/WeiqunZhang/amrex-devtests/tree/main/reducer2
  Run w/ n_cell=512, max_grid_size=128
  Vector: this PR
  Tuple: Use fixed-size GpuTuple
  deviceReduce: use deviceReduceSum etc.
  Loop: Loop over single-element tuple reduction
| Method       |   CPU |    GPU |
|--------------+-------+--------|
| Vector       | 3.455 | 0.0186 |
| Tuple        | 0.117 | 0.0117 |
| deviceReduce | 0.311 | 0.0181 |
| Loop         | 3.501 | 0.0269 |

We should be able to improve the CPU performance.

WeiqunZhang · 2026-01-28T23:05:16Z

I swapped the loop order for the CPU implementation so that at least the data are more likely to be in cache. That improved the 30-reduce "Vector" run from 3.455 o 2.424. But it's still way slower than the "Tuple" (0.117) and "deviceReduce" (0.311) approaches.

WeiqunZhang · 2026-01-28T23:20:51Z

Results on Frontier.

* Frontier MI250X
  5 reduces, https://github.com/WeiqunZhang/amrex-devtests/tree/main/reducer
  Run w/ n_cell=512, max_grid_size=128
  Vector v0: cc375527bf92 this PR
  Vector v1: efd4839f83c2 this PR
  Tuple: Use fixed-size GpuTuple
  deviceReduce: use deviceReduceSum etc.
  Loop: Loop over single-element tuple reduction
| Method       |    CPU |     GPU |
|--------------+--------+---------|
| Vector v0    | 0.0895 | 0.00392 |
| Vector v1    | 0.5098 | N/A     |
| Tuple        | 0.0613 | 0.00142 |
| deviceReduce | 0.0627 | 0.14124 |
| Loop         | 0.2226 | 0.00427 |

* Frontier MI250X
  30 reduces, https://github.com/WeiqunZhang/amrex-devtests/tree/main/reducer2
  Run w/ n_cell=512, max_grid_size=128
  Vector: this PR
  Tuple: Use fixed-size GpuTuple
  deviceReduce: use deviceReduceSum etc.
  Loop: Loop over single-element tuple reduction
| Method       |    CPU |     GPU |
|--------------+--------+---------|
| Vector v0    | 0.4297 | 0.02054 |
| Vector v1    | 2.4189 |     N/A |
| Tuple        | 0.3310 | 0.00536 |
| deviceReduce | 0.3448 | 0.14935 |
| Loop         | 1.3521 | 0.02529 |

WeiqunZhang · 2026-01-28T23:25:11Z

One thing is clear. The tuple reduce approach is the best for both GPU, CPU and CPU with OMP threads.

The deviceReduce approach seems very bad for AMD GPUs (because of atomics?).

WeiqunZhang · 2026-01-29T00:15:54Z

I will try a different approach. If we make Reducer work with a single reduce type, it could simplify a lot of codes. Although one would need to call Reducer up to three times, it might still be a win.

WeiqunZhang · 2026-01-29T01:59:10Z

An idea for providing runtime options in ImpacX. Maybe we could add a runtime mask to the existing tuple based reduction.

ax3l · 2026-01-29T05:58:01Z

Can you tell me more about the mask approach?
I thought about zero-reducing unused moment terms, but I assumed the cost is in data movement and not much in the small addition per value.

WeiqunZhang · 2026-02-03T20:27:14Z

Looks like the CPU compiler on perlmutter cheated. The time for 30-tuple reduce cannot be trusted because the compiler figured out they all have the same value. Anyway, the bottom line is still that the approach in this PR does not work well.

WeiqunZhang requested a review from ax3l January 28, 2026 01:30

WeiqunZhang force-pushed the reducer branch from 7289647 to 8362662 Compare January 28, 2026 01:34

WeiqunZhang force-pushed the reducer branch from 8362662 to ec2881b Compare January 28, 2026 02:23

WeiqunZhang force-pushed the reducer branch from ec2881b to cc37552 Compare January 28, 2026 02:53

AlexanderSinn reviewed Jan 28, 2026

View reviewed changes

ax3l added the enhancement label Jan 28, 2026

Swap looping order for CPU

efd4839

WeiqunZhang mentioned this pull request Feb 3, 2026

Reducer: New wrapper class for ReduceOps and ReduceData #4933

Merged

WeiqunZhang closed this Feb 3, 2026

Conversation

WeiqunZhang commented Jan 28, 2026

Uh oh!

WeiqunZhang commented Jan 28, 2026

Uh oh!

WeiqunZhang commented Jan 28, 2026 • edited by ax3l Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeiqunZhang commented Jan 28, 2026

Uh oh!

github-actions bot commented Jan 28, 2026

Uh oh!

amrex-gitlab-ci-reporter bot commented Jan 28, 2026

Uh oh!

AlexanderSinn Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexanderSinn Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

WeiqunZhang Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

WeiqunZhang Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

ax3l commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeiqunZhang commented Jan 28, 2026

Uh oh!

WeiqunZhang commented Jan 28, 2026

Uh oh!

WeiqunZhang commented Jan 28, 2026

Uh oh!

WeiqunZhang commented Jan 28, 2026

Uh oh!

WeiqunZhang commented Jan 29, 2026

Uh oh!

WeiqunZhang commented Jan 29, 2026

Uh oh!

ax3l commented Jan 29, 2026

Uh oh!

WeiqunZhang commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WeiqunZhang commented Jan 28, 2026 •

edited by ax3l

Loading

AlexanderSinn Jan 28, 2026 •

edited

Loading

ax3l commented Jan 28, 2026 •

edited

Loading