Skip to content

Reducer: Reduction class with support for runtime number of values#4925

Closed
WeiqunZhang wants to merge 2 commits intoAMReX-Codes:developmentfrom
WeiqunZhang:reducer
Closed

Reducer: Reduction class with support for runtime number of values#4925
WeiqunZhang wants to merge 2 commits intoAMReX-Codes:developmentfrom
WeiqunZhang:reducer

Conversation

@WeiqunZhang
Copy link
Copy Markdown
Member

This new class adds support for reduction of a runtime number of values in addition to the existing support for a compile time number of values. The values need to be the same type. However, the reduction types can be a mixing of max, min and sum.

@WeiqunZhang WeiqunZhang requested a review from ax3l January 28, 2026 01:30
@WeiqunZhang
Copy link
Copy Markdown
Member Author

@ax3l This can be used to implement the ParticleReduce feature you have requested.

@WeiqunZhang
Copy link
Copy Markdown
Member Author

WeiqunZhang commented Jan 28, 2026

You can do something like

Vector<ReduceOpType> ops;
ops.push_back(ReduceOpType::min);
ops.push_back(ReduceOpType::sum);
Reducer<ParticleReal> reducer(ops);
for (ParIter ...) {
    Long np = ...;
    reducer.eval(np, [=] AMREX_GPU_DEVICE (int iop, Long i)
    {
        return ...; // particle i's value for iop'th reduction.
    });
}
Vector<ParticleReal> result = reducer.getResults();

This new class adds support for reduction of a runtime number of values in
addition to the existing support for a compile time number of values. The
values need to be the same type. However, the reduction types can be a
mixing of max, min and sum.
@WeiqunZhang
Copy link
Copy Markdown
Member Author

/run-hpsf-gitlab-ci

@github-actions
Copy link
Copy Markdown

@amrex-gitlab-ci-reporter
Copy link
Copy Markdown

GitLab CI 1403015 finished with status: success. See details at https://gitlab.spack.io/amrex/amrex/-/pipelines/1403015.

Comment on lines +52 to +59
reducer.eval(box, [=] AMREX_GPU_DEVICE (int iop, int i, int j, int k)
{ // 0 <= iop < 5
if (iop >= 0 && iop <= 2) { // min, max & sum
return a(i,j,k);
} else { // 1-norm & inf-norm
return std::abs(a(i,j,k));
}
});
Copy link
Copy Markdown
Member

@AlexanderSinn AlexanderSinn Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could imagine this interface being quite awkward to use if, for example, up to 20-30 different quantities need to be reduced with some disabled at runtime. This would require a big switch statement with a lookup to map iop to quantity equations. Additionally, any shared data like a(i,j,k) will need to be read in from memory for each  iop separately instead of being reused as registers. I think this would negate most of the memory bandwidth benefit from combining all the reductions into a single kernel, leaving only to the advantage of reduced launch latency.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this kind of style. I wonder how the performance compares, as it does more parallel updates but uses less memory bandwidth.

amrex::ParallelFor(amrex::Gpu::KernelInfo().setReduction(true), box,
    [=] AMREX_GPU_DEVICE (int i, int j, int k, amrex::Gpu::Handler const& handler) noexcept
    {
        amrex::Real * result_ptr = rptr;
        amrex::Real value = a(i, j, k);

        if (do_min_max_sum) {
            // min, max & sum
            amrex::Gpu::deviceReduceMin(result_ptr++, value, handler);
            amrex::Gpu::deviceReduceMax(result_ptr++, value, handler);
            amrex::Gpu::deviceReduceSum(result_ptr++, value, handler);
        }

        if (do_norm) {
            // 1-norm & inf-norm
            amrex::Gpu::deviceReduceSum(result_ptr++, std::abs(value), handler);
            amrex::Gpu::deviceReduceMax(result_ptr++, std::abs(value), handler);
        }
    });

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do some testing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface is always going to awkward no matter what we use. If the number of values is big, the performance could be a real issue.

The deviceReduce* way could be very slow for OMP. For sum, it uses omp atomic. For min and max, it uses critical region.

@ax3l
Copy link
Copy Markdown
Member

ax3l commented Jan 28, 2026

@WeiqunZhang
Copy link
Copy Markdown
Member Author

Some results on perlmutter.

* Perlmutter, A100 80GB
  5 reduces, https://github.com/WeiqunZhang/amrex-devtests/tree/main/reducer
  Run w/ n_cell=512, max_grid_size=128
  Vector: this PR
  Tuple: Use fixed-size GpuTuple
  deviceReduce: use deviceReduceSum etc.
  Loop: Loop over single-element tuple reduction
| Method       |   CPU |     GPU | OMP w/ 8 threads |
|--------------+-------+---------+------------------|
| Vector       | 0.380 | 0.00343 |           0.0508 |
| Tuple        | 0.124 | 0.00136 |           0.0164 |
| deviceReduce | 0.161 | 0.01097 |            97.95 |
| Loop         | 0.394 | 0.00441 |           0.0627 |

* Perlmutter, A100 80GB
  30 reduces, https://github.com/WeiqunZhang/amrex-devtests/tree/main/reducer2
  Run w/ n_cell=512, max_grid_size=128
  Vector: this PR
  Tuple: Use fixed-size GpuTuple
  deviceReduce: use deviceReduceSum etc.
  Loop: Loop over single-element tuple reduction
| Method       |   CPU |    GPU |
|--------------+-------+--------|
| Vector       | 3.455 | 0.0186 |
| Tuple        | 0.117 | 0.0117 |
| deviceReduce | 0.311 | 0.0181 |
| Loop         | 3.501 | 0.0269 |

We should be able to improve the CPU performance.

@WeiqunZhang
Copy link
Copy Markdown
Member Author

I swapped the loop order for the CPU implementation so that at least the data are more likely to be in cache. That improved the 30-reduce "Vector" run from 3.455 o 2.424. But it's still way slower than the "Tuple" (0.117) and "deviceReduce" (0.311) approaches.

@WeiqunZhang
Copy link
Copy Markdown
Member Author

Results on Frontier.

* Frontier MI250X
  5 reduces, https://github.com/WeiqunZhang/amrex-devtests/tree/main/reducer
  Run w/ n_cell=512, max_grid_size=128
  Vector v0: cc375527bf92 this PR
  Vector v1: efd4839f83c2 this PR
  Tuple: Use fixed-size GpuTuple
  deviceReduce: use deviceReduceSum etc.
  Loop: Loop over single-element tuple reduction
| Method       |    CPU |     GPU |
|--------------+--------+---------|
| Vector v0    | 0.0895 | 0.00392 |
| Vector v1    | 0.5098 | N/A     |
| Tuple        | 0.0613 | 0.00142 |
| deviceReduce | 0.0627 | 0.14124 |
| Loop         | 0.2226 | 0.00427 |

* Frontier MI250X
  30 reduces, https://github.com/WeiqunZhang/amrex-devtests/tree/main/reducer2
  Run w/ n_cell=512, max_grid_size=128
  Vector: this PR
  Tuple: Use fixed-size GpuTuple
  deviceReduce: use deviceReduceSum etc.
  Loop: Loop over single-element tuple reduction
| Method       |    CPU |     GPU |
|--------------+--------+---------|
| Vector v0    | 0.4297 | 0.02054 |
| Vector v1    | 2.4189 |     N/A |
| Tuple        | 0.3310 | 0.00536 |
| deviceReduce | 0.3448 | 0.14935 |
| Loop         | 1.3521 | 0.02529 |

@WeiqunZhang
Copy link
Copy Markdown
Member Author

One thing is clear. The tuple reduce approach is the best for both GPU, CPU and CPU with OMP threads.

The deviceReduce approach seems very bad for AMD GPUs (because of atomics?).

@WeiqunZhang
Copy link
Copy Markdown
Member Author

I will try a different approach. If we make Reducer work with a single reduce type, it could simplify a lot of codes. Although one would need to call Reducer up to three times, it might still be a win.

@WeiqunZhang
Copy link
Copy Markdown
Member Author

An idea for providing runtime options in ImpacX. Maybe we could add a runtime mask to the existing tuple based reduction.

@ax3l
Copy link
Copy Markdown
Member

ax3l commented Jan 29, 2026

Can you tell me more about the mask approach?
I thought about zero-reducing unused moment terms, but I assumed the cost is in data movement and not much in the small addition per value.

@WeiqunZhang
Copy link
Copy Markdown
Member Author

Looks like the CPU compiler on perlmutter cheated. The time for 30-tuple reduce cannot be trusted because the compiler figured out they all have the same value. Anyway, the bottom line is still that the approach in this PR does not work well.

@WeiqunZhang WeiqunZhang closed this Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants