Reducer: Reduction class with support for runtime number of values#4925
Reducer: Reduction class with support for runtime number of values#4925WeiqunZhang wants to merge 2 commits intoAMReX-Codes:developmentfrom
Conversation
|
@ax3l This can be used to implement the ParticleReduce feature you have requested. |
|
You can do something like Vector<ReduceOpType> ops;
ops.push_back(ReduceOpType::min);
ops.push_back(ReduceOpType::sum);
Reducer<ParticleReal> reducer(ops);
for (ParIter ...) {
Long np = ...;
reducer.eval(np, [=] AMREX_GPU_DEVICE (int iop, Long i)
{
return ...; // particle i's value for iop'th reduction.
});
}
Vector<ParticleReal> result = reducer.getResults(); |
This new class adds support for reduction of a runtime number of values in addition to the existing support for a compile time number of values. The values need to be the same type. However, the reduction types can be a mixing of max, min and sum.
|
/run-hpsf-gitlab-ci |
|
GitLab CI 1403015 finished with status: success. See details at https://gitlab.spack.io/amrex/amrex/-/pipelines/1403015. |
| reducer.eval(box, [=] AMREX_GPU_DEVICE (int iop, int i, int j, int k) | ||
| { // 0 <= iop < 5 | ||
| if (iop >= 0 && iop <= 2) { // min, max & sum | ||
| return a(i,j,k); | ||
| } else { // 1-norm & inf-norm | ||
| return std::abs(a(i,j,k)); | ||
| } | ||
| }); |
There was a problem hiding this comment.
I could imagine this interface being quite awkward to use if, for example, up to 20-30 different quantities need to be reduced with some disabled at runtime. This would require a big switch statement with a lookup to map iop to quantity equations. Additionally, any shared data like a(i,j,k) will need to be read in from memory for each iop separately instead of being reused as registers. I think this would negate most of the memory bandwidth benefit from combining all the reductions into a single kernel, leaving only to the advantage of reduced launch latency.
There was a problem hiding this comment.
I like this kind of style. I wonder how the performance compares, as it does more parallel updates but uses less memory bandwidth.
amrex::ParallelFor(amrex::Gpu::KernelInfo().setReduction(true), box,
[=] AMREX_GPU_DEVICE (int i, int j, int k, amrex::Gpu::Handler const& handler) noexcept
{
amrex::Real * result_ptr = rptr;
amrex::Real value = a(i, j, k);
if (do_min_max_sum) {
// min, max & sum
amrex::Gpu::deviceReduceMin(result_ptr++, value, handler);
amrex::Gpu::deviceReduceMax(result_ptr++, value, handler);
amrex::Gpu::deviceReduceSum(result_ptr++, value, handler);
}
if (do_norm) {
// 1-norm & inf-norm
amrex::Gpu::deviceReduceSum(result_ptr++, std::abs(value), handler);
amrex::Gpu::deviceReduceMax(result_ptr++, std::abs(value), handler);
}
});There was a problem hiding this comment.
Let me do some testing.
There was a problem hiding this comment.
The interface is always going to awkward no matter what we use. If the number of values is big, the performance could be a real issue.
The deviceReduce* way could be very slow for OMP. For sum, it uses omp atomic. For min and max, it uses critical region.
|
Thank you! ✨ For background, the application is to do a more fine-grained reduction selection in BLAST-ImpactX/impactx#1102 |
|
Some results on perlmutter. We should be able to improve the CPU performance. |
|
I swapped the loop order for the CPU implementation so that at least the data are more likely to be in cache. That improved the 30-reduce "Vector" run from 3.455 o 2.424. But it's still way slower than the "Tuple" (0.117) and "deviceReduce" (0.311) approaches. |
|
Results on Frontier. |
|
One thing is clear. The tuple reduce approach is the best for both GPU, CPU and CPU with OMP threads. The deviceReduce approach seems very bad for AMD GPUs (because of atomics?). |
|
I will try a different approach. If we make Reducer work with a single reduce type, it could simplify a lot of codes. Although one would need to call Reducer up to three times, it might still be a win. |
|
An idea for providing runtime options in ImpacX. Maybe we could add a runtime mask to the existing tuple based reduction. |
|
Can you tell me more about the mask approach? |
|
Looks like the CPU compiler on perlmutter cheated. The time for 30-tuple reduce cannot be trusted because the compiler figured out they all have the same value. Anyway, the bottom line is still that the approach in this PR does not work well. |
This new class adds support for reduction of a runtime number of values in addition to the existing support for a compile time number of values. The values need to be the same type. However, the reduction types can be a mixing of max, min and sum.