Skip to content

Reuse GPU tags in MLEBABecLap::applyBC()#4882

Merged
WeiqunZhang merged 10 commits intoAMReX-Codes:developmentfrom
ankithadas:MLEBABecLap-Reuse-Tags
Jan 14, 2026
Merged

Reuse GPU tags in MLEBABecLap::applyBC()#4882
WeiqunZhang merged 10 commits intoAMReX-Codes:developmentfrom
ankithadas:MLEBABecLap-Reuse-Tags

Conversation

@ankithadas
Copy link
Copy Markdown
Contributor

Summary

Additional background

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

@ankithadas

This comment was marked as outdated.

@ankithadas ankithadas marked this pull request as ready for review January 8, 2026 10:56
@ankithadas ankithadas changed the title Reuse GPU tags in MLEBABecLap:: applyBC() Reuse GPU tags in MLEBABecLap::applyBC() Jan 8, 2026
@ankithadas ankithadas marked this pull request as draft January 8, 2026 12:26
@ankithadas
Copy link
Copy Markdown
Contributor Author

This is currently broken. Using more than 1 AMR level causes convergence issues

@ankithadas
Copy link
Copy Markdown
Contributor Author

Slight performance improvement when reusing GPU tags in MLEBABecLap::applyBC()

GPU Device: NVIDIA A5000

Test: amrex/Tests/LinearSolvers/CellEB
Inputs

n_cell = 320
#verbose = 10
#use_petsc = true
eb2.geom_type = sphere
eb2.sphere_center = 0.5  0.5  0.5
eb2.sphere_radius = 0.25
eb2.sphere_has_fluid_inside = 0

After

Initializing AMReX (51bd06566543-dirty)...
MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
Initializing CUDA...
CUDA initialized with 1 device.
AMReX (51bd06566543-dirty) initialized
vfrc min = 2.267635963e-07
Initial max, 1 and 2-norm residuals at level 0 = 1157920.353 1.18632832e+11 172310653.1
MLMG: # of AMR levels: 1
      # of MG levels on the coarsest AMR level: 7
MLMG: Initial rhs               = 0
MLMG: Initial residual (resid0) = 1157920.353
MLCGSolver_BiCGStab: Initial error (error0) =        0.187443743
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 8.127541048e-05
MLMG: Iteration   1 Fine resid/resid0 = 0.01655981191
MLCGSolver_BiCGStab: Initial error (error0) =        0.07363543285
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 6.34748546e-05
MLMG: Iteration   2 Fine resid/resid0 = 0.00136280975
MLCGSolver_BiCGStab: Initial error (error0) =        0.009381498681
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 6.824751003e-05
MLMG: Iteration   3 Fine resid/resid0 = 4.951131385e-05
MLCGSolver_BiCGStab: Initial error (error0) =        0.0005124706739
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 4.49532358e-05
MLMG: Iteration   4 Fine resid/resid0 = 2.410026621e-06
MLCGSolver_BiCGStab: Initial error (error0) =        1.978430222e-05
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 4.455584488e-05
MLMG: Iteration   5 Fine resid/resid0 = 1.007094382e-07
MLCGSolver_BiCGStab: Initial error (error0) =        1.049685501e-06
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 5.420743823e-05
MLMG: Iteration   6 Fine resid/resid0 = 4.673451236e-09
MLCGSolver_BiCGStab: Initial error (error0) =        3.884517412e-08
MLCGSolver_BiCGStab: Final: Iteration    7 rel. err. 3.635062157e-05
MLMG: Iteration   7 Fine resid/resid0 = 2.012107808e-10
MLCGSolver_BiCGStab: Initial error (error0) =        1.865120721e-09
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 5.896060112e-05
MLMG: Iteration   8 Fine resid/resid0 = 9.116323764e-12
MLCGSolver_BiCGStab: Initial error (error0) =        7.506198938e-11
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 7.780028184e-05
MLMG: Iteration   9 Fine resid/resid0 = 3.83532538e-13
MLMG: Final Iter. 9 resid, resid/resid0 = 4.441001316e-07, 3.83532538e-13
MLMG: Timers: Solve = 1.209650513 Iter = 1.186354926 Bottom = 0.012291025
Final max, 1 and 2-norm residuals at level 0 = 3.701518851e-05 0.8073304517 0.0003706030728


TinyProfiler total time across processes [min...avg...max]: 1.451 ... 1.451 ... 1.451

--------------------------------------------------------------------------------------------
Name                                         NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------
MLEBABecLap::Fsmooth()                          432     0.9391     0.9391     0.9391  64.71%
MLEBABecLap::Fapply()                           175     0.1428     0.1428     0.1428   9.84%
main                                              1    0.04345    0.04345    0.04345   2.99%
EB2::GShopLevel()-fine                            1    0.03669    0.03669    0.03669   2.53%
FillBoundary_nowait()                           637    0.03089    0.03089    0.03089   2.13%
FabArray::setVal()                              288    0.03001    0.03001    0.03001   2.07%
FabArray::Xpay()                                 66     0.0267     0.0267     0.0267   1.84%
MLLinOp::defineGrids()                            1    0.02224    0.02224    0.02224   1.53%
MLCellLinOp::defineAuxData()                      1    0.01503    0.01503    0.01503   1.04%
FabArray::ParallelCopy_nowait()                 273    0.01293    0.01293    0.01293   0.89%
EB2::GShopLevel()-coarse                          6    0.01173    0.01173    0.01173   0.81%
amrex::Add()                                      9    0.01091    0.01091    0.01091   0.75%
MLMG::ResNormInf()                               10    0.01074    0.01074    0.01074   0.74%
MLEBABecLap::interpolation()                     54    0.01059    0.01059    0.01059   0.73%
MLMG::prepareForSolve()                           1   0.009654   0.009654   0.009654   0.67%
FabArrayBase::CPC::define()                     104   0.008786   0.008786   0.008786   0.61%
MLMG::mgVcycle_down::1                            9   0.008778   0.008778   0.008778   0.60%
MLEBABecLap::define()                             1    0.00851    0.00851    0.00851   0.59%
MLABecLaplacian::prepareForSolve()                1   0.007636   0.007636   0.007636   0.53%
MLMG::addInterpCorrection()                      54   0.007334   0.007334   0.007334   0.51%
BndryData::define()                               1   0.005956   0.005956   0.005956   0.41%
MLMG::mgVcycle_down::0                            9   0.005507   0.005507   0.005507   0.38%
FabArrayBase::FB::FB()                           67   0.005479   0.005479   0.005479   0.38%
amrex::Copy()                                    33   0.005037   0.005037   0.005037   0.35%
MLEBABecLap::applyBC()                          607   0.004428   0.004428   0.004428   0.31%
FabArray::norminf()                             120   0.003285   0.003285   0.003285   0.23%
amrex::Dot()                                    218   0.003103   0.003103   0.003103   0.21%
MLCGSolver::bicgstab                              9   0.002477   0.002477   0.002477   0.17%
MultiFab::Subtract()                              2   0.002356   0.002356   0.002356   0.16%
MLCellLinOp::prepareForSolve()                    1   0.002229   0.002229   0.002229   0.15%
MLMG::mgVcycle_down::2                            9    0.00197    0.00197    0.00197   0.14%
MLMG::apply()                                     2   0.001962   0.001962   0.001962   0.14%
FabArray::BuildMask()                             7   0.001088   0.001088   0.001088   0.07%
MLMG::mgVcycle_down::3                            9  0.0006272  0.0006272  0.0006272   0.04%
FabArrayBase::getCPC()                          195  0.0001845  0.0001845  0.0001845   0.01%
MLMG::mgVcycle_down::4                            9  0.0001448  0.0001448  0.0001448   0.01%
FabArray::FillBoundary()                        637  0.0001352  0.0001352  0.0001352   0.01%
MLMG::mgVcycle_down::5                            9  0.0001299  0.0001299  0.0001299   0.01%
MLMG::solve()                                     1  0.0001273  0.0001273  0.0001273   0.01%
FabArrayBase::getFB()                           644  0.0001254  0.0001254  0.0001254   0.01%
MLCellLinOp::smooth()                           117  8.841e-05  8.841e-05  8.841e-05   0.01%
FabArray::ParallelCopy()                        273  7.435e-05  7.435e-05  7.435e-05   0.01%
MLEBABecLap::apply()                            175  4.818e-05  4.818e-05  4.818e-05   0.00%
EB2::Initialize()                                 1  4.557e-05  4.557e-05  4.557e-05   0.00%
MLCellLinOp::defineBC()                           1  3.304e-05  3.304e-05  3.304e-05   0.00%
MLMG::actualBottomSolve()                         9   2.36e-05   2.36e-05   2.36e-05   0.00%
MLCellLinOp::correctionResidual()                54  2.327e-05  2.327e-05  2.327e-05   0.00%
MLMG::mgVcycle()                                  9   2.23e-05   2.23e-05   2.23e-05   0.00%
MLMG::oneIter()                                   9  1.775e-05  1.775e-05  1.775e-05   0.00%
MLMG:computeResOfCorrection()                    54  1.768e-05  1.768e-05  1.768e-05   0.00%
MLCellLinOp::solutionResidual()                  12  1.181e-05  1.181e-05  1.181e-05   0.00%
MLMG::computeResidual()                           9  5.926e-06  5.926e-06  5.926e-06   0.00%
MLMG::mgVcycle_up::0                              9  5.535e-06  5.535e-06  5.535e-06   0.00%
MLLinOp::define()                                 1    5.5e-06    5.5e-06    5.5e-06   0.00%
MLMG::mgVcycle_bottom                             9  4.803e-06  4.803e-06  4.803e-06   0.00%
MLMG::mgVcycle_up::5                              9  4.496e-06  4.496e-06  4.496e-06   0.00%
MLMG::mgVcycle_up::1                              9  3.287e-06  3.287e-06  3.287e-06   0.00%
MLMG::mgVcycle_up::4                              9  2.472e-06  2.472e-06  2.472e-06   0.00%
MLMG::mgVcycle_up::3                              9  2.462e-06  2.462e-06  2.462e-06   0.00%
MLMG::mgVcycle_up::2                              9  2.446e-06  2.446e-06  2.446e-06   0.00%
MLMG::computeMLResidual()                         1   7.52e-07   7.52e-07   7.52e-07   0.00%
Other                                          1448   0.009917   0.009917   0.009917   0.68%
--------------------------------------------------------------------------------------------

Before

Initializing AMReX (51bd06566543-dirty)...
MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
Initializing CUDA...
CUDA initialized with 1 device.
AMReX (51bd06566543-dirty) initialized
vfrc min = 2.267635963e-07
Initial max, 1 and 2-norm residuals at level 0 = 1157920.353 1.18632832e+11 172310653.1
MLMG: # of AMR levels: 1
      # of MG levels on the coarsest AMR level: 7
MLMG: Initial rhs               = 0
MLMG: Initial residual (resid0) = 1157920.353
MLCGSolver_BiCGStab: Initial error (error0) =        0.1874375103
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 8.15655458e-05
MLMG: Iteration   1 Fine resid/resid0 = 0.01655981191
MLCGSolver_BiCGStab: Initial error (error0) =        0.07363980466
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 6.582736103e-05
MLMG: Iteration   2 Fine resid/resid0 = 0.001362809017
MLCGSolver_BiCGStab: Initial error (error0) =        0.009383728425
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 6.662239496e-05
MLMG: Iteration   3 Fine resid/resid0 = 4.951338737e-05
MLCGSolver_BiCGStab: Initial error (error0) =        0.0005125537207
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 4.707445947e-05
MLMG: Iteration   4 Fine resid/resid0 = 2.409647954e-06
MLCGSolver_BiCGStab: Initial error (error0) =        1.98425658e-05
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 4.635330225e-05
MLMG: Iteration   5 Fine resid/resid0 = 1.007922962e-07
MLCGSolver_BiCGStab: Initial error (error0) =        1.049047774e-06
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 5.403871333e-05
MLMG: Iteration   6 Fine resid/resid0 = 4.67256324e-09
MLCGSolver_BiCGStab: Initial error (error0) =        3.888555559e-08
MLCGSolver_BiCGStab: Final: Iteration    7 rel. err. 3.442317299e-05
MLMG: Iteration   7 Fine resid/resid0 = 2.013351971e-10
MLCGSolver_BiCGStab: Initial error (error0) =        1.865300846e-09
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 5.686638177e-05
MLMG: Iteration   8 Fine resid/resid0 = 9.116421951e-12
MLCGSolver_BiCGStab: Initial error (error0) =        7.540114335e-11
MLCGSolver_BiCGStab: Final: Iteration    6 rel. err. 7.540490835e-05
MLMG: Iteration   9 Fine resid/resid0 = 3.838467205e-13
MLMG: Final Iter. 9 resid, resid/resid0 = 4.444639299e-07, 3.838467205e-13
MLMG: Timers: Solve = 1.269669695 Iter = 1.24614023 Bottom = 0.013362239
Final max, 1 and 2-norm residuals at level 0 = 3.626202169e-05 0.808276678 0.0003671337233


TinyProfiler total time across processes [min...avg...max]: 1.509 ... 1.509 ... 1.509

--------------------------------------------------------------------------------------------
Name                                         NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------
MLEBABecLap::Fsmooth()                          432     0.8934     0.8934     0.8934  59.21%
MLEBABecLap::Fapply()                           175     0.1263     0.1263     0.1263   8.37%
MLEBABecLap::applyBC()                          607     0.1257     0.1257     0.1257   8.33%
main                                              1    0.04373    0.04373    0.04373   2.90%
EB2::GShopLevel()-fine                            1    0.03685    0.03685    0.03685   2.44%
FillBoundary_nowait()                           637    0.03095    0.03095    0.03095   2.05%
FabArray::setVal()                              288    0.02994    0.02994    0.02994   1.98%
FabArray::Xpay()                                 66    0.02665    0.02665    0.02665   1.77%
MLLinOp::defineGrids()                            1    0.02096    0.02096    0.02096   1.39%
MLCellLinOp::defineAuxData()                      1    0.01413    0.01413    0.01413   0.94%
FabArray::ParallelCopy_nowait()                 273    0.01274    0.01274    0.01274   0.84%
EB2::GShopLevel()-coarse                          6    0.01178    0.01178    0.01178   0.78%
amrex::Add()                                      9    0.01095    0.01095    0.01095   0.73%
MLMG::ResNormInf()                               10    0.01079    0.01079    0.01079   0.72%
MLEBABecLap::interpolation()                     54    0.01067    0.01067    0.01067   0.71%
MLMG::prepareForSolve()                           1   0.009721   0.009721   0.009721   0.64%
MLMG::mgVcycle_down::1                            9    0.00894    0.00894    0.00894   0.59%
FabArrayBase::CPC::define()                     104   0.008569   0.008569   0.008569   0.57%
MLEBABecLap::define()                             1   0.008326   0.008326   0.008326   0.55%
MLABecLaplacian::prepareForSolve()                1   0.007643   0.007643   0.007643   0.51%
MLMG::addInterpCorrection()                      54   0.007416   0.007416   0.007416   0.49%
BndryData::define()                               1   0.005828   0.005828   0.005828   0.39%
MLMG::mgVcycle_down::0                            9    0.00559    0.00559    0.00559   0.37%
FabArrayBase::FB::FB()                           67   0.005561   0.005561   0.005561   0.37%
amrex::Copy()                                    33   0.005037   0.005037   0.005037   0.33%
FabArray::norminf()                             120   0.003387   0.003387   0.003387   0.22%
amrex::Dot()                                    218   0.003202   0.003202   0.003202   0.21%
MLCGSolver::bicgstab                              9   0.002576   0.002576   0.002576   0.17%
MultiFab::Subtract()                              2    0.00236    0.00236    0.00236   0.16%
MLMG::mgVcycle_down::2                            9   0.002014   0.002014   0.002014   0.13%
MLMG::apply()                                     2   0.001973   0.001973   0.001973   0.13%
FabArray::BuildMask()                             7   0.000967   0.000967   0.000967   0.06%
MLMG::mgVcycle_down::3                            9  0.0006366  0.0006366  0.0006366   0.04%
FabArrayBase::getCPC()                          195    0.00018    0.00018    0.00018   0.01%
MLMG::solve()                                     1  0.0001654  0.0001654  0.0001654   0.01%
MLMG::mgVcycle_down::4                            9  0.0001463  0.0001463  0.0001463   0.01%
FabArrayBase::getFB()                           644  0.0001388  0.0001388  0.0001388   0.01%
FabArray::FillBoundary()                        637  0.0001362  0.0001362  0.0001362   0.01%
MLMG::mgVcycle_down::5                            9  0.0001339  0.0001339  0.0001339   0.01%
MLCellLinOp::smooth()                           117  9.462e-05  9.462e-05  9.462e-05   0.01%
FabArray::ParallelCopy()                        273  6.667e-05  6.667e-05  6.667e-05   0.00%
MLEBABecLap::apply()                            175  4.681e-05  4.681e-05  4.681e-05   0.00%
EB2::Initialize()                                 1  4.624e-05  4.624e-05  4.624e-05   0.00%
MLCellLinOp::defineBC()                           1   3.14e-05   3.14e-05   3.14e-05   0.00%
MLMG::mgVcycle()                                  9  2.969e-05  2.969e-05  2.969e-05   0.00%
MLCellLinOp::correctionResidual()                54  2.852e-05  2.852e-05  2.852e-05   0.00%
MLMG::actualBottomSolve()                         9  2.748e-05  2.748e-05  2.748e-05   0.00%
MLMG:computeResOfCorrection()                    54  2.373e-05  2.373e-05  2.373e-05   0.00%
MLCellLinOp::solutionResidual()                  12  1.617e-05  1.617e-05  1.617e-05   0.00%
MLMG::oneIter()                                   9  1.596e-05  1.596e-05  1.596e-05   0.00%
MLMG::computeResidual()                           9  8.129e-06  8.129e-06  8.129e-06   0.00%
MLMG::mgVcycle_up::0                              9  7.414e-06  7.414e-06  7.414e-06   0.00%
MLMG::mgVcycle_bottom                             9  5.569e-06  5.569e-06  5.569e-06   0.00%
MLLinOp::define()                                 1  4.885e-06  4.885e-06  4.885e-06   0.00%
MLMG::mgVcycle_up::1                              9  4.365e-06  4.365e-06  4.365e-06   0.00%
MLMG::mgVcycle_up::5                              9  4.163e-06  4.163e-06  4.163e-06   0.00%
MLMG::mgVcycle_up::3                              9   2.71e-06   2.71e-06   2.71e-06   0.00%
MLMG::mgVcycle_up::2                              9  2.572e-06  2.572e-06  2.572e-06   0.00%
MLMG::mgVcycle_up::4                              9  2.483e-06  2.483e-06  2.483e-06   0.00%
MLMG::computeMLResidual()                         1  1.267e-06  1.267e-06  1.267e-06   0.00%
Other                                          1449    0.01214    0.01214    0.01214   0.80%
--------------------------------------------------------------------------------------------

@ankithadas ankithadas marked this pull request as ready for review January 8, 2026 13:24
@WeiqunZhang
Copy link
Copy Markdown
Member

/run-hpsf-gitlab-ci

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 8, 2026

@amrex-gitlab-ci-reporter
Copy link
Copy Markdown

GitLab CI 1380239 finished with status: success. See details at https://gitlab.spack.io/amrex/amrex/-/pipelines/1380239.

@ankithadas
Copy link
Copy Markdown
Contributor Author

ankithadas commented Jan 8, 2026

It looks like this might be a very small performance improvement at best. Adding Gpu::streamSynchronize() after amrex::ParallelFor(tags... shows the actual cost of applyBC().

MLMG: Iteration   9 Fine resid/resid0 = 3.839056296e-13
MLMG: Final Iter. 9 resid, resid/resid0 = 4.44532142e-07, 3.839056296e-13
MLMG: Timers: Solve = 1.223781896 Iter = 1.200463613 Bottom = 0.013117793
Final max, 1 and 2-norm residuals at level 0 = 4.136757882e-05 0.8063068762 0.0003748244268


TinyProfiler total time across processes [min...avg...max]: 1.462 ... 1.462 ... 1.462

--------------------------------------------------------------------------------------------
Name                                         NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------
MLEBABecLap::Fsmooth()                          432     0.8718     0.8718     0.8718  59.64%
MLEBABecLap::Fapply()                           175     0.1231     0.1231     0.1231   8.42%
MLEBABecLap::applyBC()                          607     0.1049     0.1049     0.1049   7.18%
main                                              1    0.04349    0.04349    0.04349   2.97%
EB2::GShopLevel()-fine                            1    0.03701    0.03701    0.03701   2.53%

@AlexanderSinn
Copy link
Copy Markdown
Member

The total runtime is faster by 3.7% I think that's quite decent if repeatable.

@AlexanderSinn
Copy link
Copy Markdown
Member

You can benchmark with both versions tiny_profiler.device_synchronize_around_region = 1 that adds stream synchronization around all profiler sections.

@ankithadas
Copy link
Copy Markdown
Contributor Author

I could rewrite this without using goto statements.

@WeiqunZhang
Copy link
Copy Markdown
Member

Sure. That would be better. Thanks!

@WeiqunZhang
Copy link
Copy Markdown
Member

This is great! Thanks!

I did a test using 4 AMD GPUs on a Frontier node with 512^3 cells. The applyBC time went down from 0.157 to 0.118. The total MLMG::solve time went down from 0.679 to 0.639, consistent with the performance improvement of applyBC. I used tiny_profiler.device_synchronize_around_region = 1.

@ankithadas
Copy link
Copy Markdown
Contributor Author

Oh wow that's great. In that case, I will update MLCellLinOp::applyBC() as well. That said, it does raise a concern about the cost of setting up tags. I had assumed this overhead would be quite small.

@WeiqunZhang
Copy link
Copy Markdown
Member

Please do MLCellLinOp::applyBC() in a different PR.

@ankithadas
Copy link
Copy Markdown
Contributor Author

@WeiqunZhang Is there a specific reason for iterating over idim and explicitly setting low and high sides, instead of directly looping over the Orientations?

@WeiqunZhang
Copy link
Copy Markdown
Member

I don't think there are any reasons other than personal taste.

@WeiqunZhang
Copy link
Copy Markdown
Member

I could rewrite this without using goto statements.

Do you plan to rewrite it? Either way is fine with me.

@WeiqunZhang WeiqunZhang merged commit 443bf9e into AMReX-Codes:development Jan 14, 2026
88 of 89 checks passed
@ankithadas ankithadas deleted the MLEBABecLap-Reuse-Tags branch January 14, 2026 04:13
WeiqunZhang added a commit that referenced this pull request Jan 19, 2026
Same optimization as implemented in #4882.

---------

Co-authored-by: Ankith A Das <[email protected]>
Co-authored-by: Weiqun Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants