Skip to content

Improve out of memory error message#4831

Merged
WeiqunZhang merged 6 commits intoAMReX-Codes:developmentfrom
AlexanderSinn:Improve_out_of_memory_error_message
Dec 7, 2025
Merged

Improve out of memory error message#4831
WeiqunZhang merged 6 commits intoAMReX-Codes:developmentfrom
AlexanderSinn:Improve_out_of_memory_error_message

Conversation

@AlexanderSinn
Copy link
Copy Markdown
Member

Summary

This PR adds extra information to the abort message if an AMReX Arena runs out of memory. This includes the type of memory, number of bytes asked, MPI rank, allocation function name and error code, TinyProfiler call stack, TinyProfiler profiled memory usage, and total usage of all memory arenas. The information is printed for all memory types, as this is easier and might be useful in case multiple arenas share the same memory type. 

For this, a PrintMemoryUsage function is added to TinyProfiler, which can also be called externally. In case some memory is still currently allocated while the usage is printed, the number of frees and the currently used bytes are printed. If not, both metrics are omitted from the output. This also applies to the output when a program finishes normally, where previously the number of frees was always printed.

Example abort message (with syntax highlighting added):

00:00:23 Rank 0 started step 4 at time = 79.4766895 with dt = 19.83517343
00:00:23 Rank 1 started step 5 at time = 99.31186293 with dt = 19.80132949
00:00:23 Rank 2 started step 6 at time = 119.1131924 with dt = 19.76826784
00:00:23 Rank 3 started step 7 at time = 138.8814603 with dt = 19.73563175
amrex::Abort::0::Arena out of memory!!!
Bytes to allocate: 1449379584 (1382 MiB)
Memory type: GPU device memory
MPI rank: 0
Error: cudaMalloc returned 2: out of memory


TinyProfiler call stack:

===== TinyProfilers ======
main()
Hipace::Evolve()
Hipace::SolveOneSlice()
MultiBuffer::put_data()


TinyProfiler memory usage so far:

Device Memory Usage:
---------------------------------------------------------------------------------------
Name                                      Nalloc  Nfree    AvgMem    MaxMem  CurrentMem
---------------------------------------------------------------------------------------
MultiBuffer::put_data()                     2285   1535    24 GiB    34 GiB      32 GiB
The_Arena::Initialize()                        1      1  4563 KiB    29 GiB       0   B
PlasmaParticleContainer::InitParticles()      52     39  1596 MiB  1631 MiB    1599 MiB
Fields::AllocData()                            1      0   673 MiB   673 MiB     673 MiB
BeamParticleContainer::resize()             3300   3280   340 MiB   439 MiB     417 MiB
Hipace::ExplicitMGSolveBxBy()                 40      0   234 MiB   235 MiB     235 MiB
FFTPoissonSolverDirichletFast::define()        7      0   159 MiB   160 MiB     160 MiB
ResizeRandomSeed                               1      0    40 MiB    40 MiB      40 MiB
AdaptiveTimeStep::GatherMinUzSlice()        1385   1385  1444   B   864 KiB       0   B
shiftSlippedParticles()                     5094   5094   282   B   478 KiB       0   B
hpmg::MultiGrid::solve1()                   5513   5513    40 KiB   432 KiB       0   B
MultiBuffer::get_data()                      767    767    74   B   108 KiB       0   B
Hipace::InitData()                            12      0   463   B   464   B     464   B
main()                                        13      0   399   B   400   B     400   B
Hipace::Evolve()                               1      0    31   B    32   B      32   B
DepositCurrent_PlasmaParticleContainer()    1387   1387     2   B    16   B       0   B
---------------------------------------------------------------------------------------

Managed Memory Usage:
---------------------------------------------------------
Name                             Nalloc  AvgMem    MaxMem
---------------------------------------------------------
The_Managed_Arena::Initialize()       1  21   B  8192 KiB
---------------------------------------------------------

Pinned Memory Usage:
---------------------------------------------------------------------------------------
Name                                      Nalloc  Nfree    AvgMem    MaxMem  CurrentMem
---------------------------------------------------------------------------------------
The_Pinned_Arena::Initialize()                 1      1  1675   B  8192 KiB       0   B
Hipace::InitData()                            18      5    17 KiB    17 KiB      17 KiB
Hipace::ExplicitMGSolveBxBy()                  1      0  1083   B  1088   B    1088   B
main()                                        37     24   399   B   400   B     400   B
AdaptiveTimeStep::GatherMinUzSlice()        1385   1385     0   B    32   B       0   B
Hipace::Evolve()                               1      0    31   B    32   B      32   B
MultiBuffer::get_data()                      767    767     0   B    16   B       0   B
PlasmaParticleContainer::InitParticles()       2      2     0   B    16   B       0   B
hpmg::MultiGrid::solve1()                   5513   5513     2   B    16   B       0   B
shiftSlippedParticles()                     2304   2304     0   B    16   B       0   B
---------------------------------------------------------------------------------------

Comms Memory Usage:
--------------------------------------------------------
Name                           Nalloc   AvgMem    MaxMem
--------------------------------------------------------
The_Comms_Arena::Initialize()       1  124   B  8192 KiB
--------------------------------------------------------


AMReX Arena usage so far:

Total GPU global memory (MB): 40441
Free  GPU global memory (MB): 431
[The         Arena] space allocated (MB): 38299
[The         Arena] space used      (MB): 36058
[The         Arena]: 90 allocs, 857 busy blocks, 227 free blocks
[The Managed Arena] space allocated (MB): 8
[The Managed Arena] space used      (MB): 0
[The Managed Arena]: 1 allocs, 0 busy blocks, 1 free blocks
[The  Pinned Arena] space allocated (MB): 8
[The  Pinned Arena] space used      (MB): 0
[The  Pinned Arena]: 1 allocs, 28 busy blocks, 1 free blocks
[The   Comms Arena] space allocated (MB): 8
[The   Comms Arena] space used      (MB): 0
[The   Comms Arena]: 1 allocs, 0 busy blocks, 1 free blocks


Out of memory, see message above !!!
SIGABRT
See Backtrace.0 file for details
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Additional background

Technically, I think calling TinyProfiler::PrintCallStack from a non-master thread is not thread-safe, but this is hard to fix and should be unlikely to cause problems.

Should more information be added to the abort message? Maybe there is something from the "full" profiler?

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

@WeiqunZhang
Copy link
Copy Markdown
Member

/run-hpsf-gitlab-ci

@github-actions
Copy link
Copy Markdown

github-actions bot commented Dec 4, 2025

@amrex-gitlab-ci-reporter
Copy link
Copy Markdown

GitLab CI 1336425 finished with status: failed. See details at https://gitlab.spack.io/amrex/amrex/-/pipelines/1336425.

@WeiqunZhang WeiqunZhang merged commit 68fdcf4 into AMReX-Codes:development Dec 7, 2025
73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants