Skip to content

Start HIP explicitly before MPI/AMReX initialization#574

Merged
baperry2 merged 2 commits intoAMReX-Combustion:developmentfrom
bssoriano:init_hip
Oct 6, 2025
Merged

Start HIP explicitly before MPI/AMReX initialization#574
baperry2 merged 2 commits intoAMReX-Combustion:developmentfrom
bssoriano:init_hip

Conversation

@bssoriano
Copy link
Copy Markdown
Contributor

Description

When running PeleLMeX on Frontier at large scale (>256 nodes), some jobs failed during MPI initialization with errors such as:

srun: error: frontier02517: task 1529: Segmentation fault (core dumped) srun: Terminating StepId=3784394.0

or

inet_recv: unexpected socket EOF on frontier0XXXX _pmi_network_allgather:_pmi_inet_recv from target failed Fatal error in PMPI_Init: Other MPI error, PMI_Allgather failed: -1

The failures were intermittent at small scale but consistent at 256+ nodes. HIP was being initialized implicitly and concurrently across all ranks after MPI_Init or during early GPU-aware MPI setup. At very large node counts, this caused contention and race conditions in ROCm’s SDMA and HSA bring-up routines.

Fix

An explicit call to

hipInit(0);

is now issued before MPI or AMReX initialization. This ensures that the HIP runtime and HSA driver are fully brought up on each rank before MPI communication setup begins, preventing concurrent lazy initialization of HIP contexts across thousands of ranks.

Validation

The fix was tested on Frontier (OLCF) for 8 separate runs at 512 nodes.
All runs completed successfully without the previous startup crashes.

This avoids potential SDMA or HSA initialization races observed on large Frontier runs.
@bssoriano bssoriano requested a review from baperry2 October 6, 2025 18:03
@baperry2 baperry2 requested a review from jrood-nrel October 6, 2025 18:45
@jrood-nrel
Copy link
Copy Markdown
Contributor

@baperry2
Copy link
Copy Markdown
Collaborator

baperry2 commented Oct 6, 2025

Thanks @bssoriano!

@baperry2 baperry2 enabled auto-merge (squash) October 6, 2025 19:27
@baperry2 baperry2 merged commit 69322dc into AMReX-Combustion:development Oct 6, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants