-
Notifications
You must be signed in to change notification settings - Fork 958
SLURM integration appears broken #4338
Copy link
Copy link
Closed
Labels
RTEIssue likely is in RTE or PMIx areasIssue likely is in RTE or PMIx areasTarget: mainTarget: v3.0.xTarget: v3.1.xbug
Description
I have checked master, v3.1.x, and v3.0.x. Only master appears to be broken, though the reason doesn't seem obvious as the code that is failing doesn't appear to be different on master. I definitely configured --with-pmi and confirmed that the s2 and s1 components were built. I set pmix_base_verbose and confirmed that the proper component is being selected.
Here is the error:
$ srun -N 2 --mpi=pmi2 ./ring
[rhc001:111766] PMI_Get_universe_size [pmix_s2.c:288:s2_init]: Operation failed
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[rhc001:111766] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: error: rhc001: task 0: Exited with exit code 1
[rhc002.cluster:27298] PMI_Get_universe_size [pmix_s2.c:288:s2_init]: Operation failed
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[rhc002.cluster:27298] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: error: rhc002: task 1: Exited with exit code 1
@artpol84 Can you folks please take a look?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
RTEIssue likely is in RTE or PMIx areasIssue likely is in RTE or PMIx areasTarget: mainTarget: v3.0.xTarget: v3.1.xbug