-
Notifications
You must be signed in to change notification settings - Fork 958
Vader BTL segfaults, sm okay #5375
Description
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Open MPI 3.1.0
Also tested Open MPI 2.1.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Using Linuxbrew with brew install open-mpi
See https://github.com/Linuxbrew/homebrew-core/blob/master/Formula/open-mpi.rb
Please describe the system on which you are running
- Operating system/version: CentOS Linux release 7.1.1503 (Core)
- Computer hardware: Intel(R) Xeon(R) CPU E7-8867 v3 @ 2.50GHz
- Network type: Shared memory
Details of the problem
I'm using OpenMPI to assemble a human genome with ABySS 2.1.0.
It crashes with a segfault in libc.so.6 called by mca_btl_vader.so when using OpenMPI 3.1.0 with the vader BTL.
It crashes with a segfault when using OpenMPI 2.1.3 with the vader BTL.
It succeeds when using OpenMPI 2.1.3 with the sm BTL.
The segfault does not occur at a consistent time but occurs within a couple of hours and a day of running. It did not completed successfully after ten attempts.
The backtrace with Open MPI 3.1.0 is…
[hpce705:162958] *** Process received signal ***
[hpce705:162958] Signal: Segmentation fault (11)
[hpce705:162958] Signal code: (128)
[hpce705:162958] Failing at address: (nil)
[hpce705:162958] [ 0] /gsc/btl/linuxbrew/lib/libc.so.6(+0x33070)[0x7f7b4c627070]
[hpce705:162958] [ 1] /gsc/btl/linuxbrew/Cellar/open-mpi/3.1.0/lib/openmpi/mca_btl_vader.so(+0x4bde)[0x7f7b40b8fbde]
[hpce705:162958] [ 2] /gsc/btl/linuxbrew/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f7b4be5f60c]
[hpce705:162958] [ 3] /gsc/btl/linuxbrew/lib/libmpi.so.40(PMPI_Request_get_status+0x74)[0x7f7b4d2abb54]
[hpce705:162958] [ 4] ABYSS-P[0x40dcec]
[hpce705:162958] [ 5] ABYSS-P[0x40df24]
[hpce705:162958] [ 6] ABYSS-P[0x40f384]
[hpce705:162958] [ 7] ABYSS-P[0x4148a7]
[hpce705:162958] [ 8] ABYSS-P[0x416c12]
[hpce705:162958] [ 9] ABYSS-P[0x4066aa]
[hpce705:162958] [10] /gsc/btl/linuxbrew/lib/libc.so.6(__libc_start_main+0xf5)[0x7f7b4c614825]
[hpce705:162958] [11] ABYSS-P[0x407d39]
[hpce705:162958] *** End of error message ***
The failed mpirun command is…
mpirun -np 48 ABYSS-P …
The successful mpirun command is…
mpirun -np 48 --mca btl self,sm ABYSS-P …
Please let me know what further information I can provide to help troubleshoot. If it would help you to reproduce the issue locally, I can provide you with data and a command line to reproduce the issue.
This issue was originally reported by @jdmontenegro and reproduced by myself at a different site.
The downstream ABySS issue is at BirolLab/abyss#236
cc @benvvalk