Skip to content

Vader BTL segfaults, sm okay #5375

@sjackman

Description

@sjackman

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Open MPI 3.1.0
Also tested Open MPI 2.1.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Using Linuxbrew with brew install open-mpi
See https://github.com/Linuxbrew/homebrew-core/blob/master/Formula/open-mpi.rb

Please describe the system on which you are running

  • Operating system/version: CentOS Linux release 7.1.1503 (Core)
  • Computer hardware: Intel(R) Xeon(R) CPU E7-8867 v3 @ 2.50GHz
  • Network type: Shared memory

Details of the problem

I'm using OpenMPI to assemble a human genome with ABySS 2.1.0.
It crashes with a segfault in libc.so.6 called by mca_btl_vader.so when using OpenMPI 3.1.0 with the vader BTL.
It crashes with a segfault when using OpenMPI 2.1.3 with the vader BTL.
It succeeds when using OpenMPI 2.1.3 with the sm BTL.
The segfault does not occur at a consistent time but occurs within a couple of hours and a day of running. It did not completed successfully after ten attempts.

The backtrace with Open MPI 3.1.0 is…

[hpce705:162958] *** Process received signal ***
[hpce705:162958] Signal: Segmentation fault (11)
[hpce705:162958] Signal code:  (128)
[hpce705:162958] Failing at address: (nil)
[hpce705:162958] [ 0] /gsc/btl/linuxbrew/lib/libc.so.6(+0x33070)[0x7f7b4c627070]
[hpce705:162958] [ 1] /gsc/btl/linuxbrew/Cellar/open-mpi/3.1.0/lib/openmpi/mca_btl_vader.so(+0x4bde)[0x7f7b40b8fbde]
[hpce705:162958] [ 2] /gsc/btl/linuxbrew/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f7b4be5f60c]
[hpce705:162958] [ 3] /gsc/btl/linuxbrew/lib/libmpi.so.40(PMPI_Request_get_status+0x74)[0x7f7b4d2abb54]
[hpce705:162958] [ 4] ABYSS-P[0x40dcec]
[hpce705:162958] [ 5] ABYSS-P[0x40df24]
[hpce705:162958] [ 6] ABYSS-P[0x40f384]
[hpce705:162958] [ 7] ABYSS-P[0x4148a7]
[hpce705:162958] [ 8] ABYSS-P[0x416c12]
[hpce705:162958] [ 9] ABYSS-P[0x4066aa]
[hpce705:162958] [10] /gsc/btl/linuxbrew/lib/libc.so.6(__libc_start_main+0xf5)[0x7f7b4c614825]
[hpce705:162958] [11] ABYSS-P[0x407d39]
[hpce705:162958] *** End of error message ***

The failed mpirun command is…

mpirun -np 48 ABYSS-P …

The successful mpirun command is…

mpirun -np 48 --mca btl self,sm ABYSS-P …

Please let me know what further information I can provide to help troubleshoot. If it would help you to reproduce the issue locally, I can provide you with data and a command line to reproduce the issue.

This issue was originally reported by @jdmontenegro and reproduced by myself at a different site.
The downstream ABySS issue is at BirolLab/abyss#236
cc @benvvalk

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions