Segmentation Fault using dist.broadcast() with openmpi

## Issue description

I am trying to run distributed pytorch using the mpi backend. In this specific minimal example, I am trying to broadcast a tensor from node 0 to node 1. 
The example works fine when the tensor is living on cpu, however when trying to exchange the tensor between GPUs, I get a Segmentation fault. Please see below. 

## Code example

```
import torch.distributed as dist
import torch

def debug_fun():
    torch.manual_seed(dist.get_rank())
    device = "cuda:{}".format(dist.get_rank())
    # device = "cpu"
    param = torch.FloatTensor(1,1).to(device)

    param.uniform_()
    print("Rank {}, before sync: {} at {}".format(dist.get_rank(),param.cpu().data.numpy(), param.device))
    dist.broadcast(param.data,0)
    print("Rank {}, after sync : {} at {}".format(dist.get_rank(),param.cpu().data.numpy(), param.device))


if __name__ == '__main__':
    dist.init_process_group("mpi", rank=0, world_size=0)
    debug_fun()
```

When running `mpirun -n 2 segfault_demo.py` with `device = "cpu"` uncommented, the output is, as can be expected:
```
Rank 0, before sync: [[0.4962566]] at cpu
Rank 1, before sync: [[0.7576316]] at cpu
Rank 0, after sync : [[0.4962566]] at cpu
Rank 1, after sync : [[0.4962566]] at cpu
``` 

However, when moving the tensor to the respective gpu, I have the following error message (home directory path has been replaced):
```
Rank 0, before sync: [[0.08403993]] at cuda:0
Rank 1, before sync: [[0.29211983]] at cuda:1
[node219:15444] *** Process received signal ***
[node219:15444] Signal: Segmentation fault (11)
[node219:15444] Signal code: Invalid permissions (2)
[node219:15444] Failing at address: 0x1050de00000
[node219:15444] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x2aaaaacde5e0]
[node219:15444] [ 1] /lib64/libc.so.6(+0x14cf13)[0x2aaaab037f13]
[node219:15444] [ 2] /HOME/lib/libopen-pal.so.40(opal_convertor_unpack+0x10a)[0x2aab47ad74fa]
[node219:15444] [ 3] /HOME/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x41f)[0x2aab6451e3ff]
[node219:15444] [ 4] /HOME/lib/openmpi/mca_btl_smcuda.so(mca_btl_smcuda_component_progress+0x53b)[0x2aab630d2e5b]
[node219:15444] [ 5] /HOME/lib/libopen-pal.so.40(opal_progress+0x2c)[0x2aab47ac66dc]
[node219:15444] [ 6] /HOME/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x2aab47acd395]
[node219:15444] [ 7] Rank 0, after sync : [[0.08403993]] at cuda:0
/HOME/lib/libmpi.so.40(ompi_request_default_wait+0x1ce)[0x2aab1e45f9ce]
[node219:15444] [ 8] /HOME/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x4ee)[0x2aab1e4a903e]
[node219:15444] [ 9] /HOME/lib/libmpi.so.40(ompi_coll_base_bcast_intra_binomial+0xb7)[0x2aab1e4a94b7]
[node219:15444] [10] /HOME/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc)[0x2aab6537a8ac]
[node219:15444] [11] /HOME/lib/libmpi.so.40(MPI_Bcast+0x139)[0x2aab1e474ee9]
[node219:15444] [12] /HOME/anaconda3/envs/fbi_env/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(_ZN3thd14DataChannelMPI9broadcastERN2at6TensorEji+0x15f)[0x2aaac5b7b5cf]
[node219:15444] [13] /HOME/anaconda3/envs/fbi_env/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(THDBroadcast+0x28)[0x2aaac5b53958]
[node219:15444] [14] /HOME/anaconda3/envs/fbi_env/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(_Z20THDPModule_broadcastP7_objectS0_+0xcb)[0x2aaac59ef95b]
[node219:15444] [15] python(_PyCFunction_FastCallDict+0x91)[0x555555663921]
[node219:15444] [16] python(+0x19cdfc)[0x5555556f0dfc]
[node219:15444] [17] python(_PyEval_EvalFrameDefault+0x2fa)[0x55555571594a]
[node219:15444] [18] python(+0x196206)[0x5555556ea206]
[node219:15444] [19] python(+0x1971cf)[0x5555556eb1cf]
[node219:15444] [20] python(+0x19ced5)[0x5555556f0ed5]
[node219:15444] [21] python(_PyEval_EvalFrameDefault+0x2fa)[0x55555571594a]
[node219:15444] [22] python(+0x196f8b)[0x5555556eaf8b]
[node219:15444] [23] python(+0x19ced5)[0x5555556f0ed5]
[node219:15444] [24] python(_PyEval_EvalFrameDefault+0x2fa)[0x55555571594a]
[node219:15444] [25] python(PyEval_EvalCodeEx+0x329)[0x5555556ebcb9]
[node219:15444] [26] python(PyEval_EvalCode+0x1c)[0x5555556eca4c]
[node219:15444] [27] python(+0x214c44)[0x555555768c44]
[node219:15444] [28] python(PyRun_FileExFlags+0xa1)[0x555555769041]
[node219:15444] [29] python(PyRun_SimpleFileExFlags+0x1c4)[0x555555769244]
[node219:15444] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node node219 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
```
Please note that in the above error message, we have the line `[node219:15444] [ 7] Rank 0, after sync : [[0.08403993]] at cuda:0`, showing that the sender successfully returned from the broadcast call. (this print-out is mixed into the error-message).

Furthermore, when explicitly setting `device = "cuda:0"` such that both processes access gpu 0, the code runs fine:
```
Rank 1, before sync: [[0.29211983]] at cuda:0
Rank 0, before sync: [[0.08403993]] at cuda:0
Rank 0, after sync : [[0.08403993]] at cuda:0
Rank 1, after sync : [[0.08403993]] at cuda:0
```
However when setting `device = "cuda:1"`, it consistently fails with the same error-message

## System Info
```
Collecting environment information...
PyTorch version: 0.5.0a0+e186377
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: CentOS Linux release 7.4.1708 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
CMake version: version 3.11.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
GPU 3: GeForce GTX 1080 Ti

Nvidia driver version: 384.111
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy (1.14.5)
[pip] torch (0.5.0a0+e186377)
[pip] torchvision (0.2.1)
[conda] magma-cuda90              2.3.0                         1    pytorch
[conda] pytorch                   0.4.0            py36hdf912b8_0
[conda] torch                     0.5.0a0+04a7fc1           <pip>
[conda] torch                     0.5.0a0+e186377           <pip>
[conda] torchvision               0.2.1                    py36_0
``` 
In addition to the above, cuDNN version is v7 for cuda9.0.
pytorch was built from source from within anaconda3
Furthermore:
mpirun (Open MPI) 3.1.0
nccl_2.2.13-1+cuda9.0  (I am not entirely sure this was found during build, since using "nccl" during initialization fails). If you think this is relevant, I will recompile pytorch and double check.



Any help would be appreciated

Matthias


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Segmentation Fault using dist.broadcast() with openmpi #9418

Issue description

Code example

System Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Segmentation Fault using dist.broadcast() with openmpi #9418

Description

Issue description

Code example

System Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions