-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
Issue description
I am trying to run distributed pytorch using the mpi backend. In this specific minimal example, I am trying to broadcast a tensor from node 0 to node 1.
The example works fine when the tensor is living on cpu, however when trying to exchange the tensor between GPUs, I get a Segmentation fault. Please see below.
Code example
import torch.distributed as dist
import torch
def debug_fun():
torch.manual_seed(dist.get_rank())
device = "cuda:{}".format(dist.get_rank())
# device = "cpu"
param = torch.FloatTensor(1,1).to(device)
param.uniform_()
print("Rank {}, before sync: {} at {}".format(dist.get_rank(),param.cpu().data.numpy(), param.device))
dist.broadcast(param.data,0)
print("Rank {}, after sync : {} at {}".format(dist.get_rank(),param.cpu().data.numpy(), param.device))
if __name__ == '__main__':
dist.init_process_group("mpi", rank=0, world_size=0)
debug_fun()
When running mpirun -n 2 segfault_demo.py with device = "cpu" uncommented, the output is, as can be expected:
Rank 0, before sync: [[0.4962566]] at cpu
Rank 1, before sync: [[0.7576316]] at cpu
Rank 0, after sync : [[0.4962566]] at cpu
Rank 1, after sync : [[0.4962566]] at cpu
However, when moving the tensor to the respective gpu, I have the following error message (home directory path has been replaced):
Rank 0, before sync: [[0.08403993]] at cuda:0
Rank 1, before sync: [[0.29211983]] at cuda:1
[node219:15444] *** Process received signal ***
[node219:15444] Signal: Segmentation fault (11)
[node219:15444] Signal code: Invalid permissions (2)
[node219:15444] Failing at address: 0x1050de00000
[node219:15444] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x2aaaaacde5e0]
[node219:15444] [ 1] /lib64/libc.so.6(+0x14cf13)[0x2aaaab037f13]
[node219:15444] [ 2] /HOME/lib/libopen-pal.so.40(opal_convertor_unpack+0x10a)[0x2aab47ad74fa]
[node219:15444] [ 3] /HOME/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x41f)[0x2aab6451e3ff]
[node219:15444] [ 4] /HOME/lib/openmpi/mca_btl_smcuda.so(mca_btl_smcuda_component_progress+0x53b)[0x2aab630d2e5b]
[node219:15444] [ 5] /HOME/lib/libopen-pal.so.40(opal_progress+0x2c)[0x2aab47ac66dc]
[node219:15444] [ 6] /HOME/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x2aab47acd395]
[node219:15444] [ 7] Rank 0, after sync : [[0.08403993]] at cuda:0
/HOME/lib/libmpi.so.40(ompi_request_default_wait+0x1ce)[0x2aab1e45f9ce]
[node219:15444] [ 8] /HOME/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x4ee)[0x2aab1e4a903e]
[node219:15444] [ 9] /HOME/lib/libmpi.so.40(ompi_coll_base_bcast_intra_binomial+0xb7)[0x2aab1e4a94b7]
[node219:15444] [10] /HOME/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc)[0x2aab6537a8ac]
[node219:15444] [11] /HOME/lib/libmpi.so.40(MPI_Bcast+0x139)[0x2aab1e474ee9]
[node219:15444] [12] /HOME/anaconda3/envs/fbi_env/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(_ZN3thd14DataChannelMPI9broadcastERN2at6TensorEji+0x15f)[0x2aaac5b7b5cf]
[node219:15444] [13] /HOME/anaconda3/envs/fbi_env/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(THDBroadcast+0x28)[0x2aaac5b53958]
[node219:15444] [14] /HOME/anaconda3/envs/fbi_env/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(_Z20THDPModule_broadcastP7_objectS0_+0xcb)[0x2aaac59ef95b]
[node219:15444] [15] python(_PyCFunction_FastCallDict+0x91)[0x555555663921]
[node219:15444] [16] python(+0x19cdfc)[0x5555556f0dfc]
[node219:15444] [17] python(_PyEval_EvalFrameDefault+0x2fa)[0x55555571594a]
[node219:15444] [18] python(+0x196206)[0x5555556ea206]
[node219:15444] [19] python(+0x1971cf)[0x5555556eb1cf]
[node219:15444] [20] python(+0x19ced5)[0x5555556f0ed5]
[node219:15444] [21] python(_PyEval_EvalFrameDefault+0x2fa)[0x55555571594a]
[node219:15444] [22] python(+0x196f8b)[0x5555556eaf8b]
[node219:15444] [23] python(+0x19ced5)[0x5555556f0ed5]
[node219:15444] [24] python(_PyEval_EvalFrameDefault+0x2fa)[0x55555571594a]
[node219:15444] [25] python(PyEval_EvalCodeEx+0x329)[0x5555556ebcb9]
[node219:15444] [26] python(PyEval_EvalCode+0x1c)[0x5555556eca4c]
[node219:15444] [27] python(+0x214c44)[0x555555768c44]
[node219:15444] [28] python(PyRun_FileExFlags+0xa1)[0x555555769041]
[node219:15444] [29] python(PyRun_SimpleFileExFlags+0x1c4)[0x555555769244]
[node219:15444] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node node219 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Please note that in the above error message, we have the line [node219:15444] [ 7] Rank 0, after sync : [[0.08403993]] at cuda:0, showing that the sender successfully returned from the broadcast call. (this print-out is mixed into the error-message).
Furthermore, when explicitly setting device = "cuda:0" such that both processes access gpu 0, the code runs fine:
Rank 1, before sync: [[0.29211983]] at cuda:0
Rank 0, before sync: [[0.08403993]] at cuda:0
Rank 0, after sync : [[0.08403993]] at cuda:0
Rank 1, after sync : [[0.08403993]] at cuda:0
However when setting device = "cuda:1", it consistently fails with the same error-message
System Info
Collecting environment information...
PyTorch version: 0.5.0a0+e186377
Is debug build: No
CUDA used to build PyTorch: 9.0.176
OS: CentOS Linux release 7.4.1708 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
CMake version: version 3.11.1
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
GPU 3: GeForce GTX 1080 Ti
Nvidia driver version: 384.111
cuDNN version: Could not collect
Versions of relevant libraries:
[pip] numpy (1.14.5)
[pip] torch (0.5.0a0+e186377)
[pip] torchvision (0.2.1)
[conda] magma-cuda90 2.3.0 1 pytorch
[conda] pytorch 0.4.0 py36hdf912b8_0
[conda] torch 0.5.0a0+04a7fc1 <pip>
[conda] torch 0.5.0a0+e186377 <pip>
[conda] torchvision 0.2.1 py36_0
In addition to the above, cuDNN version is v7 for cuda9.0.
pytorch was built from source from within anaconda3
Furthermore:
mpirun (Open MPI) 3.1.0
nccl_2.2.13-1+cuda9.0 (I am not entirely sure this was found during build, since using "nccl" during initialization fails). If you think this is relevant, I will recompile pytorch and double check.
Any help would be appreciated
Matthias