-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 Describe the bug
I seem to have found an issue that can occur when destroying the default process group and attempting to reinitialize it immediately after. This can lead to a race condition where not all workers have finished destroying their distributed process groups, however, NCCL tries to connect to them anyway leaving the NCCL status in an invalid state. This issue is stochastic (as any delay that gives the worker enough time to actually destroy the process group, will fix the issue). However, it is also very consistent
dist.destroy_process_group()
dist.init_process_group()
dist.barrier()leads to reliable crash on NCCL:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/streaming/base/dataset.py", line 513, in __init__
self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
File "/usr/lib/python3/dist-packages/streaming/base/shared/prefix.py", line 196, in get_shm_prefix
dist.barrier()
File "/usr/lib/python3/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: [15] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
when operand like dist.broadcast is immeaditelly called:
File "/usr/lib/python3/dist-packages/composer/utils/dist.py", line 334, in broadcast
dist.broadcast(tensor, src)
File "/usr/lib/python3/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to 172.16.5.145<49779> failed : Software caused connection abort
Adding a time.sleep(10) seems to solve the issue. I am not sure if it's some issue with waiting on an object to be garbage collected (and therefore a resource to be properly released or what.
dist.destroy_process_group()
time.sleep(10)
dist.init_process_group()
dist.barrier()Seems to alleviate the issue, but we should probably make sure destroy_process_group actually destroys the process group or at least document this issue.
Versions
2.1.1
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225