Skip to content

Possible NCCL race condition destroy_process_group followed by init_process_group. #119196

@Skylion007

Description

@Skylion007

🐛 Describe the bug

I seem to have found an issue that can occur when destroying the default process group and attempting to reinitialize it immediately after. This can lead to a race condition where not all workers have finished destroying their distributed process groups, however, NCCL tries to connect to them anyway leaving the NCCL status in an invalid state. This issue is stochastic (as any delay that gives the worker enough time to actually destroy the process group, will fix the issue). However, it is also very consistent

dist.destroy_process_group()
dist.init_process_group()
dist.barrier()

leads to reliable crash on NCCL:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/streaming/base/dataset.py", line 513, in __init__
    self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
  File "/usr/lib/python3/dist-packages/streaming/base/shared/prefix.py", line 196, in get_shm_prefix
    dist.barrier()
  File "/usr/lib/python3/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: [15] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.

when operand like dist.broadcast is immeaditelly called:

  File "/usr/lib/python3/dist-packages/composer/utils/dist.py", line 334, in broadcast
    dist.broadcast(tensor, src)
  File "/usr/lib/python3/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
    work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to 172.16.5.145<49779> failed : Software caused connection abort

Adding a time.sleep(10) seems to solve the issue. I am not sure if it's some issue with waiting on an object to be garbage collected (and therefore a resource to be properly released or what.

dist.destroy_process_group()
time.sleep(10)
dist.init_process_group()
dist.barrier()

Seems to alleviate the issue, but we should probably make sure destroy_process_group actually destroys the process group or at least document this issue.

Versions

2.1.1

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: ncclProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions