Possible NCCL race condition `destroy_process_group` followed by `init_process_group`.

### 🐛 Describe the bug

I seem to have found an issue that can occur when destroying the default process group and attempting to reinitialize it immediately after. This can lead to a race condition where not all workers have finished destroying their distributed process groups, however, NCCL tries to connect to them anyway leaving the NCCL status in an invalid state. This issue is stochastic (as any delay that gives the worker enough time to actually destroy the process group, will fix the issue). However, it is also very consistent

```python
dist.destroy_process_group()
dist.init_process_group()
dist.barrier()
```
leads to reliable crash on NCCL:
```
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/streaming/base/dataset.py", line 513, in __init__
    self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
  File "/usr/lib/python3/dist-packages/streaming/base/shared/prefix.py", line 196, in get_shm_prefix
    dist.barrier()
  File "/usr/lib/python3/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: [15] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
```
when operand like dist.broadcast is immeaditelly called:
```
  File "/usr/lib/python3/dist-packages/composer/utils/dist.py", line 334, in broadcast
    dist.broadcast(tensor, src)
  File "/usr/lib/python3/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
    work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to 172.16.5.145<49779> failed : Software caused connection abort
```


Adding a time.sleep(10) seems to solve the issue. I am not sure if it's some issue with waiting on an object to be garbage collected (and therefore a resource to be properly released or what.
```python
dist.destroy_process_group()
time.sleep(10)
dist.init_process_group()
dist.barrier()
```

Seems to alleviate the issue, but we should probably make sure destroy_process_group actually destroys the process group or at least document this issue.

### Versions

2.1.1

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible NCCL race condition `destroy_process_group` followed by `init_process_group`. #119196

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible NCCL race condition destroy_process_group followed by init_process_group. #119196

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Possible NCCL race condition `destroy_process_group` followed by `init_process_group`. #119196