[Distributed] Non-0 ranks creating CUDA contexts on device 0

### 🐛 Describe the bug

Symptom:
![image](https://github.com/user-attachments/assets/ca41c0f1-9896-47b8-a3d5-962d10c5f71a)

Each non-0 rank is occupying ~ 1GB memory on GPU 0.

### Versions

Simple repro:
`torchrun --standalone --nproc-per-node 4 repro.py`

```
import torch
import os
import torch.distributed.distributed_c10d as c10d

def repro(rank, world_size):
    device = torch.device("cuda:%d" % rank)
    #torch.cuda.set_device(device)
    c10d.init_process_group(
        backend="nccl", rank=rank, world_size=world_size, device_id=device,
    )
    
    x = torch.ones((10,), device=device)
    c10d.all_reduce(x)
    c10d.destroy_process_group()
    print("clean exit")

if __name__ == "__main__":
    repro(int(os.environ["RANK"]), int(os.environ["WORLD_SIZE"]))

```

Note:
The issue can be avoided if user uncomments the `torch.cuda.set_device(device)` line. But this is not / has not been a requirement of distributed.

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Distributed] Non-0 ranks creating CUDA contexts on device 0 #135279

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Distributed] Non-0 ranks creating CUDA contexts on device 0 #135279

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions