Skip to content

[Distributed] Non-0 ranks creating CUDA contexts on device 0 #135279

@kwen2501

Description

@kwen2501

🐛 Describe the bug

Symptom:
image

Each non-0 rank is occupying ~ 1GB memory on GPU 0.

Versions

Simple repro:
torchrun --standalone --nproc-per-node 4 repro.py

import torch
import os
import torch.distributed.distributed_c10d as c10d

def repro(rank, world_size):
    device = torch.device("cuda:%d" % rank)
    #torch.cuda.set_device(device)
    c10d.init_process_group(
        backend="nccl", rank=rank, world_size=world_size, device_id=device,
    )
    
    x = torch.ones((10,), device=device)
    c10d.all_reduce(x)
    c10d.destroy_process_group()
    print("clean exit")

if __name__ == "__main__":
    repro(int(os.environ["RANK"]), int(os.environ["WORLD_SIZE"]))

Note:
The issue can be avoided if user uncomments the torch.cuda.set_device(device) line. But this is not / has not been a requirement of distributed.

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions