-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Labels
high priorityoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriage review
Description
🐛 Describe the bug
Each non-0 rank is occupying ~ 1GB memory on GPU 0.
Versions
Simple repro:
torchrun --standalone --nproc-per-node 4 repro.py
import torch
import os
import torch.distributed.distributed_c10d as c10d
def repro(rank, world_size):
device = torch.device("cuda:%d" % rank)
#torch.cuda.set_device(device)
c10d.init_process_group(
backend="nccl", rank=rank, world_size=world_size, device_id=device,
)
x = torch.ones((10,), device=device)
c10d.all_reduce(x)
c10d.destroy_process_group()
print("clean exit")
if __name__ == "__main__":
repro(int(os.environ["RANK"]), int(os.environ["WORLD_SIZE"]))
Note:
The issue can be avoided if user uncomments the torch.cuda.set_device(device) line. But this is not / has not been a requirement of distributed.
cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o
Metadata
Metadata
Assignees
Labels
high priorityoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriage review
