-
Notifications
You must be signed in to change notification settings - Fork 1.5k
CacheDataset shared memory, during torchrun multi-gpu training #5613
Copy link
Copy link
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
CacheDataset with shared memory crashes during torchrun distributed training
steps to reproduce, run torchrun --nproc_per_node=2 main.py
import torch, os
import torch.distributed as dist
import torch.multiprocessing as mp
from monai.data import CacheDataset
rank = int(os.getenv("LOCAL_RANK"))
dist.init_process_group(backend="nccl", init_method="env://")
torch.cuda.set_device(rank)
dataset = CacheDataset(data=[1,2,3], runtime_cache=True)
this errors doesn't accure if spawning multi-gpu processes manually (instead of torchrun).
this issue is because "broadcast_object_list" checks for authkey in each process, and torchrun has these keys different.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working