Skip to content

CacheDataset shared memory, during torchrun multi-gpu training #5613

@myron

Description

@myron

CacheDataset with shared memory crashes during torchrun distributed training

steps to reproduce, run torchrun --nproc_per_node=2 main.py

import torch, os
import torch.distributed as dist
import torch.multiprocessing as mp
from monai.data import CacheDataset


rank = int(os.getenv("LOCAL_RANK"))
dist.init_process_group(backend="nccl", init_method="env://") 
torch.cuda.set_device(rank)

dataset = CacheDataset(data=[1,2,3], runtime_cache=True)

this errors doesn't accure if spawning multi-gpu processes manually (instead of torchrun).
this issue is because "broadcast_object_list" checks for authkey in each process, and torchrun has these keys different.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions