CacheDataset shared memory,  during torchrun multi-gpu training

CacheDataset with shared memory crashes during torchrun distributed training

steps to reproduce, run `torchrun --nproc_per_node=2 main.py`

```
import torch, os
import torch.distributed as dist
import torch.multiprocessing as mp
from monai.data import CacheDataset


rank = int(os.getenv("LOCAL_RANK"))
dist.init_process_group(backend="nccl", init_method="env://") 
torch.cuda.set_device(rank)

dataset = CacheDataset(data=[1,2,3], runtime_cache=True)
```

this errors doesn't accure if spawning multi-gpu processes manually (instead of torchrun). 
this issue is  because "broadcast_object_list" checks for authkey in each process, and  torchrun has these keys different. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CacheDataset shared memory, during torchrun multi-gpu training #5613

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CacheDataset shared memory, during torchrun multi-gpu training #5613

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions