Skip to content

CacheDataset crashes with runtime_cache=True, num_workers=0 in DDP #5573

@myron

Description

@myron

CacheDataset crashes with runtime_cache=True, num_workers=0 in DDP

This is because we try to convert a shared list (ListProxy) to a regular list, but must maintain a ListProxy since we're in DDP, and need to access the shared items from different processes.

the exact error is this , that is unable to convert. But even if we can convert, we should not do in DDP, since need a ListProxy

EDIT: expected behavior below:

  • it should not crash
  • we should not call disable_share_memory_cache in DataLoader, if in DDP and using runtime_cache, even for num_workers==0, because we still need to use ListProxy (it is needed for different processes in DDP to read/write to the same cache indices). if we convert ListProxy -> List in such case, we will get memory copies in each process (potentially going OOM).
Traceback (most recent call last):                                                                                                                              
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap                                                               
    fn(i, *args)                                                                                                                                                
  File "/scripts/segmenter.py", line 1032, in run_segmenter_worker                                       
    best_metric = segmenter.run()                                                                                                                               
  File "//scripts/segmenter.py", line 696, in run
    self.train()
  File "/scripts/segmenter.py", line 732, in train
    train_loader = DataLoader(train_ds, batch_size=config["batch_size"], shuffle=(train_sampler is None),  num_workers=config["num_workers"], sampler=train_sampler, pin_memory=True)
  File "/mnt/amproj/Code/MONAI/monai/data/dataloader.py", line 87, in __init__
    dataset.disable_share_memory_cache() 
  File "/MONAI/monai/data/dataset.py", line 855, in disable_share_memory_cache
    self._cache = list(self._cache)
  File "<string>", line 2, in __len__
  File "/usr/lib/python3.8/multiprocessing/managers.py", line 831, in _callmethod
    self._connect()
  File "/usr/lib/python3.8/multiprocessing/managers.py", line 818, in _connect
    conn = self._Client(self._token.address, authkey=self._authkey)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 502, in Client
    c = SocketClient(address)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 630, in SocketClient
    s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions