-
Notifications
You must be signed in to change notification settings - Fork 1.5k
CacheDataset crashes with runtime_cache=True, num_workers=0 in DDP #5573
Copy link
Copy link
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
CacheDataset crashes with runtime_cache=True, num_workers=0 in DDP
This is because we try to convert a shared list (ListProxy) to a regular list, but must maintain a ListProxy since we're in DDP, and need to access the shared items from different processes.
the exact error is this , that is unable to convert. But even if we can convert, we should not do in DDP, since need a ListProxy
EDIT: expected behavior below:
- it should not crash
- we should not call disable_share_memory_cache in DataLoader, if in DDP and using runtime_cache, even for num_workers==0, because we still need to use ListProxy (it is needed for different processes in DDP to read/write to the same cache indices). if we convert ListProxy -> List in such case, we will get memory copies in each process (potentially going OOM).
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/scripts/segmenter.py", line 1032, in run_segmenter_worker
best_metric = segmenter.run()
File "//scripts/segmenter.py", line 696, in run
self.train()
File "/scripts/segmenter.py", line 732, in train
train_loader = DataLoader(train_ds, batch_size=config["batch_size"], shuffle=(train_sampler is None), num_workers=config["num_workers"], sampler=train_sampler, pin_memory=True)
File "/mnt/amproj/Code/MONAI/monai/data/dataloader.py", line 87, in __init__
dataset.disable_share_memory_cache()
File "/MONAI/monai/data/dataset.py", line 855, in disable_share_memory_cache
self._cache = list(self._cache)
File "<string>", line 2, in __len__
File "/usr/lib/python3.8/multiprocessing/managers.py", line 831, in _callmethod
self._connect()
File "/usr/lib/python3.8/multiprocessing/managers.py", line 818, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 502, in Client
c = SocketClient(address)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 630, in SocketClient
s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working