5573 fixes the shared cache race condition for cachedataset#5595
5573 fixes the shared cache race condition for cachedataset#5595wyli wants to merge 8 commits intoProject-MONAI:devfrom
Conversation
Signed-off-by: Wenqi Li <[email protected]>
|
@wyli please my comments here #5573 (comment) thanks , this PR addresses the crash, but it won't be a correct logic of using shared data. we should not be automatically converting to a standard list in DDP, since a a user wants to use shared cache even if num_workers==0 |
|
/build |
Signed-off-by: Wenqi Li <[email protected]>
|
/build |
Signed-off-by: Wenqi Li <[email protected]>
|
/build |
Signed-off-by: Wenqi Li <[email protected]>
Signed-off-by: Wenqi Li <[email protected]>
|
/build |
Signed-off-by: Wenqi Li <[email protected]>
|
/build |
|
like I said above. We should not call disable_share_memory_cache in DataLoader, if in DDP and using runtime_cache, even for num_workers==0, because we still need to use ListProxy (it is needed for different processes in DDP to read/write to the same cache indices). if we convert ListProxy -> List in such case, we will get memory copies in each process (potentially going OOM). I've updated the issues to be more explicit about it: #5573 this PR introduces a quick fix to addresses the crash, but it does not address the whole issue, which we still need to fix even if merging this PR. I'd prefer a proper fix. |
Signed-off-by: Wenqi Li [email protected]
part of #5573
Description
the shared cache is started by rank 0:
MONAI/monai/data/dataset.py
Lines 830 to 831 in 0c2fbac
so rank 0 shouldn't exit before the other processes finish.
This PR adds a barrier and slightly delays rank 0.
the PR also fixes some typos.
Types of changes
./runtests.sh -f -u --net --coverage../runtests.sh --quick --unittests --disttests.make htmlcommand in thedocs/folder.