Skip to content

5573 fixes the shared cache race condition for cachedataset#5595

Closed
wyli wants to merge 8 commits intoProject-MONAI:devfrom
wyli:fixes-5573
Closed

5573 fixes the shared cache race condition for cachedataset#5595
wyli wants to merge 8 commits intoProject-MONAI:devfrom
wyli:fixes-5573

Conversation

@wyli
Copy link
Copy Markdown
Contributor

@wyli wyli commented Nov 28, 2022

Signed-off-by: Wenqi Li [email protected]

part of #5573

Description

the shared cache is started by rank 0:

MONAI/monai/data/dataset.py

Lines 830 to 831 in 0c2fbac

# broadcast the ProxyList to all the ranks, then share the same cache content at runtime
dist.broadcast_object_list(obj_list, src=0)

so rank 0 shouldn't exit before the other processes finish.
This PR adds a barrier and slightly delays rank 0.

the PR also fixes some typos.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: Wenqi Li <[email protected]>
@myron
Copy link
Copy Markdown
Collaborator

myron commented Nov 28, 2022

@wyli please my comments here #5573 (comment)

thanks , this PR addresses the crash, but it won't be a correct logic of using shared data. we should not be automatically converting to a standard list in DDP, since a a user wants to use shared cache even if num_workers==0

@wyli
Copy link
Copy Markdown
Contributor Author

wyli commented Nov 30, 2022

/build

@wyli
Copy link
Copy Markdown
Contributor Author

wyli commented Dec 1, 2022

/build

@wyli
Copy link
Copy Markdown
Contributor Author

wyli commented Dec 1, 2022

/build

@Nic-Ma Nic-Ma mentioned this pull request Dec 1, 2022
7 tasks
@wyli
Copy link
Copy Markdown
Contributor Author

wyli commented Dec 1, 2022

/build

@wyli
Copy link
Copy Markdown
Contributor Author

wyli commented Dec 1, 2022

/build

@wyli wyli enabled auto-merge (squash) December 1, 2022 16:38
@wyli wyli disabled auto-merge December 1, 2022 16:50
@myron
Copy link
Copy Markdown
Collaborator

myron commented Dec 1, 2022

like I said above. We should not call disable_share_memory_cache in DataLoader, if in DDP and using runtime_cache, even for num_workers==0, because we still need to use ListProxy (it is needed for different processes in DDP to read/write to the same cache indices). if we convert ListProxy -> List in such case, we will get memory copies in each process (potentially going OOM).

I've updated the issues to be more explicit about it: #5573
in the issue there is a code (that will trigger assertion failure with the PR, since _cache becomes non-shared)

this PR introduces a quick fix to addresses the crash, but it does not address the whole issue, which we still need to fix even if merging this PR. I'd prefer a proper fix.

@wyli wyli closed this Dec 1, 2022
@wyli wyli deleted the fixes-5573 branch December 1, 2022 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants