cachedataset small fix#5565
Conversation
Signed-off-by: myron <[email protected]>
|
Hi @myron , The initial idea of "convert ProxyList to List" is to support caching on GPU memory, because we must use num_workers=0 for GPU caching. Thanks. |
I don't have crash log at the moment , but the issue arises when we use In this case we use shared memory , so all processes can access the same shared list (ProxyList). |
|
Hi @myron , May I know in which case you must use Thanks. |
I use num_workers==0, when a) something is crashing , to have sequential loading order, and to debug log outputs b) and when there there is not enough RAM memory (on my local desktop) it saves RAM. If you want to have a mode for on GPU caching, we can try
|
|
the runtime cache option + data loader option are already complicated, I just hope there are no more new parameters for the users to tune... |
I agree with it, current CacheDataset has a lot of options. That's why originally I wanted to have a separate class SharedCacheDataset #5308 As it stands at moment, we have a small bug that it crashes with num_workers=0, runtime_cache=True in DDP. To support GPU caching, I think the easiest thing is to simply inform the user (in docs) not to use runtime_cache when tensor is on GPU, and either
in this case If a user doesn't use runtime_cache, then GPU caching will work. |
|
I added a log here #5573 |
Signed-off-by: myron <[email protected]>
Signed-off-by: myron <[email protected]>
|
for some reason flake8 test fails , but it seems unrelated to this PR, what can cause it ? @Nic-Ma @wyli |
|
#5575 That's addressed just now on the dev |
|
So you aligned to drop the "runtime_cache" for GPU caching? I think it's OK if it can simplify the user stories.. Thanks. |
I don't have a preference, we can either drop the "runtime_cache" for GPU caching (this PR) or add another option to CacheDataset to manually set when to use shared cache. |
Signed-off-by: myron <[email protected]>
Signed-off-by: myron <[email protected]>
@Nic-Ma I've updated that tests https://github.com/Project-MONAI/MONAI/blob/dev/tests/test_integration_fast_train.py#L147 but I also see this option is used in this model zoo, not sure if anything needs to be updated there |
I tested the bundle, it works fine with this PR, so I'll merge this for now cc @Nic-Ma we can follow up if there are other issues... |
|
/build |
|
/build |
|
this still has some issues with the CI @Nic-Ma , no progress for 1 hour at this line: for cuda 10.2 torch==1.8.2 torchvision==0.9.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu102 |
This test "tests.test_smartcachedataset" passes fine on my machine with a docker nvidia/pytorch:22:11 (but it's not exact same environment) Looking at the code, It does not seem like this PR (or anything with runtime cache) should affect this test (tests.test_smartcachedataset) . I've made a slight update (just in case). @wyli can you please re-run this CI test ? thanks |
Signed-off-by: myron <[email protected]>
|
Hi @myron , Let's see whether the PR can pass all the CI tests. Thanks. |
|
/build |
|
/build |
1 similar comment
|
/build |
|
@Nic-Ma where can I see the errors of CI failing to debug it? If I click on the "blossom-ci - Fail " test it does not show me any errors or any detailes https://github.com/Project-MONAI/MONAI/actions/runs/3594715289 |
|
/build |
|
/integration-test |
|
Merged as all the CI tests passed now. Thanks. |
Fixes #5573
When using CacheDataset , and DataLoader num_workers==0, we convert ProxyList to List in disable_share_memory_cache()
but if we're already inside DDP (DistributedDataParallel), we should not do such conversion, as it will crash. we should continue using ProxyList()
Types of changes
./runtests.sh -f -u --net --coverage../runtests.sh --quick --unittests --disttests.make htmlcommand in thedocs/folder.