cachedataset small fix by myron · Pull Request #5565 · Project-MONAI/MONAI

myron · 2022-11-23T02:16:36Z

When using CacheDataset , and DataLoader num_workers==0, we convert ProxyList to List in disable_share_memory_cache()

but if we're already inside DDP (DistributedDataParallel), we should not do such conversion, as it will crash. we should continue using ProxyList()

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
In-line docstrings updated.
Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: myron <[email protected]>

Nic-Ma · 2022-11-23T03:10:31Z

Hi @myron ,

The initial idea of "convert ProxyList to List" is to support caching on GPU memory, because we must use num_workers=0 for GPU caching.
Could you please help share your crash log?

Thanks.

myron · 2022-11-23T04:37:56Z

please help share your crash log?

I don't have crash log at the moment , but the issue arises when we use
CacheDataset(runtime_cache=True) during DistributedDataParallel, with num_workers==0

In this case we use shared memory , so all processes can access the same shared list (ProxyList).
Now, if num_workers==0, we still must use ProxyList because different processes still need to access the same shared list. (and our current logic disables it, and converts to a regular List)

Nic-Ma · 2022-11-23T04:47:32Z

Hi @myron ,

May I know in which case you must use num_workers=0? I suppose for PyTorch DataLoader, at least num_workers=1 is faster than num_workers=0.
If so, we can drop the shared cache support for num_workers=0, because we must use the regular list for GPU cache.
What do you think?

Thanks.

myron · 2022-11-23T04:56:04Z

num_workers=0, because we must use the regular list for GPU cache.
What do you think?

Hi @myron ,

May I know in which case you must use num_workers=0? I suppose for PyTorch DataLoader, at least num_workers=1 is faster than num_workers=0. If so, we can drop the shared cache support for num_workers=0, because we must use the regular list for GPU cache. What do you think?

Thanks.

I use num_workers==0, when a) something is crashing , to have sequential loading order, and to debug log outputs b) and when there there is not enough RAM memory (on my local desktop) it saves RAM.

If you want to have a mode for on GPU caching, we can try

simply inform users (in docs) not to use runtime_cache if any tensors are on GPU.
In fact I believe the default DataLoader may already crash if any tensors are copied to GPU, so may be we can just check if it's a ThreadDataLoader
alternatively, we can separate runtime_cache logic form shared_cache , and have two options, to explicitly indicate it CacheDataset(runtime_cache=True, shared_cache=True)

Nic-Ma · 2022-11-23T05:00:20Z

Hi @myron ,

GPU caching can work with DataLoader(num_workers=0), and cache at runtime.
Maybe we should separate it into 2 options as you suggested, @wyli what do you think?

Thanks.

wyli · 2022-11-23T09:34:52Z

the runtime cache option + data loader option are already complicated, I just hope there are no more new parameters for the users to tune...

myron · 2022-11-23T18:10:15Z

the runtime cache option + data loader option are already complicated, I just hope there are no more new parameters for the users to tune...

I agree with it, current CacheDataset has a lot of options. That's why originally I wanted to have a separate class SharedCacheDataset #5308

As it stands at moment, we have a small bug that it crashes with num_workers=0, runtime_cache=True in DDP.

To support GPU caching, I think the easiest thing is to simply inform the user (in docs) not to use runtime_cache when tensor is on GPU, and either

merge this PR.
or remove that check for CacheDataset completely with conversion to list

in this case If a user doesn't use runtime_cache, then GPU caching will work.
@Nic-Ma

myron · 2022-11-23T18:20:43Z

I added a log here #5573

monai/data/dataset.py

Signed-off-by: myron <[email protected]>

myron · 2022-11-23T23:51:45Z

for some reason flake8 test fails , but it seems unrelated to this PR, what can cause it ? @Nic-Ma @wyli

flake8
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/__main__.py", line 7, in <module>
    raise SystemExit(main())
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/cli.py", line 23, in main
    app.run(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/application.py", line 198, in run
    self._run(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/application.py", line 186, in _run
    self.initialize(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/application.py", line 165, in initialize
    self.plugins, self.options = parse_args(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/options/parse_args.py", line 53, in parse_args
    opts = aggregator.aggregate_options(option_manager, cfg, cfg_dir, rest)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/options/aggregator.py", line 30, in aggregate_options
    parsed_config = config.parse_config(manager, cfg, cfg_dir)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/options/config.py", line 131, in parse_config
    raise ValueError(
ValueError: Error code '#' supplied to 'ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/__main__.py", line 7, in <module>
    raise SystemExit(main())
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/cli.py", line 23, in main
    app.run(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/application.py", line 198, in run
    self._run(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/application.py", line 186, in _run
    self.initialize(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/application.py", line 165, in initialize
    self.plugins, self.options = parse_args(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/options/parse_args.py", line 53, in parse_args
    opts = aggregator.aggregate_options(option_manager, cfg, cfg_dir, rest)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/options/aggregator.py", line 30, in aggregate_options
    parsed_config = config.parse_config(manager, cfg, cfg_dir)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/options/config.py", line 131, in parse_config
    raise ValueError(
ValueError: Error code '#' supplied to 'ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'
Check failed!
Please run auto style fixes: ./runtests.sh --autofix
Error: Process completed with exit code 1.

wyli · 2022-11-24T00:00:02Z

#5575 That's addressed just now on the dev

Nic-Ma · 2022-11-24T01:02:32Z

Hi @myron , @wyli ,

So you aligned to drop the "runtime_cache" for GPU caching? I think it's OK if it can simplify the user stories..
But please note this integration test will fail, may need some update: https://github.com/Project-MONAI/MONAI/blob/dev/tests/test_integration_fast_train.py#L147

Thanks.

myron · 2022-11-24T01:27:44Z

Hi @myron , @wyli ,

So you aligned to drop the "runtime_cache" for GPU caching? I think it's OK if it can simplify the user stories.. But please note this integration test will fail, may need some update: https://github.com/Project-MONAI/MONAI/blob/dev/tests/test_integration_fast_train.py#L147

Thanks.

I don't have a preference, we can either drop the "runtime_cache" for GPU caching (this PR) or add another option to CacheDataset to manually set when to use shared cache.

wyli

I don't fully follow all your comments @myron @Nic-Ma, but the current code change is just adding not self._is_dist condition to disable_share_memory_cache, I think it's good. please let me know if I misunderstand.

monai/data/dataset.py

Signed-off-by: myron <[email protected]>

myron · 2022-11-26T04:15:08Z

Hi @myron , @wyli ,

So you aligned to drop the "runtime_cache" for GPU caching? I think it's OK if it can simplify the user stories.. But please note this integration test will fail, may need some update: https://github.com/Project-MONAI/MONAI/blob/dev/tests/test_integration_fast_train.py#L147

Thanks.

@Nic-Ma I've updated that tests https://github.com/Project-MONAI/MONAI/blob/dev/tests/test_integration_fast_train.py#L147

but I also see this option is used in this model zoo, not sure if anything needs to be updated there
https://github.com/Project-MONAI/model-zoo/blob/dev/models/mednist_reg/configs/train.yaml

wyli · 2022-11-26T10:41:20Z

Hi @myron , @wyli ,
So you aligned to drop the "runtime_cache" for GPU caching? I think it's OK if it can simplify the user stories.. But please note this integration test will fail, may need some update: https://github.com/Project-MONAI/MONAI/blob/dev/tests/test_integration_fast_train.py#L147
Thanks.

@Nic-Ma I've updated that tests https://github.com/Project-MONAI/MONAI/blob/dev/tests/test_integration_fast_train.py#L147

but I also see this option is used in this model zoo, not sure if anything needs to be updated there https://github.com/Project-MONAI/model-zoo/blob/dev/models/mednist_reg/configs/train.yaml

I tested the bundle, it works fine with this PR, so I'll merge this for now cc @Nic-Ma we can follow up if there are other issues...

wyli · 2022-11-26T10:41:27Z

/build

wyli · 2022-11-30T18:28:30Z

/build

wyli · 2022-11-30T20:21:41Z

this still has some issues with the CI @Nic-Ma , no progress for 1 hour at this line:


[2022-11-30T18:57:41.332Z] .Finished test: test_value_3 (tests.test_rand_shift_intensity.TestRandShiftIntensity) (0.0018s)

[2022-11-30T18:57:41.332Z] Starting test: test_datalist (tests.test_smartcachedataset.TestSmartCacheDataset)...

[2022-11-30T18:57:41.332Z] 
Loading dataset:   0%|          | 0/2 [00:00<?, ?it/s]
Loading dataset: 100%|██████████| 2/2 [00:00<00:00, 13617.87it/s]

[2022-11-30T18:57:41.332Z] .Finished test: test_datalist (tests.test_smartcachedataset.TestSmartCacheDataset) (0.102s)

[2022-11-30T18:57:41.332Z] Starting test: test_set_data (tests.test_smartcachedataset.TestSmartCacheDataset)...

[2022-11-30T18:57:41.587Z] 
Loading dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Loading dataset: 100%|██████████| 5/5 [00:00<00:00, 2596.77it/s]

for cuda 10.2 torch==1.8.2 torchvision==0.9.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu102

test

Signed-off-by: myron <[email protected]>

myron · 2022-11-30T22:50:29Z

this still has some issues with the CI @Nic-Ma , no progress for 1 hour at this line:


[2022-11-30T18:57:41.332Z] .Finished test: test_value_3 (tests.test_rand_shift_intensity.TestRandShiftIntensity) (0.0018s)

[2022-11-30T18:57:41.332Z] Starting test: test_datalist (tests.test_smartcachedataset.TestSmartCacheDataset)...

[2022-11-30T18:57:41.332Z] 
Loading dataset:   0%|          | 0/2 [00:00<?, ?it/s]
Loading dataset: 100%|██████████| 2/2 [00:00<00:00, 13617.87it/s]

[2022-11-30T18:57:41.332Z] .Finished test: test_datalist (tests.test_smartcachedataset.TestSmartCacheDataset) (0.102s)

[2022-11-30T18:57:41.332Z] Starting test: test_set_data (tests.test_smartcachedataset.TestSmartCacheDataset)...

[2022-11-30T18:57:41.587Z] 
Loading dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Loading dataset: 100%|██████████| 5/5 [00:00<00:00, 2596.77it/s]

for cuda 10.2 torch==1.8.2 torchvision==0.9.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu102

This test "tests.test_smartcachedataset" passes fine on my machine with a docker nvidia/pytorch:22:11 (but it's not exact same environment)

Looking at the code, It does not seem like this PR (or anything with runtime cache) should affect this test (tests.test_smartcachedataset) . I've made a slight update (just in case). @wyli can you please re-run this CI test ? thanks

Signed-off-by: myron <[email protected]>

Nic-Ma · 2022-12-01T07:59:50Z

Hi @myron ,

Let's see whether the PR can pass all the CI tests.

Thanks.

Nic-Ma · 2022-12-01T08:00:03Z

/build

Nic-Ma · 2022-12-01T15:04:23Z

As the CI tests failed, I tried to re-run them.
If it continues failing, I think @myron you may need to check deeper for the issue.
Let's merge #5595 first as it can fix the crash issue, @wyli removed the sleep(0.5) logic and non-breaking?

Thanks.

wyli · 2022-12-01T15:43:55Z

/build

wyli · 2022-12-01T15:45:53Z

/build

myron · 2022-12-01T17:03:35Z

@Nic-Ma where can I see the errors of CI failing to debug it? If I click on the "blossom-ci - Fail " test it does not show me any errors or any detailes https://github.com/Project-MONAI/MONAI/actions/runs/3594715289
Thanks

Nic-Ma · 2022-12-02T02:05:16Z

/build

Nic-Ma · 2022-12-02T08:27:34Z

/integration-test

Nic-Ma · 2022-12-02T08:54:48Z

Merged as all the CI tests passed now.

Thanks.

cachedataset fix

3002c67

Signed-off-by: myron <[email protected]>

myron added this to the Auto3D Seg framework [internal ongoing milestone] milestone Nov 23, 2022

myron requested a review from Nic-Ma November 23, 2022 02:16

Merge branch 'dev' into datasetfix

92b93d2

wyli reviewed Nov 23, 2022

View reviewed changes

monai/data/dataset.py Outdated Show resolved Hide resolved

myron added 2 commits November 23, 2022 12:23

doc string

33d41ef

Signed-off-by: myron <[email protected]>

updating to _is_dist

10db196

Signed-off-by: myron <[email protected]>

Merge branch 'dev' into datasetfix

c57871e

wyli previously approved these changes Nov 24, 2022

View reviewed changes

Merge branch 'dev' into datasetfix

3ccc3fe

wyli reviewed Nov 24, 2022

View reviewed changes

monai/data/dataset.py Show resolved Hide resolved

myron added 3 commits November 25, 2022 20:04

Merge remote-tracking branch 'upstream/dev' into datasetfix

e35c471

adding back doc string

76203a2

Signed-off-by: myron <[email protected]>

update test

4e9a2bd

Signed-off-by: myron <[email protected]>

wyli enabled auto-merge (squash) November 26, 2022 10:41

wyli enabled auto-merge (squash) November 30, 2022 18:28

wyli previously approved these changes Nov 30, 2022

View reviewed changes

wyli disabled auto-merge November 30, 2022 20:19

myron added 3 commits November 30, 2022 13:28

Merge remote-tracking branch 'upstream/dev' into datasetfix

2c37ee9

checking ci

4c37438

Signed-off-by: myron <[email protected]>

Merge branch 'datasetfix' of github.com:myron/MONAI into datasetfix

ee31d7c

myron and others added 2 commits November 30, 2022 22:53

per comments

631cd56

Signed-off-by: myron <[email protected]>

Merge branch 'dev' into datasetfix

c0d4153

myron requested a review from wyli December 1, 2022 07:38

myron requested a review from Nic-Ma December 1, 2022 08:00

Merge branch 'dev' into datasetfix

34b0238

wyli enabled auto-merge (squash) December 1, 2022 15:43

wyli disabled auto-merge December 1, 2022 15:43

Nic-Ma approved these changes Dec 2, 2022

View reviewed changes

Nic-Ma merged commit e2fc703 into Project-MONAI:dev Dec 2, 2022

wyli mentioned this pull request Dec 2, 2022

CacheDataset shared cache issues #5633

Closed

myron mentioned this pull request Dec 4, 2022

revise cachedataset runtime_cache modes #5630

Merged

7 tasks

Conversation

myron commented Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Types of changes

Uh oh!

Nic-Ma commented Nov 23, 2022

Uh oh!

myron commented Nov 23, 2022

Uh oh!

Nic-Ma commented Nov 23, 2022

Uh oh!

myron commented Nov 23, 2022

Uh oh!

Nic-Ma commented Nov 23, 2022

Uh oh!

wyli commented Nov 23, 2022

Uh oh!

myron commented Nov 23, 2022

Uh oh!

myron commented Nov 23, 2022

Uh oh!

Uh oh!

myron commented Nov 23, 2022

Uh oh!

wyli commented Nov 24, 2022

Uh oh!

Nic-Ma commented Nov 24, 2022

Uh oh!

myron commented Nov 24, 2022

Uh oh!

wyli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

myron commented Nov 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wyli commented Nov 26, 2022

Uh oh!

wyli commented Nov 26, 2022

Uh oh!

wyli commented Nov 30, 2022

Uh oh!

wyli commented Nov 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

myron commented Nov 30, 2022

Uh oh!

Nic-Ma commented Dec 1, 2022

Uh oh!

Nic-Ma commented Dec 1, 2022

Uh oh!

Nic-Ma commented Dec 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wyli commented Dec 1, 2022

Uh oh!

wyli commented Dec 1, 2022

Uh oh!

myron commented Dec 1, 2022

Uh oh!

Nic-Ma commented Dec 2, 2022

Uh oh!

Nic-Ma commented Dec 2, 2022

Uh oh!

Nic-Ma commented Dec 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

myron commented Nov 23, 2022 •

edited

Loading

myron commented Nov 26, 2022 •

edited

Loading

wyli commented Nov 30, 2022 •

edited

Loading

Nic-Ma commented Dec 1, 2022 •

edited

Loading