Skip to content

cachedataset small fix#5565

Merged
Nic-Ma merged 19 commits intoProject-MONAI:devfrom
myron:datasetfix
Dec 2, 2022
Merged

cachedataset small fix#5565
Nic-Ma merged 19 commits intoProject-MONAI:devfrom
myron:datasetfix

Conversation

@myron
Copy link
Copy Markdown
Collaborator

@myron myron commented Nov 23, 2022

Fixes #5573

When using CacheDataset , and DataLoader num_workers==0, we convert ProxyList to List in disable_share_memory_cache()

but if we're already inside DDP (DistributedDataParallel), we should not do such conversion, as it will crash. we should continue using ProxyList()

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: myron <[email protected]>
@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Nov 23, 2022

Hi @myron ,

The initial idea of "convert ProxyList to List" is to support caching on GPU memory, because we must use num_workers=0 for GPU caching.
Could you please help share your crash log?

Thanks.

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Nov 23, 2022

please help share your crash log?

I don't have crash log at the moment , but the issue arises when we use
CacheDataset(runtime_cache=True) during DistributedDataParallel, with num_workers==0

In this case we use shared memory , so all processes can access the same shared list (ProxyList).
Now, if num_workers==0, we still must use ProxyList because different processes still need to access the same shared list. (and our current logic disables it, and converts to a regular List)

@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Nov 23, 2022

Hi @myron ,

May I know in which case you must use num_workers=0? I suppose for PyTorch DataLoader, at least num_workers=1 is faster than num_workers=0.
If so, we can drop the shared cache support for num_workers=0, because we must use the regular list for GPU cache.
What do you think?

Thanks.

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Nov 23, 2022

num_workers=0, because we must use the regular list for GPU cache.
What do you think?

Hi @myron ,

May I know in which case you must use num_workers=0? I suppose for PyTorch DataLoader, at least num_workers=1 is faster than num_workers=0. If so, we can drop the shared cache support for num_workers=0, because we must use the regular list for GPU cache. What do you think?

Thanks.

I use num_workers==0, when a) something is crashing , to have sequential loading order, and to debug log outputs b) and when there there is not enough RAM memory (on my local desktop) it saves RAM.

If you want to have a mode for on GPU caching, we can try

  1. simply inform users (in docs) not to use runtime_cache if any tensors are on GPU.
    In fact I believe the default DataLoader may already crash if any tensors are copied to GPU, so may be we can just check if it's a ThreadDataLoader
  2. alternatively, we can separate runtime_cache logic form shared_cache , and have two options, to explicitly indicate it CacheDataset(runtime_cache=True, shared_cache=True)

@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Nov 23, 2022

Hi @myron ,

  1. GPU caching can work with DataLoader(num_workers=0), and cache at runtime.
  2. Maybe we should separate it into 2 options as you suggested, @wyli what do you think?

Thanks.

@wyli
Copy link
Copy Markdown
Contributor

wyli commented Nov 23, 2022

the runtime cache option + data loader option are already complicated, I just hope there are no more new parameters for the users to tune...

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Nov 23, 2022

the runtime cache option + data loader option are already complicated, I just hope there are no more new parameters for the users to tune...

I agree with it, current CacheDataset has a lot of options. That's why originally I wanted to have a separate class SharedCacheDataset #5308

As it stands at moment, we have a small bug that it crashes with num_workers=0, runtime_cache=True in DDP.

To support GPU caching, I think the easiest thing is to simply inform the user (in docs) not to use runtime_cache when tensor is on GPU, and either

  • merge this PR.
  • or remove that check for CacheDataset completely with conversion to list

in this case If a user doesn't use runtime_cache, then GPU caching will work.
@Nic-Ma

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Nov 23, 2022

I added a log here #5573

Signed-off-by: myron <[email protected]>
@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Nov 23, 2022

for some reason flake8 test fails , but it seems unrelated to this PR, what can cause it ? @Nic-Ma @wyli

flake8
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/__main__.py", line 7, in <module>
    raise SystemExit(main())
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/cli.py", line 23, in main
    app.run(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/application.py", line 198, in run
    self._run(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/application.py", line 186, in _run
    self.initialize(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/application.py", line 165, in initialize
    self.plugins, self.options = parse_args(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/options/parse_args.py", line 53, in parse_args
    opts = aggregator.aggregate_options(option_manager, cfg, cfg_dir, rest)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/options/aggregator.py", line 30, in aggregate_options
    parsed_config = config.parse_config(manager, cfg, cfg_dir)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/options/config.py", line 131, in parse_config
    raise ValueError(
ValueError: Error code '#' supplied to 'ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/__main__.py", line 7, in <module>
    raise SystemExit(main())
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/cli.py", line 23, in main
    app.run(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/application.py", line 198, in run
    self._run(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/application.py", line 186, in _run
    self.initialize(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/main/application.py", line 165, in initialize
    self.plugins, self.options = parse_args(argv)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/options/parse_args.py", line 53, in parse_args
    opts = aggregator.aggregate_options(option_manager, cfg, cfg_dir, rest)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/options/aggregator.py", line 30, in aggregate_options
    parsed_config = config.parse_config(manager, cfg, cfg_dir)
  File "/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/flake8/options/config.py", line 131, in parse_config
    raise ValueError(
ValueError: Error code '#' supplied to 'ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'
Check failed!
Please run auto style fixes: ./runtests.sh --autofix
Error: Process completed with exit code 1.

@wyli
Copy link
Copy Markdown
Contributor

wyli commented Nov 24, 2022

#5575 That's addressed just now on the dev

@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Nov 24, 2022

Hi @myron , @wyli ,

So you aligned to drop the "runtime_cache" for GPU caching? I think it's OK if it can simplify the user stories..
But please note this integration test will fail, may need some update: https://github.com/Project-MONAI/MONAI/blob/dev/tests/test_integration_fast_train.py#L147

Thanks.

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Nov 24, 2022

Hi @myron , @wyli ,

So you aligned to drop the "runtime_cache" for GPU caching? I think it's OK if it can simplify the user stories.. But please note this integration test will fail, may need some update: https://github.com/Project-MONAI/MONAI/blob/dev/tests/test_integration_fast_train.py#L147

Thanks.

I don't have a preference, we can either drop the "runtime_cache" for GPU caching (this PR) or add another option to CacheDataset to manually set when to use shared cache.

wyli
wyli previously approved these changes Nov 24, 2022
Copy link
Copy Markdown
Contributor

@wyli wyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully follow all your comments @myron @Nic-Ma, but the current code change is just adding not self._is_dist condition to disable_share_memory_cache, I think it's good. please let me know if I misunderstand.

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Nov 26, 2022

Hi @myron , @wyli ,

So you aligned to drop the "runtime_cache" for GPU caching? I think it's OK if it can simplify the user stories.. But please note this integration test will fail, may need some update: https://github.com/Project-MONAI/MONAI/blob/dev/tests/test_integration_fast_train.py#L147

Thanks.

@Nic-Ma I've updated that tests https://github.com/Project-MONAI/MONAI/blob/dev/tests/test_integration_fast_train.py#L147

but I also see this option is used in this model zoo, not sure if anything needs to be updated there
https://github.com/Project-MONAI/model-zoo/blob/dev/models/mednist_reg/configs/train.yaml

@wyli
Copy link
Copy Markdown
Contributor

wyli commented Nov 26, 2022

Hi @myron , @wyli ,
So you aligned to drop the "runtime_cache" for GPU caching? I think it's OK if it can simplify the user stories.. But please note this integration test will fail, may need some update: https://github.com/Project-MONAI/MONAI/blob/dev/tests/test_integration_fast_train.py#L147
Thanks.

@Nic-Ma I've updated that tests https://github.com/Project-MONAI/MONAI/blob/dev/tests/test_integration_fast_train.py#L147

but I also see this option is used in this model zoo, not sure if anything needs to be updated there https://github.com/Project-MONAI/model-zoo/blob/dev/models/mednist_reg/configs/train.yaml

I tested the bundle, it works fine with this PR, so I'll merge this for now cc @Nic-Ma we can follow up if there are other issues...

@wyli
Copy link
Copy Markdown
Contributor

wyli commented Nov 26, 2022

/build

@wyli wyli enabled auto-merge (squash) November 26, 2022 10:41
@wyli
Copy link
Copy Markdown
Contributor

wyli commented Nov 30, 2022

/build

@wyli wyli enabled auto-merge (squash) November 30, 2022 18:28
wyli
wyli previously approved these changes Nov 30, 2022
@wyli wyli disabled auto-merge November 30, 2022 20:19
@wyli
Copy link
Copy Markdown
Contributor

wyli commented Nov 30, 2022

this still has some issues with the CI @Nic-Ma , no progress for 1 hour at this line:


[2022-11-30T18:57:41.332Z] .Finished test: test_value_3 (tests.test_rand_shift_intensity.TestRandShiftIntensity) (0.0018s)

[2022-11-30T18:57:41.332Z] Starting test: test_datalist (tests.test_smartcachedataset.TestSmartCacheDataset)...

[2022-11-30T18:57:41.332Z] 
Loading dataset:   0%|          | 0/2 [00:00<?, ?it/s]
Loading dataset: 100%|██████████| 2/2 [00:00<00:00, 13617.87it/s]

[2022-11-30T18:57:41.332Z] .Finished test: test_datalist (tests.test_smartcachedataset.TestSmartCacheDataset) (0.102s)

[2022-11-30T18:57:41.332Z] Starting test: test_set_data (tests.test_smartcachedataset.TestSmartCacheDataset)...

[2022-11-30T18:57:41.587Z] 
Loading dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Loading dataset: 100%|██████████| 5/5 [00:00<00:00, 2596.77it/s]

for cuda 10.2 torch==1.8.2 torchvision==0.9.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu102

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Nov 30, 2022

this still has some issues with the CI @Nic-Ma , no progress for 1 hour at this line:


[2022-11-30T18:57:41.332Z] .Finished test: test_value_3 (tests.test_rand_shift_intensity.TestRandShiftIntensity) (0.0018s)

[2022-11-30T18:57:41.332Z] Starting test: test_datalist (tests.test_smartcachedataset.TestSmartCacheDataset)...

[2022-11-30T18:57:41.332Z] 
Loading dataset:   0%|          | 0/2 [00:00<?, ?it/s]
Loading dataset: 100%|██████████| 2/2 [00:00<00:00, 13617.87it/s]

[2022-11-30T18:57:41.332Z] .Finished test: test_datalist (tests.test_smartcachedataset.TestSmartCacheDataset) (0.102s)

[2022-11-30T18:57:41.332Z] Starting test: test_set_data (tests.test_smartcachedataset.TestSmartCacheDataset)...

[2022-11-30T18:57:41.587Z] 
Loading dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Loading dataset: 100%|██████████| 5/5 [00:00<00:00, 2596.77it/s]

for cuda 10.2 torch==1.8.2 torchvision==0.9.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu102

This test "tests.test_smartcachedataset" passes fine on my machine with a docker nvidia/pytorch:22:11 (but it's not exact same environment)

Looking at the code, It does not seem like this PR (or anything with runtime cache) should affect this test (tests.test_smartcachedataset) . I've made a slight update (just in case). @wyli can you please re-run this CI test ? thanks

@myron myron requested a review from wyli December 1, 2022 07:38
@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Dec 1, 2022

Hi @myron ,

Let's see whether the PR can pass all the CI tests.

Thanks.

@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Dec 1, 2022

/build

@myron myron requested a review from Nic-Ma December 1, 2022 08:00
@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Dec 1, 2022

As the CI tests failed, I tried to re-run them.
If it continues failing, I think @myron you may need to check deeper for the issue.
Let's merge #5595 first as it can fix the crash issue, @wyli removed the sleep(0.5) logic and non-breaking?

Thanks.

@wyli wyli enabled auto-merge (squash) December 1, 2022 15:43
@wyli wyli disabled auto-merge December 1, 2022 15:43
@wyli
Copy link
Copy Markdown
Contributor

wyli commented Dec 1, 2022

/build

1 similar comment
@wyli
Copy link
Copy Markdown
Contributor

wyli commented Dec 1, 2022

/build

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Dec 1, 2022

@Nic-Ma where can I see the errors of CI failing to debug it? If I click on the "blossom-ci - Fail " test it does not show me any errors or any detailes https://github.com/Project-MONAI/MONAI/actions/runs/3594715289
Thanks

@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Dec 2, 2022

/build

@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Dec 2, 2022

/integration-test

@Nic-Ma Nic-Ma merged commit e2fc703 into Project-MONAI:dev Dec 2, 2022
@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Dec 2, 2022

Merged as all the CI tests passed now.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CacheDataset crashes with runtime_cache=True, num_workers=0 in DDP

3 participants