SharedCacheDataset by myron · Pull Request #5308 · Project-MONAI/MONAI

myron · 2022-10-11T08:12:22Z

This introduces a new SharedCacheDataset(Dataset) class to address #3843

It has several improvements / changes compared to the CacheDataset https://github.com/Project-MONAI/MONAI/blob/dev/monai/data/dataset.py#L690

It caches into a shared memory in RAM, so all processes in multigpu run with DistributedDataParallel can read/write to the same shared memory (globally accessible)
it works on the fly (there is non need to wait to pre-cache the data), it will cache it on the first data access, and then use it from the cache

If tested it and used for many tasks already in DDP / multigpu
I'll add unit tests too if it looks good

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
In-line docstrings updated.
Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: myron <[email protected]>

for more information, see https://pre-commit.ci

myron · 2022-10-13T03:03:34Z

I've added unit tests. please review. thanks

Nic-Ma · 2022-10-13T05:36:18Z

Hi @myron ,

Thanks very much for the implementation of this feature.

I see much duplicated code if compared to the existing CacheDataset, is it possible to enhance the original CacheDataset directly? Or inherit from it?
Have you tested it with different multi-processing methods? like spawn and fork?
https://pytorch.org/docs/stable/multiprocessing.html
Please also test the case that we cache data on GPU (with ToDevice transform), refer to:
https://github.com/Project-MONAI/tutorials/blob/main/acceleration/fast_model_training_guide.md#4-cache-io-and-transforms-data-to-gpu
I remember shared memory is much slower, that's why we cache duplicated data for every rank in multi-gpu training for relatively small dataset. Could you please help provide some speed comparison when doing multi-gpu training with CacheDataset and this SharedCacheDataset?

Thanks in advance.

myron · 2022-10-13T19:34:01Z

Hi @Nic-Ma , thank you for the feedback

1. I see much duplicated code if compared to the existing `CacheDataset`, is it possible to enhance the original `CacheDataset` directly? Or inherit from it?

There is only one similar method. Sublclassing would be too messy and confusing, since the use case is different. Modifying CacheDataset will make it very long (and prone to bugs).

2. Have you tested it with different multi-processing methods? like `spawn` and `fork`?
   https://pytorch.org/docs/stable/multiprocessing.html

Yes, I have been using in both spawn and fork modes.

3. Please also test the case that we cache data on GPU (with `ToDevice` transform), refer to:
   https://github.com/Project-MONAI/tutorials/blob/main/acceleration/fast_model_training_guide.md#4-cache-io-and-transforms-data-to-gpu

This class is designed to for caching on CPU only, it is not designed for GPU, I don't see a reason to test for it.
We can add an assert statement to exit if the data is on gpu, if you'd like

4. I remember `shared memory` is much slower, that's why we cache duplicated data for every rank in multi-gpu training for relatively small dataset. Could you please help provide some speed comparison when doing multi-gpu training with `CacheDataset` and this `SharedCacheDataset`?

It's true there is an overhead with shared memory. It's definitely faster vs not-cashing at all.
I don't have the bandwidth to write this code for speed comparisons at the moment. Perhaps someone else would like to do this exercise. Otherwise, in the interest of time, let's merge it for now, and later we can do further improvements / tuning. thank you

Signed-off-by: myron <[email protected]>

myron · 2022-10-17T17:44:08Z

I remember shared memory is much slower, that's why we cache duplicated data for every rank in multi-gpu training for relatively small dataset. Could you please help provide some speed comparison when doing multi-gpu training with CacheDataset and this SharedCacheDataset?
It's true there is an overhead with shared memory. It's definitely faster vs not-cashing at all. I don't have the bandwidth to write this code for speed comparisons at the moment. Perhaps someone else would like to do this exercise. Otherwise, in the interest of time, let's merge it for now, and later we can do further improvements / tuning. thank you

I ran the comparison in terms of time SharedCacheDataset vs CacheDataset, using a 3D segmentation task (416 images training, 104 validation) - evenly divisible data size for 8 gpu training.

Set up on 8gpu (Nvidia A40 48gb x 8, ampere, latest pytorch, monai 1.0.0 dev branch):

For CacheDataset pre-slit dataset into non-overlapping subsets. Then each process caches it's subset before proceeding to training (using random sampling without DistributedSampler)
For SharedCacheDataset, (no extra setup , just using as a standard Dataset with DistributedSampler). It cashes images on the fly during first epoch.

Timing:

First epoch:

CachDataset: 20 sec to pre-cache everything, then train loop 45sec, validation loop 21sec
SharedCacheDataset: trainloop 49sec, validation loop 22 sec (slightly longer train loop but, it includes caching, so no time wasted to wait for caching beforehand)

Second epoch (everything non-random is already cached)

CachDataset: train loop 39 sec, validation loop 21sec
SharedCacheDataset: train loop 39 sec, validation loop 21sec

no difference in time on subsequent epochs. If there is any overhead with using shared memory, it's probably hidden within DataLoader parallel loads (that are preloading in parallel to gpu runs).

So overall it seems SharedCacheDataset has all the advantages: caches on the fly, no need for special pre-splitting of data (that limits randomness). PS: this an experiment on particular 3D tumor segmentation task from CT, on other tasks timing could be slightly different.

wyli · 2022-10-18T07:08:33Z

agree with @Nic-Ma, if it's always better the changes should be made to CachDataset instead of creating another class.

Nic-Ma · 2022-10-18T11:54:59Z

Hi @myron ,

Cool, glad to see your test results, could you please help change it to update the CacheDataset directly?
And maybe you can help also double confirm whether different num_workers of DataLoader will affect the test results.

Thanks in advance.

myron · 2022-10-18T18:19:03Z

Cool, glad to see your test results, could you please help change it to update the CacheDataset directly? And maybe you can help also double confirm whether different num_workers of DataLoader will affect the test results.

Thanks in advance.

@Nic-Ma , thanks, CacheDataset is quite a bit different, if I update it it will break many old use cases. Also perhaps someone prefers to use CacheDataset as is (as it works for gpu caching too). For num_workers, I used 3, but, we could investigate it's performance later.

Can we merge this please, to unblock the overall progress. This is a new class, so it won't break anything. In the future we can consider more performance tests, or think of merging with other classes.

wyli · 2022-10-18T19:11:18Z

this is a nice feature, all of us don't want to block it... @Nic-Ma and I just wanted to ensure the API is clean and maintainable. simply copying and modifying existing classes will add to the technical debts. hope we are not creating trouble for you. perhaps we can invite other people to collaborate on this if you don't have time.

myron · 2022-10-18T20:03:00Z

this is a nice feature, all of us don't want to block it... @Nic-Ma and I just wanted to ensure the API is clean and maintainable. simply copying and modifying existing classes will add to the technical debts. hope we are not creating trouble for you. perhaps we can invite other people to collaborate on this if you don't have time.

@wyli I understand what you're saying, and I agree with you that we need a clean maintainable API. So if you'd like to invite other people to collaborate here and they do it, it'd be great. My only request is that we merge it soon , as we need it for Auto3Dseg to unblock that work as soon as possible. PS: this PR has been waiting for 7 days already.

The related issue #3843 has been open for 8 months (since February), it doesn't look like we have much interest from other people to collaborate or contribute to it. I want to avoid the situation when it will take forever to bring any novel developments into our system. Since this class is very useful and not conflicting with anything existing, my preference is to merge it now, and after that, for sure, people can have a thorough discussion how to improve the cleanness of the API or unify some classes if they want.

Nic-Ma · 2022-10-19T04:37:03Z

Hi @myron ,

OK, let me try to investigate more about this feature, can I commit to this PR directly? Or I create another PR based on this.

Thanks.

Nic-Ma · 2022-10-20T05:21:49Z

Hi @myron ,

I took a deep look at the solution of this PR, and got some questions:

Seems it requires the master process to provide shared-memory list, how can we use it with torchrun or torch.launch?
If you use the torch.multiprocessing method, can you use fork and initialize CacheDataset before launching multiprocessing? Then I think all the subprocesses will share the master's list because it's copy-on-write. Same as the multiprocessing logic in CacheDataset + DataLoader when num_worker > 0:
https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L1022

CC @ericspod @wyli for more discussion.

Thanks in advance.

monai/data/dataset.py

ericspod · 2022-10-20T15:43:15Z

2. can you use fork and initialize CacheDataset

Do be aware that fork is not present on Windows so special care has to be taken so that any multiprocessing semantics implemented here is going to work across platforms. Shared memory and other facilities are present in Windows but handles to such things have to be passed to subprocesses, Process does that for you as does other multiprocessing facilities in the standard library and Pytorch. If what you want to do is cache in multiple processes and then use those processes again in the Dataloader I don't think that's going to be compatible.

Nic-Ma · 2022-10-20T15:53:37Z

Hi @ericspod ,

I synced with @myron , I will take over the development of this feature in #5365 and let's keep this PR open for comparison.

Thanks.

myron · 2022-10-21T04:51:46Z

Hi @myron ,

I took a deep look at the solution of this PR, and got some questions:
1. Seems it requires the master process to provide shared-memory list, how can we use it with `torchrun` or `torch.launch`?

I don't know, I don't use torch.launch personally, I always start DistributedDataParaller from within the code itself, so that I can allocated the shared memory and provide it to the individual/spawned processes.

2. If you use the `torch.multiprocessing` method, can you use `fork` and initialize `CacheDataset` before launching multiprocessing? Then I think all the subprocesses will share the master's list because it's `copy-on-write`. Same as the multiprocessing logic in `CacheDataset + DataLoader` when `num_worker > 0`:
   https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L1022

I'm not sure either. I do use 'fork' usually with DDP, but I don't initialize anything major before launching DDP.

CC @ericspod @wyli for more discussion.

Thanks in advance.

Signed-off-by: myron <[email protected]>

myron · 2022-11-03T00:30:48Z

Seems it requires the master process to provide shared-memory list, how can we use it with torchrun or torch.launch?

I found a way to initialize cache without master process, by initializing it on first rank , and broadcasting
see here (I pushed the update to demostrate) https://github.com/Project-MONAI/MONAI/pull/5308/files#diff-e682e654b07b9d753cac73f5a782e70644c82a84d30ebc757ec294a261845108R961-R966

caveat, but this will only works if this was called before
torch.cuda.set_device(device)
(it is required to be called before creating the SharedDataset)

caveat2, in multi-gpu, all processes must create the SharedDataset, to be able to broadcast the cache, otherwise there will a deadlock.

but with this approach , we don't need to initialize cache in master process, and it doesn't need to be an input parameter tat all.

@Nic-Ma what do you think?

Nic-Ma · 2022-11-03T03:04:55Z

Hi @myron ,

Thanks for the sharing, sounds interesting, let me test it and update PR #5365 to include it.

Thanks.

Fixes #5308 . ### Description This PR added runtime cache support in the `CacheDataset`. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [ ] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [ ] In-line docstrings updated. - [ ] Documentation updated, tested `make html` command in the `docs/` folder. Signed-off-by: Wenqi Li <[email protected]>

…#5365) Fixes Project-MONAI#5308 . ### Description This PR added runtime cache support in the `CacheDataset`. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [ ] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [ ] In-line docstrings updated. - [ ] Documentation updated, tested `make html` command in the `docs/` folder. Signed-off-by: Wenqi Li <[email protected]> Signed-off-by: Behrooz <[email protected]>

SharedCacheDataset

77743e3

Signed-off-by: myron <[email protected]>

myron added the enhancement New feature or request label Oct 11, 2022

myron added this to the Auto3D Seg framework [P0 v1.0] milestone Oct 11, 2022

wyli requested review from KumoLiu, Nic-Ma, ericspod and rijobro October 12, 2022 18:54

myron and others added 3 commits October 12, 2022 14:25

Merge remote-tracking branch 'upstream/dev' into cachedataset

3171f4e

unit test for SharedCacheDataset

d75c6b8

Signed-off-by: myron <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c2f1eb7

for more information, see https://pre-commit.ci

Merge remote-tracking branch 'upstream/dev' into cachedataset

7565ebf

myron added 2 commits October 13, 2022 12:43

formatting

b9eefc9

Signed-off-by: myron <[email protected]>

mypy format

82a14e4

Signed-off-by: myron <[email protected]>

Nic-Ma mentioned this pull request Oct 20, 2022

5308 Add runtime cache support to CacheDataset #5365

Merged

7 tasks

ericspod reviewed Oct 20, 2022

View reviewed changes

monai/data/dataset.py Outdated Show resolved Hide resolved

ericspod reviewed Oct 20, 2022

View reviewed changes

monai/data/dataset.py Show resolved Hide resolved

ericspod reviewed Oct 20, 2022

View reviewed changes

monai/data/dataset.py Show resolved Hide resolved

myron added 2 commits October 23, 2022 15:41

Merge remote-tracking branch 'upstream/dev' into cachedataset

775ad2e

small

51cd82c

Signed-off-by: myron <[email protected]>

myron modified the milestones: Auto3D Seg framework [P0 v1.0], Auto3D Seg framework [internal ongoing milestone] Oct 31, 2022

myron added 2 commits November 2, 2022 17:22

sync cache example

42f1097

Signed-off-by: myron <[email protected]>

Merge branch 'dev' into cachedataset

f1032f4

wyli closed this in #5365 Nov 9, 2022

myron mentioned this pull request Nov 23, 2022

cachedataset small fix #5565

Merged

7 tasks

myron mentioned this pull request Dec 4, 2022

revise cachedataset runtime_cache modes #5630

Merged

7 tasks

Conversation

myron commented Oct 11, 2022

Types of changes

Uh oh!

myron commented Oct 13, 2022

Uh oh!

Nic-Ma commented Oct 13, 2022

Uh oh!

myron commented Oct 13, 2022

Uh oh!

myron commented Oct 17, 2022

Uh oh!

wyli commented Oct 18, 2022

Uh oh!

Nic-Ma commented Oct 18, 2022

Uh oh!

myron commented Oct 18, 2022

Uh oh!

wyli commented Oct 18, 2022

Uh oh!

myron commented Oct 18, 2022

Uh oh!

Nic-Ma commented Oct 19, 2022

Uh oh!

Nic-Ma commented Oct 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericspod commented Oct 20, 2022

Uh oh!

Nic-Ma commented Oct 20, 2022

Uh oh!

myron commented Oct 21, 2022

Uh oh!

myron commented Nov 3, 2022

Uh oh!

Nic-Ma commented Nov 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Nic-Ma commented Oct 20, 2022 •

edited

Loading