Skip to content

SharedCacheDataset#5308

Closed
myron wants to merge 11 commits intoProject-MONAI:devfrom
myron:cachedataset
Closed

SharedCacheDataset#5308
myron wants to merge 11 commits intoProject-MONAI:devfrom
myron:cachedataset

Conversation

@myron
Copy link
Copy Markdown
Collaborator

@myron myron commented Oct 11, 2022

This introduces a new SharedCacheDataset(Dataset) class to address #3843

It has several improvements / changes compared to the CacheDataset https://github.com/Project-MONAI/MONAI/blob/dev/monai/data/dataset.py#L690

  • It caches into a shared memory in RAM, so all processes in multigpu run with DistributedDataParallel can read/write to the same shared memory (globally accessible)
  • it works on the fly (there is non need to wait to pre-cache the data), it will cache it on the first data access, and then use it from the cache

If tested it and used for many tasks already in DDP / multigpu
I'll add unit tests too if it looks good

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: myron <[email protected]>
@myron myron added the enhancement New feature or request label Oct 11, 2022
@myron myron added this to the Auto3D Seg framework [P0 v1.0] milestone Oct 11, 2022
@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Oct 13, 2022

I've added unit tests. please review. thanks

@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Oct 13, 2022

Hi @myron ,

Thanks very much for the implementation of this feature.

  1. I see much duplicated code if compared to the existing CacheDataset, is it possible to enhance the original CacheDataset directly? Or inherit from it?
  2. Have you tested it with different multi-processing methods? like spawn and fork?
    https://pytorch.org/docs/stable/multiprocessing.html
  3. Please also test the case that we cache data on GPU (with ToDevice transform), refer to:
    https://github.com/Project-MONAI/tutorials/blob/main/acceleration/fast_model_training_guide.md#4-cache-io-and-transforms-data-to-gpu
  4. I remember shared memory is much slower, that's why we cache duplicated data for every rank in multi-gpu training for relatively small dataset. Could you please help provide some speed comparison when doing multi-gpu training with CacheDataset and this SharedCacheDataset?

Thanks in advance.

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Oct 13, 2022

Hi @Nic-Ma , thank you for the feedback

1. I see much duplicated code if compared to the existing `CacheDataset`, is it possible to enhance the original `CacheDataset` directly? Or inherit from it?

There is only one similar method. Sublclassing would be too messy and confusing, since the use case is different. Modifying CacheDataset will make it very long (and prone to bugs).

2. Have you tested it with different multi-processing methods? like `spawn` and `fork`?
   https://pytorch.org/docs/stable/multiprocessing.html

Yes, I have been using in both spawn and fork modes.

3. Please also test the case that we cache data on GPU (with `ToDevice` transform), refer to:
   https://github.com/Project-MONAI/tutorials/blob/main/acceleration/fast_model_training_guide.md#4-cache-io-and-transforms-data-to-gpu

This class is designed to for caching on CPU only, it is not designed for GPU, I don't see a reason to test for it.
We can add an assert statement to exit if the data is on gpu, if you'd like

4. I remember `shared memory` is much slower, that's why we cache duplicated data for every rank in multi-gpu training for relatively small dataset. Could you please help provide some speed comparison when doing multi-gpu training with `CacheDataset` and this `SharedCacheDataset`?

It's true there is an overhead with shared memory. It's definitely faster vs not-cashing at all.
I don't have the bandwidth to write this code for speed comparisons at the moment. Perhaps someone else would like to do this exercise. Otherwise, in the interest of time, let's merge it for now, and later we can do further improvements / tuning. thank you

myron added 2 commits October 13, 2022 12:43
Signed-off-by: myron <[email protected]>
Signed-off-by: myron <[email protected]>
@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Oct 17, 2022

  1. I remember shared memory is much slower, that's why we cache duplicated data for every rank in multi-gpu training for relatively small dataset. Could you please help provide some speed comparison when doing multi-gpu training with CacheDataset and this SharedCacheDataset?

It's true there is an overhead with shared memory. It's definitely faster vs not-cashing at all. I don't have the bandwidth to write this code for speed comparisons at the moment. Perhaps someone else would like to do this exercise. Otherwise, in the interest of time, let's merge it for now, and later we can do further improvements / tuning. thank you

I ran the comparison in terms of time SharedCacheDataset vs CacheDataset, using a 3D segmentation task (416 images training, 104 validation) - evenly divisible data size for 8 gpu training.

Set up on 8gpu (Nvidia A40 48gb x 8, ampere, latest pytorch, monai 1.0.0 dev branch):

  • For CacheDataset pre-slit dataset into non-overlapping subsets. Then each process caches it's subset before proceeding to training (using random sampling without DistributedSampler)
  • For SharedCacheDataset, (no extra setup , just using as a standard Dataset with DistributedSampler). It cashes images on the fly during first epoch.

Timing:

First epoch:

  • CachDataset: 20 sec to pre-cache everything, then train loop 45sec, validation loop 21sec
  • SharedCacheDataset: trainloop 49sec, validation loop 22 sec (slightly longer train loop but, it includes caching, so no time wasted to wait for caching beforehand)

Second epoch (everything non-random is already cached)

  • CachDataset: train loop 39 sec, validation loop 21sec
  • SharedCacheDataset: train loop 39 sec, validation loop 21sec

no difference in time on subsequent epochs. If there is any overhead with using shared memory, it's probably hidden within DataLoader parallel loads (that are preloading in parallel to gpu runs).

So overall it seems SharedCacheDataset has all the advantages: caches on the fly, no need for special pre-splitting of data (that limits randomness). PS: this an experiment on particular 3D tumor segmentation task from CT, on other tasks timing could be slightly different.

@wyli
Copy link
Copy Markdown
Contributor

wyli commented Oct 18, 2022

agree with @Nic-Ma, if it's always better the changes should be made to CachDataset instead of creating another class.

@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Oct 18, 2022

Hi @myron ,

Cool, glad to see your test results, could you please help change it to update the CacheDataset directly?
And maybe you can help also double confirm whether different num_workers of DataLoader will affect the test results.

Thanks in advance.

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Oct 18, 2022

Cool, glad to see your test results, could you please help change it to update the CacheDataset directly? And maybe you can help also double confirm whether different num_workers of DataLoader will affect the test results.

Thanks in advance.

@Nic-Ma , thanks, CacheDataset is quite a bit different, if I update it it will break many old use cases. Also perhaps someone prefers to use CacheDataset as is (as it works for gpu caching too). For num_workers, I used 3, but, we could investigate it's performance later.

Can we merge this please, to unblock the overall progress. This is a new class, so it won't break anything. In the future we can consider more performance tests, or think of merging with other classes.

@wyli
Copy link
Copy Markdown
Contributor

wyli commented Oct 18, 2022

this is a nice feature, all of us don't want to block it... @Nic-Ma and I just wanted to ensure the API is clean and maintainable. simply copying and modifying existing classes will add to the technical debts. hope we are not creating trouble for you. perhaps we can invite other people to collaborate on this if you don't have time.

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Oct 18, 2022

this is a nice feature, all of us don't want to block it... @Nic-Ma and I just wanted to ensure the API is clean and maintainable. simply copying and modifying existing classes will add to the technical debts. hope we are not creating trouble for you. perhaps we can invite other people to collaborate on this if you don't have time.

@wyli I understand what you're saying, and I agree with you that we need a clean maintainable API. So if you'd like to invite other people to collaborate here and they do it, it'd be great. My only request is that we merge it soon , as we need it for Auto3Dseg to unblock that work as soon as possible. PS: this PR has been waiting for 7 days already.

The related issue #3843 has been open for 8 months (since February), it doesn't look like we have much interest from other people to collaborate or contribute to it. I want to avoid the situation when it will take forever to bring any novel developments into our system. Since this class is very useful and not conflicting with anything existing, my preference is to merge it now, and after that, for sure, people can have a thorough discussion how to improve the cleanness of the API or unify some classes if they want.

@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Oct 19, 2022

Hi @myron ,

OK, let me try to investigate more about this feature, can I commit to this PR directly? Or I create another PR based on this.

Thanks.

@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Oct 20, 2022

Hi @myron ,

I took a deep look at the solution of this PR, and got some questions:

  1. Seems it requires the master process to provide shared-memory list, how can we use it with torchrun or torch.launch?
  2. If you use the torch.multiprocessing method, can you use fork and initialize CacheDataset before launching multiprocessing? Then I think all the subprocesses will share the master's list because it's copy-on-write. Same as the multiprocessing logic in CacheDataset + DataLoader when num_worker > 0:
    https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L1022

CC @ericspod @wyli for more discussion.

Thanks in advance.

@ericspod
Copy link
Copy Markdown
Member

2. can you use fork and initialize CacheDataset

Do be aware that fork is not present on Windows so special care has to be taken so that any multiprocessing semantics implemented here is going to work across platforms. Shared memory and other facilities are present in Windows but handles to such things have to be passed to subprocesses, Process does that for you as does other multiprocessing facilities in the standard library and Pytorch. If what you want to do is cache in multiple processes and then use those processes again in the Dataloader I don't think that's going to be compatible.

@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Oct 20, 2022

Hi @ericspod ,

I synced with @myron , I will take over the development of this feature in #5365 and let's keep this PR open for comparison.

Thanks.

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Oct 21, 2022

Hi @myron ,

I took a deep look at the solution of this PR, and got some questions:

1. Seems it requires the master process to provide shared-memory list, how can we use it with `torchrun` or `torch.launch`?

I don't know, I don't use torch.launch personally, I always start DistributedDataParaller from within the code itself, so that I can allocated the shared memory and provide it to the individual/spawned processes.

2. If you use the `torch.multiprocessing` method, can you use `fork` and initialize `CacheDataset` before launching multiprocessing? Then I think all the subprocesses will share the master's list because it's `copy-on-write`. Same as the multiprocessing logic in `CacheDataset + DataLoader` when `num_worker > 0`:
   https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L1022

I'm not sure either. I do use 'fork' usually with DDP, but I don't initialize anything major before launching DDP.

CC @ericspod @wyli for more discussion.

Thanks in advance.

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Nov 3, 2022

Seems it requires the master process to provide shared-memory list, how can we use it with torchrun or torch.launch?

I found a way to initialize cache without master process, by initializing it on first rank , and broadcasting
see here (I pushed the update to demostrate) https://github.com/Project-MONAI/MONAI/pull/5308/files#diff-e682e654b07b9d753cac73f5a782e70644c82a84d30ebc757ec294a261845108R961-R966

caveat, but this will only works if this was called before
torch.cuda.set_device(device)
(it is required to be called before creating the SharedDataset)

caveat2, in multi-gpu, all processes must create the SharedDataset, to be able to broadcast the cache, otherwise there will a deadlock.

but with this approach , we don't need to initialize cache in master process, and it doesn't need to be an input parameter tat all.

@Nic-Ma what do you think?

@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Nov 3, 2022

Hi @myron ,

Thanks for the sharing, sounds interesting, let me test it and update PR #5365 to include it.

Thanks.

@wyli wyli closed this in #5365 Nov 9, 2022
wyli pushed a commit that referenced this pull request Nov 9, 2022
Fixes #5308 .

### Description

This PR added runtime cache support in the `CacheDataset`.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [ ] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated, tested `make html` command in the `docs/`
folder.

Signed-off-by: Wenqi Li <[email protected]>
bhashemian pushed a commit to bhashemian/MONAI that referenced this pull request Nov 23, 2022
…#5365)

Fixes Project-MONAI#5308 .

### Description

This PR added runtime cache support in the `CacheDataset`.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [ ] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated, tested `make html` command in the `docs/`
folder.

Signed-off-by: Wenqi Li <[email protected]>
Signed-off-by: Behrooz <[email protected]>
@myron myron mentioned this pull request Nov 23, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants