PersistentDataset, CacheDataset improvements

It would nice to have several improvements to caching of data on disk and in memory. 

For caching on disk with **PersistentDataset**, 
-  the first epoch takes a long time, because saving is synchronous. I suggest saving on background queue, and returning data right away. This has a small probability that the same image might be saved twice, but on average it will have a speed up
- the documentation description implies that that cache_dir is optional (but it's not) "..If specified, this is the location for persistent storage .."
- The cached data is much larger on disk than the originals (and takes a lot of space). This is due to the original .nii.gz image is usually uint8/16 for images and uint8 for labels and gziped inside of nii.gz.  I wonder if there is a way to do similar for the cache

For caching in memory with **CacheDataset**
- on multigpu machine, each process gets its own cache, and they don't have access to any shared cache. It seems as of python 3.8 we can create a shared memory object https://docs.python.org/3/library/multiprocessing.shared_memory.html, so that all process can use it. it would be nice to have this, otherwise it's not very practical to use CacheDataset on multigpu machine. I know there is a way to manually partition data between processes, but then we need to worry about accuracy due to less random data sampling. 
- In memory caching runs before any training starts, which takes time. Can we do it on the fly (similar to PersistentDataset) as we iterate the first epoch, we add data into memory cache. 

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PersistentDataset, CacheDataset improvements #3843

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PersistentDataset, CacheDataset improvements #3843

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions