-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PersistentDataset, CacheDataset improvements #3843
Copy link
Copy link
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
It would nice to have several improvements to caching of data on disk and in memory.
For caching on disk with PersistentDataset,
- the first epoch takes a long time, because saving is synchronous. I suggest saving on background queue, and returning data right away. This has a small probability that the same image might be saved twice, but on average it will have a speed up
- the documentation description implies that that cache_dir is optional (but it's not) "..If specified, this is the location for persistent storage .."
- The cached data is much larger on disk than the originals (and takes a lot of space). This is due to the original .nii.gz image is usually uint8/16 for images and uint8 for labels and gziped inside of nii.gz. I wonder if there is a way to do similar for the cache
For caching in memory with CacheDataset
- on multigpu machine, each process gets its own cache, and they don't have access to any shared cache. It seems as of python 3.8 we can create a shared memory object https://docs.python.org/3/library/multiprocessing.shared_memory.html, so that all process can use it. it would be nice to have this, otherwise it's not very practical to use CacheDataset on multigpu machine. I know there is a way to manually partition data between processes, but then we need to worry about accuracy due to less random data sampling.
- In memory caching runs before any training starts, which takes time. Can we do it on the fly (similar to PersistentDataset) as we iterate the first epoch, we add data into memory cache.
Thank you
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request