Skip to content

PersistentDataset, CacheDataset improvements #3843

@myron

Description

@myron

It would nice to have several improvements to caching of data on disk and in memory.

For caching on disk with PersistentDataset,

  • the first epoch takes a long time, because saving is synchronous. I suggest saving on background queue, and returning data right away. This has a small probability that the same image might be saved twice, but on average it will have a speed up
  • the documentation description implies that that cache_dir is optional (but it's not) "..If specified, this is the location for persistent storage .."
  • The cached data is much larger on disk than the originals (and takes a lot of space). This is due to the original .nii.gz image is usually uint8/16 for images and uint8 for labels and gziped inside of nii.gz. I wonder if there is a way to do similar for the cache

For caching in memory with CacheDataset

  • on multigpu machine, each process gets its own cache, and they don't have access to any shared cache. It seems as of python 3.8 we can create a shared memory object https://docs.python.org/3/library/multiprocessing.shared_memory.html, so that all process can use it. it would be nice to have this, otherwise it's not very practical to use CacheDataset on multigpu machine. I know there is a way to manually partition data between processes, but then we need to worry about accuracy due to less random data sampling.
  • In memory caching runs before any training starts, which takes time. Can we do it on the fly (similar to PersistentDataset) as we iterate the first epoch, we add data into memory cache.

Thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions