Skip to content

Python dataloader Improvements #13023

@ssnl

Description

@ssnl

@goldsborough and I are planning a series of improvements to dataloader in both C++ and Python API. This issue mainly focuses on the planned changes for the Python API.

  • Iterator Serialization ([feature request] Savable data loader/iterator #11813)

    Dataloader iterator is not picklable currently, due to the multiprocessing and multithreading attributes it has. We should make it picklable as long as the dataset and the Sampler iterator is picklable, e.g., the __getstate__ could be

      def __getstate__(self):
          return (self.loader, self.sampler_iter, self.base_seed)

    We will also make the iterator of provided samplers serializable.

  • Examples of Bulk Loading

    The current dataloding API seems to suggest that dataloader is mainly suited for creating batches from random reads from the dataset. However, it supports bulk loading very well. For instance, this gist implements sharded/chunked bulk loading in just 40 lines. We will improve the documentation to include examples of such cases.

  • Worker Load Configurations

    We currently balance the load of workers by keeping the #tasks per worker balanced. This could be a problem if the workload is not very even for the tasks. We should make this optional instead. Additionally, the max task number (currently 2 * num_workers) should also become configurable.

  • Expose Sampler iterator ([feature request] [PyTorch] Dynamic Samplers. #7359)

    This would enable dynamic updates to the Sampler iterator states, e.g., dynamic reweighting of the samples. The API may be loader_iter.get_sampler_iter(). Since we always prefetch some number of batches, we also need to augment the existing document to reflect that this iterator may be ahead of the latest return value of the data loader iterator.

    Edit: As @apaszke pointed out below, it is possible to allow for strict consistency by providing a interface to flush the pipeline and ask sampler iterator to give new indices basing on the updated state. But that design needs further consideration and we don't plan to do until there is immediate need.

  • More Flexible worker_init_fn

    Currently, worker_init_fn only takes in a single worker_idx argument, making it very difficult to initialize dataset object of each worker in a different way. Furthermore, it is impossible for the dataset.__getitem__ in workers to communicate with the Sampler iterator to fetch more indices, or update the iterator state. We plan to augment it's input argument to include a wider range of objects it can access, without being BC breaking, and being future-proof.

    I haven't given much thought to the API design of this. But for a proof-of-concept, the API could be a get_worker_init_fn_arg argument which would be called in main process, takes in a data_loader_iter_info "struct", containing fields referencing the dataset and the sampler_iter (and maybe more), and returns a serializable to be fed in as an additional argument of worker_init_fn in worker processes. Please let me know if you have suggestions!

  • Iterator-style Dataset

    We don't necessarily need to have a sampler. By allowing an iterator style Dataset (rather than a stateless mapping), the workers can do interesting things like backfilling. This is entirely supported as of today, but we will make it nicer.

  • Bridging C++ and Python DataLoaders

    We will be providing a simple way to convert a C++ DataLoader into a Python one, with the same API as the existing Python DataLoader.

Our Plan to Make These Happen

I (@ssnl) will be focusing on the first four items while @goldsborough will implement the fifth. In addition to these, @goldsborough is also adding a bunch of exciting features into the C++ DataLoader to allow for greater flexibility (e.g., see #12960 #12999).

Let us know your thoughts and suggestions!

cc @soumith @fmassa @apaszke

Metadata

Metadata

Assignees

Labels

module: dataloaderRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions