Python dataloader Improvements

@goldsborough and I are planning a series of improvements to dataloader in both C++ and Python API. This issue mainly focuses on the planned changes for the Python API.

- [ ] Iterator Serialization (https://github.com/pytorch/pytorch/issues/11813)

  Dataloader iterator is not picklable currently, due to the multiprocessing and multithreading attributes it has. We should make it picklable as long as the dataset and the Sampler iterator is picklable, e.g., the `__getstate__` could be
  ```py
    def __getstate__(self):
        return (self.loader, self.sampler_iter, self.base_seed)
  ```

  We will also make the iterator of provided samplers serializable.

- [ ] Examples of Bulk Loading

  The current dataloding API seems to suggest that dataloader is mainly suited for creating batches from random reads from the dataset. However, it supports bulk loading very well. For instance, this [gist](https://gist.github.com/SsnL/205a4cd2e4e631a42cc9d8e879a296dc) implements sharded/chunked bulk loading in just 40 lines. We will improve the documentation to include examples of such cases.

- [ ] Worker Load Configurations

  We currently balance the load of workers by keeping the #tasks per worker balanced. This could be a problem if the workload is not very even for the tasks. We should make this optional instead. Additionally, the max task number (currently ``2 * num_workers``) should also become configurable.

- [ ] Expose Sampler iterator (https://github.com/pytorch/pytorch/issues/7359)

  This would enable dynamic updates to the Sampler iterator states, e.g., dynamic reweighting of the samples. The API may be `loader_iter.get_sampler_iter()`. Since we always prefetch some number of batches, we also need to augment the existing document to reflect that this iterator may be ahead of the latest return value of the data loader iterator. 

  Edit: As @apaszke pointed out below, it is possible to allow for strict consistency by providing a interface to flush the pipeline and ask sampler iterator to give new indices basing on the updated state. But that design needs further consideration and we don't plan to do until there is immediate need.

- [x] More Flexible `worker_init_fn`
  
  Currently, `worker_init_fn` only takes in a single `worker_idx` argument, making it very difficult to initialize dataset object of each worker in a different way. Furthermore, it is impossible for the `dataset.__getitem__` in workers to communicate with the Sampler iterator to fetch more indices, or update the iterator state. We plan to augment it's input argument to include a wider range of objects it can access, without being BC breaking, and being future-proof. 

  I haven't given much thought to the API design of this. But for a proof-of-concept, the API could be a `get_worker_init_fn_arg` argument which would be called in **main** process, takes in a `data_loader_iter_info` "struct", containing fields referencing the `dataset` and the `sampler_iter` (and maybe more), and returns a serializable to be fed in as an additional argument of `worker_init_fn` in **worker** processes. Please let me know if you have suggestions!

- [x] Iterator-style Dataset

  We don't necessarily need to have a sampler. By allowing an iterator style Dataset (rather than a stateless mapping), the workers can do interesting things like backfilling. This is entirely supported as of today, but we will make it nicer.

- [ ] Bridging C++ and Python DataLoaders

  We will be providing a simple way to convert a C++ DataLoader into a Python one, with the same API as the existing Python DataLoader. 

# Our Plan to Make These Happen
I (@ssnl) will be focusing on the first four items while @goldsborough will implement the fifth. In addition to these, @goldsborough is also adding a bunch of exciting features into the C++ DataLoader to allow for greater flexibility (e.g., see https://github.com/pytorch/pytorch/pull/12960 https://github.com/pytorch/pytorch/pull/12999).

Let us know your thoughts and suggestions! 


cc @soumith @fmassa @apaszke 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python dataloader Improvements #13023

Our Plan to Make These Happen

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Python dataloader Improvements #13023

Description

Our Plan to Make These Happen

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions