-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
@goldsborough and I are planning a series of improvements to dataloader in both C++ and Python API. This issue mainly focuses on the planned changes for the Python API.
-
Iterator Serialization ([feature request] Savable data loader/iterator #11813)
Dataloader iterator is not picklable currently, due to the multiprocessing and multithreading attributes it has. We should make it picklable as long as the dataset and the Sampler iterator is picklable, e.g., the
__getstate__could bedef __getstate__(self): return (self.loader, self.sampler_iter, self.base_seed)
We will also make the iterator of provided samplers serializable.
-
Examples of Bulk Loading
The current dataloding API seems to suggest that dataloader is mainly suited for creating batches from random reads from the dataset. However, it supports bulk loading very well. For instance, this gist implements sharded/chunked bulk loading in just 40 lines. We will improve the documentation to include examples of such cases.
-
Worker Load Configurations
We currently balance the load of workers by keeping the #tasks per worker balanced. This could be a problem if the workload is not very even for the tasks. We should make this optional instead. Additionally, the max task number (currently
2 * num_workers) should also become configurable. -
Expose Sampler iterator ([feature request] [PyTorch] Dynamic Samplers. #7359)
This would enable dynamic updates to the Sampler iterator states, e.g., dynamic reweighting of the samples. The API may be
loader_iter.get_sampler_iter(). Since we always prefetch some number of batches, we also need to augment the existing document to reflect that this iterator may be ahead of the latest return value of the data loader iterator.Edit: As @apaszke pointed out below, it is possible to allow for strict consistency by providing a interface to flush the pipeline and ask sampler iterator to give new indices basing on the updated state. But that design needs further consideration and we don't plan to do until there is immediate need.
-
More Flexible
worker_init_fnCurrently,
worker_init_fnonly takes in a singleworker_idxargument, making it very difficult to initialize dataset object of each worker in a different way. Furthermore, it is impossible for thedataset.__getitem__in workers to communicate with the Sampler iterator to fetch more indices, or update the iterator state. We plan to augment it's input argument to include a wider range of objects it can access, without being BC breaking, and being future-proof.I haven't given much thought to the API design of this. But for a proof-of-concept, the API could be a
get_worker_init_fn_argargument which would be called in main process, takes in adata_loader_iter_info"struct", containing fields referencing thedatasetand thesampler_iter(and maybe more), and returns a serializable to be fed in as an additional argument ofworker_init_fnin worker processes. Please let me know if you have suggestions! -
Iterator-style Dataset
We don't necessarily need to have a sampler. By allowing an iterator style Dataset (rather than a stateless mapping), the workers can do interesting things like backfilling. This is entirely supported as of today, but we will make it nicer.
-
Bridging C++ and Python DataLoaders
We will be providing a simple way to convert a C++ DataLoader into a Python one, with the same API as the existing Python DataLoader.
Our Plan to Make These Happen
I (@ssnl) will be focusing on the first four items while @goldsborough will implement the fifth. In addition to these, @goldsborough is also adding a bunch of exciting features into the C++ DataLoader to allow for greater flexibility (e.g., see #12960 #12999).
Let us know your thoughts and suggestions!