-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add IterableDataset #19228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IterableDataset #19228
Conversation
7c9e030 to
746065b
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritamdamania87 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
this is currently awaiting (1) #19421 and (2) filling in the docs |
922cfc3 to
fc53d79
Compare
fc53d79 to
3c93a83
Compare
3c93a83 to
8caeca7
Compare
8caeca7 to
aba6a21
Compare
|
@pritamdamania87 doc is done! |
42a783d to
e50f65f
Compare
|
@apaszke Soumith mentioned it'd be nice to have you review parts of this PR. I've looked through some of the new functionality added in this PR (IterableDataset), but it would be good if you can take a look at some of the changes to the existing code structure. Thanks! |
7598ba4 to
d0ce455
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritamdamania87 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pritamdamania87 do let me know if this breaks anything internal |
Looks like renaming pin_memory_batch breaks a bunch of stuff which does |
|
@pritamdamania87 for the first error, I prefer fixing those who import private helpers. For the second, it’d be great if you can show/message me the trace, or at least the non confidential parts. Thanks! |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritamdamania87 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This is a modified version of pytorch#14705 since commit structure for that PR is quite messy. 1. Add `IterableDataset`. 3. So we have 2 data loader mods: `Iterable` and `Map`. 1. `Iterable` if the `dataset` is an instance of `IterableDataset` 2. `Map` o.w. 3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading. 3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`. 4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration. 5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`. 7. Import torch.utils.data in `torch/__init__.py` 9. data loader examples and documentations 10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate` Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023 Pull Request resolved: pytorch#19228 Reviewed By: bddppq Differential Revision: D15058152 fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
Summary: This is a modified version of pytorch#14705 since commit structure for that PR is quite messy. 1. Add `IterableDataset`. 3. So we have 2 data loader mods: `Iterable` and `Map`. 1. `Iterable` if the `dataset` is an instance of `IterableDataset` 2. `Map` o.w. 3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading. 3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`. 4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration. 5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`. 7. Import torch.utils.data in `torch/__init__.py` 9. data loader examples and documentations 10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate` Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023 Pull Request resolved: pytorch#19228 Reviewed By: bddppq Differential Revision: D15058152 fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
|
Seems like the PR is finished? Good job! |
|
@bethunebtj - we're targeting 1.2 release at the end of July / early August hopefully. But you can always use nightlies :) |
|
Is this functionality available in nightly? |
|
@rfalcon100 yes |
Summary: Back in April, malmaud added type annotations for `dataloader.py`. However, at about the same time, SsnL in #19228 replaced `_DataLoaderIter` with `_BaseDataLoaderIter` and two subclasses, `_SingleProcessDataLoaderIter`, and `_MultiProcessingDataLoaderIter`. However - probably because these changes happened in parallel at roughly the same time, the type stubs and several other references in the codebase were never updated to match this refactoring. I've gone ahead and done the updates to reflect the refactoring in #19228, which fixes the specific type stub/impelementation mismatch pointed out in #26673, although not the broader problem that pytorch doesn't have a test to make sure that the `.pyi` type stub files match the real API defined in `.py` files. Pull Request resolved: #27105 Differential Revision: D17813641 Pulled By: ezyang fbshipit-source-id: ed7ac025c8d6ad3f298dd073347ec83bb4b6600c
…ch#27105) Summary: Back in April, malmaud added type annotations for `dataloader.py`. However, at about the same time, SsnL in pytorch#19228 replaced `_DataLoaderIter` with `_BaseDataLoaderIter` and two subclasses, `_SingleProcessDataLoaderIter`, and `_MultiProcessingDataLoaderIter`. However - probably because these changes happened in parallel at roughly the same time, the type stubs and several other references in the codebase were never updated to match this refactoring. I've gone ahead and done the updates to reflect the refactoring in pytorch#19228, which fixes the specific type stub/impelementation mismatch pointed out in pytorch#26673, although not the broader problem that pytorch doesn't have a test to make sure that the `.pyi` type stub files match the real API defined in `.py` files. Pull Request resolved: pytorch#27105 Differential Revision: D17813641 Pulled By: ezyang fbshipit-source-id: ed7ac025c8d6ad3f298dd073347ec83bb4b6600c
This is a modified version of #14705 since commit structure for that PR is quite messy.
Add
IterableDataset.So we have 2 data loader mods:
IterableandMap.Iterableif thedatasetis an instance ofIterableDatasetMapo.w.Add better support for non-batch loading (i.e.,
batch_size=Noneandbatch_sampler=None). This is useful in doing things like bulk loading.Refactor
DataLoaderIterinto two classes,_SingleProcessDataLoaderIterand_MultiProcessingDataLoaderIter. Rename some methods to be more generic, e.g.,get_batch->get_data.Add
torch.utils.data.get_worker_infowhich returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used inIterableDataset.__iter__andworker_init_fnto do per-worker configuration.Add
ChainDataset, which is the analog ofConcatDatasetforIterableDataset.Import torch.utils.data in
torch/__init__.pydata loader examples and documentations
Use
get_worker_infoto detect whether we are in a worker process indefault_collateCloses #17909, #18096, #19946, and some of #13023