-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Open
Labels
module: datatorch.utils.datatorch.utils.datatriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
Analysis
In PyTorch
T_co = TypeVar('T_co', covariant=True)
class Dataset(Generic[T_co]):
r"""<...> All subclasses should overwrite :meth:`__getitem__`
<...> Subclasses could also optionally implement :meth:`__getitems__`
"""
def __getitem__(self, index) -> T_co:
<...>
class IterableDataset(Dataset[T_co], Iterable[T_co]):
r"""<...> All subclasses should overwrite :meth:`__iter__`
"""
# No `def __len__(self)` default? Subclasses raise `TypeError` when needed.
# See NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ]So:
1.
object <- Dataset <---- IterableDataset
\ /
\- Iterable <-/
Datasets should implement__getitem__.IterableDatasets should implement__iter__and__len__.
In Python
Now, let's look at the Python's Collections Abstract Base Classes documentation.
What does Python offer?
-- Sized ---------------------
/ \
object <--- Iterable <-------------------- Collection <--- Sequence
\ \ / /
\ - Reversible <-------------------/
\ /
-- Container ---------------
Iterablerequires__iter__.__len__introduced inSized.Iterables doesn't have__len__.Containerintroduces__contains__.Reversibleintroduces__reversed__.Collectionjust join the superclasses.Sequencehas real implementations of__iter__,__contains__,__reversed__but requires__len__,__getitem__.
Applying back to PyTorch
What do I see?
- p.4 =>
IterableDatasetcan't be a subclass ofDataset. (It could be vice versa.) - p.6 => p.3 is incorrect, i.e.
IterableDatasets shouldn't implement__len__butDatasetcould. - Inheritance could be changed drastically, without big code changes.
Proposition
Variant 1
Ok, we could have:
-- Sized ---------------------
/ \
/ - IterableDataset \
/ / \
object <--- Iterable <-------------------- Collection <- Dataset
\ /
-- Container -------------------
| Class | Methods to implement |
|---|---|
IterableDataset |
__iter__ |
Dataset |
__len__, __iter__, __contains__ , __getitem__, __reversed__ |
Variant 2
Or even subclass of Sequence for Dataset:
-- Sized ---------------------
/ \
/ - IterableDataset \
/ / \
object <--- Iterable <-------------------- Collection <- Sequence <- Dataset
\ \ / /
\ - Reversible <--------------------/
\ /
-- Container -----------------
| Class | Methods to implement |
|---|---|
IterableDataset |
__iter__ |
Dataset |
__len__, __getitem__ |
See Also
Sequence could be Mapping
I thought a Dataset as a Sequence (list), i.e. int indices.
But strictly saying indices are untyped. So, it could be a Mapping (dict).
Type Annotations
I'd appreciate to annotate all the code (see above).
NOTE [ Lack of Default __len__ in Python Abstract Base Classes ]
Thinking on this could solve the nasty https://github.com/pytorch/pytorch/blob/v2.2.1/torch/utils/data/sampler.py#L70-L95.
On that, see also #122410, #47055.
Related Discussions, Issues and Commits
- Add IterableDataset #19228,
- [types] torch.utils.data.{Dataset, Sampler} are not Sized #47055,
- Enable len(dataloader) for iterable dataset #23587.
Thoughts
Thoughts? :)
Skeleton003 and edwardzjl
Metadata
Metadata
Assignees
Labels
module: datatorch.utils.datatorch.utils.datatriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module