Skip to content

On Dataset, IterableDataset inheritance #120139

@kuraga

Description

@kuraga

Analysis

In PyTorch

We have (https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html):

T_co = TypeVar('T_co', covariant=True)

class Dataset(Generic[T_co]):
    r"""<...> All subclasses should overwrite :meth:`__getitem__`
    <...> Subclasses could also optionally implement :meth:`__getitems__`
    """
    def __getitem__(self, index) -> T_co:
        <...>

class IterableDataset(Dataset[T_co], Iterable[T_co]):
    r"""<...> All subclasses should overwrite :meth:`__iter__`
    """
    # No `def __len__(self)` default? Subclasses raise `TypeError` when needed.
    # See NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ]

So:
1.

object <- Dataset <---- IterableDataset
       \               /
        \- Iterable <-/ 
  1. Datasets should implement __getitem__.
  2. IterableDatasets should implement __iter__ and __len__.

In Python

Now, let's look at the Python's Collections Abstract Base Classes documentation.

What does Python offer?

         -- Sized ---------------------
        /                               \
object <--- Iterable <-------------------- Collection <--- Sequence
        \             \                  /               /
         \             - Reversible <-------------------/
          \                            /
           -- Container ---------------
  1. Iterable requires __iter__.
  2. __len__ introduced in Sized. Iterables doesn't have __len__.
  3. Container introduces __contains__.
  4. Reversible introduces __reversed__.
  5. Collection just join the superclasses.
  6. Sequence has real implementations of __iter__, __contains__ , __reversed__ but requires __len__, __getitem__.

Applying back to PyTorch

What do I see?

  1. p.4 => IterableDataset can't be a subclass of Dataset. (It could be vice versa.)
  2. p.6 => p.3 is incorrect, i.e. IterableDatasets shouldn't implement __len__ but Dataset could.
  3. Inheritance could be changed drastically, without big code changes.

Proposition

Variant 1

Ok, we could have:

         -- Sized ---------------------
        /                              \
       /             - IterableDataset  \
      /             /                    \
object <--- Iterable <-------------------- Collection <- Dataset
      \                                  /
        -- Container -------------------
Class Methods to implement
IterableDataset __iter__
Dataset __len__, __iter__, __contains__ , __getitem__, __reversed__

Variant 2

Or even subclass of Sequence for Dataset:

         -- Sized ---------------------
        /                              \
       /             - IterableDataset  \
      /             /                    \
object <--- Iterable <-------------------- Collection <- Sequence <- Dataset
      \             \                    /              /
       \             - Reversible <--------------------/
        \                               /
         -- Container -----------------
Class Methods to implement
IterableDataset __iter__
Dataset __len__, __getitem__

See Also

Sequence could be Mapping

I thought a Dataset as a Sequence (list), i.e. int indices.
But strictly saying indices are untyped. So, it could be a Mapping (dict).

Type Annotations

I'd appreciate to annotate all the code (see above).

NOTE [ Lack of Default __len__ in Python Abstract Base Classes ]

Thinking on this could solve the nasty https://github.com/pytorch/pytorch/blob/v2.2.1/torch/utils/data/sampler.py#L70-L95.
On that, see also #122410, #47055.

Related Discussions, Issues and Commits

Thoughts

Thoughts? :)

cc @VitalyFedyunin @ejguan @dzhulgakov @ssnl

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: datatorch.utils.datatriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions