No shuffling of examples in introduction notebook

We realized that in the [introduction notebook](https://github.com/google-research/meta-dataset/blob/master/Intro_to_Metadataset.ipynb), the usage examples given for the `make_multisource_episode_pipeline` did not set the `shuffle_buffer_size` parameter, which defaults to not shuffling examples within each class.

Two unfortunate consequences we identified in code that would not shuffle examples are:
- Evaluation on the `traffic_sign` dataset were overly optimistic, since the examples were organized as 30-image sequences of pictures from the same physical sign (successive frames from the same video), leading to support and query examples being more frequently really close.
- Training on small datasets can be worse, since the first examples of a given class would always tend to be support examples, and the later ones would be query examples, reducing the diversity of episodes.

Code using the training loop of Meta-Dataset was not affected, since it gets its `shuffle_buffer_size` [value](https://github.com/google-research/meta-dataset/blob/master/meta_dataset/trainer.py#L974) from a `DataConfig` object set from a [`gin` configuration](https://github.com/google-research/meta-dataset/blob/master/meta_dataset/learn/gin/setups/data_config_common.gin#L25) that is explicitly passed to `Trainer`'s constructor (in [`all.gin`](https://github.com/google-research/meta-dataset/blob/6d58a849edf8914680653307e3b4cc2b682ff8d1/meta_dataset/learn/gin/setups/all.gin) and [`imagenet.gin`](https://github.com/google-research/meta-dataset/blob/6d58a849edf8914680653307e3b4cc2b682ff8d1/meta_dataset/learn/gin/setups/imagenet.gin)).

We have mitigated the first point by updating the dataset conversion code to shuffle the `traffic_sign` images once (3512a8267334b85c78f99c07934ab6c7dfe9f670), and updated the notebook to show a better practice (c3f62a140e7d5e647546288251ed54b523edee11), but existing datasets, and code inspired from the notebook (outside of this repository) are still impacted.

Similarly, `make_multisource_batch_pipeline` does not pass a `shuffle_buffer_size`, but the impact seems much smaller (batch training should be less sensitive to the order of examples, and the random mixing of different classes adds randomness already).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

No shuffling of examples in introduction notebook #54

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

No shuffling of examples in introduction notebook #54

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions