Dataloader distribute tasks to workers when in_order is False #142324

michael-diggin · 2024-12-08T07:49:12Z

Fixes #105203 and is a follow up PR to #141833

When in_order is True (the default), tasks are given out to workers in a round robin fashion. When in_order is False this is no longer needed, as we give up guarantees of reproducibility, and instead tasks should be given to workers that are able to perform work.
In this PR I've added tracking of the number of outstanding tasks for each worker (updated when tasks are added to their queue, and when data is returned to the main thread). When finding the next queue to add a task to, if in_order is False it will only add the task to the workers queue if it has fewer than _prefetch_factor tasks outstanding.
The current default behaviour is left as is.

Tests are also updated to assert on the worker IDs for each sample of data returned.
I've run the following to confirm they aren't flaky

for i in {1..20}; do python test/test_dataloader.py TestOutOfOrderDataLoader; done

cc @andrewkho @divyanshk @ssnl @VitalyFedyunin @dzhulgakov

pytorch-bot · 2024-12-08T07:49:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142324

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a20b7fa with merge base 3f80632 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

michael-diggin · 2024-12-16T17:59:30Z

Hi @andrewkho, @divyanshk, would it be possible to get a review of this follow up PR when you get a chance? Thanks!

andrewkho · 2024-12-17T01:10:52Z

Sorry for the delay @michael-diggin this slipped off my radar, having a look now

andrewkho

Really nice diff! One question to think through for IterableDatasets, and probably want to add context in one of the comments for future readers

torch/utils/data/dataloader.py

michael-diggin · 2024-12-17T20:26:42Z

Really nice diff! One question to think through for IterableDatasets, and probably want to add context in one of the comments for future readers

Thanks for the quick review @andrewkho! I've responded to the comments, let me know what you think about the IterableDataset case, or if you've got a different idea in mind that may work better.

torch/utils/data/dataloader.py

andrewkho

LGTM, and thank you for adding this! Just a question about potential deadlocks, which I think this is safe from, but not sure if we can have stronger guarantees somehow

andrewkho · 2024-12-26T18:17:02Z

torch/utils/data/dataloader.py


-    def _process_data(self, data):
+    def _process_data(self, data, worker_idx):
+        self._workers_num_tasks[worker_idx] -= 1


I'm pretty sure that the way this is set up, we'll never deadlock due to not decrementing this, but is there somewhere we can give stronger guarantees on this?

It shouldn't deadlock the way it is set up.
_workers_num_tasks is only incremented in try_put_index() which gets called in two places (other than at the very beginning of an epoch to start off a bunch of tasks):

_process_data, where it is decremented first, so that will be safe, this is also the only way _next_data can exit that isn't shutdown/end of epoch

within _next_data when a worker for an IterableDataset is finished, also safe as that's just distributing work

I think the key thing is that _process_data, and hence the decrementing, is pretty much always called by _next_data when it returns, and that decrementing happens before any incrementing.
I can't think of anything off the top of my head that would give stronger guarantees, but maybe a small bit of refactoring could make this more clear (eg incrementing it within _process_data after the call to _try_put_index)? I'll also try to add a new test case that gives more confidence that it won't deadlock too.

michael-diggin · 2024-12-27T09:30:20Z

One of the test failures looked a bit strange, but also unrelated. I've merged in a later commit from main which may help (passes on my local branch now). @andrewkho would you be able to rerun the CI when you get a chance? Thanks!

michael-diggin · 2025-01-03T07:59:41Z

@pytorchbot merge

pytorchmergebot · 2025-01-03T08:01:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Dataloader out of order follow up

b1c50db

michael-diggin requested review from andrewkho and divyanshk as code owners December 8, 2024 07:49

pytorch-bot bot added the release notes: dataloader release notes category label Dec 8, 2024

pytorchbot added the open source label Dec 8, 2024

handle test flakiness

44021c4

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 9, 2024

Merge branch 'main' into out-of-order-dataloader-followup

9fe551d

andrewkho added the module: dataloader Related to torch.utils.data.DataLoader and Sampler label Dec 17, 2024

andrewkho reviewed Dec 17, 2024

View reviewed changes

torch/utils/data/dataloader.py Show resolved Hide resolved

torch/utils/data/dataloader.py Show resolved Hide resolved

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

add more clear comment

e161686

improve distribution logic

a5295d6

andrewkho mentioned this pull request Dec 26, 2024

[Stateful DL] Pre-emptive: ensure compatibility with out-of-order updates to torch.utils.data.DataLoader meta-pytorch/data#1414

Closed

andrewkho reviewed Dec 26, 2024

View reviewed changes

torch/utils/data/dataloader.py Show resolved Hide resolved

add comment for non zero sum

06800da

andrewkho approved these changes Dec 26, 2024

View reviewed changes

Merge branch 'main' into out-of-order-dataloader-followup

7eab00c

make test fully deterministic to avoid flakes

a20b7fa

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 3, 2025

pytorchmergebot added the merging label Jan 3, 2025

pytorchmergebot added the Merged label Jan 3, 2025

pytorchmergebot closed this in 55dc61d Jan 3, 2025

pytorchmergebot removed the merging label Jan 3, 2025

michael-diggin mentioned this pull request Jan 20, 2025

[Stateful DL] Add out of order implementation meta-pytorch/data#1423

Merged

michael-diggin deleted the out-of-order-dataloader-followup branch January 22, 2025 20:15

Dataloader distribute tasks to workers when in_order is False #142324

Dataloader distribute tasks to workers when in_order is False #142324

Uh oh!

Conversation

michael-diggin commented Dec 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142324

✅ No Failures

Uh oh!

michael-diggin commented Dec 16, 2024

Uh oh!

andrewkho commented Dec 17, 2024

Uh oh!

andrewkho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michael-diggin commented Dec 17, 2024

Uh oh!

Uh oh!

andrewkho left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewkho Dec 26, 2024

Choose a reason for hiding this comment

Uh oh!

michael-diggin Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michael-diggin commented Dec 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michael-diggin commented Jan 3, 2025

Uh oh!

pytorchmergebot commented Jan 3, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

michael-diggin commented Dec 8, 2024 •

edited

Loading

pytorch-bot bot commented Dec 8, 2024 •

edited

Loading

andrewkho left a comment •

edited

Loading

michael-diggin Dec 26, 2024 •

edited

Loading

michael-diggin commented Dec 27, 2024 •

edited

Loading