[Pipelining] Support V-schedules in IR and Runtime #138125

wconstab · 2024-10-16T22:03:55Z

Stack from ghstack (oldest at bottom):

V-schedules have a special case where the last rank has 2 adjacent
stages.

E.g. if rank3 had stage 3 and stage 4, then we should implement direct
transfer of stage3 outputs to stage4 inputs without a
send/recv.

In the schedling logic, we also must allow scheduling the
stage 4 forward after running stage 3 forward, without expecting a stage
4 RECV_F

In the runtime, we pass activations between adjacent stages without
using SEND/RECV ops since the stages are on the same rank/process. We
add new APIs to PipelineStage abstraction for passing the activations
both during forward and backward. Currently the implementation directly
modifies the 'recv buffers' the stage is managing, so the
forward/backwrad execution logic does not need to know the difference.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-16T22:03:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138125

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 50c48ad with merge base failed to retrieve merge base, please contact dev infra:

NEW FAILURE - The following job has failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for test/distributed/pipelining/test_schedule.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 3287a3b Pull Request resolved: #138125

[ghstack-poisoned]

ghstack-source-id: 55ab932 Pull Request resolved: #138125

[ghstack-poisoned]

V-schedules have a special case where the last rank has 2 adjacent stages. E.g. if rank3 had stage 3 and stage 4, then we should implement direct transfer of stage3 outputs to stage4 inputs without a send/recv. In the schedling logic, we also must allow scheduling the stage 4 forward after running stage 3 forward, without expecting a stage 4 RECV_F TODO: Implement execution runtime logic for activations for V-schedule ghstack-source-id: 172d9a3 Pull Request resolved: #138125

[ghstack-poisoned]

V-schedules have a special case where the last rank has 2 adjacent stages. E.g. if rank3 had stage 3 and stage 4, then we should implement direct transfer of stage3 outputs to stage4 inputs without a send/recv. In the schedling logic, we also must allow scheduling the stage 4 forward after running stage 3 forward, without expecting a stage 4 RECV_F TODO: Implement execution runtime logic for activations for V-schedule ghstack-source-id: 577940d Pull Request resolved: #138125

[ghstack-poisoned]

V-schedules have a special case where the last rank has 2 adjacent stages. E.g. if rank3 had stage 3 and stage 4, then we should implement direct transfer of stage3 outputs to stage4 inputs without a send/recv. In the schedling logic, we also must allow scheduling the stage 4 forward after running stage 3 forward, without expecting a stage 4 RECV_F ghstack-source-id: aa28430 Pull Request resolved: #138125

[ghstack-poisoned]

H-Huang · 2024-10-30T23:22:00Z

torch/distributed/pipelining/stage.py

+                tensor, torch.Tensor
+            ), f"expected tensor values as outputs from prev stage, got {type(tensor)}"
+            if not isinstance(info, _RecvInfo):
+                # TODO: when would info not be a _RecvInfo? should this be an error?


I think this would not be _RecvInfo for the first stage and a different placeholder class is used for the first stage

H-Huang · 2024-10-30T23:25:55Z

test/distributed/pipelining/test_schedule.py

+                n_stages,
+                device,
+                # TODO(whc) shape inference shouldn't have needed to run communications in this 1-rank, 2-stage scenario,
+                # but it was failing on fakePG recv data unpiclking error, so something is wrong. Work around for now.


Hmm that's interesting, i wonder why. I tried the fakePG and used regular schedules and it seemed to work correctly

H-Huang · 2024-10-30T23:53:17Z

torch/distributed/pipelining/schedules.py

+                    if (
+                        not stage.is_first
+                        # no recv op expected for V-schedule special case (see [Note: V-schedule special case])
+                        and stage_idx - 1 not in stage_index_to_stage


nit: for better clarity the check stage_idx +/- 1 in stage_index_to_stage which is recurring in a few spots could be refactored into a property like has_local_next_stage and has_local_prev_stage

H-Huang · 2024-10-30T23:54:57Z

test/distributed/pipelining/test_schedule.py

+                "schedule": "v_2_rank_4_stage",
+                "compute": {
+                    0: [
+                        "0F0",


nit: is it autoformatted to be like this? or is it possible to have the actions together in a row, that would take up less lines.

yea, autoformat did this. I dunno if there is a way to override it?

[ghstack-poisoned]

wconstab · 2024-10-31T16:53:24Z

squashed this into previous PR, becuase I noticed that this PR only contained a new test and the code changes that were supposed to be in this PR ended up in the previous PR, probably due to a messed up rebase. closing this one and will just land the squashed one.

Update

994e27c

[ghstack-poisoned]

This was referenced Oct 16, 2024

[pipelining] Batch send/recv between adjacent ranks #130860

Closed

[Pipelining] Support separate dI / dW and V-schedules #131762

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 16, 2024

wconstab added a commit that referenced this pull request Oct 16, 2024

WIP Various fixes

b4b8760

ghstack-source-id: 3287a3b Pull Request resolved: #138125

Update

969b146

[ghstack-poisoned]

wconstab mentioned this pull request Oct 16, 2024

[Pipelining] add schedule simulator and chrometrace dump #138134

Closed

Update

3f9750f

[ghstack-poisoned]

Update

38678a2

[ghstack-poisoned]

Update

027e187

[ghstack-poisoned]

wconstab mentioned this pull request Oct 17, 2024

[Pipelining] Fix/improve format_pipeline_order #138142

Closed

Update

7087e78

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Oct 17, 2024

WIP Various fixes

2375c7f

ghstack-source-id: 55ab932 Pull Request resolved: #138125

Update

5ebb896

[ghstack-poisoned]

This was referenced Oct 17, 2024

[Pipelining] Remove unnecessary {0,1} qualifier from regex #138271

Closed

[Pipelining] WIP- not sure why we need this fix?' #138272

Closed

Update

6581a73

[ghstack-poisoned]

wconstab changed the title ~~WIP Various fixes~~ [Pipelining] Fix IR pass logic for V schedules with adjacent stages Oct 17, 2024

Update

c4cb9c2

[ghstack-poisoned]

wconstab added release notes: distributed (pipeline) release notes category module: pipelining Pipeline Parallelism labels Oct 18, 2024

Update

69e4b97

[ghstack-poisoned]

Update

9629ad5

[ghstack-poisoned]

Update

158e952

[ghstack-poisoned]

wconstab mentioned this pull request Oct 24, 2024

[Pipelining] Make forward_outputs and loss_fn strictly typed #138779

Closed

Update

c64454f

[ghstack-poisoned]

Update

9ad8fdd

[ghstack-poisoned]

This was referenced Oct 24, 2024

[Pipelining] Relax multi-stage constraint #138862

Closed

Debug test failure for separate I, W execution #138863

Closed

wconstab mentioned this pull request Oct 25, 2024

[Pipelining] Update schedules to use I, B actions. #138886

Closed

wconstab added 2 commits October 25, 2024 09:01

Update

cab8314

[ghstack-poisoned]

Update

e0256b2

[ghstack-poisoned]

This was referenced Oct 25, 2024

[Pipelining] Optimize ready_to_schedule logic #138924

Closed

[Pipelining] Remove unused special case from simulator #138928

Closed

wconstab changed the title ~~[Pipelining] Fix IR pass logic for V schedules with adjacent stages~~ [Pipelining] Support V-schedules in IR and Runtime Oct 25, 2024

Update

538bea6

[ghstack-poisoned]

wconstab requested review from H-Huang and kwen2501 October 30, 2024 19:44

H-Huang approved these changes Oct 30, 2024

View reviewed changes

wconstab added 3 commits October 30, 2024 21:26

Update

9d0aa28

[ghstack-poisoned]

Update

477ac00

[ghstack-poisoned]

Update

50c48ad

[ghstack-poisoned]

pytorch-bot bot added the topic: not user facing topic category label Oct 31, 2024

wconstab closed this Oct 31, 2024

github-actions bot deleted the gh/wconstab/343/head branch December 1, 2024 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Pipelining] Support V-schedules in IR and Runtime #138125

[Pipelining] Support V-schedules in IR and Runtime #138125

Uh oh!

wconstab commented Oct 16, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Oct 16, 2024 •

edited

Loading

Uh oh!

H-Huang Oct 30, 2024

Uh oh!

H-Huang Oct 30, 2024

Uh oh!

H-Huang Oct 30, 2024

Uh oh!

H-Huang Oct 30, 2024

Uh oh!

wconstab Oct 31, 2024

Uh oh!

wconstab commented Oct 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Pipelining] Support V-schedules in IR and Runtime #138125

[Pipelining] Support V-schedules in IR and Runtime #138125

Uh oh!

Conversation

wconstab commented Oct 16, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138125

❌ 1 New Failure

Uh oh!

H-Huang Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab commented Oct 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wconstab commented Oct 16, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 16, 2024 •

edited

Loading