Skip to content

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Oct 16, 2024

Stack from ghstack (oldest at bottom):

V-schedules have a special case where the last rank has 2 adjacent
stages.

E.g. if rank3 had stage 3 and stage 4, then we should implement direct
transfer of stage3 outputs to stage4 inputs without a
send/recv.

In the schedling logic, we also must allow scheduling the
stage 4 forward after running stage 3 forward, without expecting a stage
4 RECV_F

In the runtime, we pass activations between adjacent stages without
using SEND/RECV ops since the stages are on the same rank/process. We
add new APIs to PipelineStage abstraction for passing the activations
both during forward and backward. Currently the implementation directly
modifies the 'recv buffers' the stage is managing, so the
forward/backwrad execution logic does not need to know the difference.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @d4l3k @c-p-i-o

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 16, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138125

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 50c48ad with merge base failed to retrieve merge base, please contact dev infra:

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 16, 2024
wconstab added a commit that referenced this pull request Oct 16, 2024
ghstack-source-id: 3287a3b
Pull Request resolved: #138125
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
wconstab added a commit that referenced this pull request Oct 17, 2024
ghstack-source-id: 55ab932
Pull Request resolved: #138125
[ghstack-poisoned]
[ghstack-poisoned]
@wconstab wconstab changed the title WIP Various fixes [Pipelining] Fix IR pass logic for V schedules with adjacent stages Oct 17, 2024
[ghstack-poisoned]
@wconstab wconstab added release notes: distributed (pipeline) release notes category module: pipelining Pipeline Parallelism labels Oct 18, 2024
[ghstack-poisoned]
wconstab added a commit that referenced this pull request Oct 18, 2024
V-schedules have a special case where the last rank has 2 adjacent
stages.

E.g. if rank3 had stage 3 and stage 4, then we should implement direct
transfer of stage3 outputs to stage4 inputs without a
send/recv.

In the schedling logic, we also must allow scheduling the
stage 4 forward after running stage 3 forward, without expecting a stage
4 RECV_F

TODO: Implement execution runtime logic for activations for V-schedule

ghstack-source-id: 172d9a3
Pull Request resolved: #138125
[ghstack-poisoned]
[ghstack-poisoned]
wconstab added a commit that referenced this pull request Oct 24, 2024
V-schedules have a special case where the last rank has 2 adjacent
stages.

E.g. if rank3 had stage 3 and stage 4, then we should implement direct
transfer of stage3 outputs to stage4 inputs without a
send/recv.

In the schedling logic, we also must allow scheduling the
stage 4 forward after running stage 3 forward, without expecting a stage
4 RECV_F

TODO: Implement execution runtime logic for activations for V-schedule

ghstack-source-id: 577940d
Pull Request resolved: #138125
[ghstack-poisoned]
[ghstack-poisoned]
wconstab added a commit that referenced this pull request Oct 24, 2024
V-schedules have a special case where the last rank has 2 adjacent
stages.

E.g. if rank3 had stage 3 and stage 4, then we should implement direct
transfer of stage3 outputs to stage4 inputs without a
send/recv.

In the schedling logic, we also must allow scheduling the
stage 4 forward after running stage 3 forward, without expecting a stage
4 RECV_F

ghstack-source-id: aa28430
Pull Request resolved: #138125
[ghstack-poisoned]
[ghstack-poisoned]
@wconstab wconstab changed the title [Pipelining] Fix IR pass logic for V schedules with adjacent stages [Pipelining] Support V-schedules in IR and Runtime Oct 25, 2024
[ghstack-poisoned]
@wconstab wconstab requested review from H-Huang and kwen2501 October 30, 2024 19:44
tensor, torch.Tensor
), f"expected tensor values as outputs from prev stage, got {type(tensor)}"
if not isinstance(info, _RecvInfo):
# TODO: when would info not be a _RecvInfo? should this be an error?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would not be _RecvInfo for the first stage and a different placeholder class is used for the first stage

n_stages,
device,
# TODO(whc) shape inference shouldn't have needed to run communications in this 1-rank, 2-stage scenario,
# but it was failing on fakePG recv data unpiclking error, so something is wrong. Work around for now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that's interesting, i wonder why. I tried the fakePG and used regular schedules and it seemed to work correctly

if (
not stage.is_first
# no recv op expected for V-schedule special case (see [Note: V-schedule special case])
and stage_idx - 1 not in stage_index_to_stage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for better clarity the check stage_idx +/- 1 in stage_index_to_stage which is recurring in a few spots could be refactored into a property like has_local_next_stage and has_local_prev_stage

"schedule": "v_2_rank_4_stage",
"compute": {
0: [
"0F0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it autoformatted to be like this? or is it possible to have the actions together in a row, that would take up less lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, autoformat did this. I dunno if there is a way to override it?

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Oct 31, 2024
@wconstab
Copy link
Contributor Author

squashed this into previous PR, becuase I noticed that this PR only contained a new test and the code changes that were supposed to be in this PR ended up in the previous PR, probably due to a messed up rebase. closing this one and will just land the squashed one.

@wconstab wconstab closed this Oct 31, 2024
@github-actions github-actions bot deleted the gh/wconstab/343/head branch December 1, 2024 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: pipelining Pipeline Parallelism oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (pipeline) release notes category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants