Skip to content

Conversation

@H-Huang
Copy link
Member

@H-Huang H-Huang commented Oct 21, 2024

Stack from ghstack (oldest at bottom):

This fix is similar to that done in #138119, except this is an edge case for the last stage. For the last stage we perform backward on the loss which we detached in the previous PR. However, we also hold the stage_outputs alive because we return all the output chunks in merge_output_chunks() after the step is over. This will also still keep the autograd graph alive, so detaching these tensors frees the memory earlier.

pre-fix:
image

post-fix:
image

cc @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138504

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 5 New Failures

As of commit bed5bb5 with merge base deaf041 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 21, 2024
H-Huang added a commit that referenced this pull request Oct 21, 2024
ghstack-source-id: 4e34cda
Pull Request resolved: #138504
@H-Huang H-Huang added release notes: distributed (pipeline) release notes category module: pipelining Pipeline Parallelism labels Oct 21, 2024
# to return to the user in merge_output_chunks, therefore
# this should be detached to release autograd graph context and free memory earlier
for t in stage_output:
t.detach_()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm confused about the difference in the 2 PRs. The first PR detaches 'stage_output' in stage_backward_input, which iiuc runs inside this function (backward_one_chunk) but above this point, in the case of non-full-backward. If this is the last stage, wouldn't we have already run stage_backward_input and detached stage_outputs by the time we get here? or are there multiple copies of 'stage_output' and we have to detach them both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the last stage backward_one_chunk operates on the loss. The dependencies look like:

rest_of_autograd_graph -> stage_output -> loss

The "stage_outputs" detached in the first PR was actually the loss for the last stage (i guess it should be renamed to be stage_output_or_loss. So we also need to also detach the stage_output in the last stage to allow the memory to be freed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha. yea, rename for clarity. and maybe even put that simple illustration in to the comment rest_of_autograd_graph -> stage_output -> loss

@H-Huang H-Huang marked this pull request as ready for review October 22, 2024 14:45
@H-Huang H-Huang requested a review from kwen2501 October 22, 2024 14:46
@H-Huang H-Huang requested a review from wconstab October 23, 2024 18:06
@H-Huang
Copy link
Member Author

H-Huang commented Oct 23, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 23, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-build / build

Details for Dev Infra team Raised by workflow job

@H-Huang
Copy link
Member Author

H-Huang commented Oct 23, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

pytorchmergebot pushed a commit that referenced this pull request Oct 25, 2024
Addressing the comments in previous PRs to update the variable names and add additional code comments

Pull Request resolved: #138735
Approved by: https://github.com/wconstab
ghstack dependencies: #138119, #138504
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: pipelining Pipeline Parallelism oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (pipeline) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants