[Pipelining] Free memory usage earlier in last stage #138504

H-Huang · 2024-10-21T20:25:01Z

Stack from ghstack (oldest at bottom):

This fix is similar to that done in #138119, except this is an edge case for the last stage. For the last stage we perform backward on the loss which we detached in the previous PR. However, we also hold the stage_outputs alive because we return all the output chunks in merge_output_chunks() after the step is over. This will also still keep the autograd graph alive, so detaching these tensors frees the memory earlier.

pre-fix:

post-fix:

cc @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-21T20:25:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138504

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[PRE-EMPTIVE] Experimenting with new runners linux.aws.a100 on inductor-perf-compare.yml

❌ 5 New Failures

As of commit bed5bb5 with merge base deaf041 ():

NEW FAILURES - The following jobs have failed:

linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-build / build (gh)
bash: /builder/libtorch/build.sh: No such file or directory
linux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-build / build (gh)
bash: /builder/libtorch/build.sh: No such file or directory
linux-binary-manywheel / manywheel-py3_9-cuda11_8-build / build (gh)
bash: /builder/manywheel/build.sh: No such file or directory
linux-binary-manywheel / manywheel-py3_9-cuda12_1-build / build (gh)
bash: /builder/manywheel/build.sh: No such file or directory
linux-binary-manywheel / manywheel-py3_9-cuda12_4-build / build (gh)
bash: /builder/manywheel/build.sh: No such file or directory

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 4e34cda Pull Request resolved: #138504

wconstab · 2024-10-22T01:02:15Z

torch/distributed/pipelining/stage.py

+            # to return to the user in merge_output_chunks, therefore
+            # this should be detached to release autograd graph context and free memory earlier
+            for t in stage_output:
+                t.detach_()


i'm confused about the difference in the 2 PRs. The first PR detaches 'stage_output' in stage_backward_input, which iiuc runs inside this function (backward_one_chunk) but above this point, in the case of non-full-backward. If this is the last stage, wouldn't we have already run stage_backward_input and detached stage_outputs by the time we get here? or are there multiple copies of 'stage_output' and we have to detach them both?

For the last stage backward_one_chunk operates on the loss. The dependencies look like:

rest_of_autograd_graph -> stage_output -> loss

The "stage_outputs" detached in the first PR was actually the loss for the last stage (i guess it should be renamed to be stage_output_or_loss. So we also need to also detach the stage_output in the last stage to allow the memory to be freed

gotcha. yea, rename for clarity. and maybe even put that simple illustration in to the comment rest_of_autograd_graph -> stage_output -> loss

H-Huang · 2024-10-23T19:31:43Z

@pytorchbot merge

pytorchmergebot · 2024-10-23T19:33:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-23T20:05:11Z

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-build / build

Details for Dev Infra team

Raised by workflow job

H-Huang · 2024-10-23T22:56:54Z

@pytorchbot merge -i

pytorchmergebot · 2024-10-23T22:58:24Z

Merge started

Your change will be merged while ignoring the following 5 checks: linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-build / build, linux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-build / build, linux-binary-manywheel / manywheel-py3_9-cuda12_4-build / build, linux-binary-manywheel / manywheel-py3_9-cuda11_8-build / build, linux-binary-manywheel / manywheel-py3_9-cuda12_1-build / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Addressing the comments in previous PRs to update the variable names and add additional code comments Pull Request resolved: #138735 Approved by: https://github.com/wconstab ghstack dependencies: #138119, #138504

Pull Request resolved: #138720 Approved by: https://github.com/wconstab ghstack dependencies: #138119, #138504, #138735

[Pipelining] Free memory usage earlier in last stage

bed5bb5

[ghstack-poisoned]

H-Huang mentioned this pull request Oct 21, 2024

[Pipelining] fix extra memory usage in zero bubble #138119

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 21, 2024

H-Huang added a commit that referenced this pull request Oct 21, 2024

[Pipelining] Free memory usage earlier in last stage

9dd7893

ghstack-source-id: 4e34cda Pull Request resolved: #138504

H-Huang added release notes: distributed (pipeline) release notes category module: pipelining Pipeline Parallelism labels Oct 21, 2024

wconstab reviewed Oct 22, 2024

View reviewed changes

H-Huang marked this pull request as ready for review October 22, 2024 14:45

H-Huang requested a review from kwen2501 October 22, 2024 14:46

H-Huang mentioned this pull request Oct 23, 2024

[Pipelining] Clean up hooks in zero bubble #138720

Closed

H-Huang requested a review from wconstab October 23, 2024 18:06

wconstab approved these changes Oct 23, 2024

View reviewed changes

This was referenced Oct 23, 2024

[PP] Add unit tests to check for memory regressions #138726

Open

[Pipelining] small comments and variable renames #138735

Closed

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 23, 2024

pytorchmergebot added the merging label Oct 23, 2024

pytorchmergebot removed the merging label Oct 23, 2024

pytorchmergebot added the merging label Oct 23, 2024

pytorchmergebot added the Merged label Oct 24, 2024

pytorchmergebot closed this in 32a3dbc Oct 24, 2024

pytorchmergebot removed the merging label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Pipelining] Free memory usage earlier in last stage #138504

[Pipelining] Free memory usage earlier in last stage #138504

Uh oh!

H-Huang commented Oct 21, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 21, 2024 •

edited

Loading

Uh oh!

wconstab Oct 22, 2024

Uh oh!

H-Huang Oct 22, 2024

Uh oh!

wconstab Oct 23, 2024

Uh oh!

H-Huang commented Oct 23, 2024

Uh oh!

pytorchmergebot commented Oct 23, 2024

Uh oh!

pytorchmergebot commented Oct 23, 2024

Uh oh!

H-Huang commented Oct 23, 2024

Uh oh!

pytorchmergebot commented Oct 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Pipelining] Free memory usage earlier in last stage #138504

[Pipelining] Free memory usage earlier in last stage #138504

Uh oh!

Conversation

H-Huang commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138504

❗ 1 Active SEVs

❌ 5 New Failures

Uh oh!

wconstab Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang commented Oct 23, 2024

Uh oh!

pytorchmergebot commented Oct 23, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 23, 2024

Merge failed

Uh oh!

H-Huang commented Oct 23, 2024

Uh oh!

pytorchmergebot commented Oct 23, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

H-Huang commented Oct 21, 2024 •

edited

Loading

pytorch-bot bot commented Oct 21, 2024 •

edited

Loading