Relax use_count constraints for swap_tensors when AccumulateGrad holds a reference #127313

mikaylagawarecki · 2024-05-28T17:39:22Z

Before this PR:

torch.utils.swap_tensors(a, b) required the use_count of a and b to be 1

a = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, 4)
out = a * 2
out.sum().backward()
# Calling swap_tensors here would fail due to the reference held by AccumulateGrad node, which is not cleaned up after backward
# torch.utils.swap_tensors(a, b)
del out
# Calling swap_tensors here would pass
torch.utils.swap_tensors(a, b)

After this PR:

torch.utils.swap_tensors(a, b) requires the use_count of a and b to be 1 or 2 IF the second reference is held by AccumulateGrad

A pre-hook will be registered on the AccumulateGrad node so that it will fail if it is called (i.e. if user attempts to backward through the graph).

a = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, 4)
out = a * 2
out.sum().backward()
# Calling swap_tensors here is ok
torch.utils.swap_tensors(a, b)
# If we ever backward to the AccumulateGrad node it will error that it was poisoned by swap_tensors

Application to `nn.Module`

This issue is especially pertinent in context of nn.Module where parameters will have AccumulateGrad nodes initialized after forward. Specifically, this is intended to address #126814 (comment). Previously, this would fail at the m.cpu() but we want users to be able to do something like the following, and instead raise an error if the user ever attempts to backward through the poisoned AccumulateGrad node

import torch
import torch.nn as nn
m = nn.Linear(3, 5)
inp = torch.randn(2, 3)
out = m(inp)
out.sum().backward()
m.cpu()

Stack from ghstack (oldest at bottom):

Differential Revision: D58094197

[ghstack-poisoned]

pytorch-bot · 2024-05-28T17:39:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127313

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Upgrade MacOS runner to 14

✅ You can merge normally! (2 Unrelated Failures)

As of commit 6e4efb1 with merge base 5196ef1 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-13-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh) (similar failure)
inductor/test_compiled_autograd.py::TestCompiledAutograd::test_autograd_cpp_node

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) ()
inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/utils/__init__.py

[ghstack-poisoned]

soulitzer

LGTM! Just some more bikeshedding on the error message

soulitzer · 2024-05-29T19:26:13Z

torch/utils/__init__.py

+        raise RuntimeError("Trying to execute AccumulateGrad node that was poisoned by swap_tensors "
+                           "this can happen when you try to run backward on a tensor that was swapped. "
+                           "For a module m with `torch.__future__.set_swap_module_params_on_conversion(True)` "
+                           "this could happen if trying to run backward changing the device or dtype of the module "


Suggested change

"this could happen if trying to run backward changing the device or dtype of the module "

"this could happen if trying to run backward after changing the device or dtype of the module "

actually.. maybe I would frame it as

changing device/dtype after running forward (and then running backward) is not allowed

please change device/dtype before running forward

because if you tell me to swap .cpu() with backward, the changes don't take effect unless the user also wanted to run a second iteration?

Gotcha, does the updated phrasing capture this appropriately

looks good!

[ghstack-poisoned]

mikaylagawarecki · 2024-05-29T22:16:33Z

@pytorchbot merge

pytorchmergebot · 2024-05-29T22:21:25Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-29T22:37:53Z

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-stable), trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14)

Details for Dev Infra team

Raised by workflow job

[ghstack-poisoned]

mikaylagawarecki · 2024-05-30T02:50:59Z

@pytorchbot merge

pytorchmergebot · 2024-05-30T02:54:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…s a reference (#127313) ### Before this PR: `torch.utils.swap_tensors(a, b)` required the `use_count` of `a` and `b` to be 1 ```python a = torch.randn(2, 3, requires_grad=True) b = torch.randn(2, 4) out = a * 2 out.sum().backward() # Calling swap_tensors here would fail due to the reference held by AccumulateGrad node, which is not cleaned up after backward # torch.utils.swap_tensors(a, b) del out # Calling swap_tensors here would pass torch.utils.swap_tensors(a, b) ``` ### After this PR: `torch.utils.swap_tensors(a, b)` requires the `use_count` of `a` and `b` to be 1 or 2 IF the second reference is held by `AccumulateGrad` A pre-hook will be registered on the `AccumulateGrad` node so that it will fail if it is called (i.e. if user attempts to backward through the graph). ```python a = torch.randn(2, 3, requires_grad=True) b = torch.randn(2, 4) out = a * 2 out.sum().backward() # Calling swap_tensors here is ok torch.utils.swap_tensors(a, b) # If we ever backward to the AccumulateGrad node it will error that it was poisoned by swap_tensors ``` ### Application to `nn.Module` This issue is especially pertinent in context of `nn.Module` where parameters will have `AccumulateGrad` nodes initialized after forward. Specifically, this is intended to address #126814 (comment). Previously, this would fail at the `m.cpu()` but we want users to be able to do something like the following, and instead raise an error if the user ever attempts to backward through the poisoned `AccumulateGrad` node ```python import torch import torch.nn as nn m = nn.Linear(3, 5) inp = torch.randn(2, 3) out = m(inp) out.sum().backward() m.cpu() ``` Pull Request resolved: #127313 Approved by: https://github.com/soulitzer

Pull Request resolved: #126814 Approved by: https://github.com/JackCaoG, https://github.com/albanD ghstack dependencies: #127313

…y and .to('meta')) (#126819) Pull Request resolved: #126819 Approved by: https://github.com/albanD ghstack dependencies: #127313, #126814

mikaylagawarecki · 2024-06-03T17:53:15Z

@mikaylagawarecki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Pull Request resolved: #126814 Approved by: https://github.com/JackCaoG, https://github.com/albanD ghstack dependencies: #127313

) Pull Request resolved: pytorch#126814 Approved by: https://github.com/JackCaoG, https://github.com/albanD ghstack dependencies: pytorch#127313

…y and .to('meta')) (pytorch#126819) Pull Request resolved: pytorch#126819 Approved by: https://github.com/albanD ghstack dependencies: pytorch#127313, pytorch#126814

Update

131df85

[ghstack-poisoned]

mikaylagawarecki requested review from albanD and soulitzer as code owners May 28, 2024 17:39

pytorch-bot bot added the release notes: nn release notes category label May 28, 2024

This was referenced May 28, 2024

Default XLA to use swap_tensors path in nn.Module._apply #126814

Closed

Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) #126819

Closed

soulitzer reviewed May 28, 2024

View reviewed changes

torch/utils/__init__.py Outdated Show resolved Hide resolved

soulitzer reviewed May 28, 2024

View reviewed changes

torch/utils/__init__.py Outdated Show resolved Hide resolved

soulitzer reviewed May 28, 2024

View reviewed changes

torch/utils/__init__.py Outdated Show resolved Hide resolved

Update

4eae73c

[ghstack-poisoned]

mikaylagawarecki requested a review from soulitzer May 28, 2024 18:20

mikaylagawarecki marked this pull request as draft May 28, 2024 18:51

Update

f190cd2

[ghstack-poisoned]

albanD removed their request for review May 28, 2024 21:44

mikaylagawarecki mentioned this pull request May 28, 2024

Add private escape hatches to fall back to pre-swap tensors behavior #126984

Closed

mikaylagawarecki marked this pull request as ready for review May 29, 2024 13:22

mikaylagawarecki marked this pull request as draft May 29, 2024 15:28

Update

2870da8

[ghstack-poisoned]

mikaylagawarecki mentioned this pull request May 25, 2024

Default traceable subclasses to use swap_tensors path for load_state_dict #126788

Closed

mikaylagawarecki marked this pull request as ready for review May 29, 2024 19:15

soulitzer approved these changes May 29, 2024

View reviewed changes

Update

e6bc7dc

[ghstack-poisoned]

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 29, 2024

pytorchmergebot added the merging label May 29, 2024

pytorchmergebot removed the merging label May 29, 2024

Update

6e4efb1

[ghstack-poisoned]

mikaylagawarecki added the topic: improvements topic category label May 30, 2024

pytorchmergebot added the merging label May 30, 2024

pytorchmergebot added the Merged label May 30, 2024

pytorchmergebot closed this in cd06ae0 May 30, 2024

pytorchmergebot removed the merging label May 30, 2024

github-actions bot deleted the gh/mikaylagawarecki/212/head branch July 4, 2024 01:55

	"this could happen if trying to run backward changing the device or dtype of the module "
	"this could happen if trying to run backward after changing the device or dtype of the module "

Relax use_count constraints for swap_tensors when AccumulateGrad holds a reference #127313

Relax use_count constraints for swap_tensors when AccumulateGrad holds a reference #127313

Uh oh!

Conversation

mikaylagawarecki commented May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before this PR:

After this PR:

Application to nn.Module

Uh oh!

pytorch-bot bot commented May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127313

❗ 1 Active SEVs

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

soulitzer left a comment

Choose a reason for hiding this comment

Uh oh!

soulitzer May 29, 2024

Choose a reason for hiding this comment

Uh oh!

soulitzer May 29, 2024

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki May 29, 2024

Choose a reason for hiding this comment

Uh oh!

soulitzer May 29, 2024

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki commented May 29, 2024

Uh oh!

pytorchmergebot commented May 29, 2024

Merge started

Uh oh!

pytorchmergebot commented May 29, 2024

Merge failed

Uh oh!

mikaylagawarecki commented May 30, 2024

Uh oh!

pytorchmergebot commented May 30, 2024

Merge started

Uh oh!

mikaylagawarecki commented Jun 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mikaylagawarecki commented May 28, 2024 •

edited

Loading

Application to `nn.Module`

pytorch-bot bot commented May 28, 2024 •

edited

Loading