Default XLA to use swap_tensors path in nn.Module._apply #126814

mikaylagawarecki · 2024-05-21T22:15:50Z

Stack from ghstack (oldest at bottom):

Add private escape hatches to fall back to pre-swap tensors behavior #126984
Default traceable subclasses to use swap_tensors path for load_state_dict #126788
Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) #126819
-> Default XLA to use swap_tensors path in nn.Module._apply #126814
Relax use_count constraints for swap_tensors when AccumulateGrad holds a reference #127313

[ghstack-poisoned]

pytorch-bot · 2024-05-21T22:15:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126814

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Unrelated Failures

As of commit 5aee159 with merge base 5196ef1 ():

NEW FAILURES - The following jobs have failed:

linux-binary-manywheel / manywheel-py3_8-cuda11_8-test / test (gh)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
trunk / macos-13-py3-arm64 / build (gh)
/Users/ec2-user/runner/_work/pytorch/pytorch/c10/util/StringUtil.cpp:45:8: error: 'wstring_convert<std::codecvt_utf8_utf16<wchar_t>>' is deprecated [-Werror,-Wdeprecated-declarations]

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

linux-binary-manywheel / manywheel-py3_8-cuda12_1-test / test (gh) (#127288)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
linux-binary-manywheel / manywheel-py3_8-cuda12_4-test / test (gh) (#127289)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) ()
inductor/test_select_algorithm.py::TestSelectAlgorithm::test_mm_dropout

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

JackCaoG · 2024-05-21T22:25:42Z

Thanks @mikaylagawarecki

[ghstack-poisoned]

ghstack-source-id: 6e77ed4 Pull Request resolved: #126814

[ghstack-poisoned]

mikaylagawarecki · 2024-05-22T17:13:41Z

test/test_nn.py

    def test_conv_empty_input(self, device, dtype):
        def help(input, conv, memory_format):
-            ref_out = conv(input)
+            ref_out = conv(input).detach()


These .detach() calls are to ensure the autograd graph is not alive during .to() (otherwise the refcount of param will be more than 1 due to the AccumulateGrad node holding a reference) and prevent swap_tensors from being used.

As discussed, this was a known limitation of the swap_tensors path. In this case it seems more like an artifact of how the test was written, but seems unlikely to occur in practice (you don't normally want to change the dtype/device of your model while the autograd graph is alive).

@JackCaoG wanted to double check that you are okay with this limitation

yea I agree that in real life it is unlikely to happen

…y and .to('meta')) (#126819) Pull Request resolved: #126819 Approved by: https://github.com/albanD ghstack dependencies: #126814

JackCaoG · 2024-05-23T17:15:27Z

Sorry my bad, upstream runs a subset of the full XLA test. I started to see the CI failure on our end with

======================================================================
ERROR: test_sync_bn1d_multi_channel (__main__.TestMpSyncBatchNorm)
TestMpSyncBatchNorm.test_sync_bn1d_multi_channel
----------------------------------------------------------------------
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 826, in _apply
    torch.utils.swap_tensors(param, param_applied)
  File "/usr/local/lib/python3.10/site-packages/torch/utils/__init__.py", line 68, in swap_tensors
    torch._C._swap_tensor_impl(t1, t2)
RuntimeError: Expected single reference to a's Tensor object but got 2

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 95, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 78, in _run_thread_per_device
    replica_results = list(
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 71, in _thread_fn
    return fn()
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 190, in __call__
    self.fn(runtime.global_ordinal(), *self.args, **self.kwargs)
  File "/__w/xla/xla/pytorch/xla/test/test_mp_sync_batch_norm.py", line 87, in _sync_bn1d_multi_channel
    assert_stats(sbn_xla.cpu(), bn_cpu)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 972, in cpu
    return self._apply(lambda t: t.cpu())
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 830, in _apply
    raise RuntimeError(f"_apply(): Couldn't swap ***self._get_name()***.***key***") from e
RuntimeError: _apply(): Couldn't swap SyncBatchNorm.weight
"""

It is ok to revert this or while I debug this issue? Thanks!

mikaylagawarecki · 2024-05-23T17:41:13Z

@pytorchbot revert -m "broke xla ci" -c nosignal

pytorchmergebot · 2024-05-23T17:42:55Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

mikaylagawarecki · 2024-06-03T20:59:20Z

@izaitsevfb I no longer see the failure in D58015016 in the new import D58094197, is this okay to re-merge?

mikaylagawarecki · 2024-06-04T16:56:06Z

@pytorchbot merge

pytorchmergebot · 2024-06-04T16:59:08Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-04T17:04:40Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-13-py3-arm64 / build

Details for Dev Infra team

Raised by workflow job

mikaylagawarecki · 2024-06-04T20:47:31Z

@pytorchbot merge -r

pytorchmergebot · 2024-06-04T20:49:44Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-06-04T20:49:50Z

Rebase failed due to

Aborting rebase because rebasing the branch resulted in the same sha as the target branch.
This usually happens because the PR has already been merged.  Please rebase locally and push.

Raised by https://github.com/pytorch/pytorch/actions/runs/9374002926

pytorchmergebot · 2024-06-04T20:51:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-04T20:51:21Z

Merge failed

Reason: 2 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_8-cuda11_8-test / test, trunk / macos-13-py3-arm64 / build

Details for Dev Infra team

Raised by workflow job

mikaylagawarecki · 2024-06-04T21:38:39Z

linux-binary-manywheel / manywheel-py3_8-cuda11_8-test / test (gh)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
trunk / macos-13-py3-arm64 / build (gh)
/Users/ec2-user/runner/_work/pytorch/pytorch/c10/util/StringUtil.cpp:45:8: error: 'wstring_convert<std::codecvt_utf8_utf16<wchar_t>>' is deprecated [-Werror,-Wdeprecated-declarations]

failures are unrelated

mikaylagawarecki · 2024-06-04T21:38:58Z

@pytorchbot merge -f "failures unrelated, see above comment"

pytorchmergebot · 2024-06-04T21:40:36Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

) Pull Request resolved: pytorch#126814 Approved by: https://github.com/JackCaoG, https://github.com/albanD ghstack dependencies: pytorch#127313

…y and .to('meta')) (pytorch#126819) Pull Request resolved: pytorch#126819 Approved by: https://github.com/albanD ghstack dependencies: pytorch#127313, pytorch#126814

….to_empty and .to('meta')) (pytorch#126819)" This reverts commit fa426b0. Reverted pytorch#126819 on behalf of https://github.com/izaitsevfb due to suspicious build instructions count regression, see [D58015016](https://www.internalfb.com/diff/D58015016) ([comment](pytorch#126814 (comment)))

…orch#126814)" This reverts commit bfdec93. Reverted pytorch#126814 on behalf of https://github.com/izaitsevfb due to suspicious build instructions count regression, see [D58015016](https://www.internalfb.com/diff/D58015016) ([comment](pytorch#126814 (comment)))

…6814)" This reverts commit a7b1dd8. ghstack-source-id: 761a49d Pull Request resolved: #128170

…6814)" (#128170) #128165 :( This reverts commit a7b1dd8. Pull Request resolved: #128170 Approved by: https://github.com/drisspg, https://github.com/albanD

…orch#126814)" (pytorch#128170) pytorch#128165 :( This reverts commit a7b1dd8. Pull Request resolved: pytorch#128170 Approved by: https://github.com/drisspg, https://github.com/albanD

Update

66c304b

[ghstack-poisoned]

mikaylagawarecki requested review from albanD and jbschlosser as code owners May 21, 2024 22:15

Update

d11336b

[ghstack-poisoned]

mikaylagawarecki requested a review from JackCaoG May 21, 2024 22:21

mikaylagawarecki marked this pull request as draft May 21, 2024 22:23

JackCaoG approved these changes May 21, 2024

View reviewed changes

mikaylagawarecki mentioned this pull request May 21, 2024

Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) #126819

Closed

Update

14f9307

[ghstack-poisoned]

mikaylagawarecki added the keep-going Don't stop on first failure, keep running tests until the end label May 22, 2024

mikaylagawarecki added a commit that referenced this pull request May 22, 2024

Default XLA to use swap_tensors path in nn.Module._apply

d92917a

ghstack-source-id: 6e77ed4 Pull Request resolved: #126814

albanD removed their request for review May 22, 2024 14:22

Update

07a3e33

[ghstack-poisoned]

pytorch-bot bot added the release notes: nn release notes category label May 22, 2024

mikaylagawarecki commented May 22, 2024

View reviewed changes

mikaylagawarecki marked this pull request as ready for review May 22, 2024 21:35

mikaylagawarecki added the topic: bc breaking topic category label May 22, 2024

albanD approved these changes May 23, 2024

View reviewed changes

pytorchmergebot closed this in eb41ed5 May 23, 2024

pytorchmergebot added the Merged label May 23, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 4, 2024

pytorchmergebot added the merging label Jun 4, 2024

pytorchmergebot removed the merging label Jun 4, 2024

pytorchmergebot added the merging label Jun 4, 2024

pytorchmergebot removed the merging label Jun 4, 2024

pytorchmergebot added the merging label Jun 4, 2024

pytorchmergebot closed this in a7b1dd8 Jun 4, 2024

pytorchmergebot removed the merging label Jun 4, 2024

ysiraichi mentioned this pull request Jun 6, 2024

swap_tensors fail when calling nn.Module.to on XLA DDP wrapped models. #128165

Open

mikaylagawarecki added a commit that referenced this pull request Jun 6, 2024

Revert "Default XLA to use swap_tensors path in nn.Module._apply (#12…

8586866

…6814)" This reverts commit a7b1dd8. ghstack-source-id: 761a49d Pull Request resolved: #128170

github-actions bot deleted the gh/mikaylagawarecki/207/head branch July 5, 2024 01:54

Default XLA to use swap_tensors path in nn.Module._apply #126814

Default XLA to use swap_tensors path in nn.Module._apply #126814

Uh oh!

Conversation

mikaylagawarecki commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126814

❌ 2 New Failures, 3 Unrelated Failures

Uh oh!

JackCaoG commented May 21, 2024

Uh oh!

mikaylagawarecki May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackCaoG May 22, 2024

Choose a reason for hiding this comment

Uh oh!

JackCaoG commented May 23, 2024

Uh oh!

mikaylagawarecki commented May 23, 2024

Uh oh!

pytorchmergebot commented May 23, 2024

Uh oh!

mikaylagawarecki commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikaylagawarecki commented Jun 4, 2024

Uh oh!

pytorchmergebot commented Jun 4, 2024

Merge started

Uh oh!

pytorchmergebot commented Jun 4, 2024

Merge failed

Uh oh!

mikaylagawarecki commented Jun 4, 2024

Uh oh!

pytorchmergebot commented Jun 4, 2024

Uh oh!

pytorchmergebot commented Jun 4, 2024

Uh oh!

pytorchmergebot commented Jun 4, 2024

Merge started

Uh oh!

pytorchmergebot commented Jun 4, 2024

Merge failed

Uh oh!

mikaylagawarecki commented Jun 4, 2024

Uh oh!

mikaylagawarecki commented Jun 4, 2024

Uh oh!

pytorchmergebot commented Jun 4, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mikaylagawarecki commented May 21, 2024 •

edited

Loading

pytorch-bot bot commented May 21, 2024 •

edited

Loading

mikaylagawarecki May 22, 2024 •

edited

Loading

mikaylagawarecki commented Jun 3, 2024 •

edited

Loading