Skip to content

Conversation

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented May 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126814

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Unrelated Failures

As of commit 5aee159 with merge base 5196ef1 (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
@mikaylagawarecki mikaylagawarecki requested a review from JackCaoG May 21, 2024 22:21
@mikaylagawarecki mikaylagawarecki marked this pull request as draft May 21, 2024 22:23
@JackCaoG
Copy link
Collaborator

Thanks @mikaylagawarecki

[ghstack-poisoned]
@mikaylagawarecki mikaylagawarecki added the keep-going Don't stop on first failure, keep running tests until the end label May 22, 2024
mikaylagawarecki added a commit that referenced this pull request May 22, 2024
@albanD albanD removed their request for review May 22, 2024 14:22
[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the release notes: nn release notes category label May 22, 2024
def test_conv_empty_input(self, device, dtype):
def help(input, conv, memory_format):
ref_out = conv(input)
ref_out = conv(input).detach()
Copy link
Contributor Author

@mikaylagawarecki mikaylagawarecki May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These .detach() calls are to ensure the autograd graph is not alive during .to() (otherwise the refcount of param will be more than 1 due to the AccumulateGrad node holding a reference) and prevent swap_tensors from being used.

As discussed, this was a known limitation of the swap_tensors path. In this case it seems more like an artifact of how the test was written, but seems unlikely to occur in practice (you don't normally want to change the dtype/device of your model while the autograd graph is alive).

@JackCaoG wanted to double check that you are okay with this limitation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I agree that in real life it is unlikely to happen

@mikaylagawarecki mikaylagawarecki marked this pull request as ready for review May 22, 2024 21:35
@mikaylagawarecki mikaylagawarecki added the topic: bc breaking topic category label May 22, 2024
@JackCaoG
Copy link
Collaborator

Sorry my bad, upstream runs a subset of the full XLA test. I started to see the CI failure on our end with

======================================================================
ERROR: test_sync_bn1d_multi_channel (__main__.TestMpSyncBatchNorm)
TestMpSyncBatchNorm.test_sync_bn1d_multi_channel
----------------------------------------------------------------------
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 826, in _apply
    torch.utils.swap_tensors(param, param_applied)
  File "/usr/local/lib/python3.10/site-packages/torch/utils/__init__.py", line 68, in swap_tensors
    torch._C._swap_tensor_impl(t1, t2)
RuntimeError: Expected single reference to a's Tensor object but got 2

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 95, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 78, in _run_thread_per_device
    replica_results = list(
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 71, in _thread_fn
    return fn()
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 190, in __call__
    self.fn(runtime.global_ordinal(), *self.args, **self.kwargs)
  File "/__w/xla/xla/pytorch/xla/test/test_mp_sync_batch_norm.py", line 87, in _sync_bn1d_multi_channel
    assert_stats(sbn_xla.cpu(), bn_cpu)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 972, in cpu
    return self._apply(lambda t: t.cpu())
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 830, in _apply
    raise RuntimeError(f"_apply(): Couldn't swap ***self._get_name()***.***key***") from e
RuntimeError: _apply(): Couldn't swap SyncBatchNorm.weight
"""

It is ok to revert this or while I debug this issue? Thanks!

@mikaylagawarecki
Copy link
Contributor Author

@pytorchbot revert -m "broke xla ci" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@mikaylagawarecki
Copy link
Contributor Author

mikaylagawarecki commented Jun 3, 2024

@izaitsevfb I no longer see the failure in D58015016 in the new import D58094197, is this okay to re-merge?

@mikaylagawarecki
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 4, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-13-py3-arm64 / build

Details for Dev Infra team Raised by workflow job

@mikaylagawarecki
Copy link
Contributor Author

@pytorchbot merge -r

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to

Aborting rebase because rebasing the branch resulted in the same sha as the target branch.
This usually happens because the PR has already been merged.  Please rebase locally and push.

Raised by https://github.com/pytorch/pytorch/actions/runs/9374002926

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_8-cuda11_8-test / test, trunk / macos-13-py3-arm64 / build

Details for Dev Infra team Raised by workflow job

@mikaylagawarecki
Copy link
Contributor Author

linux-binary-manywheel / manywheel-py3_8-cuda11_8-test / test (gh)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
trunk / macos-13-py3-arm64 / build (gh)
/Users/ec2-user/runner/_work/pytorch/pytorch/c10/util/StringUtil.cpp:45:8: error: 'wstring_convert<std::codecvt_utf8_utf16<wchar_t>>' is deprecated [-Werror,-Wdeprecated-declarations]

failures are unrelated

@mikaylagawarecki
Copy link
Contributor Author

@pytorchbot merge -f "failures unrelated, see above comment"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

petrex pushed a commit to petrex/pytorch that referenced this pull request Jun 5, 2024
mikaylagawarecki added a commit that referenced this pull request Jun 6, 2024
…6814)"

This reverts commit a7b1dd8.

ghstack-source-id: 761a49d
Pull Request resolved: #128170
@github-actions github-actions bot deleted the gh/mikaylagawarecki/207/head branch July 5, 2024 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged release notes: nn release notes category Reverted topic: bc breaking topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants