Skip to content

Conversation

@skotapati
Copy link
Contributor

@skotapati skotapati commented Nov 14, 2024

Makes #139550 (comment) work

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140725

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 5b88c10 with merge base e4a05de (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) release notes: mps Release notes category labels Nov 14, 2024
@malfet
Copy link
Contributor

malfet commented Nov 14, 2024

@skotapati this fixes 2nd issue (when explicit synchronize is used, but not the 1st one, when async copy should not require synchronization at all, as from PyTorch point of view it is done in the context of the same stream.

@skotapati skotapati marked this pull request as ready for review November 14, 2024 19:31
@qqaatw
Copy link
Collaborator

qqaatw commented Nov 14, 2024

I think we need

  1. to address the second issue mentioned here: Bug in conversion to mps with non_blocking=True #139550 (comment). A simple solution would be: before replace _prevCommandBuffer with _commandBuffer, we check whether the previous command buffer has been committed. If so, wait for it and then dispose.

if (!_enableCommitAndContinue) {
_prevCommandBuffer = _commandBuffer;
} else {

if (!_enableCommitAndContinue) {
    if (_prevCommandBuffer) {
      // the previous command buffer (if exists) has already been committed,
      // so we just wait until it's completed and then dispose it.
      [_prevCommandBuffer waitUntilCompleted];
      [_prevCommandBuffer release];
      _prevCommandBuffer = nil;
    }
   _prevCommandBuffer = _commandBuffer; 
}
  1. testing to prevent regression.

@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 15, 2024
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Nov 18, 2024
@skotapati skotapati requested a review from qqaatw November 18, 2024 23:05
@malfet
Copy link
Contributor

malfet commented Nov 19, 2024

@pytorchbot merge -f "LGTM"

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 19, 2024

You need to provide a reason for using force merge, in the format @pytorchbot merge -f 'Explanation'.
The explanation needs to be clear on why this is needed. Here are some good examples:

  • Bypass checks due to unrelated upstream failures from ...
  • This is a minor fix to ..., which shouldn't break anything
  • This is pre-tested in a previous CI run
  • Bypass flaky ... check

@malfet
Copy link
Contributor

malfet commented Nov 19, 2024

@pytorchbot merge -f "Lint + MPS tests are green"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Contributor

malfet commented Nov 21, 2024

@pytorchbot revert -m "It causes deadlocks when I try to run something benchmark from #127242" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Nov 21, 2024
@pytorchmergebot
Copy link
Collaborator

@skotapati your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Nov 21, 2024
@malfet
Copy link
Contributor

malfet commented Nov 21, 2024

If this change is landed, running something like

import torch
from timeit import default_timer
from torch.utils.benchmark import Measurement, Timer


def bench_cross(
    m,
    dtype=torch.float32,
    device: str = "cpu",
) -> Measurement:
    setup = f"""
     x = torch.rand({m}, 3, dtype={dtype}, device="{device}")
     y = torch.rand({m}, 3, dtype={dtype}, device="{device}")
    """

    t = Timer(
        stmt="torch.cross(x, y);torch.mps.synchronize()", setup=setup, language="python", timer=default_timer
    )
    return t.blocked_autorange()


if __name__ == "__main__":
    torch.set_num_threads(1)
    print(bench_cross(1024, device='mps'))

Causes a deadlock with the following backtrace

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12
    frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84
    frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40
    frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100
    frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92
    frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040
    frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200
    frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104
    frame #8: 0x0000000100fccbe4 Python`run_mod + 168
    frame #9: 0x0000000100fcb518 Python`pyrun_file + 164
    frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256
    frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80
    frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164
    frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72
    frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988
    frame #15: 0x0000000100ff1564 Python`pymain_main + 304
    frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40
    frame #17: 0x000000019f630274 dyld`start + 2840

@malfet malfet requested review from kulinseth and malfet November 21, 2024 22:07
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before next attempt to land it, please provide a quick description why it was causing the deadlock

@malfet
Copy link
Contributor

malfet commented Nov 21, 2024

A much simpler reproducer (going to turn it into a regression test)

import torch
y=torch.nextafter(torch.ones(1024, device='mps'), torch.zeros(1024, device='mps'))
torch.mps.synchronize()
torch.mps.synchronize()

Copy link
Collaborator

@qqaatw qqaatw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't repro with Nikita's scripts though.

void MPSStream::commitAndContinueRoot() {
assert(_commandBuffer);
id<MTLCommandBuffer> rootCommandBuffer = [_commandBuffer rootCommandBuffer];
[_commandBuffer commitAndContinue];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess after this point we should check if the rootCommandBuffer's status is committed. If so, then wait until completed.

pytorchmergebot pushed a commit that referenced this pull request Nov 22, 2024
See #140725 (comment)
Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]`
```
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12
    frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84
    frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40
    frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100
    frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92
    frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040
    frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200
    frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104
    frame #8: 0x0000000100fccbe4 Python`run_mod + 168
    frame #9: 0x0000000100fcb518 Python`pyrun_file + 164
    frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256
    frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80
    frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164
    frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72
    frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988
    frame #15: 0x0000000100ff1564 Python`pymain_main + 304
    frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40
    frame #17: 0x000000019f630274 dyld`start + 2840
```

Pull Request resolved: #141296
Approved by: https://github.com/huydhn
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
See pytorch#140725 (comment)
Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]`
```
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12
    frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84
    frame pytorch#2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40
    frame pytorch#3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100
    frame pytorch#4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92
    frame pytorch#5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040
    frame pytorch#6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200
    frame pytorch#7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104
    frame pytorch#8: 0x0000000100fccbe4 Python`run_mod + 168
    frame pytorch#9: 0x0000000100fcb518 Python`pyrun_file + 164
    frame pytorch#10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256
    frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80
    frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164
    frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72
    frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988
    frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304
    frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40
    frame pytorch#17: 0x000000019f630274 dyld`start + 2840
```

Pull Request resolved: pytorch#141296
Approved by: https://github.com/huydhn
@skotapati
Copy link
Contributor Author

I'm no longer able to repro this issue on current main, even when putting the system under heavy load or using a very large tensor, which repro'd the issue reliably earlier. I'll try to figure out why the deadlock occurred and how it was resolved, but other than that I think this should be ready to merge

@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2025

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Mar 8, 2025
@github-actions github-actions bot closed this Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/mps Run MPS tests (subset of trunk) Merged open source release notes: mps Release notes category Reverted Stale topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants