Fix MPS synchronize by waiting for root buffer to complete #140725

skotapati · 2024-11-14T17:19:28Z

Makes #139550 (comment) work

pytorch-bot · 2024-11-14T17:19:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140725

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 5b88c10 with merge base e4a05de ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-cuda12.4-py3.10-gcc9-sm89 / test (default, 5, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (trunk failure)
RuntimeError: inductor/test_compiled_autograd 1/2 failed!

This comment was automatically generated by Dr. CI and updates every 15 minutes.

malfet · 2024-11-14T17:22:44Z

@skotapati this fixes 2nd issue (when explicit synchronize is used, but not the 1st one, when async copy should not require synchronization at all, as from PyTorch point of view it is done in the context of the same stream.

qqaatw · 2024-11-14T22:46:08Z

I think we need

to address the second issue mentioned here: Bug in conversion to mps with non_blocking=True #139550 (comment). A simple solution would be: before replace _prevCommandBuffer with _commandBuffer, we check whether the previous command buffer has been committed. If so, wait for it and then dispose.

pytorch/aten/src/ATen/mps/MPSStream.mm

Lines 131 to 133 in 891ba2e

    
           if (!_enableCommitAndContinue) { 
        
             _prevCommandBuffer = _commandBuffer; 
        
           } else {

if (!_enableCommitAndContinue) {
    if (_prevCommandBuffer) {
      // the previous command buffer (if exists) has already been committed,
      // so we just wait until it's completed and then dispose it.
      [_prevCommandBuffer waitUntilCompleted];
      [_prevCommandBuffer release];
      _prevCommandBuffer = nil;
    }
   _prevCommandBuffer = _commandBuffer; 
}

testing to prevent regression.

test/test_mps.py

malfet · 2024-11-19T22:34:06Z

@pytorchbot merge -f "LGTM"

pytorch-bot · 2024-11-19T22:34:10Z

You need to provide a reason for using force merge, in the format @pytorchbot merge -f 'Explanation'.
The explanation needs to be clear on why this is needed. Here are some good examples:

Bypass checks due to unrelated upstream failures from ...
This is a minor fix to ..., which shouldn't break anything
This is pre-tested in a previous CI run
Bypass flaky ... check

malfet · 2024-11-19T23:08:38Z

@pytorchbot merge -f "Lint + MPS tests are green"

pytorchmergebot · 2024-11-19T23:10:15Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…40725) Makes pytorch#139550 (comment) work Pull Request resolved: pytorch#140725 Approved by: https://github.com/malfet, https://github.com/kulinseth

malfet · 2024-11-21T21:54:47Z

@pytorchbot revert -m "It causes deadlocks when I try to run something benchmark from #127242" -c nosignal

pytorchmergebot · 2024-11-21T21:56:20Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…140725)" This reverts commit 9bc9d4c. Reverted #140725 on behalf of https://github.com/malfet due to It causes deadlocks when I try to run something benchmark from #127242 ([comment](#140725 (comment)))

pytorchmergebot · 2024-11-21T21:56:25Z

@skotapati your PR has been successfully reverted.

malfet · 2024-11-21T22:06:51Z

If this change is landed, running something like

import torch
from timeit import default_timer
from torch.utils.benchmark import Measurement, Timer


def bench_cross(
    m,
    dtype=torch.float32,
    device: str = "cpu",
) -> Measurement:
    setup = f"""
     x = torch.rand({m}, 3, dtype={dtype}, device="{device}")
     y = torch.rand({m}, 3, dtype={dtype}, device="{device}")
    """

    t = Timer(
        stmt="torch.cross(x, y);torch.mps.synchronize()", setup=setup, language="python", timer=default_timer
    )
    return t.blocked_autorange()


if __name__ == "__main__":
    torch.set_num_threads(1)
    print(bench_cross(1024, device='mps'))

Causes a deadlock with the following backtrace

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12
    frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84
    frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40
    frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100
    frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92
    frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040
    frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200
    frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104
    frame #8: 0x0000000100fccbe4 Python`run_mod + 168
    frame #9: 0x0000000100fcb518 Python`pyrun_file + 164
    frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256
    frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80
    frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164
    frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72
    frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988
    frame #15: 0x0000000100ff1564 Python`pymain_main + 304
    frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40
    frame #17: 0x000000019f630274 dyld`start + 2840

malfet

Before next attempt to land it, please provide a quick description why it was causing the deadlock

malfet · 2024-11-21T22:24:56Z

A much simpler reproducer (going to turn it into a regression test)

import torch
y=torch.nextafter(torch.ones(1024, device='mps'), torch.zeros(1024, device='mps'))
torch.mps.synchronize()
torch.mps.synchronize()

qqaatw

I couldn't repro with Nikita's scripts though.

qqaatw · 2024-11-21T22:29:59Z

aten/src/ATen/mps/MPSStream.mm

+void MPSStream::commitAndContinueRoot() {
+  assert(_commandBuffer);
+  id<MTLCommandBuffer> rootCommandBuffer = [_commandBuffer rootCommandBuffer];
+  [_commandBuffer commitAndContinue];


I guess after this point we should check if the rootCommandBuffer's status is committed. If so, then wait until completed.

See #140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame #15: 0x0000000100ff1564 Python`pymain_main + 304 frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame #17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: #141296 Approved by: https://github.com/huydhn

…40725) Makes pytorch#139550 (comment) work Pull Request resolved: pytorch#140725 Approved by: https://github.com/malfet, https://github.com/kulinseth

…ytorch#140725)" This reverts commit 9bc9d4c. Reverted pytorch#140725 on behalf of https://github.com/malfet due to It causes deadlocks when I try to run something benchmark from pytorch#127242 ([comment](pytorch#140725 (comment)))

See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame pytorch#2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame pytorch#3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame pytorch#4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame pytorch#5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame pytorch#6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame pytorch#7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame pytorch#8: 0x0000000100fccbe4 Python`run_mod + 168 frame pytorch#9: 0x0000000100fcb518 Python`pyrun_file + 164 frame pytorch#10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn

skotapati · 2025-01-07T17:53:04Z

I'm no longer able to repro this issue on current main, even when putting the system under heavy load or using a very large tensor, which repro'd the issue reliably earlier. I'll try to figure out why the deadlock occurred and how it was resolved, but other than that I think this should be ready to merge

github-actions · 2025-03-08T21:34:00Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) release notes: mps Release notes category labels Nov 14, 2024

pytorchbot added the open source label Nov 14, 2024

skotapati marked this pull request as ready for review November 14, 2024 19:31

skotapati requested review from kulinseth and malfet as code owners November 14, 2024 19:31

skotapati requested review from DenisVieriu97 and jhavukainen November 14, 2024 21:22

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 15, 2024

pytorch-bot bot added the topic: not user facing topic category label Nov 18, 2024

skotapati force-pushed the dev/skotapati/sync branch from de3dd40 to b512e5a Compare November 18, 2024 20:11

qqaatw reviewed Nov 18, 2024

View reviewed changes

test/test_mps.py Outdated Show resolved Hide resolved

skotapati requested a review from qqaatw November 18, 2024 23:05

malfet approved these changes Nov 19, 2024

View reviewed changes

kulinseth approved these changes Nov 19, 2024

View reviewed changes

malfet approved these changes Nov 19, 2024

View reviewed changes

pytorchmergebot added the merging label Nov 19, 2024

pytorchmergebot added the Merged label Nov 19, 2024

pytorchmergebot closed this in 9bc9d4c Nov 19, 2024

pytorchmergebot removed the merging label Nov 19, 2024

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Nov 21, 2024

pytorchmergebot reopened this Nov 21, 2024

malfet requested review from kulinseth and malfet November 21, 2024 22:07

malfet requested changes Nov 21, 2024

View reviewed changes

malfet mentioned this pull request Nov 21, 2024

[MPS] Add regression test for sync deadlock #141296

Closed

qqaatw reviewed Nov 21, 2024

View reviewed changes

skotapati added 4 commits January 7, 2025 09:41

Fix MPS synchronize by waiting for root buffer to complete

536176e

Add regression test for synchronize issue

5960610

Linter fixes

c34ee8a

Formatting fixes

5b88c10

skotapati force-pushed the dev/skotapati/sync branch from 1b930dc to 5b88c10 Compare January 7, 2025 17:41

github-actions bot added the Stale label Mar 8, 2025

github-actions bot closed this Apr 7, 2025

Fix MPS synchronize by waiting for root buffer to complete #140725

Fix MPS synchronize by waiting for root buffer to complete #140725

Uh oh!

Conversation

skotapati commented Nov 14, 2024 • edited by malfet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140725

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

malfet commented Nov 14, 2024

Uh oh!

qqaatw commented Nov 14, 2024

Uh oh!

Uh oh!

malfet commented Nov 19, 2024

Uh oh!

pytorch-bot bot commented Nov 19, 2024

Uh oh!

malfet commented Nov 19, 2024

Uh oh!

pytorchmergebot commented Nov 19, 2024

Merge started

Uh oh!

malfet commented Nov 21, 2024

Uh oh!

pytorchmergebot commented Nov 21, 2024

Uh oh!

pytorchmergebot commented Nov 21, 2024

Uh oh!

malfet commented Nov 21, 2024

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

malfet commented Nov 21, 2024

Uh oh!

qqaatw left a comment

Choose a reason for hiding this comment

Uh oh!

qqaatw Nov 21, 2024

Choose a reason for hiding this comment

Uh oh!

skotapati commented Jan 7, 2025

Uh oh!

github-actions bot commented Mar 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

skotapati commented Nov 14, 2024 •

edited by malfet

Loading

pytorch-bot bot commented Nov 14, 2024 •

edited

Loading