[MPS] Add regression test for sync deadlock #141296

malfet · 2024-11-21T22:33:29Z

See #140725 (comment)
Running torch.mps.synchronize() after metal kernel resulted in infinite wait inside [_MTLCommandBuffer waitUntilCompleted]

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12
    frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84
    frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40
    frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100
    frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92
    frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040
    frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200
    frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104
    frame #8: 0x0000000100fccbe4 Python`run_mod + 168
    frame #9: 0x0000000100fcb518 Python`pyrun_file + 164
    frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256
    frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80
    frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164
    frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72
    frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988
    frame #15: 0x0000000100ff1564 Python`pymain_main + 304
    frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40
    frame #17: 0x000000019f630274 dyld`start + 2840

See

pytorch-bot · 2024-11-21T22:33:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141296

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

❌ 19 New Failures

As of commit 1ed6736 with merge base 7b2138b ():

NEW FAILURES - The following jobs have failed:

pull / cuda12.1-py3.10-gcc9-sm75 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-cuda11.8-py3.10-gcc9 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-cuda12.1-py3.10-gcc9 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-py3_9-clang9-xla / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-py3.11-clang10 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-py3.12-clang10 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-py3.9-clang10 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-py3.9-clang10-onnx / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-rocm6.2-py3.10 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-cuda11.8-cudnn9-py3.9-clang12 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3-clang12-executorch / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3-clang12-mobile-build / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3.10-clang15-asan / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3.9-gcc11 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3.9-gcc11-mobile-lightweight-dispatch-build / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3.9-gcc11-no-ops / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3.9-gcc11-pch / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / win-vs2019-cpu-py3 / build (gh)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

malfet · 2024-11-22T00:54:51Z

@pytorchbot merge -f "Lint + MPS tests are green"

pytorchmergebot · 2024-11-22T00:56:21Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame pytorch#2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame pytorch#3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame pytorch#4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame pytorch#5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame pytorch#6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame pytorch#7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame pytorch#8: 0x0000000100fccbe4 Python`run_mod + 168 frame pytorch#9: 0x0000000100fcb518 Python`pyrun_file + 164 frame pytorch#10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn

[MPS] Add regression test for sync deadlock

6c48a2c

See

malfet requested a review from kulinseth as a code owner November 21, 2024 22:33

pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) topic: not user facing topic category labels Nov 21, 2024

Update test_mps.py

1ed6736

malfet requested a review from Skylion007 November 21, 2024 23:27

huydhn approved these changes Nov 22, 2024

View reviewed changes

pytorchmergebot added the merging label Nov 22, 2024

pytorchmergebot added the Merged label Nov 22, 2024

pytorchmergebot closed this in 65166d8 Nov 22, 2024

pytorchmergebot removed the merging label Nov 22, 2024

malfet deleted the malfet-patch-36 branch December 12, 2024 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MPS] Add regression test for sync deadlock #141296

[MPS] Add regression test for sync deadlock #141296

Uh oh!

malfet commented Nov 21, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 21, 2024 •

edited

Loading

Uh oh!

malfet commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[MPS] Add regression test for sync deadlock #141296

[MPS] Add regression test for sync deadlock #141296

Uh oh!

Conversation

malfet commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141296

❗ 1 Active SEVs

❌ 19 New Failures

Uh oh!

malfet commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

malfet commented Nov 21, 2024 •

edited

Loading

pytorch-bot bot commented Nov 21, 2024 •

edited

Loading