-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix MPS synchronize by waiting for root buffer to complete #140725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140725
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 5b88c10 with merge base e4a05de ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@skotapati this fixes 2nd issue (when explicit synchronize is used, but not the 1st one, when async copy should not require synchronization at all, as from PyTorch point of view it is done in the context of the same stream. |
|
I think we need
pytorch/aten/src/ATen/mps/MPSStream.mm Lines 131 to 133 in 891ba2e
if (!_enableCommitAndContinue) {
if (_prevCommandBuffer) {
// the previous command buffer (if exists) has already been committed,
// so we just wait until it's completed and then dispose it.
[_prevCommandBuffer waitUntilCompleted];
[_prevCommandBuffer release];
_prevCommandBuffer = nil;
}
_prevCommandBuffer = _commandBuffer;
}
|
de3dd40 to
b512e5a
Compare
|
@pytorchbot merge -f "LGTM" |
|
You need to provide a reason for using force merge, in the format @pytorchbot merge -f 'Explanation'.
|
|
@pytorchbot merge -f "Lint + MPS tests are green" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "It causes deadlocks when I try to run something benchmark from #127242" -c nosignal |
|
@pytorchbot successfully started a revert job. Check the current status here. |
…140725)" This reverts commit 9bc9d4c. Reverted #140725 on behalf of https://github.com/malfet due to It causes deadlocks when I try to run something benchmark from #127242 ([comment](#140725 (comment)))
|
@skotapati your PR has been successfully reverted. |
|
If this change is landed, running something like import torch
from timeit import default_timer
from torch.utils.benchmark import Measurement, Timer
def bench_cross(
m,
dtype=torch.float32,
device: str = "cpu",
) -> Measurement:
setup = f"""
x = torch.rand({m}, 3, dtype={dtype}, device="{device}")
y = torch.rand({m}, 3, dtype={dtype}, device="{device}")
"""
t = Timer(
stmt="torch.cross(x, y);torch.mps.synchronize()", setup=setup, language="python", timer=default_timer
)
return t.blocked_autorange()
if __name__ == "__main__":
torch.set_num_threads(1)
print(bench_cross(1024, device='mps'))Causes a deadlock with the following backtrace |
malfet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before next attempt to land it, please provide a quick description why it was causing the deadlock
|
A much simpler reproducer (going to turn it into a regression test) import torch
y=torch.nextafter(torch.ones(1024, device='mps'), torch.zeros(1024, device='mps'))
torch.mps.synchronize()
torch.mps.synchronize() |
qqaatw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't repro with Nikita's scripts though.
| void MPSStream::commitAndContinueRoot() { | ||
| assert(_commandBuffer); | ||
| id<MTLCommandBuffer> rootCommandBuffer = [_commandBuffer rootCommandBuffer]; | ||
| [_commandBuffer commitAndContinue]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess after this point we should check if the rootCommandBuffer's status is committed. If so, then wait until completed.
See #140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame #15: 0x0000000100ff1564 Python`pymain_main + 304 frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame #17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: #141296 Approved by: https://github.com/huydhn
…ytorch#140725)" This reverts commit 9bc9d4c. Reverted pytorch#140725 on behalf of https://github.com/malfet due to It causes deadlocks when I try to run something benchmark from pytorch#127242 ([comment](pytorch#140725 (comment)))
See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame pytorch#2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame pytorch#3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame pytorch#4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame pytorch#5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame pytorch#6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame pytorch#7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame pytorch#8: 0x0000000100fccbe4 Python`run_mod + 168 frame pytorch#9: 0x0000000100fcb518 Python`pyrun_file + 164 frame pytorch#10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn
1b930dc to
5b88c10
Compare
|
I'm no longer able to repro this issue on current main, even when putting the system under heavy load or using a very large tensor, which repro'd the issue reliably earlier. I'll try to figure out why the deadlock occurred and how it was resolved, but other than that I think this should be ready to merge |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Makes #139550 (comment) work