-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Make Tensor.__dlpack__(stream=None) capture-safe during CUDA Graph capture
#163242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163242
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f21bd9e with merge base 28c42cc ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Why don't we want to keep no sync behavior if stream is None? Having different behavior during capture and outside of it is nightmare |
The problem is that we are not allowed to sync the default stream during capture (cudaErrorStreamCaptureInvalidated). Before the #150217 we don't have this behavior. |
Exactly, and I'm questioning the changes introduced by #150217, as they effectively make capture impossible, and I'm strongly against introducing different behavior under capture and without capture. |
|
Note that there is stream = -1 which indicate no sync, however, seems most impl still defaults to None (which is specified as sync to default stream) On a related note, the new speed convert RFC #162845 adds a stream query, which will make the behavior more streamlined (and compatible with cudagraph) Of course we are talking about default behavior of dlpack, the original reasoning is that None is a safe choice (without considering cudagraph), and the dlpack interface would require explicit sync to the default stream. Considering cudagrpah, i think indeed in that mode default to current seems more sensible, but the no sync default behavior in stream mode might indeed cause some issues as some consumer may expects sync of the stream passed. cc @leofang |
|
Please fix PR description to match what PR does |
|
Just want to send a followup note, it might worthwhile bring awareness and discuss DLPack spec, ideally the spec should get updated and reflecting the needs. Note that the same behavior can be achieved by passing in Personally I am OK with either, and stream preserving call is an important path. If I had to pick initially, I personally might even prefer the preserving one. Just want to make sure some consistency with data-api spec, we don't have surprises when people calling |
|
@ysiraichi are you able to weigh in here? |
|
Just to elaborate a bit further on the choices here for clarity Canonical way of using dlpack is to call s = torch.cuda.Stream()
x = torch.randn(8, device="cuda")
g = torch.cuda.CUDAGraph()
with torch.cuda.stream(s):
with torch.cuda.graph(g):
_ = x + 1
mylib_tensor = mylib.from_dlpack(x)
mylib_kernel(mylib_tensor)We have two choices for mylib to implement cudagraph support C0: simply call
|
|
How about we change |
|
opened data-apis/array-api#974 for discussion in data-api |
|
Flipping default to -1 sounds good to me |
|
Since this makes it so |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…capture (#163242) Many extensions (including pybind helpers) call `Tensor.__dlpack__()` without a stream argument. Before #150217, `stream=None` behaved like “no cross-stream sync” and was safe inside CUDA Graph capture. After #150217, `stream=None` maps to the legacy default stream, adding a cross-stream wait that invalidates capture when running on a non-default stream. See this example ``` import torch s = torch.cuda.Stream() x = torch.randn(8, device="cuda") g = torch.cuda.CUDAGraph() with torch.cuda.stream(s): with torch.cuda.graph(g): _ = x + 1 cap = x.__dlpack__() _ = torch.utils.dlpack.from_dlpack(cap) ``` This PR partially reverts #150217 that stream=None defaults to no sync. Pull Request resolved: #163242 Approved by: https://github.com/ngimel
Many extensions (including pybind helpers) call
Tensor.__dlpack__()without a stream argument. Before #150217,stream=Nonebehaved like “no cross-stream sync” and was safe inside CUDA Graph capture. After #150217,stream=Nonemaps to the legacy default stream, adding a cross-stream wait that invalidates capture when running on a non-default stream.See this example
This PR partially reverts #150217 that stream=None defaults to no sync.
cc @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng