[CUDA12] Conditionally set device in autograd engine#91191
[CUDA12] Conditionally set device in autograd engine#91191Aidyn-A wants to merge 4 commits intopytorch:masterfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91191
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 63facca: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
cc @ngimel |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 additional jobs have failed, first few of them are: trunk ,trunk / macos-12-py3-arm64 / test (default, 2, 2, macos-m1-12) Details for Dev Infra teamRaised by workflow job |
| // better. | ||
|
|
||
| #if defined(USE_CUDA) | ||
| if (at::detail::getCUDAHooks().hasPrimaryContext(device)) { |
There was a problem hiding this comment.
But does that mean that if the device is not initialized yet then the current device will never be set on the worker thread? That sounds bad no?
There was a problem hiding this comment.
Current device will always be initialized at this instance, because prior to this call PyTorch would set currents device for memory allocations and kernel calls in forward pass.
There was a problem hiding this comment.
It's possible to run a custom autograd function that would run afoul of this, but we already operate under assumption that no new contexts are going to be created in autograd:
pytorch/torch/csrc/autograd/engine.cpp
Lines 681 to 686 in 63facca
There was a problem hiding this comment.
I'm still confused.
The following sounds valid but the second backward call will use worker threads that are not properly on the right device.:
import torch
a = torch.rand(10, requires_grad=True)
a.sum().backward() # This call creates the worker threads, no cuda context here
b = torch.rand(10, device="cuda", requires_grad=True)
b.sum().backward() # This one will now run with no current device set!!
There was a problem hiding this comment.
Ok then would setting device for every thread somewhere here
pytorch/torch/csrc/autograd/engine.cpp
Line 1112 in 63facca
There was a problem hiding this comment.
You mean when you run it locally?
There was a problem hiding this comment.
Both locally and CI
There was a problem hiding this comment.
Well I do expect the code to run but you will need a bit more meat there to check that you're on the right device.
I don't have a multi-gpu machine right now but my guess is that the following will print all 0s instead of a 1 for the second BW:
import torch
class MyFn(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
print("FW", torch.cuda.current_device())
return x.clone()
@staticmethod
def backward(ctx, gO):
print("BW", torch.cuda.current_device())
return gO
a = torch.rand(10, requires_grad=True)
MyFn.apply(a).sum().backward() # This call creates the worker threads, no cuda context here
b = torch.rand(10, device="cuda:1", requires_grad=True)
MyFn.apply(b).sum().backward() # This one will now run with no current device set!!Curious to see what the result is if you have a multi-gpu machine at hand.
There was a problem hiding this comment.
Works fine locally:
FW 0
BW 0
FW 1
BW 1
# And if I print them
# a is tensor([0.2930, 0.1664, 0.7401, 0.5832, 0.2043, 0.9405, 0.6830, 0.8309, 0.1875, 0.8577], requires_grad=True)
# a.grad is tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
# b is tensor([0.1209, 0.5240, 0.6959, 0.3283, 0.1513, 0.7672, 0.0403, 0.4019, 0.3628, 0.5973], device='cuda:1', requires_grad=True)
# b.grad is tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:1')
There was a problem hiding this comment.
It could work by virtue of device guards, and the threads default device outside guards could be 0.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This is a device agnostic version #91191. The reason of existence of this PR is device agnostic policy of autograd engine. Hence, the compile time `USE_CUDA` is not supported, so doing something like: https://github.com/pytorch/pytorch/blob/fa1ea9f9bcaa77c1370468059be95ad9b421f500/torch/csrc/autograd/engine.cpp#L351-L357 is not effective. In this PR a check upon CUDA devices in device registry is added such that threads set the same CUDA device. Pull Request resolved: #92354 Approved by: https://github.com/albanD, https://github.com/ngimel
CUDA 12 introduces behavioral changes in
cudaSetDevice. In the old version it would just set the device to be used for kernel launches and memory allocations without creating a CUDA context. Now, in CUDA 12, every timecudaSetDeviceis called for the first time it creates a CUDA context. See issue #91122.The autograd engine iterates over all devices and sets them:
pytorch/torch/csrc/autograd/engine.cpp
Lines 1399 to 1402 in f8b348c
pytorch/torch/csrc/autograd/engine.cpp
Line 349 in f8b348c
Which causes pollution of CUDA contexts on sibling devices.
This PR introduces a workaround this issue by conditionally setting the device.