Skip to content

[CUDA12] Conditionally set device in autograd engine#91191

Closed
Aidyn-A wants to merge 4 commits intopytorch:masterfrom
Aidyn-A:cuda12_autograd_conditionally_set_device
Closed

[CUDA12] Conditionally set device in autograd engine#91191
Aidyn-A wants to merge 4 commits intopytorch:masterfrom
Aidyn-A:cuda12_autograd_conditionally_set_device

Conversation

@Aidyn-A
Copy link
Collaborator

@Aidyn-A Aidyn-A commented Dec 20, 2022

CUDA 12 introduces behavioral changes in cudaSetDevice. In the old version it would just set the device to be used for kernel launches and memory allocations without creating a CUDA context. Now, in CUDA 12, every time cudaSetDevice is called for the first time it creates a CUDA context. See issue #91122.

The autograd engine iterates over all devices and sets them:

for (const auto i : c10::irange(num_devices)) {
std::thread t(&Engine::thread_init, this, i, device_ready_queues_[i], true);
t.detach();
}

set_device(device);

Which causes pollution of CUDA contexts on sibling devices.
This PR introduces a workaround this issue by conditionally setting the device.

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 20, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91191

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 63facca:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Aidyn-A Aidyn-A marked this pull request as ready for review December 21, 2022 21:30
@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Dec 21, 2022

cc @ngimel

@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Dec 21, 2022

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 21, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 additional jobs have failed, first few of them are: trunk ,trunk / macos-12-py3-arm64 / test (default, 2, 2, macos-m1-12)

Details for Dev Infra team Raised by workflow job

// better.

#if defined(USE_CUDA)
if (at::detail::getCUDAHooks().hasPrimaryContext(device)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But does that mean that if the device is not initialized yet then the current device will never be set on the worker thread? That sounds bad no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current device will always be initialized at this instance, because prior to this call PyTorch would set currents device for memory allocations and kernel calls in forward pass.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to run a custom autograd function that would run afoul of this, but we already operate under assumption that no new contexts are going to be created in autograd:

// stash_current_streams() stashed streams for all device IDs that already
// had a CUDA context before the GraphTask executed. For inactive devices,
// it stashed a c10::nullopt. I don't expect GraphTask's backward pass ran
// leaf nodes on any new devices, so the stashed streams should be enough.
// If leaf_stream.device_index() happens to be for a new device,
// operator* on the c10::nullopt should throw an error.

Copy link
Collaborator

@albanD albanD Dec 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused.
The following sounds valid but the second backward call will use worker threads that are not properly on the right device.:

import torch

a = torch.rand(10, requires_grad=True)
a.sum().backward() # This call creates the worker threads, no cuda context here

b = torch.rand(10, device="cuda", requires_grad=True)
b.sum().backward() # This one will now run with no current device set!!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then would setting device for every thread somewhere here

task.stash_current_streams();
be the right place?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean when you run it locally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both locally and CI

Copy link
Collaborator

@albanD albanD Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I do expect the code to run but you will need a bit more meat there to check that you're on the right device.
I don't have a multi-gpu machine right now but my guess is that the following will print all 0s instead of a 1 for the second BW:

import torch

class MyFn(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        print("FW", torch.cuda.current_device())
        return x.clone()

    @staticmethod
    def backward(ctx, gO):
        print("BW", torch.cuda.current_device())
        return gO

a = torch.rand(10, requires_grad=True)
MyFn.apply(a).sum().backward() # This call creates the worker threads, no cuda context here

b = torch.rand(10, device="cuda:1", requires_grad=True)
MyFn.apply(b).sum().backward() # This one will now run with no current device set!!

Curious to see what the result is if you have a multi-gpu machine at hand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works fine locally:

FW 0
BW 0
FW 1
BW 1

# And if I print them
# a is tensor([0.2930, 0.1664, 0.7401, 0.5832, 0.2043, 0.9405, 0.6830, 0.8309, 0.1875, 0.8577], requires_grad=True)
# a.grad is tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

# b is tensor([0.1209, 0.5240, 0.6959, 0.3283, 0.1513, 0.7672, 0.0403, 0.4019, 0.3628, 0.5973], device='cuda:1', requires_grad=True)
# b.grad is tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:1')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could work by virtue of device guards, and the threads default device outside guards could be 0.

@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Dec 22, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Mar 13, 2023
This is a device agnostic version #91191.
The reason of existence of this PR is device agnostic policy of autograd engine. Hence, the compile time `USE_CUDA` is not supported, so doing something like:
https://github.com/pytorch/pytorch/blob/fa1ea9f9bcaa77c1370468059be95ad9b421f500/torch/csrc/autograd/engine.cpp#L351-L357
is not effective.

In this PR a check upon CUDA devices in device registry is added such that threads set the same CUDA device.
Pull Request resolved: #92354
Approved by: https://github.com/albanD, https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants