[CUDA12] Conditionally set device in autograd engine by Aidyn-A · Pull Request #91191 · pytorch/pytorch

Aidyn-A · 2022-12-20T19:59:12Z

CUDA 12 introduces behavioral changes in cudaSetDevice. In the old version it would just set the device to be used for kernel launches and memory allocations without creating a CUDA context. Now, in CUDA 12, every time cudaSetDevice is called for the first time it creates a CUDA context. See issue #91122.

The autograd engine iterates over all devices and sets them:

pytorch/torch/csrc/autograd/engine.cpp

Lines 1399 to 1402 in f8b348c

    
           for (const auto i : c10::irange(num_devices)) { 
        
             std::thread t(&Engine::thread_init, this, i, device_ready_queues_[i], true); 
        
             t.detach(); 
        
           }

pytorch/torch/csrc/autograd/engine.cpp

Line 349 in f8b348c

set_device(device);

Which causes pollution of CUDA contexts on sibling devices.
This PR introduces a workaround this issue by conditionally setting the device.

pytorch-bot · 2022-12-20T19:59:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91191

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 63facca:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…device

Aidyn-A · 2022-12-21T21:31:09Z

cc @ngimel

Aidyn-A · 2022-12-21T23:58:53Z

@pytorchbot merge

pytorchmergebot · 2022-12-22T00:00:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-12-22T00:51:03Z

Merge failed

Reason: 2 additional jobs have failed, first few of them are: trunk ,trunk / macos-12-py3-arm64 / test (default, 2, 2, macos-m1-12)

Details for Dev Infra team

Raised by workflow job

…device

albanD · 2022-12-22T15:13:05Z

torch/csrc/autograd/engine.cpp

  // better.
+
+#if defined(USE_CUDA)
+  if (at::detail::getCUDAHooks().hasPrimaryContext(device)) {


But does that mean that if the device is not initialized yet then the current device will never be set on the worker thread? That sounds bad no?

Current device will always be initialized at this instance, because prior to this call PyTorch would set currents device for memory allocations and kernel calls in forward pass.

It's possible to run a custom autograd function that would run afoul of this, but we already operate under assumption that no new contexts are going to be created in autograd:

pytorch/torch/csrc/autograd/engine.cpp

Lines 681 to 686 in 63facca

// stash_current_streams() stashed streams for all device IDs that already

// had a CUDA context before the GraphTask executed. For inactive devices,

// it stashed a c10::nullopt. I don't expect GraphTask's backward pass ran

// leaf nodes on any new devices, so the stashed streams should be enough.

// If leaf_stream.device_index() happens to be for a new device,

// operator* on the c10::nullopt should throw an error.

I'm still confused.
The following sounds valid but the second backward call will use worker threads that are not properly on the right device.:

import torch a = torch.rand(10, requires_grad=True) a.sum().backward() # This call creates the worker threads, no cuda context here b = torch.rand(10, device="cuda", requires_grad=True) b.sum().backward() # This one will now run with no current device set!!

Ok then would setting device for every thread somewhere here

pytorch/torch/csrc/autograd/engine.cpp

Line 1112 in 63facca

task.stash_current_streams();

be the right place?

You mean when you run it locally?

Both locally and CI

Well I do expect the code to run but you will need a bit more meat there to check that you're on the right device.
I don't have a multi-gpu machine right now but my guess is that the following will print all 0s instead of a 1 for the second BW:

import torch class MyFn(torch.autograd.Function): @staticmethod def forward(ctx, x): print("FW", torch.cuda.current_device()) return x.clone() @staticmethod def backward(ctx, gO): print("BW", torch.cuda.current_device()) return gO a = torch.rand(10, requires_grad=True) MyFn.apply(a).sum().backward() # This call creates the worker threads, no cuda context here b = torch.rand(10, device="cuda:1", requires_grad=True) MyFn.apply(b).sum().backward() # This one will now run with no current device set!!

Curious to see what the result is if you have a multi-gpu machine at hand.

Works fine locally:

FW 0 BW 0 FW 1 BW 1 # And if I print them # a is tensor([0.2930, 0.1664, 0.7401, 0.5832, 0.2043, 0.9405, 0.6830, 0.8309, 0.1875, 0.8577], requires_grad=True) # a.grad is tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]) # b is tensor([0.1209, 0.5240, 0.6959, 0.3283, 0.1513, 0.7672, 0.0403, 0.4019, 0.3628, 0.5973], device='cuda:1', requires_grad=True) # b.grad is tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:1')

It could work by virtue of device guards, and the threads default device outside guards could be 0.

Aidyn-A · 2022-12-22T19:53:06Z

@pytorchbot merge

pytorchmergebot · 2022-12-22T19:54:41Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This is a device agnostic version #91191. The reason of existence of this PR is device agnostic policy of autograd engine. Hence, the compile time `USE_CUDA` is not supported, so doing something like: https://github.com/pytorch/pytorch/blob/fa1ea9f9bcaa77c1370468059be95ad9b421f500/torch/csrc/autograd/engine.cpp#L351-L357 is not effective. In this PR a check upon CUDA devices in device registry is added such that threads set the same CUDA device. Pull Request resolved: #92354 Approved by: https://github.com/albanD, https://github.com/ngimel

conditionally set device

79d4806

pytorchbot added the open source label Dec 20, 2022

Aidyn-A added 2 commits December 21, 2022 10:51

check if use_cuda

4ddeacd

Merge branch 'pytorch:master' into cuda12_autograd_conditionally_set_…

8ddd06a

…device

Aidyn-A marked this pull request as ready for review December 21, 2022 21:30

Aidyn-A requested review from albanD and soulitzer as code owners December 21, 2022 21:30

ngimel approved these changes Dec 21, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 21, 2022

Merge branch 'pytorch:master' into cuda12_autograd_conditionally_set_…

63facca

…device

albanD reviewed Dec 22, 2022

View reviewed changes

pytorchmergebot added the Merged label Dec 22, 2022

pytorchmergebot closed this in b95e1d7 Dec 22, 2022

eqy mentioned this pull request Jan 2, 2023

[CUDA][CUDA 12] CUDA 12 Support Tracking Issue #91122

Closed

5 tasks

Aidyn-A mentioned this pull request Jan 19, 2023

[CUDA12] Add test for backward #92630

Closed

Aidyn-A mentioned this pull request Feb 15, 2023

[CUDA 12] Autograd engine follow up #94929

Closed

Aidyn-A mentioned this pull request Mar 1, 2023

[CUDA12] Autograd engine use current device only #92354

Closed

	for (const auto i : c10::irange(num_devices)) {
	std::thread t(&Engine::thread_init, this, i, device_ready_queues_[i], true);
	t.detach();
	}

	// stash_current_streams() stashed streams for all device IDs that already
	// had a CUDA context before the GraphTask executed. For inactive devices,
	// it stashed a c10::nullopt. I don't expect GraphTask's backward pass ran
	// leaf nodes on any new devices, so the stashed streams should be enough.
	// If leaf_stream.device_index() happens to be for a new device,
	// operator* on the c10::nullopt should throw an error.

Conversation

Aidyn-A commented Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91191

✅ No Failures

Uh oh!

Aidyn-A commented Dec 21, 2022

Uh oh!

Aidyn-A commented Dec 21, 2022

Uh oh!

pytorchmergebot commented Dec 22, 2022

Merge started

Uh oh!

pytorchmergebot commented Dec 22, 2022

Merge failed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albanD Dec 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albanD Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aidyn-A commented Dec 22, 2022

Uh oh!

pytorchmergebot commented Dec 22, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Aidyn-A commented Dec 20, 2022 •

edited

Loading

pytorch-bot bot commented Dec 20, 2022 •

edited

Loading

albanD Dec 26, 2022 •

edited

Loading

albanD Jan 20, 2023 •

edited

Loading