[CUDA12] Autograd engine use current device only by Aidyn-A · Pull Request #92354 · pytorch/pytorch

Aidyn-A · 2023-01-18T00:00:18Z

This is a device agnostic version #91191.
The reason of existence of this PR is device agnostic policy of autograd engine. Hence, the compile time USE_CUDA is not supported, so doing something like:

pytorch/torch/csrc/autograd/engine.cpp

Lines 351 to 357 in fa1ea9f

    
           #if defined(USE_CUDA) 
        
             if (at::detail::getCUDAHooks().hasPrimaryContext(device)) { 
        
               set_device(device); 
        
             } 
        
           #else 
        
             set_device(device); 
        
           #endif

is not effective.

In this PR a check upon CUDA devices in device registry is added such that threads set the same CUDA device.

pytorch-bot · 2023-01-18T00:00:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92354

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9bfd185:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Aidyn-A · 2023-03-01T19:52:03Z

Continuing from #94929 (comment)
@albanD this PR is independent on CUDA version and can be tested and benchmarked now.

Aidyn-A · 2023-03-06T19:50:55Z

@albanD do you have any comments on this PR?

albanD

This would deserve a more detailed comment about what it does.
But that only helps if you already initialized cuda on at least one device?

If you're doing such a disruptive change, I guess at this point you might as well just move the set_device into the inner loop after we get work:

pytorch/torch/csrc/autograd/engine.cpp

Line 516 in b0b5f3c

// This will only work if the worker is running a non backward task

And make sure that when we don't change the device, this will be super cheap to do ?
cc @ngimel

test/autograd/test_functional.py

torch/csrc/autograd/engine.cpp

Aidyn-A · 2023-03-10T01:45:00Z

@pytorchbot label "topic: not user facing", "ciflow/trunk"

pytorch-bot · 2023-03-10T01:45:08Z

Didn't find following labels among repository labels: topic: not user facing,

Aidyn-A · 2023-03-10T01:45:39Z

@pytorchbot label "topic: not user facing"

albanD

Thanks for the update the change is at the right place, only a small perf concern

torch/csrc/autograd/engine.cpp

Rebase to viable/strict

albanD

Thanks!

Aidyn-A · 2023-03-13T20:02:17Z

@pytorchbot merge

pytorchmergebot · 2023-03-13T20:04:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This is a device agnostic version #91191. The reason of existence of this PR is device agnostic policy of autograd engine. Hence, the compile time `USE_CUDA` is not supported, so doing something like: https://github.com/pytorch/pytorch/blob/fa1ea9f9bcaa77c1370468059be95ad9b421f500/torch/csrc/autograd/engine.cpp#L351-L357 is not effective. In this PR a check upon CUDA devices in device registry is added such that threads set the same CUDA device. Pull Request resolved: pytorch/pytorch#92354 Approved by: https://github.com/albanD, https://github.com/ngimel

modify autograd engine

37b9487

pytorchbot added the open source label Jan 18, 2023

Aidyn-A changed the title ~~[CUDA12] autograd engine use current device only~~ [CUDA12] Autograd engine use current device only Jan 18, 2023

Aidyn-A added 6 commits January 19, 2023 10:23

use last current device

6898802

Merge branch 'master' into cuda12_autograd_use_current_device_only

5df7160

check devices in device guard registry

22f0f06

fix lint

a057350

more efficient code

3074ad3

fix lint

258a40a

Aidyn-A marked this pull request as ready for review March 1, 2023 16:19

Aidyn-A requested review from albanD and soulitzer as code owners March 1, 2023 16:19

Aidyn-A mentioned this pull request Mar 1, 2023

[CUDA 12] Autograd engine follow up #94929

Closed

soulitzer removed their request for review March 2, 2023 01:35

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 2, 2023

albanD reviewed Mar 7, 2023

View reviewed changes

Aidyn-A added 4 commits March 8, 2023 13:37

Merge branch 'master' into cuda12_autograd_use_current_device_only

5c76e64

revert changes in start_device_threads

fd6cc8e

omit setting cuda device

d856ba4

update comment

53155f0

albanD reviewed Mar 8, 2023

View reviewed changes

test/autograd/test_functional.py Outdated Show resolved Hide resolved

test/autograd/test_functional.py Outdated Show resolved Hide resolved

torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved

Aidyn-A force-pushed the cuda12_autograd_use_current_device_only branch from 53155f0 to 375569a Compare March 9, 2023 21:08

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 10, 2023

pytorch-bot bot added the topic: not user facing topic category label Mar 10, 2023

move the set_device into the inner loop

76ab85d

Aidyn-A force-pushed the cuda12_autograd_use_current_device_only branch from aa3b32d to 76ab85d Compare March 10, 2023 19:06

albanD reviewed Mar 10, 2023

View reviewed changes

torch/csrc/autograd/engine.cpp Show resolved Hide resolved

Aidyn-A and others added 4 commits March 10, 2023 17:47

skip cudaSetDevice if it is already set

df53cb8

Merge pull request #1 from viable/strict

c7add59

Rebase to viable/strict

Avoid setting worker_device in set_device

cbfd82a

skipping worker_device = device is unavoidable

9bfd185

albanD approved these changes Mar 13, 2023

View reviewed changes

ngimel approved these changes Mar 13, 2023

View reviewed changes

pytorchmergebot added the Merged label Mar 13, 2023

pytorchmergebot closed this in c69b3b8 Mar 13, 2023

	#if defined(USE_CUDA)
	if (at::detail::getCUDAHooks().hasPrimaryContext(device)) {
	set_device(device);
	}
	#else
	set_device(device);
	#endif

Conversation

Aidyn-A commented Jan 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92354

✅ No Failures

Uh oh!

Aidyn-A commented Mar 1, 2023

Uh oh!

Aidyn-A commented Mar 6, 2023

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Aidyn-A commented Mar 10, 2023

Uh oh!

pytorch-bot bot commented Mar 10, 2023

Uh oh!

Aidyn-A commented Mar 10, 2023

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Aidyn-A commented Mar 13, 2023

Uh oh!

pytorchmergebot commented Mar 13, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Aidyn-A commented Jan 18, 2023 •

edited

Loading

pytorch-bot bot commented Jan 18, 2023 •

edited

Loading