Skip to content

[CUDA12] Autograd engine use current device only#92354

Closed
Aidyn-A wants to merge 16 commits intopytorch:masterfrom
Aidyn-A:cuda12_autograd_use_current_device_only
Closed

[CUDA12] Autograd engine use current device only#92354
Aidyn-A wants to merge 16 commits intopytorch:masterfrom
Aidyn-A:cuda12_autograd_use_current_device_only

Conversation

@Aidyn-A
Copy link
Collaborator

@Aidyn-A Aidyn-A commented Jan 18, 2023

This is a device agnostic version #91191.
The reason of existence of this PR is device agnostic policy of autograd engine. Hence, the compile time USE_CUDA is not supported, so doing something like:

#if defined(USE_CUDA)
if (at::detail::getCUDAHooks().hasPrimaryContext(device)) {
set_device(device);
}
#else
set_device(device);
#endif

is not effective.

In this PR a check upon CUDA devices in device registry is added such that threads set the same CUDA device.

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 18, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92354

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9bfd185:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Aidyn-A Aidyn-A changed the title [CUDA12] autograd engine use current device only [CUDA12] Autograd engine use current device only Jan 18, 2023
@Aidyn-A Aidyn-A marked this pull request as ready for review March 1, 2023 16:19
@Aidyn-A Aidyn-A requested review from albanD and soulitzer as code owners March 1, 2023 16:19
@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Mar 1, 2023

Continuing from #94929 (comment)
@albanD this PR is independent on CUDA version and can be tested and benchmarked now.

@soulitzer soulitzer removed their request for review March 2, 2023 01:35
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 2, 2023
@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Mar 6, 2023

@albanD do you have any comments on this PR?

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would deserve a more detailed comment about what it does.
But that only helps if you already initialized cuda on at least one device?

If you're doing such a disruptive change, I guess at this point you might as well just move the set_device into the inner loop after we get work:

// This will only work if the worker is running a non backward task

And make sure that when we don't change the device, this will be super cheap to do ?
cc @ngimel

@Aidyn-A Aidyn-A force-pushed the cuda12_autograd_use_current_device_only branch from 53155f0 to 375569a Compare March 9, 2023 21:08
@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Mar 10, 2023

@pytorchbot label "topic: not user facing", "ciflow/trunk"

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 10, 2023

Didn't find following labels among repository labels: topic: not user facing,

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 10, 2023
@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Mar 10, 2023

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 10, 2023
@Aidyn-A Aidyn-A force-pushed the cuda12_autograd_use_current_device_only branch from aa3b32d to 76ab85d Compare March 10, 2023 19:06
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update the change is at the right place, only a small perf concern

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Mar 13, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 23, 2023
This is a device agnostic version #91191.
The reason of existence of this PR is device agnostic policy of autograd engine. Hence, the compile time `USE_CUDA` is not supported, so doing something like:
https://github.com/pytorch/pytorch/blob/fa1ea9f9bcaa77c1370468059be95ad9b421f500/torch/csrc/autograd/engine.cpp#L351-L357
is not effective.

In this PR a check upon CUDA devices in device registry is added such that threads set the same CUDA device.
Pull Request resolved: pytorch/pytorch#92354
Approved by: https://github.com/albanD, https://github.com/ngimel
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 27, 2023
This is a device agnostic version #91191.
The reason of existence of this PR is device agnostic policy of autograd engine. Hence, the compile time `USE_CUDA` is not supported, so doing something like:
https://github.com/pytorch/pytorch/blob/fa1ea9f9bcaa77c1370468059be95ad9b421f500/torch/csrc/autograd/engine.cpp#L351-L357
is not effective.

In this PR a check upon CUDA devices in device registry is added such that threads set the same CUDA device.
Pull Request resolved: pytorch/pytorch#92354
Approved by: https://github.com/albanD, https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants