Skip to content

[DO NOT MERGE] [CUDA12] Conditionally set device in device guard#91219

Closed
Aidyn-A wants to merge 28 commits intopytorch:masterfrom
Aidyn-A:cuda12_device_guard_conditionally_set_device
Closed

[DO NOT MERGE] [CUDA12] Conditionally set device in device guard#91219
Aidyn-A wants to merge 28 commits intopytorch:masterfrom
Aidyn-A:cuda12_device_guard_conditionally_set_device

Conversation

@Aidyn-A
Copy link
Collaborator

@Aidyn-A Aidyn-A commented Dec 21, 2022

CUDA 12 introduces behavioral changes in cudaSetDevice. In the old version it would just set the device to be used for kernel launches and memory allocations without creating a CUDA context. Now, in CUDA 12, every time cudaSetDevice is called for the first time it creates a CUDA context. See issue #91122.
This PR introduces a workaround for cases like:

import torch
x = torch.randn(1, device="cuda:1")

in which the primary context would be allocated on devices 0 and 1 because device 0 the default one.
The real danger would have happened in distributed jobs where all ranks set device to 0 in the destructor of CUDA guard creating nproc_per_node CUDA contexts.

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 21, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91219

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 790d2d9:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Aidyn-A Aidyn-A marked this pull request as ready for review January 17, 2023 18:52
@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Jan 17, 2023

cc @ngimel

@bdhirsh bdhirsh requested a review from ngimel January 18, 2023 12:43
@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 18, 2023
@lezcano
Copy link
Collaborator

lezcano commented Jan 30, 2023

This is the only PR left to close #91122. PTAL @ngimel

@Aidyn-A Aidyn-A changed the title [CUDA12] Conditionally set device in device guard [DO NOT MERGE] [CUDA12] Conditionally set device in device guard Jan 30, 2023
@Aidyn-A Aidyn-A marked this pull request as draft January 30, 2023 20:36
@Aidyn-A Aidyn-A force-pushed the cuda12_device_guard_conditionally_set_device branch from 32819e7 to f2eaff9 Compare February 9, 2023 22:58
@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Feb 15, 2023

The usage of aten in c10 is not preferred. Would it make more sense to use CUDA Driver API in c10 like here #94864?

@jjsjann123
Copy link
Collaborator

looks like c10 bzl rule is still setting USE_CUDA, while cmake is not using that any more. Seems like a consistency issue that we should patch and remove?! https://github.com/pytorch/pytorch/blob/master/c10/cuda/build.bzl#L28

cc'ing @vors

@Aidyn-A Aidyn-A closed this Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module with-ssh

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants