Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94864
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 88d7b8b: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
From offline discussion: so this PR would most likely need to use a lazy loading approach. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check Details for Dev Infra teamRaised by workflow job |
|
I've unlinked the PR internally, so next merge attempt should succeed, but let's not do it before weekend. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Hi @Aidyn-A! Soon after this PR landed, some long-standing FSDP unit tests have become flaky (#99011, #98821). I am not entirely sure the connection, but we see More stacktrace (run on 8 GPUs)I wonder if there could be any conflict with Perhaps, one possible remediation is to revert this PR for now? cc: @ezyang @ngimel |
|
@pytorchbot revert |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot revert -m "causes flaky fsdp failures" -c weird |
|
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 94864 failedReason: Command Details for Dev Infra teamRaised by workflow job |
This PR adds workaround for CUDA 12
cudaSetDevicechange which will always create primary context on target device. So operations like this:would always create primary context on on device
cuda:1because it is creating a tensor on it and on devicecuda:0because the destructor of CUDA Device guard callscudaSetDevice(0).After this PR the CUDA Device guard will not call
cudaSetDevice(0)if primary context does not exist oncuda:0.cc @ezyang @gchanan