-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix static linkage cases and NO_DISTRIBUTED=1 + CUDA (#16705) #17337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pytorchbot rebase this please |
|
Sorry, I can't rebase this because the author of this PR didn't grant maintainers permission to modify the branch. Hey @soumith! If you click the Allow edits from maintainers checkbox on the right sidebar, I can rebase PRs automatically from you. Please consider letting me help you out ;) (To learn more about this bot, see Bot commands.) |
c510d45 to
88c4796
Compare
|
I checked out your branch and it seems to be ok (at least for me). Thanks ! |
|
@pytorchbot rebase this please |
|
Sorry, I can't rebase this because the author of this PR didn't grant maintainers permission to modify the branch. Hey @soumith! If you click the Allow edits from maintainers checkbox on the right sidebar, I can rebase PRs automatically from you. Please consider letting me help you out ;) (To learn more about this bot, see Bot commands.) |
|
@pytorchbot rebase this please |
|
Sorry, I can't rebase this because the author of this PR didn't grant maintainers permission to modify the branch. Hey @soumith! If you click the Allow edits from maintainers checkbox on the right sidebar, I can rebase PRs automatically from you. Please consider letting me help you out ;) (To learn more about this bot, see Bot commands.) |
|
@pytorchbot rebase this please |
Differential Revision: D13952085 Pulled By: soumith fbshipit-source-id: 410c4e117a44c08eadc6f3ded91fafc320a7c696
f87a0ed to
3b3b69d
Compare
| do { \ | ||
| cudaError_t __err = EXPR; \ | ||
| if (__err != cudaSuccess) { \ | ||
| cudaGetLastError(); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel can you review that this is okay?
Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. @colesbury suggested that this is the right thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it is. It won't work if the error is sticky and context was corrupted, but for other cases if you can recover from RuntimeError thrown by AT_ERROR, you are good to go.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Attempt #2 (attempt 1 is pytorch/pytorch#16705 and got reverted because of CI failures) Fixes pytorch/pytorch#14805 Pull Request resolved: pytorch/pytorch#17337 Differential Revision: D14175626 Pulled By: soumith fbshipit-source-id: 66f2e10e219a1bf88ed342ec5c89da6f2994d8eb
#93192) Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally This error was accidentally introduced by #92227, which was trying to fix_ #91758 as introduced in #85256. The unit test `TestCuda.test_events_multi_gpu_elapsed_time` has been failed since that PR got merged (in cuda 11.8 and cuda 12.0). That test requires >=2 GPU, so it's probably not tested in the OSS CI? ``` python test/test_cuda.py -v -k TestCuda.test_events_multi_gpu_elapsed_time ``` E.g. in https://github.com/pytorch/pytorch/actions/runs/4026926691/jobs/6922406192 ``` 2023-01-27T19:41:32.2312162Z test_events_multi_gpu_elapsed_time (__main__.TestCuda) ... skip: detected only one GPU (0.001s) ``` The original C10_CUDA_CHECK before #85256 has an extra `cudaGetLastError` that captures those cuda errors, https://github.com/pytorch/pytorch/pull/85256/files#diff-0823e63e781acf56e93a5553ed7feee0db0bda05d86e2560c7b80e87e32e0024L41-L42 This extra `cudaGetLastError` was originally introduced in #17337. As commented here https://github.com/pytorch/pytorch/pull/17337/files#r259104503 > soumith on Feb 21, 2019: Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. colesbury suggested that this is the right thing to do. Pull Request resolved: #93192 Approved by: https://github.com/ezyang
Attempt #2 (attempt 1 is #16705 and got reverted because of CI failures)
Fixes #14805