Fix static linkage cases and NO_DISTRIBUTED=1 + CUDA (#16705) #17337

soumith · 2019-02-21T01:39:49Z

Attempt #2 (attempt 1 is #16705 and got reverted because of CI failures)

soumith · 2019-02-21T11:05:04Z

@pytorchbot rebase this please

pytorchbot · 2019-02-21T11:05:05Z

Sorry, I can't rebase this because the author of this PR didn't grant maintainers permission to modify the branch.

Hey @soumith! If you click the Allow edits from maintainers checkbox on the right sidebar, I can rebase PRs automatically from you. Please consider letting me help you out ;)

(To learn more about this bot, see Bot commands.)

mfuntowicz · 2019-02-21T13:09:18Z

I checked out your branch and it seems to be ok (at least for me).

Thanks !

ezyang · 2019-02-21T19:41:15Z

@pytorchbot rebase this please

pytorchbot · 2019-02-21T19:41:16Z

Sorry, I can't rebase this because the author of this PR didn't grant maintainers permission to modify the branch.

Hey @soumith! If you click the Allow edits from maintainers checkbox on the right sidebar, I can rebase PRs automatically from you. Please consider letting me help you out ;)

(To learn more about this bot, see Bot commands.)

ezyang · 2019-02-21T19:45:18Z

@pytorchbot rebase this please

pytorchbot · 2019-02-21T19:45:18Z

Sorry, I can't rebase this because the author of this PR didn't grant maintainers permission to modify the branch.

Hey @soumith! If you click the Allow edits from maintainers checkbox on the right sidebar, I can rebase PRs automatically from you. Please consider letting me help you out ;)

(To learn more about this bot, see Bot commands.)

ezyang · 2019-02-21T19:45:43Z

@pytorchbot rebase this please

Differential Revision: D13952085 Pulled By: soumith fbshipit-source-id: 410c4e117a44c08eadc6f3ded91fafc320a7c696

…ror tracker

soumith · 2019-02-21T20:35:52Z

c10/cuda/CUDAException.h

  do {                                                     \
    cudaError_t __err = EXPR;                              \
    if (__err != cudaSuccess) {                            \
+      cudaGetLastError();				   \


@ngimel can you review that this is okay?

Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. @colesbury suggested that this is the right thing to do.

Yep, it is. It won't work if the error is sticky and context was corrupted, but for other cases if you can recover from RuntimeError thrown by AT_ERROR, you are good to go.

facebook-github-bot

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Attempt #2 (attempt 1 is pytorch/pytorch#16705 and got reverted because of CI failures) Fixes pytorch/pytorch#14805 Pull Request resolved: pytorch/pytorch#17337 Differential Revision: D14175626 Pulled By: soumith fbshipit-source-id: 66f2e10e219a1bf88ed342ec5c89da6f2994d8eb

#93192) Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally This error was accidentally introduced by #92227, which was trying to fix_ #91758 as introduced in #85256. The unit test `TestCuda.test_events_multi_gpu_elapsed_time` has been failed since that PR got merged (in cuda 11.8 and cuda 12.0). That test requires >=2 GPU, so it's probably not tested in the OSS CI? ``` python test/test_cuda.py -v -k TestCuda.test_events_multi_gpu_elapsed_time ``` E.g. in https://github.com/pytorch/pytorch/actions/runs/4026926691/jobs/6922406192 ``` 2023-01-27T19:41:32.2312162Z test_events_multi_gpu_elapsed_time (__main__.TestCuda) ... skip: detected only one GPU (0.001s) ``` The original C10_CUDA_CHECK before #85256 has an extra `cudaGetLastError` that captures those cuda errors, https://github.com/pytorch/pytorch/pull/85256/files#diff-0823e63e781acf56e93a5553ed7feee0db0bda05d86e2560c7b80e87e32e0024L41-L42 This extra `cudaGetLastError` was originally introduced in #17337. As commented here https://github.com/pytorch/pytorch/pull/17337/files#r259104503 > soumith on Feb 21, 2019: Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. colesbury suggested that this is the right thing to do. Pull Request resolved: #93192 Approved by: https://github.com/ezyang

soumith force-pushed the nodist_cuda_attempt2 branch from c510d45 to 88c4796 Compare February 21, 2019 12:06

soumith added 2 commits February 21, 2019 12:34

Fix static linkage cases and NO_DISTRIBUTED=1 + CUDA (#16705)

651a9fe

Differential Revision: D13952085 Pulled By: soumith fbshipit-source-id: 410c4e117a44c08eadc6f3ded91fafc320a7c696

in C10_CUDA_CHECK, retreive error if it's not Success to reset the er…

3b3b69d

…ror tracker

soumith force-pushed the nodist_cuda_attempt2 branch from f87a0ed to 3b3b69d Compare February 21, 2019 20:34

soumith commented Feb 21, 2019

View reviewed changes

ezyang approved these changes Feb 21, 2019

View reviewed changes

facebook-github-bot reviewed Feb 21, 2019

View reviewed changes

facebook-github-bot closed this in 3a47d56 Feb 22, 2019

pytorchbot added the merged label Feb 22, 2019

soumith deleted the nodist_cuda_attempt2 branch April 7, 2019 03:57

mrshenli mentioned this pull request Jun 3, 2019

NO_DISTRIBUTED=1 is untested in CI #14496

Closed

xwang233 mentioned this pull request Jan 28, 2023

Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally #93192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix static linkage cases and NO_DISTRIBUTED=1 + CUDA (#16705) #17337

Fix static linkage cases and NO_DISTRIBUTED=1 + CUDA (#16705) #17337

Uh oh!

soumith commented Feb 21, 2019

Uh oh!

soumith commented Feb 21, 2019

Uh oh!

pytorchbot commented Feb 21, 2019

Uh oh!

mfuntowicz commented Feb 21, 2019

Uh oh!

ezyang commented Feb 21, 2019

Uh oh!

pytorchbot commented Feb 21, 2019

Uh oh!

ezyang commented Feb 21, 2019

Uh oh!

pytorchbot commented Feb 21, 2019

Uh oh!

ezyang commented Feb 21, 2019

Uh oh!

soumith Feb 21, 2019

Uh oh!

ngimel Feb 21, 2019

Uh oh!

facebook-github-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Fix static linkage cases and NO_DISTRIBUTED=1 + CUDA (#16705) #17337

Fix static linkage cases and NO_DISTRIBUTED=1 + CUDA (#16705) #17337

Uh oh!

Conversation

soumith commented Feb 21, 2019

Uh oh!

soumith commented Feb 21, 2019

Uh oh!

pytorchbot commented Feb 21, 2019

Uh oh!

mfuntowicz commented Feb 21, 2019

Uh oh!

ezyang commented Feb 21, 2019

Uh oh!

pytorchbot commented Feb 21, 2019

Uh oh!

ezyang commented Feb 21, 2019

Uh oh!

pytorchbot commented Feb 21, 2019

Uh oh!

ezyang commented Feb 21, 2019

Uh oh!

soumith Feb 21, 2019

Choose a reason for hiding this comment

Uh oh!

ngimel Feb 21, 2019

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants