Skip to content

Conversation

@soumith
Copy link
Contributor

@soumith soumith commented Feb 21, 2019

Attempt #2 (attempt 1 is #16705 and got reverted because of CI failures)

Fixes #14805

@soumith
Copy link
Contributor Author

soumith commented Feb 21, 2019

@pytorchbot rebase this please

@pytorchbot
Copy link
Collaborator

Sorry, I can't rebase this because the author of this PR didn't grant maintainers permission to modify the branch.

Hey @soumith! If you click the Allow edits from maintainers checkbox on the right sidebar, I can rebase PRs automatically from you. Please consider letting me help you out ;)

(To learn more about this bot, see Bot commands.)

@soumith soumith force-pushed the nodist_cuda_attempt2 branch from c510d45 to 88c4796 Compare February 21, 2019 12:06
@mfuntowicz
Copy link
Contributor

I checked out your branch and it seems to be ok (at least for me).

Thanks !

@ezyang
Copy link
Contributor

ezyang commented Feb 21, 2019

@pytorchbot rebase this please

@pytorchbot
Copy link
Collaborator

Sorry, I can't rebase this because the author of this PR didn't grant maintainers permission to modify the branch.

Hey @soumith! If you click the Allow edits from maintainers checkbox on the right sidebar, I can rebase PRs automatically from you. Please consider letting me help you out ;)

(To learn more about this bot, see Bot commands.)

@ezyang
Copy link
Contributor

ezyang commented Feb 21, 2019

@pytorchbot rebase this please

@pytorchbot
Copy link
Collaborator

Sorry, I can't rebase this because the author of this PR didn't grant maintainers permission to modify the branch.

Hey @soumith! If you click the Allow edits from maintainers checkbox on the right sidebar, I can rebase PRs automatically from you. Please consider letting me help you out ;)

(To learn more about this bot, see Bot commands.)

@ezyang
Copy link
Contributor

ezyang commented Feb 21, 2019

@pytorchbot rebase this please

Differential Revision: D13952085

Pulled By: soumith

fbshipit-source-id: 410c4e117a44c08eadc6f3ded91fafc320a7c696
@soumith soumith force-pushed the nodist_cuda_attempt2 branch from f87a0ed to 3b3b69d Compare February 21, 2019 20:34
do { \
cudaError_t __err = EXPR; \
if (__err != cudaSuccess) { \
cudaGetLastError(); \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel can you review that this is okay?

Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. @colesbury suggested that this is the right thing to do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it is. It won't work if the error is sticky and context was corrupted, but for other cases if you can recover from RuntimeError thrown by AT_ERROR, you are good to go.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Feb 22, 2019
Summary:
Attempt #2 (attempt 1 is pytorch/pytorch#16705 and got reverted because of CI failures)

Fixes pytorch/pytorch#14805
Pull Request resolved: pytorch/pytorch#17337

Differential Revision: D14175626

Pulled By: soumith

fbshipit-source-id: 66f2e10e219a1bf88ed342ec5c89da6f2994d8eb
@soumith soumith deleted the nodist_cuda_attempt2 branch April 7, 2019 03:57
pytorchmergebot pushed a commit that referenced this pull request Jan 28, 2023
#93192)

Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally

This error was accidentally introduced by #92227, which was trying to fix_ #91758 as introduced in #85256.

The unit test `TestCuda.test_events_multi_gpu_elapsed_time` has been failed since that PR got merged (in cuda 11.8 and cuda 12.0). That test requires >=2 GPU, so it's probably not tested in the OSS CI?
```
python test/test_cuda.py -v -k TestCuda.test_events_multi_gpu_elapsed_time
```

E.g. in https://github.com/pytorch/pytorch/actions/runs/4026926691/jobs/6922406192
```
2023-01-27T19:41:32.2312162Z   test_events_multi_gpu_elapsed_time (__main__.TestCuda) ... skip: detected only one GPU (0.001s)
```

The original C10_CUDA_CHECK before #85256 has an extra `cudaGetLastError` that captures those cuda errors, https://github.com/pytorch/pytorch/pull/85256/files#diff-0823e63e781acf56e93a5553ed7feee0db0bda05d86e2560c7b80e87e32e0024L41-L42

This extra `cudaGetLastError` was originally introduced in #17337. As commented here https://github.com/pytorch/pytorch/pull/17337/files#r259104503

> soumith on Feb 21, 2019:
Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. colesbury suggested that this is the right thing to do.
Pull Request resolved: #93192
Approved by: https://github.com/ezyang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

libcudart.so not found when compiling with NO_DISTRIBUTED=1

7 participants