[NCCL] Provide additional information about NCCL error codes.#45950
Closed
rohan-varma wants to merge 5 commits intogh/rohan-varma/182/basefrom
Closed
[NCCL] Provide additional information about NCCL error codes.#45950rohan-varma wants to merge 5 commits intogh/rohan-varma/182/basefrom
rohan-varma wants to merge 5 commits intogh/rohan-varma/182/basefrom
Conversation
A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").
In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.
Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:
```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```
The last line is what we have added to the message.
In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
[ghstack-poisoned]
This was referenced Oct 7, 2020
rohan-varma
added a commit
that referenced
this pull request
Oct 7, 2020
A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").
In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.
Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:
```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```
The last line is what we have added to the message.
In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
ghstack-source-id: 113729960
Pull Request resolved: #45950
Codecov Report
@@ Coverage Diff @@
## gh/rohan-varma/182/base #45950 +/- ##
========================================================
Coverage 68.28% 68.28%
========================================================
Files 410 410
Lines 53609 53609
========================================================
Hits 36608 36608
Misses 17001 17001 Continue to review full report at Codecov.
|
…es."
A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").
In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.
Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:
```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```
The last line is what we have added to the message.
In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
[ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 7, 2020
Pull Request resolved: #45950 A pain point for debugging failed training jobs due to NCCL errors has been understanding the source of the error, since NCCL does not itself report too many details (usually just "unhandled {system, cuda, internal} error"). In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail. Tested by forcing `result = ncclSystemError` in the macro. The new error message looks like: ```RuntimeError: NCCL error in: caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL version 2.7.3 ncclSystemError: System call (socket, malloc, munmap, etc) failed. ``` The last line is what we have added to the message. In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL provides more details about errors sa well. ghstack-source-id: 113782832 Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
…es."
A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").
In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.
Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:
```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```
The last line is what we have added to the message.
In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
[ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 8, 2020
Pull Request resolved: #45950 A pain point for debugging failed training jobs due to NCCL errors has been understanding the source of the error, since NCCL does not itself report too many details (usually just "unhandled {system, cuda, internal} error"). In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail. Tested by forcing `result = ncclSystemError` in the macro. The new error message looks like: ```RuntimeError: NCCL error in: caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL version 2.7.3 ncclSystemError: System call (socket, malloc, munmap, etc) failed. ``` The last line is what we have added to the message. In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL provides more details about errors sa well. ghstack-source-id: 113914760 Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
…es."
A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").
In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.
Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:
```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```
The last line is what we have added to the message.
In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
[ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 11, 2020
Pull Request resolved: #45950 A pain point for debugging failed training jobs due to NCCL errors has been understanding the source of the error, since NCCL does not itself report too many details (usually just "unhandled {system, cuda, internal} error"). In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail. Tested by forcing `result = ncclSystemError` in the macro. The new error message looks like: ```RuntimeError: NCCL error in: caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL version 2.7.3 ncclSystemError: System call (socket, malloc, munmap, etc) failed. ``` The last line is what we have added to the message. In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL provides more details about errors sa well. ghstack-source-id: 114037060 Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
…es."
A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").
In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.
Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:
```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```
The last line is what we have added to the message.
In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
[ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 13, 2020
Pull Request resolved: #45950 A pain point for debugging failed training jobs due to NCCL errors has been understanding the source of the error, since NCCL does not itself report too many details (usually just "unhandled {system, cuda, internal} error"). In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail. Tested by forcing `result = ncclSystemError` in the macro. The new error message looks like: ```RuntimeError: NCCL error in: caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL version 2.7.3 ncclSystemError: System call (socket, malloc, munmap, etc) failed. ``` The last line is what we have added to the message. In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL provides more details about errors sa well. ghstack-source-id: 114219288 Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
Contributor
|
This pull request has been merged in 965046c. |
rohan-varma
added a commit
that referenced
this pull request
Mar 17, 2021
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Mar 17, 2021
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/) ghstack-source-id: 124090094 Pull Request resolved: #54117
rohan-varma
added a commit
that referenced
this pull request
Mar 23, 2021
…rors" #45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Mar 23, 2021
Pull Request resolved: #54117 #45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr ghstack-source-id: 124589813 Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)
rohan-varma
added a commit
that referenced
this pull request
Mar 23, 2021
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Mar 23, 2021
Pull Request resolved: #54117 #45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr ghstack-source-id: 124662592 Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)
rohan-varma
added a commit
that referenced
this pull request
Mar 23, 2021
…rors" #45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Mar 23, 2021
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/) [ghstack-poisoned]
facebook-github-bot
pushed a commit
that referenced
this pull request
Mar 24, 2021
Summary: Pull Request resolved: #54117 #45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr ghstack-source-id: 124662592 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D27100497 fbshipit-source-id: fec3663ffa3e92bae8391ef4f77054abb4bb9715
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack:
A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").
In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example,
ncclSystemErroris what is thrown when system calls such as malloc or munmap fail.Tested by forcing
result = ncclSystemErrorin the macro. The new errormessage looks like:
The last line is what we have added to the message.
In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
Differential Revision: D24155894