Skip to content

[NCCL] Provide additional information about NCCL error codes.#45950

Closed
rohan-varma wants to merge 5 commits intogh/rohan-varma/182/basefrom
gh/rohan-varma/182/head
Closed

[NCCL] Provide additional information about NCCL error codes.#45950
rohan-varma wants to merge 5 commits intogh/rohan-varma/182/basefrom
gh/rohan-varma/182/head

Conversation

@rohan-varma
Copy link
Copy Markdown
Contributor

@rohan-varma rohan-varma commented Oct 7, 2020

Stack from ghstack:

A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").

In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, ncclSystemError is what is thrown when system calls such as malloc or munmap fail.

Tested by forcing result = ncclSystemError in the macro. The new error
message looks like:

caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

The last line is what we have added to the message.

In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.

Differential Revision: D24155894

A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").

In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.

Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:

```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

The last line is what we have added to the message.

In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.

Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 7, 2020
rohan-varma added a commit that referenced this pull request Oct 7, 2020
A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").

In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.

Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:

```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

The last line is what we have added to the message.

In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.

Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)

ghstack-source-id: 113729960
Pull Request resolved: #45950
@codecov
Copy link
Copy Markdown

codecov bot commented Oct 7, 2020

Codecov Report

Merging #45950 into gh/rohan-varma/182/base will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@                   Coverage Diff                    @@
##           gh/rohan-varma/182/base   #45950   +/-   ##
========================================================
  Coverage                    68.28%   68.28%           
========================================================
  Files                          410      410           
  Lines                        53609    53609           
========================================================
  Hits                         36608    36608           
  Misses                       17001    17001           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e29d71...05cd43e. Read the comment docs.

Copy link
Copy Markdown
Contributor

@mingzhe09088 mingzhe09088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

…es."

A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").

In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.

Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:

```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

The last line is what we have added to the message.

In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.

Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 7, 2020
Pull Request resolved: #45950

A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").

In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.

Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:

```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

The last line is what we have added to the message.

In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
ghstack-source-id: 113782832

Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
…es."

A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").

In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.

Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:

```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

The last line is what we have added to the message.

In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.

Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 8, 2020
Pull Request resolved: #45950

A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").

In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.

Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:

```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

The last line is what we have added to the message.

In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
ghstack-source-id: 113914760

Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
…es."

A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").

In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.

Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:

```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

The last line is what we have added to the message.

In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.

Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 11, 2020
Pull Request resolved: #45950

A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").

In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.

Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:

```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

The last line is what we have added to the message.

In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
ghstack-source-id: 114037060

Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
…es."

A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").

In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.

Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:

```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

The last line is what we have added to the message.

In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.

Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 13, 2020
Pull Request resolved: #45950

A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").

In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.

Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:

```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

The last line is what we have added to the message.

In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
ghstack-source-id: 114219288

Differential Revision: [D24155894](https://our.internmc.facebook.com/intern/diff/D24155894/)
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 965046c.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/182/head branch October 17, 2020 14:15
rohan-varma added a commit that referenced this pull request Mar 17, 2021
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Mar 17, 2021
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)

ghstack-source-id: 124090094
Pull Request resolved: #54117
rohan-varma added a commit that referenced this pull request Mar 23, 2021
…rors"

#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Mar 23, 2021
Pull Request resolved: #54117

#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr
ghstack-source-id: 124589813

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)
rohan-varma added a commit that referenced this pull request Mar 23, 2021
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Mar 23, 2021
Pull Request resolved: #54117

#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr
ghstack-source-id: 124662592

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)
rohan-varma added a commit that referenced this pull request Mar 23, 2021
…rors"

#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Mar 23, 2021
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Mar 24, 2021
Summary:
Pull Request resolved: #54117

#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr
ghstack-source-id: 124662592

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27100497

fbshipit-source-id: fec3663ffa3e92bae8391ef4f77054abb4bb9715
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants