Skip to content

[NCCL] Add more details for checkForNCCLErrors#54117

Closed
rohan-varma wants to merge 3 commits intogh/rohan-varma/259/basefrom
gh/rohan-varma/259/head
Closed

[NCCL] Add more details for checkForNCCLErrors#54117
rohan-varma wants to merge 3 commits intogh/rohan-varma/259/basefrom
gh/rohan-varma/259/head

Conversation

@rohan-varma
Copy link
Copy Markdown
Contributor

@rohan-varma rohan-varma commented Mar 17, 2021

Stack from ghstack:

#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for C10D_NCCL_CHECK which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have checkForNCCLErrors which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr

Differential Revision: D27100497

#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 17, 2021
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Mar 17, 2021

💊 CI failures summary and remediations

As of commit 5b06e18 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

rohan-varma added a commit that referenced this pull request Mar 17, 2021
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)

ghstack-source-id: 124090094
Pull Request resolved: #54117
@rohan-varma
Copy link
Copy Markdown
Contributor Author

cc @pritamdamania87

#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Mar 23, 2021
Pull Request resolved: #54117

#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr
ghstack-source-id: 124589813

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Mar 23, 2021
Pull Request resolved: #54117

#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr
ghstack-source-id: 124662592

Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 789dc6d.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/259/head branch March 27, 2021 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants