[NCCL] Add more details for checkForNCCLErrors#54117
Closed
rohan-varma wants to merge 3 commits intogh/rohan-varma/259/basefrom
Closed
[NCCL] Add more details for checkForNCCLErrors#54117rohan-varma wants to merge 3 commits intogh/rohan-varma/259/basefrom
rohan-varma wants to merge 3 commits intogh/rohan-varma/259/basefrom
Conversation
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/) [ghstack-poisoned]
Contributor
💊 CI failures summary and remediationsAs of commit 5b06e18 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
rohan-varma
added a commit
that referenced
this pull request
Mar 17, 2021
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/) ghstack-source-id: 124090094 Pull Request resolved: #54117
Contributor
Author
zhaojuanmao
approved these changes
Mar 19, 2021
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Mar 23, 2021
Pull Request resolved: #54117 #45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr ghstack-source-id: 124589813 Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Mar 23, 2021
Pull Request resolved: #54117 #45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr ghstack-source-id: 124662592 Differential Revision: [D27100497](https://our.internmc.facebook.com/intern/diff/D27100497/)
Contributor
|
This pull request has been merged in 789dc6d. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack:
#45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.
However, that PR only used the added function for
C10D_NCCL_CHECKwhich is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also havecheckForNCCLErrorswhich checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.Also renames the function s/errorMessage/getNcclErrorDetailStr
Differential Revision: D27100497