-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[PGNCCL] Fix bugs in non-blocking mode #137741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Throw async error during init wait - Add wait after comm split [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137741
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 156cec9 with merge base 56cc22e ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
- Throw async error during init wait - Add wait after comm split cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
| ncclCommSplit(sourceComm, color_id, rank, &(comm->ncclComm_), &config), | ||
| std::nullopt); | ||
| #else | ||
| // After calling ncclCommSplit in non-blocking mode, we should wait for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm kinda confused about the meaning of non-blocking mode in nccl. maybe it would be a good time to write up a note on different modes of nccl init and what their semantics are.
are there 2 states that we transition through in non-blocking init? 1) ncclInProgress state, 2) still doing some more non-blocking init, finally 3) ready for use? And the point is we need to block/wait on (1) but we can let (2) be asynchronous? otherwise i'm confused about what part of non-blocking init is actually non-blocking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the meaning of non-blocking mode in nccl
In general, it means the API call will return immediately, instead of blocking the calling thread until the init, finalize, etc operations complete.
are there 2 states that we transition through in non-blocking init? 1) ncclInProgress state, 2) still doing some more non-blocking init, finally 3) ready for use?
In general, two states only: ncclInProgress and ncclSuccess (if there is no error).
ncclCommSplit is a special case: it involves two communicators -- the parent and the child.
So a Q is: should we wait for the parent or the child?
The answer is both, as of NCCL 2.23 (today):
when ncclCommSplit returns, a valid pointer for child comm may not have been assigned. We'd need to wait for ncclSuccess from the parent comm to confirm that. But that doesn't mean the child comm is ready for use, we need to wait for ncclSuccess from the child comm to confirm that.
I have talked to the NCCL team re improvement of the expectation, i.e. when ncclCommSplit returns, a valid pointer for child comm SHOULD have been assigned. It seems NCCL team agrees, so that the API presents non-blocking semantics for both the parent and the child comms. Detailed discussion is here:
NVIDIA/nccl#1472
| // only wait for initialization if nonblocking mode is enabled | ||
| if (!initialized_ && nccl_use_nonblocking()) { | ||
| waitUntilInitialized(nccl_nonblocking_timeout()); | ||
| if (!initialized_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous implementation should not hang forever? as there is time bound on timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I modified PR description to:
Previously we just busy wait for
result == ncclSuccess-- if the nonblocking init encountered an error, we never report it. Added detection of async error viancclGetAsyncError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If non-blocking init itself failed, does NCCL know (we are in non-blocking mode?) to populate the error in ncclGetAsyncError ? (I'm assuming ncclGetAsyncError only works in nonblocking mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ncclGetAsyncError will populate error when non-blocking init fails.
ncclGetAsyncError also works in blocking mode, i.e. when checking if a running collective hit any network error.
| } | ||
|
|
||
| #define C10D_SCHED_SLEEP() \ | ||
| std::this_thread::sleep_for( \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define an inline function instead of a macro? in general we should prefer to use inline function instead of macro as it has type checks and more safty than simple text macro replacement
| // source communicator to be out of ncclInProgress state. | ||
| // Reason 1: | ||
| // it's unsafe to call new operations on the parent comm while it's in | ||
| // ncclInProgress state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 66 "auto sourceComm = source->getNcclComm();" should have already guaranteed that the parent comm is ready? why do we need to wait again here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wait here refers to AFTER calling ncclCommSplit -- we need to wait on sourceComm to make sure it gives a valid pointer to child comm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still confused, before we call ncclCommSplit here, the parent comm should have already been ncclSuccess state instead of ncclInProgress state, because there is already a wait in getNcclComm(). We can discuss this offline as it seems I misunderstood something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this discussion help clarify?
NVIDIA/nccl#1472 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you might need update the comment? because the parent comm is ready/initialized, but it is the child comm which is not initialized, so we need to wait here.
### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. Pull Request resolved: #137741 Approved by: https://github.com/shuqiangzhang
Stack from ghstack (oldest at bottom):
Fix 1: Throw async error during init wait
Previously we just busy wait for
ncclSuccess, if the nonblocking init encountered error, we never report that. Added detection of async error viancclGetAsyncError.Fix 2: Add wait after comm split
This wait does not mean the child comm is ready for use, neither does it block till that point.
cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o