Skip to content

Conversation

@kwen2501
Copy link
Collaborator

@kwen2501 kwen2501 commented Oct 10, 2024

Stack from ghstack (oldest at bottom):

Fix 1: Throw async error during init wait

Previously we just busy wait for ncclSuccess, if the nonblocking init encountered error, we never report that. Added detection of async error via ncclGetAsyncError.

Fix 2: Add wait after comm split

  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   https://github.com/NVIDIA/nccl/issues/1472

This wait does not mean the child comm is ready for use, neither does it block till that point.

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

- Throw async error during init wait

- Add wait after comm split

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 10, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137741

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 156cec9 with merge base 56cc22e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 10, 2024
@kwen2501 kwen2501 added the topic: bug fixes topic category label Oct 10, 2024
- Throw async error during init wait

- Add wait after comm split

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
ncclCommSplit(sourceComm, color_id, rank, &(comm->ncclComm_), &config),
std::nullopt);
#else
// After calling ncclCommSplit in non-blocking mode, we should wait for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm kinda confused about the meaning of non-blocking mode in nccl. maybe it would be a good time to write up a note on different modes of nccl init and what their semantics are.

are there 2 states that we transition through in non-blocking init? 1) ncclInProgress state, 2) still doing some more non-blocking init, finally 3) ready for use? And the point is we need to block/wait on (1) but we can let (2) be asynchronous? otherwise i'm confused about what part of non-blocking init is actually non-blocking

Copy link
Collaborator Author

@kwen2501 kwen2501 Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the meaning of non-blocking mode in nccl

In general, it means the API call will return immediately, instead of blocking the calling thread until the init, finalize, etc operations complete.

are there 2 states that we transition through in non-blocking init? 1) ncclInProgress state, 2) still doing some more non-blocking init, finally 3) ready for use?

In general, two states only: ncclInProgress and ncclSuccess (if there is no error).

ncclCommSplit is a special case: it involves two communicators -- the parent and the child.
So a Q is: should we wait for the parent or the child?

The answer is both, as of NCCL 2.23 (today):
when ncclCommSplit returns, a valid pointer for child comm may not have been assigned. We'd need to wait for ncclSuccess from the parent comm to confirm that. But that doesn't mean the child comm is ready for use, we need to wait for ncclSuccess from the child comm to confirm that.

I have talked to the NCCL team re improvement of the expectation, i.e. when ncclCommSplit returns, a valid pointer for child comm SHOULD have been assigned. It seems NCCL team agrees, so that the API presents non-blocking semantics for both the parent and the child comms. Detailed discussion is here:
NVIDIA/nccl#1472

// only wait for initialization if nonblocking mode is enabled
if (!initialized_ && nccl_use_nonblocking()) {
waitUntilInitialized(nccl_nonblocking_timeout());
if (!initialized_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous implementation should not hang forever? as there is time bound on timeout?

Copy link
Collaborator Author

@kwen2501 kwen2501 Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I modified PR description to:

Previously we just busy wait for result == ncclSuccess -- if the nonblocking init encountered an error, we never report it. Added detection of async error via ncclGetAsyncError.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If non-blocking init itself failed, does NCCL know (we are in non-blocking mode?) to populate the error in ncclGetAsyncError ? (I'm assuming ncclGetAsyncError only works in nonblocking mode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ncclGetAsyncError will populate error when non-blocking init fails.

ncclGetAsyncError also works in blocking mode, i.e. when checking if a running collective hit any network error.

}

#define C10D_SCHED_SLEEP() \
std::this_thread::sleep_for( \
Copy link
Contributor

@shuqiangzhang shuqiangzhang Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define an inline function instead of a macro? in general we should prefer to use inline function instead of macro as it has type checks and more safty than simple text macro replacement

// source communicator to be out of ncclInProgress state.
// Reason 1:
// it's unsafe to call new operations on the parent comm while it's in
// ncclInProgress state.
Copy link
Contributor

@shuqiangzhang shuqiangzhang Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 66 "auto sourceComm = source->getNcclComm();" should have already guaranteed that the parent comm is ready? why do we need to wait again here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wait here refers to AFTER calling ncclCommSplit -- we need to wait on sourceComm to make sure it gives a valid pointer to child comm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still confused, before we call ncclCommSplit here, the parent comm should have already been ncclSuccess state instead of ncclInProgress state, because there is already a wait in getNcclComm(). We can discuss this offline as it seems I misunderstood something

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this discussion help clarify?
NVIDIA/nccl#1472 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you might need update the comment? because the parent comm is ready/initialized, but it is the child comm which is not initialized, so we need to wait here.

### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   NVIDIA/nccl#1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   NVIDIA/nccl#1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   NVIDIA/nccl#1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@kwen2501
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 15, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

jackzhxng pushed a commit that referenced this pull request Oct 16, 2024
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   NVIDIA/nccl#1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

Pull Request resolved: #137741
Approved by: https://github.com/shuqiangzhang
@github-actions github-actions bot deleted the gh/kwen2501/73/head branch November 15, 2024 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category topic: bug fixes topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants