[PGNCCL] Fix bugs in non-blocking mode #137741

kwen2501 · 2024-10-10T22:48:00Z

Stack from ghstack (oldest at bottom):

Fix 1: Throw async error during init wait

Previously we just busy wait for ncclSuccess, if the nonblocking init encountered error, we never report that. Added detection of async error via ncclGetAsyncError.

Fix 2: Add wait after comm split

  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   https://github.com/NVIDIA/nccl/issues/1472

This wait does not mean the child comm is ready for use, neither does it block till that point.

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

- Throw async error during init wait - Add wait after comm split [ghstack-poisoned]

pytorch-bot · 2024-10-10T22:48:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137741

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 156cec9 with merge base 56cc22e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

- Throw async error during init wait - Add wait after comm split cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

wconstab · 2024-10-14T16:18:22Z

torch/csrc/distributed/c10d/NCCLUtils.cpp

      ncclCommSplit(sourceComm, color_id, rank, &(comm->ncclComm_), &config),
      std::nullopt);
+#else
+  // After calling ncclCommSplit in non-blocking mode, we should wait for the


i'm kinda confused about the meaning of non-blocking mode in nccl. maybe it would be a good time to write up a note on different modes of nccl init and what their semantics are.

are there 2 states that we transition through in non-blocking init? 1) ncclInProgress state, 2) still doing some more non-blocking init, finally 3) ready for use? And the point is we need to block/wait on (1) but we can let (2) be asynchronous? otherwise i'm confused about what part of non-blocking init is actually non-blocking

the meaning of non-blocking mode in nccl

In general, it means the API call will return immediately, instead of blocking the calling thread until the init, finalize, etc operations complete.

are there 2 states that we transition through in non-blocking init? 1) ncclInProgress state, 2) still doing some more non-blocking init, finally 3) ready for use?

In general, two states only: ncclInProgress and ncclSuccess (if there is no error).

ncclCommSplit is a special case: it involves two communicators -- the parent and the child.
So a Q is: should we wait for the parent or the child?

The answer is both, as of NCCL 2.23 (today):
when ncclCommSplit returns, a valid pointer for child comm may not have been assigned. We'd need to wait for ncclSuccess from the parent comm to confirm that. But that doesn't mean the child comm is ready for use, we need to wait for ncclSuccess from the child comm to confirm that.

I have talked to the NCCL team re improvement of the expectation, i.e. when ncclCommSplit returns, a valid pointer for child comm SHOULD have been assigned. It seems NCCL team agrees, so that the API presents non-blocking semantics for both the parent and the child comms. Detailed discussion is here:
NVIDIA/nccl#1472

shuqiangzhang · 2024-10-14T16:32:23Z

torch/csrc/distributed/c10d/NCCLUtils.cpp

-  // only wait for initialization if nonblocking mode is enabled
-  if (!initialized_ && nccl_use_nonblocking()) {
-    waitUntilInitialized(nccl_nonblocking_timeout());
+  if (!initialized_) {


Previous implementation should not hang forever? as there is time bound on timeout?

You are right. I modified PR description to:

Previously we just busy wait for result == ncclSuccess -- if the nonblocking init encountered an error, we never report it. Added detection of async error via ncclGetAsyncError.

If non-blocking init itself failed, does NCCL know (we are in non-blocking mode?) to populate the error in ncclGetAsyncError ? (I'm assuming ncclGetAsyncError only works in nonblocking mode.

ncclGetAsyncError will populate error when non-blocking init fails.

ncclGetAsyncError also works in blocking mode, i.e. when checking if a running collective hit any network error.

shuqiangzhang · 2024-10-14T16:38:31Z

torch/csrc/distributed/c10d/NCCLUtils.hpp

  }

+#define C10D_SCHED_SLEEP()     \
+  std::this_thread::sleep_for( \


Define an inline function instead of a macro? in general we should prefer to use inline function instead of macro as it has type checks and more safty than simple text macro replacement

shuqiangzhang · 2024-10-14T16:49:01Z

torch/csrc/distributed/c10d/NCCLUtils.cpp

+  // source communicator to be out of ncclInProgress state.
+  // Reason 1:
+  //   it's unsafe to call new operations on the parent comm while it's in
+  //   ncclInProgress state.


line 66 "auto sourceComm = source->getNcclComm();" should have already guaranteed that the parent comm is ready? why do we need to wait again here?

The wait here refers to AFTER calling ncclCommSplit -- we need to wait on sourceComm to make sure it gives a valid pointer to child comm.

Still confused, before we call ncclCommSplit here, the parent comm should have already been ncclSuccess state instead of ncclInProgress state, because there is already a wait in getNcclComm(). We can discuss this offline as it seems I misunderstood something

Would this discussion help clarify?
NVIDIA/nccl#1472 (comment)

Then you might need update the comment? because the parent comm is ready/initialized, but it is the child comm which is not initialized, so we need to wait here.

### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 · 2024-10-15T17:49:28Z

@pytorchbot merge

pytorchmergebot · 2024-10-15T17:51:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. Pull Request resolved: #137741 Approved by: https://github.com/shuqiangzhang

[PGNCCL] Fix bugs in non-blocking mode

df1d86e

- Throw async error during init wait - Add wait after comm split [ghstack-poisoned]

kwen2501 mentioned this pull request Oct 10, 2024

[PGNCCL] Enable non-blocking API mode by default #137544

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 10, 2024

kwen2501 added the topic: bug fixes topic category label Oct 10, 2024

Update on "[PGNCCL] Fix bugs in non-blocking mode"

19c65af

- Throw async error during init wait - Add wait after comm split cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 requested review from eqy, fduwjj, shuqiangzhang and wconstab October 10, 2024 23:08

kwen2501 mentioned this pull request Oct 13, 2024

[c10d] Fix color value for comm split being negative #137855

Closed

wconstab reviewed Oct 14, 2024

View reviewed changes

shuqiangzhang reviewed Oct 14, 2024

View reviewed changes

wconstab mentioned this pull request Oct 14, 2024

segfault when using DTensor with nonblocking nccl comm #137392

Open

kwen2501 mentioned this pull request Oct 14, 2024

[CI][Distributed] Not to test distributed_test.py with UCC #137932

Closed

shuqiangzhang approved these changes Oct 15, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 15, 2024

pytorchmergebot added the merging label Oct 15, 2024

pytorchmergebot added the Merged label Oct 15, 2024

pytorchmergebot closed this in 35fc24f Oct 15, 2024

pytorchmergebot removed the merging label Oct 15, 2024

kwen2501 mentioned this pull request Oct 16, 2024

Upgrade distributed test to g4dn instances (T4 GPUs) #137161

Closed

kwen2501 mentioned this pull request Oct 18, 2024

[PGNCCL] Add default value for nccl_nonblocking_timeout #138374

Closed

github-actions bot deleted the gh/kwen2501/73/head branch November 15, 2024 02:10

[PGNCCL] Fix bugs in non-blocking mode #137741

[PGNCCL] Fix bugs in non-blocking mode #137741

Uh oh!

Conversation

kwen2501 commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix 1: Throw async error during init wait

Fix 2: Add wait after comm split

Uh oh!

pytorch-bot bot commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137741

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Oct 15, 2024

Uh oh!

pytorchmergebot commented Oct 15, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kwen2501 commented Oct 10, 2024 •

edited

Loading

pytorch-bot bot commented Oct 10, 2024 •

edited

Loading

kwen2501 Oct 14, 2024 •

edited

Loading

kwen2501 Oct 14, 2024 •

edited

Loading

shuqiangzhang Oct 14, 2024 •

edited

Loading

shuqiangzhang Oct 14, 2024 •

edited

Loading