Skip to content

Conversation

@kwen2501
Copy link
Collaborator

@kwen2501 kwen2501 commented Dec 11, 2024

Stack from ghstack (oldest at bottom):

This PR defines a set of APIs for interacting with NCCLComm (an existing abstraction of the real ncclComm_t).
This set of APIs include create, split, destroy, dump, etc, around the management of NCCLComm.
The previous implementation of the above methods under NCCLComm are now made the default impl of this API set.
And the impl of the API set is extendable out of trunk.

For example: for the implementation of dump

void dump(
    std::shared_ptr<NCCLComm>& comm,
    std::unordered_map<std::string, std::string>& dump) {
#if defined(IS_NCCLX) && defined(NCCL_COMM_DUMP)
  if (comm->isAborted()) {
    LOG(INFO) << "Communicator was aborted before trying to dump its state.";
    return;
  }
  C10D_NCCL_CHECK(::ncclCommDump(ncclComm_, dump), std::nullopt);
#else
  TORCH_CHECK(false, "NCCL version does not support dumping.");
#endif
}

One can now host the #if defined(IS_NCCLX) path out of trunk.

TODO:

ncclConfig_t still shows up in API surface. We probably need a way to make it opaque.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Dec 11, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 11, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142881

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 109053e with merge base 5d3bc63 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This PR defines a set of APIs for interacting with `NCCLComm` (an existing abstraction of the real `ncclComm_t`).
This set of APIs include `create`, `split`, `destroy`, `dump`, etc, around the management of `NCCLComm`.
The previous implementation of the above methods under `NCCLComm` are now made the default impl of this API set.
And the impl of the API set is extendable out of trunk.

For example: for the implementation of dump
```
void dump(
    std::shared_ptr<NCCLComm>& comm,
    std::unordered_map<std::string, std::string>& dump) {
#if defined(IS_NCCLX) && defined(NCCL_COMM_DUMP)
  if (comm->isAborted()) {
    LOG(INFO) << "Communicator was aborted before trying to dump its state.";
    return;
  }
  C10D_NCCL_CHECK(::ncclCommDump(ncclComm_, dump), std::nullopt);
#else
  TORCH_CHECK(false, "NCCL version does not support dumping.");
#endif
}
```
The `#if defined(IS_NCCLX)` path can be now moved out of trunk.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Dec 11, 2024
ghstack-source-id: 3dda6cc
Pull Request resolved: #142881
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Feb 10, 2025
@github-actions github-actions bot closed this Mar 12, 2025
@github-actions github-actions bot deleted the gh/kwen2501/118/head branch April 11, 2025 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants