[PGNCCL] Ensure comm is ready before all accesses #138384

kwen2501 · 2024-10-19T06:49:04Z

Stack from ghstack (oldest at bottom):

Previously we only wait for comm to become ready after its initialization.
That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc.
Therefore, we just ensure comm is ready every "next time" we need to access ncclComm.
The place to add such gate keeper is getNcclComm.

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-19T06:49:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138384

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 03d792c with merge base 195d0a6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Previously we only wait for comm to become ready after its initialization. But that's not enough. There are other NCCL APIs that can cause the comm to be InProgress. Therefore, we just ensure comm is ready every time we call `getNcclComm`, as a protection for subsequent NCCL call on the returned comm. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 · 2024-10-22T21:53:12Z

@pytorchbot merge

pytorchmergebot · 2024-10-22T21:55:25Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384

Previously we only wait for comm to become ready after its initialization. That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc. Therefore, we just ensure comm is ready every "next time" we need to access ncclComm. The place to add such gate keeper is `getNcclComm`. Pull Request resolved: #138384 Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj ghstack dependencies: #137855, #138488, #138374

### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384

[PGNCCL] Ensure comm is ready before all accesses

8808319

[ghstack-poisoned]

This was referenced Oct 19, 2024

[PGNCCL] Add default value for nccl_nonblocking_timeout #138374

Closed

[PGNCCL] Enable non-blocking API mode by default #137544

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 19, 2024

kwen2501 requested review from fduwjj, shuqiangzhang and wconstab October 19, 2024 20:03

This was referenced Oct 21, 2024

[Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR #138488

Closed

[PGNCCL] Use non-blocking mode by default in eager init #138527

Closed

kwen2501 requested a review from eqy October 22, 2024 00:15

kwen2501 mentioned this pull request Oct 22, 2024

[Distributed] Add more tests to use eager init #138644

Closed

shuqiangzhang approved these changes Oct 22, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 22, 2024

pytorchmergebot added the merging label Oct 22, 2024

fduwjj approved these changes Oct 22, 2024

View reviewed changes

pytorchmergebot added the Merged label Oct 23, 2024

pytorchmergebot closed this in f2ebf6d Oct 23, 2024

pytorchmergebot removed the merging label Oct 23, 2024

kwen2501 mentioned this pull request Oct 24, 2024

[PGNCCL] Fix P2P data corruption in non-blocking mode #138860

Closed

github-actions bot deleted the gh/kwen2501/78/head branch November 22, 2024 02:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PGNCCL] Ensure comm is ready before all accesses #138384

[PGNCCL] Ensure comm is ready before all accesses #138384

Uh oh!

kwen2501 commented Oct 19, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 19, 2024 •

edited

Loading

Uh oh!

kwen2501 commented Oct 22, 2024

Uh oh!

pytorchmergebot commented Oct 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[PGNCCL] Ensure comm is ready before all accesses #138384

[PGNCCL] Ensure comm is ready before all accesses #138384

Uh oh!

Conversation

kwen2501 commented Oct 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138384

✅ No Failures

Uh oh!

kwen2501 commented Oct 22, 2024

Uh oh!

pytorchmergebot commented Oct 22, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented Oct 19, 2024 •

edited

Loading

pytorch-bot bot commented Oct 19, 2024 •

edited

Loading