-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[PGNCCL] Ensure comm is ready before all accesses #138384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138384
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 03d792c with merge base 195d0a6 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Previously we only wait for comm to become ready after its initialization. But that's not enough. There are other NCCL APIs that can cause the comm to be InProgress. Therefore, we just ensure comm is ready every time we call `getNcclComm`, as a protection for subsequent NCCL call on the returned comm. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384
Previously we only wait for comm to become ready after its initialization. That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc. Therefore, we just ensure comm is ready every "next time" we need to access ncclComm. The place to add such gate keeper is `getNcclComm`. Pull Request resolved: #138384 Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj ghstack dependencies: #137855, #138488, #138374
### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384
Stack from ghstack (oldest at bottom):
nccl_nonblocking_timeout#138374Previously we only wait for comm to become ready after its initialization.
That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc.
Therefore, we just ensure comm is ready every "next time" we need to access ncclComm.
The place to add such gate keeper is
getNcclComm.cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o