-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[BE]: Update NCCL to 2.27.3 #155233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BE]: Update NCCL to 2.27.3 #155233
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155233
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 008adee with merge base 70b68ca ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
nWEIdia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please re-enable the skipped unit test in test_c10d_nccl.py and see if this update would indeed fix #153517 ?
32bc4c4 to
f16f184
Compare
I think I reenabled the skipped test, please confirm |
|
Upstream CI and runners should be able to cover cuda 12.6 + nccl 2.27.3 cases. |
I don't have access to any B200 nodes at the moment, and I don't think torch has any B200 CI runners setup yet |
Understood, then let's first wait for cu126 signals and worry about B200 runners later. |
f16f184 to
a81c050
Compare
cu126 runners passed btw, so looks like it fixed it |
That is good news on the cu126 side! But for cu128, we are internally still testing with the latest pytorch, please wait before merging, thanks! |
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Indeed good news it fixes the bug.
Let's upgrade!
atalman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
The B200 signal with ~ 2 months old pytorch looks good. Testing with current ( ~ days old) pytorch now. |
Update: that more recent pytorch also passes my check with this NCCL 2.27.3 version. |
|
Looks like a rebase is needed here. |
a81c050 to
9f6496d
Compare
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 5 jobs have failed, first few of them are: macos-arm64-binary-wheel / wheel-py3_10-cpu-build, macos-arm64-binary-wheel / wheel-py3_9-cpu-build, macos-arm64-binary-wheel / wheel-py3_12-cpu-build, macos-arm64-binary-wheel / wheel-py3_13-cpu-build, macos-arm64-binary-wheel / wheel-py3_11-cpu-build Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 6 checks: macos-arm64-binary-wheel / wheel-py3_10-cpu-build, macos-arm64-binary-wheel / wheel-py3_9-cpu-build, macos-arm64-binary-wheel / wheel-py3_12-cpu-build, macos-arm64-binary-wheel / wheel-py3_13t-cpu-build, macos-arm64-binary-wheel / wheel-py3_13-cpu-build, macos-arm64-binary-wheel / wheel-py3_11-cpu-build Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
9d55297 to
bf677de
Compare
|
@pytorchbot merge -i |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
bf677de to
8c7a226
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
7a63164 to
008adee
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes: pytorch#155052 and pytorch#153517 This upgrade is needed to effectively use those symmetric memory kernels anyway. Also fixes some nasty NCCL bugs. Pull Request resolved: pytorch#155233 Approved by: https://github.com/nWEIdia, https://github.com/kwen2501, https://github.com/atalman, https://github.com/eqy
Fixes: #155052 and #153517
This upgrade is needed to effectively use those symmetric memory kernels anyway. Also fixes some nasty NCCL bugs.
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k