[BE]: Update NCCL to 2.28.3 by Skylion007 · Pull Request #162351 · pytorch/pytorch

Skylion007 · 2025-09-07T19:36:20Z

@eqy New NCCL has some a bunch of bugfixes for features including reducing the number SMs needed by NVLINK collectives as well as some very useful new APIs for SymmetricMemory. Also allows FP8 support for non-reductive operations on pre-sm90 devices.

pytorch-bot · 2025-09-07T19:36:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162351

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f8b1b0c with merge base 991e3d0 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ezyang · 2025-09-09T02:56:20Z

need to update it more places

tinglvv · 2025-09-09T18:07:55Z

Change looks good, would need to upload the NCCL packages to download.pytorch.org (eg https://download.pytorch.org/whl/nightly/nvidia-nccl-cu13/) and get signals on ciflow/binaries before we merge
cc @atalman

Skylion007 · 2025-09-14T18:32:18Z

need to update it more places

@ezyang, ah you meant the workflows. Fixed those.

Skylion007 · 2025-09-27T21:18:05Z

@pytorchbot merge

pytorchmergebot · 2025-09-27T21:19:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@eqy

@eqy New NCCL has some a bunch of bugfixes for features including reducing the number SMs needed by NVLINK collectives as well as some very useful new APIs for SymmetricMemory. Also allows FP8 support for non-reductive operations on pre-sm90 devices. Pull Request resolved: #162351 Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/atalman

@eqy

@eqy New NCCL has some a bunch of bugfixes for features including reducing the number SMs needed by NVLINK collectives as well as some very useful new APIs for SymmetricMemory. Also allows FP8 support for non-reductive operations on pre-sm90 devices. Pull Request resolved: pytorch#162351 Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/atalman

ngimel · 2025-10-01T06:38:01Z

This update breaks all nccl ops on H100 with "no kernel image available" on cuda 12.9. Note we cannot use 12.8 for reasons, and cannot use 13.0 because our driver version is insufficient, so 12.9 is the only option

albanD · 2025-10-01T12:40:02Z

@pytorchbot revert -m "Broke H100 on 12.9" -c nosignal

Reverting out of caution as H100 is very widely used across our userbase.
@ptrblck can someone on your end take a look?

pytorchmergebot · 2025-10-01T12:41:46Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2025-10-01T12:41:51Z

Reverting PR 162351 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 5504a06e014d96e5d226f08d403ae1117edf343e returned non-zero exit code 1

CONFLICT (modify/delete): .github/workflows/generated-linux-binary-manywheel-main.yml deleted in HEAD and modified in parent of 5504a06e014 ([BE]: Update NCCL to 2.28.3 (#162351)).  Version parent of 5504a06e014 ([BE]: Update NCCL to 2.28.3 (#162351)) of .github/workflows/generated-linux-binary-manywheel-main.yml left in tree.
error: could not revert 5504a06e014... [BE]: Update NCCL to 2.28.3 (#162351)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

Revert #162351 as it breaks H100 Pull Request resolved: #164352 Approved by: https://github.com/atalman, https://github.com/malfet

tinglvv · 2025-10-01T18:25:21Z

Hi @albanD @ngimel, would you mind sharing some links to the H100 failures so that we could take a look? Thanks.

ptrblck · 2025-10-01T18:31:20Z

@albanD A quick check does not show missing architectures. We'll follow up on Slack to get a repro

ngimel · 2025-10-02T18:27:46Z

Discussed offline, it's a nccl back where it incorrectly handles error propagation for static linking situation. Static linking is used for local source builds.

Revert pytorch#162351 as it breaks H100 Pull Request resolved: pytorch#164352 Approved by: https://github.com/atalman, https://github.com/malfet

Skylion007 requested review from albanD, eqy, ezyang, ngimel, tinglvv and wanchaol September 7, 2025 19:36

Skylion007 requested review from a team and jeffdaily as code owners September 7, 2025 19:36

pytorch-bot bot added ciflow/inductor topic: not user facing topic category labels Sep 7, 2025

pytorchbot added the open source label Sep 7, 2025

ezyang approved these changes Sep 9, 2025

View reviewed changes

ezyang added ciflow/h100 ciflow/h100-distributed ciflow/h100-symm-mem labels Sep 9, 2025

malfet approved these changes Sep 9, 2025

View reviewed changes

Skylion007 force-pushed the skylion007/update-nccl-2-28-3 branch from 2a11c8e to 15a3d77 Compare September 14, 2025 18:32

atalman approved these changes Sep 23, 2025

View reviewed changes

[BE]: Update NCCL to 2.28.3 and fix build runtime CUDA13 mismatch

f8b1b0c

Skylion007 force-pushed the skylion007/update-nccl-2-28-3 branch from 15a3d77 to f8b1b0c Compare September 27, 2025 19:51

Skylion007 changed the title ~~[BE]: Update NCCL to 2.28.3 and fix build runtime CUDA13 mismatch~~ [BE]: Update NCCL to 2.28.3 Sep 27, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 27, 2025

pytorchmergebot added the merging label Sep 27, 2025

pytorchmergebot added the Merged label Sep 28, 2025

pytorchmergebot closed this in 5504a06 Sep 28, 2025

pytorchmergebot removed the merging label Sep 28, 2025

albanD mentioned this pull request Oct 1, 2025

Revert nccl upgrade back to 2.27.5 #164352

Closed

Skylion007 mentioned this pull request Oct 24, 2025

[BE]: Update NCCL version to 2.28.7 #166174

Closed

Conversation

Skylion007 commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162351

✅ No Failures

Uh oh!

ezyang commented Sep 9, 2025

Uh oh!

tinglvv commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 commented Sep 14, 2025

Uh oh!

Skylion007 commented Sep 27, 2025

Uh oh!

pytorchmergebot commented Sep 27, 2025

Merge started

Uh oh!

ngimel commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albanD commented Oct 1, 2025

Uh oh!

pytorchmergebot commented Oct 1, 2025

Uh oh!

pytorchmergebot commented Oct 1, 2025

Reverting PR 162351 failed

Uh oh!

tinglvv commented Oct 1, 2025

Uh oh!

ptrblck commented Oct 1, 2025

Uh oh!

ngimel commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Skylion007 commented Sep 7, 2025 •

edited

Loading

pytorch-bot bot commented Sep 7, 2025 •

edited

Loading

tinglvv commented Sep 9, 2025 •

edited

Loading

ngimel commented Oct 1, 2025 •

edited

Loading