Skip to content

[BE]: Update NCCL to 2.28.3#162351

Closed
Skylion007 wants to merge 1 commit intopytorch:mainfrom
Skylion007:skylion007/update-nccl-2-28-3
Closed

[BE]: Update NCCL to 2.28.3#162351
Skylion007 wants to merge 1 commit intopytorch:mainfrom
Skylion007:skylion007/update-nccl-2-28-3

Conversation

@Skylion007
Copy link
Collaborator

@Skylion007 Skylion007 commented Sep 7, 2025

@eqy New NCCL has some a bunch of bugfixes for features including reducing the number SMs needed by NVLINK collectives as well as some very useful new APIs for SymmetricMemory. Also allows FP8 support for non-reductive operations on pre-sm90 devices.

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162351

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f8b1b0c with merge base 991e3d0 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@ezyang
Copy link
Contributor

ezyang commented Sep 9, 2025

need to update it more places

@tinglvv
Copy link
Collaborator

tinglvv commented Sep 9, 2025

Change looks good, would need to upload the NCCL packages to download.pytorch.org (eg https://download.pytorch.org/whl/nightly/nvidia-nccl-cu13/) and get signals on ciflow/binaries before we merge
cc @atalman

@Skylion007 Skylion007 force-pushed the skylion007/update-nccl-2-28-3 branch from 2a11c8e to 15a3d77 Compare September 14, 2025 18:32
@Skylion007
Copy link
Collaborator Author

need to update it more places

@ezyang, ah you meant the workflows. Fixed those.

@Skylion007 Skylion007 force-pushed the skylion007/update-nccl-2-28-3 branch from 15a3d77 to f8b1b0c Compare September 27, 2025 19:51
@Skylion007 Skylion007 changed the title [BE]: Update NCCL to 2.28.3 and fix build runtime CUDA13 mismatch [BE]: Update NCCL to 2.28.3 Sep 27, 2025
@Skylion007
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 27, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

jainapurva pushed a commit that referenced this pull request Sep 29, 2025
@eqy New NCCL has some a bunch of bugfixes for features including reducing the number SMs needed by NVLINK collectives as well as some very useful new APIs for SymmetricMemory.  Also allows FP8 support for non-reductive operations on pre-sm90 devices.
Pull Request resolved: #162351
Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/atalman
maggiemoss pushed a commit to maggiemoss/pytorch that referenced this pull request Sep 29, 2025
@eqy New NCCL has some a bunch of bugfixes for features including reducing the number SMs needed by NVLINK collectives as well as some very useful new APIs for SymmetricMemory.  Also allows FP8 support for non-reductive operations on pre-sm90 devices.
Pull Request resolved: pytorch#162351
Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/atalman
@ngimel
Copy link
Collaborator

ngimel commented Oct 1, 2025

This update breaks all nccl ops on H100 with "no kernel image available" on cuda 12.9. Note we cannot use 12.8 for reasons, and cannot use 13.0 because our driver version is insufficient, so 12.9 is the only option

@albanD
Copy link
Collaborator

albanD commented Oct 1, 2025

@pytorchbot revert -m "Broke H100 on 12.9" -c nosignal

Reverting out of caution as H100 is very widely used across our userbase.
@ptrblck can someone on your end take a look?

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 162351 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 5504a06e014d96e5d226f08d403ae1117edf343e returned non-zero exit code 1

CONFLICT (modify/delete): .github/workflows/generated-linux-binary-manywheel-main.yml deleted in HEAD and modified in parent of 5504a06e014 ([BE]: Update NCCL to 2.28.3 (#162351)).  Version parent of 5504a06e014 ([BE]: Update NCCL to 2.28.3 (#162351)) of .github/workflows/generated-linux-binary-manywheel-main.yml left in tree.
error: could not revert 5504a06e014... [BE]: Update NCCL to 2.28.3 (#162351)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

@tinglvv
Copy link
Collaborator

tinglvv commented Oct 1, 2025

Hi @albanD @ngimel, would you mind sharing some links to the H100 failures so that we could take a look? Thanks.

@ptrblck
Copy link
Collaborator

ptrblck commented Oct 1, 2025

@albanD A quick check does not show missing architectures. We'll follow up on Slack to get a repro

@ngimel
Copy link
Collaborator

ngimel commented Oct 2, 2025

Discussed offline, it's a nccl back where it incorrectly handles error propagation for static linking situation. Static linking is used for local source builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants