Skip to content

Revert nccl upgrade back to 2.27.5#164352

Closed
albanD wants to merge 1 commit intopytorch:mainfrom
albanD:revert_nccl_upgrade
Closed

Revert nccl upgrade back to 2.27.5#164352
albanD wants to merge 1 commit intopytorch:mainfrom
albanD:revert_nccl_upgrade

Conversation

@albanD
Copy link
Collaborator

@albanD albanD commented Oct 1, 2025

Revert #162351 as it breaks H100

@albanD albanD requested review from a team and jeffdaily as code owners October 1, 2025 13:13
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164352

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 48 Pending, 1 Unrelated Failure

As of commit 0bb9e78 with merge base 70d1043 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ci-no-td Do not run TD on this PR ciflow/inductor topic: not user facing topic category labels Oct 1, 2025
@albanD albanD requested review from atalman and ngimel October 1, 2025 13:32
@albanD albanD force-pushed the revert_nccl_upgrade branch from 7fd444e to 0bb9e78 Compare October 1, 2025 14:12
@malfet
Copy link
Contributor

malfet commented Oct 1, 2025

@pytorchbot merge -f "Lint is green"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@nWEIdia
Copy link
Collaborator

nWEIdia commented Oct 1, 2025

If I am not mistaken, the cu13 was using 2.27.7, not 2.27.5. So this is not strictly a revert but rather
revert and downgrade NCCL from 2.27.7 to 2.27.5 in cu13.
See #162351

@nWEIdia
Copy link
Collaborator

nWEIdia commented Oct 1, 2025

I suggest redoing this revert as for cu13 there is no such thing as 2.27.5 nccl.
See: https://pypi.org/project/nvidia-nccl-cu13/#history

update: created #164383 to forward fix.

@Skylion007
Copy link
Collaborator

What's the H100 breakage? Curious if it's actually NCCL breakage or NCCL's interaction with NVSHMEM or something else that is broken

@ngimel
Copy link
Collaborator

ngimel commented Oct 2, 2025

Yes it is actually nccl breakage, nccl doesn't correctly check errors NVIDIA/nccl#1864

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants