Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164352
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 48 Pending, 1 Unrelated FailureAs of commit 0bb9e78 with merge base 70d1043 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
7fd444e to
0bb9e78
Compare
|
@pytorchbot merge -f "Lint is green" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
If I am not mistaken, the cu13 was using 2.27.7, not 2.27.5. So this is not strictly a revert but rather |
|
I suggest redoing this revert as for cu13 there is no such thing as 2.27.5 nccl. update: created #164383 to forward fix. |
https://pypi.org/project/nvidia-nccl-cu13/#history does not have 2.27.5 but 2.27.7+. Companion PR: #164352
|
What's the H100 breakage? Curious if it's actually NCCL breakage or NCCL's interaction with NVSHMEM or something else that is broken |
https://pypi.org/project/nvidia-nccl-cu13/#history does not have 2.27.5 but 2.27.7+. Companion PR: #164352 Fixes a potential binary breakage due to non-existence of referenced NCCL cu13 version. Pull Request resolved: #164383 Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/atalman
|
Yes it is actually nccl breakage, nccl doesn't correctly check errors NVIDIA/nccl#1864 |
Revert pytorch#162351 as it breaks H100 Pull Request resolved: pytorch#164352 Approved by: https://github.com/atalman, https://github.com/malfet
https://pypi.org/project/nvidia-nccl-cu13/#history does not have 2.27.5 but 2.27.7+. Companion PR: pytorch#164352 Fixes a potential binary breakage due to non-existence of referenced NCCL cu13 version. Pull Request resolved: pytorch#164383 Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/atalman
Revert #162351 as it breaks H100