Skip to content

Conversation

@Skylion007
Copy link
Collaborator

@Skylion007 Skylion007 commented Jun 5, 2025

Fixes: #155052 and #153517

This upgrade is needed to effectively use those symmetric memory kernels anyway. Also fixes some nasty NCCL bugs.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155233

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 008adee with merge base 70b68ca (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Collaborator

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please re-enable the skipped unit test in test_c10d_nccl.py and see if this update would indeed fix #153517 ?

@Skylion007 Skylion007 force-pushed the skylion007/update-nccl-2-27-3 branch from 32bc4c4 to f16f184 Compare June 5, 2025 16:40
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 5, 2025
@Skylion007 Skylion007 requested a review from nWEIdia June 5, 2025 16:40
@Skylion007
Copy link
Collaborator Author

Could you please re-enable the skipped unit test in test_c10d_nccl.py and see if this update would indeed fix #153517 ?

I think I reenabled the skipped test, please confirm

@nWEIdia
Copy link
Collaborator

nWEIdia commented Jun 5, 2025

Upstream CI and runners should be able to cover cuda 12.6 + nccl 2.27.3 cases.
Perhaps on our end, we need to test cuda 12.8 + nccl 2.27.3 on B200 nodes. Is there a way for you to test? If not, we can do the test.

@Skylion007
Copy link
Collaborator Author

Upstream CI and runners should be able to cover cuda 12.6 + nccl 2.27.3 cases. Perhaps on our end, we need to test cuda 12.8 + nccl 2.27.3 on B200 nodes. Is there a way for you to test? If not, we can do the test.

I don't have access to any B200 nodes at the moment, and I don't think torch has any B200 CI runners setup yet

@nWEIdia
Copy link
Collaborator

nWEIdia commented Jun 5, 2025

Upstream CI and runners should be able to cover cuda 12.6 + nccl 2.27.3 cases. Perhaps on our end, we need to test cuda 12.8 + nccl 2.27.3 on B200 nodes. Is there a way for you to test? If not, we can do the test.

I don't have access to any B200 nodes at the moment, and I don't think torch has any B200 CI runners setup yet

Understood, then let's first wait for cu126 signals and worry about B200 runners later.

@Skylion007 Skylion007 force-pushed the skylion007/update-nccl-2-27-3 branch from f16f184 to a81c050 Compare June 6, 2025 16:06
@Skylion007
Copy link
Collaborator Author

Upstream CI and runners should be able to cover cuda 12.6 + nccl 2.27.3 cases. Perhaps on our end, we need to test cuda 12.8 + nccl 2.27.3 on B200 nodes. Is there a way for you to test? If not, we can do the test.

I don't have access to any B200 nodes at the moment, and I don't think torch has any B200 CI runners setup yet

Understood, then let's first wait for cu126 signals and worry about B200 runners later.

cu126 runners passed btw, so looks like it fixed it

@nWEIdia
Copy link
Collaborator

nWEIdia commented Jun 6, 2025

Upstream CI and runners should be able to cover cuda 12.6 + nccl 2.27.3 cases. Perhaps on our end, we need to test cuda 12.8 + nccl 2.27.3 on B200 nodes. Is there a way for you to test? If not, we can do the test.

I don't have access to any B200 nodes at the moment, and I don't think torch has any B200 CI runners setup yet

Understood, then let's first wait for cu126 signals and worry about B200 runners later.

cu126 runners passed btw, so looks like it fixed it

That is good news on the cu126 side!

But for cu128, we are internally still testing with the latest pytorch, please wait before merging, thanks!

@kwen2501 kwen2501 self-requested a review June 6, 2025 20:16
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Indeed good news it fixes the bug.
Let's upgrade!

Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nWEIdia
Copy link
Collaborator

nWEIdia commented Jun 6, 2025

The B200 signal with ~ 2 months old pytorch looks good. Testing with current ( ~ days old) pytorch now.

@nWEIdia
Copy link
Collaborator

nWEIdia commented Jun 10, 2025

The B200 signal with ~ 2 months old pytorch looks good. Testing with current ( ~ days old) pytorch now.

Update: that more recent pytorch also passes my check with this NCCL 2.27.3 version.

@nWEIdia
Copy link
Collaborator

nWEIdia commented Jun 10, 2025

Looks like a rebase is needed here.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@Skylion007
Copy link
Collaborator Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 6 checks: macos-arm64-binary-wheel / wheel-py3_10-cpu-build, macos-arm64-binary-wheel / wheel-py3_9-cpu-build, macos-arm64-binary-wheel / wheel-py3_12-cpu-build, macos-arm64-binary-wheel / wheel-py3_13t-cpu-build, macos-arm64-binary-wheel / wheel-py3_13-cpu-build, macos-arm64-binary-wheel / wheel-py3_11-cpu-build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@Skylion007 Skylion007 force-pushed the skylion007/update-nccl-2-27-3 branch from 9d55297 to bf677de Compare June 11, 2025 21:19
@Skylion007
Copy link
Collaborator Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@Skylion007 Skylion007 requested a review from tinglvv June 12, 2025 16:03
@Skylion007 Skylion007 force-pushed the skylion007/update-nccl-2-27-3 branch from bf677de to 8c7a226 Compare June 13, 2025 20:46
@Skylion007
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@Skylion007 Skylion007 force-pushed the skylion007/update-nccl-2-27-3 branch from 7a63164 to 008adee Compare June 14, 2025 14:56
@Skylion007
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

better-engineering Relatively self-contained tasks for better engineering contributors ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update nccl 2.27.3 in pytorch nightly

7 participants