Skip to content

[BE]: Update NCCL version to 2.28.7#166174

Closed
Skylion007 wants to merge 1 commit intopytorch:mainfrom
Skylion007:skylion007/update-nccl-2-28-7
Closed

[BE]: Update NCCL version to 2.28.7#166174
Skylion007 wants to merge 1 commit intopytorch:mainfrom
Skylion007:skylion007/update-nccl-2-28-7

Conversation

@Skylion007
Copy link
Collaborator

@Skylion007 Skylion007 commented Oct 24, 2025

Update NCCL version to 2.28.7. Fixes issues that prevented 2.28.3 update in #162351

@Skylion007 Skylion007 requested review from a team and jeffdaily as code owners October 24, 2025 17:38
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 24, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166174

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Cancelled Jobs

As of commit b6e6ecb with merge base 380d440 (image):

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Skylion007 Skylion007 added the better-engineering Relatively self-contained tasks for better engineering contributors label Oct 24, 2025
@Skylion007 Skylion007 requested a review from eqy October 24, 2025 19:51
@eqy eqy requested a review from nWEIdia October 24, 2025 19:56
@eqy
Copy link
Collaborator

eqy commented Oct 24, 2025

We had a number of issues with fixes in-flight, requested a review from @nWEIdia to confirm if 2.28.7 is a good target or if we should wait for a slightly later version.

@nWEIdia
Copy link
Collaborator

nWEIdia commented Oct 24, 2025

For this version, we have a known performance regression signals, I recommend waiting for a few days for NCCL team's update.

@Skylion007 Skylion007 closed this Oct 24, 2025
@Skylion007 Skylion007 reopened this Oct 28, 2025
@jerryzh168 jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 30, 2025
@kwen2501
Copy link
Collaborator

@nWEIdia @eqy wondering if you heard any update?

@kwen2501 kwen2501 self-requested a review November 14, 2025 21:16
@nWEIdia
Copy link
Collaborator

nWEIdia commented Nov 14, 2025

@nWEIdia @eqy wondering if you heard any update?

We should be good to go with 2.28.9 https://github.com/NVIDIA/nccl/tree/v2.28.9-1

@ngimel
Copy link
Collaborator

ngimel commented Nov 14, 2025

@nWEIdia can you please make a PR?

@nWEIdia
Copy link
Collaborator

nWEIdia commented Nov 14, 2025

@nWEIdia can you please make a PR?

Yes, I will do so, just realized that we tagged 2.28.9 but the PYPI wheels are yet to be published. https://pypi.org/project/nvidia-nccl-cu13/

@Skylion007
Copy link
Collaborator Author

Skylion007 commented Nov 15, 2025

@nWEIdia BTW, @atalman Finally added self service PR update wheels: See this PR of mine as an example, will need to land first when you do upgrade: pytorch/test-infra#7415

@nWEIdia
Copy link
Collaborator

nWEIdia commented Nov 18, 2025

@nWEIdia can you please make a PR?

Hi @ngimel I created the PR: #168091 please help take a look, thanks!

fduwjj added a commit that referenced this pull request Nov 27, 2025
….28.9-1"


(This PR will be rebased on #166174)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Nov 27, 2025
….28.9-1"


(This PR will be rebased on #166174)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 5, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 8, 2025
…dged with nccl 2.28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 8, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 8, 2025
…dged with nccl 2.28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 8, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 8, 2025
…dged with nccl 2.28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 8, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Dec 9, 2025
…68129)

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682
Pull Request resolved: #168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
skpark-rh pushed a commit to skpark-rh/pytorch that referenced this pull request Dec 10, 2025
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
fduwjj added a commit that referenced this pull request Dec 10, 2025
…dged with nccl 2.28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 10, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Dec 11, 2025
…68129)

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682
Pull Request resolved: #168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
pytorchmergebot pushed a commit that referenced this pull request Dec 11, 2025
…68129)

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682
Pull Request resolved: #168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
pytorchmergebot pushed a commit that referenced this pull request Dec 11, 2025
…68129)

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682
Pull Request resolved: #168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
fduwjj added a commit that referenced this pull request Dec 13, 2025
…dged with nccl 2.28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 13, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 13, 2025
…dged with nccl 2.28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 13, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 13, 2025
…dged with nccl 2.28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 13, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Dec 13, 2025
…68129)

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682
Pull Request resolved: #168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
pytorchbot pushed a commit that referenced this pull request Dec 13, 2025
…68129)

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682
Pull Request resolved: #168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman

(cherry picked from commit c907c77)
vishalgoyal316 pushed a commit to vishalgoyal316/pytorch that referenced this pull request Dec 17, 2025
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
vishalgoyal316 pushed a commit to vishalgoyal316/pytorch that referenced this pull request Dec 17, 2025
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
@Skylion007 Skylion007 closed this Jan 2, 2026
krastogi-in pushed a commit to krastogi-in/pytorch that referenced this pull request Jan 9, 2026
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
krastogi-in pushed a commit to krastogi-in/pytorch that referenced this pull request Jan 9, 2026
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
krastogi-in pushed a commit to krastogi-in/pytorch that referenced this pull request Jan 9, 2026
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

better-engineering Relatively self-contained tasks for better engineering contributors ciflow/inductor open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants