[SymmetricMemoryOps] implement one_shot_all_reduce #137472

yifuwang · 2024-10-08T05:13:33Z

Stack from ghstack (oldest at bottom):

This Stack

Implement custom all-reduce algos available in IntraNodeComm as symm_mem ops and replace the existing IntraNodeComm kernels with them.

This PR

Implement symm_mem::one_shot_all_reduce and symm_mem::one_shot_all_reduce_out. Later we'll replace the one-shot all-reduce in IntraNodeComm with these.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-08T05:13:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137472

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2dfe6d8 with merge base 9b2e453 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

Chillee

I assume you've benchmarked this?

## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::two_shot_all_reduce_`. Later we'll replace the two-shot all-reduce in `IntraNodeComm` with these. Pull Request resolved: #137473 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472

…reduce (#137474) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::multimem_one_shot_all_reduce_out`. The out-variant is more suitable for `IntraNodeComm` integration. Pull Request resolved: #137474 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473

…m ops (#137475) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR - Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out` - Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_` - Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks. - Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`). - Removes methods that were made for the python binding. - Replaces nvlink detection logic with `DMAConnectivityDetector`. Pull Request resolved: #137475 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474

…ating bfloat16 with multimem.ld_reduce (#137529) This provides better accuracy without additional cost. Also added documentation to `multimem_one_shot_all_reduce` to note the numerical caveats. Pull Request resolved: #137529 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474, #137475

- Previously the detection would fail before user calling APIs such as `torch.cuda.set_device()`. This is because the detection logic requires nvml initialization. In this PR, we added explicit nvml initialization (which idempotent). - Previously any nvml issue occurred in the detection logic would result in fatal error. Now we issue an informative warning and return a topology assuming no NVLink connectivity. Pull Request resolved: #137530 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474, #137475, #137529

ghstack-source-id: 6dbadcb Pull Request resolved: pytorch#137472

XiaoSong9905 · 2025-11-25T02:54:53Z

Hi @yifuwang , do you know why pytorch only support fp32 and bf16 for multimem based AR? Why PyTorch decided to drop fp16 / fp8_e4m3 / fp8e5m2 for now? Thanks~

[SymmetricMemoryOps] implement one_shot_all_reduce

96d0257

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 8, 2024

Update on "[SymmetricMemoryOps] implement one_shot_all_reduce"

2dfe6d8

[ghstack-poisoned]

yifuwang requested review from Chillee and weifengpy October 8, 2024 19:02

yifuwang marked this pull request as ready for review October 8, 2024 19:03

Chillee approved these changes Oct 8, 2024

View reviewed changes

weifengpy approved these changes Oct 8, 2024

View reviewed changes

This was referenced Oct 8, 2024

[SymmetricMemoryOps] use float32 as the accumulator type when accumulating bfloat16 with multimem.ld_reduce #137529

Closed

[CudaDMAConnectivityDetector] improve the detection robustness #137530

Closed

pytorchmergebot added the Merged label Oct 9, 2024

pytorchmergebot closed this in 82e55b6 Oct 9, 2024

yifuwang mentioned this pull request Oct 9, 2024

[SymmetricMemory] implement timeout for barrier(), put_signal() and wait_signal() #137643

Closed

github-actions bot deleted the gh/yifuwang/134/head branch November 9, 2024 02:02

yifuwang pushed a commit to yifuwang/pytorch that referenced this pull request Feb 22, 2025

[SymmetricMemoryOps] implement one_shot_all_reduce

c9ca66a

ghstack-source-id: 6dbadcb Pull Request resolved: pytorch#137472

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SymmetricMemoryOps] implement one_shot_all_reduce #137472

[SymmetricMemoryOps] implement one_shot_all_reduce #137472

Uh oh!

yifuwang commented Oct 8, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Oct 8, 2024 •

edited

Loading

Uh oh!

Chillee left a comment

Uh oh!

XiaoSong9905 commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SymmetricMemoryOps] implement one_shot_all_reduce #137472

[SymmetricMemoryOps] implement one_shot_all_reduce #137472

Uh oh!

Conversation

yifuwang commented Oct 8, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This Stack

This PR

Uh oh!

pytorch-bot bot commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137472

✅ No Failures

Uh oh!

Chillee left a comment

Choose a reason for hiding this comment

Uh oh!

XiaoSong9905 commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yifuwang commented Oct 8, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 8, 2024 •

edited

Loading