[SymmetricMemoryOps] refine cross-device barriers #137471

yifuwang · 2024-10-08T05:13:28Z

Stack from ghstack (oldest at bottom):

This Stack

Implement custom all-reduce algos available in IntraNodeComm as symm_mem ops and replace the existing IntraNodeComm kernels with them.

This PR

Refine the corss-device synchronization primitives to make it clearer when to use which synchronization pattern.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-08T05:13:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137471

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit dd059f5 with merge base 9b2e453 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::one_shot_all_reduce` and `symm_mem::one_shot_all_reduce_out`. Later we'll replace the one-shot all-reduce in `IntraNodeComm` with these. Pull Request resolved: #137472 Approved by: https://github.com/Chillee, https://github.com/weifengpy ghstack dependencies: #137471

## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::two_shot_all_reduce_`. Later we'll replace the two-shot all-reduce in `IntraNodeComm` with these. Pull Request resolved: #137473 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472

…reduce (#137474) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::multimem_one_shot_all_reduce_out`. The out-variant is more suitable for `IntraNodeComm` integration. Pull Request resolved: #137474 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473

…m ops (#137475) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR - Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out` - Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_` - Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks. - Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`). - Removes methods that were made for the python binding. - Replaces nvlink detection logic with `DMAConnectivityDetector`. Pull Request resolved: #137475 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474

…ating bfloat16 with multimem.ld_reduce (#137529) This provides better accuracy without additional cost. Also added documentation to `multimem_one_shot_all_reduce` to note the numerical caveats. Pull Request resolved: #137529 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474, #137475

- Previously the detection would fail before user calling APIs such as `torch.cuda.set_device()`. This is because the detection logic requires nvml initialization. In this PR, we added explicit nvml initialization (which idempotent). - Previously any nvml issue occurred in the detection logic would result in fatal error. Now we issue an informative warning and return a topology assuming no NVLink connectivity. Pull Request resolved: #137530 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474, #137475, #137529

ghstack-source-id: 68341b0 Pull Request resolved: pytorch#137471

[SymmetricMemoryOps] refine cross-device barriers

14f6d82

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 8, 2024

Update on "[SymmetricMemoryOps] refine cross-device barriers"

dd059f5

[ghstack-poisoned]

yifuwang requested review from Chillee and weifengpy October 8, 2024 19:01

yifuwang marked this pull request as ready for review October 8, 2024 19:01

Chillee approved these changes Oct 8, 2024

View reviewed changes

weifengpy approved these changes Oct 8, 2024

View reviewed changes

This was referenced Oct 8, 2024

[SymmetricMemoryOps] use float32 as the accumulator type when accumulating bfloat16 with multimem.ld_reduce #137529

Closed

[CudaDMAConnectivityDetector] improve the detection robustness #137530

Closed

pytorchmergebot added the Merged label Oct 9, 2024

pytorchmergebot closed this in 5d83ee3 Oct 9, 2024

yifuwang mentioned this pull request Oct 9, 2024

[SymmetricMemory] implement timeout for barrier(), put_signal() and wait_signal() #137643

Closed

github-actions bot deleted the gh/yifuwang/133/head branch November 9, 2024 02:02

yifuwang pushed a commit to yifuwang/pytorch that referenced this pull request Feb 22, 2025

[SymmetricMemoryOps] refine cross-device barriers

2f5ae54

ghstack-source-id: 68341b0 Pull Request resolved: pytorch#137471

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SymmetricMemoryOps] refine cross-device barriers #137471

[SymmetricMemoryOps] refine cross-device barriers #137471

Uh oh!

yifuwang commented Oct 8, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Oct 8, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SymmetricMemoryOps] refine cross-device barriers #137471

[SymmetricMemoryOps] refine cross-device barriers #137471

Uh oh!

Conversation

yifuwang commented Oct 8, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This Stack

This PR

Uh oh!

pytorch-bot bot commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137471

✅ No Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yifuwang commented Oct 8, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 8, 2024 •

edited

Loading