Skip to content

Conversation

@yifuwang
Copy link
Collaborator

@yifuwang yifuwang commented Oct 8, 2024

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137471

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit dd059f5 with merge base 9b2e453 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@yifuwang yifuwang requested review from Chillee and weifengpy October 8, 2024 19:01
@yifuwang yifuwang marked this pull request as ready for review October 8, 2024 19:01
pytorchmergebot pushed a commit that referenced this pull request Oct 9, 2024
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::one_shot_all_reduce` and `symm_mem::one_shot_all_reduce_out`. Later we'll replace the one-shot all-reduce in `IntraNodeComm` with these.
Pull Request resolved: #137472
Approved by: https://github.com/Chillee, https://github.com/weifengpy
ghstack dependencies: #137471
pytorchmergebot pushed a commit that referenced this pull request Oct 9, 2024
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::two_shot_all_reduce_`. Later we'll replace the two-shot all-reduce in `IntraNodeComm` with these.

Pull Request resolved: #137473
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472
pytorchmergebot pushed a commit that referenced this pull request Oct 9, 2024
…reduce (#137474)

## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::multimem_one_shot_all_reduce_out`. The out-variant is more suitable for `IntraNodeComm` integration.
Pull Request resolved: #137474
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473
pytorchmergebot pushed a commit that referenced this pull request Oct 9, 2024
…m ops (#137475)

## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR
- Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out`
- Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_`
- Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks.
- Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`).
- Removes methods that were made for the python binding.
- Replaces nvlink detection logic with `DMAConnectivityDetector`.

Pull Request resolved: #137475
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474
pytorchmergebot pushed a commit that referenced this pull request Oct 9, 2024
…ating bfloat16 with multimem.ld_reduce (#137529)

This provides better accuracy without additional cost.

Also added documentation to `multimem_one_shot_all_reduce` to note the numerical caveats.
Pull Request resolved: #137529
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475
pytorchmergebot pushed a commit that referenced this pull request Oct 9, 2024
- Previously the detection would fail before user calling APIs such as `torch.cuda.set_device()`. This is because the detection logic requires nvml initialization. In this PR, we added explicit nvml initialization (which idempotent).
- Previously any nvml issue occurred in the detection logic would result in fatal error. Now we issue an informative warning and return a topology assuming no NVLink connectivity.

Pull Request resolved: #137530
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475, #137529
@github-actions github-actions bot deleted the gh/yifuwang/133/head branch November 9, 2024 02:02
yifuwang pushed a commit to yifuwang/pytorch that referenced this pull request Feb 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants