Skip to content

Conversation

@yifuwang
Copy link
Collaborator

@yifuwang yifuwang commented Oct 8, 2024

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137473

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f90ee56 with merge base 9b2e453 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@yifuwang yifuwang requested review from Chillee and weifengpy October 8, 2024 19:03
@yifuwang
Copy link
Collaborator Author

yifuwang commented Oct 8, 2024

This Stack

Implement custom all-reduce algos available in IntraNodeComm as symm_mem ops and replace the existing IntraNodeComm kernels with them.

This PR

Implement symm_mem::two_shot_all_reduce_. Later we'll replace the two-shot all-reduce in IntraNodeComm with these.

@yifuwang yifuwang marked this pull request as ready for review October 8, 2024 19:04
@yifuwang
Copy link
Collaborator Author

yifuwang commented Oct 8, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 8, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Oct 9, 2024
…reduce (#137474)

## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::multimem_one_shot_all_reduce_out`. The out-variant is more suitable for `IntraNodeComm` integration.
Pull Request resolved: #137474
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473
pytorchmergebot pushed a commit that referenced this pull request Oct 9, 2024
…m ops (#137475)

## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR
- Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out`
- Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_`
- Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks.
- Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`).
- Removes methods that were made for the python binding.
- Replaces nvlink detection logic with `DMAConnectivityDetector`.

Pull Request resolved: #137475
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474
pytorchmergebot pushed a commit that referenced this pull request Oct 9, 2024
…ating bfloat16 with multimem.ld_reduce (#137529)

This provides better accuracy without additional cost.

Also added documentation to `multimem_one_shot_all_reduce` to note the numerical caveats.
Pull Request resolved: #137529
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475
pytorchmergebot pushed a commit that referenced this pull request Oct 9, 2024
- Previously the detection would fail before user calling APIs such as `torch.cuda.set_device()`. This is because the detection logic requires nvml initialization. In this PR, we added explicit nvml initialization (which idempotent).
- Previously any nvml issue occurred in the detection logic would result in fatal error. Now we issue an informative warning and return a topology assuming no NVLink connectivity.

Pull Request resolved: #137530
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475, #137529
KnAwnime pushed a commit to KnAwnime/Biblioteka that referenced this pull request Oct 16, 2024
@github-actions github-actions bot deleted the gh/yifuwang/135/head branch November 9, 2024 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants