-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Use same NVSHMEM version across CUDA builds #162206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162206
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 194cd4e with merge base 1f0b01d ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@tinglvv Does this PR make sense? Can you please review? Thanks! |
|
Thanks! Looks good. |
|
@pytorchbot merge |
Merge failedReason: Approvers from one of the following sets are needed:
|
|
Perfect, I was planning on doing this anymore: pytorch/.ci/docker/common/install_cuda.sh Line 13 in b04e922
|
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_12-cuda12_8-test / test Details for Dev Infra teamRaised by workflow job |
|
@atalman We need some new S3 uploads for nvidia wheels |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Command Details for Dev Infra teamRaised by workflow job |
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: #162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007 ghstack-source-id: d5589c4
|
There is a land conflict which leads to mismatched yml generation. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) Details for Dev Infra teamRaised by workflow job |
|
@kwen2501 New NVSHMEM just dropped on PYPI that can use IBGDA on more devices. Should we upgrade it across the board? |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@Skylion007 Thanks! |
Let's do it |
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: pytorch#162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007
This reverts commit 0d9c95c. Reverted pytorch#162206 on behalf of https://github.com/malfet due to Broke lint, see https://hud.pytorch.org/hud/pytorch/pytorch/4dd73e659a8fd4872e5f49cfd72e420fa7c4e6c9/1?per_page=50&name_filter=workflow-checks ([comment](pytorch#162206 (comment)))
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: pytorch#162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: pytorch#162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007
This reverts commit 0d9c95c. Reverted pytorch#162206 on behalf of https://github.com/malfet due to Broke lint, see https://hud.pytorch.org/hud/pytorch/pytorch/4dd73e659a8fd4872e5f49cfd72e420fa7c4e6c9/1?per_page=50&name_filter=workflow-checks ([comment](pytorch#162206 (comment)))
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: pytorch#162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: pytorch#162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007
This reverts commit 0d9c95c. Reverted pytorch#162206 on behalf of https://github.com/malfet due to Broke lint, see https://hud.pytorch.org/hud/pytorch/pytorch/4dd73e659a8fd4872e5f49cfd72e420fa7c4e6c9/1?per_page=50&name_filter=workflow-checks ([comment](pytorch#162206 (comment)))
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: pytorch#162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: pytorch#162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007
This reverts commit 0d9c95c. Reverted pytorch#162206 on behalf of https://github.com/malfet due to Broke lint, see https://hud.pytorch.org/hud/pytorch/pytorch/4dd73e659a8fd4872e5f49cfd72e420fa7c4e6c9/1?per_page=50&name_filter=workflow-checks ([comment](pytorch#162206 (comment)))
pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: pytorch#162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007
Stack from ghstack (oldest at bottom):
#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.