-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[ROCm] Update to AOTriton 0.8b #140172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm] Update to AOTriton 0.8b #140172
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140172
Note: Links to docs will display an error until the docs builds have been completed. ❌ 7 New Failures, 9 Unrelated FailuresAs of commit 4b2c0e6 with merge base 920e436 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "ciflow/rocm" "keep-going" "rocm" "topic: not user facing" |
|
Can't add following labels to PR: ciflow/rocm. Please ping one of the reviewers for help. |
|
@jithunnair-amd @jataylo can you approve the ROCM CI tag? |
04a6184 to
eac8c49
Compare
eac8c49 to
2e44237
Compare
74e349e to
8e72a1f
Compare
|
@jithunnair-amd @jeffdaily @pruthvistony |
c997bfb to
9abbe31
Compare
| "lib/*.lib", | ||
| ] | ||
| ) | ||
| aotriton_image_path = os.path.join(lib_path, "aotriton.images") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 0.8 we moved GPU kernels in to separate files to counter the bloating size of share object file
This is the subdirectory that AOTriton puts all GPU kernels, it must be on the same directory of libaotriton.so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malfet to take a look at this change especially. We have additional files in aotriton that would need to be packaged with the pytorch installation
f6b7c1a to
e0b9459
Compare
|
@jithunnair-amd I changed the runner to 12xlarge temporarily to make sure the docker image build can complete within 100 minutes. |
|
@pytorchbot label "ciflow/inductor-rocm" |
361f296 to
f67538e
Compare
|
All CUDA+ROCM tests passed, move out of draft status. Note the ROCm inductor tests hit |
| // On ROCM, ME and FA share the backend, and hence they share the checking | ||
| // function for fundamental limitations by the GPU kernel | ||
| // caller_is_meff is added to make the TORCH_WARN message showing the correct result | ||
| template<bool caller_is_meff = false> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malfet to look at this change especially, since this is a function commonly used by CUDA and ROCm, but the changes here maintains the same functionality for CUDA.
|
@xinyazhang This CI failure seems related to your PR: https://github.com/pytorch/pytorch/actions/runs/12189422770/job/34006369193 |
I noticed this problem before, but felt like it is more of a compiler problem. Possible related clang issues:
Update: okay, should be fixed by 4b2c0e6 |
✅ Deploy Preview for chimerical-cranachan-793287 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
|
@pytorchbot merge -f "Unrelated CI failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Notable new features for SDPA operators on AMD systems from AOTriton 0.8b:
1. Nestedtensor support;
2. MQA/GQA support;
3. Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases;
+ The kernel should use top-left alignment, bottom right alignment will be added later
4. Move gfx1100 (RX7900/W7800/W7900) out of experimental support status.
However, users are strongly recommended to update to ROCM 6.2.4, notably for
its firmware updates.
Related unit tests are enabled as well.
Notable related changes from AOTriton 0.8b:
1. AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory `aotriton.images`;
2. LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB
3. The compression cannot be disabled now, and `liblzma` is hard run-time dependency.
+ Should not be a problem, since `lzma` is part of Python Standard Library
Pull Request resolved: #140172
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Co-authored-by: Jithun Nair <[email protected]>
|
Hi @xinyazhang Thanks for the great works! |
It should be part of PyTorch 2.6 |
This is backported from upstream PR pytorch#140172, pytorch#137443 and pytorch#139432. Original commit message of pytorch#140172: Notable new features for SDPA operators on AMD systems from AOTriton 0.8b: 1. Nestedtensor support; 2. MQA/GQA support; 3. Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases; + The kernel should use top-left alignment, bottom right alignment will be added later 4. Move gfx1100 (RX7900/W7800/W7900) out of experimental support status. However, users are strongly recommended to update to ROCM 6.2.4, notably for its firmware updates. Related unit tests are enabled as well. Notable related changes from AOTriton 0.8b: 1. AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory `aotriton.images`; 2. LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB 3. The compression cannot be disabled now, and `liblzma` is hard run-time dependency. + Should not be a problem, since `lzma` is part of Python Standard Library Pull Request resolved: pytorch#140172 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Co-authored-by: Jithun Nair <[email protected]>
This is backported from upstream PR pytorch#140172, pytorch#137443 and pytorch#139432. Original commit message of pytorch#140172: Notable new features for SDPA operators on AMD systems from AOTriton 0.8b: 1. Nestedtensor support; 2. MQA/GQA support; 3. Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases; + The kernel should use top-left alignment, bottom right alignment will be added later 4. Move gfx1100 (RX7900/W7800/W7900) out of experimental support status. However, users are strongly recommended to update to ROCM 6.2.4, notably for its firmware updates. Related unit tests are enabled as well. Notable related changes from AOTriton 0.8b: 1. AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory `aotriton.images`; 2. LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB 3. The compression cannot be disabled now, and `liblzma` is hard run-time dependency. + Should not be a problem, since `lzma` is part of Python Standard Library Pull Request resolved: pytorch#140172 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Fixes #ISSUE_NUMBER --------- Co-authored-by: Jithun Nair <[email protected]>

Notable new features for SDPA operators on AMD systems from AOTriton 0.8b:
However, users are strongly recommended to update to ROCM 6.2.4, notably for
its firmware updates.
Related unit tests are enabled as well.
Notable related changes from AOTriton 0.8b:
aotriton.images;liblzmais hard run-time dependency.lzmais part of Python Standard LibraryThis PR also updates the wheel building logic to include the
aotriton.imagesdirectory and files: fea8b5acc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov