[ROCm] Update to AOTriton 0.8b #140172

xinyazhang · 2024-11-08T21:07:28Z

Notable new features for SDPA operators on AMD systems from AOTriton 0.8b:

Nestedtensor support;
MQA/GQA support;
Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases;
- The kernel should use top-left alignment, bottom right alignment will be added later
Move gfx1100 (RX7900/W7800/W7900) out of experimental support status.
However, users are strongly recommended to update to ROCM 6.2.4, notably for
its firmware updates.

Related unit tests are enabled as well.

Notable related changes from AOTriton 0.8b:

AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory aotriton.images;
LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB
The compression cannot be disabled now, and liblzma is hard run-time dependency.
- Should not be a problem, since lzma is part of Python Standard Library

This PR also updates the wheel building logic to include the aotriton.images directory and files: fea8b5a

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

pytorch-bot · 2024-11-08T21:07:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140172

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 9 Unrelated Failures

As of commit 4b2c0e6 with merge base 920e436 ():

NEW FAILURES - The following jobs have failed:

docker-builds / docker-build (linux.12xlarge, pytorch-linux-focal-py3.13-clang10) (gh)
#21 63.49 Getting requirements to build wheel: finished with status 'error'
linux-s390x-binary-manywheel / manywheel-py3_10-cpu-s390x-build / build (gh)
Process completed with exit code 1.
linux-s390x-binary-manywheel / manywheel-py3_11-cpu-s390x-build / build (gh)
Process completed with exit code 1.
linux-s390x-binary-manywheel / manywheel-py3_12-cpu-s390x-build / build (gh)
Process completed with exit code 1.
linux-s390x-binary-manywheel / manywheel-py3_13-cpu-s390x-build / build (gh)
Process completed with exit code 1.
linux-s390x-binary-manywheel / manywheel-py3_9-cpu-s390x-build / build (gh)
Process completed with exit code 1.
pull / linux-focal-py3.13-clang10 / build (gh)
#21 70.39 Getting requirements to build wheel: finished with status 'error'

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

linux-binary-manywheel / manywheel-py3_10-xpu-test (gh) (similar failure)
Exception raised from registerKernel at /pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:121 (most recent call first):
linux-binary-manywheel / manywheel-py3_11-xpu-test (gh) (similar failure)
Exception raised from registerKernel at /pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:121 (most recent call first):
linux-binary-manywheel / manywheel-py3_12-xpu-test (gh) (similar failure)
terminate called after throwing an instance of 'c10::Error'
linux-binary-manywheel / manywheel-py3_13-xpu-test (gh) (similar failure)
Exception raised from registerKernel at /pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:121 (most recent call first):
linux-binary-manywheel / manywheel-py3_9-xpu-test (gh) (similar failure)
Exception raised from registerKernel at /pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:121 (most recent call first):
windows-binary-wheel / wheel-py3_9-xpu-test (gh) (similar failure)
Could Not Find C:\actions-runner\_work\pytorch\pytorch\builder\windows\python-amd64.exe

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge) (gh) (trunk failure)
##[error]Process completed with exit code 128.

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#141703)
convnext_base
inductor / cuda12.4-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#141498)
convnext_base

This comment was automatically generated by Dr. CI and updates every 15 minutes.

xinyazhang · 2024-11-08T21:09:51Z

@pytorchbot label "ciflow/rocm" "keep-going" "rocm" "topic: not user facing"

pytorch-bot · 2024-11-08T21:09:59Z

Can't add following labels to PR: ciflow/rocm. Please ping one of the reviewers for help.

xinyazhang · 2024-11-11T20:18:47Z

@jithunnair-amd @jataylo can you approve the ROCM CI tag?

xinyazhang · 2024-11-26T21:09:19Z

@jithunnair-amd @jeffdaily @pruthvistony
We can move gfx1100 out of experimental status since our experiments have shown its performance advantage on ROCM 6.2.4

xinyazhang · 2024-11-26T22:01:26Z

setup.py

                "lib/*.lib",
            ]
        )
+        aotriton_image_path = os.path.join(lib_path, "aotriton.images")


In 0.8 we moved GPU kernels in to separate files to counter the bloating size of share object file
This is the subdirectory that AOTriton puts all GPU kernels, it must be on the same directory of libaotriton.so

@malfet to take a look at this change especially. We have additional files in aotriton that would need to be packaged with the pytorch installation

xinyazhang · 2024-11-27T11:20:56Z

@jithunnair-amd I changed the runner to 12xlarge temporarily to make sure the docker image build can complete within 100 minutes.

xinyazhang · 2024-12-03T18:34:19Z

@pytorchbot label "ciflow/inductor-rocm"

xinyazhang · 2024-12-03T22:40:57Z

https://hud.pytorch.org/pytorch/pytorch/pull/140172?sha=f67538eda2ff51ded484df1e08b3c17df3224a9d

All CUDA+ROCM tests passed, move out of draft status.

Note the ROCm inductor tests hit Secret source: None randomly and thus not very reliable.

jithunnair-amd · 2024-12-03T23:28:27Z

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

+// On ROCM, ME and FA share the backend, and hence they share the checking
+// function for fundamental limitations by the GPU kernel
+// caller_is_meff is added to make the TORCH_WARN message showing the correct result
+template<bool caller_is_meff = false>


@malfet to look at this change especially, since this is a function commonly used by CUDA and ROCm, but the changes here maintains the same functionality for CUDA.

jithunnair-amd · 2024-12-06T05:56:03Z

@xinyazhang This CI failure seems related to your PR: https://github.com/pytorch/pytorch/actions/runs/12189422770/job/34006369193
Looking at the history of failures of this build, all the failures are only from CI runs on this PR.

xinyazhang · 2024-12-06T07:52:16Z

This CI failure seems related to your PR: https://github.com/pytorch/pytorch/actions/runs/12189422770/job/34006369193
Looking at the history of failures of this build, all the failures are only from CI runs on this PR.

I noticed this problem before, but felt like it is more of a compiler problem. Possible related clang issues:

Update: okay, should be fixed by 4b2c0e6

…array_of()

netlify · 2024-12-06T08:07:44Z

✅ Deploy Preview for chimerical-cranachan-793287 ready!

Name	Link
🔨 Latest commit	`4b2c0e6`
🔍 Latest deploy log	https://app.netlify.com/sites/chimerical-cranachan-793287/deploys/6752b01f4f67a30008f3d1e5
😎 Deploy Preview	https://deploy-preview-140172--chimerical-cranachan-793287.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

jithunnair-amd · 2024-12-06T21:43:21Z

@pytorchbot merge -f "Unrelated CI failures"

pytorchmergebot · 2024-12-06T21:44:54Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Notable new features for SDPA operators on AMD systems from AOTriton 0.8b: 1. Nestedtensor support; 2. MQA/GQA support; 3. Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases; + The kernel should use top-left alignment, bottom right alignment will be added later 4. Move gfx1100 (RX7900/W7800/W7900) out of experimental support status. However, users are strongly recommended to update to ROCM 6.2.4, notably for its firmware updates. Related unit tests are enabled as well. Notable related changes from AOTriton 0.8b: 1. AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory `aotriton.images`; 2. LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB 3. The compression cannot be disabled now, and `liblzma` is hard run-time dependency. + Should not be a problem, since `lzma` is part of Python Standard Library Pull Request resolved: #140172 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Co-authored-by: Jithun Nair <[email protected]>

austin362667 · 2024-12-30T05:43:13Z

Hi @xinyazhang Thanks for the great works!
Could I kindly ask which PyTorch version this fix is expected to be released in? Will it be 2.6.0 or a later version? cc @jataylo

xinyazhang · 2024-12-31T21:10:31Z

Could I kindly ask which PyTorch version this fix is expected to be released in? Will it be 2.6.0 or a later version? cc @jataylo

It should be part of PyTorch 2.6

This is backported from upstream PR pytorch#140172, pytorch#137443 and pytorch#139432. Original commit message of pytorch#140172: Notable new features for SDPA operators on AMD systems from AOTriton 0.8b: 1. Nestedtensor support; 2. MQA/GQA support; 3. Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases; + The kernel should use top-left alignment, bottom right alignment will be added later 4. Move gfx1100 (RX7900/W7800/W7900) out of experimental support status. However, users are strongly recommended to update to ROCM 6.2.4, notably for its firmware updates. Related unit tests are enabled as well. Notable related changes from AOTriton 0.8b: 1. AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory `aotriton.images`; 2. LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB 3. The compression cannot be disabled now, and `liblzma` is hard run-time dependency. + Should not be a problem, since `lzma` is part of Python Standard Library Pull Request resolved: pytorch#140172 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Co-authored-by: Jithun Nair <[email protected]>

This is backported from upstream PR pytorch#140172, pytorch#137443 and pytorch#139432. Original commit message of pytorch#140172: Notable new features for SDPA operators on AMD systems from AOTriton 0.8b: 1. Nestedtensor support; 2. MQA/GQA support; 3. Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases; + The kernel should use top-left alignment, bottom right alignment will be added later 4. Move gfx1100 (RX7900/W7800/W7900) out of experimental support status. However, users are strongly recommended to update to ROCM 6.2.4, notably for its firmware updates. Related unit tests are enabled as well. Notable related changes from AOTriton 0.8b: 1. AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory `aotriton.images`; 2. LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB 3. The compression cannot be disabled now, and `liblzma` is hard run-time dependency. + Should not be a problem, since `lzma` is part of Python Standard Library Pull Request resolved: pytorch#140172 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Fixes #ISSUE_NUMBER --------- Co-authored-by: Jithun Nair <[email protected]>

pytorch-bot bot added module: rocm AMD GPU support for Pytorch release notes: releng release notes category labels Nov 8, 2024

pytorchbot added the open source label Nov 8, 2024

jithunnair-amd added ciflow/rocm Trigger "default" config CI on ROCm release notes: rocm mandatorylabel labels Nov 11, 2024

xinyazhang force-pushed the xinyazhang/sdpa-nestedtensor branch from 04a6184 to eac8c49 Compare November 16, 2024 06:02

xinyazhang force-pushed the xinyazhang/sdpa-nestedtensor branch from eac8c49 to 2e44237 Compare November 25, 2024 22:41

xinyazhang changed the title ~~[ROCm] Support Nestedtensor in SDPA operators~~ [ROCm] Bump to AOTriton 0.8b Nov 26, 2024

xinyazhang changed the title ~~[ROCm] Bump to AOTriton 0.8b~~ [ROCm] Update to AOTriton 0.8b Nov 26, 2024

pytorch-bot bot added the module: inductor label Nov 26, 2024

xinyazhang force-pushed the xinyazhang/sdpa-nestedtensor branch from 74e349e to 8e72a1f Compare November 26, 2024 20:46

xinyazhang force-pushed the xinyazhang/sdpa-nestedtensor branch from c997bfb to 9abbe31 Compare November 26, 2024 21:11

xinyazhang commented Nov 26, 2024

View reviewed changes

xinyazhang force-pushed the xinyazhang/sdpa-nestedtensor branch from f6b7c1a to e0b9459 Compare November 27, 2024 09:00

pytorch-bot bot added the ciflow/inductor-rocm Trigger "inductor" config CI on ROCm label Dec 3, 2024

xinyazhang force-pushed the xinyazhang/sdpa-nestedtensor branch from 361f296 to f67538e Compare December 3, 2024 18:34

pytorch-bot bot added the ciflow/inductor label Dec 3, 2024

xinyazhang marked this pull request as ready for review December 3, 2024 22:40

xinyazhang requested review from jeffdaily and mruberry as code owners December 3, 2024 22:40

jithunnair-amd requested a review from malfet December 3, 2024 23:26

jithunnair-amd reviewed Dec 3, 2024

View reviewed changes

xinyazhang marked this pull request as ready for review December 6, 2024 05:35

clang-12 has problem in supporting default template arguments inside …

4b2c0e6

…array_of()

pytorchmergebot added the merging label Dec 6, 2024

pytorchmergebot closed this in 424156c Dec 6, 2024

pytorchmergebot added Merged and removed merging labels Dec 6, 2024

jataylo mentioned this pull request Dec 11, 2024

[ROCm] FlexAttention Sliding Window Attention Numeric Error #138300

Closed

austin362667 mentioned this pull request Dec 30, 2024

Add unit tests for shared prefix masked attention with torch.FlexAttention linkedin/Liger-Kernel#504

Merged

3 tasks

jithunnair-amd deleted the xinyazhang/sdpa-nestedtensor branch January 1, 2025 01:20

xinyazhang mentioned this pull request Jan 7, 2025

[Release/2.4] Fixes 10211 - Deselecting the efficient attention backend for scaled dot product function in ROCm ROCm/pytorch#1818

Merged

xinyazhang mentioned this pull request Jan 13, 2025

[release/2.5] Support head dimension 512 with AOTriton 0.8.1b ROCm/pytorch#1832

Merged

jithunnair-amd restored the xinyazhang/sdpa-nestedtensor branch February 21, 2025 15:43

jithunnair-amd deleted the xinyazhang/sdpa-nestedtensor branch February 21, 2025 15:46

[ROCm] Update to AOTriton 0.8b #140172

[ROCm] Update to AOTriton 0.8b #140172

Uh oh!

Conversation

xinyazhang commented Nov 8, 2024 • edited by jithunnair-amd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140172

❌ 7 New Failures, 9 Unrelated Failures

Uh oh!

xinyazhang commented Nov 8, 2024

Uh oh!

pytorch-bot bot commented Nov 8, 2024

Uh oh!

xinyazhang commented Nov 11, 2024

Uh oh!

xinyazhang commented Nov 26, 2024

Uh oh!

xinyazhang Nov 26, 2024

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

xinyazhang commented Nov 27, 2024

Uh oh!

xinyazhang commented Dec 3, 2024

Uh oh!

xinyazhang commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd commented Dec 6, 2024

Uh oh!

xinyazhang commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for chimerical-cranachan-793287 ready!

Uh oh!

jithunnair-amd commented Dec 6, 2024

Uh oh!

pytorchmergebot commented Dec 6, 2024

Merge started

Uh oh!

austin362667 commented Dec 30, 2024

Uh oh!

xinyazhang commented Dec 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xinyazhang commented Nov 8, 2024 •

edited by jithunnair-amd

Loading

pytorch-bot bot commented Nov 8, 2024 •

edited

Loading

xinyazhang commented Dec 3, 2024 •

edited

Loading

xinyazhang commented Dec 6, 2024 •

edited

Loading

netlify bot commented Dec 6, 2024 •

edited

Loading