-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[inductor][cpp] Add FlexAttention support for CPU inference #141453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fix arg mapping; remove assert; support other buffer
fix kv block size
add SKIP_MASK_SCORE into kernel_options and update template code
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141453
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (5 Unrelated Failures)As of commit 9e3de2d with merge base f870ee2 ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
leslie-fang-intel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Please add the corresponding the UTs.
Sure, we will enable CPU path tests, in |
|
Please seek CI approval before scheduling CIFlow labels |
|
@pytorchbot revert -m "This breaks tests on platforms compiled without MKLDNN, namely MacOS, see https://github.com/pytorch/pytorch/actions/runs/12245441371/job/34159967794" -c nosignal |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@jianan-gu your PR has been successfully reverted. |
…141453)" This reverts commit db379ed. Reverted #141453 on behalf of https://github.com/malfet due to This breaks tests on platforms compiled without MKLDNN, namely MacOS, see https://github.com/pytorch/pytorch/actions/runs/12245441371/job/34159967794 ([comment](#141453 (comment)))
|
Hi, @malfet , Thanks for pointing out this, we have refined this PR with a proper checking, and now passed in the latest CI for MacOS. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ytorch#141453)" This reverts commit 7edbde3. Reverted pytorch#141453 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing periodic NO_AVX2 ([comment](pytorch#141453 (comment)))
…141453) This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs. Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs. With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance. For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests): ``` pytest test/inductor/test_flex_attention.py `TestFlexAttention` #common functions: run_test preprocess_paged_attention run_paged_attention run_test_with_paged_attention run_test_with_call run_dynamic_test run_automatic_dynamic_test #test functions: test_builtin_score_mods test_builtin_score_mods_automatic_dynamic test_builtin_score_mods_different_seqlen test_builtin_score_mods_different_block_size test_kv_batch_broadcast test_GQA test_cpu_error_message_return_lse test_validate_cpu_dtype_error_message `TestPagedAttention` #test function: test_paged_builtin_score_mods ``` For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor. Besides, more optimizations are also planned in follow up PRs, including: - Block sparse computation - Flash decoding tuning Pull Request resolved: pytorch#141453 Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel Co-authored-by: Wu, Chunyuan <[email protected]>
…ytorch#141453)" This reverts commit db379ed. Reverted pytorch#141453 on behalf of https://github.com/malfet due to This breaks tests on platforms compiled without MKLDNN, namely MacOS, see https://github.com/pytorch/pytorch/actions/runs/12245441371/job/34159967794 ([comment](pytorch#141453 (comment)))
…141453) This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs. Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs. With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance. For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests): ``` pytest test/inductor/test_flex_attention.py `TestFlexAttention` #common functions: run_test preprocess_paged_attention run_paged_attention run_test_with_paged_attention run_test_with_call run_dynamic_test run_automatic_dynamic_test #test functions: test_builtin_score_mods test_builtin_score_mods_automatic_dynamic test_builtin_score_mods_different_seqlen test_builtin_score_mods_different_block_size test_kv_batch_broadcast test_GQA test_cpu_error_message_return_lse test_validate_cpu_dtype_error_message `TestPagedAttention` #test function: test_paged_builtin_score_mods ``` For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor. Besides, more optimizations are also planned in follow up PRs, including: - Block sparse computation - Flash decoding tuning Pull Request resolved: pytorch#141453 Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel Co-authored-by: Wu, Chunyuan <[email protected]>
This PR extends and refines all rest UTs for CPU and more devices in `test/inductor/test_flex_attention.py` and `test/inductor/test_flex_decoding.py`, as a follow-up to #141453 Pull Request resolved: #144953 Approved by: https://github.com/drisspg
…4953) This PR extends and refines all rest UTs for CPU and more devices in `test/inductor/test_flex_attention.py` and `test/inductor/test_flex_decoding.py`, as a follow-up to pytorch#141453 Pull Request resolved: pytorch#144953 Approved by: https://github.com/drisspg
…4953) This PR extends and refines all rest UTs for CPU and more devices in `test/inductor/test_flex_attention.py` and `test/inductor/test_flex_decoding.py`, as a follow-up to pytorch#141453 Pull Request resolved: pytorch#144953 Approved by: https://github.com/drisspg
This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.
Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.
With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.
For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):
For the rest UTs in
test/inductor/test_flex_attention.pyandtest/inductor/test_flex_decoding.py, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.Besides, more optimizations are also planned in follow up PRs, including:
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov