[inductor][cpp] Add FlexAttention support for CPU inference #141453

jianan-gu · 2024-11-24T11:03:49Z

This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.

Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.

With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.

For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):

pytest test/inductor/test_flex_attention.py  
`TestFlexAttention`
#common functions: 
run_test 
preprocess_paged_attention
run_paged_attention
run_test_with_paged_attention
run_test_with_call
run_dynamic_test
run_automatic_dynamic_test

#test functions: 
test_builtin_score_mods
test_builtin_score_mods_automatic_dynamic
test_builtin_score_mods_different_seqlen
test_builtin_score_mods_different_block_size
test_kv_batch_broadcast
test_GQA
test_cpu_error_message_return_lse
test_validate_cpu_dtype_error_message

`TestPagedAttention`
#test function:
test_paged_builtin_score_mods

For the rest UTs in test/inductor/test_flex_attention.py and test/inductor/test_flex_decoding.py, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.

Besides, more optimizations are also planned in follow up PRs, including:

Block sparse computation
Flash decoding tuning

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

fix arg mapping; remove assert; support other buffer

fix kv block size

add SKIP_MASK_SCORE into kernel_options and update template code

pytorch-bot · 2024-11-24T11:03:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141453

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit 9e3de2d with merge base f870ee2 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor-rocm / rocm6.2-py3.10-inductor / test (inductor, 1, 2, linux.rocm.gpu.2) (gh) (similar failure)
##[error]Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers
inductor-rocm / rocm6.2-py3.10-inductor / test (inductor, 2, 2, linux.rocm.gpu.2) (gh) (similar failure)
##[error]Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers
periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 1, 5, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh) (similar failure)
'Test'
periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 3, 5, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh) (similar failure)
test_nestedtensor.py::TestNestedTensorSubclassCUDA::test_linear_nt_dim_3_cuda
periodic / linux-focal-rocm6.2-py3.10 / test (distributed, 1, 3, linux.rocm.gpu, module:rocm, oncall:distributed) (gh) (disabled by #142361 but the issue was closed recently and a rebase is needed to make it pass)
distributed/test_device_mesh.py::DeviceMeshTest::test_init_process_group

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/_inductor/codegen/cpp_prefix.h

torch/_inductor/codegen/cpp_mha_template.py

torch/_inductor/kernel/flex_attention.py

torch/_inductor/codegen/cpp_mha_template.py

leslie-fang-intel

Thanks for the PR. Please add the corresponding the UTs.

torch/_inductor/codegen/cpp_mha_template.py

torch/_inductor/codegen/cpp_prefix.h

jianan-gu · 2024-11-29T04:23:23Z

Thanks for the PR. Please add the corresponding the UTs.

Sure, we will enable CPU path tests, in both test/inductor/test_flex_attention.py and test/inductor/test_flex_decoding.py

pytorch-bot · 2024-11-29T09:14:14Z

Please seek CI approval before scheduling CIFlow labels

malfet · 2024-12-09T22:55:13Z

@pytorchbot revert -m "This breaks tests on platforms compiled without MKLDNN, namely MacOS, see https://github.com/pytorch/pytorch/actions/runs/12245441371/job/34159967794" -c nosignal

pytorchmergebot · 2024-12-09T22:57:54Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2024-12-09T22:58:02Z

@jianan-gu your PR has been successfully reverted.

…141453)" This reverts commit db379ed. Reverted #141453 on behalf of https://github.com/malfet due to This breaks tests on platforms compiled without MKLDNN, namely MacOS, see https://github.com/pytorch/pytorch/actions/runs/12245441371/job/34159967794 ([comment](#141453 (comment)))

jianan-gu · 2024-12-10T06:56:48Z

Hi, @malfet , Thanks for pointing out this, we have refined this PR with a proper checking, and now passed in the latest CI for MacOS.

leslie-fang-intel · 2024-12-10T07:57:44Z

@pytorchbot merge

pytorchmergebot · 2024-12-10T07:59:49Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ytorch#141453)" This reverts commit 7edbde3. Reverted pytorch#141453 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing periodic NO_AVX2 ([comment](pytorch#141453 (comment)))

…141453) This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs. Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs. With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance. For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests): ``` pytest test/inductor/test_flex_attention.py `TestFlexAttention` #common functions: run_test preprocess_paged_attention run_paged_attention run_test_with_paged_attention run_test_with_call run_dynamic_test run_automatic_dynamic_test #test functions: test_builtin_score_mods test_builtin_score_mods_automatic_dynamic test_builtin_score_mods_different_seqlen test_builtin_score_mods_different_block_size test_kv_batch_broadcast test_GQA test_cpu_error_message_return_lse test_validate_cpu_dtype_error_message `TestPagedAttention` #test function: test_paged_builtin_score_mods ``` For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor. Besides, more optimizations are also planned in follow up PRs, including: - Block sparse computation - Flash decoding tuning Pull Request resolved: pytorch#141453 Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel Co-authored-by: Wu, Chunyuan <[email protected]>

…ytorch#141453)" This reverts commit db379ed. Reverted pytorch#141453 on behalf of https://github.com/malfet due to This breaks tests on platforms compiled without MKLDNN, namely MacOS, see https://github.com/pytorch/pytorch/actions/runs/12245441371/job/34159967794 ([comment](pytorch#141453 (comment)))

…141453) This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs. Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs. With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance. For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests): ``` pytest test/inductor/test_flex_attention.py `TestFlexAttention` #common functions: run_test preprocess_paged_attention run_paged_attention run_test_with_paged_attention run_test_with_call run_dynamic_test run_automatic_dynamic_test #test functions: test_builtin_score_mods test_builtin_score_mods_automatic_dynamic test_builtin_score_mods_different_seqlen test_builtin_score_mods_different_block_size test_kv_batch_broadcast test_GQA test_cpu_error_message_return_lse test_validate_cpu_dtype_error_message `TestPagedAttention` #test function: test_paged_builtin_score_mods ``` For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor. Besides, more optimizations are also planned in follow up PRs, including: - Block sparse computation - Flash decoding tuning Pull Request resolved: pytorch#141453 Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel Co-authored-by: Wu, Chunyuan <[email protected]>

This PR extends and refines all rest UTs for CPU and more devices in `test/inductor/test_flex_attention.py` and `test/inductor/test_flex_decoding.py`, as a follow-up to #141453 Pull Request resolved: #144953 Approved by: https://github.com/drisspg

…4953) This PR extends and refines all rest UTs for CPU and more devices in `test/inductor/test_flex_attention.py` and `test/inductor/test_flex_decoding.py`, as a follow-up to pytorch#141453 Pull Request resolved: pytorch#144953 Approved by: https://github.com/drisspg

jianan-gu and others added 14 commits November 19, 2024 02:55

init cpp template for flex attention on CPU

476fdce

add score/mask mod

4e716c5

refinement for no-mask/no-score cases

b251b7d

fix kv block size

0c94a15

fix arg mapping; remove assert; support other buffer

9c3a519

add assertion on len == 1

844da1c

Merge pull request #4 from chunyuan-w/chunyuan/flex_fix

d6a55bc

fix arg mapping; remove assert; support other buffer

Merge branch 'cpu-flex-attention-upstream' into jianan/buffer_changes

640a020

Merge pull request #5 from jianan-gu/jianan/buffer_changes

a7fd69d

fix kv block size

add SKIP_MASK_SCORE into kernel_options and update template code

c2d4992

Merge pull request #6 from chunyuan-w/chunyuan/skip_score_mask

cbe4847

add SKIP_MASK_SCORE into kernel_options and update template code

refinement kernel naming

1b03f05

fix paged phisical index access

2049492

refine codes

11c8e5a

jianan-gu requested review from albanD, jbschlosser and mikaylagawarecki as code owners November 24, 2024 11:03

pytorch-bot bot added the module: inductor label Nov 24, 2024

jianan-gu marked this pull request as draft November 24, 2024 11:03

pytorchbot added the open source label Nov 24, 2024

jgong5 requested review from jgong5 and leslie-fang-intel November 25, 2024 01:51

jgong5 requested changes Nov 25, 2024

View reviewed changes

leslie-fang-intel reviewed Nov 25, 2024

View reviewed changes

torch/_inductor/codegen/cpp_mha_template.py Outdated Show resolved Hide resolved

torch/_inductor/codegen/cpp_prefix.h Outdated Show resolved Hide resolved

leslie-fang-intel mentioned this pull request Nov 29, 2024

[Inductor][CPP] Fix issue in CPP GEMM Template Prune Tensor #141798

Closed

leslie-fang-intel added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request labels Nov 29, 2024

pytorchmergebot closed this in db379ed Dec 9, 2024

pytorchmergebot removed the merging label Dec 9, 2024

malfet mentioned this pull request Dec 9, 2024

MacOS tests has not been running for few weeks #142206

Closed

pytorchmergebot added the ci-no-td Do not run TD on this PR label Dec 9, 2024

pytorchmergebot reopened this Dec 9, 2024

jianan-gu added 2 commits December 9, 2024 19:11

add platform check for cpu

cf6569b

lint minor refine

9e3de2d

leslie-fang-intel removed the ci-no-td Do not run TD on this PR label Dec 10, 2024

pytorchmergebot added the merging label Dec 10, 2024

pytorchmergebot closed this in d51e6fa Dec 10, 2024

pytorchmergebot removed the merging label Dec 10, 2024

tpopok mentioned this pull request Jan 28, 2025

CPU Model compile not working for flexattention #145861

Closed

jianan-gu mentioned this pull request Feb 8, 2025

[Inductor UT] Refactor FlexAttention UT and add CPU tests #144953

Closed

[inductor][cpp] Add FlexAttention support for CPU inference #141453

[inductor][cpp] Add FlexAttention support for CPU inference #141453

Uh oh!

Conversation

jianan-gu commented Nov 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141453

✅ You can merge normally! (5 Unrelated Failures)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leslie-fang-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jianan-gu commented Nov 29, 2024

Uh oh!

pytorch-bot bot commented Nov 29, 2024

Uh oh!

malfet commented Dec 9, 2024

Uh oh!

pytorchmergebot commented Dec 9, 2024

Uh oh!

pytorchmergebot commented Dec 9, 2024

Uh oh!

jianan-gu commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leslie-fang-intel commented Dec 10, 2024

Uh oh!

pytorchmergebot commented Dec 10, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

jianan-gu commented Nov 24, 2024 •

edited

Loading

pytorch-bot bot commented Nov 24, 2024 •

edited

Loading

jianan-gu commented Dec 10, 2024 •

edited

Loading