Skip to content

Conversation

@jianan-gu
Copy link
Contributor

@jianan-gu jianan-gu commented Nov 24, 2024

This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.

Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.

With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.

For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):

pytest test/inductor/test_flex_attention.py  
`TestFlexAttention`
#common functions: 
run_test 
preprocess_paged_attention
run_paged_attention
run_test_with_paged_attention
run_test_with_call
run_dynamic_test
run_automatic_dynamic_test

#test functions: 
test_builtin_score_mods
test_builtin_score_mods_automatic_dynamic
test_builtin_score_mods_different_seqlen
test_builtin_score_mods_different_block_size
test_kv_batch_broadcast
test_GQA
test_cpu_error_message_return_lse
test_validate_cpu_dtype_error_message

`TestPagedAttention`
#test function:
test_paged_builtin_score_mods

For the rest UTs in test/inductor/test_flex_attention.py and test/inductor/test_flex_decoding.py, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.

Besides, more optimizations are also planned in follow up PRs, including:

  • Block sparse computation
  • Flash decoding tuning

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141453

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit 9e3de2d with merge base f870ee2 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Collaborator

@leslie-fang-intel leslie-fang-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Please add the corresponding the UTs.

@jianan-gu
Copy link
Contributor Author

Thanks for the PR. Please add the corresponding the UTs.

Sure, we will enable CPU path tests, in both test/inductor/test_flex_attention.py and test/inductor/test_flex_decoding.py

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 29, 2024

Please seek CI approval before scheduling CIFlow labels

@malfet
Copy link
Contributor

malfet commented Dec 9, 2024

@pytorchbot revert -m "This breaks tests on platforms compiled without MKLDNN, namely MacOS, see https://github.com/pytorch/pytorch/actions/runs/12245441371/job/34159967794" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@jianan-gu your PR has been successfully reverted.

@leslie-fang-intel leslie-fang-intel removed the ci-no-td Do not run TD on this PR label Dec 10, 2024
@jianan-gu
Copy link
Contributor Author

jianan-gu commented Dec 10, 2024

Hi, @malfet , Thanks for pointing out this, we have refined this PR with a proper checking, and now passed in the latest CI for MacOS.

@leslie-fang-intel
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

bluenote10 pushed a commit to bluenote10/pytorch that referenced this pull request Dec 14, 2024
bluenote10 pushed a commit to bluenote10/pytorch that referenced this pull request Dec 14, 2024
…141453)

This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.

Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.

With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.

For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):
```
pytest test/inductor/test_flex_attention.py
`TestFlexAttention`
#common functions:
run_test
preprocess_paged_attention
run_paged_attention
run_test_with_paged_attention
run_test_with_call
run_dynamic_test
run_automatic_dynamic_test

#test functions:
test_builtin_score_mods
test_builtin_score_mods_automatic_dynamic
test_builtin_score_mods_different_seqlen
test_builtin_score_mods_different_block_size
test_kv_batch_broadcast
test_GQA
test_cpu_error_message_return_lse
test_validate_cpu_dtype_error_message

`TestPagedAttention`
#test function:
test_paged_builtin_score_mods
```
For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.

Besides, more optimizations are also planned in follow up PRs, including:

- Block sparse computation
- Flash decoding tuning

Pull Request resolved: pytorch#141453
Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel

Co-authored-by: Wu, Chunyuan <[email protected]>
bluenote10 pushed a commit to bluenote10/pytorch that referenced this pull request Dec 14, 2024
…141453)

This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.

Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.

With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.

For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):
```
pytest test/inductor/test_flex_attention.py
`TestFlexAttention`
#common functions:
run_test
preprocess_paged_attention
run_paged_attention
run_test_with_paged_attention
run_test_with_call
run_dynamic_test
run_automatic_dynamic_test

#test functions:
test_builtin_score_mods
test_builtin_score_mods_automatic_dynamic
test_builtin_score_mods_different_seqlen
test_builtin_score_mods_different_block_size
test_kv_batch_broadcast
test_GQA
test_cpu_error_message_return_lse
test_validate_cpu_dtype_error_message

`TestPagedAttention`
#test function:
test_paged_builtin_score_mods
```
For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.

Besides, more optimizations are also planned in follow up PRs, including:

- Block sparse computation
- Flash decoding tuning

Pull Request resolved: pytorch#141453
Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel

Co-authored-by: Wu, Chunyuan <[email protected]>
pytorchmergebot pushed a commit that referenced this pull request Apr 15, 2025
This PR extends and refines all rest UTs for CPU and more devices in `test/inductor/test_flex_attention.py`  and `test/inductor/test_flex_decoding.py`, as a follow-up to #141453

Pull Request resolved: #144953
Approved by: https://github.com/drisspg
timocafe pushed a commit to timocafe/pytorch that referenced this pull request Apr 16, 2025
…4953)

This PR extends and refines all rest UTs for CPU and more devices in `test/inductor/test_flex_attention.py`  and `test/inductor/test_flex_decoding.py`, as a follow-up to pytorch#141453

Pull Request resolved: pytorch#144953
Approved by: https://github.com/drisspg
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
…4953)

This PR extends and refines all rest UTs for CPU and more devices in `test/inductor/test_flex_attention.py`  and `test/inductor/test_flex_decoding.py`, as a follow-up to pytorch#141453

Pull Request resolved: pytorch#144953
Approved by: https://github.com/drisspg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor open source Reverted topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.