[inductor] [cpp] use non-temporal tile load for A #129455

chunyuan-w · 2024-06-25T06:46:06Z

Stack from ghstack (oldest at bottom):

Use non-temporal tile load _tile_stream_loadd for A to keep B in L1.
Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding #129348 (also in this ghstack) on top of this PR.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-06-25T06:46:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129455

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 415b882 with merge base cb2bce9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1. **TODOs:** - [ ] Collect benchmark data before and after this change cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 0089b94 Pull Request resolved: #129455

jgong5

Share perf numbers?

Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1. **TODOs:** - [ ] Collect benchmark data before and after this change cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 040bbba Pull Request resolved: #129455

Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1. ## Performance data Performance speedups with >=5% on BF16 AMP, with this PR vs. without this PR, measured on CPU with AMX support: - Static shapes Single-threaded | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | timm_vision_transformer | 5% huggingface | MT5ForConditionalGeneration | 5% huggingface | MobileBertForMaskedLM | 5% timm_models | gmixer_24_224 | 5% No perf regressions. TODO: collect benchmark for - Static shapes Multi-threaded - Dynamic shapes Single-threaded - Dynamic shapes Multi-threaded cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

chunyuan-w · 2024-07-10T08:08:19Z

Share perf numbers?

Updated the performance status in the PR description:
No obvious performance boost (no regression either) at end-to-end level by only changing to non-temporal load. We're expecting to get more performance gain after tuning the cache blocking (#129348 also in this ghstack) on top of this PR.

Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1. Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding #129348 (also in this ghstack) on top of this PR. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

chunyuan-w · 2024-07-15T01:45:10Z

@pytorchbot merge

pytorchmergebot · 2024-07-15T01:48:09Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

chunyuan-w · 2024-07-15T02:48:07Z

@pytorchbot merge

pytorchmergebot · 2024-07-15T02:49:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ghstack-source-id: 65ff95d Pull Request resolved: pytorch/pytorch#129455

Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1. Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding pytorch#129348 (also in this ghstack) on top of this PR. Pull Request resolved: pytorch#129455 Approved by: https://github.com/jgong5

[inductor] [cpp] use non-temporal tile load for A

c152817

[ghstack-poisoned]

chunyuan-w mentioned this pull request Jun 25, 2024

[inductor] [cpp] improve cache blocking with CPU info #129348

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Jun 25, 2024

pytorchbot added the open source label Jun 25, 2024

chunyuan-w marked this pull request as draft June 25, 2024 06:46

chunyuan-w added a commit that referenced this pull request Jun 25, 2024

[inductor] [cpp] use non-temporal tile load for A

1be7e53

ghstack-source-id: 0089b94 Pull Request resolved: #129455

jgong5 reviewed Jun 25, 2024

View reviewed changes

chunyuan-w added a commit that referenced this pull request Jun 26, 2024

[inductor] [cpp] use non-temporal tile load for A

e0ef4e3

ghstack-source-id: 040bbba Pull Request resolved: #129455

chunyuan-w marked this pull request as ready for review July 12, 2024 08:55

chunyuan-w requested a review from jgong5 July 12, 2024 08:55

jgong5 approved these changes Jul 15, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 15, 2024

pytorchmergebot added the merging label Jul 15, 2024

pytorchmergebot removed the merging label Jul 15, 2024

chunyuan-w added the topic: not user facing topic category label Jul 15, 2024

pytorchmergebot added the merging label Jul 15, 2024

pytorchmergebot added the Merged label Jul 15, 2024

pytorchmergebot closed this in a3c0bab Jul 15, 2024

pytorchmergebot removed the merging label Jul 15, 2024

henrylhtsang mentioned this pull request Jul 17, 2024

[aoti] Unskip some aot inductor tests #130973

Closed

francograndegmailcom pushed a commit to francograndegmailcom/pytorch-pytorch that referenced this pull request Jul 23, 2024

[inductor] [cpp] use non-temporal tile load for A

2069165

ghstack-source-id: 65ff95d Pull Request resolved: pytorch/pytorch#129455

github-actions bot deleted the gh/chunyuan-w/19/head branch August 15, 2024 01:55

jgong5 mentioned this pull request Aug 24, 2024

[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU #125683

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor] [cpp] use non-temporal tile load for A #129455

[inductor] [cpp] use non-temporal tile load for A #129455

Uh oh!

chunyuan-w commented Jun 25, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 25, 2024 •

edited

Loading

Uh oh!

jgong5 left a comment

Uh oh!

chunyuan-w commented Jul 10, 2024

Uh oh!

chunyuan-w commented Jul 15, 2024

Uh oh!

pytorchmergebot commented Jul 15, 2024

Uh oh!

chunyuan-w commented Jul 15, 2024

Uh oh!

pytorchmergebot commented Jul 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[inductor] [cpp] use non-temporal tile load for A #129455

[inductor] [cpp] use non-temporal tile load for A #129455

Uh oh!

Conversation

chunyuan-w commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129455

✅ No Failures

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

chunyuan-w commented Jul 10, 2024

Uh oh!

chunyuan-w commented Jul 15, 2024

Uh oh!

pytorchmergebot commented Jul 15, 2024

Merge failed

Uh oh!

chunyuan-w commented Jul 15, 2024

Uh oh!

pytorchmergebot commented Jul 15, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chunyuan-w commented Jun 25, 2024 •

edited

Loading

pytorch-bot bot commented Jun 25, 2024 •

edited

Loading