[inductor][cpp][gemm] optimize arbitrary N in packed gemm template #130690

jgong5 · 2024-07-14T03:03:03Z

Stack from ghstack (oldest at bottom):

Currently we require n % register_block_n == 0 which typically bring good perf when n is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where register_block_n == 1). This PR optimizes this by padding n to the multiple of register_block_n which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded n. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer.

Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16.

Before
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
_linear_pointwise 2.3563 ms 100.0%
cpp_packed_gemm_0 710.5902 ms 0.3%

After
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
cpp_packed_gemm_0 1.8909 ms 100.0%
_linear_pointwise 2.1016 ms 90.0%

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-07-14T03:03:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130690

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 7f97e30 with merge base 6c2c8ee ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Lint / lintrunner-noclang / linux-job (gh) (trunk failure)
>>> Lint for test/inductor/test_aot_inductor_package.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

torch/_inductor/codegen/cpp_gemm_template.py

leslie-fang-intel

LGTM, feels we can save some lines in the template.

Skylion007 · 2024-07-14T18:46:12Z

torch/_inductor/codegen/cpp_template_kernel.py

                    return self.store_pointwise_nodes(dst, [copy])
            else:
-                assert dst.layout == src.layout
+                assert dst.layout == src.layout, f"dst: {dst}, src: {src}"


Suggested change

assert dst.layout == src.layout, f"dst: {dst}, src: {src}"

assert dst.layout == src.layout, f"{dst=}, {src=}"

[ghstack-poisoned]

ghstack-source-id: 978dd69 Pull Request resolved: #130690

[ghstack-poisoned]

jgong5 · 2024-07-19T01:18:07Z

@pytorchbot merge

pytorchmergebot · 2024-07-19T01:20:45Z

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

jgong5 · 2024-07-20T06:23:04Z

@pytorchbot merge

pytorchmergebot · 2024-07-20T06:24:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% ### FP32 single thread (measured on Ice Lake) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. Co-authored-by: Jiong Gong <[email protected]> Pull Request resolved: #129348 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #130675, #130690

…ytorch#130690) Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3073x768, 3073) _linear_pointwise 2.3563 ms 100.0% cpp_packed_gemm_0 710.5902 ms 0.3% After AUTOTUNE linear_unary(512x768, 3073x768, 3073) cpp_packed_gemm_0 1.8909 ms 100.0% _linear_pointwise 2.1016 ms 90.0% Pull Request resolved: pytorch#130690 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: pytorch#130675

## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% ### FP32 single thread (measured on Ice Lake) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. Co-authored-by: Jiong Gong <[email protected]> Pull Request resolved: pytorch#129348 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: pytorch#130675, pytorch#130690

ghstack-source-id: 81c7d9c Pull Request resolved: pytorch/pytorch#130690

…ytorch#130690) Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3073x768, 3073) _linear_pointwise 2.3563 ms 100.0% cpp_packed_gemm_0 710.5902 ms 0.3% After AUTOTUNE linear_unary(512x768, 3073x768, 3073) cpp_packed_gemm_0 1.8909 ms 100.0% _linear_pointwise 2.1016 ms 90.0% Pull Request resolved: pytorch#130690 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: pytorch#130675

## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% ### FP32 single thread (measured on Ice Lake) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. Co-authored-by: Jiong Gong <[email protected]> Pull Request resolved: pytorch#129348 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: pytorch#130675, pytorch#130690

Update

3805a3b

[ghstack-poisoned]

jgong5 mentioned this pull request Jul 14, 2024

[inductor][cpp] align dtype convert cache between vec and scalar kernels #130677

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Jul 14, 2024

jgong5 mentioned this pull request Jul 14, 2024

[inductor][cpp][gemm] move bias add to epilogue #130675

Closed

pytorchbot added the open source label Jul 14, 2024

Update

ee58000

[ghstack-poisoned]

Update

bbdaa0c

[ghstack-poisoned]

jgong5 requested a review from leslie-fang-intel July 14, 2024 04:36

Update

c32045e

[ghstack-poisoned]

leslie-fang-intel reviewed Jul 14, 2024

View reviewed changes

torch/_inductor/codegen/cpp_gemm_template.py Show resolved Hide resolved

leslie-fang-intel reviewed Jul 14, 2024

View reviewed changes

torch/_inductor/codegen/cpp_gemm_template.py Show resolved Hide resolved

leslie-fang-intel approved these changes Jul 14, 2024

View reviewed changes

Skylion007 reviewed Jul 14, 2024

View reviewed changes

Update

83e15e7

[ghstack-poisoned]

Update

a14e8bc

[ghstack-poisoned]

jgong5 pushed a commit that referenced this pull request Jul 15, 2024

[inductor][cpp][gemm] optimize arbitrary N in packed gemm template

724137f

ghstack-source-id: 978dd69 Pull Request resolved: #130690

jgong5 added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request labels Jul 16, 2024

jgong5 mentioned this pull request Jul 16, 2024

[inductor][cpp][gemm] support k slicing for static shapes #130821

Closed

Jiong Gong added 2 commits July 16, 2024 01:31

Update

5d83736

[ghstack-poisoned]

Update

7f97e30

[ghstack-poisoned]

This was referenced Jul 18, 2024

[inductor] [cpp] improve cache blocking with CPU info #129348

Closed

[inductor][cpp][gemm] improve thread blocking heuristics #131024

Closed

pytorchmergebot added the merging label Jul 19, 2024

pytorchmergebot removed the merging label Jul 19, 2024

jgong5 requested review from Skylion007 and jansel July 19, 2024 01:25

jansel approved these changes Jul 19, 2024

View reviewed changes

pytorchmergebot added the merging label Jul 20, 2024

pytorchmergebot added the Merged label Jul 20, 2024

pytorchmergebot closed this in 0b44e1a Jul 20, 2024

pytorchmergebot removed the merging label Jul 20, 2024

francograndegmailcom pushed a commit to francograndegmailcom/pytorch-pytorch that referenced this pull request Jul 23, 2024

[inductor][cpp][gemm] optimize arbitrary N in packed gemm template

a82e34d

ghstack-source-id: 81c7d9c Pull Request resolved: pytorch/pytorch#130690

henrylhtsang mentioned this pull request Jul 31, 2024

[BE][typing] fix types in common pruning #132309

Closed

github-actions bot deleted the gh/jgong5/59/head branch August 20, 2024 01:58

jgong5 mentioned this pull request Aug 24, 2024

[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU #125683

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor][cpp][gemm] optimize arbitrary N in packed gemm template #130690

[inductor][cpp][gemm] optimize arbitrary N in packed gemm template #130690

Uh oh!

jgong5 commented Jul 14, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 14, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

leslie-fang-intel left a comment

Uh oh!

Skylion007 Jul 14, 2024

Uh oh!

jgong5 commented Jul 19, 2024

Uh oh!

pytorchmergebot commented Jul 19, 2024

Uh oh!

jgong5 commented Jul 20, 2024

Uh oh!

pytorchmergebot commented Jul 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

	assert dst.layout == src.layout, f"dst: {dst}, src: {src}"
	assert dst.layout == src.layout, f"{dst=}, {src=}"

[inductor][cpp][gemm] optimize arbitrary N in packed gemm template #130690

[inductor][cpp][gemm] optimize arbitrary N in packed gemm template #130690

Uh oh!

Conversation

jgong5 commented Jul 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130690

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Uh oh!

leslie-fang-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Skylion007 Jul 14, 2024

Choose a reason for hiding this comment

Uh oh!

jgong5 commented Jul 19, 2024

Uh oh!

pytorchmergebot commented Jul 19, 2024

Merge failed

Uh oh!

jgong5 commented Jul 20, 2024

Uh oh!

pytorchmergebot commented Jul 20, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jgong5 commented Jul 14, 2024 •

edited

Loading

pytorch-bot bot commented Jul 14, 2024 •

edited

Loading