[inductor][cpp][gemm] move bias add to epilogue #130675

jgong5 · 2024-07-13T10:01:37Z

Stack from ghstack (oldest at bottom):

Speedup bias-add compute by moving it to the epilogue. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16.
Before
AUTOTUNE linear_unary(512x768, 3072x768, 3072)
cpp_packed_gemm_0 1.9200 ms 100.0%
_linear_pointwise 1.9345 ms 99.3%

After
AUTOTUNE linear_unary(512x768, 3072x768, 3072)
cpp_packed_gemm_0 1.8321 ms 100.0%
_linear_pointwise 1.9246 ms 95.2%

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-07-13T10:01:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130675

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit bebe1bb with merge base 6c2c8ee ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Lint / lintrunner-noclang / linux-job (gh) (trunk failure)
>>> Lint for test/inductor/test_aot_inductor_package.py:
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
inductor/test_flex_attention.py::TestFlexAttention::test_fw_bw_graph_correctness
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
inductor/test_flex_attention.py::TestFlexAttention::test_fw_bw_graph_correctness

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

leslie-fang-intel

LGTM, only a small question.

leslie-fang-intel · 2024-07-14T12:00:30Z

torch/_inductor/codegen/cpp_gemm_template.py

+                Y_aliases.add(current_input_buffer.get_name())
+                reindexers.append(None)
+                if i < len(epilogue_creators) - 1:
+                    current_input_buffer = ir.Buffer(


Why we need to create a new Buffer here? Seems we can reuse the buffer created above of

ir.ComputedBuffer( name=buffer_name, layout=template_buffer.layout, data=creator(current_input_buffer), )

Since they are fake_buffers and only used to patch the get_dtype method?

They have to be different buffers in the IR, otherwise, there would be cyclic dependencies, i.e., a buffer depends on itself, can would cause problems in the codegen.

[ghstack-poisoned]

jgong5 · 2024-07-18T01:14:53Z

@pytorchbot merge

pytorchmergebot · 2024-07-18T01:16:39Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-18T07:15:20Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

jgong5 · 2024-07-18T08:06:19Z

@pytorchbot merge

[ghstack-poisoned]

pytorchmergebot · 2024-07-18T08:08:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-18T14:07:33Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

jgong5 · 2024-07-19T01:14:40Z

@pytorchbot merge

pytorchmergebot · 2024-07-19T01:16:24Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…130690) Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3073x768, 3073) _linear_pointwise 2.3563 ms 100.0% cpp_packed_gemm_0 710.5902 ms 0.3% After AUTOTUNE linear_unary(512x768, 3073x768, 3073) cpp_packed_gemm_0 1.8909 ms 100.0% _linear_pointwise 2.1016 ms 90.0% Pull Request resolved: #130690 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #130675

## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% ### FP32 single thread (measured on Ice Lake) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. Co-authored-by: Jiong Gong <[email protected]> Pull Request resolved: #129348 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #130675, #130690

Speedup bias-add compute by moving it to the epilogue. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3072x768, 3072) cpp_packed_gemm_0 1.9200 ms 100.0% _linear_pointwise 1.9345 ms 99.3% After AUTOTUNE linear_unary(512x768, 3072x768, 3072) cpp_packed_gemm_0 1.8321 ms 100.0% _linear_pointwise 1.9246 ms 95.2% Pull Request resolved: pytorch#130675 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel

…ytorch#130690) Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3073x768, 3073) _linear_pointwise 2.3563 ms 100.0% cpp_packed_gemm_0 710.5902 ms 0.3% After AUTOTUNE linear_unary(512x768, 3073x768, 3073) cpp_packed_gemm_0 1.8909 ms 100.0% _linear_pointwise 2.1016 ms 90.0% Pull Request resolved: pytorch#130690 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: pytorch#130675

## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% ### FP32 single thread (measured on Ice Lake) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. Co-authored-by: Jiong Gong <[email protected]> Pull Request resolved: pytorch#129348 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: pytorch#130675, pytorch#130690

ghstack-source-id: 0b0ede7 Pull Request resolved: pytorch/pytorch#130675

Speedup bias-add compute by moving it to the epilogue. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3072x768, 3072) cpp_packed_gemm_0 1.9200 ms 100.0% _linear_pointwise 1.9345 ms 99.3% After AUTOTUNE linear_unary(512x768, 3072x768, 3072) cpp_packed_gemm_0 1.8321 ms 100.0% _linear_pointwise 1.9246 ms 95.2% Pull Request resolved: pytorch#130675 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel

…ytorch#130690) Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3073x768, 3073) _linear_pointwise 2.3563 ms 100.0% cpp_packed_gemm_0 710.5902 ms 0.3% After AUTOTUNE linear_unary(512x768, 3073x768, 3073) cpp_packed_gemm_0 1.8909 ms 100.0% _linear_pointwise 2.1016 ms 90.0% Pull Request resolved: pytorch#130690 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: pytorch#130675

## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% ### FP32 single thread (measured on Ice Lake) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. Co-authored-by: Jiong Gong <[email protected]> Pull Request resolved: pytorch#129348 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: pytorch#130675, pytorch#130690

Update

06cccb7

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Jul 13, 2024

Update

912c696

[ghstack-poisoned]

pytorchbot added the open source label Jul 13, 2024

Update

6670217

[ghstack-poisoned]

Update

553d1f9

[ghstack-poisoned]

jgong5 mentioned this pull request Jul 13, 2024

[inductor][cpp] align dtype convert cache between vec and scalar kernels #130677

Closed

Update

f0adb10

[ghstack-poisoned]

jgong5 requested a review from leslie-fang-intel July 13, 2024 14:32

jgong5 mentioned this pull request Jul 14, 2024

[inductor][cpp][gemm] optimize arbitrary N in packed gemm template #130690

Closed

leslie-fang-intel approved these changes Jul 14, 2024

View reviewed changes

Update

83a1f67

[ghstack-poisoned]

jgong5 requested a review from jansel July 16, 2024 01:43

jgong5 added the topic: not user facing topic category label Jul 16, 2024

jgong5 mentioned this pull request Jul 16, 2024

[inductor][cpp][gemm] support k slicing for static shapes #130821

Closed

Update

cf8d7b6

[ghstack-poisoned]

jansel approved these changes Jul 17, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 18, 2024

pytorchmergebot added the merging label Jul 18, 2024

Update

bebe1bb

[ghstack-poisoned]

This was referenced Jul 18, 2024

[inductor] [cpp] improve cache blocking with CPU info #129348

Closed

[inductor][cpp][gemm] improve thread blocking heuristics #131024

Closed

pytorchmergebot closed this in 39493aa Jul 19, 2024

pytorchmergebot added Merged and removed merging labels Jul 19, 2024

francograndegmailcom pushed a commit to francograndegmailcom/pytorch-pytorch that referenced this pull request Jul 23, 2024

[inductor][cpp][gemm] move bias add to epilogue

beb0c63

ghstack-source-id: 0b0ede7 Pull Request resolved: pytorch/pytorch#130675

henrylhtsang mentioned this pull request Jul 31, 2024

[BE][typing] fix types in common pruning #132309

Closed

github-actions bot deleted the gh/jgong5/57/head branch August 18, 2024 02:01

[inductor][cpp][gemm] move bias add to epilogue #130675

[inductor][cpp][gemm] move bias add to epilogue #130675

Uh oh!

Conversation

jgong5 commented Jul 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130675

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

leslie-fang-intel left a comment

Choose a reason for hiding this comment

Uh oh!

leslie-fang-intel Jul 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgong5 Jul 14, 2024

Choose a reason for hiding this comment

Uh oh!

jgong5 commented Jul 18, 2024

Uh oh!

pytorchmergebot commented Jul 18, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 18, 2024

Uh oh!

jgong5 commented Jul 18, 2024

Uh oh!

pytorchmergebot commented Jul 18, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 18, 2024

Uh oh!

jgong5 commented Jul 19, 2024

Uh oh!

pytorchmergebot commented Jul 19, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jgong5 commented Jul 13, 2024 •

edited

Loading

pytorch-bot bot commented Jul 13, 2024 •

edited

Loading

leslie-fang-intel Jul 14, 2024 •

edited

Loading