[inductor][cpp] support bf16/fp16 gemm template epilogue fusion #126545

jgong5 · 2024-05-17T15:27:35Z

Stack from ghstack (oldest at bottom):

As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:

bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
Support bf16/fp16 legalization for codegen_loop_bodies which is used to generate the epilogue loops.
We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
Add localize_buffer method to LocalBufferScope to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-05-17T15:27:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126545

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 0cba55e with merge base 87072dc ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Lint / lintrunner-noclang / linux-job (gh) (trunk failure)
>>> Lint for test/test_nestedtensor.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 06252d5 Pull Request resolved: #126545

…usion" cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 58eab0c Pull Request resolved: #126545

jgong5 · 2024-05-19T08:36:41Z

torch/_inductor/codegen/cpp_gemm_template.py

        input_nodes,
        beta=1,
        alpha=1,
+        has_bias=False,


Originally we use the number of input nodes to decide whether there is a bias (2: no bias, 3: with bias) but with the inputs from epilogue nodes as part of the template, there could be more inputs even if there is no bias. So we now use a dedicated flag to check that.

jgong5 · 2024-05-19T08:37:15Z

torch/_inductor/codegen/cpp_gemm_template.py

+        has_bias=False,
        trans_w=False,
        input_indices=None,
+        epilogue_creator: Optional[Callable[[ir.Buffer], ir.Pointwise]] = None,


It is used to create the in-template epilogue nodes.

jgong5 · 2024-05-19T08:38:46Z

torch/_inductor/codegen/cpp_gemm_template.py

+
+        epilogues: List[ir.IRNode] = []
+        if self.epilogue_creator is not None:
+            gemm_output_name = "GemmOut"


with in-template epilogue nodes, the gemm output could be different from the template output, i.e., gemm out -> in-template epilogues -> template output -> fused out-of-template epilogues. So we create a dedicated buffer for gemm out.

…usion" As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows: 1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling. 2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops. 3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen. 4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality. cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: f0c6e59 Pull Request resolved: #126545

…usion" As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows: 1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling. 2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops. 3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen. 4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality. cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

[ghstack-poisoned]

jgong5 · 2024-06-12T22:03:00Z

@pytorchbot merge

pytorchmergebot · 2024-06-12T22:04:44Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-13T04:03:24Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

jgong5 · 2024-06-13T04:58:42Z

@pytorchbot merge

pytorchmergebot · 2024-06-13T05:00:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…rch#126545) As part of pytorch#125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows: 1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling. 2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops. 3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen. 4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality. Pull Request resolved: pytorch#126545 Approved by: https://github.com/jansel

henrylhtsang · 2024-08-27T23:10:43Z

Hi @jgong5, I am debugging the cpu selection algorithm tests, which are sometimes flaky and would fail due to counters["inductor"]["cpp_epilogue_fusion_counter"] = 1. Want to see if you have insights. I don't have a stable repro. It is occasionally failing (say 25%) in fbcode CI.

In particular, I am mostly looking test_linear_with_pointwise.

In the 156 tests that are from test_linear_with_pointwise (excluding the test_linear_with_pointwise_dynamic_shapes), 41 of them are flaky.
If I limit to those that contains bfloat16 in their names, there are 52 tests and 41 of them are flaky (i.e., all flaky tests have bfloat16 in their names)
If I further limit to those that contains test_linear_with_pointwise_batch_size_384_in_features_196_out_features_385, among the 26 tests, 21 are flaky.
If I further limit to "bias_True", then 11 out of 13 tests are flaky. The only two that are not flay have epilogue div or mul.

The other thing I observed is that when the tests pass, they print

AUTOTUNE linear_unary(384x196, 384x196)
72  cpp_packed_gemm_0 0.2068 ms 100.0%
73  _linear_pointwise 236.3641 ms 0.1%

but when they failed due to flakiness, they would print

AUTOTUNE mm(384x196, 196x384)
71  cpp_packed_gemm_0 4.1393 ms 100.0%
72  mm 22332.3400 ms 0.0%

curious if you have any idea about it.

jgong5 · 2024-08-28T06:31:21Z

Hi @jgong5, I am debugging the cpu selection algorithm tests, which are sometimes flaky and would fail due to counters["inductor"]["cpp_epilogue_fusion_counter"] = 1. Want to see if you have insights. I don't have a stable repro. It is occasionally failing (say 25%) in fbcode CI.

In particular, I am mostly looking test_linear_with_pointwise.

In the 156 tests that are from test_linear_with_pointwise (excluding the test_linear_with_pointwise_dynamic_shapes), 41 of them are flaky.

If I limit to those that contains bfloat16 in their names, there are 52 tests and 41 of them are flaky (i.e., all flaky tests have bfloat16 in their names)

If I further limit to those that contains test_linear_with_pointwise_batch_size_384_in_features_196_out_features_385, among the 26 tests, 21 are flaky.

If I further limit to "bias_True", then 11 out of 13 tests are flaky. The only two that are not flay have epilogue div or mul.

The other thing I observed is that when the tests pass, they print
AUTOTUNE linear_unary(384x196, 384x196)
72  cpp_packed_gemm_0 0.2068 ms 100.0%
73  _linear_pointwise 236.3641 ms 0.1%
but when they failed due to flakiness, they would print
AUTOTUNE mm(384x196, 196x384)
71  cpp_packed_gemm_0 4.1393 ms 100.0%
72  mm 22332.3400 ms 0.0%
curious if you have any idea about it.

Hi @henrylhtsang From the log you shared, it seems the aten.mm is not replaced by linear_pointwise when the tests fail. There is an FX graph pass that does the graph rewrite here:

pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py

Line 1136 in 09f9c25

def linear(match, *args, **kwargs):

The pass is executed when the _is_packable_linear extra checker returns True. Not sure what happened when the tests fail. But perhaps you can firstly check if the problem is caused by _is_packable_linear returning False in the first place and then dig into the checker to understand why? You can also compare the fx_graph_readable.py between the successful and failing cases generated with TORCH_COMPILE_DEBUG=1. Below is what I got from my side with TORCH_COMPILE_DEBUG=1 pytest -vsk test_linear_with_pointwise_batch_size_384_in_features_196_out_features_385_bias_True_epilogue_tanh_cpu_bfloat16 test_cpu_select_algorithm.py. FYI.

class <lambda>(torch.nn.Module):
    def forward(self, arg2_1: "bf16[384, 196]"):
        # No stacktrace found for following nodes
        _frozen_param1: "bf16[385]" = self._frozen_param1
        _frozen_param3 = self._frozen_param3
        
         # File: /home/jgong5/pytorch/test/inductor/test_cpu_select_algorithm.py:236 in forward, code: return self.epilogue(self.linear(x))
        _linear_pointwise_default: "bf16[384, 385]" = torch.ops.mkldnn._linear_pointwise.default(arg2_1, _frozen_param3, _frozen_param1, 'none', [], '');  arg2_1 = _frozen_param3 = _frozen_param1 = None
        tanh: "bf16[384, 385]" = torch.ops.aten.tanh.default(_linear_pointwise_default);  _linear_pointwise_default = None
        return (tanh,)

henrylhtsang · 2024-08-28T22:51:49Z

Hi @jgong5, I am debugging the cpu selection algorithm tests, which are sometimes flaky and would fail due to counters["inductor"]["cpp_epilogue_fusion_counter"] = 1. Want to see if you have insights. I don't have a stable repro. It is occasionally failing (say 25%) in fbcode CI.
In particular, I am mostly looking test_linear_with_pointwise.

In the 156 tests that are from test_linear_with_pointwise (excluding the test_linear_with_pointwise_dynamic_shapes), 41 of them are flaky.

If I limit to those that contains bfloat16 in their names, there are 52 tests and 41 of them are flaky (i.e., all flaky tests have bfloat16 in their names)

If I further limit to those that contains test_linear_with_pointwise_batch_size_384_in_features_196_out_features_385, among the 26 tests, 21 are flaky.

If I further limit to "bias_True", then 11 out of 13 tests are flaky. The only two that are not flay have epilogue div or mul.

The other thing I observed is that when the tests pass, they print
AUTOTUNE linear_unary(384x196, 384x196)
72  cpp_packed_gemm_0 0.2068 ms 100.0%
73  _linear_pointwise 236.3641 ms 0.1%
but when they failed due to flakiness, they would print
AUTOTUNE mm(384x196, 196x384)
71  cpp_packed_gemm_0 4.1393 ms 100.0%
72  mm 22332.3400 ms 0.0%
curious if you have any idea about it.
Hi @henrylhtsang From the log you shared, it seems the aten.mm is not replaced by linear_pointwise when the tests fail. There is an FX graph pass that does the graph rewrite here:

pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py

Line 1136 in 09f9c25

def linear(match, *args, **kwargs):

The pass is executed when the _is_packable_linear extra checker returns True. Not sure what happened when the tests fail. But perhaps you can firstly check if the problem is caused by _is_packable_linear returning False in the first place and then dig into the checker to understand why? You can also compare the fx_graph_readable.py between the successful and failing cases generated with TORCH_COMPILE_DEBUG=1. Below is what I got from my side with TORCH_COMPILE_DEBUG=1 pytest -vsk test_linear_with_pointwise_batch_size_384_in_features_196_out_features_385_bias_True_epilogue_tanh_cpu_bfloat16 test_cpu_select_algorithm.py. FYI.
class <lambda>(torch.nn.Module):
    def forward(self, arg2_1: "bf16[384, 196]"):
        # No stacktrace found for following nodes
        _frozen_param1: "bf16[385]" = self._frozen_param1
        _frozen_param3 = self._frozen_param3
        
         # File: /home/jgong5/pytorch/test/inductor/test_cpu_select_algorithm.py:236 in forward, code: return self.epilogue(self.linear(x))
        _linear_pointwise_default: "bf16[384, 385]" = torch.ops.mkldnn._linear_pointwise.default(arg2_1, _frozen_param3, _frozen_param1, 'none', [], '');  arg2_1 = _frozen_param3 = _frozen_param1 = None
        tanh: "bf16[384, 385]" = torch.ops.aten.tanh.default(_linear_pointwise_default);  _linear_pointwise_default = None
        return (tanh,)

Thanks. Unfortunately I don't have a stable repro, so I might have to land some debugging code to check it first. I plan on checking if torch.ops.mkldnn._is_mkldnn_bf16_supported() is False. Any other flags I should pay attention to?

jgong5 · 2024-08-29T01:32:26Z

Thanks. Unfortunately I don't have a stable repro, so I might have to land some debugging code to check it first. I plan on checking if torch.ops.mkldnn._is_mkldnn_bf16_supported() is False. Any other flags I should pay attention to?

If that is for debugging purpose only and temporary changes, perhaps you can add some logging info inside _is_packable_linear for all the relevant paths that return False. This can expediate your debugging I guess.

Summary: Add debug utils to debug a flaky test in fbcode ci. Some context: #126545 Test Plan: ci Differential Revision: D62005445 Pull Request resolved: #135038 Approved by: https://github.com/jgong5, https://github.com/XuehaiPan

…35038) Summary: Add debug utils to debug a flaky test in fbcode ci. Some context: pytorch#126545 Test Plan: ci Differential Revision: D62005445 Pull Request resolved: pytorch#135038 Approved by: https://github.com/jgong5, https://github.com/XuehaiPan

[inductor][cpp] support bf16/fp16 gemm template epilogue fusion

2547656

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels May 17, 2024

jgong5 pushed a commit that referenced this pull request May 17, 2024

[inductor][cpp] support bf16/fp16 gemm template epilogue fusion

2fac304

ghstack-source-id: 06252d5 Pull Request resolved: #126545

jgong5 marked this pull request as draft May 17, 2024 15:27

pytorchbot added the open source label May 17, 2024

jgong5 pushed a commit that referenced this pull request May 19, 2024

[inductor][cpp] support bf16/fp16 gemm template epilogue fusion

b31b980

ghstack-source-id: 58eab0c Pull Request resolved: #126545

jgong5 marked this pull request as ready for review May 19, 2024 08:54

jgong5 requested review from jansel, lezcano and peterbell10 May 19, 2024 08:54

jgong5 commented May 19, 2024

View reviewed changes

jgong5 mentioned this pull request May 19, 2024

[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU #125683

Open

18 tasks

This was referenced May 20, 2024

[inductor][cpp] GEMM template (infra and fp32) #124021

Closed

[inductor][cpp] epilogue support for gemm template #126019

Closed

[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion #126068

Closed

jgong5 pushed a commit that referenced this pull request May 20, 2024

[inductor][cpp] support bf16/fp16 gemm template epilogue fusion

458b91d

ghstack-source-id: f0c6e59 Pull Request resolved: #126545

Jiong Gong added 6 commits June 7, 2024 19:17

Update

ce48307

[ghstack-poisoned]

Update

2b174d8

[ghstack-poisoned]

Update

5c2f656

[ghstack-poisoned]

Update

51c12a7

[ghstack-poisoned]

Update

c049f4d

[ghstack-poisoned]

Update

76f3922

[ghstack-poisoned]

jgong5 mentioned this pull request Jun 12, 2024

[RELAND][inductor][cpp] bf16/fp16 gemm template computed with fp32 #128472

Closed

Update

5fbd5b0

[ghstack-poisoned]

jgong5 mentioned this pull request Jun 12, 2024

[inductor] fix linear add bias pattern #128473

Closed

Update

0cba55e

[ghstack-poisoned]

pytorchmergebot added the merging label Jun 12, 2024

pytorchmergebot closed this in 1fd2cd2 Jun 13, 2024

pytorchmergebot removed the merging label Jun 13, 2024

github-actions bot deleted the gh/jgong5/48/head branch July 14, 2024 02:02

henrylhtsang mentioned this pull request Sep 3, 2024

[test][easy] Add debug utils for cpu select algorithm test #135038

Closed

[inductor][cpp] support bf16/fp16 gemm template epilogue fusion #126545

[inductor][cpp] support bf16/fp16 gemm template epilogue fusion #126545

Uh oh!

Conversation

jgong5 commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126545

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

jgong5 May 19, 2024

Choose a reason for hiding this comment

Uh oh!

jgong5 May 19, 2024

Choose a reason for hiding this comment

Uh oh!

jgong5 May 19, 2024

Choose a reason for hiding this comment

Uh oh!

jgong5 commented Jun 12, 2024

Uh oh!

pytorchmergebot commented Jun 12, 2024

Merge started

Uh oh!

pytorchmergebot commented Jun 13, 2024

Uh oh!

jgong5 commented Jun 13, 2024

Uh oh!

pytorchmergebot commented Jun 13, 2024

Merge started

Uh oh!

henrylhtsang commented Aug 27, 2024

Uh oh!

jgong5 commented Aug 28, 2024

Uh oh!

henrylhtsang commented Aug 28, 2024

Uh oh!

jgong5 commented Aug 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jgong5 commented May 17, 2024 •

edited

Loading

pytorch-bot bot commented May 17, 2024 •

edited

Loading