Skip to content

Conversation

@jgong5
Copy link
Collaborator

@jgong5 jgong5 commented May 17, 2024

Stack from ghstack (oldest at bottom):

As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:

  1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
  2. Support bf16/fp16 legalization for codegen_loop_bodies which is used to generate the epilogue loops.
  3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
  4. Add localize_buffer method to LocalBufferScope to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

@pytorch-bot
Copy link

pytorch-bot bot commented May 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126545

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 0cba55e with merge base 87072dc (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jgong5 pushed a commit that referenced this pull request May 17, 2024
@jgong5 jgong5 marked this pull request as draft May 17, 2024 15:27
…usion"

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
…usion"

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
…usion"

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
…usion"

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
…usion"

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
jgong5 pushed a commit that referenced this pull request May 19, 2024
@jgong5 jgong5 marked this pull request as ready for review May 19, 2024 08:54
@jgong5 jgong5 requested review from jansel, lezcano and peterbell10 May 19, 2024 08:54
input_nodes,
beta=1,
alpha=1,
has_bias=False,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally we use the number of input nodes to decide whether there is a bias (2: no bias, 3: with bias) but with the inputs from epilogue nodes as part of the template, there could be more inputs even if there is no bias. So we now use a dedicated flag to check that.

has_bias=False,
trans_w=False,
input_indices=None,
epilogue_creator: Optional[Callable[[ir.Buffer], ir.Pointwise]] = None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used to create the in-template epilogue nodes.


epilogues: List[ir.IRNode] = []
if self.epilogue_creator is not None:
gemm_output_name = "GemmOut"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with in-template epilogue nodes, the gemm output could be different from the template output, i.e., gemm out -> in-template epilogues -> template output -> fused out-of-template epilogues. So we create a dedicated buffer for gemm out.

…usion"


As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
…usion"


As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
Jiong Gong added 6 commits June 7, 2024 19:17
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@jgong5
Copy link
Collaborator Author

jgong5 commented Jun 12, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@jgong5
Copy link
Collaborator Author

jgong5 commented Jun 13, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

TharinduRusira pushed a commit to TharinduRusira/pytorch that referenced this pull request Jun 14, 2024
…rch#126545)

As part of pytorch#125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

Pull Request resolved: pytorch#126545
Approved by: https://github.com/jansel
ignaciobartol pushed a commit to ignaciobartol/pytorch that referenced this pull request Jun 14, 2024
…rch#126545)

As part of pytorch#125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

Pull Request resolved: pytorch#126545
Approved by: https://github.com/jansel
@github-actions github-actions bot deleted the gh/jgong5/48/head branch July 14, 2024 02:02
@henrylhtsang
Copy link
Contributor

Hi @jgong5, I am debugging the cpu selection algorithm tests, which are sometimes flaky and would fail due to counters["inductor"]["cpp_epilogue_fusion_counter"] = 1. Want to see if you have insights. I don't have a stable repro. It is occasionally failing (say 25%) in fbcode CI.

In particular, I am mostly looking test_linear_with_pointwise.

  • In the 156 tests that are from test_linear_with_pointwise (excluding the test_linear_with_pointwise_dynamic_shapes), 41 of them are flaky.
  • If I limit to those that contains bfloat16 in their names, there are 52 tests and 41 of them are flaky (i.e., all flaky tests have bfloat16 in their names)
  • If I further limit to those that contains test_linear_with_pointwise_batch_size_384_in_features_196_out_features_385, among the 26 tests, 21 are flaky.
  • If I further limit to "bias_True", then 11 out of 13 tests are flaky. The only two that are not flay have epilogue div or mul.

The other thing I observed is that when the tests pass, they print

AUTOTUNE linear_unary(384x196, 384x196)
72  cpp_packed_gemm_0 0.2068 ms 100.0%
73  _linear_pointwise 236.3641 ms 0.1%

but when they failed due to flakiness, they would print

AUTOTUNE mm(384x196, 196x384)
71  cpp_packed_gemm_0 4.1393 ms 100.0%
72  mm 22332.3400 ms 0.0%

curious if you have any idea about it.

@jgong5
Copy link
Collaborator Author

jgong5 commented Aug 28, 2024

Hi @jgong5, I am debugging the cpu selection algorithm tests, which are sometimes flaky and would fail due to counters["inductor"]["cpp_epilogue_fusion_counter"] = 1. Want to see if you have insights. I don't have a stable repro. It is occasionally failing (say 25%) in fbcode CI.

In particular, I am mostly looking test_linear_with_pointwise.

  • In the 156 tests that are from test_linear_with_pointwise (excluding the test_linear_with_pointwise_dynamic_shapes), 41 of them are flaky.
  • If I limit to those that contains bfloat16 in their names, there are 52 tests and 41 of them are flaky (i.e., all flaky tests have bfloat16 in their names)
  • If I further limit to those that contains test_linear_with_pointwise_batch_size_384_in_features_196_out_features_385, among the 26 tests, 21 are flaky.
  • If I further limit to "bias_True", then 11 out of 13 tests are flaky. The only two that are not flay have epilogue div or mul.

The other thing I observed is that when the tests pass, they print

AUTOTUNE linear_unary(384x196, 384x196)
72  cpp_packed_gemm_0 0.2068 ms 100.0%
73  _linear_pointwise 236.3641 ms 0.1%

but when they failed due to flakiness, they would print

AUTOTUNE mm(384x196, 196x384)
71  cpp_packed_gemm_0 4.1393 ms 100.0%
72  mm 22332.3400 ms 0.0%

curious if you have any idea about it.

Hi @henrylhtsang From the log you shared, it seems the aten.mm is not replaced by linear_pointwise when the tests fail. There is an FX graph pass that does the graph rewrite here:

def linear(match, *args, **kwargs):

The pass is executed when the _is_packable_linear extra checker returns True. Not sure what happened when the tests fail. But perhaps you can firstly check if the problem is caused by _is_packable_linear returning False in the first place and then dig into the checker to understand why? You can also compare the fx_graph_readable.py between the successful and failing cases generated with TORCH_COMPILE_DEBUG=1. Below is what I got from my side with TORCH_COMPILE_DEBUG=1 pytest -vsk test_linear_with_pointwise_batch_size_384_in_features_196_out_features_385_bias_True_epilogue_tanh_cpu_bfloat16 test_cpu_select_algorithm.py. FYI.

class <lambda>(torch.nn.Module):
    def forward(self, arg2_1: "bf16[384, 196]"):
        # No stacktrace found for following nodes
        _frozen_param1: "bf16[385]" = self._frozen_param1
        _frozen_param3 = self._frozen_param3
        
         # File: /home/jgong5/pytorch/test/inductor/test_cpu_select_algorithm.py:236 in forward, code: return self.epilogue(self.linear(x))
        _linear_pointwise_default: "bf16[384, 385]" = torch.ops.mkldnn._linear_pointwise.default(arg2_1, _frozen_param3, _frozen_param1, 'none', [], '');  arg2_1 = _frozen_param3 = _frozen_param1 = None
        tanh: "bf16[384, 385]" = torch.ops.aten.tanh.default(_linear_pointwise_default);  _linear_pointwise_default = None
        return (tanh,)

@henrylhtsang
Copy link
Contributor

Hi @jgong5, I am debugging the cpu selection algorithm tests, which are sometimes flaky and would fail due to counters["inductor"]["cpp_epilogue_fusion_counter"] = 1. Want to see if you have insights. I don't have a stable repro. It is occasionally failing (say 25%) in fbcode CI.
In particular, I am mostly looking test_linear_with_pointwise.

  • In the 156 tests that are from test_linear_with_pointwise (excluding the test_linear_with_pointwise_dynamic_shapes), 41 of them are flaky.
  • If I limit to those that contains bfloat16 in their names, there are 52 tests and 41 of them are flaky (i.e., all flaky tests have bfloat16 in their names)
  • If I further limit to those that contains test_linear_with_pointwise_batch_size_384_in_features_196_out_features_385, among the 26 tests, 21 are flaky.
  • If I further limit to "bias_True", then 11 out of 13 tests are flaky. The only two that are not flay have epilogue div or mul.

The other thing I observed is that when the tests pass, they print

AUTOTUNE linear_unary(384x196, 384x196)
72  cpp_packed_gemm_0 0.2068 ms 100.0%
73  _linear_pointwise 236.3641 ms 0.1%

but when they failed due to flakiness, they would print

AUTOTUNE mm(384x196, 196x384)
71  cpp_packed_gemm_0 4.1393 ms 100.0%
72  mm 22332.3400 ms 0.0%

curious if you have any idea about it.

Hi @henrylhtsang From the log you shared, it seems the aten.mm is not replaced by linear_pointwise when the tests fail. There is an FX graph pass that does the graph rewrite here:

def linear(match, *args, **kwargs):

The pass is executed when the _is_packable_linear extra checker returns True. Not sure what happened when the tests fail. But perhaps you can firstly check if the problem is caused by _is_packable_linear returning False in the first place and then dig into the checker to understand why? You can also compare the fx_graph_readable.py between the successful and failing cases generated with TORCH_COMPILE_DEBUG=1. Below is what I got from my side with TORCH_COMPILE_DEBUG=1 pytest -vsk test_linear_with_pointwise_batch_size_384_in_features_196_out_features_385_bias_True_epilogue_tanh_cpu_bfloat16 test_cpu_select_algorithm.py. FYI.

class <lambda>(torch.nn.Module):
    def forward(self, arg2_1: "bf16[384, 196]"):
        # No stacktrace found for following nodes
        _frozen_param1: "bf16[385]" = self._frozen_param1
        _frozen_param3 = self._frozen_param3
        
         # File: /home/jgong5/pytorch/test/inductor/test_cpu_select_algorithm.py:236 in forward, code: return self.epilogue(self.linear(x))
        _linear_pointwise_default: "bf16[384, 385]" = torch.ops.mkldnn._linear_pointwise.default(arg2_1, _frozen_param3, _frozen_param1, 'none', [], '');  arg2_1 = _frozen_param3 = _frozen_param1 = None
        tanh: "bf16[384, 385]" = torch.ops.aten.tanh.default(_linear_pointwise_default);  _linear_pointwise_default = None
        return (tanh,)

Thanks. Unfortunately I don't have a stable repro, so I might have to land some debugging code to check it first. I plan on checking if torch.ops.mkldnn._is_mkldnn_bf16_supported() is False. Any other flags I should pay attention to?

@jgong5
Copy link
Collaborator Author

jgong5 commented Aug 29, 2024

Thanks. Unfortunately I don't have a stable repro, so I might have to land some debugging code to check it first. I plan on checking if torch.ops.mkldnn._is_mkldnn_bf16_supported() is False. Any other flags I should pay attention to?

If that is for debugging purpose only and temporary changes, perhaps you can add some logging info inside _is_packable_linear for all the relevant paths that return False. This can expediate your debugging I guess.

pytorchmergebot pushed a commit that referenced this pull request Sep 6, 2024
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants