[inductor] [cpp] fix max-autotune for single-thread dynamic shapes #136418

chunyuan-w · 2024-09-23T07:44:28Z

Stack from ghstack (oldest at bottom):

Fixes the compilation error of max-autotune for maml_omniglot (AMP and FP32) and soft_actor_critic (AMP) in Torchbench for single-thread dynamic shapes case:

/tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp: In function ‘void kernel(const bfloat16*, const bfloat16*, const bfloat16*, bfloat16*, int64_t)’:
/tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:279:41: error: the value of ‘Mr_blocks’ is not usable in a constant expression
  279 |         constexpr int64_t m_block_end = Mr_blocks;
      |                                         ^~~~~~~~~
/tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:237:19: note: ‘Mr_blocks’ was not initialized with a constant expression
  237 |     const int64_t Mr_blocks = (M + Mr - 1) / Mr;
      |                   ^~~~~~~~~

The PR also updates the UT to add a test for BS=512 in single thread.
The previous case has BS=1024 equal to the K and N value. The generated code does not have symbolic shapes thus fails to capture the above issue.
By adding a case of BS=512, the generated code will have symbolic shape for the M dim and is able to reproduce the issue that this PR is addressing.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

pytorch-bot · 2024-09-23T07:44:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136418

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f0e56a7 with merge base e9bfbf7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

jgong5 · 2024-09-25T05:52:18Z

test/inductor/test_cpu_select_algorithm.py

+            512,
+            1024,


Perhaps just using 512 would be good enough? I'm concerned about the added test time...

Changed to just keep 512.

chunyuan-w · 2024-09-25T06:27:01Z

@pytorchbot merge

pytorchmergebot · 2024-09-25T06:28:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…no epilogue_nodes (#136518) The `template_buffer_has_other_users` function checks the case where there're epilogue nodes and the template output has users other than these epilogue nodes. When there's no epilogue nodes, the function could return `False` directly. Pull Request resolved: #136518 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #136418

[ghstack-poisoned]

…ytorch#136418) Fixes the compilation error of max-autotune for `maml_omniglot` (AMP and FP32) and `soft_actor_critic` (AMP) in Torchbench for single-thread dynamic shapes case: ``` /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp: In function ‘void kernel(const bfloat16*, const bfloat16*, const bfloat16*, bfloat16*, int64_t)’: /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:279:41: error: the value of ‘Mr_blocks’ is not usable in a constant expression 279 | constexpr int64_t m_block_end = Mr_blocks; | ^~~~~~~~~ /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:237:19: note: ‘Mr_blocks’ was not initialized with a constant expression 237 | const int64_t Mr_blocks = (M + Mr - 1) / Mr; | ^~~~~~~~~ ``` The PR also updates the UT to add a test for `BS`=512 in single thread. The previous case has `BS`=1024 equal to the `K` and `N` value. The generated code does not have symbolic shapes thus fails to capture the above issue. By adding a case of `BS`=512, the generated code will have symbolic shape for the M dim and is able to reproduce the issue that this PR is addressing. Pull Request resolved: pytorch#136418 Approved by: https://github.com/jgong5

…no epilogue_nodes (pytorch#136518) The `template_buffer_has_other_users` function checks the case where there're epilogue nodes and the template output has users other than these epilogue nodes. When there's no epilogue nodes, the function could return `False` directly. Pull Request resolved: pytorch#136518 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: pytorch#136418

Fixes the max-autotune failure of `soft_actor_critic` of Torchbench in FP32 single thread dynamic shape case: ```log File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_micro_gemm.py", line 136, in codegen_call C_ptr = f"&({kernel.index(C, [0, 0])})" File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_template_kernel.py", line 135, in index else self.args.input(node.get_name()) File "/home/user/inductor/pytorch/torch/_inductor/codegen/common.py", line 1251, in input assert name not in V.graph.removed_buffers, name AssertionError: buf_GemmOut ``` The 1st and 2nd linear does not need to use local buffer while the 3rd linear needs to use local buffer. The 3rd linear which uses local buffer will add its global buffer (named as `buf_GemmOut`) into `V.graph.removed_buffers`. When scheduling the nodes, the 1st linear (won't use local buffer) will get its output buffer (also named as `buf_GemmOut`) from the input and found that it's in the `V.graph.removed_buffers` and raise AssertionError. The issue is that the output buffer of all these linears are all names with `buf_GemmOut`, which have a conflict. Rename these buffers by adding the name of the `template_buffer` as the prefix. Pull Request resolved: #136419 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #136418, #136518

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 23, 2024

pytorchbot added the open source label Sep 23, 2024

chunyuan-w added the topic: not user facing topic category label Sep 23, 2024

chunyuan-w mentioned this pull request Sep 23, 2024

[inductor] [cpp] fix gemm_output_name conflict #136419

Closed

chunyuan-w added 2 commits September 23, 2024 15:43

Update

1770f05

[ghstack-poisoned]

Update

3ce8ae0

[ghstack-poisoned]

chunyuan-w requested review from jgong5 and leslie-fang-intel September 24, 2024 01:20

chunyuan-w added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 24, 2024

chunyuan-w mentioned this pull request Sep 24, 2024

[inductor] [cpp] fix the check of template_buffer_has_other_users if no epilogue_nodes #136518

Closed

jgong5 approved these changes Sep 25, 2024

View reviewed changes

pytorchmergebot added the merging label Sep 25, 2024

pytorchmergebot added the Merged label Sep 25, 2024

pytorchmergebot closed this in 2e30c16 Sep 25, 2024

pytorchmergebot removed the merging label Sep 25, 2024

Update

f0e56a7

[ghstack-poisoned]

github-actions bot deleted the gh/chunyuan-w/31/head branch October 27, 2024 02:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor] [cpp] fix max-autotune for single-thread dynamic shapes #136418

[inductor] [cpp] fix max-autotune for single-thread dynamic shapes #136418

Uh oh!

chunyuan-w commented Sep 23, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 23, 2024 •

edited

Loading

Uh oh!

jgong5 Sep 25, 2024

Uh oh!

chunyuan-w Sep 25, 2024

Uh oh!

chunyuan-w commented Sep 25, 2024

Uh oh!

pytorchmergebot commented Sep 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[inductor] [cpp] fix max-autotune for single-thread dynamic shapes #136418

[inductor] [cpp] fix max-autotune for single-thread dynamic shapes #136418

Uh oh!

Conversation

chunyuan-w commented Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136418

✅ No Failures

Uh oh!

jgong5 Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

chunyuan-w Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

chunyuan-w commented Sep 25, 2024

Uh oh!

pytorchmergebot commented Sep 25, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chunyuan-w commented Sep 23, 2024 •

edited

Loading

pytorch-bot bot commented Sep 23, 2024 •

edited

Loading