Refactored template codegen to explicitly set current body when generating code #127144

Chillee · 2024-05-24T23:56:37Z

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

The main motivation for this refactor is that today, when generating templates, this is what happens.

def_kernel() # registers hook for fully generating function definition
store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`.

Later on, when we codegen the template:

pytorch/torch/_inductor/codegen/simd.py

Line 1402 in f8c4c26

if not only_gen_src_code:

epilogue_node.codegen() # Also writes to body!
template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body`

Today, this is fine, as long as store_output is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying.

In FlexAttention backwards, we might want a modification to be positioned after the store_output (just logically from a code organization POV). This doesn't work today because modification also needs to codegen a subgraph, but writing to body here conflicts with store_output's implicit saved state on self.body.
If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322)
The current code also makes it quite difficult to support fusion into multiple output nodes.

To resolve this, I do two things:

I remove the default self.body on TritonTemplateKernel. Instead, I have a dict of self.subgraph_bodies, which can be enabled in a context with TritonTemplateKernel.set_subgraph_body. This allows multiple different template functions to write to their own isolated bodies.
I add functions that allow you to finalize specific hooks on PartialRender.

[ghstack-poisoned]

pytorch-bot · 2024-05-24T23:56:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127144

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (6 Unrelated Failures)

As of commit 5cd8590 with merge base e4b2452 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-13-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh) (similar failure)
inductor/test_compiled_autograd.py::TestCompiledAutograd::test_autograd_cpp_node_saved

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Lint / lintrunner-noclang / linux-job (gh) (trunk failure)
>>> Lint for test/test_optim.py:
linux-binary-manywheel / manywheel-py3_8-cuda11_8-test / test (gh) (trunk failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
linux-binary-manywheel / manywheel-py3_8-cuda12_1-test / test (gh) (trunk failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
linux-binary-manywheel / manywheel-py3_8-cuda12_4-test / test (gh) (trunk failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal, unstable) (gh) (#126993)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: bb05cd0 Pull Request resolved: #127144

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

generating code ghstack-source-id: 26e3570 Pull Request resolved: #127144

… when generating code" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang The main motivation for this refactor is that today, when generating templates, this is what happens. ``` def_kernel() # registers hook for fully generating function definition store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`. ``` Later on, when we codegen the template: https://github.com/pytorch/pytorch/blob/f8c4c268da67e9684f3287b7468f36a5a27c6a0b/torch/_inductor/codegen/simd.py#L1402 ``` epilogue_node.codegen() # Also writes to body! template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body` ``` Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying. 1. In FlexAttention backwards, we might want a `modification` to be positioned *after* the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`. 2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322) 3. The current code also makes it quite difficult to support fusion into multiple output nodes. To resolve this, I do two things: 1. I *remove* the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies. 2. I add functions that allow you to finalize specific hooks on `PartialRender`. [ghstack-poisoned]

Chillee · 2024-05-28T06:01:41Z

@pytorchbot merge

pytorchmergebot · 2024-05-28T06:03:41Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-28T06:03:53Z

Merge failed

Reason: 11 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 1, 5, linux.4xlarge.nvidia.gpu, unstable)
pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 2, 5, linux.4xlarge.nvidia.gpu, unstable)
pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu, unstable)
pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu, unstable)
pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 5, 5, linux.4xlarge.nvidia.gpu, unstable)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

Chillee · 2024-05-28T06:22:07Z

@pytorchbot merge

pytorchmergebot · 2024-05-28T06:24:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-28T06:24:26Z

Merge failed

Reason: 11 jobs have failed, first few of them are: pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 1, 5, linux.4xlarge.nvidia.gpu, unstable), pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu, unstable), pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu, unstable), pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu, unstable), pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu, unstable)

Details for Dev Infra team

Raised by workflow job

Chillee · 2024-05-28T07:01:01Z

@pytorchbot merge

pytorchmergebot · 2024-05-28T07:03:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-28T07:03:28Z

Merge failed

Reason: 11 jobs have failed, first few of them are: pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 1, 5, linux.4xlarge.nvidia.gpu, unstable), pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu, unstable), pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu, unstable), pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu, unstable), pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu, unstable)

Details for Dev Infra team

Raised by workflow job

Chillee · 2024-05-28T07:35:31Z

@pytorchbot merge

pytorchmergebot · 2024-05-28T07:37:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ating code (pytorch#127144) The main motivation for this refactor is that today, when generating templates, this is what happens. ``` def_kernel() # registers hook for fully generating function definition store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`. ``` Later on, when we codegen the template: https://github.com/pytorch/pytorch/blob/f8c4c268da67e9684f3287b7468f36a5a27c6a0b/torch/_inductor/codegen/simd.py#L1402 ``` epilogue_node.codegen() # Also writes to body! template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body` ``` Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying. 1. In FlexAttention backwards, we might want a `modification` to be positioned *after* the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`. 2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322) 3. The current code also makes it quite difficult to support fusion into multiple output nodes. To resolve this, I do two things: 1. I *remove* the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies. 2. I add functions that allow you to finalize specific hooks on `PartialRender`. Pull Request resolved: pytorch#127144 Approved by: https://github.com/jansel

eellison · 2024-05-29T16:38:35Z

This now errors for me:

import torch
import torch._inductor.config
torch._inductor.config.max_autotune = True

class ToyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = torch.nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
        self.linear = torch.nn.Linear(262144, 100)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.conv(x)
        x = x.view(x.size(0), -1)
        return self.relu(self.linear(x))

m = ToyModel().to(device="cuda:0")
from torch._inductor.utils import fresh_inductor_cache

with fresh_inductor_cache():
    m = torch.compile(m)
    input_tensor = torch.randn(32,3,64,64).to(device="cuda:0")
    out = m(input_tensor)

revert ?

Chillee · 2024-05-29T18:16:50Z

Is this specifically because of convs?

EDIT: It's actually because of indexing I believe.

Chillee · 2024-05-29T19:55:20Z

Will put up a forward fix.

…ating code (pytorch#127144) The main motivation for this refactor is that today, when generating templates, this is what happens. ``` def_kernel() # registers hook for fully generating function definition store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`. ``` Later on, when we codegen the template: https://github.com/pytorch/pytorch/blob/f8c4c268da67e9684f3287b7468f36a5a27c6a0b/torch/_inductor/codegen/simd.py#L1402 ``` epilogue_node.codegen() # Also writes to body! template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body` ``` Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying. 1. In FlexAttention backwards, we might want a `modification` to be positioned *after* the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`. 2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322) 3. The current code also makes it quite difficult to support fusion into multiple output nodes. To resolve this, I do two things: 1. I *remove* the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies. 2. I add functions that allow you to finalize specific hooks on `PartialRender`. Pull Request resolved: pytorch#127144 Approved by: https://github.com/jansel

Try disabling second codegen_body call in inductor templates

af89be2

[ghstack-poisoned]

Chillee mentioned this pull request May 24, 2024

Made some minor improvements to flexattention perf + added more autotune configs #126811

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels May 24, 2024

github-actions bot requested a review from ezyang May 24, 2024 23:56

ezyang removed their request for review May 25, 2024 00:06

This was referenced May 25, 2024

Unify add_fake_dep and add_mutation_dep, as they're literally the same thing #127148

Closed

Change direct uses of MutationOutput to mark_node_as_mutating #127149

Closed

Chillee mentioned this pull request May 25, 2024

Turn the mutation dependency of MutationOutput to weak deps #127151

Closed

Chillee added a commit that referenced this pull request May 25, 2024

Try disabling second codegen_body call in inductor templates

857f552

ghstack-source-id: bb05cd0 Pull Request resolved: #127144

Chillee requested a review from jansel May 26, 2024 00:19

Chillee added a commit that referenced this pull request May 26, 2024

Refactored template codegen to explicitly set current body when

2ae7378

generating code ghstack-source-id: 26e3570 Pull Request resolved: #127144

Chillee requested a review from eellison May 26, 2024 00:19

Chillee changed the title ~~Try disabling second codegen_body call in inductor templates~~ Refactored template codegen to explicitly set current body when generating code May 26, 2024

Chillee requested a review from ipiszy May 26, 2024 00:34

Chillee added the ciflow/trunk Trigger trunk jobs on your pull request label May 26, 2024

Chillee mentioned this pull request May 26, 2024

Added memory budget to partitioner #126320

Closed

jansel approved these changes May 26, 2024

View reviewed changes

pytorchmergebot added the merging label May 28, 2024

pytorchmergebot removed the merging label May 28, 2024

pytorchmergebot added the merging label May 28, 2024

pytorchmergebot removed the merging label May 28, 2024

pytorchmergebot added the merging label May 28, 2024

pytorchmergebot removed the merging label May 28, 2024

pytorchmergebot added the merging label May 28, 2024

Chillee mentioned this pull request May 28, 2024

[inductor][cpp] epilogue support for gemm template #126019

Closed

pytorchmergebot added the Merged label May 28, 2024

pytorchmergebot closed this in ec8b254 May 28, 2024

pytorchmergebot removed the merging label May 28, 2024

github-actions bot deleted the gh/chillee/298/head branch June 29, 2024 01:53

Refactored template codegen to explicitly set current body when generating code #127144

Refactored template codegen to explicitly set current body when generating code #127144

Uh oh!

Conversation

Chillee commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127144

✅ You can merge normally! (6 Unrelated Failures)

Uh oh!

Chillee commented May 28, 2024

Uh oh!

pytorchmergebot commented May 28, 2024

Merge started

Uh oh!

pytorchmergebot commented May 28, 2024

Merge failed

Uh oh!

Chillee commented May 28, 2024

Uh oh!

pytorchmergebot commented May 28, 2024

Merge started

Uh oh!

pytorchmergebot commented May 28, 2024

Merge failed

Uh oh!

Chillee commented May 28, 2024

Uh oh!

pytorchmergebot commented May 28, 2024

Merge started

Uh oh!

pytorchmergebot commented May 28, 2024

Merge failed

Uh oh!

Chillee commented May 28, 2024

Uh oh!

pytorchmergebot commented May 28, 2024

Merge started

Uh oh!

eellison commented May 29, 2024

Uh oh!

Chillee commented May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Chillee commented May 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Chillee commented May 24, 2024 •

edited

Loading

pytorch-bot bot commented May 24, 2024 •

edited

Loading

Chillee commented May 29, 2024 •

edited

Loading