Skip to content

Conversation

@Chillee
Copy link
Collaborator

@Chillee Chillee commented May 24, 2024

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

The main motivation for this refactor is that today, when generating templates, this is what happens.

def_kernel() # registers hook for fully generating function definition
store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`.

Later on, when we codegen the template:

if not only_gen_src_code:

epilogue_node.codegen() # Also writes to body!
template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body`

Today, this is fine, as long as store_output is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying.

  1. In FlexAttention backwards, we might want a modification to be positioned after the store_output (just logically from a code organization POV). This doesn't work today because modification also needs to codegen a subgraph, but writing to body here conflicts with store_output's implicit saved state on self.body.
  2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322)
  3. The current code also makes it quite difficult to support fusion into multiple output nodes.

To resolve this, I do two things:

  1. I remove the default self.body on TritonTemplateKernel. Instead, I have a dict of self.subgraph_bodies, which can be enabled in a context with TritonTemplateKernel.set_subgraph_body. This allows multiple different template functions to write to their own isolated bodies.
  2. I add functions that allow you to finalize specific hooks on PartialRender.

@pytorch-bot
Copy link

pytorch-bot bot commented May 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127144

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (6 Unrelated Failures)

As of commit 5cd8590 with merge base e4b2452 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot requested a review from ezyang May 24, 2024 23:56
@ezyang ezyang removed their request for review May 25, 2024 00:06
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
Chillee added a commit that referenced this pull request May 25, 2024
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@Chillee Chillee requested a review from jansel May 26, 2024 00:19
Chillee added a commit that referenced this pull request May 26, 2024
generating code

ghstack-source-id: 26e3570
Pull Request resolved: #127144
@Chillee Chillee requested a review from eellison May 26, 2024 00:19
@Chillee Chillee changed the title Try disabling second codegen_body call in inductor templates Refactored template codegen to explicitly set current body when generating code May 26, 2024
@Chillee Chillee requested a review from ipiszy May 26, 2024 00:34
@Chillee Chillee added the ciflow/trunk Trigger trunk jobs on your pull request label May 26, 2024
… when generating code"




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

The main motivation for this refactor is that today, when generating templates, this is what happens.

```
def_kernel() # registers hook for fully generating function definition
store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`.
```

Later on, when we codegen the template: https://github.com/pytorch/pytorch/blob/f8c4c268da67e9684f3287b7468f36a5a27c6a0b/torch/_inductor/codegen/simd.py#L1402

```
epilogue_node.codegen() # Also writes to body!
template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body`
```

Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying.

1. In FlexAttention backwards, we might want a `modification` to be positioned *after* the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`.
2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322)
3. The current code also makes it quite difficult to support fusion into multiple output nodes.

To resolve this, I do two things:
1. I *remove* the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies.
2. I add functions that allow you to finalize specific hooks on `PartialRender`.

[ghstack-poisoned]
@Chillee
Copy link
Collaborator Author

Chillee commented May 28, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Chillee
Copy link
Collaborator Author

Chillee commented May 28, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Chillee
Copy link
Collaborator Author

Chillee commented May 28, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Chillee
Copy link
Collaborator Author

Chillee commented May 28, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

titaiwangms pushed a commit to titaiwangms/pytorch that referenced this pull request May 28, 2024
…ating code (pytorch#127144)

The main motivation for this refactor is that today, when generating templates, this is what happens.

```
def_kernel() # registers hook for fully generating function definition
store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`.
```

Later on, when we codegen the template: https://github.com/pytorch/pytorch/blob/f8c4c268da67e9684f3287b7468f36a5a27c6a0b/torch/_inductor/codegen/simd.py#L1402

```
epilogue_node.codegen() # Also writes to body!
template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body`
```

Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying.

1. In FlexAttention backwards, we might want a `modification` to be positioned *after* the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`.
2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322)
3. The current code also makes it quite difficult to support fusion into multiple output nodes.

To resolve this, I do two things:
1. I *remove* the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies.
2. I add functions that allow you to finalize specific hooks on `PartialRender`.
Pull Request resolved: pytorch#127144
Approved by: https://github.com/jansel
@eellison
Copy link
Contributor

This now errors for me:

import torch
import torch._inductor.config
torch._inductor.config.max_autotune = True

class ToyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = torch.nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
        self.linear = torch.nn.Linear(262144, 100)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.conv(x)
        x = x.view(x.size(0), -1)
        return self.relu(self.linear(x))

m = ToyModel().to(device="cuda:0")
from torch._inductor.utils import fresh_inductor_cache

with fresh_inductor_cache():
    m = torch.compile(m)
    input_tensor = torch.randn(32,3,64,64).to(device="cuda:0")
    out = m(input_tensor)

revert ?

@Chillee
Copy link
Collaborator Author

Chillee commented May 29, 2024

Is this specifically because of convs?

EDIT: It's actually because of indexing I believe.

@Chillee
Copy link
Collaborator Author

Chillee commented May 29, 2024

Will put up a forward fix.

Aidyn-A pushed a commit to tinglvv/pytorch that referenced this pull request May 30, 2024
…ating code (pytorch#127144)

The main motivation for this refactor is that today, when generating templates, this is what happens.

```
def_kernel() # registers hook for fully generating function definition
store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`.
```

Later on, when we codegen the template: https://github.com/pytorch/pytorch/blob/f8c4c268da67e9684f3287b7468f36a5a27c6a0b/torch/_inductor/codegen/simd.py#L1402

```
epilogue_node.codegen() # Also writes to body!
template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body`
```

Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying.

1. In FlexAttention backwards, we might want a `modification` to be positioned *after* the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`.
2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322)
3. The current code also makes it quite difficult to support fusion into multiple output nodes.

To resolve this, I do two things:
1. I *remove* the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies.
2. I add functions that allow you to finalize specific hooks on `PartialRender`.
Pull Request resolved: pytorch#127144
Approved by: https://github.com/jansel
@github-actions github-actions bot deleted the gh/chillee/298/head branch June 29, 2024 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants