-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Refactored template codegen to explicitly set current body when generating code #127144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127144
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (6 Unrelated Failures)As of commit 5cd8590 with merge base e4b2452 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
… when generating code" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang The main motivation for this refactor is that today, when generating templates, this is what happens. ``` def_kernel() # registers hook for fully generating function definition store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`. ``` Later on, when we codegen the template: https://github.com/pytorch/pytorch/blob/f8c4c268da67e9684f3287b7468f36a5a27c6a0b/torch/_inductor/codegen/simd.py#L1402 ``` epilogue_node.codegen() # Also writes to body! template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body` ``` Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying. 1. In FlexAttention backwards, we might want a `modification` to be positioned *after* the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`. 2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322) 3. The current code also makes it quite difficult to support fusion into multiple output nodes. To resolve this, I do two things: 1. I *remove* the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies. 2. I add functions that allow you to finalize specific hooks on `PartialRender`. [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ating code (pytorch#127144) The main motivation for this refactor is that today, when generating templates, this is what happens. ``` def_kernel() # registers hook for fully generating function definition store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`. ``` Later on, when we codegen the template: https://github.com/pytorch/pytorch/blob/f8c4c268da67e9684f3287b7468f36a5a27c6a0b/torch/_inductor/codegen/simd.py#L1402 ``` epilogue_node.codegen() # Also writes to body! template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body` ``` Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying. 1. In FlexAttention backwards, we might want a `modification` to be positioned *after* the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`. 2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322) 3. The current code also makes it quite difficult to support fusion into multiple output nodes. To resolve this, I do two things: 1. I *remove* the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies. 2. I add functions that allow you to finalize specific hooks on `PartialRender`. Pull Request resolved: pytorch#127144 Approved by: https://github.com/jansel
|
This now errors for me: revert ? |
|
Is this specifically because of convs? EDIT: It's actually because of indexing I believe. |
|
Will put up a forward fix. |
…ating code (pytorch#127144) The main motivation for this refactor is that today, when generating templates, this is what happens. ``` def_kernel() # registers hook for fully generating function definition store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`. ``` Later on, when we codegen the template: https://github.com/pytorch/pytorch/blob/f8c4c268da67e9684f3287b7468f36a5a27c6a0b/torch/_inductor/codegen/simd.py#L1402 ``` epilogue_node.codegen() # Also writes to body! template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body` ``` Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying. 1. In FlexAttention backwards, we might want a `modification` to be positioned *after* the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`. 2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322) 3. The current code also makes it quite difficult to support fusion into multiple output nodes. To resolve this, I do two things: 1. I *remove* the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies. 2. I add functions that allow you to finalize specific hooks on `PartialRender`. Pull Request resolved: pytorch#127144 Approved by: https://github.com/jansel
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang
The main motivation for this refactor is that today, when generating templates, this is what happens.
Later on, when we codegen the template:
pytorch/torch/_inductor/codegen/simd.py
Line 1402 in f8c4c26
Today, this is fine, as long as
store_outputis the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying.modificationto be positioned after thestore_output(just logically from a code organization POV). This doesn't work today becausemodificationalso needs to codegen a subgraph, but writing tobodyhere conflicts withstore_output's implicit saved state onself.body.To resolve this, I do two things:
self.bodyonTritonTemplateKernel. Instead, I have a dict ofself.subgraph_bodies, which can be enabled in a context withTritonTemplateKernel.set_subgraph_body. This allows multiple different template functions to write to their own isolated bodies.PartialRender.