Add support for cat memory planning mms with max autotune #132554

eellison · 2024-08-02T22:25:07Z

Stack from ghstack (oldest at bottom):

-> Add support for cat memory planning mms with max autotune #132554

When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.

Discussion for reviewers:

It feels a little bit odd that in the existing code we set the output of aten.mm as FlexibleLayout. While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation

class AllocatedFixedLayout(FixedLayout)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-08-02T22:25:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132554

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 914f5ab with merge base d1b87e2 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

rocm / linux-focal-rocm6.2-py3.10 / test (default, 4, 6, linux.rocm.gpu.2) (gh) (similar failure)
test_testing.py::TestImports::test_circular_dependencies

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 8afd1b6 Pull Request resolved: #132554

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: f83eeae Pull Request resolved: #132554

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: e7cdd6f Pull Request resolved: #132554

Chillee

I haven't looked at this carefully yet, but will this work with all triton templates? thinking about flexattention here

eellison · 2024-08-02T23:20:38Z

Ugh, it could, but it doesn't right now because I only implemented this for MultiTemplateBuffer and flex_attention has input_gen_fns which is NYI for MultiTemplate. But it could without much difficulty.

But I wanted to resolve the current, non-max-autotune handling of mms first then can handle it. Like with above, I think returning FlexibleLayout for aten.mm is misleading/buggy. Should not rely on that for cat planning.

Options:

make the aten.mm return FixedLayout, and check for external kernel alloc in concat planning
introduce AllocatedFixedLayout

Chillee · 2024-08-02T23:30:16Z

It feels a little bit odd that in the existing code we set the output of aten.mm as FlexibleLayout even though its shape and stride are fixed.

I don't actually understand this? Isn't aten.mm codegened with an out parameter? So its stride isn't actually fixed?

eellison · 2024-08-02T23:32:45Z

Hmm, maybe, I don't know what would actually happen if you pass cublas a weird output stride.

I think you are correct that it would work but we also dont want to pass in a transposed output to a cublas kernel and get a bunch of discontiguous writes.

Do we actually want the output strides of mms to be flexible ?

eellison · 2024-08-02T23:45:09Z

Hmm, at least this was about equal:

import torch
import triton
from torch._inductor.select_algorithm import extern_kernels

torch.set_default_device('cuda')

inps = [torch.rand([4096, 4096], dtype=torch.float16) for _ in range(2)]
out1 = inps[0].clone()
out2 = inps[0].clone().T

print(triton.testing.do_bench(lambda: extern_kernels.mm(inps[0], inps[1], out=out1)))
print(triton.testing.do_bench(lambda: extern_kernels.mm(inps[0], inps[1], out=out2)))

Similarly for FlexAttention - if we just change the layout to be FlexibleLayout, this would work today, but are you okay with the output strides potentially being non contiguous ?

Chillee · 2024-08-03T00:34:40Z

are you okay with the output strides potentially being non contiguous

Yeah, i think so. Well, I'd definitely want them to be "contiguous enough".

When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat. Discussion for reviewers: It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](https://github.com/pytorch/pytorch/blob/bcac71517c461765b4fa9efccc6f1a5a475c3544/torch/_inductor/kernel/mm.py#L156). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation ``` class AllocatedFixedLayout(FixedLayout) ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: ff9c51f Pull Request resolved: #132554

eellison · 2024-10-04T16:04:56Z

@pytorchbot rebase

pytorchmergebot · 2024-10-04T16:06:21Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-10-04T16:06:34Z

Successfully rebased gh/eellison/686/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/132554)

ghstack-source-id: e0f6709 Pull Request resolved: #132554

When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat. Discussion for reviewers: It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](https://github.com/pytorch/pytorch/blob/bcac71517c461765b4fa9efccc6f1a5a475c3544/torch/_inductor/kernel/mm.py#L156). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation ``` class AllocatedFixedLayout(FixedLayout) ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: bd3fae3 Pull Request resolved: #132554

eellison · 2024-10-07T20:18:41Z

@pytorchbot merge -i

pytorchmergebot · 2024-10-07T20:20:14Z

Merge started

Your change will be merged while ignoring the following 1 checks: inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_freezing_avx2_timm, 2, 2, lf.linux.10xlarge.avx2)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

huydhn · 2024-10-08T06:19:32Z

@pytorchbot revert -m 'Sorry for reverting your change but I think it is failing on ROCm' -c nosignal

inductor/test_max_autotune.py::TestMaxAutotune::test_conv_cat GH job link HUD commit link

pytorchmergebot · 2024-10-08T06:20:57Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…32554)" This reverts commit d558ec0. Reverted #132554 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing on ROCm ([comment](#132554 (comment)))

pytorchmergebot · 2024-10-08T06:21:09Z

@eellison your PR has been successfully reverted.

When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat. Discussion for reviewers: It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](https://github.com/pytorch/pytorch/blob/bcac71517c461765b4fa9efccc6f1a5a475c3544/torch/_inductor/kernel/mm.py#L156). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation ``` class AllocatedFixedLayout(FixedLayout) ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: e707fc6 Pull Request resolved: #132554

eellison · 2024-10-08T22:29:21Z

@pytorchbot merge

pytorchmergebot · 2024-10-08T22:31:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Add support for cat memory planning mms with max autotune

ab0def5

[ghstack-poisoned]

eellison mentioned this pull request Aug 2, 2024

Bump maxinum num warps #132458

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Aug 2, 2024

Update on "Add support for cat memory planning mms with max autotune"

5d30b01

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

eellison added a commit that referenced this pull request Aug 2, 2024

Add support for cat memory planning mms with max autotune

3576c1c

ghstack-source-id: 8afd1b6 Pull Request resolved: #132554

Update on "Add support for cat memory planning mms with max autotune"

880c35b

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

eellison added a commit that referenced this pull request Aug 2, 2024

Add support for cat memory planning mms with max autotune

05bc997

ghstack-source-id: f83eeae Pull Request resolved: #132554

Update on "Add support for cat memory planning mms with max autotune"

b7388a5

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

eellison added a commit that referenced this pull request Aug 2, 2024

Add support for cat memory planning mms with max autotune

ccc7da5

ghstack-source-id: e7cdd6f Pull Request resolved: #132554

eellison requested review from Chillee, jansel, peterbell10 and shunting314 August 2, 2024 23:03

Chillee reviewed Aug 2, 2024

View reviewed changes

jansel approved these changes Aug 22, 2024

View reviewed changes

eellison added a commit that referenced this pull request Oct 3, 2024

Add support for cat memory planning mms with max autotune

6f6ddeb

ghstack-source-id: ff9c51f Pull Request resolved: #132554

eellison added the topic: not user facing topic category label Oct 4, 2024

Update

a3ffb92

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Oct 4, 2024

Add support for cat memory planning mms with max autotune

5bb07a1

ghstack-source-id: e0f6709 Pull Request resolved: #132554

eellison added a commit that referenced this pull request Oct 7, 2024

Add support for cat memory planning mms with max autotune

dd80eed

ghstack-source-id: bd3fae3 Pull Request resolved: #132554

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 7, 2024

pytorchmergebot added the merging label Oct 7, 2024

pytorchmergebot added the Merged label Oct 7, 2024

pytorchmergebot closed this in d558ec0 Oct 7, 2024

pytorchmergebot removed the merging label Oct 7, 2024

huydhn added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 8, 2024

pytorchmergebot added the Reverted label Oct 8, 2024

pytorchmergebot reopened this Oct 8, 2024

eellison added a commit that referenced this pull request Oct 8, 2024

Add support for cat memory planning mms with max autotune

62197ff

ghstack-source-id: e707fc6 Pull Request resolved: #132554

pytorchmergebot added the merging label Oct 8, 2024

pytorchmergebot closed this in 4aed81c Oct 8, 2024

pytorchmergebot removed the merging label Oct 8, 2024

eellison mentioned this pull request Oct 8, 2024

[torch.addmm] [torch.cat] Abnormal random output when using torch.compile() #137254

Closed

Chillee mentioned this pull request Oct 17, 2024

Refactor FlexibleLayout to separate out "this stride can be changed" and "how this buffer is allocated can be changed" #138280

Open

github-actions bot deleted the gh/eellison/686/head branch November 8, 2024 02:07

Add support for cat memory planning mms with max autotune #132554

Add support for cat memory planning mms with max autotune #132554

Uh oh!

Conversation

eellison commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132554

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Chillee left a comment

Choose a reason for hiding this comment

Uh oh!

eellison commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Chillee commented Aug 2, 2024

Uh oh!

eellison commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eellison commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Chillee commented Aug 3, 2024

Uh oh!

eellison commented Oct 4, 2024

Uh oh!

pytorchmergebot commented Oct 4, 2024

Uh oh!

pytorchmergebot commented Oct 4, 2024

Uh oh!

eellison commented Oct 7, 2024

Uh oh!

pytorchmergebot commented Oct 7, 2024

Merge started

Uh oh!

huydhn commented Oct 8, 2024

Uh oh!

pytorchmergebot commented Oct 8, 2024

Uh oh!

pytorchmergebot commented Oct 8, 2024

Uh oh!

eellison commented Oct 8, 2024

Uh oh!

pytorchmergebot commented Oct 8, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

eellison commented Aug 2, 2024 •

edited

Loading

pytorch-bot bot commented Aug 2, 2024 •

edited

Loading

eellison commented Aug 2, 2024 •

edited

Loading

eellison commented Aug 2, 2024 •

edited

Loading

eellison commented Aug 2, 2024 •

edited

Loading