Add host-side Triton TMA support to Dynamo #137677

aakhundov · 2024-10-10T01:49:18Z

Stack from ghstack (oldest at bottom):

This adds Dynamo tracing support for the host-side Triton TMA API (see create_2d_tma_descriptor calls on the host in the Triton tutorial). A few notes:

Here we assume the availability of the host-side TMA API added to upstream Triton in [nvidia] Support passing TMA descriptors by-value triton-lang/triton#4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
To capture the chain of calls t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...), we add three new variable trackers: DataPtrVariable, CreateTMADescriptorVariable (for the function), TMADescriptorVariable (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created.
The newly introduced variables have reconstruct methods used in case of graph breaks.
The tma_descriptor_metadata extracted from the captured create_{1d,2d}_tma_descriptor calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like.
In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors.
In the Triton kernel mutation analysis pass (in AOTAutograd), we use the tt.experimental_descriptor_store TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required.
JIT Inductor and AOT Inductor support will be implemented in follow-up PRs.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec

Differential Revision: D64404928

[ghstack-poisoned]

pytorch-bot · 2024-10-10T01:49:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137677

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit db1362a with merge base 4a8e493 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This adds Dynamo tracing support to the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). Design notes TBA. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

ghstack-source-id: f1e578b Pull Request resolved: #137677

This adds Dynamo tracing support to the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). Design notes TBA. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

ghstack-source-id: 6814cce Pull Request resolved: #137677

torch/_dynamo/variables/tensor.py

test/inductor/test_triton_kernels.py

torch/_dynamo/variables/functions.py

zou3519

Looks good overall, let's add some more testing around graph breaks.

mobicham · 2024-10-11T18:20:25Z

Is this gonna support pre-allocated descriptors on the host-side ? Creating descriptors on the fly for each kernel launch is very slow as you probably know.
Thanks!

aakhundov · 2024-10-11T18:45:32Z

Is this gonna support pre-allocated descriptors on the host-side ? Creating descriptors on the fly for each kernel launch is very slow as you probably know. Thanks!

@mobicham thanks for chiming in! Yes, we want to support host-side TMA descriptor preallocation (and filling) with this PR and a (soon-to-be) follow up adding Inductor support. Along the lines of the matmul_tma_persistent in the Triton tutorial #9. Is this what you're referring to?

mobicham · 2024-10-11T18:47:09Z

@aakhundov perfect, that's exactly what I am looking for, thank you!

This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

ghstack-source-id: ed14d05 Pull Request resolved: #137677

This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in triton-lang/triton#4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created. - The newly introduced variables have `reconstruct` methods used in case of graph breaks. - The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like. - In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors. - In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required. - JIT Inductor and AOT Inductor support will be implemented in follow-up PRs. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

test/dynamo/test_misc.py

torch/_dynamo/testing.py

torch/_dynamo/variables/functions.py

test/inductor/test_triton_kernels.py

[ghstack-poisoned]

aakhundov · 2024-10-15T16:15:32Z

@aakhundov has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

aakhundov · 2024-10-15T23:11:37Z

@pytorchbot merge

pytorchmergebot · 2024-10-15T23:13:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in triton-lang/triton#4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created. - The newly introduced variables have `reconstruct` methods used in case of graph breaks. - The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like. - In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors. - In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required. - JIT Inductor and AOT Inductor support will be implemented in follow-up PRs. X-link: pytorch/pytorch#137677 Approved by: https://github.com/zou3519 Reviewed By: clee2000 Differential Revision: D64404928 Pulled By: aakhundov fbshipit-source-id: c812cea3867c55800d5fe213bf07bf21292345e3

This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in triton-lang/triton#4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - Due to Dynamo support implemented in the previous PR, the `tma_descriptor_metadata` dict is delivered to the `triton_kerenl_wrap_` lowering and passed to the `ir.UserDefinedTritonKernel` as additional argument. - Looking into the `tma_descriptor_metadata`, `ir.UserDefinedTritonKernel` substitutes the corresponding `TensorBox` arguments of the kernel (swapped upstream in Dynamo) by the new `ir.TMADescriptor` nodes implementing TMA descriptors in Inductor IR. - `ir.TMADescriptor.__init__` provides the wiring between the upstream underlying `ir.TensorBox` and the downstream `ir.UserDefinedTritonKernel` kernel. In particular, we use `ir.NonOwnedLayout` wrapping `ir.ReinterpretView` to avoid the upstream tensor's buffer being deleted prematurely (before the TMA descriptor is used in the Triton kernel). - Via `ir.TMADescriptor.codegen`, the Triton's `create_{1d,2d}_tma_descriptor` function call is codegened in the wrapper (in the host code). - New `TMADescriptorArg` dataclass is added to handle the Triton kernel metadata pertinent to host-side TMA. - AOT Inductor support will be implemented in a follow-up PR. Pull Request resolved: pytorch#137950 Approved by: https://github.com/eellison ghstack dependencies: pytorch#137677

Add host-side Triton TMA support to Dynamo

4165fa9

[ghstack-poisoned]

aakhundov requested a review from zou3519 as a code owner October 10, 2024 01:49

pytorch-bot bot added ciflow/inductor module: dynamo module: inductor labels Oct 10, 2024

aakhundov marked this pull request as draft October 10, 2024 01:49

aakhundov added the topic: not user facing topic category label Oct 10, 2024

aakhundov added a commit that referenced this pull request Oct 10, 2024

Add host-side Triton TMA support to Dynamo

803cc57

ghstack-source-id: f1e578b Pull Request resolved: #137677

aakhundov added a commit that referenced this pull request Oct 10, 2024

Add host-side Triton TMA support to Dynamo

cd355df

ghstack-source-id: 6814cce Pull Request resolved: #137677

zou3519 reviewed Oct 11, 2024

View reviewed changes

torch/_dynamo/variables/tensor.py Outdated Show resolved Hide resolved

zou3519 reviewed Oct 11, 2024

View reviewed changes

test/inductor/test_triton_kernels.py Show resolved Hide resolved

zou3519 reviewed Oct 11, 2024

View reviewed changes

test/inductor/test_triton_kernels.py Outdated Show resolved Hide resolved

zou3519 reviewed Oct 11, 2024

View reviewed changes

test/inductor/test_triton_kernels.py Outdated Show resolved Hide resolved

zou3519 reviewed Oct 11, 2024

View reviewed changes

torch/_dynamo/variables/functions.py Show resolved Hide resolved

zou3519 reviewed Oct 11, 2024

View reviewed changes

aakhundov added a commit that referenced this pull request Oct 11, 2024

Add host-side Triton TMA support to Dynamo

8c7cf45

ghstack-source-id: ed14d05 Pull Request resolved: #137677