Add persistent+TMA version of Triton mm and addmm #142101

aakhundov · 2024-12-05T03:39:08Z

This PR adds persistent+TMA versions (Triton template + the corresponding infra) for the tuned_mm and tuned_addmm lowerings. The persistent+TMA choices are added to the GEMM autotuning if (checked by the use_triton_tma_template helper):

The min. hardware and Triton version requirements are met for the TMA support.
The GEMM inputs are compatible with the Triton TMA API (i.e., 16-byte aligned and contiguous).
The config.triton.enable_persistent_tma_matmul is set to True.

Additional notes:

As added in this PR, the TMA uses are not compatible with prolog / epilogue fusion. To this end, in the new Triton template we currently support: TMA-based loads of A/B, but no prologue fusion; epilogue fusion, but no TMA-based stores of C. TMA + fusion compatibility can be added as a follow-up.
The current Triton TMA API (experimental_device_tensormap_create2d) does not support strides. Due to this, we limit the applicability of the new Triton template to the cases where the inputs are contiguous.
The transposed layouts of A and / or B are supported by passing the constexpr flags to the kernel and adjusting the ordering of the block sizes accordingly in the kernel code (this should have no effect on the kernel perf, as decided at the Triton compilation time).
After the next Triton pin update, we can switch to the tensor descriptor API (landed recently in [Pipeliner] Multi-buffer TMA descriptors triton-lang/triton#5290) in the new Triton template, which should allow lifting 2 and 3 above.
The configs for the new Triton template in persistent_mm_kernel_configs are preliminary. We should do more perf exploration and possibly augment the config in a follow-up.
This PR is rebased onto and unifies with two related PRs landed previously: Adding lowering to persistent-tma device kernel for _scaled_mm #142045 (some infra unification with the persistent+TMA template for _scaled_mm) and Prologue Fusion #134532 (add possibility to disable prolog fusion for selected choices).
The current Triton TMA API only supports 1D and 2D descriptors (even after [Pipeliner] Multi-buffer TMA descriptors triton-lang/triton#5290, see here). For now, this blocks adding persistent+TMA template for torch.bmm.

Stack from ghstack (oldest at bottom):

-> Add persistent+TMA version of Triton mm and addmm #142101

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-12-05T03:39:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142101

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3bc5de0 with merge base e0bdae7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: adf68fd Pull Request resolved: #142101

[ghstack-poisoned]

ghstack-source-id: 7ec4601 Pull Request resolved: #142101

[ghstack-poisoned]

ghstack-source-id: 0deecba Pull Request resolved: #142101

[ghstack-poisoned]

ghstack-source-id: 4d95d26 Pull Request resolved: #142101

[ghstack-poisoned]

ghstack-source-id: af16673 Pull Request resolved: #142101

[ghstack-poisoned]

ghstack-source-id: 32bf6e9 Pull Request resolved: #142101

[ghstack-poisoned]

ghstack-source-id: adba8b8 Pull Request resolved: #142101

[ghstack-poisoned]

ghstack-source-id: 7f86c28 Pull Request resolved: #142101

[ghstack-poisoned]

ghstack-source-id: 16cd91f Pull Request resolved: #142101

aakhundov · 2024-12-13T18:33:19Z

torch/_inductor/scheduler.py

+                # For prologue fusion we check if the underlying template of the choice
+                # supports all allowed prologue inputs. If not, we skip this choice in
+                # the fusion benchmark.
+                # TODO: Remove this check after all Triton templates support prologue fusion.
+                # Currently, persistent+TMA Triton template does not due to the TMA-based loads.
+                if (
+                    not epilogue_fusion
+                    and hasattr(choice, "allowed_prologue_inps")
+                    and choice.allowed_prologue_inps != multi_node.allowed_prologue_inps
+                ):
+                    continue


cc @eellison: this is to selectively skip choices not supporting prologue fusion (like currently the choices from the persistent+TMA template).

aakhundov · 2024-12-13T18:34:22Z

torch/_inductor/kernel/mm_common.py

+scaled_persistent_mm_kernel_configs = [
+    {"config": (128, 128, 64, 3, 8), "cond": True},
+    {"config": (128, 128, 128, 3, 8), "cond": True},
+    {"config": (128, 128, 128, 4, 8), "cond": True},
+    {"config": (128, 128, 128, 4, 4), "cond": True},
+    {"config": (128, 128, 128, 3, 4), "cond": True},
+    {"config": (128, 128, 128, 5, 4), "cond": True},
+    {"config": (128, 128, 128, 5, 8), "cond": True},
+    {"config": (128, 128, 128, 6, 8), "cond": True},
+    {"config": (128, 128, 64, 4, 8), "cond": True},
+]


cc @drisspg: separated these configs used in _scaled_mm persistent+TMA template-based lowering, as ~half of them OOMs on SMEM for 2-byte dtypes. Kept the ones that don't in persistent_mm_kernel_configs.

drisspg · 2024-12-13T18:55:42Z

test/inductor/test_fp8.py

            w_inverse_scale,
            bias,
        )
-        with config.patch({"triton.enable_persistent_tma_matmul": True}):


wow, bad mistake on my end thanks for fixing!

drisspg · 2024-12-13T18:59:22Z

torch/_inductor/kernel/mm.py

+
+    # based on triton.ops.matmul
+    start_pid = tl.program_id(0)
+    grid_m = (M + BLOCK_M - 1) // BLOCK_M


nit: why not cdiv for these too?

drisspg · 2024-12-13T19:04:31Z

torch/_inductor/kernel/mm_scaled.py

-                    workspace_arg=get_workspace_arg(
-                        kwargs["NUM_SMS"], mat_a.get_device()
+                    workspace_arg=get_tma_workspace_arg(
+                        num_tma_descriptors=3,


we actually only need 2, if you care to update as well, my top of stack had the C stores but that was buggy anyways

Thanks for catching this! Helped me find a nasty bug in the new template code, too.

drisspg

Looks great! will let Elias comment on the prologue stuff

[ghstack-poisoned]

ghstack-source-id: 8008755 Pull Request resolved: #142101

eellison

looks great !

eellison · 2024-12-13T20:17:43Z

torch/_inductor/kernel/mm.py

+
+            # inductor generates a suffix
+            {{store_output(("idx_m", "idx_n"), "acc", "mask", indent_width=12)}}
+            acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)


maybe in a follow up we can dedup with the scaled version cc @drisspg

Yeah should/want to do this

eellison · 2024-12-13T22:22:17Z

torch/_inductor/kernel/mm.py

+            [BLOCK_M, BLOCK_K] if A_ROW_MAJOR else [BLOCK_K, BLOCK_M],
+            A.dtype.element_ty,
+        )
+        b = tl._experimental_descriptor_load(


i dont see any mask here. i guess it doesnt support k not divisible by k block ?

TMA does support k not divisible by k block (given that k * dtype.itemsize % 16 == 0). Masking happens in the HW doing TMA, with the OOB values set to zero.

eellison · 2024-12-13T22:23:45Z

torch/_inductor/select_algorithm.py

+                c
+                for c in choices
+                if re.search(
+                    config.test_configs.autotune_choice_name_regex,


nit: it is it worth functools.lru_caching re.compile() and using that ?

I imagine these regexes being used exclusively for testing (as in: not in prod / with real models) and being relatively simple (prob a substring or a few separated by |). So not sure how much it's worth precompiling them. Happy to add if you feel it's worth, though.

aakhundov · 2024-12-14T01:07:00Z

@pytorchbot merge

pytorchmergebot · 2024-12-14T01:08:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-12-14T03:16:13Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x b85dd6fa405d8b7019e83646b62b21f92e27999a returned non-zero exit code 1

Auto-merging test/inductor/test_max_autotune.py
CONFLICT (content): Merge conflict in test/inductor/test_max_autotune.py
Auto-merging torch/_inductor/kernel/mm.py
Auto-merging torch/_inductor/kernel/mm_common.py
Auto-merging torch/_inductor/scheduler.py
Auto-merging torch/_inductor/select_algorithm.py
Auto-merging torch/_inductor/utils.py
error: could not apply b85dd6fa405... Add persistent+TMA version of Triton mm and addmm
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

[ghstack-poisoned]

ghstack-source-id: 2d14d8f Pull Request resolved: #142101

aakhundov · 2024-12-16T19:04:28Z

@pytorchbot merge

pytorchmergebot · 2024-12-16T19:06:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kimishpatel · 2025-06-03T19:39:48Z

torch/_inductor/kernel/mm.py

+        )
+
+        if ki == k_tiles - 1:
+            # rematerialize rm and rn to save registers


how do you determine if not doing this results in spill/fills?

Update

a2e0187

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Dec 5, 2024

aakhundov added a commit that referenced this pull request Dec 5, 2024

[WIP] Add persistent+TMA version of Triton mm and addmm

21a9cce

ghstack-source-id: adf68fd Pull Request resolved: #142101

aakhundov marked this pull request as draft December 5, 2024 03:39

aakhundov added the topic: not user facing topic category label Dec 5, 2024

Update

84bcf9c

[ghstack-poisoned]

aakhundov added a commit that referenced this pull request Dec 5, 2024

[WIP] Add persistent+TMA version of Triton mm and addmm

84c97d7

ghstack-source-id: 7ec4601 Pull Request resolved: #142101

Update

f5ad2b9

[ghstack-poisoned]

aakhundov added a commit that referenced this pull request Dec 10, 2024

[WIP] Add persistent+TMA version of Triton mm and addmm

01b3442

ghstack-source-id: 0deecba Pull Request resolved: #142101

Update

766fd69

[ghstack-poisoned]

aakhundov added a commit that referenced this pull request Dec 10, 2024

[WIP] Add persistent+TMA version of Triton mm and addmm

b790cc2

ghstack-source-id: 4d95d26 Pull Request resolved: #142101

Update

0984d0e

[ghstack-poisoned]

aakhundov added a commit that referenced this pull request Dec 10, 2024

[WIP] Add persistent+TMA version of Triton mm and addmm

28d1689

ghstack-source-id: af16673 Pull Request resolved: #142101

Update

6315575

[ghstack-poisoned]

aakhundov added a commit that referenced this pull request Dec 11, 2024

[WIP] Add persistent+TMA version of Triton mm and addmm

484d502

ghstack-source-id: 32bf6e9 Pull Request resolved: #142101

Update

5447ae8

[ghstack-poisoned]

aakhundov added a commit that referenced this pull request Dec 12, 2024

[WIP] Add persistent+TMA version of Triton mm and addmm

a7dbff8

ghstack-source-id: adba8b8 Pull Request resolved: #142101

Update

b3d45f6

[ghstack-poisoned]

aakhundov added a commit that referenced this pull request Dec 12, 2024

Add persistent+TMA version of Triton mm and addmm

6d26da6

ghstack-source-id: 7f86c28 Pull Request resolved: #142101

Update

f073009

[ghstack-poisoned]

aakhundov added a commit that referenced this pull request Dec 13, 2024

Add persistent+TMA version of Triton mm and addmm

c1a04d5

ghstack-source-id: 16cd91f Pull Request resolved: #142101

aakhundov changed the title ~~[WIP] Add persistent+TMA version of Triton mm and addmm~~ Add persistent+TMA version of Triton mm and addmm Dec 13, 2024

aakhundov marked this pull request as ready for review December 13, 2024 18:31

aakhundov requested review from drisspg and eellison December 13, 2024 18:31

aakhundov commented Dec 13, 2024

View reviewed changes

drisspg reviewed Dec 13, 2024

View reviewed changes

drisspg approved these changes Dec 13, 2024

View reviewed changes

Update

642079e

[ghstack-poisoned]

aakhundov added a commit that referenced this pull request Dec 13, 2024

Add persistent+TMA version of Triton mm and addmm

b85dd6f

ghstack-source-id: 8008755 Pull Request resolved: #142101

eellison approved these changes Dec 13, 2024

View reviewed changes

eellison reviewed Dec 13, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 14, 2024

pytorchmergebot added the merging label Dec 14, 2024

pytorchmergebot removed the merging label Dec 14, 2024

Update

3bc5de0

[ghstack-poisoned]

aakhundov added a commit that referenced this pull request Dec 15, 2024

Add persistent+TMA version of Triton mm and addmm

9a95991

ghstack-source-id: 2d14d8f Pull Request resolved: #142101

pytorchmergebot added the merging label Dec 16, 2024

pytorchmergebot closed this in e885225 Dec 16, 2024

pytorchmergebot added Merged and removed merging labels Dec 16, 2024

aakhundov self-assigned this Dec 17, 2024

github-actions bot deleted the gh/aakhundov/19/head branch January 18, 2025 02:02

kimishpatel reviewed Jun 3, 2025

View reviewed changes

Add persistent+TMA version of Triton mm and addmm #142101

Add persistent+TMA version of Triton mm and addmm #142101

Uh oh!

Conversation

aakhundov commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142101

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drisspg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drisspg Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aakhundov commented Dec 14, 2024

Uh oh!

pytorchmergebot commented Dec 14, 2024

Merge started

Uh oh!

pytorchmergebot commented Dec 14, 2024

Merge failed

Uh oh!

aakhundov commented Dec 16, 2024

Uh oh!

pytorchmergebot commented Dec 16, 2024

Merge started

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

aakhundov commented Dec 5, 2024 •

edited

Loading

pytorch-bot bot commented Dec 5, 2024 •

edited

Loading

drisspg left a comment •

edited

Loading

drisspg Dec 13, 2024 •

edited

Loading