Skip to content

Conversation

@shunting314
Copy link
Contributor

@shunting314 shunting314 commented Sep 8, 2025

Stack from ghstack (oldest at bottom):

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

def test_3dred_pw_2d_outer_red(self):
"""
Test a pattern as follows. We have a 3d contiguous tensor [m, n, k] as input.
1. do reduction on the k dimension and get a [m, n] tensor
2. do a pointwise operation on this [m, n] tensor (and realize the computation)
3. do a outer reduction on the output of step 2 on the m dimension.
Each of these step generate a kernel before fusion.
Without any loop reorder, kernel 1 and kernel 2 will get fused. And kernel 3 will be separeate.
But if we reorder the loop for kernel 2, then kernel 2 will get fused with kernel 3.
And the fused kernel-2-3 can not be fused with kernel 1.
The older version of LOAF algorithm will do reorder in this case. But there is no real
benefits. There are even some slight downsides
1. the original fusion without loop reordering is more natural
2. fusion kernel 1 with kernel 2 may help precision when the output of kernel 1 is in low precision.
By fusion kernel 1 and kernel 2, the pointwise operation will operate on fp32 precision thanks
to fusion.
"""
M, N, K = 64, 64, 64
def f(x):
x = x.sum(dim=-1)
x = x + 1 # can be more complex like sigmoid or other ops
return x, x.sum(dim=0)
x = torch.randn(M, N, K, device=GPU_TYPE)
self.do_acc_test(f, x)
self.assertEqual(0, metrics.num_loop_reordering)

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162355

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit bfb4223 with merge base a6f9e0e (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.


Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

https://github.com/pytorch/pytorch/blob/9b33c49064495ac73c672cd1a22d516038a0b609/test/inductor/test_loop_ordering.py#L612-L641

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about including reordering in the fusion score key, and not actually making the change to the loop body in the scoring round.

This would allow other fusions to occur following reordering. We have other, similar loop body modifications we want to do (@BoyuanFeng's masking, and I have another use case here: #161404) and this could allow those use cases to compose.

@shunting314
Copy link
Contributor Author

@eellison

What do you think about including reordering in the fusion score key, and not actually making the change to the loop body in the scoring round.

I got what you mean. But the major blocker for turning LOAF on by default right now is the increase of compilation time and I'm worried this would further increase that. There are a couple of ways to decide fusion-score-key without changing loop order

  1. a most straightforward way is to make a clone of the SchedulerNode , reorder the loops, decide the fusion score and drop the clone. This would definitely increase compilation time further
  2. a more sophisticated way is to pretend doing loop reordering without cloning the SchedulerNode. This would need us to update Scheduler.score_fusion_memory to decide the share of memory access for a specific loop order. But then we would need add caching to avoid doing these kind of analysis later again when we perform the fusion. Also the cache may be tricky to implement since a SchedulerNode may already get reordered for fusing another pair of nodes.

So I don't find a good way to do what you said without potentially further increasing compilation time.

This would allow other fusions to occur following reordering. We have other, similar loop body modifications we want to do (@BoyuanFeng's masking, and I have another use case here: #161404) and this could allow those use cases to compose.

Can these fusion all be done in the final round and compose with LOAF? In high level the process of fusion would become

  1. iteratively find regular fusions
  2. a final round to apply other extra fusions with: LOAF, index-inversion (your PR) etc.

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - we can discuss other potential improvements in the future but I think this is a clear improvement so we should land.


Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
@shunting314 shunting314 added the topic: not user facing topic category label Sep 19, 2025
@shunting314
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 19, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.

Pull Request resolved: pytorch#162355
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: pytorch#162101, pytorch#162126
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.

Pull Request resolved: pytorch#162355
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: pytorch#162101, pytorch#162126
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.

Pull Request resolved: pytorch#162355
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: pytorch#162101, pytorch#162126
@github-actions github-actions bot deleted the gh/shunting314/224/head branch October 20, 2025 02:17
Khanaksahu pushed a commit to Khanaksahu/pytorch-fork that referenced this pull request Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants