[Inductor] do loop reordering in a separate final round #162355

shunting314 · 2025-09-08T00:26:15Z

Stack from ghstack (oldest at bottom):

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

pytorch/test/inductor/test_loop_ordering.py

Lines 612 to 641 in a1f7639

    
               def test_3dred_pw_2d_outer_red(self): 
        
                   """ 
        
                   Test a pattern as follows. We have a 3d contiguous tensor [m, n, k] as input. 
        
                   1. do reduction on the k dimension and get a [m, n] tensor 
        
                   2. do a pointwise operation on this [m, n] tensor (and realize the computation) 
        
                   3. do a outer reduction on the output of step 2 on the m dimension. 
        
                   Each of these step generate a kernel before fusion. 
        
                   Without any loop reorder, kernel 1 and kernel 2 will get fused. And kernel 3 will be separeate. 
        
                   But if we reorder the loop for kernel 2, then kernel 2 will get fused with kernel 3. 
        
                   And the fused kernel-2-3 can not be fused with kernel 1. 
        
                   The older version of LOAF algorithm will do reorder in this case. But there is no real 
        
                   benefits. There are even some slight downsides 
        
                   1. the original fusion without loop reordering is more natural 
        
                   2. fusion kernel 1 with kernel 2 may help precision when the output of kernel 1 is in low precision. 
        
                      By fusion kernel 1 and kernel 2, the pointwise operation will operate on fp32 precision thanks 
        
                      to fusion. 
        
                   """ 
        
                   M, N, K = 64, 64, 64 
        
                   def f(x): 
        
                       x = x.sum(dim=-1) 
        
                       x = x + 1  # can be more complex like sigmoid or other ops 
        
                       return x, x.sum(dim=0) 
        
                   x = torch.randn(M, N, K, device=GPU_TYPE) 
        
                   self.do_acc_test(f, x) 
        
                   self.assertEqual(0, metrics.num_loop_reordering)

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

[ghstack-poisoned]

pytorch-bot · 2025-09-08T00:26:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162355

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit bfb4223 with merge base a6f9e0e ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / inductor-test / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
hf_GPT2
inductor / unit-test / inductor-test / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu) (gh) (similar failure)
'test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_hf_bert_ddp_aot_eager'
inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_hf_bert_ddp_aot_eager

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / inductor-test / test (inductor_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
Process completed with exit code 134.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: https://github.com/pytorch/pytorch/blob/9b33c49064495ac73c672cd1a22d516038a0b609/test/inductor/test_loop_ordering.py#L612-L641 Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641 Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

eellison

What do you think about including reordering in the fusion score key, and not actually making the change to the loop body in the scoring round.

This would allow other fusions to occur following reordering. We have other, similar loop body modifications we want to do (@BoyuanFeng's masking, and I have another use case here: #161404) and this could allow those use cases to compose.

shunting314 · 2025-09-09T18:44:03Z

@eellison

What do you think about including reordering in the fusion score key, and not actually making the change to the loop body in the scoring round.

I got what you mean. But the major blocker for turning LOAF on by default right now is the increase of compilation time and I'm worried this would further increase that. There are a couple of ways to decide fusion-score-key without changing loop order

a most straightforward way is to make a clone of the SchedulerNode , reorder the loops, decide the fusion score and drop the clone. This would definitely increase compilation time further
a more sophisticated way is to pretend doing loop reordering without cloning the SchedulerNode. This would need us to update Scheduler.score_fusion_memory to decide the share of memory access for a specific loop order. But then we would need add caching to avoid doing these kind of analysis later again when we perform the fusion. Also the cache may be tricky to implement since a SchedulerNode may already get reordered for fusing another pair of nodes.

So I don't find a good way to do what you said without potentially further increasing compilation time.

This would allow other fusions to occur following reordering. We have other, similar loop body modifications we want to do (@BoyuanFeng's masking, and I have another use case here: #161404) and this could allow those use cases to compose.

Can these fusion all be done in the final round and compose with LOAF? In high level the process of fusion would become

iteratively find regular fusions
a final round to apply other extra fusions with: LOAF, index-inversion (your PR) etc.

eellison

Sounds good - we can discuss other potential improvements in the future but I think this is a clear improvement so we should land.

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641 Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

shunting314 · 2025-09-19T17:38:35Z

@pytorchbot merge

pytorchmergebot · 2025-09-19T17:40:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641 Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. Pull Request resolved: pytorch#162355 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162101, pytorch#162126

ghstack-source-id: a8e056b Pull Request resolved: pytorch/pytorch#162355

[Inductor] do loop reordering in a separate final round

9b33c49

[ghstack-poisoned]

shunting314 mentioned this pull request Sep 8, 2025

[inductor] fuse for scalar shared data #162311

Closed

shunting314 mentioned this pull request Sep 8, 2025

[inductor] fix 3d tiled online softmax #162341

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 8, 2025

This was referenced Sep 7, 2025

[inductor] avoid creating LoopBody twice #162101

Closed

[Inductor] don't call sympy_str when not needed #162126

Closed

[inductor] turn on loaf (for oss) by default #162030

Closed

LOAF not for land hack #162102

Closed

shunting314 requested review from eellison and jansel September 8, 2025 00:36

shunting314 added 2 commits September 7, 2025 23:24

eellison reviewed Sep 8, 2025

View reviewed changes

shunting314 mentioned this pull request Sep 9, 2025

[inductor] keep weakdep for mutation with different index #162459

Closed

eellison approved these changes Sep 10, 2025

View reviewed changes

jansel approved these changes Sep 11, 2025

View reviewed changes

shunting314 added the topic: not user facing topic category label Sep 19, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 19, 2025

pytorchmergebot added the merging label Sep 19, 2025

pytorchmergebot added the Merged label Sep 19, 2025

pytorchmergebot closed this in 248156e Sep 19, 2025

pytorchmergebot removed the merging label Sep 19, 2025

github-actions bot deleted the gh/shunting314/224/head branch October 20, 2025 02:17

Khanaksahu pushed a commit to Khanaksahu/pytorch-fork that referenced this pull request Nov 17, 2025

[Inductor] do loop reordering in a separate final round

3316233

ghstack-source-id: a8e056b Pull Request resolved: pytorch/pytorch#162355

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor] do loop reordering in a separate final round #162355

[Inductor] do loop reordering in a separate final round #162355

Uh oh!

shunting314 commented Sep 8, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 8, 2025 •

edited

Loading

Uh oh!

eellison left a comment

Uh oh!

shunting314 commented Sep 9, 2025

Uh oh!

eellison left a comment

Uh oh!

shunting314 commented Sep 19, 2025

Uh oh!

pytorchmergebot commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	def test_3dred_pw_2d_outer_red(self):
	"""
	Test a pattern as follows. We have a 3d contiguous tensor [m, n, k] as input.
	1. do reduction on the k dimension and get a [m, n] tensor
	2. do a pointwise operation on this [m, n] tensor (and realize the computation)
	3. do a outer reduction on the output of step 2 on the m dimension.

	Each of these step generate a kernel before fusion.
	Without any loop reorder, kernel 1 and kernel 2 will get fused. And kernel 3 will be separeate.

	But if we reorder the loop for kernel 2, then kernel 2 will get fused with kernel 3.
	And the fused kernel-2-3 can not be fused with kernel 1.

	The older version of LOAF algorithm will do reorder in this case. But there is no real
	benefits. There are even some slight downsides
	1. the original fusion without loop reordering is more natural
	2. fusion kernel 1 with kernel 2 may help precision when the output of kernel 1 is in low precision.
	By fusion kernel 1 and kernel 2, the pointwise operation will operate on fp32 precision thanks
	to fusion.
	"""
	M, N, K = 64, 64, 64

	def f(x):
	x = x.sum(dim=-1)
	x = x + 1 # can be more complex like sigmoid or other ops
	return x, x.sum(dim=0)

	x = torch.randn(M, N, K, device=GPU_TYPE)
	self.do_acc_test(f, x)
	self.assertEqual(0, metrics.num_loop_reordering)

[Inductor] do loop reordering in a separate final round #162355

[Inductor] do loop reordering in a separate final round #162355

Uh oh!

Conversation

shunting314 commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162355

✅ You can merge normally! (4 Unrelated Failures)

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

shunting314 commented Sep 9, 2025

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

shunting314 commented Sep 19, 2025

Uh oh!

pytorchmergebot commented Sep 19, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shunting314 commented Sep 8, 2025 •

edited

Loading

pytorch-bot bot commented Sep 8, 2025 •

edited

Loading