[inductor] add a threshold for membw saving during fusion #136782

shunting314 · 2024-09-26T18:57:31Z

Stack from ghstack (oldest at bottom):

Fix #133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings .

I think adding a threshold for membw saving in general is not bad.

I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 )

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2024-09-26T18:57:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136782

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (15 Unrelated Failures)

As of commit b709ccd with merge base c6609ec ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
stable_diffusion_unet
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
stable_diffusion_unet
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
stable_diffusion_unet

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
stable_diffusion_unet
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
stable_diffusion_unet
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
stable_diffusion_unet
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_aot_inductor_amp_freezing_torchbench, 2, 2, linux.12xlarge) (gh) (trunk failure)
stable_diffusion_unet
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_aot_inductor_freezing_torchbench, 2, 2, linux.12xlarge) (gh) (trunk failure)
stable_diffusion_unet
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_amp_freezing_torchbench, 2, 2, linux.16xlarge.spr) (gh) (trunk failure)
stable_diffusion_unet
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_freezing_avx2_torchbench, 2, 2, linux.10xlarge.avx2) (gh) (trunk failure)
stable_diffusion_unet
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_freezing_torchbench, 2, 2, linux.12xlarge) (gh) (trunk failure)
stable_diffusion_unet
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_torchbench, 2, 2, linux.12xlarge) (gh) (trunk failure)
stable_diffusion_unet
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_aot_inductor_amp_freezing_torchbench, 2, 2, linux.12xlarge) (gh) (trunk failure)
stable_diffusion_unet
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_aot_inductor_freezing_torchbench, 2, 2, linux.12xlarge) (gh) (trunk failure)
stable_diffusion_unet
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.12xlarge) (gh) (trunk failure)
stable_diffusion_unet

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2024-09-26T19:02:50Z

torch/_inductor/config.py

+# brings more savings.
+#
+# For the cases loop ordering after fusion does not help, we don't lose much.
+score_fusion_memory_threshold = 10


n00b question, is the unit here number of elements, number of bytes, or something else? mostly for my own learning.

It's number of bytes.

eellison

Even if the actual memory savings are small, kernel launch overhead can accumulate. I think still makes sense to fuse small tensors

shunting314 · 2024-09-26T20:18:50Z

Even if the actual memory savings are small, kernel launch overhead can accumulate. I think still makes sense to fuse small tensors

It's true. But I think it may not matter much in practice.

This scenario can be rare
We can leverage CUDAGraphs

Fusing two nodes having very small common memory access looks no much difference to 'fusing two unrelated nodes'. If such fusion in general is good, I guess we should enable 'aggressive_fusion' by default? We can check the perf test result once it's ready

Fix #133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11058587412 ) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: a7b1615 Pull Request resolved: #136782

eellison · 2024-10-07T15:38:15Z

@pytorchbot rebase

pytorchmergebot · 2024-10-07T15:39:33Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-10-07T15:39:46Z

Successfully rebased gh/shunting314/177/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/136782)

ghstack-source-id: c89b8f5 Pull Request resolved: #136782

eellison · 2024-10-08T23:48:58Z

When running the failing test, test_asgd_recompile_foreach, and bumping up the tensor sizes to be more reasonable ([1024, 1024]) this pr makes the test 3x slower. havent debugged further but its possible that not fusing the small fusion is inhibiting a larger fusion. needs more investigation before landing.

shunting314 · 2024-10-16T23:16:49Z

When running the failing test, test_asgd_recompile_foreach, and bumping up the tensor sizes to be more reasonable ([1024, 1024]) this pr makes the test 3x slower. havent debugged further but its possible that not fusing the small fusion is inhibiting a larger fusion. needs more investigation before landing.

In my tests, with the PR the test run time increases from 4.8s to 5.6s. That's probably due to more kernels being generated. But the 'torch._inductor.metrics.num_bytes_accessed' metrics decreased from 117555520 to 83968512 (1.4x reduction) which should be good for perf.

This tests involves some scalar tensors according to the generated wrapper ( https://gist.github.com/shunting314/8421fc964a1fc2848db273cce3d8a5ca ) even if the tensor sizes is increased to be more reasonable [1024, 1024]. I'll force the threshold to be an no-op for those tests and rerun a perf test. Will dig further if the perf test shows some red flag.

Fix #133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11058587412 ) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: cae0aa4 Pull Request resolved: #136782

shunting314 · 2024-10-17T18:37:47Z

Add the perf result here.

Overall neutral (within 1% change.)

shunting314 · 2024-10-18T22:02:42Z

@pytorchbot merge

ezyang · 2024-10-22T13:34:07Z

Final HF compile time regression:

When fp8 dtype is involved, Inductor may set min_elem_per_thread to be a positive value. This will force increasing XBLOCK even for a small xnumel (e.g. 1). Inductor will report an error later when sanity check the triton config. The simple fix here is to just not let XBLOCK to be larger than xnumel. Pull Request resolved: #138730 Approved by: https://github.com/Chillee ghstack dependencies: #136782

jansel · 2024-10-26T03:07:31Z

This PR makes x.sum() + 1 become two kernels, because it only saves 4 bytes of memory. This is likely why it is increasing compile times since we are generating more kernels.

jansel · 2024-10-26T03:12:41Z

Possible fix over at #138970

@ezyang

…ions (#138970) PR #136782 made `x.sum()+1` become two kernels, which hurts compile times as @ezyang noticed and breaks a lot of the tests in this stack. This reworks that heuristic to not apply as often. Pull Request resolved: #138970 Approved by: https://github.com/shunting314

Fix #133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 ) Pull Request resolved: #136782 Approved by: https://github.com/jansel

…torch#142273) Summary: (Since I am trying the other solution for pytorch#141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in pytorch#136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Reviewed By: eellison Differential Revision: D66906175

…42273) **Summary:** (Since I am trying the other solution for #141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in #136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. **Test Plan:** ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` And ran a float8 dynamic scaling training script to verify it e2e ----- Differential Revision: D66906175 Pull Request resolved: #142273 Approved by: https://github.com/eellison

…torch#142474) Summary: Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for pytorch#141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in pytorch#136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Differential Revision: D67012816

…42474) Summary: Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for #141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in #136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Reviewed By: shunting314, sijiac Differential Revision: D67012816

…42474) Summary: **Re-land the pr**. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for #141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in #136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Differential Revision: D67012816 Pull Request resolved: #142474 Approved by: https://github.com/eellison, https://github.com/sijiac, https://github.com/shunting314

[inductor] add a threshold for membw saving during fusion

7432afb

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 26, 2024

shunting314 requested review from Chillee, eellison and jansel and removed request for jansel September 26, 2024 19:00

shunting314 added the topic: not user facing topic category label Sep 26, 2024

vkuzo reviewed Sep 26, 2024

View reviewed changes

eellison reviewed Sep 26, 2024

View reviewed changes

jansel approved these changes Sep 26, 2024

View reviewed changes

shunting314 added a commit that referenced this pull request Sep 26, 2024

[inductor] add a threshold for membw saving during fusion

af56b7b

ghstack-source-id: a7b1615 Pull Request resolved: #136782

Update

5247d60

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Oct 7, 2024

[inductor] add a threshold for membw saving during fusion

fd5f70d

ghstack-source-id: c89b8f5 Pull Request resolved: #136782

shunting314 added a commit that referenced this pull request Oct 16, 2024

[inductor] add a threshold for membw saving during fusion

446c546

ghstack-source-id: cae0aa4 Pull Request resolved: #136782

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 18, 2024

pytorchmergebot added the merging label Oct 18, 2024

pytorchmergebot closed this in 0a38c0e Oct 22, 2024

pytorchmergebot removed the merging label Oct 22, 2024

This was referenced Oct 23, 2024

[Inductor] don't set XBLOCK larger than xnumel #138730

Closed

[inductor] don't fuse two nodes if likely increase peak memory #138756

Closed

jansel mentioned this pull request Oct 26, 2024

[inductor] Only apply score_fusion_memory_threshold to horizontal fusions #138970

Closed

github-actions bot deleted the gh/shunting314/177/head branch November 26, 2024 02:09

y-sq mentioned this pull request Dec 6, 2024

[Inductor][Easy] Fix a test failure in loop_ordering_after_fusion #142273

Closed

y-sq mentioned this pull request Dec 10, 2024

[Inductor][Easy] Fix a test failure in loop_ordering_after_fusion #142474

Closed

[inductor] add a threshold for membw saving during fusion #136782

[inductor] add a threshold for membw saving during fusion #136782

Uh oh!

Conversation

shunting314 commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136782

✅ You can merge normally! (15 Unrelated Failures)

Uh oh!

vkuzo Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

shunting314 Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

shunting314 commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eellison commented Oct 7, 2024

Uh oh!

pytorchmergebot commented Oct 7, 2024

Uh oh!

pytorchmergebot commented Oct 7, 2024

Uh oh!

eellison commented Oct 8, 2024

Uh oh!

shunting314 commented Oct 16, 2024

Uh oh!

shunting314 commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shunting314 commented Oct 18, 2024

Uh oh!

ezyang commented Oct 22, 2024

Uh oh!

jansel commented Oct 26, 2024

Uh oh!

jansel commented Oct 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

shunting314 commented Sep 26, 2024 •

edited

Loading

pytorch-bot bot commented Sep 26, 2024 •

edited

Loading

shunting314 commented Sep 26, 2024 •

edited

Loading

shunting314 commented Oct 17, 2024 •

edited

Loading