-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[inductor] add a threshold for membw saving during fusion #136782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
| # brings more savings. | ||
| # | ||
| # For the cases loop ordering after fusion does not help, we don't lose much. | ||
| score_fusion_memory_threshold = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b question, is the unit here number of elements, number of bytes, or something else? mostly for my own learning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's number of bytes.
eellison
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the actual memory savings are small, kernel launch overhead can accumulate. I think still makes sense to fuse small tensors
It's true. But I think it may not matter much in practice.
Fusing two nodes having very small common memory access looks no much difference to 'fusing two unrelated nodes'. If such fusion in general is good, I guess we should enable 'aggressive_fusion' by default? We can check the perf test result once it's ready |
Fix #133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11058587412 ) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
|
When running the failing test, |
In my tests, with the PR the test run time increases from 4.8s to 5.6s. That's probably due to more kernels being generated. But the 'torch._inductor.metrics.num_bytes_accessed' metrics decreased from 117555520 to 83968512 (1.4x reduction) which should be good for perf. This tests involves some scalar tensors according to the generated wrapper ( https://gist.github.com/shunting314/8421fc964a1fc2848db273cce3d8a5ca ) even if the tensor sizes is increased to be more reasonable [1024, 1024]. I'll force the threshold to be an no-op for those tests and rerun a perf test. Will dig further if the perf test shows some red flag. |
Fix #133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11058587412 ) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
|
Add the perf result here. Overall neutral (within 1% change.) |
|
@pytorchbot merge |
When fp8 dtype is involved, Inductor may set min_elem_per_thread to be a positive value. This will force increasing XBLOCK even for a small xnumel (e.g. 1). Inductor will report an error later when sanity check the triton config. The simple fix here is to just not let XBLOCK to be larger than xnumel. Pull Request resolved: #138730 Approved by: https://github.com/Chillee ghstack dependencies: #136782
|
This PR makes |
|
Possible fix over at #138970 |
…ions (#138970) PR #136782 made `x.sum()+1` become two kernels, which hurts compile times as @ezyang noticed and breaks a lot of the tests in this stack. This reworks that heuristic to not apply as often. Pull Request resolved: #138970 Approved by: https://github.com/shunting314
Fix #133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 ) Pull Request resolved: #136782 Approved by: https://github.com/jansel
…torch#142273) Summary: (Since I am trying the other solution for pytorch#141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in pytorch#136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Reviewed By: eellison Differential Revision: D66906175
…42273) **Summary:** (Since I am trying the other solution for #141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in #136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. **Test Plan:** ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` And ran a float8 dynamic scaling training script to verify it e2e ----- Differential Revision: D66906175 Pull Request resolved: #142273 Approved by: https://github.com/eellison
…torch#142474) Summary: Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for pytorch#141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in pytorch#136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Differential Revision: D67012816
…torch#142474) Summary: Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for pytorch#141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in pytorch#136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Differential Revision: D67012816
…torch#142474) Summary: Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for pytorch#141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in pytorch#136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Differential Revision: D67012816
…42474) Summary: Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for #141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in #136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Reviewed By: shunting314, sijiac Differential Revision: D67012816
…42474) Summary: **Re-land the pr**. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for #141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in #136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Differential Revision: D67012816 Pull Request resolved: #142474 Approved by: https://github.com/eellison, https://github.com/sijiac, https://github.com/shunting314

Stack from ghstack (oldest at bottom):
Fix #133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings .
I think adding a threshold for membw saving in general is not bad.
I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 )
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov