[Inductor] don't call sympy_str when not needed #162126

shunting314 · 2025-09-04T01:21:45Z

Stack from ghstack (oldest at bottom):

I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering. Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set.

(scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 )

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

[ghstack-poisoned]

pytorch-bot · 2025-09-04T01:21:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162126

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit 84662fd with merge base a6f9e0e ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / unit-test / inductor-test / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu) (gh) (similar failure)
'test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_hf_bert_ddp_aot_eager'
inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_hf_bert_ddp_aot_eager

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / inductor-test / test (inductor_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
Process completed with exit code 134.
inductor / inductor-test / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering. Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set. <img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" /> (scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 ) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

pytorchmergebot · 2025-09-19T17:40:37Z

Starting merge as part of PR stack under #162355

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641 Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. Pull Request resolved: #162355 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #162101, #162126

I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering. Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set. <img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" /> (scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 ) Pull Request resolved: pytorch#162126 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162101

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641 Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. Pull Request resolved: pytorch#162355 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162101, pytorch#162126

I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering. Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set. <img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" /> (scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 ) Pull Request resolved: pytorch#162126 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162101

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641 Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. Pull Request resolved: pytorch#162355 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162101, pytorch#162126

I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering. Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set. <img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" /> (scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 ) Pull Request resolved: pytorch#162126 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162101

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641 Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. Pull Request resolved: pytorch#162355 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162101, pytorch#162126

ghstack-source-id: 348300e Pull Request resolved: pytorch/pytorch#162126

[Inductor] don't call sympy_str when not needed

ceee7e8

[ghstack-poisoned]

This was referenced Sep 3, 2025

[ez][inductor] add a few outer dimension reduction cases for LOAF #162028

Closed

[inductor] avoid creating LoopBody twice #162101

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 4, 2025

This was referenced Sep 4, 2025

[inductor] turn on loaf (for oss) by default #162030

Closed

LOAF not for land hack #162102

Closed

shunting314 requested review from eellison and jansel September 4, 2025 01:28

jansel approved these changes Sep 4, 2025

View reviewed changes

shunting314 mentioned this pull request Sep 4, 2025

[inductor] fix TemplateBuffer.extract_read_writes #162221

Closed

This was referenced Sep 5, 2025

[inductor] rename deps during refreshing #162303

Closed

[inductor] fuse for scalar shared data #162311

Closed

shunting314 added the topic: not user facing topic category label Sep 6, 2025

This was referenced Sep 6, 2025

[inductor] fix 3d tiled online softmax #162341

Closed

[Inductor] do loop reordering in a separate final round #162355

Closed

shunting314 added 2 commits September 7, 2025 23:24

eellison approved these changes Sep 8, 2025

View reviewed changes

pytorchmergebot closed this in e88460f Sep 19, 2025

pytorchmergebot added the Merged label Sep 19, 2025

github-actions bot deleted the gh/shunting314/218/head branch October 20, 2025 02:17

Khanaksahu pushed a commit to Khanaksahu/pytorch-fork that referenced this pull request Nov 17, 2025

[Inductor] don't call sympy_str when not needed

7432cb5

ghstack-source-id: 348300e Pull Request resolved: pytorch/pytorch#162126

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor] don't call sympy_str when not needed #162126

[Inductor] don't call sympy_str when not needed #162126

Uh oh!

shunting314 commented Sep 4, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 4, 2025 •

edited

Loading

Uh oh!

pytorchmergebot commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Inductor] don't call sympy_str when not needed #162126

[Inductor] don't call sympy_str when not needed #162126

Uh oh!

Conversation

shunting314 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162126

✅ You can merge normally! (4 Unrelated Failures)

Uh oh!

pytorchmergebot commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shunting314 commented Sep 4, 2025 •

edited

Loading

pytorch-bot bot commented Sep 4, 2025 •

edited

Loading