Skip to content

Conversation

@Lucaskabela
Copy link
Contributor

@Lucaskabela Lucaskabela commented Nov 4, 2025

Graph partition relies on get_free_symbol_uses() to collect symbol inputs.

def get_scheduler_node_symbol_uses(
node: BaseSchedulerNode,
) -> OrderedSet[sympy.Symbol]:
"""
Gets symbols used in node.
"""
if isinstance(node, FusedSchedulerNode):
return OrderedSet().union(
*(get_scheduler_node_symbol_uses(snode) for snode in node.snodes)
)
assert node.node is not None
free_symbol_uses = node.node.get_free_symbol_uses()
free_symbol_uses.update(
*(get_layout_symints(ir_node) for ir_node in node.node.get_outputs())
)
return free_symbol_uses

I empirically observed that get_free_symbol_uses() becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to get_free_symbol_uses() for 1 node.

Why? Because get_free_symbol_uses() may recursively call another get_free_symbol_uses(), which could recursively run many times.

pytorch/torch/_inductor/ir.py

Lines 4541 to 4543 in ee7434b

result = self.layout.get_free_symbol_uses(
unbacked_only
) | self.data.get_free_symbol_uses(unbacked_only)

This PR fixes the issue by caching the results of get_free_symbol_uses(). I validated on torchtitan that the issue is fixed.

Pull Request resolved: #166338

(cherry picked from commit dfebdca)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166994

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 97cb547 with merge base 4840a1a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Lucaskabela
Copy link
Contributor Author

cc @BoyuanFeng to review cherrypick conflict resolution

@BoyuanFeng
Copy link
Contributor

The test fails on release/2.9 but not on main branch. The unit test is added by the same PR to check the FlexibleLayout does not change free symbol uses. However, the unit test needs #163639 to run. I have validated that the test passed when patching #163639.

Since we don't want to cherry-pick #163639, this PR will cherry pick #166338 without the test test_flexible_layout_immutable_free_symbols_dynamic_shapes.

Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs.
https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/scheduler.py#L4869-L4885

I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node.

Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times.
https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/ir.py#L4541-L4543

This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed.

Pull Request resolved: #166338
Approved by: https://github.com/eellison

(cherry picked from commit dfebdca)
@atalman atalman merged commit 3d27d95 into release/2.9 Nov 6, 2025
119 checks passed
@github-actions github-actions bot deleted the cherrypick_166338 branch December 7, 2025 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants