inductor: use previous guards to know if a size is 1 for broadcasting #136670

bdhirsh · 2024-09-25T19:54:00Z

Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.

In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.

In particular, if we have a tensor with a size value of (64//((2048//(s3*((s2//s3))))))), and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:

Eq((64//((2048//(s3*((s2//s3))))))), 1)

I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our Eq(LHS, 1) guards, then return True.

I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:

(1) I wanted to call some version of set_replacement(expr, 1). But set_replacement() only accepts plain symbols on the LHS, not expressions

(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. Eq(s2, 32). It looks like our existing try_solve() logic is... not quite able to do this generally though.

Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form Eq(LHS, 1)

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec

[ghstack-poisoned]

pytorch-bot · 2024-09-25T19:54:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136670

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (14 Unrelated Failures)

As of commit b53bad2 with merge base 932ae13 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, lf.linux.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.11-clang10 / test (crossref, 1, 2, lf.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.11-clang10 / test (default, 3, 4, lf.linux.4xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.12-clang10 / test (default, 3, 4, lf.linux.4xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.12-clang10-experimental-split-build / test (default, 3, 3, linux.4xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.9-clang10 / test (crossref, 1, 2, lf.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.9-clang10 / test (default, 3, 4, lf.linux.4xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-jammy-py3.10-clang15-asan / test (default, 3, 6, lf.linux.4xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-jammy-py3.9-gcc11 / test (default, 3, 4, lf.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 3, 5, lf.linux.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (nogpu_AVX512, 2, 2, lf.linux.4xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (nogpu_NO_AVX2, 2, 2, lf.linux.4xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names

This comment was automatically generated by Dr. CI and updates every 15 minutes.

bdhirsh · 2024-09-25T20:30:14Z

Updated to just use shape_env.evaluate_expr(sympy.Eq(expr, 1)) in a bunch of spots relevant to broadcasting in inductor

…roadcasting" Fixes #136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3*((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

bdhirsh · 2024-09-25T20:35:26Z

torch/_inductor/lowering.py

        reversed(a), reversed(b), fillvalue=sympy.Integer(1)
    ):
-        if y == 1:
+        if V.graph.sizevars.shape_env.evaluate_expr(sympy.Eq(y, 1)):


@ezyang from running my test locally with TORCH_LOGS="+dynamic", this does seem to add extra guards into the shape env. Do you think that's worth worrying about? (or do we have some guard deduping logic later on?)

We're supposed to dedupe it! So clean it up into a test case and post a bug about it

oh nope nvm - the guard_added log prints twice but we don't actually end up with duplicate guards in dynamo

ezyang

really gotta do that symint rewrite.......

bdhirsh · 2024-09-25T20:44:16Z

@pytorchbot merge

pytorchmergebot · 2024-09-25T20:45:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-25T21:06:30Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

pull / linux-focal-py3.12-clang10 / test (default, 2, 4, lf.linux.2xlarge)
pull / linux-focal-py3.12-clang10 / test (default, 3, 4, lf.linux.2xlarge)
pull / linux-focal-py3.12-clang10-experimental-split-build / test (default, 2, 3, linux.2xlarge)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

bdhirsh · 2024-09-25T22:46:58Z

torch/_inductor/ir.py

                assert old_size[i] is not None
                new_size[i] = old_size[i]
-            elif old_size[i] is None or old_size[i] == 1:
+            elif old_size[i] is None or V.graph.sizevars.shape_env.evaluate_expr(


welp, I can't actually use evaluate_expr() if the expressions has unbacked symints in it

ah ok, I think I just want to guard with size_oblivious=True (I think this seems... reasonable for cases where inductor is dealing with broadcasting?)

Yes, size oblivious is the normal fix here, but you /really/ want to be operating on SymInts here now lol

…roadcasting" Fixes #136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3*((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

ghstack-source-id: 362ee2e Pull Request resolved: #136670

…roadcasting" Fixes #136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3*((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

…pytorch#136670) Fixes pytorch#136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3*((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` Pull Request resolved: pytorch#136670 Approved by: https://github.com/ezyang

…ses) (pytorch#136759) this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see pytorch#136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` Pull Request resolved: pytorch#136759 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#136670

compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: pytorch#136760 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#136670, pytorch#136759

…rch#136760)" This reverts commit c010c60. Reverted pytorch#136760 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/c010c6099bf304bbb681af534b9f3996c33ce582) ([comment](pytorch#136670 (comment)))

…/subclasses) (pytorch#136759)" This reverts commit b17cd26. Reverted pytorch#136759 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/c010c6099bf304bbb681af534b9f3996c33ce582) ([comment](pytorch#136670 (comment)))

…dcasting (pytorch#136670)" This reverts commit dfdda2f. Reverted pytorch#136670 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/c010c6099bf304bbb681af534b9f3996c33ce582) ([comment](pytorch#136670 (comment)))

bdhirsh · 2024-10-14T23:28:27Z

After staring at the test for a while, I think the test is just flaky and also fails on main (there is a lot of randomness going on in the test). working on an actual repro so I can file a separate issue

…roadcasting" Fixes #136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3*((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

bdhirsh · 2024-10-15T18:29:29Z

test/functorch/test_control_flow.py

    )
    def test_associative_scan_dim(self, combine_mode, reverse, device):
        import random
+        random.seed(10)


locally this "fixed" the test for me that broke previously (it was flaky, issue filed here: #137943)

…roadcasting" Fixes #136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3*((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

ghstack-source-id: 271d504 Pull Request resolved: #136670

bdhirsh · 2024-10-15T18:46:07Z

benchmarks/dynamo/pr_time_benchmarks/expected_results.csv

-add_loop_inductor,             compile_time_instruction_count, 24400000000, 0.015
-add_loop_inductor_dynamic_gpu, compile_time_instruction_count, 39410000000, 0.025
-add_loop_inductor_gpu,         compile_time_instruction_count, 22440000000, 0.015
+add_loop_inductor,             compile_time_instruction_count, 25292308084, 0.015


cc @laithsakka (this is from me re-landing the PR to better support dynamic shape broadcasting in inductor, although it hurt the compile time microbenchmarks a bit)

…roadcasting" Fixes #136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3*((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

bdhirsh · 2024-10-16T14:21:37Z

@pytorchbot merge

pytorchmergebot · 2024-10-16T14:23:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-16T14:54:38Z

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

Pull Request resolved: #138075 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #136670

ezyang · 2024-10-17T13:41:12Z

Something like 6% regression on relevant benchmarks

bdhirsh · 2024-10-17T14:26:39Z

yep (extra thrash because this PR got reverted and re-landed 😞).

My understanding is that we can either:

(1) accept the regression for now, wait for the SymInt re-write in inductor to to claw back the compile time perf (a SymInt re-write would mean we automatically get the behavior of not doing sympy logic when our sizes are ints)

(2) if we want to avoid the compile time hit before doing a Symint re-write, we could manually do the branching ourselves in all of the code-paths I changed, although it would be a bit ugly.

inductor: use previous guards to know if a size is 1 for broadcasting

5139341

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: dynamo module: inductor labels Sep 25, 2024

github-actions bot requested review from SherlockNoMad, albanD, antoniojkim, ezyang and miladm September 25, 2024 19:54

bdhirsh added the release notes: inductor label Sep 25, 2024

bdhirsh commented Sep 25, 2024

View reviewed changes

ezyang approved these changes Sep 25, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 25, 2024

pytorchmergebot added the merging label Sep 25, 2024

pytorchmergebot removed the merging label Sep 25, 2024

bdhirsh commented Sep 25, 2024

View reviewed changes

bdhirsh added a commit that referenced this pull request Sep 25, 2024

inductor: use previous guards to know if a size is 1 for broadcasting

c7dbf09

ghstack-source-id: 362ee2e Pull Request resolved: #136670

This was referenced Sep 26, 2024

compile time benchmarks for AOTDispatcher (inference/training/subclasses) #136759

Closed

compile time benchmarks for AOTDispatcher (partitioner) #136760

Closed

albanD removed their request for review October 2, 2024 20:20

bdhirsh mentioned this pull request Oct 14, 2024

associative scan is incorrect for certain shapes/kwargs #137943

Closed

bdhirsh commented Oct 15, 2024

View reviewed changes

bdhirsh added a commit that referenced this pull request Oct 15, 2024

inductor: use previous guards to know if a size is 1 for broadcasting

dbb4f38

ghstack-source-id: 271d504 Pull Request resolved: #136670

bdhirsh commented Oct 15, 2024

View reviewed changes

bdhirsh mentioned this pull request Oct 16, 2024

add myself as codeowner in aot_autograd #138075

Closed

pytorchmergebot added the merging label Oct 16, 2024

pytorchmergebot removed the merging label Oct 16, 2024

pytorchmergebot closed this in a682194 Oct 16, 2024

github-actions bot deleted the gh/bdhirsh/614/head branch November 17, 2024 02:12

inductor: use previous guards to know if a size is 1 for broadcasting #136670

inductor: use previous guards to know if a size is 1 for broadcasting #136670

Uh oh!

Conversation

bdhirsh commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136670

✅ You can merge normally! (14 Unrelated Failures)

Uh oh!

bdhirsh commented Sep 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented Sep 25, 2024

Uh oh!

pytorchmergebot commented Sep 25, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 25, 2024

Merge failed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented Oct 14, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented Oct 16, 2024

Uh oh!

pytorchmergebot commented Oct 16, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 16, 2024

Merge failed

Uh oh!

ezyang commented Oct 17, 2024

Uh oh!

bdhirsh commented Oct 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bdhirsh commented Sep 25, 2024 •

edited

Loading

pytorch-bot bot commented Sep 25, 2024 •

edited

Loading