[DRAFT] call sympy.Max(a, b, evaluate=False) instead of torch.utils._sympy.functions .Max #137796

laithsakka · 2024-10-11T18:10:48Z

Summary:
I was looking at this profile with @bobrenjc93

TORCH_COMPILE_STROBELIGHT=1 COMPILE_STROBELIGHT_MAX_STACK_LENGTH= 500 buck2 run @fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200

strobelight profile link: https://fburl.com/scuba/pyperf_experimental/on_demand/lrh6erxx

Most of the time is spent in constructing the max() node. if we pass evaluate=False we no longer have the exponential cost!?

paste that show what we construct when we call max https://www.internalfb.com/phabricator/paste/view/P1644273374
there is a clear repetition across calls, and simplification is not doing much other than flattening inputs of max.

This is just draft to make sure all test pass in OSS, wonder if avoid simplification at construction can make other
programs slower? if so maybe we can add this under a flag?

alternatively we can also define our own max function and customize automatic simplifications inside
https://docs.sympy.org/latest/explanation/best-practices.html#avoid-too-much-automatic-evaluation

--num-features=200
compile.compile_inner
16.5991s vs 120.824s

--num-features=400 (compile.compile_inner )
40s vs 918.024s

num_features=100

rank: 0, world_size: 2, num_features: 100, batch_size: 10, time: 20.05s
va
rank: 0, world_size: 2, num_features: 100, batch_size: 10, time: 40.24s

num_features=200

rank: 0, world_size: 2, num_features: 200, batch_size: 10, time: 20.66s
rank: 0, world_size: 2, num_features: 200, batch_size: 10, time: 125.05s

Differential Revision: D64252491

pytorch-bot · 2024-10-11T18:10:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137796

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 58 Cancelled Jobs, 1 Unrelated Failure

As of commit 630f330 with merge base ed94725 ():

CANCELLED JOBS - The following jobs were cancelled. Please retry:

inductor / cuda12.1-py3.10-gcc9-sm86 (gh)
inductor / cuda12.1-py3.10-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
inductor / cuda12.1-py3.12-gcc9-sm86 (gh)
inductor / cuda12.1-py3.12-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
inductor / cuda12.4-py3.10-gcc9-sm86 (gh)
inductor / cuda12.4-py3.10-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
inductor / linux-jammy-cpu-py3.12-gcc11-inductor-halide (gh)
inductor / linux-jammy-cpu-py3.12-gcc11-inductor-halide / build (gh)
##[error]The operation was canceled.
inductor / linux-jammy-cpu-py3.12-gcc11-inductor-triton-cpu (gh)
inductor / linux-jammy-cpu-py3.12-gcc11-inductor-triton-cpu / build (gh)
##[error]The operation was canceled.
inductor / linux-jammy-cpu-py3.9-gcc11-inductor (gh)
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / build (gh)
##[error]The operation was canceled.
inductor-periodic / cuda12.1-py3.10-gcc9-sm80 (gh)
inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / build (gh)
##[error]The operation was canceled.
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks (gh)
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / build (gh)
##[error]The operation was canceled.
Lint / lintrunner-clang / linux-job (gh)
##[error]The operation was canceled.
Lint / lintrunner-noclang / linux-job (gh)
##[error]The operation was canceled.
pull / before-test (gh)
pull / cuda12.1-py3.10-gcc9-sm75 (gh)
pull / cuda12.1-py3.10-gcc9-sm75 / build (gh)
##[error]The operation was canceled.
pull / linux-docs (gh)
pull / linux-focal-cpu-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, lf.linux.4xlarge) (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda11.8-py3.10-gcc9 (gh)
pull / linux-focal-cuda11.8-py3.10-gcc9 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.1-py3.10-gcc9 (gh)
pull / linux-focal-cuda12.1-py3.10-gcc9 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.1-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, lf.linux.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 (gh)
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, lf.linux.4xlarge.nvidia.gpu) (gh)
pull / linux-focal-py3_9-clang9-xla (gh)
pull / linux-focal-py3_9-clang9-xla / build (gh)
pull / linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single / build-and-test (default, 1, 1, linux.2xlarge) (gh)
##[error]The operation was canceled.
pull / linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single-full-jit / build-and-test (default, 1, 1, linux.2xlarge) (gh)
##[error]The operation was canceled.
pull / linux-focal-py3-clang9-mobile-custom-build-static / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.11-clang10 (gh)
pull / linux-focal-py3.11-clang10 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.12-clang10 (gh)
pull / linux-focal-py3.12-clang10 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.12-clang10-experimental-split-build (gh)
pull / linux-focal-py3.12-clang10-experimental-split-build / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.9-clang10 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.9-clang10-onnx (gh)
pull / linux-focal-py3.9-clang10-onnx / build (gh)
##[error]The operation was canceled.
pull / linux-focal-rocm6.2-py3.10 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-cuda11.8-cudnn9-py3.9-clang12 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3-clang12-executorch (gh)
pull / linux-jammy-py3-clang12-executorch / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3-clang12-mobile-build / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.10-clang15-asan (gh)
pull / linux-jammy-py3.10-clang15-asan / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.9-gcc11 (gh)
pull / linux-jammy-py3.9-gcc11 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.9-gcc11-mobile-lightweight-dispatch-build / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.9-gcc11-no-ops / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.9-gcc11-pch / build (gh)
##[error]The operation was canceled.
pull / win-vs2019-cpu-py3 / build (gh)
##[error]The operation was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / before-test / llm-retrieval (gh) (matched llm-retrieval rule in flaky-rules.json)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-10-11T18:10:56Z

This pull request was exported from Phabricator. Differential Revision: D64252491

…sympy.functions .Max (pytorch#137796) Summary: --num-features=200 compile.compile_inner 16.5991s vs --num-features=400 compile.compile_inner 287.493 vs pytorch#100 num_features ``` rank: 0, world_size: 2, num_features: 100, batch_size: 10, time: 20.05s va rank: 0, world_size: 2, num_features: 100, batch_size: 10, time: 40.24s ``` pytorch#200 num_features ``` rank: 0, world_size: 2, num_features: 200, batch_size: 10, time: 20.66s rank: 0, world_size: 2, num_features: 200, batch_size: 10, time: 125.05s ``` Differential Revision: D64252491

facebook-github-bot · 2024-10-11T18:14:52Z

This pull request was exported from Phabricator. Differential Revision: D64252491

ezyang · 2024-10-12T02:39:21Z

As I was telling @bobrenjc93, you can't do this because the simplification done in the constructor is actually load bearing for unbacked reasoning

ezyang · 2024-10-12T02:39:51Z

And your profile is pre #133325 which fixed the bulk of the problems

laithsakka · 2024-10-12T04:33:26Z

great if #133325 fix it then no need for this,
I will check that, otherwise, we have to think about a compromise, if I understand your comment its not a correctness issue but rather a perf?
Like you are saying if we dont do the simplification at construction, unbacked reasoning can become so expensive?

ezyang · 2024-10-12T12:03:29Z

No, it is a correctness issue. For example, if you have a test Max(1, u0, 256) == Max(u0, 256), you need to return True for it. But the easiest way for this to happen is for the Max constructor to simplify Max(1, u0, 256) into Max(u0, 256), none of our other reasoning mechanisms will work.

laithsakka · 2024-10-12T17:08:08Z

No, it is a correctness issue. For example, if you have a test Max(1, u0, 256) == Max(u0, 256), you need to return True for it. But the easiest way for this to happen is for the Max constructor to simplify Max(1, u0, 256) into Max(u0, 256), none of our other reasoning mechanisms will work.

mhmm where in the compiler we do this == check, and why is it not .equals() ?

Max(1, u0, 256).equals(Max(u0, 256)) should be True.

do you have an e2e example of this with failure, like an function that we compile that we end up generating wrong program or failing to compile it due to this? I will try to play with some examples to see if i can get one.

So I have another idea that is less risky but need to benchmark it but its O(N) so :

if we look at the paste. https://www.internalfb.com/phabricator/paste/view/P1644273374
we can see that all that max do here is flattening the inputs.

so we can do something like this in O(N) under some conditions

given max(max(a, b), c) if a , b and c are all summations of symbols only and there is no intersection in the symbols of a, b, c. then generate
max(a, b, c) + some sorting for a, b, c order

facebook-github-bot · 2024-10-15T03:47:47Z

This pull request was exported from Phabricator. Differential Revision: D64252491

…sympy.functions .Max (pytorch#137796) Summary: I was looking at this profile with bobrenjc93 ``` TORCH_COMPILE_STROBELIGHT=1 COMPILE_STROBELIGHT_MAX_STACK_LENGTH= 500 buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` strobelight profile link: https://fburl.com/scuba/pyperf_experimental/on_demand/lrh6erxx {F1924015721}{F1924015712} Most of the time is spent in constructing the max() node. if we pass evaluate=False we no longer have the exponential cost!? paste that show what we compute when we call pass https://www.internalfb.com/phabricator/paste/view/P1644273374 there is a clear repetition across calls, and simplification is not doing much other than flattening inputs of max. This is just draft to make sure all test pass in OSS, wonder if avoid simplification at construction can make other programs slower? if so maybe we can add this under a flag? --num-features=200 compile.compile_inner 16.5991s vs 120.824s --num-features=400 (compile.compile_inner ) 40s vs 918.024s num_features=100 ``` rank: 0, world_size: 2, num_features: 100, batch_size: 10, time: 20.05s va rank: 0, world_size: 2, num_features: 100, batch_size: 10, time: 40.24s ``` num_features=200 ``` rank: 0, world_size: 2, num_features: 200, batch_size: 10, time: 20.66s rank: 0, world_size: 2, num_features: 200, batch_size: 10, time: 125.05s ``` Differential Revision: D64252491 D64252491

facebook-github-bot · 2024-10-15T03:51:42Z

This pull request was exported from Phabricator. Differential Revision: D64252491

laithsakka · 2024-10-15T04:23:40Z

I confirmed that #133325 does fix the regression that this diff try to Fix
while we can fix the issue in this diff by calling simplify in statically_known_true, there is No value anymore of doing that.

also calling simplify in statically_known_true can be a two way sword, confirmed it work for the benchmark above if
#133325 could not land. but if it does not needed.

cc @bobrenjc93 @ezyang

pytorch-bot bot added ciflow/inductor release notes: fx release notes category labels Oct 11, 2024

facebook-github-bot added the fb-exported label Oct 11, 2024

laithsakka force-pushed the export-D64252491 branch from 873e16d to 4aa1df4 Compare October 11, 2024 18:14

laithsakka requested a review from ezyang October 11, 2024 18:37

laithsakka force-pushed the export-D64252491 branch from 4aa1df4 to e0d98ca Compare October 15, 2024 03:47

laithsakka force-pushed the export-D64252491 branch from e0d98ca to 630f330 Compare October 15, 2024 03:51

laithsakka requested a review from bobrenjc93 October 15, 2024 04:04

laithsakka closed this Oct 15, 2024

laithsakka reopened this Oct 15, 2024

laithsakka closed this Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT] call sympy.Max(a, b, evaluate=False) instead of torch.utils._sympy.functions .Max #137796

[DRAFT] call sympy.Max(a, b, evaluate=False) instead of torch.utils._sympy.functions .Max #137796

Uh oh!

laithsakka commented Oct 11, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 11, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Oct 11, 2024

Uh oh!

facebook-github-bot commented Oct 11, 2024

Uh oh!

ezyang commented Oct 12, 2024

Uh oh!

ezyang commented Oct 12, 2024

Uh oh!

laithsakka commented Oct 12, 2024

Uh oh!

ezyang commented Oct 12, 2024

Uh oh!

laithsakka commented Oct 12, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Oct 15, 2024

Uh oh!

facebook-github-bot commented Oct 15, 2024

Uh oh!

laithsakka commented Oct 15, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[DRAFT] call sympy.Max(a, b, evaluate=False) instead of torch.utils._sympy.functions .Max #137796

[DRAFT] call sympy.Max(a, b, evaluate=False) instead of torch.utils._sympy.functions .Max #137796

Uh oh!

Conversation

laithsakka commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137796

❌ 58 Cancelled Jobs, 1 Unrelated Failure

Uh oh!

facebook-github-bot commented Oct 11, 2024

Uh oh!

facebook-github-bot commented Oct 11, 2024

Uh oh!

ezyang commented Oct 12, 2024

Uh oh!

ezyang commented Oct 12, 2024

Uh oh!

laithsakka commented Oct 12, 2024

Uh oh!

ezyang commented Oct 12, 2024

Uh oh!

laithsakka commented Oct 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Oct 15, 2024

Uh oh!

facebook-github-bot commented Oct 15, 2024

Uh oh!

laithsakka commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

laithsakka commented Oct 11, 2024 •

edited

Loading

pytorch-bot bot commented Oct 11, 2024 •

edited

Loading

laithsakka commented Oct 12, 2024 •

edited

Loading

laithsakka commented Oct 15, 2024 •

edited

Loading