Skip to content

Conversation

@laithsakka
Copy link
Contributor

@laithsakka laithsakka commented Oct 11, 2024

Summary:
I was looking at this profile with @bobrenjc93

TORCH_COMPILE_STROBELIGHT=1 COMPILE_STROBELIGHT_MAX_STACK_LENGTH= 500 buck2 run @fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200

strobelight profile link: https://fburl.com/scuba/pyperf_experimental/on_demand/lrh6erxx
Screenshot 2024-10-11 at 11 46 52 AM
Screenshot 2024-10-11 at 11 47 08 AM

Most of the time is spent in constructing the max() node. if we pass evaluate=False we no longer have the exponential cost!?

paste that show what we construct when we call max https://www.internalfb.com/phabricator/paste/view/P1644273374
there is a clear repetition across calls, and simplification is not doing much other than flattening inputs of max.

This is just draft to make sure all test pass in OSS, wonder if avoid simplification at construction can make other
programs slower? if so maybe we can add this under a flag?

alternatively we can also define our own max function and customize automatic simplifications inside
https://docs.sympy.org/latest/explanation/best-practices.html#avoid-too-much-automatic-evaluation

--num-features=200
compile.compile_inner
16.5991s vs 120.824s

--num-features=400 (compile.compile_inner )
40s vs 918.024s

num_features=100

rank: 0, world_size: 2, num_features: 100, batch_size: 10, time: 20.05s
va
rank: 0, world_size: 2, num_features: 100, batch_size: 10, time: 40.24s

num_features=200

rank: 0, world_size: 2, num_features: 200, batch_size: 10, time: 20.66s
rank: 0, world_size: 2, num_features: 200, batch_size: 10, time: 125.05s

Differential Revision: D64252491

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 11, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137796

Note: Links to docs will display an error until the docs builds have been completed.

❌ 58 Cancelled Jobs, 1 Unrelated Failure

As of commit 630f330 with merge base ed94725 (image):

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor release notes: fx release notes category labels Oct 11, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64252491

laithsakka added a commit to laithsakka/pytorch that referenced this pull request Oct 11, 2024
…sympy.functions .Max (pytorch#137796)

Summary:


--num-features=200
compile.compile_inner                 16.5991s
vs



--num-features=400
 compile.compile_inner                 287.493
vs



pytorch#100 num_features
```
rank: 0, world_size: 2, num_features: 100, batch_size: 10, time: 20.05s
va
rank: 0, world_size: 2, num_features: 100, batch_size: 10, time: 40.24s
```


pytorch#200 num_features
```
rank: 0, world_size: 2, num_features: 200, batch_size: 10, time: 20.66s
rank: 0, world_size: 2, num_features: 200, batch_size: 10, time: 125.05s
```

Differential Revision: D64252491
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64252491

@laithsakka laithsakka requested a review from ezyang October 11, 2024 18:37
@ezyang
Copy link
Contributor

ezyang commented Oct 12, 2024

As I was telling @bobrenjc93, you can't do this because the simplification done in the constructor is actually load bearing for unbacked reasoning

@ezyang
Copy link
Contributor

ezyang commented Oct 12, 2024

And your profile is pre #133325 which fixed the bulk of the problems

@laithsakka
Copy link
Contributor Author

great if #133325 fix it then no need for this,
I will check that, otherwise, we have to think about a compromise, if I understand your comment its not a correctness issue but rather a perf?
Like you are saying if we dont do the simplification at construction, unbacked reasoning can become so expensive?

@ezyang
Copy link
Contributor

ezyang commented Oct 12, 2024

No, it is a correctness issue. For example, if you have a test Max(1, u0, 256) == Max(u0, 256), you need to return True for it. But the easiest way for this to happen is for the Max constructor to simplify Max(1, u0, 256) into Max(u0, 256), none of our other reasoning mechanisms will work.

@laithsakka
Copy link
Contributor Author

laithsakka commented Oct 12, 2024

No, it is a correctness issue. For example, if you have a test Max(1, u0, 256) == Max(u0, 256), you need to return True for it. But the easiest way for this to happen is for the Max constructor to simplify Max(1, u0, 256) into Max(u0, 256), none of our other reasoning mechanisms will work.

mhmm where in the compiler we do this == check, and why is it not .equals() ?

Max(1, u0, 256).equals(Max(u0, 256)) should be True.

do you have an e2e example of this with failure, like an function that we compile that we end up generating wrong program or failing to compile it due to this? I will try to play with some examples to see if i can get one.

So I have another idea that is less risky but need to benchmark it but its O(N) so :

if we look at the paste. https://www.internalfb.com/phabricator/paste/view/P1644273374
we can see that all that max do here is flattening the inputs.

so we can do something like this in O(N) under some conditions

given max(max(a, b), c) if a , b and c are all summations of symbols only and there is no intersection in the symbols of a, b, c. then generate
max(a, b, c) + some sorting for a, b, c order

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64252491

…sympy.functions .Max (pytorch#137796)

Summary:

I was looking at this profile with bobrenjc93 

```
TORCH_COMPILE_STROBELIGHT=1 COMPILE_STROBELIGHT_MAX_STACK_LENGTH= 500 buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200
```



strobelight profile link: https://fburl.com/scuba/pyperf_experimental/on_demand/lrh6erxx
{F1924015721}{F1924015712}
Most of the time is spent in constructing the max() node. if we pass evaluate=False we no longer have the exponential cost!?

paste that show what we compute when we call pass https://www.internalfb.com/phabricator/paste/view/P1644273374
there is a clear repetition across calls, and simplification is not doing much other than flattening inputs of max.


This is just draft to make sure all test pass in OSS, wonder if avoid simplification at construction can make other 
programs slower? if so maybe we can add this under a flag?


--num-features=200
compile.compile_inner                
16.5991s vs   120.824s

--num-features=400 (compile.compile_inner )
40s vs 918.024s


 num_features=100
```
rank: 0, world_size: 2, num_features: 100, batch_size: 10, time: 20.05s
va
rank: 0, world_size: 2, num_features: 100, batch_size: 10, time: 40.24s
```


 num_features=200
```
rank: 0, world_size: 2, num_features: 200, batch_size: 10, time: 20.66s
rank: 0, world_size: 2, num_features: 200, batch_size: 10, time: 125.05s
```

Differential Revision:
D64252491

D64252491
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64252491

@laithsakka laithsakka requested a review from bobrenjc93 October 15, 2024 04:04
@laithsakka
Copy link
Contributor Author

laithsakka commented Oct 15, 2024

I confirmed that #133325 does fix the regression that this diff try to Fix
while we can fix the issue in this diff by calling simplify in statically_known_true, there is No value anymore of doing that.

also calling simplify in statically_known_true can be a two way sword, confirmed it work for the benchmark above if
#133325 could not land. but if it does not needed.

cc @bobrenjc93 @ezyang

@laithsakka laithsakka closed this Oct 15, 2024
@laithsakka laithsakka reopened this Oct 15, 2024
@laithsakka laithsakka closed this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants