avoid specializing strides with DDPOptimizer + inductor #140751

bdhirsh · 2024-11-14T21:32:53Z

The issue was that:

(1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket

(2) doing so requires knowing the exact strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with

(3) there is some existing logic to do this today: we have a fakify_first_call flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with

(4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, evaluate_symexpr, which given the sympy string, and the ShapeEnv's var_to_val mapping, will evaluate the sympy string to generate concrete strides

(5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual SymInts that we can return.

Stack from ghstack (oldest at bottom):

-> avoid specializing strides with DDPOptimizer + inductor #140751

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @ezyang @SherlockNoMad @EikanWang @jgong5 @wenzhe-nrv @penguinwu @bobrenjc93 @voznesenskym @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2024-11-14T21:32:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140751

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 7dce432 with merge base 38d10a1 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge) (gh) (trunk failure)
##[error]Process completed with exit code 128.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 6dfd35e Pull Request resolved: #140751

eellison

looks good a few questions.. think this was a recent regression with the aot autograd cache changes..

eellison · 2024-11-18T16:50:07Z

test/distributed/test_dynamo_distributed.py

+        if not DDP._active_ddp_module:
+            with self._inside_ddp_forward():
+                return self.module.forward(*inputs, **kwargs)


Why do we need these changes? I'm not familiar with DDP._active_ddp_module

When I just re-used this testing infra as-is, dynamo would graph break on the context manager (because it mutates an attribute on the current nn.Module) - that might have been fine for other usages of this helper in the tests, but I wanted my test to not have graph breaks, since it makes it easier for me to assert on compile metrics.

My understanding of this test class is that _active_ddp_module is a custom attribute in DDP that dynamo hooks into in order to know whether DDP is running, and we can use that attribute to simulate DDP <> dynamo interactions without needing to actual involve distributed code in the test.

I didn't want to fix all of the infra, but I also wanted my test not to have graph breaks, so for my own local usage I set _inside_ddp_forward() manually to be true in my test, outside of the compiled region.

If you'd rather I don't tweak the existing testing infra, I can also just tweak my test to not re-use this infra.

eellison · 2024-11-18T16:53:11Z

torch/fx/experimental/symbolic_shapes.py


-    def evaluate_symexpr(self, code: str) -> Union[int, float, bool]:
+    def evaluate_symexpr(
+        self, code: str, *, return_symints: bool


The return type is incorrect now.. Would it be easier to type / follow with different function ? Either way is fine but I do agree with the bc linter to add default return_symints = False.

yup agreed - I'll split it into a separate function

eellison · 2024-11-18T16:59:52Z

torch/_inductor/utils.py

                    tuple(
                        (
-                            shape_env.evaluate_symexpr(e)
+                            shape_env.evaluate_symexpr(


So, in the original version of this PR we suppressed guards when we were setting the output strides.

I guess in the refactor we are now adding guards when we get the strides to return.

Would it be enough to suppress guards here ? or maybe it's better to do the current fix. Just trying to follow.

Yeah looks like you're right, this must have been regressed from the inductor FX graph cache changes

Would it be enough to suppress guards here ?

fair question: I think the main issue is that:

(1) prior to the inductor cache, the output strides were actual SymInts - so we could just stash them in the tracing context, grab them in AOTAutograd, and slam them into our output tensors with a call to as_strided() while suppressing guards to make sure we didn't specialize

(2) now that we have the inductor cache, the output strides become part of the cache and they are serialized to disk. In the case of an inductor cache hit, we will load them from disk, and exprs here is a list of raw strings (they are basically the serialized sympy expressions for each stride).

So at the very least, we need to convert the raw sympy strings back into SymInts to use them in AOTAutograd. I found that ShapeEnv.evaluate_symexpr was the existing code for doing this: but it would also replace each symbol with the current hint, instead of replacing it with the raw sympy.Symbol that it mapped to.

cc @jamesjwu just to be aware.

eellison · 2024-11-18T17:02:02Z

torch/fx/experimental/symbolic_shapes.py

-        args = {str(e): val for e, val in self.var_to_val.items()}
+        if return_symints:
+            args = {
+                str(e): SymInt(SymNode(e, self, int, int(val), fx_node=None))


I guess this is O(num_output_symints * num_total_symints) ..

True. This seemed sus to I tried running a small benchmark (function with 100 3D inputs compiled with dynamic=True, P1681754466) with TORCH_COMPILE_CPROFILE=1, and:

(1) I don't see evaluate_symexpr or set_tracing_context_output_strides in the svg

(2) The E2E compile time was 15s - it seems like there is probably other low hanging fruit in a heavily-dynamic microbenchmark to find compile time wins

svg download command: jf download GG0n2huLD3y3VGsBAAc3IKj-v5g0bsIXAAAz --file "_compile_inner_0_2.svg"

(15s was a lot, ended up filing an issue about it here 😛 #140970)

Fixes #140229 The issue was that: (1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket (2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with (3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with (4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides (5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o ezyang SherlockNoMad EikanWang jgong5 wenzhe-nrv penguinwu bobrenjc93 voznesenskym Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: ec4bdac Pull Request resolved: #140751

bdhirsh · 2024-11-27T21:38:37Z

@pytorchbot merge

pytorchmergebot · 2024-11-27T21:40:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-11-27T21:40:31Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

BC Lint / bc_linter

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

Fixes #140229 Fixes #139474 The issue was that: (1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket (2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with (3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with (4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides (5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o ezyang SherlockNoMad EikanWang jgong5 wenzhe-nrv penguinwu bobrenjc93 voznesenskym Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: 5c8381a Pull Request resolved: #140751

bdhirsh · 2024-12-04T23:25:05Z

@pytorchbot merge

pytorchmergebot · 2024-12-04T23:26:58Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes pytorch#140229 Fixes pytorch#139474 The issue was that: (1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket (2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with (3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with (4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides (5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return. Pull Request resolved: pytorch#140751 Approved by: https://github.com/eellison

avoid specializing strides with DDPOptimizer + inductor

4e7df54

[ghstack-poisoned]

bdhirsh mentioned this pull request Nov 14, 2024

fix layer_norm decomp precision for cpu #140557

Closed

pytorch-bot bot added ciflow/inductor module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: fx release notes category labels Nov 14, 2024

bdhirsh added a commit that referenced this pull request Nov 14, 2024

avoid specializing strides with DDPOptimizer + inductor

8b9c1d1

ghstack-source-id: 6dfd35e Pull Request resolved: #140751

github-actions bot requested review from SherlockNoMad, albanD, antoniojkim, ezyang and miladm November 14, 2024 21:33

facebook-github-bot added the fx label Nov 14, 2024

This was referenced Nov 14, 2024

DDP disables torch.compile dynamic shape behaviour #140229

Closed

Recompilation exausted with size mismatch at index 2 #139474

Closed

bdhirsh requested a review from eellison November 14, 2024 21:38

bdhirsh added the module: dynamic shapes label Nov 14, 2024

eellison reviewed Nov 18, 2024

View reviewed changes

eellison approved these changes Nov 18, 2024

View reviewed changes

bdhirsh added a commit that referenced this pull request Nov 21, 2024

avoid specializing strides with DDPOptimizer + inductor

5167638

ghstack-source-id: ec4bdac Pull Request resolved: #140751

bdhirsh added the suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) label Nov 27, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 27, 2024

pytorchmergebot added the merging label Nov 27, 2024

pytorchmergebot removed the merging label Nov 27, 2024

albanD removed their request for review December 4, 2024 22:48

bdhirsh added a commit that referenced this pull request Dec 4, 2024

avoid specializing strides with DDPOptimizer + inductor

25b67c7

ghstack-source-id: 5c8381a Pull Request resolved: #140751

pytorchmergebot added the merging label Dec 4, 2024

bdhirsh removed the suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) label Dec 5, 2024

pytorchmergebot added the Merged label Dec 5, 2024

pytorchmergebot closed this in 471017c Dec 5, 2024

pytorchmergebot removed the merging label Dec 5, 2024

github-actions bot deleted the gh/bdhirsh/625/head branch January 5, 2025 02:10

avoid specializing strides with DDPOptimizer + inductor #140751

avoid specializing strides with DDPOptimizer + inductor #140751

Uh oh!

Conversation

bdhirsh commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140751

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented Nov 27, 2024

Uh oh!

pytorchmergebot commented Nov 27, 2024

Merge started

Uh oh!

pytorchmergebot commented Nov 27, 2024

Merge failed

Uh oh!

bdhirsh commented Dec 4, 2024

Uh oh!

pytorchmergebot commented Dec 4, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bdhirsh commented Nov 14, 2024 •

edited

Loading

pytorch-bot bot commented Nov 14, 2024 •

edited

Loading