-
Notifications
You must be signed in to change notification settings - Fork 26.3k
avoid specializing strides with DDPOptimizer + inductor #140751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140751
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 7dce432 with merge base 38d10a1 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
eellison
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good a few questions.. think this was a recent regression with the aot autograd cache changes..
| if not DDP._active_ddp_module: | ||
| with self._inside_ddp_forward(): | ||
| return self.module.forward(*inputs, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need these changes? I'm not familiar with DDP._active_ddp_module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I just re-used this testing infra as-is, dynamo would graph break on the context manager (because it mutates an attribute on the current nn.Module) - that might have been fine for other usages of this helper in the tests, but I wanted my test to not have graph breaks, since it makes it easier for me to assert on compile metrics.
My understanding of this test class is that _active_ddp_module is a custom attribute in DDP that dynamo hooks into in order to know whether DDP is running, and we can use that attribute to simulate DDP <> dynamo interactions without needing to actual involve distributed code in the test.
I didn't want to fix all of the infra, but I also wanted my test not to have graph breaks, so for my own local usage I set _inside_ddp_forward() manually to be true in my test, outside of the compiled region.
If you'd rather I don't tweak the existing testing infra, I can also just tweak my test to not re-use this infra.
|
|
||
| def evaluate_symexpr(self, code: str) -> Union[int, float, bool]: | ||
| def evaluate_symexpr( | ||
| self, code: str, *, return_symints: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type is incorrect now.. Would it be easier to type / follow with different function ? Either way is fine but I do agree with the bc linter to add default return_symints = False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup agreed - I'll split it into a separate function
torch/_inductor/utils.py
Outdated
| tuple( | ||
| ( | ||
| shape_env.evaluate_symexpr(e) | ||
| shape_env.evaluate_symexpr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in the original version of this PR we suppressed guards when we were setting the output strides.
I guess in the refactor we are now adding guards when we get the strides to return.
Would it be enough to suppress guards here ? or maybe it's better to do the current fix. Just trying to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah looks like you're right, this must have been regressed from the inductor FX graph cache changes
Would it be enough to suppress guards here ?
fair question: I think the main issue is that:
(1) prior to the inductor cache, the output strides were actual SymInts - so we could just stash them in the tracing context, grab them in AOTAutograd, and slam them into our output tensors with a call to as_strided() while suppressing guards to make sure we didn't specialize
(2) now that we have the inductor cache, the output strides become part of the cache and they are serialized to disk. In the case of an inductor cache hit, we will load them from disk, and exprs here is a list of raw strings (they are basically the serialized sympy expressions for each stride).
So at the very least, we need to convert the raw sympy strings back into SymInts to use them in AOTAutograd. I found that ShapeEnv.evaluate_symexpr was the existing code for doing this: but it would also replace each symbol with the current hint, instead of replacing it with the raw sympy.Symbol that it mapped to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jamesjwu just to be aware.
| args = {str(e): val for e, val in self.var_to_val.items()} | ||
| if return_symints: | ||
| args = { | ||
| str(e): SymInt(SymNode(e, self, int, int(val), fx_node=None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is O(num_output_symints * num_total_symints) ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. This seemed sus to I tried running a small benchmark (function with 100 3D inputs compiled with dynamic=True, P1681754466) with TORCH_COMPILE_CPROFILE=1, and:
(1) I don't see evaluate_symexpr or set_tracing_context_output_strides in the svg
(2) The E2E compile time was 15s - it seems like there is probably other low hanging fruit in a heavily-dynamic microbenchmark to find compile time wins
svg download command: jf download GG0n2huLD3y3VGsBAAc3IKj-v5g0bsIXAAAz --file "_compile_inner_0_2.svg"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(15s was a lot, ended up filing an issue about it here 😛 #140970)
Fixes #140229 The issue was that: (1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket (2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with (3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with (4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides (5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o ezyang SherlockNoMad EikanWang jgong5 wenzhe-nrv penguinwu bobrenjc93 voznesenskym Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
Fixes #140229 Fixes #139474 The issue was that: (1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket (2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with (3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with (4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides (5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o ezyang SherlockNoMad EikanWang jgong5 wenzhe-nrv penguinwu bobrenjc93 voznesenskym Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
Fixes #140229 Fixes #139474 The issue was that: (1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket (2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with (3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with (4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides (5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o ezyang SherlockNoMad EikanWang jgong5 wenzhe-nrv penguinwu bobrenjc93 voznesenskym Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes pytorch#140229 Fixes pytorch#139474 The issue was that: (1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket (2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with (3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with (4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides (5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return. Pull Request resolved: pytorch#140751 Approved by: https://github.com/eellison
Fixes #140229
Fixes #139474
The issue was that:
(1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket
(2) doing so requires knowing the exact strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with
(3) there is some existing logic to do this today: we have a
fakify_first_callflag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with(4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function,
evaluate_symexpr, which given the sympy string, and the ShapeEnv'svar_to_valmapping, will evaluate the sympy string to generate concrete strides(5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual
SymIntsthat we can return.Stack from ghstack (oldest at bottom):
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @ezyang @SherlockNoMad @EikanWang @jgong5 @wenzhe-nrv @penguinwu @bobrenjc93 @voznesenskym @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov