Skip to content

Conversation

@bdhirsh
Copy link
Contributor

@bdhirsh bdhirsh commented Nov 14, 2024

Fixes #140229

Fixes #139474

The issue was that:

(1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket

(2) doing so requires knowing the exact strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with

(3) there is some existing logic to do this today: we have a fakify_first_call flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with

(4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, evaluate_symexpr, which given the sympy string, and the ShapeEnv's var_to_val mapping, will evaluate the sympy string to generate concrete strides

(5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual SymInts that we can return.

Stack from ghstack (oldest at bottom):

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @ezyang @SherlockNoMad @EikanWang @jgong5 @wenzhe-nrv @penguinwu @bobrenjc93 @voznesenskym @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140751

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 7dce432 with merge base 38d10a1 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good a few questions.. think this was a recent regression with the aot autograd cache changes..

Comment on lines +304 to +306
if not DDP._active_ddp_module:
with self._inside_ddp_forward():
return self.module.forward(*inputs, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need these changes? I'm not familiar with DDP._active_ddp_module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I just re-used this testing infra as-is, dynamo would graph break on the context manager (because it mutates an attribute on the current nn.Module) - that might have been fine for other usages of this helper in the tests, but I wanted my test to not have graph breaks, since it makes it easier for me to assert on compile metrics.

My understanding of this test class is that _active_ddp_module is a custom attribute in DDP that dynamo hooks into in order to know whether DDP is running, and we can use that attribute to simulate DDP <> dynamo interactions without needing to actual involve distributed code in the test.

I didn't want to fix all of the infra, but I also wanted my test not to have graph breaks, so for my own local usage I set _inside_ddp_forward() manually to be true in my test, outside of the compiled region.

If you'd rather I don't tweak the existing testing infra, I can also just tweak my test to not re-use this infra.


def evaluate_symexpr(self, code: str) -> Union[int, float, bool]:
def evaluate_symexpr(
self, code: str, *, return_symints: bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type is incorrect now.. Would it be easier to type / follow with different function ? Either way is fine but I do agree with the bc linter to add default return_symints = False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup agreed - I'll split it into a separate function

tuple(
(
shape_env.evaluate_symexpr(e)
shape_env.evaluate_symexpr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in the original version of this PR we suppressed guards when we were setting the output strides.

I guess in the refactor we are now adding guards when we get the strides to return.

Would it be enough to suppress guards here ? or maybe it's better to do the current fix. Just trying to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks like you're right, this must have been regressed from the inductor FX graph cache changes

Would it be enough to suppress guards here ?

fair question: I think the main issue is that:

(1) prior to the inductor cache, the output strides were actual SymInts - so we could just stash them in the tracing context, grab them in AOTAutograd, and slam them into our output tensors with a call to as_strided() while suppressing guards to make sure we didn't specialize

(2) now that we have the inductor cache, the output strides become part of the cache and they are serialized to disk. In the case of an inductor cache hit, we will load them from disk, and exprs here is a list of raw strings (they are basically the serialized sympy expressions for each stride).

So at the very least, we need to convert the raw sympy strings back into SymInts to use them in AOTAutograd. I found that ShapeEnv.evaluate_symexpr was the existing code for doing this: but it would also replace each symbol with the current hint, instead of replacing it with the raw sympy.Symbol that it mapped to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jamesjwu just to be aware.

args = {str(e): val for e, val in self.var_to_val.items()}
if return_symints:
args = {
str(e): SymInt(SymNode(e, self, int, int(val), fx_node=None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is O(num_output_symints * num_total_symints) ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. This seemed sus to I tried running a small benchmark (function with 100 3D inputs compiled with dynamic=True, P1681754466) with TORCH_COMPILE_CPROFILE=1, and:

(1) I don't see evaluate_symexpr or set_tracing_context_output_strides in the svg

(2) The E2E compile time was 15s - it seems like there is probably other low hanging fruit in a heavily-dynamic microbenchmark to find compile time wins

svg download command: jf download GG0n2huLD3y3VGsBAAc3IKj-v5g0bsIXAAAz --file "_compile_inner_0_2.svg"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(15s was a lot, ended up filing an issue about it here 😛 #140970)

Fixes #140229

The issue was that:

(1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket

(2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with

(3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with

(4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides

(5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return.




cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o ezyang SherlockNoMad EikanWang jgong5 wenzhe-nrv penguinwu bobrenjc93 voznesenskym Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Nov 21, 2024
@bdhirsh bdhirsh added the suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) label Nov 27, 2024
@bdhirsh
Copy link
Contributor Author

bdhirsh commented Nov 27, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 27, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

Fixes #140229

Fixes #139474

The issue was that:

(1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket

(2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with

(3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with

(4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides

(5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return.




cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o ezyang SherlockNoMad EikanWang jgong5 wenzhe-nrv penguinwu bobrenjc93 voznesenskym Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
@albanD albanD removed their request for review December 4, 2024 22:48
Fixes #140229

Fixes #139474

The issue was that:

(1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket

(2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with

(3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with

(4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides

(5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return.




cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o ezyang SherlockNoMad EikanWang jgong5 wenzhe-nrv penguinwu bobrenjc93 voznesenskym Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Dec 4, 2024
@bdhirsh
Copy link
Contributor Author

bdhirsh commented Dec 4, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@bdhirsh bdhirsh removed the suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) label Dec 5, 2024
AmdSampsa pushed a commit to AmdSampsa/pytorch that referenced this pull request Dec 9, 2024
Fixes pytorch#140229

Fixes pytorch#139474

The issue was that:

(1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket

(2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with

(3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with

(4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides

(5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return.

Pull Request resolved: pytorch#140751
Approved by: https://github.com/eellison
@github-actions github-actions bot deleted the gh/bdhirsh/625/head branch January 5, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request fx Merged module: dynamic shapes module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: fx release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants