dont clone symints, dont clobber symint proxies #88230

bdhirsh · 2022-11-01T19:02:05Z

Stack from ghstack (oldest at bottom):

Update functionalization metadata more eagerly. *_scatter ops should preserve input stride/storage_offset #88198
-> dont clone symints, dont clobber symint proxies #88230

[ghstack-poisoned]

pytorch-bot · 2022-11-01T19:02:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88230

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 7c0d484:

The following jobs have failed:

cuda11.6-py3.10-gcc7-sm86 / test (default, 1, 4, linux.g5.4xlarge.nvidia.gpu)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD

Thanks :)

[ghstack-poisoned]

ezyang · 2022-11-02T00:21:05Z

As discussed in meeting, a further BE follow up is to delete clone entirely

ghstack-source-id: 4f2f9e6 Pull Request resolved: #88230

[ghstack-poisoned]

wconstab · 2022-11-02T04:37:15Z

did anyone ever explain why this is better (or what's wrong with the clone approach)? Hopefully not just that 'its easier to automatically create missing symints/proxies than to find out where they should have been created'?

[ghstack-poisoned]

ezyang · 2022-11-02T14:47:11Z

I have a hypothesis for why the cloning approach is wrong. It is only safe to clone when we are setting the sizes/strides on the bottom-most fake tensor; only those tensors are guaranteed to pass through proxy tensor mode and actually get tracked. With functionalization, we also have a FunctionalizeTensorWrapper floating around, and I guess we are accidentally cloning on that (we shouldn't), resulting in untracked proxies.

The cloning business seems pretty delicate, and in practice this approach seems to have been more robust; we haven't had any more partitioner disasters on the branch since we put this in to soak.

[ghstack-poisoned]

ezyang · 2022-11-09T22:24:09Z

Status update on this?!

bdhirsh · 2022-11-10T14:45:33Z

This is mostly ready, but there are a few proxy tensor tests failures that I didn't have a chance to chug through. I try to hit them today. Also needed to rebase / rebuild due to merge conflicts

bdhirsh · 2022-11-10T16:33:12Z

Darn - the breakage is that there are cases where we DO want to clobber the proxy, for inplace ops on tensors.

If I do x.foo_(), and foo_() changes some of the metadata on x's proxy, that should get reflected. In the specific failing test, it's because we have:

a's proxy is a constant
a = torch.Tensor([2.])
a's proxy is no longer a constant
a.normal_()

But there are probably also other cases, where an inplace op mutates the size/stride/storage_offset and we need to replace the proxy.

Since mutations of symints isn't really a thing, maybe we can avoid the clobbering only in the case where we have SymInts/SymFloats, and properly do the clobbering for tensors?

ezyang · 2022-11-10T17:16:21Z

Sgtm

ezyang · 2022-11-10T17:35:13Z

BTW, it doesn't have to be a full clobber for tensors either. When we normal_ the proxy on the Tensor doesn't change (although we currently try to do it, so that you're less likely to DCE). What changes are the SymInts on the tensor, but those wouldn't actually get clobbered. So what you would want to do is still attempt to track the tensor, but if an internal symint is already tracked, nothing to do.

[ghstack-poisoned]

bdhirsh · 2022-11-11T22:14:10Z

test/functorch/test_aotdispatch.py

            # TODO(whc)- are the saved-tensors/saved-symints correct here?
            # i just made the test pass based on what default partition did
-            [False, True, True, False, False] + [False] * 5 + [True] * 3,
+            [False, True, True, False, False] + [False] * 4 + [True] * 4,


Hey @wconstab - I guess I'm not too surprised that this test wobbled, since it seems like this test originally wobbled when we added symint clone'ing. Just wanted to see if you have any thoughts

i'm trying to recall..

[False, True, True, False, False]
this first block should correspond to the outputs of f, namely return cat, sb, c, mm2, where sb expands to 2 dims of size, hence True, True for is_sym_node

[False] * 4 + [True] * 4,
this part i can't remember exactly. in the non-clone case we're saving one less non-sym-node for backward, and saving one more sym-node. I think what i would do is check what these forward/backward graphs look like and what they were like before this PR.

Hmm ok. The new graph does seem strictly better than the old one. To (partially) explain the difference of why we have -1 tensor and +1 symint saved:

One thing I see is that we return cat twice in the old forward but not the new one.

We also previously weren't saving one of the symints that was created in the forward, and now we are: add: Sym(s1 + s2) = sym_size + sym_size_1

Old:

class GraphModule(torch.nn.Module): def forward(self, primals_1: f32[s0], primals_2: f32[s1, s0], primals_3: f32[s2, s0], primals_4: f32[s0, 1]): # No stacktrace found for following nodes sym_size: Sym(s1) = torch.ops.aten.sym_size(primals_2, 0) sym_size_1: Sym(s2) = torch.ops.aten.sym_size(primals_3, 0) add: Sym(s1 + s2) = sym_size + sym_size_1 sym_size_2: Sym(s0) = torch.ops.aten.sym_size(primals_1, 0) expand: f32[s1 + s2, s0] = torch.ops.aten.expand.default(primals_1, [add, sym_size_2]); add = None cat: f32[2*s1 + 2*s2, s0] = torch.ops.aten.cat.default([expand, primals_2, primals_3]) mm: f32[2*s1 + 2*s2, 1] = torch.ops.aten.mm.default(cat, primals_4) sym_size_3: Sym(1) = torch.ops.aten.sym_size(mm, 1) view: f32[1, s0] = torch.ops.aten.view.default(primals_1, [sym_size_3, sym_size_2]); primals_1 = sym_size_3 = None mm_1: f32[2*s1 + 2*s2, s0] = torch.ops.aten.mm.default(mm, view) sym_size_5: Sym(s0) = torch.ops.aten.sym_size(primals_2, 1); primals_2 = None return [cat, sym_size, sym_size_5, primals_3, mm_1, view, mm, primals_4, expand, cat, sym_size_2, sym_size, sym_size_1]

new:

class GraphModule(torch.nn.Module): def forward(self, primals_1: f32[s0], primals_2: f32[s1, s0], primals_3: f32[s2, s0], primals_4: f32[s0, 1]): # No stacktrace found for following nodes sym_size: Sym(s1) = torch.ops.aten.sym_size(primals_2, 0) sym_size_1: Sym(s2) = torch.ops.aten.sym_size(primals_3, 0) add: Sym(s1 + s2) = sym_size + sym_size_1 sym_size_2: Sym(s0) = torch.ops.aten.sym_size(primals_1, 0) expand: f32[s1 + s2, s0] = torch.ops.aten.expand.default(primals_1, [add, sym_size_2]) cat: f32[2*s1 + 2*s2, s0] = torch.ops.aten.cat.default([expand, primals_2, primals_3]); expand = None mm: f32[2*s1 + 2*s2, 1] = torch.ops.aten.mm.default(cat, primals_4) sym_size_3: Sym(1) = torch.ops.aten.sym_size(primals_4, 1) view: f32[1, s0] = torch.ops.aten.view.default(primals_1, [sym_size_3, sym_size_2]); primals_1 = sym_size_3 = None mm_1: f32[2*s1 + 2*s2, s0] = torch.ops.aten.mm.default(mm, view) sym_size_4: Sym(s0) = torch.ops.aten.sym_size(primals_2, 1); primals_2 = None return [cat, sym_size, sym_size_4, primals_3, mm_1, view, mm, cat, primals_4, sym_size_1, sym_size, add, sym_size_2]

hmm, aren't you returning cat twice in both old/new?

anjali411 · 2022-11-15T10:49:38Z

torch/fx/experimental/symbolic_shapes.py


    def clone(self):
-        return SymNode(self.expr, self.shape_env, self.pytype, constant=self.constant)
+        return self


why do we want to return self here?

We're disabling the SymNode cloning. A better refactor would be to eliminate the clone method entirely

I see. It might be nice to add a warning here since this behavior seems a little unexpected

Until Brian removes calls to clone from C++, this method will get (uselessly) called a bunch, warning would be spamming

[ghstack-poisoned]

ezyang · 2022-11-21T22:56:47Z

@pytorchbot merge

pytorchmergebot · 2022-11-21T22:59:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-11-22T01:15:22Z

Merge failed

Reason: 2 additional jobs have failed, first few of them are: trunk ,trunk / cuda11.6-py3.10-gcc7-sm86 / test (default, 1, 4, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

ezyang · 2022-11-22T01:36:08Z

@pytorchbot merge -f "flaky ci only"

pytorchmergebot · 2022-11-22T01:37:39Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Pull Request resolved: pytorch#88230 Approved by: https://github.com/albanD

dont clone symints, dont clobber symint proxies

9f0e767

[ghstack-poisoned]

bdhirsh mentioned this pull request Nov 1, 2022

propagate .meta info when replacing subgraphs in fx #87255

Closed

pytorch-bot bot added the release notes: fx release notes category label Nov 1, 2022

github-actions bot requested review from Chillee, Krovatkin, albanD, anjali411, antoniojkim, ezyang, miladm and wconstab November 1, 2022 19:02

albanD approved these changes Nov 1, 2022

View reviewed changes

Update on "dont clone symints, dont clobber symint proxies"

c03ebfd

[ghstack-poisoned]

bdhirsh added a commit that referenced this pull request Nov 2, 2022

dont clone symints, dont clobber symint proxies

6465cf6

ghstack-source-id: 4f2f9e6 Pull Request resolved: #88230

Update on "dont clone symints, dont clobber symint proxies"

7a002dd

[ghstack-poisoned]

Update on "dont clone symints, dont clobber symint proxies"

1cc6228

[ghstack-poisoned]

Update on "dont clone symints, dont clobber symint proxies"

9c963d0

[ghstack-poisoned]

This was referenced Nov 2, 2022

reland "fix as_strided_scatter_backward (#87646)" #88342

Closed

reland "Do not use unsafe restriding for subclasses (#87610)" #88343

Closed

bdhirsh added 2 commits November 3, 2022 15:29

Update on "dont clone symints, dont clobber symint proxies"

62b1ce0

[ghstack-poisoned]

Update on "dont clone symints, dont clobber symint proxies"

4348451

[ghstack-poisoned]

ezyang mentioned this pull request Nov 7, 2022

Symbolic shapes #84246

Closed

Update on "dont clone symints, dont clobber symint proxies"

74a2ead

[ghstack-poisoned]

ezyang mentioned this pull request Nov 9, 2022

Symbolic shape: sym_floor , sym_sqrt, sym_int #88760

Closed

bdhirsh added 3 commits November 10, 2022 10:07

Update on "dont clone symints, dont clobber symint proxies"

3fddf3d

[ghstack-poisoned]

Update on "dont clone symints, dont clobber symint proxies"

bbc7d11

[ghstack-poisoned]

Update on "dont clone symints, dont clobber symint proxies"

19ce614

[ghstack-poisoned]

bdhirsh commented Nov 11, 2022

View reviewed changes

anjali411 reviewed Nov 15, 2022

View reviewed changes

bdhirsh and others added 2 commits November 16, 2022 10:23

Update on "dont clone symints, dont clobber symint proxies"

fa372e4

[ghstack-poisoned]

Update on "dont clone symints, dont clobber symint proxies"

7c0d484

[ghstack-poisoned]

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 21, 2022

pytorchmergebot added the Merged label Nov 22, 2022

pytorchmergebot closed this in 06dffb3 Nov 22, 2022

facebook-github-bot deleted the gh/bdhirsh/339/head branch June 8, 2023 15:43

dont clone symints, dont clobber symint proxies #88230

dont clone symints, dont clobber symint proxies #88230

Uh oh!

Conversation

bdhirsh commented Nov 1, 2022 • edited by ezyang Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88230

❌ 1 Failures

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Nov 2, 2022

Uh oh!

wconstab commented Nov 2, 2022

Uh oh!

ezyang commented Nov 2, 2022

Uh oh!

ezyang commented Nov 9, 2022

Uh oh!

bdhirsh commented Nov 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdhirsh commented Nov 10, 2022

Uh oh!

ezyang commented Nov 10, 2022

Uh oh!

ezyang commented Nov 10, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented Nov 21, 2022

Uh oh!

pytorchmergebot commented Nov 21, 2022

Merge started

Uh oh!

pytorchmergebot commented Nov 22, 2022

Merge failed

Uh oh!

ezyang commented Nov 22, 2022

Uh oh!

pytorchmergebot commented Nov 22, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

bdhirsh commented Nov 1, 2022 •

edited by ezyang

Loading

pytorch-bot bot commented Nov 1, 2022 •

edited

Loading

bdhirsh commented Nov 10, 2022 •

edited

Loading