Skip to content

Conversation

@shunting314
Copy link
Contributor

@shunting314 shunting314 commented Mar 3, 2025

Stack from ghstack (oldest at bottom):

Fix #148356

This is some sort of short term fix to recover the default behavior to apply layout constraint for custom ops when there are no tags.

A longer term attempt to make sure Inductor always gets correct eager strides is here: #148104

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148367

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bf76d44 with merge base 1c544a9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Fix #148356 

This is some sort of short term fix to recover the default behavior to apply layout constraint for custom ops when there are no tags.

A longer term attempt to make sure Inductor always gets correct eager strides is here: #148104

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Mar 4, 2025
ghstack-source-id: 3d2340e
Pull Request resolved: #148367
@shunting314 shunting314 requested a review from zou3519 March 4, 2025 00:09
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would kind of rather we fix the relevant weight norm kernels then land this PR which is a non fully featured version of what richard was doing in the other pr.

Giving aten kernels this extra stride constraint is bad for things like channels optimization.

@shunting314
Copy link
Contributor Author

I would kind of rather we fix the relevant weight norm kernels then land this PR which is a non fully featured version of what richard was doing in the other pr.

But this is a general problem not just for weight norm. Other ATen ops may have layout constraint as well. Inductor does not know that if it's an implicit fallback

Giving aten kernels this extra stride constraint is bad for things like channels optimization.

This only affects implicit fallback. For explicitly registered ATen kernels like convolution, we still have full control of the input layouts.

This PR only brings back the behavior for custom ops. The behavior for implicit fallback of ATen ops is already added last November.

@eellison
Copy link
Contributor

eellison commented Mar 4, 2025

@shunting314 I don't think this is a general problem for aten ops. the error is only for specific backward ops, which assume as an invariant that the forward would have already applied the layout constraints.

@shunting314
Copy link
Contributor Author

I don't think this is a general problem for aten ops. the error is only for specific backward ops, which assume as an invariant that the forward would have already applied the layout constraints.

yea, that's what I mean by 'general'. There can be new ops like this be added in future, right?

@eellison
Copy link
Contributor

eellison commented Mar 4, 2025

@shunting314 we are not adding new core aten ops at any fast rate, especially with backward and unchecked invariants around striding.

Fix #148356 

This is some sort of short term fix to recover the default behavior to apply layout constraint for custom ops when there are no tags.

A longer term attempt to make sure Inductor always gets correct eager strides is here: #148104

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
@shunting314
Copy link
Contributor Author

Chatted with @eellison offline. The default layout constraint for implicit fallback ATEN ops now are only applied to bwd graph since fwd ATen ops should work with any strides.

# For ATen ops, only apply the constraint for backward
# ops since fwd ops should work for any strides.
if torch._library.utils.is_builtin(target) and self.is_backward:
decided_constraint = require_contiguous # type: ignore[assignment]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, i guess in the future we could change this to matching the fx strides.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean "needs_exact_strides"?

Fix #148356 

This is some sort of short term fix to recover the default behavior to apply layout constraint for custom ops when there are no tags.

A longer term attempt to make sure Inductor always gets correct eager strides is here: #148104

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Mar 4, 2025
ghstack-source-id: cbfc7e7
Pull Request resolved: #148367
@shunting314
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 5, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_9-cuda12_6-build / build

Details for Dev Infra team Raised by workflow job

@shunting314
Copy link
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: linux-binary-manywheel / manywheel-py3_9-cuda12_6-build / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/shunting314/199/head branch April 11, 2025 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants