-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[wip] Speed up permute and expand cases in as_strided_backward #8965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot retest this please |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
This is a pretty complex patch. Maybe try to get someone to pair review it with you in person. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
You can't claim a speed-up without perf numbers! |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failure looks real
Jun 07 15:04:29 ======================================================================
Jun 07 15:04:29 ERROR: test_as_strided (__main__.TestAutograd)
Jun 07 15:04:29 ----------------------------------------------------------------------
Jun 07 15:04:29 Traceback (most recent call last):
Jun 07 15:04:29 File "test_autograd.py", line 2623, in test_as_strided
Jun 07 15:04:29 test(torch.randn(2, 3), lambda x: x.expand(10, 2, 3), [2, 3], [3, 1], 0)
Jun 07 15:04:29 File "test_autograd.py", line 2593, in test
Jun 07 15:04:29 gradcheck(closure, [x])
Jun 07 15:04:29 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/autograd/gradcheck.py", line 289, in gradcheck
Jun 07 15:04:29 'numerical:%s\nanalytical:%s\n' % (i, j, n, a))
Jun 07 15:04:29 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/autograd/gradcheck.py", line 227, in fail_test
Jun 07 15:04:29 raise RuntimeError(msg)
Jun 07 15:04:29 RuntimeError: Jacobian mismatch for output 0 with respect to input 0,
Jun 07 15:04:29 numerical:tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
Jun 07 15:04:29 [0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
Jun 07 15:04:29 [0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000],
Jun 07 15:04:29 [0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000],
Jun 07 15:04:29 [0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000],
Jun 07 15:04:29 [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000]])
Jun 07 15:04:29 analytical:tensor([[10., 0., 0., 0., 0., 0.],
Jun 07 15:04:29 [ 0., 10., 0., 0., 0., 0.],
Jun 07 15:04:29 [ 0., 0., 10., 0., 0., 0.],
Jun 07 15:04:29 [ 0., 0., 0., 10., 0., 0.],
Jun 07 15:04:29 [ 0., 0., 0., 0., 10., 0.],
Jun 07 15:04:29 [ 0., 0., 0., 0., 0., 10.]])
|
This is WIP still. But I plan to finish it when I have time. |
Summary: Currently, `reshape` does an `as_strided` when the geometry is viewable. However, `as_strided` backward is not very optimized, and can not always detect such cases. Improvements are planned at #8965, and I will finish it some day. But the current situation is that in these cases backward through `reshape` will copy gradient while a simple `view` will not. This is unnecessary. Notably this affects `flatten` and a whole bunch of other ops implemented on top of `reshape`. ```py In [15]: x = torch.randn(3, 4, requires_grad=True) In [16]: y = x.reshape(x.shape) In [17]: assert y._base is not None In [18]: gy = torch.randn_like(y) In [20]: gx = torch.autograd.grad(y, x, gy)[0] In [21]: gx Out[21]: tensor([[ 0.2189, 0.3396, -0.1108, 1.7703], [ 1.0737, -0.1222, 1.0765, -1.3363], [-1.3798, -0.2950, 0.0800, 0.2501]]) In [22]: gx._base # not gy Out[22]: tensor([ 0.2189, 0.3396, -0.1108, 1.7703, 1.0737, -0.1222, 1.0765, -1.3363, -1.3798, -0.2950, 0.0800, 0.2501]) In [23]: gy.zero_() Out[23]: tensor([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]) In [24]: gx # not sharing storage with gy Out[24]: tensor([[ 0.2189, 0.3396, -0.1108, 1.7703], [ 1.0737, -0.1222, 1.0765, -1.3363], [-1.3798, -0.2950, 0.0800, 0.2501]]) # but everything is optimized with view, which should be equivalent with reshape in this case In [25]: y = x.view(x.shape) In [26]: assert y._base is not None In [27]: gy = torch.randn_like(y) In [28]: gx = torch.autograd.grad(y, x, gy)[0] In [29]: gx Out[29]: tensor([[-2.4463, 1.1446, 0.1501, 0.1212], [-1.1125, 1.4661, 0.9092, -0.2153], [-0.1937, -0.3381, -1.3883, -0.7329]]) In [30]: gy.zero_() Out[30]: tensor([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]) In [31]: gx # sharing storage with gy Out[31]: tensor([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]) ``` Pull Request resolved: #28901 Differential Revision: D18240868 Pulled By: ezyang fbshipit-source-id: 28fdaa0c7014a9dae6731dfe8b67784d38fc27f0
Summary: Currently, `reshape` does an `as_strided` when the geometry is viewable. However, `as_strided` backward is not very optimized, and can not always detect such cases. Improvements are planned at pytorch/pytorch#8965, and I will finish it some day. But the current situation is that in these cases backward through `reshape` will copy gradient while a simple `view` will not. This is unnecessary. Notably this affects `flatten` and a whole bunch of other ops implemented on top of `reshape`. ```py In [15]: x = torch.randn(3, 4, requires_grad=True) In [16]: y = x.reshape(x.shape) In [17]: assert y._base is not None In [18]: gy = torch.randn_like(y) In [20]: gx = torch.autograd.grad(y, x, gy)[0] In [21]: gx Out[21]: tensor([[ 0.2189, 0.3396, -0.1108, 1.7703], [ 1.0737, -0.1222, 1.0765, -1.3363], [-1.3798, -0.2950, 0.0800, 0.2501]]) In [22]: gx._base # not gy Out[22]: tensor([ 0.2189, 0.3396, -0.1108, 1.7703, 1.0737, -0.1222, 1.0765, -1.3363, -1.3798, -0.2950, 0.0800, 0.2501]) In [23]: gy.zero_() Out[23]: tensor([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]) In [24]: gx # not sharing storage with gy Out[24]: tensor([[ 0.2189, 0.3396, -0.1108, 1.7703], [ 1.0737, -0.1222, 1.0765, -1.3363], [-1.3798, -0.2950, 0.0800, 0.2501]]) # but everything is optimized with view, which should be equivalent with reshape in this case In [25]: y = x.view(x.shape) In [26]: assert y._base is not None In [27]: gy = torch.randn_like(y) In [28]: gx = torch.autograd.grad(y, x, gy)[0] In [29]: gx Out[29]: tensor([[-2.4463, 1.1446, 0.1501, 0.1212], [-1.1125, 1.4661, 0.9092, -0.2153], [-0.1937, -0.3381, -1.3883, -0.7329]]) In [30]: gy.zero_() Out[30]: tensor([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]) In [31]: gx # sharing storage with gy Out[31]: tensor([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]) ``` Pull Request resolved: pytorch/pytorch#28901 Differential Revision: D18240868 Pulled By: ezyang fbshipit-source-id: 28fdaa0c7014a9dae6731dfe8b67784d38fc27f0
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Optimize for permute case: since we sort the strides, we can check if the size & strides match via some permutation (assuming that strides are unique).
Optimize for expand case:
itosize[i]is just multiplying the #times an address is referred to in input bysize[i]. So we can just divide gradient by\prod_{expanded dim i} inp_size[i].sumover multiple dimensions, rather than doing asumfor each one. (same forsqueeze).I added a test that asserts combinations of permute, (un)squeeze, expand, and transpose won't allocate the large storage-like tensor. Although it tests the internal behavior of
as_strided_backwardand will need to be modified if further optimizations are done in other cases, I can't find better ways to test that we are actually not copying but only returning a view of the gradient.