-
Notifications
You must be signed in to change notification settings - Fork 26.3k
optimize RNN on CPU #22512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize RNN on CPU #22512
Conversation
0b296d7 to
7a98707
Compare
VitalyFedyunin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: Lots of unrelated formatting changes, really hard to read.
aten/src/ATen/native/RNN.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be logical error, chunk operator returns views over the input. So when you inplace modify it in chunked_hgates[1].add_(chunked_igates[1]).sigmoid_(); you are also modifying input tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. I'm not sure if I correctly understand that. Here I used in_place add_ for chuncked_hgates, which comes from params.linear_hh(hidden).chunk(3, 1). The input part goes to chuncked_igates which is only used as the argument of add_. So I guess it should be safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case of pre_compute_input ( btw is it better to rename it to past tense?):
// contains vector of views over input
const auto chunked_igates = pre_compute_input ? input.chunk(3, 1) : params.linear_ih(input).chunk(3, 1);
// ...
// changing input tensor inplace!
const auto new_gate = chunked_igates[2].add(chunked_hgates[2].mul_(reset_gate)).tanh_();
And even if you are sure that operator() would be called only once, it is bad pattern to modify inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused here. This line is "chunked_igates[2].add" which is not a in_place add_, right? And tanh_ is applied on the result of add() which is another tensor in my understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed review.
aten/src/ATen/native/RNN.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is wrong with the same approach in case of the GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will apply it in the next PR for GPU. Originally on GPU we do matmul instead of linear for input first, and fuse the bias part in at::_thnn_fused_lstm_cell function. So I think it will be better to make change on CPU only first then apply the GPU change in case the PR to be too large.
|
This PR fused the matmul ops for input sequence together which helped improve the performance for all RNN layers. |
7a98707 to
89f04de
Compare
Summary: Pull Request resolved: pytorch#22512 optimize RNN on CPU Differential Revision: D16113360 fbshipit-source-id: 32bcbe72c3749e4500d8223791b009db03a95d3d
89f04de to
cee4c49
Compare
VitalyFedyunin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still contains inplace modification of the input tensor.
|
This pull request has been merged in 8bdda03. |
Summary: Pull Request resolved: pytorch/pytorch#22512 optimize RNN on CPU Reviewed By: llyfacebook Differential Revision: D16113360 fbshipit-source-id: 9ee53b3b4bb9b636e7be1ccdf25420e2caa60762
This PR fused the matmul ops for input sequence together which helped improve the performance for all RNN layers.
In our test for a speech model, previously the GRU layer takes about 40.16% of the total inference time (about 77ms). While after this PR, the GRU layer takes about 27.73% of the total inference time (about 43ms).
Summary: optimize RNN on CPU
Differential Revision: D16113360