Skip to content

Conversation

@xiaomengy
Copy link
Contributor

@xiaomengy xiaomengy commented Jul 3, 2019

This PR fused the matmul ops for input sequence together which helped improve the performance for all RNN layers.
In our test for a speech model, previously the GRU layer takes about 40.16% of the total inference time (about 77ms). While after this PR, the GRU layer takes about 27.73% of the total inference time (about 43ms).

Summary: optimize RNN on CPU

Differential Revision: D16113360

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: Lots of unrelated formatting changes, really hard to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be logical error, chunk operator returns views over the input. So when you inplace modify it in chunked_hgates[1].add_(chunked_igates[1]).sigmoid_(); you are also modifying input tensor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. I'm not sure if I correctly understand that. Here I used in_place add_ for chuncked_hgates, which comes from params.linear_hh(hidden).chunk(3, 1). The input part goes to chuncked_igates which is only used as the argument of add_. So I guess it should be safe?

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of pre_compute_input ( btw is it better to rename it to past tense?):

 // contains vector of views over input
 const auto chunked_igates = pre_compute_input ? input.chunk(3, 1) : params.linear_ih(input).chunk(3, 1);
 
// ...

 // changing input tensor inplace!
const auto new_gate = chunked_igates[2].add(chunked_hgates[2].mul_(reset_gate)).tanh_();
 

And even if you are sure that operator() would be called only once, it is bad pattern to modify inputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused here. This line is "chunked_igates[2].add" which is not a in_place add_, right? And tanh_ is applied on the result of add() which is another tensor in my understanding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. LGTM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is wrong with the same approach in case of the GPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will apply it in the next PR for GPU. Originally on GPU we do matmul instead of linear for input first, and fuse the bias part in at::_thnn_fused_lstm_cell function. So I think it will be better to make change on CPU only first then apply the GPU change in case the PR to be too large.

@xiaomengy
Copy link
Contributor Author

This PR fused the matmul ops for input sequence together which helped improve the performance for all RNN layers.
In our test for a speech model, previously the GRU layer takes about 40.16% of the total inference time (about 77ms). While after this PR, the GRU layer takes about 27.73% of the total inference time (about 43ms).

Summary:
Pull Request resolved: pytorch#22512

optimize RNN on CPU

Differential Revision: D16113360

fbshipit-source-id: 32bcbe72c3749e4500d8223791b009db03a95d3d
Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still contains inplace modification of the input tensor.

@xiaomengy xiaomengy deleted the export-D16113360 branch July 11, 2019 20:30
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 8bdda03.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 11, 2019
Summary:
Pull Request resolved: pytorch/pytorch#22512

optimize RNN on CPU

Reviewed By: llyfacebook

Differential Revision: D16113360

fbshipit-source-id: 9ee53b3b4bb9b636e7be1ccdf25420e2caa60762
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants