optimize RNN on CPU #22512

xiaomengy · 2019-07-03T21:59:50Z

This PR fused the matmul ops for input sequence together which helped improve the performance for all RNN layers.
In our test for a speech model, previously the GRU layer takes about 40.16% of the total inference time (about 77ms). While after this PR, the GRU layer takes about 27.73% of the total inference time (about 43ms).

Summary: optimize RNN on CPU

Differential Revision: D16113360

VitalyFedyunin

Side note: Lots of unrelated formatting changes, really hard to read.

aten/src/ATen/native/RNN.cpp

VitalyFedyunin · 2019-07-09T20:53:27Z

aten/src/ATen/native/RNN.cpp

Might be logical error, chunk operator returns views over the input. So when you inplace modify it in chunked_hgates[1].add_(chunked_igates[1]).sigmoid_(); you are also modifying input tensor.

Thanks for the comment. I'm not sure if I correctly understand that. Here I used in_place add_ for chuncked_hgates, which comes from params.linear_hh(hidden).chunk(3, 1). The input part goes to chuncked_igates which is only used as the argument of add_. So I guess it should be safe?

in case of pre_compute_input ( btw is it better to rename it to past tense?):

// contains vector of views over input const auto chunked_igates = pre_compute_input ? input.chunk(3, 1) : params.linear_ih(input).chunk(3, 1); // ... // changing input tensor inplace! const auto new_gate = chunked_igates[2].add(chunked_hgates[2].mul_(reset_gate)).tanh_();

And even if you are sure that operator() would be called only once, it is bad pattern to modify inputs

I am a little confused here. This line is "chunked_igates[2].add" which is not a in_place add_, right? And tanh_ is applied on the result of add() which is another tensor in my understanding.

My bad. LGTM.

Thanks for the detailed review.

VitalyFedyunin · 2019-07-09T21:01:08Z

aten/src/ATen/native/RNN.cpp

What is wrong with the same approach in case of the GPU?

I will apply it in the next PR for GPU. Originally on GPU we do matmul instead of linear for input first, and fuse the bias part in at::_thnn_fused_lstm_cell function. So I think it will be better to make change on CPU only first then apply the GPU change in case the PR to be too large.

xiaomengy · 2019-07-10T17:52:36Z

This PR fused the matmul ops for input sequence together which helped improve the performance for all RNN layers.
In our test for a speech model, previously the GRU layer takes about 40.16% of the total inference time (about 77ms). While after this PR, the GRU layer takes about 27.73% of the total inference time (about 43ms).

Summary: Pull Request resolved: pytorch#22512 optimize RNN on CPU Differential Revision: D16113360 fbshipit-source-id: 32bcbe72c3749e4500d8223791b009db03a95d3d

VitalyFedyunin

Still contains inplace modification of the input tensor.

facebook-github-bot · 2019-07-11T20:38:28Z

This pull request has been merged in 8bdda03.

Summary: Pull Request resolved: pytorch/pytorch#22512 optimize RNN on CPU Reviewed By: llyfacebook Differential Revision: D16113360 fbshipit-source-id: 9ee53b3b4bb9b636e7be1ccdf25420e2caa60762

pytorchbot added the module: operators label Jul 3, 2019

xiaomengy force-pushed the export-D16113360 branch from 0b296d7 to 7a98707 Compare July 5, 2019 18:33

VitalyFedyunin suggested changes Jul 9, 2019

View reviewed changes

xiaomengy force-pushed the export-D16113360 branch from 7a98707 to 89f04de Compare July 10, 2019 17:52

optimize RNN on CPU (pytorch#22512)

cee4c49

Summary: Pull Request resolved: pytorch#22512 optimize RNN on CPU Differential Revision: D16113360 fbshipit-source-id: 32bcbe72c3749e4500d8223791b009db03a95d3d

xiaomengy force-pushed the export-D16113360 branch from 89f04de to cee4c49 Compare July 10, 2019 17:58

xiaomengy requested a review from VitalyFedyunin July 10, 2019 18:59

VitalyFedyunin suggested changes Jul 11, 2019

View reviewed changes

VitalyFedyunin approved these changes Jul 11, 2019

View reviewed changes

facebook-github-bot closed this in 8bdda03 Jul 11, 2019

xiaomengy deleted the export-D16113360 branch July 11, 2019 20:30

facebook-github-bot added the merged label Jul 11, 2019

mruberry added the Merged label Oct 28, 2020

optimize RNN on CPU #22512

optimize RNN on CPU #22512

Uh oh!

Conversation

xiaomengy commented Jul 3, 2019 • edited by VitalyFedyunin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

VitalyFedyunin Jul 9, 2019

Choose a reason for hiding this comment

Uh oh!

xiaomengy Jul 10, 2019

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin Jul 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaomengy Jul 11, 2019

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin Jul 11, 2019

Choose a reason for hiding this comment

Uh oh!

xiaomengy Jul 11, 2019

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin Jul 9, 2019

Choose a reason for hiding this comment

Uh oh!

xiaomengy Jul 10, 2019

Choose a reason for hiding this comment

Uh oh!

xiaomengy commented Jul 10, 2019

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xiaomengy commented Jul 3, 2019 •

edited by VitalyFedyunin

Loading

VitalyFedyunin Jul 11, 2019 •

edited

Loading