[pt][aten] Enable the intra-op parallelism for layer norm #28464

jianyuh · 2019-10-22T20:49:45Z

Stack from ghstack:

[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256 (2/2) #29154 [bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256 (2/2)
[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256 #29104 [bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256
[pt][aten] Enable the intra-op parallelism for layer norm #28464 [pt][aten] Enable the intra-op parallelism for layer norm

We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Before this Diff:

-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
native_layer_norm        14.10%           449.257ms        14.10%           449.257ms        9.360ms          NaN              0.000us          0.000us          48               [[61, 64, 1024], [1024], [1024]]

After this Diff:

-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
native_layer_norm        1.55%            42.453ms         1.55%            42.453ms         884.448us        NaN              0.000us          0.000us          48               [[61, 64, 1024], [1024], [1024]]

Differential Revision: D18063407

We would like to enable the intra-op parallelism for layer norm. Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) [ghstack-poisoned]

aten/src/ATen/native/cpu/layer_norm_kernel.cpp

xiaomengy · 2019-10-23T01:16:13Z

aten/src/ATen/native/cpu/layer_norm_kernel.cpp

+      T mean_val = T(0);
+      T rstd_val = T(0);
+      for (int64_t j = 0; j < N; ++j) {
+        mean_val += X_ptr[j];


Here we can consider to use vec256::Vec256 to do reduce. I remember in my local test, the compiler's auto vectorization is not as efficient as vec256 in reduce case. You can take a look at #23349

Thanks! Do you know why #23349 is not not merged into the PyTorch master branch?

No specific reasons. Just it is not a very urgent thing. So here since we will change this part, we can do it together.

@BIT-silence : Added your PR in #29104.

We would like to enable the intra-op parallelism for layer norm. Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) [ghstack-poisoned]

Pull Request resolved: #28464 We would like to enable the intra-op parallelism for layer norm. Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) ghstack-source-id: 92431877

ilia-cher · 2019-10-26T19:19:24Z

aten/src/ATen/native/cpu/layer_norm_kernel.cpp

+  at::parallel_for(0, num_tasks, 1, [&](int64_t start, int64_t end) {
+    const int64_t M_per_thread = (M + num_tasks - 1) / num_tasks;
+    const int64_t M_start = std::min(start * M_per_thread, M);
+    const int64_t M_end = std::min(end * M_per_thread, M);


this is not how at::parallel_for is typically used, you don't need to compute the number of tasks and start and end by yourself - check #19105 for example,i

you typically just need to write, e.g.

at::parallel_for(0, M, <choose grain size>, [&](int64_t start, int64_t end) { for (int64_t i = start; i < end; ++i) { ... } });

Thanks! Updated the PR. One thing to double check: When I set the <choose grain size> to 1, does this mean the loop is iterated in an interleaving way, or in a block-cyclic way?

As @ilia-cher told me, the loop is iterated in a "block-cyclic" way: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/ParallelNative.h#L20

We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Before this Diff: ``` ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- native_layer_norm 14.10% 449.257ms 14.10% 449.257ms 9.360ms NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` https://docs.google.com/spreadsheets/d/137BkyMmVuLS0Shz7QHze-12CM1MEwhGimfIhh3P5kBY/edit#gid=375380546 After this Diff: ``` ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- native_layer_norm 1.55% 42.453ms 1.55% 42.453ms 884.448us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` https://docs.google.com/spreadsheets/d/137BkyMmVuLS0Shz7QHze-12CM1MEwhGimfIhh3P5kBY/edit#gid=930034990 Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) [ghstack-poisoned]

Pull Request resolved: #28464 We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) ghstack-source-id: 92705021

Similar to #28464, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Similar to #28464, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 92772953 Pull Request resolved: #28810

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93069172

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93087184

We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Before this Diff: ``` ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- native_layer_norm 14.10% 449.257ms 14.10% 449.257ms 9.360ms NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` After this Diff: ``` ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- native_layer_norm 1.55% 42.453ms 1.55% 42.453ms 884.448us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) [ghstack-poisoned]

Pull Request resolved: #28464 We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) ghstack-source-id: 93161959

We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Before this Diff: ``` ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- native_layer_norm 14.10% 449.257ms 14.10% 449.257ms 9.360ms NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` After this Diff: ``` ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- native_layer_norm 1.55% 42.453ms 1.55% 42.453ms 884.448us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) [ghstack-poisoned]

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93309512

xiaomengy

LGTM

Summary: Pull Request resolved: pytorch/pytorch#28464 We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "LayerNorm" Reviewed By: BIT-silence Differential Revision: D18063407 fbshipit-source-id: c116e744d78ea50b3aadf2e9a819e5b876a944bf

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93329026

facebook-github-bot · 2019-11-06T05:13:26Z

This pull request has been merged in 492764b.

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93346238

Summary: Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Test Plan: CI Differential Revision: D18165752 fbshipit-source-id: 354cede4c36893acbd69711f49aa6a51dc94397f

Summary: Pull Request resolved: pytorch/pytorch#28810 Similar to pytorch/pytorch#28464 and pytorch/pytorch#28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Test Plan: CI Differential Revision: D18165752 fbshipit-source-id: 354cede4c36893acbd69711f49aa6a51dc94397f

jianyuh requested a review from xiaomengy October 22, 2019 23:15

xiaomengy suggested changes Oct 23, 2019

View reviewed changes

jianyuh requested a review from ilia-cher October 25, 2019 17:18

ilia-cher reviewed Oct 26, 2019

View reviewed changes

ilia-cher self-requested a review October 26, 2019 19:20

jianyuh mentioned this pull request Oct 28, 2019

[bert] Add the intra-op parallelism for equal operator #28810

Closed

jianyuh mentioned this pull request Nov 4, 2019

[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256 #29104

Closed

jianyuh mentioned this pull request Nov 4, 2019

[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256 (2/2) #29154

Closed

xiaomengy approved these changes Nov 5, 2019

View reviewed changes

facebook-github-bot closed this in 492764b Nov 5, 2019

facebook-github-bot added the merged label Nov 6, 2019

facebook-github-bot deleted the gh/jianyuh/35/head branch November 9, 2019 15:16

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pt][aten] Enable the intra-op parallelism for layer norm #28464

[pt][aten] Enable the intra-op parallelism for layer norm #28464

Uh oh!

jianyuh commented Oct 22, 2019 •

edited

Loading

Uh oh!

Uh oh!

xiaomengy Oct 23, 2019

Uh oh!

jianyuh Oct 23, 2019

Uh oh!

xiaomengy Oct 23, 2019

Uh oh!

jianyuh Nov 4, 2019

Uh oh!

ilia-cher Oct 26, 2019

Uh oh!

jianyuh Oct 27, 2019

Uh oh!

jianyuh Nov 4, 2019

Uh oh!

xiaomengy left a comment

Uh oh!

facebook-github-bot commented Nov 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[pt][aten] Enable the intra-op parallelism for layer norm #28464

[pt][aten] Enable the intra-op parallelism for layer norm #28464

Uh oh!

Conversation

jianyuh commented Oct 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

xiaomengy Oct 23, 2019

Choose a reason for hiding this comment

Uh oh!

jianyuh Oct 23, 2019

Choose a reason for hiding this comment

Uh oh!

xiaomengy Oct 23, 2019

Choose a reason for hiding this comment

Uh oh!

jianyuh Nov 4, 2019

Choose a reason for hiding this comment

Uh oh!

ilia-cher Oct 26, 2019

Choose a reason for hiding this comment

Uh oh!

jianyuh Oct 27, 2019

Choose a reason for hiding this comment

Uh oh!

jianyuh Nov 4, 2019

Choose a reason for hiding this comment

Uh oh!

xiaomengy left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jianyuh commented Oct 22, 2019 •

edited

Loading