[bert] Add the intra-op parallelism for equal operator #28810

jianyuh · 2019-10-28T21:37:31Z

Stack from ghstack:

[bert] Add the intra-op parallelism for equal operator #28810 [bert] Add the intra-op parallelism for equal operator

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Benchmarking the RoBERTa model with 20 threads:

Before this Diff: P120104857

equal                    11.16%           305.851ms        11.16%           305.851ms        4.248ms          NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]

After this Diff:
(grain size is the third parameter when using at::parallel_for. As measured below, the performance differences between these different grain size seem to be subtle.)

grain size = TH_OMP_OVERHEAD_THRESHOLD:

equal                    1.43%            36.056ms         1.43%            36.056ms         500.783us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]

grain size = HYPER_TH_OMP_OVERHEAD_THRESHOLD:

equal                    1.41%            35.126ms         1.41%            35.126ms         487.855us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]

grain size = 1:

equal                    1.43%            35.632ms         1.43%            35.632ms         494.886us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]

Note that the size of HYPER_TH_OMP_OVERHEAD_THRESHOLD and TH_OMP_OVERHEAD_THRESHOLD can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . HYPER_TH_OMP_OVERHEAD_THRESHOLD is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.

Differential Revision: D18165752

Similar to #28464, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Similar to #28464, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 92772953 Pull Request resolved: #28810

jspark1105 · 2019-10-31T17:04:16Z

aten/src/TH/generic/THTensorMoreMath.cpp

+    at::parallel_for(
+        0,
+        sz,
+        HYPER_TH_OMP_OVERHEAD_THRESHOLD,


I think HYPER_TH_OMP_OVERHEAD_THRESHOLD is used for more expensive functions like cosh so it can generate too fine-grain tasks. Maybe TH_OMP_OVERHEAD_THRESHOLD is a right one to use here?

I'd suggest running a simple benchmark to tune this param

The definition of HYPER_TH_OMP_OVERHEAD_THRESHOLD and TH_OMP_OVERHEAD_THRESHOLD can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 .
As you mentioned, HYPER_TH_OMP_OVERHEAD_THRESHOLD is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.

However, as I measured, the performance differences between these different grain size are subtle:

(grain size is the third parameter when using at::parallel_for. As measured below, the performance differences between these different grain size seem to be subtle.)

grain size = TH_OMP_OVERHEAD_THRESHOLD:

equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]]

grain size = HYPER_TH_OMP_OVERHEAD_THRESHOLD:

equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]]

grain size = 1:

equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]]

aten/src/TH/generic/THTensorMoreMath.cpp

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93069172

jianyuh · 2019-11-01T07:20:36Z

TODO: more benchmarking to choose the granularity size.

aten/src/TH/generic/THTensorMoreMath.cpp

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93087184

jianyuh · 2019-11-03T23:04:31Z

Done the benchmarking. Updated the result in the summary.

ilia-cher · 2019-11-05T19:57:22Z

aten/src/TH/generic/THTensorMoreMath.cpp

+        TH_OMP_OVERHEAD_THRESHOLD,
+        [&](int64_t begin, int64_t end) {
+          for (auto iter = begin; iter < end; iter++) {
+            if (!equal) {


I'm a bit concerned about read/writes, in some cases they will be atomic but I'm not sure this is always true;
could you use local to the scope equal variable and then write it into atomic int variable defined outside

In case of non atomic write worst thing that could happen is making few more loop iterations. It is much cheaper than any synchronization.

what about read/write race?

using std::atomic<int> would probably be also negligibly different as it would just use atomic processor instructions

Updated to std::atomic<int>.

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93309512

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93329026

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93346238

facebook-github-bot · 2019-11-06T19:48:58Z

This pull request has been merged in 6a4b51a.

Summary: Pull Request resolved: pytorch/pytorch#28810 Similar to pytorch/pytorch#28464 and pytorch/pytorch#28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Test Plan: CI Differential Revision: D18165752 fbshipit-source-id: 354cede4c36893acbd69711f49aa6a51dc94397f

jianyuh requested a review from ilia-cher October 28, 2019 21:45

ilia-cher requested a review from VitalyFedyunin October 29, 2019 00:34

jspark1105 reviewed Oct 31, 2019

View reviewed changes

VitalyFedyunin reviewed Oct 31, 2019

View reviewed changes

aten/src/TH/generic/THTensorMoreMath.cpp Outdated Show resolved Hide resolved

jspark1105 reviewed Nov 1, 2019

View reviewed changes

aten/src/TH/generic/THTensorMoreMath.cpp Outdated Show resolved Hide resolved

ilia-cher reviewed Nov 5, 2019

View reviewed changes

ilia-cher approved these changes Nov 6, 2019

View reviewed changes

facebook-github-bot closed this in 6a4b51a Nov 6, 2019

facebook-github-bot added the merged label Nov 6, 2019

facebook-github-bot deleted the gh/jianyuh/40/head branch November 10, 2019 15:16

mruberry added the Merged label Oct 28, 2020

[bert] Add the intra-op parallelism for equal operator #28810

[bert] Add the intra-op parallelism for equal operator #28810

Uh oh!

Conversation

jianyuh commented Oct 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jspark1105 Oct 31, 2019

Choose a reason for hiding this comment

Uh oh!

ilia-cher Oct 31, 2019

Choose a reason for hiding this comment

Uh oh!

jianyuh Nov 3, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jianyuh commented Nov 1, 2019

Uh oh!

Uh oh!

jianyuh commented Nov 3, 2019

Uh oh!

ilia-cher Nov 5, 2019

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin Nov 5, 2019

Choose a reason for hiding this comment

Uh oh!

ilia-cher Nov 5, 2019

Choose a reason for hiding this comment

Uh oh!

ilia-cher Nov 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jianyuh Nov 6, 2019

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jianyuh commented Oct 28, 2019 •

edited

Loading

ilia-cher Nov 5, 2019 •

edited

Loading