-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[bert] Add the intra-op parallelism for equal operator #28810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Similar to #28464, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Similar to #28464, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 92772953 Pull Request resolved: #28810
| at::parallel_for( | ||
| 0, | ||
| sz, | ||
| HYPER_TH_OMP_OVERHEAD_THRESHOLD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think HYPER_TH_OMP_OVERHEAD_THRESHOLD is used for more expensive functions like cosh so it can generate too fine-grain tasks. Maybe TH_OMP_OVERHEAD_THRESHOLD is a right one to use here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest running a simple benchmark to tune this param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition of HYPER_TH_OMP_OVERHEAD_THRESHOLD and TH_OMP_OVERHEAD_THRESHOLD can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 .
As you mentioned, HYPER_TH_OMP_OVERHEAD_THRESHOLD is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.
However, as I measured, the performance differences between these different grain size are subtle:
(grain size is the third parameter when using at::parallel_for. As measured below, the performance differences between these different grain size seem to be subtle.)
- grain size =
TH_OMP_OVERHEAD_THRESHOLD:
equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]]
- grain size =
HYPER_TH_OMP_OVERHEAD_THRESHOLD:
equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]]
- grain size = 1:
equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]]
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93069172
|
TODO: more benchmarking to choose the granularity size. |
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93087184
|
Done the benchmarking. Updated the result in the summary. |
| TH_OMP_OVERHEAD_THRESHOLD, | ||
| [&](int64_t begin, int64_t end) { | ||
| for (auto iter = begin; iter < end; iter++) { | ||
| if (!equal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about read/writes, in some cases they will be atomic but I'm not sure this is always true;
could you use local to the scope equal variable and then write it into atomic int variable defined outside
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of non atomic write worst thing that could happen is making few more loop iterations. It is much cheaper than any synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about read/write race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using std::atomic<int> would probably be also negligibly different as it would just use atomic processor instructions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to std::atomic<int>.
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93309512
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93329026
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93346238
|
This pull request has been merged in 6a4b51a. |
Summary: Pull Request resolved: pytorch/pytorch#28810 Similar to pytorch/pytorch#28464 and pytorch/pytorch#28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Test Plan: CI Differential Revision: D18165752 fbshipit-source-id: 354cede4c36893acbd69711f49aa6a51dc94397f
Stack from ghstack:
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.
Benchmarking the RoBERTa model with 20 threads:
Before this Diff: P120104857
After this Diff:
(grain size is the third parameter when using
at::parallel_for. As measured below, the performance differences between these different grain size seem to be subtle.)TH_OMP_OVERHEAD_THRESHOLD:HYPER_TH_OMP_OVERHEAD_THRESHOLD:Note that the size of
HYPER_TH_OMP_OVERHEAD_THRESHOLDandTH_OMP_OVERHEAD_THRESHOLDcan be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 .HYPER_TH_OMP_OVERHEAD_THRESHOLDis used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.Differential Revision: D18165752