-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[pt][aten] Enable the intra-op parallelism for layer norm #28464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We would like to enable the intra-op parallelism for layer norm. Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) [ghstack-poisoned]
| T mean_val = T(0); | ||
| T rstd_val = T(0); | ||
| for (int64_t j = 0; j < N; ++j) { | ||
| mean_val += X_ptr[j]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can consider to use vec256::Vec256 to do reduce. I remember in my local test, the compiler's auto vectorization is not as efficient as vec256 in reduce case. You can take a look at #23349
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Do you know why #23349 is not not merged into the PyTorch master branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No specific reasons. Just it is not a very urgent thing. So here since we will change this part, we can do it together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BIT-silence : Added your PR in #29104.
We would like to enable the intra-op parallelism for layer norm. Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) [ghstack-poisoned]
Pull Request resolved: #28464 We would like to enable the intra-op parallelism for layer norm. Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) ghstack-source-id: 92431877
| at::parallel_for(0, num_tasks, 1, [&](int64_t start, int64_t end) { | ||
| const int64_t M_per_thread = (M + num_tasks - 1) / num_tasks; | ||
| const int64_t M_start = std::min(start * M_per_thread, M); | ||
| const int64_t M_end = std::min(end * M_per_thread, M); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not how at::parallel_for is typically used, you don't need to compute the number of tasks and start and end by yourself - check #19105 for example,i
you typically just need to write, e.g.
at::parallel_for(0, M, <choose grain size>, [&](int64_t start, int64_t end) {
for (int64_t i = start; i < end; ++i) { ... }
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated the PR. One thing to double check: When I set the <choose grain size> to 1, does this mean the loop is iterated in an interleaving way, or in a block-cyclic way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @ilia-cher told me, the loop is iterated in a "block-cyclic" way: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/ParallelNative.h#L20
We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Before this Diff: ``` ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- native_layer_norm 14.10% 449.257ms 14.10% 449.257ms 9.360ms NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` https://docs.google.com/spreadsheets/d/137BkyMmVuLS0Shz7QHze-12CM1MEwhGimfIhh3P5kBY/edit#gid=375380546 After this Diff: ``` ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- native_layer_norm 1.55% 42.453ms 1.55% 42.453ms 884.448us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` https://docs.google.com/spreadsheets/d/137BkyMmVuLS0Shz7QHze-12CM1MEwhGimfIhh3P5kBY/edit#gid=930034990 Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) [ghstack-poisoned]
Pull Request resolved: #28464 We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) ghstack-source-id: 92705021
Similar to #28464, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Similar to #28464, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 92772953 Pull Request resolved: #28810
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93069172
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93087184
We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Before this Diff: ``` ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- native_layer_norm 14.10% 449.257ms 14.10% 449.257ms 9.360ms NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` After this Diff: ``` ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- native_layer_norm 1.55% 42.453ms 1.55% 42.453ms 884.448us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) [ghstack-poisoned]
Pull Request resolved: #28464 We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) ghstack-source-id: 93161959
We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Before this Diff: ``` ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- native_layer_norm 14.10% 449.257ms 14.10% 449.257ms 9.360ms NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` After this Diff: ``` ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- native_layer_norm 1.55% 42.453ms 1.55% 42.453ms 884.448us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` Differential Revision: [D18063407](https://our.internmc.facebook.com/intern/diff/D18063407/) [ghstack-poisoned]
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93309512
xiaomengy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary: Pull Request resolved: pytorch/pytorch#28464 We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "LayerNorm" Reviewed By: BIT-silence Differential Revision: D18063407 fbshipit-source-id: c116e744d78ea50b3aadf2e9a819e5b876a944bf
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93329026
|
This pull request has been merged in 492764b. |
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]
Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93346238
Summary: Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Test Plan: CI Differential Revision: D18165752 fbshipit-source-id: 354cede4c36893acbd69711f49aa6a51dc94397f
Summary: Pull Request resolved: pytorch/pytorch#28810 Similar to pytorch/pytorch#28464 and pytorch/pytorch#28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Test Plan: CI Differential Revision: D18165752 fbshipit-source-id: 354cede4c36893acbd69711f49aa6a51dc94397f
Stack from ghstack:
We would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.
Before this Diff:
After this Diff:
Differential Revision: D18063407