Skip to content

Conversation

@jianyuh
Copy link
Member

@jianyuh jianyuh commented Oct 28, 2019

Stack from ghstack:

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Benchmarking the RoBERTa model with 20 threads:

Before this Diff: P120104857

equal                    11.16%           305.851ms        11.16%           305.851ms        4.248ms          NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]     

After this Diff:
(grain size is the third parameter when using at::parallel_for. As measured below, the performance differences between these different grain size seem to be subtle.)

  • grain size = TH_OMP_OVERHEAD_THRESHOLD:
equal                    1.43%            36.056ms         1.43%            36.056ms         500.783us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
  • grain size = HYPER_TH_OMP_OVERHEAD_THRESHOLD:
equal                    1.41%            35.126ms         1.41%            35.126ms         487.855us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
  • grain size = 1:
equal                    1.43%            35.632ms         1.43%            35.632ms         494.886us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]

Note that the size of HYPER_TH_OMP_OVERHEAD_THRESHOLD and TH_OMP_OVERHEAD_THRESHOLD can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . HYPER_TH_OMP_OVERHEAD_THRESHOLD is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.

Differential Revision: D18165752

Similar to #28464, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

[ghstack-poisoned]
jianyuh added a commit that referenced this pull request Oct 28, 2019
Similar to #28464, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

ghstack-source-id: 92772953
Pull Request resolved: #28810
@jianyuh jianyuh requested a review from ilia-cher October 28, 2019 21:45
at::parallel_for(
0,
sz,
HYPER_TH_OMP_OVERHEAD_THRESHOLD,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think HYPER_TH_OMP_OVERHEAD_THRESHOLD is used for more expensive functions like cosh so it can generate too fine-grain tasks. Maybe TH_OMP_OVERHEAD_THRESHOLD is a right one to use here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest running a simple benchmark to tune this param

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of HYPER_TH_OMP_OVERHEAD_THRESHOLD and TH_OMP_OVERHEAD_THRESHOLD can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 .
As you mentioned, HYPER_TH_OMP_OVERHEAD_THRESHOLD is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.

However, as I measured, the performance differences between these different grain size are subtle:

(grain size is the third parameter when using at::parallel_for. As measured below, the performance differences between these different grain size seem to be subtle.)

  • grain size = TH_OMP_OVERHEAD_THRESHOLD:
equal                    1.43%            36.056ms         1.43%            36.056ms         500.783us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
  • grain size = HYPER_TH_OMP_OVERHEAD_THRESHOLD:
equal                    1.41%            35.126ms         1.41%            35.126ms         487.855us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
  • grain size = 1:
equal                    1.43%            35.632ms         1.43%            35.632ms         494.886us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

[ghstack-poisoned]
jianyuh added a commit that referenced this pull request Nov 1, 2019
Pull Request resolved: #28810

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)
ghstack-source-id: 93069172
@jianyuh
Copy link
Member Author

jianyuh commented Nov 1, 2019

TODO: more benchmarking to choose the granularity size.

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

[ghstack-poisoned]
jianyuh added a commit that referenced this pull request Nov 1, 2019
Pull Request resolved: #28810

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)
ghstack-source-id: 93087184
@jianyuh
Copy link
Member Author

jianyuh commented Nov 3, 2019

Done the benchmarking. Updated the result in the summary.

TH_OMP_OVERHEAD_THRESHOLD,
[&](int64_t begin, int64_t end) {
for (auto iter = begin; iter < end; iter++) {
if (!equal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about read/writes, in some cases they will be atomic but I'm not sure this is always true;
could you use local to the scope equal variable and then write it into atomic int variable defined outside

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of non atomic write worst thing that could happen is making few more loop iterations. It is much cheaper than any synchronization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about read/write race?

Copy link
Contributor

@ilia-cher ilia-cher Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using std::atomic<int> would probably be also negligibly different as it would just use atomic processor instructions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to std::atomic<int>.

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Benchmarking the RoBERTa model with 20 threads:

Before this Diff: P120104857
```
equal                    11.16%           305.851ms        11.16%           305.851ms        4.248ms          NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]     
```


After this Diff:
(grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.)

- grain size = `TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.43%            36.056ms         1.43%            36.056ms         500.783us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.41%            35.126ms         1.41%            35.126ms         487.855us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = 1:
```
equal                    1.43%            35.632ms         1.43%            35.632ms         494.886us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

Note that the size of  `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and  `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.


Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

[ghstack-poisoned]
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Benchmarking the RoBERTa model with 20 threads:

Before this Diff: P120104857
```
equal                    11.16%           305.851ms        11.16%           305.851ms        4.248ms          NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]     
```


After this Diff:
(grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.)

- grain size = `TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.43%            36.056ms         1.43%            36.056ms         500.783us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.41%            35.126ms         1.41%            35.126ms         487.855us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = 1:
```
equal                    1.43%            35.632ms         1.43%            35.632ms         494.886us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

Note that the size of  `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and  `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.


Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

[ghstack-poisoned]
jianyuh added a commit that referenced this pull request Nov 5, 2019
Pull Request resolved: #28810

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)
ghstack-source-id: 93309512
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Benchmarking the RoBERTa model with 20 threads:

Before this Diff: P120104857
```
equal                    11.16%           305.851ms        11.16%           305.851ms        4.248ms          NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]     
```


After this Diff:
(grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.)

- grain size = `TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.43%            36.056ms         1.43%            36.056ms         500.783us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.41%            35.126ms         1.41%            35.126ms         487.855us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = 1:
```
equal                    1.43%            35.632ms         1.43%            35.632ms         494.886us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

Note that the size of  `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and  `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.


Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

[ghstack-poisoned]
jianyuh added a commit that referenced this pull request Nov 6, 2019
Pull Request resolved: #28810

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)
ghstack-source-id: 93329026
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Benchmarking the RoBERTa model with 20 threads:

Before this Diff: P120104857
```
equal                    11.16%           305.851ms        11.16%           305.851ms        4.248ms          NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]     
```


After this Diff:
(grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.)

- grain size = `TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.43%            36.056ms         1.43%            36.056ms         500.783us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.41%            35.126ms         1.41%            35.126ms         487.855us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = 1:
```
equal                    1.43%            35.632ms         1.43%            35.632ms         494.886us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

Note that the size of  `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and  `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.


Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

[ghstack-poisoned]
jianyuh added a commit that referenced this pull request Nov 6, 2019
Pull Request resolved: #28810

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)
ghstack-source-id: 93346238
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 6a4b51a.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Nov 6, 2019
Summary:
Pull Request resolved: pytorch/pytorch#28810

Similar to pytorch/pytorch#28464 and pytorch/pytorch#28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Test Plan: CI

Differential Revision: D18165752

fbshipit-source-id: 354cede4c36893acbd69711f49aa6a51dc94397f
@facebook-github-bot facebook-github-bot deleted the gh/jianyuh/40/head branch November 10, 2019 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants