Skip to content

Conversation

@jianyuh
Copy link
Member

@jianyuh jianyuh commented Oct 22, 2019

Stack from ghstack:

Similar to #26692, we would like to enable the intra-op parallelism for dynamic Linear op.

Test Benchmark:

import time
import torch

K, N = 1024, 1024

print('M', 'nthread=1', 'nthread=2', 'nthread=4', 'nthread=8', 'nthread=16', sep=', ')

for M in range(512, 2049, 512):
    print(M, sep=',', end=', ')
    for num_threads in (1, 2, 4, 8, 16,):

        torch.set_num_threads(num_threads)

        x = torch.rand(M, K)
        w = torch.rand(K, N)

        NITER = 20

        # Test dynamic quantized
        q_w = torch.quantize_per_tensor(w, 0.01, 0, dtype=torch.qint8)
        packed_w = torch.ops.quantized.linear_prepack(q_w, None)

        s = time.time()
        for i in range(NITER):
            torch.ops.quantized.linear_dynamic(x, packed_w)
        elapsed_per_iter_dyn_quant = (time.time() - s) / NITER

        print("{:0.2f}".format(2.0*M*N*K/elapsed_per_iter_dyn_quant/1E9), end=', ')
    print("\n", end='')

Before this Diff:

(base) [root@[test machine] ~/jhuang_test/dynamic_quant]# python benchmark_quantize_dynamic.py
M, nthread=1, nthread=2, nthread=4, nthread=8, nthread=16
512, 119.28, 139.50, 141.66, 141.58, 141.42,
1024, 122.42, 141.21, 123.09, 141.85, 123.03,
1536, 122.80, 122.18, 141.39, 123.25, 141.35,
2048, 123.41, 141.34, 123.62, 140.55, 123.76,

After this Diff:

(base) [root@[test machine] ~/jhuang_test/dynamic_quant]# python benchmark_quantize_dynamic.py
M, nthread=1, nthread=2, nthread=4, nthread=8, nthread=16
512, 123.29, 271.99, 508.66, 882.83, 1295.07,
1024, 126.05, 273.15, 515.42, 914.11, 877.63,
1536, 142.48, 236.85, 524.10, 481.32, 970.81,
2048, 124.76, 279.03, 433.73, 958.67, 1045.82,

Differential Revision: D18074757

Similar to #26692, we would like to enable the intra-op parallelism for dynamic Linear op.

Differential Revision: [D18074757](https://our.internmc.facebook.com/intern/diff/D18074757/)

[ghstack-poisoned]
… operator"

Similar to #26692, we would like to enable the intra-op parallelism for dynamic Linear op.

Differential Revision: [D18074757](https://our.internmc.facebook.com/intern/diff/D18074757/)

[ghstack-poisoned]
jianyuh added a commit that referenced this pull request Oct 22, 2019
Pull Request resolved: #28477

Similar to #26692, we would like to enable the intra-op parallelism for dynamic Linear op.
ghstack-source-id: 92419573

Differential Revision: [D18074757](https://our.internmc.facebook.com/intern/diff/D18074757/)
@jianyuh jianyuh requested review from dskhudia and ilia-cher October 22, 2019 23:23
@jianyuh jianyuh requested a review from jamesr66a October 24, 2019 05:02
if (pack_ptr.q_scheme == kPerTensorAffine) {
// Process the per tensor quantization.
int num_tasks = at::get_num_threads();
at::parallel_for(0, num_tasks, 1, [&](int64_t begin, int64_t end) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to test the parallelization logic to see if there is a speedup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update with some performance numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with the performance number in the summary.

/*thread_id=*/task_id,
/*num_threads=*/num_tasks);

} else if (pack_ptr.q_scheme == kPerChannelAffine) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we support per-channel dynamic quant as a qconfig option?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. We added the per channel quantization support for dynamic quantization a while ago.

Copy link
Collaborator

@jamesr66a jamesr66a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, seems to just delegate to the FBGEMM algorithm for splitting work among tasks.

I'd still like to see the perf numbers, though

@jianyuh
Copy link
Member Author

jianyuh commented Oct 27, 2019

Looks good, seems to just delegate to the FBGEMM algorithm for splitting work among tasks.

I'd still like to see the perf numbers, though

Updated with the performance number in the summary. Thanks!

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 28, 2019
…#28477)

Summary:
Pull Request resolved: pytorch/pytorch#28477

Similar to pytorch/pytorch#26692, we would like to enable the intra-op parallelism for dynamic Linear op.
ghstack-source-id: 92419573

Test Plan:
CI

Test Benchmark:
```
import time
import torch

K, N = 1024, 1024

print('M', 'nthread=1', 'nthread=2', 'nthread=4', 'nthread=8', 'nthread=16', sep=', ')

for M in range(512, 2049, 512):
    print(M, sep=',', end=', ')
    for num_threads in (1, 2, 4, 8, 16,):

        torch.set_num_threads(num_threads)

        x = torch.rand(M, K)
        w = torch.rand(K, N)

        NITER = 20

        # Test dynamic quantized
        q_w = torch.quantize_per_tensor(w, 0.01, 0, dtype=torch.qint8)
        packed_w = torch.ops.quantized.linear_prepack(q_w, None)

        s = time.time()
        for i in range(NITER):
            torch.ops.quantized.linear_dynamic(x, packed_w)
        elapsed_per_iter_dyn_quant = (time.time() - s) / NITER

        print("{:0.2f}".format(2.0*M*N*K/elapsed_per_iter_dyn_quant/1E9), end=', ')
    print("\n", end='')
```
Before this Diff:
```
(base) [[email protected] ~/jhuang_test/dynamic_quant]# python benchmark_quantize_dynamic.py
M, nthread=1, nthread=2, nthread=4, nthread=8, nthread=16
512, 119.28, 139.50, 141.66, 141.58, 141.42,
1024, 122.42, 141.21, 123.09, 141.85, 123.03,
1536, 122.80, 122.18, 141.39, 123.25, 141.35,
2048, 123.41, 141.34, 123.62, 140.55, 123.76,
```

After this Diff:
```
(base) [[email protected] ~/jhuang_test/dynamic_quant]# python benchmark_quantize_dynamic.py
M, nthread=1, nthread=2, nthread=4, nthread=8, nthread=16
512, 123.29, 271.99, 508.66, 882.83, 1295.07,
1024, 126.05, 273.15, 515.42, 914.11, 877.63,
1536, 142.48, 236.85, 524.10, 481.32, 970.81,
2048, 124.76, 279.03, 433.73, 958.67, 1045.82,
```

Differential Revision: D18074757

fbshipit-source-id: ad5b43477d2187c818c137093c6d6af02d5ca1d5
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 052046b.

jianyuh added a commit that referenced this pull request Nov 1, 2019
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

[ghstack-poisoned]
jianyuh added a commit that referenced this pull request Nov 1, 2019
Pull Request resolved: #28810

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)
ghstack-source-id: 93069172
@facebook-github-bot facebook-github-bot deleted the gh/jianyuh/36/head branch November 1, 2019 14:17
jianyuh added a commit that referenced this pull request Nov 1, 2019
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

[ghstack-poisoned]
jianyuh added a commit that referenced this pull request Nov 1, 2019
Pull Request resolved: #28810

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)
ghstack-source-id: 93087184
jianyuh added a commit that referenced this pull request Nov 5, 2019
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Benchmarking the RoBERTa model with 20 threads:

Before this Diff: P120104857
```
equal                    11.16%           305.851ms        11.16%           305.851ms        4.248ms          NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]     
```


After this Diff:
(grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.)

- grain size = `TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.43%            36.056ms         1.43%            36.056ms         500.783us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.41%            35.126ms         1.41%            35.126ms         487.855us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = 1:
```
equal                    1.43%            35.632ms         1.43%            35.632ms         494.886us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

Note that the size of  `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and  `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.


Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

[ghstack-poisoned]
jianyuh added a commit that referenced this pull request Nov 5, 2019
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Benchmarking the RoBERTa model with 20 threads:

Before this Diff: P120104857
```
equal                    11.16%           305.851ms        11.16%           305.851ms        4.248ms          NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]     
```


After this Diff:
(grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.)

- grain size = `TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.43%            36.056ms         1.43%            36.056ms         500.783us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.41%            35.126ms         1.41%            35.126ms         487.855us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = 1:
```
equal                    1.43%            35.632ms         1.43%            35.632ms         494.886us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

Note that the size of  `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and  `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.


Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

[ghstack-poisoned]
jianyuh added a commit that referenced this pull request Nov 5, 2019
Pull Request resolved: #28810

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)
ghstack-source-id: 93309512
jianyuh added a commit that referenced this pull request Nov 6, 2019
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Benchmarking the RoBERTa model with 20 threads:

Before this Diff: P120104857
```
equal                    11.16%           305.851ms        11.16%           305.851ms        4.248ms          NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]     
```


After this Diff:
(grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.)

- grain size = `TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.43%            36.056ms         1.43%            36.056ms         500.783us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.41%            35.126ms         1.41%            35.126ms         487.855us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = 1:
```
equal                    1.43%            35.632ms         1.43%            35.632ms         494.886us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

Note that the size of  `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and  `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.


Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

[ghstack-poisoned]
jianyuh added a commit that referenced this pull request Nov 6, 2019
Pull Request resolved: #28810

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)
ghstack-source-id: 93329026
jianyuh added a commit that referenced this pull request Nov 6, 2019
Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Benchmarking the RoBERTa model with 20 threads:

Before this Diff: P120104857
```
equal                    11.16%           305.851ms        11.16%           305.851ms        4.248ms          NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]     
```


After this Diff:
(grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.)

- grain size = `TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.43%            36.056ms         1.43%            36.056ms         500.783us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`:
```
equal                    1.41%            35.126ms         1.41%            35.126ms         487.855us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

- grain size = 1:
```
equal                    1.43%            35.632ms         1.43%            35.632ms         494.886us        NaN              0.000us          0.000us          72               [[61, 64, 1024], [61, 64, 1024]]
```

Note that the size of  `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and  `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16.


Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)

[ghstack-poisoned]
jianyuh added a commit that referenced this pull request Nov 6, 2019
Pull Request resolved: #28810

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/)
ghstack-source-id: 93346238
facebook-github-bot pushed a commit that referenced this pull request Nov 6, 2019
Summary:
Pull Request resolved: #28810

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Test Plan: CI

Differential Revision: D18165752

fbshipit-source-id: 354cede4c36893acbd69711f49aa6a51dc94397f
zdevito pushed a commit to zdevito/ATen that referenced this pull request Nov 6, 2019
Summary:
Pull Request resolved: pytorch/pytorch#28810

Similar to pytorch/pytorch#28464 and pytorch/pytorch#28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model.

Test Plan: CI

Differential Revision: D18165752

fbshipit-source-id: 354cede4c36893acbd69711f49aa6a51dc94397f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants