Enabling intra-op parallelism for dynamic quantized Linear operator #28477

jianyuh · 2019-10-22T23:07:55Z

Stack from ghstack:

Enabling intra-op parallelism for dynamic quantized Linear operator #28477 Enabling intra-op parallelism for dynamic quantized Linear operator

Similar to #26692, we would like to enable the intra-op parallelism for dynamic Linear op.

Test Benchmark:

import time
import torch

K, N = 1024, 1024

print('M', 'nthread=1', 'nthread=2', 'nthread=4', 'nthread=8', 'nthread=16', sep=', ')

for M in range(512, 2049, 512):
    print(M, sep=',', end=', ')
    for num_threads in (1, 2, 4, 8, 16,):

        torch.set_num_threads(num_threads)

        x = torch.rand(M, K)
        w = torch.rand(K, N)

        NITER = 20

        # Test dynamic quantized
        q_w = torch.quantize_per_tensor(w, 0.01, 0, dtype=torch.qint8)
        packed_w = torch.ops.quantized.linear_prepack(q_w, None)

        s = time.time()
        for i in range(NITER):
            torch.ops.quantized.linear_dynamic(x, packed_w)
        elapsed_per_iter_dyn_quant = (time.time() - s) / NITER

        print("{:0.2f}".format(2.0*M*N*K/elapsed_per_iter_dyn_quant/1E9), end=', ')
    print("\n", end='')

Before this Diff:

(base) [root@[test machine] ~/jhuang_test/dynamic_quant]# python benchmark_quantize_dynamic.py
M, nthread=1, nthread=2, nthread=4, nthread=8, nthread=16
512, 119.28, 139.50, 141.66, 141.58, 141.42,
1024, 122.42, 141.21, 123.09, 141.85, 123.03,
1536, 122.80, 122.18, 141.39, 123.25, 141.35,
2048, 123.41, 141.34, 123.62, 140.55, 123.76,

After this Diff:

(base) [root@[test machine] ~/jhuang_test/dynamic_quant]# python benchmark_quantize_dynamic.py
M, nthread=1, nthread=2, nthread=4, nthread=8, nthread=16
512, 123.29, 271.99, 508.66, 882.83, 1295.07,
1024, 126.05, 273.15, 515.42, 914.11, 877.63,
1536, 142.48, 236.85, 524.10, 481.32, 970.81,
2048, 124.76, 279.03, 433.73, 958.67, 1045.82,

Differential Revision: D18074757

Similar to #26692, we would like to enable the intra-op parallelism for dynamic Linear op. Differential Revision: [D18074757](https://our.internmc.facebook.com/intern/diff/D18074757/) [ghstack-poisoned]

… operator" Similar to #26692, we would like to enable the intra-op parallelism for dynamic Linear op. Differential Revision: [D18074757](https://our.internmc.facebook.com/intern/diff/D18074757/) [ghstack-poisoned]

Pull Request resolved: #28477 Similar to #26692, we would like to enable the intra-op parallelism for dynamic Linear op. ghstack-source-id: 92419573 Differential Revision: [D18074757](https://our.internmc.facebook.com/intern/diff/D18074757/)

raghuramank100 · 2019-10-25T16:32:10Z

aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp

-    if (pack_ptr.q_scheme == kPerTensorAffine) {
-      // Process the per tensor quantization.
+    int num_tasks = at::get_num_threads();
+    at::parallel_for(0, num_tasks, 1, [&](int64_t begin, int64_t end) {


Is there a way to test the parallelization logic to see if there is a speedup?

Will update with some performance numbers.

Updated with the performance number in the summary.

raghuramank100 · 2019-10-25T16:33:00Z

aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp

+              /*thread_id=*/task_id,
+              /*num_threads=*/num_tasks);
+
+        } else if (pack_ptr.q_scheme == kPerChannelAffine) {


Does this mean that we support per-channel dynamic quant as a qconfig option?

Yes, I think so. We added the per channel quantization support for dynamic quantization a while ago.

jamesr66a

Looks good, seems to just delegate to the FBGEMM algorithm for splitting work among tasks.

I'd still like to see the perf numbers, though

jianyuh · 2019-10-27T06:09:35Z

Looks good, seems to just delegate to the FBGEMM algorithm for splitting work among tasks.

I'd still like to see the perf numbers, though

Updated with the performance number in the summary. Thanks!

…#28477) Summary: Pull Request resolved: pytorch/pytorch#28477 Similar to pytorch/pytorch#26692, we would like to enable the intra-op parallelism for dynamic Linear op. ghstack-source-id: 92419573 Test Plan: CI Test Benchmark: ``` import time import torch K, N = 1024, 1024 print('M', 'nthread=1', 'nthread=2', 'nthread=4', 'nthread=8', 'nthread=16', sep=', ') for M in range(512, 2049, 512): print(M, sep=',', end=', ') for num_threads in (1, 2, 4, 8, 16,): torch.set_num_threads(num_threads) x = torch.rand(M, K) w = torch.rand(K, N) NITER = 20 # Test dynamic quantized q_w = torch.quantize_per_tensor(w, 0.01, 0, dtype=torch.qint8) packed_w = torch.ops.quantized.linear_prepack(q_w, None) s = time.time() for i in range(NITER): torch.ops.quantized.linear_dynamic(x, packed_w) elapsed_per_iter_dyn_quant = (time.time() - s) / NITER print("{:0.2f}".format(2.0*M*N*K/elapsed_per_iter_dyn_quant/1E9), end=', ') print("\n", end='') ``` Before this Diff: ``` (base) [[email protected] ~/jhuang_test/dynamic_quant]# python benchmark_quantize_dynamic.py M, nthread=1, nthread=2, nthread=4, nthread=8, nthread=16 512, 119.28, 139.50, 141.66, 141.58, 141.42, 1024, 122.42, 141.21, 123.09, 141.85, 123.03, 1536, 122.80, 122.18, 141.39, 123.25, 141.35, 2048, 123.41, 141.34, 123.62, 140.55, 123.76, ``` After this Diff: ``` (base) [[email protected] ~/jhuang_test/dynamic_quant]# python benchmark_quantize_dynamic.py M, nthread=1, nthread=2, nthread=4, nthread=8, nthread=16 512, 123.29, 271.99, 508.66, 882.83, 1295.07, 1024, 126.05, 273.15, 515.42, 914.11, 877.63, 1536, 142.48, 236.85, 524.10, 481.32, 970.81, 2048, 124.76, 279.03, 433.73, 958.67, 1045.82, ``` Differential Revision: D18074757 fbshipit-source-id: ad5b43477d2187c818c137093c6d6af02d5ca1d5

facebook-github-bot · 2019-10-28T21:07:51Z

This pull request has been merged in 052046b.

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93069172

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93087184

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93309512

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93329026

Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Benchmarking the RoBERTa model with 20 threads: Before this Diff: P120104857 ``` equal 11.16% 305.851ms 11.16% 305.851ms 4.248ms NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` After this Diff: (grain size is the third parameter when using `at::parallel_for`. As measured below, the performance differences between these different grain size seem to be subtle.) - grain size = `TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.43% 36.056ms 1.43% 36.056ms 500.783us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = `HYPER_TH_OMP_OVERHEAD_THRESHOLD`: ``` equal 1.41% 35.126ms 1.41% 35.126ms 487.855us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` - grain size = 1: ``` equal 1.43% 35.632ms 1.43% 35.632ms 494.886us NaN 0.000us 0.000us 72 [[61, 64, 1024], [61, 64, 1024]] ``` Note that the size of `HYPER_TH_OMP_OVERHEAD_THRESHOLD` and `TH_OMP_OVERHEAD_THRESHOLD` can be found in https://github.com/pytorch/pytorch/blob/master/aten/src/TH/generic/THTensorApply.hpp#L7-L10 . `HYPER_TH_OMP_OVERHEAD_THRESHOLD` is used for more fine-grained tasks since HYPER_TH_OMP_OVERHEAD_THRESHOLD = TH_OMP_OVERHEAD_THRESHOLD / 16. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) [ghstack-poisoned]

Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Differential Revision: [D18165752](https://our.internmc.facebook.com/intern/diff/D18165752/) ghstack-source-id: 93346238

Summary: Pull Request resolved: #28810 Similar to #28464 and #28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Test Plan: CI Differential Revision: D18165752 fbshipit-source-id: 354cede4c36893acbd69711f49aa6a51dc94397f

Summary: Pull Request resolved: pytorch/pytorch#28810 Similar to pytorch/pytorch#28464 and pytorch/pytorch#28477, we would like to enable the intra-op parallelism for layer norm. This will be mapped to the parallel performance win for the BERT/RoBERTa model. Test Plan: CI Differential Revision: D18165752 fbshipit-source-id: 354cede4c36893acbd69711f49aa6a51dc94397f

jianyuh requested review from dskhudia and ilia-cher October 22, 2019 23:23

jianyuh mentioned this pull request Oct 24, 2019

Add intra-op parallel support for the dynamic linear operator #28568

Closed

jianyuh requested a review from jamesr66a October 24, 2019 05:02

raghuramank100 reviewed Oct 25, 2019

View reviewed changes

jamesr66a approved these changes Oct 25, 2019

View reviewed changes

facebook-github-bot closed this in 052046b Oct 28, 2019

facebook-github-bot added the merged label Oct 28, 2019

jianyuh mentioned this pull request Oct 28, 2019

[bert] Add the intra-op parallelism for equal operator #28810

Closed

facebook-github-bot deleted the gh/jianyuh/36/head branch November 1, 2019 14:17

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enabling intra-op parallelism for dynamic quantized Linear operator #28477

Enabling intra-op parallelism for dynamic quantized Linear operator #28477

Uh oh!

jianyuh commented Oct 22, 2019 •

edited

Loading

Uh oh!

raghuramank100 Oct 25, 2019

Uh oh!

jianyuh Oct 25, 2019

Uh oh!

jianyuh Oct 27, 2019

Uh oh!

raghuramank100 Oct 25, 2019

Uh oh!

jianyuh Oct 25, 2019

Uh oh!

jamesr66a left a comment

Uh oh!

jianyuh commented Oct 27, 2019

Uh oh!

facebook-github-bot commented Oct 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Enabling intra-op parallelism for dynamic quantized Linear operator #28477

Enabling intra-op parallelism for dynamic quantized Linear operator #28477

Uh oh!

Conversation

jianyuh commented Oct 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghuramank100 Oct 25, 2019

Choose a reason for hiding this comment

Uh oh!

jianyuh Oct 25, 2019

Choose a reason for hiding this comment

Uh oh!

jianyuh Oct 27, 2019

Choose a reason for hiding this comment

Uh oh!

raghuramank100 Oct 25, 2019

Choose a reason for hiding this comment

Uh oh!

jianyuh Oct 25, 2019

Choose a reason for hiding this comment

Uh oh!

jamesr66a left a comment

Choose a reason for hiding this comment

Uh oh!

jianyuh commented Oct 27, 2019

Uh oh!

facebook-github-bot commented Oct 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jianyuh commented Oct 22, 2019 •

edited

Loading