Add fast sums and sums of squares over quantized ranges to QuantizedOpKernels.cpp #35693

vkuzo · 2020-03-30T18:08:41Z

Stack from ghstack:

add quantized layer norm implementation #35329 add quantized layer norm implementation
Add fast sums and sums of squares over quantized ranges to QuantizedOpKernels.cpp #35693 Add fast sums and sums of squares over quantized ranges to QuantizedOpKernels.cpp

Summary:

Adds utility functions to quantized int types to calculate
horizontal sums and sums of squares using avx2 intrinsics.

This is useful for quantized implementations of various normalization
layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate
the mean and variance of a layer of quantized ints.

Test Plan:

Adhoc c++ tester for the correctness of the avx2 functions:
https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae
Run with:

-lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main

The integration bits and performance will be tested in the next PR in the stack
where we will hook quantized Layernorm to use this.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D20768804

Summary: Adds utility functions to quantized int types of vec256 to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Adds utility functions to quantized int types of vec256 to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: d01df6a Pull Request resolved: #35693

vkuzo · 2020-03-30T18:12:02Z

aten/src/ATen/cpu/vec256/vec256_qint.h

+
+// horizontal sums signed i64, overflow unsafe
+// x = (y3, y2, y1, y0)
+int64_t custom_mm256_hsum_epi64_ignore_overflow(__m256i x) {


p.s. I tried to put these in a separate header but got a bunch of linking errors with "multiple functions of the same name" defined across various build flags. Let me know if there is something special that needs to be done for a new header in this dir.

dr-ci · 2020-03-30T18:22:45Z

💊 CircleCI build failures summary and remediations

As of commit ab52e4b (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no CircleCI failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 19 times.

jamesr66a

Seems reasonable to me, but you might want to name the hsum functions something that explicitly mentions the widening behavior to distinguish those functions from something like _mm256_hadd_epi{16}, which would overflow

cc @dskhudia can you take a look as well?

aten/src/ATen/cpu/vec256/vec256_qint.h

…56 qint types" Summary: Adds utility functions to quantized int types of vec256 to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Adds utility functions to quantized int types of vec256 to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 16a33f7 Pull Request resolved: #35693

dskhudia · 2020-03-31T19:34:46Z

aten/src/ATen/cpu/vec256/vec256_qint.h

+  const __m256i xHalf1_64 = _mm256_cvtepu8_epi16(xHalf1);
+  // (x15, ..., x0), int16
+  const __m256i xHalf2_64 = _mm256_cvtepu8_epi16(xHalf2);


Any reason not to use _mm256_hadd_epi16 on xHalf1_64 and xHalf2_64 and then forming a tree of hadds?

just to clarify, you mean why not call custom_mm256_hsum_epu16_overflow from here? If yes - it would be slower (can't remember by how much but I did measure it in my adhoc tester and it was significant) because that functions widens the inputs again, and we only need to widen once to ensure no overflow.

I think it was more than 20% slower, but can run again if needed for the exact #

I meant the horizontal add intrinsic itself on 16-bit values.

ah, I didn't know about hadd (was searching for hsum). Thanks for the tip! Yeah, that should improve things, along with your other suggestion - will check it out and benchmark

…56 qint types" Summary: Adds utility functions to quantized int types of vec256 to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D20768804](https://our.internmc.facebook.com/intern/diff/D20768804) [ghstack-poisoned]

dskhudia · 2020-04-07T02:12:41Z

hsum for int8 and uint8 can be combined using a template since the code is mostly the same. Similarly hsum_sq for int8 and uint8. Other than this, it looks good to me.

dskhudia · 2020-04-07T02:17:50Z

aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp

+  alignas(64) int32_t temp[8];
+  _mm256_store_si256(reinterpret_cast<__m256i*>(temp), sum_v);
+  for (int k = 0; k < 8; ++k) {
+    row_sum += temp[k];
+  }


If you are feeling adventurous, you can do this part using _mm256_hadd_epi32 and the remainder part below using mask instructions. For example, see use of masking in remainder loop in https://github.com/pytorch/FBGEMM/blob/master/src/QuantUtilsAvx2.cc#L91-L97

vkuzo · 2020-04-07T16:56:14Z

hsum for int8 and uint8 can be combined using a template since the code is mostly the same. Similarly hsum_sq for int8 and uint8. Other than this, it looks good to me.

hmm, not sure if this is worth it, as unless we templatize all three types we'll have to branch at the callsites

… QuantizedOpKernels.cpp" Summary: Adds utility functions to quantized int types to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D20768804](https://our.internmc.facebook.com/intern/diff/D20768804) [ghstack-poisoned]

facebook-github-bot · 2020-04-10T00:21:56Z

This pull request has been merged in 23e5f6a.

…es (pytorch#35693) Summary: Pull Request resolved: pytorch#35693 Adds utility functions to quantized int types of vec256 to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Imported from OSS Differential Revision: D20768804 fbshipit-source-id: 4720dd358dde0dabbab8e1a33a67be55925d98f9

vkuzo mentioned this pull request Mar 30, 2020

add quantized layer norm implementation #35329

Closed

vkuzo added the oncall: quantization Quantization support in PyTorch label Mar 30, 2020

vkuzo requested a review from jamesr66a March 30, 2020 18:10

vkuzo commented Mar 30, 2020

View reviewed changes

vkuzo requested review from dskhudia, raghuramank100 and supriyar March 30, 2020 18:12

vkuzo self-assigned this Mar 30, 2020

jamesr66a reviewed Mar 30, 2020

View reviewed changes

aten/src/ATen/cpu/vec256/vec256_qint.h Outdated Show resolved Hide resolved

dskhudia reviewed Mar 31, 2020

View reviewed changes

vkuzo changed the title ~~Add avx2 integer horizontal sum and sum of squares to vec256 qint types~~ Add fast sums and sums of squares over quantized ranges to QuantizedOpKernels.cpp Apr 6, 2020

dskhudia reviewed Apr 7, 2020

View reviewed changes

dskhudia approved these changes Apr 7, 2020

View reviewed changes

vkuzo added 2 commits April 7, 2020 15:16

facebook-github-bot closed this in 23e5f6a Apr 9, 2020

facebook-github-bot added the merged label Apr 10, 2020

facebook-github-bot deleted the gh/vkuzo/18/head branch April 13, 2020 14:16

mruberry added the Merged label Oct 28, 2020

Add fast sums and sums of squares over quantized ranges to QuantizedOpKernels.cpp #35693

Add fast sums and sums of squares over quantized ranges to QuantizedOpKernels.cpp #35693

Uh oh!

Conversation

vkuzo commented Mar 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo Mar 30, 2020

Choose a reason for hiding this comment

Uh oh!

dr-ci bot commented Mar 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CircleCI build failures summary and remediations

Uh oh!

jamesr66a left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dskhudia Mar 31, 2020

Choose a reason for hiding this comment

Uh oh!

vkuzo Mar 31, 2020

Choose a reason for hiding this comment

Uh oh!

vkuzo Mar 31, 2020

Choose a reason for hiding this comment

Uh oh!

dskhudia Mar 31, 2020

Choose a reason for hiding this comment

Uh oh!

vkuzo Mar 31, 2020

Choose a reason for hiding this comment

Uh oh!

dskhudia commented Apr 7, 2020

Uh oh!

dskhudia Apr 7, 2020

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Apr 7, 2020

Uh oh!

facebook-github-bot commented Apr 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

vkuzo commented Mar 30, 2020 •

edited

Loading

dr-ci bot commented Mar 30, 2020 •

edited

Loading