-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add fast sums and sums of squares over quantized ranges to QuantizedOpKernels.cpp #35693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: Adds utility functions to quantized int types of vec256 to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: Adds utility functions to quantized int types of vec256 to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: d01df6a Pull Request resolved: #35693
|
|
||
| // horizontal sums signed i64, overflow unsafe | ||
| // x = (y3, y2, y1, y0) | ||
| int64_t custom_mm256_hsum_epi64_ignore_overflow(__m256i x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p.s. I tried to put these in a separate header but got a bunch of linking errors with "multiple functions of the same name" defined across various build flags. Let me know if there is something special that needs to be done for a new header in this dir.
💊 CircleCI build failures summary and remediationsAs of commit ab52e4b (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no CircleCI failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 19 times. |
jamesr66a
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me, but you might want to name the hsum functions something that explicitly mentions the widening behavior to distinguish those functions from something like _mm256_hadd_epi{16}, which would overflow
cc @dskhudia can you take a look as well?
…56 qint types" Summary: Adds utility functions to quantized int types of vec256 to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: Adds utility functions to quantized int types of vec256 to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 16a33f7 Pull Request resolved: #35693
| const __m256i xHalf1_64 = _mm256_cvtepu8_epi16(xHalf1); | ||
| // (x15, ..., x0), int16 | ||
| const __m256i xHalf2_64 = _mm256_cvtepu8_epi16(xHalf2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use _mm256_hadd_epi16 on xHalf1_64 and xHalf2_64 and then forming a tree of hadds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to clarify, you mean why not call custom_mm256_hsum_epu16_overflow from here? If yes - it would be slower (can't remember by how much but I did measure it in my adhoc tester and it was significant) because that functions widens the inputs again, and we only need to widen once to ensure no overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was more than 20% slower, but can run again if needed for the exact #
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the horizontal add intrinsic itself on 16-bit values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I didn't know about hadd (was searching for hsum). Thanks for the tip! Yeah, that should improve things, along with your other suggestion - will check it out and benchmark
…56 qint types" Summary: Adds utility functions to quantized int types of vec256 to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D20768804](https://our.internmc.facebook.com/intern/diff/D20768804) [ghstack-poisoned]
|
hsum for int8 and uint8 can be combined using a template since the code is mostly the same. Similarly hsum_sq for int8 and uint8. Other than this, it looks good to me. |
| alignas(64) int32_t temp[8]; | ||
| _mm256_store_si256(reinterpret_cast<__m256i*>(temp), sum_v); | ||
| for (int k = 0; k < 8; ++k) { | ||
| row_sum += temp[k]; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are feeling adventurous, you can do this part using _mm256_hadd_epi32 and the remainder part below using mask instructions. For example, see use of masking in remainder loop in https://github.com/pytorch/FBGEMM/blob/master/src/QuantUtilsAvx2.cc#L91-L97
hmm, not sure if this is worth it, as unless we templatize all three types we'll have to branch at the callsites |
… QuantizedOpKernels.cpp" Summary: Adds utility functions to quantized int types to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D20768804](https://our.internmc.facebook.com/intern/diff/D20768804) [ghstack-poisoned]
… QuantizedOpKernels.cpp" Summary: Adds utility functions to quantized int types to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D20768804](https://our.internmc.facebook.com/intern/diff/D20768804) [ghstack-poisoned]
|
This pull request has been merged in 23e5f6a. |
…es (pytorch#35693) Summary: Pull Request resolved: pytorch#35693 Adds utility functions to quantized int types of vec256 to calculate horizontal sums and sums of squares using avx2 intrinsics. This is useful for quantized implementations of various normalization layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate the mean and variance of a layer of quantized ints. Test Plan: Adhoc c++ tester for the correctness of the avx2 functions: https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae Run with: ``` -lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main ``` The integration bits and performance will be tested in the next PR in the stack where we will hook quantized Layernorm to use this. Imported from OSS Differential Revision: D20768804 fbshipit-source-id: 4720dd358dde0dabbab8e1a33a67be55925d98f9
Stack from ghstack:
Summary:
Adds utility functions to quantized int types to calculate
horizontal sums and sums of squares using avx2 intrinsics.
This is useful for quantized implementations of various normalization
layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate
the mean and variance of a layer of quantized ints.
Test Plan:
Adhoc c++ tester for the correctness of the avx2 functions:
https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae
Run with:
The integration bits and performance will be tested in the next PR in the stack
where we will hook quantized Layernorm to use this.
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: D20768804