Skip to content

Conversation

@vkuzo
Copy link
Contributor

@vkuzo vkuzo commented Mar 30, 2020

Stack from ghstack:

Summary:

Adds utility functions to quantized int types to calculate
horizontal sums and sums of squares using avx2 intrinsics.

This is useful for quantized implementations of various normalization
layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate
the mean and variance of a layer of quantized ints.

Test Plan:

Adhoc c++ tester for the correctness of the avx2 functions:
https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae
Run with:

-lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main

The integration bits and performance will be tested in the next PR in the stack
where we will hook quantized Layernorm to use this.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D20768804

Summary:

Adds utility functions to quantized int types of vec256 to calculate
horizontal sums and sums of squares using avx2 intrinsics.

This is useful for quantized implementations of various normalization
layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate
the mean and variance of a layer of quantized ints.

Test Plan:

Adhoc c++ tester for the correctness of the avx2 functions:
https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae
Run with:
```
-lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main
```

The integration bits and performance will be tested in the next PR in the stack
where we will hook quantized Layernorm to use this.

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Mar 30, 2020
Summary:

Adds utility functions to quantized int types of vec256 to calculate
horizontal sums and sums of squares using avx2 intrinsics.

This is useful for quantized implementations of various normalization
layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate
the mean and variance of a layer of quantized ints.

Test Plan:

Adhoc c++ tester for the correctness of the avx2 functions:
https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae
Run with:
```
-lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main
```

The integration bits and performance will be tested in the next PR in the stack
where we will hook quantized Layernorm to use this.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: d01df6a
Pull Request resolved: #35693
@vkuzo vkuzo added the oncall: quantization Quantization support in PyTorch label Mar 30, 2020
@vkuzo vkuzo requested a review from jamesr66a March 30, 2020 18:10

// horizontal sums signed i64, overflow unsafe
// x = (y3, y2, y1, y0)
int64_t custom_mm256_hsum_epi64_ignore_overflow(__m256i x) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. I tried to put these in a separate header but got a bunch of linking errors with "multiple functions of the same name" defined across various build flags. Let me know if there is something special that needs to be done for a new header in this dir.

@dr-ci
Copy link

dr-ci bot commented Mar 30, 2020

💊 CircleCI build failures summary and remediations

As of commit ab52e4b (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no CircleCI failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 19 times.

@vkuzo vkuzo self-assigned this Mar 30, 2020
Copy link
Collaborator

@jamesr66a jamesr66a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, but you might want to name the hsum functions something that explicitly mentions the widening behavior to distinguish those functions from something like _mm256_hadd_epi{16}, which would overflow

cc @dskhudia can you take a look as well?

…56 qint types"

Summary:

Adds utility functions to quantized int types of vec256 to calculate
horizontal sums and sums of squares using avx2 intrinsics.

This is useful for quantized implementations of various normalization
layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate
the mean and variance of a layer of quantized ints.

Test Plan:

Adhoc c++ tester for the correctness of the avx2 functions:
https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae
Run with:
```
-lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main
```

The integration bits and performance will be tested in the next PR in the stack
where we will hook quantized Layernorm to use this.

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Mar 31, 2020
Summary:

Adds utility functions to quantized int types of vec256 to calculate
horizontal sums and sums of squares using avx2 intrinsics.

This is useful for quantized implementations of various normalization
layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate
the mean and variance of a layer of quantized ints.

Test Plan:

Adhoc c++ tester for the correctness of the avx2 functions:
https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae
Run with:
```
-lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main
```

The integration bits and performance will be tested in the next PR in the stack
where we will hook quantized Layernorm to use this.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 16a33f7
Pull Request resolved: #35693
Comment on lines 318 to 320
const __m256i xHalf1_64 = _mm256_cvtepu8_epi16(xHalf1);
// (x15, ..., x0), int16
const __m256i xHalf2_64 = _mm256_cvtepu8_epi16(xHalf2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use _mm256_hadd_epi16 on xHalf1_64 and xHalf2_64 and then forming a tree of hadds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to clarify, you mean why not call custom_mm256_hsum_epu16_overflow from here? If yes - it would be slower (can't remember by how much but I did measure it in my adhoc tester and it was significant) because that functions widens the inputs again, and we only need to widen once to ensure no overflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was more than 20% slower, but can run again if needed for the exact #

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the horizontal add intrinsic itself on 16-bit values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I didn't know about hadd (was searching for hsum). Thanks for the tip! Yeah, that should improve things, along with your other suggestion - will check it out and benchmark

…56 qint types"

Summary:

Adds utility functions to quantized int types of vec256 to calculate
horizontal sums and sums of squares using avx2 intrinsics.

This is useful for quantized implementations of various normalization
layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate
the mean and variance of a layer of quantized ints.

Test Plan:

Adhoc c++ tester for the correctness of the avx2 functions:
https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae
Run with:
```
-lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main
```

The integration bits and performance will be tested in the next PR in the stack
where we will hook quantized Layernorm to use this.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D20768804](https://our.internmc.facebook.com/intern/diff/D20768804)

[ghstack-poisoned]
@vkuzo vkuzo changed the title Add avx2 integer horizontal sum and sum of squares to vec256 qint types Add fast sums and sums of squares over quantized ranges to QuantizedOpKernels.cpp Apr 6, 2020
@dskhudia
Copy link
Contributor

dskhudia commented Apr 7, 2020

hsum for int8 and uint8 can be combined using a template since the code is mostly the same. Similarly hsum_sq for int8 and uint8. Other than this, it looks good to me.

Comment on lines +176 to +180
alignas(64) int32_t temp[8];
_mm256_store_si256(reinterpret_cast<__m256i*>(temp), sum_v);
for (int k = 0; k < 8; ++k) {
row_sum += temp[k];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are feeling adventurous, you can do this part using _mm256_hadd_epi32 and the remainder part below using mask instructions. For example, see use of masking in remainder loop in https://github.com/pytorch/FBGEMM/blob/master/src/QuantUtilsAvx2.cc#L91-L97

@vkuzo
Copy link
Contributor Author

vkuzo commented Apr 7, 2020

hsum for int8 and uint8 can be combined using a template since the code is mostly the same. Similarly hsum_sq for int8 and uint8. Other than this, it looks good to me.

hmm, not sure if this is worth it, as unless we templatize all three types we'll have to branch at the callsites

vkuzo added 2 commits April 7, 2020 15:16
… QuantizedOpKernels.cpp"


Summary:

Adds utility functions to quantized int types to calculate
horizontal sums and sums of squares using avx2 intrinsics.

This is useful for quantized implementations of various normalization
layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate
the mean and variance of a layer of quantized ints.

Test Plan:

Adhoc c++ tester for the correctness of the avx2 functions:
https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae
Run with:
```
-lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main
```

The integration bits and performance will be tested in the next PR in the stack
where we will hook quantized Layernorm to use this.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D20768804](https://our.internmc.facebook.com/intern/diff/D20768804)

[ghstack-poisoned]
… QuantizedOpKernels.cpp"


Summary:

Adds utility functions to quantized int types to calculate
horizontal sums and sums of squares using avx2 intrinsics.

This is useful for quantized implementations of various normalization
layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate
the mean and variance of a layer of quantized ints.

Test Plan:

Adhoc c++ tester for the correctness of the avx2 functions:
https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae
Run with:
```
-lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main
```

The integration bits and performance will be tested in the next PR in the stack
where we will hook quantized Layernorm to use this.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D20768804](https://our.internmc.facebook.com/intern/diff/D20768804)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 23e5f6a.

@facebook-github-bot facebook-github-bot deleted the gh/vkuzo/18/head branch April 13, 2020 14:16
ashishfarmer pushed a commit to ashishfarmer/pytorch that referenced this pull request Apr 13, 2020
…es (pytorch#35693)

Summary:
Pull Request resolved: pytorch#35693

Adds utility functions to quantized int types of vec256 to calculate
horizontal sums and sums of squares using avx2 intrinsics.

This is useful for quantized implementations of various normalization
layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate
the mean and variance of a layer of quantized ints.

Test Plan:
Adhoc c++ tester for the correctness of the avx2 functions:
https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae
Run with:
```
-lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main
```

The integration bits and performance will be tested in the next PR in the stack
where we will hook quantized Layernorm to use this.

Imported from OSS

Differential Revision: D20768804

fbshipit-source-id: 4720dd358dde0dabbab8e1a33a67be55925d98f9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: quantization Quantization support in PyTorch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants