[lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM#3139
[lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM#3139terrelln merged 4 commits intofacebook:devfrom danlark1:dev
Conversation
We found that movemask is not used properly or consumes too much CPU. This effort helps to optimize the movemask emulation on ARM. For level 8-9 we saw 3-5% improvements. For level 10 we say 1.5% improvement. The key idea is not to use pure movemasks but to have groups of bits. For rowEntries == 16, 32 we are going to have groups of size 4 and 2 respectively. It means that each bit will be duplicated within the group Then we do AND to have only one bit set in the group so that iteration with lowering bit `a &= (a - 1)` works as well. Also, aarch64 does not have rotate instructions for 16 bit, only for 32 and 64, that's why we see more improvements for level 8-9. vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by 4 every u16 and narrows to 8 lower bits. See the picture below. It's also used in [Folly](https://github.com/facebook/folly/blob/c5702590080aa5d0e8d666d91861d64634065132/folly/container/detail/F14Table.h#L446). It also uses 2 cycles according to Neoverse-N{1,2} guidelines. 64 bit movemask is already well optimized. We have ongoing experiments but were not able to validate other implementations work reliably faster.
terrelln
left a comment
There was a problem hiding this comment.
Awesome, thanks for the patch @danlark1!
When @senhuang42 was implemented this, we tested that we didn't regress ARM, but we didn't optimize for it specifically. And I'm not super familiar with NEON, so I can totally see how this was missed. Now that I have an M1 Macbook, and can easily test NEON out, I'll have to familiarize myself.
I trust that you've measured carefully, but I'll double check that it doesn't regress x86-64, and measure on my M1.
I have one nit, then I just need to measure, and this will be good to go!
|
I see neutral performance on x86-64 with our clang 9 & 12, and gcc 9 & 11. On my 2021 Macbook Air I see a +3.5-4.5% performance improvements for levels 5 through 8. |
|
Thanks for the PR @danlark1! |
While the scalar post-processing required to obtain one bit per lane makes this more expensive than directly supporting variable-sized bit groups (as done in Zstandard[^1]), the result is still an improvement over the current lane-by-lane algorithm. [^1]: See facebook/zstd#3139, namely `ZSTD_row_matchMaskGroupWidth`.
While the scalar post-processing required to obtain one bit per lane makes this more expensive than directly supporting variable-sized bit groups (as done in Zstandard[^1]), the result is still an improvement over the current lane-by-lane algorithm. [^1]: See facebook/zstd#3139, namely `ZSTD_row_matchMaskGroupWidth`.
While the scalar post-processing required to obtain one bit per lane makes this more expensive than directly supporting variable-sized bit groups (as done in Zstandard[^1]), the result is still an improvement over the current lane-by-lane algorithm. [^1]: See facebook/zstd#3139, namely `ZSTD_row_matchMaskGroupWidth`.
While the scalar post-processing required to obtain one bit per lane makes this more expensive than directly supporting variable-sized bit groups (as done in Zstandard[^1]), the result is still an improvement over the current lane-by-lane algorithm. [^1]: See facebook/zstd#3139, namely `ZSTD_row_matchMaskGroupWidth`.
While the scalar post-processing required to obtain one bit per lane makes this more expensive than directly supporting variable-sized bit groups (as done in Zstandard[^1]), the result is still an improvement over the current lane-by-lane algorithm. [^1]: See facebook/zstd#3139, namely `ZSTD_row_matchMaskGroupWidth`.
We found that movemask is not used properly or consumes too much CPU.
This effort helps to optimize the movemask emulation on ARM.
For levels 8-9 we saw 3-5% improvement over all compression. For level 10 we saw 1.5% improvement.
The key idea is not to use pure movemasks but to have groups of bits.
For rowEntries == 16, 32 we are going to have groups of size 4 and 2
respectively. It means that each bit will be duplicated within the group
Then we do AND to have only one bit set in the group so that iteration
with lowering bit
a &= (a - 1)works as well.Also, aarch64 does not have rotate instructions for 16 bit, only for 32
and 64, that's why we see more improvements for level 8-9.
vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by
4 every u16 and narrows to 8 lower bits. See the picture below. It's also used in Folly.
It also uses 2 cycles according to Neoverse-N{1,2} guidelines.
And after AND it will be something like
64 bit movemask is already well optimized. We have ongoing experiments
but were not able to validate other implementations work reliably faster.