[lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM by danlark1 · Pull Request #3139 · facebook/zstd

danlark1 · 2022-05-22T10:47:25Z

We found that movemask is not used properly or consumes too much CPU.
This effort helps to optimize the movemask emulation on ARM.

For levels 8-9 we saw 3-5% improvement over all compression. For level 10 we saw 1.5% improvement.

The key idea is not to use pure movemasks but to have groups of bits.
For rowEntries == 16, 32 we are going to have groups of size 4 and 2
respectively. It means that each bit will be duplicated within the group

Then we do AND to have only one bit set in the group so that iteration
with lowering bit a &= (a - 1) works as well.

Also, aarch64 does not have rotate instructions for 16 bit, only for 32
and 64, that's why we see more improvements for level 8-9.

vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by
4 every u16 and narrows to 8 lower bits. See the picture below. It's also used in Folly.
It also uses 2 cycles according to Neoverse-N{1,2} guidelines.

And after AND it will be something like

64 bit movemask is already well optimized. We have ongoing experiments
but were not able to validate other implementations work reliably faster.

We found that movemask is not used properly or consumes too much CPU. This effort helps to optimize the movemask emulation on ARM. For level 8-9 we saw 3-5% improvements. For level 10 we say 1.5% improvement. The key idea is not to use pure movemasks but to have groups of bits. For rowEntries == 16, 32 we are going to have groups of size 4 and 2 respectively. It means that each bit will be duplicated within the group Then we do AND to have only one bit set in the group so that iteration with lowering bit `a &= (a - 1)` works as well. Also, aarch64 does not have rotate instructions for 16 bit, only for 32 and 64, that's why we see more improvements for level 8-9. vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by 4 every u16 and narrows to 8 lower bits. See the picture below. It's also used in [Folly](https://github.com/facebook/folly/blob/c5702590080aa5d0e8d666d91861d64634065132/folly/container/detail/F14Table.h#L446). It also uses 2 cycles according to Neoverse-N{1,2} guidelines. 64 bit movemask is already well optimized. We have ongoing experiments but were not able to validate other implementations work reliably faster.

terrelln

Awesome, thanks for the patch @danlark1!

When @senhuang42 was implemented this, we tested that we didn't regress ARM, but we didn't optimize for it specifically. And I'm not super familiar with NEON, so I can totally see how this was missed. Now that I have an M1 Macbook, and can easily test NEON out, I'll have to familiarize myself.

I trust that you've measured carefully, but I'll double check that it doesn't regress x86-64, and measure on my M1.

I have one nit, then I just need to measure, and this will be good to go!

lib/compress/zstd_lazy.c

embg · 2022-05-24T14:35:33Z

Maybe relevant: https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/

terrelln · 2022-05-24T15:06:34Z

I see neutral performance on x86-64 with our clang 9 & 12, and gcc 9 & 11.

On my 2021 Macbook Air I see a +3.5-4.5% performance improvements for levels 5 through 8.

terrelln · 2022-05-24T15:10:40Z

Thanks for the PR @danlark1!

While the scalar post-processing required to obtain one bit per lane makes this more expensive than directly supporting variable-sized bit groups (as done in Zstandard[^1]), the result is still an improvement over the current lane-by-lane algorithm. [^1]: See facebook/zstd#3139, namely `ZSTD_row_matchMaskGroupWidth`.

facebook-github-bot added the CLA Signed label May 22, 2022

danlark1 changed the title ~~[lazy] Optimize ZSTD_row_getMatchMask for level 8-10~~ [lazy] Optimize ZSTD_row_getMatchMask for level 8-10 for ARM May 22, 2022

Disable unused variable warning

778f639

danlark1 changed the title ~~[lazy] Optimize ZSTD_row_getMatchMask for level 8-10 for ARM~~ [lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM May 22, 2022

terrelln reviewed May 23, 2022

View reviewed changes

lib/compress/zstd_lazy.c Outdated Show resolved Hide resolved

danlark1 added 2 commits May 23, 2022 14:49

Move NEON version to a separate function and fix indentation

6b561d2

Again unused error warning. Fixed

9166c6a

terrelln merged commit 1c8a697 into facebook:dev May 24, 2022

Cyan4973 mentioned this pull request Feb 9, 2023

release v1.5.4 #3487

Merged

onalante-ebay mentioned this pull request Dec 26, 2025

Implement optimized movemasks for NEON xtensor-stack/xsimd#1236

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM#3139

[lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM#3139
terrelln merged 4 commits intofacebook:devfrom
danlark1:dev

danlark1 commented May 22, 2022 •

edited

Loading

Uh oh!

terrelln left a comment

Uh oh!

Uh oh!

embg commented May 24, 2022

Uh oh!

terrelln commented May 24, 2022

Uh oh!

terrelln commented May 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

danlark1 commented May 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terrelln left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

embg commented May 24, 2022

Uh oh!

terrelln commented May 24, 2022

Uh oh!

terrelln commented May 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danlark1 commented May 22, 2022 •

edited

Loading