Flatten ZSTD_row_getMatchMask#2681
Conversation
* Remove the SIMD abstraction layer. * Add big endian support. * Align `hashTags` within `tagRow` to a 16-byte boundary. * Switch SSE2 to use aligned reads. * Optimize scalar path using SWAR. * Optimize neon path for `n == 32` * Work around minor clang issue for NEON (https://bugs.llvm.org/show_bug.cgi?id=49577)
|
AFAICT, the test failures are all: which leaves me mystified on how any SIMD ever passed these tests... Do we need (or want) |
senhuang42
left a comment
There was a problem hiding this comment.
Thanks for these contributions!! These NEON and scalar optimizations look great - I've left some comments that mostly have to do with style/code structure. I'll get around to benchmarking the improvements scalar mode at some point as well.
|
Amusingly, NEON gets cheaper when |
That's actually good to know - we're going to also add support for 64 row entries as well. |
senhuang42
left a comment
There was a problem hiding this comment.
This looks good to me! @terrelln do you have any thoughts?
|
Thanks @aqrit for this great contribution 🚀 |
|
Excellent work @aqrit ! |
| const U16 hi = (U16)vgetq_lane_u8(t3, 8); | ||
| const U16 lo = (U16)vgetq_lane_u8(t3, 0); | ||
| return ZSTD_rotateRight_U16((hi << 8) | lo, head); | ||
| } else { /* rowEntries == 32 */ |
There was a problem hiding this comment.
@aqrit :
This NEON code must be designed specifically for each rowEntries.
Hence, we have an implementation for 16, for 32 and then for 64.
If we would like to introduce 128 or even 256, it's necessary to write another implementation.
Ignoring the issue of generating the final mask (which width directly depends on the nb of entries),
does that seem possible to rewrite the main test loop in a way which would be generic,
and scale naturally with rowEntries (provided it's a power of 2) ?
For an example of what I mean by "an implementation which can scale with rowEntries",
please have a look at : https://github.com/facebook/zstd/blob/dev/lib/compress/zstd_lazy.c#L992 ,
which uses SSE2.
There was a problem hiding this comment.
128/256 would just loop the 64 byte path.
I've done no benchmarking but...
it seems really painful to emulate pmovmskb using ARMv7-A NEON.
AArch64 added the SIMD instruction addv which might be useful for simplifying all this.
On AArch64, is the ARMv7-A path actually faster than the scalar fallback?
hashTagswithintagRowto a 16-byte boundary.n == 32Benchmark:
tl;dr: compression speed of scalar path increases by about 22%.
Scalar path is now only about 4% slower than the SSE2 path.
benched on my desktop
i7-8700Kusinggcc version 9.3.0