Optimize BITCOUNT with AVX2 and AVX512 popcount implementations. #14309
Optimize BITCOUNT with AVX2 and AVX512 popcount implementations. #14309sundb merged 16 commits intoredis:unstablefrom
BITCOUNT with AVX2 and AVX512 popcount implementations. #14309Conversation
🎉 Snyk checks have passed. No issues have been found so far.✅ security/snyk check is complete. No issues have been found. (View Details) ✅ license/snyk check is complete. No issues have been found. (View Details) |
|
Great job! No new security vulnerabilities introduced in this pull request |
shahsb
left a comment
There was a problem hiding this comment.
The benchmarks focus on very large bitmaps (100M and 1B bits), where this optimization will shine. Is there any data on performance for very small strings (e.g., 16, 32, or 64 bytes)? While unlikely to be slower, it would be useful to confirm that the overhead of the function dispatch and setup for the vectorized paths doesn't negatively impact performance on "tiny" workloads.
shahsb
left a comment
There was a problem hiding this comment.
Really impressive work here! The 10% improvement on large bitmaps is substantial. I'm curious what was the most challenging part of getting this optimization right?
…mize.bitcount.avx
Co-authored-by: debing.sun <[email protected]>
Co-authored-by: debing.sun <[email protected]>
Doesn't this PR have no improvement for AMD? |
It should. we only have runners for intel/arm on CI mainly due to cost. But i'll add data for AMD on a manual run. |
@sundb added results for AMD EPYC 9R14 (single shard) -- updated the main comment as well
|
BITCOUNT on Intel with AVX2 and AVX512 popcount implementations. BITCOUNT with AVX2 and AVX512 popcount implementations.

This PR introduces vectorized implementations of
BITCOUNTfor x86_64 targets with AVX2 and AVX512 support.VPOPCNTDQon 64B chunks with_mm512_reduce_add_epi64to efficiently aggregate results across 512-bit vectors.The test suite has been expanded with unit tests that validate correctness across aligned/unaligned buffers, edge cases, random data, and large workloads, ensuring consistency between scalar, AVX2, and AVX512 implementations.
Performance Results on Intel Xeon SPR (single shard)
/redis unstable(median obs. ± std.dev)filipecosta90/redis optimize.bitcount.avx(median obs. ± std.dev)Performance Results on AMD EPYC 9R14 (single shard)
/redis unstable(median obs. ± std.dev)filipecosta90/redis optimize.bitcount.avx(median obs. ± std.dev)Reproduce Benchmarks