Skip to content

Optimize BITCOUNT with AVX2 and AVX512 popcount implementations. #14309

Merged
sundb merged 16 commits intoredis:unstablefrom
filipecosta90:optimize.bitcount.avx
Oct 14, 2025
Merged

Optimize BITCOUNT with AVX2 and AVX512 popcount implementations. #14309
sundb merged 16 commits intoredis:unstablefrom
filipecosta90:optimize.bitcount.avx

Conversation

@fcostaoliveira
Copy link
Collaborator

@fcostaoliveira fcostaoliveira commented Aug 27, 2025

This PR introduces vectorized implementations of BITCOUNT for x86_64 targets with AVX2 and AVX512 support.

  • AVX2 path: processes 32B at a time, using unrolled POPCNT on 64-bit lanes with independent accumulators to reduce data dependencies.
  • AVX512 path: leverages VPOPCNTDQ on 64B chunks with _mm512_reduce_add_epi64 to efficiently aggregate results across 512-bit vectors.
  • Both paths include cache prefetching hints to better overlap memory fetches with computation. This was proved to matter in Add 2K software prefetch to improve BITCOUNT performance #14103.
  • Fallbacks to the scalar implementation if hardware support is unavailable.

The test suite has been expanded with unit tests that validate correctness across aligned/unaligned buffers, edge cases, random data, and large workloads, ensuring consistency between scalar, AVX2, and AVX512 implementations.


Performance Results on Intel Xeon SPR (single shard)

Test Case Baseline /redis unstable (median obs. ± std.dev) Comparison filipecosta90/redis optimize.bitcount.avx (median obs. ± std.dev) % change (lower-better) Note
memtier_benchmark-1key-100M-bits-bitmap-bitcount 6.239 ± 0.5% (7 datapoints) 5.759 ± 0.3% (3 datapoints) −7.7% Better
memtier_benchmark-1key-1Billion-bits-bitmap-bitcount 59.647 ± 0.7% (7 datapoints) 53.503 ± 0.3% (3 datapoints) −10.3% Better

Performance Results on AMD EPYC 9R14 (single shard)

Test Case Metric Baseline /redis unstable (median obs. ± std.dev) Comparison filipecosta90/redis optimize.bitcount.avx (median obs. ± std.dev) % change (lower-better)
memtier_benchmark-1key-100M-bits-bitmap-bitcount Ops/sec (↑ better) 37,120.79 52,093.55 +40.2%
Latency (p50 / p99) 5.279 ms / 9.535 ms 3.807 ms / 6.399 ms −27.9% / −33.0%

Reproduce Benchmarks

pip3 install redis-benchmarks-specification
redis-benchmarks-spec-client-runner \
  --test memtier_benchmark-1key-1Billion-bits-bitmap-bitcount.yml \
  --db_server_host <...> --db_server_port <...> --db_server_password <...> \
  --flushall_on_every_test_start

@fcostaoliveira fcostaoliveira requested a review from sundb August 27, 2025 00:24
@snyk-io
Copy link

snyk-io bot commented Aug 27, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@kaplanben
Copy link

Logo
Checkmarx One – Scan Summary & Detailscae2c0a3-4572-43a2-919f-2bfd50f53efe

Great job! No new security vulnerabilities introduced in this pull request

Copy link

@shahsb shahsb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks focus on very large bitmaps (100M and 1B bits), where this optimization will shine. Is there any data on performance for very small strings (e.g., 16, 32, or 64 bytes)? While unlikely to be slower, it would be useful to confirm that the overhead of the function dispatch and setup for the vectorized paths doesn't negatively impact performance on "tiny" workloads.

Copy link

@shahsb shahsb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really impressive work here! The 10% improvement on large bitmaps is substantial. I'm curious what was the most challenging part of getting this optimization right?

@fcostaoliveira fcostaoliveira requested a review from sundb October 11, 2025 22:48
@fcostaoliveira fcostaoliveira requested a review from sundb October 12, 2025 12:30
Co-authored-by: debing.sun <[email protected]>
Co-authored-by: debing.sun <[email protected]>
Copy link
Collaborator

@sundb sundb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sundb
Copy link
Collaborator

sundb commented Oct 13, 2025

Optimize BITCOUNT on Intel with AVX2 and AVX512 popcount implementations

Doesn't this PR have no improvement for AMD?

@fcostaoliveira
Copy link
Collaborator Author

Doesn't this PR have no improvement for AMD?

It should. we only have runners for intel/arm on CI mainly due to cost. But i'll add data for AMD on a manual run.

@fcostaoliveira
Copy link
Collaborator Author

Optimize BITCOUNT on Intel with AVX2 and AVX512 popcount implementations

Doesn't this PR have no improvement for AMD?

@sundb added results for AMD EPYC 9R14 (single shard) -- updated the main comment as well

Test Case Metric Baseline /redis unstable (median obs. ± std.dev) Comparison filipecosta90/redis optimize.bitcount.avx (median obs. ± std.dev) % change (lower-better)
memtier_benchmark-1key-100M-bits-bitmap-bitcount Ops/sec (↑ better) 37,120.79 52,093.55 +40.2%
Latency (p50 / p99) 5.279 ms / 9.535 ms 3.807 ms / 6.399 ms −27.9% / −33.0%

@sundb sundb changed the title Optimize BITCOUNT on Intel with AVX2 and AVX512 popcount implementations. Optimize BITCOUNT with AVX2 and AVX512 popcount implementations. Oct 14, 2025
@sundb sundb merged commit 119c83d into redis:unstable Oct 14, 2025
19 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in Redis 8.4 Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants