Port HyperLogLog implementation to Rust#8095
Conversation
|
Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset. In case there are security findings, they will be communicated to you as a comment inside the PR. Hope you’ll enjoy using Jit. Questions? Comments? Want to learn more? Get in touch with us. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8095 +/- ##
==========================================
- Coverage 83.07% 82.54% -0.54%
==========================================
Files 361 368 +7
Lines 54821 55352 +531
Branches 13809 14340 +531
==========================================
+ Hits 45541 45688 +147
- Misses 9133 9516 +383
- Partials 147 148 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
edf7664 to
d9e37f0
Compare
| pub struct HyperLogLog<const BITS: u8, const SIZE: usize, H: hash32::Hasher + Default = CFnvHasher> | ||
| { | ||
| cached_card: Cell<Option<usize>>, | ||
| registers: Box<[u8; SIZE]>, |
There was a problem hiding this comment.
For small sizes, why put it in a box? (We use BITS=6 for numeric index today)
There was a problem hiding this comment.
It's rather tricky to have it on the stack up to a certain size and on the heap afterwards (especially with const generics). We could limit ourselves to the stack, which is going to be fine up to 10 bits precision.
Wdyt?
There was a problem hiding this comment.
As discussed, if we are going to have 2 implementations for different use-cases, I think it makes sense to make this one in-place only and allow BITS <= 10.
Also, for the numeric use-case, I wonder if we can use even BITS=5 with the new hasher. Currently#define NR_MAXRANGE_CARD 2500
There was a problem hiding this comment.
We can play around with it when we are integrating the numeric index and see how it fares!
cd513de to
cac2830
Compare
|
I guess we can close #7356. |
Co-authored-by: GuyAv46 <[email protected]>
3f5f59f to
acdc725
Compare
|
Rebased on latest master to resolve conflicts. |
Describe the changes in the pull request
A Rust port of the HyperLogLog implementation from
hll.c.Benchmarks
I've added benches for various hashers—the original C
fnvhash, as well as a few others.Speed-wise,
fnvis the third fastest, but accuracy is an issue:The table reports the correct cardinality in the
ncolumn, the estimated cardinality for each hasher as well as the error rate.fnvdegrades really fast. All things considered (speed + accuracy), it might be a good idea to migrate over to eitherfxhashorwyhashlater on.Mark if applicable
Note
Adds a self-contained
hyperloglogcrate implementing HLL with const generics, stack-allocated registers, cached counts, merge/clear, and pluggable hashers viahash32::Hasher(defaultCFnvHashermatching the C seed;Murmur3Hasheralias provided).HyperLogLog<BITS, SIZE, H>, type aliases (HyperLogLog10, etc.),add,add_precomputed_hash,count,merge,clear,from_registers,try_set_registersfnv,murmur3,xxhash32,ahash,fxhash,wyhash; prints accuracy and measures add/count/merge throughputfnvcrate: implementshash32::HasherforFnv32to support 32-bit hashes and exposes C-compatible seeded wrapper (CFnvHasher)hyperloglogmember; adds deps (ahash,bytecount,hash32,rustc-hash,wyhash,xxhash-rust) and minor version bumps inCargo.lockWritten by Cursor Bugbot for commit 2ee0a24. This will update automatically on new commits. Configure here.