[MOD-13843] A Rust implementation of the numeric index#8276
[MOD-13843] A Rust implementation of the numeric index#8276LukeMathWalker merged 3 commits intomasterfrom
Conversation
b3f8540 to
9ccd2fc
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8276 +/- ##
==========================================
+ Coverage 79.34% 83.05% +3.70%
==========================================
Files 389 399 +10
Lines 57556 58334 +778
Branches 15708 16486 +778
==========================================
+ Hits 45669 48449 +2780
+ Misses 11729 9718 -2011
- Partials 158 167 +9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
24119a9 to
9c26b70
Compare
9c26b70 to
bb910df
Compare
4486b46 to
8f3faa8
Compare
8f3faa8 to
bdf8961
Compare
9a7b195 to
4d02a30
Compare
4d02a30 to
853f01f
Compare
853f01f to
7e72e18
Compare
dbac863 to
c4fe674
Compare
|
Extracted find-related code to #8302. |
c4fe674 to
63087b3
Compare
meiravgri
left a comment
There was a problem hiding this comment.
Super clean!
To avoid losing track, I have reviewed most core imp:
src: arena.rs, lib.rs, mode.rs range.rs
src/tree: insert.rs, mod.rs
will continue reviewing other files after comments from this batch are reviewed
| /// Debug-asserts that the resulting key fits in `u32`. | ||
| pub fn insert(&mut self, node: NumericRangeNode) -> NodeIndex { | ||
| let key = self.nodes.insert(node); | ||
| debug_assert!(key <= u32::MAX as usize); |
There was a problem hiding this comment.
We can't actually guarantee that Slab::insert will return a usize less than u32::MAX, right? Does the memory savings of u32 over u64 justify the silent truncation risk?
Also, what's the purpose of the debug_assert!? I don't see us testing such a scenario.
There was a problem hiding this comment.
Debug asserts can be enabled in "release" builds too, they aren't necessarily tied to testing.
E.g. it's possible to produce a release build with debug assertions enabled to re-run particular scenarios and see what triggers.
In this specific case, it's better to make it a production assert, since the overhead is going to be negligible.
There was a problem hiding this comment.
On one hand i dont think we need to crash the server in this case, because the user didnt do anything wrong
on the other hand, if handling it gracefully will add complexity to the code it doesn't worth it...
your call
| /// Returns `0.0` for leaf nodes. | ||
| pub const fn split_value(&self) -> f64 { | ||
| match self { | ||
| Self::Leaf(_) => 0.0, |
There was a problem hiding this comment.
is 0.0 the right value to return for leaf nodes when 0.0 is a valid split value for internal nodes?
There was a problem hiding this comment.
Option is the right abstraction here, yes.
| while reader.next_record(&mut result).unwrap_or(false) { | ||
| // SAFETY: We know the result contains numeric data | ||
| let entry_value = unsafe { result.as_numeric_unchecked() }; | ||
| entries.push((result.doc_id, entry_value)); | ||
| } | ||
|
|
||
| (split, entries) |
There was a problem hiding this comment.
do we copy the entire inverted index of the range here?
There was a problem hiding this comment.
Yes, we need to collect all its entries into a buffer.
| // Take the existing range from the leaf and convert to an internal node. | ||
| let old_range = nodes[node_idx].take_range(); | ||
| let new_node = | ||
| NumericRangeNode::internal_indexed(split, left_idx, right_idx, old_range, nodes); | ||
| nodes[node_idx] = new_node; |
There was a problem hiding this comment.
dont we need to pass the range only if internal ranges are enabled?
There was a problem hiding this comment.
At this stage, we keep it unconditionally, then it's trimmed in node_add based on the max_depth_range that was specified. I could also push the check further down, bringing max_depth_range into this context.
GuyAv46
left a comment
There was a problem hiding this comment.
Awesome! First review without the tests.
Deletion (and repair) logic will come in a later PR, I assume? Interested in the slab compaction logic
| match self { | ||
| Self::Leaf(leaf) => { | ||
| // Replace with a default range; callers are expected to replace the whole node. | ||
| Some(std::mem::replace(&mut leaf.range, NumericRange::new(false))) |
There was a problem hiding this comment.
This is the actual default, right?
Consider using std::mem::take
| /// Returns `0.0` for leaf nodes. | ||
| pub const fn split_value(&self) -> f64 { | ||
| match self { | ||
| Self::Leaf(_) => 0.0, |
There was a problem hiding this comment.
Not sure if it matters much, but 0.0 is an arbitrary valid value.
Consider returning NaN, None, panic, or even extract the median instead
There was a problem hiding this comment.
Option is the right abstraction here, yes.
| /// cache locality, eliminates per-node heap allocation overhead, and makes | ||
| /// rotations cheaper (index swaps instead of alloc/dealloc). |
There was a problem hiding this comment.
makes rotations cheaper (index swaps instead of alloc/dealloc)
The C implementation doesn't require [de]allocations, so this is not a real improvement. The rest is true :)
There was a problem hiding this comment.
I reworded it around pruning, which is when we see the difference (i.e. one realloc vs multiple deallocs).
| pub const fn mem_usage(&self) -> usize { | ||
| let base_size = std::mem::size_of::<Self>(); | ||
| // Our tree is a full binary tree, so #nodes = 2 * #leaves - 1 | ||
| let nodes_count = 2 * self.stats.num_leaves.saturating_sub(1) + 1; |
There was a problem hiding this comment.
Why do we need 2 * self.stats.num_leaves.saturating_sub(1) + 1? num_leaves should always be at least 1, no?
There was a problem hiding this comment.
True, but we don't need this either, we know the exact node count thanks to the arena.
| /// hashed into HyperLogLog registers. We hash the raw bytes (bit | ||
| /// representation) rather than the numeric value — see | ||
| /// [`NumericRange::update_cardinality`] for rationale. | ||
| fn update_cardinality(hll: &mut Hll, value: f64) { |
There was a problem hiding this comment.
Why have this helper, and not simply define it at the NumericRange implementation of update_cardinality?
There was a problem hiding this comment.
Later on, when have the GC code (#8293) we'll need to compute hashes of values outside the context of a NumericRange, and since f64 doesn't implement Hash, I used this helper as the source of truth for how we are hashing floats.
A different approach could be to make HyperLogLog strongly typed—generic over the type of values it's counting, and then create a thin wrapper around f64 which encodes the desired hashing behaviour.
| self.last_doc_id = doc_id; | ||
|
|
||
| let mut rv = AddResult::default(); | ||
| Self::node_add( |
There was a problem hiding this comment.
Consider making node_add return an AddResult, unless it has performance panalties
There was a problem hiding this comment.
It won't work unless we make node_add non-recursive. It needs to thread rv via function parameters so that each "layer" of the chain can update it in place.
There was a problem hiding this comment.
I see. In practice, the child sets it inially and then it gets updated on the way up. Could be an idea for the future
| // Collect all entries from the range | ||
| let mut entries: Vec<(ffi::t_docId, f64)> = Vec::new(); |
There was a problem hiding this comment.
Why do we collect the results, and not use the reader in the "Redistribute entries to children" for loop?
There was a problem hiding this comment.
To avoid an annoying borrow-checker issue, but it can be worked around.
Yes, they are in #8293. I think I have addressed all feedback in the latest commit. Please have a second look! @meiravgri @GuyAv46 |
e09782e to
06f9330
Compare
06f9330 to
601f53a
Compare
|



Describe the changes in the pull request
A pure Rust implementation of the numeric index (and the associated range tree) from
numeric_index.c/numeric_index.h.Key differences:
wyhashrather thanfnvsince Port HyperLogLog implementation to Rust #8095 provedwyhashto be faster and more accurate.Which additional issues this PR fixes
Main objects this PR modified
Mark if applicable
Release Notes
If a release note is required (bug fix / new feature / enhancement), describe the user impact of this PR in the title.
Note
High Risk
Adds a brand-new numeric indexing data structure (splitting/balancing, memory accounting, iterator invalidation) and expands workspace dependencies/lints; correctness and performance regressions are plausible despite tests.
Overview
Introduces a new Rust crate
numeric_range_treeand wires it into the workspace, providing an arena-allocated numeric range tree with adaptive leaf splitting, AVL-like single-rotation balancing, optional internal-range retention, revision-based iterator invalidation, and memory/statistics tracking.Updates
hyperloglogto include and expose a sharedWyHasherimplementation (used by the range tree’s HLL for cardinality estimation) and adjusts the benchmarks accordingly. Workspace config is extended with new deps (slab,rstest) and a rustdoc lint allowance, plus correspondingCargo.lockupdates.Written by Cursor Bugbot for commit 601f53a. This will update automatically on new commits. Configure here.