Skip to content

[MOD-13843] A Rust implementation of the numeric index#8276

Merged
LukeMathWalker merged 3 commits intomasterfrom
numeric-range-tree-1-rust
Feb 9, 2026
Merged

[MOD-13843] A Rust implementation of the numeric index#8276
LukeMathWalker merged 3 commits intomasterfrom
numeric-range-tree-1-rust

Conversation

@LukeMathWalker
Copy link
Collaborator

@LukeMathWalker LukeMathWalker commented Feb 4, 2026

Describe the changes in the pull request

A pure Rust implementation of the numeric index (and the associated range tree) from numeric_index.c/numeric_index.h.

Key differences:

  • All nodes are stored in a contiguous slab of memory (arena-style). Parent nodes refer to their children by index rather than by pointer.
  • The HLL stored inside the numeric range tree using wyhash rather than fnv since Port HyperLogLog implementation to Rust #8095 proved wyhash to be faster and more accurate.

Which additional issues this PR fixes

  1. MOD-...
  2. #...

Main objects this PR modified

  1. ...

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

Release Notes

  • This PR requires release notes
  • This PR does not require release notes

If a release note is required (bug fix / new feature / enhancement), describe the user impact of this PR in the title.


Note

High Risk
Adds a brand-new numeric indexing data structure (splitting/balancing, memory accounting, iterator invalidation) and expands workspace dependencies/lints; correctness and performance regressions are plausible despite tests.

Overview
Introduces a new Rust crate numeric_range_tree and wires it into the workspace, providing an arena-allocated numeric range tree with adaptive leaf splitting, AVL-like single-rotation balancing, optional internal-range retention, revision-based iterator invalidation, and memory/statistics tracking.

Updates hyperloglog to include and expose a shared WyHasher implementation (used by the range tree’s HLL for cardinality estimation) and adjusts the benchmarks accordingly. Workspace config is extended with new deps (slab, rstest) and a rustdoc lint allowance, plus corresponding Cargo.lock updates.

Written by Cursor Bugbot for commit 601f53a. This will update automatically on new commits. Configure here.

@LukeMathWalker LukeMathWalker force-pushed the numeric-range-tree-1-rust branch from b3f8540 to 9ccd2fc Compare February 4, 2026 17:15
@codecov
Copy link

codecov bot commented Feb 4, 2026

Codecov Report

❌ Patch coverage is 83.48271% with 129 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.05%. Comparing base (d3ac7b4) to head (601f53a).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
src/redisearch_rs/numeric_range_tree/src/index.rs 40.00% 54 Missing ⚠️
src/redisearch_rs/numeric_range_tree/src/arena.rs 53.06% 23 Missing ⚠️
src/redisearch_rs/numeric_range_tree/src/node.rs 90.06% 15 Missing ⚠️
...edisearch_rs/numeric_range_tree/src/tree/insert.rs 94.14% 9 Missing and 4 partials ⚠️
...earch_rs/numeric_range_tree/src/tree/invariants.rs 87.62% 5 Missing and 7 partials ⚠️
src/redisearch_rs/hyperloglog/src/wyhash.rs 75.00% 3 Missing ⚠️
src/redisearch_rs/numeric_range_tree/src/range.rs 95.08% 3 Missing ⚠️
...c/redisearch_rs/numeric_range_tree/src/tree/mod.rs 95.83% 3 Missing ⚠️
.../redisearch_rs/numeric_range_tree/src/unique_id.rs 50.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8276      +/-   ##
==========================================
+ Coverage   79.34%   83.05%   +3.70%     
==========================================
  Files         389      399      +10     
  Lines       57556    58334     +778     
  Branches    15708    16486     +778     
==========================================
+ Hits        45669    48449    +2780     
+ Misses      11729     9718    -2011     
- Partials      158      167       +9     
Flag Coverage Δ
flow 84.12% <ø> (+5.54%) ⬆️
unit 50.94% <83.48%> (+0.46%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@LukeMathWalker LukeMathWalker force-pushed the numeric-range-tree-1-rust branch 6 times, most recently from 24119a9 to 9c26b70 Compare February 4, 2026 20:41
@LukeMathWalker LukeMathWalker force-pushed the numeric-range-tree-1-rust branch from 9c26b70 to bb910df Compare February 5, 2026 12:06
@LukeMathWalker LukeMathWalker force-pushed the numeric-range-tree-1-rust branch 7 times, most recently from 4486b46 to 8f3faa8 Compare February 5, 2026 16:49
@LukeMathWalker LukeMathWalker force-pushed the numeric-range-tree-1-rust branch from 8f3faa8 to bdf8961 Compare February 6, 2026 08:28
@LukeMathWalker LukeMathWalker force-pushed the numeric-range-tree-1-rust branch 2 times, most recently from 9a7b195 to 4d02a30 Compare February 6, 2026 11:26
@LukeMathWalker LukeMathWalker force-pushed the numeric-range-tree-1-rust branch from 4d02a30 to 853f01f Compare February 6, 2026 11:42
@LukeMathWalker LukeMathWalker force-pushed the numeric-range-tree-1-rust branch from 853f01f to 7e72e18 Compare February 6, 2026 16:07
@LukeMathWalker
Copy link
Collaborator Author

Extracted find-related code to #8302.

@LukeMathWalker LukeMathWalker force-pushed the numeric-range-tree-1-rust branch from c4fe674 to 63087b3 Compare February 8, 2026 10:02
Copy link
Collaborator

@meiravgri meiravgri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super clean!
To avoid losing track, I have reviewed most core imp:
src: arena.rs, lib.rs, mode.rs range.rs
src/tree: insert.rs, mod.rs

will continue reviewing other files after comments from this batch are reviewed

/// Debug-asserts that the resulting key fits in `u32`.
pub fn insert(&mut self, node: NumericRangeNode) -> NodeIndex {
let key = self.nodes.insert(node);
debug_assert!(key <= u32::MAX as usize);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't actually guarantee that Slab::insert will return a usize less than u32::MAX, right? Does the memory savings of u32 over u64 justify the silent truncation risk?

Also, what's the purpose of the debug_assert!? I don't see us testing such a scenario.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug asserts can be enabled in "release" builds too, they aren't necessarily tied to testing.
E.g. it's possible to produce a release build with debug assertions enabled to re-run particular scenarios and see what triggers.
In this specific case, it's better to make it a production assert, since the overhead is going to be negligible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one hand i dont think we need to crash the server in this case, because the user didnt do anything wrong
on the other hand, if handling it gracefully will add complexity to the code it doesn't worth it...
your call

/// Returns `0.0` for leaf nodes.
pub const fn split_value(&self) -> f64 {
match self {
Self::Leaf(_) => 0.0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 0.0 the right value to return for leaf nodes when 0.0 is a valid split value for internal nodes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option is the right abstraction here, yes.

Comment on lines 288 to 294
while reader.next_record(&mut result).unwrap_or(false) {
// SAFETY: We know the result contains numeric data
let entry_value = unsafe { result.as_numeric_unchecked() };
entries.push((result.doc_id, entry_value));
}

(split, entries)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we copy the entire inverted index of the range here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to collect all its entries into a buffer.

Comment on lines 329 to 333
// Take the existing range from the leaf and convert to an internal node.
let old_range = nodes[node_idx].take_range();
let new_node =
NumericRangeNode::internal_indexed(split, left_idx, right_idx, old_range, nodes);
nodes[node_idx] = new_node;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont we need to pass the range only if internal ranges are enabled?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage, we keep it unconditionally, then it's trimmed in node_add based on the max_depth_range that was specified. I could also push the check further down, bringing max_depth_range into this context.

Copy link
Collaborator

@GuyAv46 GuyAv46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! First review without the tests.
Deletion (and repair) logic will come in a later PR, I assume? Interested in the slab compaction logic

match self {
Self::Leaf(leaf) => {
// Replace with a default range; callers are expected to replace the whole node.
Some(std::mem::replace(&mut leaf.range, NumericRange::new(false)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual default, right?
Consider using std::mem::take

/// Returns `0.0` for leaf nodes.
pub const fn split_value(&self) -> f64 {
match self {
Self::Leaf(_) => 0.0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it matters much, but 0.0 is an arbitrary valid value.
Consider returning NaN, None, panic, or even extract the median instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option is the right abstraction here, yes.

Comment on lines 87 to 88
/// cache locality, eliminates per-node heap allocation overhead, and makes
/// rotations cheaper (index swaps instead of alloc/dealloc).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes rotations cheaper (index swaps instead of alloc/dealloc)

The C implementation doesn't require [de]allocations, so this is not a real improvement. The rest is true :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded it around pruning, which is when we see the difference (i.e. one realloc vs multiple deallocs).

pub const fn mem_usage(&self) -> usize {
let base_size = std::mem::size_of::<Self>();
// Our tree is a full binary tree, so #nodes = 2 * #leaves - 1
let nodes_count = 2 * self.stats.num_leaves.saturating_sub(1) + 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need 2 * self.stats.num_leaves.saturating_sub(1) + 1? num_leaves should always be at least 1, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but we don't need this either, we know the exact node count thanks to the arena.

/// hashed into HyperLogLog registers. We hash the raw bytes (bit
/// representation) rather than the numeric value — see
/// [`NumericRange::update_cardinality`] for rationale.
fn update_cardinality(hll: &mut Hll, value: f64) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have this helper, and not simply define it at the NumericRange implementation of update_cardinality?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later on, when have the GC code (#8293) we'll need to compute hashes of values outside the context of a NumericRange, and since f64 doesn't implement Hash, I used this helper as the source of truth for how we are hashing floats.
A different approach could be to make HyperLogLog strongly typed—generic over the type of values it's counting, and then create a thin wrapper around f64 which encodes the desired hashing behaviour.

self.last_doc_id = doc_id;

let mut rv = AddResult::default();
Self::node_add(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making node_add return an AddResult, unless it has performance panalties

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't work unless we make node_add non-recursive. It needs to thread rv via function parameters so that each "layer" of the chain can update it in place.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In practice, the child sets it inially and then it gets updated on the way up. Could be an idea for the future

Comment on lines 284 to 285
// Collect all entries from the range
let mut entries: Vec<(ffi::t_docId, f64)> = Vec::new();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we collect the results, and not use the reader in the "Redistribute entries to children" for loop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid an annoying borrow-checker issue, but it can be worked around.

@LukeMathWalker
Copy link
Collaborator Author

Awesome! First review without the tests. Deletion (and repair) logic will come in a later PR, I assume? Interested in the slab compaction logic

Yes, they are in #8293.

I think I have addressed all feedback in the latest commit. Please have a second look! @meiravgri @GuyAv46

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

@LukeMathWalker LukeMathWalker force-pushed the numeric-range-tree-1-rust branch from e09782e to 06f9330 Compare February 9, 2026 14:59
@LukeMathWalker LukeMathWalker force-pushed the numeric-range-tree-1-rust branch from 06f9330 to 601f53a Compare February 9, 2026 15:03
@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 9, 2026

@LukeMathWalker LukeMathWalker added this pull request to the merge queue Feb 9, 2026
Merged via the queue into master with commit c20a807 Feb 9, 2026
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants