Reduce ImmutableGeoMapIndex RAM usage with flatten arrays#8673
Conversation
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs (1)
359-367: Consider documenting theVecvsHashSettradeoff.Using
Vec::contains()(O(n)) instead ofAHashSet(O(1) amortized) is a minor performance regression. This is acceptable given:
seen_hashesremains small (bounded bygeo_hashes.len() × max_precision ≈ ~120entries)- Point removal is relatively infrequent
- Eliminates
AHashSet'sRandomStateoverhead (~32 bytes stack)A brief inline comment could help future readers understand this deliberate tradeoff.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs` around lines 359 - 367, In decrement_hash_point_counts, add a brief inline comment near the seen_hashes Vec explaining the deliberate tradeoff: we use Vec<GeoHash> with Vec::contains (O(n)) instead of AHashSet to avoid RandomState stack overhead and because seen_hashes is bounded (≈ geo_hashes.len() × max_precision, ~120 entries) and point removals are infrequent; mention the performance/space rationale and approximate bound so future maintainers understand why AHashSet was avoided.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs`:
- Around line 92-103: The code casts points_map_ids.len() to u32 twice (when
pushing into points_map_offsets inside the closure and after the loop) without
overflow checks; change both casts to use a safe conversion (e.g.,
u32::try_from(points_map_ids.len()).map_err(|_| /* return an appropriate error
*/)? or use usize::try_into()) and propagate/return an error if the length
exceeds u32::MAX so offsets cannot silently wrap; update the push calls that
reference points_map_ids.len() and ensure the error type matches the surrounding
function's Result so callers get a clear overflow error instead of corrupt
offsets.
---
Nitpick comments:
In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs`:
- Around line 359-367: In decrement_hash_point_counts, add a brief inline
comment near the seen_hashes Vec explaining the deliberate tradeoff: we use
Vec<GeoHash> with Vec::contains (O(n)) instead of AHashSet to avoid RandomState
stack overhead and because seen_hashes is bounded (≈ geo_hashes.len() ×
max_precision, ~120 entries) and point removals are infrequent; mention the
performance/space rationale and approximate bound so future maintainers
understand why AHashSet was avoided.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a4938124-645e-46e8-83ce-fcfdee745f47
📒 Files selected for processing (1)
lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs
| |i, ids| { | ||
| points_map[i].1 = ids | ||
| .iter() | ||
| .copied() | ||
| // Filter deleted points | ||
| .filter(|id| !index.storage.deleted.get(*id as usize).unwrap_or_default()) | ||
| .collect(); | ||
| points_map_hashes.push(points_map_entries[i].hash.normalize()); | ||
| points_map_offsets.push(points_map_ids.len() as u32); | ||
| for &id in ids { | ||
| if !index.storage.deleted.get(id as usize).unwrap_or_default() { | ||
| points_map_ids.push(id); | ||
| } | ||
| } | ||
| Ok(()) | ||
| }, | ||
| )?; | ||
| drop(points_map_old); | ||
| points_map_offsets.push(points_map_ids.len() as u32); |
There was a problem hiding this comment.
Add overflow protection for points_map_ids.len() as u32 casts.
Lines 94 and 103 cast points_map_ids.len() to u32 without overflow checks. If the total number of (hash, point_id) pairs exceeds u32::MAX, this would silently overflow and corrupt offset indices, leading to incorrect range lookups or panics.
Consider using checked arithmetic or assertions:
🛡️ Proposed fix with overflow check
|i, ids| {
points_map_hashes.push(points_map_entries[i].hash.normalize());
- points_map_offsets.push(points_map_ids.len() as u32);
+ let offset = u32::try_from(points_map_ids.len())
+ .map_err(|_| OperationError::service_error("Geo index too large for u32 offsets"))?;
+ points_map_offsets.push(offset);
for &id in ids {
if !index.storage.deleted.get(id as usize).unwrap_or_default() {
points_map_ids.push(id);
}
}
Ok(())
},
)?;
- points_map_offsets.push(points_map_ids.len() as u32);
+ let final_offset = u32::try_from(points_map_ids.len())
+ .map_err(|_| OperationError::service_error("Geo index too large for u32 offsets"))?;
+ points_map_offsets.push(final_offset);Based on learnings: "In code that computes offsets or casts counts to PointOffsetType (u32), avoid unsafe or unchecked casts from usize to PointOffsetType."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs` around
lines 92 - 103, The code casts points_map_ids.len() to u32 twice (when pushing
into points_map_offsets inside the closure and after the loop) without overflow
checks; change both casts to use a safe conversion (e.g.,
u32::try_from(points_map_ids.len()).map_err(|_| /* return an appropriate error
*/)? or use usize::try_into()) and propagate/return an error if the length
exceeds u32::MAX so offsets cannot silently wrap; update the push calls that
reference points_map_ids.len() and ensure the error type matches the surrounding
function's Result so callers get a clear overflow error instead of corrupt
offsets.
915dec0 to
256c561
Compare
Replace `Vec<(GeoHash, AHashSet<PointOffsetType>)>` with three flat parallel arrays: `Vec<GeoHash>`, `Vec<u32>` (offsets), and `Vec<PointOffsetType>` (IDs). Each AHashSet carried ~64 bytes of stack overhead (32-byte RandomState + 32-byte RawTable metadata) plus hash-table heap allocations with control bytes — totalling ~124 bytes per entry even with a single point. The flat layout uses 8 + 4 + 4 = 16 bytes per entry, a ~7.75x reduction. Deletions use a sentinel value (u32::MAX) in the IDs array, filtered during iteration. On-disk format is unchanged. Made-with: Cursor
256c561 to
01c927f
Compare
|
I like this over #8674 because this limits to 3 heap allocations, rather than allocating a Maybe we can make it a bit easier to understand by extracting the three new fields into a dedicated type with functions. |
Replace `Vec<(GeoHash, AHashSet<PointOffsetType>)>` with three flat parallel arrays: `Vec<GeoHash>`, `Vec<u32>` (offsets), and `Vec<PointOffsetType>` (IDs). Each AHashSet carried ~64 bytes of stack overhead (32-byte RandomState + 32-byte RawTable metadata) plus hash-table heap allocations with control bytes — totalling ~124 bytes per entry even with a single point. The flat layout uses 8 + 4 + 4 = 16 bytes per entry, a ~7.75x reduction. Deletions use a sentinel value (u32::MAX) in the IDs array, filtered during iteration. On-disk format is unchanged. Made-with: Cursor Co-authored-by: Cursor Agent <[email protected]>
Summary
Vec<(GeoHash, AHashSet<PointOffsetType>)>inImmutableGeoMapIndexwith three flat parallel arrays:Vec<GeoHash>(sorted hashes),Vec<u32>(offsets), andVec<PointOffsetType>(flat point IDs)AHashSetcarried ~64 bytes of stack overhead (32-byteRandomState+ 32-byteRawTablemetadata) plus hash-table heap allocations with control bytes — totalling ~124 bytes per entry even with a single point ID. The flat layout uses 8 + 4 + 4 = 16 bytes per entry, a ~7.75x reductionpoints_mapshrinks from ~124 MB to ~16 MBu32::MAX) in the IDs array, filtered during iterationNo on-disk format changes. No API changes. All behavior preserved.
Test plan
cargo test -p segment --features testing -- geo_index)cargo build)cargo fmtcheck passes