Skip to content

Reduce ImmutableGeoMapIndex RAM usage with flatten arrays#8673

Merged
timvisee merged 1 commit into
devfrom
optimize-immutable-geo-map-index-memory
Apr 15, 2026
Merged

Reduce ImmutableGeoMapIndex RAM usage with flatten arrays#8673
timvisee merged 1 commit into
devfrom
optimize-immutable-geo-map-index-memory

Conversation

@qdrant-cloud-bot

@qdrant-cloud-bot qdrant-cloud-bot commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace Vec<(GeoHash, AHashSet<PointOffsetType>)> in ImmutableGeoMapIndex with three flat parallel arrays: Vec<GeoHash> (sorted hashes), Vec<u32> (offsets), and Vec<PointOffsetType> (flat point IDs)
  • Each AHashSet carried ~64 bytes of stack overhead (32-byte RandomState + 32-byte RawTable metadata) plus hash-table heap allocations with control bytes — totalling ~124 bytes per entry even with a single point ID. The flat layout uses 8 + 4 + 4 = 16 bytes per entry, a ~7.75x reduction
  • For 1M unique geo locations: points_map shrinks from ~124 MB to ~16 MB
  • Deletions use a sentinel value (u32::MAX) in the IDs array, filtered during iteration

No on-disk format changes. No API changes. All behavior preserved.

Test plan

  • All 34 geo index tests pass (cargo test -p segment --features testing -- geo_index)
  • Full workspace builds (cargo build)
  • cargo fmt check passes

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs (1)

359-367: Consider documenting the Vec vs HashSet tradeoff.

Using Vec::contains() (O(n)) instead of AHashSet (O(1) amortized) is a minor performance regression. This is acceptable given:

  • seen_hashes remains small (bounded by geo_hashes.len() × max_precision ≈ ~120 entries)
  • Point removal is relatively infrequent
  • Eliminates AHashSet's RandomState overhead (~32 bytes stack)

A brief inline comment could help future readers understand this deliberate tradeoff.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs` around
lines 359 - 367, In decrement_hash_point_counts, add a brief inline comment near
the seen_hashes Vec explaining the deliberate tradeoff: we use Vec<GeoHash> with
Vec::contains (O(n)) instead of AHashSet to avoid RandomState stack overhead and
because seen_hashes is bounded (≈ geo_hashes.len() × max_precision, ~120
entries) and point removals are infrequent; mention the performance/space
rationale and approximate bound so future maintainers understand why AHashSet
was avoided.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs`:
- Around line 92-103: The code casts points_map_ids.len() to u32 twice (when
pushing into points_map_offsets inside the closure and after the loop) without
overflow checks; change both casts to use a safe conversion (e.g.,
u32::try_from(points_map_ids.len()).map_err(|_| /* return an appropriate error
*/)? or use usize::try_into()) and propagate/return an error if the length
exceeds u32::MAX so offsets cannot silently wrap; update the push calls that
reference points_map_ids.len() and ensure the error type matches the surrounding
function's Result so callers get a clear overflow error instead of corrupt
offsets.

---

Nitpick comments:
In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs`:
- Around line 359-367: In decrement_hash_point_counts, add a brief inline
comment near the seen_hashes Vec explaining the deliberate tradeoff: we use
Vec<GeoHash> with Vec::contains (O(n)) instead of AHashSet to avoid RandomState
stack overhead and because seen_hashes is bounded (≈ geo_hashes.len() ×
max_precision, ~120 entries) and point removals are infrequent; mention the
performance/space rationale and approximate bound so future maintainers
understand why AHashSet was avoided.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a4938124-645e-46e8-83ce-fcfdee745f47

📥 Commits

Reviewing files that changed from the base of the PR and between 1ad626d and 915dec0.

📒 Files selected for processing (1)
  • lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs

Comment on lines 92 to +103
|i, ids| {
points_map[i].1 = ids
.iter()
.copied()
// Filter deleted points
.filter(|id| !index.storage.deleted.get(*id as usize).unwrap_or_default())
.collect();
points_map_hashes.push(points_map_entries[i].hash.normalize());
points_map_offsets.push(points_map_ids.len() as u32);
for &id in ids {
if !index.storage.deleted.get(id as usize).unwrap_or_default() {
points_map_ids.push(id);
}
}
Ok(())
},
)?;
drop(points_map_old);
points_map_offsets.push(points_map_ids.len() as u32);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add overflow protection for points_map_ids.len() as u32 casts.

Lines 94 and 103 cast points_map_ids.len() to u32 without overflow checks. If the total number of (hash, point_id) pairs exceeds u32::MAX, this would silently overflow and corrupt offset indices, leading to incorrect range lookups or panics.

Consider using checked arithmetic or assertions:

🛡️ Proposed fix with overflow check
             |i, ids| {
                 points_map_hashes.push(points_map_entries[i].hash.normalize());
-                points_map_offsets.push(points_map_ids.len() as u32);
+                let offset = u32::try_from(points_map_ids.len())
+                    .map_err(|_| OperationError::service_error("Geo index too large for u32 offsets"))?;
+                points_map_offsets.push(offset);
                 for &id in ids {
                     if !index.storage.deleted.get(id as usize).unwrap_or_default() {
                         points_map_ids.push(id);
                     }
                 }
                 Ok(())
             },
         )?;
-        points_map_offsets.push(points_map_ids.len() as u32);
+        let final_offset = u32::try_from(points_map_ids.len())
+            .map_err(|_| OperationError::service_error("Geo index too large for u32 offsets"))?;
+        points_map_offsets.push(final_offset);

Based on learnings: "In code that computes offsets or casts counts to PointOffsetType (u32), avoid unsafe or unchecked casts from usize to PointOffsetType."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs` around
lines 92 - 103, The code casts points_map_ids.len() to u32 twice (when pushing
into points_map_offsets inside the closure and after the loop) without overflow
checks; change both casts to use a safe conversion (e.g.,
u32::try_from(points_map_ids.len()).map_err(|_| /* return an appropriate error
*/)? or use usize::try_into()) and propagate/return an error if the length
exceeds u32::MAX so offsets cannot silently wrap; update the push calls that
reference points_map_ids.len() and ensure the error type matches the surrounding
function's Result so callers get a clear overflow error instead of corrupt
offsets.

@qdrant qdrant deleted a comment from coderabbitai Bot Apr 14, 2026
@qdrant-cloud-bot qdrant-cloud-bot force-pushed the optimize-immutable-geo-map-index-memory branch from 915dec0 to 256c561 Compare April 14, 2026 13:35
@qdrant-cloud-bot qdrant-cloud-bot changed the title Reduce ImmutableGeoMapIndex RAM usage ~7x with flat arrays Reduce ImmutableGeoMapIndex RAM usage with RoaringBitmap Apr 14, 2026
Replace `Vec<(GeoHash, AHashSet<PointOffsetType>)>` with three flat
parallel arrays: `Vec<GeoHash>`, `Vec<u32>` (offsets), and
`Vec<PointOffsetType>` (IDs).

Each AHashSet carried ~64 bytes of stack overhead (32-byte RandomState +
32-byte RawTable metadata) plus hash-table heap allocations with control
bytes — totalling ~124 bytes per entry even with a single point.  The
flat layout uses 8 + 4 + 4 = 16 bytes per entry, a ~7.75x reduction.

Deletions use a sentinel value (u32::MAX) in the IDs array, filtered
during iteration.  On-disk format is unchanged.

Made-with: Cursor
@qdrant-cloud-bot qdrant-cloud-bot force-pushed the optimize-immutable-geo-map-index-memory branch from 256c561 to 01c927f Compare April 14, 2026 13:40
@qdrant-cloud-bot qdrant-cloud-bot changed the title Reduce ImmutableGeoMapIndex RAM usage with RoaringBitmap Reduce ImmutableGeoMapIndex RAM usage ~7x with flat arrays Apr 14, 2026
coderabbitai[bot]

This comment was marked as resolved.

@timvisee

Copy link
Copy Markdown
Member

I like this over #8674 because this limits to 3 heap allocations, rather than allocating a Vec (and more) for each geo hash.

Maybe we can make it a bit easier to understand by extracting the three new fields into a dedicated type with functions.

@generall generall changed the title Reduce ImmutableGeoMapIndex RAM usage ~7x with flat arrays Reduce ImmutableGeoMapIndex RAM usage with flatten arrays Apr 14, 2026
@qdrant qdrant deleted a comment from coderabbitai Bot Apr 15, 2026
@generall generall self-requested a review April 15, 2026 12:28
@timvisee timvisee merged commit 0f8f861 into dev Apr 15, 2026
20 of 21 checks passed
@timvisee timvisee deleted the optimize-immutable-geo-map-index-memory branch April 15, 2026 12:54
timvisee pushed a commit that referenced this pull request May 8, 2026
Replace `Vec<(GeoHash, AHashSet<PointOffsetType>)>` with three flat
parallel arrays: `Vec<GeoHash>`, `Vec<u32>` (offsets), and
`Vec<PointOffsetType>` (IDs).

Each AHashSet carried ~64 bytes of stack overhead (32-byte RandomState +
32-byte RawTable metadata) plus hash-table heap allocations with control
bytes — totalling ~124 bytes per entry even with a single point.  The
flat layout uses 8 + 4 + 4 = 16 bytes per entry, a ~7.75x reduction.

Deletions use a sentinel value (u32::MAX) in the IDs array, filtered
during iteration.  On-disk format is unchanged.

Made-with: Cursor

Co-authored-by: Cursor Agent <[email protected]>
@timvisee timvisee mentioned this pull request May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants