Reduce ImmutableGeoMapIndex RAM usage with flatten arrays by qdrant-cloud-bot · Pull Request #8673 · qdrant/qdrant

qdrant-cloud-bot · 2026-04-14T13:21:08Z

Summary

Replace Vec<(GeoHash, AHashSet<PointOffsetType>)> in ImmutableGeoMapIndex with three flat parallel arrays: Vec<GeoHash> (sorted hashes), Vec<u32> (offsets), and Vec<PointOffsetType> (flat point IDs)
Each AHashSet carried ~64 bytes of stack overhead (32-byte RandomState + 32-byte RawTable metadata) plus hash-table heap allocations with control bytes — totalling ~124 bytes per entry even with a single point ID. The flat layout uses 8 + 4 + 4 = 16 bytes per entry, a ~7.75x reduction
For 1M unique geo locations: points_map shrinks from ~124 MB to ~16 MB
Deletions use a sentinel value (u32::MAX) in the IDs array, filtered during iteration

No on-disk format changes. No API changes. All behavior preserved.

Test plan

All 34 geo index tests pass (cargo test -p segment --features testing -- geo_index)
Full workspace builds (cargo build)
cargo fmt check passes

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs (1)
359-367: Consider documenting the Vec vs HashSet tradeoff.

Using Vec::contains() (O(n)) instead of AHashSet (O(1) amortized) is a minor performance regression. This is acceptable given:

seen_hashes remains small (bounded by geo_hashes.len() × max_precision ≈ ~120 entries)

Point removal is relatively infrequent

Eliminates AHashSet's RandomState overhead (~32 bytes stack)

A brief inline comment could help future readers understand this deliberate tradeoff.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs` around
lines 359 - 367, In decrement_hash_point_counts, add a brief inline comment near
the seen_hashes Vec explaining the deliberate tradeoff: we use Vec<GeoHash> with
Vec::contains (O(n)) instead of AHashSet to avoid RandomState stack overhead and
because seen_hashes is bounded (≈ geo_hashes.len() × max_precision, ~120
entries) and point removals are infrequent; mention the performance/space
rationale and approximate bound so future maintainers understand why AHashSet
was avoided.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs`:
- Around line 92-103: The code casts points_map_ids.len() to u32 twice (when
pushing into points_map_offsets inside the closure and after the loop) without
overflow checks; change both casts to use a safe conversion (e.g.,
u32::try_from(points_map_ids.len()).map_err(|_| /* return an appropriate error
*/)? or use usize::try_into()) and propagate/return an error if the length
exceeds u32::MAX so offsets cannot silently wrap; update the push calls that
reference points_map_ids.len() and ensure the error type matches the surrounding
function's Result so callers get a clear overflow error instead of corrupt
offsets.

---

Nitpick comments:
In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs`:
- Around line 359-367: In decrement_hash_point_counts, add a brief inline
comment near the seen_hashes Vec explaining the deliberate tradeoff: we use
Vec<GeoHash> with Vec::contains (O(n)) instead of AHashSet to avoid RandomState
stack overhead and because seen_hashes is bounded (≈ geo_hashes.len() ×
max_precision, ~120 entries) and point removals are infrequent; mention the
performance/space rationale and approximate bound so future maintainers
understand why AHashSet was avoided.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a4938124-645e-46e8-83ce-fcfdee745f47

📥 Commits

Reviewing files that changed from the base of the PR and between 1ad626d and 915dec0.

📒 Files selected for processing (1)

lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs

coderabbitai · 2026-04-14T13:31:34Z

            |i, ids| {
-                points_map[i].1 = ids
-                    .iter()
-                    .copied()
-                    // Filter deleted points
-                    .filter(|id| !index.storage.deleted.get(*id as usize).unwrap_or_default())
-                    .collect();
+                points_map_hashes.push(points_map_entries[i].hash.normalize());
+                points_map_offsets.push(points_map_ids.len() as u32);
+                for &id in ids {
+                    if !index.storage.deleted.get(id as usize).unwrap_or_default() {
+                        points_map_ids.push(id);
+                    }
+                }
                Ok(())
            },
        )?;
-        drop(points_map_old);
+        points_map_offsets.push(points_map_ids.len() as u32);


⚠️ Potential issue | 🟠 Major

Add overflow protection for points_map_ids.len() as u32 casts.

Lines 94 and 103 cast points_map_ids.len() to u32 without overflow checks. If the total number of (hash, point_id) pairs exceeds u32::MAX, this would silently overflow and corrupt offset indices, leading to incorrect range lookups or panics.

Consider using checked arithmetic or assertions:

🛡️ Proposed fix with overflow check

|i, ids| { points_map_hashes.push(points_map_entries[i].hash.normalize()); - points_map_offsets.push(points_map_ids.len() as u32); + let offset = u32::try_from(points_map_ids.len()) + .map_err(|_| OperationError::service_error("Geo index too large for u32 offsets"))?; + points_map_offsets.push(offset); for &id in ids { if !index.storage.deleted.get(id as usize).unwrap_or_default() { points_map_ids.push(id); } } Ok(()) }, )?; - points_map_offsets.push(points_map_ids.len() as u32); + let final_offset = u32::try_from(points_map_ids.len()) + .map_err(|_| OperationError::service_error("Geo index too large for u32 offsets"))?; + points_map_offsets.push(final_offset);

Based on learnings: "In code that computes offsets or casts counts to PointOffsetType (u32), avoid unsafe or unchecked casts from usize to PointOffsetType."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs` around lines 92 - 103, The code casts points_map_ids.len() to u32 twice (when pushing into points_map_offsets inside the closure and after the loop) without overflow checks; change both casts to use a safe conversion (e.g., u32::try_from(points_map_ids.len()).map_err(|_| /* return an appropriate error */)? or use usize::try_into()) and propagate/return an error if the length exceeds u32::MAX so offsets cannot silently wrap; update the push calls that reference points_map_ids.len() and ensure the error type matches the surrounding function's Result so callers get a clear overflow error instead of corrupt offsets.

Replace `Vec<(GeoHash, AHashSet<PointOffsetType>)>` with three flat parallel arrays: `Vec<GeoHash>`, `Vec<u32>` (offsets), and `Vec<PointOffsetType>` (IDs). Each AHashSet carried ~64 bytes of stack overhead (32-byte RandomState + 32-byte RawTable metadata) plus hash-table heap allocations with control bytes — totalling ~124 bytes per entry even with a single point. The flat layout uses 8 + 4 + 4 = 16 bytes per entry, a ~7.75x reduction. Deletions use a sentinel value (u32::MAX) in the IDs array, filtered during iteration. On-disk format is unchanged. Made-with: Cursor

timvisee · 2026-04-14T13:46:14Z

I like this over #8674 because this limits to 3 heap allocations, rather than allocating a Vec (and more) for each geo hash.

Maybe we can make it a bit easier to understand by extracting the three new fields into a dedicated type with functions.

Replace `Vec<(GeoHash, AHashSet<PointOffsetType>)>` with three flat parallel arrays: `Vec<GeoHash>`, `Vec<u32>` (offsets), and `Vec<PointOffsetType>` (IDs). Each AHashSet carried ~64 bytes of stack overhead (32-byte RandomState + 32-byte RawTable metadata) plus hash-table heap allocations with control bytes — totalling ~124 bytes per entry even with a single point. The flat layout uses 8 + 4 + 4 = 16 bytes per entry, a ~7.75x reduction. Deletions use a sentinel value (u32::MAX) in the IDs array, filtered during iteration. On-disk format is unchanged. Made-with: Cursor Co-authored-by: Cursor Agent <[email protected]>

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

qdrant deleted a comment from coderabbitai Bot Apr 14, 2026

qdrant-cloud-bot force-pushed the optimize-immutable-geo-map-index-memory branch from 915dec0 to 256c561 Compare April 14, 2026 13:35

qdrant-cloud-bot changed the title ~~Reduce ImmutableGeoMapIndex RAM usage ~7x with flat arrays~~ Reduce ImmutableGeoMapIndex RAM usage with RoaringBitmap Apr 14, 2026

qdrant-cloud-bot force-pushed the optimize-immutable-geo-map-index-memory branch from 256c561 to 01c927f Compare April 14, 2026 13:40

qdrant-cloud-bot changed the title ~~Reduce ImmutableGeoMapIndex RAM usage with RoaringBitmap~~ Reduce ImmutableGeoMapIndex RAM usage ~7x with flat arrays Apr 14, 2026

qdrant-cloud-bot mentioned this pull request Apr 14, 2026

Reduce ImmutableGeoMapIndex RAM usage with RoaringBitmap #8674

Closed

3 tasks

This comment was marked as resolved.

Sign in to view

generall changed the title ~~Reduce ImmutableGeoMapIndex RAM usage ~7x with flat arrays~~ Reduce ImmutableGeoMapIndex RAM usage with flatten arrays Apr 14, 2026

qdrant deleted a comment from coderabbitai Bot Apr 15, 2026

generall self-requested a review April 15, 2026 12:28

generall approved these changes Apr 15, 2026

View reviewed changes

timvisee approved these changes Apr 15, 2026

View reviewed changes

timvisee merged commit 0f8f861 into dev Apr 15, 2026
20 of 21 checks passed

timvisee deleted the optimize-immutable-geo-map-index-memory branch April 15, 2026 12:54

timvisee mentioned this pull request May 8, 2026

Bump version to 1.18.0 #8959

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce ImmutableGeoMapIndex RAM usage with flatten arrays#8673

Reduce ImmutableGeoMapIndex RAM usage with flatten arrays#8673
timvisee merged 1 commit into
devfrom
optimize-immutable-geo-map-index-memory

qdrant-cloud-bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

timvisee commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qdrant-cloud-bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

timvisee commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qdrant-cloud-bot commented Apr 14, 2026 •

edited

Loading