feat: avoid payload-index rebuild#9138
Conversation
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/map_index/tests.rs (1)
477-501: ⚡ Quick winAdd behavior assertions to the round-trip swap test.
int_map_immutable_to_mmap_round_trip_preserves_filescurrently proves byte-for-byte file stability, but not query/value correctness after each swap. Please also assert representativeget_iterator/values_countoutputs before and after both transitions to catch in-memory rebuild regressions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/segment/src/index/field_index/map_index/tests.rs` around lines 477 - 501, The test int_map_immutable_to_mmap_round_trip_preserves_files currently only checks file snapshots and swap state; add assertions to validate query correctness by calling MapIndex methods like get_iterator and values_count on the loaded index (from data saved via save_map_index/load_map_index) before the first swap, after swapping to Mmap, after the no-op swap, and after swapping back to Immutable: pick a few representative keys from the test data (e.g., keys corresponding to values 1, 4, and 6) and assert that values_count(key) matches expected counts and that iterating via get_iterator yields the expected Value items for those keys each time so that in-memory rebuilds preserve semantic results across swap_on_disk(true/false) transitions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/segment/src/index/struct_payload_index/mod.rs`:
- Around line 426-441: In update_indexed_schema, avoid leaving
self.config.indices mutated when save_config() fails by capturing the previous
entry for field (e.g., previous = self.config.indices.remove(field) or get
clone), perform the insert of PayloadFieldSchemaWithIndexType, call
save_config(), and if save_config() returns Err restore the previous value
(re-insert previous if Some or ensure the key is removed if it was None) before
returning the error; reference the update_indexed_schema function,
self.config.indices insertion, and save_config() call to locate where to
snapshot and rollback the in-memory change.
---
Nitpick comments:
In `@lib/segment/src/index/field_index/map_index/tests.rs`:
- Around line 477-501: The test
int_map_immutable_to_mmap_round_trip_preserves_files currently only checks file
snapshots and swap state; add assertions to validate query correctness by
calling MapIndex methods like get_iterator and values_count on the loaded index
(from data saved via save_map_index/load_map_index) before the first swap, after
swapping to Mmap, after the no-op swap, and after swapping back to Immutable:
pick a few representative keys from the test data (e.g., keys corresponding to
values 1, 4, and 6) and assert that values_count(key) matches expected counts
and that iterating via get_iterator yields the expected Value items for those
keys each time so that in-memory rebuilds preserve semantic results across
swap_on_disk(true/false) transitions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: da0805f9-5f3d-4258-9ddd-8ef4fdc794e2
📒 Files selected for processing (23)
lib/segment/src/entry/entry_point.rslib/segment/src/index/field_index/field_index_base/field_index.rslib/segment/src/index/field_index/full_text_index/immutable_text_index/lifecycle.rslib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rslib/segment/src/index/field_index/full_text_index/lifecycle.rslib/segment/src/index/field_index/full_text_index/tests/mod.rslib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rslib/segment/src/index/field_index/geo_index/mod.rslib/segment/src/index/field_index/geo_index/tests.rslib/segment/src/index/field_index/map_index/immutable_map_index/lifecycle.rslib/segment/src/index/field_index/map_index/lifecycle.rslib/segment/src/index/field_index/map_index/tests.rslib/segment/src/index/field_index/mod.rslib/segment/src/index/field_index/numeric_index/immutable_numeric_index/lifecycle.rslib/segment/src/index/field_index/numeric_index/lifecycle.rslib/segment/src/index/field_index/numeric_index/storage/lifecycle.rslib/segment/src/index/field_index/numeric_index/tests.rslib/segment/src/index/field_index/schema_transition.rslib/segment/src/index/field_index/swap_in_place.rslib/segment/src/index/struct_payload_index/mod.rslib/segment/src/index/struct_payload_index/payload_index.rslib/segment/src/segment/entry.rslib/shard/src/update.rs
5cc23aa to
00b68c4
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
lib/segment/src/index/field_index/field_index_base/field_index.rs (1)
249-252: ⚡ Quick winUpdate stale phase comment to match current behavior.
Line 249–252 says every arm returns
Ok(false), but bool/null swaps now returnOk(true). Please adjust this doc to avoid misleading future changes.Proposed doc-only diff
- /// Phase 1 scaffolding: every arm returns `Ok(false)` so the wiring is - /// in place but the fast path never fires. Each family's real - /// implementation will replace its arm in later phases (numeric → map → - /// geo → full-text). + /// Some families perform real in-place swaps already (for example, + /// schema-only families); others still return `Ok(false)` until their + /// swap implementation is added.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/segment/src/index/field_index/field_index_base/field_index.rs` around lines 249 - 252, Update the Phase 1 scaffolding doc comment in field_index.rs (the "Phase 1 scaffolding" block near FieldIndexBase) to reflect current behavior: instead of saying every arm returns `Ok(false)`, note that most arms return `Ok(false)` but bool/null swap arms now return `Ok(true)` (so the fast path can fire for those cases); rephrase to avoid misleading future maintainers and mention that concrete family implementations will replace these arms in later phases.lib/segment/src/index/field_index/map_index/tests.rs (1)
477-500: ⚡ Quick winStrengthen swap regression coverage with behavioral assertions.
Please also assert read behavior after each swap direction (and add the
swap_on_disk(false)false-path check for mutable) so this test guards semantics, not only file-byte stability.Proposed test additions
assert!(index.swap_on_disk(true).unwrap()); assert!(matches!(index, MapIndex::Mmap(_))); assert_eq!(snapshot_files(&index), files_before); + let hw_counter = HardwareCounterCell::new(); + let mut hits: Vec<PointOffsetType> = index.get_iterator(&4, &hw_counter).collect(); + hits.sort(); + assert_eq!(hits, vec![1, 2]); // No-op assert!(index.swap_on_disk(true).unwrap()); assert!(index.swap_on_disk(false).unwrap()); assert!(matches!(index, MapIndex::Immutable(_))); assert_eq!(snapshot_files(&index), files_before); + let hw_counter = HardwareCounterCell::new(); + let mut hits: Vec<PointOffsetType> = index.get_iterator(&4, &hw_counter).collect(); + hits.sort(); + assert_eq!(hits, vec![1, 2]); @@ assert!(matches!(index, MapIndex::Mutable(_))); assert!(!index.swap_on_disk(true).unwrap()); assert!(matches!(index, MapIndex::Mutable(_))); + assert!(!index.swap_on_disk(false).unwrap()); + assert!(matches!(index, MapIndex::Mutable(_)));Also applies to: 503-519
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/segment/src/index/field_index/map_index/tests.rs` around lines 477 - 500, Add behavioral assertions to int_map_immutable_to_mmap_round_trip_preserves_files: after loading via load_map_index and after every swap_on_disk(true/false) call, verify that reading from the index returns the expected values for the original `data` (use the same accessors used elsewhere in tests to read mapped values); also add the missing false-path check for swap_on_disk(false) on the mutable-to-mmap transition to ensure it returns Ok(false) when no-op is expected. Reference the test helpers save_map_index, load_map_index, snapshot_files and the MapIndex variants (MapIndex::Immutable, MapIndex::Mmap) and call swap_on_disk on the `index` variable, asserting both the boolean return and that lookups into the `index` match `data` after each swap.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/segment/src/index/field_index/full_text_index/tests/mod.rs`:
- Around line 602-613: The test is flaky because FullTextIndex::new_mmap(...,
false, &empty_deleted) can return Mmap depending on runtime memory heuristics;
change the setup to deterministically create an Immutable index by calling
FullTextIndex::new_mmap with the flag that forces immutable creation (replace
the third argument false with true) or otherwise use the explicit constructor
that returns FullTextIndex::Immutable so the assert!(matches!(index,
FullTextIndex::Immutable(_))) and subsequent snapshot_files checks are
deterministic.
---
Nitpick comments:
In `@lib/segment/src/index/field_index/field_index_base/field_index.rs`:
- Around line 249-252: Update the Phase 1 scaffolding doc comment in
field_index.rs (the "Phase 1 scaffolding" block near FieldIndexBase) to reflect
current behavior: instead of saying every arm returns `Ok(false)`, note that
most arms return `Ok(false)` but bool/null swap arms now return `Ok(true)` (so
the fast path can fire for those cases); rephrase to avoid misleading future
maintainers and mention that concrete family implementations will replace these
arms in later phases.
In `@lib/segment/src/index/field_index/map_index/tests.rs`:
- Around line 477-500: Add behavioral assertions to
int_map_immutable_to_mmap_round_trip_preserves_files: after loading via
load_map_index and after every swap_on_disk(true/false) call, verify that
reading from the index returns the expected values for the original `data` (use
the same accessors used elsewhere in tests to read mapped values); also add the
missing false-path check for swap_on_disk(false) on the mutable-to-mmap
transition to ensure it returns Ok(false) when no-op is expected. Reference the
test helpers save_map_index, load_map_index, snapshot_files and the MapIndex
variants (MapIndex::Immutable, MapIndex::Mmap) and call swap_on_disk on the
`index` variable, asserting both the boolean return and that lookups into the
`index` match `data` after each swap.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 3b60baff-c88f-4fd5-a47a-4d4cb4a16701
📒 Files selected for processing (25)
lib/segment/src/entry/entry_point.rslib/segment/src/index/field_index/bool_index/mod.rslib/segment/src/index/field_index/field_index_base/field_index.rslib/segment/src/index/field_index/full_text_index/immutable_text_index/lifecycle.rslib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rslib/segment/src/index/field_index/full_text_index/lifecycle.rslib/segment/src/index/field_index/full_text_index/tests/mod.rslib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rslib/segment/src/index/field_index/geo_index/mod.rslib/segment/src/index/field_index/geo_index/tests.rslib/segment/src/index/field_index/map_index/immutable_map_index/lifecycle.rslib/segment/src/index/field_index/map_index/lifecycle.rslib/segment/src/index/field_index/map_index/tests.rslib/segment/src/index/field_index/mod.rslib/segment/src/index/field_index/null_index/mod.rslib/segment/src/index/field_index/numeric_index/immutable_numeric_index/lifecycle.rslib/segment/src/index/field_index/numeric_index/lifecycle.rslib/segment/src/index/field_index/numeric_index/storage/lifecycle.rslib/segment/src/index/field_index/numeric_index/tests.rslib/segment/src/index/field_index/schema_transition.rslib/segment/src/index/field_index/swap_in_place.rslib/segment/src/index/struct_payload_index/mod.rslib/segment/src/index/struct_payload_index/payload_index.rslib/segment/src/segment/entry.rslib/shard/src/update.rs
✅ Files skipped from review due to trivial changes (2)
- lib/segment/src/index/field_index/numeric_index/immutable_numeric_index/lifecycle.rs
- lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
| let mut index = FullTextIndex::new_mmap( | ||
| temp_dir.path().to_path_buf(), | ||
| config.clone(), | ||
| false, | ||
| &empty_deleted, | ||
| ) | ||
| .unwrap() | ||
| .unwrap(); | ||
|
|
||
| assert!(matches!(index, FullTextIndex::Immutable(_))); | ||
| let files_before = snapshot_files(&index); | ||
| assert!(!files_before.is_empty()); |
There was a problem hiding this comment.
Make immutable setup deterministic in swap test
This test currently depends on runtime low-memory mode. FullTextIndex::new_mmap(..., false, ...) can still return FullTextIndex::Mmap, making the Immutable assertion flaky.
Proposed fix
- // Reload as Immutable
- let mut index = FullTextIndex::new_mmap(
- temp_dir.path().to_path_buf(),
- config.clone(),
- false,
- &empty_deleted,
- )
- .unwrap()
- .unwrap();
+ // Reload as Immutable explicitly (avoid low-memory mode auto-selecting Mmap)
+ let mmap_index = crate::index::field_index::full_text_index::mmap_text_index::MmapFullTextIndex::open(
+ temp_dir.path().to_path_buf(),
+ config.clone(),
+ false,
+ &empty_deleted,
+ )
+ .unwrap()
+ .unwrap();
+ let mut index = FullTextIndex::Immutable(
+ crate::index::field_index::full_text_index::immutable_text_index::ImmutableFullTextIndex::open_mmap(mmap_index).unwrap(),
+ );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/segment/src/index/field_index/full_text_index/tests/mod.rs` around lines
602 - 613, The test is flaky because FullTextIndex::new_mmap(..., false,
&empty_deleted) can return Mmap depending on runtime memory heuristics; change
the setup to deterministically create an Immutable index by calling
FullTextIndex::new_mmap with the flag that forces immutable creation (replace
the third argument false with true) or otherwise use the explicit constructor
that returns FullTextIndex::Immutable so the assert!(matches!(index,
FullTextIndex::Immutable(_))) and subsequent snapshot_files checks are
deterministic.
00b68c4 to
87b0638
Compare
cd23362 to
84e8b34
Compare
84e8b34 to
1b43a26
Compare
1b43a26 to
aee2b8f
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR implements smart index reuse when only the Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
lib/segment/src/index/struct_payload_index/mod.rs (1)
348-363:⚠️ Potential issue | 🟠 MajorRollback the in-memory schema when
save_config()fails.
update_indexed_schema()mutatesself.config.indicesbefore persistence. Ifsave_config()errors, later retries in the same process can see the field as already transitioned and returnAlreadyBuilt, while the on-disk config is still stale.Minimal fix
let index_types: Vec<_> = self .field_indexes .get(field) .map(|indexes| indexes.iter().map(|i| i.get_full_index_type()).collect()) .unwrap_or_default(); - self.config.indices.insert( - field.clone(), - PayloadFieldSchemaWithIndexType::new(new_schema, index_types), - ); - self.save_config()?; + let next = PayloadFieldSchemaWithIndexType::new(new_schema, index_types); + let prev = self.config.indices.insert(field.clone(), next); + if let Err(err) = self.save_config() { + match prev { + Some(prev) => { + self.config.indices.insert(field.clone(), prev); + } + None => { + self.config.indices.remove(field); + } + } + return Err(err); + } Ok(())🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/segment/src/index/struct_payload_index/mod.rs` around lines 348 - 363, update_indexed_schema currently mutates self.config.indices before calling save_config, causing a stale on-disk state if save_config fails; capture the previous entry (via the insert return or get) for the given field, perform the insert of PayloadFieldSchemaWithIndexType(new_schema, index_types), call save_config(), and if save_config() returns Err restore the prior in-memory value (re-insert the previous value or remove the key if it did not exist) and propagate the error so callers don’t see a half-applied in-memory state (referencing update_indexed_schema, self.config.indices, PayloadFieldSchemaWithIndexType, save_config and the AlreadyBuilt semantics).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@lib/segment/src/index/struct_payload_index/mod.rs`:
- Around line 348-363: update_indexed_schema currently mutates
self.config.indices before calling save_config, causing a stale on-disk state if
save_config fails; capture the previous entry (via the insert return or get) for
the given field, perform the insert of
PayloadFieldSchemaWithIndexType(new_schema, index_types), call save_config(),
and if save_config() returns Err restore the prior in-memory value (re-insert
the previous value or remove the key if it did not exist) and propagate the
error so callers don’t see a half-applied in-memory state (referencing
update_indexed_schema, self.config.indices, PayloadFieldSchemaWithIndexType,
save_config and the AlreadyBuilt semantics).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: b5e960c1-e70e-4b91-a801-2c287db45dca
📒 Files selected for processing (10)
lib/segment/src/index/field_index/full_text_index/immutable_text_index/lifecycle.rslib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rslib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rslib/segment/src/index/field_index/map_index/immutable_map_index/lifecycle.rslib/segment/src/index/field_index/mod.rslib/segment/src/index/field_index/schema_transition.rslib/segment/src/index/struct_payload_index/build.rslib/segment/src/index/struct_payload_index/mod.rslib/segment/src/index/struct_payload_index/payload_index.rslib/segment/src/index/struct_payload_index/tests.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
- lib/segment/src/index/field_index/schema_transition.rs
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rs (1)
22-25: ⚡ Quick winUse explicit
OperationError::from(err)conversions here.These new
err.into()calls violate the repo rule for explicit conversions in Rust.As per coding guidelines, "Prefer explicit conversions: use
SomeType::from(x)instead of implicitx.into()in new code."Also applies to: 28-30, 61-63, 131-133
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rs` around lines 22 - 25, Replace implicit conversions using err.into() with explicit OperationError::from(err) in the Result error branches: where counts_per_hash is built from index.storage.counts_per_hash.read_whole() (the match arm returning Err(err) -> Err(err.into())), and the other similar matches referenced (the match that constructs counts_per_hash, the ones at the other occurrences around the code and the ones noted at 28-30, 61-63, 131-133). Concretely, change each Err(err) => return Err(err.into()) to Err(err) => return Err(OperationError::from(err)) so the code uses explicit OperationError::from conversions for those read_whole() error paths and any identical patterns in the same module.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@lib/segment/src/index/field_index/full_text_index/immutable_text_index/lifecycle.rs`:
- Around line 15-20: open_mmap currently boxes/consumes the MmapFullTextIndex
before calling ImmutableInvertedIndex::try_from, so on failure the mmap-backed
index is dropped; fix by delaying consumption: do not call Box::new(index) until
after ImmutableInvertedIndex::try_from succeeds — create the inverted index from
&index.inverted_index while index is still owned (unboxed), handle Err by
returning the OperationError without dropping/consuming the original index, and
only box (Box::new(index)) after you have an Ok(inverted_index) to preserve the
mmap for rollback.
In
`@lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rs`:
- Around line 20-25: The error branches in open_mmap currently return an
OperationError while the boxed StoredGeoMapIndex<MmapFile> is still owned,
causing it to be dropped and losing the mmap-backed index; to fix this, ensure
the boxed index is not dropped on any load failure by leaking or converting the
Box to a raw pointer before returning the error (e.g., use Box::into_raw or
std::mem::forget on the Boxed index) so the mmap file remains intact for
in-place swap rollback; apply the same change to all similar Err branches that
call storage.counts_per_hash.read_whole(), storage.offsets.read_whole(),
storage.hashes.read_whole() (the other reported locations) so
StoredGeoMapIndex<MmapFile> survives failures.
In
`@lib/segment/src/index/field_index/map_index/immutable_map_index/lifecycle.rs`:
- Around line 23-25: open_mmap currently takes ownership of UniversalMapIndex<N>
and drops the mmap-backed index if the internal scan fails, preventing rollback;
change open_mmap (and the analogous function around the other occurrence) so the
mmap index is not lost on error: either accept a reference/Box you can return or
change the error branch of OperationResult to carry back the original
UniversalMapIndex<N> (e.g., return Result<Self, (ScanError,
UniversalMapIndex<N>)>), perform the scan without dropping the owned index, and
return the index alongside Err so the caller can reinstall it during rollback.
---
Nitpick comments:
In
`@lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rs`:
- Around line 22-25: Replace implicit conversions using err.into() with explicit
OperationError::from(err) in the Result error branches: where counts_per_hash is
built from index.storage.counts_per_hash.read_whole() (the match arm returning
Err(err) -> Err(err.into())), and the other similar matches referenced (the
match that constructs counts_per_hash, the ones at the other occurrences around
the code and the ones noted at 28-30, 61-63, 131-133). Concretely, change each
Err(err) => return Err(err.into()) to Err(err) => return
Err(OperationError::from(err)) so the code uses explicit OperationError::from
conversions for those read_whole() error paths and any identical patterns in the
same module.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: a7d5409c-6fd2-40c7-a2e2-33a9d65bf7ce
📒 Files selected for processing (3)
lib/segment/src/index/field_index/full_text_index/immutable_text_index/lifecycle.rslib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rslib/segment/src/index/field_index/map_index/immutable_map_index/lifecycle.rs
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rs (1)
105-105:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRemove/rework the no-op
let _ = index;(it shadows intent, even though it compiles)
let _ = index;at line 105 is a move/reassignment-like line, butcargo checkreports no error and the code immediately reusesindexto buildSelf { ..., storage: index, ... }. That meansindexis likely aCopy/handle type (e.g., reference/Arc-like) rather than a unique owning value—so the line is probably redundant and confusing; if it was meant to influence drop/borrow order, it should be replaced with an explicit, targeted pattern (or removed).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rs` at line 105, The line `let _ = index;` is a redundant no-op that confuses intent; remove that statement from lifecycle.rs and leave the subsequent usage that builds `Self { ..., storage: index, ... }` unchanged. If the original intent was to force a drop or control borrow/drop order for `index`, replace the no-op with an explicit operation: either wrap the earlier scope to let `index` go out of scope before reuse, or call `std::mem::drop(index)` (and then obtain a new value) — otherwise simply delete the `let _ = index;` line so `index` is used directly when constructing the `Self` instance.
♻️ Duplicate comments (2)
lib/segment/src/index/field_index/map_index/immutable_map_index/lifecycle.rs (1)
24-24:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftMmap-backed index is dropped on scan failure, preventing rollback.
The boxed
index(line 24) is dropped iffor_each_value_mapreturns an error and the?operator on line 67 triggers an early return. This makes the original mmap-backedUniversalMapIndex<N>unavailable for rollback during swap operations, as noted in the previous review.To preserve the index for rollback, either leak it before returning the error (
Box::into_raw/mem::forget) or change the return type to include the index on failure:Result<Self, (OperationError, Box<UniversalMapIndex<N>>)>.Also applies to: 40-67
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/segment/src/index/field_index/map_index/immutable_map_index/lifecycle.rs` at line 24, The boxed mmap-backed index is dropped when for_each_value_map returns an error because the function currently returns early with `?`; change the function to return Result<Self, (OperationError, Box<UniversalMapIndex<N>>)>, keep the `let index = Box::new(index);` as is, and update all error propagation around the for_each_value_map call (and any returns in the 40-67 region) to map/map_err the OperationError into Err((error, index)) so the caller receives the boxed index for rollback; update the function signature and any call sites accordingly to accept the new Err payload.lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rs (1)
21-29:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftMmap-backed index is still dropped on all error paths, preventing rollback.
The boxed
indexis dropped when any error path returns early (?on lines 25, 32, 99, 129, or explicitErron line 63). This prevents the caller from recovering the original mmap-backed index during swap rollback, as flagged in the previous review.The current code boxes the index on line 21 but does not prevent it from being dropped on error. To support rollback, the index must survive failures—either by using
Box::into_raw/mem::forgetbefore returning errors, or by changing the signature to returnResult<Self, (OperationError, Box<StoredGeoMapIndex<MmapFile>>)>.Also applies to: 32-32, 62-64, 98-99, 129-129
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rs` around lines 21 - 29, The boxed mmap-backed index (created as Box::new(index)) is being dropped on early error returns which prevents rollback; update the code so the boxed index is preserved on all error paths: either convert the Box to a raw pointer (Box::into_raw or std::mem::forget) before any `?`/Err returns so it is not dropped, or change the function's error type to return Result<Self, (OperationError, Box<StoredGeoMapIndex<MmapFile>>)>, and adjust every early return (the places that currently `?` on storage.counts_per_hash.read_whole() and other fallible calls and explicit Err returns) to return the error paired with the boxed index instead of letting the Box be dropped. Ensure every path that currently returns an error returns the boxed index unchanged so the caller can perform rollback.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In
`@lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rs`:
- Line 105: The line `let _ = index;` is a redundant no-op that confuses intent;
remove that statement from lifecycle.rs and leave the subsequent usage that
builds `Self { ..., storage: index, ... }` unchanged. If the original intent was
to force a drop or control borrow/drop order for `index`, replace the no-op with
an explicit operation: either wrap the earlier scope to let `index` go out of
scope before reuse, or call `std::mem::drop(index)` (and then obtain a new
value) — otherwise simply delete the `let _ = index;` line so `index` is used
directly when constructing the `Self` instance.
---
Duplicate comments:
In
`@lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rs`:
- Around line 21-29: The boxed mmap-backed index (created as Box::new(index)) is
being dropped on early error returns which prevents rollback; update the code so
the boxed index is preserved on all error paths: either convert the Box to a raw
pointer (Box::into_raw or std::mem::forget) before any `?`/Err returns so it is
not dropped, or change the function's error type to return Result<Self,
(OperationError, Box<StoredGeoMapIndex<MmapFile>>)>, and adjust every early
return (the places that currently `?` on storage.counts_per_hash.read_whole()
and other fallible calls and explicit Err returns) to return the error paired
with the boxed index instead of letting the Box be dropped. Ensure every path
that currently returns an error returns the boxed index unchanged so the caller
can perform rollback.
In
`@lib/segment/src/index/field_index/map_index/immutable_map_index/lifecycle.rs`:
- Line 24: The boxed mmap-backed index is dropped when for_each_value_map
returns an error because the function currently returns early with `?`; change
the function to return Result<Self, (OperationError,
Box<UniversalMapIndex<N>>)>, keep the `let index = Box::new(index);` as is, and
update all error propagation around the for_each_value_map call (and any returns
in the 40-67 region) to map/map_err the OperationError into Err((error, index))
so the caller receives the boxed index for rollback; update the function
signature and any call sites accordingly to accept the new Err payload.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 8c278820-bbba-4a14-8529-091d422a1ce5
📒 Files selected for processing (3)
lib/segment/src/index/field_index/full_text_index/immutable_text_index/lifecycle.rslib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rslib/segment/src/index/field_index/map_index/immutable_map_index/lifecycle.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- lib/segment/src/index/field_index/full_text_index/immutable_text_index/lifecycle.rs
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/segment/src/index/struct_payload_index/build.rs (1)
57-85:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winClarify doc comment about
on_diskvs the hardcodedfalse: thefalsepassed toIndexSelector::new_index/new_null_indexiscreate_if_missing, while theon_diskrepresentation is honored byselector(payload_schema)viapayload_schema.is_on_disk()feedingIndexSelectorMmap { is_on_disk }. Update the comment wording to avoid implying that the hardcoded boolean controlson_disk.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/segment/src/index/struct_payload_index/build.rs` around lines 57 - 85, Update the doc comment for reuse_or_build_index to make clear that the hardcoded false passed to selector.new_index and selector.new_null_index is the create_if_missing flag and does not control the on_disk representation; state that on_disk is determined when constructing the selector via selector(payload_schema) (which uses payload_schema.is_on_disk() to configure IndexSelectorMmap { is_on_disk }), so rephrase the comment to avoid implying the boolean toggles on_disk and instead explain that selector honors payload_schema.is_on_disk().
♻️ Duplicate comments (1)
lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rs (1)
20-32:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftError branches still drop the boxed mmap index, removing the swap-rollback recovery path.
After boxing
indexat Line 21, every fallible step (read_whole()?at Lines 25 and 32,read_batch(...)?at Line 61,collected?at Line 96, andencode_max_precision(...)?at Line 126) propagates only theOperationErrorand drops the ownedStoredGeoMapIndex<MmapFile>. During an in-placeMmap → Immutableswap, a load failure here therefore destroys the original mmap-backed index and prevents restoring it on rollback.Also applies to: 61-61, 96-96, 126-126
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rs` around lines 20 - 32, In open_mmap, do not box/move the StoredGeoMapIndex<MmapFile> into heap (the local variable index) before performing fallible reads; currently boxing at the start (variable index) causes the mmap-backed index to be dropped on any early error (e.g., counts_per_hash.read_whole(), storage.points_map.read_whole(), read_batch(...), iterator.collect(), encode_max_precision(...)), preventing swap-rollback recovery. Fix by deferring the Box::new(...) (or move into the final ImmutableGeoIndex) until after all read_whole/read_batch/collect/encode_max_precision calls succeed — use references/borrows to access index.storage for these operations or perform the reads into temporaries first, then box the StoredGeoMapIndex only once all fallible operations have completed successfully.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@lib/segment/src/index/struct_payload_index/build.rs`:
- Around line 57-85: Update the doc comment for reuse_or_build_index to make
clear that the hardcoded false passed to selector.new_index and
selector.new_null_index is the create_if_missing flag and does not control the
on_disk representation; state that on_disk is determined when constructing the
selector via selector(payload_schema) (which uses payload_schema.is_on_disk() to
configure IndexSelectorMmap { is_on_disk }), so rephrase the comment to avoid
implying the boolean toggles on_disk and instead explain that selector honors
payload_schema.is_on_disk().
---
Duplicate comments:
In
`@lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rs`:
- Around line 20-32: In open_mmap, do not box/move the
StoredGeoMapIndex<MmapFile> into heap (the local variable index) before
performing fallible reads; currently boxing at the start (variable index) causes
the mmap-backed index to be dropped on any early error (e.g.,
counts_per_hash.read_whole(), storage.points_map.read_whole(), read_batch(...),
iterator.collect(), encode_max_precision(...)), preventing swap-rollback
recovery. Fix by deferring the Box::new(...) (or move into the final
ImmutableGeoIndex) until after all
read_whole/read_batch/collect/encode_max_precision calls succeed — use
references/borrows to access index.storage for these operations or perform the
reads into temporaries first, then box the StoredGeoMapIndex only once all
fallible operations have completed successfully.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 188eaae1-622d-4db9-846d-d1251179ffcf
📒 Files selected for processing (4)
lib/segment/src/index/field_index/geo_index/immutable_geo_index/lifecycle.rslib/segment/src/index/field_index/schema_transition.rslib/segment/src/index/struct_payload_index/build.rslib/segment/src/index/struct_payload_index/payload_index.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- lib/segment/src/index/struct_payload_index/payload_index.rs
- lib/segment/src/index/field_index/schema_transition.rs
| // optimizer applies `on_disk` when converting to non-appendable). | ||
| SchemaTransition::OnlyOnDiskFlipped { .. } => { | ||
| if matches!(self.storage_type, StorageType::GridstoreAppendable) { | ||
| self.update_indexed_schema(field, new_payload_schema.clone())?; |
There was a problem hiding this comment.
why do we need to update the schema during drop_if_incompatible, shouldn't it be during apply?
There was a problem hiding this comment.
Discussed in chat: It may be more intuitive to do during apply, but it also requires more changes than just doing what this PR does.
Appendable Gridstore ignores on_disk at the storage layer and is unsafe to reload while mutable
Gridstore-backed indexes behave the same regardless of on-disk parameter, yes. But it should be safe to reload if we are holding an upgradable read lock.
Making the decision here that mutable indexes are not affected by on-disk parameter feels like it might shoot us in the foot if the concrete implementations change.
wdyt about separating this edge case from the main optimization of this PR?
5f2ca3d to
64fc8c0
Compare
* feat: add swap on_disk methods * fix: linter * feat: add swap in disk for indexes * fix: linter * fix: std::fs::read -> fs_err::read * feat: add on_disk swap bool & null indexes * refactor: use macro instead boilerplate * feat: add better error propagation * fix: review comments -> remove try_swap.. and mut lock * fix: linter * chore: remove try_open_mmap * fix: match -> ? * review nits * fix: seperate edge case for appendable segment --------- Co-authored-by: Luis Cossío <[email protected]>
Impl of this plan: #9114
Summary
Re-issuing
create_payload_indexwith only a differenton_diskvalue previously triggered a full index rebuild from payload storage. For non-appendable (mmap) segments the on-disk format is identical betweenon_disk: false(loaded into RAM) andon_disk: true(mmap, page-cache only) — same files, only a different in-memory wrapper. This PR avoids the rebuild by reusing the existing files and reloading them in the requested mode, integrated into the existingdelete → build → applypipeline (no separate in-place swap). The reload runs off the write lock and the previous index is not dropped, so reads are never blocked and the field is never left unindexed.drop_index_if_incompatible— classifies the schema delta; anon_disk-only change is treated as compatible (the index is kept, files not dropped). Any other change still drops and rebuilds.build_index→reuse_existing_index— for anon_disk-only change, reloads the field's index from its existing files in the new mode (via the normalnew_indexloader) instead of rebuilding from payload; falls back to a full rebuild if the files can't be loaded.apply_index— swaps the freshly-loaded index in; only this brief step takes the write lock.on_diskhas no storage-layer effect there, so the new value is only persisted in the schema (config) and the live index is left untouched; the optimizer applies it when the segment is converted to non-appendable.Tests —
drop_index_if_incompatiblekeeps the index on anon_disk-only change;build_indexreloads in the new mode and preserves the indexed-point count; classifier unit tests for everyPayloadSchemaParamsvariant.All Submissions:
devbranch (notmaster) and my branch was created fromdev.New Feature Submissions:
cargo +nightly fmt --allcommand prior to submission?cargo clippy --workspace --all-featurescommand?Changes to Core Features: