Fix proxy deleted_mask race dropping live points from filtered search#9116
Conversation
ProxySegment::new snapshots the wrapped segment's deleted_mask while the optimizer holds only the upgradable-read lock, before the write lock freezes the segment. An upsert racing onto the still-appendable wrapped segment in that window lands at an internal offset past the snapshot. The scored search path (PlainVectorIndex::search) consults the proxy mask in place of the segment's live deleted state, and check_deleted_condition defaults any out-of-range offset to deleted (unwrap_or(true)) — so the live point is silently dropped from filtered KNN while scroll/count/retrieve still return it. Re-snapshot deleted_mask in the optimizer once the holder write lock is held (segment frozen) and before the proxy goes live, so the mask covers the segment's full final point range. The fresh read also captures any deletes that raced in, closing the ghost direction too. Adds a proxy-level regression test that reproduces the race without the model-testing harness. Co-Authored-By: Claude Opus 4.7 <[email protected]>
ProxySegment::new now returns UnsyncedProxySegment instead of a ready-to-use ProxySegment. The deleted_mask snapshot is deferred to UnsyncedProxySegment::finalize(), which reads the wrapped segment's deleted bitvec exactly once. The only way to obtain a ProxySegment (and thus put it in a SegmentHolder) is via finalize(), so the sync under the holder write lock can no longer be forgotten, and the mask is no longer read twice (once in new, once in resync). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
fb4e78f to
d8e29ef
Compare
|
Noticed, that the same logic was missing for snapshot proxification. Aslo replaced let match with full match. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/shard/src/proxy_segment/mod.rs`:
- Around line 105-107: The matches! call in ProxySegment::new consumes the
non-Copy LockedSegment (moving `segment`) which is later used; change the
scrutinee to a borrow so it doesn't move — e.g. use a borrowed pattern like
matches!(&segment, LockedSegment::Proxy(_)) (or matches!(ref segment,
LockedSegment::Proxy(_)) depending on context) so `segment` remains available
for subsequent use in ProxySegment::new and no ownership is taken.
In `@lib/shard/src/proxy_segment/tests.rs`:
- Around line 70-101: The closure build_unsynced_proxy (returning
(UnsyncedProxySegment, LockedSegment)) is misformatted; run rustfmt (cargo fmt
--all) or manually reformat the closure block around build_unsynced_proxy,
aligning the closure signature, the chained method calls on
original_segment.get().write().upsert_point(...).unwrap(), and the final let
proxy = ProxySegment::new(original_segment); so the code follows rustfmt
conventions and the test compiles/formats correctly; ensure spacing, indentation
and line breaks match rustfmt output for the closure and its inner calls.
- Around line 52-65: The regression test currently passes None for the filter
(so it only checks unfiltered search) but must exercise the payload-filtered
path; update the search invocation(s) in this test to pass a concrete payload
filter that matches the point appended after the deleted_mask snapshot (e.g., a
filter selecting payload key/value you set on that point such as "tag" ==
"new"), so the scored search codepath with filter is exercised; adjust the
filter argument(s) where None is passed around the test (around the search calls
in the blocks referenced by the comment, e.g., the search/scored_search call
sites and any verify helpers in the 117-127 and 134-162 ranges) and ensure the
inserted point(s) have the corresponding payload key/value used by the filter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 09455b91-3202-4f47-a2e3-dfeb677e243a
📒 Files selected for processing (5)
lib/collection/src/collection_manager/tests/mod.rslib/shard/src/optimize.rslib/shard/src/proxy_segment/mod.rslib/shard/src/proxy_segment/tests.rslib/shard/src/segment_holder/snapshot.rs
| if matches!(segment, LockedSegment::Proxy(_)) { | ||
| log::debug!("Double proxy segment creation"); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== LockedSegment definition =="
fd -i 'locked_segment.rs' | while read -r f; do
echo "--- $f ---"
rg -n -C3 'enum\s+LockedSegment|struct\s+LockedSegment|type\s+LockedSegment|derive\([^)]*Copy|impl\s+Copy\s+for\s+LockedSegment' "$f" || true
done
echo
echo "== Guard and subsequent use in ProxySegment::new =="
rg -n -C2 'matches!\(segment,\s*LockedSegment::Proxy|let \(wrapped_config, version\)|wrapped_segment:\s*segment' lib/shard/src/proxy_segment/mod.rsRepository: qdrant/qdrant
Length of output: 1180
Fix move-after-matches! in ProxySegment::new
LockedSegment is #[derive(Clone, Debug)] and wraps Arc<...>, so it’s not Copy; matches!(segment, LockedSegment::Proxy(_)) moves segment, but segment is used later in ProxySegment::new. Borrow the scrutinee.
Suggested fix
- if matches!(segment, LockedSegment::Proxy(_)) {
+ if matches!(&segment, LockedSegment::Proxy(_)) {
log::debug!("Double proxy segment creation");
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if matches!(segment, LockedSegment::Proxy(_)) { | |
| log::debug!("Double proxy segment creation"); | |
| } | |
| if matches!(&segment, LockedSegment::Proxy(_)) { | |
| log::debug!("Double proxy segment creation"); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/shard/src/proxy_segment/mod.rs` around lines 105 - 107, The matches! call
in ProxySegment::new consumes the non-Copy LockedSegment (moving `segment`)
which is later used; change the scrutinee to a borrow so it doesn't move — e.g.
use a borrowed pattern like matches!(&segment, LockedSegment::Proxy(_)) (or
matches!(ref segment, LockedSegment::Proxy(_)) depending on context) so
`segment` remains available for subsequent use in ProxySegment::new and no
ownership is taken.
| /// Regression test for the proxy `deleted_mask` race that drops a live point from scored | ||
| /// search (catalog: "exact dense search + payload filter drops a candidate", optimizer-on). | ||
| /// | ||
| /// `deleted_mask` is a snapshot of the wrapped segment's deleted bitvec. If it is synced while | ||
| /// the wrapped segment is still appendable, an upsert can still land afterwards at an internal | ||
| /// offset past the snapshot; the scored search consults `deleted_mask` and treats every | ||
| /// out-of-range offset as deleted (`check_deleted_condition` → `unwrap_or(true)`), silently | ||
| /// excluding the live point even though scroll/retrieve still see it. | ||
| /// | ||
| /// [`UnsyncedProxySegment::finalize`] is what reads the mask, so the fix is timing: finalize | ||
| /// only once the wrapped segment is frozen, so the mask covers its full final point range. This | ||
| /// test exercises both orderings on two parallel segments — finalize-before-race (buggy) vs | ||
| /// finalize-after-race (fixed) — entirely at the proxy level, no model-testing harness involved. | ||
| #[test] |
There was a problem hiding this comment.
Regression test does not actually exercise filtered search.
Line 53 describes a payload-filtered search failure mode, but Line 124 passes None for filter, so the new test validates unfiltered search only. Please add a concrete filter path in this test so it protects the exact bug scenario.
Also applies to: 117-127, 134-162
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/shard/src/proxy_segment/tests.rs` around lines 52 - 65, The regression
test currently passes None for the filter (so it only checks unfiltered search)
but must exercise the payload-filtered path; update the search invocation(s) in
this test to pass a concrete payload filter that matches the point appended
after the deleted_mask snapshot (e.g., a filter selecting payload key/value you
set on that point such as "tag" == "new"), so the scored search codepath with
filter is exercised; adjust the filter argument(s) where None is passed around
the test (around the search calls in the blocks referenced by the comment, e.g.,
the search/scored_search call sites and any verify helpers in the 117-127 and
134-162 ranges) and ensure the inserted point(s) have the corresponding payload
key/value used by the filter.
new deliberately returns the unsynced UnsyncedProxySegment stage rather than Self, since a usable ProxySegment only exists once deleted_mask is synced via finalize(). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…t::new for tests Instead of allowing clippy::new_ret_no_self on a ProxySegment::new that returned UnsyncedProxySegment, give the two-phase constructor its natural home: UnsyncedProxySegment::new returns Self and is what production code (optimize, snapshot) uses, finalizing under the holder write lock. ProxySegment::new becomes a #[cfg(feature = "testing")] convenience that builds and finalizes in one step (returns Self), so existing test call sites stay terse and don't need an explicit .finalize(). The shard testing feature is enabled for both shard's own tests and collection's dev-dependency, and excluded from production builds. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…#9116) * Fix proxy deleted_mask race dropping live points from scored search ProxySegment::new snapshots the wrapped segment's deleted_mask while the optimizer holds only the upgradable-read lock, before the write lock freezes the segment. An upsert racing onto the still-appendable wrapped segment in that window lands at an internal offset past the snapshot. The scored search path (PlainVectorIndex::search) consults the proxy mask in place of the segment's live deleted state, and check_deleted_condition defaults any out-of-range offset to deleted (unwrap_or(true)) — so the live point is silently dropped from filtered KNN while scroll/count/retrieve still return it. Re-snapshot deleted_mask in the optimizer once the holder write lock is held (segment frozen) and before the proxy goes live, so the mask covers the segment's full final point range. The fresh read also captures any deletes that raced in, closing the ghost direction too. Adds a proxy-level regression test that reproduces the race without the model-testing harness. Co-Authored-By: Claude Opus 4.7 <[email protected]> * Make proxy deleted_mask sync a type-state, read once ProxySegment::new now returns UnsyncedProxySegment instead of a ready-to-use ProxySegment. The deleted_mask snapshot is deferred to UnsyncedProxySegment::finalize(), which reads the wrapped segment's deleted bitvec exactly once. The only way to obtain a ProxySegment (and thus put it in a SegmentHolder) is via finalize(), so the sync under the holder write lock can no longer be forgotten, and the mask is no longer read twice (once in new, once in resync). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Allow clippy::new_ret_no_self on ProxySegment::new new deliberately returns the unsynced UnsyncedProxySegment stage rather than Self, since a usable ProxySegment only exists once deleted_mask is synced via finalize(). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Name the real constructor UnsyncedProxySegment::new, keep ProxySegment::new for tests Instead of allowing clippy::new_ret_no_self on a ProxySegment::new that returned UnsyncedProxySegment, give the two-phase constructor its natural home: UnsyncedProxySegment::new returns Self and is what production code (optimize, snapshot) uses, finalizing under the holder write lock. ProxySegment::new becomes a #[cfg(feature = "testing")] convenience that builds and finalizes in one step (returns Self), so existing test call sites stay terse and don't need an explicit .finalize(). The shard testing feature is enabled for both shard's own tests and collection's dev-dependency, and excluded from production builds. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 <[email protected]> Co-authored-by: generall <[email protected]>
Summary
A point that is upserted onto a segment just as the optimizer wraps that segment in a
ProxySegmentcan be silently dropped from scored (KNN) search results, whilescroll/count/retrievestill return it. The point exists, has the searched vector, and matches the filter — it's just invisible to filtered search until the next optimization cycle.Root cause
ProxySegment::newsnapshots the wrapped segment'sdeleted_mask(an internal-offset bitvec) while the optimizer holds only the upgradable-read lock on the segment holder. The write lock that actually freezes the segment and swaps the proxy in is taken later (optimize.rs). Ordinary upserts only need a shared holder read lock, so a write can still land on the still-appendable wrapped segment in that window.Such a point gets an internal offset past the end of the snapshotted mask. The scored search path (
PlainVectorIndex::search) substitutes the proxy'sdeleted_maskfor the segment's live deleted bitslice, andcheck_deleted_conditiondefaults any out-of-range offset to deleted:So the live point is excluded from filtered KNN.
scroll/countgo throughread_filtered, which uses the wrapped segment's live id-tracker state, so they still see it — hence the asymmetry.This is the missing-direction twin of the previously-known ghost race (a delete landing in the same window).
Fix
Re-snapshot
deleted_maskin the optimizer once the holder write lock is held (segment frozen) and before the proxy goes live, so the mask covers the segment's full, final point range. A fresh read also captures any deletes that raced in, closing the ghost direction too. The change is a single extra bitvec read per proxied segment, under a lock that's already held.Test plan
test_proxy_deleted_mask_resync_after_race_window_writereproduces the race on two parallel segments (one without resync = buggy, one with = fixed), entirely within theshardcrate — no collection/optimizer-thread/soak harness needed.resync_deleted_maskis neutered to a no-op (point dropped), and passes with the fix.proxy_segmentunit tests pass;cargo clippy -p shard --all-targets -- -D warningsclean.seed=42, 2 shards): originally failed at op 1574 with a missing search candidate; after the fix, clean across multiple seeds and 15–20k ops each with heavy optimizer churn.🤖 Generated with Claude Code