Skip to content

Fix proxy deleted_mask race dropping live points from filtered search#9116

Merged
agourlay merged 4 commits into
devfrom
fix-proxy-deleted-mask-race
May 22, 2026
Merged

Fix proxy deleted_mask race dropping live points from filtered search#9116
agourlay merged 4 commits into
devfrom
fix-proxy-deleted-mask-race

Conversation

@agourlay
Copy link
Copy Markdown
Member

Summary

A point that is upserted onto a segment just as the optimizer wraps that segment in a ProxySegment can be silently dropped from scored (KNN) search results, while scroll/count/retrieve still return it. The point exists, has the searched vector, and matches the filter — it's just invisible to filtered search until the next optimization cycle.

Root cause

ProxySegment::new snapshots the wrapped segment's deleted_mask (an internal-offset bitvec) while the optimizer holds only the upgradable-read lock on the segment holder. The write lock that actually freezes the segment and swaps the proxy in is taken later (optimize.rs). Ordinary upserts only need a shared holder read lock, so a write can still land on the still-appendable wrapped segment in that window.

Such a point gets an internal offset past the end of the snapshotted mask. The scored search path (PlainVectorIndex::search) substitutes the proxy's deleted_mask for the segment's live deleted bitslice, and check_deleted_condition defaults any out-of-range offset to deleted:

&& !point_deleted.get_bit(point as usize).unwrap_or(true)   // out-of-range → "deleted"

So the live point is excluded from filtered KNN. scroll/count go through read_filtered, which uses the wrapped segment's live id-tracker state, so they still see it — hence the asymmetry.

This is the missing-direction twin of the previously-known ghost race (a delete landing in the same window).

Fix

Re-snapshot deleted_mask in the optimizer once the holder write lock is held (segment frozen) and before the proxy goes live, so the mask covers the segment's full, final point range. A fresh read also captures any deletes that raced in, closing the ghost direction too. The change is a single extra bitvec read per proxied segment, under a lock that's already held.

Test plan

  • New proxy-level regression test test_proxy_deleted_mask_resync_after_race_window_write reproduces the race on two parallel segments (one without resync = buggy, one with = fixed), entirely within the shard crate — no collection/optimizer-thread/soak harness needed.
  • Verified the test fails when resync_deleted_mask is neutered to a no-op (point dropped), and passes with the fix.
  • All 12 proxy_segment unit tests pass; cargo clippy -p shard --all-targets -- -D warnings clean.
  • Found and reproduced via a model-based soak (optimizer enabled, seed=42, 2 shards): originally failed at op 1574 with a missing search candidate; after the fix, clean across multiple seeds and 15–20k ops each with heavy optimizer churn.

🤖 Generated with Claude Code

ProxySegment::new snapshots the wrapped segment's deleted_mask while the
optimizer holds only the upgradable-read lock, before the write lock freezes
the segment. An upsert racing onto the still-appendable wrapped segment in that
window lands at an internal offset past the snapshot. The scored search path
(PlainVectorIndex::search) consults the proxy mask in place of the segment's
live deleted state, and check_deleted_condition defaults any out-of-range
offset to deleted (unwrap_or(true)) — so the live point is silently dropped
from filtered KNN while scroll/count/retrieve still return it.

Re-snapshot deleted_mask in the optimizer once the holder write lock is held
(segment frozen) and before the proxy goes live, so the mask covers the
segment's full final point range. The fresh read also captures any deletes
that raced in, closing the ghost direction too.

Adds a proxy-level regression test that reproduces the race without the
model-testing harness.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@qdrant qdrant deleted a comment from coderabbitai Bot May 21, 2026
ProxySegment::new now returns UnsyncedProxySegment instead of a
ready-to-use ProxySegment. The deleted_mask snapshot is deferred to
UnsyncedProxySegment::finalize(), which reads the wrapped segment's
deleted bitvec exactly once. The only way to obtain a ProxySegment
(and thus put it in a SegmentHolder) is via finalize(), so the sync
under the holder write lock can no longer be forgotten, and the mask
is no longer read twice (once in new, once in resync).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@generall generall force-pushed the fix-proxy-deleted-mask-race branch from fb4e78f to d8e29ef Compare May 21, 2026 23:16
@generall
Copy link
Copy Markdown
Member

generall commented May 21, 2026

Noticed, that the same logic was missing for snapshot proxification.
Introduced a type-level fool protection for this case.

Aslo replaced let match with full match.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/shard/src/proxy_segment/mod.rs`:
- Around line 105-107: The matches! call in ProxySegment::new consumes the
non-Copy LockedSegment (moving `segment`) which is later used; change the
scrutinee to a borrow so it doesn't move — e.g. use a borrowed pattern like
matches!(&segment, LockedSegment::Proxy(_)) (or matches!(ref segment,
LockedSegment::Proxy(_)) depending on context) so `segment` remains available
for subsequent use in ProxySegment::new and no ownership is taken.

In `@lib/shard/src/proxy_segment/tests.rs`:
- Around line 70-101: The closure build_unsynced_proxy (returning
(UnsyncedProxySegment, LockedSegment)) is misformatted; run rustfmt (cargo fmt
--all) or manually reformat the closure block around build_unsynced_proxy,
aligning the closure signature, the chained method calls on
original_segment.get().write().upsert_point(...).unwrap(), and the final let
proxy = ProxySegment::new(original_segment); so the code follows rustfmt
conventions and the test compiles/formats correctly; ensure spacing, indentation
and line breaks match rustfmt output for the closure and its inner calls.
- Around line 52-65: The regression test currently passes None for the filter
(so it only checks unfiltered search) but must exercise the payload-filtered
path; update the search invocation(s) in this test to pass a concrete payload
filter that matches the point appended after the deleted_mask snapshot (e.g., a
filter selecting payload key/value you set on that point such as "tag" ==
"new"), so the scored search codepath with filter is exercised; adjust the
filter argument(s) where None is passed around the test (around the search calls
in the blocks referenced by the comment, e.g., the search/scored_search call
sites and any verify helpers in the 117-127 and 134-162 ranges) and ensure the
inserted point(s) have the corresponding payload key/value used by the filter.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 09455b91-3202-4f47-a2e3-dfeb677e243a

📥 Commits

Reviewing files that changed from the base of the PR and between 0be214d and fb4e78f.

📒 Files selected for processing (5)
  • lib/collection/src/collection_manager/tests/mod.rs
  • lib/shard/src/optimize.rs
  • lib/shard/src/proxy_segment/mod.rs
  • lib/shard/src/proxy_segment/tests.rs
  • lib/shard/src/segment_holder/snapshot.rs

Comment thread lib/shard/src/proxy_segment/mod.rs Outdated
Comment on lines +105 to +107
if matches!(segment, LockedSegment::Proxy(_)) {
log::debug!("Double proxy segment creation");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== LockedSegment definition =="
fd -i 'locked_segment.rs' | while read -r f; do
  echo "--- $f ---"
  rg -n -C3 'enum\s+LockedSegment|struct\s+LockedSegment|type\s+LockedSegment|derive\([^)]*Copy|impl\s+Copy\s+for\s+LockedSegment' "$f" || true
done

echo
echo "== Guard and subsequent use in ProxySegment::new =="
rg -n -C2 'matches!\(segment,\s*LockedSegment::Proxy|let \(wrapped_config, version\)|wrapped_segment:\s*segment' lib/shard/src/proxy_segment/mod.rs

Repository: qdrant/qdrant

Length of output: 1180


Fix move-after-matches! in ProxySegment::new

LockedSegment is #[derive(Clone, Debug)] and wraps Arc<...>, so it’s not Copy; matches!(segment, LockedSegment::Proxy(_)) moves segment, but segment is used later in ProxySegment::new. Borrow the scrutinee.

Suggested fix
-        if matches!(segment, LockedSegment::Proxy(_)) {
+        if matches!(&segment, LockedSegment::Proxy(_)) {
             log::debug!("Double proxy segment creation");
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if matches!(segment, LockedSegment::Proxy(_)) {
log::debug!("Double proxy segment creation");
}
if matches!(&segment, LockedSegment::Proxy(_)) {
log::debug!("Double proxy segment creation");
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/shard/src/proxy_segment/mod.rs` around lines 105 - 107, The matches! call
in ProxySegment::new consumes the non-Copy LockedSegment (moving `segment`)
which is later used; change the scrutinee to a borrow so it doesn't move — e.g.
use a borrowed pattern like matches!(&segment, LockedSegment::Proxy(_)) (or
matches!(ref segment, LockedSegment::Proxy(_)) depending on context) so
`segment` remains available for subsequent use in ProxySegment::new and no
ownership is taken.

Comment on lines +52 to +65
/// Regression test for the proxy `deleted_mask` race that drops a live point from scored
/// search (catalog: "exact dense search + payload filter drops a candidate", optimizer-on).
///
/// `deleted_mask` is a snapshot of the wrapped segment's deleted bitvec. If it is synced while
/// the wrapped segment is still appendable, an upsert can still land afterwards at an internal
/// offset past the snapshot; the scored search consults `deleted_mask` and treats every
/// out-of-range offset as deleted (`check_deleted_condition` → `unwrap_or(true)`), silently
/// excluding the live point even though scroll/retrieve still see it.
///
/// [`UnsyncedProxySegment::finalize`] is what reads the mask, so the fix is timing: finalize
/// only once the wrapped segment is frozen, so the mask covers its full final point range. This
/// test exercises both orderings on two parallel segments — finalize-before-race (buggy) vs
/// finalize-after-race (fixed) — entirely at the proxy level, no model-testing harness involved.
#[test]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Regression test does not actually exercise filtered search.

Line 53 describes a payload-filtered search failure mode, but Line 124 passes None for filter, so the new test validates unfiltered search only. Please add a concrete filter path in this test so it protects the exact bug scenario.

Also applies to: 117-127, 134-162

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/shard/src/proxy_segment/tests.rs` around lines 52 - 65, The regression
test currently passes None for the filter (so it only checks unfiltered search)
but must exercise the payload-filtered path; update the search invocation(s) in
this test to pass a concrete payload filter that matches the point appended
after the deleted_mask snapshot (e.g., a filter selecting payload key/value you
set on that point such as "tag" == "new"), so the scored search codepath with
filter is exercised; adjust the filter argument(s) where None is passed around
the test (around the search calls in the blocks referenced by the comment, e.g.,
the search/scored_search call sites and any verify helpers in the 117-127 and
134-162 ranges) and ensure the inserted point(s) have the corresponding payload
key/value used by the filter.

Comment thread lib/shard/src/proxy_segment/tests.rs Outdated
coderabbitai[bot]

This comment was marked as resolved.

generall and others added 2 commits May 22, 2026 01:23
new deliberately returns the unsynced UnsyncedProxySegment stage rather
than Self, since a usable ProxySegment only exists once deleted_mask is
synced via finalize().

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…t::new for tests

Instead of allowing clippy::new_ret_no_self on a ProxySegment::new that
returned UnsyncedProxySegment, give the two-phase constructor its natural
home: UnsyncedProxySegment::new returns Self and is what production code
(optimize, snapshot) uses, finalizing under the holder write lock.

ProxySegment::new becomes a #[cfg(feature = "testing")] convenience that
builds and finalizes in one step (returns Self), so existing test call
sites stay terse and don't need an explicit .finalize(). The shard
testing feature is enabled for both shard's own tests and collection's
dev-dependency, and excluded from production builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai Bot May 22, 2026
@agourlay agourlay merged commit c82a476 into dev May 22, 2026
25 of 27 checks passed
@agourlay agourlay deleted the fix-proxy-deleted-mask-race branch May 22, 2026 08:58
generall added a commit that referenced this pull request May 22, 2026
…#9116)

* Fix proxy deleted_mask race dropping live points from scored search

ProxySegment::new snapshots the wrapped segment's deleted_mask while the
optimizer holds only the upgradable-read lock, before the write lock freezes
the segment. An upsert racing onto the still-appendable wrapped segment in that
window lands at an internal offset past the snapshot. The scored search path
(PlainVectorIndex::search) consults the proxy mask in place of the segment's
live deleted state, and check_deleted_condition defaults any out-of-range
offset to deleted (unwrap_or(true)) — so the live point is silently dropped
from filtered KNN while scroll/count/retrieve still return it.

Re-snapshot deleted_mask in the optimizer once the holder write lock is held
(segment frozen) and before the proxy goes live, so the mask covers the
segment's full final point range. The fresh read also captures any deletes
that raced in, closing the ghost direction too.

Adds a proxy-level regression test that reproduces the race without the
model-testing harness.

Co-Authored-By: Claude Opus 4.7 <[email protected]>

* Make proxy deleted_mask sync a type-state, read once

ProxySegment::new now returns UnsyncedProxySegment instead of a
ready-to-use ProxySegment. The deleted_mask snapshot is deferred to
UnsyncedProxySegment::finalize(), which reads the wrapped segment's
deleted bitvec exactly once. The only way to obtain a ProxySegment
(and thus put it in a SegmentHolder) is via finalize(), so the sync
under the holder write lock can no longer be forgotten, and the mask
is no longer read twice (once in new, once in resync).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Allow clippy::new_ret_no_self on ProxySegment::new

new deliberately returns the unsynced UnsyncedProxySegment stage rather
than Self, since a usable ProxySegment only exists once deleted_mask is
synced via finalize().

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Name the real constructor UnsyncedProxySegment::new, keep ProxySegment::new for tests

Instead of allowing clippy::new_ret_no_self on a ProxySegment::new that
returned UnsyncedProxySegment, give the two-phase constructor its natural
home: UnsyncedProxySegment::new returns Self and is what production code
(optimize, snapshot) uses, finalizing under the holder write lock.

ProxySegment::new becomes a #[cfg(feature = "testing")] convenience that
builds and finalizes in one step (returns Self), so existing test call
sites stay terse and don't need an explicit .finalize(). The shard
testing feature is enabled for both shard's own tests and collection's
dev-dependency, and excluded from production builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.7 <[email protected]>
Co-authored-by: generall <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants