Skip to content

feat(searcher): candidate_strategy="union" — BM25 candidates joined with vector pool before hybrid rerank#1306

Merged
igorls merged 2 commits intodevelopfrom
feat/hybrid-candidate-union
May 3, 2026
Merged

feat(searcher): candidate_strategy="union" — BM25 candidates joined with vector pool before hybrid rerank#1306
igorls merged 2 commits intodevelopfrom
feat/hybrid-candidate-union

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented May 2, 2026

Summary

Adds an opt-in candidate_strategy="union" to search_memories that pulls top-K BM25-only candidates from sqlite FTS5 and merges them into the hybrid rerank pool. Default behavior ("vector") is unchanged.

The current hybrid path gathers candidates from the vector index only (n_results * 3 over-fetch), then BM25-reranks within them. When the query embeds close to the wrong content semantically, the right doc never enters the rerank pool — no matter how wide the over-fetch. Validated against a ~6K-document mixed corpus (knowledge prose + short structured records): at 30x over-fetch (~5% of the corpus), the target doc still didn't surface for narrative-shaped queries targeting terminology guides. Wider over-fetch isn't the answer; widening the pool's source is.

The concrete failure mode: a narrative-shaped query embeds close to records sharing the same operational vocabulary. A terminology / style guide is BM25-strong for the query but vector-distant. Vector-only candidates don't include it; BM25 never gets to rerank it. The hybrid path produces 0.00 recall on a probe that pure BM25 alone scores 1.00 — the hybrid is worse than its component on the same input.

What changed

  • New parameter candidate_strategy: str = "vector" on search_memories.
    • "vector" (default): historical behavior, no change.
    • "union": also fetch top n_results * 3 candidates via the existing _bm25_only_via_sqlite helper, dedupe by source_file, merge into the rerank pool. BM25-only candidates carry distance=None so they're scored on BM25 contribution alone.
  • _hybrid_rank now handles distance=None explicitly (vec_sim=0) instead of relying on a sentinel value.
  • New strategies register via _CANDIDATE_MERGERS; dispatch is in _apply_candidate_strategy so search_memories stays under the project's C901 complexity ceiling.

Bench numbers

~6K-doc internal mixed corpus, recall@10, 5 probes spanning policy-exception lookup, temporal-decay, style retrieval, set-difference, and pattern-recognition:

probe baseline ("vector") "union" delta
policy-exception 0.00 0.50 +0.50
temporal-decay 0.17 0.50 +0.33
style-retrieval 0.00 1.00 +1.00 (PASSES)
set-difference 0.00–0.06 0.06–0.09 ~
pattern-recog 0.64 (stable) 0.50–0.71 (typ. 0.71) typ. +0.07, occ. −0.14
macro recall 0.16–0.17 0.51–0.56 +0.34 to +0.40

The pattern-recog variance points at a related issue worth its own PR: _hybrid_rank computes BM25 IDF over the candidate set, so adding new candidates re-normalizes BM25 for existing candidates non-monotonically. Stable corpus-wide BM25 would remove this. Out of scope here.

Tests

tests/test_hybrid_candidate_union.py — 6 new tests, all passing:

  • default behavior unchanged (explicit "vector" matches default)
  • "union" surfaces a BM25-strong vector-distant doc
  • "union" doesn't drop docs "vector" would have found
  • empty-palace handling
  • invalid candidate_strategy raises ValueError
  • _hybrid_rank tolerates distance=None

Existing test_hybrid_search.py (5) and test_searcher.py (27) pass — no regressions.

Performance note

Each "union" query adds one sqlite open + FTS5 MATCH + metadata fetch (via the existing _bm25_only_via_sqlite helper, which already runs as the vector_disabled fallback path so the code is well-trodden). Per-query overhead is small but unmeasured at corpus scale. Default stays "vector" until a maintainer characterizes the cost.

Caveats / followups not in this PR

  1. Performance impact on large palaces — needs measurement before flipping the default.
  2. The local-BM25 IDF instability called out in the bench numbers above (_hybrid_rank re-normalizes existing candidates' BM25 when new ones are added) is its own architectural concern. Worth fixing in a separate PR.
  3. MCP surface — exposing candidate_strategy to MCP callers is straightforward (one extra arg) but I haven't done it here. Easy to add once the Python API lands.

Test plan

  • pytest tests/test_hybrid_candidate_union.py — 6/6
  • pytest tests/test_hybrid_search.py — 5/5 no regression
  • pytest tests/test_searcher.py — 27/27 no regression
  • ruff check clean
  • ruff format clean

…ing pool

Default search behavior is unchanged. Opt-in candidate_strategy="union"
also pulls top-K BM25-only candidates from sqlite FTS5 and merges them
into the rerank pool, catching docs with strong BM25 signal that the
vector index didn't surface in the over-fetch window.

Motivation
----------
The current hybrid path gathers candidates from the vector index only
(n_results * 3 over-fetch), then BM25-reranks within them. When the
query embeds close to the wrong content semantically, the right doc
never enters the rerank pool — *no matter how wide the over-fetch*.
Tested on a ~6K-document mixed corpus (knowledge prose + short structured
records): at *30x* over-fetch (~5% of the corpus) the target doc still
didn't surface for narrative-shaped queries targeting terminology guides.
Wider over-fetch isn't the answer; widening the pool's *source* is.

Concrete failure mode: a narrative-shaped query embeds close to records
sharing the same operational vocabulary (other narrative entries in the
corpus). A terminology / style guide is BM25-strong for the query
(rare keywords the guide repeats) but vector-distant. Vector-only
candidates don't include it; BM25 never gets to rerank it. The hybrid
path produces 0.00 recall on a probe that pure BM25 alone scores 1.00 —
the hybrid is worse than its component on the same input.

Behavior change
---------------
* New parameter ``candidate_strategy: str = "vector"`` on ``search_memories``.
  - ``"vector"`` (default): historical behavior, no change.
  - ``"union"``: also fetch top ``n_results * 3`` candidates via the
    existing ``_bm25_only_via_sqlite`` helper, dedupe by source_file,
    merge into the rerank pool. BM25-only candidates carry
    ``distance=None`` so they're scored on BM25 contribution alone
    (vec_sim coerces to 0).
* ``_hybrid_rank`` now handles ``distance=None`` explicitly, scoring
  such candidates as vector-unknown (vec_sim=0) rather than treating
  it as max-distance via shim.
* New strategies register via ``_CANDIDATE_MERGERS``; dispatch is in
  ``_apply_candidate_strategy`` so ``search_memories`` stays under the
  C901 complexity ceiling.

Bench numbers (~6K-doc internal mixed corpus, recall@10, 5 probes spanning
policy-exception lookup, temporal-decay, style retrieval, set-difference,
and pattern-recognition):

                              baseline ("vector")   "union"
  policy-exception probe        0.00                  0.50    +0.50
  temporal-decay probe          0.17                  0.50    +0.33
  style-retrieval probe         0.00                  1.00    +1.00 (PASSES)
  set-difference probe          0.00–0.06             0.06–0.09  ~
  pattern-recog probe           0.64 (stable)         0.50–0.71  variance, typ. +0.07
  macro recall                  0.16–0.17             0.51–0.56  +0.34 to +0.40

The pattern-recog variance points at a related issue worth a separate PR:
``_hybrid_rank`` computes BM25 IDF over the candidate set. Adding new
candidates re-normalizes BM25 for *existing* candidates non-monotonically.
Stable corpus-wide BM25 would remove this. Out of scope here.

Tests
-----
``tests/test_hybrid_candidate_union.py`` (6 tests, all pass):
- default behavior unchanged (explicit ``"vector"`` matches default)
- ``"union"`` surfaces a BM25-strong vector-distant doc
- ``"union"`` doesn't drop docs ``"vector"`` would have found
- empty-palace handling
- invalid ``candidate_strategy`` raises
- ``_hybrid_rank`` tolerates ``distance=None``

Existing ``test_hybrid_search.py`` (5) and ``test_searcher.py`` (27) pass.

Performance note
----------------
Each ``"union"`` query adds one sqlite open + FTS5 MATCH + metadata
fetch (via the existing ``_bm25_only_via_sqlite`` helper, which already
runs as the ``vector_disabled`` fallback path so the code is
well-trodden). Per-query overhead is small but unmeasured at corpus
scale. Default stays ``"vector"`` until a maintainer characterizes the
cost.
Copilot AI review requested due to automatic review settings May 2, 2026 03:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in candidate_strategy="union" mode to search_memories, aiming to improve hybrid retrieval by merging BM25/FTS5 candidates into the rerank pool instead of relying on vector candidates alone. It extends the searcher’s hybrid-ranking path and adds targeted tests for the new strategy and distance=None handling.

Changes:

  • Add candidate_strategy to search_memories with "vector" and "union" dispatch paths.
  • Allow _hybrid_rank to handle BM25-only candidates by treating distance=None as no vector signal.
  • Add a new test module covering default behavior, union retrieval, invalid strategies, empty palaces, and missing-distance reranking.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
mempalace/searcher.py Adds union-candidate merging, strategy dispatch, API docs, and distance=None handling in hybrid reranking.
tests/test_hybrid_candidate_union.py Adds regression/integration tests for the new union candidate strategy and BM25-only hybrid reranking behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/searcher.py Outdated
Comment on lines +587 to +591
seen_sources = {h.get("source_file") for h in hits}
for bh in bm25_extra:
key = bh.get("source_file")
if not key or key == "?" or key in seen_sources:
continue
Comment thread mempalace/searcher.py Outdated
Comment on lines +587 to +591
seen_sources = {h.get("source_file") for h in hits}
for bh in bm25_extra:
key = bh.get("source_file")
if not key or key == "?" or key in seen_sources:
continue
Comment thread mempalace/searcher.py
n_results: int = 5,
max_distance: float = 0.0,
vector_disabled: bool = False,
candidate_strategy: str = "vector",
# The invariant is: union should not lose anything vector found.
missing = vec_ids - union_ids
assert not missing, f"union dropped docs that vector found: {missing}"

get_collection(palace, create=True) # create empty collection
result = search_memories("anything", palace, n_results=5, candidate_strategy="union")
assert result.get("results", []) == []

Comment thread mempalace/searcher.py Outdated
Comment on lines +848 to +851
# Candidate strategy hook: optionally widen the rerank pool's *source*
# before ranking. Default ("vector") is a no-op; "union" merges top-K
# BM25 candidates from sqlite. See `_apply_candidate_strategy`.
_apply_candidate_strategy(candidate_strategy, hits, query, palace_path, wing, room, n_results)
Comment thread mempalace/searcher.py
# before ranking. Default ("vector") is a no-op; "union" merges top-K
# BM25 candidates from sqlite. See `_apply_candidate_strategy`.
_apply_candidate_strategy(candidate_strategy, hits, query, palace_path, wing, room, n_results)

@igorls igorls added area/search Search and retrieval enhancement New feature or request labels May 2, 2026
@igorls igorls merged commit 2ad379b into develop May 3, 2026
6 checks passed
@igorls igorls deleted the feat/hybrid-candidate-union branch May 3, 2026 06:40
- Dedup union candidates by (full_path, chunk_index), not basename —
  two files sharing a basename in different dirs no longer collide,
  and a vector hit on chunk N of a file no longer blocks BM25 from
  contributing chunk M of the same file.
- Validate candidate_strategy at the top of search_memories so invalid
  values fail consistently, not only when the call routes through the
  vector path.
- Trim hits back to n_results after the union+rerank pool grows;
  preserves the existing search_memories size contract that the MCP
  limit parameter is built on.
- Skip BM25-only injection when max_distance > 0.0; BM25-only
  candidates carry distance=None and would silently bypass the
  caller's strict vector-distance threshold.

Adds 4 tests covering: validation under vector_disabled, n_results
trim, max_distance honoring, and basename-collision dedup.
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request May 3, 2026
MemPalace#1306 (hybrid candidate union, merged 2026-05-02) added a second BM25
scoring site inside _bm25_only_via_sqlite that this PR did not cover --
the original wiring landed before MemPalace#1306 existed. Without the plumbing,
two paths silently bypass locale stop_words even when the palace has
MEMPALACE_LANG set:

- vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without
  filtering.
- candidate_strategy="union": BM25 candidates merged into the rerank
  pool come from a tokenizer that ignored the configured locale, so
  the merged-in entries fight the lang-aware _hybrid_rank rerank.

Resolution moves once: stop_words = _resolve_stop_words(lang) is
hoisted before the vector_disabled branch and threaded through
_apply_candidate_strategy, _merge_bm25_union_candidates, and
_bm25_only_via_sqlite into _bm25_scores.

The FTS5 candidate-selection _tokenize at the top of
_bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index
is built with a trigram tokenizer that already mismatches our
\\w{2,} regex, so dropping further tokens there changes selection
semantics in ways outside this PR's scope.

Three tests pin the propagation: a sqlite-backed test for
_bm25_only_via_sqlite -> _bm25_scores, a unit spy on
_apply_candidate_strategy -> merger, and an end-to-end on
search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.
xcarbo added a commit to xcarbo/mempalace that referenced this pull request May 4, 2026
Catches up on a month of upstream work. Highlights pulled in:

- MemPalace#1306 searcher: hybrid candidate union (vector ∪ BM25 reranking pool)
- MemPalace#1325 mcp security: omit absolute paths from tool_get_drawer / tool_status
- MemPalace#1322 chroma: wire quarantine_stale_hnsw to prevent SIGSEGV on stale HNSW
- MemPalace#1320 mcp: forward valid_to / source params in kg_add / kg_invalidate
- MemPalace#1321 cli: honor --palace flag in cmd_init
- MemPalace#1314 kg temporal params fix
- MemPalace#1244 cli: cmd_compress writes to mempalace_closets so palace can read
- MemPalace#1243 mcp: case-insensitive agent name in diary_write/diary_read
- MemPalace#1303 mcp_server: pass embedding_function= on collection reopen
- MemPalace#1076/MemPalace#1077 hooks: quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json
- Various ruff format passes on touched files

Conflict resolution (CHANGELOG.md only — code files all auto-merged):

- 3.3.5 unreleased section header from upstream kept above 3.3.4
- 3.3.4 section: kept our 2026-04-30 release date; merged upstream's new
  MemPalace#1299 SIGSEGV-on-default-EF entry in alongside our existing topic-tunnels
  (MemPalace#1194/MemPalace#1195/MemPalace#1197), HNSW-bloat (MemPalace#1191), max_seq_id (MemPalace#1135), and
  auto-ingest (MemPalace#1230/MemPalace#1231) entries. Kept our richer topic-tunnels detail
  (upstream's version was a strict subset).

xdev patches preserved (still on this branch, untouched by merge):
- 6ef44cb fix(hooks): route CC transcripts via convo_miner with cwd-based wings
- 3fad61d fix(config): allow leading dash in wing names

Not pushed to origin — run tests locally and decide when to push.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request May 6, 2026
MemPalace#1306 (hybrid candidate union, merged 2026-05-02) added a second BM25
scoring site inside _bm25_only_via_sqlite that this PR did not cover --
the original wiring landed before MemPalace#1306 existed. Without the plumbing,
two paths silently bypass locale stop_words even when the palace has
MEMPALACE_LANG set:

- vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without
  filtering.
- candidate_strategy="union": BM25 candidates merged into the rerank
  pool come from a tokenizer that ignored the configured locale, so
  the merged-in entries fight the lang-aware _hybrid_rank rerank.

Resolution moves once: stop_words = _resolve_stop_words(lang) is
hoisted before the vector_disabled branch and threaded through
_apply_candidate_strategy, _merge_bm25_union_candidates, and
_bm25_only_via_sqlite into _bm25_scores.

The FTS5 candidate-selection _tokenize at the top of
_bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index
is built with a trigram tokenizer that already mismatches our
\\w{2,} regex, so dropping further tokens there changes selection
semantics in ways outside this PR's scope.

Three tests pin the propagation: a sqlite-backed test for
_bm25_only_via_sqlite -> _bm25_scores, a unit spy on
_apply_candidate_strategy -> merger, and an end-to-end on
search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/search Search and retrieval enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants