feat(searcher): candidate_strategy="union" — BM25 candidates joined with vector pool before hybrid rerank by igorls · Pull Request #1306 · MemPalace/mempalace

igorls · 2026-05-02T03:50:44Z

Summary

Adds an opt-in candidate_strategy="union" to search_memories that pulls top-K BM25-only candidates from sqlite FTS5 and merges them into the hybrid rerank pool. Default behavior ("vector") is unchanged.

The current hybrid path gathers candidates from the vector index only (n_results * 3 over-fetch), then BM25-reranks within them. When the query embeds close to the wrong content semantically, the right doc never enters the rerank pool — no matter how wide the over-fetch. Validated against a ~6K-document mixed corpus (knowledge prose + short structured records): at 30x over-fetch (~5% of the corpus), the target doc still didn't surface for narrative-shaped queries targeting terminology guides. Wider over-fetch isn't the answer; widening the pool's source is.

The concrete failure mode: a narrative-shaped query embeds close to records sharing the same operational vocabulary. A terminology / style guide is BM25-strong for the query but vector-distant. Vector-only candidates don't include it; BM25 never gets to rerank it. The hybrid path produces 0.00 recall on a probe that pure BM25 alone scores 1.00 — the hybrid is worse than its component on the same input.

What changed

New parameter candidate_strategy: str = "vector" on search_memories.
- "vector" (default): historical behavior, no change.
- "union": also fetch top n_results * 3 candidates via the existing _bm25_only_via_sqlite helper, dedupe by source_file, merge into the rerank pool. BM25-only candidates carry distance=None so they're scored on BM25 contribution alone.
_hybrid_rank now handles distance=None explicitly (vec_sim=0) instead of relying on a sentinel value.
New strategies register via _CANDIDATE_MERGERS; dispatch is in _apply_candidate_strategy so search_memories stays under the project's C901 complexity ceiling.

Bench numbers

~6K-doc internal mixed corpus, recall@10, 5 probes spanning policy-exception lookup, temporal-decay, style retrieval, set-difference, and pattern-recognition:

probe	baseline (`"vector"`)	`"union"`	delta
policy-exception	0.00	0.50	+0.50
temporal-decay	0.17	0.50	+0.33
style-retrieval	0.00	1.00	+1.00 (PASSES)
set-difference	0.00–0.06	0.06–0.09	~
pattern-recog	0.64 (stable)	0.50–0.71 (typ. 0.71)	typ. +0.07, occ. −0.14
macro recall	0.16–0.17	0.51–0.56	+0.34 to +0.40

The pattern-recog variance points at a related issue worth its own PR: _hybrid_rank computes BM25 IDF over the candidate set, so adding new candidates re-normalizes BM25 for existing candidates non-monotonically. Stable corpus-wide BM25 would remove this. Out of scope here.

Tests

tests/test_hybrid_candidate_union.py — 6 new tests, all passing:

default behavior unchanged (explicit "vector" matches default)
"union" surfaces a BM25-strong vector-distant doc
"union" doesn't drop docs "vector" would have found
empty-palace handling
invalid candidate_strategy raises ValueError
_hybrid_rank tolerates distance=None

Existing test_hybrid_search.py (5) and test_searcher.py (27) pass — no regressions.

Performance note

Each "union" query adds one sqlite open + FTS5 MATCH + metadata fetch (via the existing _bm25_only_via_sqlite helper, which already runs as the vector_disabled fallback path so the code is well-trodden). Per-query overhead is small but unmeasured at corpus scale. Default stays "vector" until a maintainer characterizes the cost.

Caveats / followups not in this PR

Performance impact on large palaces — needs measurement before flipping the default.
The local-BM25 IDF instability called out in the bench numbers above (_hybrid_rank re-normalizes existing candidates' BM25 when new ones are added) is its own architectural concern. Worth fixing in a separate PR.
MCP surface — exposing candidate_strategy to MCP callers is straightforward (one extra arg) but I haven't done it here. Easy to add once the Python API lands.

Test plan

pytest tests/test_hybrid_candidate_union.py — 6/6
pytest tests/test_hybrid_search.py — 5/5 no regression
pytest tests/test_searcher.py — 27/27 no regression
ruff check clean
ruff format clean

…ing pool Default search behavior is unchanged. Opt-in candidate_strategy="union" also pulls top-K BM25-only candidates from sqlite FTS5 and merges them into the rerank pool, catching docs with strong BM25 signal that the vector index didn't surface in the over-fetch window. Motivation ---------- The current hybrid path gathers candidates from the vector index only (n_results * 3 over-fetch), then BM25-reranks within them. When the query embeds close to the wrong content semantically, the right doc never enters the rerank pool — *no matter how wide the over-fetch*. Tested on a ~6K-document mixed corpus (knowledge prose + short structured records): at *30x* over-fetch (~5% of the corpus) the target doc still didn't surface for narrative-shaped queries targeting terminology guides. Wider over-fetch isn't the answer; widening the pool's *source* is. Concrete failure mode: a narrative-shaped query embeds close to records sharing the same operational vocabulary (other narrative entries in the corpus). A terminology / style guide is BM25-strong for the query (rare keywords the guide repeats) but vector-distant. Vector-only candidates don't include it; BM25 never gets to rerank it. The hybrid path produces 0.00 recall on a probe that pure BM25 alone scores 1.00 — the hybrid is worse than its component on the same input. Behavior change --------------- * New parameter ``candidate_strategy: str = "vector"`` on ``search_memories``. - ``"vector"`` (default): historical behavior, no change. - ``"union"``: also fetch top ``n_results * 3`` candidates via the existing ``_bm25_only_via_sqlite`` helper, dedupe by source_file, merge into the rerank pool. BM25-only candidates carry ``distance=None`` so they're scored on BM25 contribution alone (vec_sim coerces to 0). * ``_hybrid_rank`` now handles ``distance=None`` explicitly, scoring such candidates as vector-unknown (vec_sim=0) rather than treating it as max-distance via shim. * New strategies register via ``_CANDIDATE_MERGERS``; dispatch is in ``_apply_candidate_strategy`` so ``search_memories`` stays under the C901 complexity ceiling. Bench numbers (~6K-doc internal mixed corpus, recall@10, 5 probes spanning policy-exception lookup, temporal-decay, style retrieval, set-difference, and pattern-recognition): baseline ("vector") "union" policy-exception probe 0.00 0.50 +0.50 temporal-decay probe 0.17 0.50 +0.33 style-retrieval probe 0.00 1.00 +1.00 (PASSES) set-difference probe 0.00–0.06 0.06–0.09 ~ pattern-recog probe 0.64 (stable) 0.50–0.71 variance, typ. +0.07 macro recall 0.16–0.17 0.51–0.56 +0.34 to +0.40 The pattern-recog variance points at a related issue worth a separate PR: ``_hybrid_rank`` computes BM25 IDF over the candidate set. Adding new candidates re-normalizes BM25 for *existing* candidates non-monotonically. Stable corpus-wide BM25 would remove this. Out of scope here. Tests ----- ``tests/test_hybrid_candidate_union.py`` (6 tests, all pass): - default behavior unchanged (explicit ``"vector"`` matches default) - ``"union"`` surfaces a BM25-strong vector-distant doc - ``"union"`` doesn't drop docs ``"vector"`` would have found - empty-palace handling - invalid ``candidate_strategy`` raises - ``_hybrid_rank`` tolerates ``distance=None`` Existing ``test_hybrid_search.py`` (5) and ``test_searcher.py`` (27) pass. Performance note ---------------- Each ``"union"`` query adds one sqlite open + FTS5 MATCH + metadata fetch (via the existing ``_bm25_only_via_sqlite`` helper, which already runs as the ``vector_disabled`` fallback path so the code is well-trodden). Per-query overhead is small but unmeasured at corpus scale. Default stays ``"vector"`` until a maintainer characterizes the cost.

Copilot

Pull request overview

This PR adds an opt-in candidate_strategy="union" mode to search_memories, aiming to improve hybrid retrieval by merging BM25/FTS5 candidates into the rerank pool instead of relying on vector candidates alone. It extends the searcher’s hybrid-ranking path and adds targeted tests for the new strategy and distance=None handling.

Changes:

Add candidate_strategy to search_memories with "vector" and "union" dispatch paths.
Allow _hybrid_rank to handle BM25-only candidates by treating distance=None as no vector signal.
Add a new test module covering default behavior, union retrieval, invalid strategies, empty palaces, and missing-distance reranking.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
`mempalace/searcher.py`	Adds union-candidate merging, strategy dispatch, API docs, and `distance=None` handling in hybrid reranking.
`tests/test_hybrid_candidate_union.py`	Adds regression/integration tests for the new union candidate strategy and BM25-only hybrid reranking behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    seen_sources = {h.get("source_file") for h in hits}
+    for bh in bm25_extra:
+        key = bh.get("source_file")
+        if not key or key == "?" or key in seen_sources:
+            continue


+    seen_sources = {h.get("source_file") for h in hits}
+    for bh in bm25_extra:
+        key = bh.get("source_file")
+        if not key or key == "?" or key in seen_sources:
+            continue


    n_results: int = 5,
    max_distance: float = 0.0,
    vector_disabled: bool = False,
+    candidate_strategy: str = "vector",


+        # The invariant is: union should not lose anything vector found.
+        missing = vec_ids - union_ids
+        assert not missing, f"union dropped docs that vector found: {missing}"
+


+        get_collection(palace, create=True)  # create empty collection
+        result = search_memories("anything", palace, n_results=5, candidate_strategy="union")
+        assert result.get("results", []) == []
+


+    # Candidate strategy hook: optionally widen the rerank pool's *source*
+    # before ranking. Default ("vector") is a no-op; "union" merges top-K
+    # BM25 candidates from sqlite. See `_apply_candidate_strategy`.
+    _apply_candidate_strategy(candidate_strategy, hits, query, palace_path, wing, room, n_results)


+    # before ranking. Default ("vector") is a no-op; "union" merges top-K
+    # BM25 candidates from sqlite. See `_apply_candidate_strategy`.
+    _apply_candidate_strategy(candidate_strategy, hits, query, palace_path, wing, room, n_results)
+


- Dedup union candidates by (full_path, chunk_index), not basename — two files sharing a basename in different dirs no longer collide, and a vector hit on chunk N of a file no longer blocks BM25 from contributing chunk M of the same file. - Validate candidate_strategy at the top of search_memories so invalid values fail consistently, not only when the call routes through the vector path. - Trim hits back to n_results after the union+rerank pool grows; preserves the existing search_memories size contract that the MCP limit parameter is built on. - Skip BM25-only injection when max_distance > 0.0; BM25-only candidates carry distance=None and would silently bypass the caller's strict vector-distance threshold. Adds 4 tests covering: validation under vector_disabled, n_results trim, max_distance honoring, and basename-collision dedup.

MemPalace#1306 (hybrid candidate union, merged 2026-05-02) added a second BM25 scoring site inside _bm25_only_via_sqlite that this PR did not cover -- the original wiring landed before MemPalace#1306 existed. Without the plumbing, two paths silently bypass locale stop_words even when the palace has MEMPALACE_LANG set: - vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without filtering. - candidate_strategy="union": BM25 candidates merged into the rerank pool come from a tokenizer that ignored the configured locale, so the merged-in entries fight the lang-aware _hybrid_rank rerank. Resolution moves once: stop_words = _resolve_stop_words(lang) is hoisted before the vector_disabled branch and threaded through _apply_candidate_strategy, _merge_bm25_union_candidates, and _bm25_only_via_sqlite into _bm25_scores. The FTS5 candidate-selection _tokenize at the top of _bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index is built with a trigram tokenizer that already mismatches our \\w{2,} regex, so dropping further tokens there changes selection semantics in ways outside this PR's scope. Three tests pin the propagation: a sqlite-backed test for _bm25_only_via_sqlite -> _bm25_scores, a unit spy on _apply_candidate_strategy -> merger, and an end-to-end on search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.

Catches up on a month of upstream work. Highlights pulled in: - MemPalace#1306 searcher: hybrid candidate union (vector ∪ BM25 reranking pool) - MemPalace#1325 mcp security: omit absolute paths from tool_get_drawer / tool_status - MemPalace#1322 chroma: wire quarantine_stale_hnsw to prevent SIGSEGV on stale HNSW - MemPalace#1320 mcp: forward valid_to / source params in kg_add / kg_invalidate - MemPalace#1321 cli: honor --palace flag in cmd_init - MemPalace#1314 kg temporal params fix - MemPalace#1244 cli: cmd_compress writes to mempalace_closets so palace can read - MemPalace#1243 mcp: case-insensitive agent name in diary_write/diary_read - MemPalace#1303 mcp_server: pass embedding_function= on collection reopen - MemPalace#1076/MemPalace#1077 hooks: quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json - Various ruff format passes on touched files Conflict resolution (CHANGELOG.md only — code files all auto-merged): - 3.3.5 unreleased section header from upstream kept above 3.3.4 - 3.3.4 section: kept our 2026-04-30 release date; merged upstream's new MemPalace#1299 SIGSEGV-on-default-EF entry in alongside our existing topic-tunnels (MemPalace#1194/MemPalace#1195/MemPalace#1197), HNSW-bloat (MemPalace#1191), max_seq_id (MemPalace#1135), and auto-ingest (MemPalace#1230/MemPalace#1231) entries. Kept our richer topic-tunnels detail (upstream's version was a strict subset). xdev patches preserved (still on this branch, untouched by merge): - 6ef44cb fix(hooks): route CC transcripts via convo_miner with cwd-based wings - 3fad61d fix(config): allow leading dash in wing names Not pushed to origin — run tests locally and decide when to push. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

MemPalace#1306 (hybrid candidate union, merged 2026-05-02) added a second BM25 scoring site inside _bm25_only_via_sqlite that this PR did not cover -- the original wiring landed before MemPalace#1306 existed. Without the plumbing, two paths silently bypass locale stop_words even when the palace has MEMPALACE_LANG set: - vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without filtering. - candidate_strategy="union": BM25 candidates merged into the rerank pool come from a tokenizer that ignored the configured locale, so the merged-in entries fight the lang-aware _hybrid_rank rerank. Resolution moves once: stop_words = _resolve_stop_words(lang) is hoisted before the vector_disabled branch and threaded through _apply_candidate_strategy, _merge_bm25_union_candidates, and _bm25_only_via_sqlite into _bm25_scores. The FTS5 candidate-selection _tokenize at the top of _bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index is built with a trigram tokenizer that already mismatches our \\w{2,} regex, so dropping further tokens there changes selection semantics in ways outside this PR's scope. Three tests pin the propagation: a sqlite-backed test for _bm25_only_via_sqlite -> _bm25_scores, a unit spy on _apply_candidate_strategy -> merger, and an end-to-end on search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.

Copilot AI review requested due to automatic review settings May 2, 2026 03:50

igorls requested review from bensig and milla-jovovich as code owners May 2, 2026 03:50

Copilot started reviewing on behalf of igorls May 2, 2026 03:51 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

igorls added area/search Search and retrieval enhancement New feature or request labels May 2, 2026

igorls merged commit 2ad379b into develop May 3, 2026
6 checks passed

igorls deleted the feat/hybrid-candidate-union branch May 3, 2026 06:40

mvalentsev mentioned this pull request May 3, 2026

feat(searcher): wire i18n stop words into BM25 tokenizer (#973) #977

Open

jphein mentioned this pull request May 3, 2026

feat(searcher): warnings + sqlite BM25 top-up when vector underdelivers #1005

Open

lgertel mentioned this pull request May 5, 2026

merge: sync upstream/develop (25 commits) into fork develop durante-tech/dos-mempalace#3

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(searcher): candidate_strategy="union" — BM25 candidates joined with vector pool before hybrid rerank#1306

feat(searcher): candidate_strategy="union" — BM25 candidates joined with vector pool before hybrid rerank#1306
igorls merged 2 commits intodevelopfrom
feat/hybrid-candidate-union

igorls commented May 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

igorls commented May 2, 2026

Summary

What changed

Bench numbers

Tests

Performance note

Caveats / followups not in this PR

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants