Conversation
…ing pool
Default search behavior is unchanged. Opt-in candidate_strategy="union"
also pulls top-K BM25-only candidates from sqlite FTS5 and merges them
into the rerank pool, catching docs with strong BM25 signal that the
vector index didn't surface in the over-fetch window.
Motivation
----------
The current hybrid path gathers candidates from the vector index only
(n_results * 3 over-fetch), then BM25-reranks within them. When the
query embeds close to the wrong content semantically, the right doc
never enters the rerank pool — *no matter how wide the over-fetch*.
Tested on a ~6K-document mixed corpus (knowledge prose + short structured
records): at *30x* over-fetch (~5% of the corpus) the target doc still
didn't surface for narrative-shaped queries targeting terminology guides.
Wider over-fetch isn't the answer; widening the pool's *source* is.
Concrete failure mode: a narrative-shaped query embeds close to records
sharing the same operational vocabulary (other narrative entries in the
corpus). A terminology / style guide is BM25-strong for the query
(rare keywords the guide repeats) but vector-distant. Vector-only
candidates don't include it; BM25 never gets to rerank it. The hybrid
path produces 0.00 recall on a probe that pure BM25 alone scores 1.00 —
the hybrid is worse than its component on the same input.
Behavior change
---------------
* New parameter ``candidate_strategy: str = "vector"`` on ``search_memories``.
- ``"vector"`` (default): historical behavior, no change.
- ``"union"``: also fetch top ``n_results * 3`` candidates via the
existing ``_bm25_only_via_sqlite`` helper, dedupe by source_file,
merge into the rerank pool. BM25-only candidates carry
``distance=None`` so they're scored on BM25 contribution alone
(vec_sim coerces to 0).
* ``_hybrid_rank`` now handles ``distance=None`` explicitly, scoring
such candidates as vector-unknown (vec_sim=0) rather than treating
it as max-distance via shim.
* New strategies register via ``_CANDIDATE_MERGERS``; dispatch is in
``_apply_candidate_strategy`` so ``search_memories`` stays under the
C901 complexity ceiling.
Bench numbers (~6K-doc internal mixed corpus, recall@10, 5 probes spanning
policy-exception lookup, temporal-decay, style retrieval, set-difference,
and pattern-recognition):
baseline ("vector") "union"
policy-exception probe 0.00 0.50 +0.50
temporal-decay probe 0.17 0.50 +0.33
style-retrieval probe 0.00 1.00 +1.00 (PASSES)
set-difference probe 0.00–0.06 0.06–0.09 ~
pattern-recog probe 0.64 (stable) 0.50–0.71 variance, typ. +0.07
macro recall 0.16–0.17 0.51–0.56 +0.34 to +0.40
The pattern-recog variance points at a related issue worth a separate PR:
``_hybrid_rank`` computes BM25 IDF over the candidate set. Adding new
candidates re-normalizes BM25 for *existing* candidates non-monotonically.
Stable corpus-wide BM25 would remove this. Out of scope here.
Tests
-----
``tests/test_hybrid_candidate_union.py`` (6 tests, all pass):
- default behavior unchanged (explicit ``"vector"`` matches default)
- ``"union"`` surfaces a BM25-strong vector-distant doc
- ``"union"`` doesn't drop docs ``"vector"`` would have found
- empty-palace handling
- invalid ``candidate_strategy`` raises
- ``_hybrid_rank`` tolerates ``distance=None``
Existing ``test_hybrid_search.py`` (5) and ``test_searcher.py`` (27) pass.
Performance note
----------------
Each ``"union"`` query adds one sqlite open + FTS5 MATCH + metadata
fetch (via the existing ``_bm25_only_via_sqlite`` helper, which already
runs as the ``vector_disabled`` fallback path so the code is
well-trodden). Per-query overhead is small but unmeasured at corpus
scale. Default stays ``"vector"`` until a maintainer characterizes the
cost.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds an opt-in candidate_strategy="union" mode to search_memories, aiming to improve hybrid retrieval by merging BM25/FTS5 candidates into the rerank pool instead of relying on vector candidates alone. It extends the searcher’s hybrid-ranking path and adds targeted tests for the new strategy and distance=None handling.
Changes:
- Add
candidate_strategytosearch_memorieswith"vector"and"union"dispatch paths. - Allow
_hybrid_rankto handle BM25-only candidates by treatingdistance=Noneas no vector signal. - Add a new test module covering default behavior, union retrieval, invalid strategies, empty palaces, and missing-distance reranking.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
mempalace/searcher.py |
Adds union-candidate merging, strategy dispatch, API docs, and distance=None handling in hybrid reranking. |
tests/test_hybrid_candidate_union.py |
Adds regression/integration tests for the new union candidate strategy and BM25-only hybrid reranking behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+587
to
+591
| seen_sources = {h.get("source_file") for h in hits} | ||
| for bh in bm25_extra: | ||
| key = bh.get("source_file") | ||
| if not key or key == "?" or key in seen_sources: | ||
| continue |
Comment on lines
+587
to
+591
| seen_sources = {h.get("source_file") for h in hits} | ||
| for bh in bm25_extra: | ||
| key = bh.get("source_file") | ||
| if not key or key == "?" or key in seen_sources: | ||
| continue |
| n_results: int = 5, | ||
| max_distance: float = 0.0, | ||
| vector_disabled: bool = False, | ||
| candidate_strategy: str = "vector", |
| # The invariant is: union should not lose anything vector found. | ||
| missing = vec_ids - union_ids | ||
| assert not missing, f"union dropped docs that vector found: {missing}" | ||
|
|
| get_collection(palace, create=True) # create empty collection | ||
| result = search_memories("anything", palace, n_results=5, candidate_strategy="union") | ||
| assert result.get("results", []) == [] | ||
|
|
Comment on lines
+848
to
+851
| # Candidate strategy hook: optionally widen the rerank pool's *source* | ||
| # before ranking. Default ("vector") is a no-op; "union" merges top-K | ||
| # BM25 candidates from sqlite. See `_apply_candidate_strategy`. | ||
| _apply_candidate_strategy(candidate_strategy, hits, query, palace_path, wing, room, n_results) |
| # before ranking. Default ("vector") is a no-op; "union" merges top-K | ||
| # BM25 candidates from sqlite. See `_apply_candidate_strategy`. | ||
| _apply_candidate_strategy(candidate_strategy, hits, query, palace_path, wing, room, n_results) | ||
|
|
- Dedup union candidates by (full_path, chunk_index), not basename — two files sharing a basename in different dirs no longer collide, and a vector hit on chunk N of a file no longer blocks BM25 from contributing chunk M of the same file. - Validate candidate_strategy at the top of search_memories so invalid values fail consistently, not only when the call routes through the vector path. - Trim hits back to n_results after the union+rerank pool grows; preserves the existing search_memories size contract that the MCP limit parameter is built on. - Skip BM25-only injection when max_distance > 0.0; BM25-only candidates carry distance=None and would silently bypass the caller's strict vector-distance threshold. Adds 4 tests covering: validation under vector_disabled, n_results trim, max_distance honoring, and basename-collision dedup.
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
May 3, 2026
MemPalace#1306 (hybrid candidate union, merged 2026-05-02) added a second BM25 scoring site inside _bm25_only_via_sqlite that this PR did not cover -- the original wiring landed before MemPalace#1306 existed. Without the plumbing, two paths silently bypass locale stop_words even when the palace has MEMPALACE_LANG set: - vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without filtering. - candidate_strategy="union": BM25 candidates merged into the rerank pool come from a tokenizer that ignored the configured locale, so the merged-in entries fight the lang-aware _hybrid_rank rerank. Resolution moves once: stop_words = _resolve_stop_words(lang) is hoisted before the vector_disabled branch and threaded through _apply_candidate_strategy, _merge_bm25_union_candidates, and _bm25_only_via_sqlite into _bm25_scores. The FTS5 candidate-selection _tokenize at the top of _bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index is built with a trigram tokenizer that already mismatches our \\w{2,} regex, so dropping further tokens there changes selection semantics in ways outside this PR's scope. Three tests pin the propagation: a sqlite-backed test for _bm25_only_via_sqlite -> _bm25_scores, a unit spy on _apply_candidate_strategy -> merger, and an end-to-end on search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.
xcarbo
added a commit
to xcarbo/mempalace
that referenced
this pull request
May 4, 2026
Catches up on a month of upstream work. Highlights pulled in: - MemPalace#1306 searcher: hybrid candidate union (vector ∪ BM25 reranking pool) - MemPalace#1325 mcp security: omit absolute paths from tool_get_drawer / tool_status - MemPalace#1322 chroma: wire quarantine_stale_hnsw to prevent SIGSEGV on stale HNSW - MemPalace#1320 mcp: forward valid_to / source params in kg_add / kg_invalidate - MemPalace#1321 cli: honor --palace flag in cmd_init - MemPalace#1314 kg temporal params fix - MemPalace#1244 cli: cmd_compress writes to mempalace_closets so palace can read - MemPalace#1243 mcp: case-insensitive agent name in diary_write/diary_read - MemPalace#1303 mcp_server: pass embedding_function= on collection reopen - MemPalace#1076/MemPalace#1077 hooks: quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json - Various ruff format passes on touched files Conflict resolution (CHANGELOG.md only — code files all auto-merged): - 3.3.5 unreleased section header from upstream kept above 3.3.4 - 3.3.4 section: kept our 2026-04-30 release date; merged upstream's new MemPalace#1299 SIGSEGV-on-default-EF entry in alongside our existing topic-tunnels (MemPalace#1194/MemPalace#1195/MemPalace#1197), HNSW-bloat (MemPalace#1191), max_seq_id (MemPalace#1135), and auto-ingest (MemPalace#1230/MemPalace#1231) entries. Kept our richer topic-tunnels detail (upstream's version was a strict subset). xdev patches preserved (still on this branch, untouched by merge): - 6ef44cb fix(hooks): route CC transcripts via convo_miner with cwd-based wings - 3fad61d fix(config): allow leading dash in wing names Not pushed to origin — run tests locally and decide when to push. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
4 tasks
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
May 6, 2026
MemPalace#1306 (hybrid candidate union, merged 2026-05-02) added a second BM25 scoring site inside _bm25_only_via_sqlite that this PR did not cover -- the original wiring landed before MemPalace#1306 existed. Without the plumbing, two paths silently bypass locale stop_words even when the palace has MEMPALACE_LANG set: - vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without filtering. - candidate_strategy="union": BM25 candidates merged into the rerank pool come from a tokenizer that ignored the configured locale, so the merged-in entries fight the lang-aware _hybrid_rank rerank. Resolution moves once: stop_words = _resolve_stop_words(lang) is hoisted before the vector_disabled branch and threaded through _apply_candidate_strategy, _merge_bm25_union_candidates, and _bm25_only_via_sqlite into _bm25_scores. The FTS5 candidate-selection _tokenize at the top of _bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index is built with a trigram tokenizer that already mismatches our \\w{2,} regex, so dropping further tokens there changes selection semantics in ways outside this PR's scope. Three tests pin the propagation: a sqlite-backed test for _bm25_only_via_sqlite -> _bm25_scores, a unit spy on _apply_candidate_strategy -> merger, and an end-to-end on search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in
candidate_strategy="union"tosearch_memoriesthat pulls top-K BM25-only candidates from sqlite FTS5 and merges them into the hybrid rerank pool. Default behavior ("vector") is unchanged.The current hybrid path gathers candidates from the vector index only (
n_results * 3over-fetch), then BM25-reranks within them. When the query embeds close to the wrong content semantically, the right doc never enters the rerank pool — no matter how wide the over-fetch. Validated against a ~6K-document mixed corpus (knowledge prose + short structured records): at 30x over-fetch (~5% of the corpus), the target doc still didn't surface for narrative-shaped queries targeting terminology guides. Wider over-fetch isn't the answer; widening the pool's source is.The concrete failure mode: a narrative-shaped query embeds close to records sharing the same operational vocabulary. A terminology / style guide is BM25-strong for the query but vector-distant. Vector-only candidates don't include it; BM25 never gets to rerank it. The hybrid path produces 0.00 recall on a probe that pure BM25 alone scores 1.00 — the hybrid is worse than its component on the same input.
What changed
candidate_strategy: str = "vector"onsearch_memories."vector"(default): historical behavior, no change."union": also fetch topn_results * 3candidates via the existing_bm25_only_via_sqlitehelper, dedupe by source_file, merge into the rerank pool. BM25-only candidates carrydistance=Noneso they're scored on BM25 contribution alone._hybrid_ranknow handlesdistance=Noneexplicitly (vec_sim=0) instead of relying on a sentinel value._CANDIDATE_MERGERS; dispatch is in_apply_candidate_strategysosearch_memoriesstays under the project's C901 complexity ceiling.Bench numbers
~6K-doc internal mixed corpus, recall@10, 5 probes spanning policy-exception lookup, temporal-decay, style retrieval, set-difference, and pattern-recognition:
"vector")"union"The pattern-recog variance points at a related issue worth its own PR:
_hybrid_rankcomputes BM25 IDF over the candidate set, so adding new candidates re-normalizes BM25 for existing candidates non-monotonically. Stable corpus-wide BM25 would remove this. Out of scope here.Tests
tests/test_hybrid_candidate_union.py— 6 new tests, all passing:"vector"matches default)"union"surfaces a BM25-strong vector-distant doc"union"doesn't drop docs"vector"would have foundcandidate_strategyraisesValueError_hybrid_ranktoleratesdistance=NoneExisting
test_hybrid_search.py(5) andtest_searcher.py(27) pass — no regressions.Performance note
Each
"union"query adds one sqlite open + FTS5 MATCH + metadata fetch (via the existing_bm25_only_via_sqlitehelper, which already runs as thevector_disabledfallback path so the code is well-trodden). Per-query overhead is small but unmeasured at corpus scale. Default stays"vector"until a maintainer characterizes the cost.Caveats / followups not in this PR
_hybrid_rankre-normalizes existing candidates' BM25 when new ones are added) is its own architectural concern. Worth fixing in a separate PR.candidate_strategyto MCP callers is straightforward (one extra arg) but I haven't done it here. Easy to add once the Python API lands.Test plan
pytest tests/test_hybrid_candidate_union.py— 6/6pytest tests/test_hybrid_search.py— 5/5 no regressionpytest tests/test_searcher.py— 27/27 no regressionruff checkcleanruff formatclean