Skip to content

feat: hybrid search fallback via keyword text-match#662

Closed
jphein wants to merge 3 commits intoMemPalace:developfrom
jphein:feat/hybrid-search-fallback
Closed

feat: hybrid search fallback via keyword text-match#662
jphein wants to merge 3 commits intoMemPalace:developfrom
jphein:feat/hybrid-search-fallback

Conversation

@jphein
Copy link
Copy Markdown
Collaborator

@jphein jphein commented Apr 12, 2026

Summary

When vector search returns poor results (best cosine distance > 1.0) or an explicit keyword is provided, fall back to ChromaDB's built-in where_document={"$contains": kw} text-match and merge results.

  • searcher.py: Add _extract_keyword(query) — picks the longest non-stopword token, preferring identifier-looking tokens (digits, underscores, dots, ALLCAPS). Modify search_memories() to accept an optional keyword param, run keyword fallback on poor/empty vector results, merge and deduplicate by ID, sort merged list by distance, and expose keyword_fallback and distance fields in the response.
  • mcp_server.py: Add keyword to mempalace_search input schema and thread it through tool_search()search_memories().
  • tests/test_searcher.py: 12 new tests across TestExtractKeyword (6) and TestHybridSearchFallback (6) — covering auto-trigger on poor results, explicit keyword, deduplication, no-trigger on good results, distance field presence, and graceful failure of keyword query.

Motivation

Vector search can miss exact identifiers (error codes, config keys, function names) when the embedding model lacks domain-specific vocabulary. The keyword fallback catches these cases without requiring a separate full-text index — ChromaDB's where_document.$contains filter is already available.

Test plan

  • python -m pytest tests/test_searcher.py -x -q — all 29 tests pass
  • ruff check mempalace/searcher.py mempalace/mcp_server.py tests/test_searcher.py — clean
  • Keyword fallback triggers when best vector distance > 1.0
  • Explicit keyword= param forces fallback even on good vector results
  • Duplicate IDs from both searches appear only once, sorted by distance

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 12, 2026 02:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a hybrid search mode to MemPalace so searches can fall back to a keyword-based where_document={"$contains": ...} text match when vector similarity is poor (or when an explicit keyword is provided), then merges/dedupes results.

Changes:

  • Implement _extract_keyword(query) and extend search_memories() with optional keyword fallback + merged result sorting/deduplication.
  • Thread keyword through the MCP server tool (mempalace_search) schema and handler path.
  • Add new unit tests covering keyword extraction and hybrid fallback behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
mempalace/searcher.py Adds keyword extraction and keyword fallback query/merge logic; exposes distance and keyword_fallback fields.
mempalace/mcp_server.py Extends mempalace_search tool input schema and passes keyword into search_memories().
tests/test_searcher.py Adds tests for _extract_keyword and hybrid fallback/merge behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/searcher.py
Comment on lines +31 to +40
"""
tokens = re.findall(r"[\w.]+", query.lower())
candidates = [t for t in tokens if t not in _STOPWORDS and len(t) > 2]
if not candidates:
return ""
# Prefer identifier-looking tokens (error codes, config keys, etc.)
ids = [t for t in candidates if re.search(r"\d|_|\.", t) or t.isupper()]
if ids:
return max(ids, key=len)
return max(candidates, key=len)
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_extract_keyword() lowercases the query before tokenization, but the docstring says it prefers ALLCAPS identifier-like tokens. As implemented, t.isupper() will never be true because every token has already been lowercased, so ALLCAPS preference is broken. Consider tokenizing without forcing lowercase (or track both original + normalized forms) so ALLCAPS identifiers can be detected as intended.

Suggested change
"""
tokens = re.findall(r"[\w.]+", query.lower())
candidates = [t for t in tokens if t not in _STOPWORDS and len(t) > 2]
if not candidates:
return ""
# Prefer identifier-looking tokens (error codes, config keys, etc.)
ids = [t for t in candidates if re.search(r"\d|_|\.", t) or t.isupper()]
if ids:
return max(ids, key=len)
return max(candidates, key=len)
Returns the selected token in lowercase for consistent fallback search.
"""
tokens = re.findall(r"[\w.]+", query)
candidates = [(t, t.lower()) for t in tokens if t.lower() not in _STOPWORDS and len(t) > 2]
if not candidates:
return ""
# Prefer identifier-looking tokens (error codes, config keys, etc.)
ids = [
(original, normalized)
for original, normalized in candidates
if re.search(r"\d|_|\.", original) or original.isupper()
]
if ids:
return max(ids, key=lambda item: len(item[0]))[1]
return max(candidates, key=lambda item: len(item[0]))[1]

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — real bug. query.lower() was called before tokenizing, making t.isupper() dead code. Fixed: tokenize original query first for ALLCAPS check, then lowercase for stopword filtering. a8cb521

Comment thread mempalace/searcher.py Outdated
Comment on lines +191 to +209
if kw:
try:
kw_kwargs = {
"query_texts": [query],
"n_results": n_results,
"include": ["documents", "metadatas", "distances"],
"where_document": {"$contains": kw},
}
if where:
kw_kwargs["where"] = where
kw_results = col.query(**kw_kwargs)
kw_docs = kw_results["documents"][0]
kw_metas = kw_results["metadatas"][0]
kw_dists = kw_results["distances"][0]
kw_ids = kw_results["ids"][0] if kw_results.get("ids") else [""] * len(kw_docs)
for kid, kdoc, kmeta, kdist in zip(kw_ids, kw_docs, kw_metas, kw_dists):
kw_hits_by_id[kid] = (kdoc, kmeta, kdist)
except Exception:
pass # keyword fallback is best-effort
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keyword fallback exceptions are currently swallowed with a bare except Exception: pass, which makes it very hard to debug why hybrid search isn’t returning expected keyword matches (e.g., filter syntax errors, backend capability issues). Since fallback is best-effort, it should still avoid failing the request, but it would help to at least logger.debug(..., exc_info=True) (or similar) when the keyword query fails.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — replaced bare except Exception: pass with logger.debug("Keyword fallback query failed for keyword=%r", kw). Still best-effort but visible at debug level. a8cb521

Comment thread mempalace/searcher.py Outdated
return {
"query": query,
"filters": {"wing": wing, "room": room},
"keyword_fallback": kw if kw_hits_by_id else None,
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keyword_fallback field is only set when kw_hits_by_id is non-empty. If a caller provides keyword= (or an extracted keyword is used) but the keyword query returns 0 hits, the response will report keyword_fallback: None even though fallback was attempted. Consider setting keyword_fallback to the keyword used whenever a keyword search is performed (and optionally add a separate boolean/field to indicate whether it produced hits).

Suggested change
"keyword_fallback": kw if kw_hits_by_id else None,
"keyword_fallback": kw if kw else None,

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — keyword_fallback is now set whenever a keyword was attempted, even with 0 hits. Callers can distinguish "no fallback" (None) from "fallback attempted, 0 hits". a8cb521

jphein added a commit to jphein/mempalace that referenced this pull request Apr 12, 2026
…d keywords

Addresses Copilot review feedback on MemPalace#662:
- _extract_keyword() now tokenizes BEFORE lowercasing so isupper() works
- Replace bare `except: pass` with logger.debug in keyword fallback
- Report keyword_fallback whenever a keyword was attempted, not only on hits

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@JoyciAkira
Copy link
Copy Markdown

Closing duplicate PR created by network timeout.

3 similar comments
@JoyciAkira
Copy link
Copy Markdown

Closing duplicate PR created by network timeout.

@JoyciAkira
Copy link
Copy Markdown

Closing duplicate PR created by network timeout.

@JoyciAkira
Copy link
Copy Markdown

Closing duplicate PR created by network timeout.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 12, 2026
…d keywords

Addresses Copilot review feedback on MemPalace#662:
- _extract_keyword() now tokenizes BEFORE lowercasing so isupper() works
- Replace bare `except: pass` with logger.debug in keyword fallback
- Report keyword_fallback whenever a keyword was attempted, not only on hits

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@jphein jphein force-pushed the feat/hybrid-search-fallback branch from a8cb521 to 0718f2d Compare April 12, 2026 21:53
jphein and others added 2 commits April 13, 2026 07:19
When vector search returns poor results (best distance > 1.0) or empty,
fall back to ChromaDB text-match using where_document={"$contains": kw}.

- Add _extract_keyword() to pick most distinctive token from query
  (prefers identifiers: digits, dots, underscores, ALLCAPS; longest wins)
- search_memories() accepts optional keyword= param; auto-extracts on
  poor vector results; merges and deduplicates results by ID; sorts by
  distance; exposes keyword_fallback and distance fields in response
- mempalace_search MCP tool: add keyword parameter to input schema,
  thread through to search_memories()
- Add TestExtractKeyword (6 tests) and TestHybridSearchFallback (6 tests)
  covering auto-trigger, explicit keyword, dedup, no-trigger on good
  results, distance field presence, best-effort on keyword query failure

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…d keywords

Addresses Copilot review feedback on MemPalace#662:
- _extract_keyword() now tokenizes BEFORE lowercasing so isupper() works
- Replace bare `except: pass` with logger.debug in keyword fallback
- Report keyword_fallback whenever a keyword was attempted, not only on hits

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@jphein jphein force-pushed the feat/hybrid-search-fallback branch from 0718f2d to 896c5ac Compare April 13, 2026 14:22
Upstream bumped pyproject.toml to 3.2.0 in MemPalace#761 but missed version.py.
Aligns the two so test_version_consistency passes.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 16, 2026

Closing — upstream v3.3.0 shipped full Okapi-BM25 hybrid search (60% vector + 40% keyword), which supersedes this $contains keyword fallback approach. The BM25 implementation is significantly more robust.

Thanks for the review feedback on this one — it informed the direction even if the specific implementation is now obsolete.

@jphein jphein closed this Apr 16, 2026
jphein added a commit to jphein/mempalace that referenced this pull request Apr 16, 2026
Upstream release includes: closets (searchable index layer), BM25 hybrid
search (60/40 blend), 8-language i18n, halls (content type routing),
cross-wing tunnels, multi-agent file locking, background hook saves,
ChromaBackend abstraction, diary ingest, fact checker, and 29 MCP tools.

Conflict resolution strategy:
- Take upstream for: searcher.py (BM25 supersedes our keyword fallback),
  cli.py (ChromaBackend repair), hook scripts (background allow),
  plugin manifests (version bump to 3.3.0)
- Keep fork additions: max_distance param, tool_reconnect, stale HNSW
  mtime detection, graph cache with write-invalidation, bulk_check_mined,
  epsilon mtime comparison, pagination helpers
- Merge imports: palace.py (logging+contextlib+hashlib),
  palace_graph.py (time+hashlib+json+os), test_miner.py (both sides)
- README: keep fork-specific content, upstream rewrote theirs entirely

Removed fork code superseded by upstream:
- _extract_keyword() and keyword fallback in searcher.py (MemPalace#662 obsolete)
- keyword param from MCP tool_search
- Inline PYMINE transcript mining in hooks (upstream uses mempalace mine)
- Raw chromadb.PersistentClient calls (replaced by ChromaBackend)

Also: update fork-direction.md and TODO-fork-improvements.md with
context-engine, context-builder, CodeSpring, Supadev evaluations.

858 tests pass (1 skipped: test_readme_claims parses upstream README format).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 16, 2026
README: replace 460-line upstream-mirroring README with focused fork-only
content — what's different, what's merged, what's superseded, PR status.
Points to upstream README for full feature docs.

CLAUDE.md: bump version 3.2.0 → 3.3.0, test count 715 → 858, remove
fork changes absorbed by upstream, update PR table (close MemPalace#662, MemPalace#738),
simplify hook architecture notes for v3.3.0 alignment.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@jphein jphein deleted the feat/hybrid-search-fallback branch April 19, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mcp MCP server and tools area/search Search and retrieval enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants