fix(mcp): retry _get_collection once on transient failure (#1286)#1377
Merged
fix(mcp): retry _get_collection once on transient failure (#1286)#1377
Conversation
A transient chromadb exception inside `_get_collection` was swallowed by the bare `except Exception: return None`, leaving every subsequent tool call hitting the same poisoned cache silently. The fix wraps the body in a `for attempt in range(2)` loop: on attempt 0 failure, log via `logger.exception(...)` and clear `_client_cache` / `_collection_cache` / `_metadata_cache` so the next iteration forces `_get_client()` to rebuild from scratch — that path now re-runs `quarantine_stale_hnsw` (per #1322), so the second attempt heals the common stale-handle case automatically. If both attempts fail, return `None` (matches the prior contract for permanent failures). Two new tests in `tests/test_mcp_server.py::TestCacheInvalidation`: - `test_get_collection_retries_once_on_exception` — first attempt raises via a monkeypatched `_get_client`, second attempt succeeds; assert the caller gets the collection back, not None. - `test_get_collection_returns_none_after_two_failures` — both attempts fail, assert we exhaust the loop and return None (no infinite retry). Surgical extraction from PR #1286, which carried the same fix idea (plus a fork-sync bundle that couldn't be merged); credit to the original author below. Co-authored-by: Jeffrey Hein <[email protected]>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the MCP server’s resilience when opening the ChromaDB collection by logging failures and retrying _get_collection() once after clearing cached client/collection/metadata state, addressing common “stale handle” transient failures without requiring a process restart.
Changes:
- Wrap
mcp_server._get_collection()in a 2-attempt loop, logging exceptions and clearing caches before the single retry. - Preserve the existing API contract by still returning
Noneafter two failures. - Add unit tests covering the “fails once then succeeds” retry behavior and the “fails twice then returns None” behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
mempalace/mcp_server.py |
Add exception logging + one retry with cache invalidation to self-heal transient collection-open failures. |
tests/test_mcp_server.py |
Add regression tests validating retry-once semantics and no-infinite-loop behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
jphein
added a commit
to jphein/mempalace
that referenced
this pull request
May 6, 2026
When the vector index returns fewer than n_results (sparse HNSW post-repair, MemPalace#951 filter-planner failure, drift), search_memories now: 1. Computes an authoritative scope count via paginated col.get(), surfaced as `available_in_scope` in the response. Caps each query below MemPalace#950's SQL-variable limit. 2. Tops up the hits list with BM25-ranked sqlite candidates tagged `matched_via: "sqlite_bm25_fallback"` when the vector path is under-delivering. Skips candidates with BM25 score 0 so the fallback never pads with unrelated content. 3. Returns `warnings: [...]` describing when fallback fired and when the scope contains more drawers than the vector path can rank (gated on a `vector_underdelivered` flag captured before fallback runs, so the warning surfaces even when BM25 papered over the gap). CLI search() delegates to search_memories() so terminal output and MCP responses share the same retrieval, fallback, and warning semantics. Preserves the palace path in printed errors. Closes the silent 0-hit failure mode where data was in sqlite but the vector path returned nothing — visible to the user via warnings and `available_in_scope`, fixable via `mempalace repair`. Tests: 29/29 pass on rebased branch (Python 3.9 floor honored via Optional[int]). Mock setup updated to set count.return_value so the new "more in scope" warning path doesn't fail on MagicMock comparison. Squashed rebase against current upstream/develop (post-MemPalace#1377). Was filed as 5-commit history; squashed for cleaner review. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This was referenced May 6, 2026
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Surgical extraction of the
_get_collectionretry+log idea from #1286. Wraps the body infor attempt in range(2); on attempt 0 failure, log vialogger.exception(...)and clear_client_cache/_collection_cache/_metadata_cacheso the next iteration forces_get_client()to rebuild. That path now re-runsquarantine_stale_hnsw(#1322), so the second attempt heals the common stale-handle case automatically.Why this PR exists separately
#1286 carries the same idea (~25 LOC of
mcp_server.py) plus a 5500-line fork-sync (FORK_CHANGELOG.md, docs/superpowers/**, deploy scripts, version bump that breakscheck-versions, aconvo_miner.pyrewrite that regressestest_convo_miningandtest_mine_convos_rebuilds_stale_drawers_after_schema_bumpon all three OSes). I extracted just the retry shape so we can land it ahead of the v3.3.5 deadline; #1286 will be closed with a comment offering to land the genuinely-useful follow-ups (session-recovery collection split,chroma.pyNone-meta coercion, closet-boost ranking refactor) as separate PRs.Credit to @jphein via
Co-authored-by:on the commit.Behavior
None(matches the prior contract).logger.exception(...)— previously silent.Test plan
tests/test_mcp_server.py::TestCacheInvalidation:test_get_collection_retries_once_on_exception— first attempt raises, second succeeds; result is returned, notNone.test_get_collection_returns_none_after_two_failures— both attempts fail; assert exactly 2 attempts andNonereturned (no infinite loop).uvx [email protected] check .andruff format --check .clean.pytest tests/ --ignore=tests/benchmarks).Out of scope (deferred to follow-ups)
backends/chroma.pyNone-meta coercion (would generalise the per-site guards from fix(searcher): guard against None metadata in CLI print path #999/fix: guard Layer3.search_raw against None doc/meta from ChromaDB (#1011) #1013/refactor(backends/chroma): coerce None metadatas to{}at backend boundary (closes #1020) #1094).searcher.py.If @jphein can rebase those onto current develop, happy to review them as separate PRs.