fix(mcp): retry _get_collection once on transient failure (#1286) by igorls · Pull Request #1377 · MemPalace/mempalace

igorls · 2026-05-06T07:52:41Z

Summary

Surgical extraction of the _get_collection retry+log idea from #1286. Wraps the body in for attempt in range(2); on attempt 0 failure, log via logger.exception(...) and clear _client_cache / _collection_cache / _metadata_cache so the next iteration forces _get_client() to rebuild. That path now re-runs quarantine_stale_hnsw (#1322), so the second attempt heals the common stale-handle case automatically.

Why this PR exists separately

#1286 carries the same idea (~25 LOC of mcp_server.py) plus a 5500-line fork-sync (FORK_CHANGELOG.md, docs/superpowers/**, deploy scripts, version bump that breaks check-versions, a convo_miner.py rewrite that regresses test_convo_mining and test_mine_convos_rebuilds_stale_drawers_after_schema_bump on all three OSes). I extracted just the retry shape so we can land it ahead of the v3.3.5 deadline; #1286 will be closed with a comment offering to land the genuinely-useful follow-ups (session-recovery collection split, chroma.py None-meta coercion, closet-boost ranking refactor) as separate PRs.

Credit to @jphein via Co-authored-by: on the commit.

Behavior

Healthy palace, warm cache: zero overhead (returns on first attempt).
Healthy palace, cold cache: zero overhead (returns on first attempt).
Transient failure (stale handle, partial flush race, etc.): one retry after cache clear, succeeds.
Permanent failure: returns None (matches the prior contract).
Every failure is logged via logger.exception(...) — previously silent.

Test plan

New unit tests in tests/test_mcp_server.py::TestCacheInvalidation:
- test_get_collection_retries_once_on_exception — first attempt raises, second succeeds; result is returned, not None.
- test_get_collection_returns_none_after_two_failures — both attempts fail; assert exactly 2 attempts and None returned (no infinite loop).
uvx [email protected] check . and ruff format --check . clean.
Full suite locally: 1554 passed, 1 skipped in 55s (pytest tests/ --ignore=tests/benchmarks).

Out of scope (deferred to follow-ups)

Session-recovery collection split (substantial new feature in fix(mcp_server): log exception + retry once on _get_collection failure #1286 — deserves its own design discussion).
backends/chroma.py None-meta coercion (would generalise the per-site guards from fix(searcher): guard against None metadata in CLI print path #999/fix: guard Layer3.search_raw against None doc/meta from ChromaDB (#1011) #1013/refactor(backends/chroma): coerce None metadatas to {} at backend boundary (closes #1020) #1094).
Closet-boost ranking refactor in searcher.py.

If @jphein can rebase those onto current develop, happy to review them as separate PRs.

A transient chromadb exception inside `_get_collection` was swallowed by the bare `except Exception: return None`, leaving every subsequent tool call hitting the same poisoned cache silently. The fix wraps the body in a `for attempt in range(2)` loop: on attempt 0 failure, log via `logger.exception(...)` and clear `_client_cache` / `_collection_cache` / `_metadata_cache` so the next iteration forces `_get_client()` to rebuild from scratch — that path now re-runs `quarantine_stale_hnsw` (per #1322), so the second attempt heals the common stale-handle case automatically. If both attempts fail, return `None` (matches the prior contract for permanent failures). Two new tests in `tests/test_mcp_server.py::TestCacheInvalidation`: - `test_get_collection_retries_once_on_exception` — first attempt raises via a monkeypatched `_get_client`, second attempt succeeds; assert the caller gets the collection back, not None. - `test_get_collection_returns_none_after_two_failures` — both attempts fail, assert we exhaust the loop and return None (no infinite retry). Surgical extraction from PR #1286, which carried the same fix idea (plus a fork-sync bundle that couldn't be merged); credit to the original author below. Co-authored-by: Jeffrey Hein <[email protected]>

Copilot

Pull request overview

This PR improves the MCP server’s resilience when opening the ChromaDB collection by logging failures and retrying _get_collection() once after clearing cached client/collection/metadata state, addressing common “stale handle” transient failures without requiring a process restart.

Changes:

Wrap mcp_server._get_collection() in a 2-attempt loop, logging exceptions and clearing caches before the single retry.
Preserve the existing API contract by still returning None after two failures.
Add unit tests covering the “fails once then succeeds” retry behavior and the “fails twice then returns None” behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`mempalace/mcp_server.py`	Add exception logging + one retry with cache invalidation to self-heal transient collection-open failures.
`tests/test_mcp_server.py`	Add regression tests validating retry-once semantics and no-infinite-loop behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When the vector index returns fewer than n_results (sparse HNSW post-repair, MemPalace#951 filter-planner failure, drift), search_memories now: 1. Computes an authoritative scope count via paginated col.get(), surfaced as `available_in_scope` in the response. Caps each query below MemPalace#950's SQL-variable limit. 2. Tops up the hits list with BM25-ranked sqlite candidates tagged `matched_via: "sqlite_bm25_fallback"` when the vector path is under-delivering. Skips candidates with BM25 score 0 so the fallback never pads with unrelated content. 3. Returns `warnings: [...]` describing when fallback fired and when the scope contains more drawers than the vector path can rank (gated on a `vector_underdelivered` flag captured before fallback runs, so the warning surfaces even when BM25 papered over the gap). CLI search() delegates to search_memories() so terminal output and MCP responses share the same retrieval, fallback, and warning semantics. Preserves the palace path in printed errors. Closes the silent 0-hit failure mode where data was in sqlite but the vector path returned nothing — visible to the user via warnings and `available_in_scope`, fixable via `mempalace repair`. Tests: 29/29 pass on rebased branch (Python 3.9 floor honored via Optional[int]). Mock setup updated to set count.return_value so the new "more in scope" warning path doesn't fail on MagicMock comparison. Squashed rebase against current upstream/develop (post-MemPalace#1377). Was filed as 5-commit history; squashed for cleaner review. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Copilot AI review requested due to automatic review settings May 6, 2026 07:52

igorls requested a review from milla-jovovich as a code owner May 6, 2026 07:52

igorls added this to the v3.3.5 milestone May 6, 2026

igorls requested a review from bensig as a code owner May 6, 2026 07:52

igorls mentioned this pull request May 6, 2026

fix(mcp_server): log exception + retry once on _get_collection failure #1286

Closed

Copilot started reviewing on behalf of igorls May 6, 2026 07:53 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

igorls merged commit f0d2360 into develop May 6, 2026
10 checks passed

This was referenced May 6, 2026

feat(searcher): warnings + sqlite BM25 top-up when vector underdelivers #1005

Open

refactor(searcher): hoist CLOSET_RANK_BOOSTS to module level + record ablation finding #1378

Open

adv3nt3 mentioned this pull request May 6, 2026

fix(mine): detect concurrent palace holder, exit non-zero with clear error (#1264) #1349

Open

4 tasks

leoDYL mentioned this pull request May 6, 2026

Stop hook: 1.9 TB palace bloat + ChromaDB Rust bindings segfault despite #1231 fix #1329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp): retry _get_collection once on transient failure (#1286)#1377

fix(mcp): retry _get_collection once on transient failure (#1286)#1377
igorls merged 1 commit intodevelopfrom
fix/get-collection-retry-on-exception

igorls commented May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

igorls commented May 6, 2026

Summary

Why this PR exists separately

Behavior

Test plan

Out of scope (deferred to follow-ups)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants