fix(mcp_server): log exception + retry once on _get_collection failure#1286
fix(mcp_server): log exception + retry once on _get_collection failure#1286jphein wants to merge 283 commits intoMemPalace:developfrom
Conversation
- tool_add_drawer and tool_delete_drawer now invalidate _metadata_cache so status/taxonomy reflect changes immediately (review item 3) - Document English-only stopword limitation in _extract_themes - Document venv/ assumption in _mempalace_python editable-install path Co-Authored-By: Claude Opus 4.6 <[email protected]>
…-8 config - _mempalace_python: honor MEMPALACE_PYTHON env var as first priority - searcher: clamp similarity to [0, 1] so L2 distances > 1 don't go negative - config: use encoding="utf-8" + ensure_ascii=False for config writes Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add SKIP_PATTERNS and JUNK_FILE_SIZE (500KB) to miner to exclude minified JS/CSS, bundles, source maps, lockfiles, and large generated files from palace mining - Add `mempalace purge --wing/--room` CLI command for batch deletion with confirmation prompt and batched ChromaDB deletes - Fix status display: use col.count() + paginated metadata fetch instead of hard-capped 10K get(), add thousands separators to output - Update CLAUDE.md test count to 615 Co-Authored-By: Claude Opus 4.6 <[email protected]>
- repair: delete entire palace directory and create fresh PersistentClient instead of same-process delete_collection/create_collection which left corrupted HNSW ghost entries. Adds index verification query after rebuild. - mcp_server: track chroma.sqlite3 inode in _get_client() to detect palace rebuilds on disk. Automatically reconnects when the database file changes, eliminating stale cache after repair/nuke operations. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…skill docs - Remove overly broad `\*[^*]+\*` regex from EMOTION_MARKERS that matched all markdown bold/italic, routing 66% of technical content to emotional room (MemPalace#536) - Replace Unicode checkmark (U+2713) with ASCII '+' in convo_miner and split_mega_files progress output — crashes Windows cp1251/cp1252 (MemPalace#535) - Fix KG path mismatch: MCP server now always uses palace_path for knowledge_graph.sqlite3 instead of diverging default path (MemPalace#538) - Fix SKILL.md: mempalace --version doesn't exist, use mempalace status (MemPalace#534) - Fix init instructions: add --yes flag for agent-friendly non-interactive init (MemPalace#534) Co-Authored-By: Claude Opus 4.6 <[email protected]>
…compress keys - Default new collections to cosine distance (hnsw:space=cosine) instead of L2 — fixes negative similarity scores (MemPalace#568) - Filter unexpected MCP tool args before dispatch — prevents TypeError crash from extra params like top_k (MemPalace#572) - Rotate WAL at 10 MB with one backup to prevent unbounded growth (MemPalace#573) - Fix cmd_compress KeyError: align dict keys with compression_stats() return values (MemPalace#569) - Fix spellcheck _load_known_names: read from "people" key, use dict keys as canonical names (MemPalace#570) - Repair command uses cosine distance for rebuilt collection Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Purge now extracts drawers to keep, nukes the palace directory, and rebuilds with a fresh PersistentClient + cosine HNSW index. This prevents ghost entries from ChromaDB's in-place collection.delete() which caused segfaults on subsequent queries/inserts. - Add FAT/exFAT inode caveat comment on _get_client() detection - Clarify purge --room help: without --wing, purges across ALL wings Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
_extract_content now handles tool_use and tool_result content blocks via _format_tool_use and _format_tool_result helpers. Join changed from space to newline for proper tool block separation. _try_claude_code_jsonl tracks tool_use_map, merges tool-result-only user messages into the previous assistant turn, and merges consecutive assistant messages from multi-turn tool loops. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ation (MemPalace#589) Co-Authored-By: Claude Opus 4.6 <[email protected]>
Fixes ARM64 HNSW segfaults, Python 3.13/3.14 compat. Auto-migrates existing databases, zero API breakage. 648 tests pass, 50K palace verified. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Both save and precompact hooks now auto-mine the JSONL transcript directly into the palace before blocking the AI. This captures raw tool output (Bash results, search findings, build errors) that the AI would otherwise summarize away during its save cycle. - Add MP_PYTHON auto-detection (MEMPAL_PYTHON env → repo venv → system) - Add inline normalize → chunk → upsert pipeline to both hooks - Skip file_already_mined — transcript grows, upsert is idempotent - Update block reason messages to explicitly request verbatim tool output - Precompact hook: parse transcript_path from input, fallback to session_id lookup - Update README with two-layer capture docs and MEMPAL_PYTHON config Co-Authored-By: Claude Opus 4.6 <[email protected]>
…b upgrade - CLAUDE.md: test count 648, fork changes 8-13, PR MemPalace#562, hook descriptions - README.md: hooks section reflects two-layer capture, chromadb >=1.5.4, normalize.py captures tool blocks, MEMPAL_PYTHON env var documented - AGENTS.md: add normalize.py and hooks to key files section - HOOKS_TUTORIAL.md: new sections for two-layer capture and configuration - mempalace/README.md: normalize.py description updated, version.py added - normalize.py: docstring updated for tool_use/tool_result capture Co-Authored-By: Claude Opus 4.6 <[email protected]>
… PRs Resolves merge conflict in mcp_server.py by keeping our improvements (cached metadata, inode detection, WAL rotation, max_distance) and integrating upstream's query_sanitizer (MemPalace#385) and context parameter. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The old block reason said "save to your memory system" which Claude interpreted as its internal ~/.claude/memory/ system instead of the mempalace. Now explicitly names mempalace_add_drawer and mempalace_diary_write MCP tools and includes a negative instruction to NOT use Claude's internal memory. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The negative instruction referenced ~/.claude/memory/ but Claude's auto-memory actually lives at ~/.claude/projects/.../memory/. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Two bugs found when mining ~/.claude/projects/ transcripts: - message field can be a string (not dict) in some JSONL entries - Read tool offset/limit can be strings, causing TypeError on addition Co-Authored-By: Claude Opus 4.6 <[email protected]>
…compress keys
- params null guard: request.get("params") or {} handles explicit null
- L2→cosine in search tool descriptions (collection uses hnsw:space=cosine)
- set_hook_setting mkdir before config write
- Guard empty tool_use_map key in normalize.py
- Correct searcher.py docstring: cosine distance range 0–2, useful range 0.3–1.0
Co-Authored-By: Claude Opus 4.6 <[email protected]>
…emPalace#637) _SAFE_NAME_RE was ASCII-only ([a-zA-Z0-9]), rejecting valid Unicode names like Pēteris, 日本語, Москва in KG writes and MCP tools. Changed to \w which matches Unicode letters/digits in Python 3. Path traversal, null bytes, and control chars still blocked by the explicit checks before the regex. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ssertion - miner.py: move chromadb/palace imports above logger to fix E402 - test_normalize.py: rename ambiguous var 'l' to 'line' (E741) - test_hooks_cli.py: fix Windows assertion — check basename contains 'python' instead of asserting venv-specific path components Co-Authored-By: Claude Opus 4.6 <[email protected]>
Local ruff 0.15 formats differently from CI's pinned 0.4.x range. Reformatted all files to match CI expectations. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Users installing hooks don't know ~/.claude/projects/ contains past JSONL transcripts that can be mined. Added prominent "Backfill Past Conversations" section to hooks README and tutorial. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Hybrid search, graph cache, L1 optimization, decay/forgetting, feedback loops, KG entity resolution, input sanitization. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- New TODO #0: convo_miner files everything to 'sessions' wing, losing project identity from Claude Code transcript paths - Added survey context: why staying on MemPalace, patterns to steal from Karta/Gigabrain/Codex/ByteRover - Future section: contradiction detection, foresight signals, query classification (simpler versions of Karta's mechanisms) - Updated "not pursuing" with dream engine, dual ANN, event sourcing Co-Authored-By: Claude Opus 4.6 <[email protected]>
Hybrid search (TODO #1): when vector results are poor (best distance > 1.0), automatically falls back to keyword text-match via ChromaDB where_document.$contains. Extracts most distinctive non-stopword token from query, or accepts explicit keyword param. Results merged, deduped, sorted by distance. MCP server exposes new keyword parameter. Wing fix (TODO #0): _ingest_transcript() now derives project wing from Claude Code transcript path (-Projects-<name>/) instead of hardcoding "sessions". Per-project search now finds auto-mined content. 692 tests pass (22 new). Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
Pull request overview
Improves the resilience and observability of the MCP server’s ChromaDB collection accessor (_get_collection) by logging failures and attempting a one-time self-healing retry after clearing cached state.
Changes:
- Log
_get_collectionexceptions instead of silently returningNone. - Retry once after clearing cached client/collection/metadata state to recover from transient failures.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| except Exception as e: | ||
| logger.error( | ||
| "_get_collection attempt %d failed (palace=%s): %s", | ||
| attempt + 1, _config.palace_path, e, |
There was a problem hiding this comment.
The error log here only includes str(e) and omits the traceback, which makes it much harder to debug the underlying ChromaDB failure. Consider using logger.exception(...) (or logger.error(..., exc_info=True)) so the full stack trace is emitted with the message.
| except Exception as e: | |
| logger.error( | |
| "_get_collection attempt %d failed (palace=%s): %s", | |
| attempt + 1, _config.palace_path, e, | |
| except Exception: | |
| logger.exception( | |
| "_get_collection attempt %d failed (palace=%s)", | |
| attempt + 1, _config.palace_path, |
| for attempt in range(2): | ||
| try: | ||
| client = _get_client() | ||
| if create: | ||
| # hnsw:num_threads=1 disables ChromaDB's multi-threaded ParallelFor | ||
| # HNSW insert path, which has a race in repairConnectionsForUpdate / | ||
| # addPoint (see issues #974, #965). Set via metadata on fresh | ||
| # collections and re-applied via _pin_hnsw_threads() for legacy | ||
| # palaces whose collections were created before this fix (the | ||
| # runtime config does not persist cross-process in chromadb 1.5.x, | ||
| # so the retrofit runs every time _get_collection opens a cache). | ||
| raw = client.get_or_create_collection( | ||
| _config.collection_name, | ||
| metadata={ | ||
| "hnsw:space": "cosine", | ||
| "hnsw:num_threads": 1, | ||
| **_HNSW_BLOAT_GUARD, | ||
| }, | ||
| ) | ||
| _pin_hnsw_threads(raw) | ||
| _collection_cache = ChromaCollection(raw) | ||
| _metadata_cache = None | ||
| _metadata_cache_time = 0 | ||
| elif _collection_cache is None: | ||
| raw = client.get_collection(_config.collection_name) | ||
| _pin_hnsw_threads(raw) | ||
| _collection_cache = ChromaCollection(raw) | ||
| _metadata_cache = None | ||
| _metadata_cache_time = 0 | ||
| return _collection_cache | ||
| except Exception as e: | ||
| logger.error( | ||
| "_get_collection attempt %d failed (palace=%s): %s", | ||
| attempt + 1, _config.palace_path, e, | ||
| ) | ||
| _pin_hnsw_threads(raw) | ||
| _collection_cache = ChromaCollection(raw) | ||
| _metadata_cache = None | ||
| _metadata_cache_time = 0 | ||
| elif _collection_cache is None: | ||
| raw = client.get_collection(_config.collection_name) | ||
| _pin_hnsw_threads(raw) | ||
| _collection_cache = ChromaCollection(raw) | ||
| _metadata_cache = None | ||
| _metadata_cache_time = 0 | ||
| return _collection_cache | ||
| except Exception: | ||
| return None | ||
| if attempt == 0: | ||
| _client_cache = None | ||
| _collection_cache = None | ||
| _metadata_cache = None | ||
| _metadata_cache_time = 0 | ||
| return None |
There was a problem hiding this comment.
This introduces new retry/self-healing behavior (clearing caches and retrying once) but there’s no unit test to lock in the expected semantics (e.g., first call raises, caches are cleared, second call succeeds; and the both-attempts-fail case still returns None). Since tests/test_mcp_server.py already covers related cache invalidation behavior, adding a small monkeypatch-based test would help prevent regressions.
… main The "fix: _get_client() get-then-create guard" row in the fork-ahead table was misleading — the guard landed in fork as d3a2d22 on 2026-04-18 but was silently dropped during the develop merge-conflict resolution in 7e18a70 on 2026-04-25. Current fork main does not carry it; we lean on metadata consistency (hnsw:space + hnsw:num_threads:1 + _HNSW_BLOAT_GUARD applied uniformly across opens) to sidestep the ChromaDB 1.5.x metadata-mismatch SIGSEGV without the defensive guard. Issue MemPalace#1089 documented the pattern on 2026-04-21. PR MemPalace#1262 (filed 2026-04-28 by @Legion345) implements the canonical version — try get_collection / except chromadb.errors.NotFoundError → create — which is an improvement over our older InvalidCollectionException reference (NotFoundError is the renamed-in-1.5.x exception class). Once MemPalace#1262 merges, this row clears via develop sync. Until then the fork is exposed if upstream ever changes default hnsw:* metadata between releases — the same drift shape that caused the original crash.
…nding queue; trim patch Filed four upstream PRs on 2026-04-30: - rboarescu#15 feat: GET /viz status dashboard (stacks on rboarescu#13) - rboarescu#16 feat: GET /list — query-free metadata browse - rboarescu#17 feat: DELETE /memory + PATCH /memory - rboarescu#18 feat(lifespan): auto-migrate Stop-hook checkpoints on startup Also rebased PR rboarescu#13 onto upstream/main to clear a CHANGELOG conflict left by upstream's b4aee82 (patch sync) — state went CONFLICTING -> MERGEABLE / CLEAN. README: - Open upstream PRs table: four new rows (rboarescu#15-rboarescu#18) plus a 2026-04-30 note covering today's rebase + new PRs in one breath. - Pending PRs queue: now empty. Replaced the four stale rows (event-log-frame and graph-endpoint were already in flight via rboarescu#11 and rboarescu#13; mempal-fast.py was already upstream via the merged PR #4 omnibus; /viz is now PR rboarescu#15) with a brief empty-state note. CLAUDE.md: - Patch description trimmed to reflect that the hnsw:num_threads enforcement landed upstream via _pin_hnsw_threads(); only the log + retry-once slice remains. patches/mcp_server_get_collection.patch: - Regenerated against current mempalace develop. The patch is now just the "log exception + retry once on cache failure" change. Filed upstream as MemPalace/mempalace#1286; once that merges this patch retires. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
README: - Intro paragraph: appended GET /list (PR rboarescu#16), DELETE/PATCH /memory (PR rboarescu#17), and lifespan auto-migrate (PR rboarescu#18) to "what this fork adds" since they're already on fork main even though the PRs are awaiting review upstream. - Links bar: added docs/typescript-port-plan.md. - Added a "Cross-repo coordination" subsection under "Recently landed in upstream" — calls out the local patches/ directory and the in-flight MemPalace/mempalace#1286 that retires it. Also references #1142 (RELEASING.md) for completeness. - Requirements: dropped the stale "kind= searcher filter" justification for recommending the fork (kind= was retired in 1.7.1); replaced with daemon-strict hook mode + warnings/sqlite-fallback search path as the actual current reasons. Added the patches/ re-apply step. - API table: added /list, DELETE /memory/{id}, PATCH /memory/{id} rows so the table reflects what the fork main actually exposes. - Sources: cross-repo PR note for #1286. CHANGELOG: - Added [Unreleased] section: patch trim (Maintenance), TS port plan doc, hook-routing-fix.md status header, and a Docs entry covering today's README updates + PR rboarescu#13 rebase context. docs/hook-routing-fix.md: - Added a Status: SHIPPED header pointing at 62425e3 (2026-04-24 clients/hook.py introduction) and clarifying mempal-fast.py is the simpler successor — same shape as docs/graph-endpoint.md's status header. No daemon behaviour changes.
4b2dfd6 to
2038269
Compare
Two issues:
1. logger.error("...", str(e)) omitted the traceback, making the
underlying ChromaDB failure hard to debug. Switch to logger.exception
so the full stack trace lands in the log alongside the message.
Same context lines, just with exc_info coming from sys.exc_info()
automatically.
2. No unit test pinned the new retry/self-healing semantics. Added
TestGetCollectionRetry (tests/test_mcp_server.py) with three cases:
- test_retry_succeeds_on_second_attempt: first _get_client raises,
caches are cleared between attempts (verified by inspecting
planted sentinels), second attempt returns the wrapped collection.
- test_returns_none_when_both_attempts_fail: two failures in a row
return None and stop after exactly two attempts.
- test_happy_path_does_not_retry: a successful first attempt
returns immediately and never invokes _get_client a second time.
All three use monkeypatch on _get_client / ChromaCollection /
_pin_hnsw_threads so they don't need a real ChromaDB; they run in
~0.4s end-to-end and don't touch disk.
…call sites The earlier 2026-04-30 wording said current fork main calls get_or_create_collection with stable/consistent metadata, sidestepping the ChromaDB 1.5.x metadata-mismatch SIGSEGV without carrying the defensive guard. That's not actually true. Audited the four get_or_create_collection call sites on current fork main: - backends/chroma.py:1040 full hnsw:* set (space + num_threads + bloat-guard) - mcp_server.py:295 full hnsw:* set - migrate.py:235 no metadata - cli.py:905 no metadata The trigger condition for the 1.5.x metadata-mismatch SIGSEGV (call- site metadata diverging from stored collection metadata) is structurally present. The fork's apparent stability is presumably because the no-metadata sites operate on different collections (mempalace_compressed, the migrate temp collection) than the full- metadata sites (mempalace_drawers) — so the divergence never reaches the same row in the same metadata table. That's a load-bearing assumption worth flagging: it's circumstantial, not structural. Once MemPalace#1262 (Legion345) lands, audit all four sites together so the guard wraps every site rather than just chroma.py.
|
Pushed
Also fixed a |
The 2026-04-30 wordings in 393ac89 ("metadata is consistent across opens sidesteps the SIGSEGV") and 260e2a2 ("non-uniform metadata across four call sites, trigger condition structurally present") both missed the mark. Third try here, traced from the code: - chromadb's get_or_create_collection is called directly from two places: backends/chroma.py:1061 and mcp_server.py:295. Both pass identical metadata (hnsw:space, hnsw:num_threads: 1, bloat-guard). - The fork's ChromaBackend.get_or_create_collection(path, name) at chroma.py:1101 is a 2-arg legacy shim that delegates via self.get_collection(palace_path, name, create=True) to the same metadata-passing line. - migrate.py:237 and cli.py:1002 call that 2-arg shim, so they reach chromadb with full metadata too. So metadata IS consistent across all five reachable call sites, not divergent. The earlier "non-uniform" claim mistook the fork's API surface for chromadb's call signature. The residual SIGSEGV exposure is legacy palaces created under older chromadb whose stored metadata differs from the current expanded set — a per-palace-history risk, not a structural call-graph one. Live MemPalace#1262 comment and MemPalace#1089 issue body have been re-PATCHed to match. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… update sync status MemPalace#1262 (Legion345) is path 1 of MemPalace#1089's "interim guard PR" — adds get-then-create wrapping in chromadb backend. Shepherding via review comment posted 2026-04-30. Once it merges, fork-ahead Row 15 clears via develop sync. MemPalace#1286 (our PR, filed 2026-04-30) is the _get_collection retry-once + log-on-failure improvement. Adjacent to Row 15 — when both MemPalace#1262 and MemPalace#1286 land, the _get_collection path is both crash-resilient and self-healing. Status line refreshed: develop has moved to fdfaf01 (Gemini CLI normalize MemPalace#1234, privacy consent MemPalace#1233, both 2026-04-27); next sync will clear those plus row 15 once MemPalace#1262 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Fork main has been failing CI lint since 2026-04-30 04:04 UTC. Three real issues + 6 format-needs files. All pre-existing on main before today's doc work but inherited by every commit since. Cleared in one pass: - F401 mempalace/miner.py:19 — unused `typing.Optional` import, auto-fixed by `ruff --fix`. - F811 tests/test_searcher.py — `class TestBM25NoneSafety` was literally duplicated (lines 205 and 243, identical name + docstring + body). Likely copy-paste during 7ba28dc (kind= filter retirement). Deleted the second copy; verified the surviving class covers the same 3 None-safety tests. - C901 mempalace/searcher.py:721 — `search_memories` complexity 27 vs 25 ceiling. Added `# noqa: C901` with reason: this fork-only function is a fallback orchestration with several legitimate branches (BM25 top-up, warnings, closet-boost) and 27 isn't pathological; refactor would just split for the linter, not for readability. - ruff format applied to 6 files (mcp_server, miner, searcher, render-docs, test_mcp_server, test_migrate) — mechanical reformat from prior commits that didn't run format. `ruff check .` and `ruff format --check .` both clean. Searcher test suite green (29/29). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Thanks @jphein — retry-once with cache-clear on One blocker: rebase against Once rebased and CI is green we can merge for v3.3.5. |
# Conflicts: # mempalace/mcp_server.py # mempalace/searcher.py # tests/test_mcp_server.py
…Palace#1289/MemPalace#1303 After the 2026-05-03 develop sync to commit 1888b67: - Row 15 (_get_client get-then-create guard) clears: MemPalace#1262 fixed ChromaBackend; MemPalace#1289 fixed mcp_server's parallel call site; MemPalace#1303 plumbed embedding_function= through the reopen path. - Trailing tracking paragraph updated (sync target, count of merged/ open/closed PRs, list of features brought in). - Upstream-PRs table: MemPalace#1262 → merged, MemPalace#1286 flagged CONFLICTING with rebase queued, MemPalace#1289 + MemPalace#1303 added as merged. Note: fork-only _get_session_recovery_collection still uses the older get_or_create_collection pattern; theoretical SIGSEGV exposure on legacy recovery collections only. Tracked as follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…attern to recovery collection `_get_session_recovery_collection` (fork-only, introduced for the checkpoint collection split in row 23) used the older `get_or_create_collection` pattern that MemPalace#1262/MemPalace#1289/MemPalace#1303 just fixed in `_get_collection`. Same theoretical SIGSEGV class on chromadb 1.5.x when stored metadata diverges from the call site, plus the `embedding_function=` omission MemPalace#1303 caught. Both branches now resolve EF via `ChromaBackend._resolve_embedding_function()` and the create-path uses try `get_collection` / except `NotFoundError` → `create_collection`. Recovery collection's metadata is intentionally lighter than the main collection's (no `_HNSW_BLOAT_GUARD`); preserved that asymmetry — recovery is small enough that segment bloat isn't a concern. Existing `test_session_recovery.py` covers happy paths; 83/83 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
A small improvement to `_get_collection`: log the exception on failure (instead of returning `None` silently) and retry once after clearing the cached client and collection. Behaviour on the happy path is unchanged. ## Why The current shape (`try: ... except Exception: return None`) has two costs in practice on a long-running deployment: 1. Invisible incidents. When a transient ChromaDB error poisons the cached `_client_cache` / `_collection_cache`, every subsequent call returns `None` but the operator never sees the underlying exception in the log. Debugging has to start from the symptoms — 404s, "no palace" responses — without the actual stack. 2. No self-healing. Once the cache is stale, the only way back to a working state is a process restart. A single retry that clears the caches first resolves the most common shape (a transient error left bad state behind) without operator intervention. ## What changes - One `logger.exception(...)` per failed attempt, including the palace path so multi-palace deployments can disambiguate. - One retry after the first failure, with `_client_cache`, `_collection_cache`, `_metadata_cache`, `_metadata_cache_time` reset before the second attempt so the retry rebuilds from scratch rather than reusing a poisoned handle. - Happy path (first attempt succeeds) is unchanged: same return, same caching, no extra log line. ## Why this is still useful after MemPalace#1289 / MemPalace#1303 MemPalace#1289 split `get_or_create_collection` into `get_collection` + `create_collection` (the metadata-mismatch SIGSEGV path); MemPalace#1303 plumbed `embedding_function=` through both reopen branches. Those fix specific *crash* shapes. This PR addresses the *aftermath* — when an error of any class poisons the client/collection cache and every subsequent call returns `None` invisibly. Retry-once self-heals; logging makes the underlying exception visible. The retry loop sits *above* the get-then-create logic, so retries benefit from MemPalace#1289's correct call shape and MemPalace#1303's EF binding. ## Rebased on top of MemPalace#1289 + MemPalace#1303 This PR was filed before MemPalace#1289/MemPalace#1303 merged. The original two commits (2038269 + 757eb7a) wrapped the older `get_or_create` body. Rebased onto the new `try get_collection / except NotFoundError → create_collection` body — the retry/log envelope is unchanged, only the inner body picks up the merged fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
757eb7a to
51780ce
Compare
|
Rebased onto develop after #1289 and #1303 merged. The retry/log envelope is unchanged; the inner body now sits on top of the get-then-create + |
|
Hi @jphein — thanks for carrying this; the underlying retry+log idea is sound and composes cleanly with #1322 (clearing I'd like to land that piece for v3.3.5 (due 2026-05-08), but this PR can't merge as-is — it carries fork-only files ( I opened #1377 with just the surgical Closing this one in favour of #1377 — sorry for the friction. The other genuinely-useful pieces buried in this PR are very welcome as separate follow-ups against current develop:
If you'd rather I cherry-pick any of those into surgical PRs the way I did with the retry, ping me and I will. Either way thanks for the diagnosis on the underlying bug — easy to miss. |
…xception fix(mcp): retry _get_collection once on transient failure (#1286)
…1286) A transient chromadb exception inside `_get_collection` was swallowed by the bare `except Exception: return None`, leaving every subsequent tool call hitting the same poisoned cache silently. The fix wraps the body in a `for attempt in range(2)` loop: on attempt 0 failure, log via `logger.exception(...)` and clear `_client_cache` / `_collection_cache` / `_metadata_cache` so the next iteration forces `_get_client()` to rebuild from scratch — that path now re-runs `quarantine_stale_hnsw` (per MemPalace#1322), so the second attempt heals the common stale-handle case automatically. If both attempts fail, return `None` (matches the prior contract for permanent failures). Two new tests in `tests/test_mcp_server.py::TestCacheInvalidation`: - `test_get_collection_retries_once_on_exception` — first attempt raises via a monkeypatched `_get_client`, second attempt succeeds; assert the caller gets the collection back, not None. - `test_get_collection_returns_none_after_two_failures` — both attempts fail, assert we exhaust the loop and return None (no infinite retry). Surgical extraction from PR MemPalace#1286, which carried the same fix idea (plus a fork-sync bundle that couldn't be merged); credit to the original author below. Co-authored-by: Jeffrey Hein <[email protected]>
A small improvement to
_get_collection: log the exception on failure (instead of returningNonesilently) and retry once after clearing the cached client and collection. Behaviour on the happy path is unchanged.Why
The current shape (
try: ... except Exception: return None) has two costs in practice on a long-running deployment:_client_cache/_collection_cache, every subsequent call returnsNonebut the operator never sees the underlying exception in the log. Debugging has to start from the symptoms — 404s, "no palace" responses — without the actual stack.What changes
logger.error(...)per failed attempt, including the palace path so multi-palace deployments can tell which one is hurting._client_cacheis now in theglobaldeclaration so the second attempt actually reopens the client from scratch.Noneif both attempts fail. Existing callers that handleNone(there are several throughoutmcp_server.py) keep working unchanged.What's intentionally not in this PR
num_threads=1retrofit and_pin_hnsw_threads()call inside the function are untouched. That work landed separately and is unrelated.mock.patchon_get_clientraising once?).Where it ran
This has been running as a local patch on top of the pipx install on a 151K-drawer palace fronting
palace-daemonsince 2026-04-25. The retry path fired twice during HNSW segment-quarantine windows after a daemon restart — both times the second attempt succeeded and traffic kept flowing without intervention. No false-positive retries observed (the happy path takes thereturn _collection_cacheexit on the first iteration).I've been carrying this as a sed-applied patch in
palace-daemon/patches/; happy to retire that and just depend on a release line that has this in it once it lands. Equally happy to rework the shape (different log level, structured logging, no retry at all and just the log) — let me know what fits best.Thanks for taking a look!