fix(mcp_server): log exception + retry once on _get_collection failure by jphein · Pull Request #1286 · MemPalace/mempalace

jphein · 2026-04-30T17:06:15Z

A small improvement to _get_collection: log the exception on failure (instead of returning None silently) and retry once after clearing the cached client and collection. Behaviour on the happy path is unchanged.

Why

The current shape (try: ... except Exception: return None) has two costs in practice on a long-running deployment:

Invisible incidents. When a transient ChromaDB error poisons the cached _client_cache / _collection_cache, every subsequent call returns None but the operator never sees the underlying exception in the log. Debugging has to start from the symptoms — 404s, "no palace" responses — without the actual stack.
No self-healing. Once the cache is stale, the only way back to a working state is a process restart. A single retry that clears the caches first resolves the most common shape (a transient error left bad state behind) without operator intervention.

What changes

for attempt in range(2):
    try:
        client = _get_client()
        # ...same body as today...
        return _collection_cache
    except Exception as e:
        logger.error(
            "_get_collection attempt %d failed (palace=%s): %s",
            attempt + 1, _config.palace_path, e,
        )
        if attempt == 0:
            _client_cache = None
            _collection_cache = None
            _metadata_cache = None
            _metadata_cache_time = 0
return None

One logger.error(...) per failed attempt, including the palace path so multi-palace deployments can tell which one is hurting.
After the first failure, all four caches reset before the second attempt — _client_cache is now in the global declaration so the second attempt actually reopens the client from scratch.
The function still returns None if both attempts fail. Existing callers that handle None (there are several throughout mcp_server.py) keep working unchanged.
Happy path: zero log lines, no behaviour change. The first iteration returns immediately on success.

What's intentionally not in this PR

The HNSW num_threads=1 retrofit and _pin_hnsw_threads() call inside the function are untouched. That work landed separately and is unrelated.
No new dependencies, no new config keys, no test changes. The retry path is timing-dependent (a transient failure followed by a recovery) and hard to fake reliably; happy to add a unit test if there's a preferred shape (mock.patch on _get_client raising once?).

Where it ran

This has been running as a local patch on top of the pipx install on a 151K-drawer palace fronting palace-daemon since 2026-04-25. The retry path fired twice during HNSW segment-quarantine windows after a daemon restart — both times the second attempt succeeded and traffic kept flowing without intervention. No false-positive retries observed (the happy path takes the return _collection_cache exit on the first iteration).

I've been carrying this as a sed-applied patch in palace-daemon/patches/; happy to retire that and just depend on a release line that has this in it once it lands. Equally happy to rework the shape (different log level, structured logging, no retry at all and just the log) — let me know what fits best.

Thanks for taking a look!

- tool_add_drawer and tool_delete_drawer now invalidate _metadata_cache so status/taxonomy reflect changes immediately (review item 3) - Document English-only stopword limitation in _extract_themes - Document venv/ assumption in _mempalace_python editable-install path Co-Authored-By: Claude Opus 4.6 <[email protected]>

…-8 config - _mempalace_python: honor MEMPALACE_PYTHON env var as first priority - searcher: clamp similarity to [0, 1] so L2 distances > 1 don't go negative - config: use encoding="utf-8" + ensure_ascii=False for config writes Co-Authored-By: Claude Opus 4.6 <[email protected]>

- Add SKIP_PATTERNS and JUNK_FILE_SIZE (500KB) to miner to exclude minified JS/CSS, bundles, source maps, lockfiles, and large generated files from palace mining - Add `mempalace purge --wing/--room` CLI command for batch deletion with confirmation prompt and batched ChromaDB deletes - Fix status display: use col.count() + paginated metadata fetch instead of hard-capped 10K get(), add thousands separators to output - Update CLAUDE.md test count to 615 Co-Authored-By: Claude Opus 4.6 <[email protected]>

- repair: delete entire palace directory and create fresh PersistentClient instead of same-process delete_collection/create_collection which left corrupted HNSW ghost entries. Adds index verification query after rebuild. - mcp_server: track chroma.sqlite3 inode in _get_client() to detect palace rebuilds on disk. Automatically reconnects when the database file changes, eliminating stale cache after repair/nuke operations. Co-Authored-By: Claude Opus 4.6 <[email protected]>

…skill docs - Remove overly broad `\*[^*]+\*` regex from EMOTION_MARKERS that matched all markdown bold/italic, routing 66% of technical content to emotional room (MemPalace#536) - Replace Unicode checkmark (U+2713) with ASCII '+' in convo_miner and split_mega_files progress output — crashes Windows cp1251/cp1252 (MemPalace#535) - Fix KG path mismatch: MCP server now always uses palace_path for knowledge_graph.sqlite3 instead of diverging default path (MemPalace#538) - Fix SKILL.md: mempalace --version doesn't exist, use mempalace status (MemPalace#534) - Fix init instructions: add --yes flag for agent-friendly non-interactive init (MemPalace#534) Co-Authored-By: Claude Opus 4.6 <[email protected]>

…compress keys - Default new collections to cosine distance (hnsw:space=cosine) instead of L2 — fixes negative similarity scores (MemPalace#568) - Filter unexpected MCP tool args before dispatch — prevents TypeError crash from extra params like top_k (MemPalace#572) - Rotate WAL at 10 MB with one backup to prevent unbounded growth (MemPalace#573) - Fix cmd_compress KeyError: align dict keys with compression_stats() return values (MemPalace#569) - Fix spellcheck _load_known_names: read from "people" key, use dict keys as canonical names (MemPalace#570) - Repair command uses cosine distance for rebuilt collection Co-Authored-By: Claude Opus 4.6 <[email protected]>

- Purge now extracts drawers to keep, nukes the palace directory, and rebuilds with a fresh PersistentClient + cosine HNSW index. This prevents ghost entries from ChromaDB's in-place collection.delete() which caused segfaults on subsequent queries/inserts. - Add FAT/exFAT inode caveat comment on _get_client() detection - Clarify purge --room help: without --wing, purges across ALL wings Co-Authored-By: Claude Opus 4.6 <[email protected]>

Co-Authored-By: Claude Opus 4.6 <[email protected]>

_extract_content now handles tool_use and tool_result content blocks via _format_tool_use and _format_tool_result helpers. Join changed from space to newline for proper tool block separation. _try_claude_code_jsonl tracks tool_use_map, merges tool-result-only user messages into the previous assistant turn, and merges consecutive assistant messages from multi-turn tool loops. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Co-Authored-By: Claude Opus 4.6 <[email protected]>

…ation (MemPalace#589) Co-Authored-By: Claude Opus 4.6 <[email protected]>

Fixes ARM64 HNSW segfaults, Python 3.13/3.14 compat. Auto-migrates existing databases, zero API breakage. 648 tests pass, 50K palace verified. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Both save and precompact hooks now auto-mine the JSONL transcript directly into the palace before blocking the AI. This captures raw tool output (Bash results, search findings, build errors) that the AI would otherwise summarize away during its save cycle. - Add MP_PYTHON auto-detection (MEMPAL_PYTHON env → repo venv → system) - Add inline normalize → chunk → upsert pipeline to both hooks - Skip file_already_mined — transcript grows, upsert is idempotent - Update block reason messages to explicitly request verbatim tool output - Precompact hook: parse transcript_path from input, fallback to session_id lookup - Update README with two-layer capture docs and MEMPAL_PYTHON config Co-Authored-By: Claude Opus 4.6 <[email protected]>

…b upgrade - CLAUDE.md: test count 648, fork changes 8-13, PR MemPalace#562, hook descriptions - README.md: hooks section reflects two-layer capture, chromadb >=1.5.4, normalize.py captures tool blocks, MEMPAL_PYTHON env var documented - AGENTS.md: add normalize.py and hooks to key files section - HOOKS_TUTORIAL.md: new sections for two-layer capture and configuration - mempalace/README.md: normalize.py description updated, version.py added - normalize.py: docstring updated for tool_use/tool_result capture Co-Authored-By: Claude Opus 4.6 <[email protected]>

… PRs Resolves merge conflict in mcp_server.py by keeping our improvements (cached metadata, inode detection, WAL rotation, max_distance) and integrating upstream's query_sanitizer (MemPalace#385) and context parameter. Co-Authored-By: Claude Opus 4.6 <[email protected]>

The old block reason said "save to your memory system" which Claude interpreted as its internal ~/.claude/memory/ system instead of the mempalace. Now explicitly names mempalace_add_drawer and mempalace_diary_write MCP tools and includes a negative instruction to NOT use Claude's internal memory. Co-Authored-By: Claude Opus 4.6 <[email protected]>

The negative instruction referenced ~/.claude/memory/ but Claude's auto-memory actually lives at ~/.claude/projects/.../memory/. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Two bugs found when mining ~/.claude/projects/ transcripts: - message field can be a string (not dict) in some JSONL entries - Read tool offset/limit can be strings, causing TypeError on addition Co-Authored-By: Claude Opus 4.6 <[email protected]>

…compress keys - params null guard: request.get("params") or {} handles explicit null - L2→cosine in search tool descriptions (collection uses hnsw:space=cosine) - set_hook_setting mkdir before config write - Guard empty tool_use_map key in normalize.py - Correct searcher.py docstring: cosine distance range 0–2, useful range 0.3–1.0 Co-Authored-By: Claude Opus 4.6 <[email protected]>

…emPalace#637) _SAFE_NAME_RE was ASCII-only ([a-zA-Z0-9]), rejecting valid Unicode names like Pēteris, 日本語, Москва in KG writes and MCP tools. Changed to \w which matches Unicode letters/digits in Python 3. Path traversal, null bytes, and control chars still blocked by the explicit checks before the regex. Co-Authored-By: Claude Opus 4.6 <[email protected]>

…ssertion - miner.py: move chromadb/palace imports above logger to fix E402 - test_normalize.py: rename ambiguous var 'l' to 'line' (E741) - test_hooks_cli.py: fix Windows assertion — check basename contains 'python' instead of asserting venv-specific path components Co-Authored-By: Claude Opus 4.6 <[email protected]>

Local ruff 0.15 formats differently from CI's pinned 0.4.x range. Reformatted all files to match CI expectations. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Users installing hooks don't know ~/.claude/projects/ contains past JSONL transcripts that can be mined. Added prominent "Backfill Past Conversations" section to hooks README and tutorial. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Hybrid search, graph cache, L1 optimization, decay/forgetting, feedback loops, KG entity resolution, input sanitization. Co-Authored-By: Claude Opus 4.6 <[email protected]>

- New TODO #0: convo_miner files everything to 'sessions' wing, losing project identity from Claude Code transcript paths - Added survey context: why staying on MemPalace, patterns to steal from Karta/Gigabrain/Codex/ByteRover - Future section: contradiction detection, foresight signals, query classification (simpler versions of Karta's mechanisms) - Updated "not pursuing" with dream engine, dual ANN, event sourcing Co-Authored-By: Claude Opus 4.6 <[email protected]>

Hybrid search (TODO #1): when vector results are poor (best distance > 1.0), automatically falls back to keyword text-match via ChromaDB where_document.$contains. Extracts most distinctive non-stopword token from query, or accepts explicit keyword param. Results merged, deduped, sorted by distance. MCP server exposes new keyword parameter. Wing fix (TODO #0): _ingest_transcript() now derives project wing from Claude Code transcript path (-Projects-<name>/) instead of hardcoding "sessions". Per-project search now finds auto-mined content. 692 tests pass (22 new). Co-Authored-By: Claude Opus 4.6 <[email protected]>

Copilot

Pull request overview

Improves the resilience and observability of the MCP server’s ChromaDB collection accessor (_get_collection) by logging failures and attempting a one-time self-healing retry after clearing cached state.

Changes:

Log _get_collection exceptions instead of silently returning None.
Retry once after clearing cached client/collection/metadata state to recover from transient failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-30T17:08:38Z

+        except Exception as e:
+            logger.error(
+                "_get_collection attempt %d failed (palace=%s): %s",
+                attempt + 1, _config.palace_path, e,


The error log here only includes str(e) and omits the traceback, which makes it much harder to debug the underlying ChromaDB failure. Consider using logger.exception(...) (or logger.error(..., exc_info=True)) so the full stack trace is emitted with the message.

Suggested change

except Exception as e:

logger.error(

"_get_collection attempt %d failed (palace=%s): %s",

attempt + 1, _config.palace_path, e,

except Exception:

logger.exception(

"_get_collection attempt %d failed (palace=%s)",

attempt + 1, _config.palace_path,

Copilot · 2026-04-30T17:08:39Z

+    for attempt in range(2):
+        try:
+            client = _get_client()
+            if create:
+                # hnsw:num_threads=1 disables ChromaDB's multi-threaded ParallelFor
+                # HNSW insert path, which has a race in repairConnectionsForUpdate /
+                # addPoint (see issues #974, #965). Set via metadata on fresh
+                # collections and re-applied via _pin_hnsw_threads() for legacy
+                # palaces whose collections were created before this fix (the
+                # runtime config does not persist cross-process in chromadb 1.5.x,
+                # so the retrofit runs every time _get_collection opens a cache).
+                raw = client.get_or_create_collection(
+                    _config.collection_name,
+                    metadata={
+                        "hnsw:space": "cosine",
+                        "hnsw:num_threads": 1,
+                        **_HNSW_BLOAT_GUARD,
+                    },
+                )
+                _pin_hnsw_threads(raw)
+                _collection_cache = ChromaCollection(raw)
+                _metadata_cache = None
+                _metadata_cache_time = 0
+            elif _collection_cache is None:
+                raw = client.get_collection(_config.collection_name)
+                _pin_hnsw_threads(raw)
+                _collection_cache = ChromaCollection(raw)
+                _metadata_cache = None
+                _metadata_cache_time = 0
+            return _collection_cache
+        except Exception as e:
+            logger.error(
+                "_get_collection attempt %d failed (palace=%s): %s",
+                attempt + 1, _config.palace_path, e,
            )
-            _pin_hnsw_threads(raw)
-            _collection_cache = ChromaCollection(raw)
-            _metadata_cache = None
-            _metadata_cache_time = 0
-        elif _collection_cache is None:
-            raw = client.get_collection(_config.collection_name)
-            _pin_hnsw_threads(raw)
-            _collection_cache = ChromaCollection(raw)
-            _metadata_cache = None
-            _metadata_cache_time = 0
-        return _collection_cache
-    except Exception:
-        return None
+            if attempt == 0:
+                _client_cache = None
+                _collection_cache = None
+                _metadata_cache = None
+                _metadata_cache_time = 0
+    return None


This introduces new retry/self-healing behavior (clearing caches and retrying once) but there’s no unit test to lock in the expected semantics (e.g., first call raises, caches are cleared, second call succeeds; and the both-attempts-fail case still returns None). Since tests/test_mcp_server.py already covers related cache invalidation behavior, adding a small monkeypatch-based test would help prevent regressions.

@Legion345

… main The "fix: _get_client() get-then-create guard" row in the fork-ahead table was misleading — the guard landed in fork as d3a2d22 on 2026-04-18 but was silently dropped during the develop merge-conflict resolution in 7e18a70 on 2026-04-25. Current fork main does not carry it; we lean on metadata consistency (hnsw:space + hnsw:num_threads:1 + _HNSW_BLOAT_GUARD applied uniformly across opens) to sidestep the ChromaDB 1.5.x metadata-mismatch SIGSEGV without the defensive guard. Issue MemPalace#1089 documented the pattern on 2026-04-21. PR MemPalace#1262 (filed 2026-04-28 by @Legion345) implements the canonical version — try get_collection / except chromadb.errors.NotFoundError → create — which is an improvement over our older InvalidCollectionException reference (NotFoundError is the renamed-in-1.5.x exception class). Once MemPalace#1262 merges, this row clears via develop sync. Until then the fork is exposed if upstream ever changes default hnsw:* metadata between releases — the same drift shape that caused the original crash.

…nding queue; trim patch Filed four upstream PRs on 2026-04-30: - rboarescu#15 feat: GET /viz status dashboard (stacks on rboarescu#13) - rboarescu#16 feat: GET /list — query-free metadata browse - rboarescu#17 feat: DELETE /memory + PATCH /memory - rboarescu#18 feat(lifespan): auto-migrate Stop-hook checkpoints on startup Also rebased PR rboarescu#13 onto upstream/main to clear a CHANGELOG conflict left by upstream's b4aee82 (patch sync) — state went CONFLICTING -> MERGEABLE / CLEAN. README: - Open upstream PRs table: four new rows (rboarescu#15-rboarescu#18) plus a 2026-04-30 note covering today's rebase + new PRs in one breath. - Pending PRs queue: now empty. Replaced the four stale rows (event-log-frame and graph-endpoint were already in flight via rboarescu#11 and rboarescu#13; mempal-fast.py was already upstream via the merged PR #4 omnibus; /viz is now PR rboarescu#15) with a brief empty-state note. CLAUDE.md: - Patch description trimmed to reflect that the hnsw:num_threads enforcement landed upstream via _pin_hnsw_threads(); only the log + retry-once slice remains. patches/mcp_server_get_collection.patch: - Regenerated against current mempalace develop. The patch is now just the "log exception + retry once on cache failure" change. Filed upstream as MemPalace/mempalace#1286; once that merges this patch retires. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

README: - Intro paragraph: appended GET /list (PR rboarescu#16), DELETE/PATCH /memory (PR rboarescu#17), and lifespan auto-migrate (PR rboarescu#18) to "what this fork adds" since they're already on fork main even though the PRs are awaiting review upstream. - Links bar: added docs/typescript-port-plan.md. - Added a "Cross-repo coordination" subsection under "Recently landed in upstream" — calls out the local patches/ directory and the in-flight MemPalace/mempalace#1286 that retires it. Also references #1142 (RELEASING.md) for completeness. - Requirements: dropped the stale "kind= searcher filter" justification for recommending the fork (kind= was retired in 1.7.1); replaced with daemon-strict hook mode + warnings/sqlite-fallback search path as the actual current reasons. Added the patches/ re-apply step. - API table: added /list, DELETE /memory/{id}, PATCH /memory/{id} rows so the table reflects what the fork main actually exposes. - Sources: cross-repo PR note for #1286. CHANGELOG: - Added [Unreleased] section: patch trim (Maintenance), TS port plan doc, hook-routing-fix.md status header, and a Docs entry covering today's README updates + PR rboarescu#13 rebase context. docs/hook-routing-fix.md: - Added a Status: SHIPPED header pointing at 62425e3 (2026-04-24 clients/hook.py introduction) and clarifying mempal-fast.py is the simpler successor — same shape as docs/graph-endpoint.md's status header. No daemon behaviour changes.

Two issues: 1. logger.error("...", str(e)) omitted the traceback, making the underlying ChromaDB failure hard to debug. Switch to logger.exception so the full stack trace lands in the log alongside the message. Same context lines, just with exc_info coming from sys.exc_info() automatically. 2. No unit test pinned the new retry/self-healing semantics. Added TestGetCollectionRetry (tests/test_mcp_server.py) with three cases: - test_retry_succeeds_on_second_attempt: first _get_client raises, caches are cleared between attempts (verified by inspecting planted sentinels), second attempt returns the wrapped collection. - test_returns_none_when_both_attempts_fail: two failures in a row return None and stop after exactly two attempts. - test_happy_path_does_not_retry: a successful first attempt returns immediately and never invokes _get_client a second time. All three use monkeypatch on _get_client / ChromaCollection / _pin_hnsw_threads so they don't need a real ChromaDB; they run in ~0.4s end-to-end and don't touch disk.

…call sites The earlier 2026-04-30 wording said current fork main calls get_or_create_collection with stable/consistent metadata, sidestepping the ChromaDB 1.5.x metadata-mismatch SIGSEGV without carrying the defensive guard. That's not actually true. Audited the four get_or_create_collection call sites on current fork main: - backends/chroma.py:1040 full hnsw:* set (space + num_threads + bloat-guard) - mcp_server.py:295 full hnsw:* set - migrate.py:235 no metadata - cli.py:905 no metadata The trigger condition for the 1.5.x metadata-mismatch SIGSEGV (call- site metadata diverging from stored collection metadata) is structurally present. The fork's apparent stability is presumably because the no-metadata sites operate on different collections (mempalace_compressed, the migrate temp collection) than the full- metadata sites (mempalace_drawers) — so the divergence never reaches the same row in the same metadata table. That's a load-bearing assumption worth flagging: it's circumstantial, not structural. Once MemPalace#1262 (Legion345) lands, audit all four sites together so the guard wraps every site rather than just chroma.py.

jphein · 2026-04-30T21:00:54Z

Pushed 757eb7a addressing both Copilot notes:

Switched logger.error("...", str(e)) to logger.exception("...") so the traceback lands in the daemon log alongside the message instead of being dropped. Same context lines, just with exc_info from sys.exc_info() automatically.
Added TestGetCollectionRetry (tests/test_mcp_server.py) with three monkeypatch tests that pin the new semantics:
- test_retry_succeeds_on_second_attempt — first _get_client raises, caches are cleared between attempts (verified by inspecting planted sentinels in _client_cache / _collection_cache), second attempt returns the wrapped collection.
- test_returns_none_when_both_attempts_fail — two failures in a row return None and stop after exactly two attempts.
- test_happy_path_does_not_retry — a successful first attempt returns immediately and never invokes _get_client a second time.
All three use monkeypatch on _get_client / ChromaCollection / _pin_hnsw_threads so they don't need a real ChromaDB; they run in ~0.4s end-to-end and don't touch disk.

Also fixed a ruff format lint failure from the previous push (multi-arg logger.exception call needed each arg on its own line under the project's ruff config). Full CI matrix now green — test-linux 3.9/3.11/3.13, test-macos, test-windows, lint. Thanks for the careful review.

The 2026-04-30 wordings in 393ac89 ("metadata is consistent across opens sidesteps the SIGSEGV") and 260e2a2 ("non-uniform metadata across four call sites, trigger condition structurally present") both missed the mark. Third try here, traced from the code: - chromadb's get_or_create_collection is called directly from two places: backends/chroma.py:1061 and mcp_server.py:295. Both pass identical metadata (hnsw:space, hnsw:num_threads: 1, bloat-guard). - The fork's ChromaBackend.get_or_create_collection(path, name) at chroma.py:1101 is a 2-arg legacy shim that delegates via self.get_collection(palace_path, name, create=True) to the same metadata-passing line. - migrate.py:237 and cli.py:1002 call that 2-arg shim, so they reach chromadb with full metadata too. So metadata IS consistent across all five reachable call sites, not divergent. The earlier "non-uniform" claim mistook the fork's API surface for chromadb's call signature. The residual SIGSEGV exposure is legacy palaces created under older chromadb whose stored metadata differs from the current expanded set — a per-palace-history risk, not a structural call-graph one. Live MemPalace#1262 comment and MemPalace#1089 issue body have been re-PATCHed to match. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

… update sync status MemPalace#1262 (Legion345) is path 1 of MemPalace#1089's "interim guard PR" — adds get-then-create wrapping in chromadb backend. Shepherding via review comment posted 2026-04-30. Once it merges, fork-ahead Row 15 clears via develop sync. MemPalace#1286 (our PR, filed 2026-04-30) is the _get_collection retry-once + log-on-failure improvement. Adjacent to Row 15 — when both MemPalace#1262 and MemPalace#1286 land, the _get_collection path is both crash-resilient and self-healing. Status line refreshed: develop has moved to fdfaf01 (Gemini CLI normalize MemPalace#1234, privacy consent MemPalace#1233, both 2026-04-27); next sync will clear those plus row 15 once MemPalace#1262 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Fork main has been failing CI lint since 2026-04-30 04:04 UTC. Three real issues + 6 format-needs files. All pre-existing on main before today's doc work but inherited by every commit since. Cleared in one pass: - F401 mempalace/miner.py:19 — unused `typing.Optional` import, auto-fixed by `ruff --fix`. - F811 tests/test_searcher.py — `class TestBM25NoneSafety` was literally duplicated (lines 205 and 243, identical name + docstring + body). Likely copy-paste during 7ba28dc (kind= filter retirement). Deleted the second copy; verified the surviving class covers the same 3 None-safety tests. - C901 mempalace/searcher.py:721 — `search_memories` complexity 27 vs 25 ceiling. Added `# noqa: C901` with reason: this fork-only function is a fallback orchestration with several legitimate branches (BM25 top-up, warnings, closet-boost) and 27 isn't pathological; refactor would just split for the linter, not for readability. - ruff format applied to 6 files (mcp_server, miner, searcher, render-docs, test_mcp_server, test_migrate) — mechanical reformat from prior commits that didn't run format. `ruff check .` and `ruff format --check .` both clean. Searcher test suite green (29/29). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

igorls · 2026-05-02T23:32:09Z

Thanks @jphein — retry-once with cache-clear on _get_collection is correct and complementary to the recent #1303/#1299/#1289 work (those fixed correctness on first open; this adds resilience after a transient failure). Local tests for TestGetCollectionRetry pass.

One blocker: rebase against develop — the diff was authored against the pre-#1303 body of _get_collection. Develop's current implementation (mcp_server.py ~L276-338) does EF resolution via ChromaBackend._resolve_embedding_function() and a get_collection/create_collection split guarded by _ChromaNotFoundError. The retry loop needs to wrap that shape, not get_or_create_collection. CONFLICTING right now.

Once rebased and CI is green we can merge for v3.3.5.

# Conflicts: # mempalace/mcp_server.py # mempalace/searcher.py # tests/test_mcp_server.py

…Palace#1289/MemPalace#1303 After the 2026-05-03 develop sync to commit 1888b67: - Row 15 (_get_client get-then-create guard) clears: MemPalace#1262 fixed ChromaBackend; MemPalace#1289 fixed mcp_server's parallel call site; MemPalace#1303 plumbed embedding_function= through the reopen path. - Trailing tracking paragraph updated (sync target, count of merged/ open/closed PRs, list of features brought in). - Upstream-PRs table: MemPalace#1262 → merged, MemPalace#1286 flagged CONFLICTING with rebase queued, MemPalace#1289 + MemPalace#1303 added as merged. Note: fork-only _get_session_recovery_collection still uses the older get_or_create_collection pattern; theoretical SIGSEGV exposure on legacy recovery collections only. Tracked as follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…attern to recovery collection `_get_session_recovery_collection` (fork-only, introduced for the checkpoint collection split in row 23) used the older `get_or_create_collection` pattern that MemPalace#1262/MemPalace#1289/MemPalace#1303 just fixed in `_get_collection`. Same theoretical SIGSEGV class on chromadb 1.5.x when stored metadata diverges from the call site, plus the `embedding_function=` omission MemPalace#1303 caught. Both branches now resolve EF via `ChromaBackend._resolve_embedding_function()` and the create-path uses try `get_collection` / except `NotFoundError` → `create_collection`. Recovery collection's metadata is intentionally lighter than the main collection's (no `_HNSW_BLOAT_GUARD`); preserved that asymmetry — recovery is small enough that segment bloat isn't a concern. Existing `test_session_recovery.py` covers happy paths; 83/83 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

A small improvement to `_get_collection`: log the exception on failure (instead of returning `None` silently) and retry once after clearing the cached client and collection. Behaviour on the happy path is unchanged. ## Why The current shape (`try: ... except Exception: return None`) has two costs in practice on a long-running deployment: 1. Invisible incidents. When a transient ChromaDB error poisons the cached `_client_cache` / `_collection_cache`, every subsequent call returns `None` but the operator never sees the underlying exception in the log. Debugging has to start from the symptoms — 404s, "no palace" responses — without the actual stack. 2. No self-healing. Once the cache is stale, the only way back to a working state is a process restart. A single retry that clears the caches first resolves the most common shape (a transient error left bad state behind) without operator intervention. ## What changes - One `logger.exception(...)` per failed attempt, including the palace path so multi-palace deployments can disambiguate. - One retry after the first failure, with `_client_cache`, `_collection_cache`, `_metadata_cache`, `_metadata_cache_time` reset before the second attempt so the retry rebuilds from scratch rather than reusing a poisoned handle. - Happy path (first attempt succeeds) is unchanged: same return, same caching, no extra log line. ## Why this is still useful after MemPalace#1289 / MemPalace#1303 MemPalace#1289 split `get_or_create_collection` into `get_collection` + `create_collection` (the metadata-mismatch SIGSEGV path); MemPalace#1303 plumbed `embedding_function=` through both reopen branches. Those fix specific *crash* shapes. This PR addresses the *aftermath* — when an error of any class poisons the client/collection cache and every subsequent call returns `None` invisibly. Retry-once self-heals; logging makes the underlying exception visible. The retry loop sits *above* the get-then-create logic, so retries benefit from MemPalace#1289's correct call shape and MemPalace#1303's EF binding. ## Rebased on top of MemPalace#1289 + MemPalace#1303 This PR was filed before MemPalace#1289/MemPalace#1303 merged. The original two commits (2038269 + 757eb7a) wrapped the older `get_or_create` body. Rebased onto the new `try get_collection / except NotFoundError → create_collection` body — the retry/log envelope is unchanged, only the inner body picks up the merged fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

jphein · 2026-05-03T17:57:13Z

Rebased onto develop after #1289 and #1303 merged. The retry/log envelope is unchanged; the inner body now sits on top of the get-then-create + embedding_function= plumbing those PRs landed, so retries benefit from the corrected call shape rather than re-attempting a known-bad pattern. CI green locally, ready for another look when convenient.

igorls · 2026-05-06T07:53:01Z

Hi @jphein — thanks for carrying this; the underlying retry+log idea is sound and composes cleanly with #1322 (clearing _client_cache on attempt 1 forces a fresh make_client() which re-runs quarantine_stale_hnsw).

I'd like to land that piece for v3.3.5 (due 2026-05-08), but this PR can't merge as-is — it carries fork-only files (FORK_CHANGELOG.md, docs/fork-changes.yaml, docs/superpowers/**, scripts/{deploy,preflight,rebase-on-develop}.sh), bumps version.py / pyproject.toml to 3.3.4 without the matching marketplace.json / plugin.json / codex-plugin/plugin.json (which is what check-versions is catching), and the convo_miner.py rewrite regresses test_convo_mining + test_mine_convos_rebuilds_stale_drawers_after_schema_bump on all three OSes (Linux/macOS/Windows go red on those two tests where develop is green).

I opened #1377 with just the surgical _get_collection retry, credited as Co-authored-by: Jeffrey Hein <[email protected]>. It includes two unit tests for the retry path and ships the cache reset between attempts so a stale handle self-heals on the second call.

Closing this one in favour of #1377 — sorry for the friction. The other genuinely-useful pieces buried in this PR are very welcome as separate follow-ups against current develop:

Session-recovery collection split in mcp_server.py (routing diary checkpoints to a separate collection so they don't dominate search) — substantial enough to deserve its own design discussion / PR
backends/chroma.py None-metadata coercion at the backend boundary (would generalise the per-site guards from fix(searcher): guard against None metadata in CLI print path #999 / fix: guard Layer3.search_raw against None doc/meta from ChromaDB (#1011) #1013 / refactor(backends/chroma): coerce None metadatas to {} at backend boundary (closes #1020) #1094)
Closet-boost ranking refactor in searcher.py with hoisted tunables

If you'd rather I cherry-pick any of those into surgical PRs the way I did with the retry, ping me and I will. Either way thanks for the diagnosis on the underlying bug — easy to miss.

…xception fix(mcp): retry _get_collection once on transient failure (#1286)

…1286) A transient chromadb exception inside `_get_collection` was swallowed by the bare `except Exception: return None`, leaving every subsequent tool call hitting the same poisoned cache silently. The fix wraps the body in a `for attempt in range(2)` loop: on attempt 0 failure, log via `logger.exception(...)` and clear `_client_cache` / `_collection_cache` / `_metadata_cache` so the next iteration forces `_get_client()` to rebuild from scratch — that path now re-runs `quarantine_stale_hnsw` (per MemPalace#1322), so the second attempt heals the common stale-handle case automatically. If both attempts fail, return `None` (matches the prior contract for permanent failures). Two new tests in `tests/test_mcp_server.py::TestCacheInvalidation`: - `test_get_collection_retries_once_on_exception` — first attempt raises via a monkeypatched `_get_client`, second attempt succeeds; assert the caller gets the collection back, not None. - `test_get_collection_returns_none_after_two_failures` — both attempts fail, assert we exhaust the loop and return None (no infinite retry). Surgical extraction from PR MemPalace#1286, which carried the same fix idea (plus a fork-sync bundle that couldn't be merged); credit to the original author below. Co-authored-by: Jeffrey Hein <[email protected]>

jphein and others added 30 commits April 10, 2026 12:05

feat: add --version flag to CLI (from upstream PR MemPalace#559)

fe5cd56

Co-Authored-By: Claude Opus 4.6 <[email protected]>

docs: design spec for tool output capture in conversation mining

d4eee82

Co-Authored-By: Claude Opus 4.6 <[email protected]>

docs: implementation plan for tool output mining

b7d2bc5

Co-Authored-By: Claude Opus 4.6 <[email protected]>

feat: add _format_tool_use for Claude Code JSONL tool blocks

374da40

feat: add _format_tool_result with per-tool strategies

9775dd1

Co-Authored-By: Claude Opus 4.6 <[email protected]>

docs: add file handle and idempotency comments per review

321e77f

Co-Authored-By: Claude Opus 4.6 <[email protected]>

fix: dry-run room=None crash (MemPalace#586), precompact hook sanitiz…

b9e2a09

…ation (MemPalace#589) Co-Authored-By: Claude Opus 4.6 <[email protected]>

feat: upgrade chromadb pin from 0.6.x to >=1.5.4 (MemPalace#581)

70dc3df

Fixes ARM64 HNSW segfaults, Python 3.13/3.14 compat. Auto-migrates existing databases, zero API breakage. 648 tests pass, 50K palace verified. Co-Authored-By: Claude Opus 4.6 <[email protected]>

fix: correct auto-memory path in hook block reason (~/.claude/projects/)

fb73bbb

The negative instruction referenced ~/.claude/memory/ but Claude's auto-memory actually lives at ~/.claude/projects/.../memory/. Co-Authored-By: Claude Opus 4.6 <[email protected]>

style: ruff format with CI version (0.4.10)

085e346

Local ruff 0.15 formats differently from CI's pinned 0.4.x range. Reformatted all files to match CI expectations. Co-Authored-By: Claude Opus 4.6 <[email protected]>

docs: fork improvement roadmap — 7 items from agentic-memory analysis

d47d376

Hybrid search, graph cache, L1 optimization, decay/forgetting, feedback loops, KG entity resolution, input sanitization. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Copilot started reviewing on behalf of jphein April 30, 2026 17:06 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

jphein force-pushed the fix/get-collection-log-retry branch from 4b2dfd6 to 2038269 Compare April 30, 2026 20:42

jphein requested a review from igorls as a code owner April 30, 2026 20:48

jphein and others added 3 commits April 30, 2026 14:19

igorls added area/mcp MCP server and tools bug Something isn't working storage labels May 2, 2026

igorls added this to the v3.3.5 milestone May 2, 2026

jphein and others added 4 commits May 3, 2026 09:33

Merge remote-tracking branch 'upstream/develop'

3d934ed

# Conflicts: # mempalace/mcp_server.py # mempalace/searcher.py # tests/test_mcp_server.py

jphein force-pushed the fix/get-collection-log-retry branch from 757eb7a to 51780ce Compare May 3, 2026 17:56

igorls mentioned this pull request May 6, 2026

fix(mcp): retry _get_collection once on transient failure (#1286) #1377

Merged

3 tasks

igorls closed this May 6, 2026

igorls added a commit that referenced this pull request May 6, 2026

Merge pull request #1377 from MemPalace/fix/get-collection-retry-on-e…

f0d2360

…xception fix(mcp): retry _get_collection once on transient failure (#1286)

jphein mentioned this pull request May 6, 2026

refactor(searcher): hoist CLOSET_RANK_BOOSTS to module level + record ablation finding #1378

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp_server): log exception + retry once on _get_collection failure#1286

fix(mcp_server): log exception + retry once on _get_collection failure#1286
jphein wants to merge 283 commits intoMemPalace:developfrom
jphein:fix/get-collection-log-retry

jphein commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

jphein commented Apr 30, 2026

Uh oh!

igorls commented May 2, 2026

Uh oh!

jphein commented May 3, 2026

Uh oh!

igorls commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jphein commented Apr 30, 2026

Why

What changes

What's intentionally not in this PR

Where it ran

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jphein commented Apr 30, 2026

Uh oh!

igorls commented May 2, 2026

Uh oh!

jphein commented May 3, 2026

Uh oh!

igorls commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants