fix(mcp_server): pass embedding_function= on collection reopen (#1299)#1303
fix(mcp_server): pass embedding_function= on collection reopen (#1299)#1303
Conversation
`mcp_server._get_collection` bypassed `ChromaBackend.get_collection` and called `client.get_collection` / `client.create_collection` without `embedding_function=`. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop hook ingest path bound `mempalace.embedding.get_embedding_function()`. On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon, per #1299) the default EF's lazy ONNX provider selection could SIGSEGV the host process on first `col.add()`, killing the MCP stdio server and leaving every subsequent tool call returning `Connection closed` until Claude Code was relaunched. Reads worked because `col.get(ids=...)` and metadata fetches don't invoke the EF; the auto-ingest path worked because mining routes through the backend abstraction. Diary writes were the consistent failure surface. Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`) and pass it into both reopen branches. Falls back to the chromadb default only if `mempalace.embedding.get_embedding_function` itself raises. Regression test patches the chromadb client class to capture `embedding_function=` on every `get_collection` / `create_collection` call from `_get_collection(create=True)` and `_get_collection()`, and fails if any call omits it. Follow-up to #1262 / #1289 (which fixed the metadata-mismatch SIGSEGV path); this addresses the EF-mismatch SIGSEGV path on the same surface.
There was a problem hiding this comment.
Pull request overview
Fixes an MCP-server-specific ChromaDB collection reopen path to consistently bind MemPalace’s embedding function (instead of falling back to ChromaDB’s default EF), preventing EF-mismatch-induced crashes and aligning MCP write behavior with the miner/backend path.
Changes:
- Update
mcp_server._get_collectionto resolve an embedding function and passembedding_function=intoclient.get_collectionandclient.create_collection. - Add a regression test ensuring
embedding_function=is passed on bothcreate=Trueand cache-miss reopen (create=False) paths. - Add a changelog entry documenting the crash class and the fix.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
mempalace/mcp_server.py |
Resolves EF and threads it through collection reopen/create to avoid ChromaDB default EF binding. |
tests/test_mcp_server.py |
Adds regression coverage asserting EF is passed to ChromaDB client calls in both reopen branches. |
CHANGELOG.md |
Documents the MCP diary-write crash class and the EF-passing fix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| ef = get_embedding_function() | ||
| except Exception: | ||
| logger.exception("Failed to build embedding function; using chromadb default") | ||
| ef = None | ||
| ef_kwargs = {"embedding_function": ef} if ef is not None else {} |
There was a problem hiding this comment.
Good catch — addressed in cd98d66. EF resolution is now scoped to the two branches that actually call client.get_collection / client.create_collection, so the warm-cache path (where the function returns _collection_cache immediately) skips it entirely. Did not plumb device=_config.embedding_device through this PR — MempalaceConfig is instantiated lazily inside get_embedding_function and _EF_CACHE keys on the resolved providers tuple, so the cold-path cost is one MempalaceConfig() + _resolve_providers per cache key. Worth a follow-up if profiling shows it, but felt out of scope for the SIGSEGV fix.
| # ChromaDB 1.x does not persist the embedding function with the | ||
| # collection, so a reader/writer that omits ``embedding_function=`` | ||
| # silently gets the chromadb-built-in default. On bleeding-edge | ||
| # interpreters (#1299: python 3.14 + chromadb 1.5.x on Apple Silicon) | ||
| # the default's lazy ONNX provider selection can SIGSEGV the host | ||
| # process on first ``col.add()``. The miner / Stop hook ingest path | ||
| # avoids this because it routes through ``ChromaBackend.get_collection`` | ||
| # which resolves the EF via ``mempalace.embedding.get_embedding_function``. | ||
| # The MCP server bypassed that abstraction; mirror its behaviour so |
There was a problem hiding this comment.
Fair point — reworded in cd98d66. The new comment now says ChromaDB 1.x persists the EF identity (its name()) but not the instance/configuration, and explicitly walks through why the identity check still passes (both mempalace.embedding._MempalaceONNX and chromadb's built-in DefaultEmbeddingFunction report name()=="default") while the provider list silently differs — which is the actual SIGSEGV trigger. That should line up with the embedding.py docstring instead of contradicting it.
|
|
||
| ### Bug Fixes | ||
|
|
||
| - **MCP server `tool_diary_write` SIGSEGV when EF default differs.** `mcp_server._get_collection` bypassed `ChromaBackend.get_collection` and called `client.get_collection` / `client.create_collection` without `embedding_function=`. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-in `DefaultEmbeddingFunction` to the collection while the miner / Stop hook ingest path bound `mempalace.embedding.get_embedding_function()`. On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon) the default's lazy ONNX provider selection could SIGSEGV the host process on first `col.add()`, killing the MCP stdio server and leaving every subsequent tool call returning `Connection closed` until Claude Code was relaunched. `_get_collection` now resolves the EF via `mempalace.embedding.get_embedding_function` and passes it into both reopen branches, matching the miner/backend path. (#1299, follow-up to #1262 / #1289) |
There was a problem hiding this comment.
Reworded in cd98d66 to match the codebase explanation: "ChromaDB 1.x persists the EF identity (its name()) with the collection but not the EF instance/configuration" — and the entry now spells out that the identity check passes (both EFs report name()=="default") while the provider list differs, which is what embedding.py is documenting and what triggered the SIGSEGV.
| # which resolves the EF via ``mempalace.embedding.get_embedding_function``. | ||
| # The MCP server bypassed that abstraction; mirror its behaviour so | ||
| # ``tool_diary_write`` / ``tool_add_drawer`` get the same EF as mining. | ||
| try: | ||
| ef = get_embedding_function() | ||
| except Exception: | ||
| logger.exception("Failed to build embedding function; using chromadb default") | ||
| ef = None |
There was a problem hiding this comment.
Done in cd98d66 — both branches now call ChromaBackend._resolve_embedding_function() directly. Dropped the from .embedding import get_embedding_function import since the helper handles it. Calling a leading-underscore static method from outside the class is mildly smelly, but felt better than promoting it to public surface in this PR; happy to do that as a follow-up if you'd prefer.
- Resolve the EF inside the two reopen branches that actually call `client.get_collection` / `client.create_collection`, so warm-cache reads stay zero-cost (no `MempalaceConfig()` / `_resolve_providers` on every tool call). - Reuse `ChromaBackend._resolve_embedding_function()` instead of duplicating its try/except + log message + None-fallback. - Reword the inline + CHANGELOG explanation to clarify that ChromaDB 1.x persists the EF *identity* (its `name()`) but not the *instance/ configuration* — `mempalace.embedding` documents this and spoofs `name()` to `"default"` precisely so the identity check passes; the bug was the *provider list* (lazy ONNX selection) silently differing.
…Palace#1289/MemPalace#1303 After the 2026-05-03 develop sync to commit 1888b67: - Row 15 (_get_client get-then-create guard) clears: MemPalace#1262 fixed ChromaBackend; MemPalace#1289 fixed mcp_server's parallel call site; MemPalace#1303 plumbed embedding_function= through the reopen path. - Trailing tracking paragraph updated (sync target, count of merged/ open/closed PRs, list of features brought in). - Upstream-PRs table: MemPalace#1262 → merged, MemPalace#1286 flagged CONFLICTING with rebase queued, MemPalace#1289 + MemPalace#1303 added as merged. Note: fork-only _get_session_recovery_collection still uses the older get_or_create_collection pattern; theoretical SIGSEGV exposure on legacy recovery collections only. Tracked as follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…attern to recovery collection `_get_session_recovery_collection` (fork-only, introduced for the checkpoint collection split in row 23) used the older `get_or_create_collection` pattern that MemPalace#1262/MemPalace#1289/MemPalace#1303 just fixed in `_get_collection`. Same theoretical SIGSEGV class on chromadb 1.5.x when stored metadata diverges from the call site, plus the `embedding_function=` omission MemPalace#1303 caught. Both branches now resolve EF via `ChromaBackend._resolve_embedding_function()` and the create-path uses try `get_collection` / except `NotFoundError` → `create_collection`. Recovery collection's metadata is intentionally lighter than the main collection's (no `_HNSW_BLOAT_GUARD`); preserved that asymmetry — recovery is small enough that segment bloat isn't a concern. Existing `test_session_recovery.py` covers happy paths; 83/83 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
A small improvement to `_get_collection`: log the exception on failure (instead of returning `None` silently) and retry once after clearing the cached client and collection. Behaviour on the happy path is unchanged. ## Why The current shape (`try: ... except Exception: return None`) has two costs in practice on a long-running deployment: 1. Invisible incidents. When a transient ChromaDB error poisons the cached `_client_cache` / `_collection_cache`, every subsequent call returns `None` but the operator never sees the underlying exception in the log. Debugging has to start from the symptoms — 404s, "no palace" responses — without the actual stack. 2. No self-healing. Once the cache is stale, the only way back to a working state is a process restart. A single retry that clears the caches first resolves the most common shape (a transient error left bad state behind) without operator intervention. ## What changes - One `logger.exception(...)` per failed attempt, including the palace path so multi-palace deployments can disambiguate. - One retry after the first failure, with `_client_cache`, `_collection_cache`, `_metadata_cache`, `_metadata_cache_time` reset before the second attempt so the retry rebuilds from scratch rather than reusing a poisoned handle. - Happy path (first attempt succeeds) is unchanged: same return, same caching, no extra log line. ## Why this is still useful after MemPalace#1289 / MemPalace#1303 MemPalace#1289 split `get_or_create_collection` into `get_collection` + `create_collection` (the metadata-mismatch SIGSEGV path); MemPalace#1303 plumbed `embedding_function=` through both reopen branches. Those fix specific *crash* shapes. This PR addresses the *aftermath* — when an error of any class poisons the client/collection cache and every subsequent call returns `None` invisibly. Retry-once self-heals; logging makes the underlying exception visible. The retry loop sits *above* the get-then-create logic, so retries benefit from MemPalace#1289's correct call shape and MemPalace#1303's EF binding. ## Rebased on top of MemPalace#1289 + MemPalace#1303 This PR was filed before MemPalace#1289/MemPalace#1303 merged. The original two commits (2038269 + 757eb7a) wrapped the older `get_or_create` body. Rebased onto the new `try get_collection / except NotFoundError → create_collection` body — the retry/log envelope is unchanged, only the inner body picks up the merged fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Catches up on a month of upstream work. Highlights pulled in: - MemPalace#1306 searcher: hybrid candidate union (vector ∪ BM25 reranking pool) - MemPalace#1325 mcp security: omit absolute paths from tool_get_drawer / tool_status - MemPalace#1322 chroma: wire quarantine_stale_hnsw to prevent SIGSEGV on stale HNSW - MemPalace#1320 mcp: forward valid_to / source params in kg_add / kg_invalidate - MemPalace#1321 cli: honor --palace flag in cmd_init - MemPalace#1314 kg temporal params fix - MemPalace#1244 cli: cmd_compress writes to mempalace_closets so palace can read - MemPalace#1243 mcp: case-insensitive agent name in diary_write/diary_read - MemPalace#1303 mcp_server: pass embedding_function= on collection reopen - MemPalace#1076/MemPalace#1077 hooks: quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json - Various ruff format passes on touched files Conflict resolution (CHANGELOG.md only — code files all auto-merged): - 3.3.5 unreleased section header from upstream kept above 3.3.4 - 3.3.4 section: kept our 2026-04-30 release date; merged upstream's new MemPalace#1299 SIGSEGV-on-default-EF entry in alongside our existing topic-tunnels (MemPalace#1194/MemPalace#1195/MemPalace#1197), HNSW-bloat (MemPalace#1191), max_seq_id (MemPalace#1135), and auto-ingest (MemPalace#1230/MemPalace#1231) entries. Kept our richer topic-tunnels detail (upstream's version was a strict subset). xdev patches preserved (still on this branch, untouched by merge): - 6ef44cb fix(hooks): route CC transcripts via convo_miner with cwd-based wings - 3fad61d fix(config): allow leading dash in wing names Not pushed to origin — run tests locally and decide when to push. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Summary
mcp_server._get_collectionbypassedChromaBackend.get_collectionand calledclient.get_collection/client.create_collectionwithoutembedding_function=. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-inDefaultEmbeddingFunctionwhile the miner / Stop hook ingest path boundmempalace.embedding.get_embedding_function().col.add(), killing the MCP stdio server and leaving every subsequent tool call returningConnection closeduntil Claude Code was relaunched.ChromaBackend._resolve_embedding_function) and pass it into both reopen branches.Why diary writes specifically
tool_status,tool_diary_read,tool_list_wings, etc.) succeed becausecol.get(ids=...)/ metadata fetches /col.count()never invoke the EF.ChromaBackend.get_collection(with proper EF) — only the in-process MCP write path hit the EF-less collection handle.tool_diary_write(andtool_add_drawer) triggercol.add(documents=[...]), which is when ChromaDB lazy-loads the bound EF for the first time. On the reporter's environment that load segfaults the host process.Follow-up to
_get_collection(create=True)(splitget_or_create_collectioninto try-get_collection/ except-create_collection).Closes #1299.
Test plan
TestCacheInvalidation::test_get_collection_passes_embedding_functionpatches the chromadb client class to captureembedding_function=on everyget_collection/create_collectioncall from both reopen branches; fails if any call omits it. Verified to fail ondevelopHEAD without this PR's mcp_server.py change.tests/test_mcp_server.pysuite passes (66/66).pytest tests/ --ignore=tests/benchmarkspasses (1470 passed, 1 skipped).ruff check+ruff format --checkclean.