Skip to content

fix(mcp_server): pass embedding_function= on collection reopen (#1299)#1303

Merged
igorls merged 2 commits intodevelopfrom
fix/mcp-server-missing-embedding-function
May 1, 2026
Merged

fix(mcp_server): pass embedding_function= on collection reopen (#1299)#1303
igorls merged 2 commits intodevelopfrom
fix/mcp-server-missing-embedding-function

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented May 1, 2026

Summary

  • mcp_server._get_collection bypassed ChromaBackend.get_collection and called client.get_collection / client.create_collection without embedding_function=. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-in DefaultEmbeddingFunction while the miner / Stop hook ingest path bound mempalace.embedding.get_embedding_function().
  • On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon, per MCP server SIGSEGV on tool_diary_write in 3.3.4 — Connection closed; server unrecoverable in-session #1299) the default EF's lazy ONNX provider selection could SIGSEGV the host process on first col.add(), killing the MCP stdio server and leaving every subsequent tool call returning Connection closed until Claude Code was relaunched.
  • Resolve the EF up front (matching ChromaBackend._resolve_embedding_function) and pass it into both reopen branches.

Why diary writes specifically

  • Reads (tool_status, tool_diary_read, tool_list_wings, etc.) succeed because col.get(ids=...) / metadata fetches / col.count() never invoke the EF.
  • The Stop hook auto-mining keeps working because mining routes through ChromaBackend.get_collection (with proper EF) — only the in-process MCP write path hit the EF-less collection handle.
  • tool_diary_write (and tool_add_drawer) trigger col.add(documents=[...]), which is when ChromaDB lazy-loads the bound EF for the first time. On the reporter's environment that load segfaults the host process.

Follow-up to

Closes #1299.

Test plan

  • New regression test TestCacheInvalidation::test_get_collection_passes_embedding_function patches the chromadb client class to capture embedding_function= on every get_collection / create_collection call from both reopen branches; fails if any call omits it. Verified to fail on develop HEAD without this PR's mcp_server.py change.
  • Full tests/test_mcp_server.py suite passes (66/66).
  • Full pytest tests/ --ignore=tests/benchmarks passes (1470 passed, 1 skipped).
  • ruff check + ruff format --check clean.
  • Reporter (@bespined) to confirm the fix on python 3.14 + chromadb 1.5.x + Apple Silicon once a 3.3.5 build is available.

`mcp_server._get_collection` bypassed `ChromaBackend.get_collection`
and called `client.get_collection` / `client.create_collection` without
`embedding_function=`. ChromaDB 1.x does not persist the EF identity
with the collection, so the MCP server's reopen silently bound
chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop
hook ingest path bound `mempalace.embedding.get_embedding_function()`.

On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple
Silicon, per #1299) the default EF's lazy ONNX provider selection could
SIGSEGV the host process on first `col.add()`, killing the MCP stdio
server and leaving every subsequent tool call returning
`Connection closed` until Claude Code was relaunched. Reads worked
because `col.get(ids=...)` and metadata fetches don't invoke the EF;
the auto-ingest path worked because mining routes through the backend
abstraction. Diary writes were the consistent failure surface.

Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`)
and pass it into both reopen branches. Falls back to the chromadb default
only if `mempalace.embedding.get_embedding_function` itself raises.

Regression test patches the chromadb client class to capture
`embedding_function=` on every `get_collection` / `create_collection`
call from `_get_collection(create=True)` and `_get_collection()`, and
fails if any call omits it.

Follow-up to #1262 / #1289 (which fixed the metadata-mismatch SIGSEGV
path); this addresses the EF-mismatch SIGSEGV path on the same surface.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an MCP-server-specific ChromaDB collection reopen path to consistently bind MemPalace’s embedding function (instead of falling back to ChromaDB’s default EF), preventing EF-mismatch-induced crashes and aligning MCP write behavior with the miner/backend path.

Changes:

  • Update mcp_server._get_collection to resolve an embedding function and pass embedding_function= into client.get_collection and client.create_collection.
  • Add a regression test ensuring embedding_function= is passed on both create=True and cache-miss reopen (create=False) paths.
  • Add a changelog entry documenting the crash class and the fix.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
mempalace/mcp_server.py Resolves EF and threads it through collection reopen/create to avoid ChromaDB default EF binding.
tests/test_mcp_server.py Adds regression coverage asserting EF is passed to ChromaDB client calls in both reopen branches.
CHANGELOG.md Documents the MCP diary-write crash class and the EF-passing fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/mcp_server.py Outdated
Comment on lines +292 to +297
try:
ef = get_embedding_function()
except Exception:
logger.exception("Failed to build embedding function; using chromadb default")
ef = None
ef_kwargs = {"embedding_function": ef} if ef is not None else {}
Copy link
Copy Markdown
Member Author

@igorls igorls May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — addressed in cd98d66. EF resolution is now scoped to the two branches that actually call client.get_collection / client.create_collection, so the warm-cache path (where the function returns _collection_cache immediately) skips it entirely. Did not plumb device=_config.embedding_device through this PR — MempalaceConfig is instantiated lazily inside get_embedding_function and _EF_CACHE keys on the resolved providers tuple, so the cold-path cost is one MempalaceConfig() + _resolve_providers per cache key. Worth a follow-up if profiling shows it, but felt out of scope for the SIGSEGV fix.

Comment thread mempalace/mcp_server.py Outdated
Comment on lines +282 to +290
# ChromaDB 1.x does not persist the embedding function with the
# collection, so a reader/writer that omits ``embedding_function=``
# silently gets the chromadb-built-in default. On bleeding-edge
# interpreters (#1299: python 3.14 + chromadb 1.5.x on Apple Silicon)
# the default's lazy ONNX provider selection can SIGSEGV the host
# process on first ``col.add()``. The miner / Stop hook ingest path
# avoids this because it routes through ``ChromaBackend.get_collection``
# which resolves the EF via ``mempalace.embedding.get_embedding_function``.
# The MCP server bypassed that abstraction; mirror its behaviour so
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point — reworded in cd98d66. The new comment now says ChromaDB 1.x persists the EF identity (its name()) but not the instance/configuration, and explicitly walks through why the identity check still passes (both mempalace.embedding._MempalaceONNX and chromadb's built-in DefaultEmbeddingFunction report name()=="default") while the provider list silently differs — which is the actual SIGSEGV trigger. That should line up with the embedding.py docstring instead of contradicting it.

Comment thread CHANGELOG.md Outdated

### Bug Fixes

- **MCP server `tool_diary_write` SIGSEGV when EF default differs.** `mcp_server._get_collection` bypassed `ChromaBackend.get_collection` and called `client.get_collection` / `client.create_collection` without `embedding_function=`. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-in `DefaultEmbeddingFunction` to the collection while the miner / Stop hook ingest path bound `mempalace.embedding.get_embedding_function()`. On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon) the default's lazy ONNX provider selection could SIGSEGV the host process on first `col.add()`, killing the MCP stdio server and leaving every subsequent tool call returning `Connection closed` until Claude Code was relaunched. `_get_collection` now resolves the EF via `mempalace.embedding.get_embedding_function` and passes it into both reopen branches, matching the miner/backend path. (#1299, follow-up to #1262 / #1289)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded in cd98d66 to match the codebase explanation: "ChromaDB 1.x persists the EF identity (its name()) with the collection but not the EF instance/configuration" — and the entry now spells out that the identity check passes (both EFs report name()=="default") while the provider list differs, which is what embedding.py is documenting and what triggered the SIGSEGV.

Comment thread mempalace/mcp_server.py Outdated
Comment on lines +289 to +296
# which resolves the EF via ``mempalace.embedding.get_embedding_function``.
# The MCP server bypassed that abstraction; mirror its behaviour so
# ``tool_diary_write`` / ``tool_add_drawer`` get the same EF as mining.
try:
ef = get_embedding_function()
except Exception:
logger.exception("Failed to build embedding function; using chromadb default")
ef = None
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in cd98d66 — both branches now call ChromaBackend._resolve_embedding_function() directly. Dropped the from .embedding import get_embedding_function import since the helper handles it. Calling a leading-underscore static method from outside the class is mildly smelly, but felt better than promoting it to public surface in this PR; happy to do that as a follow-up if you'd prefer.

- Resolve the EF inside the two reopen branches that actually call
  `client.get_collection` / `client.create_collection`, so warm-cache
  reads stay zero-cost (no `MempalaceConfig()` / `_resolve_providers`
  on every tool call).
- Reuse `ChromaBackend._resolve_embedding_function()` instead of
  duplicating its try/except + log message + None-fallback.
- Reword the inline + CHANGELOG explanation to clarify that ChromaDB 1.x
  persists the EF *identity* (its `name()`) but not the *instance/
  configuration* — `mempalace.embedding` documents this and spoofs
  `name()` to `"default"` precisely so the identity check passes; the
  bug was the *provider list* (lazy ONNX selection) silently differing.
@igorls igorls merged commit 5ddaf7a into develop May 1, 2026
6 checks passed
jphein added a commit to jphein/mempalace that referenced this pull request May 3, 2026
…Palace#1289/MemPalace#1303

After the 2026-05-03 develop sync to commit 1888b67:
- Row 15 (_get_client get-then-create guard) clears: MemPalace#1262 fixed
  ChromaBackend; MemPalace#1289 fixed mcp_server's parallel call site; MemPalace#1303
  plumbed embedding_function= through the reopen path.
- Trailing tracking paragraph updated (sync target, count of merged/
  open/closed PRs, list of features brought in).
- Upstream-PRs table: MemPalace#1262 → merged, MemPalace#1286 flagged CONFLICTING with
  rebase queued, MemPalace#1289 + MemPalace#1303 added as merged.

Note: fork-only _get_session_recovery_collection still uses the older
get_or_create_collection pattern; theoretical SIGSEGV exposure on
legacy recovery collections only. Tracked as follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request May 3, 2026
…attern to recovery collection

`_get_session_recovery_collection` (fork-only, introduced for the
checkpoint collection split in row 23) used the older
`get_or_create_collection` pattern that MemPalace#1262/MemPalace#1289/MemPalace#1303 just
fixed in `_get_collection`. Same theoretical SIGSEGV class on
chromadb 1.5.x when stored metadata diverges from the call site,
plus the `embedding_function=` omission MemPalace#1303 caught.

Both branches now resolve EF via `ChromaBackend._resolve_embedding_function()`
and the create-path uses try `get_collection` / except
`NotFoundError` → `create_collection`. Recovery collection's
metadata is intentionally lighter than the main collection's
(no `_HNSW_BLOAT_GUARD`); preserved that asymmetry — recovery is
small enough that segment bloat isn't a concern.

Existing `test_session_recovery.py` covers happy paths; 83/83 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request May 3, 2026
A small improvement to `_get_collection`: log the exception on failure
(instead of returning `None` silently) and retry once after clearing
the cached client and collection. Behaviour on the happy path is
unchanged.

## Why

The current shape (`try: ... except Exception: return None`) has two
costs in practice on a long-running deployment:

1. Invisible incidents. When a transient ChromaDB error poisons the
   cached `_client_cache` / `_collection_cache`, every subsequent
   call returns `None` but the operator never sees the underlying
   exception in the log. Debugging has to start from the symptoms —
   404s, "no palace" responses — without the actual stack.
2. No self-healing. Once the cache is stale, the only way back to a
   working state is a process restart. A single retry that clears
   the caches first resolves the most common shape (a transient
   error left bad state behind) without operator intervention.

## What changes

- One `logger.exception(...)` per failed attempt, including the
  palace path so multi-palace deployments can disambiguate.
- One retry after the first failure, with `_client_cache`,
  `_collection_cache`, `_metadata_cache`, `_metadata_cache_time`
  reset before the second attempt so the retry rebuilds from
  scratch rather than reusing a poisoned handle.
- Happy path (first attempt succeeds) is unchanged: same return,
  same caching, no extra log line.

## Why this is still useful after MemPalace#1289 / MemPalace#1303

MemPalace#1289 split `get_or_create_collection` into `get_collection` +
`create_collection` (the metadata-mismatch SIGSEGV path); MemPalace#1303
plumbed `embedding_function=` through both reopen branches. Those
fix specific *crash* shapes. This PR addresses the *aftermath* —
when an error of any class poisons the client/collection cache and
every subsequent call returns `None` invisibly. Retry-once
self-heals; logging makes the underlying exception visible. The
retry loop sits *above* the get-then-create logic, so retries
benefit from MemPalace#1289's correct call shape and MemPalace#1303's EF binding.

## Rebased on top of MemPalace#1289 + MemPalace#1303

This PR was filed before MemPalace#1289/MemPalace#1303 merged. The original two
commits (2038269 + 757eb7a) wrapped the older `get_or_create`
body. Rebased onto the new `try get_collection / except
NotFoundError → create_collection` body — the retry/log envelope
is unchanged, only the inner body picks up the merged fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
xcarbo added a commit to xcarbo/mempalace that referenced this pull request May 4, 2026
Catches up on a month of upstream work. Highlights pulled in:

- MemPalace#1306 searcher: hybrid candidate union (vector ∪ BM25 reranking pool)
- MemPalace#1325 mcp security: omit absolute paths from tool_get_drawer / tool_status
- MemPalace#1322 chroma: wire quarantine_stale_hnsw to prevent SIGSEGV on stale HNSW
- MemPalace#1320 mcp: forward valid_to / source params in kg_add / kg_invalidate
- MemPalace#1321 cli: honor --palace flag in cmd_init
- MemPalace#1314 kg temporal params fix
- MemPalace#1244 cli: cmd_compress writes to mempalace_closets so palace can read
- MemPalace#1243 mcp: case-insensitive agent name in diary_write/diary_read
- MemPalace#1303 mcp_server: pass embedding_function= on collection reopen
- MemPalace#1076/MemPalace#1077 hooks: quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json
- Various ruff format passes on touched files

Conflict resolution (CHANGELOG.md only — code files all auto-merged):

- 3.3.5 unreleased section header from upstream kept above 3.3.4
- 3.3.4 section: kept our 2026-04-30 release date; merged upstream's new
  MemPalace#1299 SIGSEGV-on-default-EF entry in alongside our existing topic-tunnels
  (MemPalace#1194/MemPalace#1195/MemPalace#1197), HNSW-bloat (MemPalace#1191), max_seq_id (MemPalace#1135), and
  auto-ingest (MemPalace#1230/MemPalace#1231) entries. Kept our richer topic-tunnels detail
  (upstream's version was a strict subset).

xdev patches preserved (still on this branch, untouched by merge):
- 6ef44cb fix(hooks): route CC transcripts via convo_miner with cwd-based wings
- 3fad61d fix(config): allow leading dash in wing names

Not pushed to origin — run tests locally and decide when to push.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP server SIGSEGV on tool_diary_write in 3.3.4 — Connection closed; server unrecoverable in-session

2 participants