fix(mcp_server): split get_or_create_collection on reopen (follow-up to #1262) by igorls · Pull Request #1289 · MemPalace/mempalace

igorls · 2026-05-01T01:35:34Z

Summary

Companion to #1262.

#1262 split get_or_create_collection into get_collection + fallback create_collection inside ChromaBackend.get_collection, fixing the chromadb 1.5.x Rust-binding SIGSEGV that fires when stored collection metadata differs from the call-site's _HNSW_BLOAT_GUARD payload.

The MCP server has the same payload at a parallel call site that #1262 did not touch:

mempalace/mcp_server.py:287

raw = client.get_or_create_collection(
    _config.collection_name,
    metadata={
        \"hnsw:space\": \"cosine\",
        \"hnsw:num_threads\": 1,
        **_HNSW_BLOAT_GUARD,
    },
)

_get_collection(create=True) is reached by tool_add_drawer (line 794) and tool_diary_write (line 1118). The Stop hook fires mempalace_diary_write at session end — exactly the crash path #1089 was filed for. Without this follow-up, legacy palaces whose stored metadata predates the bloat-guard expansion still SIGSEGV the MCP server on reopen.

Fix

Apply the same try/except split here. New behaviour:

try:
    raw = client.get_collection(_config.collection_name)
except _ChromaNotFoundError:
    raw = client.create_collection(
        _config.collection_name,
        metadata={
            \"hnsw:space\": \"cosine\",
            \"hnsw:num_threads\": 1,
            **_HNSW_BLOAT_GUARD,
        },
    )
_pin_hnsw_threads(raw)

_pin_hnsw_threads still runs after both branches, preserving the legacy-palace retrofit comment that already lived above this block.

Test plan

New regression test test_get_collection_create_true_avoids_get_or_create_on_reopen patches get_or_create_collection at the chromadb client class level (not the instance — chromadb's mtime-change detection rebuilds the client between calls, so an instance-level spy doesn't survive) and asserts the second _get_collection(create=True) call never reaches it.
Verified the test fails on develop tip without this fix (assert col2 is not None — None returned because the spy raises and the outer try/except swallows it).
All 65 tests in tests/test_mcp_server.py pass with the fix.
ruff check + ruff format --check clean (CI-pinned 0.4.x).

#1262) #1262 split `get_or_create_collection` into `get_collection` + fallback `create_collection` inside `ChromaBackend.get_collection`, fixing the chromadb 1.5.x Rust-binding SIGSEGV that fires when stored collection metadata differs from the call-site's `_HNSW_BLOAT_GUARD` payload. The MCP server's `_get_collection(create=True)` carries the same metadata payload at `mcp_server.py:287` and routes through chromadb's Python client directly, bypassing the backend layer. Both `tool_add_drawer` and `tool_diary_write` reach this site on every invocation, and the Stop hook fires `mempalace_diary_write` at session end — which was exactly the crash path #1089 named. Apply the same try/except split here so legacy palaces whose stored metadata predates the bloat-guard expansion no longer crash on the MCP-server reopen path. Regression test patches `get_or_create_collection` at the chromadb client class level (not the instance — chromadb's mtime-change detection rebuilds the client between calls, so an instance-level spy doesn't survive) and asserts the second `_get_collection(create=True)` call never reaches it.

Copilot

Pull request overview

Follow-up to #1262 that applies the same ChromaDB 1.5.x crash-avoidance pattern in the MCP server: avoid get_or_create_collection(..., metadata=...) on reopen by trying get_collection() first and only creating with metadata when missing.

Changes:

Update mcp_server._get_collection(create=True) to use get_collection with a create_collection fallback on _ChromaNotFoundError.
Add a regression test ensuring reopen does not call get_or_create_collection (the metadata-mismatch SIGSEGV path).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
mempalace/mcp_server.py	Splits `get_or_create_collection` into `get_collection` + `create_collection` fallback to avoid ChromaDB 1.5.x metadata-mismatch segfaults on reopen.
tests/test_mcp_server.py	Adds regression coverage to ensure `_get_collection(create=True)` avoids `get_or_create_collection` on reopen.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The same try/except split that #1262 applied at the backend layer (ChromaBackend.get_collection) was needed at the parallel call site in mcp_server._get_collection(create=True), which carries the same metadata payload directly to chromadb's Python client. Both reopen paths in mempalace now bypass get_or_create_collection on existing collections, closing the SIGSEGV class for both tool_add_drawer and tool_diary_write (the Stop hook's path).

Catches up xdev-patches with 112 commits from MemPalace/develop, including: - v3.3.4 release - MemPalace#1262/MemPalace#1289 ChromaDB collection-reopen crash fix (relevant to long-running MCP server & mempalace-api) - MemPalace#1287 HNSW divergence floor fix - MemPalace#1288 BLOB seq_id decode in repair - MemPalace#1180 cross-wing tunnels by shared topics - MemPalace#1194 wing-slug normalization for hyphenated dirs Conflict resolution: hooks_cli.py and mcp_server.py both had local patches (6ef44cb route CC transcripts via convo_miner; 3fad61d allow leading dash) that overlap with upstream fixes (MemPalace#1231, MemPalace#1194). Took upstream entirely on those two files — upstream's version handles separate transcript/project ingest, uses _mempalace_python(), and adds _pin_hnsw_threads. The local config.py regex relaxation auto-merged cleanly and is preserved. Safety tag: pre-upstream-merge-20260501-091227 (rollback target). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…lace#1299) `mcp_server._get_collection` bypassed `ChromaBackend.get_collection` and called `client.get_collection` / `client.create_collection` without `embedding_function=`. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop hook ingest path bound `mempalace.embedding.get_embedding_function()`. On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon, per MemPalace#1299) the default EF's lazy ONNX provider selection could SIGSEGV the host process on first `col.add()`, killing the MCP stdio server and leaving every subsequent tool call returning `Connection closed` until Claude Code was relaunched. Reads worked because `col.get(ids=...)` and metadata fetches don't invoke the EF; the auto-ingest path worked because mining routes through the backend abstraction. Diary writes were the consistent failure surface. Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`) and pass it into both reopen branches. Falls back to the chromadb default only if `mempalace.embedding.get_embedding_function` itself raises. Regression test patches the chromadb client class to capture `embedding_function=` on every `get_collection` / `create_collection` call from `_get_collection(create=True)` and `_get_collection()`, and fails if any call omits it. Follow-up to MemPalace#1262 / MemPalace#1289 (which fixed the metadata-mismatch SIGSEGV path); this addresses the EF-mismatch SIGSEGV path on the same surface.

…SEGV (upstream MemPalace#1289) Mirrors the backend-layer fix from MemPalace#1262 at the MCP server's _get_collection cache rebuild path. Also imports _HNSW_BLOAT_GUARD for correct metadata on fresh collection creation.

Sync with upstream/develop via cherry-pick of 9 critical bugfixes: - HNSW index bloat prevention (MemPalace#1191) - SIGSEGV guards on collection reopen (MemPalace#1262, MemPalace#1289) - blob-seq marker fast-path (MemPalace#1177) - palace_graph None-metadata guard (MemPalace#1201) - palace_graph security tunnels (MemPalace#1168) - hyphenated wing name normalization (MemPalace#1197) - searcher _tokenize None guard (MemPalace#1198) - mcp_server diary topic sanitize (MemPalace#936) Tests: 1459 default + 113 benchmark + 6/7 stress all pass. (random-kill stress test was failing on dev pre-merge; not regressed.)

…Palace#1289/MemPalace#1303 After the 2026-05-03 develop sync to commit 1888b67: - Row 15 (_get_client get-then-create guard) clears: MemPalace#1262 fixed ChromaBackend; MemPalace#1289 fixed mcp_server's parallel call site; MemPalace#1303 plumbed embedding_function= through the reopen path. - Trailing tracking paragraph updated (sync target, count of merged/ open/closed PRs, list of features brought in). - Upstream-PRs table: MemPalace#1262 → merged, MemPalace#1286 flagged CONFLICTING with rebase queued, MemPalace#1289 + MemPalace#1303 added as merged. Note: fork-only _get_session_recovery_collection still uses the older get_or_create_collection pattern; theoretical SIGSEGV exposure on legacy recovery collections only. Tracked as follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…attern to recovery collection `_get_session_recovery_collection` (fork-only, introduced for the checkpoint collection split in row 23) used the older `get_or_create_collection` pattern that MemPalace#1262/MemPalace#1289/MemPalace#1303 just fixed in `_get_collection`. Same theoretical SIGSEGV class on chromadb 1.5.x when stored metadata diverges from the call site, plus the `embedding_function=` omission MemPalace#1303 caught. Both branches now resolve EF via `ChromaBackend._resolve_embedding_function()` and the create-path uses try `get_collection` / except `NotFoundError` → `create_collection`. Recovery collection's metadata is intentionally lighter than the main collection's (no `_HNSW_BLOAT_GUARD`); preserved that asymmetry — recovery is small enough that segment bloat isn't a concern. Existing `test_session_recovery.py` covers happy paths; 83/83 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

A small improvement to `_get_collection`: log the exception on failure (instead of returning `None` silently) and retry once after clearing the cached client and collection. Behaviour on the happy path is unchanged. ## Why The current shape (`try: ... except Exception: return None`) has two costs in practice on a long-running deployment: 1. Invisible incidents. When a transient ChromaDB error poisons the cached `_client_cache` / `_collection_cache`, every subsequent call returns `None` but the operator never sees the underlying exception in the log. Debugging has to start from the symptoms — 404s, "no palace" responses — without the actual stack. 2. No self-healing. Once the cache is stale, the only way back to a working state is a process restart. A single retry that clears the caches first resolves the most common shape (a transient error left bad state behind) without operator intervention. ## What changes - One `logger.exception(...)` per failed attempt, including the palace path so multi-palace deployments can disambiguate. - One retry after the first failure, with `_client_cache`, `_collection_cache`, `_metadata_cache`, `_metadata_cache_time` reset before the second attempt so the retry rebuilds from scratch rather than reusing a poisoned handle. - Happy path (first attempt succeeds) is unchanged: same return, same caching, no extra log line. ## Why this is still useful after MemPalace#1289 / MemPalace#1303 MemPalace#1289 split `get_or_create_collection` into `get_collection` + `create_collection` (the metadata-mismatch SIGSEGV path); MemPalace#1303 plumbed `embedding_function=` through both reopen branches. Those fix specific *crash* shapes. This PR addresses the *aftermath* — when an error of any class poisons the client/collection cache and every subsequent call returns `None` invisibly. Retry-once self-heals; logging makes the underlying exception visible. The retry loop sits *above* the get-then-create logic, so retries benefit from MemPalace#1289's correct call shape and MemPalace#1303's EF binding. ## Rebased on top of MemPalace#1289 + MemPalace#1303 This PR was filed before MemPalace#1289/MemPalace#1303 merged. The original two commits (2038269 + 757eb7a) wrapped the older `get_or_create` body. Rebased onto the new `try get_collection / except NotFoundError → create_collection` body — the retry/log envelope is unchanged, only the inner body picks up the merged fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Copilot AI review requested due to automatic review settings May 1, 2026 01:35

igorls requested review from bensig and milla-jovovich as code owners May 1, 2026 01:35

Copilot started reviewing on behalf of igorls May 1, 2026 01:36 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

igorls merged commit 5e540da into develop May 1, 2026
10 checks passed

bespined mentioned this pull request May 1, 2026

MCP server SIGSEGV on tool_diary_write in 3.3.4 — Connection closed; server unrecoverable in-session #1299

Closed

igorls mentioned this pull request May 1, 2026

fix(mcp_server): pass embedding_function= on collection reopen (#1299) #1303

Merged

5 tasks

This was referenced May 2, 2026

Fix/windows hook stdio utf8 #1280

Closed

fix(mcp_server): log exception + retry once on _get_collection failure #1286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp_server): split get_or_create_collection on reopen (follow-up to #1262)#1289

fix(mcp_server): split get_or_create_collection on reopen (follow-up to #1262)#1289
igorls merged 1 commit intodevelopfrom
fix/mcp-server-collection-reopen-crash

igorls commented May 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

igorls commented May 1, 2026

Summary

Fix

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants