fix(storage): stop ChromaDB from crashing when reopening an existing …#1262
fix(storage): stop ChromaDB from crashing when reopening an existing …#1262igorls merged 1 commit intoMemPalace:developfrom
Conversation
… main The "fix: _get_client() get-then-create guard" row in the fork-ahead table was misleading — the guard landed in fork as d3a2d22 on 2026-04-18 but was silently dropped during the develop merge-conflict resolution in 7e18a70 on 2026-04-25. Current fork main does not carry it; we lean on metadata consistency (hnsw:space + hnsw:num_threads:1 + _HNSW_BLOAT_GUARD applied uniformly across opens) to sidestep the ChromaDB 1.5.x metadata-mismatch SIGSEGV without the defensive guard. Issue MemPalace#1089 documented the pattern on 2026-04-21. PR MemPalace#1262 (filed 2026-04-28 by @Legion345) implements the canonical version — try get_collection / except chromadb.errors.NotFoundError → create — which is an improvement over our older InvalidCollectionException reference (NotFoundError is the renamed-in-1.5.x exception class). Once MemPalace#1262 merges, this row clears via develop sync. Until then the fork is exposed if upstream ever changes default hnsw:* metadata between releases — the same drift shape that caused the original crash.
|
Thanks for picking this up @Legion345 — and for the kind credit, that was generous. A couple of corrections about the field data, since you anchored the PR description on it and I want to make sure the maintainer isn't reviewing under a wrong impression: The "since April 10" / "400+ starts, zero crashes" line came from my #1089 issue body, and both numbers are off. The guard actually landed on our fork at What's actually true today (and apologies — this paragraph has been edited twice already; the prior versions overcorrected in two different directions). Tracing the call graph: chromadb's On the diff itself: the approach is right. One small thing I'd flag as a positive — you're catching Thanks again for taking this on, and apologies for the back-and-forth on this comment — I shouldn't have edited my own correction with another wrong claim. |
…call sites The earlier 2026-04-30 wording said current fork main calls get_or_create_collection with stable/consistent metadata, sidestepping the ChromaDB 1.5.x metadata-mismatch SIGSEGV without carrying the defensive guard. That's not actually true. Audited the four get_or_create_collection call sites on current fork main: - backends/chroma.py:1040 full hnsw:* set (space + num_threads + bloat-guard) - mcp_server.py:295 full hnsw:* set - migrate.py:235 no metadata - cli.py:905 no metadata The trigger condition for the 1.5.x metadata-mismatch SIGSEGV (call- site metadata diverging from stored collection metadata) is structurally present. The fork's apparent stability is presumably because the no-metadata sites operate on different collections (mempalace_compressed, the migrate temp collection) than the full- metadata sites (mempalace_drawers) — so the divergence never reaches the same row in the same metadata table. That's a load-bearing assumption worth flagging: it's circumstantial, not structural. Once MemPalace#1262 (Legion345) lands, audit all four sites together so the guard wraps every site rather than just chroma.py.
The 2026-04-30 wordings in 393ac89 ("metadata is consistent across opens sidesteps the SIGSEGV") and 260e2a2 ("non-uniform metadata across four call sites, trigger condition structurally present") both missed the mark. Third try here, traced from the code: - chromadb's get_or_create_collection is called directly from two places: backends/chroma.py:1061 and mcp_server.py:295. Both pass identical metadata (hnsw:space, hnsw:num_threads: 1, bloat-guard). - The fork's ChromaBackend.get_or_create_collection(path, name) at chroma.py:1101 is a 2-arg legacy shim that delegates via self.get_collection(palace_path, name, create=True) to the same metadata-passing line. - migrate.py:237 and cli.py:1002 call that 2-arg shim, so they reach chromadb with full metadata too. So metadata IS consistent across all five reachable call sites, not divergent. The earlier "non-uniform" claim mistook the fork's API surface for chromadb's call signature. The residual SIGSEGV exposure is legacy palaces created under older chromadb whose stored metadata differs from the current expanded set — a per-palace-history risk, not a structural call-graph one. Live MemPalace#1262 comment and MemPalace#1089 issue body have been re-PATCHed to match. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… update sync status MemPalace#1262 (Legion345) is path 1 of MemPalace#1089's "interim guard PR" — adds get-then-create wrapping in chromadb backend. Shepherding via review comment posted 2026-04-30. Once it merges, fork-ahead Row 15 clears via develop sync. MemPalace#1286 (our PR, filed 2026-04-30) is the _get_collection retry-once + log-on-failure improvement. Adjacent to Row 15 — when both MemPalace#1262 and MemPalace#1286 land, the _get_collection path is both crash-resilient and self-healing. Status line refreshed: develop has moved to fdfaf01 (Gemini CLI normalize MemPalace#1234, privacy consent MemPalace#1233, both 2026-04-27); next sync will clear those plus row 15 once MemPalace#1262 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…pen-crash fix(mcp_server): split get_or_create_collection on reopen (follow-up to #1262)
Three fixes landed on develop after the initial release-prep cut and were brought in via the develop merge. Document them in the 3.3.4 Bug Fixes section so the release notes reflect what users will actually receive. - #1287 - HNSW divergence floor scales with hnsw:sync_threshold (resolves a silent-fallback regression introduced by the interaction between #1191 and #1227 in this release) - #1262 - ChromaBackend get_or_create_collection split, fixing the stop-hook SIGSEGV class on legacy palaces with mismatched stored metadata (#1089) - #1288 / #1254 - repair --mode max-seq-id heuristic now decodes BLOB-typed embeddings.seq_id, restoring the un-poison path added in #1135 for palaces where chromadb 1.5.x writes seq_ids natively
The same try/except split that #1262 applied at the backend layer (ChromaBackend.get_collection) was needed at the parallel call site in mcp_server._get_collection(create=True), which carries the same metadata payload directly to chromadb's Python client. Both reopen paths in mempalace now bypass get_or_create_collection on existing collections, closing the SIGSEGV class for both tool_add_drawer and tool_diary_write (the Stop hook's path).
Catches up xdev-patches with 112 commits from MemPalace/develop, including: - v3.3.4 release - MemPalace#1262/MemPalace#1289 ChromaDB collection-reopen crash fix (relevant to long-running MCP server & mempalace-api) - MemPalace#1287 HNSW divergence floor fix - MemPalace#1288 BLOB seq_id decode in repair - MemPalace#1180 cross-wing tunnels by shared topics - MemPalace#1194 wing-slug normalization for hyphenated dirs Conflict resolution: hooks_cli.py and mcp_server.py both had local patches (6ef44cb route CC transcripts via convo_miner; 3fad61d allow leading dash) that overlap with upstream fixes (MemPalace#1231, MemPalace#1194). Took upstream entirely on those two files — upstream's version handles separate transcript/project ingest, uses _mempalace_python(), and adds _pin_hnsw_threads. The local config.py regex relaxation auto-merged cleanly and is preserved. Safety tag: pre-upstream-merge-20260501-091227 (rollback target). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…lace#1299) `mcp_server._get_collection` bypassed `ChromaBackend.get_collection` and called `client.get_collection` / `client.create_collection` without `embedding_function=`. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop hook ingest path bound `mempalace.embedding.get_embedding_function()`. On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon, per MemPalace#1299) the default EF's lazy ONNX provider selection could SIGSEGV the host process on first `col.add()`, killing the MCP stdio server and leaving every subsequent tool call returning `Connection closed` until Claude Code was relaunched. Reads worked because `col.get(ids=...)` and metadata fetches don't invoke the EF; the auto-ingest path worked because mining routes through the backend abstraction. Diary writes were the consistent failure surface. Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`) and pass it into both reopen branches. Falls back to the chromadb default only if `mempalace.embedding.get_embedding_function` itself raises. Regression test patches the chromadb client class to capture `embedding_function=` on every `get_collection` / `create_collection` call from `_get_collection(create=True)` and `_get_collection()`, and fails if any call omits it. Follow-up to MemPalace#1262 / MemPalace#1289 (which fixed the metadata-mismatch SIGSEGV path); this addresses the EF-mismatch SIGSEGV path on the same surface.
…SEGV (upstream MemPalace#1289) Mirrors the backend-layer fix from MemPalace#1262 at the MCP server's _get_collection cache rebuild path. Also imports _HNSW_BLOAT_GUARD for correct metadata on fresh collection creation.
Sync with upstream/develop via cherry-pick of 9 critical bugfixes: - HNSW index bloat prevention (MemPalace#1191) - SIGSEGV guards on collection reopen (MemPalace#1262, MemPalace#1289) - blob-seq marker fast-path (MemPalace#1177) - palace_graph None-metadata guard (MemPalace#1201) - palace_graph security tunnels (MemPalace#1168) - hyphenated wing name normalization (MemPalace#1197) - searcher _tokenize None guard (MemPalace#1198) - mcp_server diary topic sanitize (MemPalace#936) Tests: 1459 default + 113 benchmark + 6/7 stress all pass. (random-kill stress test was failing on dev pre-merge; not regressed.)
…Palace#1289/MemPalace#1303 After the 2026-05-03 develop sync to commit 1888b67: - Row 15 (_get_client get-then-create guard) clears: MemPalace#1262 fixed ChromaBackend; MemPalace#1289 fixed mcp_server's parallel call site; MemPalace#1303 plumbed embedding_function= through the reopen path. - Trailing tracking paragraph updated (sync target, count of merged/ open/closed PRs, list of features brought in). - Upstream-PRs table: MemPalace#1262 → merged, MemPalace#1286 flagged CONFLICTING with rebase queued, MemPalace#1289 + MemPalace#1303 added as merged. Note: fork-only _get_session_recovery_collection still uses the older get_or_create_collection pattern; theoretical SIGSEGV exposure on legacy recovery collections only. Tracked as follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…attern to recovery collection `_get_session_recovery_collection` (fork-only, introduced for the checkpoint collection split in row 23) used the older `get_or_create_collection` pattern that MemPalace#1262/MemPalace#1289/MemPalace#1303 just fixed in `_get_collection`. Same theoretical SIGSEGV class on chromadb 1.5.x when stored metadata diverges from the call site, plus the `embedding_function=` omission MemPalace#1303 caught. Both branches now resolve EF via `ChromaBackend._resolve_embedding_function()` and the create-path uses try `get_collection` / except `NotFoundError` → `create_collection`. Recovery collection's metadata is intentionally lighter than the main collection's (no `_HNSW_BLOAT_GUARD`); preserved that asymmetry — recovery is small enough that segment bloat isn't a concern. Existing `test_session_recovery.py` covers happy paths; 83/83 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
What does this PR do?
Fixes #1089.
When opening an existing palace,
get_or_create_collectionwas called with hnsw metadata on every open. In ChromaDB 1.5.x, if that metadata differs from what's already stored, the Rust bindings segfault — no traceback, just a core dump. This is what causes the stop hook to crash at session end.Fix: try
get_collectionfirst, fall back tocreate_collectiononly when the collection doesn't exist yet. Existing palaces open without touching their metadata; new ones are created with the full settings as before.Credit to @jphein who described this exact approach in #1089 and has been running it on their fork since April 10 with zero crashes across 400+ starts.
How to test
Open a palace twice in the same session:
Or run the new regression tests directly:
python -m pytest tests/test_backends.py -v -k "idempotent or preserves_existing"
Checklist