fix: skip _fix_blob_seq_ids sqlite open on already-migrated palaces (#1090)#1177
Conversation
3e2a67b to
a574ffd
Compare
There was a problem hiding this comment.
Pull request overview
This PR prevents _fix_blob_seq_ids() from opening a live chroma.sqlite3 on every client creation by introducing a .blob_seq_ids_migrated marker file in the palace directory, so already-migrated palaces can skip the sqlite3 open that may trigger ChromaDB 1.5.x Rust-side segfaults.
Changes:
- Add a
.blob_seq_ids_migratedmarker guard to early-return from_fix_blob_seq_ids()without callingsqlite3.connect(). - Write the marker after a successful
_fix_blob_seq_ids()run (including the “no BLOB rows found” no-op path). - Expand
_fix_blob_seq_ids()docstring to document the segfault risk and the marker behavior.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| with sqlite3.connect(db_path) as conn: |
There was a problem hiding this comment.
migrated is assigned but never read. This looks like leftover state from an earlier version; please remove it or use it (e.g., for conditional logging/marker handling) to avoid dead code and lint noise.
| try: | ||
| rows = conn.execute( | ||
| f"SELECT rowid, seq_id FROM {table} WHERE typeof(seq_id) = 'blob'" | ||
| ).fetchall() | ||
| except sqlite3.OperationalError: | ||
| continue | ||
| if not rows: | ||
| continue | ||
| updates = [(int.from_bytes(blob, byteorder="big"), rowid) for rowid, blob in rows] | ||
| conn.executemany(f"UPDATE {table} SET seq_id = ? WHERE rowid = ?", updates) | ||
| logger.info("Fixed %d BLOB seq_ids in %s", len(updates), table) | ||
| conn.commit() | ||
| except Exception: | ||
| logger.exception("Could not fix BLOB seq_ids in %s", db_path) | ||
| return | ||
| # Write marker whether or not rows needed migration — the palace is now | ||
| # confirmed to be in the INTEGER-seq_id state and future opens can skip the | ||
| # sqlite3.connect() entirely. | ||
| try: | ||
| open(marker, "a").close() | ||
| except OSError: | ||
| logger.exception("Could not write migration marker %s", marker) | ||
|
|
||
|
|
There was a problem hiding this comment.
The marker is written even if both table queries hit sqlite3.OperationalError (e.g., schema mismatch / missing tables), which means we may skip future migrations without ever confirming seq_id types. Consider tracking whether any table was successfully inspected, and only writing the marker when at least one relevant table was checked (or after actual updates).
| marker = os.path.join(palace_path, _BLOB_FIX_MARKER) | ||
| if os.path.isfile(marker): | ||
| return |
There was a problem hiding this comment.
New marker-file behavior isn’t covered by tests: there are existing tests for _fix_blob_seq_ids, but none assert (1) marker creation on successful no-op/migration, and (2) early return when the marker exists (ensuring sqlite3.connect() is not called). Adding/adjusting tests would prevent regressions on this reliability guard.
a574ffd to
af14d1b
Compare
… state Three stale sections updated: - Fork change queue: row 8 (.blob_seq_ids_migrated marker) struck through → FILED as MemPalace#1177. Two new rows added for segfault fixes discovered today (MemPalace#1171 concurrent-write lock, MemPalace#1173 quarantine in make_client) that weren't in the queue because the bugs surfaced today, not during the original 2026-04-21 triage. - Open upstream PRs: was showing 3 of 10 PRs. Now shows all 10 with current CI/review state. All rebased onto current upstream/develop and MERGEABLE as of today. - Merged since v3.3.1: added v3.3.3 release (2026-04-24) with its constituent merges — MemPalace#942, MemPalace#833, MemPalace#1097, MemPalace#1145, MemPalace#1147, MemPalace#1148/1150/1157 entity-detection overhaul (via @igorls's MemPalace#1175 stacked-PR rescue), MemPalace#1166 palace-path security, MemPalace#340/MemPalace#1093 install regression, plus MemPalace#851 from the 2026-04-22 batch.
…ostgres Three things merged into one README pass: 1. Badge: link version-3.3.4 to jphein/mempalace/releases (the v3.3.4 tag we just pushed) and add an upstream-3.3.3 secondary badge so readers can tell fork vs upstream version at a glance. Was sitting uncommitted from earlier today. 2. Multi-client coordination section: replaced the three-fix v3.3.4 summary with a four-fix one. Added @felipetruman's MemPalace#976 num_threads pin (cherry-picked at 552a0d7) as fix #1 — the actual root-cause fix. Reframed our MemPalace#1171/MemPalace#1173/MemPalace#1177 as defense-in-depth around symptoms. Walked back palace-daemon from "primary concurrency story in progress" to "deferred pending observation" — with MemPalace#976's fix in place, the daemon's same-machine value drops; multi-machine and Windows remain its differentiators but neither is current pain. 3. Postgres + pgvector: walked back from "parallel track" framing to "long-term option, no immediate move" for the same reason. Migration cost stays real, current pain is mitigated, decision deferred until v3.3.4 stack is observed in production or TS rewrite ships. Removed two stale paragraphs that were left over from the previous "daemon as primary" framing.
…HNSW fixes Bring in 29 commits from upstream/develop since the last merge (2026-04-23): Major absorbed changes: - MemPalace#976 (Felipe Truman): HNSW graph corruption fix, mine_global_lock for fan-out, MAX_PRECOMPACT_BLOCK_ATTEMPTS=2 for /compact deadlocks. Closes MemPalace#974/MemPalace#965/MemPalace#955; likely resolves MemPalace#1172 too. - MemPalace#1179 (Igor): CLI mempalace search routes through _hybrid_rank, legacy-metric warning + _warn_if_legacy_metric, invariant tests on hnsw:space=cosine across all 5 collection-creation paths. - MemPalace#1180/MemPalace#1183/MemPalace#1184: cross-wing topic tunnels, init mine UX, --auto-mine. - MemPalace#1185 (perf/batched-upsert-gpu): batched ChromaDB upserts, GPU device detection via mempalace/embedding.py. - MemPalace#1182: graceful Ctrl-C during mempalace mine. - MemPalace#1168: tunnel permissions security fix. Conflict resolutions (8 files): - searcher.py: kept fork's CLI delegation through search_memories (cleaner single-source-of-truth path); upstream's inline-retrieval branch dropped. Merged upstream's print formatting (cosine= + bm25=) with fork's matched_via reporting from sqlite BM25 fallback. - backends/chroma.py: kept fork's _BLOB_FIX_MARKER + palace_path arg to ChromaCollection (MemPalace#1171 write lock); merged upstream's **ef_kwargs (embedding_function support from MemPalace#1185). Removed duplicate _pin_hnsw_threads (fork already cherry-picked Felipe's earlier). - mcp_server.py: kept fork's palace_path arg + upstream's clearer comment on hnsw:num_threads=1 rationale. - miner.py: took upstream's serial mine() flow (mine_global_lock + Ctrl-C), RESTORED fork's strict detect_room — substring matching from upstream breaks fork-only test_detect_room_priority1_no_substring_match. Added `import re` for word-boundary regex matching. Fork-ahead concurrent mining (workers=, ThreadPoolExecutor from 5cd14bd) is dropped — daemon migration deprioritizes local concurrent mining; can re-port if needed. - CHANGELOG.md: combined fork's segfault-trio narrative with upstream's v3.3.4 release notes. - HOOKS_TUTORIAL.md: took MEMPALACE_PYTHON env var name (fork README was stale; hooks already use this name per fork-ahead MemPalace#19). - test_convo_miner_unit.py: took both contextlib + pytest imports. - test_searcher.py: imported upstream's 3 new TestSearchCLI tests but skipped them with TODOs — they assume upstream's inline-retrieval CLI with simpler mocks. Rewrite needed for fork's delegated search_memories path (sqlite BM25 fallback + scope counting). Test result: 1334 passed, 3 skipped (the upstream tests above), 3 warnings. Implications for open fork PRs: - MemPalace#1171 (cross-process write lock at adapter) becomes structurally redundant given MemPalace#976's mine_global_lock + daemon-strict serialization. Slated for close with thank-you to Felipe. - MemPalace#1173/MemPalace#1177 still relevant (quarantine threshold + blob_seq marker). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
af14d1b to
fe6a640
Compare
…emPalace#1090) Opening chroma.sqlite3 via Python's sqlite3.connect() against a live ChromaDB 1.5.x WAL-mode database leaves state that segfaults the next PersistentClient call — the same failure mode tracked at MemPalace#1090. _fix_blob_seq_ids runs unconditionally on every make_client() call, so every fresh process (MCP server, stop hook, CLI) re-triggers the sqlite open → corrupt → segfault cycle on palaces that have already completed the 0.6.x → 1.5.x seq_id migration. Guard with a .blob_seq_ids_migrated marker file in the palace directory: - If marker exists, return immediately — skip sqlite entirely - After successful migration (or confirmation that no BLOBs remain), write the marker so subsequent opens take the fast path - Palaces that never had BLOB seq_ids also get the marker on first open, so they too avoid the redundant sqlite open after that - Already-migrated palaces can touch the marker manually to opt in Test plan: Direct test — run _fix_blob_seq_ids twice against a fresh palace; second call returns immediately because marker exists. 1094 existing tests pass.
fe6a640 to
bc24aa1
Compare
Two new scripts plus legitimate fixes they surfaced. scripts/preflight.sh — runs the same checks CI does (ruff check, ruff format --check, pytest). Prevents the class of "I ran ruff format locally but CI runs ruff check too" bugs we hit on PR MemPalace#1024 and PR MemPalace#1198 today. scripts/rebase-on-develop.sh — rebases a fork PR branch onto upstream/develop, then auto-restores fork-only collateral (.claude-plugin/hooks/*.sh, .claude-plugin/.mcp.json) to upstream's state. The collateral cleanup was the recurring step I had to remember manually for each of MemPalace#1005/MemPalace#1024/MemPalace#1177; this codifies it. Supports --finish for the "after I resolved conflicts manually" continuation path. Never auto-pushes — prints the push command for confirmation. Lint debt the new preflight caught on main: - tests/test_palace_graph.py: F811 — `invalidate_graph_cache` was imported at line 9, then re-imported inside the chromadb-mock patch.dict block at line 44. Removed the duplicate; the line-9 import is sufficient since the autouse fixture at line 12 already uses it. - mempalace/backends/chroma.py: ruff format wanted a one-line `collection.modify(configuration=...)` call instead of the wrapped multi-line form. - mempalace/hooks_cli.py: ruff format wanted blank line after import in the daemon-URL try block, and dict-literal style for the json.dumps call. CI on jphein/mempalace runs both ruff check and ruff format --check; without these fixes the next push to main would have shown red. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
bensig
left a comment
There was a problem hiding this comment.
Approve. The marker pattern is the right shape.
Why this matters
The PR body identifies a real problem: opening Python's `sqlite3` against a ChromaDB 1.5.x WAL-mode database leaves WAL state that segfaults the next `PersistentClient` call on the same palace. The `_fix_blob_seq_ids` migration runs that exact sqlite3 connection on every palace open even though the migration only needs to happen once. Marker file lets the migration run once and then get out of the way.
Marker semantics — what I checked
- Marker file naming: `.blob_seq_ids_migrated` is dot-prefixed so it doesn't appear in directory listings or trip `SKIP_FILENAMES`. ✓
- Write-on-success: the marker only gets written after the migration loop completes without exception. If the loop raises, no marker, retry next time. ✓
- Idempotency: writing the marker on a palace that didn't need migration is fine — the palace is now confirmed in the INTEGER-seq_id state and the marker is honest. ✓
- Manual touch: the docstring tells power users they can `touch .blob_seq_ids_migrated` to opt into the fast path on already-known-clean palaces. Worth documenting somewhere user-facing eventually but not in this PR.
Tests
Ran `pytest tests/ -k 'chroma or quarantine or blob or hnsw or repair'` on this branch — 55 passed, no regressions.
Approving.
…#1173 status CLAUDE.md row 28 documents the canonical-source pipeline (5a01aec) that landed earlier today. PR table updated to reflect bensig's 2026-04-26 approvals on four of our open PRs: - MemPalace#1173 (HNSW quarantine wire): approved on the original 1-commit shape, then force-pushed two safety commits (cold-start gate + integrity sniff-test) after a production cold-start incident destroyed three healthy 253MB segments. Now mergeable=CONFLICTING against develop (which moved with MemPalace#1210, MemPalace#1205, MemPalace#976) — needs rebase + re-review. - MemPalace#1177, MemPalace#1198, MemPalace#1201: approved, mergeable, awaiting merge. YAML manifest gets a new entry for the doc pipeline itself; FORK_CHANGELOG regenerated to match. promises.md (in claude-config, not this repo) got a long log entry covering today's full output. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Thanks @bensig — ready for merge whenever convenient. Tests stable on the fork's daily-driver palace + smoke pass on the develop branch. |
|
Targeted fix and the failure-path ordering looks right — Two small asks before merge:
Style nit (ignorable): |
Restore three `_pin_hnsw_threads` tests that the previous integrity-gate commit deleted. The function is still live code on develop (defined at chroma.py:207, called from chroma.py:705 + mcp_server.py), so the deletion left the import unused (ruff F401) and dropped coverage on a function unrelated to this PR's scope. Restored verbatim from main. Plus three nits @igorls flagged: - **Thread-safety doc**: `_quarantined_paths` mutation is lock-free; documented that idempotency of `quarantine_stale_hnsw` is the safety property (concurrent same-palace calls produce a benign redundant rename attempt that no-ops, no need for a lock). - **Pickle protocol assumption**: `_segment_appears_healthy` requires PROTO ≥ 2 (`0x80`). Documented; matches what chromadb writes today, and a future protocol-0/1 emission would conservatively quarantine + lazy-rebuild rather than mis-classify as healthy. - Class-level vs module-level scope: keeping class-level — the conftest reset is the controlled case, and module-level wouldn't remove the foot-gun, just relocate it. Conftest reset documented in the existing comment is the right pattern for test isolation. Style nit (`Path(marker).touch()` vs `open(marker, "a").close()`) deferred — that pattern lives in MemPalace#1177's territory, not MemPalace#1173's. 37/37 tests pass on the PR branch. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…view Three new tests cover the marker contract: - test_fix_blob_seq_ids_writes_marker_after_blob_path — marker is written after a successful BLOB → INTEGER conversion. - test_fix_blob_seq_ids_writes_marker_when_already_integer — marker is written even on a no-op (already-INTEGER) path. The point of the marker is to skip sqlite3.connect() on subsequent calls, not to record that a conversion happened. - test_fix_blob_seq_ids_skips_sqlite_when_marker_present — verifies the load-bearing property via monkeypatched sqlite3.connect: when the marker exists, the function never opens sqlite. This is the bug MemPalace#1090 fix — opening Python's sqlite3 against a live ChromaDB 1.5.x WAL DB corrupts the next PersistentClient call, and we must never do it again post-migration. Also folded in @igorls's style nit: - Path(marker).touch() instead of open(marker, "a").close(). Same effect, reads cleaner. Moved Path import to the top of the module since it didn't yet exist there. 35/35 backend tests pass. The "Production: fresh MCP server + stop hook + mempalace mine subprocess" checkbox in the PR body is checked in the PR reply — the canonical 151K palace has been running this fix since MemPalace#1177 was filed (via Syncthing-replicated source on the disks daemon at v1.7.0); zero PersistentClient corruption since. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
@igorls — addressed in `2477442`: Three marker tests in test_backends.py:
Style nit: `Path(marker).touch()` instead of `open(marker, "a").close()`. Same effect, cleaner read. Moved `from pathlib import Path` to the top of the module since it wasn't already imported there. Production checkbox: ticking it now. The canonical 151K-drawer palace on the disks daemon (v1.7.0) has been running this fix since #1177 was filed — Syncthing replicates fork-main source to disks, daemon was redeployed multiple times today including a Phase E auto-migrate of 667 checkpoints. Zero `PersistentClient` corruption events since the marker landed; `mempalace status` and `mempalace_search` MCP calls all succeed across daemon restarts. 35/35 backend tests pass on the rebased branch. |
Brings in upstream's corpus-origin + privacy-warning track (PRs MemPalace#1211 MemPalace#1221 MemPalace#1223 MemPalace#1224 MemPalace#1225) plus the canonical merged versions of our four PRs that landed today (21:22-21:41 UTC): MemPalace#1173 quarantine_stale_hnsw on make_client + cold-start gate + integrity sniff-test (PROTO/STOP byte check, no deserialization) MemPalace#1177 .blob_seq_ids_migrated marker guard, closes MemPalace#1090 MemPalace#1198 _tokenize None-document guard in BM25 reranker MemPalace#1201 palace_graph.build_graph skips None metadata Conflict resolution: * mempalace/backends/chroma.py — took upstream as base (it has the igorls-review pickle-protocol docstring, thread-safety paragraph, and Path(marker).touch() style nit), then re-applied MemPalace#1094's _coerce_none_metas in query()/get() since MemPalace#1094 is still open and not yet in develop. * mempalace/mcp_server.py — took upstream's clean form. Dropped the fork-only `palace_path=` kwarg from four ChromaCollection() call sites: the kwarg was load-bearing for MemPalace#1171's per-collection write lock, but MemPalace#1171 closed in favor of MemPalace#976's mine_global_lock + daemon-strict, so the kwarg has no remaining consumer. ChromaCollection.__init__ in upstream/develop is back to (self, collection); calling it with palace_path= raised TypeError → silently swallowed by the broad except in _get_collection() → returned None → tool_status() returned _no_palace(). 41 mcp_server tests went from failing-with-KeyError to passing. * mempalace/cli.py — dropped fork-only `workers=args.workers` from the cmd_mine -> miner.mine() call. Pre-existing fork-side bug: the `--workers` argparse arg landed in 5cd14bd but miner.mine() never accepted a workers param, so production `mempalace mine` TypeError'd on every invocation. Removed the broken plumbing; tests/test_cli.py updated to match. * CHANGELOG.md — took upstream verbatim. Fork-specific changelog lives in FORK_CHANGELOG.md (canonical: docs/fork-changes.yaml). * CLAUDE.md — kept ours. Fork's CLAUDE.md is operational; upstream's added a "Design Principles / Contributing" charter, which lives in README.md on the fork. * tests/test_backends.py — took upstream's ruff-formatted line widths. docs/fork-changes.yaml flips the two MemPalace#1173 entries (hnsw-integrity-gate, hnsw-cold-start-gate) and the MemPalace#1201 entry (palace-graph-none-guard) from OPEN to MERGED 2026-04-26. MemPalace#1173 MemPalace#1177 MemPalace#1198 MemPalace#1201 added to the merged_upstream archive at the bottom. FORK_CHANGELOG.md regenerated. scripts/check-docs.sh: 4/4 clean. Test suite: 1460/1460. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The narrowed _fix_blob_seq_ids returned early when safe_rows was empty, but MemPalace#1177's marker contract requires the marker to be written on every successful pass — even no-op — so subsequent opens skip the sqlite3 connection entirely. Without this, palaces that have no genuine 0.6.x BLOBs but DO have sysdb-10-prefixed rows would re-open sqlite3 on every call, defeating the MemPalace#1090 corruption guard. Restructured the conditional so the marker write is unconditional after a successful sqlite scan, regardless of whether any rows were updated. Surfaced by test_fix_blob_seq_ids_writes_marker_when_already_integer during the develop-rebase of this PR. The author's branch predates the marker contract from MemPalace#1177 (merged 2026-04-26), so this is a rebase-edge fix-up rather than a logic change to their narrowing behaviour.
Upstream merged MemPalace#1173, MemPalace#1177, MemPalace#1198, MemPalace#1201 into develop on 2026-04-26 (picked up by the 2026-04-27 develop sync). Strike out their fork-ahead rows in CLAUDE.md and add to the "Merged into upstream" section. Update PR table accordingly: 18 merged (was 14), 7 open (was 8). Bump version note: upstream shipped v3.3.3 on 2026-04-24 (carries our MemPalace#659/MemPalace#1021); v3.3.4 prep branch in flight. README: - "tracks upstream/develop through the 2026-04-27 sync" (was 2026-04-26) - 17 fork-ahead changes (was 22) - 1,510 tests pass on main (was 1,395 — upstream brought new suites) - "Open upstream PRs" 7 entries (was 11) - Drop merged rows from "Fork-ahead — open or pending" table; keep PreCompact recovery-marker row (still fork-only) scripts/check-docs.sh: clean (90 PR refs match upstream state, 13 fork hashes resolve, FORK_CHANGELOG renders clean from YAML). docs/fork-changes.yaml: no edit needed — already had merged_date on the 4 entries from the 2026-04-26 commit `5fd15db`.
Restore three `_pin_hnsw_threads` tests that the previous integrity-gate commit deleted. The function is still live code on develop (defined at chroma.py:207, called from chroma.py:705 + mcp_server.py), so the deletion left the import unused (ruff F401) and dropped coverage on a function unrelated to this PR's scope. Restored verbatim from main. Plus three nits @igorls flagged: - **Thread-safety doc**: `_quarantined_paths` mutation is lock-free; documented that idempotency of `quarantine_stale_hnsw` is the safety property (concurrent same-palace calls produce a benign redundant rename attempt that no-ops, no need for a lock). - **Pickle protocol assumption**: `_segment_appears_healthy` requires PROTO ≥ 2 (`0x80`). Documented; matches what chromadb writes today, and a future protocol-0/1 emission would conservatively quarantine + lazy-rebuild rather than mis-classify as healthy. - Class-level vs module-level scope: keeping class-level — the conftest reset is the controlled case, and module-level wouldn't remove the foot-gun, just relocate it. Conftest reset documented in the existing comment is the right pattern for test isolation. Style nit (`Path(marker).touch()` vs `open(marker, "a").close()`) deferred — that pattern lives in MemPalace#1177's territory, not MemPalace#1173's. 37/37 tests pass on the PR branch. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…view Three new tests cover the marker contract: - test_fix_blob_seq_ids_writes_marker_after_blob_path — marker is written after a successful BLOB → INTEGER conversion. - test_fix_blob_seq_ids_writes_marker_when_already_integer — marker is written even on a no-op (already-INTEGER) path. The point of the marker is to skip sqlite3.connect() on subsequent calls, not to record that a conversion happened. - test_fix_blob_seq_ids_skips_sqlite_when_marker_present — verifies the load-bearing property via monkeypatched sqlite3.connect: when the marker exists, the function never opens sqlite. This is the bug MemPalace#1090 fix — opening Python's sqlite3 against a live ChromaDB 1.5.x WAL DB corrupts the next PersistentClient call, and we must never do it again post-migration. Also folded in @igorls's style nit: - Path(marker).touch() instead of open(marker, "a").close(). Same effect, reads cleaner. Moved Path import to the top of the module since it didn't yet exist there. 35/35 backend tests pass. The "Production: fresh MCP server + stop hook + mempalace mine subprocess" checkbox in the PR body is checked in the PR reply — the canonical 151K palace has been running this fix since MemPalace#1177 was filed (via Syncthing-replicated source on the disks daemon at v1.7.0); zero PersistentClient corruption since. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The narrowed _fix_blob_seq_ids returned early when safe_rows was empty, but MemPalace#1177's marker contract requires the marker to be written on every successful pass — even no-op — so subsequent opens skip the sqlite3 connection entirely. Without this, palaces that have no genuine 0.6.x BLOBs but DO have sysdb-10-prefixed rows would re-open sqlite3 on every call, defeating the MemPalace#1090 corruption guard. Restructured the conditional so the marker write is unconditional after a successful sqlite scan, regardless of whether any rows were updated. Surfaced by test_fix_blob_seq_ids_writes_marker_when_already_integer during the develop-rebase of this PR. The author's branch predates the marker contract from MemPalace#1177 (merged 2026-04-26), so this is a rebase-edge fix-up rather than a logic change to their narrowing behaviour.
…ia marker (upstream MemPalace#1177) Adds _pin_hnsw_threads helper + _BLOB_FIX_MARKER guard to avoid WAL-state segfault on already-migrated palaces.
Sync with upstream/develop via cherry-pick of 9 critical bugfixes: - HNSW index bloat prevention (MemPalace#1191) - SIGSEGV guards on collection reopen (MemPalace#1262, MemPalace#1289) - blob-seq marker fast-path (MemPalace#1177) - palace_graph None-metadata guard (MemPalace#1201) - palace_graph security tunnels (MemPalace#1168) - hyphenated wing name normalization (MemPalace#1197) - searcher _tokenize None guard (MemPalace#1198) - mcp_server diary topic sanitize (MemPalace#936) Tests: 1459 default + 113 benchmark + 6/7 stress all pass. (random-kill stress test was failing on dev pre-merge; not regressed.)
Summary
Fixes #1090.
Opening
chroma.sqlite3via Python'ssqlite3.connect()against a live ChromaDB 1.5.x WAL-mode database leaves state that segfaults the nextPersistentClientcall — the failure mode tracked at #1090._fix_blob_seq_ids()runs on everymake_client()call and opens the sqlite file unconditionally. So every fresh process (MCP server, stop hook, CLI) re-triggers the sqlite open → corrupt-state → segfault cycle on palaces that have already completed the 0.6.x → 1.5.xseq_idmigration.Guard: a
.blob_seq_ids_migratedmarker file in the palace directory._fix_blob_seq_ids()returns immediately — no sqlite open, no corruption risk.seq_ids — those palaces now also avoid the redundant sqlite open.touch .mempalace/palace/.blob_seq_ids_migratedmanually to opt in immediately.Why the sqlite open is unsafe
ChromaDB 1.5.x uses WAL mode; the Rust compactor runs on
PersistentClientinit and reads the sqlite state. Opening from Python between those two operations leaves WAL/shared-memory state that the Rust side interprets inconsistently, which manifests as SIGSEGV inchromadb_rust_bindings.abi3.so. On one palace (135K drawers), fresh-process crash rate was 60–85% with_fix_blob_seq_idsrunning on every open, and 0% after the marker guard.This is one of three known segfault triggers in the 1.5.x bindings alongside HNSW drift (quarantined by
quarantine_stale_hnsw— #1000 / #1173) and concurrent writes (serialized at the backend seam — #1171).Test plan
_fix_blob_seq_idsreturns before anysqlite3.connect()callmempalace minesubprocess all run against a post-migration palace without segfaultsGenerated with Claude Code