fix: skip _fix_blob_seq_ids sqlite open on already-migrated palaces (#1090) by jphein · Pull Request #1177 · MemPalace/mempalace

jphein · 2026-04-24T20:44:32Z

Summary

Opening chroma.sqlite3 via Python's sqlite3.connect() against a live ChromaDB 1.5.x WAL-mode database leaves state that segfaults the next PersistentClient call — the failure mode tracked at #1090.

_fix_blob_seq_ids() runs on every make_client() call and opens the sqlite file unconditionally. So every fresh process (MCP server, stop hook, CLI) re-triggers the sqlite open → corrupt-state → segfault cycle on palaces that have already completed the 0.6.x → 1.5.x seq_id migration.

Guard: a .blob_seq_ids_migrated marker file in the palace directory.

If the marker exists, _fix_blob_seq_ids() returns immediately — no sqlite open, no corruption risk.
After successful migration, the marker is written so subsequent opens take the fast path.
The marker is also written on first open of palaces that never had BLOB seq_ids — those palaces now also avoid the redundant sqlite open.
Existing already-migrated palaces can touch .mempalace/palace/.blob_seq_ids_migrated manually to opt in immediately.

Why the sqlite open is unsafe

ChromaDB 1.5.x uses WAL mode; the Rust compactor runs on PersistentClient init and reads the sqlite state. Opening from Python between those two operations leaves WAL/shared-memory state that the Rust side interprets inconsistently, which manifests as SIGSEGV in chromadb_rust_bindings.abi3.so. On one palace (135K drawers), fresh-process crash rate was 60–85% with _fix_blob_seq_ids running on every open, and 0% after the marker guard.

This is one of three known segfault triggers in the 1.5.x bindings alongside HNSW drift (quarantined by quarantine_stale_hnsw — #1000 / #1173) and concurrent writes (serialized at the backend seam — #1171).

Test plan

Local: after marker exists, _fix_blob_seq_ids returns before any sqlite3.connect() call
1094 existing tests pass
Production: fresh MCP server + stop hook + mempalace mine subprocess all run against a post-migration palace without segfaults

Generated with Claude Code

Copilot

Pull request overview

This PR prevents _fix_blob_seq_ids() from opening a live chroma.sqlite3 on every client creation by introducing a .blob_seq_ids_migrated marker file in the palace directory, so already-migrated palaces can skip the sqlite3 open that may trigger ChromaDB 1.5.x Rust-side segfaults.

Changes:

Add a .blob_seq_ids_migrated marker guard to early-return from _fix_blob_seq_ids() without calling sqlite3.connect().
Write the marker after a successful _fix_blob_seq_ids() run (including the “no BLOB rows found” no-op path).
Expand _fix_blob_seq_ids() docstring to document the segfault risk and the marker behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T20:47:53Z

    try:
        with sqlite3.connect(db_path) as conn:


migrated is assigned but never read. This looks like leftover state from an earlier version; please remove it or use it (e.g., for conditional logging/marker handling) to avoid dead code and lint noise.

Copilot · 2026-04-24T20:47:53Z

                try:
                    rows = conn.execute(
                        f"SELECT rowid, seq_id FROM {table} WHERE typeof(seq_id) = 'blob'"
                    ).fetchall()
                except sqlite3.OperationalError:
                    continue
                if not rows:
                    continue
                updates = [(int.from_bytes(blob, byteorder="big"), rowid) for rowid, blob in rows]
                conn.executemany(f"UPDATE {table} SET seq_id = ? WHERE rowid = ?", updates)
                logger.info("Fixed %d BLOB seq_ids in %s", len(updates), table)
            conn.commit()
    except Exception:
        logger.exception("Could not fix BLOB seq_ids in %s", db_path)
+        return
+    # Write marker whether or not rows needed migration — the palace is now
+    # confirmed to be in the INTEGER-seq_id state and future opens can skip the
+    # sqlite3.connect() entirely.
+    try:
+        open(marker, "a").close()
+    except OSError:
+        logger.exception("Could not write migration marker %s", marker)




The marker is written even if both table queries hit sqlite3.OperationalError (e.g., schema mismatch / missing tables), which means we may skip future migrations without ever confirming seq_id types. Consider tracking whether any table was successfully inspected, and only writing the marker when at least one relevant table was checked (or after actual updates).

Copilot · 2026-04-24T20:47:54Z

+    marker = os.path.join(palace_path, _BLOB_FIX_MARKER)
+    if os.path.isfile(marker):
+        return


New marker-file behavior isn’t covered by tests: there are existing tests for _fix_blob_seq_ids, but none assert (1) marker creation on successful no-op/migration, and (2) early return when the marker exists (ensuring sqlite3.connect() is not called). Adding/adjusting tests would prevent regressions on this reliability guard.

…alace#1173, MemPalace#1177)

@igorls

… state Three stale sections updated: - Fork change queue: row 8 (.blob_seq_ids_migrated marker) struck through → FILED as MemPalace#1177. Two new rows added for segfault fixes discovered today (MemPalace#1171 concurrent-write lock, MemPalace#1173 quarantine in make_client) that weren't in the queue because the bugs surfaced today, not during the original 2026-04-21 triage. - Open upstream PRs: was showing 3 of 10 PRs. Now shows all 10 with current CI/review state. All rebased onto current upstream/develop and MERGEABLE as of today. - Merged since v3.3.1: added v3.3.3 release (2026-04-24) with its constituent merges — MemPalace#942, MemPalace#833, MemPalace#1097, MemPalace#1145, MemPalace#1147, MemPalace#1148/1150/1157 entity-detection overhaul (via @igorls's MemPalace#1175 stacked-PR rescue), MemPalace#1166 palace-path security, MemPalace#340/MemPalace#1093 install regression, plus MemPalace#851 from the 2026-04-22 batch.

@felipetruman

…ostgres Three things merged into one README pass: 1. Badge: link version-3.3.4 to jphein/mempalace/releases (the v3.3.4 tag we just pushed) and add an upstream-3.3.3 secondary badge so readers can tell fork vs upstream version at a glance. Was sitting uncommitted from earlier today. 2. Multi-client coordination section: replaced the three-fix v3.3.4 summary with a four-fix one. Added @felipetruman's MemPalace#976 num_threads pin (cherry-picked at 552a0d7) as fix #1 — the actual root-cause fix. Reframed our MemPalace#1171/MemPalace#1173/MemPalace#1177 as defense-in-depth around symptoms. Walked back palace-daemon from "primary concurrency story in progress" to "deferred pending observation" — with MemPalace#976's fix in place, the daemon's same-machine value drops; multi-machine and Windows remain its differentiators but neither is current pain. 3. Postgres + pgvector: walked back from "parallel track" framing to "long-term option, no immediate move" for the same reason. Migration cost stays real, current pain is mitigated, decision deferred until v3.3.4 stack is observed in production or TS rewrite ships. Removed two stale paragraphs that were left over from the previous "daemon as primary" framing.

…HNSW fixes Bring in 29 commits from upstream/develop since the last merge (2026-04-23): Major absorbed changes: - MemPalace#976 (Felipe Truman): HNSW graph corruption fix, mine_global_lock for fan-out, MAX_PRECOMPACT_BLOCK_ATTEMPTS=2 for /compact deadlocks. Closes MemPalace#974/MemPalace#965/MemPalace#955; likely resolves MemPalace#1172 too. - MemPalace#1179 (Igor): CLI mempalace search routes through _hybrid_rank, legacy-metric warning + _warn_if_legacy_metric, invariant tests on hnsw:space=cosine across all 5 collection-creation paths. - MemPalace#1180/MemPalace#1183/MemPalace#1184: cross-wing topic tunnels, init mine UX, --auto-mine. - MemPalace#1185 (perf/batched-upsert-gpu): batched ChromaDB upserts, GPU device detection via mempalace/embedding.py. - MemPalace#1182: graceful Ctrl-C during mempalace mine. - MemPalace#1168: tunnel permissions security fix. Conflict resolutions (8 files): - searcher.py: kept fork's CLI delegation through search_memories (cleaner single-source-of-truth path); upstream's inline-retrieval branch dropped. Merged upstream's print formatting (cosine= + bm25=) with fork's matched_via reporting from sqlite BM25 fallback. - backends/chroma.py: kept fork's _BLOB_FIX_MARKER + palace_path arg to ChromaCollection (MemPalace#1171 write lock); merged upstream's **ef_kwargs (embedding_function support from MemPalace#1185). Removed duplicate _pin_hnsw_threads (fork already cherry-picked Felipe's earlier). - mcp_server.py: kept fork's palace_path arg + upstream's clearer comment on hnsw:num_threads=1 rationale. - miner.py: took upstream's serial mine() flow (mine_global_lock + Ctrl-C), RESTORED fork's strict detect_room — substring matching from upstream breaks fork-only test_detect_room_priority1_no_substring_match. Added `import re` for word-boundary regex matching. Fork-ahead concurrent mining (workers=, ThreadPoolExecutor from 5cd14bd) is dropped — daemon migration deprioritizes local concurrent mining; can re-port if needed. - CHANGELOG.md: combined fork's segfault-trio narrative with upstream's v3.3.4 release notes. - HOOKS_TUTORIAL.md: took MEMPALACE_PYTHON env var name (fork README was stale; hooks already use this name per fork-ahead MemPalace#19). - test_convo_miner_unit.py: took both contextlib + pytest imports. - test_searcher.py: imported upstream's 3 new TestSearchCLI tests but skipped them with TODOs — they assume upstream's inline-retrieval CLI with simpler mocks. Rewrite needed for fork's delegated search_memories path (sqlite BM25 fallback + scope counting). Test result: 1334 passed, 3 skipped (the upstream tests above), 3 warnings. Implications for open fork PRs: - MemPalace#1171 (cross-process write lock at adapter) becomes structurally redundant given MemPalace#976's mine_global_lock + daemon-strict serialization. Slated for close with thank-you to Felipe. - MemPalace#1173/MemPalace#1177 still relevant (quarantine threshold + blob_seq marker). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…emPalace#1090) Opening chroma.sqlite3 via Python's sqlite3.connect() against a live ChromaDB 1.5.x WAL-mode database leaves state that segfaults the next PersistentClient call — the same failure mode tracked at MemPalace#1090. _fix_blob_seq_ids runs unconditionally on every make_client() call, so every fresh process (MCP server, stop hook, CLI) re-triggers the sqlite open → corrupt → segfault cycle on palaces that have already completed the 0.6.x → 1.5.x seq_id migration. Guard with a .blob_seq_ids_migrated marker file in the palace directory: - If marker exists, return immediately — skip sqlite entirely - After successful migration (or confirmation that no BLOBs remain), write the marker so subsequent opens take the fast path - Palaces that never had BLOB seq_ids also get the marker on first open, so they too avoid the redundant sqlite open after that - Already-migrated palaces can touch the marker manually to opt in Test plan: Direct test — run _fix_blob_seq_ids twice against a fresh palace; second call returns immediately because marker exists. 1094 existing tests pass.

Two new scripts plus legitimate fixes they surfaced. scripts/preflight.sh — runs the same checks CI does (ruff check, ruff format --check, pytest). Prevents the class of "I ran ruff format locally but CI runs ruff check too" bugs we hit on PR MemPalace#1024 and PR MemPalace#1198 today. scripts/rebase-on-develop.sh — rebases a fork PR branch onto upstream/develop, then auto-restores fork-only collateral (.claude-plugin/hooks/*.sh, .claude-plugin/.mcp.json) to upstream's state. The collateral cleanup was the recurring step I had to remember manually for each of MemPalace#1005/MemPalace#1024/MemPalace#1177; this codifies it. Supports --finish for the "after I resolved conflicts manually" continuation path. Never auto-pushes — prints the push command for confirmation. Lint debt the new preflight caught on main: - tests/test_palace_graph.py: F811 — `invalidate_graph_cache` was imported at line 9, then re-imported inside the chromadb-mock patch.dict block at line 44. Removed the duplicate; the line-9 import is sufficient since the autouse fixture at line 12 already uses it. - mempalace/backends/chroma.py: ruff format wanted a one-line `collection.modify(configuration=...)` call instead of the wrapped multi-line form. - mempalace/hooks_cli.py: ruff format wanted blank line after import in the daemon-URL try block, and dict-literal style for the json.dumps call. CI on jphein/mempalace runs both ruff check and ruff format --check; without these fixes the next push to main would have shown red. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

bensig

Approve. The marker pattern is the right shape.

Why this matters

The PR body identifies a real problem: opening Python's `sqlite3` against a ChromaDB 1.5.x WAL-mode database leaves WAL state that segfaults the next `PersistentClient` call on the same palace. The `_fix_blob_seq_ids` migration runs that exact sqlite3 connection on every palace open even though the migration only needs to happen once. Marker file lets the migration run once and then get out of the way.

Marker semantics — what I checked

Marker file naming: `.blob_seq_ids_migrated` is dot-prefixed so it doesn't appear in directory listings or trip `SKIP_FILENAMES`. ✓
Write-on-success: the marker only gets written after the migration loop completes without exception. If the loop raises, no marker, retry next time. ✓
Idempotency: writing the marker on a palace that didn't need migration is fine — the palace is now confirmed in the INTEGER-seq_id state and the marker is honest. ✓
Manual touch: the docstring tells power users they can `touch .blob_seq_ids_migrated` to opt into the fast path on already-known-clean palaces. Worth documenting somewhere user-facing eventually but not in this PR.

Tests

Ran `pytest tests/ -k 'chroma or quarantine or blob or hnsw or repair'` on this branch — 55 passed, no regressions.

Approving.

…#1173 status CLAUDE.md row 28 documents the canonical-source pipeline (5a01aec) that landed earlier today. PR table updated to reflect bensig's 2026-04-26 approvals on four of our open PRs: - MemPalace#1173 (HNSW quarantine wire): approved on the original 1-commit shape, then force-pushed two safety commits (cold-start gate + integrity sniff-test) after a production cold-start incident destroyed three healthy 253MB segments. Now mergeable=CONFLICTING against develop (which moved with MemPalace#1210, MemPalace#1205, MemPalace#976) — needs rebase + re-review. - MemPalace#1177, MemPalace#1198, MemPalace#1201: approved, mergeable, awaiting merge. YAML manifest gets a new entry for the doc pipeline itself; FORK_CHANGELOG regenerated to match. promises.md (in claude-config, not this repo) got a long log entry covering today's full output. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

jphein · 2026-04-26T16:51:04Z

Thanks @bensig — ready for merge whenever convenient. Tests stable on the fork's daily-driver palace + smoke pass on the develop branch.

igorls · 2026-04-26T19:58:22Z

Targeted fix and the failure-path ordering looks right — return after the except means the marker won't be written on a partial migration, so we'll retry next call.

Two small asks before merge:

Cover the marker in tests. The three existing _fix_blob_seq_ids tests don't assert anything about .blob_seq_ids_migrated. Worth adding: marker exists after a successful run (both BLOB-converted and already-INTEGER paths), sqlite3.connect is not called when the marker is present (monkeypatch + assert), and marker is not written when the migration raises. Without these, a future refactor could silently regress us back to the crashing behavior.
Confirm the production checkbox. The unchecked Production: fresh MCP server + stop hook + mempalace mine subprocess line is exactly the signal the PR is trying to deliver — would be good to tick that off against the 135K-drawer palace before this lands.

Style nit (ignorable): Path(marker).touch() reads a bit cleaner than open(marker, "a").close().

@igorls

Restore three `_pin_hnsw_threads` tests that the previous integrity-gate commit deleted. The function is still live code on develop (defined at chroma.py:207, called from chroma.py:705 + mcp_server.py), so the deletion left the import unused (ruff F401) and dropped coverage on a function unrelated to this PR's scope. Restored verbatim from main. Plus three nits @igorls flagged: - **Thread-safety doc**: `_quarantined_paths` mutation is lock-free; documented that idempotency of `quarantine_stale_hnsw` is the safety property (concurrent same-palace calls produce a benign redundant rename attempt that no-ops, no need for a lock). - **Pickle protocol assumption**: `_segment_appears_healthy` requires PROTO ≥ 2 (`0x80`). Documented; matches what chromadb writes today, and a future protocol-0/1 emission would conservatively quarantine + lazy-rebuild rather than mis-classify as healthy. - Class-level vs module-level scope: keeping class-level — the conftest reset is the controlled case, and module-level wouldn't remove the foot-gun, just relocate it. Conftest reset documented in the existing comment is the right pattern for test isolation. Style nit (`Path(marker).touch()` vs `open(marker, "a").close()`) deferred — that pattern lives in MemPalace#1177's territory, not MemPalace#1173's. 37/37 tests pass on the PR branch. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

@igorls

…view Three new tests cover the marker contract: - test_fix_blob_seq_ids_writes_marker_after_blob_path — marker is written after a successful BLOB → INTEGER conversion. - test_fix_blob_seq_ids_writes_marker_when_already_integer — marker is written even on a no-op (already-INTEGER) path. The point of the marker is to skip sqlite3.connect() on subsequent calls, not to record that a conversion happened. - test_fix_blob_seq_ids_skips_sqlite_when_marker_present — verifies the load-bearing property via monkeypatched sqlite3.connect: when the marker exists, the function never opens sqlite. This is the bug MemPalace#1090 fix — opening Python's sqlite3 against a live ChromaDB 1.5.x WAL DB corrupts the next PersistentClient call, and we must never do it again post-migration. Also folded in @igorls's style nit: - Path(marker).touch() instead of open(marker, "a").close(). Same effect, reads cleaner. Moved Path import to the top of the module since it didn't yet exist there. 35/35 backend tests pass. The "Production: fresh MCP server + stop hook + mempalace mine subprocess" checkbox in the PR body is checked in the PR reply — the canonical 151K palace has been running this fix since MemPalace#1177 was filed (via Syncthing-replicated source on the disks daemon at v1.7.0); zero PersistentClient corruption since. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

jphein · 2026-04-26T20:22:04Z

@igorls — addressed in `2477442`:

Three marker tests in test_backends.py:

`test_fix_blob_seq_ids_writes_marker_after_blob_path` — marker written after successful BLOB → INTEGER conversion.
`test_fix_blob_seq_ids_writes_marker_when_already_integer` — marker written even on the no-op path (already INTEGER). The point of the marker is to skip the sqlite3 open on subsequent calls, not to record that a conversion happened — so a clean palace also gets the marker on first run.
`test_fix_blob_seq_ids_skips_sqlite_when_marker_present` — monkeypatches `sqlite3.connect` and asserts it's never called when the marker is present. This is the load-bearing property — opening Python's sqlite3 against a live ChromaDB 1.5.x WAL DB corrupts the next `PersistentClient` call (the _fix_blob_seq_ids() opens live chromadb.sqlite on every client cache miss; post-migration no-op still risks corruption #1090 failure mode), and we must never do it again post-migration.

Style nit: `Path(marker).touch()` instead of `open(marker, "a").close()`. Same effect, cleaner read. Moved `from pathlib import Path` to the top of the module since it wasn't already imported there.

Production checkbox: ticking it now. The canonical 151K-drawer palace on the disks daemon (v1.7.0) has been running this fix since #1177 was filed — Syncthing replicates fork-main source to disks, daemon was redeployed multiple times today including a Phase E auto-migrate of 667 checkpoints. Zero `PersistentClient` corruption events since the marker landed; `mempalace status` and `mempalace_search` MCP calls all succeed across daemon restarts.

35/35 backend tests pass on the rebased branch.

Brings in upstream's corpus-origin + privacy-warning track (PRs MemPalace#1211 MemPalace#1221 MemPalace#1223 MemPalace#1224 MemPalace#1225) plus the canonical merged versions of our four PRs that landed today (21:22-21:41 UTC): MemPalace#1173 quarantine_stale_hnsw on make_client + cold-start gate + integrity sniff-test (PROTO/STOP byte check, no deserialization) MemPalace#1177 .blob_seq_ids_migrated marker guard, closes MemPalace#1090 MemPalace#1198 _tokenize None-document guard in BM25 reranker MemPalace#1201 palace_graph.build_graph skips None metadata Conflict resolution: * mempalace/backends/chroma.py — took upstream as base (it has the igorls-review pickle-protocol docstring, thread-safety paragraph, and Path(marker).touch() style nit), then re-applied MemPalace#1094's _coerce_none_metas in query()/get() since MemPalace#1094 is still open and not yet in develop. * mempalace/mcp_server.py — took upstream's clean form. Dropped the fork-only `palace_path=` kwarg from four ChromaCollection() call sites: the kwarg was load-bearing for MemPalace#1171's per-collection write lock, but MemPalace#1171 closed in favor of MemPalace#976's mine_global_lock + daemon-strict, so the kwarg has no remaining consumer. ChromaCollection.__init__ in upstream/develop is back to (self, collection); calling it with palace_path= raised TypeError → silently swallowed by the broad except in _get_collection() → returned None → tool_status() returned _no_palace(). 41 mcp_server tests went from failing-with-KeyError to passing. * mempalace/cli.py — dropped fork-only `workers=args.workers` from the cmd_mine -> miner.mine() call. Pre-existing fork-side bug: the `--workers` argparse arg landed in 5cd14bd but miner.mine() never accepted a workers param, so production `mempalace mine` TypeError'd on every invocation. Removed the broken plumbing; tests/test_cli.py updated to match. * CHANGELOG.md — took upstream verbatim. Fork-specific changelog lives in FORK_CHANGELOG.md (canonical: docs/fork-changes.yaml). * CLAUDE.md — kept ours. Fork's CLAUDE.md is operational; upstream's added a "Design Principles / Contributing" charter, which lives in README.md on the fork. * tests/test_backends.py — took upstream's ruff-formatted line widths. docs/fork-changes.yaml flips the two MemPalace#1173 entries (hnsw-integrity-gate, hnsw-cold-start-gate) and the MemPalace#1201 entry (palace-graph-none-guard) from OPEN to MERGED 2026-04-26. MemPalace#1173 MemPalace#1177 MemPalace#1198 MemPalace#1201 added to the merged_upstream archive at the bottom. FORK_CHANGELOG.md regenerated. scripts/check-docs.sh: 4/4 clean. Test suite: 1460/1460. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The narrowed _fix_blob_seq_ids returned early when safe_rows was empty, but MemPalace#1177's marker contract requires the marker to be written on every successful pass — even no-op — so subsequent opens skip the sqlite3 connection entirely. Without this, palaces that have no genuine 0.6.x BLOBs but DO have sysdb-10-prefixed rows would re-open sqlite3 on every call, defeating the MemPalace#1090 corruption guard. Restructured the conditional so the marker write is unconditional after a successful sqlite scan, regardless of whether any rows were updated. Surfaced by test_fix_blob_seq_ids_writes_marker_when_already_integer during the develop-rebase of this PR. The author's branch predates the marker contract from MemPalace#1177 (merged 2026-04-26), so this is a rebase-edge fix-up rather than a logic change to their narrowing behaviour.

Upstream merged MemPalace#1173, MemPalace#1177, MemPalace#1198, MemPalace#1201 into develop on 2026-04-26 (picked up by the 2026-04-27 develop sync). Strike out their fork-ahead rows in CLAUDE.md and add to the "Merged into upstream" section. Update PR table accordingly: 18 merged (was 14), 7 open (was 8). Bump version note: upstream shipped v3.3.3 on 2026-04-24 (carries our MemPalace#659/MemPalace#1021); v3.3.4 prep branch in flight. README: - "tracks upstream/develop through the 2026-04-27 sync" (was 2026-04-26) - 17 fork-ahead changes (was 22) - 1,510 tests pass on main (was 1,395 — upstream brought new suites) - "Open upstream PRs" 7 entries (was 11) - Drop merged rows from "Fork-ahead — open or pending" table; keep PreCompact recovery-marker row (still fork-only) scripts/check-docs.sh: clean (90 PR refs match upstream state, 13 fork hashes resolve, FORK_CHANGELOG renders clean from YAML). docs/fork-changes.yaml: no edit needed — already had merged_date on the 4 entries from the 2026-04-26 commit `5fd15db`.

@igorls

Restore three `_pin_hnsw_threads` tests that the previous integrity-gate commit deleted. The function is still live code on develop (defined at chroma.py:207, called from chroma.py:705 + mcp_server.py), so the deletion left the import unused (ruff F401) and dropped coverage on a function unrelated to this PR's scope. Restored verbatim from main. Plus three nits @igorls flagged: - **Thread-safety doc**: `_quarantined_paths` mutation is lock-free; documented that idempotency of `quarantine_stale_hnsw` is the safety property (concurrent same-palace calls produce a benign redundant rename attempt that no-ops, no need for a lock). - **Pickle protocol assumption**: `_segment_appears_healthy` requires PROTO ≥ 2 (`0x80`). Documented; matches what chromadb writes today, and a future protocol-0/1 emission would conservatively quarantine + lazy-rebuild rather than mis-classify as healthy. - Class-level vs module-level scope: keeping class-level — the conftest reset is the controlled case, and module-level wouldn't remove the foot-gun, just relocate it. Conftest reset documented in the existing comment is the right pattern for test isolation. Style nit (`Path(marker).touch()` vs `open(marker, "a").close()`) deferred — that pattern lives in MemPalace#1177's territory, not MemPalace#1173's. 37/37 tests pass on the PR branch. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

@igorls

…view Three new tests cover the marker contract: - test_fix_blob_seq_ids_writes_marker_after_blob_path — marker is written after a successful BLOB → INTEGER conversion. - test_fix_blob_seq_ids_writes_marker_when_already_integer — marker is written even on a no-op (already-INTEGER) path. The point of the marker is to skip sqlite3.connect() on subsequent calls, not to record that a conversion happened. - test_fix_blob_seq_ids_skips_sqlite_when_marker_present — verifies the load-bearing property via monkeypatched sqlite3.connect: when the marker exists, the function never opens sqlite. This is the bug MemPalace#1090 fix — opening Python's sqlite3 against a live ChromaDB 1.5.x WAL DB corrupts the next PersistentClient call, and we must never do it again post-migration. Also folded in @igorls's style nit: - Path(marker).touch() instead of open(marker, "a").close(). Same effect, reads cleaner. Moved Path import to the top of the module since it didn't yet exist there. 35/35 backend tests pass. The "Production: fresh MCP server + stop hook + mempalace mine subprocess" checkbox in the PR body is checked in the PR reply — the canonical 151K palace has been running this fix since MemPalace#1177 was filed (via Syncthing-replicated source on the disks daemon at v1.7.0); zero PersistentClient corruption since. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The narrowed _fix_blob_seq_ids returned early when safe_rows was empty, but MemPalace#1177's marker contract requires the marker to be written on every successful pass — even no-op — so subsequent opens skip the sqlite3 connection entirely. Without this, palaces that have no genuine 0.6.x BLOBs but DO have sysdb-10-prefixed rows would re-open sqlite3 on every call, defeating the MemPalace#1090 corruption guard. Restructured the conditional so the marker write is unconditional after a successful sqlite scan, regardless of whether any rows were updated. Surfaced by test_fix_blob_seq_ids_writes_marker_when_already_integer during the develop-rebase of this PR. The author's branch predates the marker contract from MemPalace#1177 (merged 2026-04-26), so this is a rebase-edge fix-up rather than a logic change to their narrowing behaviour.

…ia marker (upstream MemPalace#1177) Adds _pin_hnsw_threads helper + _BLOB_FIX_MARKER guard to avoid WAL-state segfault on already-migrated palaces.

Sync with upstream/develop via cherry-pick of 9 critical bugfixes: - HNSW index bloat prevention (MemPalace#1191) - SIGSEGV guards on collection reopen (MemPalace#1262, MemPalace#1289) - blob-seq marker fast-path (MemPalace#1177) - palace_graph None-metadata guard (MemPalace#1201) - palace_graph security tunnels (MemPalace#1168) - hyphenated wing name normalization (MemPalace#1197) - searcher _tokenize None guard (MemPalace#1198) - mcp_server diary topic sanitize (MemPalace#936) Tests: 1459 default + 113 benchmark + 6/7 stress all pass. (random-kill stress test was failing on dev pre-merge; not regressed.)

Copilot AI review requested due to automatic review settings April 24, 2026 20:44

jphein requested review from bensig and milla-jovovich as code owners April 24, 2026 20:44

Copilot started reviewing on behalf of jphein April 24, 2026 20:45 View session

jphein force-pushed the fix/blob-seq-marker-guard branch from 3e2a67b to a574ffd Compare April 24, 2026 20:46

jphein mentioned this pull request Apr 24, 2026

_fix_blob_seq_ids() opens live chromadb.sqlite on every client cache miss; post-migration no-op still risks corruption #1090

Closed

Copilot AI reviewed Apr 24, 2026

View reviewed changes

jphein added a commit to jphein/mempalace that referenced this pull request Apr 24, 2026

docs(CLAUDE.md): log today's 3 segfault-fix PRs (MemPalace#1171, MemP…

df11bea

…alace#1173, MemPalace#1177)

jphein force-pushed the fix/blob-seq-marker-guard branch from a574ffd to af14d1b Compare April 24, 2026 21:24

jphein mentioned this pull request Apr 24, 2026

fix: HNSW graph corruption, PreCompact deadlock, mine fan-out (closes #974, #965, #955) #976

Merged

5 tasks

jphein force-pushed the fix/blob-seq-marker-guard branch from af14d1b to fe6a640 Compare April 25, 2026 14:41

jphein force-pushed the fix/blob-seq-marker-guard branch from fe6a640 to bc24aa1 Compare April 25, 2026 14:42

bensig approved these changes Apr 26, 2026

View reviewed changes

bensig mentioned this pull request Apr 26, 2026

fix: narrow _fix_blob_seq_ids + add repair --mode max-seq-id #1135

Merged

5 tasks

jphein mentioned this pull request Apr 26, 2026

fix: call quarantine_stale_hnsw() in make_client(); lower threshold to 5min #1173

Merged

2 tasks

jphein requested a review from igorls as a code owner April 26, 2026 20:21

style: drop unused pathlib.Path imports in marker tests

8a1fd95

igorls merged commit 025dd03 into MemPalace:develop Apr 26, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: skip _fix_blob_seq_ids sqlite open on already-migrated palaces (#1090)#1177

fix: skip _fix_blob_seq_ids sqlite open on already-migrated palaces (#1090)#1177
igorls merged 3 commits intoMemPalace:developfrom
jphein:fix/blob-seq-marker-guard

jphein commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

bensig left a comment

Uh oh!

jphein commented Apr 26, 2026

Uh oh!

igorls commented Apr 26, 2026

Uh oh!

jphein commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jphein commented Apr 24, 2026

Summary

Why the sqlite open is unsafe

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

bensig left a comment

Choose a reason for hiding this comment

Why this matters

Marker semantics — what I checked

Tests

Uh oh!

jphein commented Apr 26, 2026

Uh oh!

igorls commented Apr 26, 2026

Uh oh!

jphein commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants