fix: narrow `_fix_blob_seq_ids` + add `repair --mode max-seq-id` by sha2fiddy · Pull Request #1135 · MemPalace/mempalace

sha2fiddy · 2026-04-23T19:16:36Z

Closes #1134.

What

Shim (mempalace/backends/chroma.py):

Drop max_seq_id from _fix_blob_seq_ids. chromadb 1.5.x writes its own BLOB format there (b'\x11\x11' + 6 ASCII digits). Misreading it as a legacy 0.6.x big-endian u64 produces a ~1.23e18 integer that stalls every subsequent write.
embeddings branch stays for real 0.6.x palaces but skips any BLOB starting with b'\x11\x11'.

Recovery (mempalace/repair.py, mempalace/cli.py):

mempalace repair --mode max-seq-id [--segment <uuid>] [--from-sidecar <path>] [--dry-run] [--yes] [--no-backup]
Detection: seq_id > 2^53.
Default heuristic: MAX(embeddings.seq_id) over the owning collection. Exact for the METADATA segment; for the sibling VECTOR segment it can land a few seq_ids ahead of the true max, which the queue treats as already consumed. Better than resetting to 0.
--from-sidecar copies clean values from a pre-corruption chroma.sqlite3 backup.
Backs up the SQLite file, closes chroma handles, runs atomic UPDATEs, and verifies after. MaxSeqIdVerificationError fires if any row is still above the threshold.

Test plan

8 new tests in tests/test_repair.py (detect, heuristic, sidecar, dry-run, segment filter, no-op, backup, rollback-on-verify-failure)
3 new tests in tests/test_backends.py (max_seq_id untouched, sysdb-10 prefix skipped in embeddings, legacy big-endian u64 still converts)
pytest tests/ --ignore=tests/benchmarks -> 1077 passed
ruff check . clean (pinned to >=0.4.0,<0.5)
ruff format --check clean on touched files

bensig

Approve. This is the deeper, correct fix for #1134. The shim narrowing is exactly right and the recovery sub-command is well-designed.

Full pytest on this branch: 1077 passed (matches the PR body claim).

Shim change — the right shape

Dropping `max_seq_id` from `_fix_blob_seq_ids` is the principled fix: chromadb 1.5.x writes its own BLOB format there (`b'\x11\x11'` + 6 ASCII digits) and we have no business converting it. The prefix check on the `embeddings` branch is good defense-in-depth — real 0.6.x BLOBs are pure big-endian u64 with no text prefix, so the check is a no-op for genuine legacy data. Worth the explicit log-on-skip so operators see when the guard fires.

Comment in `_fix_blob_seq_ids` is the right level of detail — names the symptom (`embeddings_queue` filters on `seq_id > start`, so a poisoned ~1.23e18 value silently suppresses every subsequent write). Future maintainer won't accidentally re-add `max_seq_id` to the loop.

Recovery sub-mode — well-designed

`mempalace repair --mode max-seq-id` fits cleanly into the existing repair entry. Specific things I checked:

Detection threshold = 2^53. Defensible — that's JS Number.MAX_SAFE_INTEGER, the practical correctness bound for any consumer that JSON-encodes the value. The actual corruption value (~1.23e18) is well above this; the threshold catches anything pathological without false-positiving on real high-throughput palaces.
Heuristic via `MAX(embeddings.seq_id)` over the owning collection. PR body is honest about the trade-off: "for the sibling VECTOR segment it can land a few seq_ids ahead of the true max, which the queue treats as already consumed." That means a few legit-but-pending writes could be marked stale. Reasonable for an emergency recovery tool — better than resetting to 0 (which would re-process everything) or refusing to recover at all.
`--from-sidecar` for the careful path. Letting users point at a pre-corruption `chroma.sqlite3` backup and copy clean values verbatim is the correct ergonomic for the case where they have one.
Backup before mutation, atomic UPDATEs, post-repair verification with `MaxSeqIdVerificationError` on detection-still-positive. Standard data-integrity discipline applied where it should be.
`_close_chroma_handles` does `gc.collect()` to release mmap handles. Brittle across chromadb versions but necessary — the current chromadb pattern doesn't expose an explicit handle-close. Worth a comment noting this is what we expect to need a touch-up if chromadb 1.6 or 2.x changes their cache structure, but fine as-is.

Tests

8 new tests in `test_repair.py` covering: detect, heuristic, sidecar, dry-run, segment filter, no-op (no poisoned rows), backup, rollback-on-verify-failure.
3 new tests in `test_backends.py`: `max_seq_id` left untouched, sysdb-10 prefix skipped in `embeddings`, legacy big-endian u64 still converts.
The "rollback on verify failure" test is the one I'd most want to see, and it's there.

CHANGELOG

Already updated with a precise description of the failure mode and the recovery command. Nothing to add.

Observation, not actionable

This PR + #1177 (skip _fix_blob_seq_ids on already-migrated palaces) are complementary — #1177's marker file means the migration loop runs at most once per palace, this PR makes that one run safe. They should land together. Recommend merging this one first since it's the substantive fix.

Closes #1134. Ship it.

The BLOB-seq_id migration shim (PR MemPalace#664) ran int.from_bytes(..., 'big') over every BLOB in max_seq_id, including chromadb 1.5.x's own native format (b'\x11\x11' + 6 ASCII digits). That conversion yields a ~1.23e18 integer that silently suppresses every subsequent embeddings_queue write for the affected segment (queue filter is seq_id > start), causing silent drawer-write drops after a 1.5.x upgrade. Two-part fix: 1. Shim narrowing (mempalace/backends/chroma.py) - Drop max_seq_id from the shim loop. chromadb owns that column's format; we don't reinterpret it. - Defense-in-depth: skip rows in embeddings whose seq_id BLOB has the sysdb-10 b'\x11\x11' prefix rather than misconvert. 2. Recovery command (mempalace/repair.py, mempalace/cli.py) - mempalace repair --mode max-seq-id [--segment <uuid>] [--from-sidecar <path>] [--dry-run] [--yes] [--no-backup] - Detects poisoned rows via threshold (seq_id > 2**53). - Default heuristic: MAX(embeddings.seq_id) over the collection owning the poisoned segment. Matches METADATA max exactly; VECTOR segments get a few seq_ids ahead (queue skips an already-indexed window — an acceptable loss vs. resetting to 0 and re-processing everything). - --from-sidecar copies clean values from a pre-corruption sqlite db. - Backs up chroma.sqlite3, closes chroma handles, atomic UPDATEs, post-repair verification that raises MaxSeqIdVerificationError if any row is still above threshold. Tests: 8 new in tests/test_repair.py (detection, heuristic, sidecar, dry-run, segment filter, no-op, backup, rollback-on-verify-failure). 3 new in tests/test_backends.py (max_seq_id untouched by shim, sysdb-10 prefix skipped in embeddings, legacy big-endian u64 BLOBs still convert). Full suite: 1103 passed.

The narrowed _fix_blob_seq_ids returned early when safe_rows was empty, but MemPalace#1177's marker contract requires the marker to be written on every successful pass — even no-op — so subsequent opens skip the sqlite3 connection entirely. Without this, palaces that have no genuine 0.6.x BLOBs but DO have sysdb-10-prefixed rows would re-open sqlite3 on every call, defeating the MemPalace#1090 corruption guard. Restructured the conditional so the marker write is unconditional after a successful sqlite scan, regardless of whether any rows were updated. Surfaced by test_fix_blob_seq_ids_writes_marker_when_already_integer during the develop-rebase of this PR. The author's branch predates the marker contract from MemPalace#1177 (merged 2026-04-26), so this is a rebase-edge fix-up rather than a logic change to their narrowing behaviour.

igorls · 2026-04-27T06:02:24Z

Rebased onto current develop (04c48dd) and pushed.

Conflicts resolved:

mempalace/repair.py — additive: kept this PR's repair_max_seq_id + helpers alongside develop's new status() from fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222) #1227 (capacity check). Both are independent functions in the same module.
tests/test_backends.py — kept all five tests covering both axes: this PR's three narrowing-behaviour tests (does_not_touch_max_seq_id, skips_sysdb10_prefix_in_embeddings, still_converts_legacy_blobs_in_embeddings) and develop's three marker-contract tests from fix: skip _fix_blob_seq_ids sqlite open on already-migrated palaces (#1090) #1177 (writes_marker_after_blob_path, writes_marker_when_already_integer, skips_sqlite_when_marker_present).
mempalace/backends/chroma.py — kept develop's reworded quarantine_stale_hnsw log message ("integrity check failed" was a real semantic addition, not just line layout). The PR's "style revert" commit became a no-op against develop's reworded message and was dropped during rebase — that's correct.

One follow-up commit (04c48dd): the narrowed _fix_blob_seq_ids was returning early when safe_rows was empty, which broke #1177's contract that the marker is written on every successful pass (even no-op). This PR predates #1177's marker landing on develop, so the interaction wasn't visible until rebase. Restructured the conditional so the marker write is unconditional after a successful sqlite scan.

Tests: 1429 passed, 1 skipped on the rebased branch. ruff check + format clean.

Author preserved on the original commit; committer is mine (force-pushed via maintainer-edit access). Thanks for the deep dig on the sysdb-10 incident — the threshold + heuristic-vs-sidecar split is exactly the right shape.

Bumps every version source from 3.3.3 to 3.3.4: - pyproject.toml - mempalace/version.py (canonical) - .claude-plugin/plugin.json - .claude-plugin/marketplace.json - .codex-plugin/plugin.json - README.md badge Dates the CHANGELOG section and adds entries for the bug fixes that landed this cycle (#1135, #1191, #1230, #1231) plus expands the #1194 entry to credit the lookup-side recovery path from #1197. Pre-tag verification: - 1441 passed, 1 skipped (full suite minus benchmarks, all platforms) - ruff check + format clean - 44/44 in test_version_consistency + test_readme_claims (6-file sync) - JPH invariant: pyproject.toml + .claude-plugin/plugin.json both reference mempalace-mcp - Wheel build + fresh-venv install: mempalace --version reports 3.3.4, mempalace-mcp --help works (catches the v3.3.2-class regression)

22 commits from upstream/develop, including: - HNSW capacity divergence detection + BM25-only sqlite fallback when vector layer is unloadable (MemPalace#1222 / MemPalace#1227 — vector_disabled kwarg routes search through chromadb-bypass path on segfault risk). - HNSW index bloat prevention via batch_size + sync_threshold metadata on collection create (MemPalace#1191). - Hooks: always mine the active transcript as convos, additive to MEMPAL_DIR (MemPalace#1230 / MemPalace#1231). Restructured into `_get_mine_targets()` list approach + separate `_ingest_transcript()`. Daemon-strict gate preserved on each entry point. - Wing-name normalization for hyphenated dirs across miner / convo_miner / palace_graph (MemPalace#1194 / MemPalace#1195 / MemPalace#1197). - `narrow _fix_blob_seq_ids` shim + `repair --mode max-seq-id` for legacy 0.6.x BLOB-poisoned palaces (MemPalace#1135). Conflict resolution: - README.md: kept fork-shaped narrative, dropped upstream's sweep tip injection (per fork-readme handling memory). - hooks/README.md: adopted upstream's accurate `MEMPAL_DIR` additive description; kept `MEMPAL_PYTHON` env-var name (matches actual `hooks/*.sh` scripts in both forks). - mempalace/cli.py: consolidated duplicate `--mode` argparse declarations into one with all 4 choices (rebuild/legacy/reorganize/max-seq-id). - mempalace/hooks_cli.py: adopted upstream's `_get_mine_targets()` + `_ingest_transcript()` shape; added `_daemon_strict()` guard at entry of `_maybe_auto_ingest`, `_mine_sync`, and `_ingest_transcript` so the daemon-strict architecture still skips local writes. - mempalace/mcp_server.py: kept both `kind=` and `vector_disabled=` kwargs on the `search_memories` call. - mempalace/searcher.py: kept fork's `_count_in_scope`, `_sqlite_fallback_and_scope`, `_apply_kind_text_filter` AND upstream's `_bm25_only_via_sqlite`. Both `kind` and `vector_disabled` parameters on `search_memories`. Tests: 1510 passing (up from 1366 — upstream brought new test suites).

Three fixes landed on develop after the initial release-prep cut and were brought in via the develop merge. Document them in the 3.3.4 Bug Fixes section so the release notes reflect what users will actually receive. - #1287 - HNSW divergence floor scales with hnsw:sync_threshold (resolves a silent-fallback regression introduced by the interaction between #1191 and #1227 in this release) - #1262 - ChromaBackend get_or_create_collection split, fixing the stop-hook SIGSEGV class on legacy palaces with mismatched stored metadata (#1089) - #1288 / #1254 - repair --mode max-seq-id heuristic now decodes BLOB-typed embeddings.seq_id, restoring the un-poison path added in #1135 for palaces where chromadb 1.5.x writes seq_ids natively

Catches up on a month of upstream work. Highlights pulled in: - MemPalace#1306 searcher: hybrid candidate union (vector ∪ BM25 reranking pool) - MemPalace#1325 mcp security: omit absolute paths from tool_get_drawer / tool_status - MemPalace#1322 chroma: wire quarantine_stale_hnsw to prevent SIGSEGV on stale HNSW - MemPalace#1320 mcp: forward valid_to / source params in kg_add / kg_invalidate - MemPalace#1321 cli: honor --palace flag in cmd_init - MemPalace#1314 kg temporal params fix - MemPalace#1244 cli: cmd_compress writes to mempalace_closets so palace can read - MemPalace#1243 mcp: case-insensitive agent name in diary_write/diary_read - MemPalace#1303 mcp_server: pass embedding_function= on collection reopen - MemPalace#1076/MemPalace#1077 hooks: quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json - Various ruff format passes on touched files Conflict resolution (CHANGELOG.md only — code files all auto-merged): - 3.3.5 unreleased section header from upstream kept above 3.3.4 - 3.3.4 section: kept our 2026-04-30 release date; merged upstream's new MemPalace#1299 SIGSEGV-on-default-EF entry in alongside our existing topic-tunnels (MemPalace#1194/MemPalace#1195/MemPalace#1197), HNSW-bloat (MemPalace#1191), max_seq_id (MemPalace#1135), and auto-ingest (MemPalace#1230/MemPalace#1231) entries. Kept our richer topic-tunnels detail (upstream's version was a strict subset). xdev patches preserved (still on this branch, untouched by merge): - 6ef44cb fix(hooks): route CC transcripts via convo_miner with cwd-based wings - 3fad61d fix(config): allow leading dash in wing names Not pushed to origin — run tests locally and decide when to push. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

sha2fiddy requested review from bensig, igorls and milla-jovovich as code owners April 23, 2026 19:16

sha2fiddy marked this pull request as draft April 23, 2026 19:55

sha2fiddy force-pushed the feature/max-seq-id-shim-fix branch from 8f88d10 to 442e931 Compare April 23, 2026 20:06

sha2fiddy marked this pull request as ready for review April 23, 2026 20:13

sha2fiddy mentioned this pull request Apr 23, 2026

Delete operations silently no-op on palaces migrated from chromadb 0.6.3 → 1.5.8 #1099

Open

yangqingggui-a11y mentioned this pull request Apr 24, 2026

Add Apple Silicon acceleration (MPS/explicit CPU) for embedding — up to 256x faster (addresses #1068) #1133

Open

igorls added bug Something isn't working storage labels Apr 24, 2026

bensig approved these changes Apr 26, 2026

View reviewed changes

sha2fiddy mentioned this pull request Apr 26, 2026

feat: add mempalace repair --mode hnsw --segment <uuid> (#1046) #1126

Open

7 tasks

sha2fiddy force-pushed the feature/max-seq-id-shim-fix branch from 04eedd5 to 4c632de Compare April 26, 2026 19:37

sha2fiddy and others added 2 commits April 27, 2026 02:57

igorls force-pushed the feature/max-seq-id-shim-fix branch from 4c632de to 04c48dd Compare April 27, 2026 06:02

igorls merged commit 003e569 into MemPalace:develop Apr 27, 2026
6 checks passed

igorls mentioned this pull request Apr 27, 2026

chore(release): v3.3.4 #1232

Merged

sha2fiddy deleted the feature/max-seq-id-shim-fix branch April 28, 2026 12:29

This was referenced Apr 28, 2026

repair --mode max-seq-id crashes with ValueError on BLOB-typed embeddings.seq_id rows #1254

Closed

repair --mode legacy discards collection embedding_model, breaking non-default models (e.g. multilingual-e5-base) #1255

Closed

Seph396 mentioned this pull request Apr 28, 2026

repair: silent exit 0 when SQLite advances but HNSW remains stale; missing quarantine_stale_hnsw() invocation #1263

Closed

messelink mentioned this pull request Apr 30, 2026

fix: batch upserts in miner to prevent ChromaDB 1.5.x compaction crashes #796

Open

6 tasks

igorls mentioned this pull request May 1, 2026

fix(repair): decode BLOB embeddings.seq_id in max-seq-id heuristic (#1254) #1288

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: narrow `_fix_blob_seq_ids` + add `repair --mode max-seq-id`#1135

fix: narrow `_fix_blob_seq_ids` + add `repair --mode max-seq-id`#1135
igorls merged 2 commits intoMemPalace:developfrom
sha2fiddy:feature/max-seq-id-shim-fix

sha2fiddy commented Apr 23, 2026 •

edited

Loading

Uh oh!

bensig left a comment

Uh oh!

igorls commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sha2fiddy commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Test plan

Uh oh!

bensig left a comment

Choose a reason for hiding this comment

Shim change — the right shape

Recovery sub-mode — well-designed

Tests

CHANGELOG

Observation, not actionable

Uh oh!

igorls commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sha2fiddy commented Apr 23, 2026 •

edited

Loading