fix(repair): refuse to overwrite when extraction looks truncated (#1208)#1210
Merged
fix(repair): refuse to overwrite when extraction looks truncated (#1208)#1210
Conversation
The user-reported case in #1208: a palace with 67,580 drawers had its HNSW files manually quarantined to recover from corruption. ``mempalace repair`` then ran cleanly and reported "Drawers found: 10000 ... Repair complete. 10000 drawers rebuilt." Backup was the v3.3.3 chroma.sqlite3 that did contain the full 67,580 — but the rebuilt collection only had the first 10K. 85% data loss, no warning. Root cause: ChromaDB's collection-layer get() silently caps at ``CHROMADB_DEFAULT_GET_LIMIT = 10_000`` rows when reading from a collection whose segment metadata is stale (typical post-quarantine state). col.count() returns the same capped value, so neither the loop bound nor the extraction count flagged the truncation. Fix is defense-in-depth, not a recovery mechanism. Repair now: 1. After extraction, queries chroma.sqlite3 directly via a read-only sqlite3 connection: COUNT(*) FROM embeddings JOIN segments JOIN collections WHERE name='mempalace_drawers'. If that count exceeds the extracted count, abort with a clear message before any destructive operation. 2. Falls back to a weaker check when the SQLite query can't run (chromadb schema drift, locked file): if extracted exactly equals CHROMADB_DEFAULT_GET_LIMIT, that's a strong-enough cap signal to refuse without explicit acknowledgement. 3. Adds ``--confirm-truncation-ok`` (CLI) and ``confirm_truncation_ok`` (rebuild_index kwarg) to override after independent verification. Useful for the rare case of a palace genuinely sized at exactly 10,000 drawers. The guard logic lives in ``repair.check_extraction_safety()`` so the two extraction paths (CLI ``cmd_repair`` and the lower-level ``rebuild_index``) share a single implementation. Raises ``TruncationDetected`` carrying the printable message. Tests: 9 new cases covering the safe path (counts match, SQLite unreadable but well under cap), both abort paths (SQLite higher than extracted, unreadable + at cap), the override flag, and end-to-end behavior of ``rebuild_index`` with the guard wired in. Plus two ``sqlite_drawer_count`` tests for the missing-file and bad-schema cases. What's NOT in this PR: actually recovering the missing 57,580 drawers from the user's case. The on-disk SQLite still holds them; recovery is a separate flow (direct-extract from chroma.sqlite3, bypass the chromadb collection layer entirely). This PR's job is to stop repair from making it worse. Refs #1208.
8 tasks
jphein
added a commit
to jphein/mempalace
that referenced
this pull request
Apr 26, 2026
…#1173 status CLAUDE.md row 28 documents the canonical-source pipeline (5a01aec) that landed earlier today. PR table updated to reflect bensig's 2026-04-26 approvals on four of our open PRs: - MemPalace#1173 (HNSW quarantine wire): approved on the original 1-commit shape, then force-pushed two safety commits (cold-start gate + integrity sniff-test) after a production cold-start incident destroyed three healthy 253MB segments. Now mergeable=CONFLICTING against develop (which moved with MemPalace#1210, MemPalace#1205, MemPalace#976) — needs rebase + re-review. - MemPalace#1177, MemPalace#1198, MemPalace#1201: approved, mergeable, awaiting merge. YAML manifest gets a new entry for the doc pipeline itself; FORK_CHANGELOG regenerated to match. promises.md (in claude-config, not this repo) got a long log entry covering today's full output. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This was referenced Apr 26, 2026
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1208.
The bug
ttessarolo reported a palace with 67,580 drawers silently losing 85% of its data through `mempalace repair`:
```
Drawers found: 10000
...
Repair complete. 10000 drawers rebuilt.
```
Verified by direct SQL: the embeddings table genuinely held 67,580 rows before repair, 10,000 after. ChromaDB's collection-layer `get()` silently caps at 10,000 when reading a collection whose segment metadata is stale — exactly the state you're in after manual HNSW quarantine, which is exactly what the issue reporter had to do to make repair runnable in the first place. `col.count()` returns the same capped value, so the loop bound (`while offset < total`) doesn't catch it.
The fix
Defense-in-depth, not data recovery. Repair now refuses to overwrite when it can't trust the extraction count.
Strong signal — query `chroma.sqlite3` directly via a read-only sqlite3 connection:
```sql
SELECT COUNT(*)
FROM embeddings e
JOIN segments s ON e.segment_id = s.id
JOIN collections c ON s.collection = c.id
WHERE c.name = ?
```
If that count exceeds the extracted count, abort before `delete_collection` runs. The error message names both numbers and the implied data loss percentage.
Weak signal — for chromadb schema drift / locked-file cases where the SQLite query can't run, fall back to: if extracted count equals exactly `CHROMADB_DEFAULT_GET_LIMIT` (10,000), refuse without explicit acknowledgement. Hitting the chromadb default `get()` cap exactly is suspicious enough.
Override — `--confirm-truncation-ok` flag (and `rebuild_index(confirm_truncation_ok=True)` kwarg) for the rare case of a palace legitimately sized at the cap, after independent verification.
Shared between two paths
The guard lives in `repair.check_extraction_safety()` and raises `TruncationDetected` with the printable message. Both call sites use it:
Same logic, single implementation, identical error wording.
Tests
9 new test cases in `tests/test_repair.py`:
Full suite locally: 1317 passed in 35s (+9 from develop @ HEAD baseline).
What's NOT in this PR
Actually recovering the lost 57,580 drawers from the user's case. The rows are still on disk in `chroma.sqlite3.embeddings`; recovery requires bypassing the chromadb collection layer entirely (read embeddings table → re-create segments → re-attach to collection, or extract and re-mine from source). That's a separate flow worth its own design — this PR's job is to stop repair from making it worse.
Refs #1208 and the #1145 / #934 / #1093 incident class generally.