fix(repair): refuse to overwrite when extraction looks truncated (#1208) by bensig · Pull Request #1210 · MemPalace/mempalace

bensig · 2026-04-26T06:34:33Z

Closes #1208.

The bug

ttessarolo reported a palace with 67,580 drawers silently losing 85% of its data through `mempalace repair`:

```
Drawers found: 10000
...
Repair complete. 10000 drawers rebuilt.
```

Verified by direct SQL: the embeddings table genuinely held 67,580 rows before repair, 10,000 after. ChromaDB's collection-layer `get()` silently caps at 10,000 when reading a collection whose segment metadata is stale — exactly the state you're in after manual HNSW quarantine, which is exactly what the issue reporter had to do to make repair runnable in the first place. `col.count()` returns the same capped value, so the loop bound (`while offset < total`) doesn't catch it.

The fix

Defense-in-depth, not data recovery. Repair now refuses to overwrite when it can't trust the extraction count.

Strong signal — query `chroma.sqlite3` directly via a read-only sqlite3 connection:

```sql
SELECT COUNT(*)
FROM embeddings e
JOIN segments s ON e.segment_id = s.id
JOIN collections c ON s.collection = c.id
WHERE c.name = ?
```

If that count exceeds the extracted count, abort before `delete_collection` runs. The error message names both numbers and the implied data loss percentage.

Weak signal — for chromadb schema drift / locked-file cases where the SQLite query can't run, fall back to: if extracted count equals exactly `CHROMADB_DEFAULT_GET_LIMIT` (10,000), refuse without explicit acknowledgement. Hitting the chromadb default `get()` cap exactly is suspicious enough.

Override — `--confirm-truncation-ok` flag (and `rebuild_index(confirm_truncation_ok=True)` kwarg) for the rare case of a palace legitimately sized at the cap, after independent verification.

Shared between two paths

The guard lives in `repair.check_extraction_safety()` and raises `TruncationDetected` with the printable message. Both call sites use it:

`cli.py::cmd_repair` — the user-facing `mempalace repair` command (this is the path the bug report hit).
`repair.py::rebuild_index` — the lower-level entry point used by `python -m mempalace.repair rebuild`.

Same logic, single implementation, identical error wording.

Tests

9 new test cases in `tests/test_repair.py`:

`test_check_extraction_safety_passes_when_counts_match` — safe path
`test_check_extraction_safety_passes_when_sqlite_unreadable_and_under_cap` — safe path with weak fallback
`test_check_extraction_safety_aborts_when_sqlite_higher` — the user's reported case (67,580 vs 10,000)
`test_check_extraction_safety_aborts_when_unreadable_and_at_cap` — fallback signal fires
`test_check_extraction_safety_override_skips_check` — override flag works
`test_sqlite_drawer_count_returns_none_on_missing_file` — graceful degrade
`test_sqlite_drawer_count_returns_none_on_unreadable_schema` — graceful degrade
`test_rebuild_index_aborts_on_truncation_signal` — end-to-end: no `delete_collection`, no upsert, no backup written when guard fires
`test_rebuild_index_proceeds_with_override` — end-to-end: override unblocks the destructive path

Full suite locally: 1317 passed in 35s (+9 from develop @ HEAD baseline).

What's NOT in this PR

Actually recovering the lost 57,580 drawers from the user's case. The rows are still on disk in `chroma.sqlite3.embeddings`; recovery requires bypassing the chromadb collection layer entirely (read embeddings table → re-create segments → re-attach to collection, or extract and re-mine from source). That's a separate flow worth its own design — this PR's job is to stop repair from making it worse.

Refs #1208 and the #1145 / #934 / #1093 incident class generally.

The user-reported case in #1208: a palace with 67,580 drawers had its HNSW files manually quarantined to recover from corruption. ``mempalace repair`` then ran cleanly and reported "Drawers found: 10000 ... Repair complete. 10000 drawers rebuilt." Backup was the v3.3.3 chroma.sqlite3 that did contain the full 67,580 — but the rebuilt collection only had the first 10K. 85% data loss, no warning. Root cause: ChromaDB's collection-layer get() silently caps at ``CHROMADB_DEFAULT_GET_LIMIT = 10_000`` rows when reading from a collection whose segment metadata is stale (typical post-quarantine state). col.count() returns the same capped value, so neither the loop bound nor the extraction count flagged the truncation. Fix is defense-in-depth, not a recovery mechanism. Repair now: 1. After extraction, queries chroma.sqlite3 directly via a read-only sqlite3 connection: COUNT(*) FROM embeddings JOIN segments JOIN collections WHERE name='mempalace_drawers'. If that count exceeds the extracted count, abort with a clear message before any destructive operation. 2. Falls back to a weaker check when the SQLite query can't run (chromadb schema drift, locked file): if extracted exactly equals CHROMADB_DEFAULT_GET_LIMIT, that's a strong-enough cap signal to refuse without explicit acknowledgement. 3. Adds ``--confirm-truncation-ok`` (CLI) and ``confirm_truncation_ok`` (rebuild_index kwarg) to override after independent verification. Useful for the rare case of a palace genuinely sized at exactly 10,000 drawers. The guard logic lives in ``repair.check_extraction_safety()`` so the two extraction paths (CLI ``cmd_repair`` and the lower-level ``rebuild_index``) share a single implementation. Raises ``TruncationDetected`` carrying the printable message. Tests: 9 new cases covering the safe path (counts match, SQLite unreadable but well under cap), both abort paths (SQLite higher than extracted, unreadable + at cap), the override flag, and end-to-end behavior of ``rebuild_index`` with the guard wired in. Plus two ``sqlite_drawer_count`` tests for the missing-file and bad-schema cases. What's NOT in this PR: actually recovering the missing 57,580 drawers from the user's case. The on-disk SQLite still holds them; recovery is a separate flow (direct-extract from chroma.sqlite3, bypass the chromadb collection layer entirely). This PR's job is to stop repair from making it worse. Refs #1208.

…#1173 status CLAUDE.md row 28 documents the canonical-source pipeline (5a01aec) that landed earlier today. PR table updated to reflect bensig's 2026-04-26 approvals on four of our open PRs: - MemPalace#1173 (HNSW quarantine wire): approved on the original 1-commit shape, then force-pushed two safety commits (cold-start gate + integrity sniff-test) after a production cold-start incident destroyed three healthy 253MB segments. Now mergeable=CONFLICTING against develop (which moved with MemPalace#1210, MemPalace#1205, MemPalace#976) — needs rebase + re-review. - MemPalace#1177, MemPalace#1198, MemPalace#1201: approved, mergeable, awaiting merge. YAML manifest gets a new entry for the doc pipeline itself; FORK_CHANGELOG regenerated to match. promises.md (in claude-config, not this repo) got a long log entry covering today's full output. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

bensig requested review from igorls and milla-jovovich as code owners April 26, 2026 06:34

bensig mentioned this pull request Apr 26, 2026

fix: prevent HNSW index bloat from resize+persist cycles #1191

Merged

8 tasks

igorls merged commit 6890948 into develop Apr 26, 2026
6 checks passed

This was referenced Apr 26, 2026

fix(hooks): drop transcript-parent fallback that caused unbounded ingest #1199

Closed

fix: call quarantine_stale_hnsw() in make_client(); lower threshold to 5min #1173

Merged

igorls mentioned this pull request Apr 26, 2026

feat(init): context-aware corpus detection #1211

Merged

3 tasks

Seph396 mentioned this pull request Apr 28, 2026

repair: silent exit 0 when SQLite advances but HNSW remains stale; missing quarantine_stale_hnsw() invocation #1263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(repair): refuse to overwrite when extraction looks truncated (#1208)#1210

fix(repair): refuse to overwrite when extraction looks truncated (#1208)#1210
igorls merged 1 commit intodevelopfrom
fix/repair-extraction-cap-detection

bensig commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bensig commented Apr 26, 2026

The bug

The fix

Shared between two paths

Tests

What's NOT in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants