Skip to content

fix(repair): refuse to overwrite when extraction looks truncated (#1208)#1210

Merged
igorls merged 1 commit intodevelopfrom
fix/repair-extraction-cap-detection
Apr 26, 2026
Merged

fix(repair): refuse to overwrite when extraction looks truncated (#1208)#1210
igorls merged 1 commit intodevelopfrom
fix/repair-extraction-cap-detection

Conversation

@bensig
Copy link
Copy Markdown
Collaborator

@bensig bensig commented Apr 26, 2026

Closes #1208.

The bug

ttessarolo reported a palace with 67,580 drawers silently losing 85% of its data through `mempalace repair`:

```
Drawers found: 10000
...
Repair complete. 10000 drawers rebuilt.
```

Verified by direct SQL: the embeddings table genuinely held 67,580 rows before repair, 10,000 after. ChromaDB's collection-layer `get()` silently caps at 10,000 when reading a collection whose segment metadata is stale — exactly the state you're in after manual HNSW quarantine, which is exactly what the issue reporter had to do to make repair runnable in the first place. `col.count()` returns the same capped value, so the loop bound (`while offset < total`) doesn't catch it.

The fix

Defense-in-depth, not data recovery. Repair now refuses to overwrite when it can't trust the extraction count.

Strong signal — query `chroma.sqlite3` directly via a read-only sqlite3 connection:

```sql
SELECT COUNT(*)
FROM embeddings e
JOIN segments s ON e.segment_id = s.id
JOIN collections c ON s.collection = c.id
WHERE c.name = ?
```

If that count exceeds the extracted count, abort before `delete_collection` runs. The error message names both numbers and the implied data loss percentage.

Weak signal — for chromadb schema drift / locked-file cases where the SQLite query can't run, fall back to: if extracted count equals exactly `CHROMADB_DEFAULT_GET_LIMIT` (10,000), refuse without explicit acknowledgement. Hitting the chromadb default `get()` cap exactly is suspicious enough.

Override — `--confirm-truncation-ok` flag (and `rebuild_index(confirm_truncation_ok=True)` kwarg) for the rare case of a palace legitimately sized at the cap, after independent verification.

Shared between two paths

The guard lives in `repair.check_extraction_safety()` and raises `TruncationDetected` with the printable message. Both call sites use it:

  • `cli.py::cmd_repair` — the user-facing `mempalace repair` command (this is the path the bug report hit).
  • `repair.py::rebuild_index` — the lower-level entry point used by `python -m mempalace.repair rebuild`.

Same logic, single implementation, identical error wording.

Tests

9 new test cases in `tests/test_repair.py`:

  • `test_check_extraction_safety_passes_when_counts_match` — safe path
  • `test_check_extraction_safety_passes_when_sqlite_unreadable_and_under_cap` — safe path with weak fallback
  • `test_check_extraction_safety_aborts_when_sqlite_higher` — the user's reported case (67,580 vs 10,000)
  • `test_check_extraction_safety_aborts_when_unreadable_and_at_cap` — fallback signal fires
  • `test_check_extraction_safety_override_skips_check` — override flag works
  • `test_sqlite_drawer_count_returns_none_on_missing_file` — graceful degrade
  • `test_sqlite_drawer_count_returns_none_on_unreadable_schema` — graceful degrade
  • `test_rebuild_index_aborts_on_truncation_signal` — end-to-end: no `delete_collection`, no upsert, no backup written when guard fires
  • `test_rebuild_index_proceeds_with_override` — end-to-end: override unblocks the destructive path

Full suite locally: 1317 passed in 35s (+9 from develop @ HEAD baseline).

What's NOT in this PR

Actually recovering the lost 57,580 drawers from the user's case. The rows are still on disk in `chroma.sqlite3.embeddings`; recovery requires bypassing the chromadb collection layer entirely (read embeddings table → re-create segments → re-attach to collection, or extract and re-mine from source). That's a separate flow worth its own design — this PR's job is to stop repair from making it worse.

Refs #1208 and the #1145 / #934 / #1093 incident class generally.

The user-reported case in #1208: a palace with 67,580 drawers had its
HNSW files manually quarantined to recover from corruption. ``mempalace
repair`` then ran cleanly and reported "Drawers found: 10000 ... Repair
complete. 10000 drawers rebuilt." Backup was the v3.3.3 chroma.sqlite3
that did contain the full 67,580 — but the rebuilt collection only had
the first 10K. 85% data loss, no warning.

Root cause: ChromaDB's collection-layer get() silently caps at
``CHROMADB_DEFAULT_GET_LIMIT = 10_000`` rows when reading from a
collection whose segment metadata is stale (typical post-quarantine
state). col.count() returns the same capped value, so neither the
loop bound nor the extraction count flagged the truncation.

Fix is defense-in-depth, not a recovery mechanism. Repair now:

1. After extraction, queries chroma.sqlite3 directly via a read-only
   sqlite3 connection: COUNT(*) FROM embeddings JOIN segments JOIN
   collections WHERE name='mempalace_drawers'. If that count exceeds
   the extracted count, abort with a clear message before any
   destructive operation.
2. Falls back to a weaker check when the SQLite query can't run
   (chromadb schema drift, locked file): if extracted exactly equals
   CHROMADB_DEFAULT_GET_LIMIT, that's a strong-enough cap signal to
   refuse without explicit acknowledgement.
3. Adds ``--confirm-truncation-ok`` (CLI) and ``confirm_truncation_ok``
   (rebuild_index kwarg) to override after independent verification.
   Useful for the rare case of a palace genuinely sized at exactly
   10,000 drawers.

The guard logic lives in ``repair.check_extraction_safety()`` so the
two extraction paths (CLI ``cmd_repair`` and the lower-level
``rebuild_index``) share a single implementation. Raises
``TruncationDetected`` carrying the printable message.

Tests: 9 new cases covering the safe path (counts match, SQLite
unreadable but well under cap), both abort paths (SQLite higher than
extracted, unreadable + at cap), the override flag, and end-to-end
behavior of ``rebuild_index`` with the guard wired in. Plus two
``sqlite_drawer_count`` tests for the missing-file and bad-schema
cases.

What's NOT in this PR: actually recovering the missing 57,580
drawers from the user's case. The on-disk SQLite still holds them;
recovery is a separate flow (direct-extract from chroma.sqlite3,
bypass the chromadb collection layer entirely). This PR's job is
to stop repair from making it worse.

Refs #1208.
@igorls igorls merged commit 6890948 into develop Apr 26, 2026
6 checks passed
jphein added a commit to jphein/mempalace that referenced this pull request Apr 26, 2026
…#1173 status

CLAUDE.md row 28 documents the canonical-source pipeline (5a01aec)
that landed earlier today. PR table updated to reflect bensig's
2026-04-26 approvals on four of our open PRs:

  - MemPalace#1173 (HNSW quarantine wire): approved on the original 1-commit
    shape, then force-pushed two safety commits (cold-start gate +
    integrity sniff-test) after a production cold-start incident
    destroyed three healthy 253MB segments. Now mergeable=CONFLICTING
    against develop (which moved with MemPalace#1210, MemPalace#1205, MemPalace#976) — needs
    rebase + re-review.
  - MemPalace#1177, MemPalace#1198, MemPalace#1201: approved, mergeable, awaiting merge.

YAML manifest gets a new entry for the doc pipeline itself; FORK_CHANGELOG
regenerated to match. promises.md (in claude-config, not this repo)
got a long log entry covering today's full output.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mempalace repair silently truncates drawers to 10,000 — data loss on palaces > 10K

2 participants