Skip to content

mempalace repair silently truncates drawers to 10,000 — data loss on palaces > 10K #1208

@ttessarolo

Description

@ttessarolo

Summary

mempalace repair silently caps recovery at 10,000 drawers, discarding everything beyond that. On a palace with 67,580 drawers, repair completed successfully with no warning and left the palace with exactly
10,000 — a 57,580-drawer loss (~85%).

Environment

  • MemPalace: 3.3.3 (also reproducible on 3.3.2, see context)
  • Python: 3.x via pipx (~/.local/pipx/venvs/mempalace)
  • ChromaDB: 1.5.8
  • OS: macOS 15 (Darwin 25.3.0), Apple Silicon
  • Palace size: ~619 MB, 67,580 drawers in mempalace_drawers, 499 in mempalace_closets

Steps to Reproduce

  1. Palace with > 10,000 drawers; HNSW for mempalace_drawers corrupted (see "Trigger" below).
  2. mempalace status segfaults (exit 139) — known HNSW/sqlite drift symptom.
  3. Quarantine drawers HNSW files (data_level0.bin, header.bin, length.bin, link_lists.bin, index_metadata.pickle) so repair itself doesn't segfault.
  4. Run mempalace repair --yes.

Observed Output

=======================================================
MemPalace Repair

Palace: /Users/.../.mempalace/palace
Drawers found: 10000
...
Repair complete. 10000 drawers rebuilt.
Backup saved at /Users/.../.mempalace/palace.backup

Drawers found: 10000 — but the underlying ChromaDB metadata segment held 67,580 embeddings (verified directly in chroma.sqlite3):

SELECT c.name, COUNT(*)
FROM embeddings e
JOIN segments s ON e.segment_id = s.id
JOIN collections c ON s.collection = c.id
GROUP BY c.name;
-- mempalace_closets : 499
-- mempalace_drawers : 67580   <-- before repair
-- mempalace_drawers : 10000   <-- after repair

Expected Behavior

Either:
- Repair must process all drawers (paginated collection.get(limit=N, offset=K) loop), or
- Repair must fail loudly when the source set exceeds the extraction cap, refusing to overwrite.

Silently dropping 85% of memories with a "Repair complete" success message is the worst possible outcome for a tool whose entire job is data preservation.

Root Cause (suspected)

collection.get() in ChromaDB defaults to a 10,000-row limit. The repair extraction path likely calls .get() once without paginating. Same issue would affect any palace > 10K drawers.

Trigger of the Underlying HNSW Corruption

Pre-existing palace state — exact cause unknown. mempalace status, mempalace repair, and chromadb.Collection.count() on mempalace_drawers all segfault when the corrupt HNSW is loaded. Closets collection
unaffected. The 3.3.2 #1000 "Quarantine stale HNSW" fix did not auto-trigger here; manual quarantine of HNSW files was required just to make repair runnable — at which point the 10K cap surfaced.

Mitigation Used

Restored the pre-repair backup (~/.mempalace.bak-<date>) and recovered the lost 57,580 drawers manually by extracting embedding vectors directly from chroma.sqlite3 and rebuilding the HNSW index without going
through mempalace repair.

Suggested Fix

In the repair extraction loop, paginate:

BATCH = 5000
offset = 0
while True:
    batch = src.get(limit=BATCH, offset=offset, include=["documents","metadatas","embeddings"])
    if not batch["ids"]:
        break
    dst.add(ids=batch["ids"], documents=batch["documents"],
            metadatas=batch["metadatas"], embeddings=batch["embeddings"])
    offset += BATCH

Plus a sanity-check assertion: assert len(dst) == src_total_count before declaring success and before overwriting the live palace.

Severity

High. Silent data loss in a tool sold as "verbatim memory, 96.6% R@5". The user's only signal something went wrong was the post-repair status showing a number that looked too round.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions