feat(checkpoint-split): phase D migration + PreCompact recovery write

jphein · claude · jphein · commit 42817d74b968 · 2026-04-26T07:46:48.000-07:00
Closes the structural fix from phases A-C: the verbatim store
(mempalace_drawers) is now actually verbatim — derivative writes
(Stop-hook + PreCompact checkpoints) live in the dedicated
mempalace_session_recovery collection.

Phase D — migration of existing checkpoint drawers:

  - migrate_checkpoints_to_recovery(palace_path, batch_size=1000) in
    mempalace/migrate.py walks the main collection in pages, filters
    drawers with topic in _CHECKPOINT_TOPICS in Python (avoids the
    chromadb 1.5.x $in/$nin filter-planner bug), copies them to the
    recovery collection (preserving IDs + metadata), then deletes from
    main. Idempotent — re-running on a fully-reorganized palace
    returns 0.
  - Add-then-delete order: a crash mid-migration leaves a duplicate,
    not a loss. Recoverable.
  - 6 new tests in test_migrate.py::TestMigrateCheckpointsToRecovery
    (moves checkpoints, idempotent, preserves IDs/metadata, handles
    auto-save synonym, no-checkpoints returns 0, no-palace returns 0).

CLI:

  - mempalace repair --mode {rebuild,reorganize} (default rebuild).
    The new "reorganize" mode invokes migrate_checkpoints_to_recovery
    on the resolved palace and prints a one-line result. Designed for
    operators upgrading a palace post-A-C and for palace-daemon's
    eventual lifespan auto-migrate (phase E, deferred).

PreCompact incorporation:

  - hook_precompact now calls _save_diary_direct mirroring hook_stop,
    leaving a recovery-collection marker before transcript mining +
    compaction. Per JP's framing — "the palace is just verbatim
    chats + tool calls" — a context-compaction event is a derivative
    write that belongs in the recovery store, not the searchable
    corpus. Failures here are non-fatal (logged, mining still runs).
  - This addresses the gap noted earlier: today's split was
    Stop-only; PreCompact entries are now incorporated.

Deploy script:

  - scripts/deploy.sh post-restart import check now also imports
    _segment_appears_healthy, migrate_checkpoints_to_recovery,
    add_drawers, _build_drawer — proves all of today's fork-ahead
    surface is loaded after a restart, not just the gate fixture.

Phase E (palace-daemon lifespan auto-migrate) is still deferred —
cross-repo, requires separate go-ahead.

Suite 1372/1372 pass. Test count bumped 1366→1372 in README.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -53,7 +53,7 @@ Ruff for linting (`ruff check`), line length 100, target Python 3.9.
 21. **feat: `kind` filter on `search_memories` excludes Stop-hook checkpoints by default** (commits `8d02835` → `3d85739` → `398f42f` → `f9f5cc4`, 2026-04-25) — Stop-hook auto-save diary entries (topic=checkpoint, text starting `"CHECKPOINT:"`) were dominating MCP search results because they're short, word-dense, and outrank substantive content under cosine similarity. New `kind` parameter on `search_memories` and `mempalace_search` MCP tool: `"content"` (default, excludes checkpoints), `"checkpoint"` (only checkpoints, recovery/audit), `"all"` (no filter, pre-2026-04-25 behavior). **Two architecture corrections during the same day:** (a) the where-clause filter (`topic $nin [...]`) tripped a ChromaDB 1.5.x filter-planner bug — `Internal error: Error finding id` on every kind=content vector query — so the exclusion moved to post-filter only (`398f42f`); (b) vector top-N is dominated by checkpoints on this palace (top-10 hits all CHECKPOINT entries on probe queries), so post-filter alone empties the result set without aggressive over-fetch — pull size raised to `max(n*20, 100)` for kind != "all" (`f9f5cc4`). Post-filter checks both `topic` metadata and text-prefix shape; coverage equivalent to the original belt-and-suspenders without the chromadb bug. Result dicts now surface `topic`. 9 tests in `TestCheckpointFilter`. Companion fix in [`jphein/palace-daemon`](https://github.com/jphein/palace-daemon) commit `dd8894c` standardizes all hook clients on `topic="checkpoint"` (was `topic="auto-save"` in `clients/hook.py`). Structural fix still pending: stop indexing checkpoints as searchable drawers (separate session-recovery table). Upstream PR pending.
 22. **fix: `palace_graph.build_graph` skips None metadata** (commit `5fd15db`, 2026-04-25) — `palace_graph.py:95` called `meta.get("room", "")` unconditionally; ChromaDB returns `None` for legacy/partial-write drawers, AttributeError took out every consumer of `build_graph` (graph_stats, find_tunnels, traverse, daemon's `/stats`). Caught by `palace-daemon/scripts/verify-routes.sh` smoke-test on 2026-04-25 — `/stats` was 500-ing on a single None drawer. Adds `if meta is None: continue` guard. Closes the same gap as upstream's #999 None-metadata audit (which covered searcher/mcp_server/miner.status) plus our PR #1094 (which coerces at backend boundary for new writes), in a different read path the audit didn't reach. Filed as #1201 on 2026-04-25.
 
-23. **feat: checkpoint collection split — phases A–C** (commit `e266365`, 2026-04-25) — Promoted from "future work" to "necessary" by 2026-04-25 Cat 9 A/B (`kind=all` 632 tokens/Q vs `kind=content` 3 tokens/Q on the canonical 151K palace; over-fetch=100 inadequate, structural fix non-optional). **Phase A:** new `_SESSION_RECOVERY_COLLECTION` constant + `get_session_recovery_collection()` in `palace.py` (mirrors `get_collection`'s shape — cosine, num_threads=1). **Phase B:** `tool_diary_write` routes `topic in _CHECKPOINT_TOPICS` to the dedicated `mempalace_session_recovery` collection, everything else stays in `mempalace_drawers`; new `_get_session_recovery_collection()` in `mcp_server.py` with parallel cache. **Phase C:** new `tool_session_recovery_read` MCP handler reads recovery collection only with optional filters `session_id`, `agent`, `since`, `until`, `wing`, `limit`; `session_id` added as optional metadata field on `tool_diary_write` so the new tool can filter by Claude Code session. Registered in `TOOLS` dict, documented in `website/reference/mcp-tools.md`. 12 new tests across `tests/test_session_recovery.py` + `TestCheckpointRouting` + `TestSessionRecoveryRead`. Design + plan at `docs/superpowers/specs/2026-04-25-checkpoint-collection-split.md` and `docs/superpowers/plans/2026-04-25-checkpoint-collection-split-impl.md`. **Phases D (data migration of ~640 existing checkpoints out of main collection) and E (palace-daemon `lifespan` auto-migrate + `mempalace repair --mode reorganize`) deferred** — multi-day work, gated on a separate go-ahead. Once D lands and the canonical-palace re-run shows the predicted `kind=all` ≈ `kind=content` token convergence, the `kind=` post-filter and over-fetch hack become deletable.
+23. **feat: checkpoint collection split — phases A–C** (commit `e266365`, 2026-04-25) — Promoted from "future work" to "necessary" by 2026-04-25 Cat 9 A/B (`kind=all` 632 tokens/Q vs `kind=content` 3 tokens/Q on the canonical 151K palace; over-fetch=100 inadequate, structural fix non-optional). **Phase A:** new `_SESSION_RECOVERY_COLLECTION` constant + `get_session_recovery_collection()` in `palace.py` (mirrors `get_collection`'s shape — cosine, num_threads=1). **Phase B:** `tool_diary_write` routes `topic in _CHECKPOINT_TOPICS` to the dedicated `mempalace_session_recovery` collection, everything else stays in `mempalace_drawers`; new `_get_session_recovery_collection()` in `mcp_server.py` with parallel cache. **Phase C:** new `tool_session_recovery_read` MCP handler reads recovery collection only with optional filters `session_id`, `agent`, `since`, `until`, `wing`, `limit`; `session_id` added as optional metadata field on `tool_diary_write` so the new tool can filter by Claude Code session. Registered in `TOOLS` dict, documented in `website/reference/mcp-tools.md`. 12 new tests across `tests/test_session_recovery.py` + `TestCheckpointRouting` + `TestSessionRecoveryRead`. Design + plan at `docs/superpowers/specs/2026-04-25-checkpoint-collection-split.md` and `docs/superpowers/plans/2026-04-25-checkpoint-collection-split-impl.md`. **Phases D (data migration of ~640 existing checkpoints out of main collection) and E (palace-daemon `lifespan` auto-migrate + `mempalace repair --mode reorganize`) deferred** — multi-day work, gated on a separate go-ahead. Once D lands and the canonical-palace re-run shows the predicted `kind=all` ≈ `kind=content` token convergence, the `kind=` post-filter and over-fetch hack become deletable. **Update 2026-04-26:** phase D shipped — `migrate_checkpoints_to_recovery()` in `mempalace/migrate.py`, idempotent walk that moves topic in `_CHECKPOINT_TOPICS` drawers from main → recovery while preserving IDs and metadata. Wired into `mempalace repair --mode reorganize` (CLI dispatch in `cli.py` chooses between `rebuild` (HNSW from sqlite) and `reorganize` (this new path)). PreCompact hook also incorporated — `hook_precompact` now writes a recovery marker via `_save_diary_direct` mirroring Stop, so a context-compaction event leaves a queryable timestamp in the recovery collection. 6 new migration tests in `test_migrate.py::TestMigrateCheckpointsToRecovery`. Phase E (palace-daemon `lifespan` auto-migrate) still pending — cross-repo, separate go-ahead.
 
 27. **perf: batch ChromaDB inserts in miner (cherry-pick of upstream #1085)** (commit `6be6fff`, 2026-04-26) — Cherry-picked @midweste's [#1085](https://github.com/MemPalace/mempalace/pull/1085) "batch ChromaDB inserts in miner — 10-30x faster mining". Upstream PR #1085 is still **OPEN** as of 2026-04-26 (created 2026-04-21, base=develop, not yet merged) — verified via `gh pr view 1085 --repo MemPalace/mempalace`. We cherry-picked the commit ahead of merge so the fork can use it now; this row clears when #1085 merges into develop and we next sync. We don't file a competing fork-side PR — the proposal is @midweste's. New `_build_drawer()` helper builds id+document+metadata in one shot; new `add_drawers()` batch-insert function takes the full chunk list and sub-batches at `DRAWER_UPSERT_BATCH_SIZE` (one chromadb upsert + one ONNX embedding forward-pass per sub-batch instead of per-chunk). `process_file` now calls `add_drawers` directly. Hoists `datetime.now()` and `os.path.getmtime()` to file-level (2 syscalls per file instead of 2N). **Conflict resolution:** fork already had a fork-only `_build_drawer_metadata` + an outer batch loop in `process_file`; upstream's clean structure supersedes both. Kept fork's `DRAWER_UPSERT_BATCH_SIZE=1000` (more conservative than upstream's 5000 for embedding-pass memory headroom); aliased upstream's `CHROMA_BATCH_LIMIT` to point at it so any code/test referencing either name sees the same value. 74/74 miner+convo_miner tests pass; full suite 1366/1366. Becomes a no-op when #1085 merges into upstream develop and we next sync develop→main.
 
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ Fork of [MemPalace](https://github.com/milla-jovovich/mempalace), tracking `upst
 
 What this fork adds that you won't get from upstream yet: a **deterministic silent-save hook architecture** (zero data loss, `systemMessage` notification, daemon-strict mode that skips local writes when `PALACE_DAEMON_URL` is set), **ChromaDB 1.5.x hardening** (`quarantine_stale_hnsw` drift recovery, segfault-trigger guards, 8-site `None`-metadata safety), **search that never silently misses** (`search_memories` returns warnings + sqlite BM25 top-up + `available_in_scope` so callers can see what they aren't getting), and **`kind=`-filtered search** that excludes Stop-hook auto-save checkpoints by default — discovered via the 2026-04-25 RLM smoke test, which surfaced that checkpoint diary entries (high word-density session summaries) were dominating retrieval and producing confident-but-misleading answers. Full list below.
 
-1366 tests pass on `main` · [Discussion #1017](https://github.com/MemPalace/mempalace/discussions/1017) introduces the fork upstream · [Issues on this repo](https://github.com/jphein/mempalace/issues) for fork-specific feedback.
+1372 tests pass on `main` · [Discussion #1017](https://github.com/MemPalace/mempalace/discussions/1017) introduces the fork upstream · [Issues on this repo](https://github.com/jphein/mempalace/issues) for fork-specific feedback.
 
 ## Fork change queue
 
diff --git a/mempalace/cli.py b/mempalace/cli.py
@@ -539,14 +539,38 @@ def cmd_status(args):
 
 
 def cmd_repair(args):
-    """Rebuild palace vector index from SQLite metadata."""
+    """Rebuild palace vector index, or reorganize derivative drawers."""
     import shutil
     from .backends.chroma import ChromaBackend
-    from .migrate import confirm_destructive_action, contains_palace_database
+    from .migrate import (
+        confirm_destructive_action,
+        contains_palace_database,
+        migrate_checkpoints_to_recovery,
+    )
 
     palace_path = os.path.abspath(
         os.path.expanduser(args.palace) if args.palace else MempalaceConfig().palace_path
     )
+
+    # mode=reorganize: move topic=checkpoint drawers from main → recovery.
+    # Non-destructive, idempotent. Designed to run on first daemon startup
+    # post-upgrade to land the checkpoint-collection split (phase D).
+    if getattr(args, "mode", "rebuild") == "reorganize":
+        if not os.path.isdir(palace_path) or not contains_palace_database(palace_path):
+            print(f"\n  No palace database found at {palace_path}")
+            return
+        print(f"\n{'=' * 55}")
+        print("  MemPalace Reorganize — checkpoint → session-recovery")
+        print(f"{'=' * 55}\n")
+        print(f"  Palace: {palace_path}")
+        moved = migrate_checkpoints_to_recovery(palace_path)
+        if moved == 0:
+            print("  Nothing to move — palace is already reorganized.")
+        else:
+            print(f"  Moved {moved} checkpoint drawer(s) to mempalace_session_recovery.")
+            print("  mempalace_search now queries content-only.")
+        print(f"\n{'=' * 55}\n")
+        return
     db_path = os.path.join(palace_path, "chroma.sqlite3")
 
     if not os.path.isdir(palace_path):
@@ -1014,10 +1038,24 @@ def main():
         instructions_sub.add_parser(instr_name, help=f"Output {instr_name} instructions")
 
     # repair
-    sub.add_parser(
+    p_repair = sub.add_parser(
         "repair",
         help="Rebuild palace vector index from stored data (fixes segfaults after corruption)",
-    ).add_argument("--yes", action="store_true", help="Skip confirmation for destructive changes")
+    )
+    p_repair.add_argument(
+        "--yes", action="store_true", help="Skip confirmation for destructive changes"
+    )
+    p_repair.add_argument(
+        "--mode",
+        choices=["rebuild", "reorganize"],
+        default="rebuild",
+        help=(
+            "rebuild: extract + re-upsert all drawers to repair HNSW index. "
+            "reorganize: move existing topic=checkpoint drawers from the main "
+            "collection into mempalace_session_recovery (idempotent; safe to "
+            "re-run)."
+        ),
+    )
 
     # mcp
     sub.add_parser(
diff --git a/mempalace/hooks_cli.py b/mempalace/hooks_cli.py
@@ -722,13 +722,37 @@ def hook_session_start(data: dict, harness: str):
 
 
 def hook_precompact(data: dict, harness: str):
-    """Precompact hook: mine transcript synchronously, then allow compaction."""
+    """Precompact hook: write a session-recovery checkpoint, mine the
+    transcript synchronously, then allow compaction.
+
+    Session-recovery write parallels ``hook_stop``'s ``_save_diary_direct``
+    so a context-compaction event leaves a "where we were" marker in the
+    dedicated ``mempalace_session_recovery`` collection — queryable later
+    via ``mempalace_session_recovery_read`` by session_id. This isn't a
+    summary of context; it's a timestamped event of "context boundary
+    crossed at message N" so an operator can find the last marker before
+    compaction lost in-context state.
+    """
     parsed = _parse_harness_input(data, harness)
     session_id = parsed["session_id"]
     transcript_path = parsed["transcript_path"]
 
     _log(f"PRE-COMPACT triggered for session {session_id}")
 
+    # Write a recovery marker before mining + compacting. Failure here is
+    # non-fatal; the mine + compaction must still proceed.
+    if transcript_path:
+        try:
+            project_wing = _wing_from_transcript_path(transcript_path)
+            _save_diary_direct(
+                transcript_path,
+                session_id,
+                wing=project_wing,
+                toast=False,
+            )
+        except Exception as e:
+            _log(f"PreCompact recovery-write failed (non-fatal): {e}")
+
     # Capture tool output via our normalize path before compaction loses it
     if transcript_path:
         _ingest_transcript(transcript_path)
diff --git a/mempalace/migrate.py b/mempalace/migrate.py
@@ -245,3 +245,100 @@ def migrate(palace_path: str, dry_run: bool = False, confirm: bool = False):
 
     print(f"\n{'=' * 60}\n")
     return True
+
+
+# ---------------------------------------------------------------------------
+# Phase D: move existing topic=checkpoint drawers from the main searchable
+# collection into the dedicated session-recovery collection. The main
+# collection is the *verbatim* store — chats, tool calls, mined files —
+# and should not carry derivative summary entries (Stop-hook auto-save
+# checkpoints) that wreck vector ranking. See spec at
+# docs/superpowers/specs/2026-04-25-checkpoint-collection-split.md.
+# ---------------------------------------------------------------------------
+
+# Topic values whose drawers belong in the session-recovery collection,
+# not the searchable main collection. Mirrors searcher._CHECKPOINT_TOPICS;
+# kept duplicated here to avoid pulling the searcher import into the
+# migrate module (which is loaded by CLI repair-mode dispatch).
+_CHECKPOINT_TOPICS = ("checkpoint", "auto-save")
+
+
+def migrate_checkpoints_to_recovery(palace_path: str, batch_size: int = 1000) -> int:
+    """Move all topic=checkpoint drawers from main → recovery collection.
+
+    Idempotent: re-running on a fully-migrated palace returns 0. Drawer
+    IDs and metadata are preserved exactly. The original drawer is added
+    to the recovery collection first, then deleted from main — so a
+    crash mid-migration leaves a duplicate (recoverable) rather than a
+    loss.
+
+    Returns the number of drawers moved on this invocation.
+    """
+    from .palace import get_collection, get_session_recovery_collection
+
+    palace_path = os.path.abspath(os.path.expanduser(palace_path))
+    if not contains_palace_database(palace_path):
+        return 0
+
+    try:
+        main = get_collection(palace_path, create=False)
+    except Exception:
+        # Palace dir exists but main collection isn't readable — nothing to migrate.
+        return 0
+    recovery = get_session_recovery_collection(palace_path, create=True)
+
+    moved_total = 0
+    offset = 0
+    # Walk the main collection in pages. We deliberately don't use a
+    # ``where={"topic": {"$in": _CHECKPOINT_TOPICS}}`` clause: the
+    # ChromaDB 1.5.x filter-planner bug surfaced earlier this week with
+    # ``$in``/``$nin`` on metadata. Pull batches plain and filter in
+    # Python.
+    while True:
+        try:
+            batch = main.get(
+                limit=batch_size,
+                offset=offset,
+                include=["documents", "metadatas"],
+            )
+        except Exception:
+            # Defensive: a chromadb error on the read path stops the
+            # migration cleanly without corrupting state. Caller can retry.
+            break
+
+        ids = batch.get("ids") or []
+        if not ids:
+            break
+
+        docs = batch.get("documents") or []
+        metas = batch.get("metadatas") or []
+
+        ids_to_move: list = []
+        docs_to_move: list = []
+        metas_to_move: list = []
+
+        for i, doc, meta in zip(ids, docs, metas):
+            meta = meta or {}
+            if meta.get("topic") in _CHECKPOINT_TOPICS:
+                ids_to_move.append(i)
+                docs_to_move.append(doc)
+                metas_to_move.append(meta)
+
+        if ids_to_move:
+            recovery.add(
+                ids=ids_to_move,
+                documents=docs_to_move,
+                metadatas=metas_to_move,
+            )
+            main.delete(ids=ids_to_move)
+            moved_total += len(ids_to_move)
+            # The delete shrinks main; the *next* page would skip
+            # ``len(ids_to_move)`` drawers. Reset offset so we re-page
+            # over the (now smaller) collection from the same logical
+            # position — equivalent to the standard "delete-during-walk"
+            # fixup.
+            continue
+
+        offset += len(ids)
+
+    return moved_total
diff --git a/scripts/deploy.sh b/scripts/deploy.sh
@@ -78,13 +78,15 @@ step "5/5  verify new code is loaded"
 # venv (proves Syncthing + restart picked up new code, not just a stale
 # import). Update this list as new public surface lands.
 ssh "$HOST" "~/.local/share/palace-daemon/venv/bin/python -c '
-from mempalace.backends.chroma import ChromaBackend
+from mempalace.backends.chroma import ChromaBackend, _segment_appears_healthy
 from mempalace.palace import _SESSION_RECOVERY_COLLECTION, get_session_recovery_collection
 from mempalace.mcp_server import tool_session_recovery_read
-assert hasattr(ChromaBackend, \"_quarantined_paths\"), \"HNSW gate fix not loaded\"
+from mempalace.migrate import migrate_checkpoints_to_recovery
+from mempalace.miner import add_drawers, _build_drawer
+assert hasattr(ChromaBackend, \"_quarantined_paths\"), \"HNSW cold-start gate not loaded\"
 assert _SESSION_RECOVERY_COLLECTION == \"mempalace_session_recovery\"
 print(\"OK\")
 '" >/dev/null 2>&1 || fail "post-restart import check failed (see ssh log)"
-ok "post-restart imports include today's fork-ahead surface"
+ok "post-restart imports include today's fork-ahead surface (gate, integrity, recovery, migrate, batch)"
 
 printf '\n\033[1;32m✦ mempalace deploy complete: %s on %s\033[0m\n' "$local_sha" "$URL"
diff --git a/tests/test_migrate.py b/tests/test_migrate.py