Skip to content

Commit 42817d7

Browse files
jpheinclaude
andcommitted
feat(checkpoint-split): phase D migration + PreCompact recovery write
Closes the structural fix from phases A-C: the verbatim store (mempalace_drawers) is now actually verbatim — derivative writes (Stop-hook + PreCompact checkpoints) live in the dedicated mempalace_session_recovery collection. Phase D — migration of existing checkpoint drawers: - migrate_checkpoints_to_recovery(palace_path, batch_size=1000) in mempalace/migrate.py walks the main collection in pages, filters drawers with topic in _CHECKPOINT_TOPICS in Python (avoids the chromadb 1.5.x $in/$nin filter-planner bug), copies them to the recovery collection (preserving IDs + metadata), then deletes from main. Idempotent — re-running on a fully-reorganized palace returns 0. - Add-then-delete order: a crash mid-migration leaves a duplicate, not a loss. Recoverable. - 6 new tests in test_migrate.py::TestMigrateCheckpointsToRecovery (moves checkpoints, idempotent, preserves IDs/metadata, handles auto-save synonym, no-checkpoints returns 0, no-palace returns 0). CLI: - mempalace repair --mode {rebuild,reorganize} (default rebuild). The new "reorganize" mode invokes migrate_checkpoints_to_recovery on the resolved palace and prints a one-line result. Designed for operators upgrading a palace post-A-C and for palace-daemon's eventual lifespan auto-migrate (phase E, deferred). PreCompact incorporation: - hook_precompact now calls _save_diary_direct mirroring hook_stop, leaving a recovery-collection marker before transcript mining + compaction. Per JP's framing — "the palace is just verbatim chats + tool calls" — a context-compaction event is a derivative write that belongs in the recovery store, not the searchable corpus. Failures here are non-fatal (logged, mining still runs). - This addresses the gap noted earlier: today's split was Stop-only; PreCompact entries are now incorporated. Deploy script: - scripts/deploy.sh post-restart import check now also imports _segment_appears_healthy, migrate_checkpoints_to_recovery, add_drawers, _build_drawer — proves all of today's fork-ahead surface is loaded after a restart, not just the gate fixture. Phase E (palace-daemon lifespan auto-migrate) is still deferred — cross-repo, requires separate go-ahead. Suite 1372/1372 pass. Test count bumped 1366→1372 in README. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
1 parent a4ec341 commit 42817d7

7 files changed

Lines changed: 286 additions & 10 deletions

File tree

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ Ruff for linting (`ruff check`), line length 100, target Python 3.9.
5353
21. **feat: `kind` filter on `search_memories` excludes Stop-hook checkpoints by default** (commits `8d02835` → `3d85739` → `398f42f` → `f9f5cc4`, 2026-04-25) — Stop-hook auto-save diary entries (topic=checkpoint, text starting `"CHECKPOINT:"`) were dominating MCP search results because they're short, word-dense, and outrank substantive content under cosine similarity. New `kind` parameter on `search_memories` and `mempalace_search` MCP tool: `"content"` (default, excludes checkpoints), `"checkpoint"` (only checkpoints, recovery/audit), `"all"` (no filter, pre-2026-04-25 behavior). **Two architecture corrections during the same day:** (a) the where-clause filter (`topic $nin [...]`) tripped a ChromaDB 1.5.x filter-planner bug — `Internal error: Error finding id` on every kind=content vector query — so the exclusion moved to post-filter only (`398f42f`); (b) vector top-N is dominated by checkpoints on this palace (top-10 hits all CHECKPOINT entries on probe queries), so post-filter alone empties the result set without aggressive over-fetch — pull size raised to `max(n*20, 100)` for kind != "all" (`f9f5cc4`). Post-filter checks both `topic` metadata and text-prefix shape; coverage equivalent to the original belt-and-suspenders without the chromadb bug. Result dicts now surface `topic`. 9 tests in `TestCheckpointFilter`. Companion fix in [`jphein/palace-daemon`](https://github.com/jphein/palace-daemon) commit `dd8894c` standardizes all hook clients on `topic="checkpoint"` (was `topic="auto-save"` in `clients/hook.py`). Structural fix still pending: stop indexing checkpoints as searchable drawers (separate session-recovery table). Upstream PR pending.
5454
22. **fix: `palace_graph.build_graph` skips None metadata** (commit `5fd15db`, 2026-04-25) — `palace_graph.py:95` called `meta.get("room", "")` unconditionally; ChromaDB returns `None` for legacy/partial-write drawers, AttributeError took out every consumer of `build_graph` (graph_stats, find_tunnels, traverse, daemon's `/stats`). Caught by `palace-daemon/scripts/verify-routes.sh` smoke-test on 2026-04-25 — `/stats` was 500-ing on a single None drawer. Adds `if meta is None: continue` guard. Closes the same gap as upstream's #999 None-metadata audit (which covered searcher/mcp_server/miner.status) plus our PR #1094 (which coerces at backend boundary for new writes), in a different read path the audit didn't reach. Filed as #1201 on 2026-04-25.
5555

56-
23. **feat: checkpoint collection split — phases A–C** (commit `e266365`, 2026-04-25) — Promoted from "future work" to "necessary" by 2026-04-25 Cat 9 A/B (`kind=all` 632 tokens/Q vs `kind=content` 3 tokens/Q on the canonical 151K palace; over-fetch=100 inadequate, structural fix non-optional). **Phase A:** new `_SESSION_RECOVERY_COLLECTION` constant + `get_session_recovery_collection()` in `palace.py` (mirrors `get_collection`'s shape — cosine, num_threads=1). **Phase B:** `tool_diary_write` routes `topic in _CHECKPOINT_TOPICS` to the dedicated `mempalace_session_recovery` collection, everything else stays in `mempalace_drawers`; new `_get_session_recovery_collection()` in `mcp_server.py` with parallel cache. **Phase C:** new `tool_session_recovery_read` MCP handler reads recovery collection only with optional filters `session_id`, `agent`, `since`, `until`, `wing`, `limit`; `session_id` added as optional metadata field on `tool_diary_write` so the new tool can filter by Claude Code session. Registered in `TOOLS` dict, documented in `website/reference/mcp-tools.md`. 12 new tests across `tests/test_session_recovery.py` + `TestCheckpointRouting` + `TestSessionRecoveryRead`. Design + plan at `docs/superpowers/specs/2026-04-25-checkpoint-collection-split.md` and `docs/superpowers/plans/2026-04-25-checkpoint-collection-split-impl.md`. **Phases D (data migration of ~640 existing checkpoints out of main collection) and E (palace-daemon `lifespan` auto-migrate + `mempalace repair --mode reorganize`) deferred** — multi-day work, gated on a separate go-ahead. Once D lands and the canonical-palace re-run shows the predicted `kind=all` ≈ `kind=content` token convergence, the `kind=` post-filter and over-fetch hack become deletable.
56+
23. **feat: checkpoint collection split — phases A–C** (commit `e266365`, 2026-04-25) — Promoted from "future work" to "necessary" by 2026-04-25 Cat 9 A/B (`kind=all` 632 tokens/Q vs `kind=content` 3 tokens/Q on the canonical 151K palace; over-fetch=100 inadequate, structural fix non-optional). **Phase A:** new `_SESSION_RECOVERY_COLLECTION` constant + `get_session_recovery_collection()` in `palace.py` (mirrors `get_collection`'s shape — cosine, num_threads=1). **Phase B:** `tool_diary_write` routes `topic in _CHECKPOINT_TOPICS` to the dedicated `mempalace_session_recovery` collection, everything else stays in `mempalace_drawers`; new `_get_session_recovery_collection()` in `mcp_server.py` with parallel cache. **Phase C:** new `tool_session_recovery_read` MCP handler reads recovery collection only with optional filters `session_id`, `agent`, `since`, `until`, `wing`, `limit`; `session_id` added as optional metadata field on `tool_diary_write` so the new tool can filter by Claude Code session. Registered in `TOOLS` dict, documented in `website/reference/mcp-tools.md`. 12 new tests across `tests/test_session_recovery.py` + `TestCheckpointRouting` + `TestSessionRecoveryRead`. Design + plan at `docs/superpowers/specs/2026-04-25-checkpoint-collection-split.md` and `docs/superpowers/plans/2026-04-25-checkpoint-collection-split-impl.md`. **Phases D (data migration of ~640 existing checkpoints out of main collection) and E (palace-daemon `lifespan` auto-migrate + `mempalace repair --mode reorganize`) deferred** — multi-day work, gated on a separate go-ahead. Once D lands and the canonical-palace re-run shows the predicted `kind=all` ≈ `kind=content` token convergence, the `kind=` post-filter and over-fetch hack become deletable. **Update 2026-04-26:** phase D shipped — `migrate_checkpoints_to_recovery()` in `mempalace/migrate.py`, idempotent walk that moves topic in `_CHECKPOINT_TOPICS` drawers from main → recovery while preserving IDs and metadata. Wired into `mempalace repair --mode reorganize` (CLI dispatch in `cli.py` chooses between `rebuild` (HNSW from sqlite) and `reorganize` (this new path)). PreCompact hook also incorporated — `hook_precompact` now writes a recovery marker via `_save_diary_direct` mirroring Stop, so a context-compaction event leaves a queryable timestamp in the recovery collection. 6 new migration tests in `test_migrate.py::TestMigrateCheckpointsToRecovery`. Phase E (palace-daemon `lifespan` auto-migrate) still pending — cross-repo, separate go-ahead.
5757

5858
27. **perf: batch ChromaDB inserts in miner (cherry-pick of upstream #1085)** (commit `6be6fff`, 2026-04-26) — Cherry-picked @midweste's [#1085](https://github.com/MemPalace/mempalace/pull/1085) "batch ChromaDB inserts in miner — 10-30x faster mining". Upstream PR #1085 is still **OPEN** as of 2026-04-26 (created 2026-04-21, base=develop, not yet merged) — verified via `gh pr view 1085 --repo MemPalace/mempalace`. We cherry-picked the commit ahead of merge so the fork can use it now; this row clears when #1085 merges into develop and we next sync. We don't file a competing fork-side PR — the proposal is @midweste's. New `_build_drawer()` helper builds id+document+metadata in one shot; new `add_drawers()` batch-insert function takes the full chunk list and sub-batches at `DRAWER_UPSERT_BATCH_SIZE` (one chromadb upsert + one ONNX embedding forward-pass per sub-batch instead of per-chunk). `process_file` now calls `add_drawers` directly. Hoists `datetime.now()` and `os.path.getmtime()` to file-level (2 syscalls per file instead of 2N). **Conflict resolution:** fork already had a fork-only `_build_drawer_metadata` + an outer batch loop in `process_file`; upstream's clean structure supersedes both. Kept fork's `DRAWER_UPSERT_BATCH_SIZE=1000` (more conservative than upstream's 5000 for embedding-pass memory headroom); aliased upstream's `CHROMA_BATCH_LIMIT` to point at it so any code/test referencing either name sees the same value. 74/74 miner+convo_miner tests pass; full suite 1366/1366. Becomes a no-op when #1085 merges into upstream develop and we next sync develop→main.
5959

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Fork of [MemPalace](https://github.com/milla-jovovich/mempalace), tracking `upst
1212

1313
What this fork adds that you won't get from upstream yet: a **deterministic silent-save hook architecture** (zero data loss, `systemMessage` notification, daemon-strict mode that skips local writes when `PALACE_DAEMON_URL` is set), **ChromaDB 1.5.x hardening** (`quarantine_stale_hnsw` drift recovery, segfault-trigger guards, 8-site `None`-metadata safety), **search that never silently misses** (`search_memories` returns warnings + sqlite BM25 top-up + `available_in_scope` so callers can see what they aren't getting), and **`kind=`-filtered search** that excludes Stop-hook auto-save checkpoints by default — discovered via the 2026-04-25 RLM smoke test, which surfaced that checkpoint diary entries (high word-density session summaries) were dominating retrieval and producing confident-but-misleading answers. Full list below.
1414

15-
1366 tests pass on `main` · [Discussion #1017](https://github.com/MemPalace/mempalace/discussions/1017) introduces the fork upstream · [Issues on this repo](https://github.com/jphein/mempalace/issues) for fork-specific feedback.
15+
1372 tests pass on `main` · [Discussion #1017](https://github.com/MemPalace/mempalace/discussions/1017) introduces the fork upstream · [Issues on this repo](https://github.com/jphein/mempalace/issues) for fork-specific feedback.
1616

1717
## Fork change queue
1818

mempalace/cli.py

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -539,14 +539,38 @@ def cmd_status(args):
539539

540540

541541
def cmd_repair(args):
542-
"""Rebuild palace vector index from SQLite metadata."""
542+
"""Rebuild palace vector index, or reorganize derivative drawers."""
543543
import shutil
544544
from .backends.chroma import ChromaBackend
545-
from .migrate import confirm_destructive_action, contains_palace_database
545+
from .migrate import (
546+
confirm_destructive_action,
547+
contains_palace_database,
548+
migrate_checkpoints_to_recovery,
549+
)
546550

547551
palace_path = os.path.abspath(
548552
os.path.expanduser(args.palace) if args.palace else MempalaceConfig().palace_path
549553
)
554+
555+
# mode=reorganize: move topic=checkpoint drawers from main → recovery.
556+
# Non-destructive, idempotent. Designed to run on first daemon startup
557+
# post-upgrade to land the checkpoint-collection split (phase D).
558+
if getattr(args, "mode", "rebuild") == "reorganize":
559+
if not os.path.isdir(palace_path) or not contains_palace_database(palace_path):
560+
print(f"\n No palace database found at {palace_path}")
561+
return
562+
print(f"\n{'=' * 55}")
563+
print(" MemPalace Reorganize — checkpoint → session-recovery")
564+
print(f"{'=' * 55}\n")
565+
print(f" Palace: {palace_path}")
566+
moved = migrate_checkpoints_to_recovery(palace_path)
567+
if moved == 0:
568+
print(" Nothing to move — palace is already reorganized.")
569+
else:
570+
print(f" Moved {moved} checkpoint drawer(s) to mempalace_session_recovery.")
571+
print(" mempalace_search now queries content-only.")
572+
print(f"\n{'=' * 55}\n")
573+
return
550574
db_path = os.path.join(palace_path, "chroma.sqlite3")
551575

552576
if not os.path.isdir(palace_path):
@@ -1014,10 +1038,24 @@ def main():
10141038
instructions_sub.add_parser(instr_name, help=f"Output {instr_name} instructions")
10151039

10161040
# repair
1017-
sub.add_parser(
1041+
p_repair = sub.add_parser(
10181042
"repair",
10191043
help="Rebuild palace vector index from stored data (fixes segfaults after corruption)",
1020-
).add_argument("--yes", action="store_true", help="Skip confirmation for destructive changes")
1044+
)
1045+
p_repair.add_argument(
1046+
"--yes", action="store_true", help="Skip confirmation for destructive changes"
1047+
)
1048+
p_repair.add_argument(
1049+
"--mode",
1050+
choices=["rebuild", "reorganize"],
1051+
default="rebuild",
1052+
help=(
1053+
"rebuild: extract + re-upsert all drawers to repair HNSW index. "
1054+
"reorganize: move existing topic=checkpoint drawers from the main "
1055+
"collection into mempalace_session_recovery (idempotent; safe to "
1056+
"re-run)."
1057+
),
1058+
)
10211059

10221060
# mcp
10231061
sub.add_parser(

mempalace/hooks_cli.py

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -722,13 +722,37 @@ def hook_session_start(data: dict, harness: str):
722722

723723

724724
def hook_precompact(data: dict, harness: str):
725-
"""Precompact hook: mine transcript synchronously, then allow compaction."""
725+
"""Precompact hook: write a session-recovery checkpoint, mine the
726+
transcript synchronously, then allow compaction.
727+
728+
Session-recovery write parallels ``hook_stop``'s ``_save_diary_direct``
729+
so a context-compaction event leaves a "where we were" marker in the
730+
dedicated ``mempalace_session_recovery`` collection — queryable later
731+
via ``mempalace_session_recovery_read`` by session_id. This isn't a
732+
summary of context; it's a timestamped event of "context boundary
733+
crossed at message N" so an operator can find the last marker before
734+
compaction lost in-context state.
735+
"""
726736
parsed = _parse_harness_input(data, harness)
727737
session_id = parsed["session_id"]
728738
transcript_path = parsed["transcript_path"]
729739

730740
_log(f"PRE-COMPACT triggered for session {session_id}")
731741

742+
# Write a recovery marker before mining + compacting. Failure here is
743+
# non-fatal; the mine + compaction must still proceed.
744+
if transcript_path:
745+
try:
746+
project_wing = _wing_from_transcript_path(transcript_path)
747+
_save_diary_direct(
748+
transcript_path,
749+
session_id,
750+
wing=project_wing,
751+
toast=False,
752+
)
753+
except Exception as e:
754+
_log(f"PreCompact recovery-write failed (non-fatal): {e}")
755+
732756
# Capture tool output via our normalize path before compaction loses it
733757
if transcript_path:
734758
_ingest_transcript(transcript_path)

mempalace/migrate.py

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -245,3 +245,100 @@ def migrate(palace_path: str, dry_run: bool = False, confirm: bool = False):
245245

246246
print(f"\n{'=' * 60}\n")
247247
return True
248+
249+
250+
# ---------------------------------------------------------------------------
251+
# Phase D: move existing topic=checkpoint drawers from the main searchable
252+
# collection into the dedicated session-recovery collection. The main
253+
# collection is the *verbatim* store — chats, tool calls, mined files —
254+
# and should not carry derivative summary entries (Stop-hook auto-save
255+
# checkpoints) that wreck vector ranking. See spec at
256+
# docs/superpowers/specs/2026-04-25-checkpoint-collection-split.md.
257+
# ---------------------------------------------------------------------------
258+
259+
# Topic values whose drawers belong in the session-recovery collection,
260+
# not the searchable main collection. Mirrors searcher._CHECKPOINT_TOPICS;
261+
# kept duplicated here to avoid pulling the searcher import into the
262+
# migrate module (which is loaded by CLI repair-mode dispatch).
263+
_CHECKPOINT_TOPICS = ("checkpoint", "auto-save")
264+
265+
266+
def migrate_checkpoints_to_recovery(palace_path: str, batch_size: int = 1000) -> int:
267+
"""Move all topic=checkpoint drawers from main → recovery collection.
268+
269+
Idempotent: re-running on a fully-migrated palace returns 0. Drawer
270+
IDs and metadata are preserved exactly. The original drawer is added
271+
to the recovery collection first, then deleted from main — so a
272+
crash mid-migration leaves a duplicate (recoverable) rather than a
273+
loss.
274+
275+
Returns the number of drawers moved on this invocation.
276+
"""
277+
from .palace import get_collection, get_session_recovery_collection
278+
279+
palace_path = os.path.abspath(os.path.expanduser(palace_path))
280+
if not contains_palace_database(palace_path):
281+
return 0
282+
283+
try:
284+
main = get_collection(palace_path, create=False)
285+
except Exception:
286+
# Palace dir exists but main collection isn't readable — nothing to migrate.
287+
return 0
288+
recovery = get_session_recovery_collection(palace_path, create=True)
289+
290+
moved_total = 0
291+
offset = 0
292+
# Walk the main collection in pages. We deliberately don't use a
293+
# ``where={"topic": {"$in": _CHECKPOINT_TOPICS}}`` clause: the
294+
# ChromaDB 1.5.x filter-planner bug surfaced earlier this week with
295+
# ``$in``/``$nin`` on metadata. Pull batches plain and filter in
296+
# Python.
297+
while True:
298+
try:
299+
batch = main.get(
300+
limit=batch_size,
301+
offset=offset,
302+
include=["documents", "metadatas"],
303+
)
304+
except Exception:
305+
# Defensive: a chromadb error on the read path stops the
306+
# migration cleanly without corrupting state. Caller can retry.
307+
break
308+
309+
ids = batch.get("ids") or []
310+
if not ids:
311+
break
312+
313+
docs = batch.get("documents") or []
314+
metas = batch.get("metadatas") or []
315+
316+
ids_to_move: list = []
317+
docs_to_move: list = []
318+
metas_to_move: list = []
319+
320+
for i, doc, meta in zip(ids, docs, metas):
321+
meta = meta or {}
322+
if meta.get("topic") in _CHECKPOINT_TOPICS:
323+
ids_to_move.append(i)
324+
docs_to_move.append(doc)
325+
metas_to_move.append(meta)
326+
327+
if ids_to_move:
328+
recovery.add(
329+
ids=ids_to_move,
330+
documents=docs_to_move,
331+
metadatas=metas_to_move,
332+
)
333+
main.delete(ids=ids_to_move)
334+
moved_total += len(ids_to_move)
335+
# The delete shrinks main; the *next* page would skip
336+
# ``len(ids_to_move)`` drawers. Reset offset so we re-page
337+
# over the (now smaller) collection from the same logical
338+
# position — equivalent to the standard "delete-during-walk"
339+
# fixup.
340+
continue
341+
342+
offset += len(ids)
343+
344+
return moved_total

scripts/deploy.sh

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -78,13 +78,15 @@ step "5/5 verify new code is loaded"
7878
# venv (proves Syncthing + restart picked up new code, not just a stale
7979
# import). Update this list as new public surface lands.
8080
ssh "$HOST" "~/.local/share/palace-daemon/venv/bin/python -c '
81-
from mempalace.backends.chroma import ChromaBackend
81+
from mempalace.backends.chroma import ChromaBackend, _segment_appears_healthy
8282
from mempalace.palace import _SESSION_RECOVERY_COLLECTION, get_session_recovery_collection
8383
from mempalace.mcp_server import tool_session_recovery_read
84-
assert hasattr(ChromaBackend, \"_quarantined_paths\"), \"HNSW gate fix not loaded\"
84+
from mempalace.migrate import migrate_checkpoints_to_recovery
85+
from mempalace.miner import add_drawers, _build_drawer
86+
assert hasattr(ChromaBackend, \"_quarantined_paths\"), \"HNSW cold-start gate not loaded\"
8587
assert _SESSION_RECOVERY_COLLECTION == \"mempalace_session_recovery\"
8688
print(\"OK\")
8789
'" >/dev/null 2>&1 || fail "post-restart import check failed (see ssh log)"
88-
ok "post-restart imports include today's fork-ahead surface"
90+
ok "post-restart imports include today's fork-ahead surface (gate, integrity, recovery, migrate, batch)"
8991

9092
printf '\n\033[1;32m✦ mempalace deploy complete: %s on %s\033[0m\n' "$local_sha" "$URL"

0 commit comments

Comments
 (0)