feat(lifespan): auto-migrate Stop-hook checkpoints to recovery collection on startup#18
feat(lifespan): auto-migrate Stop-hook checkpoints to recovery collection on startup#18jphein wants to merge 2 commits intorboarescu:mainfrom
Conversation
…tion (Phase E) Calls mempalace.migrate.migrate_checkpoints_to_recovery() during the daemon's lifespan startup, after the HNSW quarantine pass and before the ChromaDB warmup ping. Idempotent — re-runs return 0 once the canonical palace has reorganized; the only one-time cost is the first restart after deploy. Closes Phase E of the checkpoint-collection-split spec (docs/superpowers/specs/2026-04-25-checkpoint-collection-split.md on jphein/mempalace). Phases A–D shipped on the mempalace fork on 2026-04-25/26 — the migrate function exists and is wired into `mempalace repair --mode reorganize` for explicit operator runs. This commit makes the daemon do it automatically on first startup post-upgrade, so operators don't have to know about the manual command. Gated behind PALACE_AUTO_MIGRATE_CHECKPOINTS=1 (default). Set =0 to skip auto-migrate in environments where the one-time cost is unwanted; the manual `mempalace repair --mode reorganize` path remains. Defensive: ImportError if the mempalace fork-side function isn't present (upstream-shaped installs); other exceptions logged as non-fatal warnings so the daemon still starts. After this lands and the canonical 151K-drawer palace migrates, the predicted Cat 9 A/B convergence (kind=all ≈ kind=content tokens-per-question) becomes testable — that's the empirical proof the structural fix worked. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
Pull request overview
Adds an optional startup-time maintenance step to automatically migrate pre-3.3.4 Stop-hook checkpoint drawers into the new mempalace_session_recovery collection, reducing search/top‑N pollution after upgrades.
Changes:
- Runs
mempalace.migrate.migrate_checkpoints_to_recovery()duringlifespan()startup (after HNSW quarantine, before Chroma warmup). - Gates the migration behind
PALACE_AUTO_MIGRATE_CHECKPOINTS(default enabled) and runs it off the asyncio loop viarun_in_executor. - Logs a non-fatal outcome (skip on missing import; warn on other exceptions; info when migration moved rows).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from mempalace.migrate import migrate_checkpoints_to_recovery | ||
|
|
||
| loop = asyncio.get_running_loop() | ||
| moved_checkpoints = await loop.run_in_executor( | ||
| None, migrate_checkpoints_to_recovery, _mp._config.palace_path | ||
| ) | ||
| if moved_checkpoints: | ||
| logger.info( | ||
| "Migrated %d checkpoint drawer(s) from main → mempalace_session_recovery; " | ||
| "mempalace_search now queries content-only.", | ||
| moved_checkpoints, | ||
| ) | ||
| except ImportError: | ||
| # mempalace.migrate.migrate_checkpoints_to_recovery is fork-side | ||
| # only at the moment; on upstream-shaped installs without it the | ||
| # daemon should still start cleanly. | ||
| logger.debug("migrate_checkpoints_to_recovery not available; skipping auto-migrate.") |
There was a problem hiding this comment.
Already addressed — the current code at main.py:386-401 distinguishes getattr(e, "name", None) == "mempalace.migrate" (the "feature genuinely not available on this release line") from any other ImportError (which now logs at warning level so a transitive dep break inside mempalace.migrate is operator-visible, not silent). Sorry I didn't close the comment loop earlier.
| # once the canonical palace has reorganized. Gated behind | ||
| # PALACE_AUTO_MIGRATE_CHECKPOINTS so operators can disable in | ||
| # environments where the one-time migration cost is unwanted. | ||
| # See mempalace docs/superpowers/specs/2026-04-25-checkpoint-collection-split.md. |
There was a problem hiding this comment.
Resolved — the mempalace docs/superpowers/specs/2026-04-25-checkpoint-collection-split.md reference is no longer in the daemon code; grep on current main.py returns nothing for that path. Removed during a later revision.
…nding queue; trim patch Filed four upstream PRs on 2026-04-30: - rboarescu#15 feat: GET /viz status dashboard (stacks on rboarescu#13) - rboarescu#16 feat: GET /list — query-free metadata browse - rboarescu#17 feat: DELETE /memory + PATCH /memory - rboarescu#18 feat(lifespan): auto-migrate Stop-hook checkpoints on startup Also rebased PR rboarescu#13 onto upstream/main to clear a CHANGELOG conflict left by upstream's b4aee82 (patch sync) — state went CONFLICTING -> MERGEABLE / CLEAN. README: - Open upstream PRs table: four new rows (rboarescu#15-rboarescu#18) plus a 2026-04-30 note covering today's rebase + new PRs in one breath. - Pending PRs queue: now empty. Replaced the four stale rows (event-log-frame and graph-endpoint were already in flight via rboarescu#11 and rboarescu#13; mempal-fast.py was already upstream via the merged PR #4 omnibus; /viz is now PR rboarescu#15) with a brief empty-state note. CLAUDE.md: - Patch description trimmed to reflect that the hnsw:num_threads enforcement landed upstream via _pin_hnsw_threads(); only the log + retry-once slice remains. patches/mcp_server_get_collection.patch: - Regenerated against current mempalace develop. The patch is now just the "log exception + retry once on cache failure" change. Filed upstream as MemPalace/mempalace#1286; once that merges this patch retires. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
README: - Intro paragraph: appended GET /list (PR rboarescu#16), DELETE/PATCH /memory (PR rboarescu#17), and lifespan auto-migrate (PR rboarescu#18) to "what this fork adds" since they're already on fork main even though the PRs are awaiting review upstream. - Links bar: added docs/typescript-port-plan.md. - Added a "Cross-repo coordination" subsection under "Recently landed in upstream" — calls out the local patches/ directory and the in-flight MemPalace/mempalace#1286 that retires it. Also references #1142 (RELEASING.md) for completeness. - Requirements: dropped the stale "kind= searcher filter" justification for recommending the fork (kind= was retired in 1.7.1); replaced with daemon-strict hook mode + warnings/sqlite-fallback search path as the actual current reasons. Added the patches/ re-apply step. - API table: added /list, DELETE /memory/{id}, PATCH /memory/{id} rows so the table reflects what the fork main actually exposes. - Sources: cross-repo PR note for #1286. CHANGELOG: - Added [Unreleased] section: patch trim (Maintenance), TS port plan doc, hook-routing-fix.md status header, and a Docs entry covering today's README updates + PR rboarescu#13 rebase context. docs/hook-routing-fix.md: - Added a Status: SHIPPED header pointing at 62425e3 (2026-04-24 clients/hook.py introduction) and clarifying mempal-fast.py is the simpler successor — same shape as docs/graph-endpoint.md's status header. No daemon behaviour changes.
Two issues raised: 1. `except ImportError` was overswallowing — catching transitive dependency failures inside `mempalace.migrate` (e.g., a missing sub-import) and treating them as 'feature-not-available'. Distinguish via `e.name == 'mempalace.migrate'` so genuine missing-feature cases stay at debug level while unexpected import failures from inside the module surface at warning level. 2. Comment referenced `mempalace docs/superpowers/specs/...` which doesn't exist in this repository (it lives in the jphein/mempalace fork's superpowers tree). Replaced with a reference to the public operator-driven equivalent (`mempalace repair --mode reorganize`), which both points readers somewhere they can actually find and provides a recovery path if the auto-migrate is disabled.
…boarescu#18 + #1286 to main Same shape as 152e428 — production at disks runs fork main, so review amendments to the in-flight upstream PRs need to ship here too. ## main.py — /list docstring (from PR rboarescu#16) The previous docstring claimed both "sorted by recency" and "ordered by mempalace's natural metadata-table order", which contradicted itself. Replaced with an accurate description: ordering is whatever mempalace_list_drawers returns (sqlite metadata-table order, which approximates insertion order but isn't strictly chronological). Pagination contract called out explicitly. ## main.py — PATCH /memory body validation (from PR rboarescu#17) - await request.json() now wraps in try/except for JSONDecodeError → 400 with a clear message (was bubbling as 500). - Validate parsed body is a dict before reading keys. - Reject empty patches (no content/wing/room) with 400 listing the valid keys, instead of forwarding {drawer_id: ...} alone to mempalace_update_drawer where it's an ambiguous no-op. DELETE handler unchanged — no body parsing, nothing to harden. ## main.py — Lifespan auto-migrate ImportError narrowing (from PR rboarescu#18) Previous `except ImportError:` swallowed transitive dep failures inside mempalace.migrate (a real bug worth surfacing) as if they were "feature unavailable" (expected). Distinguish via `getattr(e, "name", None) == "mempalace.migrate"` — module genuinely absent → debug log, anything else → warning log so it's visible. Comment that pointed at `mempalace docs/superpowers/specs/2026-04-25- checkpoint-collection-split.md` (a path that doesn't exist in this repo, lives only in the jphein/mempalace fork's superpowers tree) replaced with a public-facing reference to `mempalace repair --mode reorganize` — the operator-driven equivalent that's actually discoverable. ## patches/mcp_server_get_collection.patch (from MemPalace/mempalace #1286) Regenerated against current upstream develop with the Copilot review fix folded in: logger.error(..., str(e)) → logger.exception(...). Now the traceback lands in the daemon log alongside the message instead of being dropped. Patch behaviour is otherwise identical. The upstream PR also gained TestGetCollectionRetry — three monkeypatch tests that pin the retry semantics. Those tests live in mempalace, not here; they'll arrive on this side automatically once #1286 merges and the patch retires. All four fixes are amendments on the corresponding upstream PR branches (force-pushed earlier today). No behaviour change for the healthy path; the changes harden failure modes that were caught in review. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Pushed
Thanks for the careful read. |
Calls
mempalace.migrate.migrate_checkpoints_to_recovery()during the daemon's lifespan startup — after the HNSW quarantine pass and before the ChromaDB warmup. Idempotent: re-runs return 0 once a palace has been reorganised, so the only one-time cost is the first restart after upgrading.Why
The mempalace 3.3.4 line ships a dedicated
mempalace_session_recoverycollection so Stop-hook auto-save checkpoints don't dominatemempalace_search's vector top-N. The migration tool to move pre-3.3.4 checkpoints into the new collection lives atmempalace.migrate.migrate_checkpoints_to_recoveryand is also surfaced throughmempalace repair --mode reorganizefor explicit operator runs. This change just calls it automatically on daemon startup so operators don't have to remember the manual command after upgrade.Design notes
PALACE_AUTO_MIGRATE_CHECKPOINTS=1(default on). Set=0to skip — the manualmempalace repair --mode reorganizepath stays available.migrate_checkpoints_to_recoveryreturns 0 once there's nothing left to move. After a successful migration, every subsequent restart is a quick no-op.mempalace.migrate.migrate_checkpoints_to_recoveryisn't importable (older release line, alternate distribution), the daemon logs a debug message and starts cleanly. Any other exception is logged as a non-fatal warning rather than blocking startup.loop.run_in_executor(None, ...)— the migration is straight-line SQLite work and shouldn't share the event loop with traffic.What's in the diff
main.py— 29 lines added insidelifespan(), immediately after the existing quarantine pass. No other files touched, no new dependencies, no version bump.Where it ran
In production on the canonical 151K-drawer palace, first restart on 2026-04-26. Migrated ~6.7K checkpoint drawers in ~12s on cold cache and has been a no-op on every restart since. No traffic impact observable in the daemon log because the work happens in lifespan startup before the server binds.
Happy to gate this differently (e.g. opt-in by default, log-only on first run) if you'd prefer a more conservative shape. Thanks for taking a look!