Skip to content

feat(lifespan): auto-migrate Stop-hook checkpoints to recovery collection on startup#18

Open
jphein wants to merge 2 commits intorboarescu:mainfrom
jphein:feat/lifespan-checkpoint-migrate
Open

feat(lifespan): auto-migrate Stop-hook checkpoints to recovery collection on startup#18
jphein wants to merge 2 commits intorboarescu:mainfrom
jphein:feat/lifespan-checkpoint-migrate

Conversation

@jphein
Copy link
Copy Markdown
Contributor

@jphein jphein commented Apr 30, 2026

Calls mempalace.migrate.migrate_checkpoints_to_recovery() during the daemon's lifespan startup — after the HNSW quarantine pass and before the ChromaDB warmup. Idempotent: re-runs return 0 once a palace has been reorganised, so the only one-time cost is the first restart after upgrading.

Why

The mempalace 3.3.4 line ships a dedicated mempalace_session_recovery collection so Stop-hook auto-save checkpoints don't dominate mempalace_search's vector top-N. The migration tool to move pre-3.3.4 checkpoints into the new collection lives at mempalace.migrate.migrate_checkpoints_to_recovery and is also surfaced through mempalace repair --mode reorganize for explicit operator runs. This change just calls it automatically on daemon startup so operators don't have to remember the manual command after upgrade.

Design notes

  • Gated behind PALACE_AUTO_MIGRATE_CHECKPOINTS=1 (default on). Set =0 to skip — the manual mempalace repair --mode reorganize path stays available.
  • Idempotentmigrate_checkpoints_to_recovery returns 0 once there's nothing left to move. After a successful migration, every subsequent restart is a quick no-op.
  • Defensive against older mempalace: if mempalace.migrate.migrate_checkpoints_to_recovery isn't importable (older release line, alternate distribution), the daemon logs a debug message and starts cleanly. Any other exception is logged as a non-fatal warning rather than blocking startup.
  • Off the asyncio loop via loop.run_in_executor(None, ...) — the migration is straight-line SQLite work and shouldn't share the event loop with traffic.

What's in the diff

main.py — 29 lines added inside lifespan(), immediately after the existing quarantine pass. No other files touched, no new dependencies, no version bump.

Where it ran

In production on the canonical 151K-drawer palace, first restart on 2026-04-26. Migrated ~6.7K checkpoint drawers in ~12s on cold cache and has been a no-op on every restart since. No traffic impact observable in the daemon log because the work happens in lifespan startup before the server binds.

Happy to gate this differently (e.g. opt-in by default, log-only on first run) if you'd prefer a more conservative shape. Thanks for taking a look!

…tion (Phase E)

Calls mempalace.migrate.migrate_checkpoints_to_recovery() during the
daemon's lifespan startup, after the HNSW quarantine pass and before
the ChromaDB warmup ping. Idempotent — re-runs return 0 once the
canonical palace has reorganized; the only one-time cost is the first
restart after deploy.

Closes Phase E of the checkpoint-collection-split spec
(docs/superpowers/specs/2026-04-25-checkpoint-collection-split.md
on jphein/mempalace). Phases A–D shipped on the mempalace fork on
2026-04-25/26 — the migrate function exists and is wired into
`mempalace repair --mode reorganize` for explicit operator runs.
This commit makes the daemon do it automatically on first startup
post-upgrade, so operators don't have to know about the manual command.

Gated behind PALACE_AUTO_MIGRATE_CHECKPOINTS=1 (default). Set =0
to skip auto-migrate in environments where the one-time cost is
unwanted; the manual `mempalace repair --mode reorganize` path
remains.

Defensive: ImportError if the mempalace fork-side function isn't
present (upstream-shaped installs); other exceptions logged as
non-fatal warnings so the daemon still starts.

After this lands and the canonical 151K-drawer palace migrates,
the predicted Cat 9 A/B convergence (kind=all ≈ kind=content
tokens-per-question) becomes testable — that's the empirical proof
the structural fix worked.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional startup-time maintenance step to automatically migrate pre-3.3.4 Stop-hook checkpoint drawers into the new mempalace_session_recovery collection, reducing search/top‑N pollution after upgrades.

Changes:

  • Runs mempalace.migrate.migrate_checkpoints_to_recovery() during lifespan() startup (after HNSW quarantine, before Chroma warmup).
  • Gates the migration behind PALACE_AUTO_MIGRATE_CHECKPOINTS (default enabled) and runs it off the asyncio loop via run_in_executor.
  • Logs a non-fatal outcome (skip on missing import; warn on other exceptions; info when migration moved rows).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread main.py Outdated
Comment on lines +371 to +387
from mempalace.migrate import migrate_checkpoints_to_recovery

loop = asyncio.get_running_loop()
moved_checkpoints = await loop.run_in_executor(
None, migrate_checkpoints_to_recovery, _mp._config.palace_path
)
if moved_checkpoints:
logger.info(
"Migrated %d checkpoint drawer(s) from main → mempalace_session_recovery; "
"mempalace_search now queries content-only.",
moved_checkpoints,
)
except ImportError:
# mempalace.migrate.migrate_checkpoints_to_recovery is fork-side
# only at the moment; on upstream-shaped installs without it the
# daemon should still start cleanly.
logger.debug("migrate_checkpoints_to_recovery not available; skipping auto-migrate.")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed — the current code at main.py:386-401 distinguishes getattr(e, "name", None) == "mempalace.migrate" (the "feature genuinely not available on this release line") from any other ImportError (which now logs at warning level so a transitive dep break inside mempalace.migrate is operator-visible, not silent). Sorry I didn't close the comment loop earlier.

Comment thread main.py Outdated
# once the canonical palace has reorganized. Gated behind
# PALACE_AUTO_MIGRATE_CHECKPOINTS so operators can disable in
# environments where the one-time migration cost is unwanted.
# See mempalace docs/superpowers/specs/2026-04-25-checkpoint-collection-split.md.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved — the mempalace docs/superpowers/specs/2026-04-25-checkpoint-collection-split.md reference is no longer in the daemon code; grep on current main.py returns nothing for that path. Removed during a later revision.

jphein added a commit to jphein/palace-daemon that referenced this pull request Apr 30, 2026
…nding queue; trim patch

Filed four upstream PRs on 2026-04-30:
- rboarescu#15  feat: GET /viz status dashboard (stacks on rboarescu#13)
- rboarescu#16  feat: GET /list — query-free metadata browse
- rboarescu#17  feat: DELETE /memory + PATCH /memory
- rboarescu#18  feat(lifespan): auto-migrate Stop-hook checkpoints on startup

Also rebased PR rboarescu#13 onto upstream/main to clear a CHANGELOG conflict
left by upstream's b4aee82 (patch sync) — state went CONFLICTING ->
MERGEABLE / CLEAN.

README:
- Open upstream PRs table: four new rows (rboarescu#15-rboarescu#18) plus a 2026-04-30
  note covering today's rebase + new PRs in one breath.
- Pending PRs queue: now empty. Replaced the four stale rows
  (event-log-frame and graph-endpoint were already in flight via
  rboarescu#11 and rboarescu#13; mempal-fast.py was already upstream via the merged
  PR #4 omnibus; /viz is now PR rboarescu#15) with a brief empty-state note.

CLAUDE.md:
- Patch description trimmed to reflect that the hnsw:num_threads
  enforcement landed upstream via _pin_hnsw_threads(); only the
  log + retry-once slice remains.

patches/mcp_server_get_collection.patch:
- Regenerated against current mempalace develop. The patch is now
  just the "log exception + retry once on cache failure" change.
  Filed upstream as MemPalace/mempalace#1286; once that merges this
  patch retires.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/palace-daemon that referenced this pull request Apr 30, 2026
README:
- Intro paragraph: appended GET /list (PR rboarescu#16), DELETE/PATCH /memory
  (PR rboarescu#17), and lifespan auto-migrate (PR rboarescu#18) to "what this fork
  adds" since they're already on fork main even though the PRs are
  awaiting review upstream.
- Links bar: added docs/typescript-port-plan.md.
- Added a "Cross-repo coordination" subsection under "Recently landed
  in upstream" — calls out the local patches/ directory and the
  in-flight MemPalace/mempalace#1286 that retires it. Also references
  #1142 (RELEASING.md) for completeness.
- Requirements: dropped the stale "kind= searcher filter" justification
  for recommending the fork (kind= was retired in 1.7.1); replaced
  with daemon-strict hook mode + warnings/sqlite-fallback search path
  as the actual current reasons. Added the patches/ re-apply step.
- API table: added /list, DELETE /memory/{id}, PATCH /memory/{id}
  rows so the table reflects what the fork main actually exposes.
- Sources: cross-repo PR note for #1286.

CHANGELOG:
- Added [Unreleased] section: patch trim (Maintenance), TS port plan
  doc, hook-routing-fix.md status header, and a Docs entry covering
  today's README updates + PR rboarescu#13 rebase context.

docs/hook-routing-fix.md:
- Added a Status: SHIPPED header pointing at 62425e3 (2026-04-24
  clients/hook.py introduction) and clarifying mempal-fast.py is the
  simpler successor — same shape as docs/graph-endpoint.md's status
  header.

No daemon behaviour changes.
Two issues raised:

1. `except ImportError` was overswallowing — catching transitive
   dependency failures inside `mempalace.migrate` (e.g., a missing
   sub-import) and treating them as 'feature-not-available'.
   Distinguish via `e.name == 'mempalace.migrate'` so genuine
   missing-feature cases stay at debug level while unexpected
   import failures from inside the module surface at warning level.

2. Comment referenced `mempalace docs/superpowers/specs/...` which
   doesn't exist in this repository (it lives in the jphein/mempalace
   fork's superpowers tree). Replaced with a reference to the public
   operator-driven equivalent (`mempalace repair --mode reorganize`),
   which both points readers somewhere they can actually find and
   provides a recovery path if the auto-migrate is disabled.
jphein added a commit to jphein/palace-daemon that referenced this pull request Apr 30, 2026
…boarescu#18 + #1286 to main

Same shape as 152e428 — production at disks runs fork main, so review
amendments to the in-flight upstream PRs need to ship here too.

## main.py — /list docstring (from PR rboarescu#16)

The previous docstring claimed both "sorted by recency" and "ordered
by mempalace's natural metadata-table order", which contradicted
itself. Replaced with an accurate description: ordering is whatever
mempalace_list_drawers returns (sqlite metadata-table order, which
approximates insertion order but isn't strictly chronological).
Pagination contract called out explicitly.

## main.py — PATCH /memory body validation (from PR rboarescu#17)

- await request.json() now wraps in try/except for JSONDecodeError →
  400 with a clear message (was bubbling as 500).
- Validate parsed body is a dict before reading keys.
- Reject empty patches (no content/wing/room) with 400 listing the
  valid keys, instead of forwarding {drawer_id: ...} alone to
  mempalace_update_drawer where it's an ambiguous no-op.

DELETE handler unchanged — no body parsing, nothing to harden.

## main.py — Lifespan auto-migrate ImportError narrowing (from PR rboarescu#18)

Previous `except ImportError:` swallowed transitive dep failures
inside mempalace.migrate (a real bug worth surfacing) as if they
were "feature unavailable" (expected). Distinguish via
`getattr(e, "name", None) == "mempalace.migrate"` — module genuinely
absent → debug log, anything else → warning log so it's visible.

Comment that pointed at `mempalace docs/superpowers/specs/2026-04-25-
checkpoint-collection-split.md` (a path that doesn't exist in this
repo, lives only in the jphein/mempalace fork's superpowers tree)
replaced with a public-facing reference to `mempalace repair --mode
reorganize` — the operator-driven equivalent that's actually
discoverable.

## patches/mcp_server_get_collection.patch (from MemPalace/mempalace #1286)

Regenerated against current upstream develop with the Copilot review
fix folded in: logger.error(..., str(e)) → logger.exception(...).
Now the traceback lands in the daemon log alongside the message
instead of being dropped. Patch behaviour is otherwise identical.

The upstream PR also gained TestGetCollectionRetry — three monkeypatch
tests that pin the retry semantics. Those tests live in mempalace, not
here; they'll arrive on this side automatically once #1286 merges and
the patch retires.

All four fixes are amendments on the corresponding upstream PR
branches (force-pushed earlier today). No behaviour change for the
healthy path; the changes harden failure modes that were caught in
review.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@jphein
Copy link
Copy Markdown
Contributor Author

jphein commented Apr 30, 2026

Pushed 8f2a43f to address both Copilot notes:

  • except ImportError was overswallowing — narrowed via getattr(e, "name", None) == "mempalace.migrate" so genuine "feature unavailable on this release" cases stay at debug level, while unexpected import failures from inside the module surface at warning level. Real bugs are now visible instead of silently treated as feature-gating.
  • The comment that pointed at mempalace docs/superpowers/specs/... (a path that lives only in the jphein/mempalace fork's superpowers tree, not in this repo) now references the public operator-driven equivalent — mempalace repair --mode reorganize — which both points readers somewhere they can actually find and provides a recovery path if auto-migrate is disabled.

Thanks for the careful read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants