Skip to content

feat: /repair + /silent-save endpoints, cold-start warmup, minor fixes#4

Closed
jphein wants to merge 15 commits intorboarescu:mainfrom
jphein:main
Closed

feat: /repair + /silent-save endpoints, cold-start warmup, minor fixes#4
jphein wants to merge 15 commits intorboarescu:mainfrom
jphein:main

Conversation

@jphein
Copy link
Copy Markdown
Contributor

@jphein jphein commented Apr 25, 2026

Six changes from running palace-daemon. Happy to split into separate PRs if you'd prefer.

fix: warm ChromaDB client on startup, remove broken signal handlers (2012f79)

ChromaDB's Rust HNSW binding can SIGSEGV on the first request when started cold. Pre-warming the client in the FastAPI lifespan (before yield) avoids this.

Also drops the custom asyncio signal handlers that called sys.exit() from inside an asyncio callback — this produced a CancelledError/SystemExit traceback on stop and skipped the lifespan shutdown flush. Uvicorn's built-in signal handling already does the right thing.

feat: add /repair, /silent-save endpoints and themed messages (d2e867c)

  • POST /repair with four modes: light, scan, prune, rebuild. Acquires _exclusive_palace() (all read/write/mine semaphores) for rebuild.
  • POST /silent-save with double-checked locking around the repair flag. While a /repair mode=rebuild is in progress, saves are queued to a JSONL file at <palace_parent>/palace-daemon-pending.jsonl and drained automatically after the repair completes.
  • messages.py with themed user-facing strings returned in systemMessage.

feat: add --palace flag to purge_wings.py (34fee93)

Script hardcoded ~/.mempalace/palace. --palace <path> lets it run against non-default deployments. Default unchanged.

fix(mcp-client): bump HTTP timeout 30s→120s, env-tunable (352b25c)

The 30s default in clients/mempalace-mcp.py is too short for cold-start mempalace_status calls on large palaces (HNSW rebuild on a few hundred thousand drawers can take several minutes). Default raised to 120s, overridable via PALACE_MCP_TIMEOUT.

feat(clients): palace-mode CLI + fast-path hook + MCP dispatcher (f8e0faa)

Three deployment helpers in clients/:

  • palace-mode: shell CLI that flips between local (in-process) and remote (daemon) palace modes by toggling PALACE_DAEMON_URL in user settings. Subcommands: status, local, remote [URL], install (idempotent re-apply after plugin updates), verify.
  • palace-mcp-dispatch.sh: invoked by the plugin's MCP server command. Dispatches to mempalace-mcp.py --daemon $URL when PALACE_DAEMON_URL is set, otherwise falls back to in-process python -m mempalace.mcp_server.
  • mempal-fast.py: stdlib-only Stop/PreCompact hook handler. POSTs to /silent-save. Imports no mempalace, loads no chromadb — so cold hook fires can't trigger the same SIGSEGV the warmup commit fixes for the daemon itself.

Conceptually separable from the daemon changes — happy to split this commit into a follow-up PR if you'd rather review the daemon work first. (Bundled here because it sits in clients/ alongside the existing mempalace-mcp.py.)

fix: address Copilot review on PR #4 + tighten queue gating (bd68632)

Pulled from Copilot's review of an earlier push:

  • /silent-save queue gating now triggers on mode == "rebuild" only — light/scan/prune don't replace the collection, so saves can proceed normally during those.
  • _enqueue_pending_write is now async + uses asyncio.to_thread so the disk append doesn't block the event loop under bursty enqueue load.
  • _drain_pending_writes holds _write_sem per entry (matching the documented contract of _do_silent_save_write); failed entries are quarantined to a timestamped sibling so the next drain doesn't replay successful saves.
  • body["message_count"] is parsed via try/except, returns 400 on non-integer input.
  • asyncio.get_event_loop()asyncio.get_running_loop() (deprecated in 3.12+ inside async functions).
  • PALACE_MCP_TIMEOUT parse hardened in the proxy client — bad env value warns and falls back to 120s instead of crashing.

jphein and others added 4 commits April 24, 2026 16:41
Adds repair coordination and queue-safe Stop-hook save path.

/repair (light/scan/prune/rebuild): coordinates repairs with live traffic.
rebuild holds all semaphore slots via _exclusive_palace() to prevent
delete_collection/create_collection from racing concurrent writes —
those backend ops sit outside the ChromaCollection flock.

/silent-save: HTTP path for Claude Code Stop-hook checkpoints. Writes
normally when the palace is healthy; queues to palace-daemon-pending.jsonl
during a rebuild window and drains automatically once the swap completes.
Double-checked locking (pre- and post-semaphore) ensures no write slips
through between the flag check and semaphore acquisition.

/repair/status: current repair state + pending queue depth.

messages.py: single source for all themed user-facing strings (✦ memory
ops, ◈ palace ops). Hook opt-in via PALACE_DAEMON_URL env var; missing
or unreachable daemon falls through to direct write with no behavior change.

Lock path now port-scoped (/tmp/palace-daemon-<port>.lock) to allow
running a test instance alongside production.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
ChromaDB's Rust HNSW binding occasionally segfaults on the very first
request when opened cold. Pre-opening the client during lifespan startup
(before yield) ensures PersistentClient is fully initialized before any
traffic arrives. Warmup failure is non-fatal — daemon still starts.

Remove the custom SIGINT/SIGTERM handlers that called sys.exit() from
inside an asyncio callback. That pattern interrupted the event loop
mid-coroutine, producing CancelledError/SystemExit tracebacks in the
logs and skipping lifespan shutdown (the flush). Uvicorn already
installs graceful shutdown handlers — no override needed.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Allows running the offline purge on deployments where the palace is not
at ~/.mempalace/palace (e.g., /mnt/raid/projects/mempalace-data/palace
on the disks server).

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Default 30s urlopen timeout was too short for read tools that walk a
multi-GB palace (e.g. `mempalace_status`, `mempalace_list_wings`). On a
2.2GB palace these can take >30s under read-semaphore contention even
when the daemon eventually returns 200 OK — observed 2026-04-24 with
MCP timeouts client-side while daemon log showed POST /mcp 200 OK.

New default 120s gives headroom for slow paths and short waits behind
in-flight writes. `PALACE_MCP_TIMEOUT` env override for tuning.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copilot AI review requested due to automatic review settings April 25, 2026 13:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR bundles several production-driven improvements to palace-daemon: cold-start stabilization, new repair/save-coordination endpoints with themed user messages, a configurable palace path for the purge script, and a longer/tunable MCP HTTP timeout.

Changes:

  • Warm ChromaDB client during FastAPI lifespan startup and remove custom signal handlers.
  • Add /repair, /repair/status, and /silent-save endpoints plus centralized themed messages.
  • Add --palace flag to scripts/purge_wings.py and make MCP proxy timeout env-tunable (default 120s).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
main.py Implements warmup, repair state, pending-writes queue/drain, and new HTTP endpoints.
messages.py Adds centralized themed system messages for saves/repair lifecycle.
clients/mempalace-mcp.py Bumps HTTP timeout and allows override via PALACE_MCP_TIMEOUT.
scripts/purge_wings.py Adds --palace argument to target non-default palace locations.
README.md Documents the new endpoints and hook wiring flow.
CHANGELOG.md Adds release notes for v1.5.0 features and fixes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread main.py Outdated
Comment on lines +161 to +167
def _enqueue_pending_write(payload: dict) -> None:
"""Append a silent-save payload to the pending-writes queue."""
path = _pending_writes_path()
line = json.dumps({"payload": payload, "enqueued_at": datetime.now().isoformat()})
with open(path, "a", encoding="utf-8") as f:
f.write(line + "\n")

Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_enqueue_pending_write() performs synchronous disk I/O directly on the event loop thread. Under load (many concurrent /silent-save calls) this can stall all request handling. Consider moving the append into run_in_executor (or using an async file writer / background task + queue) so /silent-save stays non-blocking even during heavy enqueue bursts.

Copilot uses AI. Check for mistakes.
Comment thread main.py
Comment on lines +187 to +197
for line in lines:
try:
entry = json.loads(line)
result = await _do_silent_save_write(entry["payload"])
if result.get("success"):
count += 1
else:
_log.warning("drain: replay failed: %s", result.get("error"))
except Exception:
_log.exception("drain: entry replay raised")
os.remove(proc_path)
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_drain_pending_writes() calls _do_silent_save_write() without holding _write_sem, even though _do_silent_save_write() documents that the caller must hold it. This can bypass the daemon’s concurrency controls and race with other writers. Also, the code deletes the .processing file unconditionally at the end, which drops any entries that failed to replay (or had JSON errors), undermining the “no lost saves” guarantee. Please drain under _write_sem (per entry or for the whole drain) and persist/requeue any failed entries instead of deleting them.

Suggested change
for line in lines:
try:
entry = json.loads(line)
result = await _do_silent_save_write(entry["payload"])
if result.get("success"):
count += 1
else:
_log.warning("drain: replay failed: %s", result.get("error"))
except Exception:
_log.exception("drain: entry replay raised")
os.remove(proc_path)
failed_lines: list[str] = []
for line in lines:
try:
entry = json.loads(line)
payload = entry["payload"]
except Exception:
_log.exception("drain: invalid entry; preserving for retry")
failed_lines.append(line)
continue
try:
async with _write_sem:
result = await _do_silent_save_write(payload)
if result.get("success"):
count += 1
else:
_log.warning("drain: replay failed: %s", result.get("error"))
failed_lines.append(line)
except Exception:
_log.exception("drain: entry replay raised; preserving for retry")
failed_lines.append(line)
if not failed_lines:
os.remove(proc_path)
return count
with open(proc_path, "w", encoding="utf-8") as f:
f.writelines(failed_lines)
try:
with open(path, "a", encoding="utf-8") as f:
f.writelines(failed_lines)
os.remove(proc_path)
except Exception:
_log.exception("drain: failed to requeue; leaving %s in place", proc_path)

Copilot uses AI. Check for mistakes.
Comment thread main.py Outdated
# occasionally segfaults on the very first request if opened cold; opening
# it here (before yield) ensures the PersistentClient is fully initialized.
try:
loop = asyncio.get_event_loop()
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside an async lifespan function, prefer asyncio.get_running_loop() over asyncio.get_event_loop() (which is deprecated in newer Python versions and can behave differently depending on policy). Switching to get_running_loop() here avoids deprecation warnings and makes the intent explicit.

Suggested change
loop = asyncio.get_event_loop()
loop = asyncio.get_running_loop()

Copilot uses AI. Check for mistakes.
Comment thread main.py Outdated
raise HTTPException(status_code=400, detail="'entry' is required")

themes = body.get("themes") or []
msg_count = int(body.get("message_count") or 1)
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msg_count = int(body.get("message_count") or 1) can raise ValueError if a client sends a non-integer (e.g., "15" is fine, but "15 msgs" or null-like values can break). Please validate/parse message_count with a try/except and return a 400 on invalid input instead of throwing a 500.

Suggested change
msg_count = int(body.get("message_count") or 1)
raw_msg_count = body.get("message_count")
if raw_msg_count:
try:
msg_count = int(raw_msg_count)
except (TypeError, ValueError):
raise HTTPException(status_code=400, detail="invalid 'message_count'")
else:
msg_count = 1

Copilot uses AI. Check for mistakes.
Comment thread clients/mempalace-mcp.py
Comment on lines +55 to +59
# 120s headroom for slow read tools (e.g. mempalace_status walking a
# multi-GB palace) plus short waits behind in-flight writes on the
# daemon's read semaphore. Override with PALACE_MCP_TIMEOUT for tuning.
timeout = int(os.getenv("PALACE_MCP_TIMEOUT", "120"))
with urllib.request.urlopen(req, timeout=timeout) as resp:
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout = int(os.getenv("PALACE_MCP_TIMEOUT", "120")) will crash the proxy if the env var is set to a non-integer value. Please catch ValueError (and maybe enforce a sane min/max) so a misconfigured environment yields a clear error message or falls back to the default instead of exiting unexpectedly.

Copilot uses AI. Check for mistakes.
Comment thread main.py Outdated
Comment on lines +542 to +545
# Fast-path: repair already in progress, queue immediately.
if _repair_state["in_progress"]:
_enqueue_pending_write(body)
return {
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/silent-save queues whenever _repair_state["in_progress"] is true, but /repair sets in_progress for all modes (light/scan/prune/rebuild). That means silent saves will be deferred even during non-destructive/readonly repairs, which doesn’t match the PR description (“during /repair mode=rebuild”). Consider only enabling queuing for mode == "rebuild" (e.g., track a separate queue_writes flag or check _repair_state["mode"]).

Copilot uses AI. Check for mistakes.
Comment thread main.py
Comment on lines +154 to +158
def _pending_writes_path() -> str:
"""Location of the jsonl queue that holds silent-saves during rebuild."""
palace_path = _mp._config.palace_path
parent = os.path.dirname(palace_path.rstrip("/")) or "/tmp"
return os.path.join(parent, "palace-daemon-pending.jsonl")
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions the pending-writes queue living under ~/.cache/palace-daemon/..., but _pending_writes_path() currently writes to <palace_parent>/palace-daemon-pending.jsonl. Please either update the PR description (and any external docs) or adjust the implementation so the queue location matches the stated behavior.

Copilot uses AI. Check for mistakes.
@jphein jphein changed the title Cold-start warmup, /repair + /silent-save endpoints, themed messages, --palace flag, HTTP timeout bump feat: /repair + /silent-save endpoints, cold-start warmup, minor fixes Apr 25, 2026
jphein and others added 8 commits April 25, 2026 08:05
Three deployment helpers that sit alongside mempalace-mcp.py:

- palace-mode: shell CLI to switch between local (in-process) and remote
  (daemon) palace modes by toggling PALACE_DAEMON_URL in
  ~/.claude/settings.local.json. Subcommands: status, local, remote [URL],
  install (idempotent re-apply of cache+workspace customizations after a
  plugin update), verify.

- palace-mcp-dispatch.sh: thin shell wrapper invoked by the MCP server
  command in plugin configs. Dispatches to mempalace-mcp.py --daemon $URL
  when PALACE_DAEMON_URL is set, otherwise falls back to in-process
  python -m mempalace.mcp_server.

- mempal-fast.py: stdlib-only Stop/PreCompact hook handler. Counts human
  messages in the transcript, gates on SAVE_INTERVAL, POSTs to the
  daemon's /silent-save. No mempalace import, no chromadb load, so cold
  hook fires can't trigger the chromadb HNSW SIGSEGV that the warmup
  patch in main.py addresses for the daemon itself.

Together these let a user single-command-flip between an in-process
palace and a daemon-backed palace without editing any JSON by hand.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Six surgical fixes pulled from Copilot's review of #4:

main.py:
- Move /silent-save queue gating from `_repair_state['in_progress']`
  (any mode) to `mode == 'rebuild'` only. Light/scan/prune don't
  replace the collection, so silent-saves can proceed normally during
  those modes — matches the original PR intent.
- Make _enqueue_pending_write async, dispatch the disk append via
  asyncio.to_thread so /silent-save stays non-blocking under bursty
  enqueue load.
- Drain pending writes under _write_sem (per entry) — the previous
  version called _do_silent_save_write without holding the semaphore
  it documents as required, racing other writers.
- Don't unconditionally remove the .processing file in drain — quarantine
  failed entries to a timestamped sibling so the next drain doesn't
  replay successful saves.
- Validate body['message_count']: parse via try/except and return 400
  on non-integer input rather than letting int() raise a 500.
- asyncio.get_event_loop() → asyncio.get_running_loop() in the lifespan
  warmup. The former is deprecated in 3.12+ inside async functions.

clients/mempalace-mcp.py:
- Validate PALACE_MCP_TIMEOUT: catch ValueError, warn to stderr, fall
  back to 120s rather than crashing the proxy on a bad env value.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
palace-mode used to rewrite both ~/.claude/plugins/cache/...
(the runtime that Claude Code actually uses) AND
~/Projects/memorypalace/.claude-plugin/ (the plugin source dev
clone). The dev-clone rewrite produced a permanently dirty git
working tree on every machine that ran `palace-mode remote`.

Drop the workspace install and verify steps. The cache copy is
the only one that affects runtime; the dev clone now stays clean
for git operations (rebases, status checks, preflight). Existing
installs are unaffected — runtime hooks already live in the cache.

Also adds venv/ to .gitignore.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
All hook paths now write topic="checkpoint" so jphein/mempalace's
search kind= filter (default kind="content") cleanly excludes them.

Before this change, three different hook entry points wrote three
different topic values:
- mempalace.hooks_cli (silent_save fallback + daemon POST):
  topic="checkpoint"  ✓
- clients/mempal-fast.py (the active fast-path):
  topic="checkpoint"  ✓
- clients/hook.py (legacy stdlib runner via _post_mcp):
  topic="auto-save"   ✗  fixed to "checkpoint"

Defines a CHECKPOINT_TOPIC constant in both client files so future
edits are grep-anchored and a future rename is one-place.

The read-side filter in jphein/mempalace's search_memories
(commit 8d02835) handles both "checkpoint" and "auto-save" as
legacy synonyms, plus does text-prefix defense-in-depth — so
existing palace data with the old topic value, or with no topic
metadata at all, is also caught. Future writes from this commit
forward will all use the canonical name.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Three fixes, one verify script.

1. /search and /context now accept a `kind=` query param mirroring the
   mempalace_search MCP tool's input_schema enum:
     - "content" (default) — excludes Stop-hook checkpoints (the
       session-summary noise that drowns out real content under vector
       similarity)
     - "checkpoint" — only checkpoints (recovery / audit)
     - "all" — no filter, pre-2026-04-25 behavior
   Invalid values return 400. Backed by the read-side filter in
   jphein/mempalace commit 8d02835 (search_memories kind= parameter).

2. **Bug fix**: /search and /context were passing `max_results` to the
   mempalace_search MCP tool. The tool's input_schema declares `limit`,
   not `max_results`, so mempalace.mcp_server.handle_request silently
   dropped the param via its schema-property whitelist (line 1677), and
   *every* /search response was capped at the default limit=5. Confirmed
   on the running v1.5.0 daemon. Renamed to `limit` so the user-supplied
   value actually binds.

3. /silent-save now canonicalizes the `topic` field at the daemon
   boundary. Drift between client topic values (clients/hook.py wrote
   "auto-save"; mempal-fast.py wrote "checkpoint") used to leak into
   palace metadata, where the read-side checkpoint filter then had to
   keep both as known-synonyms. New _canonical_topic() rewrites
   "auto-save" → "checkpoint" with a warning log. Future drift gets the
   same treatment.

scripts/verify-routes.sh is a curl-based smoke test that exercises
every public route post-deploy. Not in CI (depends on a live palace);
run manually after `systemctl --user restart palace-daemon`.

Bumps VERSION 1.5.0 → 1.5.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Five steps: git push origin → wait for Syncthing/git mirror → systemctl
restart on the deploy host → poll /health until ready → run
verify-routes.sh smoke test. Reads PALACE_API_KEY from
~/.claude/settings.local.json env block (palace-mode's source of truth)
so it picks up the same key the local Claude Code sessions use.

Defaults: PALACE_HOST=disks, PALACE_SYNC_GRACE=3s, PALACE_HEALTH_TIMEOUT=30s,
all overridable via env.

Replaces the previous manual sequence (push → ssh → restart → curl /health →
hand-run smoke tests) with a single command that fails fast on any step.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Documents the v1.5.1 changes:
- README API table updated with kind= column on /search and /context
- New "kind= filter" section under /search explaining the three
  values (content/checkpoint/all) and why default-excluding
  checkpoints is the right call (80%+ of search results were
  checkpoint noise, side-by-side numbers from 2026-04-25
  validation)
- CHANGELOG entry for 1.5.1: kind= add, _canonical_topic helper,
  verify-routes.sh, and the silently-dropped limit= bug fix

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
/stats fan-outs to three MCP tools — mempalace_kg_stats (cached),
mempalace_graph_stats (60s TTL cache via #661), mempalace_status
(no cache, walks corpus paginated). On the deployed disks daemon
with 151K drawers, cold /stats takes ~32s and warm-cache /stats
takes ~30-56s because mempalace_status doesn't cache and the
corpus walk is intrinsically slow.

The previous 30s curl timeout meant every verify-routes run
flagged a false positive on /stats. 90s gives headroom on the
production palace.

Companion README update: documents /stats as a heavy admin route,
not a chat-turn path. End-to-end verified post-deploy 2026-04-25:
all 11 assertions pass against the deployed v1.5.1 daemon,
including the kind= filter (scope 150811 vs 151448 = ~637
checkpoints filtered) and the max_results→limit bug fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown
Owner

@rboarescu rboarescu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — tested all new endpoints against a Docker container running the PR branch. /silent-save, /repair (all four modes), the queue/drain during rebuild, and the 409 on overlapping repairs all work correctly.

Three issues to fix before I can merge:

  1. API key hardcoded in clients/palace-mode (lines 16–17)
    DEFAULT_KEY="censored_although_too_late_should_rotate_API_KEY"
    Personal API key in a public repo. Please remove — users should supply it via PALACE_API_KEY env var. Same for DEFAULT_URL, leave it empty or document it as a required env var.

  2. Hardcoded absolute path in clients/palace-mcp-dispatch.sh (line 8)
    exec "$PYTHON" /home/jp/Projects/palace-daemon/clients/mempalace-mcp.py ...
    Only works on your machine. palace-mode already resolves this correctly with readlink -f — palace-mcp-dispatch.sh should use the same pattern.

I never knew github could be such an exciting ride... and rabbit hole.

Thanks again!

jphein and others added 3 commits April 25, 2026 12:21
Single-shot structural snapshot that mirrors /stats's asyncio.gather
shape, plus:
- parallel mempalace_list_rooms fan-out per wing
- direct read-only sqlite read of knowledge_graph.sqlite3 (no extra
  MCP roundtrip)

Replaces what multipass-structural-memory-eval (SME) would otherwise
compose by serially calling list_wings + list_rooms × N + list_tunnels
+ kg_stats over HTTP. On the 151K-drawer canonical palace, list_wings
alone takes ~30s — a serial composition costs minutes; this gathers
in well under that.

KG read uses URI-mode ?mode=ro so the daemon can't accidentally write.
Schema differences tolerated via per-query OperationalError catch.
_kg_path() derives location from _mp._config.palace_path so non-default
deployments work unchanged.

Spec: docs/graph-endpoint.md. SME adapter prefers /graph once daemon ≥
1.6.0 and falls back to MCP composition otherwise — no coordinated
deploy needed.

Verify-routes.sh has a /graph probe.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…nnels

mempalace_list_tunnels and mempalace_graph_stats disagree on what counts
as a tunnel (mempalace 3.3.4 — list_tunnels returns [], graph_stats
reports 9 on the 151K-drawer canonical palace). The fix in palace-daemon
is to shadow graph_stats.top_tunnels for /graph.tunnels so

  /graph.tunnels   and   /stats.graph.tunnel_rooms

always agree. Drops one MCP call from the gather (tunnels piggybacks on
the graph_stats roundtrip we already need for kg_stats). Spec Part 2
tracks the upstream reconciliation separately.

Live verified against canonical palace post-fix: tunnels has 9 entries
including the diary, technical, architecture rooms — matching what
/stats has been reporting all along.
Drops the list_wings + list_rooms × N MCP fan-out (which serialized
through the 4-slot read semaphore and stalled indefinitely under
concurrent search load — observed by SME adapter as 60s+ timeouts).

New shape for /graph internals:
  - graph_stats (MCP, cheap, gives tunnels)
  - kg_stats    (MCP, cheap)
  - wings + rooms-per-wing  → direct sqlite read of embedding_metadata
  - kg_entities + triples   → direct sqlite read of knowledge_graph.sqlite3
All four phases gather in parallel.

Benchmarked on the canonical 151K-drawer palace: structural read drops
from 60-120s under contention to ~0.4s, semaphore-free. The MCP
roundtrips (graph_stats + kg_stats) remain the floor; both are
typically sub-second.

Safety:
  - Read-only via sqlite URI ?mode=ro (cannot acquire write locks)
  - Chroma uses WAL — concurrent writers see a consistent snapshot
  - Same pattern _read_kg_direct already uses since v1.6.0 (2003e80)
  - Per-query try/except OperationalError — schema drift across
    chroma versions degrades to empty wings/rooms (SME adapter then
    falls back to its MCP composition path)

Couples to chroma's embedding_metadata internal layout, which is not a
documented stable API. Acceptable trade given the 200× speedup and the
graceful degradation path.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
rboarescu added a commit that referenced this pull request Apr 25, 2026
…ity)

Aligns hook.py with the kind= search filter introduced in the /repair +
/silent-save PR. Saves written with topic="auto-save" would not be
returned by kind=checkpoint queries once that PR lands.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@rboarescu
Copy link
Copy Markdown
Owner

Merged locally — conflicts resolved and pushed directly to main in ef6ac03.

@rboarescu rboarescu closed this Apr 25, 2026
jphein referenced this pull request in jphein/palace-daemon Apr 26, 2026
…key, portable paths

Two issues called out by @rboarescu in the PR #4 review:

1. clients/palace-mode embedded a hardcoded PALACE_API_KEY default
   value at lines 16-17. The author key has been treated as compromised
   and will be rotated on next daemon restart; this commit stops the
   bleed by reading both URL and key from the environment with no
   in-source defaults. `palace-mode remote` now fails fast with a
   clear error if neither PALACE_DAEMON_URL nor PALACE_API_KEY is set,
   instead of silently writing empty strings into settings.local.json.

2. clients/palace-mcp-dispatch.sh hardcoded an absolute path to its
   sibling clients/mempalace-mcp.py, only working on the author's
   machine. Now resolves siblings via readlink -f / dirname (same
   pattern palace-mode already used), so the dispatcher works
   wherever the clients/ directory lives.

No daemon restart needed for these client-side fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@jphein
Copy link
Copy Markdown
Contributor Author

jphein commented Apr 26, 2026

Thanks for the careful read — both real, both fixed in d450cef:

  1. Embedded API key — gone. DEFAULT_URL and DEFAULT_KEY now both read from env with no in-source defaults; palace-mode remote fails fast with a clear error if either is missing. Will rotate on the next daemon restart. Should not have shipped with that value baked in — fair catch.

  2. Hardcoded absolute path in dispatch.sh — fixed with the same readlink -f + dirname pattern palace-mode was already using. Should work regardless of install path now.

The "exciting ride" line resonates. I only started actually doing this kind of collaboration around my birthday, so we're both fairly fresh at it. There's a real difference between making something work for yourself and making it portable, reviewable, and not actively hostile to other contributors — every time I think I've internalized that, something like the API key leak reminds me there's more.

Re: the merge — thanks for picking up the substance and resolving conflicts directly. Happy to send any of the post-1.5.0 work on the fork (/graph, deploy.sh, the direct-sqlite optimization for structural reads) as small separate PRs at whatever cadence works.

jphein referenced this pull request in jphein/palace-daemon Apr 27, 2026
Pulls in 8 commits we were behind on, most importantly:

- v1.4.4: Dockerfile + docker-compose.yml + HNSW monitoring + granular
  concurrency env vars (PALACE_MAX_READ_CONCURRENCY /
  PALACE_MAX_WRITE_CONCURRENCY)
- v1.4.5: time-based saves, session-end force save, state cleanup
- 0060190: use CHECKPOINT_TOPIC constant for diary writes (PR #4
  compatibility — composes cleanly with our existing CHECKPOINT_TOPIC
  + CHECKPOINT_TOPIC_SYNONYMS block)
- a64244c: revert MCP-breaking toast injection, keep REST endpoint
  toasts
- d0aabb9 v1.5.1: harden _get_collection (silent failures logged,
  cache self-healing retry, num_threads=1 enforced on every open),
  /health returns 503 on degraded palace, systemd watchdog via
  sd_notify, startup warmup opens the collection
- patches/ + scripts/apply_patches.sh: upstream's patch-style backport
  system

Conflict resolution:
- main.py: kept VERSION at 1.7.2 (was 1.7.1; bumped to reflect the
  upstream merge). Kept our CHECKPOINT_TOPIC + CHECKPOINT_TOPIC_SYNONYMS
  block AND adopted upstream's PALACE_MAX_READ/WRITE_CONCURRENCY env
  var split.
- palace-daemon.service: kept our placeholder (User=user,
  /home/user/...) and ExecStartPre/ExecStartPost hooks; adopted
  upstream's Type=notify + NotifyAccess=main for the watchdog.
- README.md: kept fork-shaped narrative wholesale per the
  fork-README-handling pattern. Queue table needs a follow-up sweep
  to reflect what's now upstream (watchdog, HNSW threads enforcement,
  collection cache auto-retry, --force flag) — those rows can be
  retired.
- CHANGELOG.md: added a v1.7.2 entry at the top listing what came in
  from upstream's v1.5.1; kept our own v1.5.1 entry below for fork
  history with a note about the v1.5.1 naming collision and PR #4's
  closed-not-merged status.

No tests in this repo; smoke test will run via deploy.sh after this
commit.
jphein referenced this pull request in jphein/palace-daemon Apr 27, 2026
…wording

Reordered the queue so the smallest, most universally-applicable changes
sit at the top and the JP-personalized tooling sits at the bottom. The
prior order ranked by raw size, which buried the actual best candidates
(`limit=` bug fix, `_canonical_topic`, `/graph.tunnels`) under deploy
tooling that needs significant generalization before it would be a
reasonable PR.

New top of queue (PR-ready, universal):
1. `limit=` bug fix on /search /context (tiny, clear bug)
2. `_canonical_topic()` synonym rewrite (composes with upstream's CHECKPOINT_TOPIC constant from 0060190)
3. `/graph.tunnels` derived from graph_stats (small fix)
4. `scripts/verify-routes.sh` (universal curl smoke test)
5. `GET /graph` (medium, depends on a sqlite-direct read pattern)
6. `GET /viz` (medium, depends on /graph)

Marked **needs generalization** on three rows that have JP-specific
assumptions baked in:
- `clients/palace-mode` (Claude Code plugin cache layout)
- `scripts/auto-repair-if-empty.sh` (systemd --user shape)
- `scripts/deploy.sh` (PALACE_HOST=disks, ~/.claude/settings.local.json,
  Syncthing-mirrored source tree, fork-mempalace import-check that
  would fail on upstream installs)

Also tightened the front-matter wording on PR #4: rboarescu cherry-picked
the contents into upstream main directly (`ef6ac03`) rather than merging
the PR via the GitHub UI, so "merged via PR #4" was technically
misleading. Updated to reflect the actual merge mechanism.

Bumped version 1.7.0 → 1.7.2 to match the post-upstream-merge state.
Drawer count refreshed 150,814 → 150,891.
jphein referenced this pull request in jphein/palace-daemon Apr 27, 2026
…eralization

Mirrors the mempalace fork's README pattern: top-of-document section
listing PRs we've filed and their state, then a "pending" queue for
PR-ready items, then an explicit "needs generalization" table for
items that bake in JP-specific assumptions and are held until they
can be split into a universal shape.

## What's now where

- **Open upstream PRs** (3): rboarescu#7 limit= fix, rboarescu#8 _canonical_topic, rboarescu#9
  verify-routes.sh
- **Recently landed in upstream**: PR #4 (cherry-picked into ef6ac03,
  closed) — the v1.5.0 daemon work
- **Pending PRs — ready to file** (6): GET /graph (folds in
  /graph.tunnels fix), GET /viz, palace-mcp-dispatch.sh, mempal-fast.py,
  event-log-frame.md, graph-endpoint.md
- **Needs generalization before PR** (3): deploy.sh (PALACE_HOST=disks
  + fork-mempalace import-check), palace-mode (Claude Code plugin
  cache layout), auto-repair-if-empty.sh (systemd --user shape)

## Why

JP's note: "we can wait if we need to generalize things but make sure
that is noted in the readme like we do with memorypalace." Filing PRs
that have JP-specific assumptions baked in would either fail in
contributor review or land features other operators couldn't use.
Better to be explicit about which items are PR-ready, which are queued,
and which need generalization work first — and what that work is —
so anyone can pick up a generalization-ready item without re-discovering
the constraint set.

`/graph.tunnels` was previously listed as its own row but it's
structurally a sub-feature of `/graph`; merged into that row to
avoid implying it can ship independently.
jphein referenced this pull request in jphein/palace-daemon Apr 27, 2026
`clients/palace-mode` shipped with two embedded user-specific defaults
on lines 19-20: a `DEFAULT_URL` pointing at my homelab daemon and a
real hex `DEFAULT_KEY`. Both are leftovers from when the script was
extracted from my own install — the URL is my homelab address, and
the hex string is a real API key I authored (it has since been
rotated, so it doesn't authenticate against my daemon anymore — but
the literal value is still in upstream main's source).

## Fix

Read both from environment, no in-source defaults:

```bash
DEFAULT_URL="${PALACE_DAEMON_URL_DEFAULT:-${PALACE_DAEMON_URL:-}}"
DEFAULT_KEY="${PALACE_API_KEY:-}"
```

Plus fail-fast guards in `set_mode_remote()` so silent empty-string
writes don't end up in `~/.claude/settings.local.json`. If either URL
or key is unset, the script prints a clear error pointing at the right
env vars and exits 2 instead of writing broken config.

## Why now

Three reasons even though the embedded key is no longer live:

1. Trains contributors that hex defaults are normal — anyone reading
   the source might assume embedding their own key is fine.
2. Default `palace-mode remote` (no args) writes my homelab URL into
   a fresh user's `settings.local.json`, sending their hooks to a
   daemon they have no auth for.
3. Reviewable provenance — removing the literal makes future audits
   cleaner.

## Scope

+11 / -1. Single file. No new dependencies. No behavior change for
operators who set `PALACE_DAEMON_URL` and `PALACE_API_KEY` in their
shell rc. Operators who relied on the embedded defaults get a clear
error pointing at what they need to set.

Originally part of fork commit `d450cef` ("address upstream review on
PR #4"); the portable-paths half of that commit shipped as PR rboarescu#10.
Sending the embedded-defaults half standalone.
jphein referenced this pull request in jphein/palace-daemon Apr 27, 2026
The dispatcher in `clients/palace-mcp-dispatch.sh` has a hardcoded
`/home/jp/Projects/palace-daemon/clients/mempalace-mcp.py` in the
`--daemon` branch — a leftover from when it was extracted into PR #4.
As shipped in `main` today, the dispatcher only works on my machine;
on every other operator's machine the `exec` target doesn't exist
and the daemon-proxy mode fails immediately.

## Fix

Resolve the sibling client path at runtime:

```bash
HERE="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &> /dev/null && pwd -P)"
MCP_CLIENT="$HERE/mempalace-mcp.py"
```

`cd "$(dirname …)" && pwd -P` is POSIX-portable. It resolves symlinks
in the directory path via the filesystem, which covers the actual
common case for plugin invocations (the script's parent directory
may be a symlink under a Claude Code plugin cache). The earlier
`readlink -f` approach failed on macOS — BSD `readlink` doesn't
accept `-f`, so the dispatcher would silently `command not found`
on macOS clients (caught in Copilot review).

Also adds an explicit existence check on `$MCP_CLIENT` before exec,
so the failure mode is a clear error message ("missing sibling
client at …") rather than a generic Python "can't open file" trace.

Same behavior on my machine; correct + portable behavior everywhere
else. The "in-process" branch (`python -m mempalace.mcp_server`)
was already portable, no change there.

Originally extracted from a fork-side commit (`d450cef`) that landed
during PR #4's review cycle but didn't make it into the final
cherry-pick. Sending standalone per the "small separate PRs"
preference.
jphein referenced this pull request in jphein/palace-daemon Apr 27, 2026
A curl-based smoke test that exercises every public read-only route
against a running daemon. Designed for manual post-deploy validation
(`systemctl restart palace-daemon`-style), not CI — depends on a live
palace.

## What it probes

GET-only routes:
- /health (auth-bypass; always responds)
- /search?q=palace&limit=2
- /context?topic=palace&limit=2
- /stats
- /repair/status

Plus a regression check on `limit=3` returning 3 hits end-to-end (only
fails if the palace had ≥3 matches and the response was capped at fewer;
flags rather than fails on empty/unreachable palaces).

POST routes (/repair, /silent-save, /memory, /mine, /flush, /reload,
/backup) intentionally skipped — they mutate state and belong in a
dedicated integration run against a throwaway palace, not in a manual
smoke against production.

## Usage

    PALACE_DAEMON_URL=http://your-host:8085 \
    PALACE_API_KEY=... \
        scripts/verify-routes.sh

Both env vars optional; defaults to localhost:8085 and no auth header.

## Why this shape

Easier than maintaining a full pytest harness for one-shot deploy
verification, and the actual failure surface (HTTP 500, malformed
JSON, missing routes after merge) is what curl + grep catches well.
The probe helpers (`probe`, `probe_json_field`) keep failure messages
short and actionable.

Originally extracted from a fork-side commit; sending standalone per
the "small separate PRs" preference noted on PR #4.
jphein referenced this pull request in jphein/palace-daemon Apr 27, 2026
`clients/palace-mode` shipped with two embedded user-specific defaults
on lines 19-20: a `DEFAULT_URL` pointing at my homelab daemon and a
real hex `DEFAULT_KEY`. Both are leftovers from when the script was
extracted from my own install — the URL is my homelab address, and
the hex string is a real API key I authored (it has since been
rotated, so it doesn't authenticate against my daemon anymore — but
the literal value is still in upstream main's source).

## Fix

Read both from environment, no in-source defaults:

```bash
DEFAULT_URL="${PALACE_DAEMON_URL_DEFAULT:-${PALACE_DAEMON_URL:-}}"
DEFAULT_KEY="${PALACE_API_KEY:-}"
```

Plus fail-fast guards in `set_mode_remote()` so silent empty-string
writes don't end up in `~/.claude/settings.local.json`. If either URL
or key is unset, the script prints a clear error pointing at the right
env vars and exits 2 instead of writing broken config.

## Why now

Three reasons even though the embedded key is no longer live:

1. Trains contributors that hex defaults are normal — anyone reading
   the source might assume embedding their own key is fine.
2. Default `palace-mode remote` (no args) writes my homelab URL into
   a fresh user's `settings.local.json`, sending their hooks to a
   daemon they have no auth for.
3. Reviewable provenance — removing the literal makes future audits
   cleaner.

## Scope

+11 / -1. Single file. No new dependencies. No behavior change for
operators who set `PALACE_DAEMON_URL` and `PALACE_API_KEY` in their
shell rc. Operators who relied on the embedded defaults get a clear
error pointing at what they need to set.

Originally part of fork commit `d450cef` ("address upstream review on
PR #4"); the portable-paths half of that commit shipped as PR rboarescu#10.
Sending the embedded-defaults half standalone.
jphein referenced this pull request in jphein/palace-daemon Apr 27, 2026
Adds `_canonical_topic()` between `/silent-save`'s payload parse and
`tool_diary_write`. When a topic comes in matching a known legacy
synonym (currently `"auto-save"`), it gets rewritten to the canonical
`CHECKPOINT_TOPIC` ("checkpoint") and a warning is logged.

## Why

The bundled clients (`clients/hook.py`, `clients/mempal-fast.py`)
already write the canonical value, so this is defense-in-depth. It
catches:

1. Future drift in those clients (someone edits one and forgets the
   convention)
2. Third-party callers POSTing to `/silent-save` directly with the
   older synonym
3. Buggy hook installs that haven't been updated

Without this, drift would silently leak into palace metadata and
mempalace's `tool_diary_write` topic-routing — which sends matching
topics to the dedicated `mempalace_session_recovery` collection rather
than the searchable main collection — would miss those drawers and
they'd dominate vector top-N in `mempalace_search` results. The
warning makes drift observable; the rewrite keeps the palace clean.

## Scope

- New module-level constants: `CHECKPOINT_TOPIC`, `CHECKPOINT_TOPIC_SYNONYMS`
- New helper: `_canonical_topic(topic: str) -> str` (10 lines)
- One call-site change in `_do_silent_save_write`

Non-checkpoint topics (e.g. `"musings"`, `"decisions"`) pass through
unchanged — the rewrite only fires on synonyms in the explicit list.

No new dependencies, no behavior change for already-canonical writes,
no breaking change for callers using the default.

Originally extracted from a larger fork-side commit (`b4b39fc`) that
bundled this with a now-retired `kind=` filter; sending it standalone
per the "small separate PRs" preference noted on PR #4.
jphein referenced this pull request in jphein/palace-daemon Apr 30, 2026
…nding queue; trim patch

Filed four upstream PRs on 2026-04-30:
- rboarescu#15  feat: GET /viz status dashboard (stacks on rboarescu#13)
- rboarescu#16  feat: GET /list — query-free metadata browse
- rboarescu#17  feat: DELETE /memory + PATCH /memory
- rboarescu#18  feat(lifespan): auto-migrate Stop-hook checkpoints on startup

Also rebased PR rboarescu#13 onto upstream/main to clear a CHANGELOG conflict
left by upstream's b4aee82 (patch sync) — state went CONFLICTING ->
MERGEABLE / CLEAN.

README:
- Open upstream PRs table: four new rows (rboarescu#15-rboarescu#18) plus a 2026-04-30
  note covering today's rebase + new PRs in one breath.
- Pending PRs queue: now empty. Replaced the four stale rows
  (event-log-frame and graph-endpoint were already in flight via
  rboarescu#11 and rboarescu#13; mempal-fast.py was already upstream via the merged
  PR #4 omnibus; /viz is now PR rboarescu#15) with a brief empty-state note.

CLAUDE.md:
- Patch description trimmed to reflect that the hnsw:num_threads
  enforcement landed upstream via _pin_hnsw_threads(); only the
  log + retry-once slice remains.

patches/mcp_server_get_collection.patch:
- Regenerated against current mempalace develop. The patch is now
  just the "log exception + retry once on cache failure" change.
  Filed upstream as MemPalace/mempalace#1286; once that merges this
  patch retires.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein referenced this pull request in jphein/palace-daemon May 6, 2026
Seven findings, all addressed:

1. **Path translation before is_dir check** (Copilot watcher.py:133).
   parse_watch_dirs() validated against the daemon filesystem before
   any path translation, so a PALACE_WATCH_DIRS value written in the
   client namespace was always rejected as "not a directory". Add a
   translator= kwarg; main.py now passes _translate_client_path so
   client-namespace entries reach the daemon's view first.

2. **Drop on_any_event in favor of specific subscriptions** (Copilot
   watcher.py:160). watchdog 3.x emits opened/closed events on plain
   file reads on Linux; routing those through the debounce would
   re-mine the project on every file open. Replaced on_any_event
   with explicit on_created / on_modified / on_moved / on_deleted.

3. **Check dest_path on FileMovedEvent** (Copilot watcher.py:171).
   Editor save-via-rename writes a temp file then renames to the
   real filename. The src_path on the move event is the temp; only
   dest_path has the real extension. Extracted _has_watchable_extension()
   helper; on_moved() checks both sides.

4. **Per-target schedule() failure isolation** (Copilot watcher.py:219).
   A bare for-loop let an exception in observer.schedule() (e.g.
   inotify watch limit on a large repo) abort the entire service.
   Wrap each schedule() in try/except; one failed target no longer
   disables the others. Logged at warning level.

5. **Cancel debounce timers on stop** (Copilot watcher.py:224). stop()
   tore down the observer but left armed threading.Timer objects to
   fire mid-teardown. Add cancel_pending() on the handler; service
   stop() walks all handlers first.

6. **Surface watcher-coroutine errors** (Copilot watcher.py:246).
   asyncio.run_coroutine_threadsafe() returns a Future the caller
   must observe; dropping it swallowed exceptions silently. Add a
   _log_future_exception done-callback.

7. **/watch reports running state accurately** (Copilot main.py:558).
   app.state.watcher was set unconditionally even when start()
   returned early (empty env, missing dep, all targets failed). Now
   only published to app.state when watcher.is_running. GET /watch
   reports running: false for idle/disabled instead of misleading
   running: true.

Tests: 28 pass (was 24). New cases:
- test_rename_to_watched_extension_fires (covers #3)
- test_cancel_pending_drops_armed_timer (covers #5)
- test_one_failed_target_doesnt_disable_others (covers #4)
- test_translator_runs_before_is_dir (covers #1)
Existing handler tests retargeted from on_any_event to on_modified
since on_any_event no longer exists in this revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein referenced this pull request in jphein/palace-daemon May 6, 2026
Seven findings, all addressed:

1. **Path translation before is_dir check** (Copilot watcher.py:133).
   parse_watch_dirs() validated against the daemon filesystem before
   any path translation, so a PALACE_WATCH_DIRS value written in the
   client namespace was always rejected as "not a directory". Add a
   translator= kwarg; main.py now passes _translate_client_path so
   client-namespace entries reach the daemon's view first.

2. **Drop on_any_event in favor of specific subscriptions** (Copilot
   watcher.py:160). watchdog 3.x emits opened/closed events on plain
   file reads on Linux; routing those through the debounce would
   re-mine the project on every file open. Replaced on_any_event
   with explicit on_created / on_modified / on_moved / on_deleted.

3. **Check dest_path on FileMovedEvent** (Copilot watcher.py:171).
   Editor save-via-rename writes a temp file then renames to the
   real filename. The src_path on the move event is the temp; only
   dest_path has the real extension. Extracted _has_watchable_extension()
   helper; on_moved() checks both sides.

4. **Per-target schedule() failure isolation** (Copilot watcher.py:219).
   A bare for-loop let an exception in observer.schedule() (e.g.
   inotify watch limit on a large repo) abort the entire service.
   Wrap each schedule() in try/except; one failed target no longer
   disables the others. Logged at warning level.

5. **Cancel debounce timers on stop** (Copilot watcher.py:224). stop()
   tore down the observer but left armed threading.Timer objects to
   fire mid-teardown. Add cancel_pending() on the handler; service
   stop() walks all handlers first.

6. **Surface watcher-coroutine errors** (Copilot watcher.py:246).
   asyncio.run_coroutine_threadsafe() returns a Future the caller
   must observe; dropping it swallowed exceptions silently. Add a
   _log_future_exception done-callback.

7. **/watch reports running state accurately** (Copilot main.py:558).
   app.state.watcher was set unconditionally even when start()
   returned early (empty env, missing dep, all targets failed). Now
   only published to app.state when watcher.is_running. GET /watch
   reports running: false for idle/disabled instead of misleading
   running: true.

Tests: 28 pass (was 24). New cases:
- test_rename_to_watched_extension_fires (covers #3)
- test_cancel_pending_drops_armed_timer (covers #5)
- test_one_failed_target_doesnt_disable_others (covers #4)
- test_translator_runs_before_is_dir (covers #1)
Existing handler tests retargeted from on_any_event to on_modified
since on_any_event no longer exists in this revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein referenced this pull request in jphein/palace-daemon May 6, 2026
* feat(watcher): file-watcher service for auto-mining on file change

Closes the "automining when files are added or updated" half of JP's
ask. The other half (CLI add/remove of mined data) ships in
jphein/mempalace#4 as `mempalace mined` + `mempalace purge --source-file`.

Adds a watchdog-based file watcher that runs inside the daemon's
lifespan. Configured via `PALACE_WATCH_DIRS` env var:

    PALACE_WATCH_DIRS="/home/jp/Projects/realmwatch=wing_realmwatch,
                       /home/jp/Projects/oracle=wing_oracle"

Each entry is `path` or `path=wing`. Bare path derives wing via
`mempalace.config.normalize_wing_name(basename)` to match the
local-spawn miner default. Non-existent paths are warned-and-skipped,
not fatal — a misconfigured env var doesn't kill startup.

Architecture:

* watcher.py — pure-Python module: parse_watch_dirs(),
  _DebouncedMineHandler (subclass of FileSystemEventHandler with a
  per-target debounce timer), WatcherService (lifecycle wrapper around
  watchdog.Observer).
* main.py lifespan — instantiate WatcherService, schedule one
  recursive watch per target, register stop() on shutdown. Failures
  are non-fatal; the daemon serves traffic regardless.
* main.py — `_internal_mine()` async closure runs the same
  `mempalace mine ... --mode projects --wing X` argv as the existing
  /mine endpoint, gated by the same `_mine_sem`.
* main.py — new `GET /watch` endpoint surfaces the running target
  list. No POST/DELETE — runtime add/remove requires daemon restart
  (operator just edits the systemd unit env and restarts).

Debounce: events on the same target collapse via a 2-second per-target
timer. Catches editor write+rename storms and *.swp / *.lock churn.
Also pre-filters by extension allowlist (29 extensions: code, configs,
markdown). Inotify fires for everything; the handler discards
non-watched extensions before the timer schedules.

Path translation: when a watch path lives in a Syncthing-replicated
directory, the daemon-side path is translated through
`PALACE_DAEMON_PATH_MAP` before the mine subprocess runs (same
mechanism used by /mine).

Tests: 11 new unittest cases in tests/test_watcher.py:
- TestParseWatchDirs (7 cases): empty, single-with-derived-wing,
  explicit-wing, multiple, skips-nonexistent, skips-files, env-fallback
- TestDebouncedMineHandler (4 cases): single-event-fires,
  burst-collapses-to-one, skips-directory-events, skips-bad-extensions

Stacked on #1 (path-translation). When that
merges, this PR auto-rebases against main; reviewers see only the
new commits in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(watcher): address Copilot review on #2

Seven findings, all addressed:

1. **Path translation before is_dir check** (Copilot watcher.py:133).
   parse_watch_dirs() validated against the daemon filesystem before
   any path translation, so a PALACE_WATCH_DIRS value written in the
   client namespace was always rejected as "not a directory". Add a
   translator= kwarg; main.py now passes _translate_client_path so
   client-namespace entries reach the daemon's view first.

2. **Drop on_any_event in favor of specific subscriptions** (Copilot
   watcher.py:160). watchdog 3.x emits opened/closed events on plain
   file reads on Linux; routing those through the debounce would
   re-mine the project on every file open. Replaced on_any_event
   with explicit on_created / on_modified / on_moved / on_deleted.

3. **Check dest_path on FileMovedEvent** (Copilot watcher.py:171).
   Editor save-via-rename writes a temp file then renames to the
   real filename. The src_path on the move event is the temp; only
   dest_path has the real extension. Extracted _has_watchable_extension()
   helper; on_moved() checks both sides.

4. **Per-target schedule() failure isolation** (Copilot watcher.py:219).
   A bare for-loop let an exception in observer.schedule() (e.g.
   inotify watch limit on a large repo) abort the entire service.
   Wrap each schedule() in try/except; one failed target no longer
   disables the others. Logged at warning level.

5. **Cancel debounce timers on stop** (Copilot watcher.py:224). stop()
   tore down the observer but left armed threading.Timer objects to
   fire mid-teardown. Add cancel_pending() on the handler; service
   stop() walks all handlers first.

6. **Surface watcher-coroutine errors** (Copilot watcher.py:246).
   asyncio.run_coroutine_threadsafe() returns a Future the caller
   must observe; dropping it swallowed exceptions silently. Add a
   _log_future_exception done-callback.

7. **/watch reports running state accurately** (Copilot main.py:558).
   app.state.watcher was set unconditionally even when start()
   returned early (empty env, missing dep, all targets failed). Now
   only published to app.state when watcher.is_running. GET /watch
   reports running: false for idle/disabled instead of misleading
   running: true.

Tests: 28 pass (was 24). New cases:
- test_rename_to_watched_extension_fires (covers #3)
- test_cancel_pending_drops_armed_timer (covers #5)
- test_one_failed_target_doesnt_disable_others (covers #4)
- test_translator_runs_before_is_dir (covers #1)
Existing handler tests retargeted from on_any_event to on_modified
since on_any_event no longer exists in this revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants