fix: harden _get_collection, watchdog, health check (v1.5.1)

rboarescu · claude · rboarescu · commit d0aabb948201 · 2026-04-26T01:33:23.000+03:00
- Log exceptions in _get_collection instead of silent None return
- Auto-retry once after cache clear on collection open failure
- Enforce hnsw:num_threads=1 on every collection open (ChromaDB #1161)
- /health returns HTTP 503 degraded when collection unavailable
- Systemd watchdog: READY=1 on startup, WATCHDOG=1 every 60s gated on live collection check
- Warmup now opens collection directly so num_threads fix applies before startup warning check
- patches/mcp_server_get_collection.patch + scripts/apply_patches.sh for post-upgrade re-apply
- Service: Type=notify, NotifyAccess=main, WatchdogSec=120

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,18 @@
 # Changelog
 
+## [1.5.1] - 2026-04-26
+
+### Fixed
+- **`_get_collection` silent failures** -- exceptions are now logged (palace path + error) instead of silently returning `None`. Cache-staleness incidents are now visible in the daemon log.
+- **Stale collection cache self-healing** -- `_get_collection` retries once after clearing all caches (`_client_cache`, `_collection_cache`, `_metadata_cache`) on failure. The incident that required a manual daemon restart now self-heals on the next tool call.
+- **HNSW `num_threads=1` enforced on every open** -- `_get_collection` calls `collection.modify()` after every open, merging `hnsw:num_threads=1` into existing metadata. ChromaDB 1.5.x does not persist HNSW metadata across reopens (issue #1161); without this, every cache clear silently re-enabled parallel inserts and risked SIGSEGV under concurrent writes.
+- **`/health` reflects actual palace state** -- previously returned HTTP 200 `ok` even when the collection was broken (false healthy). Now calls `_get_collection()` and returns HTTP 503 `degraded` if the palace is unavailable.
+
+### Added
+- **Systemd watchdog** -- daemon sends `READY=1` on startup and `WATCHDOG=1` every `WatchdogSec/2` seconds via `sd_notify` (stdlib-only, no external deps). Watchdog pings are gated on a live `_get_collection()` check: if the palace goes dark, the watchdog goes silent and systemd kills and restarts the daemon.
+- `palace-daemon.service` updated: `Type=simple` changed to `Type=notify`, `NotifyAccess=main` added, `WatchdogSec=120` added. Re-install: `sudo cp palace-daemon.service /etc/systemd/system/ && sudo systemctl daemon-reload && sudo systemctl restart palace-daemon`.
+- **Startup warmup opens the collection** -- lifespan warmup now calls `_get_collection(create=True)` directly instead of `ping`. `ping` never touches the collection, so `num_threads=1` was not applied before `_warn_if_hnsw_threads_unset` ran at startup, causing a spurious warning on every boot. The warning is now silent on a healthy palace.
+
 ## [1.5.0] - 2026-04-24
 
 ### Added
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -3,17 +3,28 @@
 ## Core Mandates
 
 ### 1. SSH-Friendly Feedback
-- **Always** provide a concise, one-line terminal confirmation (e.g., '📥 Filed to {room}') after filing memories via the MemPalace MCP. 
+- **Always** provide a concise, one-line terminal confirmation (e.g., '📥 Filed to {room}') after filing memories via the MemPalace MCP.
 - Do not rely on desktop notifications as the user is often on SSH.
 
 ### 2. Post-Phase Documentation
 - At the end of every work phase, systematically update the project's `README.md` or `CHANGELOG.md`.
 - **Mandatory:** File a roadmap update to the corresponding room in the 'lab_projects' wing via MemPalace.
 
 ### 3. Service Management
-- **System Service Only:** ALWAYS manage `palace-daemon` via `sudo systemctl [start|stop|restart] palace-daemon`. 
+- **System Service Only:** ALWAYS manage `palace-daemon` via `sudo systemctl [start|stop|restart] palace-daemon`.
 - **No Manual Starts:** NEVER start the daemon manually via `python3 main.py`. Manual startup is blocked by default and requires the `--manual` flag; only use this for isolated debugging.
 
 ### 4. Memory Protocol
 - **Silent Mode:** Ensure `silent_save` is enabled in MemPalace settings to prevent blocking the chat flow.
 - **Roadmap Sync:** Before finishing, check the 'lab_projects' wing to ensure the next steps are clearly documented for the next session.
+
+### 5. Upgrading mempalace
+After `pipx upgrade mempalace`, always re-apply local patches and restart:
+
+    bash /home/radu/palace-daemon/scripts/apply_patches.sh
+    sudo systemctl restart palace-daemon
+
+If a patch conflicts, the script will say so. Check whether upstream fixed the issue — if so, delete the patch file. Otherwise update the patch to match the new code.
+
+Patches live in `patches/`. Current patches:
+- `mcp_server_get_collection.patch` — `_get_collection`: exception logging, auto-retry on cache failure, `hnsw:num_threads=1` enforcement (workaround for ChromaDB issue #1161)
diff --git a/README.md b/README.md
@@ -21,6 +21,9 @@ To prevent database corruption, this project enforces a strict **Single-Process
 ## Features
 
 - **Self-Healing Startup** — `--force` flag automatically clears stale processes on the target port
+- **Collection cache auto-retry** -- if the internal ChromaDB collection cache goes stale, `_get_collection` clears all caches and retries once automatically before returning an error
+- **HNSW thread safety** -- `num_threads=1` is enforced on every collection open, not just creation; prevents SIGSEGV from parallel inserts after any cache clear (ChromaDB 1.5.x issue #1161)
+- **Systemd watchdog** -- sends `READY=1` on startup and `WATCHDOG=1` every 60s (gated on a live collection check); systemd restarts the daemon if the palace goes dark
 - **Protected Manual Start** — requires `--manual` flag for debugging, preventing accidental agent starts
 - **MCP proxy** — any MCP client connects to /mcp instead of spawning a local process
 - **REST API** — search, store, and query the palace over HTTP (Android app, netdash, scripts)
@@ -91,8 +94,18 @@ Only runs while you're logged in. Use this if you don't have sudo or only need t
 
 Edit `palace-daemon.service` to set `PALACE_API_KEY` or a custom `--palace` path before installing.
 
+The service uses `Type=notify` and `WatchdogSec=120`: the daemon signals systemd when it is ready and sends a watchdog heartbeat every 60 s. If the watchdog goes silent (e.g. the palace collection breaks), systemd kills and restarts the daemon automatically.
+
 ## Troubleshooting
 
+### Palace reports `degraded` on `/health`
+The daemon is running but cannot open the ChromaDB collection. Since 1.5.1, `_get_collection` will attempt a self-heal automatically on the next tool call. If it persists:
+
+    curl -X POST http://localhost:8085/reload    # clear client cache
+    sudo systemctl restart palace-daemon         # full restart if reload fails
+
+Check `journalctl -u palace-daemon -n 50` for the logged exception — it will now show the exact error instead of a silent `None`.
+
 ### Port 8085 already in use
 If the daemon fails to start with `[Errno 98] address already in use`, it usually means a previous instance didn't shut down cleanly.
 
@@ -109,7 +122,7 @@ To manually clear the lock and port without starting:
 
 | Method | Endpoint | Description |
 |--------|----------|-------------|
-| GET | /health | Daemon + palace status (inc. version) |
+| GET | /health | Daemon + palace status; returns HTTP 503 `degraded` if collection is unavailable |
 | POST | /backup | Atomic verified SQLite backup |
 | POST | /reload | Clear client cache / refresh index |
 | POST | /repair | Coordinate repair with daemon traffic (`mode`: `light`/`scan`/`prune`/`rebuild`) |
diff --git a/main.py b/main.py
@@ -39,7 +39,7 @@
 
 # ── Config (env vars override CLI defaults) ───────────────────────────────────
 
-VERSION = "1.5.0"
+VERSION = "1.5.1"
 DEFAULT_HOST = os.getenv("PALACE_HOST", "0.0.0.0")
 DEFAULT_PORT = int(os.getenv("PALACE_PORT", "8085"))
 DEFAULT_PALACE = os.getenv("PALACE_PATH", "")
@@ -67,6 +67,47 @@
 _log = logging.getLogger("palace-daemon")
 
 
+# ── Systemd watchdog / sd_notify ─────────────────────────────────────────────
+
+def _sd_notify(msg: str) -> None:
+    """Send a message to systemd notify socket without external dependencies."""
+    sock_path = os.environ.get("NOTIFY_SOCKET", "")
+    if not sock_path:
+        return
+    try:
+        import socket as _sock
+        with _sock.socket(_sock.AF_UNIX, _sock.SOCK_DGRAM) as s:
+            # Abstract namespace sockets use NUL prefix; systemd uses @ prefix.
+            addr = chr(0) + sock_path[1:] if sock_path.startswith("@") else sock_path
+            s.sendto(msg.encode(), addr)
+    except Exception:
+        pass
+
+
+def _watchdog_interval() -> int:
+    """Return WatchdogSec in seconds from WATCHDOG_USEC (set by systemd), or 0."""
+    try:
+        return int(os.environ.get("WATCHDOG_USEC", "0")) // 1_000_000
+    except ValueError:
+        return 0
+
+
+async def _watchdog_loop(interval_secs: int) -> None:
+    """Ping systemd watchdog at half the watchdog interval, only when palace is healthy."""
+    tick = max(10, interval_secs // 2)
+    while True:
+        await asyncio.sleep(tick)
+        try:
+            loop = asyncio.get_running_loop()
+            col = await loop.run_in_executor(None, _mp._get_collection)
+            if col is not None:
+                _sd_notify("WATCHDOG=1\n")
+            else:
+                _log.warning("Watchdog: palace collection unavailable — skipping WATCHDOG=1")
+        except Exception as e:
+            _log.warning("Watchdog check failed: %s", e)
+
+
 async def _warn_if_hnsw_threads_unset() -> None:
     """Warn if hnsw:num_threads != 1 after a collection reopen.
 
@@ -321,16 +362,24 @@ async def lifespan(app: FastAPI):
     # Warm the ChromaDB client before accepting traffic. The Rust HNSW binding
     # occasionally segfaults on the very first request if opened cold; opening
     # it here (before yield) ensures the PersistentClient is fully initialized.
+    # We open the collection directly (not via ping) so that _get_collection's
+    # hnsw:num_threads=1 fix is applied before _warn_if_hnsw_threads_unset runs.
     try:
         loop = asyncio.get_running_loop()
-        await loop.run_in_executor(None, _mp.handle_request, {
-            "jsonrpc": "2.0", "id": "warmup", "method": "ping", "params": {}
-        })
+        await loop.run_in_executor(None, _mp._get_collection, True)
         logger.info("Palace client warmed up.")
     except Exception as e:
-        logger.warning("Warmup ping failed (non-fatal): %s", e)
+        logger.warning("Warmup collection open failed (non-fatal): %s", e)
     await _warn_if_hnsw_threads_unset()
 
+    # Signal systemd that startup is complete (Type=notify in service file).
+    _sd_notify("READY=1\n")
+
+    # Start systemd watchdog loop if WatchdogSec is configured.
+    wdog_secs = _watchdog_interval()
+    if wdog_secs > 0:
+        asyncio.create_task(_watchdog_loop(wdog_secs))
+        logger.info("Systemd watchdog active (interval=%ds, tick=%ds).", wdog_secs, max(10, wdog_secs // 2))
 
     yield
     
@@ -371,7 +420,18 @@ async def health():
     # Bypass semaphores — health must respond even when all slots are busy.
     loop = asyncio.get_running_loop()
     result = await loop.run_in_executor(None, _mp.handle_request, {"jsonrpc": "2.0", "id": 1, "method": "ping", "params": {}}) or {}
-    return {"status": "ok", "daemon": "palace-daemon", "version": VERSION, "palace": result}
+    # Test actual collection access so /health reflects true palace state.
+    palace_ok = False
+    try:
+        col = await loop.run_in_executor(None, _mp._get_collection)
+        palace_ok = col is not None
+    except Exception:
+        pass
+    status = "ok" if palace_ok else "degraded"
+    payload = {"status": status, "daemon": "palace-daemon", "version": VERSION, "palace": result}
+    if not palace_ok:
+        return JSONResponse(content=payload, status_code=503)
+    return payload
 
 
 @app.get("/search")
diff --git a/palace-daemon.service b/palace-daemon.service
@@ -3,14 +3,16 @@ Description=palace-daemon — MemPalace HTTP/MCP gateway
 After=network.target
 
 [Service]
-Type=simple
+Type=notify
+NotifyAccess=main
 User=radu
 Group=radu
 WorkingDirectory=/home/radu/palace-daemon
 # Use the new --force flag to ensure port 8085 is cleared on every start
 ExecStart=/home/radu/.local/share/pipx/venvs/mempalace/bin/python main.py --force --palace /home/radu/.mempalace/palace
 Restart=on-failure
 RestartSec=5
+WatchdogSec=120
 StandardOutput=journal
 StandardError=journal
 Environment=PYTHONUNBUFFERED=1
diff --git a/patches/mcp_server_get_collection.patch b/patches/mcp_server_get_collection.patch
@@ -0,0 +1,76 @@
+--- a/mempalace/mcp_server.py
++++ b/mempalace/mcp_server.py
+@@ -212,25 +212,56 @@
+ 
+ 
+ def _get_collection(create=False):
+-    """Return the ChromaDB collection, caching the client between calls."""
+-    global _collection_cache, _metadata_cache, _metadata_cache_time
+-    try:
+-        client = _get_client()
+-        if create:
+-            _collection_cache = ChromaCollection(
+-                client.get_or_create_collection(
+-                    _config.collection_name, metadata={"hnsw:space": "cosine"}
++    """Return the ChromaDB collection, caching the client between calls.
++
++    Retries once on failure after clearing all caches (fixes stale-cache
++    breakage without requiring a daemon restart). Logs the exception so
++    failures are visible in the daemon log instead of silently returning None.
++    Sets hnsw:num_threads=1 on every open — ChromaDB 1.5.x does not persist
++    HNSW metadata across reopens, so parallel inserts stay disabled.
++    """
++    global _client_cache, _collection_cache, _metadata_cache, _metadata_cache_time
++    for attempt in range(2):
++        try:
++            client = _get_client()
++            if create:
++                _collection_cache = ChromaCollection(
++                    client.get_or_create_collection(
++                        _config.collection_name,
++                        metadata={"hnsw:space": "cosine", "hnsw:num_threads": 1},
++                    )
++                )
++                _metadata_cache = None
++                _metadata_cache_time = 0
++            elif _collection_cache is None:
++                _collection_cache = ChromaCollection(
++                    client.get_collection(_config.collection_name)
+                 )
++                _metadata_cache = None
++                _metadata_cache_time = 0
++            # Re-apply num_threads=1 on every open since ChromaDB 1.5.x does
++            # not persist HNSW metadata across PersistentClient reopens (#1161).
++            if _collection_cache is not None:
++                try:
++                    existing = getattr(_collection_cache._collection, "metadata", {}) or {}
++                    if existing.get("hnsw:num_threads") != 1:
++                        _collection_cache._collection.modify(
++                            metadata={**existing, "hnsw:num_threads": 1}
++                        )
++                except Exception:
++                    pass
++            return _collection_cache
++        except Exception as e:
++            logger.error(
++                "_get_collection attempt %d failed (palace=%s): %s",
++                attempt + 1, _config.palace_path, e,
+             )
+-            _metadata_cache = None
+-            _metadata_cache_time = 0
+-        elif _collection_cache is None:
+-            _collection_cache = ChromaCollection(client.get_collection(_config.collection_name))
+-            _metadata_cache = None
+-            _metadata_cache_time = 0
+-        return _collection_cache
+-    except Exception:
+-        return None
++            if attempt == 0:
++                _client_cache = None
++                _collection_cache = None
++                _metadata_cache = None
++                _metadata_cache_time = 0
++    return None
+ 
+ 
+ def _no_palace():
diff --git a/scripts/apply_patches.sh b/scripts/apply_patches.sh
@@ -0,0 +1,75 @@
+#!/usr/bin/env bash
+# apply_patches.sh — re-apply local patches to the mempalace pipx install
+# Run this after every: pipx upgrade mempalace
+#
+# Usage:
+#   bash scripts/apply_patches.sh
+#   bash scripts/apply_patches.sh --check   # dry-run, no changes
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PATCHES_DIR="$SCRIPT_DIR/../patches"
+VENV_SITE="$(/home/radu/.local/share/pipx/venvs/mempalace/bin/python \
+    -c 'import site; print(site.getsitepackages()[0])')"
+
+DRY_RUN=0
+[[ "${1:-}" == "--check" ]] && DRY_RUN=1
+
+MEMPALACE_VERSION="$(/home/radu/.local/share/pipx/venvs/mempalace/bin/python \
+    -c 'import mempalace; print(mempalace.__version__)' 2>/dev/null || echo unknown)"
+
+echo "mempalace version : $MEMPALACE_VERSION"
+echo "site-packages     : $VENV_SITE"
+echo "patches dir       : $PATCHES_DIR"
+[[ $DRY_RUN -eq 1 ]] && echo "(dry-run — no changes will be made)"
+echo ""
+
+APPLIED=0
+SKIPPED=0
+FAILED=0
+
+for patch in "$PATCHES_DIR"/*.patch; do
+    [[ -f "$patch" ]] || continue
+    name="$(basename "$patch")"
+
+    # Check if already applied
+    if patch --dry-run -p1 -R --quiet -d "$VENV_SITE" < "$patch" 2>/dev/null; then
+        echo "  [already applied] $name"
+        ((SKIPPED++)) || true
+        continue
+    fi
+
+    # Check if applicable
+    if ! patch --dry-run -p1 --quiet -d "$VENV_SITE" < "$patch" 2>/dev/null; then
+        echo "  [CONFLICT]        $name  <-- upstream may have changed this code; review manually"
+        ((FAILED++)) || true
+        continue
+    fi
+
+    if [[ $DRY_RUN -eq 1 ]]; then
+        echo "  [would apply]     $name"
+        ((APPLIED++)) || true
+    else
+        patch -p1 -d "$VENV_SITE" < "$patch"
+        echo "  [applied]         $name"
+        ((APPLIED++)) || true
+    fi
+done
+
+echo ""
+echo "Results: $APPLIED applied, $SKIPPED already-applied, $FAILED conflicts"
+
+if [[ $FAILED -gt 0 ]]; then
+    echo ""
+    echo "Action required: $FAILED patch(es) conflicted."
+    echo "Check if upstream fixed the issue — if so, remove the patch file."
+    echo "Otherwise update the patch to match the new upstream code."
+    exit 1
+fi
+
+if [[ $DRY_RUN -eq 0 && $APPLIED -gt 0 ]]; then
+    echo ""
+    echo "Restart the daemon to pick up changes:"
+    echo "  sudo systemctl restart palace-daemon"
+fi