Skip to content

Commit d0aabb9

Browse files
rboarescuclaude
andcommitted
fix: harden _get_collection, watchdog, health check (v1.5.1)
- Log exceptions in _get_collection instead of silent None return - Auto-retry once after cache clear on collection open failure - Enforce hnsw:num_threads=1 on every collection open (ChromaDB #1161) - /health returns HTTP 503 degraded when collection unavailable - Systemd watchdog: READY=1 on startup, WATCHDOG=1 every 60s gated on live collection check - Warmup now opens collection directly so num_threads fix applies before startup warning check - patches/mcp_server_get_collection.patch + scripts/apply_patches.sh for post-upgrade re-apply - Service: Type=notify, NotifyAccess=main, WatchdogSec=120 Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
1 parent a64244c commit d0aabb9

7 files changed

Lines changed: 260 additions & 10 deletions

File tree

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,18 @@
11
# Changelog
22

3+
## [1.5.1] - 2026-04-26
4+
5+
### Fixed
6+
- **`_get_collection` silent failures** -- exceptions are now logged (palace path + error) instead of silently returning `None`. Cache-staleness incidents are now visible in the daemon log.
7+
- **Stale collection cache self-healing** -- `_get_collection` retries once after clearing all caches (`_client_cache`, `_collection_cache`, `_metadata_cache`) on failure. The incident that required a manual daemon restart now self-heals on the next tool call.
8+
- **HNSW `num_threads=1` enforced on every open** -- `_get_collection` calls `collection.modify()` after every open, merging `hnsw:num_threads=1` into existing metadata. ChromaDB 1.5.x does not persist HNSW metadata across reopens (issue #1161); without this, every cache clear silently re-enabled parallel inserts and risked SIGSEGV under concurrent writes.
9+
- **`/health` reflects actual palace state** -- previously returned HTTP 200 `ok` even when the collection was broken (false healthy). Now calls `_get_collection()` and returns HTTP 503 `degraded` if the palace is unavailable.
10+
11+
### Added
12+
- **Systemd watchdog** -- daemon sends `READY=1` on startup and `WATCHDOG=1` every `WatchdogSec/2` seconds via `sd_notify` (stdlib-only, no external deps). Watchdog pings are gated on a live `_get_collection()` check: if the palace goes dark, the watchdog goes silent and systemd kills and restarts the daemon.
13+
- `palace-daemon.service` updated: `Type=simple` changed to `Type=notify`, `NotifyAccess=main` added, `WatchdogSec=120` added. Re-install: `sudo cp palace-daemon.service /etc/systemd/system/ && sudo systemctl daemon-reload && sudo systemctl restart palace-daemon`.
14+
- **Startup warmup opens the collection** -- lifespan warmup now calls `_get_collection(create=True)` directly instead of `ping`. `ping` never touches the collection, so `num_threads=1` was not applied before `_warn_if_hnsw_threads_unset` ran at startup, causing a spurious warning on every boot. The warning is now silent on a healthy palace.
15+
316
## [1.5.0] - 2026-04-24
417

518
### Added

CLAUDE.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,28 @@
33
## Core Mandates
44

55
### 1. SSH-Friendly Feedback
6-
- **Always** provide a concise, one-line terminal confirmation (e.g., '📥 Filed to {room}') after filing memories via the MemPalace MCP.
6+
- **Always** provide a concise, one-line terminal confirmation (e.g., '📥 Filed to {room}') after filing memories via the MemPalace MCP.
77
- Do not rely on desktop notifications as the user is often on SSH.
88

99
### 2. Post-Phase Documentation
1010
- At the end of every work phase, systematically update the project's `README.md` or `CHANGELOG.md`.
1111
- **Mandatory:** File a roadmap update to the corresponding room in the 'lab_projects' wing via MemPalace.
1212

1313
### 3. Service Management
14-
- **System Service Only:** ALWAYS manage `palace-daemon` via `sudo systemctl [start|stop|restart] palace-daemon`.
14+
- **System Service Only:** ALWAYS manage `palace-daemon` via `sudo systemctl [start|stop|restart] palace-daemon`.
1515
- **No Manual Starts:** NEVER start the daemon manually via `python3 main.py`. Manual startup is blocked by default and requires the `--manual` flag; only use this for isolated debugging.
1616

1717
### 4. Memory Protocol
1818
- **Silent Mode:** Ensure `silent_save` is enabled in MemPalace settings to prevent blocking the chat flow.
1919
- **Roadmap Sync:** Before finishing, check the 'lab_projects' wing to ensure the next steps are clearly documented for the next session.
20+
21+
### 5. Upgrading mempalace
22+
After `pipx upgrade mempalace`, always re-apply local patches and restart:
23+
24+
bash /home/radu/palace-daemon/scripts/apply_patches.sh
25+
sudo systemctl restart palace-daemon
26+
27+
If a patch conflicts, the script will say so. Check whether upstream fixed the issue — if so, delete the patch file. Otherwise update the patch to match the new code.
28+
29+
Patches live in `patches/`. Current patches:
30+
- `mcp_server_get_collection.patch``_get_collection`: exception logging, auto-retry on cache failure, `hnsw:num_threads=1` enforcement (workaround for ChromaDB issue #1161)

README.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,9 @@ To prevent database corruption, this project enforces a strict **Single-Process
2121
## Features
2222

2323
- **Self-Healing Startup**`--force` flag automatically clears stale processes on the target port
24+
- **Collection cache auto-retry** -- if the internal ChromaDB collection cache goes stale, `_get_collection` clears all caches and retries once automatically before returning an error
25+
- **HNSW thread safety** -- `num_threads=1` is enforced on every collection open, not just creation; prevents SIGSEGV from parallel inserts after any cache clear (ChromaDB 1.5.x issue #1161)
26+
- **Systemd watchdog** -- sends `READY=1` on startup and `WATCHDOG=1` every 60s (gated on a live collection check); systemd restarts the daemon if the palace goes dark
2427
- **Protected Manual Start** — requires `--manual` flag for debugging, preventing accidental agent starts
2528
- **MCP proxy** — any MCP client connects to /mcp instead of spawning a local process
2629
- **REST API** — search, store, and query the palace over HTTP (Android app, netdash, scripts)
@@ -91,8 +94,18 @@ Only runs while you're logged in. Use this if you don't have sudo or only need t
9194
9295
Edit `palace-daemon.service` to set `PALACE_API_KEY` or a custom `--palace` path before installing.
9396

97+
The service uses `Type=notify` and `WatchdogSec=120`: the daemon signals systemd when it is ready and sends a watchdog heartbeat every 60 s. If the watchdog goes silent (e.g. the palace collection breaks), systemd kills and restarts the daemon automatically.
98+
9499
## Troubleshooting
95100

101+
### Palace reports `degraded` on `/health`
102+
The daemon is running but cannot open the ChromaDB collection. Since 1.5.1, `_get_collection` will attempt a self-heal automatically on the next tool call. If it persists:
103+
104+
curl -X POST http://localhost:8085/reload # clear client cache
105+
sudo systemctl restart palace-daemon # full restart if reload fails
106+
107+
Check `journalctl -u palace-daemon -n 50` for the logged exception — it will now show the exact error instead of a silent `None`.
108+
96109
### Port 8085 already in use
97110
If the daemon fails to start with `[Errno 98] address already in use`, it usually means a previous instance didn't shut down cleanly.
98111

@@ -109,7 +122,7 @@ To manually clear the lock and port without starting:
109122

110123
| Method | Endpoint | Description |
111124
|--------|----------|-------------|
112-
| GET | /health | Daemon + palace status (inc. version) |
125+
| GET | /health | Daemon + palace status; returns HTTP 503 `degraded` if collection is unavailable |
113126
| POST | /backup | Atomic verified SQLite backup |
114127
| POST | /reload | Clear client cache / refresh index |
115128
| POST | /repair | Coordinate repair with daemon traffic (`mode`: `light`/`scan`/`prune`/`rebuild`) |

main.py

Lines changed: 66 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939

4040
# ── Config (env vars override CLI defaults) ───────────────────────────────────
4141

42-
VERSION = "1.5.0"
42+
VERSION = "1.5.1"
4343
DEFAULT_HOST = os.getenv("PALACE_HOST", "0.0.0.0")
4444
DEFAULT_PORT = int(os.getenv("PALACE_PORT", "8085"))
4545
DEFAULT_PALACE = os.getenv("PALACE_PATH", "")
@@ -67,6 +67,47 @@
6767
_log = logging.getLogger("palace-daemon")
6868

6969

70+
# ── Systemd watchdog / sd_notify ─────────────────────────────────────────────
71+
72+
def _sd_notify(msg: str) -> None:
73+
"""Send a message to systemd notify socket without external dependencies."""
74+
sock_path = os.environ.get("NOTIFY_SOCKET", "")
75+
if not sock_path:
76+
return
77+
try:
78+
import socket as _sock
79+
with _sock.socket(_sock.AF_UNIX, _sock.SOCK_DGRAM) as s:
80+
# Abstract namespace sockets use NUL prefix; systemd uses @ prefix.
81+
addr = chr(0) + sock_path[1:] if sock_path.startswith("@") else sock_path
82+
s.sendto(msg.encode(), addr)
83+
except Exception:
84+
pass
85+
86+
87+
def _watchdog_interval() -> int:
88+
"""Return WatchdogSec in seconds from WATCHDOG_USEC (set by systemd), or 0."""
89+
try:
90+
return int(os.environ.get("WATCHDOG_USEC", "0")) // 1_000_000
91+
except ValueError:
92+
return 0
93+
94+
95+
async def _watchdog_loop(interval_secs: int) -> None:
96+
"""Ping systemd watchdog at half the watchdog interval, only when palace is healthy."""
97+
tick = max(10, interval_secs // 2)
98+
while True:
99+
await asyncio.sleep(tick)
100+
try:
101+
loop = asyncio.get_running_loop()
102+
col = await loop.run_in_executor(None, _mp._get_collection)
103+
if col is not None:
104+
_sd_notify("WATCHDOG=1\n")
105+
else:
106+
_log.warning("Watchdog: palace collection unavailable — skipping WATCHDOG=1")
107+
except Exception as e:
108+
_log.warning("Watchdog check failed: %s", e)
109+
110+
70111
async def _warn_if_hnsw_threads_unset() -> None:
71112
"""Warn if hnsw:num_threads != 1 after a collection reopen.
72113
@@ -321,16 +362,24 @@ async def lifespan(app: FastAPI):
321362
# Warm the ChromaDB client before accepting traffic. The Rust HNSW binding
322363
# occasionally segfaults on the very first request if opened cold; opening
323364
# it here (before yield) ensures the PersistentClient is fully initialized.
365+
# We open the collection directly (not via ping) so that _get_collection's
366+
# hnsw:num_threads=1 fix is applied before _warn_if_hnsw_threads_unset runs.
324367
try:
325368
loop = asyncio.get_running_loop()
326-
await loop.run_in_executor(None, _mp.handle_request, {
327-
"jsonrpc": "2.0", "id": "warmup", "method": "ping", "params": {}
328-
})
369+
await loop.run_in_executor(None, _mp._get_collection, True)
329370
logger.info("Palace client warmed up.")
330371
except Exception as e:
331-
logger.warning("Warmup ping failed (non-fatal): %s", e)
372+
logger.warning("Warmup collection open failed (non-fatal): %s", e)
332373
await _warn_if_hnsw_threads_unset()
333374

375+
# Signal systemd that startup is complete (Type=notify in service file).
376+
_sd_notify("READY=1\n")
377+
378+
# Start systemd watchdog loop if WatchdogSec is configured.
379+
wdog_secs = _watchdog_interval()
380+
if wdog_secs > 0:
381+
asyncio.create_task(_watchdog_loop(wdog_secs))
382+
logger.info("Systemd watchdog active (interval=%ds, tick=%ds).", wdog_secs, max(10, wdog_secs // 2))
334383

335384
yield
336385

@@ -371,7 +420,18 @@ async def health():
371420
# Bypass semaphores — health must respond even when all slots are busy.
372421
loop = asyncio.get_running_loop()
373422
result = await loop.run_in_executor(None, _mp.handle_request, {"jsonrpc": "2.0", "id": 1, "method": "ping", "params": {}}) or {}
374-
return {"status": "ok", "daemon": "palace-daemon", "version": VERSION, "palace": result}
423+
# Test actual collection access so /health reflects true palace state.
424+
palace_ok = False
425+
try:
426+
col = await loop.run_in_executor(None, _mp._get_collection)
427+
palace_ok = col is not None
428+
except Exception:
429+
pass
430+
status = "ok" if palace_ok else "degraded"
431+
payload = {"status": status, "daemon": "palace-daemon", "version": VERSION, "palace": result}
432+
if not palace_ok:
433+
return JSONResponse(content=payload, status_code=503)
434+
return payload
375435

376436

377437
@app.get("/search")

palace-daemon.service

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,16 @@ Description=palace-daemon — MemPalace HTTP/MCP gateway
33
After=network.target
44

55
[Service]
6-
Type=simple
6+
Type=notify
7+
NotifyAccess=main
78
User=radu
89
Group=radu
910
WorkingDirectory=/home/radu/palace-daemon
1011
# Use the new --force flag to ensure port 8085 is cleared on every start
1112
ExecStart=/home/radu/.local/share/pipx/venvs/mempalace/bin/python main.py --force --palace /home/radu/.mempalace/palace
1213
Restart=on-failure
1314
RestartSec=5
15+
WatchdogSec=120
1416
StandardOutput=journal
1517
StandardError=journal
1618
Environment=PYTHONUNBUFFERED=1
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
--- a/mempalace/mcp_server.py
2+
+++ b/mempalace/mcp_server.py
3+
@@ -212,25 +212,56 @@
4+
5+
6+
def _get_collection(create=False):
7+
- """Return the ChromaDB collection, caching the client between calls."""
8+
- global _collection_cache, _metadata_cache, _metadata_cache_time
9+
- try:
10+
- client = _get_client()
11+
- if create:
12+
- _collection_cache = ChromaCollection(
13+
- client.get_or_create_collection(
14+
- _config.collection_name, metadata={"hnsw:space": "cosine"}
15+
+ """Return the ChromaDB collection, caching the client between calls.
16+
+
17+
+ Retries once on failure after clearing all caches (fixes stale-cache
18+
+ breakage without requiring a daemon restart). Logs the exception so
19+
+ failures are visible in the daemon log instead of silently returning None.
20+
+ Sets hnsw:num_threads=1 on every open — ChromaDB 1.5.x does not persist
21+
+ HNSW metadata across reopens, so parallel inserts stay disabled.
22+
+ """
23+
+ global _client_cache, _collection_cache, _metadata_cache, _metadata_cache_time
24+
+ for attempt in range(2):
25+
+ try:
26+
+ client = _get_client()
27+
+ if create:
28+
+ _collection_cache = ChromaCollection(
29+
+ client.get_or_create_collection(
30+
+ _config.collection_name,
31+
+ metadata={"hnsw:space": "cosine", "hnsw:num_threads": 1},
32+
+ )
33+
+ )
34+
+ _metadata_cache = None
35+
+ _metadata_cache_time = 0
36+
+ elif _collection_cache is None:
37+
+ _collection_cache = ChromaCollection(
38+
+ client.get_collection(_config.collection_name)
39+
)
40+
+ _metadata_cache = None
41+
+ _metadata_cache_time = 0
42+
+ # Re-apply num_threads=1 on every open since ChromaDB 1.5.x does
43+
+ # not persist HNSW metadata across PersistentClient reopens (#1161).
44+
+ if _collection_cache is not None:
45+
+ try:
46+
+ existing = getattr(_collection_cache._collection, "metadata", {}) or {}
47+
+ if existing.get("hnsw:num_threads") != 1:
48+
+ _collection_cache._collection.modify(
49+
+ metadata={**existing, "hnsw:num_threads": 1}
50+
+ )
51+
+ except Exception:
52+
+ pass
53+
+ return _collection_cache
54+
+ except Exception as e:
55+
+ logger.error(
56+
+ "_get_collection attempt %d failed (palace=%s): %s",
57+
+ attempt + 1, _config.palace_path, e,
58+
)
59+
- _metadata_cache = None
60+
- _metadata_cache_time = 0
61+
- elif _collection_cache is None:
62+
- _collection_cache = ChromaCollection(client.get_collection(_config.collection_name))
63+
- _metadata_cache = None
64+
- _metadata_cache_time = 0
65+
- return _collection_cache
66+
- except Exception:
67+
- return None
68+
+ if attempt == 0:
69+
+ _client_cache = None
70+
+ _collection_cache = None
71+
+ _metadata_cache = None
72+
+ _metadata_cache_time = 0
73+
+ return None
74+
75+
76+
def _no_palace():

scripts/apply_patches.sh

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
#!/usr/bin/env bash
2+
# apply_patches.sh — re-apply local patches to the mempalace pipx install
3+
# Run this after every: pipx upgrade mempalace
4+
#
5+
# Usage:
6+
# bash scripts/apply_patches.sh
7+
# bash scripts/apply_patches.sh --check # dry-run, no changes
8+
9+
set -euo pipefail
10+
11+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
12+
PATCHES_DIR="$SCRIPT_DIR/../patches"
13+
VENV_SITE="$(/home/radu/.local/share/pipx/venvs/mempalace/bin/python \
14+
-c 'import site; print(site.getsitepackages()[0])')"
15+
16+
DRY_RUN=0
17+
[[ "${1:-}" == "--check" ]] && DRY_RUN=1
18+
19+
MEMPALACE_VERSION="$(/home/radu/.local/share/pipx/venvs/mempalace/bin/python \
20+
-c 'import mempalace; print(mempalace.__version__)' 2>/dev/null || echo unknown)"
21+
22+
echo "mempalace version : $MEMPALACE_VERSION"
23+
echo "site-packages : $VENV_SITE"
24+
echo "patches dir : $PATCHES_DIR"
25+
[[ $DRY_RUN -eq 1 ]] && echo "(dry-run — no changes will be made)"
26+
echo ""
27+
28+
APPLIED=0
29+
SKIPPED=0
30+
FAILED=0
31+
32+
for patch in "$PATCHES_DIR"/*.patch; do
33+
[[ -f "$patch" ]] || continue
34+
name="$(basename "$patch")"
35+
36+
# Check if already applied
37+
if patch --dry-run -p1 -R --quiet -d "$VENV_SITE" < "$patch" 2>/dev/null; then
38+
echo " [already applied] $name"
39+
((SKIPPED++)) || true
40+
continue
41+
fi
42+
43+
# Check if applicable
44+
if ! patch --dry-run -p1 --quiet -d "$VENV_SITE" < "$patch" 2>/dev/null; then
45+
echo " [CONFLICT] $name <-- upstream may have changed this code; review manually"
46+
((FAILED++)) || true
47+
continue
48+
fi
49+
50+
if [[ $DRY_RUN -eq 1 ]]; then
51+
echo " [would apply] $name"
52+
((APPLIED++)) || true
53+
else
54+
patch -p1 -d "$VENV_SITE" < "$patch"
55+
echo " [applied] $name"
56+
((APPLIED++)) || true
57+
fi
58+
done
59+
60+
echo ""
61+
echo "Results: $APPLIED applied, $SKIPPED already-applied, $FAILED conflicts"
62+
63+
if [[ $FAILED -gt 0 ]]; then
64+
echo ""
65+
echo "Action required: $FAILED patch(es) conflicted."
66+
echo "Check if upstream fixed the issue — if so, remove the patch file."
67+
echo "Otherwise update the patch to match the new upstream code."
68+
exit 1
69+
fi
70+
71+
if [[ $DRY_RUN -eq 0 && $APPLIED -gt 0 ]]; then
72+
echo ""
73+
echo "Restart the daemon to pick up changes:"
74+
echo " sudo systemctl restart palace-daemon"
75+
fi

0 commit comments

Comments
 (0)