feat(backends): quarantine_stale_hnsw — recover from HNSW/sqlite drift (closes #823)#1000
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an operator-facing recovery primitive to mitigate ChromaDB HNSW-on-disk drift vs chroma.sqlite3 by quarantining suspected-stale HNSW segment directories, addressing the read-path crash/staleness failure mode described in #823.
Changes:
- Introduce
quarantine_stale_hnsw(palace_path, stale_seconds=3600)to rename stale HNSW segment dirs to a.drift-YYYYMMDD-HHMMSSsuffix based on mtime skew vschroma.sqlite3. - Add pytest coverage for renaming behavior, no-op behavior on fresh segments, missing-palace safety, and skipping already-quarantined directories.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
mempalace/backends/chroma.py |
Adds the quarantine_stale_hnsw helper that quarantines stale HNSW segments via renaming. |
tests/test_backends.py |
Adds tests validating the quarantine heuristic and safety/no-op cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def quarantine_stale_hnsw(palace_path: str, stale_seconds: float = 3600.0) -> list: | ||
| """Rename HNSW segment dirs whose files are stale vs. chroma.sqlite3. | ||
|
|
||
| When a ChromaDB 1.5.x PersistentClient opens a palace whose on-disk | ||
| HNSW segment is significantly older than ``chroma.sqlite3``, the Rust | ||
| graph-walk can dereference dangling neighbor pointers for entries that | ||
| exist in the metadata segment but not in the HNSW index, and segfault | ||
| in a background thread on the next ``count()`` or ``query(...)`` call. | ||
|
|
||
| This is the same failure mode reported at #823 (semantic search stale | ||
| after ``add_drawer``), observed at neo-cortex-mcp#2 (SIGSEGV on | ||
| ``count()`` with chromadb 1.5.5), and acknowledged as by-design at | ||
| chroma-core/chroma#2594. On one fork palace (135K drawers), the drift | ||
| caused a 65–85% crash rate on fresh-process opens; fresh-process | ||
| crash rate dropped to 0% after the segment dir was renamed out of the | ||
| way and ChromaDB rebuilt lazily. | ||
|
|
||
| Heuristic: if ``chroma.sqlite3`` is more than ``stale_seconds`` newer | ||
| than the segment's ``data_level0.bin``, the segment is considered | ||
| suspect and renamed to ``<uuid>.drift-<timestamp>``. ChromaDB reopens | ||
| cleanly without it and writes fresh index files on next use. The | ||
| original directory is renamed, not deleted, so recovery remains | ||
| possible if the heuristic misfires. | ||
|
|
||
| The default threshold (1h) is deliberately conservative — ChromaDB's | ||
| HNSW flush cadence means legitimate drift is normally on the order of | ||
| seconds to minutes. A segment that is more than an hour out of date is | ||
| almost certainly in a "crashed mid-write" state. | ||
|
|
||
| Args: | ||
| palace_path: path to the palace directory containing ``chroma.sqlite3`` | ||
| stale_seconds: minimum mtime gap to treat a segment as stale | ||
|
|
||
| Returns: | ||
| List of paths that were quarantined (empty if nothing drifted). | ||
| """ | ||
| db_path = os.path.join(palace_path, "chroma.sqlite3") | ||
| if not os.path.isfile(db_path): | ||
| return [] | ||
| try: | ||
| sqlite_mtime = os.path.getmtime(db_path) | ||
| except OSError: | ||
| return [] | ||
|
|
||
| moved: list = [] | ||
| try: |
There was a problem hiding this comment.
Type hints here are overly generic: the function returns a list of quarantined path strings, but the signature uses -> list and moved: list = []. Elsewhere in the repo newer built-in generics are used (e.g. list[str]). Consider updating this to -> list[str] and moved: list[str] = [] (and optionally stale_seconds: float already matches).
|
Thanks @copilot-pull-request-reviewer — addressed in c6413fd: tightened the signature to |
README: - Bump drawer count 134K → 137K (matches current palace) - Update "Open upstream PRs" table with actual reviewer status, add MemPalace#999 and MemPalace#1000, drop stale "rebased against MemPalace#863" note since MemPalace#863 is in v3.3.1 CLAUDE.md: - Rename section header v3.3.0 → v3.3.1 (we merged v3.3.1 earlier today) - Add fork changes 11-13: None-metadata guards, .blob_seq_ids_migrated marker, quarantine_stale_hnsw - Replace "Pulled in from upstream v3.3.1" summary: i18n, BCP-47 locales, UTF-8 Path.read_text, non-blocking precompact (MemPalace#863), silent_save honoring (MemPalace#966) — ours still broader - Update the open-upstream-PRs table to 7 open, include MemPalace#999 + MemPalace#1000 and refresh each row's status line to match what's actually on GitHub (MERGEABLE flags, @bensig ping context, Copilot review addressed, etc.) Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
c6413fd to
055386e
Compare
|
Rebased onto |
README: - Fork-changes table: expand the None-metadata row to cover all 8 sites (searcher.py CLI + API + closet-boost, miner.status, 4 mcp_server handlers). Previous row only called out the CLI print path. - Add a new Search row: warnings + sqlite BM25 top-up contract (the "never silent miss" feature) with pointer to MemPalace#951 + MemPalace#823. - Open-PR table: expand MemPalace#999 scope line to mention 8 sites + architectural note, update MemPalace#1000 to reflect post-MemPalace#995 rebase, add MemPalace#1005 with Copilot fixes noted. CLAUDE.md: - PR status header: 7 open -> 8 open (adds MemPalace#1005). - Same PR row updates as README for MemPalace#999/MemPalace#1000/MemPalace#1005. - Fork Changes list: expand entry 11 (None guards) to 8 sites + adapter consolidation proposal on MemPalace#999; add entry 14 for the warnings+BM25 feature; keep 12 and 13 as-is. 42 README-claim tests still pass.
Concrete evidence: - New "What this looks like in practice" section right after the status line, showing a real stop-hook systemMessage output and a real mempalace_search response shape (warnings, available_in_scope, matched_via tags, similarity scores). Someone evaluating the fork can see what "running in production" actually looks like. Headlines box: - New "Headlines" subsection at the top of Fork Changes with the three differentiators someone should know if they only read one section — silent-save hook, ChromaDB 1.5.x hardening (quarantine + None guards), search-never-misses contract. Links to MemPalace#673/MemPalace#999/ MemPalace#1000/MemPalace#1005 so readers can jump to the work itself. Citations for comparison table: - Every row now links to its upstream repo: Hindsight (vectorize-io), Mem0 + OpenMemory (mem0ai), Cognee (topoteretes), Letta (letta-ai), engram (NickCirv), CaviraOSS OpenMemory. Cognee row updated since they've added MCP support since we first wrote the row. - Replaced the "Systems mentioned without captured primary URLs" footnote (now stale since we have them) with a "Verification note" that's honest about the point-in-time nature of these columns and explains why TagMem is absent. Structure cleanups: - Removed the upstream MemPalace logo at the top — it's milla- jovovich's asset and using it in a fork README is awkward. - Renamed "Roadmap" → "Planned work" — the section is decisive with priorities and time estimates, "Roadmap" was underselling it. No content removed from the document beyond the stale footnote and the upstream logo. All existing sections intact. 42 README-claim tests pass.
Every remaining row in "Still ahead of upstream" now carries a status so the reader can tell at a glance whether each change is being upstreamed, pending a PR, or deliberately fork-local. Dropped: - "Guard ChromaDB 1.5.x metadata-mismatch segfault" — this row was overstated. The memory file for today's debugging notes that the try-get/except-create pattern is defensive code that never reproduced a specific crash (the actual crashes traced to HNSW drift). Leaving it in "Still ahead" implied an upstream-candidate fix, which it isn't. Code stays in place as defensive, but the README no longer claims it as a fork-ahead feature. Moved to Superseded: - "Stale HNSW mtime detection + mempalace_reconnect" — upstream took a different approach in MemPalace#757. Our broader inode+mtime detection and the mempalace_reconnect MCP tool remain as fork-local convenience; they're just not "ahead of upstream" anymore. Statuses now populated: - Linked PR number for the 7 changes with active upstream PRs (MemPalace#659, MemPalace#660, MemPalace#661, MemPalace#673 with APPROVED note, MemPalace#999, MemPalace#1000, MemPalace#1005). - "PR pending" for 3 items that are good candidates but unfiled: epsilon mtime comparison, max_distance parameter, tool output mining. - "fork-only" for 2 items we keep intentionally without pitching upstream: .blob_seq_ids_migrated marker (narrow), bulk_check_mined (complementary to upstream's MemPalace#784 file-locking). Legend sentence added above the table explains the three status values. 42 README-claim tests pass.
Dialectician
left a comment
There was a problem hiding this comment.
Read the diff. Logic is right: compare data_level0.bin mtime to chroma.sqlite3 mtime, rename the segment directory if the gap exceeds threshold. Rename-not-delete is the right call so recovery is reversible by hand if the heuristic ever misfires. Default stale_seconds=3600.0 is conservative enough that a quick write-then-read cycle will not misclassify a fresh segment.
Tests cover the two main branches (drifted segment renamed, fresh segment untouched) with clean fixtures. Closes #823 credibly.
Two small questions, both follow-ups not blockers:
- Multi-segment palaces: the loop iterates all subdirs with
data_level0.bin, but if two segments drift, the second rename could collide with a name from the first. Probably safe becausequarantine-<ts>is unique per call, worth a quick verification. - Logging:
logger.exceptionon rename failure is right. Might also log the sqlite/hnsw mtime delta when quarantining, so postmortem grep tells you how stale the segment was.
Peace B.Sweet
|
Addressed Copilot type hint nit: |
|
Thanks for reading the diff carefully — the rename-not-delete design was exactly the choice I sweated over most, so good to have it independently confirmed. The 1h default being conservative-by-design is also appreciated; a false positive here (quarantining a healthy segment) would be annoying to recover from, so erring cautious is the right call. Fixed the |
Addresses Copilot nit on MemPalace#1000. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…review addressed Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Add a helper that renames HNSW segment directories whose `data_level0.bin` is significantly older than `chroma.sqlite3`. Drift between the on-disk HNSW graph and the live embeddings table is the root cause of a segfault class where the Rust graph-walk dereferences dangling neighbor pointers for entries in the metadata segment that no longer exist in the HNSW index, crashing in a background thread on `count()` or `query()`. Issue MemPalace#823 describes the same drift as a silent-staleness symptom (semantic search returns stale results after `add_drawer` because `data_level0.bin` lags the sqlite metadata under the default `sync_threshold=1000`). Under heavier load or after an interrupted write, the same drift can escalate from "silent stale results" to "SIGSEGV on next open," which is the failure mode observed at neo-cortex-mcp#2 (chromadb 1.5.5, Python 3.12) and acknowledged at chroma-core/chroma#2594. On one 135K-drawer palace where `index_metadata.pickle` claimed 137,813 elements against 135,464 rows in sqlite (2,349-entry drift), fresh Python processes crashed in `col.count()` 17/20 times; after renaming the segment dir out of the way and letting ChromaDB rebuild lazily, the same 20-run check went to 0 crashes. The recovery path MemPalace#823 suggests (export / recreate / reimport) is heavy — it re-embeds every drawer. This helper is lighter: rename the segment dir so ChromaDB reopens without it, and the indexer rebuilds lazily on the next write. The original directory is renamed (not deleted) so the operator can recover if the heuristic misfires. If `chroma.sqlite3` is more than `stale_seconds` (default 3600) newer than the segment's `data_level0.bin`, the segment is considered suspect. One hour is deliberately conservative — normal HNSW flush cadence is seconds to minutes, so an hour of drift implies a crashed mid-write, not routine lag. - Additive: exposes `quarantine_stale_hnsw(palace_path, stale_seconds)` as a helper. Not wired into `_client()` / startup on this PR — the goal is to land the primitive first so operators and higher layers can opt in. A follow-up could call it automatically on palace open behind an env var or config flag. - Closes MemPalace#823 by giving operators a first-class recovery path without having to install `chromadb-ops` or re-mine. Four new tests in `tests/test_backends.py`: - renames drifted segment, preserves original files under `.drift-TS` suffix - leaves fresh segments alone - no-op on missing palace path / missing `chroma.sqlite3` - skips already-quarantined (`.drift-` suffixed) directories `pytest tests/test_backends.py` → 11 passed. `ruff check` / `ruff format --check` — clean.
055386e to
0c38dea
Compare
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code |
…emPalace#1036 Overnight/morning: - MemPalace#681, MemPalace#1000, MemPalace#1023 merged — moved from "Still ahead" to "Merged upstream (post-v3.3.1)" - bensig reviewed MemPalace#659 (wing_ prefix + agent filter) and MemPalace#1021 (silent_guard default) — both addressed on their PR branches - MemPalace#673 needed re-rebase after overnight develop merges; done - MemPalace#1036 filed: paginate miner.status(), closes upstream MemPalace#802 and MemPalace#1015 Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Version bumps across pyproject.toml, mempalace/version.py, README badge, uv.lock, and plugin manifests (.claude-plugin/*, .codex-plugin/*). CHANGELOG aligned with main (post-3.3.1) and a new [3.3.2] section added covering the 11 PRs merged on develop since v3.3.1 — silent-transcript-drop fix + tandem sweeper (#998), None-metadata guards (#999, #1013), chromadb ≥1.5.4 for Py 3.13/3.14 (#1010), Windows Unicode (#681), HNSW quarantine recovery (#1000), PID stacking guard (#1023), doc-path cleanup (#996, #1012), and RFC 001/002 internal scaffolding (#995, #1014, #990).
Resolutions: - `.claude-plugin/.mcp.json`, `plugin.json` — adopt upstream's `mempalace-mcp` console-script command (added via upstream MemPalace#340 for pipx/uv). Run `pip install -e .` in plugin venv after merge to install the entry point. - `.claude-plugin/hooks/*.sh`, `hooks/*.sh` — adopt upstream's console-command resolution order (`mempalace` script → `python3 -m mempalace` → `python`). `MEMPAL_PYTHON` override still works inside `hooks_cli.py`. - `mempalace/hooks_cli.py`, `tests/test_hooks_cli.py` — keep fork's `_mempalace_python()` helper (fork-ahead #4); upstream only had `sys.executable`, which loses MEMPAL_PYTHON override. - `mempalace/miner.py` — keep fork's concurrent mining path (fork-ahead #1), apply upstream's unicode-`✓` → ASCII-`+` fix (MemPalace#681) to both paths. - `mempalace/backends/chroma.py` — take upstream's refined `quarantine_stale_hnsw` docstring (it's the version merged via our own MemPalace#1000). Brought in: 33 upstream commits including Belarusian/Chinese/German/Spanish/French entity detection, console-script entry points, hook plugin-root space quoting, and v3.3.2 tag (which contains our MemPalace#681/MemPalace#1000/MemPalace#1023). Tests: 1096 passed, 106 deselected (benchmarks). Ruff clean.
…#1023 merged, MemPalace#673 rebased - Bump fork-ahead section header to "after v3.3.2" - Strike #11 (quarantine_stale_hnsw) and MemPalace#18 (PID file guard) as merged-into-upstream-via-v3.3.2, keep entries for traceability - Add v3.3.2-shipped items to "Merged into upstream (post-v3.3.1)" - Rebuild PR table: 10 merged / 7 open / 7 closed; add MemPalace#1024 row, reclassify MemPalace#681/MemPalace#1000/MemPalace#1023 as merged, note MemPalace#673 rebased 2026-04-21 - Annotate MemPalace#661 status with the GitHub review-state machine caveat (CHANGES_REQUESTED persists until reviewer dismisses, not owed) - Bump test count 1063 → 1096 post-merge
…unts - Upstream released v3.3.2 on 2026-04-21 (our MemPalace#681/MemPalace#1000/MemPalace#1023) - Drawer count 152,682 → 165,632; wings 23 → 28; tests 1063 → 1096 - MemPalace#673 now MERGEABLE after fresh rebase + squash to 1 commit - MemPalace#661 status clarified: CHANGES_REQUESTED persists until reviewer dismisses, not an outstanding owe - Merged-upstream section split out v3.3.2 release group
Port quarantine_stale_hnsw from upstream MemPalace#1000 (0c38dea) and wire it into _client() / make_client() so palaces with drifted HNSW segments self-recover on open instead of segfaulting in the Rust graph walk. Hit on personal-v2 today: 125k-embedding palace, three segment dirs with data_level0.bin 8-9 days older than chroma.sqlite3 after an interrupted write. Every fresh client open SIGSEGV'd in chromadb_rust_bindings at the same offset (KERN_INVALID_ADDRESS at 0x44, dangling neighbor pointer in HNSW graph walk). Renaming the segment dirs out of the way let chroma rebuild lazily from sqlite with zero data loss. Differs from upstream by being on-by-default rather than operator-opt-in; opt out with MEMPALACE_HNSW_NO_QUARANTINE=1. The stat-call overhead is microseconds; the cost of skipping is a hard crash on the next query() or count(). Threshold unchanged at 3600s — legitimate HNSW flush lag is seconds to minutes, so an hour of drift implies a crashed mid-write. Tests: 11/11 in test_backends.py (4 new for the helper).
Summary
Adds a
quarantine_stale_hnsw(palace_path, stale_seconds=3600)helper that renames HNSW segment directories whosedata_level0.binis significantly older thanchroma.sqlite3. Drift between the on-disk HNSW graph and the liveembeddingstable in sqlite is the root cause of the bug #823 describes — and, under heavier load or after an interrupted write, the same drift escalates from "silent stale results" to a SIGSEGV where the Rust graph-walk dereferences dangling neighbor pointers for entries that exist in the metadata segment but not in the HNSW index.Closes #823.
The drift, in data
On one 135K-drawer palace where the bug reproduced,
index_metadata.pickleclaimedtotal_elements_added: 137813againstSELECT COUNT(*) FROM embeddings = 135464— a 2,349-entry drift. Fresh Python processes crashed incol.count()on 17/20 attempts:After renaming the segment dir out of the way and letting ChromaDB rebuild lazily, the same 20-run check went to 0 crashes and queries returned reasonable distances against the same underlying drawers.
Same failure mode is tracked publicly at neo-cortex-mcp#2 (chromadb 1.5.5, Python 3.12, SIGSEGV on
.count()with the deep-recursion stack signature) and acknowledged at chroma-core/chroma#2594 (HNSW drift is by-design under defaultsync_threshold).Why not #823's suggested recovery
#823 suggests "export all drawers, recreate the collection with
sync_threshold=100, re-import." That works, but it re-embeds every drawer — expensive on a 100K+ palace and requires the collection to be readable in the first place, which a drifted palace sometimes isn't. This helper is lighter:chroma.sqlite3are untouched.<uuid>.drift-YYYYMMDD-HHMMSSso operators can recover if the heuristic misfires.Heuristic
If
chroma.sqlite3is more thanstale_seconds(default 3600) newer than the segment'sdata_level0.bin, the segment is considered suspect. One hour is deliberately conservative — normal HNSW flush cadence is seconds to minutes, so an hour of drift implies a crashed mid-write, not routine lag.Scope
quarantine_stale_hnsw()as a helper. Intentionally not wired into_client()/ startup on this PR — the goal is to land the primitive first so operators and higher layers can opt in. A natural follow-up is to call it automatically on palace open behind an env var or a config flag; happy to do that in a second PR once this shape is agreed.hnsw:num_threads=1) or fix: degrade gracefully when filtered Chroma search fails #951 (graceful fallback when filtered queries fail) — those are sibling failure modes. This PR addresses the READ-path crash on drifted segments, which neither of them touches.sync_threshold(the other half of ChromaDB: HNSW persistence vs add_drawer — default sync_threshold can leave semantic search stale #823) can still be done separately and remains a reasonable addition.Tests
Four new tests in
tests/test_backends.py:test_quarantine_stale_hnsw_renames_drifted_segment— 2h drift: renamed, original files preserved under.drift-TSsuffix.test_quarantine_stale_hnsw_leaves_fresh_segment_alone— 10s drift: not touched.test_quarantine_stale_hnsw_no_palace— missing palace / missingchroma.sqlite3: returns[], no raise.test_quarantine_stale_hnsw_skips_already_quarantined— directories already carrying the.drift-suffix are never re-renamed.pytest tests/test_backends.py→ 11 passed.ruff check/ruff format --check— clean.Credit
Diagnosis and fix worked out in jphein/mempalace@fdfacdf while chasing the read-path SIGSEGV on a production palace. The mtime heuristic is a deliberate compromise (no
pickle.loadonindex_metadata.pickle— linting policy in the downstream didn't allow it); happy to make it authoritative (readtotal_elements_addedvs sqlite count) on the upstream path if you'd prefer.