Skip to content

feat(backends): quarantine_stale_hnsw — recover from HNSW/sqlite drift (closes #823)#1000

Merged
bensig merged 1 commit intoMemPalace:developfrom
jphein:fix/quarantine-stale-hnsw
Apr 19, 2026
Merged

feat(backends): quarantine_stale_hnsw — recover from HNSW/sqlite drift (closes #823)#1000
bensig merged 1 commit intoMemPalace:developfrom
jphein:fix/quarantine-stale-hnsw

Conversation

@jphein
Copy link
Copy Markdown
Collaborator

@jphein jphein commented Apr 18, 2026

Summary

Adds a quarantine_stale_hnsw(palace_path, stale_seconds=3600) helper that renames HNSW segment directories whose data_level0.bin is significantly older than chroma.sqlite3. Drift between the on-disk HNSW graph and the live embeddings table in sqlite is the root cause of the bug #823 describes — and, under heavier load or after an interrupted write, the same drift escalates from "silent stale results" to a SIGSEGV where the Rust graph-walk dereferences dangling neighbor pointers for entries that exist in the metadata segment but not in the HNSW index.

Closes #823.

The drift, in data

On one 135K-drawer palace where the bug reproduced, index_metadata.pickle claimed total_elements_added: 137813 against SELECT COUNT(*) FROM embeddings = 135464 — a 2,349-entry drift. Fresh Python processes crashed in col.count() on 17/20 attempts:

import chromadb
c = chromadb.PersistentClient(path=\"/palace\")
col = c.get_collection(\"mempalace_drawers\")
col.count()  # SIGSEGV, background thread, chromadb_rust_bindings.abi3.so

After renaming the segment dir out of the way and letting ChromaDB rebuild lazily, the same 20-run check went to 0 crashes and queries returned reasonable distances against the same underlying drawers.

Same failure mode is tracked publicly at neo-cortex-mcp#2 (chromadb 1.5.5, Python 3.12, SIGSEGV on .count() with the deep-recursion stack signature) and acknowledged at chroma-core/chroma#2594 (HNSW drift is by-design under default sync_threshold).

Why not #823's suggested recovery

#823 suggests "export all drawers, recreate the collection with sync_threshold=100, re-import." That works, but it re-embeds every drawer — expensive on a 100K+ palace and requires the collection to be readable in the first place, which a drifted palace sometimes isn't. This helper is lighter:

  • Rename the drifted segment dir so ChromaDB reopens without it.
  • ChromaDB writes a fresh minimal segment on next use; the underlying embeddings in chroma.sqlite3 are untouched.
  • The renamed directory is not deleted — it's moved to <uuid>.drift-YYYYMMDD-HHMMSS so operators can recover if the heuristic misfires.

Heuristic

If chroma.sqlite3 is more than stale_seconds (default 3600) newer than the segment's data_level0.bin, the segment is considered suspect. One hour is deliberately conservative — normal HNSW flush cadence is seconds to minutes, so an hour of drift implies a crashed mid-write, not routine lag.

Scope

Tests

Four new tests in tests/test_backends.py:

  • test_quarantine_stale_hnsw_renames_drifted_segment — 2h drift: renamed, original files preserved under .drift-TS suffix.
  • test_quarantine_stale_hnsw_leaves_fresh_segment_alone — 10s drift: not touched.
  • test_quarantine_stale_hnsw_no_palace — missing palace / missing chroma.sqlite3: returns [], no raise.
  • test_quarantine_stale_hnsw_skips_already_quarantined — directories already carrying the .drift- suffix are never re-renamed.

pytest tests/test_backends.py11 passed. ruff check / ruff format --check — clean.

Credit

Diagnosis and fix worked out in jphein/mempalace@fdfacdf while chasing the read-path SIGSEGV on a production palace. The mtime heuristic is a deliberate compromise (no pickle.load on index_metadata.pickle — linting policy in the downstream didn't allow it); happy to make it authoritative (read total_elements_added vs sqlite count) on the upstream path if you'd prefer.

Copilot AI review requested due to automatic review settings April 18, 2026 17:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an operator-facing recovery primitive to mitigate ChromaDB HNSW-on-disk drift vs chroma.sqlite3 by quarantining suspected-stale HNSW segment directories, addressing the read-path crash/staleness failure mode described in #823.

Changes:

  • Introduce quarantine_stale_hnsw(palace_path, stale_seconds=3600) to rename stale HNSW segment dirs to a .drift-YYYYMMDD-HHMMSS suffix based on mtime skew vs chroma.sqlite3.
  • Add pytest coverage for renaming behavior, no-op behavior on fresh segments, missing-palace safety, and skipping already-quarantined directories.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
mempalace/backends/chroma.py Adds the quarantine_stale_hnsw helper that quarantines stale HNSW segments via renaming.
tests/test_backends.py Adds tests validating the quarantine heuristic and safety/no-op cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/backends/chroma.py Outdated
Comment on lines +15 to +60
def quarantine_stale_hnsw(palace_path: str, stale_seconds: float = 3600.0) -> list:
"""Rename HNSW segment dirs whose files are stale vs. chroma.sqlite3.

When a ChromaDB 1.5.x PersistentClient opens a palace whose on-disk
HNSW segment is significantly older than ``chroma.sqlite3``, the Rust
graph-walk can dereference dangling neighbor pointers for entries that
exist in the metadata segment but not in the HNSW index, and segfault
in a background thread on the next ``count()`` or ``query(...)`` call.

This is the same failure mode reported at #823 (semantic search stale
after ``add_drawer``), observed at neo-cortex-mcp#2 (SIGSEGV on
``count()`` with chromadb 1.5.5), and acknowledged as by-design at
chroma-core/chroma#2594. On one fork palace (135K drawers), the drift
caused a 65–85% crash rate on fresh-process opens; fresh-process
crash rate dropped to 0% after the segment dir was renamed out of the
way and ChromaDB rebuilt lazily.

Heuristic: if ``chroma.sqlite3`` is more than ``stale_seconds`` newer
than the segment's ``data_level0.bin``, the segment is considered
suspect and renamed to ``<uuid>.drift-<timestamp>``. ChromaDB reopens
cleanly without it and writes fresh index files on next use. The
original directory is renamed, not deleted, so recovery remains
possible if the heuristic misfires.

The default threshold (1h) is deliberately conservative — ChromaDB's
HNSW flush cadence means legitimate drift is normally on the order of
seconds to minutes. A segment that is more than an hour out of date is
almost certainly in a "crashed mid-write" state.

Args:
palace_path: path to the palace directory containing ``chroma.sqlite3``
stale_seconds: minimum mtime gap to treat a segment as stale

Returns:
List of paths that were quarantined (empty if nothing drifted).
"""
db_path = os.path.join(palace_path, "chroma.sqlite3")
if not os.path.isfile(db_path):
return []
try:
sqlite_mtime = os.path.getmtime(db_path)
except OSError:
return []

moved: list = []
try:
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints here are overly generic: the function returns a list of quarantined path strings, but the signature uses -> list and moved: list = []. Elsewhere in the repo newer built-in generics are used (e.g. list[str]). Consider updating this to -> list[str] and moved: list[str] = [] (and optionally stale_seconds: float already matches).

Copilot uses AI. Check for mistakes.
@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 18, 2026

Thanks @copilot-pull-request-reviewer — addressed in c6413fd: tightened the signature to -> list[str] and the local to moved: list[str] = [] to match PEP 585 usage elsewhere in the repo (e.g. exporter.py, sweeper.py).

jphein added a commit to jphein/mempalace that referenced this pull request Apr 18, 2026
README:
- Bump drawer count 134K → 137K (matches current palace)
- Update "Open upstream PRs" table with actual reviewer status, add MemPalace#999
  and MemPalace#1000, drop stale "rebased against MemPalace#863" note since MemPalace#863 is in v3.3.1

CLAUDE.md:
- Rename section header v3.3.0 → v3.3.1 (we merged v3.3.1 earlier today)
- Add fork changes 11-13: None-metadata guards, .blob_seq_ids_migrated
  marker, quarantine_stale_hnsw
- Replace "Pulled in from upstream v3.3.1" summary: i18n, BCP-47 locales,
  UTF-8 Path.read_text, non-blocking precompact (MemPalace#863), silent_save
  honoring (MemPalace#966) — ours still broader
- Update the open-upstream-PRs table to 7 open, include MemPalace#999 + MemPalace#1000 and
  refresh each row's status line to match what's actually on GitHub
  (MERGEABLE flags, @bensig ping context, Copilot review addressed, etc.)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@jphein jphein force-pushed the fix/quarantine-stale-hnsw branch from c6413fd to 055386e Compare April 18, 2026 19:07
@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 18, 2026

Rebased onto develop after #995 merged (055386e). Integration was clean — RFC 001's _validate_where + _fix_blob_seq_ids kept from upstream, quarantine_stale_hnsw() added alongside with list[str] type hints matching the new typed-results style. Dropped my earlier one-line style commit since the type hint was already right in the rebased placement. Tests still pass (29/29 on test_backends.py).

jphein added a commit to jphein/mempalace that referenced this pull request Apr 18, 2026
README:
- Fork-changes table: expand the None-metadata row to cover all 8 sites
  (searcher.py CLI + API + closet-boost, miner.status, 4 mcp_server
  handlers). Previous row only called out the CLI print path.
- Add a new Search row: warnings + sqlite BM25 top-up contract (the
  "never silent miss" feature) with pointer to MemPalace#951 + MemPalace#823.
- Open-PR table: expand MemPalace#999 scope line to mention 8 sites + architectural
  note, update MemPalace#1000 to reflect post-MemPalace#995 rebase, add MemPalace#1005 with Copilot
  fixes noted.

CLAUDE.md:
- PR status header: 7 open -> 8 open (adds MemPalace#1005).
- Same PR row updates as README for MemPalace#999/MemPalace#1000/MemPalace#1005.
- Fork Changes list: expand entry 11 (None guards) to 8 sites + adapter
  consolidation proposal on MemPalace#999; add entry 14 for the warnings+BM25
  feature; keep 12 and 13 as-is.

42 README-claim tests still pass.
jphein added a commit to jphein/mempalace that referenced this pull request Apr 18, 2026
Concrete evidence:
- New "What this looks like in practice" section right after the
  status line, showing a real stop-hook systemMessage output and a
  real mempalace_search response shape (warnings, available_in_scope,
  matched_via tags, similarity scores). Someone evaluating the fork
  can see what "running in production" actually looks like.

Headlines box:
- New "Headlines" subsection at the top of Fork Changes with the
  three differentiators someone should know if they only read one
  section — silent-save hook, ChromaDB 1.5.x hardening (quarantine +
  None guards), search-never-misses contract. Links to MemPalace#673/MemPalace#999/
  MemPalace#1000/MemPalace#1005 so readers can jump to the work itself.

Citations for comparison table:
- Every row now links to its upstream repo: Hindsight (vectorize-io),
  Mem0 + OpenMemory (mem0ai), Cognee (topoteretes), Letta (letta-ai),
  engram (NickCirv), CaviraOSS OpenMemory. Cognee row updated since
  they've added MCP support since we first wrote the row.
- Replaced the "Systems mentioned without captured primary URLs"
  footnote (now stale since we have them) with a "Verification note"
  that's honest about the point-in-time nature of these columns and
  explains why TagMem is absent.

Structure cleanups:
- Removed the upstream MemPalace logo at the top — it's milla-
  jovovich's asset and using it in a fork README is awkward.
- Renamed "Roadmap" → "Planned work" — the section is decisive with
  priorities and time estimates, "Roadmap" was underselling it.

No content removed from the document beyond the stale footnote and
the upstream logo. All existing sections intact. 42 README-claim
tests pass.
jphein added a commit to jphein/mempalace that referenced this pull request Apr 18, 2026
Every remaining row in "Still ahead of upstream" now carries a status
so the reader can tell at a glance whether each change is being
upstreamed, pending a PR, or deliberately fork-local.

Dropped:
- "Guard ChromaDB 1.5.x metadata-mismatch segfault" — this row was
  overstated. The memory file for today's debugging notes that the
  try-get/except-create pattern is defensive code that never
  reproduced a specific crash (the actual crashes traced to HNSW
  drift). Leaving it in "Still ahead" implied an upstream-candidate
  fix, which it isn't. Code stays in place as defensive, but the
  README no longer claims it as a fork-ahead feature.

Moved to Superseded:
- "Stale HNSW mtime detection + mempalace_reconnect" — upstream took
  a different approach in MemPalace#757. Our broader inode+mtime detection
  and the mempalace_reconnect MCP tool remain as fork-local
  convenience; they're just not "ahead of upstream" anymore.

Statuses now populated:
- Linked PR number for the 7 changes with active upstream PRs
  (MemPalace#659, MemPalace#660, MemPalace#661, MemPalace#673 with APPROVED note, MemPalace#999, MemPalace#1000, MemPalace#1005).
- "PR pending" for 3 items that are good candidates but unfiled:
  epsilon mtime comparison, max_distance parameter, tool output
  mining.
- "fork-only" for 2 items we keep intentionally without pitching
  upstream: .blob_seq_ids_migrated marker (narrow), bulk_check_mined
  (complementary to upstream's MemPalace#784 file-locking).

Legend sentence added above the table explains the three status
values. 42 README-claim tests pass.
Copy link
Copy Markdown

@Dialectician Dialectician left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read the diff. Logic is right: compare data_level0.bin mtime to chroma.sqlite3 mtime, rename the segment directory if the gap exceeds threshold. Rename-not-delete is the right call so recovery is reversible by hand if the heuristic ever misfires. Default stale_seconds=3600.0 is conservative enough that a quick write-then-read cycle will not misclassify a fresh segment.

Tests cover the two main branches (drifted segment renamed, fresh segment untouched) with clean fixtures. Closes #823 credibly.

Two small questions, both follow-ups not blockers:

  1. Multi-segment palaces: the loop iterates all subdirs with data_level0.bin, but if two segments drift, the second rename could collide with a name from the first. Probably safe because quarantine-<ts> is unique per call, worth a quick verification.
  2. Logging: logger.exception on rename failure is right. Might also log the sqlite/hnsw mtime delta when quarantining, so postmortem grep tells you how stale the segment was.

Peace B.Sweet

@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 19, 2026

Addressed Copilot type hint nit: -> list-> list[str] and moved: list = []moved: list[str] = [] to match the repo's use of built-in generics elsewhere.

@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 19, 2026

Thanks for reading the diff carefully — the rename-not-delete design was exactly the choice I sweated over most, so good to have it independently confirmed. The 1h default being conservative-by-design is also appreciated; a false positive here (quarantining a healthy segment) would be annoying to recover from, so erring cautious is the right call.

Fixed the list[str] nit.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 19, 2026
jphein added a commit to jphein/mempalace that referenced this pull request Apr 19, 2026
…review addressed

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Add a helper that renames HNSW segment directories whose
`data_level0.bin` is significantly older than `chroma.sqlite3`. Drift
between the on-disk HNSW graph and the live embeddings table is the
root cause of a segfault class where the Rust graph-walk dereferences
dangling neighbor pointers for entries in the metadata segment that no
longer exist in the HNSW index, crashing in a background thread on
`count()` or `query()`.

Issue MemPalace#823 describes the same drift as a silent-staleness symptom
(semantic search returns stale results after `add_drawer` because
`data_level0.bin` lags the sqlite metadata under the default
`sync_threshold=1000`). Under heavier load or after an interrupted
write, the same drift can escalate from "silent stale results" to
"SIGSEGV on next open," which is the failure mode observed at
neo-cortex-mcp#2 (chromadb 1.5.5, Python 3.12) and acknowledged at
chroma-core/chroma#2594.

On one 135K-drawer palace where `index_metadata.pickle` claimed 137,813
elements against 135,464 rows in sqlite (2,349-entry drift), fresh
Python processes crashed in `col.count()` 17/20 times; after renaming
the segment dir out of the way and letting ChromaDB rebuild lazily, the
same 20-run check went to 0 crashes.

The recovery path MemPalace#823 suggests (export / recreate / reimport) is heavy
— it re-embeds every drawer. This helper is lighter: rename the segment
dir so ChromaDB reopens without it, and the indexer rebuilds lazily on
the next write. The original directory is renamed (not deleted) so the
operator can recover if the heuristic misfires.

If `chroma.sqlite3` is more than `stale_seconds` (default 3600) newer
than the segment's `data_level0.bin`, the segment is considered
suspect. One hour is deliberately conservative — normal HNSW flush
cadence is seconds to minutes, so an hour of drift implies a crashed
mid-write, not routine lag.

- Additive: exposes `quarantine_stale_hnsw(palace_path, stale_seconds)`
  as a helper. Not wired into `_client()` / startup on this PR — the
  goal is to land the primitive first so operators and higher layers
  can opt in. A follow-up could call it automatically on palace open
  behind an env var or config flag.
- Closes MemPalace#823 by giving operators a first-class recovery path without
  having to install `chromadb-ops` or re-mine.

Four new tests in `tests/test_backends.py`:
- renames drifted segment, preserves original files under `.drift-TS` suffix
- leaves fresh segments alone
- no-op on missing palace path / missing `chroma.sqlite3`
- skips already-quarantined (`.drift-` suffixed) directories

`pytest tests/test_backends.py` → 11 passed. `ruff check` / `ruff format
--check` — clean.
@jphein jphein force-pushed the fix/quarantine-stale-hnsw branch from 055386e to 0c38dea Compare April 19, 2026 01:05
@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 19, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

@bensig bensig merged commit caf503f into MemPalace:develop Apr 19, 2026
6 checks passed
jphein added a commit to jphein/mempalace that referenced this pull request Apr 19, 2026
…emPalace#1036

Overnight/morning:
- MemPalace#681, MemPalace#1000, MemPalace#1023 merged — moved from "Still ahead" to "Merged upstream (post-v3.3.1)"
- bensig reviewed MemPalace#659 (wing_ prefix + agent filter) and MemPalace#1021 (silent_guard
  default) — both addressed on their PR branches
- MemPalace#673 needed re-rebase after overnight develop merges; done
- MemPalace#1036 filed: paginate miner.status(), closes upstream MemPalace#802 and MemPalace#1015

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
igorls added a commit that referenced this pull request Apr 19, 2026
Version bumps across pyproject.toml, mempalace/version.py, README badge,
uv.lock, and plugin manifests (.claude-plugin/*, .codex-plugin/*).

CHANGELOG aligned with main (post-3.3.1) and a new [3.3.2] section added
covering the 11 PRs merged on develop since v3.3.1 — silent-transcript-drop
fix + tandem sweeper (#998), None-metadata guards (#999, #1013),
chromadb ≥1.5.4 for Py 3.13/3.14 (#1010), Windows Unicode (#681),
HNSW quarantine recovery (#1000), PID stacking guard (#1023), doc-path
cleanup (#996, #1012), and RFC 001/002 internal scaffolding (#995, #1014, #990).
@igorls igorls mentioned this pull request Apr 19, 2026
8 tasks
jphein added a commit to jphein/mempalace that referenced this pull request Apr 21, 2026
Resolutions:
- `.claude-plugin/.mcp.json`, `plugin.json` — adopt upstream's `mempalace-mcp`
  console-script command (added via upstream MemPalace#340 for pipx/uv). Run
  `pip install -e .` in plugin venv after merge to install the entry point.
- `.claude-plugin/hooks/*.sh`, `hooks/*.sh` — adopt upstream's console-command
  resolution order (`mempalace` script → `python3 -m mempalace` → `python`).
  `MEMPAL_PYTHON` override still works inside `hooks_cli.py`.
- `mempalace/hooks_cli.py`, `tests/test_hooks_cli.py` — keep fork's
  `_mempalace_python()` helper (fork-ahead #4); upstream only had
  `sys.executable`, which loses MEMPAL_PYTHON override.
- `mempalace/miner.py` — keep fork's concurrent mining path (fork-ahead #1),
  apply upstream's unicode-`✓` → ASCII-`+` fix (MemPalace#681) to both paths.
- `mempalace/backends/chroma.py` — take upstream's refined `quarantine_stale_hnsw`
  docstring (it's the version merged via our own MemPalace#1000).

Brought in: 33 upstream commits including Belarusian/Chinese/German/Spanish/French
entity detection, console-script entry points, hook plugin-root space quoting,
and v3.3.2 tag (which contains our MemPalace#681/MemPalace#1000/MemPalace#1023).

Tests: 1096 passed, 106 deselected (benchmarks). Ruff clean.
jphein added a commit to jphein/mempalace that referenced this pull request Apr 21, 2026
…#1023 merged, MemPalace#673 rebased

- Bump fork-ahead section header to "after v3.3.2"
- Strike #11 (quarantine_stale_hnsw) and MemPalace#18 (PID file guard) as
  merged-into-upstream-via-v3.3.2, keep entries for traceability
- Add v3.3.2-shipped items to "Merged into upstream (post-v3.3.1)"
- Rebuild PR table: 10 merged / 7 open / 7 closed; add MemPalace#1024 row,
  reclassify MemPalace#681/MemPalace#1000/MemPalace#1023 as merged, note MemPalace#673 rebased 2026-04-21
- Annotate MemPalace#661 status with the GitHub review-state machine caveat
  (CHANGES_REQUESTED persists until reviewer dismisses, not owed)
- Bump test count 1063 → 1096 post-merge
jphein added a commit to jphein/mempalace that referenced this pull request Apr 21, 2026
…unts

- Upstream released v3.3.2 on 2026-04-21 (our MemPalace#681/MemPalace#1000/MemPalace#1023)
- Drawer count 152,682 → 165,632; wings 23 → 28; tests 1063 → 1096
- MemPalace#673 now MERGEABLE after fresh rebase + squash to 1 commit
- MemPalace#661 status clarified: CHANGES_REQUESTED persists until reviewer
  dismisses, not an outstanding owe
- Merged-upstream section split out v3.3.2 release group
@jphein jphein deleted the fix/quarantine-stale-hnsw branch April 25, 2026 14:53
peterwangsc added a commit to peterwangsc/mempalace that referenced this pull request Apr 28, 2026
Port quarantine_stale_hnsw from upstream MemPalace#1000 (0c38dea) and wire it
into _client() / make_client() so palaces with drifted HNSW segments
self-recover on open instead of segfaulting in the Rust graph walk.

Hit on personal-v2 today: 125k-embedding palace, three segment dirs
with data_level0.bin 8-9 days older than chroma.sqlite3 after an
interrupted write. Every fresh client open SIGSEGV'd in
chromadb_rust_bindings at the same offset (KERN_INVALID_ADDRESS at
0x44, dangling neighbor pointer in HNSW graph walk). Renaming the
segment dirs out of the way let chroma rebuild lazily from sqlite
with zero data loss.

Differs from upstream by being on-by-default rather than
operator-opt-in; opt out with MEMPALACE_HNSW_NO_QUARANTINE=1. The
stat-call overhead is microseconds; the cost of skipping is a hard
crash on the next query() or count().

Threshold unchanged at 3600s — legitimate HNSW flush lag is seconds
to minutes, so an hour of drift implies a crashed mid-write.

Tests: 11/11 in test_backends.py (4 new for the helper).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ChromaDB: HNSW persistence vs add_drawer — default sync_threshold can leave semantic search stale

4 participants