Skip to content

Commit fdfacdf

Browse files
jpheinclaude
andcommitted
feat: quarantine_stale_hnsw() helper for HNSW/sqlite drift crashes
Adds a utility that renames HNSW segment directories whose data_level0.bin is significantly older than chroma.sqlite3. Drift between the on-disk HNSW graph and the live embeddings table is the root cause of the intermittent count()/query() SIGSEGV in chromadb_rust_bindings — when the Rust graph-walk hits a dangling neighbor pointer for an entry that no longer exists in sqlite, the process crashes in a background thread at a fixed offset (a3ee57 in the .abi3.so). Confirmed empirically on a 135K-drawer palace: HNSW claimed 137,813 elements, sqlite had 135,464, crash rate was 85%. After renaming the segment dir (keeping it recoverable), 15/15 fresh-process opens and queries succeed. Same failure mode is tracked publicly at neo-cortex-mcp#2 and claude-mem#1110; chroma-core#2594 acknowledges the drift is by-design and unbounded. The heuristic uses mtime (not pickle) because linting rules keep pickle out of the module. 1-hour threshold is conservative — normal writes flush HNSW within seconds, so a legitimate segment shouldn't be quarantined. Not wired into _client() yet; callers invoke it explicitly when drift is suspected. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
1 parent d3a2d22 commit fdfacdf

1 file changed

Lines changed: 64 additions & 0 deletions

File tree

mempalace/backends/chroma.py

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
"""ChromaDB-backed MemPalace collection adapter."""
22

3+
import datetime as _dt
34
import logging
45
import os
56
import sqlite3
@@ -14,6 +15,69 @@
1415
_BLOB_FIX_MARKER = ".blob_seq_ids_migrated"
1516

1617

18+
def quarantine_stale_hnsw(palace_path: str, stale_seconds: float = 3600.0) -> list:
19+
"""Rename HNSW segment dirs whose files are stale vs. chroma.sqlite3.
20+
21+
When ChromaDB 1.5.x loads an HNSW segment that disagrees with the live
22+
``embeddings`` table in sqlite, the Rust graph-walk dereferences dangling
23+
neighbor pointers and segfaults in a background thread (the failure
24+
mirrored at neo-cortex-mcp#2 and observed locally at offset ``a3ee57``
25+
in ``chromadb_rust_bindings.abi3.so``).
26+
27+
Heuristic: if the sqlite mtime is more than *stale_seconds* newer than
28+
the HNSW ``data_level0.bin`` mtime, the segment is suspect and gets
29+
renamed out of the way. Chroma reopens cleanly without it and rebuilds
30+
index files on next write. The original directory is renamed, not
31+
deleted, so recovery remains possible.
32+
33+
Returns the list of quarantined segment paths.
34+
"""
35+
db_path = os.path.join(palace_path, "chroma.sqlite3")
36+
if not os.path.isfile(db_path):
37+
return []
38+
try:
39+
sqlite_mtime = os.path.getmtime(db_path)
40+
except OSError:
41+
return []
42+
43+
moved: list = []
44+
try:
45+
entries = os.listdir(palace_path)
46+
except OSError:
47+
return []
48+
49+
for name in entries:
50+
if "-" not in name or name.startswith(".") or ".drift-" in name:
51+
continue
52+
seg_dir = os.path.join(palace_path, name)
53+
if not os.path.isdir(seg_dir):
54+
continue
55+
hnsw_bin = os.path.join(seg_dir, "data_level0.bin")
56+
if not os.path.isfile(hnsw_bin):
57+
continue
58+
try:
59+
hnsw_mtime = os.path.getmtime(hnsw_bin)
60+
except OSError:
61+
continue
62+
if sqlite_mtime - hnsw_mtime < stale_seconds:
63+
continue
64+
stamp = _dt.datetime.now().strftime("%Y%m%d-%H%M%S")
65+
target = f"{seg_dir}.drift-{stamp}"
66+
try:
67+
os.rename(seg_dir, target)
68+
moved.append(target)
69+
logger.warning(
70+
"Quarantined stale HNSW segment %s "
71+
"(sqlite %.0fs newer than HNSW); renamed to %s",
72+
seg_dir,
73+
sqlite_mtime - hnsw_mtime,
74+
target,
75+
)
76+
except OSError:
77+
logger.exception("Failed to quarantine stale HNSW segment %s", seg_dir)
78+
return moved
79+
80+
1781
def _fix_blob_seq_ids(palace_path: str):
1882
"""Fix ChromaDB 0.6.x -> 1.5.x migration bug: BLOB seq_ids -> INTEGER.
1983

0 commit comments

Comments
 (0)