Summary
Opening an existing palace can fail hard when chroma.sqlite3 is much newer than the on-disk HNSW segment directories. In my local repro on Windows, the process previously aborted while opening the real palace; after quarantining stale segment dirs before PersistentClient(...), the same palace opens normally.
What I observed
- Palace path: existing real user palace with ~45k drawers
chroma.sqlite3 mtime was about 43-45 hours newer than both segment directories
- Before the fix, commands that open the collection (
status was the easiest repro) could abort during Chroma open / early collection access
- After renaming stale HNSW dirs out of the way, Chroma rebuilt lazily and the palace opened again
Likely cause
mempalace.backends.chroma already has quarantine_stale_hnsw(...), but ChromaBackend._client() does not call it before constructing a fresh chromadb.PersistentClient. That leaves the crash-prone stale-index path reachable on real palaces.
Proposed fix
Call quarantine_stale_hnsw(palace_path) immediately before chromadb.PersistentClient(path=palace_path) when _client() is creating or refreshing the cached client.
Validation
I tested the fix locally against the real palace and it changed the behavior from a hard failure to a normal recoverable runtime error later in status, which exposed a second independent bug (too many SQL variables) that can then be fixed separately.
I also added a regression test that verifies stale segment directories are quarantined before the backend opens the client.
Local patch reference
Branch: codex/mempalace-fix-windows-help-and-status
Commit: dda82d1
Summary
Opening an existing palace can fail hard when
chroma.sqlite3is much newer than the on-disk HNSW segment directories. In my local repro on Windows, the process previously aborted while opening the real palace; after quarantining stale segment dirs beforePersistentClient(...), the same palace opens normally.What I observed
chroma.sqlite3mtime was about 43-45 hours newer than both segment directoriesstatuswas the easiest repro) could abort during Chroma open / early collection accessLikely cause
mempalace.backends.chromaalready hasquarantine_stale_hnsw(...), butChromaBackend._client()does not call it before constructing a freshchromadb.PersistentClient. That leaves the crash-prone stale-index path reachable on real palaces.Proposed fix
Call
quarantine_stale_hnsw(palace_path)immediately beforechromadb.PersistentClient(path=palace_path)when_client()is creating or refreshing the cached client.Validation
I tested the fix locally against the real palace and it changed the behavior from a hard failure to a normal recoverable runtime error later in
status, which exposed a second independent bug (too many SQL variables) that can then be fixed separately.I also added a regression test that verifies stale segment directories are quarantined before the backend opens the client.
Local patch reference
Branch:
codex/mempalace-fix-windows-help-and-statusCommit:
dda82d1