Skip to content

_fix_blob_seq_ids() opens live chromadb.sqlite on every client cache miss; post-migration no-op still risks corruption #1090

@jphein

Description

@jphein

What we think we're seeing

mempalace/backends/chroma.py::_fix_blob_seq_ids() is invoked every time _ChromaClientRegistry.get_or_create() has a cache miss or detects an mtime/inode drift on chroma.sqlite3 (lines 468 + 489 on current develop), and also from the deprecated-ish make_client() helper. For a palace that's already been migrated from the 0.6.x BLOB seq_id format, every one of those calls does:

  1. sqlite3.connect(db_path) against a live chromadb 1.5.x sqlite file
  2. two SELECT rowid, seq_id FROM {table} WHERE typeof(seq_id) = 'blob' queries (per-table early-out)
  3. conn.commit() (even though nothing was written on the no-op path)

On our fork's 165K-drawer palace, we've observed — across ~400 MCP-server starts and miner invocations — that this pattern correlates with occasional chromadb-Rust-side crashes on the next PersistentClient(...) call after a successful _fix_blob_seq_ids no-op. Not every time, but frequently enough to be a real reliability hit over hours of use. The crash signatures vary (null-pointer in the Rust compactor, "mismatched types" re-emerging, occasional SIGSEGV); the common factor is they happen after a clean _fix_blob_seq_ids return.

Hypothesis

chromadb 1.5.x's Rust side maintains its own in-memory view of the sqlite file and doesn't expect an external process (even in-process Python sqlite3.connect) to acquire a lock on the same file mid-session. The post-migration _fix_blob_seq_ids call opens, reads, commits, and closes the file from Python, which appears to desync the Rust side's cached state. On subsequent PersistentClient init, the Rust compactor fires against stale mental-model-of-sqlite and crashes.

The function has to run the first time a 0.6.x palace is migrated (that's what it was added for). After that first successful migration, re-running it is a no-op that costs two SELECTs — but costs us a non-trivial crash rate on chromadb 1.5.x.

Fork workaround

After _fix_blob_seq_ids completes successfully and finds nothing to update (the "no BLOB rows" early-out), the fork writes <palace_path>/.blob_seq_ids_migrated as a sentinel. _get_client() checks for that sentinel first; if it's present, _fix_blob_seq_ids is skipped entirely. The function still runs on palaces without the sentinel (the "needs migration" case), so 0.6.x → 1.5.x migration still works correctly.

Result on our palace: zero crashes of this flavor since the sentinel landed (2026-04-10 → present, ~11 days). Before the sentinel: roughly 1 crash per 10–20 process starts on the same hardware + chromadb combo.

Asking before filing a PR

I want to flag this before guessing at a fix direction, because I'm not 100% sure of the causation — "crash after a clean _fix_blob_seq_ids return" is circumstantial. Two questions:

  1. Has anyone else on chromadb 1.5.x observed this? If the correlation is specific to our environment (macOS ARM64, Python 3.13, chromadb 1.5.8, 165K-drawer palace), the fork's sentinel is a narrow-value fix. If it reproduces elsewhere, it's a broader reliability patch.
  2. Is the direction "skip after migration" or "don't open sqlite3 from Python at all"? The sentinel is one option; another is detecting chromadb-version and skipping for 1.5.x (since the BLOB issue was a 0.6.x → 1.x migration artifact and shouldn't appear on palaces that started life on 1.5.x). Either would get the same result with different semantics.

Adjacent issues

Happy to draft a PR once there's direction. Code for reference: fork's _get_client().

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstorage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions