fix: chunk collection.add() to avoid ChromaDB 5,461-item hard limit#520
fix: chunk collection.add() to avoid ChromaDB 5,461-item hard limit#520phobicdotno wants to merge 3 commits intoMemPalace:developfrom
Conversation
|
This is exactly the right fix and the implementation is clean. The A few observations from our experience running 9,000+ drawer palaces: The 5,000 limit choice is correct. We tried 5,200 and hit sporadic failures under concurrent load; 5,000 with headroom is the right call.
One minor thing: the progress print in Overall: solid fix for a real, hard-to-diagnose footgun. Tested on our end at 9,245 drawers — no issues. |
Five new tests covering large batches (6000 items), small batches (100), empty input, boundary (exactly 5000), and upsert idempotency. Uses pre-computed embeddings to avoid downloading the sentence-transformers model during CI.
|
Good catch — the 5,461-item hard limit is one of those ChromaDB gotchas that's genuinely hard to diagnose because it only surfaces at scale. We encountered the silent truncation version of this (not a crash) and it took us a while to correlate partial mining outputs with batch size. The 5,000 chunk size is a safe margin, but you might want it configurable. ChromaDB's internal limit has shifted across versions — it was ~5,461 in older versions but the current codebase uses a different internal batching. A constant Progress reporting during chunked adds: for large directories (the 8,886-drawer case you benchmarked), chunked adds will be silent for a while. Even a simple Partial failure handling: if chunk N fails after chunks 0..N-1 have been committed, the palace is in a partially-mined state. Worth documenting this and recommending the caller re-mine the directory (the existing dedup via The test coverage for the 5,461-limit case is exactly right — a synthetic test that creates 6,000 items and confirms all land in the collection is the kind of regression test that would have caught this early. Good PR for a subtle production issue. |
|
Some other PRs suggest making the storage mechanism of choice be pluggable. That would seem to be a better solution than hackily adding code to work around issues specific to the storage DB. Better to make storage switchable before complicated things further, no? |
|
@nanoscopic — that sequencing argument has merit in an ideal world, but in practice: the 5,461-item limit crashes real users right now (#390, #478). Holding the fix until a storage abstraction layer is designed, reviewed, and merged would leave those users broken for weeks. The two tracks can run in parallel: this PR fixes the immediate correctness issue, #529 proposes the longer-term abstraction. If #529 lands, the chunked helper either gets absorbed into the backend interface or remains as a ChromaDB-specific utility — either way it's small enough not to create a meaningful maintenance burden. |
switching storage might be the way to go |
Problem
ChromaDB's
collection.add()has an undocumented hard limit of ~5,461 items per call. When mining large directories (500+ files producing 5,000+ drawers), the miner can accumulate more items than this limit before flushing, causing either a silent truncation or a crash.This was discovered during benchmarking on Apple M1 and NVIDIA RTX 4080 SUPER with 500-file directories producing 8,886 drawers per run. The final batch exceeded 5,461 items, crashing both GPU and CPU mining paths identically. See benchmark results and discussion in #515.
This is difficult to diagnose because:
Fix
Added
chunked_add()andchunked_upsert()helpers that automatically split large batches into chunks of 5,000 items (safe margin under the 5,461 hard limit). Allcollection.add()andcollection.upsert()call sites updated to use these helpers.Changes
mempalace/miner.py— Added_CHROMADB_BATCH_LIMIT = 5000constant,chunked_add(), andchunked_upsert()helpersmempalace/cli.py— Updatedrepaircommand to usechunked_add()tests/test_miner.py— 5 new tests covering large batches, small batches, empty input, boundary cases, and upsertTest plan