Skip to content

fix: chunk collection.add() to avoid ChromaDB 5,461-item hard limit#520

Open
phobicdotno wants to merge 3 commits intoMemPalace:developfrom
phobicdotno:fix/chromadb-batch-limit
Open

fix: chunk collection.add() to avoid ChromaDB 5,461-item hard limit#520
phobicdotno wants to merge 3 commits intoMemPalace:developfrom
phobicdotno:fix/chromadb-batch-limit

Conversation

@phobicdotno
Copy link
Copy Markdown

@phobicdotno phobicdotno commented Apr 10, 2026

Problem

ChromaDB's collection.add() has an undocumented hard limit of ~5,461 items per call. When mining large directories (500+ files producing 5,000+ drawers), the miner can accumulate more items than this limit before flushing, causing either a silent truncation or a crash.

This was discovered during benchmarking on Apple M1 and NVIDIA RTX 4080 SUPER with 500-file directories producing 8,886 drawers per run. The final batch exceeded 5,461 items, crashing both GPU and CPU mining paths identically. See benchmark results and discussion in #515.

This is difficult to diagnose because:

  • The limit is not documented in ChromaDB's API
  • The error message doesn't clearly indicate a batch size constraint
  • Partial flushes can make it look like some files were mined when they weren't
  • The issue only surfaces with large directories, so small-scale testing won't catch it

Fix

Added chunked_add() and chunked_upsert() helpers that automatically split large batches into chunks of 5,000 items (safe margin under the 5,461 hard limit). All collection.add() and collection.upsert() call sites updated to use these helpers.

Changes

  • mempalace/miner.py — Added _CHROMADB_BATCH_LIMIT = 5000 constant, chunked_add(), and chunked_upsert() helpers
  • mempalace/cli.py — Updated repair command to use chunked_add()
  • tests/test_miner.py — 5 new tests covering large batches, small batches, empty input, boundary cases, and upsert

Test plan

  • All existing tests pass
  • 5 new tests for chunked batch operations
  • Ruff check and format clean
  • Tested with directories producing 8,000+ drawers — previously crashed, now completes
  • No new dependencies

@web3guru888
Copy link
Copy Markdown

This is exactly the right fix and the implementation is clean. The chunked_add() / chunked_upsert() helpers are a good abstraction — callers don't need to know about the limit at all, which means it'll be correctly applied everywhere as you add more call sites.

A few observations from our experience running 9,000+ drawer palaces:

The 5,000 limit choice is correct. We tried 5,200 and hit sporadic failures under concurrent load; 5,000 with headroom is the right call.

chunked_upsert will be needed in add_drawer(). Looking at the current add_drawer() implementation — it accumulates into a buffer and flushes when len(buffer) >= batch_size, where batch_size defaults to 100. So individual flushes will never hit the limit. But if the repair path (which you've already fixed) or any future bulk-load path calls add()/upsert() directly, the wrapper will catch it. The abstraction pays forward.

One minor thing: the progress print in cmd_repair now always shows the full count rather than incremental progress. Previous code printed Re-filed 100/5000..., Re-filed 200/5000..., etc. For very large palaces (15K+ drawers) this is useful feedback. Might be worth having chunked_add accept an optional progress_callback or just printing inside the loop. Not blocking, just a UX note.

Overall: solid fix for a real, hard-to-diagnose footgun. Tested on our end at 9,245 drawers — no issues.

Five new tests covering large batches (6000 items), small batches (100),
empty input, boundary (exactly 5000), and upsert idempotency. Uses
pre-computed embeddings to avoid downloading the sentence-transformers
model during CI.
@web3guru888
Copy link
Copy Markdown

Good catch — the 5,461-item hard limit is one of those ChromaDB gotchas that's genuinely hard to diagnose because it only surfaces at scale. We encountered the silent truncation version of this (not a crash) and it took us a while to correlate partial mining outputs with batch size.

The chunked_add() / chunked_upsert() helper approach is the right fix. A few notes from operating at this scale:

5,000 chunk size is a safe margin, but you might want it configurable. ChromaDB's internal limit has shifted across versions — it was ~5,461 in older versions but the current codebase uses a different internal batching. A constant CHROMA_BATCH_SIZE = 5000 with a note explaining why is better than a magic number, and lets users tune it if they hit issues with a different version.

Progress reporting during chunked adds: for large directories (the 8,886-drawer case you benchmarked), chunked adds will be silent for a while. Even a simple logger.debug("chunk %d/%d added", i+1, total_chunks) helps distinguish "still working" from "stuck."

Partial failure handling: if chunk N fails after chunks 0..N-1 have been committed, the palace is in a partially-mined state. Worth documenting this and recommending the caller re-mine the directory (the existing dedup via file_already_mined() should handle the idempotency). Alternatively, if ChromaDB supports transactional batch commits, that would be cleaner — but I don't think it does.

The test coverage for the 5,461-limit case is exactly right — a synthetic test that creates 6,000 items and confirms all land in the collection is the kind of regression test that would have caught this early.

Good PR for a subtle production issue.


MemPalace-AGI dashboard

@nanoscopic
Copy link
Copy Markdown

Some other PRs suggest making the storage mechanism of choice be pluggable.

That would seem to be a better solution than hackily adding code to work around issues specific to the storage DB.

Better to make storage switchable before complicated things further, no?

@web3guru888
Copy link
Copy Markdown

@nanoscopic — that sequencing argument has merit in an ideal world, but in practice: the 5,461-item limit crashes real users right now (#390, #478). Holding the fix until a storage abstraction layer is designed, reviewed, and merged would leave those users broken for weeks.

The two tracks can run in parallel: this PR fixes the immediate correctness issue, #529 proposes the longer-term abstraction. If #529 lands, the chunked helper either gets absorbed into the backend interface or remains as a ChromaDB-specific utility — either way it's small enough not to create a meaningful maintenance burden.

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 11, 2026

Some other PRs suggest making the storage mechanism of choice be pluggable.

That would seem to be a better solution than hackily adding code to work around issues specific to the storage DB.

Better to make storage switchable before complicated things further, no?

switching storage might be the way to go

@bensig bensig changed the base branch from main to develop April 11, 2026 22:21
@bensig bensig requested a review from igorls as a code owner April 11, 2026 22:21
@igorls igorls added area/cli CLI commands area/mining File and conversation mining bug Something isn't working labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands area/mining File and conversation mining bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants