Summary
When mining a project with ~56K drawers, ChromaDB's HNSW index file link_lists.bin grows to 441GB apparent size (168GB on disk) instead of the expected ~8MB. This caused two sequential system crashes due to memory and disk exhaustion before the problem was identified. The mempalace status command segfaults (exit 139) when attempting to load the corrupted index.
Environment
- mempalace v3.0.0
- chromadb 1.5.7
- macOS Darwin 25.3.0
Steps to Reproduce
- Have a project with ~50K+ text files (or fewer files that chunk into 50K+ drawers)
- Run
mempalace mine <project_dir>
- Observe
link_lists.bin growing to hundreds of GB
mempalace status segfaults
Smaller mines (~10K drawers) work fine. The issue appears at scale.
Root Cause
ChromaDB creates the HNSW collection with default parameters:
hnsw:batch_size = 100
hnsw:sync_threshold = 1000
- Initial capacity = 1000
resize_factor = 1.2
When mempalace inserts drawers one at a time via collection.upsert(), each 1000 inserts trigger a persistDirty() call, and the index resizes ~30 times to accommodate 56K elements (1000 -> 1200 -> 1440 -> ... -> 56000+). The persistDirty() function uses relative seek positioning in link_lists.bin, and after many resize cycles the seek positions drift, causing macOS to extend the sparse file with zero-filled regions. Each resize+persist cycle compounds the problem.
Relevant upstream issues:
Impact
- Disk exhaustion: 441GB file on a laptop with limited SSD space
- System crashes: Two sequential crashes from memory/disk pressure before the cause was identified
- Data loss risk: The HNSW index stores the actual embedding vectors; when corrupted, they cannot be recovered from ChromaDB's SQLite (which only stores IDs, metadata, and documents)
- Segfaults:
mempalace status and mempalace search crash when trying to load the bloated index
Workaround
Delete the corrupted index and re-mine:
rm -rf ~/.mempalace/palace/*/
mempalace mine <project_dir>
Fix
PR incoming -- two changes:
- Pass
hnsw:batch_size=50000 and hnsw:sync_threshold=50000 on collection creation to defer persistence until large batches complete
- Batch per-file drawer upserts instead of inserting one drawer at a time, reducing the number of resize operations
Summary
When mining a project with ~56K drawers, ChromaDB's HNSW index file
link_lists.bingrows to 441GB apparent size (168GB on disk) instead of the expected ~8MB. This caused two sequential system crashes due to memory and disk exhaustion before the problem was identified. Themempalace statuscommand segfaults (exit 139) when attempting to load the corrupted index.Environment
Steps to Reproduce
mempalace mine <project_dir>link_lists.bingrowing to hundreds of GBmempalace statussegfaultsSmaller mines (~10K drawers) work fine. The issue appears at scale.
Root Cause
ChromaDB creates the HNSW collection with default parameters:
hnsw:batch_size = 100hnsw:sync_threshold = 1000resize_factor = 1.2When mempalace inserts drawers one at a time via
collection.upsert(), each 1000 inserts trigger apersistDirty()call, and the index resizes ~30 times to accommodate 56K elements (1000 -> 1200 -> 1440 -> ... -> 56000+). ThepersistDirty()function uses relative seek positioning inlink_lists.bin, and after many resize cycles the seek positions drift, causing macOS to extend the sparse file with zero-filled regions. Each resize+persist cycle compounds the problem.Relevant upstream issues:
hnsw:initial_capacity(not yet merged)Impact
mempalace statusandmempalace searchcrash when trying to load the bloated indexWorkaround
Delete the corrupted index and re-mine:
Fix
PR incoming -- two changes:
hnsw:batch_size=50000andhnsw:sync_threshold=50000on collection creation to defer persistence until large batches complete