fix: prevent HNSW index bloat from resize+persist cycles#346
fix: prevent HNSW index bloat from resize+persist cycles#346yoshi280 wants to merge 1 commit intoMemPalace:developfrom
Conversation
Set hnsw:batch_size and hnsw:sync_threshold to 50K on collection creation so ChromaDB defers index persistence until large batches complete. Batch per-file drawer upserts instead of inserting one at a time. Without this fix, mining ~56K drawers causes link_lists.bin to grow to 441GB (apparent) / 168GB (on disk) due to pathological sparse file allocation in hnswlib's persistDirty() function during resize cycles.
web3guru888
left a comment
There was a problem hiding this comment.
This is a serious fix — a 56,083x file size ratio is the kind of failure that's invisible until it takes down a machine. Good root cause analysis on the seek drift compounding through resize cycles.
The HNSW metadata approach (setting batch_size and sync_threshold at collection creation) is the right intervention point. One thing worth documenting: these values are baked in at collection creation time and can't be changed later without recreating the collection. Users upgrading an existing palace will need to re-mine to get the fix, since collections created before this PR will retain the default params. A note in the PR description or changelog would help people who hit mysterious post-mine slowdowns and don't realize their existing collection predates the fix.
On the batching in process_file(): collecting all drawers into lists and doing a single upsert per file rather than per chunk is a meaningful change beyond just the HNSW metadata — it also reduces ChromaDB's transaction overhead. Worth calling out explicitly as a separate improvement in the PR description, since it benefits even users who never approach 56K drawers.
The test plan looks thorough. The <1K and 10K+ scale tests are the critical ones for catching regressions.
Test ResultsVerified the fix on a real workload: 57,287 drawers mined across 3 projects in sequential runs. Before fix (default HNSW params, one-at-a-time upserts)
After fix (
|
| Metric | Value |
|---|---|
| Drawers | 57,287 |
link_lists.bin apparent size |
433KB |
link_lists.bin on-disk size |
433KB (no sparse allocation) |
| Total palace size | 551MB |
mempalace status |
Works correctly |
mempalace search |
Works correctly |
| System impact | None -- stable throughout |
Test details
Three sequential mines on macOS Darwin 25.3.0, ChromaDB 1.5.7, mempalace 3.0.0:
- alpha-seek (~10K drawers) -- codebase + GSD project archive (55 milestones, 180 decisions, 343 tasks exported from SQLite to markdown)
- optimus-prime (~47K drawers) -- large prior project with 45 milestones and 18K LOC
All mines completed without errors. Index stayed healthy throughout. The deferred persistence (batch_size=50000) means ChromaDB accumulates inserts in memory and writes the HNSW index once rather than doing hundreds of resize+persist cycles.
Important note: concurrent mines corrupt the index
During testing, we also confirmed that running two mempalace mine commands in parallel (writing to the same ChromaDB collection simultaneously) corrupts the HNSW index even with the fix applied. ChromaDB's PersistentClient is not designed for multi-process writes. Mines must be run sequentially. This should probably be documented or enforced with a lock file.
|
Hey @yoshi280 — this conflicts with main now. Could you rebase? We'd like to include this in the next release. |
|
@yoshi280 the rebase is worth doing — the 441GB → 4.4MB reduction on 57k drawers is a compelling result that deserves to land. The test data is also exactly the right kind of evidence for this fix: real sequential mine runs on a large corpus, not a synthetic benchmark. That's hard to dismiss. If it helps to scope the conflict: the main-branch changes that usually cause HNSW-related conflicts are in the batch upsert path (PRs #298 and #492 touched the upsert loop). If those have merged since your branch was cut, the conflict is likely just in the ChromaDB client init block or the call signature. Should be a small rebase. |
|
@bensig awesome! I will try rebasing later this week when I'm back in office. Really enjoying the app btw! I'll get back to u in a few days. |
|
@robot-rocket-science Hey — opened #1191 with this rebased onto current |
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 on collection creation in both `get_collection(..., create=True)` and the legacy `create_collection()` path. Preserves existing `hnsw:space` and `hnsw:num_threads=1` (race fix from MemPalace#976) and the `**ef_kwargs` plumbing for embedding-function injection (perf fix from MemPalace#1148/a4868a3). Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index resizes and hundreds of persistDirty() calls. persistDirty uses relative seek positioning in link_lists.bin; accumulated seek drift across resize cycles causes the OS to extend the sparse file with zero-filled regions, each cycle compounding the next. Result: link_lists.bin grows into hundreds of GB sparse, after which `status`, `search`, and `repair` all segfault and the palace is unrecoverable. Empirical: rebuilt a palace from scratch on 39,792 drawers across 5 wings with this fix applied. Final palace 376 MB, link_lists.bin stays at 0 bytes across both Chroma collection dirs, status and search both return cleanly. Same workload without the fix bloated the palace to 565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers. Migration note: chromadb treats HNSW config as immutable post-creation, so existing bloated palaces still need to be nuked and re-mined; this only protects fresh collections. Tests assert both keys land on the persisted collection metadata in both code paths, which also covers the MemPalace#1161 "config silently dropped" concern at CI time. Closes MemPalace#344 Supersedes MemPalace#346 Co-authored-by: robot-rocket-science <[email protected]>
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every collection-creation call site: * `mempalace/backends/chroma.py` — `get_collection(create=True)` and the legacy `create_collection()` path. Preserves existing `hnsw:space`, `hnsw:num_threads=1` (race fix from MemPalace#976), and `**ef_kwargs` (embedding-function plumbing from a4868a3). * `mempalace/mcp_server.py` — the direct `client.get_or_create_collection` path used when a palace is first opened by the MCP server. Without this third site, MCP-bootstrapped palaces would skip the guard and could still trigger the original bloat. Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index resizes and hundreds of persistDirty() calls. persistDirty uses relative seek positioning in link_lists.bin; accumulated seek drift across resize cycles causes the OS to extend the sparse file with zero-filled regions, each cycle compounding the next. Result: link_lists.bin grows into hundreds of GB sparse, after which `status`, `search`, and `repair` all segfault and the palace is unrecoverable. Empirical: rebuilt a palace from scratch on 39,792 drawers across 5 wings with this fix applied. Final palace 376 MB, link_lists.bin stays at 0 bytes across both Chroma collection dirs, status and search both return cleanly. Same workload without the fix bloated the palace to 565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers. Migration note: chromadb 1.5.x exposes a `collection.modify(configuration={"hnsw": {...}})` retrofit path for already-created collections (`UpdateHNSWConfiguration`), but this PR doesn't pursue it — by the time link_lists.bin has bloated the index is already corrupt and the only known recovery is a fresh mine. Tests assert both keys land on the persisted collection metadata in both `ChromaBackend` code paths, which also covers the MemPalace#1161 "config silently dropped" concern at CI time. A separate smoke test was used to verify the metadata round-trips through `chromadb.PersistentClient` reopen on chromadb 1.5.8. Closes MemPalace#344 Supersedes MemPalace#346 Co-authored-by: robot-rocket-science <[email protected]>
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every collection-creation call site: * `mempalace/backends/chroma.py` — `get_collection(create=True)` and the legacy `create_collection()` path. Preserves existing `hnsw:space`, `hnsw:num_threads=1` (race fix from MemPalace#976), and `**ef_kwargs` (embedding-function plumbing from a4868a3). * `mempalace/mcp_server.py` — the direct `client.get_or_create_collection` path used when a palace is first opened by the MCP server. Without this third site, MCP-bootstrapped palaces would skip the guard and could still trigger the original bloat. Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index resizes and hundreds of persistDirty() calls. persistDirty uses relative seek positioning in link_lists.bin; accumulated seek drift across resize cycles causes the OS to extend the sparse file with zero-filled regions, each cycle compounding the next. Result: link_lists.bin grows into hundreds of GB sparse, after which `status`, `search`, and `repair` all segfault and the palace is unrecoverable. Empirical: rebuilt a palace from scratch on 39,792 drawers across 5 wings with this fix applied. Final palace 376 MB, link_lists.bin stays at 0 bytes across both Chroma collection dirs, status and search both return cleanly. Same workload without the fix bloated the palace to 565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers. Migration note: chromadb 1.5.x exposes a `collection.modify(configuration={"hnsw": {...}})` retrofit path for already-created collections (`UpdateHNSWConfiguration`), but this PR doesn't pursue it — by the time link_lists.bin has bloated the index is already corrupt and the only known recovery is a fresh mine. Tests assert both keys land on the persisted collection metadata in both `ChromaBackend` code paths, which also covers the MemPalace#1161 "config silently dropped" concern at CI time. A separate smoke test was used to verify the metadata round-trips through `chromadb.PersistentClient` reopen on chromadb 1.5.8. Closes MemPalace#344 Supersedes MemPalace#346 Co-authored-by: robot-rocket-science <[email protected]>
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every collection-creation call site: * `mempalace/backends/chroma.py` — `get_collection(create=True)` and the legacy `create_collection()` path. Preserves existing `hnsw:space`, `hnsw:num_threads=1` (race fix from MemPalace#976), and `**ef_kwargs` (embedding-function plumbing from a4868a3). * `mempalace/mcp_server.py` — the direct `client.get_or_create_collection` path used when a palace is first opened by the MCP server. Without this third site, MCP-bootstrapped palaces would skip the guard and could still trigger the original bloat. Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index resizes and hundreds of persistDirty() calls. persistDirty uses relative seek positioning in link_lists.bin; accumulated seek drift across resize cycles causes the OS to extend the sparse file with zero-filled regions, each cycle compounding the next. Result: link_lists.bin grows into hundreds of GB sparse, after which `status`, `search`, and `repair` all segfault and the palace is unrecoverable. Empirical: rebuilt a palace from scratch on 39,792 drawers across 5 wings with this fix applied. Final palace 376 MB, link_lists.bin stays at 0 bytes across both Chroma collection dirs, status and search both return cleanly. Same workload without the fix bloated the palace to 565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers. Migration note: chromadb 1.5.x exposes a `collection.modify(configuration={"hnsw": {...}})` retrofit path for already-created collections (`UpdateHNSWConfiguration`), but this PR doesn't pursue it — by the time link_lists.bin has bloated the index is already corrupt and the only known recovery is a fresh mine. Tests assert both keys land on the persisted collection metadata in both `ChromaBackend` code paths, which also covers the MemPalace#1161 "config silently dropped" concern at CI time. A separate smoke test was used to verify the metadata round-trips through `chromadb.PersistentClient` reopen on chromadb 1.5.8. Closes MemPalace#344 Supersedes MemPalace#346 Co-authored-by: robot-rocket-science <[email protected]>
Summary
hnsw:batch_sizeandhnsw:sync_thresholdto 50,000 on ChromaDB collection creation to defer index persistence until large batches completeprocess_file()instead of inserting one drawer at a time viaadd_drawer()mcp_server.pyfor collections created via the MCP serverFixes #344
Problem
Mining ~56,000 drawers (from a GSD project database export) caused
link_lists.binto grow to 441GB apparent / 168GB on disk instead of the expected ~8MB. This caused two sequential system crashes from disk/memory exhaustion before the problem was identified. The corrupted index mademempalace statussegfault (exit 139) and rendered the palace unusable.Root cause: With default HNSW params (
batch_size=100,sync_threshold=1000, initial capacity 1000), inserting 56K drawers one at a time triggers ~30 index resizes and ~560persistDirty()calls. ThepersistDirty()function uses relative seek positioning inlink_lists.bin, and after many resize cycles the accumulated seek positions drift, causing the OS to extend the sparse file with zero-filled regions. Each cycle compounds the problem.Numbers:
link_lists.bin: 473,786,104,508 bytes apparent (441GB), 168GB actual diskdata_level0.bin: 64MB (correctly sized)link_lists.binfor 56K vectors at M=16: ~8.4MBChanges
miner.py:_HNSW_METADATAconstant with tuned batch/sync thresholdsget_collection()passes metadata oncreate_collection()process_file()now collects all drawer IDs, documents, and metadata into lists and callscollection.upsert()once per file instead of once per chunkmcp_server.py:_get_collection(create=True)passes the same HNSW metadata toget_or_create_collection()Test plan
link_lists.binstays under 1MBmempalace statusandmempalace searchwork after large mines