Skip to content

fix: prevent HNSW index bloat from resize+persist cycles#346

Open
yoshi280 wants to merge 1 commit intoMemPalace:developfrom
yoshi280:fix/hnsw-index-bloat
Open

fix: prevent HNSW index bloat from resize+persist cycles#346
yoshi280 wants to merge 1 commit intoMemPalace:developfrom
yoshi280:fix/hnsw-index-bloat

Conversation

@yoshi280
Copy link
Copy Markdown
Contributor

@yoshi280 yoshi280 commented Apr 9, 2026

Summary

  • Set hnsw:batch_size and hnsw:sync_threshold to 50,000 on ChromaDB collection creation to defer index persistence until large batches complete
  • Batch per-file drawer upserts in process_file() instead of inserting one drawer at a time via add_drawer()
  • Apply the same HNSW metadata in mcp_server.py for collections created via the MCP server

Fixes #344

Problem

Mining ~56,000 drawers (from a GSD project database export) caused link_lists.bin to grow to 441GB apparent / 168GB on disk instead of the expected ~8MB. This caused two sequential system crashes from disk/memory exhaustion before the problem was identified. The corrupted index made mempalace status segfault (exit 139) and rendered the palace unusable.

Root cause: With default HNSW params (batch_size=100, sync_threshold=1000, initial capacity 1000), inserting 56K drawers one at a time triggers ~30 index resizes and ~560 persistDirty() calls. The persistDirty() function uses relative seek positioning in link_lists.bin, and after many resize cycles the accumulated seek positions drift, causing the OS to extend the sparse file with zero-filled regions. Each cycle compounds the problem.

Numbers:

  • 56,387 drawers from 2,395 GSD project files
  • link_lists.bin: 473,786,104,508 bytes apparent (441GB), 168GB actual disk
  • data_level0.bin: 64MB (correctly sized)
  • Expected link_lists.bin for 56K vectors at M=16: ~8.4MB
  • Ratio: 56,083x expected size

Changes

miner.py:

  • Added _HNSW_METADATA constant with tuned batch/sync thresholds
  • get_collection() passes metadata on create_collection()
  • process_file() now collects all drawer IDs, documents, and metadata into lists and calls collection.upsert() once per file instead of once per chunk

mcp_server.py:

  • _get_collection(create=True) passes the same HNSW metadata to get_or_create_collection()

Test plan

  • Mine a project with <1K drawers -- verify behavior unchanged
  • Mine a project with 10K+ drawers -- verify link_lists.bin stays under 1MB
  • Mine a project with 50K+ drawers -- verify no bloat (was 441GB, should be <50MB)
  • Verify mempalace status and mempalace search work after large mines
  • Verify MCP server creates collections with correct metadata

Set hnsw:batch_size and hnsw:sync_threshold to 50K on collection
creation so ChromaDB defers index persistence until large batches
complete. Batch per-file drawer upserts instead of inserting one
at a time.

Without this fix, mining ~56K drawers causes link_lists.bin to grow
to 441GB (apparent) / 168GB (on disk) due to pathological sparse file
allocation in hnswlib's persistDirty() function during resize cycles.
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a serious fix — a 56,083x file size ratio is the kind of failure that's invisible until it takes down a machine. Good root cause analysis on the seek drift compounding through resize cycles.

The HNSW metadata approach (setting batch_size and sync_threshold at collection creation) is the right intervention point. One thing worth documenting: these values are baked in at collection creation time and can't be changed later without recreating the collection. Users upgrading an existing palace will need to re-mine to get the fix, since collections created before this PR will retain the default params. A note in the PR description or changelog would help people who hit mysterious post-mine slowdowns and don't realize their existing collection predates the fix.

On the batching in process_file(): collecting all drawers into lists and doing a single upsert per file rather than per chunk is a meaningful change beyond just the HNSW metadata — it also reduces ChromaDB's transaction overhead. Worth calling out explicitly as a separate improvement in the PR description, since it benefits even users who never approach 56K drawers.

The test plan looks thorough. The <1K and 10K+ scale tests are the critical ones for catching regressions.

@yoshi280
Copy link
Copy Markdown
Contributor Author

yoshi280 commented Apr 9, 2026

Test Results

Verified the fix on a real workload: 57,287 drawers mined across 3 projects in sequential runs.

Before fix (default HNSW params, one-at-a-time upserts)

Metric Value
Drawers ~56,000
link_lists.bin apparent size 441GB (473,786,104,508 bytes)
link_lists.bin on-disk size 168GB
Expected size ~8.4MB
Bloat factor 56,083x
mempalace status Segfault (exit 139)
System impact 2 sequential crashes from memory/disk exhaustion

After fix (hnsw:batch_size=50000, hnsw:sync_threshold=50000, batched upserts)

Metric Value
Drawers 57,287
link_lists.bin apparent size 433KB
link_lists.bin on-disk size 433KB (no sparse allocation)
Total palace size 551MB
mempalace status Works correctly
mempalace search Works correctly
System impact None -- stable throughout

Test details

Three sequential mines on macOS Darwin 25.3.0, ChromaDB 1.5.7, mempalace 3.0.0:

  1. alpha-seek (~10K drawers) -- codebase + GSD project archive (55 milestones, 180 decisions, 343 tasks exported from SQLite to markdown)
  2. optimus-prime (~47K drawers) -- large prior project with 45 milestones and 18K LOC

All mines completed without errors. Index stayed healthy throughout. The deferred persistence (batch_size=50000) means ChromaDB accumulates inserts in memory and writes the HNSW index once rather than doing hundreds of resize+persist cycles.

Important note: concurrent mines corrupt the index

During testing, we also confirmed that running two mempalace mine commands in parallel (writing to the same ChromaDB collection simultaneously) corrupts the HNSW index even with the fix applied. ChromaDB's PersistentClient is not designed for multi-process writes. Mines must be run sequentially. This should probably be documented or enforced with a lock file.

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 11, 2026

Hey @yoshi280 — this conflicts with main now. Could you rebase? We'd like to include this in the next release.

@web3guru888
Copy link
Copy Markdown

@yoshi280 the rebase is worth doing — the 441GB → 4.4MB reduction on 57k drawers is a compelling result that deserves to land.

The test data is also exactly the right kind of evidence for this fix: real sequential mine runs on a large corpus, not a synthetic benchmark. That's hard to dismiss.

If it helps to scope the conflict: the main-branch changes that usually cause HNSW-related conflicts are in the batch upsert path (PRs #298 and #492 touched the upsert loop). If those have merged since your branch was cut, the conflict is likely just in the ChromaDB client init block or the call signature. Should be a small rebase.

@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
@yoshi280
Copy link
Copy Markdown
Contributor Author

@bensig awesome! I will try rebasing later this week when I'm back in office. Really enjoying the app btw! I'll get back to u in a few days.

@funguf
Copy link
Copy Markdown
Contributor

funguf commented Apr 25, 2026

@robot-rocket-science Hey — opened #1191 with this rebased onto current develop. Hope you don't mind; saw it had been stalled since Apr 11 and the fix saved my palace from the same 565 GB bloat you hit. Kept your authorship via Co-authored-by: trailer and credited you in the body. Happy to close #1191 in favour of yours if you'd rather rebase #346 yourself — no preference, just wanted the fix to land.

funguf added a commit to funguf/mempalace that referenced this pull request Apr 25, 2026
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 on collection
creation in both `get_collection(..., create=True)` and the legacy
`create_collection()` path. Preserves existing `hnsw:space` and
`hnsw:num_threads=1` (race fix from MemPalace#976) and the `**ef_kwargs` plumbing
for embedding-function injection (perf fix from MemPalace#1148/a4868a3).

Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index
resizes and hundreds of persistDirty() calls. persistDirty uses relative
seek positioning in link_lists.bin; accumulated seek drift across resize
cycles causes the OS to extend the sparse file with zero-filled regions,
each cycle compounding the next. Result: link_lists.bin grows into
hundreds of GB sparse, after which `status`, `search`, and `repair` all
segfault and the palace is unrecoverable.

Empirical: rebuilt a palace from scratch on 39,792 drawers across 5
wings with this fix applied. Final palace 376 MB, link_lists.bin stays
at 0 bytes across both Chroma collection dirs, status and search both
return cleanly. Same workload without the fix bloated the palace to
565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers.

Migration note: chromadb treats HNSW config as immutable post-creation,
so existing bloated palaces still need to be nuked and re-mined; this
only protects fresh collections.

Tests assert both keys land on the persisted collection metadata in
both code paths, which also covers the MemPalace#1161 "config silently dropped"
concern at CI time.

Closes MemPalace#344
Supersedes MemPalace#346

Co-authored-by: robot-rocket-science <[email protected]>
funguf added a commit to funguf/mempalace that referenced this pull request Apr 25, 2026
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every
collection-creation call site:

* `mempalace/backends/chroma.py` — `get_collection(create=True)` and the
  legacy `create_collection()` path. Preserves existing `hnsw:space`,
  `hnsw:num_threads=1` (race fix from MemPalace#976), and `**ef_kwargs`
  (embedding-function plumbing from a4868a3).
* `mempalace/mcp_server.py` — the direct `client.get_or_create_collection`
  path used when a palace is first opened by the MCP server. Without this
  third site, MCP-bootstrapped palaces would skip the guard and could
  still trigger the original bloat.

Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index
resizes and hundreds of persistDirty() calls. persistDirty uses relative
seek positioning in link_lists.bin; accumulated seek drift across resize
cycles causes the OS to extend the sparse file with zero-filled regions,
each cycle compounding the next. Result: link_lists.bin grows into
hundreds of GB sparse, after which `status`, `search`, and `repair` all
segfault and the palace is unrecoverable.

Empirical: rebuilt a palace from scratch on 39,792 drawers across 5
wings with this fix applied. Final palace 376 MB, link_lists.bin stays
at 0 bytes across both Chroma collection dirs, status and search both
return cleanly. Same workload without the fix bloated the palace to
565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers.

Migration note: chromadb 1.5.x exposes a
`collection.modify(configuration={"hnsw": {...}})` retrofit path for
already-created collections (`UpdateHNSWConfiguration`), but this PR
doesn't pursue it — by the time link_lists.bin has bloated the index
is already corrupt and the only known recovery is a fresh mine.

Tests assert both keys land on the persisted collection metadata in
both `ChromaBackend` code paths, which also covers the MemPalace#1161 "config
silently dropped" concern at CI time. A separate smoke test was used
to verify the metadata round-trips through `chromadb.PersistentClient`
reopen on chromadb 1.5.8.

Closes MemPalace#344
Supersedes MemPalace#346

Co-authored-by: robot-rocket-science <[email protected]>
igorls pushed a commit to funguf/mempalace that referenced this pull request Apr 27, 2026
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every
collection-creation call site:

* `mempalace/backends/chroma.py` — `get_collection(create=True)` and the
  legacy `create_collection()` path. Preserves existing `hnsw:space`,
  `hnsw:num_threads=1` (race fix from MemPalace#976), and `**ef_kwargs`
  (embedding-function plumbing from a4868a3).
* `mempalace/mcp_server.py` — the direct `client.get_or_create_collection`
  path used when a palace is first opened by the MCP server. Without this
  third site, MCP-bootstrapped palaces would skip the guard and could
  still trigger the original bloat.

Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index
resizes and hundreds of persistDirty() calls. persistDirty uses relative
seek positioning in link_lists.bin; accumulated seek drift across resize
cycles causes the OS to extend the sparse file with zero-filled regions,
each cycle compounding the next. Result: link_lists.bin grows into
hundreds of GB sparse, after which `status`, `search`, and `repair` all
segfault and the palace is unrecoverable.

Empirical: rebuilt a palace from scratch on 39,792 drawers across 5
wings with this fix applied. Final palace 376 MB, link_lists.bin stays
at 0 bytes across both Chroma collection dirs, status and search both
return cleanly. Same workload without the fix bloated the palace to
565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers.

Migration note: chromadb 1.5.x exposes a
`collection.modify(configuration={"hnsw": {...}})` retrofit path for
already-created collections (`UpdateHNSWConfiguration`), but this PR
doesn't pursue it — by the time link_lists.bin has bloated the index
is already corrupt and the only known recovery is a fresh mine.

Tests assert both keys land on the persisted collection metadata in
both `ChromaBackend` code paths, which also covers the MemPalace#1161 "config
silently dropped" concern at CI time. A separate smoke test was used
to verify the metadata round-trips through `chromadb.PersistentClient`
reopen on chromadb 1.5.8.

Closes MemPalace#344
Supersedes MemPalace#346

Co-authored-by: robot-rocket-science <[email protected]>
@igorls igorls mentioned this pull request Apr 27, 2026
lealvona pushed a commit to lealvona/mempalace that referenced this pull request Apr 29, 2026
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every
collection-creation call site:

* `mempalace/backends/chroma.py` — `get_collection(create=True)` and the
  legacy `create_collection()` path. Preserves existing `hnsw:space`,
  `hnsw:num_threads=1` (race fix from MemPalace#976), and `**ef_kwargs`
  (embedding-function plumbing from a4868a3).
* `mempalace/mcp_server.py` — the direct `client.get_or_create_collection`
  path used when a palace is first opened by the MCP server. Without this
  third site, MCP-bootstrapped palaces would skip the guard and could
  still trigger the original bloat.

Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index
resizes and hundreds of persistDirty() calls. persistDirty uses relative
seek positioning in link_lists.bin; accumulated seek drift across resize
cycles causes the OS to extend the sparse file with zero-filled regions,
each cycle compounding the next. Result: link_lists.bin grows into
hundreds of GB sparse, after which `status`, `search`, and `repair` all
segfault and the palace is unrecoverable.

Empirical: rebuilt a palace from scratch on 39,792 drawers across 5
wings with this fix applied. Final palace 376 MB, link_lists.bin stays
at 0 bytes across both Chroma collection dirs, status and search both
return cleanly. Same workload without the fix bloated the palace to
565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers.

Migration note: chromadb 1.5.x exposes a
`collection.modify(configuration={"hnsw": {...}})` retrofit path for
already-created collections (`UpdateHNSWConfiguration`), but this PR
doesn't pursue it — by the time link_lists.bin has bloated the index
is already corrupt and the only known recovery is a fresh mine.

Tests assert both keys land on the persisted collection metadata in
both `ChromaBackend` code paths, which also covers the MemPalace#1161 "config
silently dropped" concern at CI time. A separate smoke test was used
to verify the metadata round-trips through `chromadb.PersistentClient`
reopen on chromadb 1.5.8.

Closes MemPalace#344
Supersedes MemPalace#346

Co-authored-by: robot-rocket-science <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mcp MCP server and tools area/mining File and conversation mining bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HNSW index bloat: link_lists.bin grows to 441GB when mining >10K drawers

5 participants