fix: prevent HNSW index bloat from resize+persist cycles by yoshi280 · Pull Request #346 · MemPalace/mempalace

yoshi280 · 2026-04-09T07:20:12Z

Summary

Set hnsw:batch_size and hnsw:sync_threshold to 50,000 on ChromaDB collection creation to defer index persistence until large batches complete
Batch per-file drawer upserts in process_file() instead of inserting one drawer at a time via add_drawer()
Apply the same HNSW metadata in mcp_server.py for collections created via the MCP server

Fixes #344

Problem

Mining ~56,000 drawers (from a GSD project database export) caused link_lists.bin to grow to 441GB apparent / 168GB on disk instead of the expected ~8MB. This caused two sequential system crashes from disk/memory exhaustion before the problem was identified. The corrupted index made mempalace status segfault (exit 139) and rendered the palace unusable.

Root cause: With default HNSW params (batch_size=100, sync_threshold=1000, initial capacity 1000), inserting 56K drawers one at a time triggers ~30 index resizes and ~560 persistDirty() calls. The persistDirty() function uses relative seek positioning in link_lists.bin, and after many resize cycles the accumulated seek positions drift, causing the OS to extend the sparse file with zero-filled regions. Each cycle compounds the problem.

Numbers:

56,387 drawers from 2,395 GSD project files
link_lists.bin: 473,786,104,508 bytes apparent (441GB), 168GB actual disk
data_level0.bin: 64MB (correctly sized)
Expected link_lists.bin for 56K vectors at M=16: ~8.4MB
Ratio: 56,083x expected size

Changes

miner.py:

Added _HNSW_METADATA constant with tuned batch/sync thresholds
get_collection() passes metadata on create_collection()
process_file() now collects all drawer IDs, documents, and metadata into lists and calls collection.upsert() once per file instead of once per chunk

mcp_server.py:

_get_collection(create=True) passes the same HNSW metadata to get_or_create_collection()

Test plan

Mine a project with <1K drawers -- verify behavior unchanged
Mine a project with 10K+ drawers -- verify link_lists.bin stays under 1MB
Mine a project with 50K+ drawers -- verify no bloat (was 441GB, should be <50MB)
Verify mempalace status and mempalace search work after large mines
Verify MCP server creates collections with correct metadata

Set hnsw:batch_size and hnsw:sync_threshold to 50K on collection creation so ChromaDB defers index persistence until large batches complete. Batch per-file drawer upserts instead of inserting one at a time. Without this fix, mining ~56K drawers causes link_lists.bin to grow to 441GB (apparent) / 168GB (on disk) due to pathological sparse file allocation in hnswlib's persistDirty() function during resize cycles.

web3guru888

This is a serious fix — a 56,083x file size ratio is the kind of failure that's invisible until it takes down a machine. Good root cause analysis on the seek drift compounding through resize cycles.

The HNSW metadata approach (setting batch_size and sync_threshold at collection creation) is the right intervention point. One thing worth documenting: these values are baked in at collection creation time and can't be changed later without recreating the collection. Users upgrading an existing palace will need to re-mine to get the fix, since collections created before this PR will retain the default params. A note in the PR description or changelog would help people who hit mysterious post-mine slowdowns and don't realize their existing collection predates the fix.

On the batching in process_file(): collecting all drawers into lists and doing a single upsert per file rather than per chunk is a meaningful change beyond just the HNSW metadata — it also reduces ChromaDB's transaction overhead. Worth calling out explicitly as a separate improvement in the PR description, since it benefits even users who never approach 56K drawers.

The test plan looks thorough. The <1K and 10K+ scale tests are the critical ones for catching regressions.

yoshi280 · 2026-04-09T15:37:17Z

Test Results

Verified the fix on a real workload: 57,287 drawers mined across 3 projects in sequential runs.

Before fix (default HNSW params, one-at-a-time upserts)

Metric	Value
Drawers	~56,000
`link_lists.bin` apparent size	441GB (473,786,104,508 bytes)
`link_lists.bin` on-disk size	168GB
Expected size	~8.4MB
Bloat factor	56,083x
`mempalace status`	Segfault (exit 139)
System impact	2 sequential crashes from memory/disk exhaustion

After fix (`hnsw:batch_size=50000`, `hnsw:sync_threshold=50000`, batched upserts)

Metric	Value
Drawers	57,287
`link_lists.bin` apparent size	433KB
`link_lists.bin` on-disk size	433KB (no sparse allocation)
Total palace size	551MB
`mempalace status`	Works correctly
`mempalace search`	Works correctly
System impact	None -- stable throughout

Test details

Three sequential mines on macOS Darwin 25.3.0, ChromaDB 1.5.7, mempalace 3.0.0:

alpha-seek (~10K drawers) -- codebase + GSD project archive (55 milestones, 180 decisions, 343 tasks exported from SQLite to markdown)
optimus-prime (~47K drawers) -- large prior project with 45 milestones and 18K LOC

All mines completed without errors. Index stayed healthy throughout. The deferred persistence (batch_size=50000) means ChromaDB accumulates inserts in memory and writes the HNSW index once rather than doing hundreds of resize+persist cycles.

Important note: concurrent mines corrupt the index

During testing, we also confirmed that running two mempalace mine commands in parallel (writing to the same ChromaDB collection simultaneously) corrupts the HNSW index even with the fix applied. ChromaDB's PersistentClient is not designed for multi-process writes. Mines must be run sequentially. This should probably be documented or enforced with a lock file.

bensig · 2026-04-11T05:31:12Z

Hey @yoshi280 — this conflicts with main now. Could you rebase? We'd like to include this in the next release.

web3guru888 · 2026-04-11T06:17:16Z

@yoshi280 the rebase is worth doing — the 441GB → 4.4MB reduction on 57k drawers is a compelling result that deserves to land.

The test data is also exactly the right kind of evidence for this fix: real sequential mine runs on a large corpus, not a synthetic benchmark. That's hard to dismiss.

If it helps to scope the conflict: the main-branch changes that usually cause HNSW-related conflicts are in the batch upsert path (PRs #298 and #492 touched the upsert loop). If those have merged since your branch was cut, the conflict is likely just in the ChromaDB client init block or the call signature. Should be a small rebase.

yoshi280 · 2026-04-12T18:56:44Z

@bensig awesome! I will try rebasing later this week when I'm back in office. Really enjoying the app btw! I'll get back to u in a few days.

funguf · 2026-04-25T05:34:38Z

@robot-rocket-science Hey — opened #1191 with this rebased onto current develop. Hope you don't mind; saw it had been stalled since Apr 11 and the fix saved my palace from the same 565 GB bloat you hit. Kept your authorship via Co-authored-by: trailer and credited you in the body. Happy to close #1191 in favour of yours if you'd rather rebase #346 yourself — no preference, just wanted the fix to land.

Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 on collection creation in both `get_collection(..., create=True)` and the legacy `create_collection()` path. Preserves existing `hnsw:space` and `hnsw:num_threads=1` (race fix from MemPalace#976) and the `**ef_kwargs` plumbing for embedding-function injection (perf fix from MemPalace#1148/a4868a3). Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index resizes and hundreds of persistDirty() calls. persistDirty uses relative seek positioning in link_lists.bin; accumulated seek drift across resize cycles causes the OS to extend the sparse file with zero-filled regions, each cycle compounding the next. Result: link_lists.bin grows into hundreds of GB sparse, after which `status`, `search`, and `repair` all segfault and the palace is unrecoverable. Empirical: rebuilt a palace from scratch on 39,792 drawers across 5 wings with this fix applied. Final palace 376 MB, link_lists.bin stays at 0 bytes across both Chroma collection dirs, status and search both return cleanly. Same workload without the fix bloated the palace to 565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers. Migration note: chromadb treats HNSW config as immutable post-creation, so existing bloated palaces still need to be nuked and re-mined; this only protects fresh collections. Tests assert both keys land on the persisted collection metadata in both code paths, which also covers the MemPalace#1161 "config silently dropped" concern at CI time. Closes MemPalace#344 Supersedes MemPalace#346 Co-authored-by: robot-rocket-science <[email protected]>

Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every collection-creation call site: * `mempalace/backends/chroma.py` — `get_collection(create=True)` and the legacy `create_collection()` path. Preserves existing `hnsw:space`, `hnsw:num_threads=1` (race fix from MemPalace#976), and `**ef_kwargs` (embedding-function plumbing from a4868a3). * `mempalace/mcp_server.py` — the direct `client.get_or_create_collection` path used when a palace is first opened by the MCP server. Without this third site, MCP-bootstrapped palaces would skip the guard and could still trigger the original bloat. Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index resizes and hundreds of persistDirty() calls. persistDirty uses relative seek positioning in link_lists.bin; accumulated seek drift across resize cycles causes the OS to extend the sparse file with zero-filled regions, each cycle compounding the next. Result: link_lists.bin grows into hundreds of GB sparse, after which `status`, `search`, and `repair` all segfault and the palace is unrecoverable. Empirical: rebuilt a palace from scratch on 39,792 drawers across 5 wings with this fix applied. Final palace 376 MB, link_lists.bin stays at 0 bytes across both Chroma collection dirs, status and search both return cleanly. Same workload without the fix bloated the palace to 565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers. Migration note: chromadb 1.5.x exposes a `collection.modify(configuration={"hnsw": {...}})` retrofit path for already-created collections (`UpdateHNSWConfiguration`), but this PR doesn't pursue it — by the time link_lists.bin has bloated the index is already corrupt and the only known recovery is a fresh mine. Tests assert both keys land on the persisted collection metadata in both `ChromaBackend` code paths, which also covers the MemPalace#1161 "config silently dropped" concern at CI time. A separate smoke test was used to verify the metadata round-trips through `chromadb.PersistentClient` reopen on chromadb 1.5.8. Closes MemPalace#344 Supersedes MemPalace#346 Co-authored-by: robot-rocket-science <[email protected]>

web3guru888 reviewed Apr 9, 2026

View reviewed changes

bensig changed the base branch from main to develop April 11, 2026 22:22

bensig requested review from bensig and milla-jovovich as code owners April 11, 2026 22:22

igorls added area/mcp MCP server and tools area/mining File and conversation mining bug Something isn't working labels Apr 14, 2026

This was referenced Apr 17, 2026

fix: HNSW graph corruption, PreCompact deadlock, mine fan-out (closes #974, #965, #955) #976

Merged

fix: set hnsw:num_threads to 1 for thread-safe HNSW inserts #991

Closed

sha2fiddy mentioned this pull request Apr 19, 2026

migrate and repair rebuild trigger unbounded link_lists.bin growth on large palaces #1046

Open

This was referenced Apr 23, 2026

HNSW index bloat: link_lists.bin grows to 441GB when mining >10K drawers #344

Closed

Delete operations silently no-op on palaces migrated from chromadb 0.6.3 → 1.5.8 #1099

Open

funguf mentioned this pull request Apr 25, 2026

fix: prevent HNSW index bloat from resize+persist cycles #1191

Merged

8 tasks

jphein mentioned this pull request Apr 26, 2026

Stop hook spawns concurrent mempalace mine processes that survive across sessions and bypass PID guard #1212

Open

igorls mentioned this pull request Apr 27, 2026

chore(release): v3.3.4 #1232

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent HNSW index bloat from resize+persist cycles#346

fix: prevent HNSW index bloat from resize+persist cycles#346
yoshi280 wants to merge 1 commit intoMemPalace:developfrom
yoshi280:fix/hnsw-index-bloat

yoshi280 commented Apr 9, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

yoshi280 commented Apr 9, 2026

Uh oh!

bensig commented Apr 11, 2026

Uh oh!

web3guru888 commented Apr 11, 2026

Uh oh!

yoshi280 commented Apr 12, 2026

Uh oh!

funguf commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yoshi280 commented Apr 9, 2026

Summary

Problem

Changes

Test plan

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Uh oh!

yoshi280 commented Apr 9, 2026

Test Results

Before fix (default HNSW params, one-at-a-time upserts)

After fix (hnsw:batch_size=50000, hnsw:sync_threshold=50000, batched upserts)

Test details

Important note: concurrent mines corrupt the index

Uh oh!

bensig commented Apr 11, 2026

Uh oh!

web3guru888 commented Apr 11, 2026

Uh oh!

yoshi280 commented Apr 12, 2026

Uh oh!

funguf commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

After fix (`hnsw:batch_size=50000`, `hnsw:sync_threshold=50000`, batched upserts)