Skip to content

HNSW index bloat: link_lists.bin grows to 441GB when mining >10K drawers #344

@yoshi280

Description

@yoshi280

Summary

When mining a project with ~56K drawers, ChromaDB's HNSW index file link_lists.bin grows to 441GB apparent size (168GB on disk) instead of the expected ~8MB. This caused two sequential system crashes due to memory and disk exhaustion before the problem was identified. The mempalace status command segfaults (exit 139) when attempting to load the corrupted index.

Environment

  • mempalace v3.0.0
  • chromadb 1.5.7
  • macOS Darwin 25.3.0

Steps to Reproduce

  1. Have a project with ~50K+ text files (or fewer files that chunk into 50K+ drawers)
  2. Run mempalace mine <project_dir>
  3. Observe link_lists.bin growing to hundreds of GB
  4. mempalace status segfaults

Smaller mines (~10K drawers) work fine. The issue appears at scale.

Root Cause

ChromaDB creates the HNSW collection with default parameters:

  • hnsw:batch_size = 100
  • hnsw:sync_threshold = 1000
  • Initial capacity = 1000
  • resize_factor = 1.2

When mempalace inserts drawers one at a time via collection.upsert(), each 1000 inserts trigger a persistDirty() call, and the index resizes ~30 times to accommodate 56K elements (1000 -> 1200 -> 1440 -> ... -> 56000+). The persistDirty() function uses relative seek positioning in link_lists.bin, and after many resize cycles the seek positions drift, causing macOS to extend the sparse file with zero-filled regions. Each resize+persist cycle compounds the problem.

Relevant upstream issues:

Impact

  • Disk exhaustion: 441GB file on a laptop with limited SSD space
  • System crashes: Two sequential crashes from memory/disk pressure before the cause was identified
  • Data loss risk: The HNSW index stores the actual embedding vectors; when corrupted, they cannot be recovered from ChromaDB's SQLite (which only stores IDs, metadata, and documents)
  • Segfaults: mempalace status and mempalace search crash when trying to load the bloated index

Workaround

Delete the corrupted index and re-mine:

rm -rf ~/.mempalace/palace/*/
mempalace mine <project_dir>

Fix

PR incoming -- two changes:

  1. Pass hnsw:batch_size=50000 and hnsw:sync_threshold=50000 on collection creation to defer persistence until large batches complete
  2. Batch per-file drawer upserts instead of inserting one drawer at a time, reducing the number of resize operations

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/miningFile and conversation miningbugSomething isn't workingperformancePerformance improvements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions