[PERF] Add hnsw:initial_capacity to reduce memory from HNSW index resizing by takayan0908 · Pull Request #6621 · chroma-core/chroma

takayan0908 · 2026-03-10T13:50:54Z

Summary

Adds a new hnsw:initial_capacity metadata parameter that controls the initial max_elements for the HNSW index
Replaces all hardcoded DEFAULT_CAPACITY references in index initialization and resize logic with the user-configurable value
Default value (1000) preserves existing behavior — no breaking changes

Problem

When adding many vectors to a collection incrementally, the HNSW index starts at DEFAULT_CAPACITY=1000 and grows by resize_factor=1.2x each time capacity is exceeded:

1000 → 1200 → 1440 → 1728 → 2074 → 2489 → 2987  (6 resizes for ~3000 vectors)

Each resize_index() call in hnswlib allocates a new contiguous buffer and frees the old one. However, memory allocators (glibc and jemalloc) typically do not return these large freed buffers to the OS. The old buffers remain mapped in the process address space, causing RSS to grow far beyond actual data requirements.

In our production environment, we observed 50GB+ RSS for 54MB of vector data (2786 vectors, 1024-dim embeddings). Analysis of /proc/PID/smaps revealed a descending pattern of anonymous memory mappings (7GB, 6GB, 5GB, 4GB...) — each one a ghost of a previous resize operation.

Solution

Allow users to pre-size the index to avoid unnecessary resizes:

collection = client.get_or_create_collection(
    name="my_collection",
    metadata={"hnsw:initial_capacity": 5000}  # match expected dataset size
)

This follows the existing pattern of hnsw:* metadata parameters (hnsw:resize_factor, hnsw:batch_size, etc.).

Changes

File	Change
`hnsw_params.py`	Add `initial_capacity` field + validator
`local_hnsw.py`	Use `self._params.initial_capacity` instead of `DEFAULT_CAPACITY`
`local_persistent_hnsw.py`	Same — both `init_index` and `load_index` paths
`test/property/strategies.py`	Add `hnsw:initial_capacity` to property-based test strategies
`test/property/test_schema.py`	Add to schema mapping and defaults

Test plan

Existing tests pass (no behavior change with default value 1000)
Property-based tests now exercise hnsw:initial_capacity with random values
Manual verification: creating a collection with hnsw:initial_capacity=5000 and adding 3000 vectors results in 0 resize operations (vs 6 previously)

🤖 Generated with Claude Code

…quency When adding many vectors to a collection, the HNSW index starts with DEFAULT_CAPACITY=1000 and grows by resize_factor=1.2x each time capacity is exceeded. For a collection of ~3000 vectors, this causes 6+ resize operations (1000→1200→1440→1728→2074→2489→2987). Each resize_index() call allocates a new contiguous buffer while the old one is freed — but memory allocators (glibc, jemalloc) typically do not return these large freed buffers to the OS, causing RSS to grow far beyond the actual data size (e.g., 50GB+ RSS for 54MB of vector data). This commit adds a new `hnsw:initial_capacity` metadata parameter that allows users to set the initial index capacity to match their expected dataset size, dramatically reducing the number of resize operations and the associated memory fragmentation. Usage: collection = client.get_or_create_collection( name="my_collection", metadata={"hnsw:initial_capacity": 5000} ) The default value (1000) preserves existing behavior. Co-Authored-By: Claude Opus 4.6 <[email protected]>

github-actions · 2026-03-10T13:51:11Z

propel-code-bot

No issues found; changes appear aligned with the intended behavior and defaults.

Status: No Issues Found | Risk: Low

Review Details

📁 5 files reviewed | 💬 0 comments

propel-code-bot Bot reviewed Mar 10, 2026

View reviewed changes

yoshi280 mentioned this pull request Apr 9, 2026

HNSW index bloat: link_lists.bin grows to 441GB when mining >10K drawers MemPalace/mempalace#344

Closed

sha2fiddy mentioned this pull request Apr 19, 2026

migrate and repair rebuild trigger unbounded link_lists.bin growth on large palaces MemPalace/mempalace#1046

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] Add hnsw:initial_capacity to reduce memory from HNSW index resizing#6621

[PERF] Add hnsw:initial_capacity to reduce memory from HNSW index resizing#6621
takayan0908 wants to merge 1 commit intochroma-core:mainfrom
takayan0908:feat/hnsw-initial-capacity

takayan0908 commented Mar 10, 2026

Uh oh!

github-actions Bot commented Mar 10, 2026

Uh oh!

propel-code-bot Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

takayan0908 commented Mar 10, 2026

Summary

Problem

Solution

Changes

Test plan

Uh oh!

github-actions Bot commented Mar 10, 2026

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

propel-code-bot Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant