Skip to content

[PERF] Add hnsw:initial_capacity to reduce memory from HNSW index resizing#6621

Open
takayan0908 wants to merge 1 commit intochroma-core:mainfrom
takayan0908:feat/hnsw-initial-capacity
Open

[PERF] Add hnsw:initial_capacity to reduce memory from HNSW index resizing#6621
takayan0908 wants to merge 1 commit intochroma-core:mainfrom
takayan0908:feat/hnsw-initial-capacity

Conversation

@takayan0908
Copy link
Copy Markdown

Summary

  • Adds a new hnsw:initial_capacity metadata parameter that controls the initial max_elements for the HNSW index
  • Replaces all hardcoded DEFAULT_CAPACITY references in index initialization and resize logic with the user-configurable value
  • Default value (1000) preserves existing behavior — no breaking changes

Problem

When adding many vectors to a collection incrementally, the HNSW index starts at DEFAULT_CAPACITY=1000 and grows by resize_factor=1.2x each time capacity is exceeded:

1000 → 1200 → 1440 → 1728 → 2074 → 2489 → 2987  (6 resizes for ~3000 vectors)

Each resize_index() call in hnswlib allocates a new contiguous buffer and frees the old one. However, memory allocators (glibc and jemalloc) typically do not return these large freed buffers to the OS. The old buffers remain mapped in the process address space, causing RSS to grow far beyond actual data requirements.

In our production environment, we observed 50GB+ RSS for 54MB of vector data (2786 vectors, 1024-dim embeddings). Analysis of /proc/PID/smaps revealed a descending pattern of anonymous memory mappings (7GB, 6GB, 5GB, 4GB...) — each one a ghost of a previous resize operation.

Solution

Allow users to pre-size the index to avoid unnecessary resizes:

collection = client.get_or_create_collection(
    name="my_collection",
    metadata={"hnsw:initial_capacity": 5000}  # match expected dataset size
)

This follows the existing pattern of hnsw:* metadata parameters (hnsw:resize_factor, hnsw:batch_size, etc.).

Changes

File Change
hnsw_params.py Add initial_capacity field + validator
local_hnsw.py Use self._params.initial_capacity instead of DEFAULT_CAPACITY
local_persistent_hnsw.py Same — both init_index and load_index paths
test/property/strategies.py Add hnsw:initial_capacity to property-based test strategies
test/property/test_schema.py Add to schema mapping and defaults

Test plan

  • Existing tests pass (no behavior change with default value 1000)
  • Property-based tests now exercise hnsw:initial_capacity with random values
  • Manual verification: creating a collection with hnsw:initial_capacity=5000 and adding 3000 vectors results in 0 resize operations (vs 6 previously)

🤖 Generated with Claude Code

…quency

When adding many vectors to a collection, the HNSW index starts with
DEFAULT_CAPACITY=1000 and grows by resize_factor=1.2x each time capacity
is exceeded. For a collection of ~3000 vectors, this causes 6+ resize
operations (1000→1200→1440→1728→2074→2489→2987). Each resize_index()
call allocates a new contiguous buffer while the old one is freed — but
memory allocators (glibc, jemalloc) typically do not return these large
freed buffers to the OS, causing RSS to grow far beyond the actual data
size (e.g., 50GB+ RSS for 54MB of vector data).

This commit adds a new `hnsw:initial_capacity` metadata parameter that
allows users to set the initial index capacity to match their expected
dataset size, dramatically reducing the number of resize operations and
the associated memory fragmentation.

Usage:
    collection = client.get_or_create_collection(
        name="my_collection",
        metadata={"hnsw:initial_capacity": 5000}
    )

The default value (1000) preserves existing behavior.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@github-actions
Copy link
Copy Markdown

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found; changes appear aligned with the intended behavior and defaults.

Status: No Issues Found | Risk: Low

Review Details

📁 5 files reviewed | 💬 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant