Skip to content

Cold-data loading optimization #7975

@1995chen

Description

@1995chen

Hi, after importing 200 M vectors into 15 shards the cold-start is painfully slow. We turned on binary quantization and keep the HNSW index in RAM, while the raw vectors stay on disk. Whenever we migrate to new hosts and restart the cluster, the startup takes up to one hour. My first guess is that the code reads the shard index files sequentially: it first loads the quantized index and only then the HNSW index. With this data size the serial load becomes a bottleneck; in an emergency the long recovery time can break our SLA.
I’d like to prefetch all required files concurrently before the service starts, so later index access will hit the OS cache.
Users can set a concurrency parameter to match the maximum IOPS their disks can deliver. It’s admittedly ugly, but it’s the quickest way to buy us time. In the long run we still need to make the index-loading code itself parallel.
Below are the startup logs:
2026-01-23T02:38:44.426125Z INFO storage::content_manager::consensus::persistent: Loading raft state from ./storage/raft_state.json
2026-01-23T02:38:44.435406Z INFO storage::content_manager::toc: Loading collection: t_scalar
2026-01-23T03:24:14.428535Z INFO collection::shards::local_shard: Recovering shard ./storage/collections/t_scalar/1: 0/1 (0%)
2026-01-23T03:24:14.487988Z INFO collection::shards::local_shard: Recovered collection t_scalar: 1/1 (100%)
2026-01-23T03:26:30.083950Z INFO collection::shards::local_shard: Recovering shard ./storage/collections/t_scalar/3: 0/1 (0%)
2026-01-23T03:26:30.190590Z INFO collection::shards::local_shard: Recovered collection t_scalar: 1/1 (100%)
2026-01-23T03:28:48.091480Z INFO collection::shards::local_shard: Recovering shard ./storage/collections/t_scalar/5: 0/1 (0%)
2026-01-23T03:28:48.201508Z INFO collection::shards::local_shard: Recovered collection t_scalar: 1/1 (100%)
2026-01-23T03:30:50.456139Z INFO collection::shards::local_shard: Recovering shard ./storage/collections/t_scalar/7: 0/1 (0%)
2026-01-23T03:30:50.568487Z INFO collection::shards::local_shard: Recovered collection t_scalar: 1/1 (100%)

We’re running on AWS r7g.8xlarge (32 vCPU, 256 GiB) with gp3 volumes configured at 16 000 IOPS.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions