Skip to content

Add DiskANN index for vec0 virtual table#278

Merged
asg017 merged 2 commits intomainfrom
pr/diskann
Mar 31, 2026
Merged

Add DiskANN index for vec0 virtual table#278
asg017 merged 2 commits intomainfrom
pr/diskann

Conversation

@asg017
Copy link
Copy Markdown
Owner

@asg017 asg017 commented Mar 31, 2026

This PR adds the third ANN index to sqlite-vec called "diskann". It's based on
the original DiskANN implementation and the 2023 LM-DiskANN paper.

This PR sits on top of #277, which adds an experimental ivf ANN index.

The DiskANN index incrementally creates a Vamana graph during vector insertion.
Compressed neighbor vectors are stored alongside neighbors. A "pruning"
operation occurs during each insert, ensuring a balanced graph.

create virtual table vec_articles using vec0(
  id integer primary key,
  headline_embedding float[1024] distance_metric=cosine indexed by diskann(
    neighbor_quantizer=binary,
    n_neighbors=48,
    search_list_size_insert=96
  )
);

insert into vec_articles(id, headline_embedding)
  values(:id, :embedding);

-- KNN query
select
  rowid,
  distance
from vec_articles
where headline_embedding match :query_embedding
  and k = 10;

Users have a few parameters they can tune for insert speed/recall/knn speed:

  • n_neighbors - the number of compressed neighbors stored for each vector. The
    higher the better the recall, but with reduced insert speed.
  • search_list_size_insert - the size of the search list used during the
    insert/pruning operation. The higher the better the recall, at the expense of
    insert speed.
  • search_list_size_search - similar to the insert sibling, but used during
    KNN queries. This is configurable per-search. The higher the better the
    recall, at the expense of search speed.

No training required, but VERY slow INSERTs

The cool thing about DiskANN over the IVF index is that there is no training
step required. Vectors can be inserted at any time, and the pruning/balancing
operations ensure great quality no matter how the data shifts over time.

However, the pruning process is slow. Since sqlite-vec stores data in hidden
shadow tables within your SQLite database, there is added overhead of b-tree
traversing and large overflow pages from large blobs.

The original LM-DiskANN paper said it took > 8 hours to build an ANN index for
1M 960D vectors, and while the sqlite-vec DiskANN algorithm doesn't take
that long, it still is pretty slow! Even worse, it gets slower the larger the
database gets. I hope to to improve this in future PRs.

Benchmarks

As always, the performance of the DiskANN index will depend on your embeddings
model, distributions of your vector space, hardware, and parameter choice.

For my use-case on my computer (Macbook Pro M4) on a semantically diverse
dataset of 1 million New York Times headlines embeded with
mixedbread-ai/mxbai-embed-large-v1,
I get:

vec0 Configuration Insert (s) Size (GB) Query (ms) Recall @10
Flat index 27.4s 3.85 590ms (baseline) 1.0
DiskANN R=48 L=64 55m21s (121x slower) 11.94 (3.1x bigger) 33.6ms (17.5x speedup) 0.980
DiskANN R=40 L=48 42m17s (93x slower) 10.03 (2.6x bigger) 4.6ms (128x speedup) 0.954

R is the number of compressed neighbors per node, while L is the insert
search list size.

Obviously, INSERTs are slow - nearly an hour for the 48-neighbors configuration,
and the database is 3x the same. But for 17x faster KNN queries and 0.980
recall, it's quite the deal!

Reducing R to 40 means less neighbors to compare and balance, which leads to
faster inserts (42m vs 55m) and a slightly smaller database size (11.94GB
vs 10.03GB). It's also leads to much faster KNN queries at less than 5ms,
more than 128x faster than the brute force method! You do sacrific some recall
in this case for 0.954, but many RAG and semantic search systems will gladly
make that trade.

I would say that if it's possible to load the database into page cache, queries
can be even faster. During benchmarks I was getting <1ms queries, which seemed
fishy. I only saw this type of speed when I performed queries in the same
process as the build phase, meaning that most of the DB was probably in the page
cache, leading to faster queries. There's a few methods to onload a SQLite DB
into OS or SQLite page-caches, but I didn't have time to explore this much.

Point is - DiskANN is definitely the fastest at KNN queries of all the
sqlite-vec indexes! But be prepared for the snail-paced INSERTs and larger
database sizes.

asg017 added 2 commits March 31, 2026 01:21
Add DiskANN graph-based index: builds a Vamana graph with configurable R
(max degree) and L (search list size, separate for insert/query), supports
int8 quantization with rescore, lazy reverse-edge replacement, pre-quantized
query optimization, and insert buffer reuse. Includes shadow table management,
delete support, KNN integration, compile flag (SQLITE_VEC_ENABLE_DISKANN),
release-demo workflow, fuzz targets, and tests. Fixes rescore int8
quantization bug.
@asg017 asg017 changed the base branch from pr/ivf to main March 31, 2026 08:22
@asg017 asg017 merged commit 1e3bb3e into main Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant