Add DiskANN index for vec0 virtual table by asg017 · Pull Request #278 · asg017/sqlite-vec

asg017 · 2026-03-31T07:54:24Z

This PR adds the third ANN index to sqlite-vec called "diskann". It's based on
the original DiskANN implementation and the 2023 LM-DiskANN paper.

This PR sits on top of #277, which adds an experimental ivf ANN index.

The DiskANN index incrementally creates a Vamana graph during vector insertion.
Compressed neighbor vectors are stored alongside neighbors. A "pruning"
operation occurs during each insert, ensuring a balanced graph.

create virtual table vec_articles using vec0(
  id integer primary key,
  headline_embedding float[1024] distance_metric=cosine indexed by diskann(
    neighbor_quantizer=binary,
    n_neighbors=48,
    search_list_size_insert=96
  )
);

insert into vec_articles(id, headline_embedding)
  values(:id, :embedding);

-- KNN query
select
  rowid,
  distance
from vec_articles
where headline_embedding match :query_embedding
  and k = 10;

Users have a few parameters they can tune for insert speed/recall/knn speed:

n_neighbors - the number of compressed neighbors stored for each vector. The
higher the better the recall, but with reduced insert speed.
search_list_size_insert - the size of the search list used during the
insert/pruning operation. The higher the better the recall, at the expense of
insert speed.
search_list_size_search - similar to the insert sibling, but used during
KNN queries. This is configurable per-search. The higher the better the
recall, at the expense of search speed.

No training required, but VERY slow INSERTs

The cool thing about DiskANN over the IVF index is that there is no training
step required. Vectors can be inserted at any time, and the pruning/balancing
operations ensure great quality no matter how the data shifts over time.

However, the pruning process is slow. Since sqlite-vec stores data in hidden
shadow tables within your SQLite database, there is added overhead of b-tree
traversing and large overflow pages from large blobs.

The original LM-DiskANN paper said it took > 8 hours to build an ANN index for
1M 960D vectors, and while the sqlite-vec DiskANN algorithm doesn't take
that long, it still is pretty slow! Even worse, it gets slower the larger the
database gets. I hope to to improve this in future PRs.

Benchmarks

As always, the performance of the DiskANN index will depend on your embeddings
model, distributions of your vector space, hardware, and parameter choice.

For my use-case on my computer (Macbook Pro M4) on a semantically diverse
dataset of 1 million New York Times headlines embeded with
mixedbread-ai/mxbai-embed-large-v1,
I get:

`vec0` Configuration	Insert (s)	Size (GB)	Query (ms)	Recall @10
Flat index	27.4s	3.85	590ms (baseline)	1.0
DiskANN R=48 L=64	55m21s (121x slower)	11.94 (3.1x bigger)	33.6ms (17.5x speedup)	0.980
DiskANN R=40 L=48	42m17s (93x slower)	10.03 (2.6x bigger)	4.6ms (128x speedup)	0.954

R is the number of compressed neighbors per node, while L is the insert
search list size.

Obviously, INSERTs are slow - nearly an hour for the 48-neighbors configuration,
and the database is 3x the same. But for 17x faster KNN queries and 0.980
recall, it's quite the deal!

Reducing R to 40 means less neighbors to compare and balance, which leads to
faster inserts (42m vs 55m) and a slightly smaller database size (11.94GB
vs 10.03GB). It's also leads to much faster KNN queries at less than 5ms,
more than 128x faster than the brute force method! You do sacrific some recall
in this case for 0.954, but many RAG and semantic search systems will gladly
make that trade.

I would say that if it's possible to load the database into page cache, queries
can be even faster. During benchmarks I was getting <1ms queries, which seemed
fishy. I only saw this type of speed when I performed queries in the same
process as the build phase, meaning that most of the DB was probably in the page
cache, leading to faster queries. There's a few methods to onload a SQLite DB
into OS or SQLite page-caches, but I didn't have time to explore this much.

Point is - DiskANN is definitely the fastest at KNN queries of all the
sqlite-vec indexes! But be prepared for the snail-paced INSERTs and larger
database sizes.

Add DiskANN graph-based index: builds a Vamana graph with configurable R (max degree) and L (search list size, separate for insert/query), supports int8 quantization with rescore, lazy reverse-edge replacement, pre-quantized query optimization, and insert buffer reuse. Includes shadow table management, delete support, KNN integration, compile flag (SQLITE_VEC_ENABLE_DISKANN), release-demo workflow, fuzz targets, and tests. Fixes rescore int8 quantization bug.

asg017 force-pushed the pr/ivf branch from eb3c6a1 to bb3ef78 Compare March 31, 2026 08:18

asg017 added 2 commits March 31, 2026 01:21

rm demo gha workflow

fb81c01

asg017 force-pushed the pr/diskann branch from 185d2ba to fb81c01 Compare March 31, 2026 08:22

asg017 changed the base branch from pr/ivf to main March 31, 2026 08:22

asg017 merged commit 1e3bb3e into main Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DiskANN index for vec0 virtual table#278

Add DiskANN index for vec0 virtual table#278
asg017 merged 2 commits intomainfrom
pr/diskann

asg017 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

asg017 commented Mar 31, 2026

No training required, but VERY slow INSERTs

Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant