Conversation
Add DiskANN graph-based index: builds a Vamana graph with configurable R (max degree) and L (search list size, separate for insert/query), supports int8 quantization with rescore, lazy reverse-edge replacement, pre-quantized query optimization, and insert buffer reuse. Includes shadow table management, delete support, KNN integration, compile flag (SQLITE_VEC_ENABLE_DISKANN), release-demo workflow, fuzz targets, and tests. Fixes rescore int8 quantization bug.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds the third ANN index to
sqlite-veccalled "diskann". It's based onthe original DiskANN implementation and the 2023 LM-DiskANN paper.
This PR sits on top of #277, which adds an experimental
ivfANN index.The DiskANN index incrementally creates a Vamana graph during vector insertion.
Compressed neighbor vectors are stored alongside neighbors. A "pruning"
operation occurs during each insert, ensuring a balanced graph.
Users have a few parameters they can tune for insert speed/recall/knn speed:
n_neighbors- the number of compressed neighbors stored for each vector. Thehigher the better the recall, but with reduced insert speed.
search_list_size_insert- the size of the search list used during theinsert/pruning operation. The higher the better the recall, at the expense of
insert speed.
search_list_size_search- similar to theinsertsibling, but used duringKNN queries. This is configurable per-search. The higher the better the
recall, at the expense of search speed.
No training required, but VERY slow INSERTs
The cool thing about DiskANN over the IVF index is that there is no training
step required. Vectors can be inserted at any time, and the pruning/balancing
operations ensure great quality no matter how the data shifts over time.
However, the pruning process is slow. Since
sqlite-vecstores data in hiddenshadow tables within your SQLite database, there is added overhead of b-tree
traversing and large overflow pages from large blobs.
The original LM-DiskANN paper said it took > 8 hours to build an ANN index for
1M 960D vectors, and while the
sqlite-vecDiskANN algorithm doesn't takethat long, it still is pretty slow! Even worse, it gets slower the larger the
database gets. I hope to to improve this in future PRs.
Benchmarks
As always, the performance of the DiskANN index will depend on your embeddings
model, distributions of your vector space, hardware, and parameter choice.
For my use-case on my computer (Macbook Pro M4) on a semantically diverse
dataset of 1 million New York Times headlines embeded with
mixedbread-ai/mxbai-embed-large-v1,I get:
vec0ConfigurationObviously, INSERTs are slow - nearly an hour for the 48-neighbors configuration,
and the database is 3x the same. But for 17x faster KNN queries and
0.980recall, it's quite the deal!
Reducing R to
40means less neighbors to compare and balance, which leads tofaster inserts (
42mvs55m) and a slightly smaller database size (11.94GBvs
10.03GB). It's also leads to much faster KNN queries at less than5ms,more than 128x faster than the brute force method! You do sacrific some recall
in this case for
0.954, but many RAG and semantic search systems will gladlymake that trade.
I would say that if it's possible to load the database into page cache, queries
can be even faster. During benchmarks I was getting
<1msqueries, which seemedfishy. I only saw this type of speed when I performed queries in the same
process as the build phase, meaning that most of the DB was probably in the page
cache, leading to faster queries. There's a few methods to onload a SQLite DB
into OS or SQLite page-caches, but I didn't have time to explore this much.
Point is - DiskANN is definitely the fastest at KNN queries of all the
sqlite-vecindexes! But be prepared for the snail-paced INSERTs and largerdatabase sizes.