Conversation
Add inverted file (IVF) index type: partitions vectors into clusters via k-means, quantizes to int8, and scans only the nearest nprobe partitions at query time. Includes shadow table management, insert/delete, KNN integration, compile flag (SQLITE_VEC_ENABLE_IVF), fuzz targets, and tests. Removes superseded ivf-benchmarks/ directory.
Rename SQLITE_VEC_ENABLE_IVF to SQLITE_VEC_EXPERIMENTAL_IVF_ENABLE and flip the default from 1 to 0. IVF tests are automatically skipped when the build flag is not set. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds the 2nd ANN index to
sqlite-vec, called "IVF". The inverted fileindex format is a simple ANN index for vector search, implemented by many vector
databases
like in Faiss.
This PR sits on top of #276, which adds a new
rescoreANN index.This is an experimental ANN index. It will not be enabled by default on
sqlite-vecbuilds. I recommend that if people want to try it out to buildsqlite-vecyourself with theSQLITE_VEC_EXPERIMENTAL_IVF_ENABLEflag, butknow that the API will likely change and break in the future.
An IVF index requires a kmeans training step on a subset of vectors, generating
centroids in a given vector space. Vectors are stored in clusters based on which
centroid they are closest to. At query time, a vector query will find the
closest
nprobecentroids and do a KNN query on each vector in that cluster.Users have 3 parameters they can tune for speed/accuracy:
nlist- the number of centroids to train/store cluster on. Generalrecommendation is around
sqrt(N),whereNis the total number of vectorsntrain- the number of vectors to train kmeans centroids on. Generalrecommendation is anywhere between
4 * sqrt(N)and16 * sqrt(N), at leastaccording to faiss docs.
nprobe- the number of centroid clusters to search at query time, similar tothe
oversampleparameter in the rescore index.In general, the higher the
nlistthe faster the queries, at the expense ofslower inserts and training. The higher the
ntrainthe higher the chance ofbetter recall, at the expense of slower training. The higher the
nprobethebetter the recall, at the expense of query time. Tradeoffs!
Training is required (kmeans)
When you first create the IVF index and insert vectors, it acts as a flat index.
It isn't until after you call
compute-centroidsis the index in effect.This is the hard part of the IVF index - kmeans is compute intensive. The
builtin kmeans implementation is called like so:
But there's a few problems with this:
tested
possibly several minutes, meaning no other threads will be able to write data
to the database
not exposed in the current SQL API.
Now, there is another option to off-load the kmeans implementation to another
thread. For example, you can load
ntrainvectors into something like Faiss oranother GPU-powered kmeans library, export those vectors to a file, then import
them to a
ivfindex insqlite-veclike so:That way you don't hold the write lock for longer than you need, other kmeans
algorithms are likely faster and more correct, and you can tune for the exact
results you're looking for.
But this is hacky, the
set-centroidcommand is a bit hacky, and I want tospend time making this workflow right. This is why this is being released as
experimental!
Benchmarks
Like the
rescoreindex benchmarks, the performance of the IVF index willdepend heavily on your embeddings model, distribution of your data/vectors, your
hardware/platform, and method of kmeans.
HOWEVER, for my use-case on my hardware (Macbook Pro M4) on a semantically
diverse dataset of 1 million New York Times headlines embeded with
mixedbread-ai/mxbai-embed-large-v1,I get:
vec0ConfigurationIVFnlist=256 nprobe=16IVFnlist=1024 nprobe=64IVFnlist=1024 nprobe=32"Flat index" here means a non-index
vec0column via brute-force. One thing tonote — this specific IVF run offloaded kmeans to Faiss on 100K vectors (quite
high!), which took only ~3 seconds. This number is NOT included in the Insert
times. Using the builtin kmeans method would have taken much longer!
A few things to learn here:
insert times! The fastest rescore benchmark was
92msat0.962recall,which is slower and has worse recall than the
nlist=256sample here.nlistvectors, then find the closest centroids cluster, then insert intothose chunks. For the smaller
nlist=256this is nearly 15 times slower thana flat index, and for a larger
nlist=1024it's 39x slower!nprobehas an obviously ceiling. Bumpingnprobe=32to64only improvedrecall from
0.988to0.990at the expense of 1.6x the speed.Again, your mileage may vary!
Not an "official"
sqlite-vecANN index quite yetThis index option will not be available in default
sqlite-vecbuilds. Instead,you must compile
sqlite-vecyourself with the newSQLITE_VEC_EXPERIMENTAL_IVF_ENABLEflag.This is because I'm less confident in this ANN index compared to
rescore/diskann. There's a lot of moving parts in this index, including afrom-scratch C implementation of kmeans that I haven't fully reviewed. Also,
there may be subtle changes to the shadow tables for IVF that I think would make
it better/faster, but would be a breaking change.
So this feature will be behind a new
SQLITE_VEC_EXPERIMENTAL_IVF_ENABLEcompile-time option. Meaning, at least for now, the Python/Ruby/Node.js builds
of
sqlite-vecwill not include this feature by default, you must compile ityourself to try it out.