Skip to content

Add new experimental IVF ANN INdex#277

Merged
asg017 merged 3 commits intomainfrom
pr/ivf
Mar 31, 2026
Merged

Add new experimental IVF ANN INdex#277
asg017 merged 3 commits intomainfrom
pr/ivf

Conversation

@asg017
Copy link
Copy Markdown
Owner

@asg017 asg017 commented Mar 31, 2026

This PR adds the 2nd ANN index to sqlite-vec, called "IVF". The inverted file
index format is a simple ANN index for vector search, implemented by many vector
databases
like in Faiss.

This PR sits on top of #276, which adds a new rescore ANN index.

This is an experimental ANN index. It will not be enabled by default on
sqlite-vec builds. I recommend that if people want to try it out to build
sqlite-vec yourself with the SQLITE_VEC_EXPERIMENTAL_IVF_ENABLE flag, but
know that the API will likely change and break in the future.

An IVF index requires a kmeans training step on a subset of vectors, generating
centroids in a given vector space. Vectors are stored in clusters based on which
centroid they are closest to. At query time, a vector query will find the
closest nprobe centroids and do a KNN query on each vector in that cluster.

create virtual table vec_articles using vec0(
  id integer primary key,
  headline_embedding float[1024] distance_metric=cosine indexed by ivf(
    nlist=1024
  )
);


-- insert enough vectors to "train" a decent kmeans cluster (~sqrt(N))
insert into vec_articles(id, headline_embedding)
  values(:id, :embedding);

-- train using the builtin SLOW kmeans
INSERT INTO vec_articles(id) VALUES ('compute-centroids')


-- KNN query
select
  rowid,
  distance
from vec_articles
where headline_embedding match :query_embedding
  and k = 10;

Users have 3 parameters they can tune for speed/accuracy:

  • nlist - the number of centroids to train/store cluster on. General
    recommendation is around sqrt(N), where N is the total number of vectors
  • ntrain - the number of vectors to train kmeans centroids on. General
    recommendation is anywhere between 4 * sqrt(N) and 16 * sqrt(N), at least
    according to faiss docs.
  • nprobe - the number of centroid clusters to search at query time, similar to
    the oversample parameter in the rescore index.

In general, the higher the nlist the faster the queries, at the expense of
slower inserts and training. The higher the ntrain the higher the chance of
better recall, at the expense of slower training. The higher the nprobe the
better the recall, at the expense of query time. Tradeoffs!

Training is required (kmeans)

When you first create the IVF index and insert vectors, it acts as a flat index.
It isn't until after you call compute-centroids is the index in effect.

This is the hard part of the IVF index - kmeans is compute intensive. The
builtin kmeans implementation is called like so:

INSERT INTO vec_articles(id) VALUES ('compute-centroids')

But there's a few problems with this:

  • The naive current implementation is slow, single threaded, and not entirely
    tested
  • In the above form, this will hold a long-term write lock on the database for
    possibly several minutes, meaning no other threads will be able to write data
    to the database
  • kmeans is a difficult problems with a lot of different parameters, which is
    not exposed in the current SQL API.

Now, there is another option to off-load the kmeans implementation to another
thread. For example, you can load ntrain vectors into something like Faiss or
another GPU-powered kmeans library, export those vectors to a file, then import
them to a ivf index in sqlite-vec like so:

INSERT INTO vec_items(id, embedding) VALUES ('set-centroid:1', :centroid_1);

That way you don't hold the write lock for longer than you need, other kmeans
algorithms are likely faster and more correct, and you can tune for the exact
results you're looking for.

But this is hacky, the set-centroid command is a bit hacky, and I want to
spend time making this workflow right. This is why this is being released as
experimental!

Benchmarks

Like the rescore index benchmarks, the performance of the IVF index will
depend heavily on your embeddings model, distribution of your data/vectors, your
hardware/platform, and method of kmeans.

HOWEVER, for my use-case on my hardware (Macbook Pro M4) on a semantically
diverse dataset of 1 million New York Times headlines embeded with
mixedbread-ai/mxbai-embed-large-v1,
I get:

vec0 Configuration Insert (s) Size (GB) Query (ms) Recall @10
Flat index 27.4s 3.85 590ms (baseline) 1.0
IVF nlist=256 nprobe=16 6m40s (14.6x slower) 3.89 56ms (10.5x speedup) 0.978
IVF nlist=1024 nprobe=64 18m (39x slower) 3.99 61ms (9.7x speedup) 0.990
IVF nlist=1024 nprobe=32 18m (39x slower) 3.99 37ms (15.9x speedup) 0.988

"Flat index" here means a non-index vec0 column via brute-force. One thing to
note — this specific IVF run offloaded kmeans to Faiss on 100K vectors (quite
high!), which took only ~3 seconds. This number is NOT included in the Insert
times. Using the builtin kmeans method would have taken much longer!

A few things to learn here:

  • Query times are much faster with pretty good recall, at an obvious cost to
    insert times! The fastest rescore benchmark was 92ms at 0.962 recall,
    which is slower and has worse recall than the nlist=256 sample here.
  • Inserts are waaaay slower here. Every insert first needs to do a kmeans on
    nlist vectors, then find the closest centroids cluster, then insert into
    those chunks. For the smaller nlist=256 this is nearly 15 times slower than
    a flat index, and for a larger nlist=1024 it's 39x slower!
  • nprobe has an obviously ceiling. Bumping nprobe=32 to 64 only improved
    recall from 0.988 to 0.990 at the expense of 1.6x the speed.

Again, your mileage may vary!

Not an "official" sqlite-vec ANN index quite yet

This index option will not be available in default sqlite-vec builds. Instead,
you must compile sqlite-vec yourself with the new
SQLITE_VEC_EXPERIMENTAL_IVF_ENABLE flag.

This is because I'm less confident in this ANN index compared to
rescore/diskann. There's a lot of moving parts in this index, including a
from-scratch C implementation of kmeans that I haven't fully reviewed. Also,
there may be subtle changes to the shadow tables for IVF that I think would make
it better/faster, but would be a breaking change.

So this feature will be behind a new SQLITE_VEC_EXPERIMENTAL_IVF_ENABLE
compile-time option. Meaning, at least for now, the Python/Ruby/Node.js builds
of sqlite-vec will not include this feature by default, you must compile it
yourself to try it out.

@asg017 asg017 changed the base branch from pr/rescore to main March 31, 2026 08:16
asg017 and others added 3 commits March 31, 2026 01:18
Add inverted file (IVF) index type: partitions vectors into clusters via
k-means, quantizes to int8, and scans only the nearest nprobe partitions at
query time. Includes shadow table management, insert/delete, KNN integration,
compile flag (SQLITE_VEC_ENABLE_IVF), fuzz targets, and tests. Removes
superseded ivf-benchmarks/ directory.
Rename SQLITE_VEC_ENABLE_IVF to SQLITE_VEC_EXPERIMENTAL_IVF_ENABLE and
flip the default from 1 to 0. IVF tests are automatically skipped when
the build flag is not set.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@asg017 asg017 merged commit e2c38f3 into main Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant