Add new experimental IVF ANN INdex by asg017 · Pull Request #277 · asg017/sqlite-vec

asg017 · 2026-03-31T06:21:51Z

This PR adds the 2nd ANN index to sqlite-vec, called "IVF". The inverted file
index format is a simple ANN index for vector search, implemented by many vector
databases
like in Faiss.

This PR sits on top of #276, which adds a new rescore ANN index.

This is an experimental ANN index. It will not be enabled by default on
sqlite-vec builds. I recommend that if people want to try it out to build
sqlite-vec yourself with the SQLITE_VEC_EXPERIMENTAL_IVF_ENABLE flag, but
know that the API will likely change and break in the future.

An IVF index requires a kmeans training step on a subset of vectors, generating
centroids in a given vector space. Vectors are stored in clusters based on which
centroid they are closest to. At query time, a vector query will find the
closest nprobe centroids and do a KNN query on each vector in that cluster.

create virtual table vec_articles using vec0(
  id integer primary key,
  headline_embedding float[1024] distance_metric=cosine indexed by ivf(
    nlist=1024
  )
);


-- insert enough vectors to "train" a decent kmeans cluster (~sqrt(N))
insert into vec_articles(id, headline_embedding)
  values(:id, :embedding);

-- train using the builtin SLOW kmeans
INSERT INTO vec_articles(id) VALUES ('compute-centroids')


-- KNN query
select
  rowid,
  distance
from vec_articles
where headline_embedding match :query_embedding
  and k = 10;

Users have 3 parameters they can tune for speed/accuracy:

nlist - the number of centroids to train/store cluster on. General
recommendation is around sqrt(N), where N is the total number of vectors
ntrain - the number of vectors to train kmeans centroids on. General
recommendation is anywhere between 4 * sqrt(N) and 16 * sqrt(N), at least
according to faiss docs.
nprobe - the number of centroid clusters to search at query time, similar to
the oversample parameter in the rescore index.

In general, the higher the nlist the faster the queries, at the expense of
slower inserts and training. The higher the ntrain the higher the chance of
better recall, at the expense of slower training. The higher the nprobe the
better the recall, at the expense of query time. Tradeoffs!

Training is required (kmeans)

When you first create the IVF index and insert vectors, it acts as a flat index.
It isn't until after you call compute-centroids is the index in effect.

This is the hard part of the IVF index - kmeans is compute intensive. The
builtin kmeans implementation is called like so:

INSERT INTO vec_articles(id) VALUES ('compute-centroids')

But there's a few problems with this:

The naive current implementation is slow, single threaded, and not entirely
tested
In the above form, this will hold a long-term write lock on the database for
possibly several minutes, meaning no other threads will be able to write data
to the database
kmeans is a difficult problems with a lot of different parameters, which is
not exposed in the current SQL API.

Now, there is another option to off-load the kmeans implementation to another
thread. For example, you can load ntrain vectors into something like Faiss or
another GPU-powered kmeans library, export those vectors to a file, then import
them to a ivf index in sqlite-vec like so:

INSERT INTO vec_items(id, embedding) VALUES ('set-centroid:1', :centroid_1);

That way you don't hold the write lock for longer than you need, other kmeans
algorithms are likely faster and more correct, and you can tune for the exact
results you're looking for.

But this is hacky, the set-centroid command is a bit hacky, and I want to
spend time making this workflow right. This is why this is being released as
experimental!

Benchmarks

Like the rescore index benchmarks, the performance of the IVF index will
depend heavily on your embeddings model, distribution of your data/vectors, your
hardware/platform, and method of kmeans.

HOWEVER, for my use-case on my hardware (Macbook Pro M4) on a semantically
diverse dataset of 1 million New York Times headlines embeded with
mixedbread-ai/mxbai-embed-large-v1,
I get:

`vec0` Configuration	Insert (s)	Size (GB)	Query (ms)	Recall @10
Flat index	27.4s	3.85	590ms (baseline)	1.0
`IVF` nlist=256 nprobe=16	6m40s (14.6x slower)	3.89	56ms (10.5x speedup)	0.978
`IVF` nlist=1024 nprobe=64	18m (39x slower)	3.99	61ms (9.7x speedup)	0.990
`IVF` nlist=1024 nprobe=32	18m (39x slower)	3.99	37ms (15.9x speedup)	0.988

"Flat index" here means a non-index vec0 column via brute-force. One thing to
note — this specific IVF run offloaded kmeans to Faiss on 100K vectors (quite
high!), which took only ~3 seconds. This number is NOT included in the Insert
times. Using the builtin kmeans method would have taken much longer!

A few things to learn here:

Query times are much faster with pretty good recall, at an obvious cost to
insert times! The fastest rescore benchmark was 92ms at 0.962 recall,
which is slower and has worse recall than the nlist=256 sample here.
Inserts are waaaay slower here. Every insert first needs to do a kmeans on
nlist vectors, then find the closest centroids cluster, then insert into
those chunks. For the smaller nlist=256 this is nearly 15 times slower than
a flat index, and for a larger nlist=1024 it's 39x slower!
nprobe has an obviously ceiling. Bumping nprobe=32 to 64 only improved
recall from 0.988 to 0.990 at the expense of 1.6x the speed.

Again, your mileage may vary!

Not an "official" `sqlite-vec` ANN index quite yet

This index option will not be available in default sqlite-vec builds. Instead,
you must compile sqlite-vec yourself with the new
SQLITE_VEC_EXPERIMENTAL_IVF_ENABLE flag.

This is because I'm less confident in this ANN index compared to
rescore/diskann. There's a lot of moving parts in this index, including a
from-scratch C implementation of kmeans that I haven't fully reviewed. Also,
there may be subtle changes to the shadow tables for IVF that I think would make
it better/faster, but would be a breaking change.

So this feature will be behind a new SQLITE_VEC_EXPERIMENTAL_IVF_ENABLE
compile-time option. Meaning, at least for now, the Python/Ruby/Node.js builds
of sqlite-vec will not include this feature by default, you must compile it
yourself to try it out.

Add inverted file (IVF) index type: partitions vectors into clusters via k-means, quantizes to int8, and scans only the nearest nprobe partitions at query time. Includes shadow table management, insert/delete, KNN integration, compile flag (SQLITE_VEC_ENABLE_IVF), fuzz targets, and tests. Removes superseded ivf-benchmarks/ directory.

Rename SQLITE_VEC_ENABLE_IVF to SQLITE_VEC_EXPERIMENTAL_IVF_ENABLE and flip the default from 1 to 0. IVF tests are automatically skipped when the build flag is not set. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

asg017 mentioned this pull request Mar 31, 2026

Add DiskANN index for vec0 virtual table #278

Merged

asg017 changed the base branch from pr/rescore to main March 31, 2026 08:16

asg017 and others added 3 commits March 31, 2026 01:18

rm ivf plan file

3e26925

asg017 force-pushed the pr/ivf branch from eb3c6a1 to bb3ef78 Compare March 31, 2026 08:18

asg017 merged commit e2c38f3 into main Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new experimental IVF ANN INdex#277

Add new experimental IVF ANN INdex#277
asg017 merged 3 commits intomainfrom
pr/ivf

asg017 commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

asg017 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Training is required (kmeans)

Benchmarks

Not an "official" sqlite-vec ANN index quite yet

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

asg017 commented Mar 31, 2026 •

edited

Loading

Not an "official" `sqlite-vec` ANN index quite yet