Do I Need to Re-Index My Embedding Database Periodically?

I’m currently working on a Retrieval-Augmented Generation (RAG) application where new data is added each time an article is created in the knowledge base. I want to know if it’s necessary to periodically update and re-index the entire embedding database to maintain optimal performance, or if modern vector databases can efficiently handle incremental updates without requiring frequent full re-indexing. Any insights or best practices on this would be greatly appreciated!

I don’t know what database you are using, but my generic answer is no you don’t need to re-index.

If you are doing the DB yourself, you may want to break up the data, into chunks, and search the chunks in parallel to lower the latency. Do the search using dot products (cosine similarity) yourself in memory.

By “data” I am talking the vectors, along with a hash key into the DB to retrieve the text. The text can live in a single static monolithic DB. The actual text for your retrieval is indexed by the hash key., and then fed to the prompt.

3 Likes

I apologize if this has been asked before, but in my RAG application, do we need to re-embed the entire content periodically for optimal performance, or can the system handle incremental updates without requiring a full re-embedding?

We’re using a MongoDB vector database, and after pushing some changes to production, we noticed that during testing, the chunks being retrieved were irrelevant to the query, even though more relevant chunks were available. After re-embedding the entire content, it started working correctly. This raises the question of whether periodic re-embedding is necessary to avoid similar issues in the future.

If you switch to a different embedding engine, then yeah, you need to re-embed everything.

Different engines aren’t compatible.

Another common gotcha is that when you switch, your cosine similarity thresholds are also different, so you need to adjust these as well when you switch engines.

2 Likes