Site icon Gradient Flow

The Vector Database Index

Measuring the popularity of different Vector Databases.

By Ben Lorica and Leo Meyerovich.

Introduction

Vector databases and vector search are on the radar of a growing number of technical teams. A key driver is that advances in neural networks have made dense vector representations of data more common. Interest has also grown due to the decision of technology companies to open source their core systems for vector search (Yahoo Vespa, Facebook Faiss). Additionally, a collection of vector database startups have raised close to $200M in funding, resulting in new enterprise solution providers and proponents.

While vector databases are used for recommendation, anomaly detection, and Q&A systems, they primarily target search and information retrieval applications.  Vector databases are designed specifically to deal with vector embeddings. An example of a vector embedding is a low-resolution picture generated from a high-resolution 3D model. Embeddings are low-dimensional spaces into which higher-dimensional vectors can be mapped into. Embeddings can represent many kinds of data, whether a sentence of text, audio snippet, or a logged event. The key is that embeddings capture some of the semantics of the input and place semantically similar inputs close together in the embedding space. The result is that, by using embeddings, AI applications work much faster and cheaper without losing quality.

Figure 1: Representative sample of popular vector databases and libraries. Note that based on their taglines, many systems highlight scalability and target search applications.

The general primitive is identifying nearest neighbors (“vector search”). Since embeddings capture underlying semantics, they provide great building blocks for pipelines that power search applications.  Some high profile users  of vector search include Facebook (similarity search), Gong and Clubhouse (semantic search), e-commerce sites (eBay, Tencent, Walmart, Ikea) and search engines like Google and Bing.

Figure 2: Semantic and neural search pipeline, from “Semantic Search and Neural Information Retrieval”.

The growing interest in embeddings has been accompanied by increased focus on scaling and speeding up vector search techniques such as KNN, ANN, and HNSW, as well as investments in related tools like hardware acceleration. The chart below shows the growth in the number of research papers that mention “similarity search” or “semantic search”:

Figure 3: Researchers are publishing more papers on “similarity search” and “semantic search”.

Vector Database Index

The purpose of this post is to compare vector databases and libraries using an index that measures popularity.  For this inaugural edition, we focus on specialized systems and include only one general search engine – Elasticsearch, which incorporated vector search through Apache Lucene’s new ANN capabilities. As with our previous post on Experiment Tracking and Management tools, we use an index that relies on public data and is modeled after TIOBE’s programming language index. Our index is comprised of the following components:

Figure 4: Vector Database Index – an indicator of the popularity of data management tools and libraries for next-gen search applications. (If you would like to suggest systems for future editions of the Index, please fill out the form below.)

As noted in a prior index calculation, these scores reflect relative popularity rather than overall utility or operational complexity. Search is the driving force behind the overall score. Vector databases are still quite specialized and advanced, so data on both demand and supply of talent remain quite sparse.  Due to the sparsity of data on talent (supply and demand), we segmented our index into tiers since resulting rankings can still be volatile for this new category. For example, the most popular tools based on our index are Elasticsearch and Faiss, an open source library developed mainly by Facebook AI Research. Technical teams frequently cite Faiss when discussing vector databases. There are three well-funded startups in the second tier of our popularity ranking: Weaviate, Pinecone, Milvus.

As you can see from the descriptions provided by creators of these tools (see Figure 1), these systems are focused on scalability.  Vector databases traditionally index a few million vectors per server (e.g., see Pinecone, Faiss, Weaviate). Techniques such as hashing and sharding facilitate scaling to larger datasets, but that can be prohibitively expensive. Recently a few systems have been highlighting their ability to scale in a cost-effective manner to a billion vectors, such as by supporting bigger-than-memory datasets through indexes optimized for SSD storage (see Vespa, Milvus) or by leveraging hardware acceleration (see GSI Technology). We expect more systems to follow suit in the near future.

Closing Observations

Vector databases are what you get when you embed your entire database

Suggestion Form

Use this form to suggest systems to include in future editions of the Vector database Index.

Index Suggestions


Ben Lorica is a Principal at Gradient Flow. He is an advisor to Graphistry and other startups.

Leo Meyerovich is founder and CEO of Graphistry, the first visual platform for graph AI. Vector search is part of how Graphistry helps security, fraud, social, supply chain, and other data-intensive teams turn graph neural networks and manifold learning into insights and actions.


Related Content:


If you enjoyed this post please support our work by encouraging your friends and colleagues to subscribe to our newsletter:


Featured Image: Key Phrases found in recent Job Postings that mention: “vector database” or vector|semantic|similarity|neural search.

Exit mobile version