{"version":"https:\/\/jsonfeed.org\/version\/1","title":"Mixedbread Blog","home_page_url":"https:\/\/www.mixedbread.com\/blog","feed_url":"https:\/\/www.mixedbread.com\/blog\/feed.json","description":"Latest news, updates, and insights from Mixedbread","icon":"https:\/\/www.mixedbread.com\/images\/og\/root.jpg","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"items":[{"id":"https:\/\/www.mixedbread.com\/blog\/latent-terms","content_html":"<p>When we think of neural retrieval, the first assumption that comes to mind is that the role of a model is to represent information in a way that allows it to be retrieved by a related query. But this representation is overly simplistic: the role of a retrieval model is to <strong>represent information in a way that can be used by a scoring mechanism<\/strong>. All retrieval work, in essence, is constrained by the operator we choose to rank documents, and all model expressivity is bottlenecked by what this operator can capture.<\/p>\n<p>This raises one, important question: <strong>do retrieval models know more than they are able to express<\/strong>? In other words, despite scoring limitations, does the retrieval training process allow a model to learn richer representations than we assume, waiting to be extracted? The answer is <strong>yes<\/strong>, and even more than that: these representations are trivial to (at least partially) extract, and their distribution approaches natural language itself.<\/p>\n<h2>Retrieval: A Song of Representations and Scoring<\/h2>\n<p>Single-vector embedding models are not a good approach to retrieval, because their <a href=\"https:\/\/arxiv.org\/abs\/2508.21038\">inherent<\/a> <a href=\"https:\/\/x.com\/lateinteraction\/status\/2034654311519547770\">limitations<\/a> make them unsuitable for a lot of situations that are all too common in the real world.<\/p>\n<p>This is a conclusion that is increasingly apparent in all domains, with multi-vector models vastly outperforming their single-vector equivalent with an order of magnitude fewer parameters in both agentic and multimodal settings, two areas which are rapidly growing in importance.<\/p>\n<p>However, while it is tempting to write another article denouncing the evils of single-vector retrieval, it is much more interesting to think about <em>why<\/em> this is the case. At surface-level, it might appear obvious: well, it&#39;s a single vector representing multi-faceted information, so the meaning is diluted!<\/p>\n<p>This assumption is not incorrect: representing nuanced information in one vector is bound to create strong dilution. But it is incomplete: what is diluted is not necessarily <strong>the representation itself<\/strong>: after all, LLMs are perfectly capable to create complex task vectors encompassing nuances. What single-vector limits, and what makes it so harmful to generalisation, is <strong>the expressiveness of scoring operators<\/strong>.<\/p>\n<p>We can think about it this way: all (first stage) retrieval operations are effectively about <strong>enabling the use of an efficient scoring mechanism<\/strong> that can produce useful relevance rankings, to be further enhanced by a reranker or directly used by the final consumer, whether human or agentic. The role of the embedding model is not to perform this scoring by itself: it would be far too expensive to use a neural model, even if it had just a handful of parameters, to score millions of document at query time.<\/p>\n<p>Instead, as we discussed in our introduction, the role of the embedding model is to <strong>convert the information in the documents into a format that can be readily consumed by the efficient scoring mechanism<\/strong> we discussed above. Thus, training embedding is a pure &quot;representation learning&quot; task, where we are provided with a constraint, the input format of the downstream former, and must learn representations that best enable it. In fact, the three great family of retrievers are shaped by their scoring operators: single-vector dense retrievers and their cosine similarity operator, sparse retrievers and various forms of weighted dot products, and multi-vector models and their MaxSim operator.<\/p>\n<p>The reason late interaction models, such as <a href=\"https:\/\/www.mixedbread.com\/blog\/wholembed-v3\">Wholembedv3<\/a>, are so powerful is because of this operator: MaxSim allows for a level of fine-grained expressiveness in scoring that is simply not possible with single-vector cosine similarity. Late interaction is about <strong>preserving this expressivity<\/strong> by allowing information from both the documents and the query to interact as late in the scoring process as possible (hence the &quot;late&quot; interaction naming).<\/p>\n<p>In fact, a few people in the ColBERT community sometimes rant that they <a href=\"https:\/\/x.com\/lateinteraction\/status\/1894696983077785980\">don't like the term \"multi-vector retrieval\" very much<\/a>. The reason for this is this simple fact: with our current understanding of retrieval, multi-vector approaches are currently required to enable MaxSim, which itself is the scoring operator that is currently required to enable late interaction retrieval. By no means is it the &quot;perfect&quot; operator, but it is a necessary evil to allow expressive scoring, until we develop better ways to do so.<\/p>\n<p>And just like ColBERT is powerful because its scoring operator maximises expressiveness at the cost of additional engineering constraints, single-vector embeddings are brittle because their scoring operator favours engineering flexibility, with the opposite tradeoffs. But as we discussed: this is <strong>not an inherent feature of the models themselves<\/strong>, but of the way their final representations are cast to be made compatible with this operator.<\/p>\n<p>As such, we find ourselves asking the same question that we did at the start of this article: do dense models contain more information than they are able to express, and, more importantly, can we extract this knowledge? <\/p>\n<p>If you have read this far, you are probably assuming that the reason we are asking this question is because we found out that, <strong>Yes<\/strong>, you can. And even more than that: it&#39;s trivial to do so.<\/p>\n<h2>Extracting a model&#39;s vocabulary<\/h2>\n<ConceptProjection \/>\n\n<p>To an extent, we know that it is very simple to train dense embedding models to become strong multi-vector retrievers, and this makes sense, as MaxSim scoring can be thought of as &quot;cosine-similarity++&quot; and would leverage a great deal of the same information.<\/p>\n<p>But circling back to our previous point, this is an imperfect solution: yes, MaxSim and late interaction is very powerful. But it&#39;s also heavy, requires solid engineering chops, and more importantly, while it is technically a first-stage retrieval approach, it still needs its own &quot;pre-first-stage&quot; candidate generation stage in order for the MaxSim compute cost to be bearable.<\/p>\n<p>But even more importantly in this context: this conversion is performed <strong>with retrieval-specific supervised training<\/strong>, making it less of an extraction and more of an adaptation: after all, you could make the case that ModernBERT itself contains retrieval information, because if you train it with retrieval supervision, it becomes a capable retriever. Not untrue, but missing the point.<\/p>\n<p>No, instead, we want to see <strong>what information a model contains<\/strong>, without further training.<\/p>\n<h3>Sparse AutoEncoders<\/h3>\n<p>And to do so, we turn towards the field of interpretability. Specifically, we look into Sparse AutoEncoders, or SAEs. These models have become commonplace in studies trying to understand the &quot;black box&quot; of language models, coming to prominence <a href=\"https:\/\/transformer-circuits.pub\/2023\/monosemantic-features\">spearheaded by Anthropic<\/a> and now found all over both research and industry.<\/p>\n<p>At their core, SAEs could not be simpler, as they&#39;re built on concepts that will be familiar to the vast majority of machine learning enthusiasts: they are shallow networks composed of single encoder-decoder block. Training follows a simple reconstruction objective: the decoder has to reconstruct the original input after it has been projected through the encoder.<\/p>\n<SaeDiagram \/>\n\n<p>What makes SAEs interesting is the constraint that is added to the encoder: a <strong>sparsity penalty<\/strong> is added to ensure that each input feature is <strong>only able to activate a limited number of features in the encoder&#39;s latent space<\/strong>. The intermediate representation between the encoder and the decoder is effectively a large sparse vector, where the vast majority of dimensions have a value of 0.<\/p>\n<p>The theory, that empirical results support to good extent, is that doing so yields some sort of &quot;latent vocabulary&quot;, in which we can analyse which tokens activate which features and better understand how information is represented within the large, dense, and otherwise uninterpretable internal activations of LLMs. This field of research is what led to the <a href=\"https:\/\/www.anthropic.com\/news\/golden-gate-claude\">(in)famous Golden Gate Claude experiment<\/a>, that we all dearly miss.<\/p>\n<p>Being a retrieval-focused company, this made us think: if SAEs are capable of extracting a latent vocabulary that can more-or-less clearly mapped to concepts, could this vocabulary also be useful for retrieval? And so, we designed a simple set of experiments: train SAEs on top of common retrievers, both internal and external, and explore the makeup of the resulting features.<\/p>\n<h2>The Latent Space is Zipfian<\/h2>\n<p>Before we say more about the nature of the representations we extracted, it&#39;s important to take a step back, and discuss Zipf&#39;s Law. Indeed, one of the core underlying aspects of most classical lexical approaches to NLP is the fact that natural language tends to follow a Zipfian distribution.<\/p>\n<p>A picture is worth a thousand words, so let&#39;s first look at what a Zipfian looks like before we delve (ChatGPT did not write this, but we will not let it claim a good word) further:<\/p>\n<ZipfLaw \/>\n\n<p>Zipf&#39;s law is a simple empirical observation: when you gather a set of observations and sort them in decreasing order, the distribution is often such that the value of the <em>n<\/em>th entry is inversely proportional to <em>n<\/em>. In practice, what this means is that the third most common element will occur about half as often as the second most common one, which itself will be about half as common as the first.<\/p>\n<p>Most human languages naturally tend towards a quasi-Zipfian distribution: while they don&#39;t follow a perfect Zipf&#39;s curve, they all roughly espouse its shape. And as we know by now, humans are very good at optimisation: if a distribution is known and has well-defined properties, we will figure out how to take advantage of this.<\/p>\n<p>And we did: for a long time, tools were specifically designed around this. The most famous of such approaches in retrieval is BM25, designed around TF-IDF features (Term Frequency - Inverse Document Frequency, essentially a way to give more weight to more discriminative terms) with a few additional tweaks and parameters to tune. In fact, BM25 is a fantastic example of our drive to optimise, in two ways: first, despite having been introduced in 1995, it remains today the pareto-optimal way to do retrieval with lexical features. Secondly, the name itself is a throwback to the necessity of iterations to optimise: BM25 stands for <strong>B<\/strong>est <strong>M<\/strong>atch <strong>25<\/strong>, and the <strong>25<\/strong> simply refers to the fact that it was the 25th method the team had designed.<\/p>\n<p>Mainstream neural sparse methods, such as SPLADE, have largely done away with these assumptions: their training methods leads to a smoother curve, with fewer saturated features. But as it turns out, <em>Latent Terms<\/em>, the activated features we extracted through an SAE applied over retrievers, <strong>are distributed in a Zipfian way over large corpuses<\/strong>:<\/p>\n<LatentZipfOverlay \/>\n\n<p>In practice, this means that the vocabulary we extract via SAEs, which we call <em>Latent Terms<\/em>, <strong>follow a distribution that is broadly similar to that of human language<\/strong>. This is unlike current sparse retrieval methods, which are explicitly trained with methods to induce sparsity that result in the absence of the characteristic saturated top of the curve that is present in natural words. And as you&#39;ve probably guessed it, having a discrete vocabulary with distribution similar to that of lexical terms means that it this vocabulary is <strong>readily usable with methods designed for lexical terms<\/strong>.<\/p>\n<h3>What does that vocabulary look like?<\/h3>\n<p>But before we get into its suitability for retrieval methods like BM25, let us first take a look at the vocabulary itself: what exactly does it capture? What do its features look like?<\/p>\n<p>Thankfully, the vocabulary is rather small, with most of our experiments targeting a vocabulary size in the {8192, 65536} range. This enables qualitative analysis, that is then easy to scale up with the use of LLMs. <\/p>\n<p>What this activation analysis revealed is that three broad, unevenly-distributed categories are present in the identified features: <strong>Lexical Features<\/strong>, which fire on a single term, <strong>Narrow Semantic Features<\/strong>, which capture multiple ways of referring to the same concept, and <strong>Broad Topical Features<\/strong>, activated by a wide range of terms around a similar topic.<\/p>\n<LatentTermCategories \/>\n\n<p>The distribution between these 3 types differs somewhat between models, but the general trend holds: roughly 10% of the features are narrow semantic, around a third is purely lexical, and the rest, comprising over half the features, are broad topical ones. <\/p>\n<h2>Not only does it overcome failure cases, it makes for very strong retrievers<\/h2>\n<p>While the shape of the features is encouraging, it&#39;s time to put them to the real test: do <em>Latent Terms<\/em> simply have a Zipfian distribution, but ultimately do not carry meaningful enough signal, or are they effective retrieval features? <\/p>\n<p>A (truncated) table is worth a thousand words:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th align=\"right\">SciFact<\/th>\n<th align=\"right\">NFC<\/th>\n<th align=\"right\">FiQA<\/th>\n<th align=\"right\">TREC-Covid<\/th>\n<th align=\"right\">DBPedia<\/th>\n<th align=\"right\">NQ<\/th>\n<th align=\"right\">HotpotQA<\/th>\n<th align=\"right\">FEVER<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>Lexical BM25<\/td>\n<td align=\"right\">0.686<\/td>\n<td align=\"right\">0.319<\/td>\n<td align=\"right\">0.249<\/td>\n<td align=\"right\">0.680<\/td>\n<td align=\"right\">0.300<\/td>\n<td align=\"right\">0.285<\/td>\n<td align=\"right\">0.569<\/td>\n<td align=\"right\">0.481<\/td>\n<\/tr>\n<tr>\n<td>SPLADE-v3<\/td>\n<td align=\"right\">0.710<\/td>\n<td align=\"right\">0.357<\/td>\n<td align=\"right\">0.374<\/td>\n<td align=\"right\">0.748<\/td>\n<td align=\"right\">0.450<\/td>\n<td align=\"right\">0.586<\/td>\n<td align=\"right\">0.692<\/td>\n<td align=\"right\">0.796<\/td>\n<\/tr>\n<tr>\n<td>Contriever<\/td>\n<td align=\"right\">0.655<\/td>\n<td align=\"right\">0.313<\/td>\n<td align=\"right\">0.274<\/td>\n<td align=\"right\">0.448<\/td>\n<td align=\"right\">0.377<\/td>\n<td align=\"right\">0.419<\/td>\n<td align=\"right\">0.542<\/td>\n<td align=\"right\">0.581<\/td>\n<\/tr>\n<tr>\n<td>Nomic<\/td>\n<td align=\"right\">0.703<\/td>\n<td align=\"right\">0.346<\/td>\n<td align=\"right\">0.377<\/td>\n<td align=\"right\">0.822<\/td>\n<td align=\"right\">0.431<\/td>\n<td align=\"right\">0.598<\/td>\n<td align=\"right\">0.672<\/td>\n<td align=\"right\">0.813<\/td>\n<\/tr>\n<tr>\n<td>GTE-MC<\/td>\n<td align=\"right\">0.756<\/td>\n<td align=\"right\">0.381<\/td>\n<td align=\"right\">0.456<\/td>\n<td align=\"right\">0.849<\/td>\n<td align=\"right\">0.475<\/td>\n<td align=\"right\">0.617<\/td>\n<td align=\"right\">0.773<\/td>\n<td align=\"right\">0.875<\/td>\n<\/tr>\n<tr>\n<td><strong>Latent Terms + Contriever<\/strong><\/td>\n<td align=\"right\">0.713<\/td>\n<td align=\"right\">0.340<\/td>\n<td align=\"right\">0.317<\/td>\n<td align=\"right\">0.709<\/td>\n<td align=\"right\">0.409<\/td>\n<td align=\"right\">0.468<\/td>\n<td align=\"right\">0.627<\/td>\n<td align=\"right\">0.751<\/td>\n<\/tr>\n<tr>\n<td><strong>Latent Terms + Nomic<\/strong><\/td>\n<td align=\"right\">0.749<\/td>\n<td align=\"right\">0.372<\/td>\n<td align=\"right\">0.382<\/td>\n<td align=\"right\">0.783<\/td>\n<td align=\"right\">0.436<\/td>\n<td align=\"right\">0.577<\/td>\n<td align=\"right\">0.732<\/td>\n<td align=\"right\">0.885<\/td>\n<\/tr>\n<tr>\n<td><strong>Latent Terms + GTE-ModernColBERT<\/strong><\/td>\n<td align=\"right\">0.730<\/td>\n<td align=\"right\">0.374<\/td>\n<td align=\"right\">0.399<\/td>\n<td align=\"right\">0.759<\/td>\n<td align=\"right\">0.387<\/td>\n<td align=\"right\">0.509<\/td>\n<td align=\"right\">0.653<\/td>\n<td align=\"right\">0.814<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>In summary: not only are <em>Latent Terms<\/em> extracted with SAEs compatible with BM25, but they are able to achieve very competitive retrieval results, matching or outperforming their single-vector backbone, whether the backbone is an older, weaker one (Contriever) or a more modern model (in this case, nomic-embed-text-v1.5). The overall performance does appear pretty strongly correlated with retrieval training, as the better text embedding also produces the superior <em>Latent Terms<\/em> results.<\/p>\n<p>Perhaps even more interestingly, the approach is competitive with SPLADE models from a similar era: SAE+BM25 over Nomic outperforms SPLADEv3, despite the latter&#39;s heavy use of knowledge distillation from much more powerful model.<\/p>\n<p>Finally, the final insight immediately apparent from the table is, again, that scoring operators do matter immensely: while single-vector models lose to their <em>Latent Terms<\/em> cousin, GTE-ModernColBERT, a late interaction model using the powerful MaxSim operator, comfortably outperforms its counterpart, despite it remaining strong.<\/p>\n<p>But there is another benchmark we are interested in here: LIMIT. LIMIT, which we talked about previously, is a toy task: queries are very straightforward, and documents are essentially a long list of a person&#39;s attributes. It is purposefully designed to be trivial for approaches capturing fine-grained information in their scoring: &quot;normal&quot; BM25&#39;s Recall@20 is in the high 90s, and so is GTE-ModernColBERT&#39;s. However, its very simple formulation is antagonistic to the limitations of single-vector scoring, and even large, 8 billion parameter single-vector models fail to reach double-digits recall numbers.\nAs such, the question is pretty clear: can <em>Latent Terms<\/em>, despite being built upon the same single-vector model, recover LIMIT performance far beyond the single-vector setting scoring limitation?<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th align=\"right\">Recall (R)@10<\/th>\n<th align=\"right\">R@20<\/th>\n<th align=\"right\">R@100<\/th>\n<th align=\"right\">R@1000<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>Lexical BM25<\/td>\n<td align=\"right\">0.9440<\/td>\n<td align=\"right\">0.9490<\/td>\n<td align=\"right\">0.9645<\/td>\n<td align=\"right\">0.9945<\/td>\n<\/tr>\n<tr>\n<td>SPLADE-v3<\/td>\n<td align=\"right\">0.5760<\/td>\n<td align=\"right\">0.6650<\/td>\n<td align=\"right\">0.8095<\/td>\n<td align=\"right\">0.9440<\/td>\n<\/tr>\n<tr>\n<td>GTE-ModernColBERT<\/td>\n<td align=\"right\">0.8430<\/td>\n<td align=\"right\">0.8565<\/td>\n<td align=\"right\">0.8720<\/td>\n<td align=\"right\">0.8795<\/td>\n<\/tr>\n<tr>\n<td>Contriever<\/td>\n<td align=\"right\">0.0210<\/td>\n<td align=\"right\">0.0265<\/td>\n<td align=\"right\">0.0530<\/td>\n<td align=\"right\">0.1250<\/td>\n<\/tr>\n<tr>\n<td>Latent Terms + GTE-ModernColBERT<\/td>\n<td align=\"right\">0.7985<\/td>\n<td align=\"right\">0.8315<\/td>\n<td align=\"right\">0.8915<\/td>\n<td align=\"right\">0.9775<\/td>\n<\/tr>\n<tr>\n<td>Latent Terms + Contriever<\/td>\n<td align=\"right\">0.4140<\/td>\n<td align=\"right\">0.5100<\/td>\n<td align=\"right\">0.7295<\/td>\n<td align=\"right\">0.9290<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>The answer is <strong>yes<\/strong>, although it is not perfect: while Contriever as a single-vector model reaches a Recall@100 of just 0.053, its Latent Terms variant hits 0.729. This confirms that even though it has been trained only in a single-vector setting the model itself <strong>has learned information allowing to avoid collapse on LIMIT beyond that that can be expressed in this same setting<\/strong>. As this avenue of research is still very young and primitive, this suggests that our current training methods teach models significant, meaningful signal that is just waiting for us to develop better ways to extract.<\/p>\n<h2>Is this inherent to retrievers or do all language models contain a sparse vocab eagerly waiting to meet BM25?<\/h2>\n<p>But first, we need to think about <strong>where<\/strong> this information comes from. Earlier, we talked about the core goal of sparse auto-encoders in existing research: given a set of neural activations from a complex, highly-uninterpretable language model, cast them into a sparse set of activations that can be studied to understand how these activations related to language, thus better understanding models.<\/p>\n<p>Then, we demonstrated that in the case of retrievers, these sparse activations approximate a natural language distribution, and that we identify three &quot;families&quot; of features capturing different levels of lexical and semantic information. The combination of these factors enable methods designed for natural language, in this case BM25, to work extremely well and result in strong retrieval performance.<\/p>\n<p>With these two factors in mind, there&#39;s one question that naturally follows: does this retrieval-friendly <em>Latent Terms<\/em> structure naturally emerge in the representations of encoder models, or are these extractable, meaningful features learned as a byproduct of retrieval-focused contrastive training?<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th align=\"right\">SciFact<\/th>\n<th align=\"right\">NFCorpus<\/th>\n<th align=\"right\">FiQA<\/th>\n<th align=\"right\">TREC-Covid<\/th>\n<th align=\"right\">DBPedia<\/th>\n<th align=\"right\">NQ<\/th>\n<th align=\"right\">HotpotQA<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>Latent Terms + BERT<\/td>\n<td align=\"right\">0.585<\/td>\n<td align=\"right\">0.216<\/td>\n<td align=\"right\">0.131<\/td>\n<td align=\"right\">0.212<\/td>\n<td align=\"right\">0.134<\/td>\n<td align=\"right\">0.165<\/td>\n<td align=\"right\">0.345<\/td>\n<\/tr>\n<tr>\n<td>Latent terms + Contriever<\/td>\n<td align=\"right\"><em>0.713<\/em><\/td>\n<td align=\"right\"><em>0.340<\/em><\/td>\n<td align=\"right\"><em>0.317<\/em><\/td>\n<td align=\"right\"><em>0.709<\/em><\/td>\n<td align=\"right\"><em>0.409<\/em><\/td>\n<td align=\"right\"><em>0.468<\/em><\/td>\n<td align=\"right\"><em>0.627<\/em><\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>The answer, quite clearly, is the latter: features extracted by SAEs over pre-trained language models do not contain the structure that enables <em>Latent Terms<\/em> to act as strong retrieval discriminators. This confirms one of our intuitions: the SAE process is not, by itself, creating structure or information that is useful for retrieval. However, it is a straightforward way to expose information that the model learns about what makes a given term (or token, in our case) in a document impact its relevance to a query, or vice-versa, in ways that the model can fail to express in a single-vector representation. This finding is also supported by the clear jump in quality between a weaker backbone (Contriever) and a stronger one (Nomic).<\/p>\n<p>This opens up quite a lot of interesting questions: is this the best way to extract this information? Should we be figuring out training methods that are not so post-hoc, so that this information is extracted in an ever better way? <strong>Is the future of retrieval sparser than we have been led to believe?<\/strong><\/p>\n<h2>Where can I learn more, and What&#39;s Next?<\/h2>\n<p>If you want to dig more onto the scientific aspect of this approach, our new preprint is now up on <a href=\"https:\/\/arxiv.org\/abs\/2605.29384\">arXiv<\/a>.<\/p>\n<p>As for what&#39;s next, this paper is the first of a series of findings that we intend to publish about this line of work. Characteristically, we expect these to be sparse, but you should stay tuned to find out more about what the vector space of retrievers is hiding. If this kind of research resonates with you, you should definitely <a href=\"mailto:ben@mixedbread.com\">get in touch<\/a> and share your excitement.<\/p>\n<h3>Citation<\/h3>\n<p>If you&#39;d like to formally reference this work, please cite the associated paper:<\/p>\n<pre><code class=\"language-latex\">@misc{latentterms,\n      title={Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies}, \n      author={Benjamin Clavi\u00e9 and Sean Lee and Aamir Shakir and Makoto P. Kato},\n      year={2026},\n      eprint={2605.29384},\n      archivePrefix={arXiv},\n      primaryClass={cs.IR},\n      url={https:\/\/arxiv.org\/abs\/2605.29384}, \n}\n<\/code><\/pre>\n","url":"https:\/\/www.mixedbread.com\/blog\/latent-terms","title":"Dense Retrievers Know More Than They Can Express","summary":"Demonstrating that dense retrieval models learn much more information than they can express through their usual scoring mechanism: they also contain an indexable, natural-language-like sparse vocabulary, which is plug-and-play with BM25.","image":"https:\/\/www.mixedbread.com\/images\/blog\/latent-terms\/latent-terms.jpg","date_modified":"2026-06-02T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research"]},{"id":"https:\/\/www.mixedbread.com\/blog\/listwise-rerank","content_html":"<p>When we released <a href=\"https:\/\/mixedbread.com\/blog\/wholembed-v3\">Wholembed v3<\/a>, the retrieval quality bar moved up sharply enough that most rerankers stopped helping. Pointwise rerankers, which score documents independently and sort by score, struggled to add value on top of strong first-stage results. On harder corpora, they actively hurt them.<\/p>\n<p>Today we are releasing <strong>mxbai-rerank-v3-listwise<\/strong>, a listwise reranking model co-designed with Wholembed v3. It is the first reranker we have shipped that improves results on every domain, language, and benchmark we tested, lifting <strong>all 56 Vidore v3 runs, +11% NDCG@10<\/strong> on average. It also brings strong instruction-following capabilities. You can steer ranking with natural-language directives, like prioritizing recent documents, preferring internal sources, or resolving conflicts between knowledge bases.<\/p>\n<p>mxbai-rerank-v3-listwise is available in preview today as part of Mixedbread Search.<\/p>\n<h2>Better Reranking for Stronger Retrieval<\/h2>\n<p>On ViDoRe v3, a benchmark for real-world document retrieval, Wholembed v3 alone averaged 0.603 NDCG@10.[^1] Adding mxbai-rerank-v3-listwise lifted every one of the 56 runs, with an average gain of 11%.<\/p>\n<div class=\"max-w-2xl\">\n  <img\n    src=\"\/images\/blog\/listwise-rerank\/plot_scatter_light.png\"\n    alt=\"mxbai-rerank-v3-listwise lifts NDCG@10 on all 56 ViDoRe v3 runs\"\n    class=\"block dark:hidden\"\n  >\n  <img\n    src=\"\/images\/blog\/listwise-rerank\/plot_scatter_dark.png\"\n    alt=\"mxbai-rerank-v3-listwise lifts NDCG@10 on all 56 ViDoRe v3 runs\"\n    class=\"hidden dark:block\"\n  >\n<\/div>\n\n<table>\n<thead>\n<tr>\n<th align=\"left\">Setting<\/th>\n<th align=\"right\">NDCG@10 (avg, 56 runs)<\/th>\n<th align=\"right\">\u0394 vs. retrieval-only<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td align=\"left\">Wholembed v3 only<\/td>\n<td align=\"right\">0.603<\/td>\n<td align=\"right\">-<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">+ mxbai-rerank-v3-listwise<\/td>\n<td align=\"right\">0.669<\/td>\n<td align=\"right\">+10.92%<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>[^1]: We use ViDoRe v3 because it is a widely used benchmark for real-world document retrieval over complex corpora, with multilingual and domain-specific subsets. The benchmark covers seven domains (computer science, energy, finance, HR, industrial, pharmaceuticals, physics) across seven languages, for 56 paired runs in total.<\/p>\n<p>The biggest gains came on the harder, lower-baseline subsets: industrial documents in German went up 18.8%, HR in French 16.3%. These are the kinds of corpora where small ranking errors compound into worse downstream answers.<\/p>\n<h2>Comparative Ranking with Instruction Following<\/h2>\n<p>mxbai-rerank-v3-listwise is a listwise reranker. Unlike pointwise rerankers, which focus on the relevance of individual documents, it reads the candidate set as a whole and can resolve conflicts between candidates.<\/p>\n<p>With strong instruction-following capabilities, you can tell it to prefer newer documents, prioritize internal sources over external summaries, or favor primary sources over commentary.<\/p>\n<p>A booking confirmation may be superseded by a later cancellation. A product spec may override a launch note. A financial filing may be more authoritative than a same-day article.<\/p>\n<p>We benchmarked mxbai-rerank-v3-listwise against the strongest pointwise rerankers available on a 900-example instruction-following evaluation covering recency-aware ranking, source-priority resolution, and multi-step composite instructions.[^2]<\/p>\n<table>\n<thead>\n<tr>\n<th>Reranker<\/th>\n<th>MRR<\/th>\n<th>Accuracy@1<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>mxbai-rerank-v3-listwise<\/strong><\/td>\n<td><strong>0.93<\/strong><\/td>\n<td><strong>88.6%<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Voyage rerank-2.5<\/td>\n<td>0.84<\/td>\n<td>77.4%<\/td>\n<\/tr>\n<tr>\n<td>Cohere Rerank 4 Pro<\/td>\n<td>0.77<\/td>\n<td>68.4%<\/td>\n<\/tr>\n<tr>\n<td>ZeroEntropy zerank-2<\/td>\n<td>0.71<\/td>\n<td>60.3%<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>[^2]: The evaluation is derived from real user corpora and search patterns, then converted into controlled ranking tasks. Examples are designed to test whether the reranker can resolve conflicts inside a candidate set, not just match query-document relevance.<\/p>\n<p>Much of the gap comes from recency-aware ranking, where the model has to understand that a March schedule change supersedes a January booking confirmation, or that Q1 FY27 guidance supersedes Q4 FY26 guidance.<\/p>\n<h2>How Listwise Changes the Answer<\/h2>\n<p>To make the difference concrete, here are five cases where pointwise and listwise reranking diverge under an explicit instruction.<\/p>\n<p><strong>Email:<\/strong>\nPointwise scoring rewards the original BA confirmation: it has the highest &quot;BA flight to London&quot; keyword density and looks unsuperseded on its own. Listwise reads the inbox as a chain (confirmation, reschedule, cancellation, then a new booking on a different carrier) and surfaces the United flight.<\/p>\n<div class=\"not-prose my-4 rounded-2xl border border-zinc-200 bg-zinc-50 p-5 sm:p-6 text-zinc-900 dark:border-white\/[0.06] dark:bg-[#0a0a0a] dark:text-zinc-100\">\n  <div class=\"mb-5 space-y-2\">\n    <div class=\"flex flex-col sm:flex-row sm:items-baseline gap-1 sm:gap-3\">\n      <span class=\"text-[10px] uppercase tracking-wider text-zinc-500\">Query<\/span>\n      <span class=\"text-sm\">\"What's the current status of my flight to London?\"<\/span>\n    <\/div>\n    <div class=\"flex flex-col sm:flex-row sm:items-baseline gap-1 sm:gap-3\">\n      <span class=\"text-[10px] uppercase tracking-wider text-orange-600 dark:text-orange-400\">Instruction<\/span>\n      <span class=\"text-sm text-zinc-600 dark:text-zinc-300\">Resolve booking changes chronologically; the most recent valid booking wins.<\/span>\n    <\/div>\n  <\/div>\n\n  <div class=\"grid grid-cols-1 sm:grid-cols-2 gap-4\">\n    <div>\n      <div class=\"text-[10px] uppercase tracking-wider text-zinc-500 mb-3\">Pointwise (before)<\/div>\n      <div class=\"space-y-2\">\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">1<\/span>\n          <span class=\"text-sm flex-1\">BA286 SFO\u2192LHR booking confirmation<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Jan 12<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">2<\/span>\n          <span class=\"text-sm flex-1\">BA286 schedule change \u2192 21:30<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Feb 3<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">3<\/span>\n          <span class=\"text-sm flex-1\">UA901 SFO\u2192LHR booking confirmation<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Feb 22<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">4<\/span>\n          <span class=\"text-sm flex-1\">BA286 cancelled, refund processed<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Feb 20<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">5<\/span>\n          <span class=\"text-sm flex-1\">Hopper: cheap flights to London<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Mar 4<\/span>\n        <\/div>\n      <\/div>\n    <\/div>\n\n<pre><code>&lt;div&gt;\n  &lt;div class=&quot;text-[10px] uppercase tracking-wider text-zinc-500 mb-3&quot;&gt;Listwise (after)&lt;\/div&gt;\n  &lt;div class=&quot;space-y-2&quot;&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-orange-600 px-3 py-2&quot;&gt;\n      &lt;span class=&quot;text-xs text-orange-100 w-4 tabular-nums&quot;&gt;1&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1 font-medium text-white&quot;&gt;UA901 SFO\u2192LHR booking confirmation&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-orange-100 tabular-nums&quot;&gt;Feb 22&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;2&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;BA286 cancelled, refund processed&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;Feb 20&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;3&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;BA286 schedule change \u2192 21:30&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;Feb 3&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;4&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;BA286 SFO\u2192LHR booking confirmation&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;Jan 12&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;5&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;Hopper: cheap flights to London&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;Mar 4&lt;\/span&gt;\n    &lt;\/div&gt;\n  &lt;\/div&gt;\n&lt;\/div&gt;\n<\/code><\/pre>\n  <\/div>\n<\/div>\n\n<p><strong>Finance:<\/strong>\nThe Q3 FY26 call is the same speaker, same format, and same topic match, a strong pointwise signal, but the guidance it issues has been superseded. The Morgan Stanley note is more recent but is analyst commentary, not company guidance. Listwise routes around both and lands on the Q4 FY26 call as the current primary source.<\/p>\n<div class=\"not-prose my-4 rounded-2xl border border-zinc-200 bg-zinc-50 p-5 sm:p-6 text-zinc-900 dark:border-white\/[0.06] dark:bg-[#0a0a0a] dark:text-zinc-100\">\n  <div class=\"mb-5 space-y-2\">\n    <div class=\"flex flex-col sm:flex-row sm:items-baseline gap-1 sm:gap-3\">\n      <span class=\"text-[10px] uppercase tracking-wider text-zinc-500\">Query<\/span>\n      <span class=\"text-sm\">\"What is NVDA's most recent forward revenue guidance?\"<\/span>\n    <\/div>\n    <div class=\"flex flex-col sm:flex-row sm:items-baseline gap-1 sm:gap-3\">\n      <span class=\"text-[10px] uppercase tracking-wider text-orange-600 dark:text-orange-400\">Instruction<\/span>\n      <span class=\"text-sm text-zinc-600 dark:text-zinc-300\">Prefer primary company guidance; later filings supersede earlier; demote analyst commentary.<\/span>\n    <\/div>\n  <\/div>\n\n  <div class=\"grid grid-cols-1 sm:grid-cols-2 gap-4\">\n    <div>\n      <div class=\"text-[10px] uppercase tracking-wider text-zinc-500 mb-3\">Pointwise (before)<\/div>\n      <div class=\"space-y-2\">\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">1<\/span>\n          <span class=\"text-sm flex-1\">Q3 FY26 call \u2014 Q4 FY26 guidance ($37.5B)<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Nov 20 '25<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">2<\/span>\n          <span class=\"text-sm flex-1\">Morgan Stanley NVDA note (Q1 model)<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Mar 15 '26<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">3<\/span>\n          <span class=\"text-sm flex-1\">Q4 FY26 call \u2014 Q1 FY27 guidance ($43B)<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Feb 26 '26<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">4<\/span>\n          <span class=\"text-sm flex-1\">CNBC: \"Nvidia guides Q1 above expectations\"<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Feb 26 '26<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">5<\/span>\n          <span class=\"text-sm flex-1\">NVDA 8-K \u2014 data center supply chain<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Apr 2 '26<\/span>\n        <\/div>\n      <\/div>\n    <\/div>\n\n<pre><code>&lt;div&gt;\n  &lt;div class=&quot;text-[10px] uppercase tracking-wider text-zinc-500 mb-3&quot;&gt;Listwise (after)&lt;\/div&gt;\n  &lt;div class=&quot;space-y-2&quot;&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-orange-600 px-3 py-2&quot;&gt;\n      &lt;span class=&quot;text-xs text-orange-100 w-4 tabular-nums&quot;&gt;1&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1 font-medium text-white&quot;&gt;Q4 FY26 call \u2014 Q1 FY27 guidance ($43B)&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-orange-100 tabular-nums&quot;&gt;Feb 26 &#39;26&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;2&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;CNBC: &quot;Nvidia guides Q1 above expectations&quot;&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;Feb 26 &#39;26&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;3&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;Morgan Stanley NVDA note (Q1 model)&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;Mar 15 &#39;26&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;4&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;Q3 FY26 call \u2014 Q4 FY26 guidance ($37.5B)&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;Nov 20 &#39;25&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;5&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;NVDA 8-K \u2014 data center supply chain&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;Apr 2 &#39;26&lt;\/span&gt;\n    &lt;\/div&gt;\n  &lt;\/div&gt;\n&lt;\/div&gt;\n<\/code><\/pre>\n  <\/div>\n<\/div>\n\n<p><strong>Legal:<\/strong>\nThe Code of Conduct is keyword-dense for &quot;indemnification&quot; but belongs to a different agreement; Amendment 4 is the most recent document but doesn&#39;t touch \u00a712. Listwise traces the \u00a712 chain (base \u2192 scope rewrite \u2192 cap reset) to identify Amendment 3 as the operative cap, sitting on top of Amendment 2&#39;s scope.<\/p>\n<div class=\"not-prose my-4 rounded-2xl border border-zinc-200 bg-zinc-50 p-5 sm:p-6 text-zinc-900 dark:border-white\/[0.06] dark:bg-[#0a0a0a] dark:text-zinc-100\">\n  <div class=\"mb-5 space-y-2\">\n    <div class=\"flex flex-col sm:flex-row sm:items-baseline gap-1 sm:gap-3\">\n      <span class=\"text-[10px] uppercase tracking-wider text-zinc-500\">Query<\/span>\n      <span class=\"text-sm\">\"Find the operative indemnification provision.\"<\/span>\n    <\/div>\n    <div class=\"flex flex-col sm:flex-row sm:items-baseline gap-1 sm:gap-3\">\n      <span class=\"text-[10px] uppercase tracking-wider text-orange-600 dark:text-orange-400\">Instruction<\/span>\n      <span class=\"text-sm text-zinc-600 dark:text-zinc-300\">Trace Section 12 amendments; ignore amendments and docs that don't touch this section.<\/span>\n    <\/div>\n  <\/div>\n\n  <div class=\"grid grid-cols-1 sm:grid-cols-2 gap-4\">\n    <div>\n      <div class=\"text-[10px] uppercase tracking-wider text-zinc-500 mb-3\">Pointwise (before)<\/div>\n      <div class=\"space-y-2\">\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">1<\/span>\n          <span class=\"text-sm flex-1\">Code of Conduct (boilerplate indemn.)<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">2024-02<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">2<\/span>\n          <span class=\"text-sm flex-1\">MSA \u00a712 \u2014 base indemnification<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">2022-06<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">3<\/span>\n          <span class=\"text-sm flex-1\">Amendment 2 \u2014 \u00a712 scope rewrite<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">2023-11<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">4<\/span>\n          <span class=\"text-sm flex-1\">Amendment 3 \u2014 \u00a712 cap \u2192 $5M<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">2025-07<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">5<\/span>\n          <span class=\"text-sm flex-1\">Amendment 4 \u2014 \u00a78 only (term)<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">2026-01<\/span>\n        <\/div>\n      <\/div>\n    <\/div>\n\n<pre><code>&lt;div&gt;\n  &lt;div class=&quot;text-[10px] uppercase tracking-wider text-zinc-500 mb-3&quot;&gt;Listwise (after)&lt;\/div&gt;\n  &lt;div class=&quot;space-y-2&quot;&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-orange-600 px-3 py-2&quot;&gt;\n      &lt;span class=&quot;text-xs text-orange-100 w-4 tabular-nums&quot;&gt;1&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1 font-medium text-white&quot;&gt;Amendment 3 \u2014 \u00a712 cap \u2192 $5M&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-orange-100 tabular-nums&quot;&gt;2025-07&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;2&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;Amendment 2 \u2014 \u00a712 scope rewrite&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;2023-11&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;3&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;MSA \u00a712 \u2014 base indemnification&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;2022-06&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;4&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;Amendment 4 \u2014 \u00a78 only (term)&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;2026-01&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;5&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;Code of Conduct (boilerplate indemn.)&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;2024-02&lt;\/span&gt;\n    &lt;\/div&gt;\n  &lt;\/div&gt;\n&lt;\/div&gt;\n<\/code><\/pre>\n  <\/div>\n<\/div>\n\n<p><strong>Memory:<\/strong>\nPointwise ranks dietary snippets by topic match, mixing current and superseded statements together. Listwise reads the chain (vegetarian \u2192 pescatarian \u2192 plant-based \u2192 pescatarian) and treats the peanut allergy as an orthogonal additive constraint, surfacing both the current pescatarian statement and the additive allergy statement at the top.<\/p>\n<div class=\"not-prose my-4 rounded-2xl border border-zinc-200 bg-zinc-50 p-5 sm:p-6 text-zinc-900 dark:border-white\/[0.06] dark:bg-[#0a0a0a] dark:text-zinc-100\">\n  <div class=\"mb-5 space-y-2\">\n    <div class=\"flex flex-col sm:flex-row sm:items-baseline gap-1 sm:gap-3\">\n      <span class=\"text-[10px] uppercase tracking-wider text-zinc-500\">Query<\/span>\n      <span class=\"text-sm\">\"What are the user's current dietary restrictions?\"<\/span>\n    <\/div>\n    <div class=\"flex flex-col sm:flex-row sm:items-baseline gap-1 sm:gap-3\">\n      <span class=\"text-[10px] uppercase tracking-wider text-orange-600 dark:text-orange-400\">Instruction<\/span>\n      <span class=\"text-sm text-zinc-600 dark:text-zinc-300\">Synthesize current state; later statements supersede earlier ones; allergies are additive.<\/span>\n    <\/div>\n  <\/div>\n\n  <div class=\"grid grid-cols-1 sm:grid-cols-2 gap-4\">\n    <div>\n      <div class=\"text-[10px] uppercase tracking-wider text-zinc-500 mb-3\">Pointwise (before)<\/div>\n      <div class=\"space-y-2\">\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">1<\/span>\n          <span class=\"text-sm flex-1\">\"Vegetarian for about five years.\"<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Aug 2025<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">2<\/span>\n          <span class=\"text-sm flex-1\">\"Going plant-based for a 90-day reset.\"<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Feb 2026<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">3<\/span>\n          <span class=\"text-sm flex-1\">\"Started eating fish again.\"<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Oct 2025<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">4<\/span>\n          <span class=\"text-sm flex-1\">\"Plant-based didn't stick \u2014 back to pescatarian.\"<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Apr 2026<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">5<\/span>\n          <span class=\"text-sm flex-1\">\"Diagnosed mild peanut allergy.\"<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">Jan 2026<\/span>\n        <\/div>\n      <\/div>\n    <\/div>\n\n<pre><code>&lt;div&gt;\n  &lt;div class=&quot;text-[10px] uppercase tracking-wider text-zinc-500 mb-3&quot;&gt;Listwise (after)&lt;\/div&gt;\n  &lt;div class=&quot;space-y-2&quot;&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-orange-600 px-3 py-2&quot;&gt;\n      &lt;span class=&quot;text-xs text-orange-100 w-4 tabular-nums&quot;&gt;1&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1 font-medium text-white&quot;&gt;&quot;Plant-based didn&#39;t stick \u2014 back to pescatarian.&quot;&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-orange-100 tabular-nums&quot;&gt;Apr 2026&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-orange-600 px-3 py-2&quot;&gt;\n      &lt;span class=&quot;text-xs text-orange-100 w-4 tabular-nums&quot;&gt;2&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1 font-medium text-white&quot;&gt;&quot;Diagnosed mild peanut allergy.&quot;&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-orange-100 tabular-nums&quot;&gt;Jan 2026&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;3&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;&quot;Going plant-based for a 90-day reset.&quot;&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;Feb 2026&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;4&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;&quot;Started eating fish again.&quot;&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;Oct 2025&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;5&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;&quot;Vegetarian for about five years.&quot;&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;Aug 2025&lt;\/span&gt;\n    &lt;\/div&gt;\n  &lt;\/div&gt;\n&lt;\/div&gt;\n<\/code><\/pre>\n  <\/div>\n<\/div>\n\n<p><strong>Models:<\/strong>\nThe benchmark explainer is the most &quot;MRR + reranker&quot; keyword-dense candidate, so the pointwise model promotes it, even though it is not a measured reranker row. Listwise applies the instruction to compare results from the same evaluation and ranks the model with the highest reported MRR first.<\/p>\n<div class=\"not-prose my-4 rounded-2xl border border-zinc-200 bg-zinc-50 p-5 sm:p-6 text-zinc-900 dark:border-white\/[0.06] dark:bg-[#0a0a0a] dark:text-zinc-100\">\n  <div class=\"mb-5 space-y-2\">\n    <div class=\"flex flex-col sm:flex-row sm:items-baseline gap-1 sm:gap-3\">\n      <span class=\"text-[10px] uppercase tracking-wider text-zinc-500\">Query<\/span>\n      <span class=\"text-sm\">\"Which reranker has the highest MRR?\"<\/span>\n    <\/div>\n    <div class=\"flex flex-col sm:flex-row sm:items-baseline gap-1 sm:gap-3\">\n      <span class=\"text-[10px] uppercase tracking-wider text-orange-600 dark:text-orange-400\">Instruction<\/span>\n      <span class=\"text-sm text-zinc-600 dark:text-zinc-300\">Compare only rows from the same instruction-following evaluation; sort by MRR descending.<\/span>\n    <\/div>\n  <\/div>\n\n  <div class=\"grid grid-cols-1 sm:grid-cols-2 gap-4\">\n    <div>\n      <div class=\"text-[10px] uppercase tracking-wider text-zinc-500 mb-3\">Pointwise (before)<\/div>\n      <div class=\"space-y-2\">\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">1<\/span>\n          <span class=\"text-sm flex-1\">Blog: \"how MRR evaluates rerankers\"<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">n\/a<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">2<\/span>\n          <span class=\"text-sm flex-1\">Voyage rerank-2.5 benchmark row<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">0.84 MRR<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">3<\/span>\n          <span class=\"text-sm flex-1\">Cohere Rerank 4 Pro benchmark row<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">0.77 MRR<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">4<\/span>\n          <span class=\"text-sm flex-1\">mxbai-rerank-v3-listwise benchmark row<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">0.93 MRR<\/span>\n        <\/div>\n        <div class=\"flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]\">\n          <span class=\"text-xs text-zinc-500 w-4 tabular-nums\">5<\/span>\n          <span class=\"text-sm flex-1\">ZeroEntropy zerank-2 benchmark row<\/span>\n          <span class=\"text-xs text-zinc-500 tabular-nums\">0.71 MRR<\/span>\n        <\/div>\n      <\/div>\n    <\/div>\n\n<pre><code>&lt;div&gt;\n  &lt;div class=&quot;text-[10px] uppercase tracking-wider text-zinc-500 mb-3&quot;&gt;Listwise (after)&lt;\/div&gt;\n  &lt;div class=&quot;space-y-2&quot;&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-orange-600 px-3 py-2&quot;&gt;\n      &lt;span class=&quot;text-xs text-orange-100 w-4 tabular-nums&quot;&gt;1&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1 font-medium text-white&quot;&gt;mxbai-rerank-v3-listwise benchmark row&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-orange-100 tabular-nums&quot;&gt;0.93 MRR&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;2&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;Voyage rerank-2.5 benchmark row&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;0.84 MRR&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;3&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;Cohere Rerank 4 Pro benchmark row&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;0.77 MRR&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;4&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;ZeroEntropy zerank-2 benchmark row&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;0.71 MRR&lt;\/span&gt;\n    &lt;\/div&gt;\n    &lt;div class=&quot;flex items-center gap-3 rounded-md bg-white px-3 py-2 dark:bg-white\/[0.04]&quot;&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 w-4 tabular-nums&quot;&gt;5&lt;\/span&gt;\n      &lt;span class=&quot;text-sm flex-1&quot;&gt;Blog: &quot;how MRR evaluates rerankers&quot;&lt;\/span&gt;\n      &lt;span class=&quot;text-xs text-zinc-500 tabular-nums&quot;&gt;n\/a&lt;\/span&gt;\n    &lt;\/div&gt;\n  &lt;\/div&gt;\n&lt;\/div&gt;\n<\/code><\/pre>\n  <\/div>\n<\/div>\n\n<h2>One API<\/h2>\n<p>mxbai-rerank-v3-listwise is available in preview today through Mixedbread Search:<\/p>\n<p><strong>Python:<\/strong><\/p>\n<pre><code class=\"language-python\">from mixedbread import Mixedbread\n\n    client = Mixedbread()\n\n    results = client.stores.search(\n        store_identifiers=[&quot;my-store&quot;],\n        query=&quot;when is my flight to London? The most recent valid booking wins&quot;,\n        search_options={\n          &quot;rerank&quot;: {\n              &quot;model&quot;: &quot;mixedbread-ai\/mxbai-rerank-v3-listwise&quot;,\n          }\n        },\n    )\n<\/code><\/pre>\n<p><strong>TypeScript:<\/strong><\/p>\n<pre><code class=\"language-typescript\">import { Mixedbread } from &quot;@mixedbread\/sdk&quot;;\n\n    const client = new Mixedbread();\n\n    const results = await client.stores.search({\n        store_identifiers: [&quot;my-store&quot;],\n        query: &quot;when is my flight to London? The most recent valid booking wins&quot;,\n        search_options: {\n          rerank: {\n              model: &quot;mixedbread-ai\/mxbai-rerank-v3-listwise&quot;,\n          }\n        },\n    });\n<\/code><\/pre>\n<p>New users get $5 free credits to <a href=\"https:\/\/www.platform.mixedbread.com\/\">try it<\/a>.<\/p>\n<p>If this is a problem you want to work on, we are <a href=\"https:\/\/mixedbread.com\/careers\">hiring<\/a>.<\/p>\n","url":"https:\/\/www.mixedbread.com\/blog\/listwise-rerank","title":"Ranking Beyond Binary Relevance: mxbai-rerank-v3-listwise","summary":"Announcing mxbai-rerank-v3-listwise, our new listwise reranker codesigned with Wholembed v3. It improves results on every benchmark we ran, with state-of-the-art instruction following.","image":"https:\/\/www.mixedbread.com\/images\/blog\/listwise-rerank\/intro-listwise.jpg","date_modified":"2026-05-08T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research","product"]},{"id":"https:\/\/www.mixedbread.com\/blog\/closing-gap","content_html":"<p>In retrieval for agentic workflows, the most important number is not the raw score, but the gap to oracle.<\/p>\n<p>Oracle retrieval is the ceiling: the score you get when the system has access to the correct evidence for every question, without retrieval misses. The smaller the gap, the less retrieval is holding the rest of your stack back.<\/p>\n<p>Our goal with Mixedbread Search is that you should not have to think about that gap at all. Building agentic systems is already complex enough without also having to debug retrieval quality. Search should be one less thing your team has to worry about.<\/p>\n<p>Across three agentic benchmarks, Mixedbread Search v3 consistently narrows that gap: on broad general-domain tasks like BrowseComp-Plus, on newer multimodal benchmarks like MADQA, and on enterprise knowledge-work benchmarks like OfficeQA-Pro. We chose these benchmarks because they stress the workflows our customers actually care about.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/oracle-gap\/oracle-gap-comparison.jpg\" alt=\"Oracle Gap Result\"><\/p>\n<h3>BrowseComp-Plus<\/h3>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2508.06600\">BrowseComp-Plus<\/a> measures deep-research agents on multi-hop questions over a corpus of roughly 100,000 web documents. It is designed specifically to separate retrieval quality from model capability.<\/p>\n<table>\n<thead>\n<tr>\n<th>Retriever<\/th>\n<th>Scaffold<\/th>\n<th>Accuracy<\/th>\n<th>Gap to Oracle (93.5%)[^1]<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>Mixedbread Search<\/strong><\/td>\n<td><strong>get_document<\/strong><\/td>\n<td><strong>90.48%<\/strong><\/td>\n<td><strong>3.02<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Reason-ModernColBERT<\/td>\n<td>get_document<\/td>\n<td>87.59%<\/td>\n<td>5.9<\/td>\n<\/tr>\n<tr>\n<td>Mixedbread Search<\/td>\n<td>standard<\/td>\n<td>80.00<\/td>\n<td>13.5<\/td>\n<\/tr>\n<tr>\n<td>Reason-ModernColBERT<\/td>\n<td>standard<\/td>\n<td>79.52<\/td>\n<td>13.98<\/td>\n<\/tr>\n<tr>\n<td>Qwen3-Embed-8B<\/td>\n<td>standard<\/td>\n<td>71.69%<\/td>\n<td>21.81<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>[^1]: As reported in the BrowseComp-Plus paper.<\/p>\n<p>Mixedbread Search v3 <strong>ranks #1<\/strong> on the <a href=\"https:\/\/huggingface.co\/spaces\/Tevatron\/BrowseComp-Plus\">leaderboard<\/a> in both settings: with the default standardized benchmarking scaffold, and with the stronger agentic scaffold where the model can read the full document behind each retrieved snippet.<\/p>\n<p>Notably, the gap between Mixedbread Search and other retrieval approaches widens under the better harness. That suggests retrieval is less often the limiting factor, and that more of the remaining headroom lies in the overall agent setup. In practice, that means teams can spend less time debugging retrieval and more time improving the rest of their system.<\/p>\n<h3>MADQA<\/h3>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2603.12180\">MADQA<\/a>, released just last week, tests whether agents can navigate more than 18,000 pages from 800 heterogeneous PDFs to answer 500 human-authored questions. The benchmark is inherently multimodal: models are given screenshots of PDF pages, and the dataset is designed to reflect complex in-domain knowledge across areas like financial reports and legal documents.<\/p>\n<p>It can be run in two settings. In one-turn mode, the model gets a single search call before it must answer. In agentic mode, the model is given up to ten turns, with metrics that explicitly measure the tradeoff between answer quality and effort, a proxy for both token cost and system speed.<\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Retriever<\/th>\n<th>Accuracy<\/th>\n<th>Gap to Oracle (99.4%)<\/th>\n<th>Page F1<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>Human w\/ Oracle Retriever<\/td>\n<td>Oracle<\/td>\n<td>99.4%<\/td>\n<td>0.0<\/td>\n<td>-<\/td>\n<\/tr>\n<tr>\n<td>Button, an agentic document-QA system built on Gemini 3.1 Pro (Agentic)<\/td>\n<td><strong>Hybrid w\/ Mixedbread<\/strong><\/td>\n<td><strong>91.7%<\/strong><\/td>\n<td><strong>7.7<\/strong><\/td>\n<td><strong>86.9<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Gemini 3 Pro (One-shot RAG)<\/td>\n<td><strong>Mixedbread<\/strong><\/td>\n<td><strong>88.2%<\/strong><\/td>\n<td><strong>11.2<\/strong><\/td>\n<td><strong>82.2<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Human w\/ BM25<\/td>\n<td>BM25<\/td>\n<td>82.2%<\/td>\n<td>17.2<\/td>\n<td>79.3<\/td>\n<\/tr>\n<tr>\n<td>Claude Sonnet 4.5 (Agentic)<\/td>\n<td>BM25<\/td>\n<td>80.6%<\/td>\n<td>18.8<\/td>\n<td>79.1<\/td>\n<\/tr>\n<tr>\n<td>Gemini 3 Pro (Preview) with File Search<\/td>\n<td>Google Files Search<\/td>\n<td>78.6<\/td>\n<td>20.8<\/td>\n<td>70.1<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>On the current MADQA <a href=\"https:\/\/huggingface.co\/spaces\/Snowflake\/MADQA-Leaderboard\">leaderboard<\/a>, Mixedbread-powered systems occupy the top spots in each category, with only a human using oracle documents outperforming them. In the single-search setting, Gemini 3 Pro with Mixedbread outperforms human experts who are allowed up to 10 BM25 searches.<\/p>\n<p>Gemini 3 Pro also gains nearly 10 points of accuracy with Mixedbread compared to Google\u2019s File Search API. Notably, the top-performing system, Button, is not ours: <a href=\"https:\/\/distyl.ai\/\">Distyl AI<\/a> paired Mixedbread Search with its own harness and reached state-of-the-art results without further tuning.<\/p>\n<p>Taken together, these results suggest that Mixedbread Search is effective at surfacing the evidence agentic workflows need, including on heterogeneous, multimodal corpora.<\/p>\n<h3>OfficeQA-Pro<\/h3>\n<p>Finally, <a href=\"https:\/\/arxiv.org\/abs\/2603.08655\">OfficeQA-Pro<\/a> is a specialized enterprise knowledge-work benchmark designed by Databricks. It contains 89,000 pages of complex financial documents, including U.S. Treasury Bulletins, dense tables, and scanned PDFs, along with questions that require multi-document reasoning to answer satisfactorily.<\/p>\n<p>We wanted to measure the effect of retrieval quality inside a widely used tool: OpenAI&#39;s Codex. So we ran the benchmark in the same Codex-based setup while varying only the retrieval tooling: a corpus baseline, where the raw corpus is stored on disk as OCR plus PDF images and Codex uses its usual tools; an oracle setting, where the model is given all correct documents; and Mixedbread Search-based retrieval.<\/p>\n<p>This setup does not isolate retrieval as cleanly as the other benchmarks, which were designed specifically for that purpose. Still, we think it is informative because OfficeQA documents are unusually difficult, and because many teams rely on agents\u2019 native \u201cno-RAG\u201d context tools in similar settings.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Correctness<\/th>\n<th>Gap to Oracle<\/th>\n<th>Latency (min)<\/th>\n<th>Tool Calls<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>Codex (Oracle, thinking high)<\/td>\n<td>65.41<\/td>\n<td>-<\/td>\n<td>2.2*<\/td>\n<td>20.7*<\/td>\n<\/tr>\n<tr>\n<td><strong>Codex (Mixedbread, thinking high)<\/strong><\/td>\n<td><strong>64.42<\/strong><\/td>\n<td><strong>0.99<\/strong><\/td>\n<td><strong>2.36<\/strong><\/td>\n<td><strong>17.35<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Codex (Corpus, thinking high)<\/td>\n<td>56.39<\/td>\n<td>9.02<\/td>\n<td>3.6<\/td>\n<td>34.5<\/td>\n<\/tr>\n<tr>\n<td>GPT 5.4 Agent + Semantic Search **<\/td>\n<td>51.90<\/td>\n<td>13.51<\/td>\n<td>8.93<\/td>\n<td>86.4<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>With all other settings held equal, Mixedbread comes close to the oracle setting while materially reducing latency and tool use. Compared with Codex\u2019s built-in retrieval over the corpus, Mixedbread recovers 89% of the gap between the corpus baseline and oracle (8.03 of 9.02 points) while cutting latency by 34% and reducing tool calls by roughly half.<\/p>\n<p>Within the same harness, Mixedbread Search moves the agent substantially closer to oracle with less search effort.<\/p>\n<p>*: performance reported from paper<br>**: using Databricks agent (from paper)<\/p>\n<h3>More Limits to Overcome<\/h3>\n<p>Even with these results, knowledge work is far from solved. Hard cases remain: ambiguous queries, genuinely missing evidence, messy enterprise corpora, and tasks where the core difficulty is not finding documents but reasoning correctly over them once found.<\/p>\n<p>Closing the oracle gap does not eliminate system failures, and it does not fully eliminate retrieval failures either. But as the gap shrinks, the source of failure shifts. When retrieval is less often the bottleneck, more of the remaining headroom moves to reasoning, prompting, and domain-specific system design, which is where teams are often best served spending their effort.<\/p>\n<p><strong>One API<\/strong><\/p>\n<p>To make retrieval one less thing teams have to manage, we built Mixedbread Search behind a simple API:<\/p>\n<p><strong>Python:<\/strong><\/p>\n<pre><code class=\"language-python\">from mixedbread import Mixedbread\n    from pathlib import Path\n\n    client = Mixedbread()\n\n    # Index documents (any modality)\n    client.stores.files.upload(\n        store_identifier=&quot;my-store&quot;,\n        file=Path(&quot;doc.pdf&quot;)\n    )\n\n    # Search\n    results = client.stores.search(\n        store_identifiers=[&quot;my-store&quot;],\n        query=&quot;quarterly revenue growth&quot;,\n    )\n<\/code><\/pre>\n<p><strong>TypeScript:<\/strong><\/p>\n<pre><code class=\"language-typescript\">import { Mixedbread } from &quot;@mixedbread\/sdk&quot;;\n    import * as fs from &quot;node:fs&quot;;\n\n    const client = new Mixedbread();\n\n    \/\/ Index documents (any modality)\n    await client.stores.files.upload({\n        storeIdentifier: &quot;my-store&quot;,\n        file: fs.createReadStream(&quot;doc.pdf&quot;)\n    });\n\n    \/\/ Search\n    const results = await client.stores.search({\n        store_identifiers: [&quot;my-store&quot;],\n        query: &quot;quarterly revenue growth&quot;,\n    });\n<\/code><\/pre>\n<p>No chunking decisions, embedding model choices, image preprocessing, vector database configuration, or reranker threshold tuning.<\/p>\n<p>We handle multimodal ingestion, late-interaction encoding, indexing, and retrieval. You upload documents and start searching. To get started just <a href=\"https:\/\/www.platform.mixedbread.com\/\">sign up<\/a>.<\/p>\n<p>We are building Mixedbread to close the gap between the search that is possible today and what users and agents of tomorrow will demand. If that sounds like a problem you want to work on, we are <a href=\"https:\/\/mixedbread.com\/careers\">hiring<\/a>.<\/p>\n","url":"https:\/\/www.mixedbread.com\/blog\/closing-gap","title":"Closing the Oracle Gap for Your Agents","summary":"Mixedbread Search v3 narrows the oracle gap for agentic retrieval, topping BrowseComp-Plus and setting leading results on MADQA and OfficeQA-Pro.","image":"https:\/\/www.mixedbread.com\/images\/blog\/oracle-gap\/orace-gap.jpg","date_modified":"2026-03-24T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["engineering","research","product"]},{"id":"https:\/\/www.mixedbread.com\/blog\/wholembed-v3","content_html":"<p>Today we&#39;re releasing Wholembed v3, our new unified, omnimodal, multilingual late-interaction retrieval model.<\/p>\n<p>Wholembed v3 sets a new state-of-the-art for search, delivering best-in-class performance across languages and modalities in both academic benchmarks and real-world industrial test cases from partners across multiple business domains.<\/p>\n<p>It is now publicly available on our <a href=\"https:\/\/www.platform.mixedbread.com\/sign-in\">platform<\/a> and will power all new stores on Mixedbread by default going forward. Co-designed with our custom-built retrieval <a href=\"https:\/\/www.mixedbread.com\/blog\/multimodal-late-interaction-billion-scale\">infrastructure<\/a>, Wholembed v3 enables Mixedbread Search to deliver state-of-the-art retrieval with high throughput, low latency, and a frictionless developer experience.<\/p>\n<h2>A Foundation Model for Modern Retrieval<\/h2>\n<p>Retrieval is at the core of most agentic applications. For AI systems to be truly powerful, they need to be grounded in relevant knowledge. And AI applications are already moving far beyond simple question answering: they are operating computers, controlling robots, and performing increasingly complex knowledge work.<\/p>\n<p>Yet search remains brittle, and, for many applications, still surprisingly hard to implement well. In large part, this comes down to the models and algorithms we rely on today. Much of retrieval still relies on older paradigms, which have garnered steady incremental performance gains but have not been able to keep pace with vast capability increases in other areas of AI. Inspired by this and the modeling and algorithmic breakthroughs that have transformed language models, we built Wholembed v3 to tackle current challenges, rather than simply improve on largely solved ones.<\/p>\n<p>Wholembed v3 was designed from the ground up for complex, real-world retrieval across hundreds of languages and all relevant modalities, including text, audio, and vision. With it, we set out to move beyond the limits and common failure modes of traditional semantic search.<\/p>\n<h3>Modern Problems Require Modern Evaluations<\/h3>\n<p>Two benchmarks that we found particularly insightful to help understand the limits of semantic search in the current age are LIMIT and BrowseComp-Plus, both of which map to two opposite, but very common, real-world applications.<\/p>\n<p><strong>LIMIT<\/strong> is a benchmark specifically designed to stress-test the limits of semantic retrieval under situations where the documents contain a lot of fine-grained, &quot;structured-like&quot; information expressed as natural text. In our experience working with customers across domains, we have found this kind of document to be extremely common across many industries. Yet, they are rarely captured by common retrieval benchmarks. LIMIT has thus represented an important frontier for semantic retrieval, with older, lexical-based methods vastly outperforming even the best billion-parameter semantic retrieval models.<\/p>\n<p><strong>BrowseComp-Plus<\/strong>, on the other hand, aims to measure how well-suited search systems are to the agent-as-a-first-class-citizen paradigm. Rather than evaluating retrievers based on retrieval metrics, which are not always well-correlated to system usefulness, they evaluate retrievers based on their ability to let an agent reach the correct answer to 830 complex deep research queries. Many of them require dozens of sequential searches to accurately answer: miss a single nugget of information and the original question remains unanswerable.<\/p>\n<p>We considered besting these two benchmarks as our North Star throughout development, while also aiming to avoid any contamination: we never evaluated any in-development model checkpoints on them until preparing for this release.<\/p>\n<table>\n<thead>\n<tr>\n<th align=\"left\">LIMIT (Benchmark to test limits of semantic search, a lot of real-world large-scale data is similar to this benchmark)<\/th>\n<th>Recall@5<\/th>\n<th>Recall@10<\/th>\n<th>Recall@100<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td align=\"left\"><strong>Model<\/strong><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td align=\"left\">Cohere Embed 4<\/td>\n<td>1.8<\/td>\n<td>2.55<\/td>\n<td>5.7<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">OpenAI Text Embedding 3 Large<\/td>\n<td>1.75<\/td>\n<td>2.4<\/td>\n<td>4.95<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">Voyage 4 Large<\/td>\n<td>1.85<\/td>\n<td>2.9<\/td>\n<td>8.95<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">Gemini Embedding 2 (03\/26)<\/td>\n<td>1.85<\/td>\n<td>2.25<\/td>\n<td>6.9<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">BM25<\/td>\n<td>85.7<\/td>\n<td>90.4<\/td>\n<td>93.6<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">Mixedbread Wholembed V3<\/td>\n<td><strong>92.45<\/strong><\/td>\n<td><strong>94.4<\/strong><\/td>\n<td><strong>98.0<\/strong><\/td>\n<\/tr>\n<\/tbody><\/table>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/wholembed-v3\/chart.png\" alt=\"BrowseComp Plus Result\"><\/p>\n<p>On LIMIT, Wholembed v3 pushes semantic retrieval far beyond what was previously thought possible for semantic search. It not only outperformed previous semantic methods by a margin, but also became the first model to outperform lexical-based retrieval. On BrowseComp-Plus, we observe how important the choice of retriever, such as Mixedbread Search, is for deep research agents: without a good retrieval system, hundreds more questions are answered incorrectly.<\/p>\n<p>We believe this strong performance on difficult, real-world domains stems from design choices that align well with what agents need: precise retrieval, strong discrimination between similar candidates, and robustness to noisy, heterogeneous data.<\/p>\n<h2>Evaluating Retrieval In A Multi-Modal Multi-Lingual World<\/h2>\n<p>Wholembed v3 is not solely built around serving agents and their novel needs, but to improve retrieval for the real world: a world where information does not live in neatly pre-packaged paragraphs with clear relevance boundaries. In practice, that means a single retrieval system that can index and search text, images, audio, and video.<\/p>\n<p>We strongly believe that to be useful, search needs to be able to find information where it lives: in good quality text, yes, but also in artifacts-ridden OCRised documents, scanned PDFs and screenshots with confusing names, or even in instructional videos.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/wholembed-v3\/benchmark.jpg\" alt=\"Benchmark Results\"><\/p>\n<p>Wholembed v3 was designed with the real world in mind. Integrated with Mixedbread Search, it provides you with a seamless experience, ensuring that your service can retrieve exactly the information it needs without requiring you to think about how every single data processing detail will affect the quality of your application.<\/p>\n<p>In our testing, we found that Wholembed v3 outperforms existing methods in almost all situations, no matter the document type, the kind of information you need to find or the language it\u2019s in.<\/p>\n<p>As part of our pre-release phase, we have heard from our partners that this held true across a variety of domains, with Wholembed v3 preview-powered Mixedbread Search replacing convoluted pipelines encompassing many brittle parts.<\/p>\n<p>We are now very excited to finally release Wholembed v3 as our default and only model, and are looking forward to new kinds of applications it will enable. Search doesn\u2019t need to be complicated, it just needs to be good.<\/p>\n<h2>Try it Now<\/h2>\n<p>Wholembed v3 is now available to all Mixedbread Search users. All new stores use v3 by default, with built-in support for audio and video.<\/p>\n<p>New users get 2M free tokens to <a href=\"https:\/\/www.platform.mixedbread.com\/\">get started<\/a>. Startups can also explore our accelerator programs with <a href=\"https:\/\/vercel.com\/ai-accelerator\">Vercel<\/a> and <a href=\"https:\/\/www.tinyfish.ai\/accelerator\">TinyFish<\/a>.<\/p>\n<p>Create a store in the <a href=\"https:\/\/www.platform.mixedbread.com\/sign-in\">platform<\/a> or get started directly with <a href=\"https:\/\/www.mixedbread.com\/docs\/quickstart\">the API<\/a> in just a few lines:<\/p>\n<p><strong>Python:<\/strong><\/p>\n<pre><code class=\"language-python\">from mixedbread import Mixedbread\n    from pathlib import Path\n\n    client = Mixedbread()\n\n    client.stores.create(name=&quot;example&quot;)\n\n    client.stores.files.upload(\n        store_identifier=&quot;example&quot;,\n        file=Path(&quot;example.mp4&quot;)\n    )\n\n    client.stores.search(\n        store_identifiers=[&quot;example&quot;],\n        query=&quot;how we bake search for the future?&quot;\n    )\n<\/code><\/pre>\n<p><strong>TypeScript:<\/strong><\/p>\n<pre><code class=\"language-typescript\">import { Mixedbread } from &quot;@mixedbread\/sdk&quot;;\n    import * as fs from &quot;node:fs&quot;;\n\n    const client = new Mixedbread();\n\n    await client.stores.create({ name: &quot;example&quot; });\n\n    await client.stores.files.upload({\n        storeIdentifier: &quot;example&quot;,\n        file: fs.createReadStream(&quot;example.mp4&quot;)\n    });\n\n    await client.stores.search({\n        store_identifiers: [&quot;example&quot;],\n        query: &quot;how we bake search for the future?&quot;\n    });\n<\/code><\/pre>\n<p>We are building Mixedbread to close the gap between the search that is possible today and what users and agents of tomorrow will demand. If that sounds like a problem you want to work on, we are <a href=\"https:\/\/mixedbread.com\/careers\">hiring<\/a>. <\/p>\n","url":"https:\/\/www.mixedbread.com\/blog\/wholembed-v3","title":"Beyond the Limit: Introduce Mixedbread Wholembed v3","summary":"Announcing Mixedbread Wholembed v3, our new unified omnimodal multilingual retrieval model, setting a new state of the art for search across languages, modalities, and real-world retrieval tasks.","image":"https:\/\/www.mixedbread.com\/images\/blog\/wholembed-v3\/wholembed-v3.jpg","date_modified":"2026-03-12T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["engineering","research","product"]},{"id":"https:\/\/www.mixedbread.com\/blog\/multimodal-late-interaction-billion-scale","content_html":"<p>Most semantic search issues don&#39;t show up as obvious failures. They show up as results that look reasonable, read well, and are still wrong. In our experience, this is a structural limitation of single-vector retrieval on dense and unfamiliar inputs: the representation collapses detail, and the retriever confidently returns &quot;close enough&quot; content that doesn&#39;t actually answer the query.<\/p>\n<p>Building a reliable retriever is also harder than it looks. You&#39;re stitching together parsing, chunking, embedding, metadata extraction, and ANN search, and each stage introduces its own brittleness. When quality drops, it&#39;s rarely clear whether the problem is upstream ingestion, representation, indexing, or scoring.<\/p>\n<p>We built a multimodal late-interaction retrieval system to make those failure modes rarer and easier to reason about. <strong>The system uses multi-vector representations across text, images, audio, and video, and it&#39;s deployed at billion-document scale: 1B+ documents indexed, 500+ QPS per store, and ~50ms search latency end-to-end.<\/strong> The rest of this post walks through the three pieces we had to build and jointly tune to get there: ingestion, encoding, and retrieval.<\/p>\n<p><a href=\"https:\/\/platform.mixedbread.com\">Get started for free<\/a><\/p>\n<h2>Multimodal Ingestion<\/h2>\n<p>For the vast majority of our embedding purposes, our goal is to create a true end-to-end representation pipeline, where we retrieve information from exactly where it lives in the embedding space, surrounded by all the valuable context.<\/p>\n<p>This means that audio files are first pre-processed to maximize quality before being passed to the model, which dynamically splits it into meaningful units on its own. For textual inputs, we have a series of pre-processing steps which ensures the data is broken down into manageable blocks (&quot;chunks&quot;) while maintaining all necessary context. Code is treated as its own separate input, with the AST parsed to determine logical cutoff points. For images, the model natively processes pixels.<\/p>\n<p>As for document formats, like PDFs and PowerPoint, every single page is individually exported to screenshots of the pages, ensuring that all visual and layout information, such as tables and graphs, are preserved and represented as individual semantic units, with context such as headings preserved across pages.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/multimodal-late-interaction-billion-scale\/ingestion.jpg\" alt=\"Multimodal Ingestion\" title=\"Multimodal Ingestion\"><\/p>\n<p>Unlike most other systems where the model&#39;s training data is largely decoupled from expected real-world inputs, our model is trained specifically on the output of these pre-processing steps. This ensures that it is fully optimized to retrieve documents exactly as they will be in production and yield more accurate search results.<\/p>\n<p>Even though our model can process the format of every type of document natively, that does not mean that downstream applications can use this information as effectively. Up until recently, the vast majority of the world&#39;s applications were built with text as the first-class, and often only, citizen. As such, parallel to preparing inputs for representation, our ingestion pipeline also runs a full document analysis step to extract texts from the input. As a result, audio and video files are transcribed, and PDFs are converted into fully human and LLM-readable markdown format using our custom <a href=\"https:\/\/www.mixedbread.com\/blog\/the-hidden-ceiling\">OCR pipeline<\/a>.<\/p>\n<h2>Multimodal Late-Interaction Encoding<\/h2>\n<p>&quot;Traditional&quot; semantic search relies on a simple, single-vector approach: an input is provided to a model, which then outputs a single vector representing the entire document.\nSingle-vector search has many advantages. It is fast and exceedingly simple to implement. However, it is also extremely lossy: while a single vector can capture paraphrases and general meaning well, it dilutes details and precise intent, and it struggles with information-dense documents, such as technical, legal documents and academic papers, especially in multimodal settings.<\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2004.12832\">ColBERT<\/a> pioneers a multi-vector approach, which replaced traditional single-vector representations with individual, lower-dimension token-level representations. These token-level representations preserve fine-grained information in both the query and the documents. In practice, it yields more accurate search results in out-of-distribution settings, where real-world papers, slideshows or long-context texts may look fundamentally different from the training data. For example, our 17 million parameter open-source ColBERT outperforms 8 billion parameter embedding models on the <a href=\"https:\/\/arxiv.org\/abs\/2404.12096\">LongEmbed<\/a> benchmark, which measures the ability of embedding models to perform long-context retrieval tasks.<\/p>\n<p>In our experiments, we found that not only can multi-vector representations be compressed while continuing to capture fine-grained meaning, they also significantly outperform single-vector alternatives on image, video and audio search.<\/p>\n<table>\n<thead>\n<tr>\n<th align=\"left\">Benchmark (NDCG@10)<\/th>\n<th align=\"left\">Best Single Vector Model<\/th>\n<th align=\"left\">mxbai-wholembed-v3<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td align=\"left\"><a href=\"https:\/\/arxiv.org\/abs\/2412.02592\">OHR-V2<\/a><\/td>\n<td align=\"left\">86.47 (Qwen 3 VL 8B)<\/td>\n<td align=\"left\"><strong>91.26<\/strong><\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><a href=\"https:\/\/arxiv.org\/abs\/2505.11651\">Miracl-Vision<\/a><\/td>\n<td align=\"left\">59.79 (Qwen 3 VL 8B)<\/td>\n<td align=\"left\"><strong>66.02<\/strong><\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><a href=\"https:\/\/arxiv.org\/abs\/2601.08620\">Vidore 3<\/a><\/td>\n<td align=\"left\">64.81 (Qwen 3 VL 8B)<\/td>\n<td align=\"left\"><strong>67.9<\/strong><\/td>\n<\/tr>\n<\/tbody><\/table>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\nmxbai-wholembed-v3 is currently in internal evaluation and will be available soon. We are publishing an in-depth release about the model benchmark soon. The benchmarks shown are challenging multimodal evaluations. As of Jan. 2026, Qwen 3 VL 8B is the latest release and achieves SOTA performance across multimodal and text only benchmarks.\n<\/div>\n\n<h3>mxbai-wholembed: A Unified Multimodal Encoder<\/h3>\n<p>Leveraging our team&#39;s previous frontier work on both <a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-embed-large-v1\">embedding models<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/2412.13663\">ModernBERT<\/a>, among other things, we trained <strong>mxbai-wholembed<\/strong>, our unified multimodal late-interaction encoder.<\/p>\n<p>This unified model produces semantic unit-level vector representations, across text, image, audio, and video inputs, in a shared latent space, accurately capturing semantic relationships across modalities. Many previous state-of-the-art systems opted for modality-specific approaches, which introduced complexity by requiring modality-specific handling and yielded performance tradeoffs orthogonal to scaling laws. By sharing one latent space, we unlock &quot;any-to-any&quot; search and streamline architectural constraints by removing the need for any such modality-specific routing and storing. Throughout our experiments, we found that the bitter lesson of the effectiveness of gradient descent continues to hold true. Rather than competing, we observed that properly orchestrated multimodal training in a single, shared embedding space resulted in a lifting effect: improving the quality of image retrieval, for example, also improved performance for both audio and text retrieval.<\/p>\n<h4>Scaling Laws<\/h4>\n<p>Our experiments showed that, despite the huge attention given to the BERT-scale model, scaling works for retrieval. This has also been independently demonstrated in the <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3626772.3657743\">best paper award winner<\/a> at SIGIR, the premier Information Retrieval conference.<\/p>\n<p>Unlike the situation around LLMs, retrieval scaling laws are largely underexplored, poorly understood, and rarely used for downstream production settings. In spite of this, we found that larger scale models consistently allow for better cross-modality alignment and handling of complex information, such as composed queries, at the expense of significant efficiency constraints.<\/p>\n<p>After careful experimentation, we have settled on an infrastructure that, combined with a custom-designed inference engine, strikes the right balance at the Pareto frontier of latency and retrieval performance.<\/p>\n<p>This optimised stack has led to considerable scale increases: mxbai-wholembed has over 20 times the parameter count of our original multimodal model, and three times that of our previous one. With every generation, we saw considerable improvements in our models, especially in real-world edge cases poorly captured by standardized benchmarking.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/multimodal-late-interaction-billion-scale\/model_size.jpg\" alt=\"mxbai-wholembed size comparison\" title=\"mxbai-wholembed size comparison\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\nmxbai-wholembed-v3 has 20 times the number of parameters of v1. This scale increase leads to considerable performance improvements.\n<\/div>\n\n<h3>Dynamic vector allocation<\/h3>\n<p>mxbai-wholembed estimates the information density of a given input and decides the necessary number of vectors to represent it accordingly.  For example, a simple cat image may output a few vectors, whereas a complex slide deck may generate thousands of vectors. This dynamic allocation of representation capacity prevents semantic collapse on dense content often experienced by single vector representations. The optimal size of allocation is determined based on large-scale internal experiments to allow mxbai-wholembed to capture fine-grained information while keeping storage requirements low.<\/p>\n<h2>Billion-scale Late-Interaction Retrieval<\/h2>\n<p>While late-interaction makes retrieval more accurate for real-world uses, it also creates a scale challenge. A single document can produce hundreds or thousands of vectors; a single collection of documents can then contain millions of vectors to search across.<\/p>\n<h3>The Scale Challenge<\/h3>\n<p>Single-vector search efficiency is greatly helped by relying on very simple operations at the hardware-level, making storage and search straightforward. Additionally, it benefits from decades of approximate nearest neighbor (ANN) index research: algorithms like HNSW, SPFresh, and DiskANN bring search to near-constant time.<\/p>\n<p>This simplicity is why many semantic search platforms heavily favor single vector offerings: they remove considerable complexity and are extremely cheap to scale. However, the tradeoffs to retrieval quality are significant and theoretically demonstrated to be impossible to overcome with our current understanding (<a href=\"https:\/\/arxiv.org\/abs\/2508.21038\">Weller<\/a>).<\/p>\n<p>Conversely, late interaction approaches have been demonstrated to greatly alleviate these tradeoffs, but break many of the simplicity assumptions that power large-scale single vector search. Specifically, they introduce three compounding challenges:<\/p>\n<ol>\n<li><strong>Limited indexing support for multi-vectors<\/strong>: The vast majority of current ANN indexing methods are not built for multi-vectors. With hundreds of vectors per document, scanning becomes prohibitively expensive and searches can take seconds instead of milliseconds.<\/li>\n<li><strong>Scoring is expensive.<\/strong>  A single-vector document requires a single inner product calculation: scoring across a million documents is ~ 1 million operations. With multi-vector methods, a document with 256 vectors scored against a 16-vector query requires 256 x 16 = 4,096 (inner product followed by score aggregation): scoring a million documents now requires 4 billion operations.<\/li>\n<li><strong>Storage scales with vectors.<\/strong> More vectors per document means more storage, more memory and more I\/O pressure.<\/li>\n<\/ol>\n<h3>silo: An S3-Native Retrieval Engine<\/h3>\n<p><strong>silo<\/strong> is a custom multi-vector retrieval engine designed to support late-interaction at billions-of-documents and trillions-of-vectors scale and maintain the flexibility of common ANN indexes. Three crucial optimizations bring the search latency under 50ms.<\/p>\n<h4>Two-Stage Retrieval<\/h4>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/multimodal-late-interaction-billion-scale\/filtering.jpg\" alt=\"Filtering Pipeline\"><\/p>\n<p>silo performs two-stage retrieval to make the search space tractable. Stage 1 pre-filters the search space using approximate searches and metadata filtering, and reduces it from billions of documents to a few thousand candidates. Stage 2 scores the candidate documents using MaxSim.<\/p>\n<p>silo maintains and updates the document indexes at write time when documents are ingested, with no performance cost during retrieval. Consequently, there is no index building in preprocessing or latency hits on CRUD operations common in ColBERT-like methods.<\/p>\n<h4>Accelerated Scoring<\/h4>\n<p>A few months ago, we wrote and released <a href=\"https:\/\/github.com\/mixedbread-ai\/maxsim-cpu\">maxsim-cpu<\/a>, a high-performance Rust implementation of MaxSim on CPU. You can read more about it in our <a href=\"https:\/\/mixedbread.com\/blog\/maxsim-cpu\">blog post<\/a>.<\/p>\n<p>In production today, we went a step further and run a version of maxsim-cpu further optimized with custom kernels targeting the specific hardware of our inference pipeline, along with aggressive vector quantization to speed up scoring.<\/p>\n<p>As a result, scoring a thousand candidate documents using MaxSim takes under 3 milliseconds and a single CPU processor is sufficient to handle thousands of QPS.<\/p>\n<h4>Object Storage with NVMe Caching<\/h4>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/multimodal-late-interaction-billion-scale\/silo.jpg\" alt=\"silo S3 architecture\" title=\"Silo Architecture\"><\/p>\n<p>All vectors are stored in low-cost, replicated S3 object storage for scalability, durability and as the source-of-truth. Actively accessed vectors are loaded at query time into NVMe SSDs and cached in memory for computation.<\/p>\n<p>On the write path, both vectors and metadata are persisted to a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Write-ahead_logging\">write-ahead log (WAL)<\/a> in S3, and their index and cache are populated synchronously.<\/p>\n<p><strong>Guarantees:<\/strong><\/p>\n<ul>\n<li>Writes acknowledged only after S3 WAL commit (durability)<\/li>\n<li>Data never lost even if compute nodes fail (fault tolerance)<\/li>\n<li>Hot stores served from cache (memory-like latency)<\/li>\n<li>Cold stores hydrated on-demand, cached for subsequent queries<\/li>\n<\/ul>\n<p>This allows us to serve billions of vectors per tenant while keeping retrieval speed in milliseconds.<\/p>\n<h2>Performance<\/h2>\n<table>\n<thead>\n<tr>\n<th align=\"left\">Stage<\/th>\n<th align=\"left\">Latency<\/th>\n<th align=\"left\">Notes<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td align=\"left\">Query encoding<\/td>\n<td align=\"left\">10-20ms<\/td>\n<td align=\"left\">Custom inference engine<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">Stage 1: Pruning<\/td>\n<td align=\"left\">&lt;10ms<\/td>\n<td align=\"left\">Billions \u2192 ~1000 candidates<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">Stage 2: Scoring<\/td>\n<td align=\"left\">5-10ms<\/td>\n<td align=\"left\">Custom kernels<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">Result assembly<\/td>\n<td align=\"left\">~10ms<\/td>\n<td align=\"left\">Metadata, snippets<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>Total<\/strong><\/td>\n<td align=\"left\"><strong>~50ms<\/strong><\/td>\n<td align=\"left\"><\/td>\n<\/tr>\n<\/tbody><\/table>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\nLatency measured at P50 on a production system with hot cache. QPS per store means each store (search index) can sustain over 500 QPS.\n<\/div>\n\n<h2>Why it matters<\/h2>\n<p>Retrieval has always been, and continues to be the natural interface to information. Traditionally, it was how you found useful websites and relevant snippets. Nowadays, it&#39;s how agents find exactly the pieces of context they need to answer a user&#39;s query. But retrieval has been stagnant for too long, and it has become clear that the once-omnipresent single-vector text representation is not meeting the needs of this new generation of users. Agents need the model to be able to understand long, reasoning-intensive queries. They need the ability to retrieve documents where they live, be they text, images, pdfs or even videos.<\/p>\n<p>We&#39;re building Mixedbread to close the gap between Search that is possible today and what the users of tomorrow will demand. If that sounds like a problem you&#39;d want to work on, we&#39;re <a href=\"http:\/\/mixedbread.com\/careers\">hiring<\/a>.<\/p>\n","url":"https:\/\/www.mixedbread.com\/blog\/multimodal-late-interaction-billion-scale","title":"Inside Mixedbread: \nHow We Built Multimodal Late-Interaction at Billion Scale","summary":"Technical deep-dive into Mixedbread Search - the first production-ready late-interaction search with native multimodality. Learn how we achieve sub-50ms latency on billion-scale document collections.","image":"https:\/\/www.mixedbread.com\/images\/blog\/multimodal-late-interaction-billion-scale\/img.jpg","date_modified":"2026-01-21T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["engineering","research","product"]},{"id":"https:\/\/www.mixedbread.com\/blog\/vercel-marketplace","content_html":"<blockquote>\n<p>\u201cMixedbread cooked so hard. Instant multi-modal AI search. Everything just works and looks fire.\u201d<\/p>\n<p>Guillermo Rauch, CEO @Vercel<\/p>\n<\/blockquote>\n<p>We are excited to announce that Mixedbread Search is now <a href=\"https:\/\/vercel.com\/marketplace\/mixedbread\">fully available on the Vercel Marketplace<\/a>!<\/p>\n<p>You will now be able to seamlessly access our state-of-the-art, blazingly fast, multi-modal semantic search solution to integrate it into your Vercel projects.<\/p>\n<p>If you&#39;re already hosting a web application on Vercel, this is the easiest way for you to integrate and manage third-party dependencies like ours.<\/p>\n<h2>Why do you need Mixedbread Search?<\/h2>\n<p>We believe that retrieving the most relevant information and context is absolutely crucial to the optimization or outright viability of many systems and applications, most prominently agentic AI systems and <a href=\"https:\/\/nga.demo.mixedbread.com\/\">search systems of any kind<\/a>.<\/p>\n<p>Failure or hallucinations of AI agents when the material to base correct answers and decisions on doesn&#39;t ever make it into the context window, although you know it&#39;s there somewhere, are still among the most frustrating experiences for developers and users alike.<\/p>\n<p>With Mixedbread Search&#39;s ability to find the most relevant data from your knowledge base at extremely low latency with <a href=\"https:\/\/mixedbread.com\/blog\/mixedbread-search\">state-of-the-art search accuracy<\/a>, no matter how deep it is buried or what format your data is in, you can offer a magical experience to your users and supercharge your AI agents.<\/p>\n<h2>Getting started<\/h2>\n<p>Visit our <a href=\"https:\/\/vercel.com\/marketplace\/mixedbread\">Vercel Marketplace listing<\/a> and select &quot;Install&quot;.<\/p>\n<p>You will be prompted to create a new Mixedbread account and configure your Mixedbread store and installation plan.<\/p>\n<p>That&#39;s it! You will now receive your <code>MXBAI_API_KEY<\/code> and <code>MXBAI_STORE_ID<\/code> environment variables and you&#39;ll be able to connect the Mixedbread Store to your project with a couple of clicks via Vercel&#39;s UI. <\/p>\n<p>Our <a href=\"https:\/\/www.mixedbread.com\/docs\/quickstart\">docs<\/a> are also there to help you along the way to learn more about building on our API.<\/p>\n<p>It&#39;s as easy as that, so don&#39;t hesitate to add Mixedbread Search to your project and boost your search experience today!<\/p>\n<p>You can also check out our <a href=\"https:\/\/vercel.com\/templates\/next.js\/mixedbread-starter\">starter template<\/a>, a demo repo for integrating Mixedbread Search into ecommerce applications, and deploy it to Vercel.<\/p>\n<h2>We want to hear from you<\/h2>\n<p>Since Mixedbread Search is still in beta, we are rapidly iterating on feedback from our users to improve it ever further.<\/p>\n<p>Please get in touch if you encounter any issues, or if you have any ideas that would make your experience even better. <\/p>\n<p>We would love to hear from you, so please feel free to <a href=\"https:\/\/www.mixedbread.com\/contact\">reach out<\/a>!<\/p>\n","url":"https:\/\/www.mixedbread.com\/blog\/vercel-marketplace","title":"Mixedbread now live on the Vercel Marketplace","summary":"Announcing the integration of Mixedbread Search with the Vercel Marketplace. Seamlessly integrate state-of-the-art, blazingly fast, multi-modal semantic search into your Vercel projects!","image":"https:\/\/www.mixedbread.com\/images\/blog\/vercel-marketplace\/vercel-marketplace.webp","date_modified":"2025-10-23T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["product"]},{"id":"https:\/\/www.mixedbread.com\/blog\/edge-v0","content_html":"<blockquote>\n<p><strong>Note<\/strong>\n<u><strong>TLDR:<\/strong><\/u> <br\/>\n    We introduce the mxbai-edge-colbert-v0 model family, at two model sizes: 17M and 32M. Both models are very strong baselines for future research, with the 17M variant outperforming ColBERTv2 and standing in a league of its own for small-model retrieval.<\/p>\n<\/blockquote>\n<h2>Introduction<\/h2>\n<p>This summer, we set out to prepare the next steps of an ambitious research roadmap, with the ultimate aim of designing ever-improving approaches to late interaction and multi-vector retrieval. As we started planning experimental work more thoroughly, we were faced with a problem when thinking about one important question: what model do we tinker with for early experiments?<\/p>\n<p>Indeed, to try out new things you need a strong baseline with a well-understood training process to ensure that the results of your experiments are meaningful.\nIdeally, this baseline would be both strong and small. Scaling laws exist in information retrieval as they do in the rest of machine learning, and capable strong models allow you to test out ideas in record time before scaling them with relatively little effort.<\/p>\n<p>With this in mind, we did not know what this tiny experimental testbed should be: this is not to say that there are no good ColBERT models out there, but none of them seemed to meet our needs!<\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2112.01488\">ColBERTv2<\/a> is an excellent baseline, but it is from 2021, which, in AI terms, means that it is ancient. <a href=\"https:\/\/huggingface.co\/lightonai\/GTE-ModernColBERT-v1\">GTE-ModernColBERT<\/a>, the current state-of-the-art, is a fantastic model, but it suffers from two problems: it is larger than we would like and it is initialized from a pre-trained dense embedding model that is hard to reproduce, limiting our control on experiments. <a href=\"https:\/\/huggingface.co\/answerdotai\/answerai-colbert-small-v1\">answerai-colbert-small-v1<\/a>, while an extremely strong model at a great compact size, is also initialized from a pre-trained checkpoint which is the averaging of not one, but <strong>two<\/strong> hard-to-reproduce embedding models. Additionally, it has a MiniLM backbone, which means it suffers from the same limitations of previous-generation encoders such as the lack of long-context support, much like ColBERTv2.<\/p>\n<p>Just a few weeks before we started pondering this, the <a href=\"https:\/\/arxiv.org\/abs\/2507.11412\">Ettin<\/a> collection of models had come out: among other things, it includes a replication of <a href=\"https:\/\/arxiv.org\/abs\/2412.13663\">ModernBERT<\/a> (with some tweaks) across a large range of model scales, from 17 million to 1 billion parameters.\nThe two smallest sizes of Ettin, with 17 and 32 million parameters respectively, seemed like perfect matches: we quickly made the decision to train tiny, capable models that could support all of our future experiments. We also immediately decided that we would release these models publicly, as we believe open source releases to be the perfect home for models that can run on just about any hardware.<\/p>\n<p>If you want the full details on our training process, please head over to the <a href=\"https:\/\/mixedbread.com\/papers\/small_colbert_report.pdf\">tech report<\/a>. We have attempted to make the tech report a true overview of <em><strong>&quot;how to train sane, near state-of-the-art retrievers in 2025&quot;<\/strong><\/em> and we would highly encourage you to read it if this is something of interest for you.<\/p>\n<p>If you just want an overview of what we did and HuggingFace links, however, you&#39;re at the right place.<\/p>\n<h2>Training Small ColBERTs: The Steps<\/h2>\n<p>Previous research on ColBERT models has indicated a pretty clear trend: all state-of-the-art ColBERTs are initialized from strong single-vector embedding models, which have undergone their own somewhat standardized multi-stage training process. This is likely due to a combination of reasons, ranging from <a href=\"https:\/\/www.mixedbread.com\/blog\/projection-variants\">MaxSim's learning constraints favouring already-strong representations over unaligned ones<\/a> to the lack of a standardized ColBERT pre-training recipe, among other potential culprits.<\/p>\n<h3>Training Dense Backbones<\/h3>\n<p>As such, before training our ColBERT models, we must first prepare suitable dense models to serve as backbones. As the purpose of this is to build all-around good performing baselines rather than benchmark chasing, we opt to follow standardized methods and use widely used datasets with limited overfitting potential.<\/p>\n<p>The Dense training process we opt for consists of three steps:<\/p>\n<ul>\n<li>First, we performed a large-scale <strong>weakly supervised contrastive pre-training<\/strong> on around 150 million training pairs. This step serves as a way to <strong>preheat<\/strong> the model&#39;s representations, shifting them from Ettin&#39;s original language modelling objective to a similarity objective. The data is not of high quality, but is large in volume, slowly shifting the embedding space in the right direction.<\/li>\n<li>Secondly, we perform <strong>supervised fine-tuning<\/strong>. This is the key step, where the model trained in the first stage is now exposed to retrieval queries and their matching document, with positive documents annotated by a human. Following standard practice, we perform hard-negative mining, so as to provide the model with &quot;believable&quot; looking negative examples and teach it to distinguish near-matches from actually relevant documents.<\/li>\n<\/ul>\n<p>Our third step is, as of yet, less standard: Stella-style knowledge distillation. This step is the key component of the Stella retrieval models, which are well known in the information retrieval community as being very strong models for their size. Effectively, the aim here is to <strong>align the representations of our model with that of a much larger, better model<\/strong>. Curiosity got the best of us here: we are really big fans of the Stella models and wanted to explore this approach to distillation in depth.<\/p>\n<p>Again, we provide more information on this step in the <a href=\"https:\/\/mixedbread.com\/papers\/small_colbert_report.pdf\">tech report<\/a>, but broadly we adopted a simplified version of the Stella mixture of losses, inspired by MongoDB&#39;s recent report on LEAF-style distillation. We note that this step strongly improved the performance of our 32 million parameter model variant and resulted in a small-but-noticeable boost for the 17 million one.<\/p>\n<p>After this stage, here we are: we now have a viable backbone that is easy to produce using standardized methods!<\/p>\n<h3>Training a ColBERT model<\/h3>\n<p>We&#39;re now ready to move on to the next step: creating ColBERT models!<\/p>\n<h5>Ablations light the way<\/h5>\n<p>We decided to take this opportunity to also run many ablations, seeking to answer a few questions we still had about the underlying mechanisms of the standard training recipe. Namely, we were wondering if...:<\/p>\n<ol>\n<li>Muon is a good optimizer for late interaction models?<\/li>\n<li>Projection dimension matters, and if so, at what point performance begins to rapidly degrade?<\/li>\n<li>Qwen3-Reranker is a good teacher for KL-Div distillation over teacher scores?<\/li>\n<li>Our proposed <a href=\"https:\/\/www.mixedbread.com\/blog\/projection-variants\">improvements to ColBERT projection heads<\/a> improve models using state-of-the-art recipes, rather than more academic ones using a weaker base model?<\/li>\n<li>The backbone models that have undergone Stella-style distillation produce better ColBERT models?<\/li>\n<li>The use, or not, of casing has an impact?<\/li>\n<\/ol>\n<p>The answer to these questions and many more is, you guessed it, in the <a href=\"https:\/\/mixedbread.com\/papers\/small_colbert_report.pdf\">tech report<\/a>! But, as a sneak peek, the answer to 1 is yes.<\/p>\n<p>Building on our findings from the ablations, we then proceed to use the best settings we&#39;ve uncovered to train the final models, resulting in our final checkpoints: mxbai-edge-colbert-v0, at both the 17M and 32M parameter scales. For everything not otherwise ablated, we followed the standardized training method introduced in <a href=\"https:\/\/arxiv.org\/abs\/2407.20750\">JaColBERTv2.5<\/a>.<\/p>\n<h2>So, how do they fare?<\/h2>\n<p>Short answer: surprisingly well!<\/p>\n<p>For models which did not do anything out-of-the-extraordinary to seek SotA performance, and which have largely steered clear of any contamination data at the most important stages, our models reach robust performance across the board:<\/p>\n<h3><strong>Results on BEIR<\/strong><\/h3>\n<table>\n<thead>\n<tr>\n<th align=\"left\">Model<\/th>\n<th align=\"center\">AVG<\/th>\n<th align=\"center\">MS MARCO<\/th>\n<th align=\"center\">SciFact<\/th>\n<th align=\"center\">Touche<\/th>\n<th align=\"center\">FiQA<\/th>\n<th align=\"center\">TREC-COVID<\/th>\n<th align=\"center\">NQ<\/th>\n<th align=\"center\">DBPedia<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td align=\"left\"><strong>Large Models (&gt;100M)<\/strong><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<\/tr>\n<tr>\n<td align=\"left\">GTE-ModernColBERT-v1<\/td>\n<td align=\"center\"><strong>0.547<\/strong><\/td>\n<td align=\"center\">0.453<\/td>\n<td align=\"center\"><strong>0.763<\/strong><\/td>\n<td align=\"center\"><strong>0.312<\/strong><\/td>\n<td align=\"center\"><strong>0.453<\/strong><\/td>\n<td align=\"center\"><strong>0.836<\/strong><\/td>\n<td align=\"center\"><strong>0.618<\/strong><\/td>\n<td align=\"center\"><strong>0.480<\/strong><\/td>\n<\/tr>\n<tr>\n<td align=\"left\">ColBERTv2<\/td>\n<td align=\"center\">0.488<\/td>\n<td align=\"center\"><strong>0.456<\/strong><\/td>\n<td align=\"center\">0.693<\/td>\n<td align=\"center\">0.263<\/td>\n<td align=\"center\">0.356<\/td>\n<td align=\"center\">0.733<\/td>\n<td align=\"center\">0.562<\/td>\n<td align=\"center\">0.446<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>Medium Models (&lt;35M)<\/strong><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>mxbai-edge-colbert-v0-32m<\/strong><\/td>\n<td align=\"center\">0.521<\/td>\n<td align=\"center\"><strong>0.450<\/strong><\/td>\n<td align=\"center\"><strong>0.740<\/strong><\/td>\n<td align=\"center\"><strong>0.313<\/strong><\/td>\n<td align=\"center\">0.390<\/td>\n<td align=\"center\">0.775<\/td>\n<td align=\"center\"><strong>0.600<\/strong><\/td>\n<td align=\"center\">0.455<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">answerai-colbert-small-v1<\/td>\n<td align=\"center\"><strong>0.534<\/strong><\/td>\n<td align=\"center\">0.434<\/td>\n<td align=\"center\"><strong>0.740<\/strong><\/td>\n<td align=\"center\">0.250<\/td>\n<td align=\"center\"><strong>0.410<\/strong><\/td>\n<td align=\"center\"><strong>0.831<\/strong><\/td>\n<td align=\"center\">0.594<\/td>\n<td align=\"center\"><strong>0.464<\/strong><\/td>\n<\/tr>\n<tr>\n<td align=\"left\">bge-small-en-v1.5<\/td>\n<td align=\"center\">0.517<\/td>\n<td align=\"center\">0.408<\/td>\n<td align=\"center\">0.713<\/td>\n<td align=\"center\">0.260<\/td>\n<td align=\"center\">0.403<\/td>\n<td align=\"center\">0.759<\/td>\n<td align=\"center\">0.502<\/td>\n<td align=\"center\">0.400<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">snowflake-s<\/td>\n<td align=\"center\">0.519<\/td>\n<td align=\"center\">0.402<\/td>\n<td align=\"center\">0.722<\/td>\n<td align=\"center\">0.235<\/td>\n<td align=\"center\">0.407<\/td>\n<td align=\"center\">0.801<\/td>\n<td align=\"center\">0.509<\/td>\n<td align=\"center\">0.410<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>Small Models (&lt;25M)<\/strong><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<td align=\"center\"><\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>mxbai-edge-colbert-v0-17m<\/strong><\/td>\n<td align=\"center\"><strong>0.490<\/strong><\/td>\n<td align=\"center\"><strong>0.416<\/strong><\/td>\n<td align=\"center\"><strong>0.719<\/strong><\/td>\n<td align=\"center\"><strong>0.316<\/strong><\/td>\n<td align=\"center\">0.326<\/td>\n<td align=\"center\"><strong>0.713<\/strong><\/td>\n<td align=\"center\"><strong>0.551<\/strong><\/td>\n<td align=\"center\"><strong>0.410<\/strong><\/td>\n<\/tr>\n<tr>\n<td align=\"left\">colbert-muvera-micro<\/td>\n<td align=\"center\">0.394<\/td>\n<td align=\"center\">0.364<\/td>\n<td align=\"center\">0.662<\/td>\n<td align=\"center\">0.251<\/td>\n<td align=\"center\">0.254<\/td>\n<td align=\"center\">0.561<\/td>\n<td align=\"center\">0.386<\/td>\n<td align=\"center\">0.332<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">all-MiniLM-L6-v2<\/td>\n<td align=\"center\">0.419<\/td>\n<td align=\"center\">0.365<\/td>\n<td align=\"center\">0.645<\/td>\n<td align=\"center\">0.169<\/td>\n<td align=\"center\"><strong>0.369<\/strong><\/td>\n<td align=\"center\">0.472<\/td>\n<td align=\"center\">0.439<\/td>\n<td align=\"center\">0.323<\/td>\n<\/tr>\n<\/tbody><\/table>\n<hr>\n<h3><strong>Results on LongEmbed<\/strong><\/h3>\n<table>\n<thead>\n<tr>\n<th align=\"left\">Model<\/th>\n<th align=\"center\">AVG<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td align=\"left\"><strong>Large Models (&gt;100M)<\/strong><\/td>\n<td align=\"center\"><\/td>\n<\/tr>\n<tr>\n<td align=\"left\">GTE-ModernColBERT-v1 (32k)<\/td>\n<td align=\"center\"><strong>0.898<\/strong><\/td>\n<\/tr>\n<tr>\n<td align=\"left\">GTE-ModernColBERT-v1 (4k)<\/td>\n<td align=\"center\">0.809<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">granite-embedding-english-r2<\/td>\n<td align=\"center\">0.656<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">ColBERTv2<\/td>\n<td align=\"center\">0.428<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>Medium Models (&lt;50M)<\/strong><\/td>\n<td align=\"center\"><\/td>\n<\/tr>\n<tr>\n<td align=\"left\">mxbai-edge-colbert-v0-32m (32k)<\/td>\n<td align=\"center\"><strong>0.849<\/strong><\/td>\n<\/tr>\n<tr>\n<td align=\"left\">mxbai-edge-colbert-v0-32m (4k)<\/td>\n<td align=\"center\">0.783<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">granite-embedding-small-english-r2<\/td>\n<td align=\"center\">0.637<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">answerai-colbert-small-v1<\/td>\n<td align=\"center\">0.441<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">bge-small-en-v1.5<\/td>\n<td align=\"center\">0.312<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">snowflake-arctic-embed-s<\/td>\n<td align=\"center\">0.356<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>Small Models (&lt;25M)<\/strong><\/td>\n<td align=\"center\"><\/td>\n<\/tr>\n<tr>\n<td align=\"left\">mxbai-edge-colbert-v0-17m (32k)<\/td>\n<td align=\"center\"><strong>0.847<\/strong><\/td>\n<\/tr>\n<tr>\n<td align=\"left\">mxbai-edge-colbert-v0-17m (4k)<\/td>\n<td align=\"center\">0.776<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">all-MiniLM-L6-v2<\/td>\n<td align=\"center\">0.298<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">colbert-muvera-micro<\/td>\n<td align=\"center\">0.405<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>Our 17 million parameter model in particular is a standout performer which we hope will be a very strong baseline for many experiments to come. Despite its incredibly low parameter count and a projection dimension of 48, just about one third of the standard 128, it comfortably outperforms ColBERTv2. And it does so while scaling exceptionally well across longer contexts: its performance of LongEmbed very comfortably exceeds the current &lt;1B parameter state-of-the-art single-vector retriever on the <a href=\"https:\/\/huggingface.co\/spaces\/mteb\/leaderboard\">LongEmbed leaderboard<\/a> by more than 19 NDCG@10 points.<\/p>\n<h3>Efficiency is the name of the game<\/h3>\n<p>Our models build upon the current wave of more efficient encoders, spearheaded by ModernBERT and carried on by subsequent models such as Ettin or ModernVBERT. As such, we designed them with efficiency in mind, attempting to minimize their computational requirements without degrading performance.<\/p>\n<p>On top of their low parameter counts and the architectural improvements inherent to the ModernBERT architecture, such as built-in unpadding and flash attention 2, we adopt <strong>very small final projection dimensions<\/strong> for our models, which makes them particularly memory- and RAM-friendly:<\/p>\n<table>\n<thead>\n<tr>\n<th><strong>Model<\/strong><\/th>\n<th><strong>Params<\/strong><\/th>\n<th><strong>Dim.<\/strong><\/th>\n<th><strong>NDCG@10<\/strong><\/th>\n<th><strong>LoCo<\/strong><\/th>\n<th><strong>GPU<\/strong><\/th>\n<th><strong>CPU<\/strong><\/th>\n<th><strong>Mem. (MB)<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>ColBERTv2<\/strong><\/td>\n<td>130M<\/td>\n<td>128<\/td>\n<td>0.6198<\/td>\n<td>--<\/td>\n<td>81s<\/td>\n<td>1540s<\/td>\n<td>732<\/td>\n<\/tr>\n<tr>\n<td><strong>answerai-colbert-small-v1<\/strong><\/td>\n<td>33M<\/td>\n<td>96<\/td>\n<td><em><strong>0.6545<\/strong><\/em><\/td>\n<td>--<\/td>\n<td>59s<\/td>\n<td>621s<\/td>\n<td>549<\/td>\n<\/tr>\n<tr>\n<td><strong>colbert-muvera-micro<\/strong><\/td>\n<td>4M<\/td>\n<td>128<\/td>\n<td>0.5599<\/td>\n<td>--<\/td>\n<td><strong>45s<\/strong><\/td>\n<td><strong>88s<\/strong><\/td>\n<td>732<\/td>\n<\/tr>\n<tr>\n<td><strong>mxbai-edge-colbert-v0-17m<\/strong><\/td>\n<td>17M<\/td>\n<td>48<\/td>\n<td>0.6405<\/td>\n<td>\u2713<\/td>\n<td><em>51s<\/em><\/td>\n<td><em>487s<\/em><\/td>\n<td><em><strong>275<\/strong><\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>mxbai-edge-colbert-v0-32m<\/strong><\/td>\n<td>32M<\/td>\n<td>64<\/td>\n<td>0.6520<\/td>\n<td>\u2713<\/td>\n<td>55s<\/td>\n<td>589s<\/td>\n<td>366<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>We&#39;re particularly excited by the 17M variant&#39;s potential as an end-to-end retriever or reranker following a <a href=\"https:\/\/huggingface.co\/blog\/static-embeddings\">static retriever<\/a> for on-edge usecases, as it can embed dozens of documents in milliseconds on CPU with remarkably low memory footprint.<\/p>\n<h2>What next<\/h2>\n<p>The models are already available on HuggingFace and supported in PyLate: <a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-edge-colbert-v0-17m\">mxbai-edge-colbert-v0-17m<\/a> and <a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-edge-colbert-v0-32m\">mxbai-edge-colbert-v0-32m<\/a>.<\/p>\n<p>With this release, we killed two birds with one stone, having released both the strongest existing edge retrieval model to date, mxbai-edge-colbert-v0, and a set of extremely strong baselines to support further experimentation.<\/p>\n<p>In the future, we intend to periodically update our edge-sized open source offerings to further disseminate our research findings in a bite-sized, anyone-can-use-it format.<\/p>\n<p>If this sounds like something you&#39;d like to contribute to, we are hiring across all technical positions! Take a look at them below and don&#39;t hesitate to apply if you feel like a fit for any of them:<\/p>\n<ul>\n<li>Research: <a href=\"https:\/\/www.mixedbread.com\/careers\/research_staff\">Research Staff<\/a>, and <a href=\"https:\/\/www.mixedbread.com\/careers\/research_intern\">Research Interns<\/a><\/li>\n<li>Software: <a href=\"https:\/\/www.mixedbread.com\/careers\/software_engineer\">Software Engineer<\/a>, <a href=\"https:\/\/www.mixedbread.com\/careers\/frontend_engineer\">Frontend Engineer<\/a> and <a href=\"https:\/\/www.mixedbread.com\/careers\/devops_engineer\">DevOps Engineer<\/a><\/li>\n<li>Product: <a href=\"https:\/\/www.mixedbread.com\/careers\/product_designer\">Product Designer<\/a><\/li>\n<\/ul>\n","url":"https:\/\/www.mixedbread.com\/blog\/edge-v0","title":"Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0","summary":"Introducing our new family of extremely efficient ColBERT models, to serve as backbones for modern late interaction research while outperforming ColBERTv2 with just 17 million parameters.","image":"https:\/\/www.mixedbread.com\/images\/blog\/edge-v0\/edge-v0.jpg","date_modified":"2025-10-16T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research"]},{"id":"https:\/\/www.mixedbread.com\/blog\/projection-variants","content_html":"<blockquote>\n<p><strong>Note<\/strong>\n<u><strong>TLDR:<\/strong><\/u> <br\/>\n    ColBERT models are amazing retrievers, but a lot of their mechanisms are understudied. We show that MaxSim&#39;s unique constraints have strong implications for model training and focus on one that makes a single-layer projection suboptimal. We introduce a variation of new projection heads to ColBERT and achieve great performance through this simple modification. For more details, read our <a href=\"https:\/\/arxiv.org\/abs\/2510.12327\">arXiv pre-print<\/a> or try it out by training a model, with these new variants already supported in <a href=\"https:\/\/github.com\/lightonai\/pylate\">PyLate<\/a>.<\/p>\n<\/blockquote>\n<h2>Introduction<\/h2>\n<p>We all know the story by now: after ChatGPT, RAG became all the rage. With the rising interest in RAG, Information Retrieval as a field then saw an unprecedented level of attention.  And with this wave of interest, the broad concept of semantic retrieval built on top of language models, which was previously an active-but-somewhat-niche subject, became a mainstream topic.<\/p>\n<p>In addition to single-vector retrieval, the most basic form of semantic matching where both queries and documents are represented as a single vector, many interesting research avenues enjoyed their well-deserved time in the spotlight: among them, sparse retrieval (SPLADE) and, of course, our preferred method: multi-vector retrieval, spearheaded by ColBERT.<\/p>\n<p>Without straining your attention raving about ColBERT for the millionth time, let&#39;s have a quick reminder: ColBERT-inspired models, a.k.a. late-interaction methods, a.k.a. multi-vector retrieval (yes, there are too many names), are a family of retrieval models which rely on the idea that fine-grained interactions are very important for retrieval. To acknowledge the reality that inference-time is naturally sensitive in search settings, these models make fine-grained interaction possible by representing documents as bags-of-vectors rather than as a single vector: each individual token in both queries and documents are represented by their own individual vector, which then goes through heavy compression to avoid ballooning storage costs. You may read more about how ColBERT models perform scoring in our <a href=\"https:\/\/www.mixedbread.com\/blog\/maxsim-cpu\">blog on MaxSim<\/a>, their key scoring operator.<\/p>\n<p>Recently, late interaction models have scored many wins. They&#39;ve been shown to be state-of-the-art on <a href=\"https:\/\/www.mixedbread.com\/blog\/mixedbread-search\">all<\/a> <a href=\"https:\/\/arxiv.org\/abs\/2503.19009\">kinds<\/a> of <a href=\"https:\/\/arxiv.org\/abs\/2407.01449\">multimodal<\/a> <a href=\"https:\/\/arxiv.org\/abs\/2510.01149\">retrieval<\/a> <a href=\"https:\/\/nga.demo.mixedbread.com\/\">situations<\/a>, far above any other retrieval paradigm, showcased their ability to generalise from <a href=\"https:\/\/huggingface.co\/lightonai\/GTE-ModernColBERT-v1\">being trained on short, 300-token documents to retrieve long, 32000-token documents, far better than even models specifically trained for it<\/a>, and have they have demonstrated that they can <a href=\"https:\/\/huggingface.co\/lightonai\/Reason-ModernColBERT\">match models<\/a> with <a href=\"https:\/\/huggingface.co\/answerdotai\/answerai-colbert-small-v1\">orders of magnitude more parameters<\/a>.<\/p>\n<p>Now, I know what you&#39;re thinking. Is this just another Big Multi-Vector lobbying blog post? Are you writing this introduction just to tell us that ColBERT is all we need? Honestly, yes, but not only. I&#39;m writing this introduction to point out that despite going from empirical win to empirical win, we still <strong>understand very little of the mechanics that make late-interaction work<\/strong>. There are many factors to the performance of these models which have simply not yet been studied, and in this blog post we will focus on one of them to show how a simple solution can increase performance with virtually no trade-offs.<\/p>\n<h2>Learning Properties You Might Not Have Thought About<\/h2>\n<p>As mentioned above, multi-vector models perform scoring via the MaxSim operator. Without going too far into the details, it works through a simple process:<\/p>\n<ol>\n<li>For every token within a query, compute its cosine similarity with every token within a document.<\/li>\n<li>Discard all resulting similarity scores, except the highest for each query token.<\/li>\n<li>Sum these similarities: Voil\u00e0! That is your MaxSim score for this query &lt; &gt; document pair<\/li>\n<li>Repeat for every document you want to score.<\/li>\n<\/ol>\n<p>During real-world retrieval scenarios, this score is used to order documents from most to least relevant. During training, it&#39;s used for loss calculation, whether through some form of contrastive loss (maximizing the score of positive examples and minimizing that of negative ones) or knowledge distillation (attempting to match the teacher&#39;s score distribution).<\/p>\n<p>In the training setting, however, this operator results in an effect you might not necessarily have thought of: it zeroes out the vast majority of gradients, with gradients flowing <strong>only through tokens which have achieved the highest simiarity with at least one query token<\/strong>. If you are training with a document length of 512 and your queries are just 8 tokens long, that means that, in the best case scenario, <strong>1.5625%<\/strong> of your document tokens will actually contribute to updating your model&#39;s weights (if you are lucky, that is, because it might well be the case that multiple query tokens are best matched with the same document token!).<\/p>\n<p>There is a lot to be said about this mechanism: could it be one of the reasons that ColBERT performs so well, because it allows hyper-specialization and reduces noise during training? Or is it actually a harmful signal bottleneck, just waiting to be alleviated to unleash an even more glorious era of late interaction upon us?<\/p>\n<p>The definite answer to this question is not to be found in this blog post, however, we do think that it means one thing...<\/p>\n<h2>Single-Layer Linear Projections Considered Harmful<\/h2>\n<p>And that thing is: the projection head of ColBERT models is likely suboptimal. If you aren&#39;t familiar with it, all existing multi-vector retrieval models end with a single layer, which is added to whatever backbone model is used (historically, BERT, and more recently, PaliGemma, Qwen, ModernBERT...) to downcast the model&#39;s final token representations to a much more manageable dimension, usually set to 128 by convention following the original ColBERT model.<\/p>\n<p>We think that this projection is suboptimal and that it&#39;s partially related to the effects of MaxSim on learning. To be perfectly honest with you, both the <a href=\"https:\/\/arxiv.org\/abs\/2510.12327\">paper<\/a> (and its dozen equations!) and this blog post are actually somewhat back-engineered to understand exactly <strong>why<\/strong> a simple projection isn&#39;t good enough after empirical results demonstrated it.<\/p>\n<p>The reason for this, we believe, is that ultimately a single-layer projection is just a mapping, without much learned in the process. And that is because a simple fact of the way MaxSim works and affects learning is that it rewards <strong>high peaks<\/strong>: you want particularly discriminative tokens to score very high in similarity to query tokens, as all other tokens will then not matter anymore.<\/p>\n<p>While this might be stating the obvious, the problem here is that <strong>tokens are different from one another<\/strong>. What I mean by that is that different token types will quite naturally have differing representations in the embedding space: entities, e.g. proper nouns, are likely to be pretty different from adjectives, which are going to be pretty different from highly-specialised medical vocabulary, and so on.<\/p>\n<p>But this is badly aligned with a single projection matrix! By definition, if you want to project from a big fixed dimension to a smaller fixed dimension and you must do so with a single learned mapping, then this mapping will have to &quot;distribute&quot; its weight budget in a way that covers all kinds of token types more or less equally well. In doing so, it weakens the representations&#39; abilities to achieve high peaks in the first place because some sharpness is lost in the projection process.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/projection-variants\/pooh_betterproj.jpeg\" alt=\"Projection variants meme illustration\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n  If you remember only one thing from this session, let it be this framing.\n<\/div>\n\n<p>Do bear in mind that while I make it sound dramatic, as is customary when introducing a slightly better variant of something, in practice, a single-layer projection <strong>does<\/strong> retain a good level of representation quality, and parts of the potential harmfulness are alleviated by the backbone model&#39;s expressiveness. Nonetheless, the fact that <strong>good enough<\/strong> exists doesn&#39;t mean that there aren&#39;t simple ways to make things better.<\/p>\n<h2>Better Projections Are Possible<\/h2>\n<p>And it is indeed possible to make things better, in a rather straightforward way! Building on the limitations highlighted above, we propose a bunch of justifications (that you can, again, find in the paper) as to why we believe that additional projection depth could considerably improve sharpening, thanks to factorization better spreading the mapping, and for adding residual connections to better leverage the quality of the original representations. We also discuss the potential impact of gating (GLU) and non-linearities and show that their theoretical impact is unclear, as they seem to introduce both beneficial and harmful properties.<\/p>\n<p>We then set out to demonstrate how these properties hold up in practice: as always in deep learning, empirical results are king. As part of our experiments, we modified the excellent PyLate ColBERT training library to add a few knobs:<\/p>\n<ul>\n<li><strong>Projection Depth<\/strong>: How many layers the projection should use.<\/li>\n<li><strong>Projection Scale<\/strong>: Whether intermediate layers should be up-scaled (similarly to the Transformers&#39; feedforward upscaling) or not.<\/li>\n<li><strong>Activation Function<\/strong>: Which activation, if any, should be applied to the output of non-final layers.<\/li>\n<li><strong>GLU Gating<\/strong>: Whether to use GLU modules or traditional feedforward modules.<\/li>\n<li><strong>Skip-connections<\/strong>: Whether we should use residual connections for intermediate layers (note: in case of depth=2, we use ResNet-inspired residuals, where a second up-projection initialized with an identity matrix is used to upcast the residual)<\/li>\n<\/ul>\n<p><em>The above is a non-exhaustive list, and you may find more information in our <a href=\"https:\/\/arxiv.org\/abs\/2510.12327\">pre-print<\/a> if you care about the full experimental setting.<\/em><\/p>\n<p>With these knobs in place, we set out to train a few hundred models, with all other settings kept identical between runs. We trained each variant multiple times, with a few random seeds, in order to ensure significance, as it&#39;s previously been shown that it is <a href=\"https:\/\/aclanthology.org\/Q18-1018\/\">fairly common for new QA retrieval state-of-the-art-results to actually fall behind existing methods when properly controlling for multiple seeds<\/a>.<\/p>\n<p>If you will allow me, here is everyone&#39;s favourite money-shot, the LaTeX booktab table with best results in bold and the infamous significance dagger:<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/projection-variants\/paper-table.png\" alt=\"Paper results table: benchmarking projection variants. See text for discussion.\" title=\"Table 1: Main results from our projection variant ablation study\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n  Table 1: Main results from our projection variant ablation study.\n<\/div>\n\n<p>What we find is that almost all projection variants in this table show significant gains over the existing projection method, across multiple datasets. Note that this table is <em>somewhat<\/em> cherry-picked, as it presents the best performing &quot;model families&quot;, employing both residuals and upscaling. Our broader results, however, show that the majority of projection variants do significantly improve retrieval performance, with the best ones reaching a rather comfortable 2 NDCG@10 points increase in average, representing an almost 4% relative improvement, at virtually no cost whatsoever except a few dozen thousand parameters.<\/p>\n<h2>Conclusion<\/h2>\n<p>Our results align with our original thought: while this, in itself, is <strong>not<\/strong> a revolutionary improvement, it is a consistent gain which highlights that there are likely many more low-hanging fruits to be discovered as we focus on better understanding the mechanisms of late-interaction.<\/p>\n<p>If doing so is something that is interesting to you, we are currently hiring across all positions:<\/p>\n<ul>\n<li>Research: <a href=\"https:\/\/www.mixedbread.com\/careers\/research_staff\">Research Staff<\/a>, and <a href=\"https:\/\/www.mixedbread.com\/careers\/research_intern\">Research Interns<\/a><\/li>\n<li>Software: <a href=\"https:\/\/www.mixedbread.com\/careers\/software_engineer\">Software Engineer<\/a>, <a href=\"https:\/\/www.mixedbread.com\/careers\/frontend_engineer\">Frontend Engineer<\/a> and <a href=\"https:\/\/www.mixedbread.com\/careers\/devops_engineer\">DevOps Engineer<\/a><\/li>\n<li>Product: <a href=\"https:\/\/www.mixedbread.com\/careers\/product_designer\">Product Designer<\/a><\/li>\n<\/ul>\n<p><em>This is the first post in our Fast Rising Science series: at Mixedbread, we conduct a lot of experiments in various aspects of Information Retrieval. We think it&#39;s a shame how much science, especially in this space, is conducted entirely behind closed doors, with the same results often being re-discovered multiple times within the same year. We intend to frequently release focused research into these small-but-important mechanisms in the future.<\/em><\/p>\n","url":"https:\/\/www.mixedbread.com\/blog\/projection-variants","title":"A Delicious Free Lunch: Better Projections Improve ColBERT","summary":"Discussing the unique learning constraints introduced by the MaxSim operator, and demonstrating that simple architecture improvements to accommodate for these limitations can increase performance in a free-lunch fashion.","image":"https:\/\/www.mixedbread.com\/images\/blog\/projection-variants\/projection-variants.webp","date_modified":"2025-10-15T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research"]},{"id":"https:\/\/www.mixedbread.com\/blog\/mixedbread-search","content_html":"<p>Today, we are launching the public beta of Mixedbread Search, a search API that is built from the ground up for the AI era.<\/p>\n<p>Over the last year, we have been hard at work to understand what Search for the AI era would look like. It was clear to us that it should be able to <strong>find information, no matter how it&#39;s stored<\/strong>. <\/p>\n<p>The world is not powered by clean, well-formatted, neatly arranged text. <\/p>\n<p>Knowledge is found in messy PDFs. It&#39;s built upon that one image, stored among 15 million others. Crucial details have been known to live in voice memos. Sometimes, even, the answer you need is to be found in business-critical VB6 scripts which have not been modified in the last decade. <strong>And you should be able to find exactly the context you need to reach the right answer, no matter how deep it is buried.<\/strong><\/p>\n<p>Mixedbread Search encapsulates what search should be: fast, accurate, multilingual, multi-modal. It is designed with both human and AI users in mind, with the goal of removing all barriers between you and the information you need.<\/p>\n<p>It is a fully end-to-end solution, where all you need to bring is your data in the format it currently exists in. Upload it to our platform, and you are ready to begin rediscovering your own documents.<\/p>\n<p><a href=\"https:\/\/platform.mixedbread.com\">Get started for free<\/a><\/p>\n<h2>An end-to-end platform<\/h2>\n<p>Mixedbread Search is built on one simple idea: you shouldn&#39;t have to understand how search works to be able to enjoy good search. It is powered by our state-of-the-art research, distilling the very best methods to provide you with:<\/p>\n<ul>\n<li>Multi-Modal Search: Text, images, audio and video are all supported.<\/li>\n<li>Multi-Lingual Search: Because the world is a lot more than just English.<\/li>\n<li>Multi-Context Search: Our platform has been tested on a variety of contexts: finding code snippets and relevant documentation, retrieving audio samples with specific BPMs, looking up e-commerce products by visual descriptions, among many others...<\/li>\n<li>Low latency Search: Because you need results now, not in 5700 milliseconds.<\/li>\n<li>Meaningfully state-of-the-art Search: On realistic DeepResearch benchmarks, LLM assistants are able to reach significantly better response accuracy with Mixedbread Search over existing search systems.<\/li>\n<\/ul>\n<p>We have developed a simple platform, where every complicated step along the way is handled seamlessly. Your documents are made search-ready thanks to our processing pipeline, which includes state-of-the-art OCR and Document parsing. <\/p>\n<p>Just upload your documents and start searching!\nCheckout the quickstart to get started. Find out more in the <a href=\"https:\/\/www.mixedbread.com\/docs\/quickstart\">Docs<\/a>.<\/p>\n<h2>State-of-the-art search performance<\/h2>\n<p>We offer market-leading search performance thanks to the pipeline based on our in-house research. For example, Mixedbread Search reaches significantly higher accuracy while requiring fewer average requests on the BrowseComp-Plus deep search benchmark using Gemini 2.5 Flash, compared to other widely used semantic search options:<\/p>\n<p><strong>Required LLM calls<\/strong>\n<img src=\"https:\/\/mixedbread.com\/images\/blog\/mixedbread-search\/requests_needed.png\" alt=\"Comparison of required LLM calls with Mixedbread requiring 16% fewer calls than the next lowest model\" title=\"Required LLM calls\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n  Required LLM calls with Mixedbread are 16% fewer than the next lowest model.\n<\/div>\n\n<p><strong>Search accuracy<\/strong>\n<img src=\"https:\/\/mixedbread.com\/images\/blog\/mixedbread-search\/benchmark.png\" alt=\"Comparison of search accuracy with Mixedbread having a 16% higher score than the next highest model\" title=\"Search accuracy\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n  Search accuracy with Mixedbread is 16% higher than the next highest model.\n<\/div>\n\n<h2>Try it now<\/h2>\n<p>All you need to start using Mixedbread Search is an account on our platform:<\/p>\n<p><a href=\"https:\/\/platform.mixedbread.com\/sign-in\">Create an account<\/a><\/p>\n<h2>We want to hear from you<\/h2>\n<p>Mixedbread Search is entering beta today. This isn&#39;t an end, but a beginning: we expect to rapidly iterate on the platform as we gather more real-world feedback.<\/p>\n<p>Please get in touch if you encounter any issues, or if you have any ideas that would make your experience even better. We would love to hear from you, so please feel free to <a href=\"https:\/\/www.mixedbread.com\/contact\">reach out<\/a>!<\/p>\n","url":"https:\/\/www.mixedbread.com\/blog\/mixedbread-search","title":"Introducing Mixedbread Search","summary":"Introducing Mixedbread Search, the first search API built from the ground up for the AI era, with both humans and AI in mind. Natively multi-modal and multi-lingual, Mixedbread Search eliminates all friction and lets you find information where it lives.","image":"https:\/\/www.mixedbread.com\/images\/blog\/mixedbread-search\/mixedbread-search.jpg","date_modified":"2025-10-01T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["product"]},{"id":"https:\/\/www.mixedbread.com\/blog\/research-vision","content_html":"<p>At Mixedbread, we are trying to strike a balance that feels increasingly hard to get <em>just right<\/em>: our goal is to be a <strong>research lab whose work directly feeds into a product<\/strong>, rather than a product company who happens to occasionally require research.<\/p>\n<p>This has led us to think a lot about why we do research, how we do it, and how we can strike the right balance between offering an attractive product to users while contributing to the broader research community.\nIn this blog post, we&#39;re sharing some of our early thoughts on our reasons for setting up Mixedbread as such, the way we approach research and decide what to work on, and how we intend to achieve this balance.<\/p>\n<h2>Why we do what we do<\/h2>\n<p>We fundamentally believe that information retrieval and capital-S Search are fundamental areas of research in the currently nascent AI era. There are two reasons for this belief.<\/p>\n<p>The first one is that neural retrieval, at its core, is about <strong>understanding how to shape and align the embedding space<\/strong> so that the relationship between representations of seemingly very different things, queries and documents, can be effectively captured.<\/p>\n<p>The second reason is that Search, broadly defined, is a foundational tool for any intelligent agent. This is not to say that we are RAG absolutists: knowledge can and should, as much as it is feasible, be baked into the LLMs themselves. <\/p>\n<p>But memorization, alone, is never going to solve problems that we commonly associate with intelligence. Even if it were not outright unreasonable, it would still be exceedingly inefficient to expect LLMs to memorize gigantic e-commerce catalogues that are changing on a daily basis, weekly financial reports, or even today&#39;s weather.<\/p>\n<p>The world, and the billions of knowledge bases it contains, is a rapidly evolving environment. I do not expect my accountant to remember every single invoice I have issued over the past 4 years. Nor do I expect them to know exactly what type of tax treaty is applicable to every single kind of international transaction. On the other hand, I do expect that he possesses tools that would make surfacing these pieces of information trivial, should it be needed.<\/p>\n<p>In fact, we agree with Nobel-prize winning psychologist Daniel Kahneman&#39;s definition of human intelligence as not only the ability to reason, but also the ability to find relevant material in memory and to deploy attention when needed, insofar that we believe information retrieval to be a cornerstone of intelligence in general.\nTo take the age-old example: geniuses of the past were not any less smart than 20th century physicists, but the very act of <strong>creating new knowledge<\/strong> requires the ability to access, search through and curate existing information.<\/p>\n<p>This is what <strong>Search<\/strong> is all about: providing an intelligent agent, human or artificial, with the ability to look up information when it is needed, building on the assumption that the agent will then know what to do with it. In our view, perfecting search is potentially one of the most important research north stars in the field, secondary only to AGI itself.<\/p>\n<p>Of course, there are many, many, many, many... many ways of representing information. And there are just as many ways of attempting to retrieve it. &quot;Old school&quot; retrieval has a lot of failure cases, but there are many promising ways to overcome these limitations as part of the endless quest for <em><strong>Perfect Search<\/strong><\/em>. Researching these, and understanding <strong>why<\/strong> and <strong>how<\/strong> they work is the aim of Mixedbread.<\/p>\n<h2>Taking the Research out of the lab<\/h2>\n<p>While the above sounds very aspirational, it is also very abstract. If the past few years have taught us anything, it&#39;s that research conducted in the lab, with outcomes optimized for the lab, and with the sole dissemination aim being an academic paper ultimately has <strong>low impact<\/strong>.<\/p>\n<p>Our approach to figuring out what is worth pursuing can be summed up in just 5 words: <strong>Impactful research should be useful.<\/strong><\/p>\n<p>Of course, not all research will yield a usable artefact: long-term work is important! And sometimes, the outcome of weeks of work will be the learning that X does not work. However, it is key to remember that <strong>individual research items<\/strong> should always be thought of as part of larger projects, at least fuzzily defined, towards building something that will be <strong>useful<\/strong>.<\/p>\n<h3>Useful Research<\/h3>\n<p>In practice, this way of thinking manifests itself in two ways: ensuring that we conduct research that somehow contributes tangibly to an <strong>end goal<\/strong>; and working towards things that are <strong>usable in the real world<\/strong>.<\/p>\n<h4>Pillar 1: Tangible End Goals<\/h4>\n<p>Whenever we come up with a new idea, we ask ourselves: &quot;Should we spend N weeks on this, will we get something useful out of this?&quot;.<\/p>\n<p><strong>Useful<\/strong>, here, is defined very broadly: will this project lead to an open-source artefact that will be useful to many people? Will it further our understanding in such a way that it will meaningfully improve future models? Will this improvement be a direct performance improvement, or will it facilitate our decision making (indeed, here are hundreds of knobs to turn when training retrievers, and their interactions are very poorly understood, so any insight onto the underlying mechanisms <strong>is<\/strong> useful)<\/p>\n<p>If the answer to any of the questions above is <strong>yes<\/strong>, then it is likely worth doing. If, on the other hand, we are unable to understand how doing this work will lead to tangible benefits for a reasonable number of people, then it&#39;s probably not going to make it to the priority list.<\/p>\n<p>The nice aspect of this way of thinking is that it is ultimately not very limiting: even &quot;moonshot&quot; projects can fall within it nicely, and we definitely have a few of those... On the other hand, it makes it considerably easier to stay focused: if we&#39;re unable to articulate why this would be useful beyond producing a paper, then we either need to think more about it, or it wasn&#39;t a great idea in the first place.<\/p>\n<h4>Pillar 2: Real-World Usability<\/h4>\n<p>The second underlying question that precedes everything we do is directly derived from the first one: <strong>is this going to be useful in the real-world<\/strong>? Or, in other words, is X <strong>realistically<\/strong> useful? <\/p>\n<p>It would be very easy, for example, to decide that performing reranking with a 1.7T param model is an acceptable research item. After all, with so many parameters, the metrics are going to look amazing. Maybe we could even fine-tune it and use it as a fully uncompressed multi-vector retrieval pipeline: imagine all the information you can squeeze into 16384 dimensions per token!<\/p>\n<p>This could potentially make a great paper, and it would have numbers in bold that would be extremely hard to beat. State-of-the-art would be achieved internally in one triumphant release.<\/p>\n<p>On the other hand, there would be serious concerns here: could this be served at a reasonable price? Could it be served at all without being a loss leader? Would the latency concerns be acceptable for all users, or even for any user at all?<\/p>\n<p>Realistically, the answer to those questions in this extreme example is no.\nThe much more interesting situation, however, is when the answer is &quot;Probably not as is, <strong>BUT<\/strong> it could be&quot;.<\/p>\n<p>Indeed, a lot of Search, in practice, is about <strong>efficiency engineering<\/strong>. Some of the most interesting papers in our field are about creative indexing methods, techniques that allow extreme compression (or quantization) without loss of performance, or other tricks that (almost magically) make things go much faster and require much less hardware without compromising quality.<\/p>\n<p>Hence, this is naturally something that is constantly at the back of our mind, and a lot of our day-to-day is dipping into this work at the intersection of engineering and research. The goal of our research, after all, is to be <strong>impactful<\/strong>. And to be impactful, something needs to run!<\/p>\n<h2>Striking a Balance<\/h2>\n<p>Finally, as mentioned in the introduction, a major concern of ours is <strong>how can a for-profit company produce meaningful research<\/strong>?<\/p>\n<p>There are many examples in industry. Obviously, you might be thinking of giants such as OpenAI and Anthropic, whose research very much directly feeds onto products. Or startups, such as HuggingFace and Prime Intellect, whose business model allows them to sustain pretty cool X users (who occasionally write papers and\/or libraries) while ensuring money is flowing in.<\/p>\n<p>Looking at other companies who have made it work, in various ways, it has become very clear to us that tradeoffs are inevitable, and that it takes trial and error as well as a domain-specific strategy to truly make this work. It&#39;s pretty clear to us that while we have a plan, it&#39;ll probably be flawed in many ways, so we expect to iterate on it as we move forward.<\/p>\n<p>Some companies have taken the approach of going full closed-source. While they conduct world-class research, most of it is only ever going to be accessible through their products, with none of the inner workings published. Others are taking the fully open research pathway, and seeking profitability in other ways.<\/p>\n<p>At Mixedbread, we want our research to directly feed into our product, and the feedback on our product to directly inform our research. This obviously means that we cannot, mechanically, be <strong>fully<\/strong> open: some things will remain closed source, especially as they are heavily embedded in our internal machinery.<\/p>\n<p>However, a lot of what we do <strong>will<\/strong> be open, and we&#39;re aiming to set up a hybrid approach. We believe that this can be sustained in various forms:<\/p>\n<ul>\n<li>Much of our research effort is focused on individual aspects of Search and Retrieval, to yield a greater understanding of certain components. In the past, we have shared <a href=\"https:\/\/arxiv.org\/abs\/2506.03487\">papers<\/a>, <a href=\"https:\/\/www.mixedbread.com\/blog\/mxbai-embed-large-v1\">technical reports<\/a> and <a href=\"https:\/\/www.mixedbread.com\/blog\/the-hidden-ceiling\">short \"findings\" blog posts<\/a>. We fully intend to continue to share these insights with the broader community, as we&#39;re convinced that there is no path to &quot;solved search&quot; without the broader research effort.<\/li>\n<li>We have begun releasing specialized tooling which can be used outside of our internal platform, such as <a href=\"https:\/\/www.mixedbread.com\/blog\/maxsim-cpu\">`maxsim-cpu`<\/a> or <a href=\"https:\/\/www.mixedbread.com\/blog\/dynamic-batching\">`batched`<\/a>, and intend to continue to do so in the future, once again as we believe such releases to be important to foster a more mature ecosystem.<\/li>\n<li>Open source models have long been part of Mixedbread&#39;s DNA. In fact, our embedding model represented almost <a href=\"https:\/\/huggingface.co\/spaces\/huggingface\/open-source-ai-year-in-review-2024\">3% of total HuggingFace model downloads last year<\/a>, and our <a href=\"https:\/\/www.mixedbread.com\/blog\/mxbai-rerank-v2\">state-of-the-art rerankers released earlier this year<\/a> were part of the first wave of research exploring lightweight LLMs as rerankers. In the future, we expect to sustain frequent open model releases, distilling the learnings from our private modelling work onto small, personal-device friendly models.<\/li>\n<\/ul>\n<h2>If this sounds good to you...<\/h2>\n<p>... How about joining us? We&#39;re currently hiring across all positions!<\/p>\n<p>If you&#39;re interested in multimodal Information Retrieval research, very broadly defined, we are looking for:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.mixedbread.com\/careers\/research_staff\">Research Staff<\/a>, at all levels of seniority. This is a mix of what you might see called <code>Research Scientist<\/code> and <code>Research Engineer<\/code> in other places, or more broadly, <code>Member of Technical Staff, Research<\/code>. This is an umbrella position for our research team, where the actual responsibilities will be tailored to your research interest and our ongoing objectives.<\/li>\n<li><a href=\"https:\/\/www.mixedbread.com\/careers\/research_intern\">Research Interns<\/a>, as part of our internship program. You will be matched with a specific, self-contained project. The goal of all of our internships is to help you build your understanding and your CV, and they&#39;re all designed with a publishable artefact (blog post, paper, model\/code release..., depending on the use case) in mind.<\/li>\n<\/ul>\n<p>If you&#39;re more interested in the engineering side of making search at scale work, we&#39;re also hiring across all engineering positions, for both <a href=\"https:\/\/www.mixedbread.com\/careers\/software_engineer\">Software Engineers<\/a> and <a href=\"https:\/\/www.mixedbread.com\/careers\/devops_engineer\">DevOps Engineers<\/a>.<\/p>\n","url":"https:\/\/www.mixedbread.com\/blog\/research-vision","title":"Our Research Vision, Part 1","summary":"Introducing our research vision and our ultimate goal: Solving Search. In this blog post, we lay out our view on Mixedbread as a research lab, our approach for prioritizing our research and making it impactful for our journey toward the objective, as well as our thoughts on balancing open- and closed-source research.","image":"https:\/\/www.mixedbread.com\/images\/blog\/research-vision\/research-vision.webp","date_modified":"2025-09-24T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research"]},{"id":"https:\/\/www.mixedbread.com\/blog\/maxsim-cpu","content_html":"<blockquote>\n<p><strong>Note<\/strong>\n<u><strong>TLDR:<\/strong><\/u> <br\/>\n    <a href=\"https:\/\/github.com\/mixedbread-ai\/maxsim-cpu\">maxsim-cpu<\/a> provides a highly-optimised implementation of the maxsim algorithm, greatly speeding up scoring for ColBERT-like models.<\/p>\n<\/blockquote>\n<p>In retrieval world, a lot of inference pipelines run on CPU: oftentimes because of cost optimization: CPU machines are pretty cheap and infinitely scalable. In a lot of cases, it&#39;s not even an optimization: if you&#39;re running a local retrieval pipeline, your laptop might not even have a GPU! In most cases, thankfully, CPUs can do a more-than-acceptable job for the typical retrieval workflow, where the only computations required are a single inference pass on a query followed by similarity calculations. <\/p>\n<p>However, the equation is a bit different when it comes to multi-vector retrieval, which powers late-interaction models such as <a href=\"https:\/\/arxiv.org\/abs\/2004.12832\">ColBERT<\/a> or <a href=\"https:\/\/arxiv.org\/abs\/2407.01449\">ColPali<\/a>. These models use the <strong>MaxSim<\/strong> operator, which requires considerably more computations than their single-vector counterparts (read on to know why!).<\/p>\n<p>While modern hardware thankfully makes this pretty fast, the additional computational cost adds up: running MaxSim on ~1000 documents with <code>PyTorch<\/code> on CPU can add between <strong>50 and 100 milliseconds<\/strong> to each query compared to running it on GPU: while this may seem rather small at first glance, <strong>inefficiencies add up<\/strong>, and spending so much time on distance calculation can make it unviable for many latency-sensitive environments.<\/p>\n<p>But it doesn&#39;t have to be this way! Obviously, current CPUs will never be as fast as GPUs for this sort of computation, but <strong>they don&#39;t have to be so slow<\/strong>! By taking advantage of architecture-specific instructions and low-level scientific computing libraries like <code>libxsmm<\/code>, we built <strong>maxsim-cpu<\/strong>: a small Python package, written in Rust, which cuts down the aforementioned overhead to just <strong>5 milliseconds<\/strong>.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/maxsim-cpu\/overhead.png\" alt=\"JITting better than Numba.\" title=\"MaxSim CPU overhead comparison\"><\/p>\n<p>Want to just jump in and try it out? Check out our <a href=\"https:\/\/github.com\/mixedbread-ai\/maxsim-cpu\">GitHub repo<\/a> or install the library directly:<\/p>\n<pre><code class=\"language-bash\">[uv] pip install maxsim-cpu\n<\/code><\/pre>\n<p>Want to understand a bit more about why it&#39;s useful? Read on!<\/p>\n<h2>What even is MaxSim?<\/h2>\n<p>MaxSim, or for <strong>Max<\/strong>imum <strong>Sim<\/strong>ilarity is the core element of current late-interaction models <a href=\"https:\/\/arxiv.org\/abs\/2004.12832\">ColBERT<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2407.01449\">ColPali<\/a>.... Its mechanism is simple: rather than performing a single cosine similarity computation between a vector representing the whole query and vectors representing entire documents, it computes <strong>token-level<\/strong> similarities. <\/p>\n<p>For each candidate document, MaxSim iterates through every token within the query, and compares its similarity to <strong>every token within the document<\/strong>, before keeping the <strong>maximum value<\/strong> for each query token (hence the <code>Max<\/code>) and summing them up to produce a document-level score.<\/p>\n<h3>Orders of Magnitudes Matter<\/h3>\n<p>This approach has many benefits: it allows for capturing semantic relationships that larger-grain methods would miss. But it is inefficient: for each query, it requires thousands of similarity calculation. For a simple example: given 1000 candidate documents, each containing 300 tokens, and a 32-token long query , a &quot;traditional&quot; single-vector query search query would perform 1000 cosine similarity calculation: one for each document against the query. Using MaxSim, we require <code>n_query_token * n_docs * n_token_per_doc<\/code> distances, or <code>32 * 1000 * 300<\/code> = <strong>9 600 000<\/strong> (yes, that&#39;s 9.6 <strong>millions<\/strong>) distance calculations.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/maxsim-cpu\/maxsim_visualised.png\" alt=\"A picture is worth a thousand words, MaxSim is worth a thousand matmuls\"><\/p>\n<p>How is it viable, then? Well, thankfully, cosine similarity calcualtions are very computationally cheap. In fact, with normalized vectors (which everyone uses), it&#39;s a simple matrix multiplication. As you might know if you&#39;ve ever looked at the math behind deep learning, matrix multiplications are the main computational operation that power all models, and GPUs are <strong>very, very good<\/strong> at running them quickly. This, compounded by the fact that individual vectors generated by ColBERT models are pretty small, means that the computational cost is pretty cheap: in fact, it&#39;s less half the compute required to run a forward pass through <strong>a single layer of BERT-base<\/strong>, something that any GPU released in the last decade can do in a handful of milliseconds.<\/p>\n<h2>The Problem<\/h2>\n<p><strong>So, there&#39;s no problem then?<\/strong> Well, not really. GPUs <strong>love<\/strong> matrix multiplications and parallel computation, that&#39;s what they&#39;re built for: thousands of really weak cores. Their trustworthy cousin, the humble CPU, is not quite as big a fan of this sort of workload.<\/p>\n<p>This results in the situation we mentioned in the introduction: computing MaxSim on CPU can quite often be a significant source of latency in retrieval systems. While the FLOPS required to perform these computations is relatively low, from the CPU&#39;s point of view, they&#39;re the most evil FLOPS there are: <strong>a lot<\/strong> of very small parallel computations. What CPUs enjoy is the complete opposite, they love performing <strong>big, more demanding computations<\/strong> that don&#39;t require so many tiny steps.<\/p>\n<p>But CPU machines are very, very cheap and widely available. So, many pipelines end up just having to take that latency hit, or figure out workarounds so there are fewer documents to score with MaxSim.<\/p>\n<h2>The (Partial) Solution: maxsim-cpu<\/h2>\n<blockquote>\n<p>But surely, CPUs can&#39;t be that slow? My scoring machine has a whole 48 cores, then it should be able to run scoring faster than in 60ms!?<\/p>\n<\/blockquote>\n<p>Thank you for your question, convenient rhetorical question-asker. Indeed, no, it doesn&#39;t have to be so slow! <\/p>\n<h3>It always comes down to optimisation tradeoffs<\/h3>\n<p>A big reason as to why overhead is so big is that MaxSim-style computations, that is, a lot of very small matrices with very low vector dimensions, is quite simply something fairly uncommon and that very few major libraries actively seek to optimise for. <\/p>\n<p>There <strong>is<\/strong> a good ecosystem of libraries to speed up CPU computations: ONNX, recent improvements in both PyTorch (better Intel MKL handling...) and JAX (XLA being pretty good at optimising CPU compute nowadays), but they&#39;re all (rightfully) much more interested in speeding up the kind of computations you need to run models.<\/p>\n<p>There is actually an entirely separate ecosystem of matrix multiplication libraries which use what are, for the purpose of this blog, essentially magic tricks to speed up maxsim-style operations: they&#39;re spearheaded by the very clearly named <a href=\"https:\/\/github.com\/libxsmm\/libxsmm\">libxsmm<\/a> (SMM standing for Small Matrix Multiplication).<\/p>\n<p>In <code>maxsim-cpu<\/code>, we built on top of libxsmm, and added some additional optimisations to speed things up further, such as fused operations to avoid having to load all the individual distances into memory (which is comparatively very slow, <a href=\"https:\/\/x.com\/BenjDicken\/status\/1847310000735330344\">as this visualisation shows<\/a>), a separate code path to speed things up considerably when handling variable length documents, further optimisations to leverage Apple Silicon-specific optimisations rather than libxsmm, etc...<\/p>\n<h3>Gotta go fast<\/h3>\n<p>The result is the <code>maxsim-cpu<\/code> package, written in Rust and exposed as a Python library. In our rapidly ran tests, we observe considerable speed-up over other python packages, while being extremely low-dependency (all you need is numpy and maxsim-cpu itself):<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/maxsim-cpu\/cpu.png\" alt=\"CPU performance comparison\"><\/p>\n<p>Even on MAC, with much fewer cores and no <code>libxsmm<\/code> to build on top of, we&#39;ve observed noticeably speedups, especially on variable batch-size, as our approach allows us to do-away with the dreadfully slow and wasteful Python-based padding operations (note that you could probably optimise those for other libs to also be faster, but it&#39;s an annoying bit of engineering to design custom routing logic):<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/maxsim-cpu\/mac.png\" alt=\"Apple M4 Max performance comparison\"><\/p>\n<h3>Using maxsim-cpu<\/h3>\n<p>The moment you&#39;ve all been waiting for (or the moment you scrolled to if you&#39;re already familiar with MaxSim): how to get your hands on <code>maxsim-cpu<\/code>? Well, it&#39;s quite simple. Below are simple instructions, and you may find more detailed ones as well as the full source code in the <a href=\"https:\/\/github.com\/mixedbread-ai\/maxsim-cpu\">GitHub Repository<\/a>.<\/p>\n<h4>Installation<\/h4>\n<p>On Linux Machines with x86 processors that support AVX2 instructions (a lot of fancy words to say &quot;Linux machines with a CPU released in the last decade&quot;) and Macs with Apple Silicon, you can install it directly from PyPi:<\/p>\n<pre><code class=\"language-bash\">[uv] pip install maxsim-cpu\n<\/code><\/pre>\n<p>We do not currently support any other hardware nor do we have plans to, but contributions in this direction (adding AVX512-specific code paths to go even faster or supporting Windows) are welcome and the PRs will be reviewed.<\/p>\n<p>For more detailed installation instructions, including building from source if you&#39;d like to modify the library, head to <a href=\"https:\/\/github.com\/mixedbread-ai\/maxsim-cpu\">github<\/a>.<\/p>\n<h4>Usage<\/h4>\n<p>The library exposes two methods: <code>maxsim_cpu.maxsim_scores<\/code> and <code>maxsim_cpu.maxsim_scores_variable<\/code>, which you should route to depending on the nature of your input: <code>maxsim_scores<\/code> expects documents to all be the same length while <code>maxsim_scores_variable<\/code> allows variable length inputs. In all cases, each method expects a single query and its set of candidate documents. Usage is as follows:<\/p>\n<pre><code class=\"language-python\">import numpy as np\nimport maxsim_cpu\n\n# Prepare normalized embeddings\nquery = np.random.randn(32, 128).astype(np.float32)  # [num_query_tokens, dim]\n\n# NOTE: maxsim-cpu expects normalized vectors.\nquery \/= np.linalg.norm(query, axis=1, keepdims=True)\n\ndocs = np.random.randn(1000, 512, 128).astype(np.float32)  # [num_docs, doc_len, dim]\n# Normalize document embeddings...\ndocs \/= np.linalg.norm(docs, axis=2, keepdims=True)\n\n# Compute MaxSim scores\nscores = maxsim_cpu.maxsim_scores(query, docs)  # Returns [num_docs] scores\n<\/code><\/pre>\n<p>Swapping in <code>maxsim_scores_variable<\/code> is straightforward:<\/p>\n<pre><code class=\"language-python\">import numpy as np\nimport maxsim_cpu\n\n# Prepare normalized embeddings\nquery = np.random.randn(32, 128).astype(np.float32)  # [num_query_tokens, dim]\n\n# NOTE: maxsim-cpu expects normalized vectors.\nquery \/= np.linalg.norm(query, axis=1, keepdims=True)\n\n# Create variable-length documents as a list\ndocs = [\n    np.random.randn(np.random.randint(50, 800), 128).astype(np.float32)  # Variable length docs\n    for _ in range(1000)\n]\n# Normalize each document in the list\ndocs = [doc \/ np.linalg.norm(doc, axis=1, keepdims=True) for doc in docs]\n\n# Compute MaxSim scores\nscores = maxsim_cpu.maxsim_scores_variable(query, docs)  # Returns [num_docs] scores\n<\/code><\/pre>\n<p>And that&#39;s pretty much it, that&#39;s all you need to know to use <code>maxsim-cpu<\/code>!<\/p>\n<h2>Conclusion<\/h2>\n<p>This blog post briefly introduces the MaxSim operator as well as our new package, <code>maxsim-cpu<\/code>. \nIt is a standalone library, meant to do one thing and do it well, and is part of our efforts to open source any individual component we feel might be useful to more than just ourselves, as we previously did with <a href=\"https:\/\/www.mixedbread.com\/blog\/dynamic-batching\">batched<\/a> and <a href=\"https:\/\/www.mixedbread.com\/blog\/intro-baguetter\">baguetter<\/a>. We hope it&#39;ll be useful to anyone who cares about MaxSim, and that it might even inspire more people to write more optimised versions of commonly used algorithms: search is more relevant than ever, but every small component can still be improved in so many ways.<\/p>\n<p>If figuring out how to create these improvements sounds like something you&#39;re interested in, we are <a href=\"https:\/\/mxbai.notion.site\/job-board?pvs=74\">currently hiring across all technical positions<\/a>, don&#39;t be shy!<\/p>\n<h3>Citation<\/h3>\n<pre><code class=\"language-bibtex\">@online{maxsimcpu2025mxbai,\n  title={{maxsim-cpu}: {M}aximising {M}axsim {E}fficiency},\n  author={Benjamin Clavi\u00e9 and Sean Lee},\n  year={2025},\n  url={https:\/\/www.mixedbread.com\/blog\/maxsim-cpu},\n}\n<\/code><\/pre>\n","url":"https:\/\/www.mixedbread.com\/blog\/maxsim-cpu","title":"maxsim-cpu: Maximising Maxsim Efficiency","summary":"Introducing maxsim-cpu, a much faster way to compute the late interaction's MaxSim operator on modern CPU hardware, optimised for both x86 and Mac ARM.","image":"https:\/\/www.mixedbread.com\/images\/blog\/maxsim-cpu\/maxsim_cpu.jpg","date_modified":"2025-07-15T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research","engineering"]},{"id":"https:\/\/www.mixedbread.com\/blog\/the-hidden-ceiling","content_html":"<p>Retrieval-Augmented Generation (RAG) has become the default way to connect Large Language Models (LLMs) with enterprise data. However, there&#39;s a critical flaw in this approach that&#39;s rarely discussed: nearly all production RAG pipelines rely on Optical Character Recognition (OCR) to process PDFs, scans, presentations, and other documents, with the silent assumption that the extracted text is &quot;good enough&quot; for downstream AI tasks.<\/p>\n<p>Our comprehensive analysis shows that this assumption is fundamentally flawed. OCR quality creates an invisible ceiling that limits the performance of even the most advanced RAG systems. The gap between what&#39;s possible with perfect text extraction and what&#39;s achieved with current OCR technology represents one of the most significant yet overlooked challenges in enterprise AI today.<\/p>\n<blockquote>\n<p><strong>Note<\/strong>\n<em>&gt;ul]:list-disc [&amp;&gt;<\/em>&gt;ul]:ml-4&quot;&gt;\n    <em><strong>TLDR:<\/strong><\/em>\n    - <strong>OCR creates an invisible performance ceiling.<\/strong> Text extraction errors significantly limit both retrieval accuracy and generation quality in RAG systems.\n    - <strong>Benchmarks reveal a substantial gap.<\/strong> Even leading OCR solutions fall <strong>~4.5%<\/strong> short (NDCG@5) of ground-truth text performance, particularly with complex document layouts.\n    - <strong>Vision-only generation is not ready yet.<\/strong> Despite rapid improvements, multimodal models still cannot reliably generate precise answers directly from multiple document images.\n    - <strong>Multimodal retrieval beats perfect text.<\/strong> Our vector stores outperform even <em>perfect text<\/em> by <strong>~12%<\/strong> on retrieval accuracy (NDCG@5) and recover <strong>70%<\/strong> of generation quality lost to OCR errors, while simultaneously simplifying architecture and enhancing future compatibility.<\/p>\n<\/blockquote>\n<h2>Why OCR is still critical for AI systems<\/h2>\n<p>Most enterprise knowledge is locked in unstructured formats like PDFs, scanned documents, invoices, presentations, images, and a plethora of other formats. Before a Large Language Model (LLM) can reason over this knowledge, it needs to be converted from its original visual or semi-structured format into plain text.<\/p>\n<p>This text conversion step, typically handled by OCR engines, is crucial because it feeds two core components of a RAG system:<\/p>\n<ol>\n<li><strong>The Retrieval System:<\/strong> Most retrieval systems depend on extracted text as their main search input. When OCR quality is poor, it produces inaccurate or &quot;corrupted&quot; text representations of your documents. This results in flawed text representations, making it difficult or impossible for the retrieval system to locate the relevant documents when a user asks a question. If the text doesn&#39;t accurately reflect the content, the search fails before it even begins.<\/li>\n<li><strong>The Generation Model (LLM):<\/strong> LLMs generate answers based <em>only<\/em> on the context they are given. If the retrieved document snippets contain OCR errors (missing words, jumbled tables, incorrect numbers), the LLM receives flawed information. This directly leads to incomplete, nonsensical, or factually incorrect answers, even if the retrieval system managed to find the <em>correct<\/em> document pages.<\/li>\n<\/ol>\n<p>In short, errors introduced by OCR don&#39;t just stay in the text; they cascade through the entire RAG pipeline, impacting both the ability to <em>find<\/em> information and the ability to <em>generate accurate answers<\/em> from it.<\/p>\n<h2>Putting OCR to the Test: Our Benchmark Setup<\/h2>\n<p>To quantify this &quot;OCR ceiling&quot; and understand its real-world impact, we needed a robust way to measure performance across diverse and challenging documents. We conducted extensive testing using the <strong><a href=\"https:\/\/arxiv.org\/abs\/2412.02592\">OHR (OCR hinders RAG) Benchmark v2<\/a><\/strong>.<\/p>\n<p>This benchmark is specifically designed to evaluate how OCR performance affects RAG tasks and includes:<\/p>\n<ul>\n<li><strong>Diverse &amp; Challenging Documents:<\/strong> <strong>8,500+<\/strong> PDF pages across seven enterprise domains (Textbooks, Law, Finance, Newspapers, Manuals, Academic Papers, Administrative) featuring complex layouts, tables, formulas, charts, diagrams, and non-standard reading orders that are known to challenge OCR systems.<\/li>\n<li><strong>Targeted Questions:<\/strong> <strong>8,498<\/strong> question-answer pairs specifically designed to test retrieval and understanding of information related to these OCR challenges. Each answer is grounded in specific evidence pages within the documents.<\/li>\n<li><strong>Verified Ground Truth:<\/strong> Human-verified, perfect text extraction and curated answers provide a reliable &quot;gold standard&quot; for comparison.<\/li>\n<\/ul>\n<p>Against this benchmark, we evaluated a range of OCR and retrieval approaches:<\/p>\n<details>\n<summary>Click to see the tested OCR & Retrieval Solutions<\/summary>\n\n<ul>\n<li><strong><a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/models\">Gemini 2.5 Flash<\/a>:<\/strong> A frontier closed-source multimodal model capable of OCR.<\/li>\n<li><strong><a href=\"https:\/\/github.com\/opendatalab\/MinerU\">MinerU<\/a>:<\/strong> A popular open-source library implementing state-of-the-art OCR methods from academic literature.<\/li>\n<li><strong><a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/document-intelligence\/overview?view=doc-intel-4.0.0\">Azure Document Intelligence<\/a>:<\/strong> A widely used commercial OCR solution in the industry.<\/li>\n<li><strong><a href=\"https:\/\/github.com\/QwenLM\/Qwen-VL\">Qwen-2.5-VL<\/a>:<\/strong> A frontier open-source multimodal model capable of OCR.<\/li>\n<li><strong><a href=\"https:\/\/github.com\/Unstructured-IO\/unstructured\">Unstructured<\/a>:<\/strong> A popular open-source library with broad adoption for document parsing.<\/li>\n<li><strong><a href=\"https:\/\/mixedbread.com\/docs\/stores\/overview\">Mixedbread Vector Store<\/a>:<\/strong> Our core offering, using native multimodal retrieval (treating pages as <em>images<\/em>, not just text) powered by our internal multimodal models (<code>mxbai-omni-v0.1<\/code>). It bypasses traditional reliance on OCR for retrieval.<\/details><\/li>\n<\/ul>\n<p>This comprehensive setup allowed us to isolate the impact of different OCR qualities and compare text-based approaches directly against our multimodal retrieval system.<\/p>\n<h2>Testing Retrieval: Setup and Results<\/h2>\n<p>First, we focused on retrieval - the task of finding the <em>right<\/em> information within the vast document set. If your RAG system can&#39;t surface the correct documents, the LLM has no chance of answering the user&#39;s query accurately.<\/p>\n<h3>Retrieval Setup<\/h3>\n<p>We transformed the OHR benchmark&#39;s question-answer pairs into a retrieval task: the question became the query, and the associated evidence pages were the target documents to retrieve.<\/p>\n<p>For the text-based OCR methods, we used <strong><a href=\"https:\/\/en.wikipedia.org\/wiki\/Okapi_BM25\">BM25<\/a><\/strong>, a standard and robust keyword-based ranking algorithm commonly used in search engines. (We tested embedding-based retrieval and rerankers too, but found they often degraded performance on this benchmark compared to the strong BM25 baseline, likely due to OCR noise corrupting the embeddings. You can find more details <a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1zBGOIOCzZZjw1HXBGGI8BzNx_kYj34LlYaFteZTU7Bg\/edit?usp=sharing\">here<\/a>.)<\/p>\n<p>For the Mixedbread Vector Store, we leveraged our multimodal embedding model (<code>mxbai-omni-v0.1<\/code>), which directly processes <strong>screenshots<\/strong> of the document pages. This approach is inherently resilient to OCR errors because it &quot;sees&quot; the page layout, structure, and visual elements alongside the text.<\/p>\n<p>We measured retrieval performance using two standard metrics:<\/p>\n<details>\n<summary>Click for Retrieval Metric Definitions<\/summary>\n\n<ul>\n<li><strong>NDCG@5 (Normalized Discounted Cumulative Gain @ 5):<\/strong> This metric evaluates the quality of the top 5 retrieved documents. It cares not only <em>if<\/em> the correct documents are found but also <em>how highly ranked<\/em> they are. Relevant documents ranked higher get more points. We chose K=5 because research shows LLMs are heavily influenced by the order of documents in their context window, with earlier documents having more impact.<\/li>\n<li><strong>Recall@5:<\/strong> This metric measures whether at least one of the <em>correct<\/em> evidence pages was retrieved within the top 5 results. It tells us if the necessary information was surfaced at all, regardless of its exact ranking.<\/details><\/li>\n<\/ul>\n<h3>Retrieval Results: The OCR Ceiling is Real<\/h3>\n<p>Our retrieval benchmarks revealed stark differences between traditional OCR-dependent methods and our multimodal approach.<\/p>\n<hr>\n<p><strong>NDCG@5 Performance<\/strong> <span class=\"text-xs font-normal\">(Average across all 7 document domains)<\/span><\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/the-hidden-ceiling\/retrieval-ndcg.webp\" alt=\"NDCG@5 Performance\" title=\"NDCG@5 Performance\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n  This chart shows NDCG@5 scores for each retrieval method, averaged across seven document domains. NDCG@5 measures both the presence and ranking of relevant documents in the top 5\u2014higher values mean more accurate retrieval, with extra weight for top-ranked relevant pages.\n<\/div>\n\n<details>\n<summary>Click for Full NDCG@5 Results Table by Domain<\/summary>\n\n<table>\n<thead>\n<tr>\n<th align=\"left\">Domain<\/th>\n<th align=\"left\">Gemini 2.5 Flash<\/th>\n<th align=\"left\">MinerU<\/th>\n<th align=\"left\">Mixedbread OCR<\/th>\n<th align=\"left\">Qwen-2.5-VL<\/th>\n<th align=\"left\">Azure<\/th>\n<th align=\"left\">Unstructured<\/th>\n<th align=\"left\"><strong>Mixedbread Vector Store<\/strong><\/th>\n<th align=\"left\">Ground Truth OCR<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td align=\"left\">academic<\/td>\n<td align=\"left\">0.805<\/td>\n<td align=\"left\">0.786<\/td>\n<td align=\"left\">0.795<\/td>\n<td align=\"left\">0.822<\/td>\n<td align=\"left\">0.797<\/td>\n<td align=\"left\">0.693<\/td>\n<td align=\"left\"><strong>0.923<\/strong><\/td>\n<td align=\"left\">0.845<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">administration<\/td>\n<td align=\"left\">0.861<\/td>\n<td align=\"left\">0.776<\/td>\n<td align=\"left\">0.842<\/td>\n<td align=\"left\">0.853<\/td>\n<td align=\"left\">0.854<\/td>\n<td align=\"left\">0.672<\/td>\n<td align=\"left\"><strong>0.920<\/strong><\/td>\n<td align=\"left\">0.895<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">finance<\/td>\n<td align=\"left\">0.656<\/td>\n<td align=\"left\">0.576<\/td>\n<td align=\"left\">0.636<\/td>\n<td align=\"left\">0.666<\/td>\n<td align=\"left\">0.664<\/td>\n<td align=\"left\">0.517<\/td>\n<td align=\"left\"><strong>0.773<\/strong><\/td>\n<td align=\"left\">0.722<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">law<\/td>\n<td align=\"left\">0.876<\/td>\n<td align=\"left\">0.829<\/td>\n<td align=\"left\">0.871<\/td>\n<td align=\"left\">0.873<\/td>\n<td align=\"left\">0.889<\/td>\n<td align=\"left\">0.724<\/td>\n<td align=\"left\"><strong>0.913<\/strong><\/td>\n<td align=\"left\">0.897<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">manual<\/td>\n<td align=\"left\">0.800<\/td>\n<td align=\"left\">0.756<\/td>\n<td align=\"left\">0.820<\/td>\n<td align=\"left\">0.834<\/td>\n<td align=\"left\">0.828<\/td>\n<td align=\"left\">0.721<\/td>\n<td align=\"left\"><strong>0.923<\/strong><\/td>\n<td align=\"left\">0.861<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">news<\/td>\n<td align=\"left\">0.442<\/td>\n<td align=\"left\">0.438<\/td>\n<td align=\"left\">0.454<\/td>\n<td align=\"left\">0.415<\/td>\n<td align=\"left\">0.460<\/td>\n<td align=\"left\">0.111<\/td>\n<td align=\"left\"><strong>0.686<\/strong><\/td>\n<td align=\"left\">0.467<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">textbook<\/td>\n<td align=\"left\">0.624<\/td>\n<td align=\"left\">0.572<\/td>\n<td align=\"left\">0.673<\/td>\n<td align=\"left\">0.698<\/td>\n<td align=\"left\">0.671<\/td>\n<td align=\"left\">0.159<\/td>\n<td align=\"left\"><strong>0.915<\/strong><\/td>\n<td align=\"left\">0.720<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>avg<\/strong><\/td>\n<td align=\"left\"><strong>0.723<\/strong><\/td>\n<td align=\"left\"><strong>0.676<\/strong><\/td>\n<td align=\"left\"><strong>0.727<\/strong><\/td>\n<td align=\"left\"><strong>0.737<\/strong><\/td>\n<td align=\"left\"><strong>0.738<\/strong><\/td>\n<td align=\"left\"><strong>0.514<\/strong><\/td>\n<td align=\"left\"><strong>0.865<\/strong><\/td>\n<td align=\"left\"><strong>0.773<\/strong><\/td>\n<\/tr>\n<\/tbody><\/table>\n<\/details>\n\n<hr>\n<p><strong>Recall@5 Performance<\/strong> <em>(Average across all 7 document domains)<\/em><\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/the-hidden-ceiling\/retrieval-recall.webp\" alt=\"Recall@5 Performance\" title=\"Recall@5 Performance\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n  This chart shows Recall@5 for each method, averaged across domains. Recall@5 is the percentage of questions where at least one correct evidence page appeared in the top 5\u2014higher is better.\n<\/div>\n\n<details>\n<summary>Click for Full Recall@5 Results Table by Domain<\/summary>\n\n<table>\n<thead>\n<tr>\n<th align=\"left\">Domain<\/th>\n<th align=\"left\">Gemini 2.5 Flash<\/th>\n<th align=\"left\">MinerU<\/th>\n<th align=\"left\">Mixedbread OCR<\/th>\n<th align=\"left\">Qwen-2.5-VL<\/th>\n<th align=\"left\">Azure<\/th>\n<th align=\"left\">Unstructured<\/th>\n<th align=\"left\"><strong>Mixedbread Vector Store<\/strong><\/th>\n<th align=\"left\">Ground Truth OCR<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td align=\"left\">academic<\/td>\n<td align=\"left\">0.902<\/td>\n<td align=\"left\">0.885<\/td>\n<td align=\"left\">0.896<\/td>\n<td align=\"left\">0.911<\/td>\n<td align=\"left\">0.902<\/td>\n<td align=\"left\">0.789<\/td>\n<td align=\"left\"><strong>0.982<\/strong><\/td>\n<td align=\"left\">0.937<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">administration<\/td>\n<td align=\"left\">0.930<\/td>\n<td align=\"left\">0.857<\/td>\n<td align=\"left\">0.920<\/td>\n<td align=\"left\">0.930<\/td>\n<td align=\"left\">0.931<\/td>\n<td align=\"left\">0.735<\/td>\n<td align=\"left\"><strong>0.967<\/strong><\/td>\n<td align=\"left\">0.959<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">finance<\/td>\n<td align=\"left\">0.778<\/td>\n<td align=\"left\">0.677<\/td>\n<td align=\"left\">0.760<\/td>\n<td align=\"left\">0.781<\/td>\n<td align=\"left\">0.783<\/td>\n<td align=\"left\">0.625<\/td>\n<td align=\"left\"><strong>0.883<\/strong><\/td>\n<td align=\"left\">0.836<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">law<\/td>\n<td align=\"left\">0.933<\/td>\n<td align=\"left\">0.890<\/td>\n<td align=\"left\">0.929<\/td>\n<td align=\"left\">0.932<\/td>\n<td align=\"left\">0.948<\/td>\n<td align=\"left\">0.775<\/td>\n<td align=\"left\"><strong>0.968<\/strong><\/td>\n<td align=\"left\">0.951<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">manual<\/td>\n<td align=\"left\">0.874<\/td>\n<td align=\"left\">0.844<\/td>\n<td align=\"left\">0.904<\/td>\n<td align=\"left\">0.912<\/td>\n<td align=\"left\">0.915<\/td>\n<td align=\"left\">0.802<\/td>\n<td align=\"left\"><strong>0.971<\/strong><\/td>\n<td align=\"left\">0.932<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">news<\/td>\n<td align=\"left\">0.479<\/td>\n<td align=\"left\">0.468<\/td>\n<td align=\"left\">0.489<\/td>\n<td align=\"left\">0.458<\/td>\n<td align=\"left\">0.493<\/td>\n<td align=\"left\">0.115<\/td>\n<td align=\"left\"><strong>0.767<\/strong><\/td>\n<td align=\"left\">0.499<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">textbook<\/td>\n<td align=\"left\">0.644<\/td>\n<td align=\"left\">0.600<\/td>\n<td align=\"left\">0.700<\/td>\n<td align=\"left\">0.728<\/td>\n<td align=\"left\">0.702<\/td>\n<td align=\"left\">0.168<\/td>\n<td align=\"left\"><strong>0.936<\/strong><\/td>\n<td align=\"left\">0.746<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>avg<\/strong><\/td>\n<td align=\"left\"><strong>0.791<\/strong><\/td>\n<td align=\"left\"><strong>0.746<\/strong><\/td>\n<td align=\"left\"><strong>0.800<\/strong><\/td>\n<td align=\"left\"><strong>0.807<\/strong><\/td>\n<td align=\"left\"><strong>0.811<\/strong><\/td>\n<td align=\"left\"><strong>0.573<\/strong><\/td>\n<td align=\"left\"><strong>0.925<\/strong><\/td>\n<td align=\"left\"><strong>0.837<\/strong><\/td>\n<\/tr>\n<\/tbody><\/table>\n<\/details>\n\n<hr>\n<p>These results reveal several critical insights:<\/p>\n<ol>\n<li><strong>OCR Creates a Performance Ceiling:<\/strong> Every single OCR solution tested underperformed compared to the Ground Truth benchmark using perfect text. The best OCR methods plateaued around 0.74 average NDCG@5, a <strong>~4.5%<\/strong> absolute gap below the Ground Truth&#39;s 0.773. This confirms that OCR errors inherently limit retrieval effectiveness.<\/li>\n<li><strong>Complexity Magnifies OCR Issues:<\/strong> The performance gap widens for documents with complex layouts (finance, textbooks, news). These domains often feature tables, formulas, multi-column text, etc., that challenge OCR.<\/li>\n<li><strong>Multimodal Excels by Seeing the Whole Picture:<\/strong> Mixedbread Vector Store consistently outperformed <em>all<\/em> other methods, including the perfect text Ground Truth benchmark. Its average NDCG@5 of <strong>0.865<\/strong> is nearly <strong>12% higher<\/strong> than Ground Truth text because it understands the <em>visual context<\/em> (layout, tables, charts) directly from the image, providing richer relevance cues.<\/li>\n<\/ol>\n<p>The Recall@5 increases from 0.84 using Ground Truth text to 0.92 using the Mixedbread Vector Store. Let&#39;s put this in perspective:<\/p>\n<ul>\n<li>With Ground Truth (perfect OCR): Recall@5 = 84% \u2192 84 out of every 100 truly relevant documents are retrieved in the top 5.<\/li>\n<li>With Mixedbread Vector Store: Recall@5 = 92% \u2192 92 out of every 100 truly relevant documents make it into the top 5.<\/li>\n<\/ul>\n<p>This 8% absolute improvement (or ~9.5% relative improvement) in recall represents a substantial gain in retrieval performance. These retrieval benchmarks quantify the hidden ceiling imposed by relying solely on OCR. While better OCR helps, the results strongly indicate that a multimodal approach represents a fundamental leap forward.<\/p>\n<h2>Testing Generation: Setup and Results<\/h2>\n<p>Okay, so multimodal retrieval finds better documents, overcoming the OCR ceiling. <em>But does this improved retrieval actually translate into more accurate final answers from the LLM?<\/em> To find out, we tested the end-to-end RAG performance.<\/p>\n<h3>Generation Setup<\/h3>\n<p>We set up three scenarios, feeding the top 5 retrieved documents from each into the same powerful LLM (<strong><code>gemini-2.5-flash-preview-04-17<\/code><\/strong>) for answer generation:<\/p>\n<ol>\n<li><strong>Perfect OCR &amp; Perfect Retrieval (Ground Truth):<\/strong> Using the human-verified text for generation and the true evidence pages as an input (&#39;Perfect Retrieval&#39;). This represents the theoretical maximum performance achievable with the current models if they would have the correct context and perfect extraction.<\/li>\n<li><strong>Perfect OCR &amp; Retrieval<\/strong>: Using the human-verified text for both BM25 retrieval and for the top 5 passages and generation context. This is the quality you would get if your OCR would be perfect with the current technology. <\/li>\n<li><strong>Mixedbread OCR (Text-Based RAG):<\/strong> Using text extracted by our high-quality OCR engine for both BM25 retrieval for the top 5 passages and generation context. This mirrors a standard, good-quality text-only RAG pipeline.<\/li>\n<li><strong>Mixedbread Vector Store (Multimodal Retrieval):<\/strong> Using our multimodal model to retrieve the top 5 <em>page images<\/em>, but then using the corresponding clean text extracted by Mixedbread OCR as the generation context. This isolates the benefit of <em>visual retrieval<\/em> while keeping the generation input modality (text) consistent.<\/li>\n<\/ol>\n<p>To measure success, we focused on the <strong>Correct Answers<\/strong> rate. We used <strong>GPT-4.1<\/strong> as an impartial judge, providing it with the original question, the ground truth answer, the ground truth <em>evidence text<\/em>, and the answer generated by <code>gemini-2.5-flash-preview-04-17<\/code> in each scenario. The final score is simply the number of correct answers divided by the total number of questions.<\/p>\n<h3>Generation Results: Better Retrieval = Better Answers<\/h3>\n<p>The generation tests confirmed our hypothesis: superior retrieval leads directly to more accurate answers.<\/p>\n<p><strong>Correct Answers Rate<\/strong><\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/the-hidden-ceiling\/generation-accuracy.webp\" alt=\"generation-accuracy\" title=\"Generation Accuracy\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n    This chart shows the percentage of correct answers from each generation method, averaged across 7 domains and judged by GPT-4.1. Higher values mean the LLM produced more accurate, ground-truth answers.\n<\/div>\n\n<details>\n<summary>Click for Full Correct Answers Table by Domain<\/summary>\n\n<table>\n<thead>\n<tr>\n<th>Domain<\/th>\n<th>Mixedbread OCR (ret &amp; gen)<\/th>\n<th>Perfect OCR + ret.<\/th>\n<th>Mixedbread Vector Store (ret) + Mixedbread OCR (gen)<\/th>\n<th>Perfect OCR + Perfect ret.<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>academic<\/td>\n<td>0.711<\/td>\n<td>0.797<\/td>\n<td><strong>0.876<\/strong><\/td>\n<td>0.904<\/td>\n<\/tr>\n<tr>\n<td>administration<\/td>\n<td>0.714<\/td>\n<td>0.812<\/td>\n<td><strong>0.846<\/strong><\/td>\n<td>0.896<\/td>\n<\/tr>\n<tr>\n<td>finance<\/td>\n<td>0.618<\/td>\n<td>0.686<\/td>\n<td><strong>0.742<\/strong><\/td>\n<td>0.877<\/td>\n<\/tr>\n<tr>\n<td>law<\/td>\n<td>0.866<\/td>\n<td>0.898<\/td>\n<td><strong>0.909<\/strong><\/td>\n<td>0.950<\/td>\n<\/tr>\n<tr>\n<td>manual<\/td>\n<td>0.782<\/td>\n<td>0.825<\/td>\n<td><strong>0.888<\/strong><\/td>\n<td>0.914<\/td>\n<\/tr>\n<tr>\n<td>news<\/td>\n<td>0.435<\/td>\n<td>0.447<\/td>\n<td><strong>0.753<\/strong><\/td>\n<td>0.951<\/td>\n<\/tr>\n<tr>\n<td>textbook<\/td>\n<td>0.607<\/td>\n<td>0.715<\/td>\n<td><strong>0.885<\/strong><\/td>\n<td>0.896<\/td>\n<\/tr>\n<tr>\n<td><strong>avg<\/strong><\/td>\n<td>0.676<\/td>\n<td>0.740<\/td>\n<td><strong>0.843<\/strong><\/td>\n<td>0.912<\/td>\n<\/tr>\n<\/tbody><\/table>\n<\/details>\n\n<p>Key takeaways from the generation tests:<\/p>\n<ol>\n<li><strong>OCR Flaws Amplify During Generation:<\/strong> Relying on standard OCR for both retrieval and generation resulted in a <strong>25.8% decrease<\/strong> in correct answers compared to using perfect text (0.677 vs 0.913). Flawed input context significantly degrades the LLM&#39;s ability to generate accurate answers.<\/li>\n<li><strong>Better Retrieval Dramatically Boosts Correct Answers:<\/strong> Simply swapping standard OCR-based <em>retrieval<\/em> for Mixedbread Vector Store&#39;s <em>multimodal retrieval<\/em> \u2013 while still using the <em>same potentially imperfect OCR text<\/em> for generation \u2013 caused the average correct answer rate to jump massively from 0.677 to <strong>0.843<\/strong>. This single change <strong>recovered 70%<\/strong> of the accuracy lost due to the limitations of a standard OCR-based pipeline.<\/li>\n<li><strong>Finding the Right Pages is Paramount:<\/strong> The <em>quality of retrieval<\/em> is often more critical than <em>perfect text<\/em> in the generation context. Getting the <em>correct<\/em> documents into the LLM&#39;s view, even with minor OCR imperfections, is far more beneficial than feeding the LLM slightly cleaner text from the <em>wrong<\/em> documents.<\/li>\n<\/ol>\n<p>These generation benchmarks confirm that state-of-the-art multimodal <em>retrieval<\/em> can mitigate a large portion of the negative downstream effects of OCR errors.<\/p>\n<h2>Direct Image Generation: Is Vision-Only RAG Ready?<\/h2>\n<p>Given the success of using visual information for retrieval, a natural question arises: can we skip OCR entirely, even for the generation step? What if we feed the <em>images<\/em> of the retrieved pages directly to a powerful multimodal LLM like <strong>Gemini 2.5 Flash<\/strong> and ask it to generate the answer by &quot;reading&quot; the images? We tested this &quot;Direct Image Understanding&quot; approach:<\/p>\n<p><strong>Correct Answers Rate<\/strong> <em>(Average across 3 document domains)<\/em><\/p>\n<table>\n<thead>\n<tr>\n<th align=\"left\">Retrieval Method<\/th>\n<th align=\"left\">Generation Input<\/th>\n<th align=\"left\">Avg. Correct Answers<\/th>\n<th align=\"left\">Performance vs. Perfect OCR<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td align=\"left\">Perfect OCR (Ground Truth)<\/td>\n<td align=\"left\">Perfect OCR Text<\/td>\n<td align=\"left\"><strong>0.899<\/strong><\/td>\n<td align=\"left\">\u00b10.0% (Baseline)<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>Mixedbread Vector Store<\/strong><\/td>\n<td align=\"left\"><strong>Mixedbread OCR Text<\/strong><\/td>\n<td align=\"left\"><strong>0.869<\/strong><\/td>\n<td align=\"left\"><strong>-3.3%<\/strong><\/td>\n<\/tr>\n<tr>\n<td align=\"left\">Mixedbread OCR<\/td>\n<td align=\"left\">Mixedbread OCR Text<\/td>\n<td align=\"left\">0.678<\/td>\n<td align=\"left\">-24.6%<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">Mixedbread Vector Store<\/td>\n<td align=\"left\">Direct Image Input<\/td>\n<td align=\"left\">0.627<\/td>\n<td align=\"left\">-30.3%<\/td>\n<\/tr>\n<\/tbody><\/table>\n<details>\n<summary>Click for Full Direct Image Input Comparison Table<\/summary>\n\n<table>\n<thead>\n<tr>\n<th align=\"left\">Domain<\/th>\n<th align=\"left\">Mixedbread OCR (ret. &amp; gen.)<\/th>\n<th align=\"left\"><strong>Mixedbread Vector Store (ret.) + Mixedbread OCR (gen.)<\/strong><\/th>\n<th align=\"left\">Mixedbread Vector Store (ret.) + Direct Image Input (gen.)<\/th>\n<th align=\"left\">Perfect OCR + Retrieval<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td align=\"left\">academic<\/td>\n<td align=\"left\">0.712<\/td>\n<td align=\"left\"><strong>0.876<\/strong><\/td>\n<td align=\"left\">0.534<\/td>\n<td align=\"left\">0.904<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">administration<\/td>\n<td align=\"left\">0.715<\/td>\n<td align=\"left\"><strong>0.846<\/strong><\/td>\n<td align=\"left\">0.672<\/td>\n<td align=\"left\">0.896<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">textbook<\/td>\n<td align=\"left\">0.607<\/td>\n<td align=\"left\"><strong>0.885<\/strong><\/td>\n<td align=\"left\">0.675<\/td>\n<td align=\"left\">0.896<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>avg<\/strong><\/td>\n<td align=\"left\"><strong>0.678<\/strong><\/td>\n<td align=\"left\"><strong>0.869<\/strong><\/td>\n<td align=\"left\"><strong>0.627<\/strong><\/td>\n<td align=\"left\"><strong>0.899<\/strong><\/td>\n<\/tr>\n<\/tbody><\/table>\n<\/details>\n\n<p>The results were surprising:<\/p>\n<ul>\n<li><strong>Direct Image Input Lags Significantly:<\/strong> Feeding page images directly to the LLM for generation yielded the <em>lowest<\/em> average correct answers (0.627).<\/li>\n<li><strong>Visual Retrieval vs. Visual Generation:<\/strong> Multimodal models excel at using visual cues for <em>retrieval<\/em>, but current models still struggle with fine-grained <em>extraction<\/em> directly from pixels <em>across multiple documents<\/em> during generation, compared to working with pre-processed text.<\/li>\n<li><strong>Quality OCR Text Still Best for Generation (For Now):<\/strong> Providing clean, explicit text to the LLM currently leads to the most accurate answers.<\/li>\n<\/ul>\n<p><strong>In essence: While fully visual RAG is an exciting possibility, today&#39;s reality is that combining the strengths of multimodal retrieval with high-quality OCR text for generation provides the best overall performance.<\/strong><\/p>\n<h2>Illustrative Examples: Where Standard OCR Falters<\/h2>\n<p>To make the impact of OCR limitations more concrete, let&#39;s examine a few specific scenarios from our benchmark data. These examples highlight common situations where traditional OCR-based systems can struggle and demonstrate how a multimodal approach to retrieval can lead to more accurate document interpretation.<\/p>\n<details>\n<summary>Click for Full Illustrative Examples<\/summary>\n\n<h3>Example 1: The Challenge of Handwritten Data in Regulatory Filings<\/h3>\n<p><strong>The Scenario:<\/strong> Regulatory filings, such as a telecommunications company&#39;s PUCO annual report, frequently combine structured typed content with critical handwritten financial figures. This mixture presents a significant <strong>OCR challenge<\/strong>, as traditional systems often fail to accurately recognize handwritten entries, leading to potential compliance and analysis issues.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/the-hidden-ceiling\/example-1.png\" alt=\"Handwritten Form Snippet\" title=\"Handwritten Form Snippet\"><\/p>\n<p><strong>Typical OCR Output &amp; Its Limitations:<\/strong>\nWhen processed by a standard OCR engine, the crucial handwritten financial data is often missed entirely or garbled:<\/p>\n<blockquote>\n<pre><code>Annual Report of TSC Communications, Inc. Year Ended December 31, 2026\n\nInstructions:\nSchedule 2 is used for PUCO annual assessment purposes pursuant to Section 4905.10, RC...\n\nSTATEMENT OF INTRASTATE GROSS EARNINGS (REVENUE)\n                                                       Amount\nLine                                                   Ohio\nNo.        Item                                         Intrastate\n1    Operating and Miscellaneous Revenue - Wholesale    [???????]\n     Cellular Communications, Radio Common Carrier...    \n2    Other Revenue, Dividend and Interest Income...      [???????]\n3    SUBTOTAL                  (1) + (2)                [???????]\n4    Earnings or receipts from sales to other public    (          )\n     utilities for resale\n5    TOTAL                     (3) + (4)                [???????]\n\n     [???????]\n     [???????]\n     [???????]\n<\/code><\/pre>\n<\/blockquote>\n<p><strong>Impact on RAG Systems:<\/strong>\nConsequently, if a query such as, <em>&quot;What is the total revenue of TSC Communications?&quot;<\/em> is posed, a RAG system relying on this flawed OCR output would likely respond: <em>&quot;Unable to determine revenue figures from the available document.&quot;<\/em> This necessitates manual data review, delaying important reporting and analytical tasks.<\/p>\n<p><strong>The Multimodal Approach:<\/strong>\nIn contrast, the multimodal system processes both the structured form and the handwritten financial figures by analyzing the document&#39;s visual layout and handwriting patterns. This holistic understanding allows it to correctly identify the total revenue as <strong>$2,775,060<\/strong>, along with component values ($2,325,472 for operating revenue and $449,588 for other revenue). This capability enables accurate, automated responses regarding the company&#39;s financial standing and regulatory obligations.<\/p>\n<hr>\n<h3>Example 2: Deciphering Visual Trends in Financial Charts<\/h3>\n<p><strong>The Scenario:<\/strong> Quarterly investment reports often feature charts, like stacked area charts showing portfolio allocation, to convey critical trends. The <strong>OCR challenge<\/strong> here is that traditional OCR primarily extracts textual elements (titles, labels) but fails to capture the actual visual data representing the trends themselves.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/the-hidden-ceiling\/example-2.png\" alt=\"Chart Snippet\" title=\"Chart Snippet\"><\/p>\n<p><strong>Typical OCR Output &amp; Its Limitations:<\/strong>\nA standard OCR tool might only extract the labels and title, leaving out the core data:<\/p>\n<blockquote>\n<pre><code>Portfolio Allocation Trends (Q1 2023 - Q4 2024)\nPercentage (%)\n100\n75\n50\n25\n0\nQ1 2023, Q2 2023, Q3 2023, Q4 2023, Q1 2024, Q2 2024, Q3 2024, Q4 2024\nCash, Commodities,Real Estate,Fixed Income, Equities\n<\/code><\/pre>\n<\/blockquote>\n<p><strong>Impact on RAG Systems:<\/strong>\nWhen a client asks, <em>&quot;How has my equity exposure changed over the past year?&quot;<\/em>, a RAG system using this limited OCR output might provide only generic information about portfolio components. It would completely miss the crucial visual trend, such as a 13 percentage point increase in equity exposure, which is essential for understanding investment risk.<\/p>\n<p><strong>The Multimodal Approach:<\/strong>\nThe multimodal system, by directly analyzing the chart visually, recognizes both the allocation percentages at each time point and the overall trend patterns. This allows it to accurately respond: <em>&quot;Your equity allocation has increased significantly from 45% to 58% over the past year, representing the largest shift in your portfolio composition.&quot;<\/em> The system can even extract specific quarterly changes to illustrate the gradual increase.<\/p>\n<hr>\n<h3>Example 3: Navigating Complex Financial Tables<\/h3>\n<p><strong>The Scenario:<\/strong> Financial reports frequently contain multi-column tables detailing revenue breakdowns and operating expenses. The <strong>OCR challenge<\/strong> with such complex table structures lies in maintaining correct column and row alignment; failures here can lead to financial figures being associated with incorrect business units or categories.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/the-hidden-ceiling\/example-3.png\" alt=\"Table Snippet\" title=\"Table Snippet\"><\/p>\n<p><strong>Typical OCR Output &amp; Its Limitations:<\/strong>\nEven if text is extracted, subtle misalignments or parsing errors by the OCR can corrupt the table&#39;s structure:<\/p>\n<blockquote>\n<pre><code>Operating Expenses\n                                          Year Ended\n                          Jan 26, 2025     Jan 28, 2024        $           %\n                                                              Change       Change\n                                           ($ in millions)\nResearch and development expenses    $        12,914    $         8,675    $    4,239        49 %\n% of net revenue                                9.9 %              14.2 %\nSales, general and administrative expenses      3,491              2,654         837         32 %\n% of net revenue                                2.7 %               4.4 %\n  Total operating expenses           $        16,405    $        11,329    $   5,076         45 %\n% of net revenue                               12.6 %              18.6 %\n<\/code><\/pre>\n<\/blockquote>\n<p><strong>Impact on RAG Systems:<\/strong>\nIf a financial analyst asks, <em>&quot;What percentage of revenue did R&amp;D represent in 2025 compared to 2024?&quot;<\/em>, a RAG system relying on poorly structured OCR output might misinterpret the relationships between figures. An erroneous response could be: <em>&quot;R&amp;D was 49% of revenue in 2025 compared to 8,675% in 2024.&quot;<\/em> Such nonsensical answers arise from the system&#39;s inability to correctly understand the visual and semantic structure of the table.<\/p>\n<p><strong>The Multimodal Approach:<\/strong>\nThe multimodal system analyzes the visual structure of the table, correctly understanding the complex alignments and relationships between headers, dollar amounts, and percentage figures. This enables an accurate response: <em>&quot;R&amp;D expenses represented 9.9% of net revenue in 2025, down from 14.2% in 2024, despite a 49% increase in absolute R&amp;D spending.&quot;<\/em> The system properly interprets both the spatial layout and the semantic connections within the financial data.<\/p>\n<\/details>\n\n<h2>The Mixedbread Vector Store Approach: Functionality and Implications<\/h2>\n<p>The Vector Store is designed to address the observed limitations of OCR-dependent RAG systems. Its architecture is centered on leveraging multimodal information for retrieval through our <code>mxbai-omni-v0.1<\/code> model. This model directly analyzes and creates embeddings from the visual content of page screenshots, videos, and other multimodal data, enabling an understanding of layout, structure, tables, and charts in their original context. As shown in our benchmarks, this improved retrieval accuracy (NDCG@5) by approximately 12% compared to even perfect text extraction.<\/p>\n<p>Concurrently with visual analysis, documents are processed by our OCR engine. The extracted text is stored and made available alongside the visual embeddings. This dual-modality approach offers distinct advantages for RAG pipelines:<\/p>\n<ul>\n<li><strong>Better Retrieval:<\/strong> Visual analysis helps locate the most relevant documents, particularly in cases where text-only search might falter due to OCR errors or the nature of the content (e.g., charts, complex tables).<\/li>\n<li><strong>Optimized Generation Context:<\/strong> High-quality OCRed text remains available, which is beneficial for current Large Language Models that primarily operate on textual input for generation.<\/li>\n<li><strong>Integrated Document Processing:<\/strong> The system handles both visual embedding and text extraction automatically, so users don&#39;t have to worry about anything during data ingestion and preparation for RAG.<\/li>\n<li><strong>Adaptability for Future LLMs:<\/strong> By storing both visual representations and text, systems are better prepared for future advancements in multimodal LLMs that might directly leverage richer image data for generation.<\/li>\n<\/ul>\n<p>This integrated system design aims to improve overall RAG performance, as evidenced by the benchmarked retrieval gains and the recovery of 70% of generation accuracy typically diminished by OCR issues in conventional pipelines, all within a unified framework.<\/p>\n<h2>Conclusion: Navigating the OCR Bottleneck with Multimodal Retrieval<\/h2>\n<p>The benchmark results presented indicate that Optical Character Recognition quality can be a significant limiting factor for RAG system performance, particularly with complex, real-world documents. Errors and omissions in text extraction can restrict both the ability to accurately retrieve relevant information and the quality of the final answers generated by an LLM.<\/p>\n<p>An approach incorporating multimodal analysis for retrieval, such as that employed by the Mixedbread Vector Store, addresses some of these limitations. By directly interpreting visual information from page images, this method improved retrieval accuracy by approximately 12% (NDCG@5) compared to even perfect text extraction in our tests. This enhancement in retrieval subsequently contributed to recovering 70% of the generation accuracy that was otherwise diminished by OCR errors in more conventional pipelines.<\/p>\n<p>While current Large Language Models generally perform optimally with high-quality text for the generation phase, the strong retrieval performance of multimodal systems highlights a path towards more robust document understanding. An integrated system that provides both visually-driven retrieval and high-quality OCR text offers a practical solution for current application needs. Furthermore, it establishes a foundation for adapting to future advancements in LLMs that may more directly leverage rich image data for generation tasks.<\/p>\n<p>The findings suggest that for applications involving diverse and structurally complex documents, incorporating multimodal understanding into the retrieval process is a key consideration for improving the accuracy and reliability of RAG systems.<\/p>\n<hr>\n<p><strong>Join our <a href=\"https:\/\/mixedbread.com\/urls\/discord\">Discord community<\/a> to share feedback, ask questions, and connect with other developers and researchers working on the cutting edge of AI!<\/strong><\/p>\n<hr>\n<h3>Citation<\/h3>\n<pre><code class=\"language-bibtex\">@online{ocrrag2025mixedbread,\n  title={The Hidden Ceiling: How OCR Quality Limits RAG Performance},\n  author={Mixedbread AI Team},\n  year={2025},\n  url={https:\/\/www.mixedbread.com\/blog\/the-hidden-ceiling},\n}\n<\/code><\/pre>\n","url":"https:\/\/www.mixedbread.com\/blog\/the-hidden-ceiling","title":"The Hidden Ceiling: How OCR Quality Limits RAG Performance","summary":"Benchmarking shows OCR errors cap text-based RAG: top OCR still misses 4-5% NDCG@5, while Mixedbread's multimodal vector store beats perfect text by 12% and recovers 70% of lost answer accuracy.","image":"https:\/\/www.mixedbread.com\/images\/blog\/the-hidden-ceiling\/intro-the-hidden-ceiling.jpg","date_modified":"2025-05-14T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research"]},{"id":"https:\/\/www.mixedbread.com\/blog\/mxbai-rerank-v2","content_html":"<p>Today, we&#39;re releasing our second generation of <strong>state-of-the-art reranking models<\/strong>\u2014the <strong>mxbai\u2011rerank\u2011v2<\/strong> family. Licensed under <strong>Apache\u00a02.0<\/strong>, they\u2019re as open as ever but now trained with <strong>reinforcement-learning<\/strong> for extra crispiness!<\/p>\n<p>Read on to learn more about our approach, how it holds up against competitors, training process, and benchmarks. If you want to dive in immediately, you can access the models on Hugging Face:<\/p>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-rerank-base-v2\">mixedbread-ai\/mxbai-rerank-base-v2<\/a> (0.5B) \u2013 the best balance of size and performance.<\/li>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-rerank-large-v2\">mixedbread-ai\/mxbai-rerank-large-v2<\/a> (1.5B) \u2013 our strongest model for the most demanding tasks.<\/li>\n<\/ul>\n<blockquote>\n<p><strong>Note<\/strong>\n<u><strong>TLDR:<\/strong><\/u> <br\/>\n    Our <a href=\"https:\/\/huggingface.co\/collections\/mixedbread-ai\/reranking-series-v2-67d230aa8f3def6bcba6e887\">V2 Reranking Family<\/a> comes with state-of-the-art performance across 100+ languages, extended context length, improved reasoning, and broad use-case support, outperforming Cohere, Voyage, and more.<\/p>\n<\/blockquote>\n<h2>Why Reranking Matters<\/h2>\n<p>Most search systems rely on a first-stage retriever (e.g., a keyword engine or vector search) to gather a set of candidate results. However, the top-ranked candidate isn&#39;t always the most relevant. Reranking addresses this by performing a second pass that &quot;reorders&quot; those candidates based on deeper semantic relevance.<\/p>\n<p>Put simply, the reranker reads each candidate result alongside the query, scoring and sorting them so that truly relevant items rise to the top. This two-step approach can significantly enhance search quality\u2014without having to revamp your existing search pipeline.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/rerank\/rerank-flow.png\" alt=\"Two-stage search flow including rerank\" title=\"Two-stage search flow including rerank\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n    The reranking process from query to ranking\n<\/div>\n\n<h3>Why is this so powerful?<\/h3>\n<p>Many companies have invested in keyword-based search systems, and switching to a purely embedding or semantic\/AI-based solution can be costly and complex (not with <RequestDemoDialog><button class=\"underline underline-offset-2\">this<\/button><\/RequestDemoDialog> btw).<\/p>\n<p>Reranking offers a sweet spot, where you keep your current setup and simply add a semantic layer. With a single line of code calling the reranker, you can tap into advanced language understanding.<\/p>\n<p>As a result you get a better user experience. Relevant answers surface, irrelevant matches drop, and because our models are open-source licensed, you get this improvement on your terms, whether you self-host or use our API.<\/p>\n<h2>Introducing mxbai-rerank-v2: Setting New Benchmarks<\/h2>\n<p>We&#39;re excited to share our next-generation of crispy reranking models with you! Our new family includes two powerful options: <strong><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-rerank-base-v2\">mxbai-rerank-base-v2<\/a><\/strong>, a compact yet powerful 0.5B-parameter model with an excellent balance between size, speed, and performance, and <strong><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-rerank-large-v2\">mxbai-rerank-large-v2<\/a><\/strong>, our flagship 1.5B-parameter model with best-in-class accuracy and robust multilingual support for the best search accuracy.<\/p>\n<p>Both handle <strong>100+ languages<\/strong>, support <strong>long contexts up to 8k tokens<\/strong> (32k-compatible), and excel at <strong>complex query reasoning<\/strong>. They&#39;re also fast\u2014delivering top-tier performance with minimal latency.<\/p>\n<table>\n<thead>\n<tr>\n<th>Feature<\/th>\n<th>V1 Models<\/th>\n<th><a href=\"https:\/\/huggingface.co\/collections\/mixedbread-ai\/reranking-series-v2-67d230aa8f3def6bcba6e887\">**V2 Models**<\/a><\/th>\n<th>Improvement<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>Architecture<\/td>\n<td>Cross-encoder<\/td>\n<td><strong>RL-optimized Qwen-2.5<\/strong><\/td>\n<td>More powerful base model<\/td>\n<\/tr>\n<tr>\n<td>Parameters<\/td>\n<td>Up to 435M<\/td>\n<td><strong>Up to 1.5B<\/strong><\/td>\n<td>Bigger, but better<\/td>\n<\/tr>\n<tr>\n<td>Languages<\/td>\n<td>English-focused<\/td>\n<td><strong>100+ languages<\/strong><\/td>\n<td>Global coverage<\/td>\n<\/tr>\n<tr>\n<td>Context Length<\/td>\n<td>512 tokens<\/td>\n<td><strong>8K tokens (32K compatible)<\/strong><\/td>\n<td>64x longer context<\/td>\n<\/tr>\n<tr>\n<td>BEIR Score<\/td>\n<td>49.32<\/td>\n<td><strong>57.49<\/strong><\/td>\n<td>+8 percentage points<\/td>\n<\/tr>\n<tr>\n<td>Speed vs Quality<\/td>\n<td>Good balance<\/td>\n<td><strong>8x faster than similar models<\/strong><\/td>\n<td>Still Fast<\/td>\n<\/tr>\n<tr>\n<td>Use Cases<\/td>\n<td>Simple text support<\/td>\n<td><strong>Support for JSON, Code, MCP, more<\/strong><\/td>\n<td>Broader application support<\/td>\n<\/tr>\n<tr>\n<td>License<\/td>\n<td>Apache 2.0<\/td>\n<td><strong>Apache 2.0<\/strong><\/td>\n<td>Open Source \u2705<\/td>\n<\/tr>\n<\/tbody><\/table>\n<h2>Fast &amp; State of the Art: Performance Analysis<\/h2>\n<p>We tested the mxbai-rerank-v2 models on English, Chinese, multilingual, tool retrieval, and code-search benchmarks, comparing them to other open- and closed-source models.<\/p>\n<p>The results are impressive outperforming open and closed source competitors such as Cohere or Voyage by a good margin.<\/p>\n<p><strong>English:<\/strong>\n<strong>English (BEIR Average)<\/strong><\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/mxbai-rerank-v2\/beir-performance.png\" alt=\"mxbai-rerank-v2 BEIR performance\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n  BEIR Benchmark Performance.\n<\/div>\n\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>NDCG@10<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>mixedbread-ai\/mxbai-rerank-large-v2<\/strong> (1.5B)<\/td>\n<td><strong>57.49<\/strong><\/td>\n<\/tr>\n<tr>\n<td><strong>mixedbread-ai\/mxbai-rerank-base-v2<\/strong> (0.5B)<\/td>\n<td><strong>55.57<\/strong><\/td>\n<\/tr>\n<tr>\n<td>cohere-rerank-3.5<\/td>\n<td>55.39<\/td>\n<\/tr>\n<tr>\n<td>BAAI\/bge-reranker-v2-gemma (2.5B)<\/td>\n<td>55.38<\/td>\n<\/tr>\n<tr>\n<td>voyage-rerank-2<\/td>\n<td>54.54<\/td>\n<\/tr>\n<tr>\n<td>jinaai\/jina-reranker-v2-base-multilingual<\/td>\n<td>54.35<\/td>\n<\/tr>\n<tr>\n<td>BAAI\/bge-reranker-v2-m3 (568M)<\/td>\n<td>53.94<\/td>\n<\/tr>\n<tr>\n<td>mixedbread-ai\/mxbai-rerank-large-v1 (435M)<\/td>\n<td>49.32<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2104.08663\">BEIR<\/a> is the industry-standard benchmark for evaluating English-language information retrieval models. Mixedbread\u2019s <strong>rerank-v2<\/strong> leads the BEIR leaderboard, outperforming all competing models and approaching the effectiveness of state-of-the-art embedding models, while using BM25 as the first stage retriever.<\/p>\n<p>Our models significantly improve search quality by re-ranking candidate documents to better align with user queries. The <strong>large variant (1.5B parameters) achieves a 57.49 BEIR score<\/strong>, the highest in the benchmark, while even our <strong>base variant (0.5B) surpasses much larger models<\/strong> with an impressive 55.57 score.<\/p>\n<p>This represents a <strong>major improvement<\/strong> over our previous generation, with <strong>v2 models showing over 8-point gains compared to v1<\/strong>. These improvements make our rerankers highly valuable for production environments where retrieval quality is critical\u2014without the extreme resource demands of larger models. Organizations can now deploy <strong>state-of-the-art search capabilities<\/strong> efficiently, ensuring <strong>high accuracy and optimal performance<\/strong>.<\/p>\n<p><strong>Multilingual:<\/strong>\n<strong>Multilingual (Mr.TyDi Average)<\/strong><\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>NDCG@10<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>BAAI\/bge-reranker-v2-m3 (568M)<\/td>\n<td><strong>30.99<\/strong><\/td>\n<\/tr>\n<tr>\n<td>BAAI\/bge-reranker-v2-gemma (2.5B)<\/td>\n<td><strong>30.40<\/strong><\/td>\n<\/tr>\n<tr>\n<td><strong>mixedbread-ai\/mxbai-rerank-large-v2<\/strong> (1.5B)<\/td>\n<td>29.79<\/td>\n<\/tr>\n<tr>\n<td><strong>mixedbread-ai\/mxbai-rerank-base-v2<\/strong> (0.5B)<\/td>\n<td>28.56<\/td>\n<\/tr>\n<tr>\n<td>mixedbread-ai\/mxbai-rerank-large-v1 (435M)<\/td>\n<td>21.88<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>In the <a href=\"https:\/\/aclanthology.org\/2021.mrl-1.12\/\">Mr.TyDi<\/a> multilingual benchmark, our large variant (1.5B) achieves a 29.79 score, ranking among the top-performing models and approaching the effectiveness of leading embedding-based approaches.<\/p>\n<p>We continue to refine our models to enhance multilingual performance and are actively working on further improvements for future releases.<\/p>\n<p><strong>Chinese:<\/strong>\n<strong>Chinese<\/strong><\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>NDCG@10<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>mixedbread-ai\/mxbai-rerank-large-v2<\/strong> (1.5B)<\/td>\n<td><strong>84.16<\/strong><\/td>\n<\/tr>\n<tr>\n<td><strong>mixedbread-ai\/mxbai-rerank-base-v2<\/strong> (0.5B)<\/td>\n<td><strong>83.70<\/strong><\/td>\n<\/tr>\n<tr>\n<td>BAAI\/bge-reranker-v2-m3 (568M)<\/td>\n<td>81.83<\/td>\n<\/tr>\n<tr>\n<td>BAAI\/bge-reranker-v2-gemma (2.5B)<\/td>\n<td>78.50<\/td>\n<\/tr>\n<tr>\n<td>mixedbread-ai\/mxbai-rerank-large-v1 (435M)<\/td>\n<td>72.53<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>Our rerank-v2 models dominate the <a href=\"https:\/\/arxiv.org\/html\/2309.07597v2\">C-Pack (Chinese)<\/a> retrieval benchmark, with the large variant (1.5B) achieving an leading score of <strong>84.16<\/strong>, followed closely by the base variant (0.5B) at 83.70.<\/p>\n<p>Compared to our previous generation, rerank-v2 delivers over 11-point gains, marking a significant improvement in search accuracy. This ensures superior ranking quality for Chinese-language search applications, making it a top choice for organizations prioritizing high-precision retrieval in Chinese content.<\/p>\n<p><strong>Code Search:<\/strong>\n<strong>Code Search (CoIR-Retrieval\/cosqa)<\/strong><\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>NDCG@10<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>mixedbread-ai\/mxbai-rerank-large-v2<\/strong> (1.5B)<\/td>\n<td><strong>32.05<\/strong><\/td>\n<\/tr>\n<tr>\n<td><strong>mixedbread-ai\/mxbai-rerank-base-v2<\/strong> (0.5B)<\/td>\n<td><strong>31.73<\/strong><\/td>\n<\/tr>\n<tr>\n<td>BAAI\/bge-reranker-v2-gemma (2.5B)<\/td>\n<td>31.51<\/td>\n<\/tr>\n<tr>\n<td>mixedbread-ai\/mxbai-rerank-large-v1 (435M)<\/td>\n<td>30.72<\/td>\n<\/tr>\n<tr>\n<td>BAAI\/bge-reranker-v2-m3 (568M)<\/td>\n<td>24.86<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>Our rerank-v2 models excel in code search tasks, achieving best-in-class performance on the <a href=\"https:\/\/arxiv.org\/html\/2407.02883v1\">CoIR-Retrieval\/CosQA<\/a> benchmark. The large variant (1.5B) leads with a 32.05 score, while even our base model (0.5B) outperforms larger alternatives with a 31.73 score.<\/p>\n<p>This marks a significant improvement over our previous generation, with rerank-v2 surpassing v1 by over 1 point. By delivering more precise and context-aware retrieval for code-related queries, our models enable developers and AI-powered coding assistants to retrieve relevant code snippets with greater accuracy and efficiency.<\/p>\n<p><strong>Tool Retrieval:<\/strong>\n<strong>Tool Retrieval (ToolRet)<\/strong><\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Score<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>mixedbread-ai\/mxbai-rerank-large-v2<\/strong> (bm25 first stage)<\/td>\n<td><strong>39.2<\/strong><\/td>\n<\/tr>\n<tr>\n<td>jina-reranker-v2-base-multilingual (NV-Embed-v1 first stage, current state-of-the-art)<\/td>\n<td>39.09<\/td>\n<\/tr>\n<tr>\n<td>jina-reranker-v2-base-multilingual (bm25 first stage)<\/td>\n<td>37.90<\/td>\n<\/tr>\n<tr>\n<td>bm25<\/td>\n<td>31.97<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2503.01763\">ToolRet<\/a> is a benchmark designed to evaluate tool retrieval tasks, where tools refer to functions that a Large Language Model (LLM) can call to perform specific operations. For example, a tool like character_count(text) returns the number of characters in a given text.<\/p>\n<p>Our rerank-v2 models set a new performance benchmark in tool retrieval, with the large variant (1.5B) achieving a leading 39.2 score, surpassing the current state-of-the-art. Please note the significant difference in performance when comparing BM25 instead of NV-Embed-V1 as the first-stage retriever. We expect that using a better first-stage retriever will yield even better results.<\/p>\n<p>This shows that our rerank-v2 models are highly effective at reordering tools based on their relevance to a given query. It can be used to retrieve from a large set of tools, for example in MCP implementations.<\/p>\n<span class=\"text-xs\">\n*For all benchmarks we used keyword-based (BM25) retrieval as the first stage retriever. For more details, please check out the [spreadsheet](https:\/\/docs.google.com\/spreadsheets\/d\/1MhaBgkILpxgtbtqD9o-SDEv3MGpx8adZMgPP5KtwOmQ\/edit?usp=sharing).*\n<\/span>\n\n<h2>Latency Comparison<\/h2>\n<p>To understand how quickly each model processes queries in real-world settings, we measured <strong>average latency per query<\/strong> (seconds) on the <a href=\"https:\/\/www.cl.uni-heidelberg.de\/statnlpgroup\/nfcorpus\/\">NFCorpus<\/a> dataset using an A100 (80GB) GPU:<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/mxbai-rerank-v2\/latency-mxbai-rerank-v2.png\" alt=\"latency diagram mxbai-rerank-v2\"><\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Latency (s)<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>mixedbread-ai\/mxbai-rerank-xsmall-v1<\/td>\n<td>0.32<\/td>\n<\/tr>\n<tr>\n<td><strong>mixedbread-ai\/mxbai-rerank-base-v2<\/strong><\/td>\n<td><strong>0.67<\/strong><\/td>\n<\/tr>\n<tr>\n<td>mixedbread-ai\/mxbai-rerank-base-v1<\/td>\n<td>0.76<\/td>\n<\/tr>\n<tr>\n<td><strong>mixedbread-ai\/mxbai-rerank-large-v2<\/strong><\/td>\n<td><strong>0.89<\/strong><\/td>\n<\/tr>\n<tr>\n<td>mixedbread-ai\/mxbai-rerank-large-v1<\/td>\n<td>2.24<\/td>\n<\/tr>\n<tr>\n<td>BAAI\/bge-reranker-v2-m3<\/td>\n<td>3.05<\/td>\n<\/tr>\n<tr>\n<td>BAAI\/bge-reranker-v2-gemma<\/td>\n<td>7.20<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>Our 1.5B model is <strong>8x faster<\/strong> than bge-reranker-v2-gemma while delivering higher accuracy. This speed advantage means you can process more queries per second without sacrificing quality, making our models ideal for high-volume production environments where both performance and cost-efficiency matter.<\/p>\n<h2>How We Trained the Models<\/h2>\n<p>Building on insights from <a href=\"https:\/\/arxiv.org\/pdf\/2501.12948\">DeepSeek R-1<\/a> and starting with <a href=\"https:\/\/arxiv.org\/pdf\/2412.15115\">**Qwen-2.5**<\/a>, we used a three-step reinforcement-learning process:<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/mxbai-rerank-v2\/training-methodology.png\" alt=\"Training methodology diagram\" title=\"Three-step training process\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n    The training process used for mxbai-rerank-v2\n<\/div>\n\n<ol>\n<li><p><strong><a href=\"https:\/\/arxiv.org\/abs\/2402.03300\">GRPO<\/a> (Guided Reinforcement Prompt Optimization)<\/strong> <br\/>\nWe taught the model to output <strong>1<\/strong> for relevant documents and <strong>0<\/strong> for irrelevant ones. GRPO ensures format consistency and gives a strong performance boost from the start.<\/p>\n<\/li>\n<li><p><strong>Contrastive Learning<\/strong> <br\/>\nNext, we gave the model a fine-grained understanding of query-document relationships, much like embedding models learn deeper semantic similarity.<\/p>\n<\/li>\n<li><p><strong>Preference Learning<\/strong> <br\/>\nFinally, we tuned the model to rank the most relevant documents highest\u2014mirroring how real users judge search results.<\/p>\n<\/li>\n<\/ol>\n<p>This layered approach yields a richer query understanding, whether you&#39;re reordering text results, code snippets, or product listings.<\/p>\n<span class=\"text-xs\">\n*A detailed technical paper is on the way, with an in-depth look at our methodology, architecture, and additional benchmarks.*\n<\/span>\n\n<h2>Try It Out<\/h2>\n<p>If you\u2019d like to see how mxbai\u2011rerank\u2011v2 performs in your own search setup, there are two ways to use the models, either via Python or using the Mixedbread API. Here&#39;s how you get started:<\/p>\n<p><strong>Python (API):<\/strong><\/p>\n<pre><code class=\"language-bash\">pip install -u mixedbread\n<\/code><\/pre>\n<p><strong>TypeScript (API):<\/strong><\/p>\n<pre><code class=\"language-bash\">npm i @mixedbread\/sdk\n<\/code><\/pre>\n<p><strong>Python:<\/strong><\/p>\n<pre><code class=\"language-bash\">pip install -U mxbai-rerank\n<\/code><\/pre>\n<p>Below is a basic Python snippet that sends a query and multiple candidate passages to the mxbai\u2011rerank\u2011v2 model. The model scores each passage, helping you rank the most relevant content at the top:<\/p>\n<p><strong>Python (API):<\/strong><\/p>\n<pre><code class=\"language-python\">from mixedbread import Mixedbread\n\n    mxbai = Mixedbread(api_key=&quot;YOUR_API_KEY&quot;)\n\n    result = mxbai.rerank(\n        model=&quot;mixedbread-ai\/mxbai-rerank-large-v2&quot;,\n        query=&quot;Who is the author of To Kill a Mockingbird?&quot;,\n        input=[\n            &quot;To Kill a Mockingbird is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.&quot;,\n            &quot;The novel Moby-Dick was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.&quot;,\n            &quot;Harper Lee, an American novelist widely known for her novel To Kill a Mockingbird, was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.&quot;,\n            &quot;Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.&quot;,\n            &quot;The Harry Potter series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.&quot;,\n            &quot;The Great Gatsby, a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.&quot;\n        ]\n    )\n\n    print(result.data)\n<\/code><\/pre>\n<p><strong>Python:<\/strong><\/p>\n<pre><code class=\"language-python\">from mxbai_rerank import MxbaiRerankV2\n\n# Load the model, here we use our base sized model\nmodel = MxbaiRerankV2(&quot;mixedbread-ai\/mxbai-rerank-base-v2&quot;)\n\n# Example query and documents\nquery = &quot;Who wrote To Kill a Mockingbird?&quot;\ndocuments = [\n    &quot;To Kill a Mockingbird is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.&quot;,\n    &quot;The novel Moby-Dick was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.&quot;,\n    &quot;Harper Lee, an American novelist widely known for her novel To Kill a Mockingbird, was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.&quot;,\n    &quot;Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.&quot;,\n    &quot;The Harry Potter series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.&quot;,\n    &quot;The Great Gatsby, a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.&quot;\n]\n\n# Calculate the scores\nresults = model.rank(query, documents)\n<\/code><\/pre>\n<p><strong>TypeScript (API):<\/strong><\/p>\n<pre><code class=\"language-typescript\">import { Mixedbread } from &quot;@mixedbread\/sdk&quot;;\n\nconst mxbai = new Mixedbread({\n  apiKey: &quot;YOUR_API_KEY&quot;,\n});\n\nconst result = await mxbai.rerank({\n  model: &quot;mixedbread-ai\/mxbai-rerank-large-v2&quot;,\n  query: &quot;Who is the author of To Kill a Mockingbird?&quot;,\n  input: [\n    &quot;To Kill a Mockingbird is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.&quot;,\n    &quot;The novel Moby-Dick was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.&quot;,\n    &quot;Harper Lee, an American novelist widely known for her novel To Kill a Mockingbird, was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.&quot;,\n    &quot;Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.&quot;,\n    &quot;The Harry Potter series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.&quot;,\n    &quot;The Great Gatsby, a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.&quot;,\n  ],\n});\n\nconsole.log(result.data);\n<\/code><\/pre>\n<p>Once you have the scores, you can reorder your documents based on their ranking. This approach quickly upgrades the \u201csecond pass\u201d in your search pipeline, making results more relevant without overhauling your entire search infrastructure.<\/p>\n<h2>Reranking Beyond Document Retrieval<\/h2>\n<p>When most people think of reranking, they picture standard document search. However, the latest v2 models demonstrate that reranking can be applied to a much broader set of tasks. Here are a few key examples:<\/p>\n<ul>\n<li><p><strong>Code, SQL &amp; Technical Documentation<\/strong> <br\/>\nBy understanding programming syntax, SQL, and documentation structures, our model surface the most relevant sections within repositories, developer wikis, or technical manuals. This approach balances semantic intent with the specific language of code, making it highly effective for technical searches.<\/p>\n<\/li>\n<li><p><strong>LLM Tool Selection &amp; Function Calling<\/strong> <br\/>\nWhen you have thousands of functions, MCP definitions, or API endpoints, choosing the right one for a given query can be challenging. The model assists by aligning the query\u2019s intent with the most appropriate tool or function\u2014particularly useful for AI assistants working with the MCP protocol or any other mechanism that requires precise function calls.<\/p>\n<\/li>\n<li><p><strong>E-Commerce Product Discovery<\/strong>  <br\/>\nIn large product catalogs, simply matching keywords often falls short of user expectations. Instead, the model examines product attributes, descriptions, and user intent to highlight the most relevant options. It also factors in details such as specifications and reviews, offering a more nuanced ranking than generic keyword-based methods.<\/p>\n<\/li>\n<li><p><strong>Structured Data &amp; JSON<\/strong> <br\/>\nTraditional rerankers often struggle with structured formats, but our approach is designed to interpret relationships between fields and values. This makes it straightforward to locate relevant entries in complex databases or JSON documents\u2014another step beyond plain-text ranking.<\/p>\n<\/li>\n<li><p><strong>Mixed Content Types<\/strong> <br\/>\nOften, data comes in various formats: text, metadata, technical specs, or category tags. By combining these different signals, the model produces a unified relevance judgment that reflects the bigger picture, rather than focusing on a single type of content.<\/p>\n<\/li>\n<\/ul>\n<p>This versatility means you can rely on a single open-source model for use cases ranging from customer-facing searches to internal knowledge management and developer tools\u2014without needing separate rerankers for each part of your tech stack.<\/p>\n<h2>Give Us Feedback<\/h2>\n<p>We&#39;d love to hear your thoughts on <strong>mxbai-rerank-v2<\/strong>! Whether you&#39;re reranking web pages, SQL queries, code snippets, or product listings, let us know how it performs and how we can improve.<\/p>\n<p>Join our <a href=\"https:\/\/mixedbread.com\/urls\/discord\">Discord<\/a> to share your experience, ask questions, and connect with other developers. Follow us on <a href=\"https:\/\/mixedbread.com\/urls\/x\">X<\/a> and <a href=\"https:\/\/mixedbread.com\/urls\/linked-in\">LinkedIn<\/a> for releases and updates!<\/p>\n<p><strong>Happy baking!<\/strong><\/p>\n<span class=\"text-xs\">\n*Reach out if you're looking for an end-to-end solution or direct support. We\u2019d love to help you bake the perfect search stack.*\n<\/span>\n\n<h3>Citation<\/h3>\n<pre><code class=\"language-bibtex\">@online{v2rerank2025mxbai,\n  title={Baked-in Brilliance: Reranking Meets RL with mxbai-rerank-v2},\n  author={Sean Lee and Rui Huang and Aamir Shakir and Julius Lipp},\n  year={2025},\n  url={https:\/\/www.mixedbread.com\/blog\/mxbai-rerank-v2},\n}\n<\/code><\/pre>\n","url":"https:\/\/www.mixedbread.com\/blog\/mxbai-rerank-v2","title":"Baked-in Brilliance: Reranking Meets RL with mxbai-rerank-v2","summary":"Introducing mxbai-rerank-v2, the second-generation reranking models from Mixedbread. They're crispier than ever\u2014featuring reinforcement learning, multilingual support, and extended context handling for even better accuracy. They come with the same open-source flexibility under Apache 2.0, now with boosted performance for a more powerful search experience.","image":"https:\/\/www.mixedbread.com\/images\/blog\/mxbai-rerank-v2\/intro-mxbai-rerank-v2.jpg","date_modified":"2025-03-13T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research"]},{"id":"https:\/\/www.mixedbread.com\/blog\/mxbai-embed-xsmall-v1","content_html":"<p>We are happy to introduce <strong>mxbai-embed-xsmall-v1<\/strong>, our smallest and most efficient embedding model to date. Despite its small size, it comes with competitive performance, making it ideal for retrieval tasks where resources are limited. It has an Apache 2.0 license and is available on Hugging Face.<\/p>\n<p>Read on to learn more about our approach and to check out our benchmarks. If you want to skip right to the model instead, you can access it here:<\/p>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-embed-xsmall-v1\">mxbai-embed-xsmall-v1<\/a>: Tiny but mighty embedding model optimized for retrieval tasks.<\/li>\n<\/ul>\n<blockquote>\n<p><strong>Note<\/strong>\n<u><strong>TLDR:<\/strong><\/u> <br\/>\n    Introducing our smallest and most efficient English embedding model, <a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-embed-xsmall-v1\">mxbai-embed-xsmall-v1<\/a>, offering competitive performance with a small footprint. It supports long context, binary quantization and Matryoshka representation learning!<\/p>\n<\/blockquote>\n<h2>Why Embeddings?<\/h2>\n<p>Embeddings are the backbone of many natural language processing applications. They transform complex textual data into numerical vectors, allowing us to measure semantic similarity between texts. This capability is crucial for tasks like recommendation systems, search engines, clustering, classification, and especially Retrieval-Augmented Generation (RAG).<\/p>\n<p>In RAG, embeddings enable large language models (LLMs) to access and understand custom data. For instance, if you need a report based on internal documents, an embedding model can encode these documents into a vector database. When you query the system, it retrieves the most relevant information to inform the LLM, allowing it to generate accurate and context-specific responses.<\/p>\n<h2>Introducing Our Smallest Embedding Model<\/h2>\n<p>We understand that not every application has the luxury of great computational resources. That&#39;s why we&#39;ve developed <strong>mxbai-embed-xsmall-v1<\/strong>, a compact yet powerful embedding model optimized for English retrieval tasks. Based on <a href=\"https:\/\/huggingface.co\/sentence-transformers\/all-MiniLM-L6-v2\">sentence-transformers\/all-MiniLM-L6-v2<\/a>, our model has only <strong>22.7 million parameters<\/strong> and is trained in <strong>float16<\/strong> for efficiency.<\/p>\n<p>Despite its small size, the model supports features like <a href=\"https:\/\/huggingface.co\/blog\/embedding-quantization\">binary quantization<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/2205.13147\">Matryoshka representation learning<\/a> (MRL), allowing for significant reductions in storage and computational requirements without sacrificing much performance.<\/p>\n<h3>Specs<\/h3>\n<ul>\n<li><strong>Small Footprint:<\/strong> Only 22.7 million parameters and 384 dimensions.<\/li>\n<li><strong>Long Context Support:<\/strong> Inputs up to 4096 tokens.<\/li>\n<li><strong>Binary Quantization and MRL:<\/strong> Up to <a href=\"https:\/\/mixedbread.com\/blog\/binary-mrl\">32x storage and 40x speed<\/a> efficiency gains.<\/li>\n<li><strong>Optimized for English Retrieval:<\/strong> Specifically trained for English retrieval tasks.<\/li>\n<\/ul>\n<h2>Model Evaluation with Benchmarks<\/h2>\n<p>Let&#39;s dive straight into the performance of our new model across various benchmarks.<\/p>\n<h3>MTEB Benchmark<\/h3>\n<p>On the <a href=\"https:\/\/arxiv.org\/abs\/2210.07316\">Massive Text Embedding Benchmark<\/a> (MTEB), our model performs well for its size:<\/p>\n<table>\n<thead>\n<tr>\n<th>Task<\/th>\n<th>all-MiniLM-L6-v2<\/th>\n<th><strong>mxbai-embed-xsmall-v1<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>ArguAna<\/td>\n<td>48.42<\/td>\n<td><strong>49.58<\/strong><\/td>\n<\/tr>\n<tr>\n<td>SciDOCS<\/td>\n<td><strong>21.58<\/strong><\/td>\n<td>21.50<\/td>\n<\/tr>\n<tr>\n<td>SciFact<\/td>\n<td>65.41<\/td>\n<td><strong>65.81<\/strong><\/td>\n<\/tr>\n<tr>\n<td>NFCorpus<\/td>\n<td>31.00<\/td>\n<td><strong>32.05<\/strong><\/td>\n<\/tr>\n<tr>\n<td>TREC-COVID<\/td>\n<td>46.09<\/td>\n<td><strong>48.90<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Touche2020<\/td>\n<td>16.39<\/td>\n<td><strong>17.09<\/strong><\/td>\n<\/tr>\n<tr>\n<td>FiQA2018<\/td>\n<td>36.10<\/td>\n<td><strong>37.10<\/strong><\/td>\n<\/tr>\n<tr>\n<td>HotpotQA<\/td>\n<td>46.50<\/td>\n<td><strong>48.37<\/strong><\/td>\n<\/tr>\n<tr>\n<td>MSMARCO (dev)<\/td>\n<td>36.54<\/td>\n<td><strong>36.76<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Fever<\/td>\n<td>50.93<\/td>\n<td><strong>56.45<\/strong><\/td>\n<\/tr>\n<tr>\n<td>NQ<\/td>\n<td>43.67<\/td>\n<td><strong>44.44<\/strong><\/td>\n<\/tr>\n<tr>\n<td>DBPedia<\/td>\n<td><strong>32.30<\/strong><\/td>\n<td>32.19<\/td>\n<\/tr>\n<tr>\n<td>Quora<\/td>\n<td>87.54<\/td>\n<td><strong>87.70<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Climate-Fever<\/td>\n<td>20.29<\/td>\n<td><strong>22.42<\/strong><\/td>\n<\/tr>\n<tr>\n<td>cqadupstack<\/td>\n<td>40.69<\/td>\n<td><strong>41.59<\/strong><\/td>\n<\/tr>\n<tr>\n<td><strong>Average<\/strong><\/td>\n<td>41.56<\/td>\n<td><strong>42.80<\/strong><\/td>\n<\/tr>\n<\/tbody><\/table>\n<h3>Long Context Benchmark (LoCo)<\/h3>\n<p>Our model supports inputs with a length of up to 4096 tokens, making it suitable for long documents:<\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>GovReport<\/th>\n<th>QasperFA<\/th>\n<th>QasperFT<\/th>\n<th>QMSum<\/th>\n<th>SummScreenFD<\/th>\n<th>Average<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>all-MiniLM-L6-v2<\/td>\n<td>86.31<\/td>\n<td>81.9<\/td>\n<td>78.88<\/td>\n<td>34.86<\/td>\n<td>54.75<\/td>\n<td>67.34<\/td>\n<\/tr>\n<tr>\n<td><strong>mxbai-embed-xsmall-v1<\/strong><\/td>\n<td><strong>95.6<\/strong><\/td>\n<td><strong>94.15<\/strong><\/td>\n<td><strong>86.91<\/strong><\/td>\n<td>26.75<\/td>\n<td><strong>78.27<\/strong><\/td>\n<td><strong>76.34<\/strong><\/td>\n<\/tr>\n<\/tbody><\/table>\n<h3>LongEmb Benchmark<\/h3>\n<p>We compared our model against <a href=\"https:\/\/huggingface.co\/sentence-transformers\/all-MiniLM-L6-v2\">sentence-transformers\/all-MiniLM-L6-v2<\/a>:<\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>NarrativeQA<\/th>\n<th>QMSum<\/th>\n<th>SummScreenFD<\/th>\n<th>2WikiMultiQA<\/th>\n<th>Average<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>all-MiniLM-L6-v2<\/td>\n<td>15.6<\/td>\n<td>20.53<\/td>\n<td>60.57<\/td>\n<td>47.70<\/td>\n<td>36.10<\/td>\n<\/tr>\n<tr>\n<td><strong>mxbai-embed-xsmall-v1<\/strong><\/td>\n<td><strong>15.65<\/strong><\/td>\n<td><strong>28.62<\/strong><\/td>\n<td><strong>81.45<\/strong><\/td>\n<td><strong>58.05<\/strong><\/td>\n<td><strong>45.94<\/strong><\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>Our model shows significant improvements, especially in longer document tasks.<\/p>\n<h2>Optimized for Efficiency with Binary Quantization and MRL<\/h2>\n<p>One of the standout features of our model is its support for binary quantization and Matryoshka representation learning (MRL). These techniques allow you to drastically reduce the size of your embeddings and speed up computations, making large-scale deployments more feasible and cost-effective.<\/p>\n<h3>Matryoshka Representation Learning (MRL)<\/h3>\n<p>MRL enables the model to produce embeddings that are still effective even when truncated to smaller dimensions. This allows you to choose an embedding size that balances performance and storage requirements.<\/p>\n<p>We evaluated the model&#39;s performance at different embedding sizes:<\/p>\n<table>\n<thead>\n<tr>\n<th>Dimension (d)<\/th>\n<th>SciFact<\/th>\n<th>SciDocs<\/th>\n<th>NFCorpus<\/th>\n<th>ArguAna<\/th>\n<th>Average<\/th>\n<th>Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>384<\/strong><\/td>\n<td>65.81<\/td>\n<td>21.50<\/td>\n<td>32.05<\/td>\n<td>49.58<\/td>\n<td>42.235<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td>256<\/td>\n<td>63.62<\/td>\n<td>20.56<\/td>\n<td>30.74<\/td>\n<td>47.61<\/td>\n<td>40.6325<\/td>\n<td>0.962<\/td>\n<\/tr>\n<tr>\n<td>128<\/td>\n<td>58.55<\/td>\n<td>17.20<\/td>\n<td>26.81<\/td>\n<td>40.11<\/td>\n<td>35.6675<\/td>\n<td>0.845<\/td>\n<\/tr>\n<tr>\n<td>64<\/td>\n<td>43.49<\/td>\n<td>11.67<\/td>\n<td>18.72<\/td>\n<td>30.93<\/td>\n<td>26.2025<\/td>\n<td>0.620<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>As you can see, even at smaller dimensions, the model retains a good portion of its full-dimension performance.<\/p>\n<h3>Binary Quantization<\/h3>\n<p>By converting embeddings into binary format, you can achieve up to <a href=\"https:\/\/mixedbread.com\/blog\/binary-mrl\">32x in storage and 40x in computational<\/a> efficiency gains. Learn more about binary quantization <a href=\"https:\/\/huggingface.co\/blog\/embedding-quantization\">here<\/a>.<\/p>\n<p>We also tested the performance of the model with binary quantization:<\/p>\n<table>\n<thead>\n<tr>\n<th>Encoding<\/th>\n<th>SciFact<\/th>\n<th>SciDocs<\/th>\n<th>NFCorpus<\/th>\n<th>ArguAna<\/th>\n<th>Average<\/th>\n<th>Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>Raw<\/td>\n<td>65.81<\/td>\n<td>21.50<\/td>\n<td>32.05<\/td>\n<td>49.58<\/td>\n<td>42.235<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td><strong>Binary<\/strong><\/td>\n<td>61.95<\/td>\n<td>19.94<\/td>\n<td>30.62<\/td>\n<td>46.14<\/td>\n<td>39.6625<\/td>\n<td>0.939<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>The performance drop is minimal and can even be further improved with on-disk rescoring!<\/p>\n<h2>Using It in Action<\/h2>\n<p>To get started, install the necessary packages:<\/p>\n<pre><code class=\"language-bash\">pip install -U sentence-transformers\n<\/code><\/pre>\n<p>Here&#39;s how you can use the model:<\/p>\n<p><strong>Sentence Transformer:<\/strong><\/p>\n<pre><code class=\"language-python\">from sentence_transformers import SentenceTransformer\nfrom sentence_transformers.util import cos_sim\n\n# 1. Load the model\nmodel = SentenceTransformer(&quot;mixedbread-ai\/mxbai-embed-xsmall-v1&quot;)\n\n# 2. Prepare your data\nquery = &#39;What are the best ingredients for sourdough bread?&#39;\ndocs = [\n    query,\n    &quot;Our bakery lays at the heart of the city.&quot;,\n    &quot;The key to great croissants is high-quality butter.&quot;,\n    &quot;Our signature sourdough uses a 100-year-old starter for optimal flavor.&quot;,\n    &quot;To create a perfect loaf of sourdough, follow these steps: mix the flour and water, let it sit for 24 hours, knead the dough, shape it, let it rise, bake it, and enjoy!&quot;\n]\n\n# 3. Encode the texts\nembeddings = model.encode(docs)\n\n# 4. Calculate cosine similarity\nsimilarities = cos_sim(embeddings[0], embeddings[1:])\nprint(similarities)\n<\/code><\/pre>\n<p><strong>Batched API:<\/strong><\/p>\n<pre><code class=\"language-python\">import uvicorn\nimport batched\nfrom fastapi import FastAPI\nfrom fastapi.responses import ORJSONResponse\nfrom sentence_transformers import SentenceTransformer\nfrom pydantic import BaseModel\n\napp = FastAPI()\n\nmodel = SentenceTransformer(&#39;mixedbread-ai\/mxbai-embed-xsmall-v1&#39;)\nmodel.encode = batched.aio.dynamically(model.encode)\n\nclass EmbeddingsRequest(BaseModel):\n    input: str | list[str]\n\n@app.post(&quot;\/embeddings&quot;)\nasync def embeddings(request: EmbeddingsRequest):\n    return ORJSONResponse({&quot;embeddings&quot;: await model.encode(request.input)})\n\nif __name__ == &quot;__main__&quot;:\n    uvicorn.run(app, host=&quot;0.0.0.0&quot;, port=8000)\n<\/code><\/pre>\n<p>The output will look like this:<\/p>\n<pre><code class=\"language-python\">tensor([[0.2590, 0.1946, 0.3870, 0.5797]])\n<\/code><\/pre>\n<h2>How We Built mxbai-embed-xsmall-v1<\/h2>\n<p>Building <strong>mxbai-embed-xsmall-v1<\/strong> was a journey focused on maximizing efficiency without compromising performance. Here&#39;s how we achieved it:<\/p>\n<h3>Base Model and Training<\/h3>\n<p>We started with the <a href=\"https:\/\/huggingface.co\/sentence-transformers\/all-MiniLM-L6-v2\">sentence-transformers\/all-MiniLM-L6-v2<\/a> model, known for its balance between performance and size. We then trained it using the <a href=\"https:\/\/arxiv.org\/abs\/2309.12871\">AnglE loss function<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/2402.14776\">Espresso<\/a>, techniques that improve the model&#39;s ability to generate high-quality embeddings for retrieval tasks.<\/p>\n<h3>Focus on Retrieval Tasks<\/h3>\n<p>Our primary goal was to create an embedding model optimized for retrieval tasks in English. By tailoring the training data and focusing on relevant loss functions, we made sure that <strong>mxbai-embed-xsmall-v1<\/strong> is great at finding semantically similar texts, which is important for search engines, recommendation systems, and RAG applications.<\/p>\n<h3>Why Small Size Matters<\/h3>\n<p>In many real-world applications, computational resources are limited. Smaller models like <strong>mxbai-embed-xsmall-v1<\/strong> offer several advantages. They provide faster inference, as reduced computation leads to quicker results. These models are also ideal for deployment on devices with limited memory and processing power due to their lower resource consumption. Additionally, they offer cost-effectiveness by lowering infrastructure costs when scaling up.<\/p>\n<h2>Give Us Feedback<\/h2>\n<p>We are excited to see how <strong>mxbai-embed-xsmall-v1<\/strong> is used in your projects. Your feedback is helping us improve our models and make them more user-friendly.<\/p>\n<p>Please share your thoughts through our <a href=\"https:\/\/mixedbread.com\/urls\/discord\">Discord community<\/a>. We are here to assist and always ready to discuss the exciting field of machine learning!<\/p>\n<h3>Citation<\/h3>\n<pre><code class=\"language-bibtex\">@online{xsmall2024mxbai,\n  title={Every Byte Matters: Introducing mxbai-embed-xsmall-v1},\n  author={Sean Lee and Julius Lipp and Rui Huang and Darius Koenig},\n  year={2024},\n  url={https:\/\/www.mixedbread.com\/blog\/mxbai-embed-xsmall-v1},\n}\n<\/code><\/pre>\n","url":"https:\/\/www.mixedbread.com\/blog\/mxbai-embed-xsmall-v1","title":"Every Byte Matters: Introducing mxbai-embed-xsmall-v1","summary":"Announcing mxbai-embed-xsmall-v1, our smallest and most efficient English embedding model optimised for retrieval tasks. It comes with competitive performance at an extra small footprint, support for long context, binary quantization and Matryoshka representation learning.","image":"https:\/\/www.mixedbread.com\/images\/blog\/mxbai-embed-xsmall-v1\/intro-mxbai-embed-xsmall-v1.jpg","date_modified":"2024-10-14T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research"]},{"id":"https:\/\/www.mixedbread.com\/blog\/dynamic-batching","content_html":"<p>Adding dynamic batching to our inference API back in the day was a real pain. We struggled with complex implementations and performance bottlenecks that were annyoing and seemed unnecessary. We figured that many developers likely face similar challenges when trying to optimize their machine learning pipelines. That&#39;s why we&#39;re happy to open-source Batched, so other developers&#39; lives can be a bit easier when it comes to implementing dynamic batching. <\/p>\n<p>Read on to learn more about dynamic batching and how Batched can help you optimize your ML workflows or access the repo <a href=\"https:\/\/github.com\/mixedbread-ai\/batched\">here<\/a>.<\/p>\n<blockquote>\n<p><strong>Note<\/strong>\n<u><strong>TLDR:<\/strong><\/u> <br\/>\n    Batched is a lightweight library that seamlessly adds dynamic batching to any transformer model or function, automatically grouping multiple requests for efficient processing.<\/p>\n<\/blockquote>\n<h2>Why Dynamic Batching?<\/h2>\n<p>GPUs excel at handling massive parallel computations using SIMD (Single Instruction, Multiple Data), which allows one instruction to be applied to large sets of data simultaneously. This parallelism is what makes GPUs so powerful for tasks like deep learning, where operations are performed over large tensors.<\/p>\n<p>However, in production environments, especially in online inference scenarios, requests usually come in one by one and are processed sequentially rather than in parallel. Each request may only use a fraction of the GPU&#39;s capability, which leads to massive underutilization of the GPU.<\/p>\n<p>This is where <strong>dynamic batching<\/strong> comes into play. Instead of handling each request individually, we gather multiple incoming requests over a short time window and process them all at once as a batch. With this approach, we&#39;re able to use the GPU&#39;s ability to perform parallel computations more effectively, improving overall throughput and resource utilization.<\/p>\n<h3>The Toaster Analogy<\/h3>\n<p>Imagine a baker who toasts bread for customers. The toaster has four slots, but when a customer orders toast, the baker immediately fills a single slot and waits for the toast to be ready. If another customer comes in while the toaster is still running, the baker has to wait until the first batch is done before being able to toast a new piece of bread for the new customer. This is sequential processing, where resources are not fully utilized.<\/p>\n<p>Now, imagine dynamic batching: the baker waits a short time after an order comes in to see if any other orders are arriving. By grouping multiple orders together, the baker fills all four slots of the toaster and toasts multiple pieces of bread simultaneously. While this approach might slightly delay the start of toasting for some customers, it significantly increases the overall throughput. More customers get their toast faster, and the toaster is used efficiently.<\/p>\n<p>Similarly, in machine learning systems, dynamic batching might introduce a small latency (typically milliseconds) for individual requests, but it dramatically improves the GPU&#39;s efficiency and the system&#39;s overall performance. By processing multiple inputs at once, the GPU&#39;s computational resources are better utilized, allowing the system to handle a much larger volume of data in less time.<\/p>\n<h3>The Challenge<\/h3>\n<p>Implementing dynamic batching isn&#39;t always straightforward. It can involve complex synchronization mechanisms, queue management, and careful handling of different input shapes and types. Many developers find themselves writing boilerplate code or fighting with available frameworks that may not fit seamlessly into their existing infrastructure.<\/p>\n<h2>Batched<\/h2>\n<p>While there are frameworks like <a href=\"https:\/\/developer.nvidia.com\/triton-inference-server\">Triton<\/a> that offer dynamic batching, we found that none provide this functionality without locking you into their entire ecosystem. As strong believers in developer experience and composability, we\u2019ve open-sourced <a href=\"https:\/\/github.com\/mixedbread-ai\/batched\">**Batched**<\/a> to allow you to easily integrate dynamic batching into your projects without adding unnecessary complexity.<\/p>\n<p>Batched is designed to be lightweight and easy to use. It wraps around your existing functions or models, automatically handling the batching of incoming requests. Therefore, it allows you to add dynamic batching to your codebase with minimal changes and without rearchitecting your entire system.<\/p>\n<h3>Installation<\/h3>\n<p>You can install <strong>Batched<\/strong> via pip:<\/p>\n<pre><code class=\"language-bash\">pip install batched\n<\/code><\/pre>\n<h3>Add Dynamic Batching to Your Model<\/h3>\n<p>Let&#39;s look at how you can add dynamic batching to your projects using the <code>batched<\/code> library. Suppose you&#39;re using an embedding model with <code>sentence_transformers<\/code>:<\/p>\n<p><strong>Simple:<\/strong><\/p>\n<pre><code class=\"language-python\">from sentence_transformers import SentenceTransformer\nimport batched\n\n# Load your model as usual\nmodel = SentenceTransformer(&#39;mixedbread-ai\/mxbai-embed-large-v1&#39;)\n\n# Wrap the encode method with batched.dynamically\nmodel.encode = batched.dynamically(model.encode)\n<\/code><\/pre>\n<p>That&#39;s it! Your model&#39;s <code>encode<\/code> method now supports dynamic batching. When you call <code>model.encode<\/code>, Batched will automatically collect incoming requests and process them in batches:<\/p>\n<pre><code class=\"language-python\">embeddings = model.encode([&quot;Hello, world!&quot;])  # Now dynamically batched.\n<\/code><\/pre>\n<p><strong>Simple Async:<\/strong><\/p>\n<pre><code class=\"language-python\">from sentence_transformers import SentenceTransformer\nimport batched\n\n# Load your model as usual\nmodel = SentenceTransformer(&#39;mixedbread-ai\/mxbai-embed-large-v1&#39;)\n\n# Wrap the encode method with batched.aio.dynamically\nmodel.encode = batched.aio.dynamically(model.encode)\n<\/code><\/pre>\n<p>That&#39;s it! Your model&#39;s <code>encode<\/code> method now supports dynamic batching. When you call <code>model.encode<\/code>, Batched will automatically collect incoming requests and process them in batches:<\/p>\n<pre><code class=\"language-python\">embeddings = await model.encode([&quot;Hello, world!&quot;])  # Now dynamically batched.\n<\/code><\/pre>\n<p><strong>Forward function:<\/strong><\/p>\n<pre><code class=\"language-python\">from sentence_transformers import SentenceTransformer\nfrom batched import inference\n\nclass Model(SentenceTransformer):\n  @inference.dynamically\n  def forward(self, features):\n    return super().forward({key: value.to(self.device) for key, value in features.items()})\n\nmodel = Model(&#39;mixedbread-ai\/mxbai-embed-large-v1&#39;)\n<\/code><\/pre>\n<p>That&#39;s it! Your model&#39;s <code>forward<\/code> method now supports dynamic batching. When you call <code>model.forward<\/code>, Batched will automatically collect incoming requests and process them in batches:<\/p>\n<pre><code class=\"language-python\">embeddings = model.encode([&quot;Hello, world!&quot;])  # Now dynamically batched.\n<\/code><\/pre>\n<h3>Using Batched in an API<\/h3>\n<p>Dynamic batching is mostly used in inference APIs, where multiple clients might be sending requests simultaneously. By batching these requests, you can significantly improve throughput and reduce per-request latency.<\/p>\n<p>Here&#39;s how you could use Batched in an API using <a href=\"https:\/\/fastapi.tiangolo.com\/\">FastAPI<\/a>:<\/p>\n<p><strong>Sync:<\/strong><\/p>\n<pre><code class=\"language-python\">from fastapi import FastAPI\nfrom fastapi.responses import ORJSONResponse\nfrom sentence_transformers import SentenceTransformer\nfrom pydantic import BaseModel\nimport uvicorn\nimport batched\n\napp = FastAPI()\n\nmodel = SentenceTransformer(&#39;mixedbread-ai\/mxbai-embed-large-v1&#39;)\nmodel.encode = batched.dynamically(model.encode)\n\nclass EmbeddingsRequest(BaseModel):\n    input: str | list[str]\n\n@app.post(&quot;\/embeddings&quot;)\ndef embeddings(request: EmbeddingsRequest):\n    embeddings = model.encode(request.input)\n    return ORJSONResponse({&quot;embeddings&quot;: embeddings})\n\nif __name__ == &quot;__main__&quot;:\n    uvicorn.run(app, host=&quot;0.0.0.0&quot;, port=8000)\n<\/code><\/pre>\n<p><strong>Async:<\/strong><\/p>\n<pre><code class=\"language-python\">from fastapi import FastAPI\nfrom fastapi.responses import ORJSONResponse\nfrom sentence_transformers import SentenceTransformer\nfrom pydantic import BaseModel\nimport uvicorn\nimport batched\n\napp = FastAPI()\n\nmodel = SentenceTransformer(&#39;mixedbread-ai\/mxbai-embed-large-v1&#39;)\nmodel.encode = batched.aio.dynamically(model.encode)\n\nclass EmbeddingsRequest(BaseModel):\n    input: str | list[str]\n\n@app.post(&quot;\/embeddings&quot;)\nasync def embeddings(request: EmbeddingsRequest):\n    embeddings = await model.encode(request.input)\n    return ORJSONResponse({&quot;embeddings&quot;: embeddings})\n\nif __name__ == &quot;__main__&quot;:\n    uvicorn.run(app, host=&quot;0.0.0.0&quot;, port=8000)\n<\/code><\/pre>\n<h3>How Batched Works Under the Hood<\/h3>\n<p>Batched intercepts calls to the wrapped function and places them into a queue. It waits for a short, configurable amount of time to collect additional requests. Once the time elapses or a maximum batch size is reached, it processes all collected requests together. If used for the forward pass of a model, it will also handle padding the inputs to the correct size.<\/p>\n<p>This makes sure that requests are batched without introducing significant latency. The default settings are optimized for common use cases, but you can adjust parameters like maximum wait time or batch size to better suit your needs.<\/p>\n<h2>Benchmarks and Performance Gains<\/h2>\n<p>While specific performance gains can vary depending on your use case and hardware, our benchmarks have shown promising results.<\/p>\n<p>We set up a test environment where we simulated multiple clients sending requests to an inference API. Without dynamic batching, the GPU was underutilized, and the system struggled to handle a high volume of concurrent requests.<\/p>\n<p>After integrating Batched, we observed:<\/p>\n<ul>\n<li><strong>Up to 10x improvement in throughput<\/strong>: The system could process many more requests per second.<\/li>\n<li><strong>Reduced per-request latency<\/strong>: Despite the slight delay introduced by batching, the overall time from request to response decreased due to more efficient processing.<\/li>\n<li><strong>Better GPU utilization<\/strong>: The GPU&#39;s computational capacity was utilized more efficiently.<\/li>\n<\/ul>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/dynamic-batching\/benchmark.png\" alt=\"Batched Benchmarks\"><\/p>\n<p>It&#39;s important to note that the exact improvements you&#39;ll see may differ based on your specific setup, model complexity, and request patterns. We strongly recommend benchmarking dynamic batching in your own workflows to get a clear picture of the benefits for your use case.<\/p>\n<h2>Feedback<\/h2>\n<p>As always, we greatly welcome any feedback! Whether you&#39;re implementing dynamic batching or exploring other optimization techniques, we&#39;d love to hear about your experiences and challenges.<\/p>\n<p>If you&#39;re interested in collaborating or have ideas on how to improve Batched, please reach out to us on our <a href=\"https:\/\/mixedbread.com\/urls\/discord\">Discord<\/a> or contribute to the <a href=\"https:\/\/github.com\/mixedbread-ai\/batched\">GitHub repository<\/a>.<\/p>\n<p>Happy batching! \ud83c\udf5e<\/p>\n","url":"https:\/\/www.mixedbread.com\/blog\/dynamic-batching","title":"Baking in Performance - Dynamic Batching with Batched","summary":"Learn how dynamic batching boosts GPU efficiency in machine learning. We introduce Batched, our open-source library that makes it easy to implement dynamic batching and improve performance in your ML projects.","image":"https:\/\/www.mixedbread.com\/images\/blog\/dynamic-batching\/intro-batched.jpg","date_modified":"2024-09-16T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["engineering"]},{"id":"https:\/\/www.mixedbread.com\/blog\/intro-baguetter","content_html":"<p>We are excited to introduce Baguetter, our new open-source retrieval testing framework! Baguetter combines sparse, dense, and hybrid retrieval into a single, easy-to-use interface. It enables fast benchmarking, implementation, and testing of new search methods.\nRead on to learn more about our approach and to check out some usage examples. If you want to skip right to the framework instead, head over to the <a href=\"https:\/\/github.com\/mixedbread-ai\/baguetter\">Baguetter GitHub repository<\/a>.<\/p>\n<h2>Why Baguetter?<\/h2>\n<p>While traditional search remains the foundation of most information retrieval systems, semantic search is gaining more and more traction. We&#39;re committed to advancing retrieval systems, but we identified a gap in our toolkit: we lacked a testing framework that could handle both traditional (sparse) and semantic (dense) search methodologies.<\/p>\n<p>Existing solutions often focus on a single method, present challenges in extensibility, or come with large codebases. We encountered these limitations firsthand during our work on <a href=\"https:\/\/mixedbread.com\/blog\/intro-bmx\">BMX<\/a> and while experimenting with various fusion algorithms for hybrid search. To address these challenges, we created Baguetter. It offers:<\/p>\n<ol>\n<li>Easy extensibility<\/li>\n<li>Support for lexical, dense, and hybrid search, as well as embeddings quantization and reranking<\/li>\n<li>Pure Python implementation with an easy-to-use API<\/li>\n<li>Good performance<\/li>\n<\/ol>\n<h2>Building on Solid Foundations<\/h2>\n<p>Baguetter is a fork of <a href=\"https:\/\/github.com\/AmenRa\/retriv\">retriv<\/a>, an open-source project created by Elias Bassani. We&#39;ve adapted retriv to enhance its flexibility and added efficient keyword search algorithms implementation, mainly <a href=\"https:\/\/github.com\/xhluca\/bm25s\">BM25S<\/a> and <a href=\"https:\/\/mixedbread.com\/blog\/intro-bmx\">BMX<\/a>. For dense search capabilities, we leverage <a href=\"https:\/\/github.com\/unum-cloud\/usearch\">USearch<\/a> and <a href=\"https:\/\/github.com\/facebookresearch\/faiss\">Faiss<\/a>. This combination allows us to explore a wide range of possibilities, from modifying BM25 tokenization to testing embedding model performance with quantization.<\/p>\n<h2>Using It<\/h2>\n<p>Getting started with Baguetter is straightforward:<\/p>\n<pre><code class=\"language-bash\">pip install baguetter\n<\/code><\/pre>\n<p>Baguetter&#39;s strength lies in its unified interface, allowing easy evaluation of different search methods:<\/p>\n<p><strong>Sparse:<\/strong><\/p>\n<pre><code class=\"language-python\">from baguetter.indices import BMXSparseIndex\n\n# Create a simple sparse index\nindex = BMXSparseIndex()\n\n# Sample documents\ndoc_ids = [&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;]\ndocs = [\n    &quot;We all love baguette and cheese&quot;,\n    &quot;Baguette is a great bread&quot;,\n    &quot;Cheese is a great source of protein&quot;,\n    &quot;Baguette is a great source of carbs&quot;,\n]\n\n# Add documents to the index\nindex.add_many(doc_ids, docs)\n\n# Perform a search\nresults = index.search(&quot;baguette and cheese&quot;)\nprint(results) # SearchResults(keys=[&#39;1&#39;, &#39;3&#39;, &#39;2&#39;, &#39;4&#39;], scores=array([1.888962  , 0.73008496, 0.60060966, 0.5797755 ], dtype=float32), normalized=False)\n<\/code><\/pre>\n<p><strong>Dense:<\/strong><\/p>\n<pre><code class=\"language-python\">from ofen.models import TextEncoder\nfrom baguetter.indices import USearchDenseIndex\n\n# Create an embedding function\nmodel = TextEncoder.from_pretrained(&quot;mixedbread-ai\/mxbai-embed-large-v1&quot;)\nmodel.half()\n\ndef embed_fn(text: list[str], is_query: bool = False, show_progress: bool = False):\n    if is_query:\n        text = [f&quot;Represent this sentence for searching relevant passages: {query}&quot; for query in text]\n    return model.encode(text, batch_size=32, show_progress=show_progress).embeddings\n\n# Create a dense index\nindex = USearchDenseIndex(embed_fn=embed_fn)\n\n# Sample documents\ndoc_ids = [&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;]\ndocs = [\n    &quot;We all love baguette and cheese&quot;,\n    &quot;Baguette is a great bread&quot;,\n    &quot;Cheese is a great source of protein&quot;,\n    &quot;Baguette is a great source of carbs&quot;,\n]\n\n# Add documents to the index\nindex.add_many(doc_ids, docs, show_progress=True)\n\n# Perform a search\nresults = index.search(&quot;baguette and cheese&quot;)\nprint(results) # SearchResults(keys=[&#39;1&#39;, &#39;2&#39;, &#39;4&#39;, &#39;3&#39;], scores=array([0.77662474, 0.7557292 , 0.73353887, 0.63068825], dtype=float32), normalized=True)\n<\/code><\/pre>\n<p><strong>Hybrid:<\/strong><\/p>\n<pre><code class=\"language-python\">from ofen.models import TextEncoder\nfrom baguetter.indices import HybridIndex, BMXSparseIndex, USearchDenseIndex\n\n# Set up embedding function (as in the Dense example)\nembed_fn = ...\n\n# Create individual indices\nsparse_index = BMXSparseIndex(normalize_scores=True)\ndense_index = USearchDenseIndex(embed_fn=embed_fn, normalize_scores=True)\n\n# Combine into a hybrid index\nhybrid_index = HybridIndex(indices={&quot;sparse&quot;: sparse_index, &quot;dense&quot;: dense_index})\n\n# Sample documents\ndoc_ids = [&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;]\ndocs = [\n    &quot;We all love baguette and cheese&quot;,\n    &quot;Baguette is a great bread&quot;,\n    &quot;Cheese is a great source of protein&quot;,\n    &quot;Baguette is a great source of carbs&quot;,\n]\n\n# Add documents to the index\nhybrid_index.add_many(doc_ids, docs, show_progress=True)\n\n# Perform a search\nresults = hybrid_index.search(&quot;baguette and cheese&quot;)\nprint(results)\n<\/code><\/pre>\n<p><strong>Reranking:<\/strong><\/p>\n<pre><code class=\"language-python\">from ofen.models import CrossEncoder\n\nfrom baguetter.evaluation import evaluate_retrievers\nfrom baguetter.indices import SearchEngine, BMXSparseIndex\nfrom baguetter.utils.model_helpers import create_post_processing_fn\n\n# Define reranking model\nreranker = CrossEncoder.from_pretrained(&quot;mixedbread-ai\/mxbai-rerank-xsmall-v1&quot;)\n\n# Define reranking engine factory\nreranking = lambda: SearchEngine(\n    post_processing_fn=create_post_processing_fn(reranker),\n    index=BMXSparseIndex()\n)\n\n# Evaluate\nresult = evaluate_retrievers(retrievers={&quot;reranking&quot;: reranking}, dataset=dataset)\nresult.save(&quot;eval_results&quot;)\n<\/code><\/pre>\n<p><strong>MRL\/Quantization:<\/strong><\/p>\n<pre><code class=\"language-python\">from functools import partial\nfrom ofen.models import TextEncoder\n\nfrom baguetter.indices import USearchDenseIndex\nfrom baguetter.utils.model_helpers import create_embed_fn\n\n# Set up embedding function\nmodel = TextEncoder.from_pretrained(&quot;mixedbread-ai\/mxbai-embed-large-v1&quot;)\nmodel.half()\n\n# Create cached embedding function for evaluation\nembed_fn = create_embed_fn(model, query_prompt=&quot;Represent this sentence for searching relevant passages: &quot;, batch_size=256, truncation_dim=512)\n\n# Create indices with different quantization settings\nfloat32_index = USearchDenseIndex(embed_fn=embed_fn)\n\nint8_index = USearchDenseIndex(\n    embedding_dim=512,\n    embed_fn=partial(embed_fn, encoding_format=&quot;int8&quot;),\n)\n\nubinary_index = USearchDenseIndex(\n    embedding_dim=512,\n    embed_fn=partial(embed_fn, encoding_format=&quot;ubinary&quot;),\n    metric=&quot;hamming&quot;,\n)\n\ndoc_ids = [&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;]\ndocs = [\n    &quot;We all love baguette and cheese&quot;,\n    &quot;Baguette is a great bread&quot;,\n    &quot;Cheese is a great source of protein&quot;,\n    &quot;Baguette is a great source of carbs&quot;,\n]\n\n# Add documents to the index\nfloat32_index.add_many(doc_ids, docs, show_progress=True)\nint8_index.add_many(doc_ids, docs, show_progress=True)\nubinary_index.add_many(doc_ids, docs, show_progress=True)\n\n# Perform a search\nresults_float32 = float32_index.search(&quot;baguette and cheese&quot;)\nresults_int8 = int8_index.search(&quot;baguette and cheese&quot;)\nresults_ubinary = ubinary_index.search(&quot;baguette and cheese&quot;)\n\nprint(results_float32)\nprint(results_int8)\nprint(results_ubinary)\n<\/code><\/pre>\n<h2>Evaluation<\/h2>\n<h3>Understanding Evaluation Metrics<\/h3>\n<p>Baguetter uses <a href=\"https:\/\/github.com\/AmenRa\/ranx\">ranx<\/a> to evaluate different retrieval metrics. These metrics help assess various aspects of a search engine&#39;s effectiveness and can guide optimization efforts. The key metrics include:<\/p>\n<ol>\n<li><p>Normalized Discounted Cumulative Gain (nDCG@k): A metric that takes into account both the position and the relevance grade of retrieved documents. It&#39;s particularly useful for tasks with graded relevance judgments and for tasks where the order of results matters.<\/p>\n<\/li>\n<li><p>Precision@k: Measures the proportion of relevant documents in the top k results. This is valuable for ensuring that a high percentage of top results are relevant.<\/p>\n<\/li>\n<li><p>Mean Reciprocal Rank (MRR): Focuses on the position of the first relevant document in the ranked list. This metric is useful for tasks where finding a single correct answer quickly is important.<\/p>\n<\/li>\n<li><p>Recall@k: Calculates the proportion of all relevant documents retrieved in the top k results. This metric is critical for tasks where finding as many relevant documents as possible is important.<\/p>\n<\/li>\n<\/ol>\n<p>When interpreting these results, consider the following:<\/p>\n<ul>\n<li>Align the metrics with your use case. Different applications may benefit from using different metrics.<\/li>\n<li>Be aware of trade-offs between metrics. Improvements in one metric may come at the cost of deteriorations in another. For example, tuning for high precision could potentially reduce recall.<\/li>\n<li>Understanding these trade-offs is crucial when optimizing your system for real-world performance.<\/li>\n<\/ul>\n<h3>Setting Up Evaluation Datasets<\/h3>\n<p>Text information retrieval datasets typically consist of three key components:<\/p>\n<ol>\n<li>A collection of documents with associated ids (corpus)<\/li>\n<li>A set of queries with associated ids (queries)<\/li>\n<li>Query-Document relevance judgments (qrels) (ground truth)<\/li>\n<\/ol>\n<p>While you would typically build a dataset yourself to judge the performance of your search engine, in this case we&#39;re using existing datasets from the <a href=\"https:\/\/github.com\/embeddings-benchmark\/mteb\">MTEB (Massive Text Embedding Benchmark)<\/a>.<\/p>\n<p>Baguetter provides a wrapper for loading datasets from the Hugging Face Hub:<\/p>\n<pre><code class=\"language-python\">from baguetter.evaluation import HFDataset\n\ndatasets = [\n    HFDataset(&quot;mteb\/scifact&quot;),\n    HFDataset(&quot;mteb\/scidocs&quot;)\n]\n<\/code><\/pre>\n<h3>Running Evaluations<\/h3>\n<p>Once you have your datasets ready, you can evaluate different retrieval methods. In the following example, we&#39;re comparing BMX and BM25 sparse retrieval but you can use any indices that extend the <code>BaseIndex<\/code> interface.<\/p>\n<pre><code class=\"language-python\">from baguetter.evaluation import evaluate_retrievers\nfrom baguetter.indices import BMXSparseIndex, BM25SparseIndex\n\n# Create indices\nbmx_factory = lambda: BMXSparseIndex()\nbm25_factory = lambda: BM25SparseIndex()\n\n# Evaluate\nresult = evaluate_retrievers(\n    datasets=datasets,\n    retriever_factories={\n        &quot;bmx&quot;: bmx_factory,\n        &quot;bm25&quot;: bm25_factoryter\n    },\n    metrics=[&quot;ndcg@1&quot;, &quot;ndcg@5&quot;, &quot;ndcg@10&quot;, &quot;precision@1&quot;, &quot;precision@5&quot;, &quot;precision@10&quot;, &quot;mrr@1&quot;, &quot;mrr@5&quot;, &quot;mrr@10&quot;],\n)\nresult.save(&quot;eval_results&quot;)\n#Report for mteb\/scifact (rounded):\n#---------------------------------------------------------------\n#    Model      NDCG@1    NDCG@5    NDCG@10    P@1    P@5    P@10    MRR@1    MRR@5    MRR@10\n#---  -------  --------  --------  ---------  -----  -----  ------  -------  -------  --------\n#a    bmx          0.55     0.663      0.694   0.55  0.161   0.093     0.55    0.643     0.655\n#b    bm25         0.54     0.66       0.687   0.54  0.161   0.091     0.54    0.638     0.649\n#....\n<\/code><\/pre>\n<p>This will run the evaluation and save the results to the specified directory for you to analyze.<\/p>\n<h2>Feedback<\/h2>\n<p>We welcome your feedback and thoughts through our <a href=\"https:\/\/mixedbread.com\/urls\/discord\">discord community<\/a>. We&#39;re here to assist and discuss topics related to information retrieval.<\/p>\n","url":"https:\/\/www.mixedbread.com\/blog\/intro-baguetter","title":"Getting Better with Baguetter - New Retrieval Testing Framework","summary":"Baguetter is our new open-source retrieval experimentation framework that combines sparse, dense, and hybrid retrieval into a single, easy-to-use interface. It enables fast benchmarking, implementation, and testing of new search methods.","image":"https:\/\/www.mixedbread.com\/images\/blog\/intro-baguetter\/baguetter.jpg","date_modified":"2024-08-23T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["engineering"]},{"id":"https:\/\/www.mixedbread.com\/blog\/intro-bmx","content_html":"<p>We are proud to announce that researchers from Mixedbread and the Hong Kong Polytechnic University have developed a new lexical search algorithm, BMX, that outperforms the current standard BM25 across the board and is easy to use via Mixedbread&#39;s open-source Baguetter library.<\/p>\n<p>Read on to learn about BMX, how it works, our benchmarks, and how to use it in practice. If you want to jump right in, you can look into the paper and the library right here:<\/p>\n<ul>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2408.06643\">BMX Paper<\/a>: Our research paper covering the approach in more academic detail.<\/li>\n<li><a href=\"https:\/\/github.com\/mixedbread-ai\/baguetter\">Baguetter<\/a>: Our reference implementation of BMX and an easy to hack testing framework for information retrieval.<\/li>\n<\/ul>\n<blockquote>\n<p><strong>Note<\/strong>\n<u><strong>TLDR:<\/strong><\/u> <br \/>\n  Our new BMX search algorithm iterates on the long-standing industry standard\n  and can be accessed via our fully open-source Baguetter library.<\/p>\n<\/blockquote>\n<h2>What is BMX?<\/h2>\n<p>Most text search engines are powered by lexical (keyword) search algorithms, the most prominent of which is BM25. It powers search engines for web search, e-commerce, legal search engines, and many more applications. A key strength of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Okapi_BM25\">BM25<\/a> is that it performs really well in an out of distribution setting, meaning that it generalizes well on data it has not seen before. However, the keyword search approach comes with its own limitations:<\/p>\n<ol>\n<li>BM25 does not consider the similarity between the query and any given document, which could enable a more accurate assessment of this document&#39;s relevance to the query.<\/li>\n<li>Lexical search algorithms lack semantic understanding and are therefore unable to deal with linguistic nuances like synonyms and homonyms. This limitation is a key factor in the underperformance of lexical search compared to domain-specific text embedding-based semantic search.<\/li>\n<\/ol>\n<p>To tackle these challenges, we propose BMX, a lexical search algorithm that incorporates both similarity and semantics. It comes with these key innovations:<\/p>\n<ol>\n<li><p><strong>Entropy-weighted similarity<\/strong>: We adjust the similarity scores between query tokens and related documents based on the entropy of each token. This approach reduces the influence of common tokens, allowing less frequent but more meaningful tokens to have a greater impact on the similarity score.<\/p>\n<\/li>\n<li><p><strong>Weighted query augmentation (WQA)<\/strong>: To incorporate semantics into lexical search, we developed a method that processes both the original query and its augmented versions simultaneously within BMX. This approach removes the need for multiple retrievals and reranking steps, leading to increased efficiency.<\/p>\n<\/li>\n<\/ol>\n<h2>How does BMX work?<\/h2>\n<p>Now, we will explore the core scoring algorithm in BMX as well as weighted query augmentation (WQA). This will be a math-heavy section, but please bear with us!<\/p>\n<Accordions class=\"not-prose [&>div]:border [&>div]:rounded-md\">\n    <Accordion title=\"Read more about BMX\">\n    <h3>TF-IDF<\/h3>\n\n<pre><code>Consider a set of documents $\\mathcal{D}$ consisting of $\\{D_1, D_2, \\ldots, D_n\\}$ documents where $n$ indicates the total number of documents, and a query $Q$ consisting of $\\{q_1, q_2, \\ldots, q_m\\}$ tokens where $m$ is the total number of query tokens. &lt;span class=&quot;font-bold&quot;&gt;Term frequency (TF)&lt;\/span&gt; $F(q_i, D)$ measures the relative frequency of the token $q_i$ in the document $D$:\n\n```math\nF(q_i, D) = \\frac{\\mathrm{count}(q_i, D)}{\\left | D \\right |}\n```\n\nwhere $\\mathrm{count}(q_i, D)$ is the number of occurences of $q_i$ in $D$ and $\\left | D \\right |$ is the total number of tokens in $D$.\n&lt;span class=&quot;font-bold&quot;&gt;Inverse document frequency&lt;\/span&gt; $\\mathrm{IDF}(q_i)$ measures how much information $q_i$ provides across the document set $\\mathcal{D}$ based on the assumption that a rare token is more informative than a common one.  If $\\mathcal{D}^s$ is the set of documents containing $q_i$, then\n\n```math\n\\mathrm{IDF}(q_i) = \\mathrm{log} \\frac{\\left | \\mathcal{D} \\right |}{\\left | \\mathcal{D}^s \\right |}\n```\n\nwhere $\\left | \\mathcal{D} \\right |$ and $\\left | \\mathcal{D}^s \\right |$ are the total number of documents in $\\mathcal{D}$ and in $\\mathcal{D}^s$ respectively. The naive &lt;span class=&quot;font-bold&quot;&gt;score&lt;\/span&gt; of a document $D$ w.r.t. a query token $q_i$ can be obtained by multiplying its TF and IDF:\n\n```math\n\\mathrm{score}(q_i, D) = \\mathrm{IDF}(q_i) \\cdot F(q_i, D)\n```\n\n&lt;h3&gt;BM25&lt;\/h3&gt;\n\nTF-IDF looks nice, but in practice it suffers from the lack of control of $F(q_i, D)$, as it linearly rewards the token occurrences. If a token appears 5 times more often than another, its TF value will undesirably be 5 times larger than that of the other token. To put a cap on $F(q_i, D)$, BM25 was developed to use the trick of term saturation:\n\n```math\n\\mathrm{score}_{25}(q_i, D) = \\mathrm{IDF}(q_i) \\cdot \\mathcal{F}_{k_1, b}(q_i, D)\n```\n\nwhere\n\n```math\n\\mathcal{F}_{k_1, b}(q_i, D) = \\frac{\\mathrm{F}(q_i, D) \\cdot (k_1 + 1)}{\\mathrm{F}(q_i, D) +  k_1 \\cdot (b \\cdot \\frac{\\left | D \\right |}{avgdl} + 1-b)}\n```\n\nparameterized by $k_1$, $b$ and document length $\\left | D \\right |$. $\\mathcal{F}_{k_1, b}(q_i, D)$ has the structure of $\\frac{F(q_i, D) \\cdot A}{F(q_i, D) + B}$, ensuring the diminishing of marginal contribution from $F(q_i, D)$. Besides, it introduces $\\frac{\\left | D \\right |}{avgdl}$ into the denominator to reward the matches in short documents.\n\n&lt;h3&gt;BMX&lt;\/h3&gt;\n\nBM25 is already pretty nice. BMX takes BM25 and puts it on steroids by leveraging entropy-weighted similarity to enable a more precise relevance assessment between query and documents. Briefly, &lt;span class=&quot;font-bold&quot;&gt;the BMX score&lt;\/span&gt; of a document $D$ w.r.t. a query token $q_i$ can be formulated as follows:\n\n```math\n\\mathrm{score}_x(q_i, D) =\\mathrm{IDF}(q_i) \\cdot \\mathcal{F}_{\\alpha}(q_i, \\mathrm{D}) + \\beta \\cdot  \\mathcal{S}(q_i, D)\n```\n\nCompared with BM25, BMX adopts a new term frequency $\\mathcal{F}_{\\alpha}(q_i, D)$ parameterized by $\\alpha$ and an extra component $\\mathcal{S}(q_i, D)$ weighted by $\\beta$.  BM25 only considers how relevant a document $D$ is to an individual query token $q_i$, overlooking its relevance to the whole query $Q$. We believe that the introduction of a &lt;span class=&quot;font-bold&quot;&gt;query-wise similarity measurement&lt;\/span&gt; $S(Q, D)$ is beneficial. There are various measurements we can use for this purpose, for example:\n```math\nS(Q, D) = \\frac{\\left | Q \\cap D \\right |}{\\left |Q \\right |}\n```\nis a set-based metric that counts the common tokens between $Q$ and $D$. Then, the share of similarity contribution from the token $q_i$ is:\n\n```math\n\\mathcal{S}(q_i, D) = E(q_i)\\cdot S(Q, D)\n```\n\n$E(q_i)$ is the normalized entropy of $q_i$, reflecting the relative importance of $q_i$ among all the query tokens in $Q$. For more details about how the entropy $E(q_i)$ and the new term frequency $\\mathcal{F}_{\\alpha} (q_i, D)$ are calculated, please refer to &lt;a href=&quot;https:\/\/arxiv.org\/abs\/2408.06643&quot;&gt; our paper. &lt;\/a&gt;\n\n&lt;h3&gt;WQA&lt;\/h3&gt;\n\nThe big disadvantage of lexical search algorithms, including BM25 and BMX, compared to embedding-based methods, is their lack of semantic understanding of queries, as they retrieve the documents solely based on query tokens, overlooking, e.g., synonyms and homonyms. \n\nTo address this limitation, query augmentation can be done through manually crafting a collection of synonyms and homonyms. Here, we suggest using an LLM to augment queries. \n\nGiven a query $Q$, we can prompt the LLM to generate $t$ augmented queries $\\mathcal{Q}_A = \\{Q_1, Q_2, \\ldots, Q_t\\}$.\nTo compute the overall relevance of document $D$ given a query $Q$, we combine the score for original query with the weighted scores for augmented queries $\\mathcal{Q}_A$:\n```math\n\\mathrm{score} (Q, \\mathcal{Q}_A, D) = \\mathrm{score}(Q, D) + \\sum_{i=1}^{t} w_i \\cdot \\mathrm{score}(Q_i, D)\n```\nThis weighted query augmentation (WQA) schema can incorporate the semantic understanding of an LLM into lexical search algorithms while still preserving their computational efficiency.\n&lt;\/Accordion&gt;\n<\/code><\/pre>\n<\/Accordions>\n\n<h2>Performance and Benchmarking<\/h2>\n<p>To validate our ideas and assess the performance of BMX, we evaluated it extensively on different benchmarks including <a href=\"https:\/\/arxiv.org\/abs\/2104.08663\">BEIR<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2407.12883\">BRIGHT<\/a>, and some multilingual benchmarks. You can find a full overview of our benchmarking in the <a href=\"https:\/\/arxiv.org\/abs\/2408.06643\">BMX paper<\/a>. Our results show that BMX outperforms BM25 across the board and can offer a significant improvement in retrieval quality.<\/p>\n<h3>BEIR<\/h3>\n<p>BEIR is a benchmark focused on out-of-domain information retrieval. It covers datasets for different domains, document or query lengths, and tasks. After evaluation on the 15 available benchmarks, we compare BMX to the implementations of BM25 in <a href=\"https:\/\/github.com\/mixedbread-ai\/baguetter\">Baguetter<\/a> and the <a href=\"https:\/\/github.com\/xhluca\/bm25s\">bm25s<\/a> library:<\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>AA<\/th>\n<th>SD<\/th>\n<th>SF<\/th>\n<th>NFC<\/th>\n<th>TCV<\/th>\n<th>TCH<\/th>\n<th>FQA<\/th>\n<th>HQA<\/th>\n<th>MM<\/th>\n<th>FVR<\/th>\n<th>NQ<\/th>\n<th>DBP<\/th>\n<th>QRA<\/th>\n<th>CF<\/th>\n<th>CQA<\/th>\n<th>Avg.<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>BM25S, k1=1.2, b=0.75<\/strong><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>Robertson<\/td>\n<td>49.2<\/td>\n<td>15.5<\/td>\n<td>68.3<\/td>\n<td>31.9<\/td>\n<td>59.0<\/td>\n<td>33.8<\/td>\n<td>25.4<\/td>\n<td>58.5<\/td>\n<td>22.6<\/td>\n<td>50.3<\/td>\n<td>29.2<\/td>\n<td>30.3<\/td>\n<td>80.4<\/td>\n<td>13.7<\/td>\n<td>29.9<\/td>\n<td>39.87<\/td>\n<\/tr>\n<tr>\n<td>ATIRE<\/td>\n<td>48.7<\/td>\n<td>15.6<\/td>\n<td>68.1<\/td>\n<td>31.8<\/td>\n<td>61.0<\/td>\n<td>33.2<\/td>\n<td>25.3<\/td>\n<td>58.5<\/td>\n<td>22.6<\/td>\n<td>50.3<\/td>\n<td>29.1<\/td>\n<td>30.3<\/td>\n<td>80.5<\/td>\n<td>13.7<\/td>\n<td>30.1<\/td>\n<td>39.92<\/td>\n<\/tr>\n<tr>\n<td>BM25+<\/td>\n<td>48.7<\/td>\n<td>15.6<\/td>\n<td>68.1<\/td>\n<td>31.8<\/td>\n<td>61.0<\/td>\n<td>33.2<\/td>\n<td>25.3<\/td>\n<td>58.5<\/td>\n<td>22.6<\/td>\n<td>50.3<\/td>\n<td>29.1<\/td>\n<td>30.3<\/td>\n<td>80.5<\/td>\n<td>13.7<\/td>\n<td>30.1<\/td>\n<td>39.92<\/td>\n<\/tr>\n<tr>\n<td>BM25L<\/td>\n<td>49.6<\/td>\n<td>15.8<\/td>\n<td>68.7<\/td>\n<td>32.2<\/td>\n<td>62.9<\/td>\n<td>33.0<\/td>\n<td>25.0<\/td>\n<td>55.9<\/td>\n<td>21.4<\/td>\n<td>46.6<\/td>\n<td>28.1<\/td>\n<td>29.4<\/td>\n<td>80.3<\/td>\n<td>13.5<\/td>\n<td>29.8<\/td>\n<td>39.48<\/td>\n<\/tr>\n<tr>\n<td>BM25<\/td>\n<td>48.7<\/td>\n<td>15.6<\/td>\n<td>68.0<\/td>\n<td>31.8<\/td>\n<td>61.0<\/td>\n<td>33.2<\/td>\n<td>25.3<\/td>\n<td>58.5<\/td>\n<td>22.6<\/td>\n<td>50.3<\/td>\n<td>29.1<\/td>\n<td>30.3<\/td>\n<td><strong>80.5<\/strong><\/td>\n<td><strong>13.7<\/strong><\/td>\n<td>30.1<\/td>\n<td>39.91<\/td>\n<\/tr>\n<tr>\n<td><strong>Baguetter, k1=1.2, b=0.75<\/strong><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>Robertson<\/td>\n<td>49.55<\/td>\n<td>15.69<\/td>\n<td>68.71<\/td>\n<td>32.85<\/td>\n<td>64.36<\/td>\n<td>30.95<\/td>\n<td>25.27<\/td>\n<td>59.13<\/td>\n<td>23.19<\/td>\n<td>50.93<\/td>\n<td>28.55<\/td>\n<td>31.69<\/td>\n<td>78.73<\/td>\n<td>13.54<\/td>\n<td>30.73<\/td>\n<td>40.26<\/td>\n<\/tr>\n<tr>\n<td>ATIRE<\/td>\n<td>49.23<\/td>\n<td>15.66<\/td>\n<td>68.73<\/td>\n<td>32.71<\/td>\n<td>65.94<\/td>\n<td>31.15<\/td>\n<td>25.30<\/td>\n<td>59.13<\/td>\n<td>23.23<\/td>\n<td>50.98<\/td>\n<td>28.59<\/td>\n<td>31.56<\/td>\n<td>78.72<\/td>\n<td>13.55<\/td>\n<td>31.00<\/td>\n<td>40.37<\/td>\n<\/tr>\n<tr>\n<td>BM25+<\/td>\n<td>49.23<\/td>\n<td>15.66<\/td>\n<td>68.73<\/td>\n<td>32.71<\/td>\n<td>65.94<\/td>\n<td>31.15<\/td>\n<td>25.30<\/td>\n<td>59.13<\/td>\n<td>23.23<\/td>\n<td>50.98<\/td>\n<td>28.59<\/td>\n<td>31.56<\/td>\n<td>78.72<\/td>\n<td>13.55<\/td>\n<td><strong>31.00<\/strong><\/td>\n<td>40.37<\/td>\n<\/tr>\n<tr>\n<td>BM25L<\/td>\n<td>50.32<\/td>\n<td>15.88<\/td>\n<td>69.37<\/td>\n<td><strong>33.00<\/strong><\/td>\n<td>67.05<\/td>\n<td>31.52<\/td>\n<td>24.79<\/td>\n<td>56.35<\/td>\n<td>22.01<\/td>\n<td>47.37<\/td>\n<td>27.35<\/td>\n<td>30.87<\/td>\n<td>78.21<\/td>\n<td>13.29<\/td>\n<td>30.87<\/td>\n<td>39.88<\/td>\n<\/tr>\n<tr>\n<td>BM25<\/td>\n<td>49.32<\/td>\n<td>15.65<\/td>\n<td>68.67<\/td>\n<td>32.68<\/td>\n<td>65.78<\/td>\n<td>31.11<\/td>\n<td>25.26<\/td>\n<td>59.09<\/td>\n<td>23.24<\/td>\n<td>50.98<\/td>\n<td>28.85<\/td>\n<td>31.56<\/td>\n<td>78.73<\/td>\n<td>13.55<\/td>\n<td>30.99<\/td>\n<td>40.36<\/td>\n<\/tr>\n<tr>\n<td><strong>BMX<\/strong><\/td>\n<td><strong>50.46<\/strong><\/td>\n<td><strong>15.91<\/strong><\/td>\n<td><strong>69.42<\/strong><\/td>\n<td>32.84<\/td>\n<td><strong>68.1<\/strong><\/td>\n<td><strong>34.34<\/strong><\/td>\n<td><strong>25.39<\/strong><\/td>\n<td><strong>61.73<\/strong><\/td>\n<td><strong>24.21<\/strong><\/td>\n<td><strong>55.75<\/strong><\/td>\n<td><strong>29.84<\/strong><\/td>\n<td><strong>32.2<\/strong><\/td>\n<td>78.56<\/td>\n<td>13.32<\/td>\n<td>30.77<\/td>\n<td><strong>41.52<\/strong><\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>As the table shows, BMX generally outperforms all BM25 variants, achieving the best results on 11 out of 15 datasets. In our view, this underscores the importance of incorporating query-document similarity in our search algorithm.<\/p>\n<p>Other than that, we also observe that embedding-based models consistently outperform lexical models, which can be attributed to their powerful semantic understanding. However, due to their significant advantage regarding resource requirements, lexical search algorithms still remain much more widely used.<\/p>\n<h3>BRIGHT<\/h3>\n<p>BRIGHT was introduced as the first text retrieval benchmark requiring intensive reasoning to retrieve the relevant documents. The benchmark includes 1,385 real-world queries across diverse domains such as StackExchange, LeetCode, and math competitions. These queries are paired with, for example, relevant web pages linked in StackExchange answers and theorems tagged in math Olympiad questions, all of which necessitate deliberate reasoning to identify their connections to the respective queries. For this evaluation, we also took a look at the performance of BMX with WQA.<\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>BIO<\/th>\n<th>ES<\/th>\n<th>ECO<\/th>\n<th>PSY<\/th>\n<th>ROB<\/th>\n<th>SO<\/th>\n<th>SL<\/th>\n<th>LC<\/th>\n<th>PONY<\/th>\n<th>AOPS<\/th>\n<th>TQ<\/th>\n<th>TT<\/th>\n<th>Avg.<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>Embedding Models<\/strong><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>E5-Mistral-7B<\/td>\n<td>18.80<\/td>\n<td>26.00<\/td>\n<td>15.50<\/td>\n<td>15.80<\/td>\n<td>16.40<\/td>\n<td>9.80<\/td>\n<td><strong>18.50<\/strong><\/td>\n<td>28.70<\/td>\n<td>4.80<\/td>\n<td>7.10<\/td>\n<td>23.90<\/td>\n<td><strong>25.10<\/strong><\/td>\n<td>17.50<\/td>\n<\/tr>\n<tr>\n<td><strong>Proprietary Models<\/strong><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>OpenAI<\/td>\n<td>23.70<\/td>\n<td>26.30<\/td>\n<td>20.00<\/td>\n<td><strong>27.50<\/strong><\/td>\n<td>12.90<\/td>\n<td>12.50<\/td>\n<td>20.30<\/td>\n<td>23.60<\/td>\n<td>2.50<\/td>\n<td>8.50<\/td>\n<td>22.20<\/td>\n<td>10.80<\/td>\n<td>17.57<\/td>\n<\/tr>\n<tr>\n<td>Cohere<\/td>\n<td>19.00<\/td>\n<td>27.50<\/td>\n<td>20.20<\/td>\n<td>21.80<\/td>\n<td>16.20<\/td>\n<td>16.50<\/td>\n<td>17.70<\/td>\n<td>26.80<\/td>\n<td>1.80<\/td>\n<td>6.50<\/td>\n<td>15.10<\/td>\n<td>7.10<\/td>\n<td>16.35<\/td>\n<\/tr>\n<tr>\n<td>Voyage<\/td>\n<td>23.60<\/td>\n<td>25.10<\/td>\n<td>19.80<\/td>\n<td>24.80<\/td>\n<td>11.20<\/td>\n<td>15.00<\/td>\n<td>15.60<\/td>\n<td><strong>30.60<\/strong><\/td>\n<td>1.50<\/td>\n<td>7.40<\/td>\n<td><strong>26.10<\/strong><\/td>\n<td>11.10<\/td>\n<td>17.65<\/td>\n<\/tr>\n<tr>\n<td><strong>Lexical Models<\/strong><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>BM25 ($k1=0.9$, $b=0.4$)<\/td>\n<td>19.25<\/td>\n<td>29.67<\/td>\n<td>16.89<\/td>\n<td>15.77<\/td>\n<td>14.19<\/td>\n<td>15.22<\/td>\n<td>15.30<\/td>\n<td>25.05<\/td>\n<td>4.62<\/td>\n<td>8.28<\/td>\n<td>6.57<\/td>\n<td>2.34<\/td>\n<td>14.43<\/td>\n<\/tr>\n<tr>\n<td>BM25 ($k1=0.9$, $b=0.4$) + WQA<\/td>\n<td>22.9<\/td>\n<td>33.86<\/td>\n<td>17.2<\/td>\n<td>24.33<\/td>\n<td>14.54<\/td>\n<td><strong>20.21<\/strong><\/td>\n<td>16.01<\/td>\n<td>26.61<\/td>\n<td>6.03<\/td>\n<td>8.41<\/td>\n<td>10.50<\/td>\n<td>3.29<\/td>\n<td>16.99<\/td>\n<\/tr>\n<tr>\n<td>BMX ($\\alpha=0.05$)<\/td>\n<td>26.4<\/td>\n<td>33.55<\/td>\n<td>13.29<\/td>\n<td>16.06<\/td>\n<td>14.68<\/td>\n<td>13.80<\/td>\n<td>15.47<\/td>\n<td>25.35<\/td>\n<td><strong>10.91<\/strong><\/td>\n<td>9.18<\/td>\n<td>8.17<\/td>\n<td>2.38<\/td>\n<td>15.77<\/td>\n<\/tr>\n<tr>\n<td>BMX ($\\alpha=0.05$) + WQA<\/td>\n<td><strong>31.57<\/strong><\/td>\n<td><strong>40.24<\/strong><\/td>\n<td>17.40<\/td>\n<td>23.53<\/td>\n<td><strong>16.67<\/strong><\/td>\n<td>18.32<\/td>\n<td>15.05<\/td>\n<td>28.34<\/td>\n<td>10.82<\/td>\n<td><strong>9.22<\/strong><\/td>\n<td>9.98<\/td>\n<td>2.74<\/td>\n<td><strong>18.66<\/strong><\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>The table demonstrates that BMX with weighted query augmentation outperforms all baselines, including embedding models and other lexical models. We attribute this to BMX&#39;s ability to generalize effectively across various domains, while embedding models often struggle with out-of-domain data. The results highlight that the WQA mechanism successfully enhances BMX\u2019s semantic understanding, enabling it to handle realistic retrieval scenarios more effectively than alternative solutions. Also, BMX achieves this without the need for costly training on massive high-quality datasets, as would be required for embedding models.<\/p>\n<h3>Multilingual Benchmark<\/h3>\n<p>To verify the performance of BMX on multilingual data, we also performed benchmarking on datasets for Chinese, Japanese, Korean, German, and French retrieval tasks.<\/p>\n<table>\n<thead>\n<tr>\n<th><strong>Model<\/strong><\/th>\n<th>MMarcoRetrieval (Chinese)<\/th>\n<th>JaQuAD (Japanese)<\/th>\n<th>StrategyQA (Korean)<\/th>\n<th>GermanDPR (German)<\/th>\n<th>FQuAD (French)<\/th>\n<th><strong>Avg.<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>BM25<\/td>\n<td>47.41<\/td>\n<td>54.26<\/td>\n<td>33.89<\/td>\n<td>52.27<\/td>\n<td>91.80<\/td>\n<td>55.93<\/td>\n<\/tr>\n<tr>\n<td><strong>BMX<\/strong><\/td>\n<td><strong>47.94<\/strong><\/td>\n<td><strong>54.63<\/strong><\/td>\n<td><strong>35.72<\/strong><\/td>\n<td><strong>53.58<\/strong><\/td>\n<td><strong>91.92<\/strong><\/td>\n<td><strong>56.76<\/strong><\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>The table presents the evaluation results, demonstrating that BMX consistently outperforms BM25 across these multilingual datasets. This suggests the increased effectiveness of BMX in handling diverse multilingual information retrieval tasks.<\/p>\n<h2>Try it out<\/h2>\n<h3>Installation<\/h3>\n<p>Before you start, make sure to install the Baguetter library:<\/p>\n<pre><code class=\"language-sh\">pip install baguetter\n<\/code><\/pre>\n<h3>Usage<\/h3>\n<p>The following examples show how to use BMX in practice:<\/p>\n<p><strong>Simple:<\/strong><\/p>\n<pre><code class=\"language-python\">from baguetter.indices import BMXSparseIndex\n\n# Initialize BM\ud835\udcb3 index\nbmx = BMXSparseIndex()\n\n# Add bakery items to the index\ndocs = [\n    &quot;Freshly crusty baked sourdough bread with a crispy crust&quot;,\n    &quot;Flaky croissants made with French butter&quot;,\n    &quot;Chocolate chip cookies with chunks of dark chocolate&quot;,\n    &quot;Cinnamon rolls with cream cheese frosting&quot;,\n    &quot;Artisanal baguettes with a soft interior and crusty exterior&quot;\n]\nkeys = list(range(len(docs)))\n\nbmx.add_many(keys=keys, values=docs)\n\n# Search for bread\nquery = &quot;crusty bread&quot;\nresults = bmx.search(query, top_k=2)\n\nprint(results)\n# SearchResults(keys=[0, 4], scores=array([2.5519667 , 0.97304875], dtype=float32), normalized=False)\n<\/code><\/pre>\n<p><strong>WQA:<\/strong><\/p>\n<pre><code class=\"language-python\">from baguetter.indices import BMXSparseIndex\n\n# Initialize BM\ud835\udcb3 index\nbmx = BMXSparseIndex()\n\n# Add bakery items to the index\ndocs = [\n    &quot;Classic French baguette with a crispy exterior&quot;,\n    &quot;Soft and chewy chocolate chip cookies&quot;,\n    &quot;Buttery croissants with a flaky texture&quot;,\n    &quot;Rich and moist chocolate fudge cake&quot;,\n    &quot;Artisanal sourdough bread with a crispy exterior&quot;\n]\nkeys = list(range(len(docs)))\n\nbmx.add_many(keys, docs)\n\n# Perform a search with Weighted Query Augmentation\noriginal_query = &quot;French bakery&quot;\naugmented_query = &quot;crispy bread&quot;\nweights = [0.7, 0.3]\n\nwqa_results = bmx.search_weighted(\n    [original_query, augmented_query],\n    query_weights=weights,\n    top_k=3\n)\n\nprint(wqa_results)\n# SearchResults(keys=[0, 4, 3], scores=array([1.2393689, 0.7163555, 0.       ], dtype=float32), normalized=False)\n<\/code><\/pre>\n<h2>Why does this matter?<\/h2>\n<p>Improving the quality of lexical search without a significant increase in complexity or computational resource requirements could lead to a range of beneficial outcomes. Using BMX could improve the user experience of any application that relies on search algorithms. We also expect performance benefits for natural language processing applications that include our algorithm as part of their pipeline.<\/p>\n<p>We&#39;re excited to hear about the community&#39;s opinions on BMX and where this algorithm could be used in the future. If you want to give us feedback or discuss anything search- or machine learning-related, please come join our <a href=\"https:\/\/www.mixedbread.com\/urls\/discord\">Discord community<\/a>.<\/p>\n<pre><code class=\"language-bibtex\">@article{li2024bmx,\n      title={BMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search},\n      author={Xianming Li and Julius Lipp and Aamir Shakir and Rui Huang and Jing Li},\n      year={2024},\n      eprint={2408.06643},\n      archivePrefix={arXiv},\n      primaryClass={cs.IR},\n      url={https:\/\/arxiv.org\/abs\/2408.06643},\n}\n<\/code><\/pre>\n","url":"https:\/\/www.mixedbread.com\/blog\/intro-bmx","title":"BM\ud835\udcb3: A Freshly Baked Take on BM25","summary":"Introducing BMX, an iteration on the industry standard BM25 search algorithm. Through the incorporation of entropy-weighted query-document similarity and weighted query augmentation, the algorithm can increase search performance on the most relevant information retrieval benchmarks.","image":"https:\/\/www.mixedbread.com\/images\/blog\/intro-bmx\/intro-bmx.jpg","date_modified":"2024-08-12T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research","engineering"]},{"id":"https:\/\/www.mixedbread.com\/blog\/deepset-mxbai-embed-de-large-v1","content_html":"<p>We are happy to introduce the much requested result of a collaboration between <a href=\"https:\/\/deepset.ai\">deepset<\/a> (the creators of <a href=\"https:\/\/github.com\/deepset-ai\/haystack\">Haystack<\/a>) and Mixedbread on a project close to home: an open-source German\/English embedding model. Our model sets a new performance standard among its open-source peers. Also, it supports <a href=\"https:\/\/huggingface.co\/blog\/embedding-quantization\">binary quantization<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/2205.13147\">Matryoshka representation learning<\/a> (MRL) to benefit from <a href=\"https:\/\/mixedbread.com\/blog\/binary-mrl\">order-of-magnitude reductions<\/a> in storage and infrastructure costs.<\/p>\n<p>Read on to learn more about our approach and to check out our benchmarks. If you want to skip right to the model instead, you can access it here:<\/p>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/deepset-mxbai-embed-de-large-v1\">mixedbread-ai\/deepset-mxbai-embed-de-large-v1<\/a>: The new and powerful open-source German embedding model made by Mixedbread together with deepset.<\/li>\n<\/ul>\n<h2>Why use embeddings?<\/h2>\n<p>Embeddings are among the most adaptable tools in natural language processing, applicable to a diverse range of settings and use cases. Specifically, embeddings are numerical representations of complex objects like text, images, and audio, depicted as n-dimensional vectors.<\/p>\n<p>By transforming objects with an embedding model, you can assess their inherent semantic similarity by calculating the similarity of their respective embeddings. This process involves measuring how closely related two objects are based on the proximity of their embeddings in the n-dimensional vector space. This technique is essential for numerous applications, forming the foundation for recommendation systems, retrieval, one-shot or few-shot learning, outlier detection, similarity search, paraphrase detection, clustering, classification, and more.<\/p>\n<p>Embeddings also play a pivotal role in Retrieval-Augmented Generation (RAG). RAG aims to enable a large language model (LLM) to access custom documents you provide, such as company analyst reports, and enhance its output based on this information. By converting documents and queries into embeddings, the LLM can retrieve the closest information to your data and use it to generate the most relevant output for the user.<\/p>\n<h2>Embedding models and their language bias<\/h2>\n<p>Currently available embedding models are typically geared towards the English language due to the sheer volume of English-language digital content, the availability of extensive English datasets for training, and the focus of major research institutions and tech companies on English-speaking markets. This bias results in significant limitations for non-English applications, often leading to inaccuracies and misrepresentations of other languages. For the German language, as for many others, this has long meant a lack of robust, effective tools for natural language processing tasks. Our new German language embedding model aims to address this problem, providing a more accurate and culturally relevant tool for the German-speaking community, which also represents one of the world&#39;s largest economic areas.<\/p>\n<h2>A new winner in German open-source embeddings<\/h2>\n<p>Our model is mainly focused on retrieval tasks and is built on the proverbial shoulders of giants; the giant being the <a href=\"https:\/\/huggingface.co\/intfloat\/multilingual-e5-large\">multilingual-e5-large<\/a> model by <a href=\"https:\/\/arxiv.org\/pdf\/2402.05672\">Wang et al.<\/a> in this case. Our model was initialized from multilingual-e5-large, fine-tuned on 30+ million pairs of high quality German data, and optimized for compression. As a result, it can perform embedding at large scale, with low cost and high performance in terms of speed and quality. Our model is trained using <a href=\"https:\/\/arxiv.org\/abs\/2309.12871\">AnglE loss<\/a>. During training a mixture of full fine-tuning and LoRA was used to provide better generalization.<\/p>\n<p>We made a significant effort to avoid any overlap of the training and test data. The model is benchmarked on a mix of private and public benchmarks, which we created in collaboration with some of deepset&#39;s clients. You can find an overview of the benchmarks in this <a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1RIwLk7Ldy5CI03ckqJuOE2BJTYzP4JbVqP6Rz3WO1kw\/edit?usp=sharing\">spreadsheet<\/a>.<\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Avg. Performance (NDCG@10)<\/th>\n<th>Binary Support<\/th>\n<th>MRL Support<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>deepset-mxbai-embed-de-large-v1<\/strong><\/td>\n<td><strong>51.7<\/strong><\/td>\n<td>\u2705<\/td>\n<td>\u2705<\/td>\n<\/tr>\n<tr>\n<td>multilingual-e5-large<\/td>\n<td>50.5<\/td>\n<td>\u274c<\/td>\n<td>\u274c<\/td>\n<\/tr>\n<tr>\n<td>jina-embeddings-v2-base-de<\/td>\n<td>50.0<\/td>\n<td>\u2705<\/td>\n<td>\u274c<\/td>\n<\/tr>\n<tr>\n<td>Commercial Models<\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>Cohere Multilingual v3<\/td>\n<td><em>52.4<\/em><\/td>\n<td>\u2705<\/td>\n<td>-<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>As the table shows, on the NDCG@10 metric, which compares the list of retrieval results against an ideally ordered list, our model sets a new standard for open-source German embedding models and even gets close to fully commercial, closed-source alternatives. While the improvement on the benchmark might not appear particularly significant at first glance, our case study demonstrates that it makes a big difference in real-world applications.<\/p>\n<h3>Case Study: A legal data client<\/h3>\n<p>For benchmarking with a focus on real-world applicability, we compared the performance of our model against a model that was specifically fine-tuned on the German legal domain. In opposition to that model&#39;s fine-tuning on domain-specific data, our model did not see any of this data before the benchmarking.<\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Avg. Performance (MAP@10)<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>deepset-mxbai-embed-de-large-v1<\/strong><\/td>\n<td><strong>90.25<\/strong><\/td>\n<\/tr>\n<tr>\n<td>voyage-law-2<\/td>\n<td>84.80<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>As we can see, our general German embedding model outperforms domain-specific alternatives in areas they were specifically trained for. We view this result as a very promising signal for the future usefulness of our model across new domains in the German-speaking world.<\/p>\n<h2>Save 97%+ of infrastructure costs with binary MRL<\/h2>\n<p>Today, embeddings are still known to struggle with storage and processing speed when used for large-scale tasks. To address the scaling issues of embeddings, two promising approaches have emerged: <a href=\"https:\/\/arxiv.org\/abs\/2205.13147\">Matryoshka representation learning<\/a> (MRL) and <a href=\"https:\/\/huggingface.co\/blog\/embedding-quantization\">binary quantization<\/a>. MRL reduces the number of output dimensions in an embedding model without significant accuracy loss by prioritizing important information in the initial dimensions, allowing for truncation of less important later dimensions. This approach optimizes storage and processing speed by calibrating the model&#39;s loss function to prioritize performance across varying output dimensions. Conversely, binary quantization reduces the size of each dimension by converting float32 values to binary, significantly enhancing memory and disk space efficiency while retaining high performance.<\/p>\n<p>By combining these methods, the aptly named <a href=\"https:\/\/mixedbread.com\/blog\/binary-mrl\">Binary MRL<\/a> approach demonstrates that it is possible to truncate dimensions and reduce size simultaneously, achieving substantial efficiency gains without major performance sacrifices. This method has been known to allow Mixedbread\u2019s embedding model for which it was originally developed to retain 90% performance with a 64x efficiency gain, dramatically lowering infrastructure costs in cloud computing and vector databases. During benchmarking with binary quantization, we could confirm that our German embedding model keeps 91.8% of performance while increasing efficiency by a factor of 32 due to the quantization. Using the MRL approach and tracking performance over different dimensionalities yields the following result:<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/deepset-mxbai-embed-de-large-v1\/mrl.png\" alt=\"binary mrl\"><\/p>\n<p>As the graph shows, a reduction of vector size by 25% still leaves 97.5% of model performance. In our view, a particularly interesting trade-off emerges at 512 dimensions, where over 93% of model performance remain while cutting embedding sizes in half and enjoying the associated benefits regarding infrastructure needs and cost.<\/p>\n<h2>Using it in action<\/h2>\n<p><strong>Python (API):<\/strong><\/p>\n<pre><code class=\"language-python\">from mixedbread_ai.client import MixedbreadAi\n    from sentence_transformers.util import cos_sim\n\n    mxbai = MixedbreadAI(api_key=&quot;YOUR_API_KEY&quot;)\n\n    query = &#39;query: Warum sollte man biologisches Brot kaufen?&#39;\n    docs = [\n        query,\n        &quot;passage: In unserer B\u00e4ckerei bieten wir auch glutenfreies Brot an, das f\u00fcr Menschen mit Z\u00f6liakie geeignet ist.&quot;,\n        &quot;passage: Kuchen und Geb\u00e4ck sind ebenfalls Teil unseres Angebots, wobei wir auf h\u00f6chste Qualit\u00e4t und Frische achten.&quot;,\n        &quot;passage: Wir haben auch eine Auswahl an herzhaften Snacks und Sandwiches, die perfekt f\u00fcr die Mittagspause sind.&quot;\n        &quot;passage: Biologisches Brot wird aus nat\u00fcrlichen Zutaten hergestellt und enth\u00e4lt keine k\u00fcnstlichen Zusatzstoffe. Es ist ges\u00fcnder und umweltfreundlicher.&quot;,\n        &quot;passage: Unsere B\u00e4ckerei bietet eine Vielzahl von Brotsorten an, darunter auch biologisches Brot. Es schmeckt besser und ist frei von chemischen Konservierungsstoffen.&quot;,\n        &quot;passage: Kunden bevorzugen zunehmend biologisches Brot, da es nicht nur gut f\u00fcr die Gesundheit ist, sondern auch einen positiven Beitrag zur Umwelt leistet.&quot;\n    ]\n\n    result = mxbai.embeddings(\n        model=&quot;mixedbread-ai\/deepset-mxbai-embed-de-large-v1&quot;,\n        input=docs\n    )\n\n    embeddings = [item.embedding for item in result.data]\n\n    # Calculate cosine similarity\n    similarities = cos_sim(embeddings[0], embeddings[1:])\n    print(similarities)\n<\/code><\/pre>\n<p><strong>Sentence Transformers:<\/strong><\/p>\n<pre><code class=\"language-python\">from sentence_transformers import SentenceTransformer\n    from sentence_transformers.util import cos_sim\n\n    # 1. load model\n    model = SentenceTransformer(&quot;mixedbread-ai\/deepset-mxbai-embed-de-large-v1&quot;)\n\n    # For retrieval you need to pass this prompt.\n    query = &quot;query: Warum sollte man biologisches Brot kaufen?&quot;\n    docs = [\n        query,\n        &quot;passage: In unserer B\u00e4ckerei bieten wir auch glutenfreies Brot an, das f\u00fcr Menschen mit Z\u00f6liakie geeignet ist.&quot;,\n        &quot;passage: Kuchen und Geb\u00e4ck sind ebenfalls Teil unseres Angebots, wobei wir auf h\u00f6chste Qualit\u00e4t und Frische achten.&quot;,\n        &quot;passage: Wir haben auch eine Auswahl an herzhaften Snacks und Sandwiches, die perfekt f\u00fcr die Mittagspause sind.&quot;\n        &quot;passage: Biologisches Brot wird aus nat\u00fcrlichen Zutaten hergestellt und enth\u00e4lt keine k\u00fcnstlichen Zusatzstoffe. Es ist ges\u00fcnder und umweltfreundlicher.&quot;,\n        &quot;passage: Unsere B\u00e4ckerei bietet eine Vielzahl von Brotsorten an, darunter auch biologisches Brot. Es schmeckt besser und ist frei von chemischen Konservierungsstoffen.&quot;,\n        &quot;passage: Kunden bevorzugen zunehmend biologisches Brot, da es nicht nur gut f\u00fcr die Gesundheit ist, sondern auch einen positiven Beitrag zur Umwelt leistet.&quot;\n    ]\n\n    # 2. Encode\n    embeddings = model.encode(docs)\n\n    # 3. Calculate cosine similarity\n    similarities = cos_sim(embeddings[0], embeddings[1:])\n    print(similarities)\n<\/code><\/pre>\n<p><strong>AnglE:<\/strong><\/p>\n<pre><code class=\"language-python\">from angle_emb import AnglE\n    from angle_emb.utils import cosine_similarity\n\n    # 1. Specify preferred dimensions\n    dimensions = 1024\n\n    # 2. Load model and set pooling strategy to avg\n    model = AnglE.from_pretrained(\n        &quot;mixedbread-ai\/deepset-mxbai-embed-de-large-v1&quot;,\n        pooling_strategy=&#39;avg&#39;\n    ).cuda()\n    query = &#39;query: Warum sollte man biologisches Brot kaufen?&#39;\n\n    docs = [\n        query,\n        &quot;passage: In unserer B\u00e4ckerei bieten wir auch glutenfreies Brot an, das f\u00fcr Menschen mit Z\u00f6liakie geeignet ist.&quot;,\n        &quot;passage: Kuchen und Geb\u00e4ck sind ebenfalls Teil unseres Angebots, wobei wir auf h\u00f6chste Qualit\u00e4t und Frische achten.&quot;,\n        &quot;passage: Wir haben auch eine Auswahl an herzhaften Snacks und Sandwiches, die perfekt f\u00fcr die Mittagspause sind.&quot;\n        &quot;passage: Biologisches Brot wird aus nat\u00fcrlichen Zutaten hergestellt und enth\u00e4lt keine k\u00fcnstlichen Zusatzstoffe. Es ist ges\u00fcnder und umweltfreundlicher.&quot;,\n        &quot;passage: Unsere B\u00e4ckerei bietet eine Vielzahl von Brotsorten an, darunter auch biologisches Brot. Es schmeckt besser und ist frei von chemischen Konservierungsstoffen.&quot;,\n        &quot;passage: Kunden bevorzugen zunehmend biologisches Brot, da es nicht nur gut f\u00fcr die Gesundheit ist, sondern auch einen positiven Beitrag zur Umwelt leistet.&quot;\n    ]\n\n    # 3. Encode\n    embeddings = model.encode(docs, embedding_size=dimensions)\n\n    for doc, emb in zip(docs[1:], embeddings[1:]):\n        print(f&#39;{query} ||| {doc}&#39;, cosine_similarity(embeddings[0], emb))\n<\/code><\/pre>\n<p><strong>Haystack:<\/strong><\/p>\n<pre><code class=\"language-python\">from mixedbread_ai_haystack import MixedbreadAITextEmbedder, MixedbreadAIDocumentEmbedder\n\n    from mixedbread_ai import EncodingFormat\n\n    text_embedder = MixedbreadAITextEmbedder(\n        model=&quot;mixedbread-ai\/deepset-mxbai-embed-de-large-v1&quot;,\n        encoding_format=EncodingFormat.BINARY\n    )\n\n    document_embedder = MixedbreadAIDocumentEmbedder(\n        model=&quot;mixedbread-ai\/deepset-mxbai-embed-de-large-v1&quot;,\n        encoding_format=EncodingFormat.BINARY\n    )\n<\/code><\/pre>\n<h2>Give us feedback<\/h2>\n<p>We hope you enjoy the new, strongest open-source German embedding model as much as we do and welcome any feedback to improve and refine our models&#39; user-friendliness or capabilities. Please let us know if you\u2019re yearning for any new features, want to tell us about an interesting use case, or have encountered any issues. We value your feedback!<\/p>\n<p>Please share your feedback and thoughts through the <a href=\"https:\/\/mixedbread.com\/urls\/discord\">Mixedbread discord<\/a> or the <a href=\"https:\/\/discord.com\/invite\/VBpFzsgRVF\">Haystack discord<\/a>. We are here to help and also always happy to chat about the exciting field of machine learning!<\/p>\n<h3>Citation<\/h3>\n<pre><code class=\"language-bibtex\">@online{embedde2024mxbai,\n  title={Open Source Gets DE-licious: Mixedbread x deepset German\/English Embeddings},\n  author={Sean Lee and Aamir Shakir and Julius Lipp and Darius Koenig},\n  year={2024},\n  url={https:\/\/www.mixedbread.com\/blog\/deepset-mxbai-embed-de-large-v1},\n}\n<\/code><\/pre>\n<h3>Thank you<\/h3>\n<p>To NVIDIA for providing us with cutting-edge computational resources. All of our training and evaluation was done on a <strong>NVIDIA DGX with 8xA100 GPUs<\/strong>, which they generously sponsored. This kind of support is invaluable to us, and we&#39;re truly grateful to the NVIDIA team for helping to make this project possible.<\/p>\n","url":"https:\/\/www.mixedbread.com\/blog\/deepset-mxbai-embed-de-large-v1","title":"Open Source Gets DE-licious: Mixedbread x deepset German\/English Embeddings","summary":"Introducing deepset-mxbai-embed-large-v1, a new open-source German\/English embedding model, developed through collaboration between deepset and Mixedbread. This model sets a new performance standard among open source peers, supporting binary quantization and Matryoshka representation learning for significant cost reductions. Outperforming domain-specific alternatives in real-world applications, it offers 97%+ infrastructure cost savings through binary MRL.","image":"https:\/\/www.mixedbread.com\/images\/blog\/deepset-mxbai-embed-de-large-v1\/baba.jpg","date_modified":"2024-07-18T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research"]},{"id":"https:\/\/www.mixedbread.com\/blog\/binary-mrl","content_html":"<p>We are happy to introduce a novel embeddings compression method: Binary MRL. This method will make vector search much more scalable and enable a range of new embeddings-based applications that weren&#39;t economically feasible before our release. Learn how the parameter influence the search results.<\/p>\n<p>Read on to learn more about our approach and to check out our benchmarks. If you want to skip right to the model instead, you can access it here:<\/p>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-embed-large-v1\">mxbai-embed-large-v1<\/a>: Our recently released flagship embedding model supports binary MRL as it is. How cool is that?!<\/li>\n<li><a href=\"https:\/\/mixedbread.com\/demos\/binary-quantization\">Wikipedia Demo<\/a>: You can experience the speed and performance of our model in the demo (using binary quantization).<\/li>\n<\/ul>\n<h2>Why Embeddings?<\/h2>\n<p>Embeddings are one of the most versatile tools in natural language processing, supporting a wide variety of settings and use cases. In essence, embeddings are numerical representations of more complex objects like text, images, audio, etc. Specifically, the objects are represented as n-dimensional vectors.<\/p>\n<p>After transforming objects using an embedding model, you can determine their inherent semantic similarity by calculating the similarity of the respective embeddings. Essentially, you determine how strongly related two objects are by measuring how close their embeddings are to each other in the n-dimensional vector space. This is crucial for many use cases: it serves as the backbone for recommendation systems, retrieval, one-shot or few-shot learning, outlier detection, similarity search, paraphrase detection, clustering, classification, and much more.<\/p>\n<p>Using embeddings is particularly important for Retrieval-Augmented Generation (RAG). The idea behind the concept of RAG is to be able to have an LLM access custom documents that you provide (like analyst reports in your company) and improve its output based on that information. Transforming the documents into embeddings (as well as the query given to the model) allows the LLM to retrieve the most relevant information from your data and utilize it to produce the most relevant output for the user.<\/p>\n<h2>Embeddings May Struggle to Scale<\/h2>\n<p>However, embeddings may be challenging to use at scale because of their memory usage, which leads to expensive solutions and high latencies. Currently, many state-of-the-art models produce embeddings with 1024 dimensions, each of which is encoded in <code>float32<\/code>, i.e., they require 4 bytes per dimension. To perform retrieval over 250 million vectors, you would therefore need around 1TB of memory! With costs estimated at $3.8 per GB\/month, using <code>x2gd<\/code> instances on AWS, this would incur monthly cloud storage costs of more than $3,500.<\/p>\n<h2>Matryoshka Representation Learning &amp; Vector Quantization to the Rescue<\/h2>\n<p>To solve the scaling issues of embeddings, two approaches have lately been gaining particular traction: Matryoshka Representation Learning (MRL) and Vector Quantization. Let&#39;s first take a look at both concepts.<\/p>\n<p><a href=\"https:\/\/www.huggingface.co\/blog\/matryoshka\">MRL<\/a> tries to make embeddings ready to scale by reducing the number of output dimensions of an embedding model without sacrificing a lot of accuracy. This can be achieved by storing more important information in earlier dimensions of the embedding, so that the less important later dimensions can be truncated, saving for example on storage cost and improving processing speed in downstream tasks. In essence, the loss function during model training needs to be calibrated in a way that not only accounts for the standard model performance on, say, 1024 output dimensions, but that tracks the performance using the first 512, 256, 128,... dimensions. Training the model to minimize this loss function will lead it to frontload the most important identifying information within its output vectors.<\/p>\n<p>On the other hand, <a href=\"https:\/\/www.huggingface.co\/blog\/embedding-quantization\">vector quantization<\/a> represents a very different approach to the problem. Here, instead of changing the number of output dimensions, the size of every dimension is reduced. Typically, each dimension of the embedding is stored as a float32 value, which requires 4 bytes (32 bits) of storage space. Especially when considering vectors with 1024 dimensions, potentially millions or billions of them, the benefits of reducing this size become obvious. A large gain in memory and disk space efficiency as well as retrieval speed under retention of 95% and more of performance can be realized by storing the embedding dimensions as binary values instead.<\/p>\n<p>This is achieved by simply transforming the float32-values to 1 if they are greater than 0 and to 0 if they are not. In order for this process not to result in greater loss of performance, a rescoring step can be performed when using the model for retrieval tasks. In this approach, first both query and documents are represented as binary embeddings and the most relevant search results are retrieved with them, which are then also reranked in relation to a float32-embedding of the query.<\/p>\n<h2>Taking It One Step Further with Binary MRL<\/h2>\n<p>Recognizing the potential of both of these approaches, we already published some of our research findings on the subject. On MRL, we published our innovative and novel <a href=\"https:\/\/mixedbread.com\/blog\/mxbai-embed-2d-large-v1\">2D-Matryoshka model<\/a>; on binary quantization, we co-authored a <a href=\"https:\/\/www.huggingface.co\/blog\/embedding-quantization\">post on the hugging face blog<\/a>, introducing curious members of the community to the subject.<\/p>\n<p>Now, we aim to take things one step further by combining both approaches. We want to demonstrate that it is feasible to truncate embedding dimensions and reduce the size of each dimension simultaneously, while still retaining most of the original model performance using our very own <a href=\"https:\/\/mixedbread.com\/blog\/mxbai-embed-large-v1\">embedding model<\/a>.<\/p>\n<p>The following table demonstrates that our model is able to retain over 90% of performance while reducing its output dimensions from 1024 to 512 and also reducing the size of each dimension by a factor of 32. In effect, we create a 64x efficiency gain. Naturally, this decrease in memory usage also leads to a proportional - i.e., enormous - decrease in infrastructure cost when processes are run via cloud computing or a vector database specifically!<\/p>\n<p>We evaluated the model performance on the MTEB retrieval benchmark, which includes the 13 publicly available BEIR datasets. The tables show NDCG@10 scores, relative performance retention, and vector size in bytes of our model with float32 values and binary quantization combined with different output dimensions:<\/p>\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th>1024 Dim.<\/th>\n<th>512 Dim.<\/th>\n<th>256 Dim.<\/th>\n<th>128 Dim.<\/th>\n<th>64 Dim.<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>NDCG@10<\/strong><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>float32<\/td>\n<td><strong>54.39<\/strong><\/td>\n<td>51.79<\/td>\n<td>46.78<\/td>\n<td>36.63<\/td>\n<td>18.63<\/td>\n<\/tr>\n<tr>\n<td>binary<\/td>\n<td>52.46<\/td>\n<td><strong>49.37<\/strong><\/td>\n<td>43.25<\/td>\n<td>32.80<\/td>\n<td>17.61<\/td>\n<\/tr>\n<tr>\n<td><strong>Performance Retention<\/strong><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>float32<\/td>\n<td><strong>100.00%<\/strong><\/td>\n<td>95.22%<\/td>\n<td>86.01%<\/td>\n<td>67.34%<\/td>\n<td>34.25%<\/td>\n<\/tr>\n<tr>\n<td>binary<\/td>\n<td>96.46%<\/td>\n<td><strong>90.76%<\/strong><\/td>\n<td>79.52%<\/td>\n<td>60.31%<\/td>\n<td>32.38%<\/td>\n<\/tr>\n<tr>\n<td><strong>Vector Size [byte]<\/strong><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>float32<\/td>\n<td><strong>4,096<\/strong><\/td>\n<td>2,048<\/td>\n<td>1,024<\/td>\n<td>512<\/td>\n<td>256<\/td>\n<\/tr>\n<tr>\n<td>binary<\/td>\n<td>128<\/td>\n<td><strong>64<\/strong><\/td>\n<td>32<\/td>\n<td>16<\/td>\n<td>8<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>As shown, <a href=\"https:\/\/mixedbread.com\/blog\/mxbai-embed-large-v1\">Mixedbread's embedding model<\/a> performs more than 90% as well using 64-byte vectors as it does using 4,096-byte vectors. In our view, these 512-dimensional binary embeddings also represent the sweet spot for the trade-off between performance and storage capacity.<\/p>\n<p>We can also take a look at the following graph visualizing the relation between performance and output dimensions for both <code>float32<\/code> and binary embeddings:<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/binary-mrl\/binmrl.png\" alt=\"binary mrl\"><\/p>\n<p>As we can see, the curves for <code>float32<\/code>and binary embeddings exhibit strong similarities. It&#39;s our view that the trade-off between size and performance is optimal in the less steep left part of the curve. Due to resource constraints, we did not evaluate the performance retention for <code>int8<\/code>-quantization, but we would expect that curve to be extremely similar to and inbetween the other two curves.<\/p>\n<p>What does all of this mean in practice?<\/p>\n<p>The following text takes 64 bytes (ASCII) to store: &quot;<em>Bread the warm and yeasty comfort that feeds both body and soul.<\/em>&quot;<\/p>\n<p>Alternatively, you could now store a vector embedding a complex object like text or an image at extremely high quality. Which would you consider more useful?<\/p>\n<h2>The Economic Consequences of the Release<\/h2>\n<p>Saving space on embedding sizes is not merely a cosmetic exercise to excite a small group of experts on the subject - it makes using neural search with vector databases significantly cheaper. This can have wide-ranging consequences: we believe it will enable new and exciting embeddings-based applications that previously weren&#39;t economically feasible!<\/p>\n<p>In the following table, we compiled an overview of required storage space and therefore monthly cost of performing retrieval over 100m, 250m, and 1b vectors, respectively. Again, we assumed costs of $3.8 per GB\/month, using <code>x2gd<\/code> instances on AWS:<\/p>\n<table>\n<thead>\n<tr>\n<th>Data type<\/th>\n<th>Dim.<\/th>\n<th>100M embeddings<\/th>\n<th>250M embeddings<\/th>\n<th>1B embeddings<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong>float32<\/strong><\/td>\n<td><strong>1024<\/strong><\/td>\n<td><strong>381.47GB <br\/>$1,449.58 \/ mo<\/strong><\/td>\n<td><strong>953.67GB <br\/>$3,623.96 \/ mo<\/strong><\/td>\n<td><strong>3.81TB <br\/>$14,495.85 \/ mo<\/strong><\/td>\n<\/tr>\n<tr>\n<td>float32<\/td>\n<td>512<\/td>\n<td>190.73GB <br\/>$724.79 \/ mo<\/td>\n<td>476.84GB <br\/>$1,811.98 \/ mo<\/td>\n<td>1.91TB <br\/>$7,247.92 \/ mo<\/td>\n<\/tr>\n<tr>\n<td>float32<\/td>\n<td>256<\/td>\n<td>95.37GB <br\/>$362.40 \/ mo<\/td>\n<td>238.42GB <br\/>$905.99 \/ mo<\/td>\n<td>953.67GB <br\/>$3,623.96 \/ mo<\/td>\n<\/tr>\n<tr>\n<td>float32<\/td>\n<td>128<\/td>\n<td>47.68GB <br\/>$181.20 \/ mo<\/td>\n<td>119.21GB <br\/>$453.00 \/ mo<\/td>\n<td>476.84GB <br\/>$1,811.98 \/ mo<\/td>\n<\/tr>\n<tr>\n<td>binary<\/td>\n<td>1024<\/td>\n<td>11.92GB <br\/>$45.30 \/ mo<\/td>\n<td>29.80GB <br\/>$113.25 \/ mo<\/td>\n<td>119.21GB <br\/>$453.00 \/ mo<\/td>\n<\/tr>\n<tr>\n<td><strong>binary<\/strong><\/td>\n<td><strong>512<\/strong><\/td>\n<td><strong>5.96GB <br\/>$22.65 \/ mo<\/strong><\/td>\n<td><strong>14.90GB <br\/>$56.62 \/ mo<\/strong><\/td>\n<td><strong>59.60GB <br\/>$226.50 \/ mo<\/strong><\/td>\n<\/tr>\n<tr>\n<td>binary<\/td>\n<td>256<\/td>\n<td>2.98GB <br\/>$11.32 \/ mo<\/td>\n<td>7.45GB <br\/>$28.31 \/ mo<\/td>\n<td>29.80GB <br\/>$113.25 \/ mo<\/td>\n<\/tr>\n<tr>\n<td>binary<\/td>\n<td>128<\/td>\n<td>1.49GB <br\/>$5.66 \/ mo<\/td>\n<td>3.73GB <br\/>$14.16 \/ mo<\/td>\n<td>14.90GB <br\/>$56.62 \/ mo<\/td>\n<\/tr>\n<\/tbody><\/table>\n<h2>Using It in Action<\/h2>\n<p>We offer binary MRL through our API and it is also supported through Sentence Transformers. Here an example how you can use it:<\/p>\n<p><strong>Python (API):<\/strong><\/p>\n<pre><code class=\"language-bash\">pip install -u mixedbread\n<\/code><\/pre>\n<p><strong>TypeScript (API):<\/strong><\/p>\n<pre><code class=\"language-bash\">npm i @mixedbread\/sdk\n<\/code><\/pre>\n<p><strong>Python:<\/strong><\/p>\n<pre><code class=\"language-bash\">pip install -U sentence-transformers\n<\/code><\/pre>\n<p><strong>Python (API):<\/strong><\/p>\n<pre><code class=\"language-python\">from mixedbread import Mixedbread\n\nmxbai = Mixedbread(api_key=&quot;YOUR_API_KEY&quot;)\n\nres = mxbai.embed(\n  model=&#39;mixedbread-ai\/mxbai-embed-large-v1&#39;,\n  input=[\n  &#39;Who is german and likes bread?&#39;,\n  &#39;Everybody in Germany.&#39;\n  ],\n  normalized=True, # this has to be True if you want to use binary with faiss\n  encoding_format=&#39;ubinary&#39;,\n  dimensions=512\n)\n<\/code><\/pre>\n<p><strong>TypeScript (API):<\/strong><\/p>\n<pre><code class=\"language-typescript\">import { Mixedbread } from &quot;@mixedbread\/sdk&quot;;\n\nconst mxbai = new Mixedbread({\n  apiKey: &quot;YOUR_API_KEY&quot;\n});\n\nconst res = await mxbai.embed({\n  model: &#39;mixedbread-ai\/mxbai-embed-large-v1&#39;,\n  input: [\n    &#39;Who is german and likes bread?&#39;,\n    &#39;Everybody in Germany.&#39;\n  ],\n  normalized: true, \/\/ this has to be True if you want to use binary with faiss\n  encoding_format: &#39;ubinary&#39;,\n  dimensions=512\n})\n<\/code><\/pre>\n<p><strong>Python:<\/strong><\/p>\n<pre><code class=\"language-python\">from sentence_transformers import SentenceTransformer\nfrom sentence_transformers.quantization import quantize_embeddings\n\n# 1. Load an embedding model\nmodel = SentenceTransformer(&quot;mixedbread-ai\/mxbai-embed-large-v1&quot;)\n\n# 2. Encode some text and select MRL dimensions\nmrl_embeddings = model.encode(\n    [&quot;Who is german and likes bread?&quot;, &quot;Everybody in Germany.&quot;], normalize_embeddings=True)[..., :512]\n\n# 3. Apply binary quantization\nbinary_embeddings = quantize_embeddings(mrl_embeddings, precision=&quot;binary&quot;)\n<\/code><\/pre>\n<p>We put also a <a href=\"https:\/\/mixedbread.com\/demos\/binary-quantization\">demo online<\/a> where you can search through the English wikipedia using binary embeddings and which helps you to understand the influence of the parameters on the results.<\/p>\n<h2>Practical Considerations<\/h2>\n<p>On a practical level, you will need a vector database that supports this new concept to take full advantage of the benefits it can provide. We understand that many providers will be hesitant in offering this service, as it directly cuts into their profits if the number of users and the embeddings they perform retrieval on stay constant - even though we believe that making vector search more economically available to users will lead to an increase in demand for expanded old as well as completely new applications that will offset this effect for the providers. Already, there are providers that recognize the potential of our findings and want to support their customers in using it in innovative and productive ways. <a href=\"https:\/\/vespa.ai\/\">Vespa<\/a> has been particularly vocal about their excitement to support the wonderful things the community will be able to do with binary MRL.<\/p>\n","url":"https:\/\/www.mixedbread.com\/blog\/binary-mrl","title":"64 bytes per embedding, yee-haw \ud83e\udd20","summary":"Binary MRL combines two popular approaches to deal with the scalability issues of embeddings. It helps our embedding model achieve a 64x gain in efficiency while retaining more than 90% of performance, drastically reducing infrastructure costs and enabling new applications.","image":"https:\/\/www.mixedbread.com\/images\/blog\/binary-mrl\/yeehaw.jpg","date_modified":"2024-04-12T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research"]},{"id":"https:\/\/www.mixedbread.com\/blog\/mxbai-colbert-large-v1","content_html":"<p>We are excited to announce our first ColBERT model, pushing the space one step forward! It comes with an Apache 2.0 license and is available on <a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-colbert-large-v1\">Hugging Face<\/a>.<\/p>\n<p>Read on to learn more about our approach and to check out our benchmarks. If you want to skip right to the model instead, you can access it here:<\/p>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-colbert-large-v1\">mixedbread-ai\/mxbai-colbert-large-v1<\/a>: SOTA ColBERT. Simply tasty and good.<\/li>\n<\/ul>\n<blockquote>\n<p><strong>Note<\/strong>\n<u><strong>TLDR:<\/strong><\/u> <br\/>\n    Our <a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-colbert-large-v1\">ColBERT model<\/a> provides state-of-the-art performance among its peers. It outperforms other available options and even cross-encoder based rerankers.<\/p>\n<\/blockquote>\n<h2>What is ColBERT?<\/h2>\n<p>In a real world use case, search is extremely complex: different domains, languages, varying text length, and many more hurdles have to be dealt with. We try to overcome this challenge by using smart embedding models, which take text as an input and produce a fixed (dimension) size vector.<\/p>\n<p>The typical search approach uses the same model to encode both documents and queries. We then choose a metric, such as cosine similarity, to measure the distance between the query and the documents. However, there is an issue with that: our model has to determine the optimal placement within the latent space, so that query and relevant documents are positioned closely together, but there is no interaction between query and document within the model.<\/p>\n<p>On the other side, we have models like cross-encoders. There, query and documents are fed to the model together, improving search accuracy. Unfortunately, cross-encoders are extremely compute-heavy, since we need to pass all possible combinations of documents and queries to the model. Therefore, these models are not suitable for large-scale search and mostly used for reranking.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/colbert-v1\/colbert.png\" alt=\"Similarity scoring process of query and document in a ColBERT model\" title=\"Similarity scoring process of query and document in a ColBERT model\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n    Similarity scoring process of query and document in a ColBERT model\n<\/div>\n\n<p>ColBERT stands for <strong>co<\/strong>ntextualized <strong>l<\/strong>ate interaction BERT and it combines both vector search and cross-encoders. In ColBERT, the queries and the documents are first encoded separately. However, instead of creating a single embedding for the entire document, ColBERT generates contextualized embeddings for each token in the document. To search, the token-level query embeddings are compared with the token-level embeddings of the documents using the lightweight scoring function MaxSim. This allows ColBERT to capture more nuanced matching signals while still being computationally efficient. The resulting scores are then used to rank the documents based on their relevance to the query.<\/p>\n<h2>Introducing mxbai-colbert-large-v1<\/h2>\n<p>Last week we released our powerful embedding model <a href=\"https:\/\/mixedbread.com\/blog\/mxbai-embed-large-v1\">mxbai-embed-large-v1<\/a>, based on which we trained our ColBERT model. Therefore, our ColBERT model provides all the benefits of our embedding model - the model has seen a huge amount of diverse data from all kinds of different domains. Our model can be easily used, no fluff like remote code etc. is needed.<\/p>\n<p>As of March 2024, our model achieves state-of-the-art performance for ColBERT models for reranking on the 13 publicly available BEIR benchmarks.<\/p>\n<h2>Using It in Action<\/h2>\n<p>We recommend using our model with the framework <a href=\"https:\/\/github.com\/bclavie\/RAGatouille\">RAGatouille<\/a>. To get started, let&#39;s install the library:<\/p>\n<pre><code class=\"language-bash\">pip install ragatouille\n<\/code><\/pre>\n<p>Now, let&#39;s see how to use our model with RAGatouille:<\/p>\n<pre><code class=\"language-python\">from ragatouille import RAGPretrainedModel\n\n# Let&#39;s create a ragatouille instance\nRAG = RAGPretrainedModel.from_pretrained(&quot;mixedbread-ai\/mxbai-colbert-large-v1&quot;)\n\ndocuments = [\n    &quot;&#39;To Kill a Mockingbird&#39; is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.&quot;,\n    &quot;The novel &#39;Moby-Dick&#39; was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.&quot;,\n    &quot;Harper Lee, an American novelist widely known for her novel &#39;To Kill a Mockingbird&#39;, was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.&quot;,\n    &quot;Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.&quot;,\n    &quot;The &#39;Harry Potter&#39; series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.&quot;,\n    &quot;&#39;The Great Gatsby&#39;, a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.&quot;\n]\n\n# index documents\nRAG.index(documents, index_name=&quot;mockingbird&quot;)\n\n# search\nquery = &quot;Who wrote &#39;To Kill a Mockingbird&#39;?&quot;\nresults = RAG.search(query)\n<\/code><\/pre>\n<p>The results should look like this:<\/p>\n<pre><code class=\"language-python\">[\n    {\n        &#39;content&#39;: &quot;&#39;To Kill a Mockingbird&#39; is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.&quot;,\n        &#39;score&#39;: 28.453125,\n        &#39;rank&#39;: 1,\n        &#39;document_id&#39;: &#39;9d564e82-f14f-433a-ab40-b10bda9dc370&#39;,\n        &#39;passage_id&#39;: 0\n    },\n    {\n    &#39;   content&#39;: &quot;Harper Lee, an American novelist widely known for her novel &#39;To Kill a Mockingbird&#39;, was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.&quot;,\n        &#39;score&#39;: 27.03125,\n        &#39;rank&#39;: 2,\n        &#39;document_id&#39;: &#39;a35a89c3-b610-4e2e-863e-fa1e7e0710a6&#39;,\n        &#39;passage_id&#39;: 2\n    },\n    ...\n]\n<\/code><\/pre>\n<h2>Built for RAG and Reranking<\/h2>\n<p>While a lot of models use ready-made datasets -- which are pretty outdated and also quite far removed from real world use cases -- we spent a lot of time building our own datasets. We scraped a large part of the internet, cleaned the data, and used it to construct our training dataset.<\/p>\n<p>We initalized our ColBERT model from our <a href=\"https:\/\/mixedbread.com\/blog\/mxbai-embed-large-v1\">mxbai-embed-large-v1<\/a> model, which was trained on over 700 million samples from various domains. We then adjusted our embedding model to the late interaction mechanism with around 96 million samples. This allows our ColBERT model to be used for a wide range of tasks and domains.<\/p>\n<h2>Model Evaluation with <a href=\"https:\/\/arxiv.org\/abs\/2104.08663\">BEIR<\/a> for Out-of-Domain Information Retrieval<\/h2>\n<p>BEIR is a benchmark focused on out-of-domain information retrieval. We benchmark our ColBERT model in two different settings: reranking and retrieval.<\/p>\n<p>Unfortunately, many recently published models were trained on the BEIR training sets and frequently even on the actual test sets (i.e., telling the model the correct answers for the test set, which is basically cheating). For our training, we excluded any potential overlap with the test sets by removing potential test candidates from the comparison. This ensures that our model is evaluated on unseen data and that the results are reliable.<\/p>\n<h3>Reranking<\/h3>\n<p>Since reranking is currently the most significant setting for the use of ColBERT models, we focused on benchmarking our model against other currently available ColBERT options on all 13 publicly available BEIR tasks.<\/p>\n<p>Specifically, we evaluated the model using the NDCG@10 metric, which scores models based on their overall rankings of search results compared to their actual relevance, with a higher weight being placed on the search results higher up on the list.<\/p>\n<p>Reranking performance in NDCG@10:<\/p>\n<table>\n<thead>\n<tr>\n<th>Dataset<\/th>\n<th align=\"right\">ColBERTv2<\/th>\n<th align=\"right\">Jina-ColBERT-v1<\/th>\n<th align=\"right\">mxbai-colbert-large-v1<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>ArguAna<\/td>\n<td align=\"right\">29.99<\/td>\n<td align=\"right\"><strong>33.42<\/strong><\/td>\n<td align=\"right\">33.11<\/td>\n<\/tr>\n<tr>\n<td>ClimateFEVER<\/td>\n<td align=\"right\">16.51<\/td>\n<td align=\"right\">20.66<\/td>\n<td align=\"right\"><strong>20.85<\/strong><\/td>\n<\/tr>\n<tr>\n<td>DBPedia<\/td>\n<td align=\"right\">31.80<\/td>\n<td align=\"right\"><strong>42.16<\/strong><\/td>\n<td align=\"right\">40.61<\/td>\n<\/tr>\n<tr>\n<td>FEVER<\/td>\n<td align=\"right\">65.13<\/td>\n<td align=\"right\"><strong>81.07<\/strong><\/td>\n<td align=\"right\">80.75<\/td>\n<\/tr>\n<tr>\n<td>FiQA<\/td>\n<td align=\"right\">23.61<\/td>\n<td align=\"right\">35.60<\/td>\n<td align=\"right\"><strong>35.86<\/strong><\/td>\n<\/tr>\n<tr>\n<td>HotPotQA<\/td>\n<td align=\"right\">63.30<\/td>\n<td align=\"right\"><strong>68.84<\/strong><\/td>\n<td align=\"right\">67.62<\/td>\n<\/tr>\n<tr>\n<td>NFCorpus<\/td>\n<td align=\"right\">33.75<\/td>\n<td align=\"right\"><strong>36.69<\/strong><\/td>\n<td align=\"right\">36.37<\/td>\n<\/tr>\n<tr>\n<td>NQ<\/td>\n<td align=\"right\">30.55<\/td>\n<td align=\"right\">51.27<\/td>\n<td align=\"right\"><strong>51.43<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Quora<\/td>\n<td align=\"right\">78.86<\/td>\n<td align=\"right\">85.18<\/td>\n<td align=\"right\"><strong>86.95<\/strong><\/td>\n<\/tr>\n<tr>\n<td>SCIDOCS<\/td>\n<td align=\"right\">14.90<\/td>\n<td align=\"right\">15.39<\/td>\n<td align=\"right\"><strong>16.98<\/strong><\/td>\n<\/tr>\n<tr>\n<td>SciFact<\/td>\n<td align=\"right\">67.89<\/td>\n<td align=\"right\">70.20<\/td>\n<td align=\"right\"><strong>71.48<\/strong><\/td>\n<\/tr>\n<tr>\n<td>TREC-COVID<\/td>\n<td align=\"right\">59.47<\/td>\n<td align=\"right\">75.00<\/td>\n<td align=\"right\"><strong>81.04<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Webis-touch\u00e92020<\/td>\n<td align=\"right\"><strong>44.22<\/strong><\/td>\n<td align=\"right\">32.12<\/td>\n<td align=\"right\">31.70<\/td>\n<\/tr>\n<tr>\n<td>Average<\/td>\n<td align=\"right\">43.08<\/td>\n<td align=\"right\">49.82<\/td>\n<td align=\"right\"><strong>50.37<\/strong><\/td>\n<\/tr>\n<\/tbody><\/table>\n<p><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-colbert-large-v1\">mxbai-colbert-large-v1<\/a> outperforms other models on average as well as directly in most of the tasks. Curiously, the model&#39;s exceptionally high score even beats typical scores for cross-encoder based reranker models on the benchmark, despite the advantages of the ColBERT architecture regarding resource use.<\/p>\n<h3>Retrieval<\/h3>\n<p>As mentioned, ColBERT is currently mainly used for reranking. However, since more and more people are starting to use ColBERT for retrieval tasks as well, we also tested our model&#39;s performance on retrieval tasks on a subset of the BEIR benchmarks.<\/p>\n<p>Due to resource limitations, so far we were only able to test our model on three BEIR tasks, with NDCG@10 serving as the main metric. We aim to complete testing on the full set of tasks in the future and will provide the full results as soon as possible.<\/p>\n<p>Retrieval performance in NDCG@10:<\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th align=\"right\">ColBERTv2<\/th>\n<th align=\"right\">Jina-ColBERT-V1<\/th>\n<th align=\"right\">mxbai-colbert-large-v1<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>NFCorpus<\/td>\n<td align=\"right\">33.7<\/td>\n<td align=\"right\">33.8<\/td>\n<td align=\"right\"><strong>36.5<\/strong><\/td>\n<\/tr>\n<tr>\n<td>SciFact<\/td>\n<td align=\"right\">68.9<\/td>\n<td align=\"right\">70.1<\/td>\n<td align=\"right\"><strong>71.3<\/strong><\/td>\n<\/tr>\n<tr>\n<td>TREC-COVID<\/td>\n<td align=\"right\">72.6<\/td>\n<td align=\"right\">75.0<\/td>\n<td align=\"right\"><strong>80.5<\/strong><\/td>\n<\/tr>\n<\/tbody><\/table>\n<div class=\"text-center w-full pb-4\">**Rest of the results will be updated soon. We're on it!**<\/div>\n\n<p>On this small subset, the model exhibits state-of-the-art retrieval performance when compared to other currently available ColBERT models. However, while our ColBERT model also performs well on retrieval, we still recommend using our embedding model <a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-embed-large-v1\">mixedbread-ai\/mxbai-embed-large-v1<\/a> in this setting.<\/p>\n<h2>Give Us Feedback<\/h2>\n<p>This is our first ColBERT model, and we greatly welcome any feedback that helps us make our models better, refine their user-friendliness, or improve their capabilities. Please let us know if you&#39;re hungry for any new features or have encountered any issues. We value your feedback!<\/p>\n<p>Please share your feedback and thoughts through our <a href=\"https:\/\/mixedbread.com\/urls\/discord\">Discord Community<\/a>. We are here to help and also always happy to chat about the exciting field of machine learning!<\/p>\n<h3>Citation<\/h3>\n<pre><code class=\"language-bibtex\">@online{colbert2024mxbai,\n  title={ColBERTus Maximus - Introducing mxbai-colbert-large-v1},\n  author={Sean Lee and Aamir Shakir and Darius Koenig and Julius Lipp},\n  year={2024},\n  url={https:\/\/www.mixedbread.com\/blog\/mxbai-colbert-large-v1},\n}\n<\/code><\/pre>\n","url":"https:\/\/www.mixedbread.com\/blog\/mxbai-colbert-large-v1","title":"ColBERTus Maximus - Introducing mxbai-colbert-large-v1","summary":"mxbai-colbert-large-v1 is a state-of-the-art ColBERT model for reranking and retrieval tasks. It is based on the mxbai-embed-large-v1 model and achieves state-of-the-art performance on 13 publicly available BEIR benchmarks. It's available on Hugging Face.","image":"https:\/\/www.mixedbread.com\/images\/blog\/colbert-v1\/intro-mxbai-colbert-large-v1.jpg","date_modified":"2024-03-19T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research"]},{"id":"https:\/\/www.mixedbread.com\/blog\/mxbai-embed-large-v1","content_html":"<p>And another one! We are excited to introduce our new and powerful embedding model. It comes with an Apache 2.0 license and is available on Hugging Face.<\/p>\n<p>Read on to learn more about our approach and to check out our benchmarks. If you want to skip right to the model instead, you can access it here:<\/p>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-embed-large-v1\">mxbai-embed-large-v1<\/a>: Strong, powerful, and large. But not too large.<\/li>\n<\/ul>\n<blockquote>\n<p><strong>Note<\/strong>\n<u><strong>TLDR:<\/strong><\/u> <br\/>\n    Our <a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-embed-large-v1\">English embedding model<\/a> provides state-of-the-art performance among other efficiently sized models. It outperforms closed source models like OpenAI&#39;s text-embedding-v3.<\/p>\n<\/blockquote>\n<h2>Why Embeddings?<\/h2>\n<p>A significant hurdle for modern generative models is their inability to directly interact with your data. Consider a scenario where your task is to generate a report on recent market trends based on internal research documents. Traditional generative models fall short here as they don&#39;t have access to or understanding of your internal documents, making it impossible for them to generate the required report.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/embed-v1\/mxbai-embed-db.png\" alt=\"The process of Retrieval-Augmented Generation, powered by embeddings\" title=\"The process of Retrieval-Augmented Generation, powered by embeddings\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n    The process of Retrieval-Augmented Generation, powered by embeddings\n<\/div>\n\n<p>To address this challenge, the Retrieval-Augmented Generation (RAG) technique offers a solution. Imagine you have a repository of internal research on market trends. This repository can be processed through an embedding model to convert the documents into a searchable format within a vector database. When you need a report on market trends, the embedding model can locate and fetch the most relevant documents. These documents can then inform a generative model, enabling it to produce a detailed report based on your specific data.<\/p>\n<h2>Introducing Our Powerful Embedding Model<\/h2>\n<p>Earlier this week, we ventured to the forefront of research into new ways of making embedding models more efficient with the <a href=\"https:\/\/mixedbread.com\/blog\/mxbai-embed-2d-large-v1\">release of our  2D-\ud83e\ude86 model<\/a>, first of its kind.<\/p>\n<p>Today, we want to return some of our attention to slightly more well-charted waters: we are releasing our flagship, state-of-the-art English embedding model, which can be easily downloaded into your existing search pipeline. No fancy custom code or trust remote code required.<\/p>\n<p>As of March 2024, our model achieves state-of-the-art performance for open-source models of the same size class (for closed source models this information is not public) on the <a href=\"https:\/\/arxiv.org\/abs\/2210.07316\">Massive Text Embedding Benchmark<\/a> (MTEB).<\/p>\n<h2>Using It in Action<\/h2>\n<p>Our model is extremely easy to use with your existing search stack. You replace the first stage retrieval with our model, and you&#39;re ready to go. You\u2019ll have two options: use the model either offline by hosting it yourself or online by using our (upcoming) API.<\/p>\n<p>To get started, install the necessary packages:<\/p>\n<pre><code class=\"language-bash\">    pip install -U mixedbread-ai sentence-transformers\n<\/code><\/pre>\n<p>Then, you can use the model like this:<\/p>\n<p><strong>Python (API):<\/strong><\/p>\n<pre><code class=\"language-python\">from mixedbread_ai.client import MixedbreadAi\n    from sentence_transformers.util import cos_sim\n\n    mxbai = MixedbreadAI(api_key=&quot;YOUR_API_KEY&quot;)\n\n    query = &#39;Represent this sentence for searching relevant passages: A man is eating a piece of bread&#39;\n\n    docs = [\n        query,\n        &quot;A man is eating food.&quot;,\n        &quot;A man is eating pasta.&quot;,\n        &quot;The girl is carrying a baby.&quot;,\n        &quot;A man is riding a horse.&quot;,\n    ]\n\n    result = mxbai.embeddings(\n        model=&quot;mixedbread-ai\/mxbai-embed-large-v1&quot;,\n        input=docs\n    )\n\n    embeddings = [item.embedding for item in result.data]\n\n    # Calculate cosine similarity\n    similarities = cos_sim(embeddings[0], embeddings[1:])\n    print(similarities)\n<\/code><\/pre>\n<p><strong>Python:<\/strong><\/p>\n<pre><code class=\"language-python\">from sentence_transformers import SentenceTransformer\n    from sentence_transformers.util import cos_sim\n\n    # 1. load model\n    model = SentenceTransformer(&quot;mixedbread-ai\/mxbai-embed-large-v1&quot;)\n\n    # For retrieval you need to pass this prompt.\n    query = &#39;Represent this sentence for searching relevant passages: A man is eating a piece of bread&#39;\n\n    docs = [\n        query,\n        &quot;A man is eating food.&quot;,\n        &quot;A man is eating pasta.&quot;,\n        &quot;The girl is carrying a baby.&quot;,\n        &quot;A man is riding a horse.&quot;,\n    ]\n\n    # 2. Encode\n    embeddings = model.encode(docs)\n\n    # 3. Calculate cosine similarity\n    similarities = cos_sim(embeddings[0], embeddings[1:])\n    print(similarities)\n<\/code><\/pre>\n<p>The result will look like this:<\/p>\n<pre><code class=\"language-python\">[0.7920, 0.6369, 0.1651, 0.3621]\n<\/code><\/pre>\n<h3>Why Do We Need the Prompt?<\/h3>\n<p>The prompt improves the model&#39;s understanding of how the embedding will be used in subsequent tasks, which in turn increases the performance. For now, we support only one prompt, but our experiments show that having domain specific prompts can increase the performance. If you are doing information retrieval, please use the prompt <code>Represent this sentence for searching relevant passages:<\/code> for your query. For everything else, just use the text as it is.<\/p>\n<h2>Built for RAG and Real World Use-Cases<\/h2>\n<p>While a lot of models use ready-made datasets -- which are pretty outdated and also quite far removed from real world use cases -- we spent a lot of time building our own datasets. We scraped a large part of the internet, cleaned the data, and used it to construct our training dataset.<\/p>\n<p>During the whole process, we ensured zero overlap with tests of MTEB. We went out of our way to not even use any training data from the MTEB (except MS Marco), unlike most other models. We trained our model with over 700 million pairs using contrastive training and tuned it on over 30 million high quality triplets using the <a href=\"https:\/\/arxiv.org\/abs\/2309.12871\">AnglE loss<\/a>. The vast amount and high quality of data ensures that the model has seen a lot of topics and domains, and that it performs well in real life and RAG-related use cases.<\/p>\n<h2>Model Evaluation with MTEB<\/h2>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2210.07316\">MTEB<\/a> is a large text embedding benchmark that measures embedding models across seven tasks: classification, clustering, pair classification, re-ranking, retrieval, STS (semantic textual similarity), and summarization. It includes 56 datasets from various domains and with various text lengths.<\/p>\n<p>Our new model is ranked first among embedding models of similar size, outperforms the new OpenAI embedding model, text-embedding-3-large, and also matches the performance of 20x larger models like <a href=\"https:\/\/huggingface.co\/jspringer\/echo-mistral-7b-instruct-lasttoken\">echo-mistral-7b<\/a>. You can find the evaluation results on the <a href=\"https:\/\/huggingface.co\/spaces\/mteb\/leaderboard\">official MTEB leaderboard<\/a>.<\/p>\n<p>PS: You can boost the performance even further by combining our embedding model with our <a href=\"https:\/\/mixedbread.com\/blog\/mxbai-rerank-v1\">rerank model<\/a>.<\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Avg (56 datasets)<\/th>\n<th>Classification (12 datasets)<\/th>\n<th>Clustering (11 datasets)<\/th>\n<th>PairClassification (3 datasets)<\/th>\n<th>Reranking (4 datasets)<\/th>\n<th>Retrieval (15 datasets)<\/th>\n<th>STS (10 datasets)<\/th>\n<th>Summarization (1 dataset)<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><strong><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-embed-large-v1\">mxbai-embed-large-v1<\/a><\/strong><\/td>\n<td><strong>64.68<\/strong><\/td>\n<td>75.64<\/td>\n<td>46.71<\/td>\n<td>87.2<\/td>\n<td>60.11<\/td>\n<td>54.39<\/td>\n<td>85.00<\/td>\n<td>32.71<\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/huggingface.co\/BAAI\/bge-large-en-v1.5\">bge-large-en-v1.5<\/a><\/td>\n<td>64.23<\/td>\n<td>75.97<\/td>\n<td>46.08<\/td>\n<td>87.12<\/td>\n<td>60.03<\/td>\n<td>54.29<\/td>\n<td>83.11<\/td>\n<td>31.61<\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-embed-2d-large-v1\">mxbai-embed-2d-large-v1<\/a><\/td>\n<td>63.25<\/td>\n<td>74.14<\/td>\n<td>46.07<\/td>\n<td>85.89<\/td>\n<td>58.94<\/td>\n<td>51.42<\/td>\n<td>84.9<\/td>\n<td>31.55<\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/huggingface.co\/nomic-ai\/nomic-embed-text-v1\">nomic-embed-text-v1<\/a><\/td>\n<td>62.39<\/td>\n<td>74.12<\/td>\n<td>43.91<\/td>\n<td>85.15<\/td>\n<td>55.69<\/td>\n<td>52.81<\/td>\n<td>82.06<\/td>\n<td>30.08<\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/huggingface.co\/jinaai\/jina-embeddings-v2-base-en\">jina-embeddings-v2-base-en<\/a><\/td>\n<td>60.38<\/td>\n<td>73.45<\/td>\n<td>41.73<\/td>\n<td>85.38<\/td>\n<td>56.98<\/td>\n<td>47.87<\/td>\n<td>80.7<\/td>\n<td>31.6<\/td>\n<\/tr>\n<tr>\n<td><em>Proprietary Models<\/em><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/openai.com\/blog\/new-embedding-models-and-api-updates\">OpenAI text-embedding-3-large<\/a><\/td>\n<td>64.58<\/td>\n<td>75.45<\/td>\n<td>49.01<\/td>\n<td>85.72<\/td>\n<td>59.16<\/td>\n<td>55.44<\/td>\n<td>81.73<\/td>\n<td>29.92<\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/txt.cohere.com\/introducing-embed-v3\/\">Cohere embed-english-v3.0<\/a><\/td>\n<td>64.47<\/td>\n<td>76.49<\/td>\n<td>47.43<\/td>\n<td>85.84<\/td>\n<td>58.01<\/td>\n<td>55.00<\/td>\n<td>82.62<\/td>\n<td>30.18<\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/openai.com\/blog\/new-and-improved-embedding-model\">OpenAI text-embedding-ada-002<\/a><\/td>\n<td>60.99<\/td>\n<td>70.93<\/td>\n<td>45.90<\/td>\n<td>84.89<\/td>\n<td>56.32<\/td>\n<td>49.25<\/td>\n<td>80.97<\/td>\n<td>30.80<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>The results on MTEB show that our model performs well for a number of different tasks and domains. This means that the model can adapt to a plethora of use cases and topics, making it an obvious choice for users.<\/p>\n<p>Unfortunately, many recently published embedding models were trained on the MTEB datasets, frequently even on the actual test sets (i.e., telling the model the correct answers for the test set, which is basically cheating). For our training, we excluded any potential overlap with the test sets by removing potential test candidates from the comparison.<\/p>\n<h2>Why No Long Context Length? Matryoshka? Preference?<\/h2>\n<p>Recently, we&#39;ve observed that some models are advertised as supporting long context to mitigate chunking. While we recognize that chunking sucks, which is also something we are working to solve, using a long context model is not the solution.<\/p>\n<p>With embeddings, we aim to capture the semantics of a text. For illustrative purposes, think of your own long context documents. They can contain any amount of different information and multiple topics which can be unrelated or contradictory. Accurately representing this with a single embedding is almost prohibitively difficult, which is why we decided not to support long context and to solve this issue in a smarter, more sensible way instead. Stay tuned, cool stuff is coming soon!<\/p>\n<p>The same goes for \ud83e\ude86 embeddings: we love the concept, as you can read in our blog post <a href=\"https:\/\/mixedbread.com\/blog\/mxbai-embed-2d-large-v1\">Fresh 2D-Matryoshka Embedding Model<\/a>. A new version supporting \ud83e\ude86 is in the making, as well as a version that fixes small issues regarding preference of the retrieved candidates -- we aim to close any potential performance gaps to commercial, closed-source models that our model, despite its extremely strong overall performance, might still face for specific tasks.<\/p>\n<h2>Give Us Feedback<\/h2>\n<p>This is our first production-ready embedding model, and we greatly welcome any feedback that helps make our models better, refine their user-friendliness, or improve their capabilities. Please let us know if you&#39;re hungry for any new features or have encountered any issues. We value your feedback!<\/p>\n<p>Please share your feedback and thoughts through our <a href=\"https:\/\/mixedbread.com\/urls\/discord\">discord community<\/a>. We are here to help and also always happy to chat about the exciting field of machine learning!<\/p>\n<h3>Citation<\/h3>\n<pre><code class=\"language-bibtex\">@online{embed2024mxbai,\n  title={Open Source Strikes Bread - New Fluffy Embedding Model},\n  author={Sean Lee and Aamir Shakir and Darius Koenig and Julius Lipp},\n  year={2024},\n  url={https:\/\/www.mixedbread.com\/blog\/mxbai-embed-large-v1},\n}\n<\/code><\/pre>\n","url":"https:\/\/www.mixedbread.com\/blog\/mxbai-embed-large-v1","title":"Open Source Strikes Bread - New Fluffy Embedding Model","summary":"Our English embedding model provides state-of-the-art performance among other efficiently sized models. It outperforms closed source models like OpenAI's text-embedding-v3.","image":"https:\/\/www.mixedbread.com\/images\/blog\/embed-v1\/intro-mxbai-embed-v1.jpg","date_modified":"2024-03-08T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research"]},{"id":"https:\/\/www.mixedbread.com\/blog\/mxbai-embed-2d-large-v1","content_html":"<p>We are excited to release the world&#39;s first 2D-\ud83e\ude86 embedding model. As our previous release, it comes with an Apache 2.0 license and is available on Hugging Face.<\/p>\n<p>Read on to learn more about our approach and to check out our benchmarks. If you want to skip right to the model instead, you can access it here:<\/p>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-embed-2d-large-v1\">mxbai-embed-2d-large-v1<\/a>: The first model of its kind.<\/li>\n<\/ul>\n<blockquote>\n<p><strong>Note<\/strong>\n<u><strong>TLDR:<\/strong><\/u> <br\/>\n    The 2D-\ud83e\ude86 model introduces a novel approach that enables you to reduce both the number of layers and the dimensions of embeddings within the model. This dual reduction strategy allows for a more compact model size while still delivering competitive performance compared to leading models, such as <a href=\"https:\/\/huggingface.co\/nomic-ai\/nomic-embed-text-v1.5\">Nomic's embedding model<\/a>. Specifically, reducing the model&#39;s layers by approximately 50% retains up to 85% of its original performance, even without additional training.<\/p>\n<\/blockquote>\n<h2>Why Embeddings?<\/h2>\n<p>A significant hurdle for modern generative models is their inability to directly interact with specific organizational data. Consider a scenario where your task is to generate a report on recent market trends based on internal research documents. Traditional generative models fall short here as they don&#39;t have access to or understanding of your internal documents, making it impossible for them to generate the required report.<\/p>\n<p>To address this challenge, the Retrieval-Augmented Generation (RAG) technique offers a solution. Imagine you have a repository of internal research on market trends. This repository can be processed through an embedding model to convert the documents into a searchable format within a vector database. When you need a report on market trends, the embedding model can locate and fetch the most relevant documents. These documents can then inform a generative model, enabling it to produce a detailed report based on your specific data.<\/p>\n<h2>What Is a Matryoshka Model? And 2D-\ud83e\ude86?<\/h2>\n<p>Dense embedding models typically produce embeddings with a fixed size, such as 768 or 1024 dimensions. All further computations (clustering, classification, semantic search, retrieval, reranking, etc.) must then be done on these full embeddings.<\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2205.13147\">Matryoshka Representation Learning<\/a> revisits this idea, and proposes a solution to train embedding models whose embeddings are still useful after dimensionality reduction -- truncation to much smaller sizes -- as shown in the figure below. This allows for considerably faster (bulk) processing and storage savings while maintaining most of the performance. However, the impact on inference speed and memory footprint is small, because the model still runs through all layers.<\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2402.14776\">2D-\ud83e\ude86<\/a> takes this idea further and proposes chunkable layers (see the second part of the figure). Here, the hidden layers are also trained on generating high quality embeddings without the higher layers. As a result, layers can be chunked from the model without losing too much performance in the embeddings generation process. This allows a user to train one large model and get multiple smaller models out of it. The first dimension of 2D-\ud83e\ude86, which chunks layers, allows faster inference and a lower memory footprint, and the second dimension, which chunks the embeddings, allows faster retrieval while using less storage capacity.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/2dmse\/2dmse_concept.png\" alt=\"Visualization of the difference between regular and 2D-\ud83e\ude86\" title=\"Visualization of the difference between regular and 2D-\ud83e\ude86\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n    Visualization of the difference between regular and 2D-\ud83e\ude86 (Source: [Li et al.](https:\/\/arxiv.org\/abs\/2402.14776))\n<\/div>\n\n<h2>Introducing the First 2D-\ud83e\ude86 Embedding Model<\/h2>\n<p>We are excited to announce the (to our knowledge) first embedding model which supports 2D-\ud83e\ude86. The model is based on our embedding model, which we will also release soon (\ud83e\udd2b). The model was pretrained using contrastive training on over 700 million pairs, covering a huge variety of different topics across the whole internet. Then, it was finetuned with over 30 million high quality triplets using novel loss functions. Our model helps the user to get multiple models out of one and to use different embedding sizes, which gives you full control over the tradeoffs between speed, storage consumption, and model performance.<\/p>\n<h2>Using It in Action<\/h2>\n<p>Our model is extremely easy to use with your existing search stack. You replace the first stage retrieval with our model, and you&#39;re ready to go. You\u2019ll have two options: use the model either offline by hosting it yourself or online by using our (upcoming) API.<\/p>\n<p>To get started, install the necessary packages:<\/p>\n<pre><code class=\"language-bash\">pip install -U mixedbread-ai sentence-transformers\n<\/code><\/pre>\n<p>Here is a quick example: Given two sentences, we want to find their similarities. We can modify the amount of layers (depth) with <code>new_num_layers<\/code> and dimensions of the embeddings with <code>new_embedding_size<\/code>.<\/p>\n<p><strong>Python (API):<\/strong><\/p>\n<pre><code class=\"language-python\">from mixedbread_ai.client import MixedbreadAi\n        from sentence_transformers.util import cos_sim\n\n        mxbai = MixedbreadAI(api_key=&quot;YOUR_API_KEY&quot;)\n\n        result = mxbai.embeddings(\n            model=&quot;mixedbread-ai\/mxbai-embed-2d-large-v1&quot;,\n            input=[\n                &#39;Who is german and likes bread?&#39;,\n                &#39;Everybody in Germany.&#39;\n            ],\n            dimensions=768\n        )\n\n        # Similarity of the first sentence with the other two\n        similarities = cos_sim(result.data[0].embedding, result.data[1].embedding)\n\n        print(&#39;similarities:&#39;, similarities)\n<\/code><\/pre>\n<p><strong>Python:<\/strong><\/p>\n<pre><code class=\"language-python\">from sentence_transformers import models, SentenceTransformer\n        from sentence_transformers.util import cos_sim\n\n        # 1. load model with &quot;cls&quot; pooling\n        model = SentenceTransformer(&quot;mixedbread-ai\/mxbai-embed-2d-large-v1&quot;)\n\n        # 2. set adaptive layer and embedding size.\n        # it is recommended to set layers from 20 to 24.\n        new_num_layers = 22  # 1D: set layer size\n        model[0].auto_model.encoder.layer = model[0].auto_model.encoder.layer[:new_num_layers]\n        new_embedding_size = 768  # 2D: set embedding size\n\n        # 3. encode\n        embeddings = model.encode(\n            [\n                &#39;Who is german and likes bread?&#39;,\n                &#39;Everybody in Germany.&#39;\n            ]\n        )\n\n        # Similarity of the first sentence with the other two\n        similarities = cos_sim(embeddings[0, :new_embedding_size], embeddings[1, :new_embedding_size])\n\n        print(&#39;similarities:&#39;, similarities)\n<\/code><\/pre>\n<p>This will yield the following similarity of <code>0.7342<\/code>.<\/p>\n<h2>Performance<\/h2>\n<p>Our first iteration of the model yields performance that is competitive with models supporting Matryoshka only in the embeddings layer. We recognise that the performance may currently be behind some larger models, but it&#39;s a first big step into bringing it to real world use-cases. We are working hard to match the overall performance of the state-of-the-art embedding models.<\/p>\n<h3>MTEB: Massive Text Embedding Benchmark<\/h3>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2210.07316\">MTEB<\/a> is a large text embedding benchmark that measures embedding models across seven tasks: classification, clustering, pair classification, re-ranking, retrieval, STS (semantic textual similarity), and summarization. It includes 56 datasets from various domains and with various text lengths.<\/p>\n<p>Unfortunately, a lot of new models have started overfitting to the tasks on the MTEB. Frequently, they even started to train on respective test sets (i.e., tell the model the correct answers for the test set, which is basically cheating) or generate synthetic data for those datasets. For our training, we excluded any potential overlap with the test sets and used no data which is used for MTEB, not even the training sets (except MS Marco) -- the reported performances are zero shot!<\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>MTEB Result<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><a href=\"https:\/\/openai.com\/blog\/new-embedding-models-and-api-updates\">text-embedding-3-large<\/a><\/td>\n<td>64.59<\/td>\n<\/tr>\n<tr>\n<td><strong><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-embed-2d-large-v1\">mxbai-embed-2d-large-v1<\/a><\/strong><\/td>\n<td><strong>63.25<\/strong><\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/huggingface.co\/nomic-ai\/nomic-embed-text-v1.5\">nomic-embed-text-v1.5<\/a><\/td>\n<td>62.28<\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/openai.com\/blog\/new-embedding-models-and-api-updates\">text-embedding-3-small<\/a><\/td>\n<td>62.26<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>As the results show, the 2D-\ud83e\ude86 in its base form performs on the level of current embedding models of different sizes against the MTEB. Now, we want to investigate the model&#39;s performance on various tasks, factoring in its downsizing abilities.<\/p>\n<h3>Matryoshka on the Embeddings Layer<\/h3>\n<p>First, we investigate the model performance under dimensionality reduction for the embeddings only -- in essence like a &#39;traditional&#39; Matryoshka model. Currently, we haven&#39;t yet managed to evaluate the MTEB for every dimension size and are working on providing the full results. For now, we report the performance of STS tasks and SciFact. We are only including the comparison to <code>nomic-embed-text-v1.5<\/code> and <code>text-embedding-3-large<\/code>, since the reported performance of <code>text-embedding-3-small<\/code> only contains the performance without Matryoshka.<\/p>\n<h4>STS (whole subset of MTEB)<\/h4>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/2dmse\/model_performance_by_dimension_sts.png\" alt=\"Model performance for different embeddings sizes against the STS benchmark\" title=\"Model performance for different embeddings sizes against the STS (whole subset of MTEB) benchmark\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n    Model performance for different embeddings sizes against the STS (whole subset of MTEB) benchmark\n<\/div>\n\n<p>Clearly, the 2D-\ud83e\ude86 embedding model can perform the task with a similar or even higher performance compared to currently available models. Even if the embedding size is reduced by factor 16, the model offers competitive performance.<\/p>\n<h4>SciFact<\/h4>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>native<\/th>\n<th>512<\/th>\n<th>256<\/th>\n<th>128<\/th>\n<th>64<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><a href=\"https:\/\/openai.com\/blog\/new-embedding-models-and-api-updates\">text-embedding-3-large<\/a><\/td>\n<td>77.77<\/td>\n<td>--<\/td>\n<td>73.1 (-6.1%)<\/td>\n<td>--<\/td>\n<td>--<\/td>\n<\/tr>\n<tr>\n<td><strong><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-embed-2d-large-v1\">mxbai-embed-2d-large-v1<\/a><\/strong><\/td>\n<td>74.11<\/td>\n<td>71.41 (-3.6%)<\/td>\n<td>68.74 (-7.2%)<\/td>\n<td>67.92 (-8.4%)<\/td>\n<td>63.75 (-14.0%)<\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/huggingface.co\/nomic-ai\/nomic-embed-text-v1.5\">nomic-embed-text-v1.5<\/a><\/td>\n<td>70.28<\/td>\n<td>70.12 (-0.2%)<\/td>\n<td>68.24 (-2.9%)<\/td>\n<td>64.28 (-8.5%)<\/td>\n<td>52.71 (-25.0%)<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>Again, the model can perform the task on a comparable level to the available Matryoshka models. While our model loses a higher degree of performance in the lower-order dimensionality reductions, it shows comparatively much stronger performance with greater size reduction.<\/p>\n<h4>TREC-COVID<\/h4>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>native<\/th>\n<th>512<\/th>\n<th>256<\/th>\n<th>128<\/th>\n<th>64<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td><a href=\"https:\/\/openai.com\/blog\/new-embedding-models-and-api-updates\">text-embedding-3-large<\/a><\/td>\n<td>79.59<\/td>\n<td>--<\/td>\n<td>76.24 (-4.3%)<\/td>\n<td>--<\/td>\n<td>--<\/td>\n<\/tr>\n<tr>\n<td><strong><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-embed-2d-large-v1\">mxbai-embed-2d-large-v1<\/a><\/strong><\/td>\n<td>68.64<\/td>\n<td>69.67 (+1.5%)<\/td>\n<td>69.90 (+1.8%)<\/td>\n<td>65.27 (-4.9%)<\/td>\n<td>59.81 (-12.9%)<\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/huggingface.co\/nomic-ai\/nomic-embed-text-v1.5\">nomic-embed-text-v1.5<\/a><\/td>\n<td>82.30<\/td>\n<td>82.12 (-0.2%)<\/td>\n<td>80.65 (-2.0%)<\/td>\n<td>74.58 (-9.4%)<\/td>\n<td>67.83 (-17.6%)<\/td>\n<\/tr>\n<\/tbody><\/table>\n<p>On this task, our model exhibited some curious behaviour. While the performance on this specific task remained a bit behind that of the Matryoshka models, we observed an interesting increase in performance for the lower-order size reductions and a comparatively slighty more stable performance towards higher-order size reductions.<\/p>\n<h3>2D-Matryoshka<\/h3>\n<p>Now, we investigate the model performance taking full advantage of the 2D-\ud83e\ude86 principle. Essentially, we iterate evaluating the model (using &#39;classic&#39; Matryoshka functionality), cutting a layer from the model each time. We start with the full 24-layer model and reduce it step-by-step to 13 layers, discarding almost 50% of the model. As we&#39;ve already seen the performance of our model compared to other Matryoshka models and this 2D-downsizing process has (to our knowledge) not been done before, there will be no comparison to other models in this section. We are happy to investigate what our model is capable of and excited to see where we can take the technology going forward. Because this process is very compute-intensive and we are working with quite limited resources, we only evaluated our model against two tasks, SciFact and STS. In this post, we show a selection of results for the base model, a reduction by one third of layers, and only half of the base model. The full list of results is available on our website (<a href=\"https:\/\/mixedbread.com\/images\/blog\/2dmse\/model_performance_by_hidden_size_and_layer_scifact.png\">SciFact<\/a>, <a href=\"https:\/\/mixedbread.com\/images\/blog\/2dmse\/model_performance_by_hidden_size_and_layer_sts.png\">STS<\/a>).<\/p>\n<h4>SciFact<\/h4>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/2dmse\/layerwise_performance_scifact.png\" alt=\"Model performance for 24, 16, and 13 layers and different embeddings sizes against the SciFact benchmark\" title=\"Model performance for 24, 16, and 13 layers and different embeddings sizes against the SciFact benchmark\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n    Model performance for 24, 16, and 13 layers and different embeddings sizes against the SciFact benchmark\n<\/div>\n\n<p>For SciFact, we see a reduction in performance from the base model by about a quarter of total performance when cutting the model down to 13 layers. This means that even after cutting half of the model, we still retain about 75% of performance. The decline in performance observed when reducing the embeddings size is consistently in line with what we would expect following the results in the above Matryoshka section.<\/p>\n<h4>STS (whole subset of MTEB)<\/h4>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/2dmse\/layerwise_performance_sts.png\" alt=\"Model performance for 24, 16, and 13 layers and different embeddings sizes against the STS benchmark\" title=\"Model performance for 24, 16, and 13 layers and different embeddings sizes against the STS benchmark\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n    Model performance for 24, 16, and 13 layers and different embeddings sizes against the STS benchmark\n<\/div>\n\n<p>For STS, the model performance after downsizing is even more promising. We still observe more than 85% of performance even after half of the model has been discarded. The results in combination with the embeddings downsizing are particularly interesting, as the performance decrease following a factor 8 dimensionality reduction is only slightly above 1%. In effect, even the combination of a 50% reduction in model size and a factor 8 reduction in embeddings size still leaves about 85% of performance.<\/p>\n<h3>Give Us Feedback<\/h3>\n<p>This is the first  model of its kind, and we welcome any feedback to make our models better and refine their user-friendliness or capabilities. Please let us know if you&#39;re hungry for any new features or have encountered any issues. We value your feedback!<\/p>\n<p>Please share your feedback and thoughts through our <a href=\"https:\/\/mixedbread.com\/urls\/discord\">discord community<\/a>. We are here to help and also always happy to chat about the exciting field of machine learning!<\/p>\n<h3>Citation<\/h3>\n<pre><code class=\"language-bibtex\">@online{embed2d2024mxbai,\n  title={Fresh 2D-Matryoshka Embedding Model},\n  author={Sean Lee and Aamir Shakir and Julius Lipp and Darius Koenig},\n  year={2024},\n  url={https:\/\/www.mixedbread.com\/blog\/mxbai-embed-2d-large-v1},\n}\n<\/code><\/pre>\n","url":"https:\/\/www.mixedbread.com\/blog\/mxbai-embed-2d-large-v1","title":"Fresh 2D-Matryoshka Embedding Model","summary":"The 2D-\ud83e\ude86 model introduces a novel approach that enables you to reduce both the number of layers and the dimensions of embeddings within the model. This dual reduction strategy allows for a more compact model size while still delivering competitive performance compared to leading models, such as Nomic's embedding model. Specifically, reducing the model's layers by approximately 50% retains up to 85% of its original performance, even without additional training.","image":"https:\/\/www.mixedbread.com\/images\/blog\/2dmse\/intro-mxbai-2dmse.jpg","date_modified":"2024-03-04T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research"]},{"id":"https:\/\/www.mixedbread.com\/blog\/mxbai-rerank-v1","content_html":"<p>Today, we are releasing a family of best-in-class reranking models. They come with a fully open-source Apache 2.0 license too! The Mixedbread team is happy to share these crispy models with the community \ud83c\udf5e<\/p>\n<p>Read on to learn more about our approach and to check out our benchmarks. If you want to skip right to the models instead, you can access them here:<\/p>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-rerank-xsmall-v1\">mxbai-rerank-xsmall-v1<\/a>: Our capacity-efficient model, high-performing at a very small size.<\/li>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-rerank-base-v1\">mxbai-rerank-base-v1<\/a>: The best balance between size and performance.<\/li>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-rerank-large-v1\">mxbai-rerank-large-v1<\/a>: Our strongest model, with extraordinary accuracy and performance.<\/li>\n<\/ul>\n<h2>Why Rerank?<\/h2>\n<p>Searching data using traditional keyword-based search can be challenging and frustrating. We\u2019ve all been in a situation where we were looking for a specific piece of information and got back results that had almost nothing to do with the things we were looking for. One way to boost your search is using embeddings-based semantic search systems, which can contextualize the meaning of a user\u2019s query, allowing them to return more relevant and accurate results.<\/p>\n<p>However, many companies have built large pipelines and systems around keyword-based search. Migrating to a semantic embedding search would be resource-intensive and costly.<\/p>\n<p>With our rerank models, companies can leverage their existing search infrastructure and add a semantic boost on top. The models perform extremely well on industry relevant use cases. What\u2019s more? They\u2019re open-source and perform on par with or even better than many closed source competitors.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/rerank\/rerank-flow.png\" alt=\"Two-stage search flow including rerank\" title=\"Two-stage search flow including rerank\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n  Two-stage search flow including rerank\n<\/div>\n\n<p>Reranking is applied after a first stage retrieval step. Keyword-based search systems like Elasticsearch or Solr can be used to retrieve the top 100 or more candidates and our reranking models can be applied at the last stage to get the most relevant candidates to the top.<\/p>\n<h2>Introducing The Mixedbread Rerank Model Family<\/h2>\n<p>We are thrilled to share our rerank model family with you. The models are fully open-source and you can host them yourself or use them with our upcoming API. These models can become an integral part of any high-performing search system.<\/p>\n<p>The models were trained using a large collection of real-life search queries and the top-10 results from search engines for these queries. A large language model ranks the results according to their relevance to the query. These signals were then used to train our rerank models. Our experiments show that our models significantly boost the search performance, particularly for complex and domain-specific queries.<\/p>\n<p>When used in combination with a keyword-based search engine, such as Elasticsearch, OpenSearch, or Solr, our rerank model endpoint can be added to the end of an existing search workflow and will allow users to incorporate semantic relevance into their keyword-based search system without changing the existing infrastructure. This is an easy, low-complexity method of improving search results by introducing semantic search technology into a user\u2019s stack with one line of code.<\/p>\n<h2>Boosting Search Quality<\/h2>\n<p>Our models are extremely easy to use with your existing search stack. Once you get the initial results from your existing search engine, pass the initial query and list of results to the model. You\u2019ll have two options: use the model either offline by hosting it yourself or online using our (upcoming) API. Our models come in three sizes:<\/p>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-rerank-xsmall-v1\">mxbai-rerank-xsmall-v1<\/a>: Offers good performance with a slight increase in non-relevant result scores.<\/li>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-rerank-base-v1\">mxbai-rerank-base-v1<\/a>: Balances size and performance optimally.<\/li>\n<li><a href=\"https:\/\/huggingface.co\/mixedbread-ai\/mxbai-rerank-large-v1\">mxbai-rerank-large-v1<\/a>: Delivers the highest accuracy and performance.<\/li>\n<\/ul>\n<h3>Using It Locally<\/h3>\n<p>To get started, install the necessary packages:<\/p>\n<p><strong>Python (API):<\/strong><\/p>\n<pre><code class=\"language-bash\">pip install -U mixedbread\n<\/code><\/pre>\n<p><strong>TypeScript (API):<\/strong><\/p>\n<pre><code class=\"language-bash\">npm i @mixedbread-ai\/sdk\n<\/code><\/pre>\n<p><strong>Python:<\/strong><\/p>\n<pre><code class=\"language-bash\">pip install -U sentence-transformers\n<\/code><\/pre>\n<p><strong>TypeScript:<\/strong><\/p>\n<pre><code class=\"language-bash\">npm i @xenova\/transformers\n<\/code><\/pre>\n<p>Here is a quick example: Given the query <code>\u201cWho wrote &#39;To Kill a Mockingbird&#39;?\u201d<\/code>, we want to retrieve the most relevant passage to that query.<\/p>\n<p><strong>Python (API):<\/strong><\/p>\n<pre><code class=\"language-python\">from mixedbread import Mixedbread\n\n    mxbai = Mixedbread(api_key=&quot;YOUR_API_KEY&quot;)\n\n    res = mxbai.rerank(\n        model=&quot;mixedbread-ai\/mxbai-rerank-large-v1&quot;,\n        query=&quot;Who is the author of To Kill a Mockingbird?&quot;,\n        input=[\n            &quot;To Kill a Mockingbird is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.&quot;,\n            &quot;The novel Moby-Dick was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.&quot;,\n            &quot;Harper Lee, an American novelist widely known for her novel To Kill a Mockingbird, was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.&quot;,\n            &quot;Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.&quot;,\n            &quot;The Harry Potter series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.&quot;,\n            &quot;The Great Gatsby, a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.&quot;\n        ],\n        top_k=3,\n        return_input=false\n    )\n\n    print(res.data)\n<\/code><\/pre>\n<p><strong>Python:<\/strong><\/p>\n<pre><code class=\"language-python\">from sentence_transformers import CrossEncoder\n\n# Load the model, here we use our base sized model\nmodel = CrossEncoder(&quot;mixedbread-ai\/mxbai-rerank-base-v1&quot;)\n\n# Example query and documents\nquery = &quot;Who wrote &#39;To Kill a Mockingbird&#39;?&quot;\ndocuments = [\n    &quot;&#39;To Kill a Mockingbird&#39; is a novel by Harper Lee published in 1960. It was immediately successful, winning the\n    Pulitzer Prize, and has become a classic of modern American literature.&quot;,\n    &quot;The novel &#39;Moby-Dick&#39; was written by Herman Melville and first published in 1851. It is considered a masterpiece of\n    American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.&quot;,\n    &quot;Harper Lee, an American novelist widely known for her novel &#39;To Kill a Mockingbird&#39;, was born in 1926 in\n    Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.&quot;,\n    &quot;Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment\n    upon the British landed gentry at the end of the 18th century.&quot;,\n    &quot;The &#39;Harry Potter&#39; series, which consists of seven fantasy novels written by British author J.K. Rowling, is among\n    the most popular and critically acclaimed books of the modern era.&quot;,\n    &quot;&#39;The Great Gatsby&#39;, a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set\n    in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.&quot;\n]\n\n# Calculate the scores\nresults = model.rank(query, documents, return_documents=True, top_k=3)\n<\/code><\/pre>\n<p><strong>TypeScript (API):<\/strong><\/p>\n<pre><code class=\"language-typescript\">import { Mixedbread } from &quot;@mixedbread\/sdk&quot;;\n\nconst mxbai = new Mixedbread({\n  apiKey: &quot;YOUR_API_KEY&quot;,\n});\n\nconst res = await mxbai.rerank({\n  model: &quot;mixedbread-ai\/mxbai-rerank-large-v1&quot;,\n  query: &quot;Who is the author of To Kill a Mockingbird?&quot;,\n  input: [\n    &quot;To Kill a Mockingbird is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.&quot;,\n    &quot;The novel Moby-Dick was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.&quot;,\n    &quot;Harper Lee, an American novelist widely known for her novel To Kill a Mockingbird, was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.&quot;,\n    &quot;Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.&quot;,\n    &quot;The Harry Potter series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.&quot;,\n    &quot;The Great Gatsby, a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.&quot;,\n  ],\n  top_k: 3,\n  return_input: false,\n});\n\nconsole.log(res.data);\n<\/code><\/pre>\n<p><strong>TypeScript:<\/strong><\/p>\n<pre><code class=\"language-typescript\">import {AutoTokenizer, AutoModelForSequenceClassification} from &#39;@xenova\/transformers&#39;;\n\nconst model_id = &#39;mixedbread-ai\/mxbai-rerank-xsmall-v1&#39;;\nconst model = await AutoModelForSequenceClassification.from_pretrained(model_id);\nconst tokenizer = await AutoTokenizer.from_pretrained(model_id);\n\n\/**\n* Performs ranking with the CrossEncoder on the given query and documents. Returns a sorted list with the document\nindices and scores.\n* @param {string} query A single query\n* @param {string[]} documents A list of documents\n* @param {Object} options Options for ranking\n* @param {number} [options.top_k=undefined] Return the top-k documents. If undefined, all documents are returned.\n* @param {number} [options.return_documents=false] If true, also returns the documents. If false, only returns the\nindices and scores.\n*\/\nasync function rank(query, documents, {\n    top_k = undefined,\n    return_documents = false,\n} = {}) {\n        const inputs = tokenizer(\n            new Array(documents.length).fill(query),\n            {\n                text_pair: documents,\n                padding: true,\n                truncation: true,\n            }\n        )\n        const {logits} = await model(inputs);\n        return logits\n        .sigmoid()\n        .tolist()\n        .map(([score], i) =&gt; ({\n        corpus_id: i,\n        score,\n        ...(return_documents ? {text: documents[i]} : {})\n    }))\n    .sort((a, b) =&gt; b.score - a.score)\n    .slice(0, top_k);\n}\n\n    \/\/ Example usage:\nconst query = &quot;Who wrote &#39;To Kill a Mockingbird&#39;?&quot;\nconst documents = [\n    &quot;&#39;To Kill a Mockingbird&#39; is a novel by Harper Lee published in 1960. It was immediately successful, winning the\n    Pulitzer Prize, and has become a classic of modern American literature.&quot;,\n    &quot;The novel &#39;Moby-Dick&#39; was written by Herman Melville and first published in 1851. It is considered a masterpiece of\n    American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.&quot;,\n    &quot;Harper Lee, an American novelist widely known for her novel &#39;To Kill a Mockingbird&#39;, was born in 1926 in\n    Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.&quot;,\n    &quot;Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment\n    upon the British landed gentry at the end of the 18th century.&quot;,\n    &quot;The &#39;Harry Potter&#39; series, which consists of seven fantasy novels written by British author J.K. Rowling, is among\n    the most popular and critically acclaimed books of the modern era.&quot;,\n    &quot;&#39;The Great Gatsby&#39;, a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set\n    in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.&quot;\n]\n\nconst results = await rank(query, documents, {return_documents: true, top_k: 3});\nconsole.log(results);\n<\/code><\/pre>\n<p>This will yield a list of sorted documents by their score:<\/p>\n<p><strong>Python:<\/strong><\/p>\n<pre><code class=\"language-python\">[\n    {\n        &#39;corpus_id&#39;: 0,\n        &#39;score&#39;: 0.9968497,\n        &#39;text&#39;: &quot;&#39;To Kill a Mockingbird&#39; is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.&quot;\n    },\n    {\n        &#39;corpus_id&#39;: 2,\n        &#39;score&#39;: 0.99251455,\n        &#39;text&#39;: &quot;Harper Lee, an American novelist widely known for her novel &#39;To Kill a Mockingbird&#39;, was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.&quot;\n    },\n    {\n        &#39;corpus_id&#39;: 1,\n        &#39;score&#39;: 0.2528591,\n        &#39;text&#39;: &quot;The novel &#39;Moby-Dick&#39; was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.&quot;\n    }\n]\n<\/code><\/pre>\n<p><strong>TypeScript:<\/strong><\/p>\n<pre><code class=\"language-ts\">[\n  {\n    corpus_id: 0,\n    score: 0.9930055141448975,\n    text: &quot;&#39;To Kill a Mockingbird&#39; is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.&quot;,\n  },\n  {\n    corpus_id: 2,\n    score: 0.9835766553878784,\n    text: &quot;Harper Lee, an American novelist widely known for her novel &#39;To Kill a Mockingbird&#39;, was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.&quot;,\n  },\n  {\n    corpus_id: 3,\n    score: 0.4480893313884735,\n    text: &quot;Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.&quot;,\n  },\n];\n<\/code><\/pre>\n<p>The <code>corpus_id<\/code> is the index from the input list of documents, the score and input text.<\/p>\n<p>You can try it out directly in your browser <a href=\"https:\/\/huggingface.co\/spaces\/Xenova\/cross-encoder-web\">here<\/a>. Big thanks to the Hugging Face Team and Joshua Lochner for providing the web interface and helping out in general!<\/p>\n<p>Learn more about <a href=\"https:\/\/www.sbert.net\/docs\/package_reference\/cross_encoder.html#sentence_transformers.cross_encoder.CrossEncoder.rank\">the sentence-transformers rank function<\/a> or <a href=\"https:\/\/github.com\/xenova\/transformers.js\">transformers.js<\/a>.<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/rerank\/rerank-process.png\" alt=\"The reranking process from query to ranking\" title=\"The reranking process from query to ranking\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n  The reranking process from query to ranking\n<\/div>\n\n<h3>Upcoming API Integration<\/h3>\n<p>We are currently working hard to make the models available through our endpoint, so you won\u2019t have to worry about hosting and infrastructure on your end. The usage via the API will also provide some additional benefits, which we\u2019ll announce soon. Stay tuned!<\/p>\n<h2>Evaluation: Best In Class<\/h2>\n<p>We benchmarked our models against other models on common benchmarks like BEIR (using a subset). First, we benchmarked for NDCG@10, a measure of the overall quality of search results, factoring in the position of the relevant documents in the list of search results and the results\u2019 relevance grades with a heavier weighting of results higher in the list. Additionally, we tested for Accuracy@3, the number of search queries for which the model includes a highly relevant search result in the top three results. Accuracy is a particularly relevant benchmark for the real-life use cases of search and other tasks like RAG.<\/p>\n<p>Now, we are going to present the evaluation results of our models on a subset of 11 BEIR datasets. The subset was chosen for its appropriate ratio between computational demand and real-world applicability. Please note that our models have never seen any samples from these evaluation datasets, while current models regularly suffer from severe data leakage.<\/p>\n<p>First we compare the performance with NDCG@10:<\/p>\n<p><img src=\"https:\/\/mixedbread.com\/images\/blog\/rerank\/ndcg10-v1.png\" alt=\"Comparison of overall relevance scores between the Mixedbread rerank family and other models\" title=\"Comparison of overall relevance scores between the Mixedbread rerank family and other models\"><\/p>\n<div class=\"w-full text-center -mt-4 mb-4 text-xs italic\">\n  Comparison of overall relevance scores between the Mixedbread rerank family\n  and other models\n<\/div>\n\n<p>Clearly, all of our models provide a significant boost over regular lexical (keyword-based) search on the overall relevance of search results. Even more, they consistently outperform current models of the same size or even larger ones, including embeddings-based semantic search models. Now, we benchmark our rerank models for accuracy:<\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>BEIR Accuracy (11 datasets)<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>Lexical Search (Pyserini)<\/td>\n<td>66.4<\/td>\n<\/tr>\n<tr>\n<td>bge-reranker-base<\/td>\n<td>66.9<\/td>\n<\/tr>\n<tr>\n<td>bge-reranker-large<\/td>\n<td>70.6<\/td>\n<\/tr>\n<tr>\n<td>cohere-embed-v3<\/td>\n<td>70.9<\/td>\n<\/tr>\n<tr>\n<td>mxbai-rerank-xsmall-v1<\/td>\n<td>70.0<\/td>\n<\/tr>\n<tr>\n<td>mxbai-rerank-base-v1<\/td>\n<td>72.3<\/td>\n<\/tr>\n<tr>\n<td>mxbai-rerank-large-v1<\/td>\n<td>74.9<\/td>\n<\/tr>\n<\/tbody><\/table>\n<div class=\"w-full text-center mb-4 text-xs italic\">\n  Comparison of accuracy scores between the Mixedbread rerank family and other\n  models\n<\/div>\n\n<p>As the data shows, the Mixedbread rerank models again consistently perform on par with or even stronger than the other currently available models, especially when factoring in a comparison of model sizes. This also includes embeddings-based semantic search models. The accuracy metric is particularly relevant because it reflects the real-world user experience of searching for information and expecting the most relevant result to show up on the screen at the first glance. You can find more information regarding the benchmarks <a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/15ELkSMFv-oHa5TRiIjDvhIstH9dlc3pnZeO-iGz4Ld4\/edit?usp=sharing\">here<\/a>.<\/p>\n<p>As an inspiration, it has to be noted that using the rerank models as a second stage after embeddings-based semantic search, rather than keyword-based search, will yield even more awesome results!<\/p>\n<h2>Build Amazing Things With Rerank Models<\/h2>\n<p>It\u2019s our firm belief that Open Sourcing the Mixedbread rerank models will help the community build amazing things, given the clear benefits of our model family:<\/p>\n<ul>\n<li><strong>Simplicity:<\/strong> The rerank step is just one line of code away from boosting your search performance.<\/li>\n<li><strong>Practicability:<\/strong> Our models can boost existing systems instead of requiring their replacement.<\/li>\n<li><strong>Performance:<\/strong> We deliver State-of-the-Art performance, built for real-world use cases.<\/li>\n<\/ul>\n<p>So, what are you waiting for? Go to <a href=\"https:\/\/huggingface.co\/mixedbread-ai\">huggingface.co<\/a>, and see for yourself!<\/p>\n<h2>Give Us Feedback<\/h2>\n<p>This our first open model release, and we welcome any feedback to make our models better and refine their user-friendliness or capabilities. Please let us know if you&#39;re hungry for any new features or have encountered any issues. We value your feedback!<\/p>\n<p>Please share your feedback and thoughts through our <a href=\"https:\/\/mixedbread.com\/urls\/discord\">discord community<\/a>. We are here to help and also always happy to chat about the exciting field of machine learning!<\/p>\n<h3>Citation<\/h3>\n<pre><code class=\"language-bibtex\">@online{rerank2024mxbai,\n  title={Boost Your Search With The Crispy Mixedbread Rerank Models},\n  author={Aamir Shakir and Darius Koenig and Julius Lipp and Sean Lee},\n  year={2024},\n  url={https:\/\/www.mixedbread.com\/blog\/mxbai-rerank-v1},\n}\n<\/code><\/pre>\n","url":"https:\/\/www.mixedbread.com\/blog\/mxbai-rerank-v1","title":"Boost Your Search With The Crispy Mixedbread Rerank Models","summary":"Introducing Mixedbread rerank models - Upgrade your search results with our new, open-source reranking models from Mixedbread. These models, available in three sizes, make it easier to find relevant results by adding a semantic layer to existing search systems. They're simple to use, work with your current setup, and are proven to boost performance with many traditional and semantic search models. Check them out for a more accurate, efficient search experience.","image":"https:\/\/www.mixedbread.com\/images\/blog\/rerank\/intro-mxbai-rerank-v1.jpg","date_modified":"2024-02-29T00:00:00.000Z","author":{"name":"Mixedbread Team","url":"https:\/\/www.mixedbread.com"},"tags":["research"]}]}