Skip to content

MB-65473: Refactor and Optimize Pre-Filtered Vector Search#317

Merged
abhinavdangeti merged 14 commits intomasterfrom
preFilterOpt
Apr 1, 2025
Merged

MB-65473: Refactor and Optimize Pre-Filtered Vector Search#317
abhinavdangeti merged 14 commits intomasterfrom
preFilterOpt

Conversation

@CascadingRadium
Copy link
Copy Markdown
Member

@CascadingRadium CascadingRadium commented Mar 25, 2025

  • Refactor pre-filtered vector search to enhance performance and reduce memory footprint.
  • Replace the current bitmap-based cluster selection mechanism with a simpler approach that uses the DirectMap in the IVF index. The IVF index's DirectMap directly maps the vector ID to the cluster it belongs to.
  • Make github.com/bits-and-blooms/bitset a direct dependency of zapx and upgrade it to
    the latest version
  • Requires MB-65473: Refactor and Optimize Pre-Filtered Vector Search go-faiss#41

Copy link
Copy Markdown
Member

@abhinavdangeti abhinavdangeti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the rename request, looks fine otherwise.

Comment thread faiss_vector_posting.go Outdated
Comment thread faiss_vector_posting.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors and optimizes the pre-filtered vector search functionality to improve performance and reduce memory footprint. Key changes include:

  • Streamlining the vector filtering logic with eligibility checks and new IVF index handling.
  • Utilizing cluster vector counts for refined centroid selection.
  • Preallocating cache maps to reduce memory reallocations.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
faiss_vector_posting.go Refactored vector search logic and centroid selection with enhanced filtering.
faiss_vector_cache.go Optimized caching by preallocating the docVecIDMap based on expected size.
Comments suppressed due to low confidence (1)

faiss_vector_posting.go:472

  • [nitpick] Consider renaming 'eligibleDocsTillNow' to a more descriptive name like 'cumulativeEligibleCount' to better convey its purpose.
eligibleDocsTillNow += clusterVectorCounts[centroidID]

@Thejas-bhat
Copy link
Copy Markdown
Member

could you just update the PR description as to what's the change and how its supposed to give us benefits (jsut for my reference to start reviewing the code tomorrow)

Comment thread faiss_vector_posting.go
@CascadingRadium
Copy link
Copy Markdown
Member Author

CascadingRadium commented Mar 29, 2025

Profiles from high-selectivity tests indicated that a significant amount of time was spent generating ineligible vector IDs. This was primarily due to the overhead of hashing and handling hash collisions when using a map as a set, especially in large segments where the number of eligibleDocIDs is high.

	eligibleDocIDsSet := make(map[uint64]struct{}, len(eligibleDocIDs))
	for _, docID := range eligibleDocIDs {
		eligibleDocIDsSet[docID] = struct{}{}
	}
	ineligibleVectorIDs := make([]int64, 0, len(vecDocIDMap)-len(vectorIDsToInclude))
	for docID, vecIDs := range docVecIDMap {
		if _, exists := eligibleDocIDsSet[uint64(docID)]; !exists {

To address this, I replaced the map with a bitset from bits-and-blooms, an existing dependency of zapx. This optimization is particularly effective in high-selectivity cases, where at least 50% of the segment's docNums are eligible. Additionally, since the segment-local docNums increase monotonically from 0, the data in the bitset is dense rather than sparse.

The bitset does not require hashing and eliminates the overhead associated with hash collisions, making it inherently faster than a map. In a map-based approach, each insertion and lookup involves computing a hash and handling potential collisions, which can significantly impact performance, especially when dealing with large datasets. In contrast, a bitset directly sets or checks bits at specific positions in a pre-allocated array, resulting in constant-time operations with minimal overhead. This fundamental difference is why the bitset is expected to be much faster than a map in this scenario.

As a result, the bitset outperforms both the map and bitmap implementations for this use case.

Benchmark results confirm that the bitset is the clear winner, as expected.

Check the benchmark file below :

package main

import (
	"math/rand"
	"testing"

	"github.com/RoaringBitmap/roaring/v2/roaring64"
	"github.com/bits-and-blooms/bitset"
)

// Modify as needed
// Total documents in segment = 100K
var totalDocs = 100000

// Eligible docs = 80K (80% selectivity)
var eligibleDocs = 80000

var docVecIDMap map[uint32][]int64
var vecDocIDMap map[int64]uint32
var eligibleDocIDs []uint64

func init() {
	ratio := float32(eligibleDocs) / float32(totalDocs) // 0.8 which is > 0.5
	if ratio <= 0.5 {
		panic("eligibleDocIDs should be more than 50% of total documents for running the benchmark")
	}
	// Test Data structures
	docVecIDMap = make(map[uint32][]int64, totalDocs)
	vecDocIDMap = make(map[int64]uint32, totalDocs)
	eligibleDocIDs = make([]uint64, eligibleDocs)
	// Populate DocVecIDMap with random data
	for i := 0; i < totalDocs; i++ {
		docID := uint32(i)
		vecID := int64(rand.Int63())
		docVecIDMap[docID] = []int64{vecID}
		vecDocIDMap[vecID] = docID
	}
	// Populate eligibleDocIDs
	for i := 0; i < eligibleDocs; i++ {
		id := uint64(i)
		eligibleDocIDs[i] = id
	}
}

func BenchmarkMapSet(b *testing.B) {
	for i := 0; i < b.N; i++ {
		vectorIDsToInclude := make([]int64, 0, len(eligibleDocIDs))
		for _, id := range eligibleDocIDs {
			vecIDs := docVecIDMap[uint32(id)]
			if len(vecIDs) == 1 {
				vectorIDsToInclude = append(vectorIDsToInclude, vecIDs[0])
			} else {
				vectorIDsToInclude = append(vectorIDsToInclude, vecIDs...)
			}
		}
		eligibleDocIDsSet := make(map[uint64]struct{}, len(eligibleDocIDs))
		for _, docID := range eligibleDocIDs {
			eligibleDocIDsSet[docID] = struct{}{}
		}
		ineligibleVectorIDs := make([]int64, 0, len(vecDocIDMap)-len(vectorIDsToInclude))
		for docID, vecIDs := range docVecIDMap {
			if _, exists := eligibleDocIDsSet[uint64(docID)]; !exists {
				if len(vecIDs) == 1 {
					ineligibleVectorIDs = append(ineligibleVectorIDs, vecIDs[0])
				} else {
					ineligibleVectorIDs = append(ineligibleVectorIDs, vecIDs...)
				}
			}
		}
	}
}

func BenchmarkBitSet(b *testing.B) {
	for i := 0; i < b.N; i++ {
		vectorIDsToInclude := make([]int64, 0, len(eligibleDocIDs))
		for _, id := range eligibleDocIDs {
			vecIDs := docVecIDMap[uint32(id)]
			if len(vecIDs) == 1 {
				vectorIDsToInclude = append(vectorIDsToInclude, vecIDs[0])
			} else {
				vectorIDsToInclude = append(vectorIDsToInclude, vecIDs...)
			}
		}
		bs := bitset.New(uint(len(eligibleDocIDs)))
		for _, docID := range eligibleDocIDs {
			bs.Set(uint(docID))
		}
		ineligibleVectorIDs := make([]int64, 0, len(vecDocIDMap)-len(vectorIDsToInclude))
		for docID, vecIDs := range docVecIDMap {
			if !bs.Test(uint(docID)) {
				if len(vecIDs) == 1 {
					ineligibleVectorIDs = append(ineligibleVectorIDs, vecIDs[0])
				} else {
					ineligibleVectorIDs = append(ineligibleVectorIDs, vecIDs...)
				}
			}
		}
	}
}

func BenchmarkRoaringBitmap(b *testing.B) {
	for i := 0; i < b.N; i++ {
		vectorIDsToInclude := make([]int64, 0, len(eligibleDocIDs))
		for _, id := range eligibleDocIDs {
			vecIDs := docVecIDMap[uint32(id)]
			if len(vecIDs) == 1 {
				vectorIDsToInclude = append(vectorIDsToInclude, vecIDs[0])
			} else {
				vectorIDsToInclude = append(vectorIDsToInclude, vecIDs...)
			}
		}
		bm := roaring64.NewBitmap()
		bm.AddMany(eligibleDocIDs)
		ineligibleVectorIDs := make([]int64, 0, len(vecDocIDMap)-len(vectorIDsToInclude))
		for docID, vecIDs := range docVecIDMap {
			if !bm.Contains(uint64(docID)) {
				if len(vecIDs) == 1 {
					ineligibleVectorIDs = append(ineligibleVectorIDs, vecIDs[0])
				} else {
					ineligibleVectorIDs = append(ineligibleVectorIDs, vecIDs...)
				}
			}
		}
	}
}

Output:

rahulrampure@HP:~/Downloads/99 Selec$ go test -bench=. -benchtime=5s -benchmem
goos: darwin
goarch: arm64
pkg: example.com/m
cpu: Apple M1 Pro
BenchmarkMapSet-10           	     859	   6604175 ns/op	 2221768 B/op	      16 allocs/op
BenchmarkBitSet-10           	    2191	   2746060 ns/op	  821280 B/op	       4 allocs/op
BenchmarkRoaringBitmap-10    	    1526	   3894757 ns/op	 1205290 B/op	      45 allocs/op
PASS
ok  	example.com/m	19.256s

Comment thread faiss_vector_posting.go
Comment thread faiss_vector_posting.go
@abhinavdangeti abhinavdangeti merged commit 8d18b86 into master Apr 1, 2025
6 checks passed
@abhinavdangeti abhinavdangeti deleted the preFilterOpt branch April 1, 2025 16:12
abhinavdangeti added a commit to blevesearch/bleve that referenced this pull request Apr 2, 2025
- Refactor pre-filtered vector search to enhance performance and reduce
memory footprint.
- Replace the current bitmap-based approach for calculating segment
local document numbers with a more direct method, where the local
document numbers are mapped directly to the segment ID during the
execution of the eligible collector.
- Requires: 
    - blevesearch/bleve_index_api#63
    - blevesearch/bleve_index_api#66
    - blevesearch/zapx#317
    - blevesearch/go-faiss#41
    - blevesearch/faiss#49

---------

Co-authored-by: Abhinav Dangeti <[email protected]>
CascadingRadium added a commit that referenced this pull request Apr 7, 2025
- Refactor pre-filtered vector search to enhance performance and reduce
memory footprint.
- Replace the current bitmap-based cluster selection mechanism with a
simpler approach that uses the DirectMap in the IVF index. The IVF
index's DirectMap directly maps the vector ID to the cluster it belongs
to.
- Make `github.com/bits-and-blooms/bitset` a direct dependency of `zapx`
and upgrade it to
  the latest version
- Requires blevesearch/go-faiss#41

---------

Co-authored-by: Abhinav Dangeti <[email protected]>
CascadingRadium added a commit to blevesearch/bleve that referenced this pull request Apr 7, 2025
- Refactor pre-filtered vector search to enhance performance and reduce
memory footprint.
- Replace the current bitmap-based approach for calculating segment
local document numbers with a more direct method, where the local
document numbers are mapped directly to the segment ID during the
execution of the eligible collector.
- Requires:
    - blevesearch/bleve_index_api#63
    - blevesearch/bleve_index_api#66
    - blevesearch/zapx#317
    - blevesearch/go-faiss#41
    - blevesearch/faiss#49

---------

Co-authored-by: Abhinav Dangeti <[email protected]>
abhinavdangeti added a commit that referenced this pull request Apr 7, 2025
#320)

- Refactor pre-filtered vector search to enhance performance and reduce
memory footprint.
- Replace the current bitmap-based cluster selection mechanism with a
simpler approach that uses the DirectMap in the IVF index. The IVF
index's DirectMap directly maps the vector ID to the cluster it belongs
to.
- Make `github.com/bits-and-blooms/bitset` a direct dependency of `zapx`
and upgrade it to
  the latest version
- Requires blevesearch/go-faiss#41

---------

Co-authored-by: Abhinav Dangeti <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants