MB-65473: Refactor and Optimize Pre-Filtered Vector Search#317
MB-65473: Refactor and Optimize Pre-Filtered Vector Search#317abhinavdangeti merged 14 commits intomasterfrom
Conversation
abhinavdangeti
left a comment
There was a problem hiding this comment.
Just the rename request, looks fine otherwise.
There was a problem hiding this comment.
Pull Request Overview
This PR refactors and optimizes the pre-filtered vector search functionality to improve performance and reduce memory footprint. Key changes include:
- Streamlining the vector filtering logic with eligibility checks and new IVF index handling.
- Utilizing cluster vector counts for refined centroid selection.
- Preallocating cache maps to reduce memory reallocations.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| faiss_vector_posting.go | Refactored vector search logic and centroid selection with enhanced filtering. |
| faiss_vector_cache.go | Optimized caching by preallocating the docVecIDMap based on expected size. |
Comments suppressed due to low confidence (1)
faiss_vector_posting.go:472
- [nitpick] Consider renaming 'eligibleDocsTillNow' to a more descriptive name like 'cumulativeEligibleCount' to better convey its purpose.
eligibleDocsTillNow += clusterVectorCounts[centroidID]
|
could you just update the PR description as to what's the change and how its supposed to give us benefits (jsut for my reference to start reviewing the code tomorrow) |
|
Profiles from high-selectivity tests indicated that a significant amount of time was spent generating ineligible vector IDs. This was primarily due to the overhead of hashing and handling hash collisions when using a map as a set, especially in large segments where the number of eligibleDocIDs is high. To address this, I replaced the map with a bitset from The bitset does not require hashing and eliminates the overhead associated with hash collisions, making it inherently faster than a map. In a map-based approach, each insertion and lookup involves computing a hash and handling potential collisions, which can significantly impact performance, especially when dealing with large datasets. In contrast, a bitset directly sets or checks bits at specific positions in a pre-allocated array, resulting in constant-time operations with minimal overhead. This fundamental difference is why the bitset is expected to be much faster than a map in this scenario. As a result, the bitset outperforms both the map and bitmap implementations for this use case. Benchmark results confirm that the bitset is the clear winner, as expected. Check the benchmark file below : Output: |
- Refactor pre-filtered vector search to enhance performance and reduce
memory footprint.
- Replace the current bitmap-based approach for calculating segment
local document numbers with a more direct method, where the local
document numbers are mapped directly to the segment ID during the
execution of the eligible collector.
- Requires:
- blevesearch/bleve_index_api#63
- blevesearch/bleve_index_api#66
- blevesearch/zapx#317
- blevesearch/go-faiss#41
- blevesearch/faiss#49
---------
Co-authored-by: Abhinav Dangeti <[email protected]>
- Refactor pre-filtered vector search to enhance performance and reduce memory footprint. - Replace the current bitmap-based cluster selection mechanism with a simpler approach that uses the DirectMap in the IVF index. The IVF index's DirectMap directly maps the vector ID to the cluster it belongs to. - Make `github.com/bits-and-blooms/bitset` a direct dependency of `zapx` and upgrade it to the latest version - Requires blevesearch/go-faiss#41 --------- Co-authored-by: Abhinav Dangeti <[email protected]>
- Refactor pre-filtered vector search to enhance performance and reduce
memory footprint.
- Replace the current bitmap-based approach for calculating segment
local document numbers with a more direct method, where the local
document numbers are mapped directly to the segment ID during the
execution of the eligible collector.
- Requires:
- blevesearch/bleve_index_api#63
- blevesearch/bleve_index_api#66
- blevesearch/zapx#317
- blevesearch/go-faiss#41
- blevesearch/faiss#49
---------
Co-authored-by: Abhinav Dangeti <[email protected]>
#320) - Refactor pre-filtered vector search to enhance performance and reduce memory footprint. - Replace the current bitmap-based cluster selection mechanism with a simpler approach that uses the DirectMap in the IVF index. The IVF index's DirectMap directly maps the vector ID to the cluster it belongs to. - Make `github.com/bits-and-blooms/bitset` a direct dependency of `zapx` and upgrade it to the latest version - Requires blevesearch/go-faiss#41 --------- Co-authored-by: Abhinav Dangeti <[email protected]>
github.com/bits-and-blooms/bitseta direct dependency ofzapxand upgrade it tothe latest version