-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add new Acorn-esque filtered HNSW search heuristic #14160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java
Outdated
Show resolved
Hide resolved
|
@msokolov I wonder your opinion here? Do you think the behavior change/result change is worth waiting for a major? I do think folks should be able to use this now, but be able to opt out. Another option I could think of is injecting a parameter or something directly into SPI loading for the hnsw vector readers. But I am not 100% sure how to do that. It does seem like it should be something that is a "global" configuration for a given Lucene instance instead of one that is provided at query time. |
lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java
Outdated
Show resolved
Hide resolved
|
EDIT: @benchaplin fixed a bug. Now candidate looks way better. We should test on some more datasets, but it seems to me the new way of filter can maybe be done all the way up to 90% filtered out. I did some more testing, this time single segment of our nightly runs. The recall & latency pattern is much healthier with this change, though the recall is lower. The only reason the recall is so high for the restricted filters is that the baseline over-eagerly drops to brute-force because it spends way too much time doing vector comparisons. BASELINE Candidate: |
|
Yeah, with the bugfix in place, this patch is looking like we don't even need it to be configurable. Let's confirm with more datasets @benchaplin |
|
OK, the current implementation is about as good as I can figure it.
However, one thing that bothers me is that increasing |
|
OK, I checked the current search, and it seems to have the same issue (increasing |
|
Barring any further discussion, I am gonna merge this soon. While the filtering search still isn't perfect, this is a marked improvement. |
This is a continuation and completion of the work started by @benchaplin in #14085 The algorithm is fairly simple: - Only score and then explore vectors that actually match the filtering criteria - Since this will make the graph even sparser, the search spread is increased to also include the candidate's neighbor neighbors (e.g. generally maxConn * maxConn exploration) - Additionally, even more scored candidates for a given NSW are considered to combat the increased sparsity Some of the changes to the baseline Acorn algorithm are: - There is some general threshold of filtering that bypasses this algorithm altogether. Default suggestion is 60%. - The number of additional neighbors explored is predicated on the percentage of the immediate neighborhood that is filtered out - More extended neighborhoods will be explored if the filter is exceptionally restrictive. This attempts to find valid vectors to score and explore for every candidate. - Only look at the extended neighbors if less than 90% of the current neighborhood matches the filter. Since this changes the behavior significantly, and there are still some weird edge cases, I am exposing as a parameter within a new idea called `KnnSearchStrategy` that the collectors can provide. This strategy object can be provided to the queries. closes: #13940
|
Annot & other potpourri I fixed during the prefiltered benchmarking for this change. mikemccand/luceneutil#337 |
Sorry, I did not follow the progress here - it looks like a nice contribution, though! I think we ended up with a per-query modification (rather than a global setting of some sort), is that right? If so, I think that makes sense if no index-level chanmges were required for it. Oh wait no I see, this is now the default behavior (dynamically selected) - no configuration needed. But the only question is whether we could somehow make it available in on 10.x since it would be a behavior change? I'm not sure if that question is still relevant though. If the results are uniformly better and there is no API breakage, would there be any concern with backport to 10.x? |
@msokolov I did backport to 10.x, but due to the recall differences I kept the current default behavior in 10.x. But, folks can opt-in to use it by supplying a query time parameter. I do think we can get uniformly better results eventually, but I thought it was "good enough" for now. |
…t return enough results (#14274) When doing approximate knn search, it's possible that the approximate search returns less than k results. In case there is a filter, we know the filter cost so we can check if there are actually more than k results that match the filter. In that case, we can switch to exact search to ensure all possible results are returned, as the approximate search was not able to find all results that matched the filter. This was already done by #14160 - this PR adds tests to check this specific case.
…t return enough results (#14274) When doing approximate knn search, it's possible that the approximate search returns less than k results. In case there is a filter, we know the filter cost so we can check if there are actually more than k results that match the filter. In that case, we can switch to exact search to ensure all possible results are returned, as the approximate search was not able to find all results that matched the filter. This was already done by #14160 - this PR adds tests to check this specific case.
This is a continuation and completion of the work started by @benchaplin in apache#14085 The algorithm is fairly simple: - Only score and then explore vectors that actually match the filtering criteria - Since this will make the graph even sparser, the search spread is increased to also include the candidate's neighbor neighbors (e.g. generally maxConn * maxConn exploration) - Additionally, even more scored candidates for a given NSW are considered to combat the increased sparsity Some of the changes to the baseline Acorn algorithm are: - There is some general threshold of filtering that bypasses this algorithm altogether. Default suggestion is 60%. - The number of additional neighbors explored is predicated on the percentage of the immediate neighborhood that is filtered out - More extended neighborhoods will be explored if the filter is exceptionally restrictive. This attempts to find valid vectors to score and explore for every candidate. - Only look at the extended neighbors if less than 90% of the current neighborhood matches the filter. Since this changes the behavior significantly, and there are still some weird edge cases, I am exposing as a parameter within a new idea called `KnnSearchStrategy` that the collectors can provide. This strategy object can be provided to the queries. closes: apache#13940



This is a continuation and completion of the work started by @benchaplin in #14085
The algorithm is fairly simple:
Some of the changes to the baseline Acorn algorithm are:
Here are some numbers for 1M vectors, float32 and then int4 quantized.
https://docs.google.com/spreadsheets/d/1GqD7Jw42IIqimr2nB78fzEfOohrcBlJzOlpt0NuUVDQ
Here is the "nightly" dataset (but I merged to a single segment)
https://docs.google.com/spreadsheets/d/1gk1uybtqleVtDUfhWXActyhW8q_lgG1mlMrOohnJRJA
Since this changes the behavior significantly, and there are still some weird edge cases, I am exposing as a parameter within a new idea called
KnnSearchStrategythat the collectors can provide. This strategy object can be provided to the queries.closes: #13940