-
Notifications
You must be signed in to change notification settings - Fork 1.3k
OptimisticKnnVectorQuery #14226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OptimisticKnnVectorQuery #14226
Conversation
|
This is indeed interesting. I eagerly wait some of the perf numbers ;). But, this re-entering the graph makes me think that collectors will need to track visitation. This way the same vector path isn't visited multiple times. Its possible that once you enter the lower level, that you accidentally "back track" to previously hit vectors.
I do think that if this proves fruitful, it should go straight into the default behavior. As for a strategy or not, it depends on how extreme the performance & recall differences are :/. The reason I am thinking a search strategy for the filter stuff I have been doing here: #14160 is that the recall behavior is significantly different for certain kinds of filters. Users might want to turn it off (if they have particular kinds of filters they run that behave wonky). |
|
Hmm github's test run failed with: however this did not repro locally. I'm going to kick off another run |
That might not be needed? My fingers are crossed, buyt I'm thinking we skip most of the revisitation by seeding with the best hits and then we would quickly terminate if nothing better is found? I was hoping to manage this without any changes to the searcher that could be needed for retaining the visited lists |
|
Well I ran some tests, and surprisingly, I saw a significant different in both recall and latency (decreases in both). This surprised me: I expected to see more-or-less similar results, so I want todig a little further. |
|
OK, I ran it on 8M data set with 128 segments. Indeed it visits way fewer vectors (seemingly), and is consistent across multiple threads. 8 threaded qps The current baseline (which isn't multi-threaded consistent, so I am just showing the single thread perf): Given the visited & latency, I wonder if the "visited" actually represents what this query is visiting? But, maybe the answer is allowing more exploration per segment? |
|
Ah, one other thought is that we are definitely scoring every seeded entry point twice. Once when they are gathered during initial query phase, then later through the seeded provisioning. We could adjust the internal API to allow cached scores to be provided (if we find that we can get the recall up). |
|
You can tune it by changing the magic number 3 to a bigger number. I found that with 15 I get slight better recall and slightly lower latencies than the baseline for my test case |
|
also - we are not really tracking "visited" properly I think |
|
I think maybe what happens here is that the K controls not only how many hits are returned from each segment, but also what the "beam width" is during search, so we could have gotten better hits even from the segments that appeared to have been exhausted (because their worst in the top K weas below the min competitive score) |
|
I am not 100% sure whats up with the behavior. However, I switched to Its interesting how visited is lower, but recall is on par with the current non-consistent baseline. I don't like the magic "16" here, without knowing exactly why its working. I also fixed visited in Lucene util (accounting for the multiple topdoc merges). Here is NOT reusing the entry point scores Here IS reusing the entry point scores (have a local refactor). So, visited drops just a little. here is reusing entry point, single threaded: Note, the visited for a SINGLE graph is around 3000. Also, as a further baseline, here is no information sharing (e.g. what it was before our in-consistent multi-threaded results). e.g. just fully exploring every graph: |
|
The latency improvement from re-using scores is surprisingly large. I would have expected costs to be dominated by newly-explored nodes, but this is cool. |
@msokolov this was only over 10 different query iterations on my laptop 😅 so, I wouldn't necessarily trust my latency numbers :D. But, the visited count is attractive. The magic numbers are just weird to me. Why is FWIW, I haven't tested this on anyother datasets yet. If we want to provide entry-point scores, I can do that refactor in a separate PR. It isn't so bad, and is generally useful (we score entry point twice on every search, which is a minimal cost, but we could prevent doing that). |
|
I don't believe 16 is "special" except in the sense that it happens to be a sweet spot is this context. We expect that as we increase that per-segment factor we will get increased recall because it is the equivalent of the old "fanout" parameter we used to have - it expands the beam of the search while keeping the number of results returned fixed. This is different from "fanout" because that was a global setting, and this one scales with the relative size of the segment. What I'd like to understand is whether there is a fixed value for this sweet spot, or whether it changes with {data, graph construction parameters, something else}. It seems clear that my idea that it is a probabilistic thing is incorrect since 16 standard deviations is just crazy. But we can at least look at it empirically. |
|
@benwtrent I'm curious how you managed to re-use the scores. I poked around a little and I guess it requires some new plumbing / class casts since the API isn't really designed for it? Do you have a patch you wouldn't mind posting? |
|
@msokolov I can provide a quick patch tomorrow against main. As I said, it would work for stuff as it is now. |
| while (ctxIter.hasNext()) { | ||
| LeafReaderContext ctx = ctxIter.next(); | ||
| TopDocs perLeaf = perLeafResults.get(ctx.ord); | ||
| if (perLeaf.scoreDocs.length > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we publish the metrics for each case here to see how often/how many passes we usually need to take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that would be critical to understanding what's going on here! Maybe we can extend TopK to include some metrics? Not sure what the best API is here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can define it as an abstract method, and let the consumer to override it with their own metrics system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, OK, pass in some kind of abstract Counter? But maybe a simple int[] is enough? The consumer can always convert to their metric system.
| while (ctxIter.hasNext()) { | ||
| LeafReaderContext ctx = ctxIter.next(); | ||
| TopDocs perLeaf = perLeafResults.get(ctx.ord); | ||
| if (perLeaf.scoreDocs.length > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can define it as an abstract method, and let the consumer to override it with their own metrics system?
| assert perLeafTopK | ||
| > 0; // if we divided by zero above, leafProportion can be NaN and then this would be | ||
| // 0 | ||
| return new TopKnnCollector(perLeafTopK, visitedLimit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to cap perLeafTopK by the original k? I think k is double on every iteration, and perLeafTopK can theoretically go over the original k, which is excessive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can do it in perLeafTopkCalculation
|
I pushed a version that re-uses scores and limits per-leaf topK to global topK. The former didn't make very much difference, but the latter change did improve things quite a bit. Here are some numbers from cohere/768d: mainlinewih reused scores and limiting perLeafK <= Kit would be awesome if you could produce similar comparisons for this version, @dungba88 ! |
8d16223 to
d20f704
Compare
|
There were some conflicts with other recent changes, so I rebased and removed the reuse-scores part of this since it was conflicting with other changes and doesn't seem worth preserving. Then I had to force-push, sorry. |
d20f704 to
5b06e16
Compare
@msokolov thanks for confirming and digging into that :). It was worth a shot! But its not super surprising as the dominate cost will be the further exploration. |
|
@msokolov I didn't see the change to cap |
|
oh darn, in all my rebasing and force-pushing I seem to have ended up with the wrong version, grr. I will check reflog and recover... |
|
Do we know where we stand with this? I wonder if we should simply use this to replace the current buggy behavior as fanning out to segments multiple times with sane sync points should fix our consistency issues and if the performance of this is on-par with the current information sharing logic, seems like a net-win. Anything I can do to help here? |
|
I'm also in favor of making this the default behavior. Maybe we can let the lambda parameter to be tuned with a reasonable default value (similar to greediness in the current collector) so that expert can tune if they need. From the benchmark we also saw that the tuning range and impact of this optimistic algorithm is larger than the global sharing algorithm, so it may satisfies a wider audiences. |
|
@msokolov I can publish the PR for the simplified version with directly changing the default behavior if there is no objection (if you are not available for it). |
|
Sorry I've been out to lunch on this - yes, I agree we should move forward here. I think we might have ended up in some analysis paralysis. We have a good candidate let's push it to main at least. We can improve the benchmarks in parallel: I think the existing benchmarking is already quite promising. @dungba88 if you have a patch with the "simplified" version ready, please go ahead and post it. |
Regarding lambda setting, I'm in favor of fixing it as a magic number that replicates something close to the current behavior (or better recall if we can and retain the same latency as we found with lambda=16) and letting users tune further using fanout. I think these are roughly equivalent and I don't think we should be exposing a lot of knobs. Experts who want to tune lambda can easily write their own Query. |
I am for this as well. Exposing more and more knobs makes things way too complicated. |
|
As far as making this the default, that sounds OK to me, but let's not backport until we've had a chance to verify no harm for a while in some pre-production environments |
dungba88
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see the latest commit already incorporates the simplified version and make it the default behavior. That's nice! Will just comment here instead.
| TimeLimitingKnnCollectorManager knnCollectorManagerInner = | ||
| new TimeLimitingKnnCollectorManager( | ||
| new ReentrantKnnCollectorManager( | ||
| getKnnCollectorManager(k, indexSearcher), perLeafResults), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the second pass right? If so, we need to get the TopKnnCollector with full k instead of per leaf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right! duh - I just collapsed all the code, haven't had a chance to retest w/luceneutil. Thanks for checking! This is hard to unit test, grr
|
I did a run over 4M vectors. One with bit quantization. In both cases, this optimistic pattern beats the baseline (sharing). Consistently, it gives better recall at better latency. I think that the vector ops count method I am using is "double counting" the vector ops... |
|
I did another round of testing with luceneutil over a couple different vector datasets and it is consistently outperforming the baseline. Given that and @benwtrent's recent results, I think we can go ahead. |
* main: (27 commits) deps(java): bump com.github.luben:zstd-jni from 1.5.7-2 to 1.5.7-3 (apache#14621) Improve user-facing docs for geo package (apache#14534) Enabling histogram collection for PointRangeQuery (apache#14560) Move sloppySin into SloppyMath from GeoUtils (apache#14516) Rewrite APIJAR extractor to use Java 24 classfile API and kill ASM dependency also for build system (apache#14613) CHANGES entry for apache#14226 (optimistic KNN Query) OptimisticKnnVectorQuery (apache#14226) Fix for Windows (spaces in paths) apache#14608 Update jdk requirements in README to OpenJDK 24 (apache#14610) Always check gradle wrapper sha checksum and download if necessary (apache#14608) Fix changelog verifier (apache#14606) MultiRange query for SortedNumeric DocValues (apache#14404) Remove RANDOM_PRELOAD read advice, which is not actually used (apache#14593) Remove duplicate test (apache#14602) Refactor the expressions compiler to use official ClassData BSM with indexed lookup (apache#14602) Disallow EA versions to run Gradle (apache#14601) Add back-compat indices for 10.2.1 Add Lucene 10.2.1 version constant DOAP changes for release 10.2.1 Revert "An attempt to make jenkins pass with the currently installed jdk24-ea. To be reverted later. apache#14600" ...
* OptimisticKnnVectorQuery that re-enters search for segments with with strong results * fix JDK23 lint * simplify OptimisticKnnVectorQuery to 2 rounds only * move OptimisticKnnVectorQuery into AbstractKnnVectorQuery * tidy * Use TopKnnCollector for second pass so we correctly collect total top K --------- Co-authored-by: Michael Sokolov <[email protected]>


Description
This is a WIP patch to work out an idea for Knn hit collection that is deterministic and efficient in the sense that the number of hits collected per leaf scales with the size of the leaf. The idea is:
amongits kLeaf results is >= global minimum score (in top K across all leaves, merged) is submitted for further exploration using seeded search starting with the previous best results.When the presumption of uniform distribution is valid, we would be able to skip steps 3 and 4 so we should get similar performance as we do with simple pro-rated algorithm. Otherwise we pay the cost of re-entering search, but we are guaranteed to get global (approximate) top K in a deterministic way (without cross-thread communication during search), and hopefully the cost is minimized by seeding with the best results so far.
The key point is that we make an optimisitic assumption initially, but we check and if it proved to be incorrect, we make up for it with some more work. In many circumstances the original assumption will hold, and in any case the cost of further searching shouldn't be too much because of the seeding. I plan to do some perf tests; no results yet beyond unit testing. I did run TestOptimisticKnnVectorQuery with 1000 repeats and it didn't fail, including the new consistency test. in #14180.
This is a WIP only because the class design isn't really great. It re-uses a bunch of stuff from SeededKnnVectorQuery. It's for floats only as it is. My thinking is a better way might be to make this be the default and merge it directly into AbstractKnnVectorQuery? Or it could possilby be triggered by a KnnSearchStrategy?