Skip to content

Conversation

@msokolov
Copy link
Contributor

@msokolov msokolov commented Feb 12, 2025

Description

This is a WIP patch to work out an idea for Knn hit collection that is deterministic and efficient in the sense that the number of hits collected per leaf scales with the size of the leaf. The idea is:

  1. Run pro-rated search assuming uniform random topK-document distribution among leaves (optimistically) by scaling k to kLeaf according to the proportion of documents in the leaf.
  2. Examine the list of per-leaf results. Any leaves whose minimum score among its kLeaf results is >= global minimum score (in top K across all leaves, merged) is submitted for further exploration using seeded search starting with the previous best results.
  3. repeat 2, 3 until all leaves min scores are worse than the global min score or other limiting conditions are reached.

When the presumption of uniform distribution is valid, we would be able to skip steps 3 and 4 so we should get similar performance as we do with simple pro-rated algorithm. Otherwise we pay the cost of re-entering search, but we are guaranteed to get global (approximate) top K in a deterministic way (without cross-thread communication during search), and hopefully the cost is minimized by seeding with the best results so far.

The key point is that we make an optimisitic assumption initially, but we check and if it proved to be incorrect, we make up for it with some more work. In many circumstances the original assumption will hold, and in any case the cost of further searching shouldn't be too much because of the seeding. I plan to do some perf tests; no results yet beyond unit testing. I did run TestOptimisticKnnVectorQuery with 1000 repeats and it didn't fail, including the new consistency test. in #14180.

This is a WIP only because the class design isn't really great. It re-uses a bunch of stuff from SeededKnnVectorQuery. It's for floats only as it is. My thinking is a better way might be to make this be the default and merge it directly into AbstractKnnVectorQuery? Or it could possilby be triggered by a KnnSearchStrategy?

@benwtrent
Copy link
Member

This is indeed interesting. I eagerly wait some of the perf numbers ;).

But, this re-entering the graph makes me think that collectors will need to track visitation. This way the same vector path isn't visited multiple times. Its possible that once you enter the lower level, that you accidentally "back track" to previously hit vectors.

My thinking is a better way might be to make this be the default and merge it directly into AbstractKnnVectorQuery? Or it could possilby be triggered by a KnnSearchStrategy?

I do think that if this proves fruitful, it should go straight into the default behavior. As for a strategy or not, it depends on how extreme the performance & recall differences are :/.

The reason I am thinking a search strategy for the filter stuff I have been doing here: #14160 is that the recall behavior is significantly different for certain kinds of filters. Users might want to turn it off (if they have particular kinds of filters they run that behave wonky).

@msokolov
Copy link
Contributor Author

Hmm github's test run failed with:

gradlew :lucene:core:test --tests "org.apache.lucene.search.TestByteVectorSimilarityQuery.testFallbackToExact" -Ptests.jvms=1 -Ptests.jvmargs= -Ptests.seed=9F91AA4B44864767 -Ptests.useSecurityManager=true -Ptests.gui=true -Ptests.file.encoding=ISO-8859-1 -Ptests.vectorsize=default -Ptests.forceintegervectors=false

however this did not repro locally. I'm going to kick off another run

@msokolov
Copy link
Contributor Author

msokolov commented Feb 12, 2025

But, this re-entering the graph makes me think that collectors will need to track visitation. This way the same vector path isn't visited multiple times. Its possible that once you enter the lower level, that you accidentally "back track" to previously hit vectors.

That might not be needed? My fingers are crossed, buyt I'm thinking we skip most of the revisitation by seeding with the best hits and then we would quickly terminate if nothing better is found? I was hoping to manage this without any changes to the searcher that could be needed for retaining the visited lists

@msokolov
Copy link
Contributor Author

Well I ran some tests, and surprisingly, I saw a significant different in both recall and latency (decreases in both). This surprised me: I expected to see more-or-less similar results, so I want todig a little further.

@benwtrent
Copy link
Member

OK, I ran it on 8M data set with 128 segments.

Indeed it visits way fewer vectors (seemingly), and is consistent across multiple threads.

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.719       36.100  8000000   100     100    28015           128        1.000

8 threaded qps

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.719        5.700  8000000   100     100    28015           128        1.000

The current baseline (which isn't multi-threaded consistent, so I am just showing the single thread perf):

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.949       59.900  8000000   100     100    87299           128        1.000

Given the visited & latency, I wonder if the "visited" actually represents what this query is visiting?

But, maybe the answer is allowing more exploration per segment?

@benwtrent
Copy link
Member

benwtrent commented Feb 13, 2025

Ah, one other thought is that we are definitely scoring every seeded entry point twice. Once when they are gathered during initial query phase, then later through the seeded provisioning. We could adjust the internal API to allow cached scores to be provided (if we find that we can get the recall up).

@msokolov
Copy link
Contributor Author

You can tune it by changing the magic number 3 to a bigger number. I found that with 15 I get slight better recall and slightly lower latencies than the baseline for my test case

@msokolov
Copy link
Contributor Author

also - we are not really tracking "visited" properly I think

@msokolov
Copy link
Contributor Author

I think maybe what happens here is that the K controls not only how many hits are returned from each segment, but also what the "beam width" is during search, so we could have gotten better hits even from the segments that appeared to have been exhausted (because their worst in the top K weas below the min competitive score)

@benwtrent
Copy link
Member

I am not 100% sure whats up with the behavior. However, I switched to 16 (also happens to be the graph conn) instead of 3.

Its interesting how visited is lower, but recall is on par with the current non-consistent baseline. I don't like the magic "16" here, without knowing exactly why its working.

I also fixed visited in Lucene util (accounting for the multiple topdoc merges).

Here is NOT reusing the entry point scores

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.942       10.400  8000000   100     100    68605           128        1.000

Here IS reusing the entry point scores (have a local refactor). So, visited drops just a little.

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.942        7.400  8000000   100     100    68477           128        1.000

here is reusing entry point, single threaded:

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.942       49.300  8000000   100     100    68477           128        1.000

Note, the visited for a SINGLE graph is around 3000.

Also, as a further baseline, here is no information sharing (e.g. what it was before our in-consistent multi-threaded results). e.g. just fully exploring every graph:

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 1.000      213.400  8000000   100     100   412280           128        1.000

@msokolov
Copy link
Contributor Author

The latency improvement from re-using scores is surprisingly large. I would have expected costs to be dominated by newly-explored nodes, but this is cool.

@benwtrent
Copy link
Member

The latency improvement from re-using scores is surprisingly large. I would have expected costs to be dominated by newly-explored nodes, but this is cool.

@msokolov this was only over 10 different query iterations on my laptop 😅 so, I wouldn't necessarily trust my latency numbers :D.

But, the visited count is attractive. The magic numbers are just weird to me. Why is 16 so special?

FWIW, I haven't tested this on anyother datasets yet.

If we want to provide entry-point scores, I can do that refactor in a separate PR. It isn't so bad, and is generally useful (we score entry point twice on every search, which is a minimal cost, but we could prevent doing that).

@msokolov
Copy link
Contributor Author

I don't believe 16 is "special" except in the sense that it happens to be a sweet spot is this context. We expect that as we increase that per-segment factor we will get increased recall because it is the equivalent of the old "fanout" parameter we used to have - it expands the beam of the search while keeping the number of results returned fixed. This is different from "fanout" because that was a global setting, and this one scales with the relative size of the segment.

What I'd like to understand is whether there is a fixed value for this sweet spot, or whether it changes with {data, graph construction parameters, something else}. It seems clear that my idea that it is a probabilistic thing is incorrect since 16 standard deviations is just crazy. But we can at least look at it empirically.

@msokolov
Copy link
Contributor Author

@benwtrent I'm curious how you managed to re-use the scores. I poked around a little and I guess it requires some new plumbing / class casts since the API isn't really designed for it? Do you have a patch you wouldn't mind posting?

@benwtrent
Copy link
Member

@msokolov I can provide a quick patch tomorrow against main. As I said, it would work for stuff as it is now.

while (ctxIter.hasNext()) {
LeafReaderContext ctx = ctxIter.next();
TopDocs perLeaf = perLeafResults.get(ctx.ord);
if (perLeaf.scoreDocs.length > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we publish the metrics for each case here to see how often/how many passes we usually need to take?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that would be critical to understanding what's going on here! Maybe we can extend TopK to include some metrics? Not sure what the best API is here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can define it as an abstract method, and let the consumer to override it with their own metrics system?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, OK, pass in some kind of abstract Counter? But maybe a simple int[] is enough? The consumer can always convert to their metric system.

while (ctxIter.hasNext()) {
LeafReaderContext ctx = ctxIter.next();
TopDocs perLeaf = perLeafResults.get(ctx.ord);
if (perLeaf.scoreDocs.length > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can define it as an abstract method, and let the consumer to override it with their own metrics system?

assert perLeafTopK
> 0; // if we divided by zero above, leafProportion can be NaN and then this would be
// 0
return new TopKnnCollector(perLeafTopK, visitedLimit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to cap perLeafTopK by the original k? I think k is double on every iteration, and perLeafTopK can theoretically go over the original k, which is excessive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can do it in perLeafTopkCalculation

@dungba88
Copy link
Contributor

dungba88 commented Feb 22, 2025

I ran some benchmark with Cohere 768 dataset for 3 different algorithms: (1) the baseline "greedy", (2) this PR "optimistic", and (3) with only "pro-rata". (2) and (3) will converge with fanout >= 20, it kinda makes sense because as we increase the ef parameters, it would already look deeper into the graph. I'll try the different value of stdev as well.

(2) still missed some matches (about 8%) when compared to (1).

Other params: ndoc=500K, topK=100

recall  latency_(ms)  fanout  quantized  greediness  index_s  index_docs/s  force_merge_s  num_segments  index_size_(MB)  vec_disk_(MB)  vec_RAM_(MB)   algorithm 
 0.936        13.285       0         no        0.00   117.24       4264.94           0.00             7          1490.05       1464.844      1464.844  greediness 
 0.933        12.766       0         no        0.10   116.45       4293.84           0.00             7          1490.14       1464.844      1464.844  greediness 
 0.927        12.045       0         no        0.30   116.69       4285.04           0.00             7          1490.12       1464.844      1464.844  greediness 
 0.923        11.709       0         no        0.50   117.74       4246.79           0.00             7          1490.06       1464.844      1464.844  greediness 
 0.917        10.950       0         no        0.70   117.68       4248.67           0.00             7          1490.06       1464.844      1464.844  greediness 
 0.907        10.996       0         no        0.90   117.40       4258.80           0.00             7          1489.98       1464.844      1464.844  greediness 
 0.836        10.349       0         no        1.00   117.07       4271.13           0.00             7          1490.03       1464.844      1464.844  greediness 
 0.846         6.682       0         no       -1.00   116.50       4291.77           0.00             7          1490.09       1464.844      1464.844  optimistic
 0.862         6.699       5         no       -1.00   118.33       4225.44           0.00             7          1489.97       1464.844      1464.844  optimistic
 0.871         7.172      10         no       -1.00   116.10       4306.63           0.00             7          1490.05       1464.844      1464.844  optimistic
 0.897         7.962      20         no       -1.00   116.11       4306.15           0.00             7          1490.06       1464.844      1464.844  optimistic
 0.925        10.975      50         no       -1.00   116.71       4284.23           0.00             7          1490.10       1464.844      1464.844  optimistic
 0.941        15.635     100         no       -1.00   116.58       4289.08           0.00             7          1490.13       1464.844      1464.844  optimistic
 0.786         5.727       0         no       -1.00   117.86       4242.25           0.00             7          1490.10       1464.844      1464.844    prorata
 0.831         6.176       5         no       -1.00   117.27       4263.70           0.00             7          1490.07       1464.844      1464.844    prorata
 0.857         6.746      10         no       -1.00   116.79       4281.04           0.00             7          1490.08       1464.844      1464.844    prorata
 0.891         7.763      20         no       -1.00   116.21       4302.59           0.00             7          1490.09       1464.844      1464.844    prorata
 0.921        11.083      50         no       -1.00   118.02       4236.46           0.00             7          1490.06       1464.844      1464.844    prorata
 0.940        15.694     100         no       -1.00   116.63       4286.91           0.00             7          1490.00       1464.844      1464.844    prorata
Screenshot 2025-02-22 at 21 17 40

@dungba88
Copy link
Contributor

dungba88 commented Feb 23, 2025

Here is the graph with various values of stdev (3, 9, 16, 25). Again both (2) and (3) converge with a stdev value of 9, and all 3 converge at a stdev value of 16 and greediness of 0.7. With this dataset and settings, using 9 seems to be the most efficient, but then there is virtually no difference between (2) and (3). I guess (2) would shine when the data is really skewed?

Screenshot 2025-02-23 at 21 31 42

@msokolov
Copy link
Contributor Author

msokolov commented Feb 24, 2025

I pushed a version that re-uses scores and limits per-leaf topK to global topK. The former didn't make very much difference, but the latter change did improve things quite a bit. Here are some numbers from cohere/768d:

mainline

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
0.954        12.919  500000    50       0       64        250         no    13786     0.00      Infinity             8          1501.70       1464.844      1464.844
0.981        18.488  500000    50      50       64        250         no    20371     0.00      Infinity             8          1501.70       1464.844      1464.844
0.989        22.948  500000    50     100       64        250         no    24963     0.00      Infinity             8          1501.70       1464.844      1464.844

wih reused scores and limiting perLeafK <= K

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index s  index docs/s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.959        11.375  500000    50       0       64        250         no    12086   308.23       1622.15             8          1501.70       1464.844      1464.844
 0.979        14.926  500000    50      50       64        250         no    16724     0.00      Infinity             8          1501.70       1464.844      1464.844
 0.987        17.858  500000    50     100       64        250         no    20277     0.00      Infinity             8          1501.70       1464.844      1464.844

it would be awesome if you could produce similar comparisons for this version, @dungba88 !

@msokolov msokolov force-pushed the optimistic-knn-query branch from 8d16223 to d20f704 Compare February 24, 2025 16:12
@msokolov
Copy link
Contributor Author

There were some conflicts with other recent changes, so I rebased and removed the reuse-scores part of this since it was conflicting with other changes and doesn't seem worth preserving. Then I had to force-push, sorry.

@msokolov msokolov force-pushed the optimistic-knn-query branch from d20f704 to 5b06e16 Compare February 24, 2025 16:21
@benwtrent
Copy link
Member

so I rebased and removed the reuse-scores part of this since it was conflicting with other changes and doesn't seem worth preserving

@msokolov thanks for confirming and digging into that :). It was worth a shot! But its not super surprising as the dominate cost will be the further exploration.

@dungba88
Copy link
Contributor

@msokolov I didn't see the change to cap perLeafTopK in the latest commit, maybe it has not been published yet?

@msokolov
Copy link
Contributor Author

oh darn, in all my rebasing and force-pushing I seem to have ended up with the wrong version, grr. I will check reflog and recover...

@benwtrent
Copy link
Member

@dungba88 @msokolov

Do we know where we stand with this? I wonder if we should simply use this to replace the current buggy behavior as fanning out to segments multiple times with sane sync points should fix our consistency issues and if the performance of this is on-par with the current information sharing logic, seems like a net-win.

Anything I can do to help here?

@github-actions github-actions bot removed the Stale label Apr 3, 2025
@dungba88
Copy link
Contributor

dungba88 commented Apr 3, 2025

I'm also in favor of making this the default behavior. Maybe we can let the lambda parameter to be tuned with a reasonable default value (similar to greediness in the current collector) so that expert can tune if they need. From the benchmark we also saw that the tuning range and impact of this optimistic algorithm is larger than the global sharing algorithm, so it may satisfies a wider audiences.

@dungba88
Copy link
Contributor

dungba88 commented Apr 8, 2025

@msokolov I can publish the PR for the simplified version with directly changing the default behavior if there is no objection (if you are not available for it).

@msokolov
Copy link
Contributor Author

Sorry I've been out to lunch on this - yes, I agree we should move forward here. I think we might have ended up in some analysis paralysis. We have a good candidate let's push it to main at least. We can improve the benchmarks in parallel: I think the existing benchmarking is already quite promising. @dungba88 if you have a patch with the "simplified" version ready, please go ahead and post it.

@msokolov
Copy link
Contributor Author

msokolov commented Apr 15, 2025

I'm also in favor of making this the default behavior. Maybe we can let the lambda parameter to be tuned with a reasonable default value (similar to greediness in the current collector) so that expert can tune if they need. From the benchmark we also saw that the tuning range and impact of this optimistic algorithm is larger than the global sharing algorithm, so it may satisfies a wider audiences.

Regarding lambda setting, I'm in favor of fixing it as a magic number that replicates something close to the current behavior (or better recall if we can and retain the same latency as we found with lambda=16) and letting users tune further using fanout. I think these are roughly equivalent and I don't think we should be exposing a lot of knobs. Experts who want to tune lambda can easily write their own Query.

@benwtrent
Copy link
Member

, I'm in favor of fixing it as a magic number that replicates something close to the current behavior (or better recall if we can and retain the same latency as we found with lambda=16) and letting users tune further using fanout. I think these are roughly equivalent and I don't think we should be exposing a lot of knobs.

I am for this as well. Exposing more and more knobs makes things way too complicated.

@msokolov
Copy link
Contributor Author

As far as making this the default, that sounds OK to me, but let's not backport until we've had a chance to verify no harm for a while in some pre-production environments

Copy link
Contributor

@dungba88 dungba88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see the latest commit already incorporates the simplified version and make it the default behavior. That's nice! Will just comment here instead.

TimeLimitingKnnCollectorManager knnCollectorManagerInner =
new TimeLimitingKnnCollectorManager(
new ReentrantKnnCollectorManager(
getKnnCollectorManager(k, indexSearcher), perLeafResults),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the second pass right? If so, we need to get the TopKnnCollector with full k instead of per leaf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right! duh - I just collapsed all the code, haven't had a chance to retest w/luceneutil. Thanks for checking! This is hard to unit test, grr

@benwtrent
Copy link
Member

I did a run over 4M vectors. One with bit quantization. In both cases, this optimistic pattern beats the baseline (sharing). Consistently, it gives better recall at better latency.

I think that the vector ops count method I am using is "double counting" the vector ops...

https://docs.google.com/spreadsheets/d/1B5L0Ob9Q_V7zwryYUJhk_T3d4zhbTLh17LezFWba08A/edit?usp=sharing

@msokolov
Copy link
Contributor Author

msokolov commented May 5, 2025

I did another round of testing with luceneutil over a couple different vector datasets and it is consistently outperforming the baseline. Given that and @benwtrent's recent results, I think we can go ahead.

@msokolov msokolov merged commit d279af1 into apache:main May 5, 2025
7 checks passed
@msokolov msokolov deleted the optimistic-knn-query branch May 5, 2025 17:59
msokolov pushed a commit that referenced this pull request May 5, 2025
weizijun added a commit to weizijun/lucene that referenced this pull request May 7, 2025
* main: (27 commits)
  deps(java): bump com.github.luben:zstd-jni from 1.5.7-2 to 1.5.7-3 (apache#14621)
  Improve user-facing docs for geo package (apache#14534)
  Enabling histogram collection for PointRangeQuery (apache#14560)
  Move sloppySin into SloppyMath from GeoUtils (apache#14516)
  Rewrite APIJAR extractor to use Java 24 classfile API and kill ASM dependency also for build system (apache#14613)
  CHANGES entry for apache#14226 (optimistic KNN Query)
  OptimisticKnnVectorQuery (apache#14226)
  Fix for Windows (spaces in paths) apache#14608
  Update jdk requirements in README to OpenJDK 24 (apache#14610)
  Always check gradle wrapper sha checksum and download if necessary (apache#14608)
  Fix changelog verifier (apache#14606)
  MultiRange query for SortedNumeric DocValues (apache#14404)
  Remove RANDOM_PRELOAD read advice, which is not actually used (apache#14593)
  Remove duplicate test (apache#14602)
  Refactor the expressions compiler to use official ClassData BSM with indexed lookup (apache#14602)
  Disallow EA versions to run Gradle (apache#14601)
  Add back-compat indices for 10.2.1
  Add Lucene 10.2.1 version constant
  DOAP changes for release 10.2.1
  Revert "An attempt to make jenkins pass with the currently installed jdk24-ea. To be reverted later. apache#14600"
  ...
msokolov added a commit that referenced this pull request Jul 17, 2025
* OptimisticKnnVectorQuery that re-enters search for segments with with strong results

* fix JDK23 lint

* simplify OptimisticKnnVectorQuery to 2 rounds only

* move OptimisticKnnVectorQuery into AbstractKnnVectorQuery

* tidy

* Use TopKnnCollector for second pass so we correctly collect total top K

---------

Co-authored-by: Michael Sokolov <[email protected]>
msokolov pushed a commit that referenced this pull request Jul 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants