-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Better vectorize score computations. #14704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better vectorize score computations. #14704
Conversation
Existing vectorization of scores is a bit fragile since it relies on `SimScorer#score` being inlined in the for loops where it is called. This is currently the case in nightly benchmarks, but may not be the case in the real world where more implementations of `SimScorer` may be used, in particular those from `FeatureField`. Furthermore, existing vectorization has some room for improvement as @gf2121 highlighted at apache#14679 (comment).
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. |
| * Bulk computation of scores. For each entry in the given {@code buffer}, compute the score of | ||
| * the document and set its value in the {@code scores} array at the same index. | ||
| * | ||
| * <p><b>NOTE</b>: Doc IDs must be sorted, with no duplicates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we override it in AssertingSimilarity to check these limitations?
| } | ||
| } else { | ||
| if (normInverses == null || normInverses.length < buffer.size) { | ||
| normInverses = new float[ArrayUtil.oversize(buffer.size, Float.BYTES)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this new array or we can just reuse scores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhhh, nice idea.
|
Great feedback once again @gf2121 ! |
|
Here is what luceneutil has to say for exhaustive evaluation (totalHitsThreshold=Integer.MAX_VALUE):
|
| this.weight = boost * idf.getValue().floatValue(); | ||
| } | ||
|
|
||
| private float score(float freq, float normInverse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider the name? it is too dangerous to have both these signatures, even though float and long have different size:
float score(float, float);
float score(float, long);
|
before adding more complexity to scores can we resolve the current problem with scoring in jenkins? https://jenkins.thetaphi.de/job/Lucene-MMAPv2-Linux/3326/ This failure has only surfaced recently, so I think it would be good to avoid stacking more changes to scoring on top of it. |
|
I was looking at it too as I was worried it may have been caused by recent optimizations, but it looks like an old and rare failure. I opened a tentative test fix at #14715. |
Existing vectorization of scores is a bit fragile since it relies on `SimScorer#score` being inlined in the for loops where it is called. This is currently the case in nightly benchmarks, but may not be the case in the real world where more implementations of `SimScorer` may be used, in particular those from `FeatureField`. Furthermore, existing vectorization has some room for improvement as @gf2121 highlighted at #14679 (comment).
|
I had to revert this PR, it's not thread-safe: I thought |
|
@mikemccand FYI if a nightly run started with this change, it may fail due to this thread safety issue. The commit is now reverted, so restarting should be enough to fix the problem. |
Existing vectorization of scores is a bit fragile since it relies on `SimScorer#score` being inlined in the for loops where it is called. This is currently the case in nightly benchmarks, but may not be the case in the real world where more implementations of `SimScorer` may be used, in particular those from `FeatureField`. Furthermore, existing vectorization has some room for improvement as @gf2121 highlighted at apache#14679 (comment).
Existing vectorization of scores is a bit fragile since it relies on `SimScorer#score` being inlined in the for loops where it is called. This is currently the case in nightly benchmarks, but may not be the case in the real world where more implementations of `SimScorer` may be used, in particular those from `FeatureField`. Furthermore, existing vectorization has some room for improvement as @gf2121 highlighted at apache#14679 (comment).
Existing vectorization of scores is a bit fragile since it relies on
SimScorer#scorebeing inlined in the for loops where it is called. This is currently the case in nightly benchmarks, but may not be the case in the real world where more implementations ofSimScorermay be used, in particular those fromFeatureField.Furthermore, existing vectorization has some room for improvement as @gf2121 highlighted at
#14679 (comment).