-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Logic for collecting Histogram efficiently using Point Trees #14439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@stefanvodita / @jpountz - Would love to get your thoughts on this optimization, and how we can leverage it in Lucene. In a nutshell, it solves the below problem: Given a sorted non-overlapping set of intervals (Histogram buckets could be an example), it collects the matching documents count in single travel of |
stefanvodita
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I only had a quick look. Is the opimisation here analogous to the one HistogramLeafCollector does with the skipper?
No, this approach is different from the skipper as it leverages |
|
I didn't mean to imply that the two solutions are the same, apologies if that's how it came across.
Let me know if this doesn't answer the question @jainankitk, maybe you'd already gone through this and you were looking for a different answer. At a higher level, I'm curious if you had a use-case in mind. |
Not at all. Even I was initially confused with skipper logic, only after spending some time realized this approach is slightly different. So, thanks for reiterating the question.
Currently,
This optimization can be applied to following use cases:
Just as a data point, this change helped us improve date histogram latency from 5168 ms to 160 ms (~32x!!) for big5 workload in OpenSearch |
… test Signed-off-by: Ankit Jain <[email protected]>
Signed-off-by: Ankit Jain <[email protected]>
|
I have updated the PR, and the code flow is like below now:
Will add few more unit tests, once I get some feedback on the code changes. Can also include small performance unit test that demonstrates, pointTree based collection is faster than the docValues based collector. |
stefanvodita
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this @jainankitk!
...src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java
Show resolved
Hide resolved
...ne/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java
Outdated
Show resolved
Hide resolved
...src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java
Outdated
Show resolved
Hide resolved
...ne/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java
Outdated
Show resolved
Hide resolved
...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java
Show resolved
Hide resolved
...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java
Outdated
Show resolved
Hide resolved
...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java
Show resolved
Hide resolved
...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java
Show resolved
Hide resolved
...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java
Show resolved
Hide resolved
Signed-off-by: Ankit Jain <[email protected]>
…synchronized Signed-off-by: Ankit Jain <[email protected]>
Signed-off-by: Ankit Jain <[email protected]>
This comment was marked as outdated.
This comment was marked as outdated.
Signed-off-by: Ankit Jain <[email protected]>
Signed-off-by: Ankit Jain <[email protected]>
Signed-off-by: Ankit Jain <[email protected]>
jpountz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea! I like that you integrated it transparently into the collector, so that users can benefit from it out of the box.
...ne/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java
Outdated
Show resolved
Hide resolved
...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java
Outdated
Show resolved
Hide resolved
...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java
Show resolved
Hide resolved
Signed-off-by: Ankit Jain <[email protected]>
Thanks @jpountz for the review. Had some challenges with the integration initially, but |
This reverts commit 88d8d34.
stefanvodita
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice results in the benchmark!
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/HistogramCollectorBenchmark.java
Show resolved
Hide resolved
...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java
Show resolved
Hide resolved
...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java
Show resolved
Hide resolved
Signed-off-by: Ankit Jain <[email protected]>
|
This is a nice optimization, using points (if the user indexed them) to carefully optimize counting of ranges.
Oooh those JMH benchy results are nice! Though, it's dangerous testing only on random data -- you can draw random conclusions/results. But it's better than no benchmark! Maybe we should add histogram faceting benchy to Lucene's nightly benchmarks? |
Thanks @mikemccand for the feedback. Having histogram faceting benchmark in Lucene's nightly will be great! Created issue mikemccand/luceneutil#375 for following up on this. |
Signed-off-by: Ankit Jain <[email protected]>
...ne/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java
Outdated
Show resolved
Hide resolved
...ne/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java
Outdated
Show resolved
Hide resolved
...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java
Show resolved
Hide resolved
Signed-off-by: Ankit Jain <[email protected]>
stefanvodita
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jainankitk!
...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java
Show resolved
Hide resolved
Signed-off-by: Ankit Jain <[email protected]>
Description
This PR adds multi range traversal logic to collect the histogram on numeric field indexed as pointValues for MATCH_ALL cases. Even for non-match all cases like
PointRangeQuery, if the query field == histogram field, this logic can be used. For the later, need to supply thePointRangeQuerybounds for building the appropriateRangesto be collected. Need some inputs from the community on how it can be plugged correctly into theHistogramCollectorOne of the key assumptions is absence of any deleted documents. Maybe going forward (especially if the deleted documents percentage is low), we can consider correcting the collected
Rangesby subtracting for deleted documents. Although if I remember correctly, getting doc values for just deleted documents was non-trivial task!Related issue #13335