Logic for collecting Histogram efficiently using Point Trees #14439

jainankitk · 2025-04-04T07:04:42Z

Description

This PR adds multi range traversal logic to collect the histogram on numeric field indexed as pointValues for MATCH_ALL cases. Even for non-match all cases like PointRangeQuery, if the query field == histogram field, this logic can be used. For the later, need to supply the PointRangeQuery bounds for building the appropriate Ranges to be collected. Need some inputs from the community on how it can be plugged correctly into the HistogramCollector

One of the key assumptions is absence of any deleted documents. Maybe going forward (especially if the deleted documents percentage is low), we can consider correcting the collected Ranges by subtracting for deleted documents. Although if I remember correctly, getting doc values for just deleted documents was non-trivial task!

Related issue #13335

jainankitk · 2025-04-07T18:41:36Z

@stefanvodita / @jpountz - Would love to get your thoughts on this optimization, and how we can leverage it in Lucene. In a nutshell, it solves the below problem:

Given a sorted non-overlapping set of intervals (Histogram buckets could be an example), it collects the matching documents count in single travel of PointsTree index, by skipping over the leafBlocks completely unless the values in leafBlock overlap with more than one interval. This ensures that the # leafBlocks actually traversed is bounded by the # buckets and remaining leafBlocks are collected in bulk. Hence it can very efficiently collect the doc counts, especially when the # documents / # buckets is pretty high.

stefanvodita

Sorry, I only had a quick look. Is the opimisation here analogous to the one HistogramLeafCollector does with the skipper?

jainankitk · 2025-04-10T00:38:00Z

Sorry, I only had a quick look. Is the opimisation here analogous to the one HistogramLeafCollector does with the skipper?

No, this approach is different from the skipper as it leverages PointValues instead of DocValues for computing the buckets

stefanvodita · 2025-04-10T17:55:49Z

I didn't mean to imply that the two solutions are the same, apologies if that's how it came across.

Need some inputs from the community on how it can be plugged correctly into the HistogramCollector

Let me know if this doesn't answer the question @jainankitk, maybe you'd already gone through this and you were looking for a different answer.
I think you could start in HistogramCollector.getLeafCollector (code). Right now we throw an exception if the field we're using isn't doc values (code). You'd need a new branch for the case you want to implement and a new LeafCollector, similar to the ones already in the file. Having that would make it easier to think through the next steps.

At a higher level, I'm curious if you had a use-case in mind.

jainankitk · 2025-04-10T19:37:08Z

I didn't mean to imply that the two solutions are the same, apologies if that's how it came across.

Not at all. Even I was initially confused with skipper logic, only after spending some time realized this approach is slightly different. So, thanks for reiterating the question.

I think you could start in HistogramCollector.getLeafCollector (code). Right now we throw an exception if the field we're using isn't doc values (code).

Currently, Collector doesn't need to be aware of the Query itself. They are designed to collect individual docId or using DocIdStream from the scorer. But this CustomCollector, does not need the scorer to provide documents, but can BulkCollect documents, assuming MATCH_ALL or PointRangeQuery (where PointRangeQuery.field == histogram.field). Otherwise, it should fallback to traditional methods for collecting matching documents.

At a higher level, I'm curious if you had a use-case in mind.

This optimization can be applied to following use cases:

Number of sale based on the price range (0-50, 50-100, 100-250,.....)
Number of visits on website for each day in a month

Just as a data point, this change helped us improve date histogram latency from 5168 ms to 160 ms (~32x!!) for big5 workload in OpenSearch

… test Signed-off-by: Ankit Jain <[email protected]>

Signed-off-by: Ankit Jain <[email protected]>

jainankitk · 2025-04-16T22:30:02Z

I have updated the PR, and the code flow is like below now:

HistogramCollector overrides the setWeight for accessing the underlying Query
To keep things simple, just optimizing for MATCH_ALL_DOCS query and no deleted documents for now
Optimized path is enabled only if pointTree is built for the field
There are few other conditions for optimized path being enabled. Being conservative for now, and fallback to original.
Add small unit test to verify working as expected

Will add few more unit tests, once I get some feedback on the code changes. Can also include small performance unit test that demonstrates, pointTree based collection is faster than the docValues based collector.

stefanvodita

Thanks for iterating on this @jainankitk!

...src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java

...ne/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java

...src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java

lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java

...ne/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java

...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java

Signed-off-by: Ankit Jain <[email protected]>

…synchronized Signed-off-by: Ankit Jain <[email protected]>

Signed-off-by: Ankit Jain <[email protected]>

jpountz

Interesting idea! I like that you integrated it transparently into the collector, so that users can benefit from it out of the box.

...ne/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java

...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java

Signed-off-by: Ankit Jain <[email protected]>

jainankitk · 2025-04-22T21:36:25Z

Interesting idea! I like that you integrated it transparently into the collector, so that users can benefit from it out of the box.

Thanks @jpountz for the review. Had some challenges with the integration initially, but seWeight method was pretty useful!

This reverts commit 88d8d34.

stefanvodita

Nice results in the benchmark!

lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/HistogramCollectorBenchmark.java

lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java

...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java

Signed-off-by: Ankit Jain <[email protected]>

mikemccand · 2025-04-23T13:01:29Z

This is a nice optimization, using points (if the user indexed them) to carefully optimize counting of ranges.

@stefanvodita - Thanks for a prompt review. Addressed most of the review comments. Adding JMH benchmark instead of the not so useful performance test added earlier. The benchark results demonstrate significant increase in throughput with increasing # documents and bucket width (lesser buckets mean less low level traversal in point range tree and more documents collected in bulk):

Oooh those JMH benchy results are nice! Though, it's dangerous testing only on random data -- you can draw random conclusions/results. But it's better than no benchmark! Maybe we should add histogram faceting benchy to Lucene's nightly benchmarks?

jainankitk · 2025-04-23T18:13:01Z

Oooh those JMH benchy results are nice! Though, it's dangerous testing only on random data -- you can draw random conclusions/results. But it's better than no benchmark! Maybe we should add histogram faceting benchy to Lucene's nightly benchmarks?

Thanks @mikemccand for the feedback. Having histogram faceting benchmark in Lucene's nightly will be great! Created issue mikemccand/luceneutil#375 for following up on this.

Signed-off-by: Ankit Jain <[email protected]>

...ne/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java

...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java

Signed-off-by: Ankit Jain <[email protected]>

stefanvodita

Thank you @jainankitk!

...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java

lucene/CHANGES.txt

Signed-off-by: Ankit Jain <[email protected]>

Adding logic for collecting histogram using Point Trees

75a0c75

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Apr 4, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Apr 4, 2025

github-actions bot added the module:sandbox label Apr 4, 2025

jainankitk changed the title ~~Adding logic for collecting Histogram efficiently using Point Trees~~ Logic for collecting Histogram efficiently using Point Trees Apr 4, 2025

jainankitk added 3 commits April 4, 2025 00:09

Adding license and NonFinalStaticField fix

9a00150

Fixing static field error

4e77620

Tidying the code

67700e0

getsaurabh02 added this to Performance Roadmap Apr 7, 2025

github-project-automation bot moved this to Todo in Performance Roadmap Apr 7, 2025

stefanvodita reviewed Apr 8, 2025

View reviewed changes

Enabling PointTreeBulkCollector using query in weight and adding unit…

adddf77

… test Signed-off-by: Ankit Jain <[email protected]>

github-actions bot added the module:core/other label Apr 16, 2025

Removing usage of java random

bd9779d

Signed-off-by: Ankit Jain <[email protected]>

stefanvodita reviewed Apr 17, 2025

View reviewed changes

jainankitk added 2 commits April 17, 2025 22:15

Addressing review comments

6646f87

Signed-off-by: Ankit Jain <[email protected]>

Handling intra segment better using concurrent map instead of static …

fe4c58c

…synchronized Signed-off-by: Ankit Jain <[email protected]>

github-actions bot removed the module:core/other label Apr 18, 2025

jainankitk added 2 commits April 17, 2025 23:22

Tidying the code

1736c8d

Running small performance test on existing vs new approach

1187683

Signed-off-by: Ankit Jain <[email protected]>

This comment was marked as outdated.

Sign in to view

bharath-techie mentioned this pull request Apr 21, 2025

[SparseIndex] Support for Skiplist in aggregation queries opensearch-project/OpenSearch#17964

Open

jainankitk added 3 commits April 21, 2025 19:28

Adding jmh benchmark for histogram collection

35c6bb8

Signed-off-by: Ankit Jain <[email protected]>

Fixing forbidden class error

179ba86

Signed-off-by: Ankit Jain <[email protected]>

Fixing unused parameter error

3208ba9

Signed-off-by: Ankit Jain <[email protected]>

jpountz reviewed Apr 22, 2025

View reviewed changes

jainankitk mentioned this pull request Apr 22, 2025

Leverage multi range traversal Histogram Collection for PointRangeQuery #14535

Closed

jainankitk added 2 commits April 22, 2025 14:31

Merge branch 'main' into mrt-collector

6370883

Addressing review comments and adding changes entry

f110cbf

Signed-off-by: Ankit Jain <[email protected]>

github-actions bot added the module:core/other label Apr 22, 2025

Revert "Improve user-facing docs for geo"

c76ce57

This reverts commit 88d8d34.

github-actions bot removed the module:core/other label Apr 22, 2025

stefanvodita reviewed Apr 22, 2025

View reviewed changes

jainankitk mentioned this pull request Apr 22, 2025

Allow Histogram Collection using PointTree when SortedNumericDocValues is absent #14536

Closed

Adding @lucene.experimental annotation to PointTreeBulkCollector

d98dd6b

Signed-off-by: Ankit Jain <[email protected]>

jainankitk mentioned this pull request Apr 23, 2025

Add nightly benchmark for Histogram faceting mikemccand/luceneutil#375

Open

Addressing review comments

2ea55d7

Signed-off-by: Ankit Jain <[email protected]>

jpountz reviewed Apr 23, 2025

View reviewed changes

Adding null check

25a8933

Signed-off-by: Ankit Jain <[email protected]>

jpountz approved these changes Apr 24, 2025

View reviewed changes

github-project-automation bot moved this from Todo to In Progress in Performance Roadmap Apr 24, 2025

stefanvodita approved these changes Apr 24, 2025

View reviewed changes

...andbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java Show resolved Hide resolved

stefanvodita reviewed Apr 24, 2025

View reviewed changes

lucene/CHANGES.txt Outdated Show resolved Hide resolved

Adding name to changelog entry

d59c132

Signed-off-by: Ankit Jain <[email protected]>

jainankitk mentioned this pull request Apr 24, 2025

Move sloppySin into SloppyMath from GeoUtils #14516

Merged

stefanvodita merged commit 02a8c3f into apache:main Apr 25, 2025
7 checks passed

github-project-automation bot moved this from In Progress to Done in Performance Roadmap Apr 25, 2025

github-project-automation bot moved this from Open to Merged in OpenSearch Lucene & Core Performance Tracking Apr 25, 2025

stefanvodita pushed a commit that referenced this pull request Apr 25, 2025

Logic for collecting Histogram efficiently using Point Trees (#14439)

7a21f53

stefanvodita added this to the 10.3.0 milestone Apr 25, 2025

jainankitk deleted the mrt-collector branch April 25, 2025 17:32

jainankitk mentioned this pull request Sep 23, 2025

Efficient iteration over deleted doc values #15226

Open

Logic for collecting Histogram efficiently using Point Trees #14439

Logic for collecting Histogram efficiently using Point Trees #14439

Uh oh!

Conversation

jainankitk commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

jainankitk commented Apr 7, 2025

Uh oh!

stefanvodita left a comment

Choose a reason for hiding this comment

Uh oh!

jainankitk commented Apr 10, 2025

Uh oh!

stefanvodita commented Apr 10, 2025

Uh oh!

jainankitk commented Apr 10, 2025

Uh oh!

jainankitk commented Apr 16, 2025

Uh oh!

stefanvodita left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jainankitk commented Apr 22, 2025

Uh oh!

stefanvodita left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mikemccand commented Apr 23, 2025

Uh oh!

jainankitk commented Apr 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stefanvodita left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jainankitk commented Apr 4, 2025 •

edited

Loading