Skip to content

Conversation

@jainankitk
Copy link
Contributor

@jainankitk jainankitk commented Apr 4, 2025

Description

This PR adds multi range traversal logic to collect the histogram on numeric field indexed as pointValues for MATCH_ALL cases. Even for non-match all cases like PointRangeQuery, if the query field == histogram field, this logic can be used. For the later, need to supply the PointRangeQuery bounds for building the appropriate Ranges to be collected. Need some inputs from the community on how it can be plugged correctly into the HistogramCollector

One of the key assumptions is absence of any deleted documents. Maybe going forward (especially if the deleted documents percentage is low), we can consider correcting the collected Ranges by subtracting for deleted documents. Although if I remember correctly, getting doc values for just deleted documents was non-trivial task!

Related issue #13335

@jainankitk jainankitk changed the title Adding logic for collecting Histogram efficiently using Point Trees Logic for collecting Histogram efficiently using Point Trees Apr 4, 2025
@jainankitk
Copy link
Contributor Author

@stefanvodita / @jpountz - Would love to get your thoughts on this optimization, and how we can leverage it in Lucene. In a nutshell, it solves the below problem:

Given a sorted non-overlapping set of intervals (Histogram buckets could be an example), it collects the matching documents count in single travel of PointsTree index, by skipping over the leafBlocks completely unless the values in leafBlock overlap with more than one interval. This ensures that the # leafBlocks actually traversed is bounded by the # buckets and remaining leafBlocks are collected in bulk. Hence it can very efficiently collect the doc counts, especially when the # documents / # buckets is pretty high.

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I only had a quick look. Is the opimisation here analogous to the one HistogramLeafCollector does with the skipper?

@jainankitk
Copy link
Contributor Author

Sorry, I only had a quick look. Is the opimisation here analogous to the one HistogramLeafCollector does with the skipper?

No, this approach is different from the skipper as it leverages PointValues instead of DocValues for computing the buckets

@stefanvodita
Copy link
Contributor

I didn't mean to imply that the two solutions are the same, apologies if that's how it came across.

Need some inputs from the community on how it can be plugged correctly into the HistogramCollector

Let me know if this doesn't answer the question @jainankitk, maybe you'd already gone through this and you were looking for a different answer.
I think you could start in HistogramCollector.getLeafCollector (code). Right now we throw an exception if the field we're using isn't doc values (code). You'd need a new branch for the case you want to implement and a new LeafCollector, similar to the ones already in the file. Having that would make it easier to think through the next steps.

At a higher level, I'm curious if you had a use-case in mind.

@jainankitk
Copy link
Contributor Author

I didn't mean to imply that the two solutions are the same, apologies if that's how it came across.

Not at all. Even I was initially confused with skipper logic, only after spending some time realized this approach is slightly different. So, thanks for reiterating the question.

I think you could start in HistogramCollector.getLeafCollector (code). Right now we throw an exception if the field we're using isn't doc values (code).

Currently, Collector doesn't need to be aware of the Query itself. They are designed to collect individual docId or using DocIdStream from the scorer. But this CustomCollector, does not need the scorer to provide documents, but can BulkCollect documents, assuming MATCH_ALL or PointRangeQuery (where PointRangeQuery.field == histogram.field). Otherwise, it should fallback to traditional methods for collecting matching documents.

At a higher level, I'm curious if you had a use-case in mind.

This optimization can be applied to following use cases:

  • Number of sale based on the price range (0-50, 50-100, 100-250,.....)
  • Number of visits on website for each day in a month

Just as a data point, this change helped us improve date histogram latency from 5168 ms to 160 ms (~32x!!) for big5 workload in OpenSearch

@jainankitk
Copy link
Contributor Author

I have updated the PR, and the code flow is like below now:

  • HistogramCollector overrides the setWeight for accessing the underlying Query
  • To keep things simple, just optimizing for MATCH_ALL_DOCS query and no deleted documents for now
  • Optimized path is enabled only if pointTree is built for the field
  • There are few other conditions for optimized path being enabled. Being conservative for now, and fallback to original.
  • Add small unit test to verify working as expected

Will add few more unit tests, once I get some feedback on the code changes. Can also include small performance unit test that demonstrates, pointTree based collection is faster than the docValues based collector.

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this @jainankitk!

@jainankitk

This comment was marked as outdated.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea! I like that you integrated it transparently into the collector, so that users can benefit from it out of the box.

@jainankitk
Copy link
Contributor Author

Interesting idea! I like that you integrated it transparently into the collector, so that users can benefit from it out of the box.

Thanks @jpountz for the review. Had some challenges with the integration initially, but seWeight method was pretty useful!

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice results in the benchmark!

@mikemccand
Copy link
Member

This is a nice optimization, using points (if the user indexed them) to carefully optimize counting of ranges.

@stefanvodita - Thanks for a prompt review. Addressed most of the review comments. Adding JMH benchmark instead of the not so useful performance test added earlier. The benchark results demonstrate significant increase in throughput with increasing # documents and bucket width (lesser buckets mean less low level traversal in point range tree and more documents collected in bulk):

Oooh those JMH benchy results are nice! Though, it's dangerous testing only on random data -- you can draw random conclusions/results. But it's better than no benchmark! Maybe we should add histogram faceting benchy to Lucene's nightly benchmarks?

@jainankitk
Copy link
Contributor Author

Oooh those JMH benchy results are nice! Though, it's dangerous testing only on random data -- you can draw random conclusions/results. But it's better than no benchmark! Maybe we should add histogram faceting benchy to Lucene's nightly benchmarks?

Thanks @mikemccand for the feedback. Having histogram faceting benchmark in Lucene's nightly will be great! Created issue mikemccand/luceneutil#375 for following up on this.

Signed-off-by: Ankit Jain <[email protected]>
Signed-off-by: Ankit Jain <[email protected]>
@github-project-automation github-project-automation bot moved this from Todo to In Progress in Performance Roadmap Apr 24, 2025
Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jainankitk!

@stefanvodita stefanvodita merged commit 02a8c3f into apache:main Apr 25, 2025
7 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Performance Roadmap Apr 25, 2025
@stefanvodita stefanvodita added this to the 10.3.0 milestone Apr 25, 2025
@jainankitk jainankitk deleted the mrt-collector branch April 25, 2025 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants