Skip to content

Conversation

@jasperjiaguo
Copy link
Contributor

@jasperjiaguo jasperjiaguo commented Sep 5, 2022

For and connected predicates, evaluate the bitmap and with the lowest number of matching docIds first, so that we limit the number of roaring bitmap containers for comparison from the beginning, this will minimize the effort of and application

Running BenchmarkAndDocIdIterator:
Screen Shot 2022-09-08 at 7 53 51 PM

speedup: around 3x after reordering suboptimal bitmap array of size 5: matching % [5/10, 4/10, 3/10, 2/10, 1/10] -> [1/10, 2/10, 3/10, 4/10, 5/10]
sorting overhead: 2.5%
please take a look at BenchmarkAndDocIdIterator.java for details

@codecov-commenter
Copy link

codecov-commenter commented Sep 5, 2022

Codecov Report

Merging #9336 (1c56dd7) into master (0f4bcfc) will decrease coverage by 1.33%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master    #9336      +/-   ##
============================================
- Coverage     69.80%   68.47%   -1.34%     
- Complexity     4777     5069     +292     
============================================
  Files          1875     1884       +9     
  Lines         99860   100283     +423     
  Branches      15194    15253      +59     
============================================
- Hits          69706    68667    -1039     
- Misses        25231    26681    +1450     
- Partials       4923     4935      +12     
Flag Coverage Δ
integration1 26.18% <100.00%> (-0.02%) ⬇️
integration2 ?
unittests1 66.94% <100.00%> (-0.16%) ⬇️
unittests2 15.30% <0.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...che/pinot/core/operator/docidsets/AndDocIdSet.java 100.00% <100.00%> (ø)
...t/core/plan/StreamingInstanceResponsePlanNode.java 0.00% <0.00%> (-100.00%) ⬇️
...ore/operator/streaming/StreamingResponseUtils.java 0.00% <0.00%> (-100.00%) ⬇️
...server/starter/helix/SegmentReloadStatusValue.java 0.00% <0.00%> (-100.00%) ⬇️
.../operator/blocks/results/MetadataResultsBlock.java 0.00% <0.00%> (-100.00%) ⬇️
...ager/realtime/PeerSchemeSplitSegmentCommitter.java 0.00% <0.00%> (-100.00%) ⬇️
...urces/ServerReloadControllerJobStatusResponse.java 0.00% <0.00%> (-100.00%) ⬇️
...ator/streaming/StreamingSelectionOnlyOperator.java 0.00% <0.00%> (-90.00%) ⬇️
...he/pinot/core/plan/StreamingSelectionPlanNode.java 0.00% <0.00%> (-88.89%) ⬇️
...che/pinot/common/auth/StaticTokenAuthProvider.java 0.00% <0.00%> (-87.50%) ⬇️
... and 195 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great optimization. Can you please share some numbers of the gain? We need to evaluate the overhead of cardinality computation vs the gain. We can also explore if there are some methods we can use to do AND on multiple bitmaps efficiently

@jasperjiaguo
Copy link
Contributor Author

jasperjiaguo commented Sep 9, 2022

@Jackie-Jiang I have added a benchmark case and we can try with different bitmap array setups. This will be the first PR of the AND optimization. Still thinking how to optimize the AndDocIdSet for mixed bitmap/scanning with column cardinality.

@jasperjiaguo jasperjiaguo marked this pull request as ready for review September 9, 2022 03:15
@jasperjiaguo jasperjiaguo changed the title [WIP][Not ready for review] Enhance and filter predicate evaluation efficiency Enhance and filter predicate evaluation efficiency Sep 9, 2022
@siddharthteotia
Copy link
Contributor

@jasperjiaguo you might want to capture a Flamegraph as well (attach yourkit to local IDE)...

@Jackie-Jiang - we plan to do a couple of optimizations including this one in multiple PRs. For example:

  • making decision using the cardinality (higher first) and reordering the tree potentially (but could be less useful than this since this is accurate)

  • skipping inverted index based evaluation for very low cardinality. We have a Flamegraph for this that shows the clear overhead.

@jasperjiaguo - also add the applyAnd to this list.


@State(Scope.Benchmark)
public class BenchmarkAndDocIdIterator {
private static final int NUM_DOCS = 10_000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be try on few different cardinalities / selectivities ?

The getDocIds().getCardinality() performance will vary accordingly imo.

Copy link
Contributor Author

@jasperjiaguo jasperjiaguo Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getCardinality() doesn't really count the cardinality every time from scratch, instead it sums the saved cardinality counter from each container, I wouldn't be too worried about it's performance, but will definitely try on a few different cardinalities / selectivities setups.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends if you have runs or not, if there are runs the cardinality needs to be computed by summing the lengths of the runs. In general computing the cardinality is quite cheap though.

Copy link
Contributor

@siddharthteotia siddharthteotia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can also put this under a flag for now with disabled. Can then do more testing on perf cluster using production data using our respective use cases and enable in OSS eventually.

@richardstartin
Copy link
Member

Please take a look at BufferFastAggregation.and which eliminates most work in an intersection by exploiting the bitmaps' tree structure to eliminate work and should produce a very sparse bitmap except in degenerate cases.

@Jackie-Jiang
Copy link
Contributor

Can we evaluate the solution by @richardstartin and see if that gives better result? I feel it should give better performance because cardinality itself cannot represent the distribution of the values.

@jasperjiaguo
Copy link
Contributor Author

jasperjiaguo commented Sep 9, 2022

@richardstartin @Jackie-Jiang Yes I have done some experiments with BufferFastAggregation.and, code see #9370. However, it didn't give very promising results:

  • using BufferFastAggregation alone without sorting yields an 1.1x speedup while 60% overhead in degenerate case
  • sorting on top of this gives marginal speedup

Screen Shot 2022-09-09 at 3 08 05 PM

@jasperjiaguo
Copy link
Contributor Author

jasperjiaguo commented Sep 9, 2022

new results with more randomized degenerate test input, can see that the overhead of getting cardinality and sorting is really small (below the margine of error)

However the overhead of BufferFastAggregation is really large (~160% more !!!) in this randomized degenerate case. We probably want to evaluate BufferFastAggregation.and more

Screen Shot 2022-09-09 at 10 13 23 PM

@jasperjiaguo
Copy link
Contributor Author

jasperjiaguo commented Sep 10, 2022

another selectivity setup: [5/25, 4/25, 3/25, 2/25/, 1/25]

Screen Shot 2022-09-09 at 10 45 26 PM

@richardstartin
Copy link
Member

At a glance not using the stream API to convert the iterators to bitmaps and using the array overload instead of the iterator overload should improve the results. I seem to recall BufferFastAggregation.and being more efficient for larger numbers of bitmaps, so it may not shine in this particular context.

@jasperjiaguo
Copy link
Contributor Author

jasperjiaguo commented Sep 10, 2022

At a glance not using the stream API to convert the iterators to bitmaps and using the array overload instead of the iterator overload should improve the results. I seem to recall BufferFastAggregation.and being more efficient for larger numbers of bitmaps, so it may not shine in this particular context.

Modified to use array and blackhole. Updated the results. Briefly looked at the workShyAnd and it seems to use the similar idea of intersecting the containers prior to full intersection to minimize the range. It could be the extra preparation work is not worth it when number of bitmaps are small.

@jasperjiaguo
Copy link
Contributor Author

Another set up with low speedup, but the overhead of degeneration remains small

Screen Shot 2022-09-09 at 10 53 46 PM

@siddharthteotia
Copy link
Contributor

I think the gain here is reasonable. Since BufferFastAggregation based intersection didn't show a clear benefit here, I think we should investigate that separately for larger number of bitmaps. @jasperjiaguo will follow-up with a separate PR on that and continue on #9370

@siddharthteotia siddharthteotia merged commit 7648a96 into apache:master Sep 12, 2022
@siddharthteotia
Copy link
Contributor

@jasperjiaguo you might want to capture a Flamegraph as well (attach yourkit to local IDE)...

@Jackie-Jiang - we plan to do a couple of optimizations including this one in multiple PRs. For example:

  • making decision using the cardinality (higher first) and reordering the tree potentially (but could be less useful than this since this is accurate)
  • skipping inverted index based evaluation for very low cardinality. We have a Flamegraph for this that shows the clear overhead.

@jasperjiaguo - also add the applyAnd to this list.

It seems like point (2) above about skipping index based eval and use scan instead is same as suggested by @richardstartin in #7600

Let me try to find the profile we had for this. Want to see if we are talking about similar or two different potential optimizations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants