Combine Metadata and Dictionary based plans into single operator #8408

richardstartin · 2022-03-25T10:17:53Z

Uses column metadata to satisfy min and max style aggregation functions when the filter matches all docs in the segment, which means metadata-based optimisations are possible for raw columns.

richardstartin · 2022-03-25T10:40:56Z

this obviously has a huge impact on performance for raw columns:

Benchmark               (_numRows)                                                          (_query)  (_scenario)  Mode  Cnt      Score      Error  Units
BenchmarkQueries.query     1500000  SELECT MIN(RAW_INT_COL), MAX(RAW_INT_COL), COUNT(*) FROM MyTable   EXP(0.001)  avgt    5  12050.914 ± 1603.683  us/op
BenchmarkQueries.query     1500000  SELECT MIN(RAW_INT_COL), MAX(RAW_INT_COL), COUNT(*) FROM MyTable     EXP(0.5)  avgt    5  16311.600 ± 2757.909  us/op
BenchmarkQueries.query     1500000  SELECT MIN(RAW_INT_COL), MAX(RAW_INT_COL), COUNT(*) FROM MyTable   EXP(0.999)  avgt    5  15463.667 ± 3098.050  us/op

Benchmark               (_numRows)                                                          (_query)  (_scenario)  Mode  Cnt    Score     Error  Units
BenchmarkQueries.query     1500000  SELECT MIN(RAW_INT_COL), MAX(RAW_INT_COL), COUNT(*) FROM MyTable   EXP(0.001)  avgt    5  513.716 ±  51.830  us/op
BenchmarkQueries.query     1500000  SELECT MIN(RAW_INT_COL), MAX(RAW_INT_COL), COUNT(*) FROM MyTable     EXP(0.5)  avgt    5  484.079 ±  88.200  us/op
BenchmarkQueries.query     1500000  SELECT MIN(RAW_INT_COL), MAX(RAW_INT_COL), COUNT(*) FROM MyTable   EXP(0.999)  avgt    5  506.688 ± 119.060  us/op

However, this optimisation is so obvious it raises questions that it wasn't done before - is realtime metadata less trustworthy than the equivalent info from a dictionary?

walterddr

looks good to me. i wonder it too on why this is not optimized before.

Also, this question is not just for this PR but before:
for REALTIME table this won't work when consuming segment has no metadata right? how do we go about and add the current consuming segment's count/min/max into the final result?

pinot-core/src/test/java/org/apache/pinot/queries/ExplainPlanQueriesTest.java

pinot-core/src/main/java/org/apache/pinot/core/plan/AggregationPlanNode.java

richardstartin · 2022-03-25T15:20:54Z

looks good to me. i wonder it too on why this is not optimized before.

Also, this question is not just for this PR but before: for REALTIME table this won't work when consuming segment has no metadata right? how do we go about and add the current consuming segment's count/min/max into the final result?

Consuming segments have data sources and track the min and max values, I'm not sure what concurrency guarantees are made though.

Jackie-Jiang · 2022-03-25T20:57:49Z

However, this optimisation is so obvious it raises questions that it wasn't done before - is realtime metadata less trustworthy than the equivalent info from a dictionary?

Great question. I believe we might not handle min/max value properly in segment metadata in certain corner cases, and I'm sure there are some historic segments that do not have the min/max value set properly. We should make the check more strict (e.g. check if min/max value exist) to ensure the segment metadata can be used to solve the query.

codecov-commenter · 2022-03-25T23:13:46Z

Codecov Report

Merging #8408 (8f58ab1) into master (649f598) will increase coverage by 0.11%.
The diff coverage is 92.00%.

@@             Coverage Diff              @@
##             master    #8408      +/-   ##
============================================
+ Coverage     70.77%   70.88%   +0.11%     
- Complexity     4218     4281      +63     
============================================
  Files          1656     1660       +4     
  Lines         86690    87033     +343     
  Branches      13076    13138      +62     
============================================
+ Hits          61351    61690     +339     
+ Misses        21100    21084      -16     
- Partials       4239     4259      +20

Flag	Coverage Δ
integration1	`28.73% <86.00%> (+0.06%)`	⬆️
integration2	`27.25% <86.00%> (-0.12%)`	⬇️
unittests1	`67.01% <74.00%> (+0.03%)`	⬆️
unittests2	`14.16% <0.00%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...rg/apache/pinot/core/plan/AggregationPlanNode.java	`89.32% <90.90%> (-1.77%)`	⬇️
...perator/query/NonScanBasedAggregationOperator.java	`81.05% <92.85%> (ø)`
...r/api/resources/WebApplicationExceptionMapper.java	`75.00% <0.00%> (-25.00%)`	⬇️
.../pinot/common/function/DateTimePatternHandler.java	`80.00% <0.00%> (-20.00%)`	⬇️
...ller/helix/core/minion/TaskTypeMetricsUpdater.java	`80.00% <0.00%> (-20.00%)`	⬇️
...pinot/common/utils/fetcher/HttpSegmentFetcher.java	`61.53% <0.00%> (-10.26%)`	⬇️
...roller/helix/core/relocation/SegmentRelocator.java	`72.97% <0.00%> (-5.41%)`	⬇️
...ntroller/helix/core/minion/TaskMetricsEmitter.java	`86.36% <0.00%> (-4.55%)`	⬇️
...n/java/org/apache/pinot/common/utils/TlsUtils.java	`76.34% <0.00%> (-3.23%)`	⬇️
.../pinot/common/function/scalar/StringFunctions.java	`71.05% <0.00%> (-2.80%)`	⬇️
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 649f598...8f58ab1. Read the comment docs.

pinot-core/src/main/java/org/apache/pinot/core/operator/query/AggregationOperatorUtils.java

...e/src/main/java/org/apache/pinot/core/operator/query/DictionaryBasedAggregationOperator.java

...ore/src/main/java/org/apache/pinot/core/operator/query/MetadataBasedAggregationOperator.java

pinot-core/src/main/java/org/apache/pinot/core/plan/AggregationPlanNode.java

compatibility-verifier/sample-test-suite/config/query-results/feature-test-1-rest-sql.results

...e/src/main/java/org/apache/pinot/core/operator/query/DataSourceBasedAggregationOperator.java

pinot-core/src/main/java/org/apache/pinot/core/plan/AggregationPlanNode.java

siddharthteotia

LGTM

(nit) Name of the new operator is not very intuitive and does not imply that result is coming from metadata or dictionary.

walterddr

this is awesome!

I have one suggestion to the integration-test comment and one question on the condition check. otherwise looks good to me.

walterddr · 2022-03-29T16:45:40Z

...core/src/main/java/org/apache/pinot/core/operator/query/NonScanBasedAggregationOperator.java

+    Dictionary dictionary = dataSource.getDictionary();
+    if (dictionary != null) {
+      return toDouble(dictionary.getMinVal());
+    }
+    return toDouble(dataSource.getDataSourceMetadata().getMinValue());


dataSource.getDictionary() seems final to me? can we optimized out this if check?

it's nullable, so no. DataSources without dictionaries but with valid metadata qualify for use now. This cost of this check is kind of a 3rd or 4th order effect in the scope of a query.

oh i meant can we create an ImmutableList<Dictionary> dataSourceDictionaries in constructor without extracting it on each block / agg function since the list of dataSource is final.

But yeah agree it is not per row so perf difference should be minimal.

It's only going to be called once per query in this context. There could be many aggregation functions but I'm tempted to keep things the way they are for the sake of simplicity.

...ache/pinot/integration/tests/MetadataAndDictionaryAggregationPlanClusterIntegrationTest.java

Jackie-Jiang

LGTM

...core/src/main/java/org/apache/pinot/core/operator/query/NonScanBasedAggregationOperator.java

richardstartin force-pushed the metadata-summary-queries branch from 5b32097 to 65629f2 Compare March 25, 2022 10:19

richardstartin mentioned this pull request Mar 25, 2022

Allow disabling dict generation for High cardinality columns #8398

Merged

richardstartin changed the title ~~satisfy queries using column metadata when convenient to~~ satisfy queries using DataSource metadata when convenient to Mar 25, 2022

richardstartin force-pushed the metadata-summary-queries branch from d875e40 to a0e66b1 Compare March 25, 2022 12:50

walterddr reviewed Mar 25, 2022

View reviewed changes

pinot-core/src/test/java/org/apache/pinot/queries/ExplainPlanQueriesTest.java Outdated Show resolved Hide resolved

pinot-core/src/main/java/org/apache/pinot/core/plan/AggregationPlanNode.java Outdated Show resolved Hide resolved

richardstartin changed the title ~~satisfy queries using DataSource metadata when convenient to~~ satisfy queries using DataSource metadata Mar 25, 2022

richardstartin force-pushed the metadata-summary-queries branch from b5c4750 to d0031b4 Compare March 25, 2022 22:45

Jackie-Jiang reviewed Mar 28, 2022

View reviewed changes

richardstartin added 2 commits March 28, 2022 20:00

satisfy queries using datasource metadata when convenient to

43e29b4

only us metadata if min and max value are non-null

48debe3

richardstartin force-pushed the metadata-summary-queries branch from d0031b4 to 7834990 Compare March 28, 2022 20:11

richardstartin commented Mar 28, 2022

View reviewed changes

compatibility-verifier/sample-test-suite/config/query-results/feature-test-1-rest-sql.results Outdated Show resolved Hide resolved

richardstartin changed the title ~~satisfy queries using DataSource metadata~~ Combine Metadata and Dictionary based plans into single operator Mar 28, 2022

merge Metadata and Dictionary based plans into DataSource based plan

a37ce01

richardstartin force-pushed the metadata-summary-queries branch from 7834990 to a37ce01 Compare March 28, 2022 21:39

Jackie-Jiang reviewed Mar 28, 2022

View reviewed changes

siddharthteotia approved these changes Mar 29, 2022

View reviewed changes

rename operator + review comments

f6d9fd9

walterddr approved these changes Mar 29, 2022

View reviewed changes

update test comments

c5e5d74

Jackie-Jiang approved these changes Mar 29, 2022

View reviewed changes

...core/src/main/java/org/apache/pinot/core/operator/query/NonScanBasedAggregationOperator.java Outdated Show resolved Hide resolved

fix exception message

8f58ab1

richardstartin merged commit d2ca07e into apache:master Mar 29, 2022

richardstartin mentioned this pull request Mar 29, 2022

accelerate counts over filters #8411

Merged

richardstartin added performance query labels Apr 1, 2022

Combine Metadata and Dictionary based plans into single operator #8408

Combine Metadata and Dictionary based plans into single operator #8408

Uh oh!

Conversation

richardstartin commented Mar 25, 2022

Uh oh!

richardstartin commented Mar 25, 2022

Uh oh!

walterddr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

richardstartin commented Mar 25, 2022

Uh oh!

Jackie-Jiang commented Mar 25, 2022

Uh oh!

codecov-commenter commented Mar 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

siddharthteotia left a comment

Choose a reason for hiding this comment

Uh oh!

walterddr left a comment

Choose a reason for hiding this comment

Uh oh!

walterddr Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

richardstartin Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

walterddr Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

richardstartin Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Mar 25, 2022 •

edited

Loading