Skip to content

Conversation

@richardstartin
Copy link
Member

Uses column metadata to satisfy min and max style aggregation functions when the filter matches all docs in the segment, which means metadata-based optimisations are possible for raw columns.

@richardstartin
Copy link
Member Author

this obviously has a huge impact on performance for raw columns:

Benchmark               (_numRows)                                                          (_query)  (_scenario)  Mode  Cnt      Score      Error  Units
BenchmarkQueries.query     1500000  SELECT MIN(RAW_INT_COL), MAX(RAW_INT_COL), COUNT(*) FROM MyTable   EXP(0.001)  avgt    5  12050.914 ± 1603.683  us/op
BenchmarkQueries.query     1500000  SELECT MIN(RAW_INT_COL), MAX(RAW_INT_COL), COUNT(*) FROM MyTable     EXP(0.5)  avgt    5  16311.600 ± 2757.909  us/op
BenchmarkQueries.query     1500000  SELECT MIN(RAW_INT_COL), MAX(RAW_INT_COL), COUNT(*) FROM MyTable   EXP(0.999)  avgt    5  15463.667 ± 3098.050  us/op
Benchmark               (_numRows)                                                          (_query)  (_scenario)  Mode  Cnt    Score     Error  Units
BenchmarkQueries.query     1500000  SELECT MIN(RAW_INT_COL), MAX(RAW_INT_COL), COUNT(*) FROM MyTable   EXP(0.001)  avgt    5  513.716 ±  51.830  us/op
BenchmarkQueries.query     1500000  SELECT MIN(RAW_INT_COL), MAX(RAW_INT_COL), COUNT(*) FROM MyTable     EXP(0.5)  avgt    5  484.079 ±  88.200  us/op
BenchmarkQueries.query     1500000  SELECT MIN(RAW_INT_COL), MAX(RAW_INT_COL), COUNT(*) FROM MyTable   EXP(0.999)  avgt    5  506.688 ± 119.060  us/op

However, this optimisation is so obvious it raises questions that it wasn't done before - is realtime metadata less trustworthy than the equivalent info from a dictionary?

@richardstartin richardstartin changed the title satisfy queries using column metadata when convenient to satisfy queries using DataSource metadata when convenient to Mar 25, 2022
@richardstartin richardstartin force-pushed the metadata-summary-queries branch from d875e40 to a0e66b1 Compare March 25, 2022 12:50
Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. i wonder it too on why this is not optimized before.

Also, this question is not just for this PR but before:
for REALTIME table this won't work when consuming segment has no metadata right? how do we go about and add the current consuming segment's count/min/max into the final result?

@richardstartin
Copy link
Member Author

looks good to me. i wonder it too on why this is not optimized before.

Also, this question is not just for this PR but before: for REALTIME table this won't work when consuming segment has no metadata right? how do we go about and add the current consuming segment's count/min/max into the final result?

Consuming segments have data sources and track the min and max values, I'm not sure what concurrency guarantees are made though.

@richardstartin richardstartin changed the title satisfy queries using DataSource metadata when convenient to satisfy queries using DataSource metadata Mar 25, 2022
@Jackie-Jiang
Copy link
Contributor

However, this optimisation is so obvious it raises questions that it wasn't done before - is realtime metadata less trustworthy than the equivalent info from a dictionary?

Great question. I believe we might not handle min/max value properly in segment metadata in certain corner cases, and I'm sure there are some historic segments that do not have the min/max value set properly. We should make the check more strict (e.g. check if min/max value exist) to ensure the segment metadata can be used to solve the query.

@richardstartin richardstartin force-pushed the metadata-summary-queries branch from b5c4750 to d0031b4 Compare March 25, 2022 22:45
@codecov-commenter
Copy link

codecov-commenter commented Mar 25, 2022

Codecov Report

Merging #8408 (8f58ab1) into master (649f598) will increase coverage by 0.11%.
The diff coverage is 92.00%.

@@             Coverage Diff              @@
##             master    #8408      +/-   ##
============================================
+ Coverage     70.77%   70.88%   +0.11%     
- Complexity     4218     4281      +63     
============================================
  Files          1656     1660       +4     
  Lines         86690    87033     +343     
  Branches      13076    13138      +62     
============================================
+ Hits          61351    61690     +339     
+ Misses        21100    21084      -16     
- Partials       4239     4259      +20     
Flag Coverage Δ
integration1 28.73% <86.00%> (+0.06%) ⬆️
integration2 27.25% <86.00%> (-0.12%) ⬇️
unittests1 67.01% <74.00%> (+0.03%) ⬆️
unittests2 14.16% <0.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...rg/apache/pinot/core/plan/AggregationPlanNode.java 89.32% <90.90%> (-1.77%) ⬇️
...perator/query/NonScanBasedAggregationOperator.java 81.05% <92.85%> (ø)
...r/api/resources/WebApplicationExceptionMapper.java 75.00% <0.00%> (-25.00%) ⬇️
.../pinot/common/function/DateTimePatternHandler.java 80.00% <0.00%> (-20.00%) ⬇️
...ller/helix/core/minion/TaskTypeMetricsUpdater.java 80.00% <0.00%> (-20.00%) ⬇️
...pinot/common/utils/fetcher/HttpSegmentFetcher.java 61.53% <0.00%> (-10.26%) ⬇️
...roller/helix/core/relocation/SegmentRelocator.java 72.97% <0.00%> (-5.41%) ⬇️
...ntroller/helix/core/minion/TaskMetricsEmitter.java 86.36% <0.00%> (-4.55%) ⬇️
...n/java/org/apache/pinot/common/utils/TlsUtils.java 76.34% <0.00%> (-3.23%) ⬇️
.../pinot/common/function/scalar/StringFunctions.java 71.05% <0.00%> (-2.80%) ⬇️
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 649f598...8f58ab1. Read the comment docs.

@richardstartin richardstartin force-pushed the metadata-summary-queries branch from d0031b4 to 7834990 Compare March 28, 2022 20:11
@richardstartin richardstartin changed the title satisfy queries using DataSource metadata Combine Metadata and Dictionary based plans into single operator Mar 28, 2022
@richardstartin richardstartin force-pushed the metadata-summary-queries branch from 7834990 to a37ce01 Compare March 28, 2022 21:39
Copy link
Contributor

@siddharthteotia siddharthteotia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

(nit) Name of the new operator is not very intuitive and does not imply that result is coming from metadata or dictionary.

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome!

I have one suggestion to the integration-test comment and one question on the condition check. otherwise looks good to me.

Comment on lines +135 to +139
Dictionary dictionary = dataSource.getDictionary();
if (dictionary != null) {
return toDouble(dictionary.getMinVal());
}
return toDouble(dataSource.getDataSourceMetadata().getMinValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataSource.getDictionary() seems final to me? can we optimized out this if check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's nullable, so no. DataSources without dictionaries but with valid metadata qualify for use now. This cost of this check is kind of a 3rd or 4th order effect in the scope of a query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i meant can we create an ImmutableList<Dictionary> dataSourceDictionaries in constructor without extracting it on each block / agg function since the list of dataSource is final.

But yeah agree it is not per row so perf difference should be minimal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only going to be called once per query in this context. There could be many aggregation functions but I'm tempted to keep things the way they are for the sake of simplicity.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants