-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Combine Metadata and Dictionary based plans into single operator #8408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine Metadata and Dictionary based plans into single operator #8408
Conversation
5b32097 to
65629f2
Compare
|
this obviously has a huge impact on performance for raw columns: However, this optimisation is so obvious it raises questions that it wasn't done before - is realtime metadata less trustworthy than the equivalent info from a dictionary? |
d875e40 to
a0e66b1
Compare
walterddr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me. i wonder it too on why this is not optimized before.
Also, this question is not just for this PR but before:
for REALTIME table this won't work when consuming segment has no metadata right? how do we go about and add the current consuming segment's count/min/max into the final result?
pinot-core/src/test/java/org/apache/pinot/queries/ExplainPlanQueriesTest.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/plan/AggregationPlanNode.java
Outdated
Show resolved
Hide resolved
Consuming segments have data sources and track the min and max values, I'm not sure what concurrency guarantees are made though. |
Great question. I believe we might not handle min/max value properly in segment metadata in certain corner cases, and I'm sure there are some historic segments that do not have the min/max value set properly. We should make the check more strict (e.g. check if min/max value exist) to ensure the segment metadata can be used to solve the query. |
b5c4750 to
d0031b4
Compare
Codecov Report
@@ Coverage Diff @@
## master #8408 +/- ##
============================================
+ Coverage 70.77% 70.88% +0.11%
- Complexity 4218 4281 +63
============================================
Files 1656 1660 +4
Lines 86690 87033 +343
Branches 13076 13138 +62
============================================
+ Hits 61351 61690 +339
+ Misses 21100 21084 -16
- Partials 4239 4259 +20
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
pinot-core/src/main/java/org/apache/pinot/core/operator/query/AggregationOperatorUtils.java
Outdated
Show resolved
Hide resolved
...e/src/main/java/org/apache/pinot/core/operator/query/DictionaryBasedAggregationOperator.java
Outdated
Show resolved
Hide resolved
...ore/src/main/java/org/apache/pinot/core/operator/query/MetadataBasedAggregationOperator.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/plan/AggregationPlanNode.java
Outdated
Show resolved
Hide resolved
d0031b4 to
7834990
Compare
compatibility-verifier/sample-test-suite/config/query-results/feature-test-1-rest-sql.results
Outdated
Show resolved
Hide resolved
7834990 to
a37ce01
Compare
...e/src/main/java/org/apache/pinot/core/operator/query/DataSourceBasedAggregationOperator.java
Outdated
Show resolved
Hide resolved
...e/src/main/java/org/apache/pinot/core/operator/query/DataSourceBasedAggregationOperator.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/plan/AggregationPlanNode.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/plan/AggregationPlanNode.java
Outdated
Show resolved
Hide resolved
siddharthteotia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
(nit) Name of the new operator is not very intuitive and does not imply that result is coming from metadata or dictionary.
walterddr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is awesome!
I have one suggestion to the integration-test comment and one question on the condition check. otherwise looks good to me.
| Dictionary dictionary = dataSource.getDictionary(); | ||
| if (dictionary != null) { | ||
| return toDouble(dictionary.getMinVal()); | ||
| } | ||
| return toDouble(dataSource.getDataSourceMetadata().getMinValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataSource.getDictionary() seems final to me? can we optimized out this if check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's nullable, so no. DataSources without dictionaries but with valid metadata qualify for use now. This cost of this check is kind of a 3rd or 4th order effect in the scope of a query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i meant can we create an ImmutableList<Dictionary> dataSourceDictionaries in constructor without extracting it on each block / agg function since the list of dataSource is final.
But yeah agree it is not per row so perf difference should be minimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only going to be called once per query in this context. There could be many aggregation functions but I'm tempted to keep things the way they are for the sake of simplicity.
...ache/pinot/integration/tests/MetadataAndDictionaryAggregationPlanClusterIntegrationTest.java
Outdated
Show resolved
Hide resolved
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Uses column metadata to satisfy min and max style aggregation functions when the filter matches all docs in the segment, which means metadata-based optimisations are possible for raw columns.