Skip to content

Conversation

@nizarhejazi
Copy link
Contributor

@nizarhejazi nizarhejazi commented Jul 21, 2022

Proper null handling for SV data types in the following aggregation functions:

  • AVG
  • MIN
  • MAX
  • COUNT
  • SUM
  • SUMPRECISION

Added many unit tests (for different data types) for diff. aggregation functions.

feature, release-notes

@Jackie-Jiang Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels Jul 21, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 21, 2022

Codecov Report

Merging #9086 (20cc8f8) into master (561e471) will increase coverage by 0.29%.
The diff coverage is 62.16%.

@@             Coverage Diff              @@
##             master    #9086      +/-   ##
============================================
+ Coverage     68.39%   68.68%   +0.29%     
+ Complexity     4757     4672      -85     
============================================
  Files          1848     1848              
  Lines         97900    98386     +486     
  Branches      14746    14932     +186     
============================================
+ Hits          66960    67579     +619     
+ Misses        26251    26008     -243     
- Partials       4689     4799     +110     
Flag Coverage Δ
integration1 26.38% <11.17%> (?)
integration2 ?
unittests1 66.97% <61.62%> (+0.03%) ⬆️
unittests2 15.30% <0.36%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...a/org/apache/pinot/core/plan/DistinctPlanNode.java 86.95% <0.00%> (ø)
...tion/function/SumPrecisionAggregationFunction.java 34.26% <39.83%> (+0.43%) ⬆️
...aggregation/groupby/ObjectGroupByResultHolder.java 87.50% <50.00%> (+1.13%) ⬆️
...aggregation/function/CountAggregationFunction.java 71.25% <57.57%> (+1.25%) ⬆️
...y/aggregation/function/AvgAggregationFunction.java 81.67% <60.00%> (-15.66%) ⬇️
...y/aggregation/function/MinAggregationFunction.java 71.32% <62.35%> (-15.56%) ⬇️
...y/aggregation/function/MaxAggregationFunction.java 72.53% <64.28%> (-14.36%) ⬇️
...y/aggregation/function/SumAggregationFunction.java 73.17% <66.66%> (-10.83%) ⬇️
...core/query/reduce/AggregationDataTableReducer.java 80.95% <68.18%> (-5.33%) ⬇️
...rg/apache/pinot/core/plan/AggregationPlanNode.java 88.99% <87.50%> (+0.20%) ⬆️
... and 254 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest breaking it into 2 parts, one for the filter and one for the aggregation

for (Expression operand : operands) {
arguments.add(getExpression(operand));
}
if (arguments.size() == 0 && functionName.equalsIgnoreCase(AggregationFunctionType.COUNT.getName())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguments should never be empty

Copy link
Contributor Author

@nizarhejazi nizarhejazi Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Count(*) has no operands. Since I allow now count(colName), I removed the if in lines 84-88 and add this if here instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count(*) should have a single identifier operand with name "*"

Copy link
Contributor Author

@nizarhejazi nizarhejazi Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is never empty except in one test: testSqlWithFinalRowCountChecker (part of pinot-query-runtime tests).

Basically, pinot-query-runtime causes this issue and send count() without any argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@walterddr I think this is a bug in the multi-stage engine where the argument is not passed for COUNT. Can you please take a look?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yeah so that pinot-server expects a * on count. got it I can make the fix.

@nizarhejazi nizarhejazi changed the title Proper null handling in Aggregation functions, and comparison operators / matchers for SV data types Proper null handling in Aggregation functions for SV data types Jul 25, 2022
if (blockValSetMap.isEmpty()) {
aggregationResultHolder.setValue(aggregationResultHolder.getDoubleResult() + length);
} else if (!blockValSetMap.containsKey(STAR_TREE_COUNT_STAR_EXPRESSION)) {
if (!_nullHandlingEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this branch be hit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest changing the branching logic as following for clarity:

if (blockValSetMap.isEmpty()) {
  ...
} else if (_nullHandlingEnabled) {
  ...
} else {
  // Star-tree pre-aggregated values
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to check for: !blockValSetMap.containsKey(STAR_TREE_COUNT_STAR_EXPRESSION)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? When null handling is enabled, we cannot use star-tree, so we shouldn't need this check right? Basically when the blockValSetMap is empty, either null handling is enabled or star-tree is used, and they should be mutual exclusive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jackie-Jiang We have three branches:
1- blockValSetMap.isEmpty() for count(*), and no StarTree.
2- !blockValSetMap.containsKey(STAR_TREE_COUNT_STAR_EXPRESSION) for count(columnName), and no StarTree (please see the ! at the beginning). Example: count(bigDecimalColumn). Note: we can have count(bigDecimalColumn) and nullHandling is either enable or not enabled.
3- StarTree.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost ready to be merged

if (blockValSetMap.isEmpty()) {
aggregationResultHolder.setValue(aggregationResultHolder.getDoubleResult() + length);
} else if (!blockValSetMap.containsKey(STAR_TREE_COUNT_STAR_EXPRESSION)) {
if (!_nullHandlingEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? When null handling is enabled, we cannot use star-tree, so we shouldn't need this check right? Basically when the blockValSetMap is empty, either null handling is enabled or star-tree is used, and they should be mutual exclusive

public void aggregateGroupBySV(int length, int[] groupKeyArray, GroupByResultHolder groupByResultHolder,
Map<ExpressionContext, BlockValSet> blockValSetMap) {
if (blockValSetMap.isEmpty()) {
if (blockValSetMap.isEmpty() || !blockValSetMap.containsKey(STAR_TREE_COUNT_STAR_EXPRESSION)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here, we should be able to simplify it into 3 conditions:

  1. no value set
  2. null handling enabled
  3. star-tree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jackie-Jiang We have three branches:
1- blockValSetMap.isEmpty() for count(*), and no StarTree.
2- !blockValSetMap.containsKey(STAR_TREE_COUNT_STAR_EXPRESSION) for count(columnName), and no StarTree (please see the ! at the beginning). Example: count(bigDecimalColumn). Note: we can have count(bigDecimalColumn) and nullHandling is either enable or not enabled.
3- StarTree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When both null handling and star-tree are not enabled, will it pass in non-empty blockValSetMap? Ideally it should pass in empty blockValSetMap, or it could add extra overhead for existing use cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I find the reason. When null handling is not enabled, ideally getInputExpressions() should return empty list to preserve the existing behavior, which is changed because we changed it to extend BaseSingleInputAggregationFunction. Since most methods in BaseSingleInputAggregationFunction cannot be reused, suggest not extending it, and keep the existing behavior when null handling is not enabled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied a commit to demonstrate the idea: b684534
Let's see if it can pass all the tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overhead is should be minimal. I don't have strong opinion either way. Will apply your commit.

for (int i = 0; i < length; i++) {
if (!nullBitmap.contains(i)) {
// TODO(nhejazi): throw an exception here instead of ignoring infinite values?
if (Double.isFinite(doubleValues[i])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Suggest not having this check to keep the behavior consistent

Copy link
Contributor Author

@nizarhejazi nizarhejazi Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi: this is gonna fail the whole SumPrecisionAgg function. So the previous behaviour is just throwing an exception and the operation failing.
@Jackie-Jiang Do you wanna still remove this check? I

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the behavior is inconsistent without null handling (throw exception) and with null handling (values ignored). We may revisit this since you already added a TODO here

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This is a very important feature. Thanks for contributing it!

@Jackie-Jiang Jackie-Jiang merged commit d617f3b into apache:master Aug 3, 2022
@Jackie-Jiang
Copy link
Contributor

Related to #8697

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature release-notes Referenced by PRs that need attention when compiling the next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants