Skip to content

Conversation

@xiangfu0
Copy link
Contributor

@xiangfu0 xiangfu0 commented May 13, 2024

  • Support WITHIN GROUP Clause in MultiStage Engine
    • Use PinotLogicalSortExchange for WITHIN GROUP instead of PinotLogicalExchange for leaf data exchange
  • Support ListAgg aggregation function
  • Support passing Distinct flag to ListAgg implementation

Limitation:

  • Only one WITHIN GROUP Clause is supported right now.

Sample query:

SELECT 
    Origin, Dest,
    LISTAGG(distinct Carrier, ', ') WITHIN GROUP (ORDER BY Carrier)
FROM 
    airlineStats
GROUP BY 
    Dest, Origin

Query Plan:

Execution Plan
LogicalProject(Origin=[$1], Dest=[$0], EXPR$2=[$2])
  LogicalAggregate(group=[{0, 1}], EXPR$2=[LISTAGG(DISTINCT $2, $3) WITHIN GROUP ([2])])
    PinotLogicalSortExchange(distribution=[hash[0, 1]], collation=[[2]], isSortOnSender=[false], isSortOnReceiver=[true])
      LogicalProject(Dest=[$28], Origin=[$61], Carrier=[$17], $f3=[_UTF-8', '])
        LogicalTableScan(table=[[default, airlineStats]])
image
SELECT
    Carrier,
    LISTAGG(DISTINCT CONCAT(Origin, Dest, ' | '), ', ') WITHIN GROUP (ORDER BY Origin, Dest)
FROM airlineStats
GROUP BY Carrier
ORDER BY Carrier

Query Plan

Execution Plan
LogicalSort(sort0=[$0], dir0=[ASC])
  PinotLogicalSortExchange(distribution=[hash], collation=[[0]], isSortOnSender=[false], isSortOnReceiver=[true])
    LogicalAggregate(group=[{0}], EXPR$1=[LISTAGG(DISTINCT $1, $2) WITHIN GROUP ([3, 4])])
      PinotLogicalSortExchange(distribution=[hash[0]], collation=[[3, 4]], isSortOnSender=[false], isSortOnReceiver=[true])
        LogicalProject(Carrier=[$17], $f1=[CONCAT($61, $28, _UTF-8' | ')], $f2=[_UTF-8', '], Origin=[$61], Dest=[$28])
          LogicalTableScan(table=[[default, airlineStats]])
image

@codecov-commenter
Copy link

codecov-commenter commented May 13, 2024

Codecov Report

Attention: Patch coverage is 40.12346% with 97 lines in your changes are missing coverage. Please review.

Project coverage is 62.22%. Comparing base (59551e4) to head (60d2a99).
Report is 479 commits behind head on master.

Files Patch % Lines
...ry/aggregation/function/array/ListAggFunction.java 0.00% 51 Missing ⚠️
...el/rules/PinotAggregateExchangeNodeInsertRule.java 61.76% 7 Missing and 6 partials ⚠️
...gregation/function/AggregationFunctionFactory.java 0.00% 12 Missing ⚠️
...gation/function/array/ListAggDistinctFunction.java 0.00% 12 Missing ⚠️
...org/apache/pinot/core/common/ObjectSerDeUtils.java 88.88% 2 Missing and 2 partials ⚠️
...inot/query/runtime/operator/AggregateOperator.java 62.50% 2 Missing and 1 partial ⚠️
...che/pinot/query/planner/logical/RexExpression.java 80.00% 1 Missing ⚠️
...che/pinot/segment/spi/AggregationFunctionType.java 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13146      +/-   ##
============================================
+ Coverage     61.75%   62.22%   +0.47%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2531      +95     
  Lines        133233   138446    +5213     
  Branches      20636    21428     +792     
============================================
+ Hits          82274    86152    +3878     
- Misses        44911    45866     +955     
- Partials       6048     6428     +380     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 35.22% <40.12%> (-26.49%) ⬇️
java-21 62.11% <40.12%> (+0.49%) ⬆️
skip-bytebuffers-false 62.19% <40.12%> (+0.45%) ⬆️
skip-bytebuffers-true 62.10% <40.12%> (+34.37%) ⬆️
temurin 62.22% <40.12%> (+0.47%) ⬆️
unittests 62.22% <40.12%> (+0.47%) ⬆️
unittests1 46.75% <40.12%> (-0.14%) ⬇️
unittests2 27.87% <0.00%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xiangfu0 xiangfu0 added feature multi-stage Related to the multi-stage query engine labels May 13, 2024
@xiangfu0 xiangfu0 force-pushed the with-in-groupby branch 4 times, most recently from 07d980d to 574394e Compare May 14, 2024 07:12
@xiangfu0 xiangfu0 requested review from Jackie-Jiang and gortiz May 14, 2024 08:50
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should generalize the query flow for aggregate with distinct or within group order by. It should be modeled as projection/distinct on the leaf stage instead of aggregation. The aggregation should only be applied on the intermediate stage, which is similar to the window function handling logic

@xiangfu0
Copy link
Contributor Author

We should generalize the query flow for aggregate with distinct or within group order by. It should be modeled as projection/distinct on the leaf stage instead of aggregation. The aggregation should only be applied on the intermediate stage, which is similar to the window function handling logic

Do you mean We should have a new rule or reuse existing rule to modify the query plan?

@xiangfu0
Copy link
Contributor Author

We should generalize the query flow for aggregate with distinct or within group order by. It should be modeled as projection/distinct on the leaf stage instead of aggregation. The aggregation should only be applied on the intermediate stage, which is similar to the window function handling logic

I think the distinct will still happen in both leaf and intermediate stage, as there could still be duplicates across leaf nodes.

@xiangfu0 xiangfu0 requested a review from Jackie-Jiang May 19, 2024 13:34
@xiangfu0 xiangfu0 force-pushed the with-in-groupby branch 5 times, most recently from 360739d to e5cd5ed Compare May 20, 2024 06:34
@xiangfu0 xiangfu0 requested review from vvivekiyer and walterddr May 20, 2024 06:39
@xiangfu0
Copy link
Contributor Author

Simplified the code to for both collation and distinct handling.

@xiangfu0 xiangfu0 requested a review from gortiz May 20, 2024 09:23
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference for ListAggFunction and ArrayAggString/ArrayAggDistinctStringFunction is just the extract final results part. Since you need to do special handling on v2 side, is it possible to directly use the ArrayAgg? If not, probably extend that and only override the extract final results

@xiangfu0
Copy link
Contributor Author

The difference for ListAggFunction and ArrayAggString/ArrayAggDistinctStringFunction is just the extract final results part. Since you need to do special handling on v2 side, is it possible to directly use the ArrayAgg? If not, probably extend that and only override the extract final results

ArrayAgg doesn't support ORDER BY so the merger/extractFinal at intermediate/final stage doesn't honer the ordering.

@xiangfu0 xiangfu0 force-pushed the with-in-groupby branch 4 times, most recently from 8bf72f0 to 9610e82 Compare May 21, 2024 03:22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change this signature, and use argument to pass distinct similar to ARRAYAGG. In the future, we don't want to handle distinct logic within the aggregation function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted getAggregationFunction signature, it means we need to have special handle for LISTAGG to append isDistinct to the third argument.

This is due to LISTAGG is registered in SqlStdOperatorTable, so the sql syntax cannot be changed, usage can only be LISTAGG(distinct expr, ' | ') not LISTAGG(expr, ' | ', true).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the error for the query:

SELECT 
    Origin, Dest,
    LISTAGG(Carrier, ', ', 'true') WITHIN GROUP (ORDER BY Carrier)
FROM 
    airlineStats
GROUP BY 
    Dest, Origin
ProcessingException(errorCode:150, message:SQLParsingError:
java.lang.Exception: Unable to find table for this query
	at org.apache.pinot.controller.api.resources.PinotQueryResource.getMultiStageQueryResponse(PinotQueryResource.java:222)
	at org.apache.pinot.controller.api.resources.PinotQueryResource.executeSqlQuery(PinotQueryResource.java:179)
	at org.apache.pinot.controller.api.resources.PinotQueryResource.handlePostSql(PinotQueryResource.java:127)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
Caused by: java.lang.RuntimeException: Error composing query plan for: SELECT 
    Origin, Dest,
    LISTAGG(Carrier, ', ', 'true') WITHIN GROUP (ORDER BY Carrier)
FROM 
    airlineStats
...
Caused by: org.apache.calcite.runtime.CalciteContextException: From line 3, column 5 to line 3, column 34: Invalid number of arguments to function 'LISTAGG'. Was expecting 1 arguments
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
...
Caused by: org.apache.calcite.sql.validate.SqlValidatorException: Invalid number of arguments to function 'LISTAGG'. Was expecting 1 arguments
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490))

@xiangfu0 xiangfu0 force-pushed the with-in-groupby branch 2 times, most recently from 3c88a3c to a4d2cc5 Compare May 21, 2024 10:03
@xiangfu0 xiangfu0 requested a review from Jackie-Jiang May 21, 2024 10:17
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@xiangfu0 xiangfu0 merged commit 253ede9 into apache:master May 22, 2024
@xiangfu0 xiangfu0 deleted the with-in-groupby branch May 22, 2024 04:21
gortiz pushed a commit to gortiz/pinot that referenced this pull request Jun 14, 2024
* Support ListAgg WITHIN GROUP clause

* fix rowCompartor schema

* Change rule to insert a sort for WithInGroup clause

* move isDistinct field out from FunctionContext

* address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation feature multi-stage Related to the multi-stage query engine

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants