[7867] Handle Select * with Extra Columns #7959

suddendust · 2021-12-25T15:29:24Z

Description

This handles cases like select $segmentName, * from table limit 1. Pinot currently doesn't support extra cols with *. Issue. We expand the * on the broker side to actual columns. Note that while the SQL standard does not allow adding additional columns along with *, almost all vendors have their own extension to this rule.

Behaviour:

tableName: baseballStats

Schema:

Column Name
playerID
homeRuns
playerStint
groundedIntoDoublePlays
G_old

Query	Expands To (Cols. are in order unless explicitly specified)	Comments
SELECT $docId,*,$segmentName FROM baseballStats	$docId, G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint, $segmentName	No extra default columns are added
SELECT playerID,*,G_old FROM baseballStats	playerID, G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint, G_old	playerID and G_old are not deduped
SELECT playerID,*,G_old FROM baseballStats	playerID, G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint, G_old	Selection order is maintained, * is expanded with natural ordering. Selection order overrides natural order
SELECT playerID as pid,* FROM baseballStats	AS(playerID, pid), G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint	Aliased column is returned along with original column playerId
SELECT sqrt(homeRuns),* FROM baseballStats	SQRT(homeRuns), G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint	f(homeRuns) is returned along with homeRuns
SELECT add(homeRuns,groundedIntoDoublePlays),* FROM baseballStats	ADD(homeRuns, groundedIntoDoublePlays), G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint	f(homeRuns, groundedIntoDoublePlays) is returned along with both the columns
SELECT , FROM baseballStats	G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint, G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint	Each * is expanded

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

Yes

Does this PR fix a zero-downtime upgrade introduced earlier?

Yes

Does this PR otherwise need attention when creating release notes? Things to consider:

Yes

siddharthteotia · 2021-12-27T03:19:56Z

...src/test/java/org/apache/pinot/broker/requesthandler/SelectStarWithOtherColsRewriteTest.java

+    String sql = "SELECT playerID,homeRuns,* FROM baseballStats";
+    PinotQuery pinotQuery = CalciteSqlParser.compileToPinotQuery(sql);
+    BaseBrokerRequestHandler.updateColumnNames("baseballStats", pinotQuery, false, COL_MAP);
+    List<Expression> newSelections = pinotQuery.getSelectList();


This is not correct / compliant with Standard SQL as far as I can tell. The column list in the select expressions should be deduplicated. So, playerID and homeRuns should be returned only once.

Even generally for non star queries, using the same column name twice in the SELECT clause should be deduplicated.

However, the scenario of aliasing must be handled.

For example, the query SELECT CustomerID AS C1, * FROM Customers must return CustomerID column in the result twice (with result column name as C1 and CustomerID respectively).

@siddharthteotia thanks for the review. Actually I wasn't aware that this is dictated by the SQL standard. Had asked both of these questions (deduplication and returning internal columns) in the original issue for clarification. Will address these.

This is what I am seeing on mysql (no dedup when * is combined with column names in select list), but not sure how standards complaint mysql is. In either case we should check against at least two other databases before setting behavior here since it will be difficult to change once set.

mysql> select *, cases, abs(cases) from covid_cases; +------------+-------+-------+------------+ | date | cases | cases | abs(cases) | +------------+-------+-------+------------+ | 2021-01-01 | 30 | 30 | 30 | | 2021-01-02 | 60 | 60 | 60 | | 2021-01-03 | 120 | 120 | 120 | | 2021-01-04 | 100 | 100 | 100 | | 2021-01-05 | 80 | 80 | 80 | | 2021-01-06 | 70 | 70 | 70 | | 2021-01-07 | 85 | 85 | 85 | | 2021-01-08 | 65 | 65 | 65 | | 2021-01-01 | 70 | 70 | 70 | | 2021-01-02 | 100 | 100 | 100 | +------------+-------+-------+------------+ 10 rows in set (0.00 sec) mysql> select * from covid_cases; +------------+-------+ | date | cases | +------------+-------+ | 2021-01-01 | 30 | | 2021-01-02 | 60 | | 2021-01-03 | 120 | | 2021-01-04 | 100 | | 2021-01-05 | 80 | | 2021-01-06 | 70 | | 2021-01-07 | 85 | | 2021-01-08 | 65 | | 2021-01-01 | 70 | | 2021-01-02 | 100 | +------------+-------+ 10 rows in set (0.00 sec)

Tried this with Postgres, same behaviour (column is returned twice). The ANSI SQL standard doesn't allow for extra columns to be added but DBs have their own extensions to it. I think we should go with how MySQL and Postgres are doing it, that is, returning the columns twice unless there are some implications.

@siddharthteotia

In the above query mysql> select *, cases, abs(cases) from covid_cases, cases and abs(cases) are different since the latter is using function. By deduplication, I mean to do only for the identifiers (simple column names) in the SELECT list.

My preference would be to deduplicate and not return the column twice as the output is cleaner that way. So, in the above query cases column should be returned exactly once.

If the user wants to get the same column twice, they should use aliasing (like I mentioned earlier)

SELECT CustomerID AS C1, * FROM Customers

CustomerID should be returned twice -- once as C1 and next as CustomerID

But if we want to stick to how MySQL or Postgres is doing and that is the general ANSI SQL / user expectation, then let's not deduplicate. I am ok with it. Just does not look clean

Trying the query here seems to deduplicate -- https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all

Not sure which database it is using underneath

@amrishlal @suddendust

siddharthteotia · 2021-12-27T03:30:09Z

Let's please also ensure that rewrite done here to support this feature does not change the default behavior of SELECT * queries and $ / internal columns are not returned.

amrishlal · 2021-12-27T20:54:49Z

...src/test/java/org/apache/pinot/broker/requesthandler/SelectStarWithOtherColsRewriteTest.java

+import org.testng.annotations.Test;
+
+
+public class SelectStarWithOtherColsRewriteTest {


Please also add test cases for:

SELECT sqrt(homeRuns), * FROM baseballStats,

Use of more than one unqualified * in the select statement should result in syntax error (calcite is probably doing this already).

Use of more than one unqualified * in the select statement should result in syntax error

Tried this with MySQL, it throws an error. With Postgres, it works. IMO if the context is clear enough (in this case we know the table name on which * is being selected), then we should return all the columns without the qualifier like how Postgres is doing it? Calcite doesn't throw a syntax error in this case btw.

amrishlal · 2021-12-27T20:59:49Z

...src/test/java/org/apache/pinot/broker/requesthandler/SelectStarWithOtherColsRewriteTest.java

+    String sql = "SELECT playerID,homeRuns,* FROM baseballStats";
+    PinotQuery pinotQuery = CalciteSqlParser.compileToPinotQuery(sql);
+    BaseBrokerRequestHandler.updateColumnNames("baseballStats", pinotQuery, false, COL_MAP);
+    List<Expression> newSelections = pinotQuery.getSelectList();


This is what I am seeing on mysql (no dedup when * is combined with column names in select list), but not sure how standards complaint mysql is. In either case we should check against at least two other databases before setting behavior here since it will be difficult to change once set.

mysql> select *, cases, abs(cases) from covid_cases; +------------+-------+-------+------------+ | date | cases | cases | abs(cases) | +------------+-------+-------+------------+ | 2021-01-01 | 30 | 30 | 30 | | 2021-01-02 | 60 | 60 | 60 | | 2021-01-03 | 120 | 120 | 120 | | 2021-01-04 | 100 | 100 | 100 | | 2021-01-05 | 80 | 80 | 80 | | 2021-01-06 | 70 | 70 | 70 | | 2021-01-07 | 85 | 85 | 85 | | 2021-01-08 | 65 | 65 | 65 | | 2021-01-01 | 70 | 70 | 70 | | 2021-01-02 | 100 | 100 | 100 | +------------+-------+-------+------------+ 10 rows in set (0.00 sec) mysql> select * from covid_cases; +------------+-------+ | date | cases | +------------+-------+ | 2021-01-01 | 30 | | 2021-01-02 | 60 | | 2021-01-03 | 120 | | 2021-01-04 | 100 | | 2021-01-05 | 80 | | 2021-01-06 | 70 | | 2021-01-07 | 85 | | 2021-01-08 | 65 | | 2021-01-01 | 70 | | 2021-01-02 | 100 | +------------+-------+ 10 rows in set (0.00 sec)

codecov-commenter · 2022-01-24T09:08:34Z

Codecov Report

Merging #7959 (4159c57) into master (b6eeaf3) will increase coverage by 0.17%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master    #7959      +/-   ##
============================================
+ Coverage     71.20%   71.37%   +0.17%     
- Complexity     4174     4304     +130     
============================================
  Files          1593     1617      +24     
  Lines         82477    83896    +1419     
  Branches      12305    12543     +238     
============================================
+ Hits          58729    59884    +1155     
- Misses        19788    19929     +141     
- Partials       3960     4083     +123

Flag	Coverage Δ
integration1	`28.89% <86.95%> (-0.18%)`	⬇️
integration2	`27.77% <86.95%> (+0.29%)`	⬆️
unittests1	`67.91% <ø> (-0.27%)`	⬇️
unittests2	`14.19% <100.00%> (-0.13%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...roker/requesthandler/BaseBrokerRequestHandler.java	`71.87% <100.00%> (+0.59%)`	⬆️
...elix/core/minion/generator/PinotTaskGenerator.java	`33.33% <0.00%> (-66.67%)`	⬇️
.../java/org/apache/pinot/spi/filesystem/PinotFS.java	`0.00% <0.00%> (-66.67%)`	⬇️
...readers/forward/BaseChunkSVForwardIndexReader.java	`46.15% <0.00%> (-46.50%)`	⬇️
.../pinot/core/operator/docidsets/BitmapDocIdSet.java	`62.50% <0.00%> (-37.50%)`	⬇️
...ller/helix/core/minion/TaskTypeMetricsUpdater.java	`80.00% <0.00%> (-20.00%)`	⬇️
...in/stream/kafka20/KafkaStreamMetadataProvider.java	`71.42% <0.00%> (-16.08%)`	⬇️
...nt/local/startree/v2/store/StarTreeDataSource.java	`40.00% <0.00%> (-13.34%)`	⬇️
...nction/DistinctCountBitmapAggregationFunction.java	`41.96% <0.00%> (-10.89%)`	⬇️
...org/apache/pinot/core/util/ListenerConfigUtil.java	`77.00% <0.00%> (-10.81%)`	⬇️
... and 174 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6eeaf3...4159c57. Read the comment docs.

suddendust · 2022-01-24T12:02:45Z

@Jackie-Jiang @amrishlal @siddharthteotia Requesting review, thanks!

Jackie-Jiang

Can you please list down the behavior of different scenario of select with star? E.g.
SELECT colA, * FROM ...
SELECT *, colA FROM ...
SELECT colA, *, $segmentName FROM ...
SELECT add(ColA, colB), * FROM ...

Currently we order columns alphabetically in select star. We should preserve this behavior

Jackie-Jiang · 2022-01-24T17:46:03Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

+  private static void expandStarExpressionToActualColumns(PinotQuery pinotQuery, Map<String, String> columnNameMap,
+      Expression selectStarExpr) {
+    List<Expression> originalSelections = pinotQuery.getSelectList();
+    Set<String> originallySelectedColumnNames =


Avoid using stream apis in query path because we have found that it has poorer performance comparing to regular apis.

Done, thanks. I was unaware of this. Is there a benchmark that I can refer to for my own knowledge? There's literature available online, but having Pinot specific info would be handy. Thanks!

…nding '*'

amrishlal · 2022-01-26T05:56:20Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

+        //expand '*' to actual columns, exclude default virtual columns
+        for (String tableCol : columnNameMap.values()) {
+          //we exclude default virtual columns and those columns that are already a part of originalSelections (to
+          // dedup columns that are selected multiple times)


I thought we decided not to dedup since that is the behavior that mysql and postgres follow?

The output remains cleaner if we dedup. I like @siddharthteotia's idea that if the user wants to the same column multiple times, he should use aliasing (which the change handles). @Jackie-Jiang What are your thoughts on this?

So by extension, SELECT *, * FROM baseballStats should also be deduped right?

Also from what I can see queries like SELECT playerID, playerID FROM baseballStats are currently supported in Pinot without the dedupe. So whatever scheme we pick should be consistent otherwise it will cause confusion later and be difficult to change.

I tried the SQL runner at w3schools, and it dedups the columns. @amrishlal Can you please share the behavior of mysql and postgres? Specifically for the following queries:

select col, * from table

select *, col from table

select *, * from table

select *, col, * from table

select col, *, col from table

select col1 + col2, * from table

I feel there is no SQL standard for this. Returning duplicate columns can add extra workload and traffic without providing much value to the user, so I prefer deduping the columns.

@Jackie-Jiang mysql doesn't dudup in any of the cases listed above, except that multiple stars are not allowed in the select list. I don't have access to postgresql, but I believe (as @suddendust tested earlier) the behavior is the same as mysql except that postgresql allows multiple stars in the select list.

The w3schools console might be custom UI logic (note how they change column name for query SELECT CustomerID, CustomerID, FROM Customers without the use of aliasing) . Nothing wrong with deduping per se and either approach would work, but I think we need to maintain consistency. For example:

if we are deduping select list, then shouldn't select *, * be also deduped (w3schools sql console is doing dudupe here)?

if we decide to dedup select *, * then shouldn't select playerID, playerID (existing functionality which is not being currently deduped) also be deduped?

what about select playerID, *, playerID - how will dudupe work in this case? (since we currently don't dedupe playerID, playerID but will dedup playerID, *)

I am wondering if we have an example of an actual commercial database that is doing dedup of select list?

I tried mysql, postgres and sql server, seems all of them don't dedup the columns automatically, so probably we should follow this convention

suddendust · 2022-01-26T09:56:24Z

@Jackie-Jiang Have updated the PR description with the behaviour.

Jackie-Jiang · 2022-01-26T22:29:21Z

Sorry for getting back and forth. I checked the behavior of MySQL, PostgresSQL and SQL Server, all of them will convert * to all the columns without deduplication.
Let's keep this convention, and just rewrite each * to all columns sorted in alphabetically order to be compatible with the existing pinot behavior. The algorithm should be quite straight forward, and we can consider also rewrite the query for * without other columns for completeness.

amrishlal · 2022-01-30T23:37:20Z

@suddendust Looks good to me :-) Minor: the function expandStarExpressionToActualColumns (singular) could be renamed to expandStarExpressionsToActualColumns (plural)

…essionsToActualColumns`

suddendust · 2022-01-31T06:05:11Z

@amrishlal Thanks for the review, have addressed this comment. I think I was missing looking at this issue in the larger context, as to how this might affect users' experience with Pinot in the long term. This was a great learning experience for me, thanks :)

suddendust · 2022-01-31T16:50:00Z

@Jackie-Jiang Requesting review, thanks!

Jackie-Jiang

LGTM otherwise. Thanks for adding the tests!

Jackie-Jiang · 2022-01-31T22:09:59Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

      Map<String, String> columnNameMap) {
    Map<String, String> aliasMap = new HashMap<>();
    if (pinotQuery != null) {
+      Expression selectStarExpr = null;


No need to store and pass this expression. This can be changed to a boolean field hasSelectStar. We can add a constant for * identifier, and the compare can be simplified to expression.equals(STAR)

Jackie-Jiang · 2022-01-31T22:10:30Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

  }

+  private static void expandStarExpressionsToActualColumns(PinotQuery pinotQuery, Map<String, String> columnNameMap,
+      Expression selectStarExpr) {


Same here, no need to pass in this expression. Use constant instead

Jackie-Jiang · 2022-01-31T22:13:16Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

+    }
+    //sort naturally
+    expandedSelections.sort(null);
+    ListIterator<Expression> li = originalSelections.listIterator();


Modifying the existing list can be expensive (keep shifting the values). We can create a new list to replace the original one

Done, thanks!

suddendust · 2022-02-01T05:09:02Z

@Jackie-Jiang Have addressed the comments.

Jackie-Jiang

LGTM with one minor comment. Will merge once the tests pass

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

siddharthteotia · 2022-02-02T00:35:05Z

@suddendust can you please take a look at the failing tests ? We can then merge this

…ler/BaseBrokerRequestHandler.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>

suddendust · 2022-02-02T06:16:38Z

Thanks everyone!

Expand '*' with other cols

12b644f

suddendust changed the title ~~[7867] Handle Select * with Other Columns~~ [7867] Handle Select * with Extra Columns Dec 25, 2021

added license header

c93e7c2

siddharthteotia reviewed Dec 27, 2021

View reviewed changes

suddendust changed the title ~~[7867] Handle Select * with Extra Columns~~ [WIP] [7867] Handle Select * with Extra Columns Dec 27, 2021

amrishlal suggested changes Dec 27, 2021

View reviewed changes

suddendust added 5 commits January 23, 2022 18:20

Additional UTs

bd9b0b6

Additional UTs for happy case

b14c500

linter

560560a

Checkstyle

83766d7

Fix failing UT

e223b56

Jackie-Jiang reviewed Jan 24, 2022

View reviewed changes

Use imperative code instead of stream APIs for performance while expa…

21de33f

…nding '*'

amrishlal reviewed Jan 26, 2022

View reviewed changes

suddendust added 2 commits January 26, 2022 14:49

Expand * in natural ordering of columns

6761eaa

Linter fixes

fc51eee

Minor refactoring

0750616

suddendust changed the title ~~[WIP] [7867] Handle Select * with Extra Columns~~ [7867] Handle Select * with Extra Columns Jan 26, 2022

suddendust added 2 commits January 29, 2022 23:18

Don't dedup columns, expand '*' request without other columns on broker

341cc5d

Add test description

0a1f0f6

Renamed from expandStarExpressionToActualColumns to `expandStarExpr…

b346f4d

…essionsToActualColumns`

Jackie-Jiang approved these changes Jan 31, 2022

View reviewed changes

Create new selections list instead of modifying the original

981ca30

siddharthteotia approved these changes Feb 1, 2022

View reviewed changes

Jackie-Jiang approved these changes Feb 1, 2022

View reviewed changes

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java Outdated Show resolved Hide resolved

Update pinot-broker/src/main/java/org/apache/pinot/broker/requesthand…

4159c57

…ler/BaseBrokerRequestHandler.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>

siddharthteotia merged commit ea2f0aa into apache:master Feb 2, 2022

suddendust deleted the 7867_2 branch February 2, 2022 06:16

Jackie-Jiang mentioned this pull request Sep 18, 2024

support for expansion of * in the project part of a query #6470

Closed

		import org.testng.annotations.Test;


		public class SelectStarWithOtherColsRewriteTest {

[7867] Handle Select * with Extra Columns #7959

[7867] Handle Select * with Extra Columns #7959

Uh oh!

Conversation

suddendust commented Dec 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Upgrade Notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amrishlal Dec 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siddharthteotia Jan 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siddharthteotia commented Dec 27, 2021

Uh oh!

amrishlal Dec 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amrishlal Dec 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jan 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

suddendust commented Jan 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amrishlal Jan 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amrishlal Jan 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suddendust commented Jan 26, 2022

Uh oh!

Jackie-Jiang commented Jan 26, 2022

Uh oh!

amrishlal commented Jan 30, 2022

Uh oh!

suddendust commented Jan 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

suddendust commented Dec 25, 2021 •

edited

Loading

amrishlal Dec 27, 2021 •

edited

Loading

siddharthteotia Jan 7, 2022 •

edited

Loading

amrishlal Dec 27, 2021 •

edited

Loading

amrishlal Dec 27, 2021 •

edited

Loading

codecov-commenter commented Jan 24, 2022 •

edited

Loading

suddendust commented Jan 24, 2022 •

edited

Loading

amrishlal Jan 26, 2022 •

edited

Loading

amrishlal Jan 26, 2022 •

edited

Loading

suddendust commented Jan 31, 2022 •

edited

Loading