Skip to content

Conversation

@suddendust
Copy link
Contributor

@suddendust suddendust commented Dec 25, 2021

Description

This handles cases like select $segmentName, * from table limit 1. Pinot currently doesn't support extra cols with *. Issue. We expand the * on the broker side to actual columns. Note that while the SQL standard does not allow adding additional columns along with *, almost all vendors have their own extension to this rule.

Behaviour:

tableName: baseballStats

Schema:

Column Name
playerID
homeRuns
playerStint
groundedIntoDoublePlays
G_old
Query Expands To (Cols. are in order unless explicitly specified) Comments
SELECT $docId,*,$segmentName FROM baseballStats $docId, G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint, $segmentName No extra default columns are added
SELECT playerID,*,G_old FROM baseballStats playerID, G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint, G_old playerID and G_old are not deduped
SELECT playerID,*,G_old FROM baseballStats playerID, G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint, G_old Selection order is maintained, * is expanded with natural ordering. Selection order overrides natural order
SELECT playerID as pid,* FROM baseballStats AS(playerID, pid), G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint Aliased column is returned along with original column playerId
SELECT sqrt(homeRuns),* FROM baseballStats SQRT(homeRuns), G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint f(homeRuns) is returned along with homeRuns
SELECT add(homeRuns,groundedIntoDoublePlays),* FROM baseballStats ADD(homeRuns, groundedIntoDoublePlays), G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint f(homeRuns, groundedIntoDoublePlays) is returned along with both the columns
SELECT *, * FROM baseballStats G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint, G_old, groundedIntoDoublePlays, homeRuns, playerID, playerStint Each * is expanded

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

  • Yes

Does this PR fix a zero-downtime upgrade introduced earlier?

  • Yes

Does this PR otherwise need attention when creating release notes? Things to consider:

  • Yes

@suddendust suddendust changed the title [7867] Handle Select * with Other Columns [7867] Handle Select * with Extra Columns Dec 25, 2021
String sql = "SELECT playerID,homeRuns,* FROM baseballStats";
PinotQuery pinotQuery = CalciteSqlParser.compileToPinotQuery(sql);
BaseBrokerRequestHandler.updateColumnNames("baseballStats", pinotQuery, false, COL_MAP);
List<Expression> newSelections = pinotQuery.getSelectList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct / compliant with Standard SQL as far as I can tell. The column list in the select expressions should be deduplicated. So, playerID and homeRuns should be returned only once.

Even generally for non star queries, using the same column name twice in the SELECT clause should be deduplicated.

However, the scenario of aliasing must be handled.

For example, the query SELECT CustomerID AS C1, * FROM Customers must return CustomerID column in the result twice (with result column name as C1 and CustomerID respectively).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddharthteotia thanks for the review. Actually I wasn't aware that this is dictated by the SQL standard. Had asked both of these questions (deduplication and returning internal columns) in the original issue for clarification. Will address these.

Copy link
Contributor

@amrishlal amrishlal Dec 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I am seeing on mysql (no dedup when * is combined with column names in select list), but not sure how standards complaint mysql is. In either case we should check against at least two other databases before setting behavior here since it will be difficult to change once set.

mysql> select *, cases, abs(cases) from covid_cases;
+------------+-------+-------+------------+
| date       | cases | cases | abs(cases) |
+------------+-------+-------+------------+
| 2021-01-01 |    30 |    30 |         30 |
| 2021-01-02 |    60 |    60 |         60 |
| 2021-01-03 |   120 |   120 |        120 |
| 2021-01-04 |   100 |   100 |        100 |
| 2021-01-05 |    80 |    80 |         80 |
| 2021-01-06 |    70 |    70 |         70 |
| 2021-01-07 |    85 |    85 |         85 |
| 2021-01-08 |    65 |    65 |         65 |
| 2021-01-01 |    70 |    70 |         70 |
| 2021-01-02 |   100 |   100 |        100 |
+------------+-------+-------+------------+
10 rows in set (0.00 sec)

mysql> select * from covid_cases;
+------------+-------+
| date       | cases |
+------------+-------+
| 2021-01-01 |    30 |
| 2021-01-02 |    60 |
| 2021-01-03 |   120 |
| 2021-01-04 |   100 |
| 2021-01-05 |    80 |
| 2021-01-06 |    70 |
| 2021-01-07 |    85 |
| 2021-01-08 |    65 |
| 2021-01-01 |    70 |
| 2021-01-02 |   100 |
+------------+-------+
10 rows in set (0.00 sec)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this with Postgres, same behaviour (column is returned twice). The ANSI SQL standard doesn't allow for extra columns to be added but DBs have their own extensions to it. I think we should go with how MySQL and Postgres are doing it, that is, returning the columns twice unless there are some implications.

@siddharthteotia

Copy link
Contributor

@siddharthteotia siddharthteotia Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above query mysql> select *, cases, abs(cases) from covid_cases, cases and abs(cases) are different since the latter is using function. By deduplication, I mean to do only for the identifiers (simple column names) in the SELECT list.

My preference would be to deduplicate and not return the column twice as the output is cleaner that way. So, in the above query cases column should be returned exactly once.

If the user wants to get the same column twice, they should use aliasing (like I mentioned earlier)

SELECT CustomerID AS C1, * FROM Customers

CustomerID should be returned twice -- once as C1 and next as CustomerID

But if we want to stick to how MySQL or Postgres is doing and that is the general ANSI SQL / user expectation, then let's not deduplicate. I am ok with it. Just does not look clean

Trying the query here seems to deduplicate -- https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all

Not sure which database it is using underneath

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddharthteotia
Copy link
Contributor

Let's please also ensure that rewrite done here to support this feature does not change the default behavior of SELECT * queries and $ / internal columns are not returned.

@suddendust suddendust changed the title [7867] Handle Select * with Extra Columns [WIP] [7867] Handle Select * with Extra Columns Dec 27, 2021
import org.testng.annotations.Test;


public class SelectStarWithOtherColsRewriteTest {
Copy link
Contributor

@amrishlal amrishlal Dec 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add test cases for:

  • SELECT sqrt(homeRuns), * FROM baseballStats,
  • Use of more than one unqualified * in the select statement should result in syntax error (calcite is probably doing this already).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of more than one unqualified * in the select statement should result in syntax error

Tried this with MySQL, it throws an error. With Postgres, it works. IMO if the context is clear enough (in this case we know the table name on which * is being selected), then we should return all the columns without the qualifier like how Postgres is doing it? Calcite doesn't throw a syntax error in this case btw.

String sql = "SELECT playerID,homeRuns,* FROM baseballStats";
PinotQuery pinotQuery = CalciteSqlParser.compileToPinotQuery(sql);
BaseBrokerRequestHandler.updateColumnNames("baseballStats", pinotQuery, false, COL_MAP);
List<Expression> newSelections = pinotQuery.getSelectList();
Copy link
Contributor

@amrishlal amrishlal Dec 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I am seeing on mysql (no dedup when * is combined with column names in select list), but not sure how standards complaint mysql is. In either case we should check against at least two other databases before setting behavior here since it will be difficult to change once set.

mysql> select *, cases, abs(cases) from covid_cases;
+------------+-------+-------+------------+
| date       | cases | cases | abs(cases) |
+------------+-------+-------+------------+
| 2021-01-01 |    30 |    30 |         30 |
| 2021-01-02 |    60 |    60 |         60 |
| 2021-01-03 |   120 |   120 |        120 |
| 2021-01-04 |   100 |   100 |        100 |
| 2021-01-05 |    80 |    80 |         80 |
| 2021-01-06 |    70 |    70 |         70 |
| 2021-01-07 |    85 |    85 |         85 |
| 2021-01-08 |    65 |    65 |         65 |
| 2021-01-01 |    70 |    70 |         70 |
| 2021-01-02 |   100 |   100 |        100 |
+------------+-------+-------+------------+
10 rows in set (0.00 sec)

mysql> select * from covid_cases;
+------------+-------+
| date       | cases |
+------------+-------+
| 2021-01-01 |    30 |
| 2021-01-02 |    60 |
| 2021-01-03 |   120 |
| 2021-01-04 |   100 |
| 2021-01-05 |    80 |
| 2021-01-06 |    70 |
| 2021-01-07 |    85 |
| 2021-01-08 |    65 |
| 2021-01-01 |    70 |
| 2021-01-02 |   100 |
+------------+-------+
10 rows in set (0.00 sec)

@codecov-commenter
Copy link

codecov-commenter commented Jan 24, 2022

Codecov Report

Merging #7959 (4159c57) into master (b6eeaf3) will increase coverage by 0.17%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #7959      +/-   ##
============================================
+ Coverage     71.20%   71.37%   +0.17%     
- Complexity     4174     4304     +130     
============================================
  Files          1593     1617      +24     
  Lines         82477    83896    +1419     
  Branches      12305    12543     +238     
============================================
+ Hits          58729    59884    +1155     
- Misses        19788    19929     +141     
- Partials       3960     4083     +123     
Flag Coverage Δ
integration1 28.89% <86.95%> (-0.18%) ⬇️
integration2 27.77% <86.95%> (+0.29%) ⬆️
unittests1 67.91% <ø> (-0.27%) ⬇️
unittests2 14.19% <100.00%> (-0.13%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...roker/requesthandler/BaseBrokerRequestHandler.java 71.87% <100.00%> (+0.59%) ⬆️
...elix/core/minion/generator/PinotTaskGenerator.java 33.33% <0.00%> (-66.67%) ⬇️
.../java/org/apache/pinot/spi/filesystem/PinotFS.java 0.00% <0.00%> (-66.67%) ⬇️
...readers/forward/BaseChunkSVForwardIndexReader.java 46.15% <0.00%> (-46.50%) ⬇️
.../pinot/core/operator/docidsets/BitmapDocIdSet.java 62.50% <0.00%> (-37.50%) ⬇️
...ller/helix/core/minion/TaskTypeMetricsUpdater.java 80.00% <0.00%> (-20.00%) ⬇️
...in/stream/kafka20/KafkaStreamMetadataProvider.java 71.42% <0.00%> (-16.08%) ⬇️
...nt/local/startree/v2/store/StarTreeDataSource.java 40.00% <0.00%> (-13.34%) ⬇️
...nction/DistinctCountBitmapAggregationFunction.java 41.96% <0.00%> (-10.89%) ⬇️
...org/apache/pinot/core/util/ListenerConfigUtil.java 77.00% <0.00%> (-10.81%) ⬇️
... and 174 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6eeaf3...4159c57. Read the comment docs.

@suddendust
Copy link
Contributor Author

suddendust commented Jan 24, 2022

@Jackie-Jiang @amrishlal @siddharthteotia Requesting review, thanks!

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please list down the behavior of different scenario of select with star? E.g.
SELECT colA, * FROM ...
SELECT *, colA FROM ...
SELECT colA, *, $segmentName FROM ...
SELECT add(ColA, colB), * FROM ...

Currently we order columns alphabetically in select star. We should preserve this behavior

private static void expandStarExpressionToActualColumns(PinotQuery pinotQuery, Map<String, String> columnNameMap,
Expression selectStarExpr) {
List<Expression> originalSelections = pinotQuery.getSelectList();
Set<String> originallySelectedColumnNames =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using stream apis in query path because we have found that it has poorer performance comparing to regular apis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks. I was unaware of this. Is there a benchmark that I can refer to for my own knowledge? There's literature available online, but having Pinot specific info would be handy. Thanks!

//expand '*' to actual columns, exclude default virtual columns
for (String tableCol : columnNameMap.values()) {
//we exclude default virtual columns and those columns that are already a part of originalSelections (to
// dedup columns that are selected multiple times)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we decided not to dedup since that is the behavior that mysql and postgres follow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output remains cleaner if we dedup. I like @siddharthteotia's idea that if the user wants to the same column multiple times, he should use aliasing (which the change handles). @Jackie-Jiang What are your thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So by extension, SELECT *, * FROM baseballStats should also be deduped right?

Copy link
Contributor

@amrishlal amrishlal Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from what I can see queries like SELECT playerID, playerID FROM baseballStats are currently supported in Pinot without the dedupe. So whatever scheme we pick should be consistent otherwise it will cause confusion later and be difficult to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the SQL runner at w3schools, and it dedups the columns. @amrishlal Can you please share the behavior of mysql and postgres? Specifically for the following queries:

  • select col, * from table
  • select *, col from table
  • select *, * from table
  • select *, col, * from table
  • select col, *, col from table
  • select col1 + col2, * from table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel there is no SQL standard for this. Returning duplicate columns can add extra workload and traffic without providing much value to the user, so I prefer deduping the columns.

Copy link
Contributor

@amrishlal amrishlal Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jackie-Jiang mysql doesn't dudup in any of the cases listed above, except that multiple stars are not allowed in the select list. I don't have access to postgresql, but I believe (as @suddendust tested earlier) the behavior is the same as mysql except that postgresql allows multiple stars in the select list.

The w3schools console might be custom UI logic (note how they change column name for query SELECT CustomerID, CustomerID, FROM Customers without the use of aliasing) . Nothing wrong with deduping per se and either approach would work, but I think we need to maintain consistency. For example:

  • if we are deduping select list, then shouldn't select *, * be also deduped (w3schools sql console is doing dudupe here)?
  • if we decide to dedup select *, * then shouldn't select playerID, playerID (existing functionality which is not being currently deduped) also be deduped?
  • what about select playerID, *, playerID - how will dudupe work in this case? (since we currently don't dedupe playerID, playerID but will dedup playerID, *)
  • I am wondering if we have an example of an actual commercial database that is doing dedup of select list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried mysql, postgres and sql server, seems all of them don't dedup the columns automatically, so probably we should follow this convention

@suddendust
Copy link
Contributor Author

@Jackie-Jiang Have updated the PR description with the behaviour.

@suddendust suddendust changed the title [WIP] [7867] Handle Select * with Extra Columns [7867] Handle Select * with Extra Columns Jan 26, 2022
@Jackie-Jiang
Copy link
Contributor

Sorry for getting back and forth. I checked the behavior of MySQL, PostgresSQL and SQL Server, all of them will convert * to all the columns without deduplication.
Let's keep this convention, and just rewrite each * to all columns sorted in alphabetically order to be compatible with the existing pinot behavior. The algorithm should be quite straight forward, and we can consider also rewrite the query for * without other columns for completeness.

@amrishlal
Copy link
Contributor

@suddendust Looks good to me :-) Minor: the function expandStarExpressionToActualColumns (singular) could be renamed to expandStarExpressionsToActualColumns (plural)

@suddendust
Copy link
Contributor Author

suddendust commented Jan 31, 2022

@amrishlal Thanks for the review, have addressed this comment. I think I was missing looking at this issue in the larger context, as to how this might affect users' experience with Pinot in the long term. This was a great learning experience for me, thanks :)

@suddendust
Copy link
Contributor Author

@Jackie-Jiang Requesting review, thanks!

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise. Thanks for adding the tests!

Map<String, String> columnNameMap) {
Map<String, String> aliasMap = new HashMap<>();
if (pinotQuery != null) {
Expression selectStarExpr = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to store and pass this expression. This can be changed to a boolean field hasSelectStar. We can add a constant for * identifier, and the compare can be simplified to expression.equals(STAR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

}

private static void expandStarExpressionsToActualColumns(PinotQuery pinotQuery, Map<String, String> columnNameMap,
Expression selectStarExpr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, no need to pass in this expression. Use constant instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

}
//sort naturally
expandedSelections.sort(null);
ListIterator<Expression> li = originalSelections.listIterator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the existing list can be expensive (keep shifting the values). We can create a new list to replace the original one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@suddendust
Copy link
Contributor Author

@Jackie-Jiang Have addressed the comments.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one minor comment. Will merge once the tests pass

@siddharthteotia
Copy link
Contributor

@suddendust can you please take a look at the failing tests ? We can then merge this

…ler/BaseBrokerRequestHandler.java

Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>
@siddharthteotia siddharthteotia merged commit ea2f0aa into apache:master Feb 2, 2022
@suddendust
Copy link
Contributor Author

Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants