Skip to content

Conversation

@yashmayya
Copy link
Contributor

@yashmayya yashmayya commented Jul 16, 2024

  • Fixes Incorrect precision for FLOAT columns in the multi-stage query engine #13615
  • [multistage] disable calcite float to double cast #11518 updated the TypeFactory (that is used to convert between Pinot's data types to Calcite SQL's types) to map Pinot's FLOAT data type to Calcite SQL's DOUBLE type.
  • This is incorrect since Pinot's FLOAT data type uses Java's float primitive type internally which is a 32 bit floating point type and should map to Calcite's REAL type instead (see https://calcite.apache.org/docs/reference.html#data-types).
  • This leads to precision issues where a simple query like SELECT aFloatColumn FROM table returns different values in the v1 and v2 query engines (can be verified via any of the quickstarts). The value returned in the multi-stage query engine is incorrect and won't match the value that was originally ingested.
  • The reason the original change was introduced was for queries like SELECT COUNT(*) FROM table WHERE aFloatColumn = 0.05. However, such queries should be written like SELECT COUNT(*) FROM table WHERE aFloatColumn = CAST(0.05 AS FLOAT) instead of incorrectly mapping Pinot's FLOAT data type. This is consistent with other standard databases like Postgres and MySQL where queries comparing FLOAT / REAL columns to literals need explicit casts on the literal value to return the expected result.
  • In a query like SELECT COUNT(*) FROM table WHERE aFloatColumn = 0.05, the literal 0.05 is inferred as a DECIMAL type with fixed precision and scale. During query compilation, Calcite tries to unify the operand types in an expression using a principle of finding a consistent type across the operands here. For a DECIMAL / REAL combination, this comes out to be DOUBLE here and thus there is an additional cast to DOUBLE added to the FLOAT column in the query plan. This can be avoided by explicitly casting the literal to a FLOAT type instead.
  • Note that similar explicit casts aren't required for DOUBLE column comparisons with literals since the consistent type inferred in this situation is DOUBLE itself and the column won't be auto-casted. Again, this behavior also matches standard databases like Postgres and MySQL where DOUBLE columns don't need explicit casts on comparison literals to get the expected results.

@yashmayya yashmayya added backward-incompat Referenced by PRs that introduce or fix backward compat issues bugfix multi-stage Related to the multi-stage query engine labels Jul 16, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jul 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.96%. Comparing base (59551e4) to head (fa215d8).
Report is 780 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13616      +/-   ##
============================================
+ Coverage     61.75%   61.96%   +0.20%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2553     +117     
  Lines        133233   140467    +7234     
  Branches      20636    21850    +1214     
============================================
+ Hits          82274    87036    +4762     
- Misses        44911    46808    +1897     
- Partials       6048     6623     +575     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 35.10% <100.00%> (-26.61%) ⬇️
java-21 61.85% <100.00%> (+0.22%) ⬆️
skip-bytebuffers-false 61.94% <100.00%> (+0.19%) ⬆️
skip-bytebuffers-true 61.80% <100.00%> (+34.07%) ⬆️
temurin 61.96% <100.00%> (+0.20%) ⬆️
unittests 61.95% <100.00%> (+0.21%) ⬆️
unittests1 46.45% <100.00%> (-0.44%) ⬇️
unittests2 27.73% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yashmayya yashmayya marked this pull request as ready for review July 16, 2024 08:23
SELECT COUNT(*) FROM FeatureTest1 ft1 INNER JOIN FeatureTest2 ft2 ON ft1.stringDimSV1 = ft2.stringDimSV1 WHERE ft1.generationNumber = __GENERATION_NUMBER__ AND ft2.generationNumber = __GENERATION_NUMBER__
SELECT ft1.stringDimSV1, COUNT(ft1.stringDimSV1) FROM FeatureTest1 ft1 INNER JOIN FeatureTest2 ft2 ON ft1.stringDimSV1 = ft2.stringDimSV1 WHERE ft1.generationNumber = __GENERATION_NUMBER__ AND ft2.generationNumber = __GENERATION_NUMBER__ GROUP BY ft1.stringDimSV1
SELECT ft1.stringDimSV2, SUM(ft1.floatMetric1) FROM FeatureTest1 ft1 INNER JOIN FeatureTest2 ft2 ON ft1.stringDimSV1 = ft2.stringDimSV1 WHERE ft1.generationNumber = __GENERATION_NUMBER__ AND ft2.generationNumber = __GENERATION_NUMBER__ GROUP BY ft1.stringDimSV2
SELECT ft1.stringDimSV2, SUM(ft1.doubleMetric1) FROM FeatureTest1 ft1 INNER JOIN FeatureTest2 ft2 ON ft1.stringDimSV1 = ft2.stringDimSV1 WHERE ft1.generationNumber = __GENERATION_NUMBER__ AND ft2.generationNumber = __GENERATION_NUMBER__ GROUP BY ft1.stringDimSV2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the multi-stage query engine, aggregation results for SUM are the same type as the aggregated column type - so even though the SUM aggregation function currently uses double as the aggregation result holder, the end result will be cast to FLOAT (leading to loss in precision, and slightly inaccurate results). It's better to aggregate the DOUBLE column instead, so that we get predictable and accurate results.

@yashmayya yashmayya requested review from Jackie-Jiang and gortiz July 16, 2024 08:23
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yashmayya yashmayya force-pushed the v2-float-precision branch from 740c788 to 6ac1892 Compare July 18, 2024 02:48
@gortiz gortiz merged commit 840535f into apache:master Jul 18, 2024
yashmayya added a commit to yashmayya/pinot that referenced this pull request Jul 24, 2024
ege-st pushed a commit to ege-st/pinot that referenced this pull request Aug 1, 2024
@npawar npawar added the v1v2 label Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backward-incompat Referenced by PRs that introduce or fix backward compat issues bugfix multi-stage Related to the multi-stage query engine v1v2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect precision for FLOAT columns in the multi-stage query engine

5 participants