Map Pinot's FLOAT data type to Calcite SQL's REAL type to fix precision issues #13616

yashmayya · 2024-07-16T05:17:25Z

Fixes Incorrect precision for FLOAT columns in the multi-stage query engine #13615
[multistage] disable calcite float to double cast #11518 updated the TypeFactory (that is used to convert between Pinot's data types to Calcite SQL's types) to map Pinot's FLOAT data type to Calcite SQL's DOUBLE type.
This is incorrect since Pinot's FLOAT data type uses Java's float primitive type internally which is a 32 bit floating point type and should map to Calcite's REAL type instead (see https://calcite.apache.org/docs/reference.html#data-types).
This leads to precision issues where a simple query like SELECT aFloatColumn FROM table returns different values in the v1 and v2 query engines (can be verified via any of the quickstarts). The value returned in the multi-stage query engine is incorrect and won't match the value that was originally ingested.
The reason the original change was introduced was for queries like SELECT COUNT(*) FROM table WHERE aFloatColumn = 0.05. However, such queries should be written like SELECT COUNT(*) FROM table WHERE aFloatColumn = CAST(0.05 AS FLOAT) instead of incorrectly mapping Pinot's FLOAT data type. This is consistent with other standard databases like Postgres and MySQL where queries comparing FLOAT / REAL columns to literals need explicit casts on the literal value to return the expected result.
In a query like SELECT COUNT(*) FROM table WHERE aFloatColumn = 0.05, the literal 0.05 is inferred as a DECIMAL type with fixed precision and scale. During query compilation, Calcite tries to unify the operand types in an expression using a principle of finding a consistent type across the operands here. For a DECIMAL / REAL combination, this comes out to be DOUBLE here and thus there is an additional cast to DOUBLE added to the FLOAT column in the query plan. This can be avoided by explicitly casting the literal to a FLOAT type instead.
Note that similar explicit casts aren't required for DOUBLE column comparisons with literals since the consistent type inferred in this situation is DOUBLE itself and the column won't be auto-casted. Again, this behavior also matches standard databases like Postgres and MySQL where DOUBLE columns don't need explicit casts on comparison literals to get the expected results.

codecov-commenter · 2024-07-16T05:53:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.96%. Comparing base (59551e4) to head (fa215d8).
Report is 780 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #13616      +/-   ##
============================================
+ Coverage     61.75%   61.96%   +0.20%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2553     +117     
  Lines        133233   140467    +7234     
  Branches      20636    21850    +1214     
============================================
+ Hits          82274    87036    +4762     
- Misses        44911    46808    +1897     
- Partials       6048     6623     +575

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`35.10% <100.00%> (-26.61%)`	⬇️
java-21	`61.85% <100.00%> (+0.22%)`	⬆️
skip-bytebuffers-false	`61.94% <100.00%> (+0.19%)`	⬆️
skip-bytebuffers-true	`61.80% <100.00%> (+34.07%)`	⬆️
temurin	`61.96% <100.00%> (+0.20%)`	⬆️
unittests	`61.95% <100.00%> (+0.21%)`	⬆️
unittests1	`46.45% <100.00%> (-0.44%)`	⬇️
unittests2	`27.73% <0.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yashmayya · 2024-07-16T08:20:23Z

...verifier/multi-stage-query-engine-test-suite/config/queries/feature-test-multi-stage.queries

 SELECT COUNT(*) FROM FeatureTest1 ft1 INNER JOIN FeatureTest2 ft2 ON ft1.stringDimSV1 = ft2.stringDimSV1 WHERE ft1.generationNumber = __GENERATION_NUMBER__ AND ft2.generationNumber = __GENERATION_NUMBER__
 SELECT ft1.stringDimSV1, COUNT(ft1.stringDimSV1) FROM FeatureTest1 ft1 INNER JOIN FeatureTest2 ft2 ON ft1.stringDimSV1 = ft2.stringDimSV1 WHERE ft1.generationNumber = __GENERATION_NUMBER__ AND ft2.generationNumber = __GENERATION_NUMBER__ GROUP BY ft1.stringDimSV1
-SELECT ft1.stringDimSV2, SUM(ft1.floatMetric1) FROM FeatureTest1 ft1 INNER JOIN FeatureTest2 ft2 ON ft1.stringDimSV1 = ft2.stringDimSV1 WHERE ft1.generationNumber = __GENERATION_NUMBER__ AND ft2.generationNumber = __GENERATION_NUMBER__ GROUP BY ft1.stringDimSV2
+SELECT ft1.stringDimSV2, SUM(ft1.doubleMetric1) FROM FeatureTest1 ft1 INNER JOIN FeatureTest2 ft2 ON ft1.stringDimSV1 = ft2.stringDimSV1 WHERE ft1.generationNumber = __GENERATION_NUMBER__ AND ft2.generationNumber = __GENERATION_NUMBER__ GROUP BY ft1.stringDimSV2


In the multi-stage query engine, aggregation results for SUM are the same type as the aggregated column type - so even though the SUM aggregation function currently uses double as the aggregation result holder, the end result will be cast to FLOAT (leading to loss in precision, and slightly inaccurate results). It's better to aggregate the DOUBLE column instead, so that we get predictable and accurate results.

Jackie-Jiang

LGTM!

pinot-query-planner/src/test/java/org/apache/pinot/query/type/TypeFactoryTest.java

…on issues

…on issues (apache#13616)

…on issues (apache#13616) (apache#254)

yashmayya added backward-incompat Referenced by PRs that introduce or fix backward compat issues bugfix multi-stage Related to the multi-stage query engine labels Jul 16, 2024

yashmayya marked this pull request as ready for review July 16, 2024 08:23

yashmayya commented Jul 16, 2024

View reviewed changes

yashmayya requested review from Jackie-Jiang and gortiz July 16, 2024 08:23

Jackie-Jiang approved these changes Jul 17, 2024

View reviewed changes

pinot-query-planner/src/test/java/org/apache/pinot/query/type/TypeFactoryTest.java Show resolved Hide resolved

yashmayya force-pushed the v2-float-precision branch from 740c788 to 6ac1892 Compare July 18, 2024 02:48

Map Pinot's FLOAT data type to Calcite SQL's REAL type to fix precisi…

fa215d8

…on issues

yashmayya force-pushed the v2-float-precision branch from 6ac1892 to fa215d8 Compare July 18, 2024 02:49

gortiz mentioned this pull request Jul 18, 2024

[multistage] disable calcite float to double cast #11518

Merged

gortiz approved these changes Jul 18, 2024

View reviewed changes

gortiz merged commit 840535f into apache:master Jul 18, 2024

yashmayya added a commit to yashmayya/pinot that referenced this pull request Jul 24, 2024

Map Pinot's FLOAT data type to Calcite SQL's REAL type to fix precisi…

09fc087

…on issues (apache#13616)

ege-st pushed a commit to ege-st/pinot that referenced this pull request Aug 1, 2024

Map Pinot's FLOAT data type to Calcite SQL's REAL type to fix precisi…

8cbc67d

…on issues (apache#13616) (apache#254)

npawar added the v1v2 label Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Map Pinot's FLOAT data type to Calcite SQL's REAL type to fix precision issues #13616

Map Pinot's FLOAT data type to Calcite SQL's REAL type to fix precision issues #13616

Uh oh!

yashmayya commented Jul 16, 2024 •

edited

Loading

Uh oh!

codecov-commenter commented Jul 16, 2024 •

edited

Loading

Uh oh!

yashmayya Jul 16, 2024

Uh oh!

Jackie-Jiang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Map Pinot's FLOAT data type to Calcite SQL's REAL type to fix precision issues #13616

Map Pinot's FLOAT data type to Calcite SQL's REAL type to fix precision issues #13616

Uh oh!

Conversation

yashmayya commented Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yashmayya Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yashmayya commented Jul 16, 2024 •

edited

Loading

codecov-commenter commented Jul 16, 2024 •

edited

Loading