-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Map Pinot's FLOAT data type to Calcite SQL's REAL type to fix precision issues #13616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #13616 +/- ##
============================================
+ Coverage 61.75% 61.96% +0.20%
+ Complexity 207 198 -9
============================================
Files 2436 2553 +117
Lines 133233 140467 +7234
Branches 20636 21850 +1214
============================================
+ Hits 82274 87036 +4762
- Misses 44911 46808 +1897
- Partials 6048 6623 +575
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
| SELECT COUNT(*) FROM FeatureTest1 ft1 INNER JOIN FeatureTest2 ft2 ON ft1.stringDimSV1 = ft2.stringDimSV1 WHERE ft1.generationNumber = __GENERATION_NUMBER__ AND ft2.generationNumber = __GENERATION_NUMBER__ | ||
| SELECT ft1.stringDimSV1, COUNT(ft1.stringDimSV1) FROM FeatureTest1 ft1 INNER JOIN FeatureTest2 ft2 ON ft1.stringDimSV1 = ft2.stringDimSV1 WHERE ft1.generationNumber = __GENERATION_NUMBER__ AND ft2.generationNumber = __GENERATION_NUMBER__ GROUP BY ft1.stringDimSV1 | ||
| SELECT ft1.stringDimSV2, SUM(ft1.floatMetric1) FROM FeatureTest1 ft1 INNER JOIN FeatureTest2 ft2 ON ft1.stringDimSV1 = ft2.stringDimSV1 WHERE ft1.generationNumber = __GENERATION_NUMBER__ AND ft2.generationNumber = __GENERATION_NUMBER__ GROUP BY ft1.stringDimSV2 | ||
| SELECT ft1.stringDimSV2, SUM(ft1.doubleMetric1) FROM FeatureTest1 ft1 INNER JOIN FeatureTest2 ft2 ON ft1.stringDimSV1 = ft2.stringDimSV1 WHERE ft1.generationNumber = __GENERATION_NUMBER__ AND ft2.generationNumber = __GENERATION_NUMBER__ GROUP BY ft1.stringDimSV2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the multi-stage query engine, aggregation results for SUM are the same type as the aggregated column type - so even though the SUM aggregation function currently uses double as the aggregation result holder, the end result will be cast to FLOAT (leading to loss in precision, and slightly inaccurate results). It's better to aggregate the DOUBLE column instead, so that we get predictable and accurate results.
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
pinot-query-planner/src/test/java/org/apache/pinot/query/type/TypeFactoryTest.java
Show resolved
Hide resolved
740c788 to
6ac1892
Compare
6ac1892 to
fa215d8
Compare
FLOATdata type to Calcite SQL'sDOUBLEtype.FLOATdata type uses Java'sfloatprimitive type internally which is a 32 bit floating point type and should map to Calcite'sREALtype instead (see https://calcite.apache.org/docs/reference.html#data-types).SELECT aFloatColumn FROM tablereturns different values in the v1 and v2 query engines (can be verified via any of the quickstarts). The value returned in the multi-stage query engine is incorrect and won't match the value that was originally ingested.SELECT COUNT(*) FROM table WHERE aFloatColumn = 0.05. However, such queries should be written likeSELECT COUNT(*) FROM table WHERE aFloatColumn = CAST(0.05 AS FLOAT)instead of incorrectly mapping Pinot'sFLOATdata type. This is consistent with other standard databases like Postgres and MySQL where queries comparingFLOAT/REALcolumns to literals need explicit casts on the literal value to return the expected result.SELECT COUNT(*) FROM table WHERE aFloatColumn = 0.05, the literal0.05is inferred as aDECIMALtype with fixed precision and scale. During query compilation, Calcite tries to unify the operand types in an expression using a principle of finding a consistent type across the operands here. For aDECIMAL/REALcombination, this comes out to beDOUBLEhere and thus there is an additional cast toDOUBLEadded to theFLOATcolumn in the query plan. This can be avoided by explicitly casting the literal to aFLOATtype instead.DOUBLEcolumn comparisons with literals since the consistent type inferred in this situation isDOUBLEitself and the column won't be auto-casted. Again, this behavior also matches standard databases like Postgres and MySQL whereDOUBLEcolumns don't need explicit casts on comparison literals to get the expected results.