fix: Fall back to Spark for trunc / date_trunc functions when format string is unsupported, or is not a literal value#2634
Conversation
trunc / date_trunc functions [WIP]trunc / date_trunc functions
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2634 +/- ##
============================================
+ Coverage 56.12% 59.29% +3.16%
- Complexity 976 1449 +473
============================================
Files 119 147 +28
Lines 11743 13789 +2046
Branches 2251 2369 +118
============================================
+ Hits 6591 8176 +1585
- Misses 4012 4388 +376
- Partials 1140 1225 +85 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
trunc / date_trunc functionstrunc / date_trunc functions when format string is not a literal value
trunc / date_trunc functions when format string is not a literal valuetrunc / date_trunc functions when format string is unsupported, or is not a literal value
| } | ||
| } | ||
|
|
||
| test("trunc with format array") { |
There was a problem hiding this comment.
we no longer support format being an array
There was a problem hiding this comment.
I feel it is better to keep this functionality rather than lose it for a case that no user has complained about yet.
There was a problem hiding this comment.
ok, I made it Incompatible rather than Unsupported and re-instated the tests
| "minute", | ||
| "hour", | ||
| "week", | ||
| "quarter") |
There was a problem hiding this comment.
week and quarter are listed twice
| object CometTruncDate extends CometExpressionSerde[TruncDate] { | ||
|
|
||
| val supportedFormats: Seq[String] = | ||
| Seq("year", "yyyy", "yy", "quarter", "mon", "month", "mm", "week") |
There was a problem hiding this comment.
The list of possible formats are:
fmt - the format representing the unit to be truncated to
"YEAR", "YYYY", "YY" - truncate to the first date of the year that the ts falls in, the time part will be zero out
"QUARTER" - truncate to the first date of the quarter that the ts falls in, the time part will be zero out
"MONTH", "MM", "MON" - truncate to the first date of the month that the ts falls in, the time part will be zero out
"WEEK" - truncate to the Monday of the week that the ts falls in, the time part will be zero out
"DAY", "DD" - zero out the time part
"HOUR" - zero out the minute and second with fraction part
"MINUTE"- zero out the second with fraction part
"SECOND" - zero out the second fraction part
"MILLISECOND" - zero out the microseconds
"MICROSECOND" - everything remains
Am I missing a reason why Comet doesn't support smaller than week formats?
There was a problem hiding this comment.
There was a problem hiding this comment.
This is very confusing in Spark, but date_trunc is actually TruncTimestamp and not TruncDate.
comphead
left a comment
There was a problem hiding this comment.
Thanks @andygrove lgtm
Btw when reviewing the code I found custom DateTrunc impl in Comet which probably not in use anymore spark-expr/src/datetime_funcs/date_trunc.rs
If it is I'll create a clean up PR
I assumed that those were the implementations that we were using, but I could be mistaken |
…mat string is unsupported, or is not a literal value (apache#2634)
Which issue does this PR close?
Closes #2621
There is a follow-up issue #2649 for another bug discovered while working on this PR.
Rationale for this change
Both the
truncanddate_truncfunctions accept a format string parameter, such asyearormonth.My understanding is that most users would specify a literal string for this parameter. However, it is possible to specify a column, or any other valid Spark expression. Unfortunately, Comet is not compatible with Spark for this use case if the expression evaluates to a format string that is not supported. The query will fail with an error such as:
The correct behavior would be to return
NULLinstead.The same is true for literal format strings:
What changes are included in this PR?
How are these changes tested?