Conversation
|
@UBarney something I think could use some definite improvement in handling of the source expressions along transformation and failure paths (https://github.com/drin/datafusion/blob/8cba13ceafcf0df047e753f20bf54ad85a02f019/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L690-L720). I try to avoid moving until I know what to return (transformed expression or source expression), but I don't know rust/datafusion well enough to know best practices for when to clone and when to move and how to avoid either until necessary. |
8cba13c to
e4b2cf5
Compare
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
|
I will try to push this forward this week |
|
In theory we should be able to use the API added in |
This implementation leverages `general_date_trunc` to truncate the preimage input value and then uses the input granularity to increment the truncated datetime by 1 unit.
This adds sql logic test for date_trunc preimage for test coverage
e4b2cf5 to
95cb436
Compare
|
I tried to reuse existing date_trunc functions where possible and match the structure in date_part::preimage. The calendar duration and sql logic tests were added with the help of an LLM. I reviewed the calendar duration so that should be cohesive with my overall design, but the sql logic tests I have no context in and I would appreciate some extra review (and advice) in that area. Thanks! |
preimage for date_trunc
This is to fix `DateTime<Tz>` being considered to include HTML rather than a templated type. Also improved the phrasing of the comment
|
Hey @drin, I'll take a look tomorrow. |
| } | ||
| }; | ||
|
|
||
| fn trunc_interval_for_ts<TsType: ArrowTimestampType>( |
There was a problem hiding this comment.
Could this move out of the preimage impl?
There was a problem hiding this comment.
it could, I was just following the style for the actual invocation.
| # Test YEAR granularity - basic comparisons | ||
|
|
||
| query P | ||
| SELECT ts FROM t1 WHERE date_trunc('year', ts) = timestamp '2024-01-01T00:00:00' ORDER BY ts; |
There was a problem hiding this comment.
Does your comment about this still hold?
SELECT PULocationID
,pickup_datetime
FROM taxi_view_2025
WHERE date_trunc('month', pickup_datetime) = '2025-12-03'Without preimage it would always return False, but with preimage, we create an interval [2025-12-01, 2026-01-01) and simplification rule returns col >= 2025-12-01 and col < 2026-01-01 we could get false positives, because 2025-12-03 falls into that interval.
There was a problem hiding this comment.
We'd need to change the behavior to cover this.
- One way would be by having a guard checking the
eq Operatorspecifically fordate_trunc preimageand returnPreimageResult::Noneifrhs != date_trunc(granularity, rhs). ButpreimageisOperatoragnostic. - Another way is by having an optimization rule to do check this
rhs != date_trunc(granularity, rhs)and returnFalsefor the whole column. But that's adding a rule just for one udf. - Another way is to only let
date_trunc preimagework with withrhs = date_trunc(granularity, rhs), but this requires the user to write the date in the right way if they want the query to run faster.
For example:
++WHERE date_trunc('month', pickup_datetime) = '2025-12-01'
--WHERE date_trunc('month', pickup_datetime) = '2025-12-03'There was a problem hiding this comment.
It was actually covered for floor preimage impl by @devanshu0987 in #20059
Check here:
https://github.com/apache/datafusion/pull/20059/changes#diff-077176fcf22cb36a0a51631a43739f5f015f46305be4f49142a450e25b152b84R280-R303
Floor is very similar to date_trunc, so we could replicate the behavior.
There was a problem hiding this comment.
I don't understand why None is returned in the case that a clear value is known. For = '2025-12-03', the value should be False. I assumed that None basically means that the preimage could not be determined because something was invalid (an error). If you use None for valid cases, how do you distinguish invalid cases?
There was a problem hiding this comment.
I guess I have a clarifying question:
What should the interval be in non-obvious cases? What happens if the Interval is None (it seems rewrite_with_preimage is only called on an actual Interval)?
There was a problem hiding this comment.
Also, looking at it a bit more, I think rewrite_with_preimage is wrong in some cases.
Here are the notes I have in our implementation:
// Special condition for these operators:
// if date_trunc(part, const_rhs) == const_rhs,
// then we use `const_rhs` instead of `next_interval(part, const_rhs)`,
// lhs(<) --> column < next_interval(part, const_rhs)
// lhs(>=) --> column >= next_interval(part, const_rhs)
In summary, rewrite_with_preimage should compare with the upper instead of the lower for truncation.
date_trunc('month', pickup_datetime) < '2025-12-03' would become
pickup_datetime < '2026-01-01' because this should be true even if
pickup_datetime == '2025-12-31' because date_trunc('month', '2025-12-31') == '2025-12-01'
There was a problem hiding this comment.
For = '2025-12-03', the value should be False.
But if we use preimage it will simplify the expression to col >= 2025-12-01 and col < 2026-01-01 with month granularity.
This way column values can match the expression and return True even if they are not exactly 2025-12-03. This gives us false positives on = comparisons when the rhs is not itself truncated to the same granularity (i.e., round-trip doesn't hold - date_trunc(granularity, rhs) != rhs).
I propose following the Floor example and only use preimage for date_trunc when date_trunc(granularity, rhs) = rhs
There was a problem hiding this comment.
If there is no valid interval, there is no preimage
There was a problem hiding this comment.
(but, from those notes I also realized I was handling the interval wrong in some cases)
There was a problem hiding this comment.
But these are the notes relevant to this specific case (predicate operator is =):
// Special condition:
// if date_trunc(part, const_rhs) != const_rhs, this is always false
// For this operator, truncation means that we check if the column is INSIDE of a range.
// lhs(=) --> column >= date_trunc(part, const_rhs) AND column < next_interval(part, const_rhs)
date_trunc('month', pickup_datetime) = '2025-12-03' is always false.
date_trunc('month', pickup_datetime) < '2025-12-03' requires an interval to be returned.
Without knowing if the predicate operator is = or <, preimage cannot know whether to return an Interval or None (if None is even the correct return in that case). So preimage must return the interval. But, in rewrite_with_preimage, you can do the appropriate check:
Operator::Eq && lower != <original> => False,
Operator::Eq => and(<check if within interval>),
I'm not sure if this makes sense for intervals from non-truncating functions. I'd have to simmer on that...
|
|
||
| /// Returns true if this granularity is valid for Time types | ||
| /// Time types don't have date components, so day/week/month/quarter/year are not valid | ||
| fn valid_for_time(&self) -> bool { |
There was a problem hiding this comment.
We should add a guard for Time types granularities.
We could reuse this function.
| .as_literal() | ||
| .and_then(|sv| sv.try_as_str().flatten()) | ||
| .map(part_normalization); | ||
|
|
There was a problem hiding this comment.
We should add a guard for Type families. col_expr: TimeStamp needs a lit_expr: TimeStamp and the same for Time types.
There was a problem hiding this comment.
as in if col_expr is a TimeStamp type, then lit_expr must also be a TimeStamp type? Why is that the case?
If I have a nanosecond timestamp (time since epoch) and the comparison is a Time type, if I convert both to nanosecond timestamps aren't they still comparable?
Actually, shouldn't this type of validation be upstream of preimage in whatever function decomposes the predicate?
| general_date_trunc(TsType::UNIT, *ts_val, parsed_tz, ts_granularity)?; | ||
| let upper_val = if is_calendar_granularity { | ||
| increment_timestamp_nanos_calendar(lower_val, parsed_tz, ts_granularity)? | ||
| } else { | ||
| increment_time_nanos(lower_val, ts_granularity) | ||
| }; |
There was a problem hiding this comment.
general_date_trunc converts the values back to the original TimeStamp type, we shouldn't increment in nanos, but use the original TimeUnit.
| DateTruncGranularity::Minute => value + SECS_PER_MINUTE, | ||
| DateTruncGranularity::Second => value + 1, | ||
| // Other granularities are not valid for time - should be caught earlier | ||
| _ => value, |
There was a problem hiding this comment.
I can update the comment, probably just copy pasta.
This is same behavior as other increment functions (which only increment the correct time type at appropriate granularities)
| DateTruncGranularity::Hour => value + MILLIS_PER_HOUR, | ||
| DateTruncGranularity::Minute => value + MILLIS_PER_MINUTE, | ||
| DateTruncGranularity::Second => value + MILLIS_PER_SECOND, | ||
| DateTruncGranularity::Millisecond => value + 1, |
There was a problem hiding this comment.
Keep match arms for granularities finer than rhs?
| DateTruncGranularity::Millisecond => value + 1, | |
| DateTruncGranularity::Millisecond => value + 1, | |
| DateTruncGranularity::Microsecond => value + 1, |
There was a problem hiding this comment.
this is for incrementing milliseconds. If you increment by 1 when the granularity is microseconds then you've incremented by too much. If you have a timestamp in milliseconds and you're truncating microseconds, you should have no change because your timestamp is too coarse.
Originally, this attempted to implement a custom optimizer rule in the datafusion expression simplifier. Now, this has been updated to work within the new preimage framework rather than being implemented directly in the expression simplifier.
Which issue does this PR close?
Closes #18319.
Rationale for this change
To transform binary expressions that compare
date_truncwith a constant value into a form that can be better utilized (improved performance).For Bauplan, we can see the following (approximate average over a handful of runs):
Q1:
Q2:
What changes are included in this PR?
A few additional support functions and additional match arms in the simplifier match expression.
Are these changes tested?
Our custom rule has tests of the expression transformations and for correct evaluation results. These will be added to the PR after the implementation is in approximately good shape.
Are there any user-facing changes?
Better performance and occasionally confusing explain plan. In short, a
date_trunc('month', col) = '2025-12-03'::DATEwill always be false (because the truncation result can never be a non-truncated value), which may produce an unexpected expression (false).Explain plan details below (may be overkill but it was fun to figure out):
Initial query:
After simplify_expressions:
Before and after
date_trunc_optimizer(our custom rule):