implement `preimage` for date_trunc by drin · Pull Request #18648 · apache/datafusion

drin · 2025-11-12T14:18:52Z

Originally, this attempted to implement a custom optimizer rule in the datafusion expression simplifier. Now, this has been updated to work within the new preimage framework rather than being implemented directly in the expression simplifier.

Which issue does this PR close?

Closes #18319.

Rationale for this change

To transform binary expressions that compare date_trunc with a constant value into a form that can be better utilized (improved performance).

For Bauplan, we can see the following (approximate average over a handful of runs):

Q1:

SELECT PULocationID, trip_miles, tips
  FROM taxi_fhvhv
 WHERE date_trunc('month', pickup_datetime) <= '2025-01-08'::DATE

Q2:

SELECT PULocationID, trip_miles, tips
  FROM taxi_fhvhv
 WHERE pickup_datetime < date_trunc('month', '2025-02-08'::DATE)

Query	Time (s)	Options
Q1	~3	no cache, optimization enabled
Q1	~35	no cache, optimization disabled
Q2	~3	no cache, optimization enabled
Q2	~3	no cache, optimization disabled

What changes are included in this PR?

A few additional support functions and additional match arms in the simplifier match expression.

Are these changes tested?

Our custom rule has tests of the expression transformations and for correct evaluation results. These will be added to the PR after the implementation is in approximately good shape.

Are there any user-facing changes?

Better performance and occasionally confusing explain plan. In short, a date_trunc('month', col) = '2025-12-03'::DATE will always be false (because the truncation result can never be a non-truncated value), which may produce an unexpected expression (false).

Explain plan details below (may be overkill but it was fun to figure out):

Initial query:

SELECT  PULocationID
           ,pickup_datetime
      FROM taxi_view_2025
     WHERE date_trunc('month', pickup_datetime) = '2025-12-03'

After simplify_expressions:

logical_plan after simplify_expressions                    | Projection: taxi_view_2025.PULocationID, taxi_view_2025.pickup_datetime                                                                                            |
|                                                            |   Filter: date_trunc(Utf8("month"), CAST(taxi_view_2025.pickup_datetime AS Timestamp(Nanosecond, None))) = TimestampNanosecond(1764720000000000000, None)          |
|                                                            |     TableScan: taxi_view_2025

Before and after date_trunc_optimizer (our custom rule):

logical_plan after optimize_projections                    | Filter: date_trunc(Utf8("month"), CAST(taxi_view_2025.pickup_datetime AS Timestamp(Nanosecond, None))) = TimestampNanosecond(1764720000000000000, None)            |
|                                                            |   TableScan: taxi_view_2025 projection=[PULocationID, pickup_datetime]                                                                                             |
| logical_plan after date_trunc_optimizer                    | Filter: Boolean(false)                                                                                                                                             |
|                                                            |   TableScan: taxi_view_2025 projection=[PULocationID, pickup_datetime]

drin · 2025-11-12T15:30:55Z

@UBarney something I think could use some definite improvement in handling of the source expressions along transformation and failure paths (https://github.com/drin/datafusion/blob/8cba13ceafcf0df047e753f20bf54ad85a02f019/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L690-L720).

I try to avoid moving until I know what to return (transformed expression or source expression), but I don't know rust/datafusion well enough to know best practices for when to clone and when to move and how to avoid either until necessary.

github-actions · 2026-01-12T02:14:06Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

drin · 2026-01-13T02:03:47Z

I will try to push this forward this week

alamb · 2026-01-27T21:36:01Z

In theory we should be able to use the API added in

Support API for "pre-image" for pruning predicate evaluation #19722

This implementation leverages `general_date_trunc` to truncate the preimage input value and then uses the input granularity to increment the truncated datetime by 1 unit.

This adds sql logic test for date_trunc preimage for test coverage

drin · 2026-02-23T20:04:22Z

I tried to reuse existing date_trunc functions where possible and match the structure in date_part::preimage. The calendar duration and sql logic tests were added with the help of an LLM. I reviewed the calendar duration so that should be cohesive with my overall design, but the sql logic tests I have no context in and I would appreciate some extra review (and advice) in that area.

Thanks!

This is to fix `DateTime<Tz>` being considered to include HTML rather than a templated type. Also improved the phrasing of the comment

sdf-jkl · 2026-02-23T21:48:05Z

Hey @drin, I'll take a look tomorrow.

sdf-jkl

@drin thanks for the wait.

I left some suggestions.

sdf-jkl · 2026-02-24T21:27:23Z

datafusion/functions/src/datetime/date_trunc.rs

+            }
+        };
+
+        fn trunc_interval_for_ts<TsType: ArrowTimestampType>(


Could this move out of the preimage impl?

it could, I was just following the style for the actual invocation.

sdf-jkl · 2026-02-24T22:01:01Z

datafusion/sqllogictest/test_files/datetime/timestamps.slt

+# Test YEAR granularity - basic comparisons
+
+query P
+SELECT ts FROM t1 WHERE date_trunc('year', ts) = timestamp '2024-01-01T00:00:00' ORDER BY ts;


Does your comment about this still hold?

SELECT PULocationID ,pickup_datetime FROM taxi_view_2025 WHERE date_trunc('month', pickup_datetime) = '2025-12-03'

Without preimage it would always return False, but with preimage, we create an interval [2025-12-01, 2026-01-01) and simplification rule returns col >= 2025-12-01 and col < 2026-01-01 we could get false positives, because 2025-12-03 falls into that interval.

We'd need to change the behavior to cover this.

One way would be by having a guard checking the eq Operator specifically for date_trunc preimage and return PreimageResult::None if rhs != date_trunc(granularity, rhs). But preimage is Operator agnostic.

Another way is by having an optimization rule to do check this rhs != date_trunc(granularity, rhs) and return False for the whole column. But that's adding a rule just for one udf.

Another way is to only let date_trunc preimage work with with rhs = date_trunc(granularity, rhs), but this requires the user to write the date in the right way if they want the query to run faster.
For example:

++WHERE date_trunc('month', pickup_datetime) = '2025-12-01' --WHERE date_trunc('month', pickup_datetime) = '2025-12-03'

It was actually covered for floor preimage impl by @devanshu0987 in #20059
Check here:
https://github.com/apache/datafusion/pull/20059/changes#diff-077176fcf22cb36a0a51631a43739f5f015f46305be4f49142a450e25b152b84R280-R303
Floor is very similar to date_trunc, so we could replicate the behavior.

I don't understand why None is returned in the case that a clear value is known. For = '2025-12-03', the value should be False. I assumed that None basically means that the preimage could not be determined because something was invalid (an error). If you use None for valid cases, how do you distinguish invalid cases?

I guess I have a clarifying question:

What should the interval be in non-obvious cases? What happens if the Interval is None (it seems rewrite_with_preimage is only called on an actual Interval)?

Also, looking at it a bit more, I think rewrite_with_preimage is wrong in some cases.

Here are the notes I have in our implementation:

// Special condition for these operators: // if date_trunc(part, const_rhs) == const_rhs, // then we use `const_rhs` instead of `next_interval(part, const_rhs)`, // lhs(<) --> column < next_interval(part, const_rhs) // lhs(>=) --> column >= next_interval(part, const_rhs)

In summary, rewrite_with_preimage should compare with the upper instead of the lower for truncation.

date_trunc('month', pickup_datetime) < '2025-12-03' would become
pickup_datetime < '2026-01-01' because this should be true even if
pickup_datetime == '2025-12-31' because date_trunc('month', '2025-12-31') == '2025-12-01'

For = '2025-12-03', the value should be False.

But if we use preimage it will simplify the expression to col >= 2025-12-01 and col < 2026-01-01 with month granularity.

This way column values can match the expression and return True even if they are not exactly 2025-12-03. This gives us false positives on = comparisons when the rhs is not itself truncated to the same granularity (i.e., round-trip doesn't hold - date_trunc(granularity, rhs) != rhs).

I propose following the Floor example and only use preimage for date_trunc when date_trunc(granularity, rhs) = rhs

If there is no valid interval, there is no preimage

(but, from those notes I also realized I was handling the interval wrong in some cases)

But these are the notes relevant to this specific case (predicate operator is =):

// Special condition: // if date_trunc(part, const_rhs) != const_rhs, this is always false // For this operator, truncation means that we check if the column is INSIDE of a range. // lhs(=) --> column >= date_trunc(part, const_rhs) AND column < next_interval(part, const_rhs)

date_trunc('month', pickup_datetime) = '2025-12-03' is always false.
date_trunc('month', pickup_datetime) < '2025-12-03' requires an interval to be returned.

Without knowing if the predicate operator is = or <, preimage cannot know whether to return an Interval or None (if None is even the correct return in that case). So preimage must return the interval. But, in rewrite_with_preimage, you can do the appropriate check:

Operator::Eq && lower != <original> => False, Operator::Eq => and(<check if within interval>),

I'm not sure if this makes sense for intervals from non-truncating functions. I'd have to simmer on that...

sdf-jkl · 2026-02-24T23:23:00Z

datafusion/functions/src/datetime/date_trunc.rs


    /// Returns true if this granularity is valid for Time types
    /// Time types don't have date components, so day/week/month/quarter/year are not valid
    fn valid_for_time(&self) -> bool {


We should add a guard for Time types granularities.
We could reuse this function.

sdf-jkl · 2026-02-24T23:25:14Z

datafusion/functions/src/datetime/date_trunc.rs

+            .as_literal()
+            .and_then(|sv| sv.try_as_str().flatten())
+            .map(part_normalization);
+


We should add a guard for Type families. col_expr: TimeStamp needs a lit_expr: TimeStamp and the same for Time types.

as in if col_expr is a TimeStamp type, then lit_expr must also be a TimeStamp type? Why is that the case?

If I have a nanosecond timestamp (time since epoch) and the comparison is a Time type, if I convert both to nanosecond timestamps aren't they still comparable?

Actually, shouldn't this type of validation be upstream of preimage in whatever function decomposes the predicate?

sdf-jkl · 2026-02-24T23:40:14Z

datafusion/functions/src/datetime/date_trunc.rs

+                general_date_trunc(TsType::UNIT, *ts_val, parsed_tz, ts_granularity)?;
+            let upper_val = if is_calendar_granularity {
+                increment_timestamp_nanos_calendar(lower_val, parsed_tz, ts_granularity)?
+            } else {
+                increment_time_nanos(lower_val, ts_granularity)
+            };


general_date_trunc converts the values back to the original TimeStamp type, we shouldn't increment in nanos, but use the original TimeUnit.

sdf-jkl · 2026-02-24T23:57:09Z

datafusion/functions/src/datetime/date_trunc.rs

+        DateTruncGranularity::Minute => value + SECS_PER_MINUTE,
+        DateTruncGranularity::Second => value + 1,
+        // Other granularities are not valid for time - should be caught earlier
+        _ => value,


return Err or None?

I can update the comment, probably just copy pasta.

This is same behavior as other increment functions (which only increment the correct time type at appropriate granularities)

sdf-jkl · 2026-02-25T14:05:22Z

datafusion/functions/src/datetime/date_trunc.rs

+        DateTruncGranularity::Hour => value + MILLIS_PER_HOUR,
+        DateTruncGranularity::Minute => value + MILLIS_PER_MINUTE,
+        DateTruncGranularity::Second => value + MILLIS_PER_SECOND,
+        DateTruncGranularity::Millisecond => value + 1,


Keep match arms for granularities finer than rhs?

Suggested change

DateTruncGranularity::Millisecond => value + 1,

DateTruncGranularity::Millisecond => value + 1,

DateTruncGranularity::Microsecond => value + 1,

this is for incrementing milliseconds. If you increment by 1 when the granularity is microseconds then you've incremented by too much. If you have a timestamp in milliseconds and you're truncating microseconds, you should have no change because your timestamp is too coarse.

github-actions bot added the optimizer Optimizer rules label Nov 12, 2025

drin marked this pull request as draft November 12, 2025 15:18

drin force-pushed the octalene.feat-optimize-datetrunc branch from 8cba13c to e4b2cf5 Compare November 12, 2025 15:31

github-actions bot added the Stale PR has not had any activity for some time label Jan 12, 2026

github-actions bot removed the Stale PR has not had any activity for some time label Jan 13, 2026

drin mentioned this pull request Feb 2, 2026

Optimize the evaluation of DATE_TRUNC(<col>) == <constant>) when pushed down #18319

Open

drin added 2 commits February 23, 2026 11:59

feat: Implemented preimage for date_trunc

0f7bfa9

This implementation leverages `general_date_trunc` to truncate the preimage input value and then uses the input granularity to increment the truncated datetime by 1 unit.

test: added sql logic test for date_trunc preimage

95cb436

This adds sql logic test for date_trunc preimage for test coverage

drin force-pushed the octalene.feat-optimize-datetrunc branch from e4b2cf5 to 95cb436 Compare February 23, 2026 20:00

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation and removed optimizer Optimizer rules labels Feb 23, 2026

drin marked this pull request as ready for review February 23, 2026 20:01

drin changed the title ~~decomposed date_trunc optimization into expr simplifier~~ implement preimage for date_trunc Feb 23, 2026

fix: changed function comments

0b1bbf1

This is to fix `DateTime<Tz>` being considered to include HTML rather than a templated type. Also improved the phrasing of the comment

sdf-jkl reviewed Feb 25, 2026

View reviewed changes

drin mentioned this pull request Feb 26, 2026

Implement preimage for floor function to enable predicate pushdown #20059

Merged

	DateTruncGranularity::Millisecond => value + 1,
	DateTruncGranularity::Millisecond => value + 1,
	DateTruncGranularity::Microsecond => value + 1,

Conversation

drin commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

drin commented Nov 12, 2025

Uh oh!

github-actions bot commented Jan 12, 2026

Uh oh!

drin commented Jan 13, 2026

Uh oh!

alamb commented Jan 27, 2026

Uh oh!

drin commented Feb 23, 2026

Uh oh!

sdf-jkl commented Feb 23, 2026

Uh oh!

sdf-jkl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drin Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drin Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drin commented Nov 12, 2025 •

edited

Loading

drin Feb 26, 2026 •

edited

Loading

drin Feb 26, 2026 •

edited

Loading