Skip to content

implement preimage for date_trunc#18648

Open
drin wants to merge 3 commits intoapache:mainfrom
drin:octalene.feat-optimize-datetrunc
Open

implement preimage for date_trunc#18648
drin wants to merge 3 commits intoapache:mainfrom
drin:octalene.feat-optimize-datetrunc

Conversation

@drin
Copy link
Contributor

@drin drin commented Nov 12, 2025

Originally, this attempted to implement a custom optimizer rule in the datafusion expression simplifier. Now, this has been updated to work within the new preimage framework rather than being implemented directly in the expression simplifier.

Which issue does this PR close?

Closes #18319.

Rationale for this change

To transform binary expressions that compare date_trunc with a constant value into a form that can be better utilized (improved performance).

For Bauplan, we can see the following (approximate average over a handful of runs):

Q1:

SELECT PULocationID, trip_miles, tips
  FROM taxi_fhvhv
 WHERE date_trunc('month', pickup_datetime) <= '2025-01-08'::DATE

Q2:

SELECT PULocationID, trip_miles, tips
  FROM taxi_fhvhv
 WHERE pickup_datetime < date_trunc('month', '2025-02-08'::DATE)
Query Time (s) Options
Q1 ~3 no cache, optimization enabled
Q1 ~35 no cache, optimization disabled
Q2 ~3 no cache, optimization enabled
Q2 ~3 no cache, optimization disabled

What changes are included in this PR?

A few additional support functions and additional match arms in the simplifier match expression.

Are these changes tested?

Our custom rule has tests of the expression transformations and for correct evaluation results. These will be added to the PR after the implementation is in approximately good shape.

Are there any user-facing changes?

Better performance and occasionally confusing explain plan. In short, a date_trunc('month', col) = '2025-12-03'::DATE will always be false (because the truncation result can never be a non-truncated value), which may produce an unexpected expression (false).

Explain plan details below (may be overkill but it was fun to figure out):

Initial query:

SELECT  PULocationID
           ,pickup_datetime
      FROM taxi_view_2025
     WHERE date_trunc('month', pickup_datetime) = '2025-12-03'

After simplify_expressions:

logical_plan after simplify_expressions                    | Projection: taxi_view_2025.PULocationID, taxi_view_2025.pickup_datetime                                                                                            |
|                                                            |   Filter: date_trunc(Utf8("month"), CAST(taxi_view_2025.pickup_datetime AS Timestamp(Nanosecond, None))) = TimestampNanosecond(1764720000000000000, None)          |
|                                                            |     TableScan: taxi_view_2025

Before and after date_trunc_optimizer (our custom rule):

logical_plan after optimize_projections                    | Filter: date_trunc(Utf8("month"), CAST(taxi_view_2025.pickup_datetime AS Timestamp(Nanosecond, None))) = TimestampNanosecond(1764720000000000000, None)            |
|                                                            |   TableScan: taxi_view_2025 projection=[PULocationID, pickup_datetime]                                                                                             |
| logical_plan after date_trunc_optimizer                    | Filter: Boolean(false)                                                                                                                                             |
|                                                            |   TableScan: taxi_view_2025 projection=[PULocationID, pickup_datetime]

@github-actions github-actions bot added the optimizer Optimizer rules label Nov 12, 2025
@drin drin marked this pull request as draft November 12, 2025 15:18
@drin
Copy link
Contributor Author

drin commented Nov 12, 2025

@UBarney something I think could use some definite improvement in handling of the source expressions along transformation and failure paths (https://github.com/drin/datafusion/blob/8cba13ceafcf0df047e753f20bf54ad85a02f019/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L690-L720).

I try to avoid moving until I know what to return (transformed expression or source expression), but I don't know rust/datafusion well enough to know best practices for when to clone and when to move and how to avoid either until necessary.

@drin drin force-pushed the octalene.feat-optimize-datetrunc branch from 8cba13c to e4b2cf5 Compare November 12, 2025 15:31
@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jan 12, 2026
@drin
Copy link
Contributor Author

drin commented Jan 13, 2026

I will try to push this forward this week

@github-actions github-actions bot removed the Stale PR has not had any activity for some time label Jan 13, 2026
@alamb
Copy link
Contributor

alamb commented Jan 27, 2026

In theory we should be able to use the API added in

drin added 2 commits February 23, 2026 11:59
This implementation leverages `general_date_trunc` to truncate the
preimage input value and then uses the input granularity to increment
the truncated datetime by 1 unit.
This adds sql logic test for date_trunc preimage for test coverage
@drin drin force-pushed the octalene.feat-optimize-datetrunc branch from e4b2cf5 to 95cb436 Compare February 23, 2026 20:00
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation and removed optimizer Optimizer rules labels Feb 23, 2026
@drin drin marked this pull request as ready for review February 23, 2026 20:01
@drin
Copy link
Contributor Author

drin commented Feb 23, 2026

I tried to reuse existing date_trunc functions where possible and match the structure in date_part::preimage. The calendar duration and sql logic tests were added with the help of an LLM. I reviewed the calendar duration so that should be cohesive with my overall design, but the sql logic tests I have no context in and I would appreciate some extra review (and advice) in that area.

Thanks!

@drin drin changed the title decomposed date_trunc optimization into expr simplifier implement preimage for date_trunc Feb 23, 2026
This is to fix `DateTime<Tz>` being considered to include HTML rather
than a templated type. Also improved the phrasing of the comment
@sdf-jkl
Copy link
Contributor

sdf-jkl commented Feb 23, 2026

Hey @drin, I'll take a look tomorrow.

Copy link
Contributor

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drin thanks for the wait.

I left some suggestions.

}
};

fn trunc_interval_for_ts<TsType: ArrowTimestampType>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this move out of the preimage impl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could, I was just following the style for the actual invocation.

# Test YEAR granularity - basic comparisons

query P
SELECT ts FROM t1 WHERE date_trunc('year', ts) = timestamp '2024-01-01T00:00:00' ORDER BY ts;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does your comment about this still hold?

SELECT  PULocationID
           ,pickup_datetime
      FROM taxi_view_2025
     WHERE date_trunc('month', pickup_datetime) = '2025-12-03'

Without preimage it would always return False, but with preimage, we create an interval [2025-12-01, 2026-01-01) and simplification rule returns col >= 2025-12-01 and col < 2026-01-01 we could get false positives, because 2025-12-03 falls into that interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to change the behavior to cover this.

  • One way would be by having a guard checking the eq Operator specifically for date_trunc preimage and return PreimageResult::None if rhs != date_trunc(granularity, rhs). But preimage is Operator agnostic.
  • Another way is by having an optimization rule to do check this rhs != date_trunc(granularity, rhs) and return False for the whole column. But that's adding a rule just for one udf.
  • Another way is to only let date_trunc preimage work with with rhs = date_trunc(granularity, rhs), but this requires the user to write the date in the right way if they want the query to run faster.
    For example:
++WHERE date_trunc('month', pickup_datetime) = '2025-12-01'
--WHERE date_trunc('month', pickup_datetime) = '2025-12-03'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why None is returned in the case that a clear value is known. For = '2025-12-03', the value should be False. I assumed that None basically means that the preimage could not be determined because something was invalid (an error). If you use None for valid cases, how do you distinguish invalid cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I have a clarifying question:

What should the interval be in non-obvious cases? What happens if the Interval is None (it seems rewrite_with_preimage is only called on an actual Interval)?

Copy link
Contributor Author

@drin drin Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, looking at it a bit more, I think rewrite_with_preimage is wrong in some cases.

Here are the notes I have in our implementation:

        // Special condition for these operators:
        // if date_trunc(part, const_rhs) == const_rhs,
        // then we use `const_rhs` instead of `next_interval(part, const_rhs)`,
        // lhs(<)  --> column <  next_interval(part, const_rhs)
        // lhs(>=) --> column >= next_interval(part, const_rhs)

In summary, rewrite_with_preimage should compare with the upper instead of the lower for truncation.

date_trunc('month', pickup_datetime) < '2025-12-03' would become
pickup_datetime < '2026-01-01' because this should be true even if
pickup_datetime == '2025-12-31' because date_trunc('month', '2025-12-31') == '2025-12-01'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For = '2025-12-03', the value should be False.

But if we use preimage it will simplify the expression to col >= 2025-12-01 and col < 2026-01-01 with month granularity.

This way column values can match the expression and return True even if they are not exactly 2025-12-03. This gives us false positives on = comparisons when the rhs is not itself truncated to the same granularity (i.e., round-trip doesn't hold - date_trunc(granularity, rhs) != rhs).

I propose following the Floor example and only use preimage for date_trunc when date_trunc(granularity, rhs) = rhs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no valid interval, there is no preimage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but, from those notes I also realized I was handling the interval wrong in some cases)

Copy link
Contributor Author

@drin drin Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But these are the notes relevant to this specific case (predicate operator is =):

// Special condition:
// if date_trunc(part, const_rhs) != const_rhs, this is always false
// For this operator, truncation means that we check if the column is INSIDE of a range.
// lhs(=) --> column >= date_trunc(part, const_rhs) AND column < next_interval(part, const_rhs)

date_trunc('month', pickup_datetime) = '2025-12-03' is always false.
date_trunc('month', pickup_datetime) < '2025-12-03' requires an interval to be returned.

Without knowing if the predicate operator is = or <, preimage cannot know whether to return an Interval or None (if None is even the correct return in that case). So preimage must return the interval. But, in rewrite_with_preimage, you can do the appropriate check:

Operator::Eq && lower != <original> => False,
Operator::Eq => and(<check if within interval>),

I'm not sure if this makes sense for intervals from non-truncating functions. I'd have to simmer on that...


/// Returns true if this granularity is valid for Time types
/// Time types don't have date components, so day/week/month/quarter/year are not valid
fn valid_for_time(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a guard for Time types granularities.
We could reuse this function.

.as_literal()
.and_then(|sv| sv.try_as_str().flatten())
.map(part_normalization);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a guard for Type families. col_expr: TimeStamp needs a lit_expr: TimeStamp and the same for Time types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in if col_expr is a TimeStamp type, then lit_expr must also be a TimeStamp type? Why is that the case?

If I have a nanosecond timestamp (time since epoch) and the comparison is a Time type, if I convert both to nanosecond timestamps aren't they still comparable?

Actually, shouldn't this type of validation be upstream of preimage in whatever function decomposes the predicate?

Comment on lines +462 to +467
general_date_trunc(TsType::UNIT, *ts_val, parsed_tz, ts_granularity)?;
let upper_val = if is_calendar_granularity {
increment_timestamp_nanos_calendar(lower_val, parsed_tz, ts_granularity)?
} else {
increment_time_nanos(lower_val, ts_granularity)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general_date_trunc converts the values back to the original TimeStamp type, we shouldn't increment in nanos, but use the original TimeUnit.

DateTruncGranularity::Minute => value + SECS_PER_MINUTE,
DateTruncGranularity::Second => value + 1,
// Other granularities are not valid for time - should be caught earlier
_ => value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return Err or None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update the comment, probably just copy pasta.

This is same behavior as other increment functions (which only increment the correct time type at appropriate granularities)

DateTruncGranularity::Hour => value + MILLIS_PER_HOUR,
DateTruncGranularity::Minute => value + MILLIS_PER_MINUTE,
DateTruncGranularity::Second => value + MILLIS_PER_SECOND,
DateTruncGranularity::Millisecond => value + 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep match arms for granularities finer than rhs?

Suggested change
DateTruncGranularity::Millisecond => value + 1,
DateTruncGranularity::Millisecond => value + 1,
DateTruncGranularity::Microsecond => value + 1,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for incrementing milliseconds. If you increment by 1 when the granularity is microseconds then you've incremented by too much. If you have a timestamp in milliseconds and you're truncating microseconds, you should have no change because your timestamp is too coarse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize the evaluation of DATE_TRUNC(<col>) == <constant>) when pushed down

3 participants