-
Notifications
You must be signed in to change notification settings - Fork 2k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
- related to Should
PruningPredicatecoerce? #14944
The usecase is a predicate like this, where month_id is an integer:
month_id = '202502'In DataFusion this results in casting the month_id column to a Utf8 and then is compared to the string '202502'
This is non ideal for at least 3 reasons:
- Converting many rows to strings is expensive
- Comparing strings is much slower than comparing integers
- many data sources (like parquet) only handle predicates in the form of
<col> <op> <const>and notcast(<col>) <op> <const>so these predicates can be pushed down
Here is an example (note the predicate in the plan below is CAST(foo.month_id AS Utf8) = Utf8("2024") rather than foo.month_id = Int32("2024")):
DataFusion CLI v46.0.0
> create table foo(month_id int) as values (1), (2), (3);
0 row(s) fetched.
Elapsed 0.003 seconds.
> explain select * from foo where month_id = '2024';
+---------------+-------------------------------------------------------+
| plan_type | plan |
+---------------+-------------------------------------------------------+
| logical_plan | Filter: CAST(foo.month_id AS Utf8) = Utf8("2024") |
| | TableScan: foo projection=[month_id] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: CAST(month_id@0 AS Utf8) = 2024 |
| | DataSourceExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+-------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.008 seconds.Describe the solution you'd like
I would like the filter in the plan above to be FilterExec: month_id = 2024 (no cast on month_id)
Other notes:
- I think this applies to
=and!= - I don't think it can be done for
<,<=,>=and>as the semantics for comparing ints and strings is different than equality.
Describe alternatives you've considered
We already have something called unwrap_in_cast that does this transformation
@jayzhan211 has moved this code in this PR
I think once that PR merged, that would be the natural place to add this optimization
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request