Implemented simplify for the starts_with function to convert it into a LIKE expression.#14119
Conversation
…nto a LIKE expression, enabling predicate pruning optimization.
alamb
left a comment
There was a problem hiding this comment.
Thank you @jatin510 -- this looks great
Can you please add some tests for this?
Specifically, I think you can use an explain test in sqllogictests, something like the following. The EXPLAIN should show LIKE being used in the physical plan
> create table t(x varchar) as values ('foo'), ('bar');
0 row(s) fetched.
Elapsed 0.013 seconds.
> explain select starts_with(x, 'fo') from t;
+---------------+----------------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------+
| logical_plan | Projection: starts_with(t.x, Utf8("fo")) |
| | TableScan: t projection=[x] |
| physical_plan | ProjectionExec: expr=[starts_with(x@0, fo) as starts_with(t.x,Utf8("fo"))] |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+----------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.003 seconds.| ) -> Result<ExprSimplifyResult> { | ||
| if let Expr::Literal(ScalarValue::Utf8(Some(pattern))) = &args[1] { | ||
| // Convert starts_with (col, 'prefix') to col LIKE 'prefix%' | ||
| let like_pattern = format!("{}%", pattern); |
There was a problem hiding this comment.
I think we probably need to escape any % that appears in the pattern (or avoid doing this replacement when such a pattern exists)
|
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
…ype. and updated the coresponding test
| ---- | ||
| logical_plan | ||
| 01)Projection: starts_with(test.column1_utf8view, Utf8View("äöüß")) AS c1, starts_with(test.column1_utf8view, Utf8View("")) AS c2, starts_with(test.column1_utf8view, Utf8View(NULL)) AS c3, starts_with(Utf8View(NULL), test.column1_utf8view) AS c4 | ||
| 01)Projection: test.column1_utf8view LIKE Utf8View("äöüß%") AS c1, CASE test.column1_utf8view IS NOT NULL WHEN Boolean(true) THEN Boolean(true) END AS c2, starts_with(test.column1_utf8view, Utf8View(NULL)) AS c3, starts_with(Utf8View(NULL), test.column1_utf8view) AS c4 |
There was a problem hiding this comment.
this is actually pretty cool -- it figured out that STARTS_WITH(column1_utf8view, '') as c2, is true if column1_utf8view is NOT NULL
| _info: &dyn SimplifyInfo, | ||
| ) -> Result<ExprSimplifyResult> { | ||
| if let Expr::Literal(scalar_value) = &args[1] { | ||
| // Convert starts_with(col, 'prefix') to col LIKE 'prefix%' with proper escaping |
There was a problem hiding this comment.
I double checked the escaping logic and I think this looks good to me.
|
This is really sweet. You an see it working to prune parquet files here: Specifically the predicate I will also add a test to this PR demonstrating this too |
| 01)CoalesceBatchesExec: target_batch_size=8192 | ||
| 02)--FilterExec: column1@0 LIKE f% | ||
| 03)----RepartitionExec: partitioning=RoundRobinBatch(2), input_partitions=1 | ||
| 04)------ParquetExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/foo.parquet]]}, projection=[column1], predicate=column1@0 LIKE f%, pruning_predicate=column1_null_count@2 != column1_row_count@3 AND column1_min@0 <= g AND f <= column1_max@1, required_guarantees=[] |
|
This looks great! |
|
I wonder if converting The difference between While benchmarking is necessary to confirm the exact performance difference, converting |
I see. This might be the tradeoff |
|
Yes that makes sense, but presumably that should happen once per pattern not once per row and be quite fast, especially for the common case of a prefix search, so I'd guess it's negligible? |
…ert it into a LIKE expression. (apache#14119)" This reverts commit 49f95af.
Implemented
simplifyfor thestarts_withfunction to convert it into aLIKEexpressionWhich issue does this PR close?
Closes #14027.
Rationale for this change
Using 'LIKE' expr for 'starts_with' to enable predicate pruning.
What changes are included in this PR?
Implemented
simplifyfor thestarts_withfunction to convert it into a LIKE expression, enabling predicate pruning optimization.Are these changes tested?
Ues
Are there any user-facing changes?
No