Skip to content

Rewrite LIKE expressions with perfect prefix or suffix#85920

Merged
Ergus merged 41 commits intoClickHouse:masterfrom
zheguang:like-perfect-affix-rewrite
Oct 15, 2025
Merged

Rewrite LIKE expressions with perfect prefix or suffix#85920
Ergus merged 41 commits intoClickHouse:masterfrom
zheguang:like-perfect-affix-rewrite

Conversation

@zheguang
Copy link
Copy Markdown
Contributor

@zheguang zheguang commented Aug 20, 2025

This PR addresses Issue #71421.

This PR adds a query tree analyzer pass to rewrite LIKE expression with perfect affix (prefix or suffix) into startsWith or endsWith functions. This rewrite can improve query performance on saving compute.

The scope of this rewrite is

Once rewritten, the query tree with string functions will be taken up by the query execution plan. Here both the primary key analysis and regular scan will work with the string functions instead.

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Improves performance of LIKE with prefix or suffix by using the new default setting optimize_rewrite_like_perfect_affix.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@zheguang zheguang marked this pull request as draft August 20, 2025 06:35
@zheguang

This comment was marked as resolved.

@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented Aug 21, 2025

@zheguang This is a nice optimization and the right direction, thanks.

About these two cases:

ILIKE: column needs to be wrapped in a LOWER function

It only makes sense (in my view) to implement this case if the speedup by the range search is bigger than the slowdown by the LOWER wrapper. Before implementing it, I can recommend testing the speed difference in a synthetic SQL test.

NOT (I)LIKE: add negation to the rewritten range expression

So NOT can be treated by simply inverting the range, right? For example, LT would become GE?

According to your comment, the overall speedup for TPC-H is indeed small but that is expected. ClickBench unfortunately doesn't have queries which would benefit from the optimization, so no speedup is expected there as well. Anyways, I am pretty sure that LIKE 'pattern%' is a common theme in real world use cases, so I would be happy to merge this PR. Will try to find someone who can review further.

@rschu1ze rschu1ze added the can be tested Allows running workflows for external contributors label Aug 21, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Aug 21, 2025

Workflow [PR], commit [b28218b]

Summary:

job_name test_name status info comment
Stateless tests (arm_binary, parallel) failure
00601_kill_running_query FAIL cidb
Integration tests (amd_binary, 3/5) failure
test_refreshable_mv/test.py::test_backup_outer_table FAIL cidb
Performance Comparison (amd_release, master_head, 2/3) failure
Start failure

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Aug 21, 2025
@Ergus Ergus self-assigned this Aug 21, 2025
Ergus

This comment was marked as resolved.

@amosbird
Copy link
Copy Markdown
Collaborator

amosbird commented Aug 22, 2025

Wouldn't it be simpler to rewrite perfect prefix like as startsWith? We can also extend to endsWith.

@zheguang

This comment was marked as resolved.

@zheguang zheguang marked this pull request as ready for review August 27, 2025 13:48
@zheguang
Copy link
Copy Markdown
Contributor Author

Wouldn't it be simpler to rewrite perfect prefix like as startsWith? We can also extend to endsWith.

Thanks for the suggestion. My understanding is a range condition works better with an index, for example currently for primary keys.

zheguang

This comment was marked as resolved.

@zheguang
Copy link
Copy Markdown
Contributor Author

zheguang commented Aug 29, 2025

LGTM however:

  1. I would request the recursion change if possible.
  2. I would like to see an example where this brings a noticeable performance benefit to ensure that we are doing everything correctly.

Thanks for the suggestion! So for recursion, I lean keeping. My thinking is, for one, this specific usage is in unit test only, and it's almost identical to one in another unit test, https://github.com/ClickHouse/ClickHouse/pull/85920/files#diff-508f54672d863c147616450f9419522b9a267e3ec04f352674a1085ff220408a. (So I ended up with a minor refactoring). As the test cases appear to have short expressions, I think the depth limited parsing currently in place is sufficient to prevent a stack overflow?

Incidentally I tracked down the query analyzer's InDepthQueryTreeVisitorWithContext which appears to use a similar non-tail-call recursion as well:

for (auto & argument : table_function_node->getArguments())
{
if (std::find(unresolved_indexes.begin(),
unresolved_indexes.end(),
index) == unresolved_indexes.end())
visit(argument);
++index;
}
return;
}
visit(child);

Conceptually though, rewriting this unit-test recursion (and also this query analyzer recursion) would entail a stack that grows until a subtree is fully resolved.

@rschu1ze
Copy link
Copy Markdown
Member

@Ergus Could you double-check / review this PR, please?

Copy link
Copy Markdown
Member

@Ergus Ergus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Still would like to see a simple benchmark comparing performance with and without this optimization enabled, but that's is not an stopper IMHO.

@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented Oct 7, 2025

Still would like to see a simple benchmark comparing performance with and without this optimization enabled

"tests/performance/like_perfect_affix_rewrite.xml" is doing exactly this, no?

  <!-- Test prefix matching rewrite -->
  <query>
      SELECT count() FROM tab
      WHERE str LIKE {prefix_pattern}
      SETTINGS optimize_rewrite_like_perfect_affix=0
  </query>

  <query>
      SELECT count() FROM tab
      WHERE str LIKE {prefix_pattern}
      SETTINGS optimize_rewrite_like_perfect_affix=1
  </query>

@rschu1ze
Copy link
Copy Markdown
Member

@Ergus: Like to review and merge this PR? Thanks!

@Ergus
Copy link
Copy Markdown
Member

Ergus commented Oct 14, 2025

@Ergus: Like to review and merge this PR? Thanks!

I already approved this

@Ergus
Copy link
Copy Markdown
Member

Ergus commented Oct 14, 2025

Current failing tests seems to be unrelated opened new issues: #88500, #88501, #88503

Specially #88501 I have already seen it in #87994

@Ergus Ergus enabled auto-merge October 14, 2025 10:53
@Ergus Ergus added this pull request to the merge queue Oct 14, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 14, 2025
@Ergus
Copy link
Copy Markdown
Member

Ergus commented Oct 14, 2025

Hi @zheguang

I just tried to merge this, but there is a new failure in 03594_like_perfect_affix_rewrite_pass that didn't appear before and is related with LIKE.

2025-10-14 13:56:02 Reason: result differs with reference:  
2025-10-14 13:56:02 --- /home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/tests/queries/0_stateless/03594_like_perfect_affix_rewrite_pass.reference	2025-10-14 13:39:28.230901005 +0200
2025-10-14 13:56:02 +++ /home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/tests/queries/0_stateless/03594_like_perfect_affix_rewrite_pass.stdout	2025-10-14 13:56:02.391500809 +0200
2025-10-14 13:56:02 @@ -18,27 +18,27 @@
2025-10-14 13:56:02  2
2025-10-14 13:56:02  -- Test LIKE perfect suffix on FixedString column - should NOT be rewritten
2025-10-14 13:56:02  SELECT count() AS `count()`
2025-10-14 13:56:02  FROM default.tab AS __table1
2025-10-14 13:56:02  WHERE __table1.col_fixedstring LIKE \'%ther\\0\'
2025-10-14 13:56:02  2
2025-10-14 13:56:02  2
2025-10-14 13:56:02  -- Test NOT LIKE with perfect prefix on String column - should be rewritten
2025-10-14 13:56:02  SELECT count() AS `count()`
2025-10-14 13:56:02  FROM default.tab AS __table1
2025-10-14 13:56:02 -WHERE NOT startsWith(__table1.col_string, \'app\')
2025-10-14 13:56:02 +WHERE not(startsWith(__table1.col_string, \'app\'))
2025-10-14 13:56:02  4
2025-10-14 13:56:02  4
2025-10-14 13:56:02  Test NOT LIKE with perfect suffix on String column - should be rewritten
2025-10-14 13:56:02  SELECT count() AS `count()`
2025-10-14 13:56:02  FROM default.tab AS __table1
2025-10-14 13:56:02 -WHERE NOT endsWith(__table1.col_string, \'Test\')
2025-10-14 13:56:02 +WHERE not(endsWith(__table1.col_string, \'Test\'))
2025-10-14 13:56:02  5
2025-10-14 13:56:02  5
2025-10-14 13:56:02  -- Test ILIKE with perfect prefix on String column - should NOT be rewritten
2025-10-14 13:56:02  SELECT count() AS `count()`
2025-10-14 13:56:02  FROM default.tab AS __table1
2025-10-14 13:56:02  WHERE __table1.col_string ILIKE \'APP%\'
2025-10-14 13:56:02  3
2025-10-14 13:56:02  3
2025-10-14 13:56:02  Test ILIKE with perfect suffix on String column - should NOT be rewritten
2025-10-14 13:56:02  SELECT count() AS `count()`
2025-10-14 13:56:02 
2025-10-14 13:56:02 
2025-10-14 13:56:02 
2025-10-14 13:56:02 Database: test_3ufzkfzb

The error looks like a pure format difference. Could you check it please?

@Ergus
Copy link
Copy Markdown
Member

Ergus commented Oct 15, 2025

Hi @zheguang :

Thank you very much for the quick reply. I will try to merge it today.

Currently failing tests seem unrelated and with history:

00601_kill_running_query: Reported as new issue #88563
test_backup_outer_table: Reported as #88566

@Ergus Ergus added this pull request to the merge queue Oct 15, 2025
Merged via the queue into ClickHouse:master with commit f302e55 Oct 15, 2025
119 of 123 checks passed
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants