-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Rewrite LIKE expressions with perfect prefix or suffix #85920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Ergus
merged 41 commits into
ClickHouse:master
from
zheguang:like-perfect-affix-rewrite
Oct 15, 2025
Merged
Changes from all commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
fa9e55a
Add analyzer pass for like expression rewrite into ranges
zheguang 214c83f
Add test queries; clean up
zheguang e93b703
Merge remote-tracking branch 'fork/master'
zheguang 1cf764f
Add ilike, not (i)like as todo
zheguang ad20c04
Merge remote-tracking branch 'origin/master' into fork-master
zheguang 8bfdbd4
Merge remote-tracking branch 'origin/master' into like-to-range
zheguang bebd57d
Pass LIKE functional and unit tests
zheguang cf6be48
Add NOT LIKE
zheguang 1890534
Add ILIKE rewrite
zheguang a2cfb03
Refactor string prefix functions; tighten node type; minor refactor
zheguang 60f03a4
Refactor analyzer utils
zheguang be7f1f3
Pass style check and some fast tests
zheguang 908f743
Merge remote-tracking branch 'origin/master' into like-to-range
zheguang 3390adc
Move new setting
zheguang 15d86b1
Merge remote-tracking branch 'origin/master' into like-to-range
zheguang e238c26
Merge remote-tracking branch 'origin/master' into like-to-range
zheguang 405d017
Do not rewrite low cardinality attribute
zheguang 5412384
test: enable analyzer
zheguang 494ccc5
Merge remote-tracking branch 'origin/master' into like-to-range
zheguang b84d95e
Tighten rewrite prerequisite and result type check
zheguang 6e9f19a
Merge branch 'master' into master
zheguang 8390191
Address comments
zheguang df08adb
Rewrite to startsWith and endsWith
zheguang 64a4d9f
Do not rewrite ILIKE
zheguang 6749e92
Rename objects for new pass; add performance test
zheguang 693509f
Merge branch 'master' into like-perfect-affix-rewrite
zheguang 26f5689
Rewrite nested string types
zheguang 3b221a4
perf-test: up table size
zheguang 5f60c2c
Minor: typos, redundant inlines
zheguang 10fbcc3
Merge branch 'master' into like-perfect-affix-rewrite
zheguang 841bb07
Merge branch 'master' into like-perfect-affix-rewrite
zheguang b4f5954
Move setting change
zheguang e93a347
Merge remote-tracking branch 'upstream/master' into like-perfect-affi…
zheguang c8db2ab
Merge remote-tracking branch 'upstream/master' into like-perfect-affi…
zheguang 3fe8f71
Cleanup: remove unecessary assert; remove non-affix performance query
zheguang 12148d1
Merge remote-tracking branch 'upstream/master' into like-perfect-affi…
zheguang acefd04
Merge remote-tracking branch 'upstream/master' into like-perfect-affi…
zheguang 392c612
Merge remote-tracking branch 'upstream/master' into like-perfect-affi…
zheguang c26b027
Merge remote-tracking branch 'upstream/master' into like-perfect-affi…
zheguang 8506e50
Fix test output
zheguang b28218b
Fix unit test output
zheguang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,162 @@ | ||
| #include <Analyzer/Passes/LikePerfectAffixRewritePass.h> | ||
|
|
||
| #include <Analyzer/ConstantNode.h> | ||
| #include <Analyzer/FunctionNode.h> | ||
| #include <Analyzer/IQueryTreeNode.h> | ||
| #include <Analyzer/IdentifierNode.h> | ||
| #include <Analyzer/InDepthQueryTreeVisitor.h> | ||
| #include <Core/Settings.h> | ||
| #include <Common/StringUtils.h> | ||
| #include <DataTypes/DataTypeLowCardinality.h> | ||
| #include <DataTypes/DataTypeNullable.h> | ||
| #include <DataTypes/IDataType.h> | ||
| #include <Functions/FunctionFactory.h> | ||
|
|
||
| #include <boost/algorithm/string.hpp> | ||
|
|
||
| namespace DB | ||
| { | ||
| namespace Setting | ||
| { | ||
| extern const SettingsBool optimize_rewrite_like_perfect_affix; | ||
| } | ||
|
|
||
| namespace | ||
| { | ||
|
|
||
| /** Rewrite LIKE expressions with perfect affix (prefix or suffix) to startsWith or endsWith functions. | ||
| * The scope of the rewrite: | ||
| * - `col LIKE 'Prefix%'`: as `startsWith(col, 'Prefix')` | ||
| * - `col LIKE '%Suffix'`: as `endsWith(col, 'Suffix')` | ||
| * - `NOT LIKE `: negating above functions with `NOT` | ||
| * Out of scope: | ||
| * - `ILIKE` is not rewritten until starts/endsWithCaseInsensitive is implemented, because lower() or left() based rewrites are suboptimal | ||
| * Type requirement on the left operand, `col`: | ||
| * - `startsWith`: Only resolved column of type `String`, `FixedString`, or nested types of these: | ||
| * - `LowCardinality(S)`, `Nullable(S)`, `LowCardinality(Nullable(S))` where `S` is either `String` or `FixedString`. | ||
| * - `endsWith`: same as `startsWith` except `FixedString` (nested or not), because `c LIKE '%suffix\0'` is not equal to `endsWith(c, 'suffix\0')`. | ||
rschu1ze marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| */ | ||
| class LikePerfectAffixRewriteVisitor : public InDepthQueryTreeVisitorWithContext<LikePerfectAffixRewriteVisitor> | ||
| { | ||
| public: | ||
| using Base = InDepthQueryTreeVisitorWithContext<LikePerfectAffixRewriteVisitor>; | ||
| using Base::Base; | ||
|
|
||
| void enterImpl(QueryTreeNodePtr & node) | ||
| { | ||
| if (!getSettings()[Setting::optimize_rewrite_like_perfect_affix]) | ||
| return; | ||
|
|
||
| auto * function_node = node->as<FunctionNode>(); | ||
| if (!function_node) | ||
| return; | ||
|
|
||
| const String & function_name = function_node->getFunctionName(); | ||
| const bool is_like = (function_name == "like"); | ||
| const bool is_not_like = (function_name == "notLike"); | ||
| if (!(is_like || is_not_like)) | ||
| return; | ||
|
|
||
| auto & args = function_node->getArguments().getNodes(); | ||
| if (args.size() != 2) | ||
| return; | ||
|
|
||
| /// Do not rewrite for non-column left hand side | ||
| if (args[0]->getNodeType() != QueryTreeNodeType::COLUMN) | ||
| return; | ||
|
|
||
| /// Extract affix (prefix or suffix) and check if suitable for rewrite | ||
| auto * pattern_constant = args[1]->as<ConstantNode>(); | ||
| if (!pattern_constant || !isString(pattern_constant->getResultType())) | ||
| return; | ||
|
|
||
| auto pattern = pattern_constant->getValue().safeGet<String>(); | ||
| const bool is_suffix = pattern.starts_with("%"); | ||
|
|
||
| /// Do not rewrite for | ||
| /// (1) non-string related type (String, FixedString, or nested within LowCardinality or Nullable or both) | ||
| /// (2) Suffix matching on FixedString, which requires trimming trailing null chars | ||
| if (!isResultTypeSupported(args[0]->getResultType(), is_suffix)) | ||
| return; | ||
|
|
||
| /// Only rewrite for perfect prefix or suffix | ||
| /// Suffix is prefix in reverse | ||
| if (is_suffix) | ||
| std::reverse(pattern.begin(), pattern.end()); | ||
|
|
||
| auto [affix, is_perfect] = extractFixedPrefixFromLikePattern(pattern, true); | ||
| if (!is_perfect || affix.empty()) | ||
| return; | ||
|
|
||
| if (is_suffix) | ||
| std::reverse(affix.begin(), affix.end()); | ||
|
|
||
| auto affix_constant = std::make_shared<ConstantNode>(std::move(affix)); | ||
|
|
||
| /// Create startsWith/endsWith function | ||
| FunctionNodePtr new_node = operation(is_suffix ? "endsWith" : "startsWith", args[0], affix_constant); | ||
| if (is_not_like) | ||
| new_node = operation("not", new_node); | ||
|
|
||
| /// Replpace the original LIKE node | ||
| chassert(new_node != nullptr, "should have been created"); | ||
| chassert(new_node->getResultType()->getTypeId() == node->getResultType()->getTypeId(), "Rewrite should preserve type"); | ||
| node = std::move(new_node); | ||
| } | ||
| private: | ||
| template<typename... Args> | ||
| requires (std::is_convertible_v<Args, QueryTreeNodePtr> && ...) && (sizeof...(Args) >= 1) | ||
| FunctionNodePtr operation(const String & op, Args&&... operands) | ||
| { | ||
| auto op_function = std::make_shared<FunctionNode>(op); | ||
| auto op_resolver = FunctionFactory::instance().get(op, getContext()); | ||
| op_function->getArguments().getNodes() = {std::forward<Args>(operands)...}; | ||
| op_function->resolveAsFunction(op_resolver); | ||
| return op_function; | ||
| } | ||
|
|
||
| bool isStr(const DataTypePtr & type, const bool for_suffix) | ||
| { | ||
| /// Do not rewrite for fixedstring with suffix, because `col LIKE '%suffix\0'` is not `endsWith(col, 'suffix\0')` | ||
| if (for_suffix) | ||
| return isString(type); | ||
| else | ||
| return isStringOrFixedString(type); | ||
| } | ||
|
|
||
| bool isResultTypeSupported(const DataTypePtr & type, const bool for_suffix) | ||
| { | ||
| return isStr(type, for_suffix) || isNullableStr(type, for_suffix) || isLowCardinalityMaybeNullableStr(type, for_suffix); | ||
| } | ||
|
|
||
| bool isLowCardinalityMaybeNullableStr(const DataTypePtr & type, const bool for_suffix) | ||
| { | ||
| if (const DataTypeLowCardinality * type_low_cardinality = typeid_cast<const DataTypeLowCardinality *>(type.get())) | ||
| { | ||
| DataTypePtr type_dictionary = type_low_cardinality->getDictionaryType(); | ||
| return isStr(type_dictionary, for_suffix) || isNullableStr(type_dictionary, for_suffix); | ||
| } | ||
| return false; | ||
| } | ||
|
|
||
| bool isNullableStr(const DataTypePtr & type, const bool for_suffix) | ||
| { | ||
| if (const DataTypeNullable * type_nullable = typeid_cast<const DataTypeNullable *>(type.get())) | ||
| { | ||
| DataTypePtr type_nested = type_nullable->getNestedType(); | ||
| return isStr(type_nested, for_suffix); | ||
| } | ||
| return false; | ||
| } | ||
| }; | ||
|
|
||
| } | ||
|
|
||
| void LikePerfectAffixRewritePass::run(QueryTreeNodePtr & query_tree_node, ContextPtr context) | ||
| { | ||
| LikePerfectAffixRewriteVisitor visitor(context); | ||
| visitor.visit(query_tree_node); | ||
| } | ||
|
|
||
| } | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| #pragma once | ||
|
|
||
| #include <Analyzer/IQueryTreePass.h> | ||
|
|
||
| namespace DB | ||
| { | ||
| class LikePerfectAffixRewritePass final : public IQueryTreePass | ||
| { | ||
| public: | ||
| String getName() override { return "LikePerfectAffixRewritePass"; } | ||
|
|
||
| String getDescription() override { return "Rewrite LIKE expressions with perfect affix (prefix or suffix) into startsWith or endsWith functions."; } | ||
|
|
||
| void run(QueryTreeNodePtr & query_tree_node, ContextPtr context) override; | ||
| }; | ||
|
|
||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| #include <Analyzer/Passes/tests/gtest_analyzer_utils.h> | ||
| #include <gtest/gtest.h> | ||
|
|
||
| #include <Analyzer/ColumnNode.h> | ||
| #include <Analyzer/FunctionNode.h> | ||
| #include <Analyzer/IdentifierNode.h> | ||
| #include <Analyzer/IQueryTreePass.h> | ||
| #include <Analyzer/QueryTreeBuilder.h> | ||
| #include <Analyzer/Utils.h> | ||
| #include <Common/tests/gtest_global_context.h> | ||
| #include <Common/tests/gtest_global_register.h> | ||
| #include <Parsers/parseQuery.h> | ||
| #include <Parsers/ExpressionListParsers.h> | ||
|
|
||
| using namespace DB; | ||
|
|
||
| /// Resolve columns of a type within a query tree | ||
| static QueryTreeNodePtr resolveColumn(DataTypePtr type, QueryTreeNodePtr node, std::map<String, QueryTreeNodePtr> & resolved_map, ContextPtr context) | ||
zheguang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| { | ||
| auto * function_node = node->as<FunctionNode>(); | ||
| if (!function_node) | ||
| { | ||
| auto * identifier_node = node->as<IdentifierNode>(); | ||
| /// it is a column | ||
| if (identifier_node) | ||
| { | ||
| String col_name = identifier_node->getIdentifier().getFullName(); | ||
| auto it = resolved_map.find(col_name); | ||
| if (it != resolved_map.end()) | ||
| return it->second; | ||
| auto column = std::make_shared<ColumnNode>(NameAndTypePair(col_name, type), node); | ||
| resolved_map[col_name] = column; | ||
| return column; | ||
| } | ||
| /// it is constant | ||
| return node; | ||
| } | ||
| QueryTreeNodes new_args; | ||
| for (const auto & argument : function_node->getArguments()) | ||
| { | ||
| auto arg = resolveColumn(type, argument, resolved_map, context); | ||
| new_args.push_back(arg); | ||
| } | ||
| function_node->getArguments().getNodes() = std::move(new_args); | ||
| resolveOrdinaryFunctionNodeByName(*function_node, function_node->getFunctionName(), context); | ||
| return node; | ||
| }; | ||
|
|
||
| /// Run an analyzer pass on a condition involving a column of a type, then compare against the expected expression. | ||
| void testPassOnCondition(QueryTreePassPtr pass, DataTypePtr columnType, const String & cond, const String & expected) | ||
| { | ||
| ContextPtr context = getContext().context; ParserExpressionWithOptionalAlias exp_elem(false); | ||
| ASTPtr query = parseQuery(exp_elem, cond, 10000, 10000, 10000); | ||
| QueryTreeNodePtr node = buildQueryTree(query, context); | ||
|
|
||
| std::map<String, QueryTreeNodePtr> resolved_map; | ||
| node = resolveColumn(columnType, node, resolved_map, context); | ||
| pass->run(node, context); | ||
| EXPECT_EQ(node->formatConvertedASTForErrorMessage(), expected); | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| #pragma once | ||
|
|
||
| #include <Analyzer/IQueryTreePass.h> | ||
| #include <Columns/IColumn.h> | ||
| #include <base/types.h> | ||
|
|
||
| void testPassOnCondition(DB::QueryTreePassPtr pass, DB::DataTypePtr columnType, const String & cond, const String & expected); |
62 changes: 62 additions & 0 deletions
62
src/Analyzer/Passes/tests/gtest_like_perfect_affix_rewrite.cpp
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| #include <gtest/gtest.h> | ||
|
|
||
| #include <Analyzer/ColumnNode.h> | ||
| #include <Analyzer/FunctionNode.h> | ||
| #include <Analyzer/IdentifierNode.h> | ||
| #include <Analyzer/QueryTreeBuilder.h> | ||
| #include <Analyzer/Passes/LikePerfectAffixRewritePass.h> | ||
| #include <Analyzer/Passes/tests/gtest_analyzer_utils.h> | ||
| #include <Analyzer/Utils.h> | ||
| #include <Common/tests/gtest_global_context.h> | ||
| #include <Common/tests/gtest_global_register.h> | ||
| #include <DataTypes/DataTypeString.h> | ||
| #include <DataTypes/DataTypesNumber.h> | ||
| #include <Parsers/parseQuery.h> | ||
| #include <Parsers/ExpressionListParsers.h> | ||
|
|
||
| using namespace DB; | ||
|
|
||
| TEST(LikePerfectAffixRewrite, rewrite) | ||
| { | ||
| tryRegisterFunctions(); | ||
| auto test_f = [&](const String & cond, const String & expected) | ||
| { | ||
| testPassOnCondition( | ||
| QueryTreePassPtr(new LikePerfectAffixRewritePass()), DataTypePtr(new DataTypeString()), | ||
| cond, expected); | ||
| }; | ||
|
|
||
| /// Perfect affix LIKE | ||
| test_f("col LIKE 'Test%'", "startsWith(col, 'Test')"); | ||
| test_f("col LIKE 'a%'", "startsWith(col, 'a')"); | ||
| test_f("col LIKE '%Test'", "endsWith(col, 'Test')"); | ||
| test_f("col LIKE '%a'", "endsWith(col, 'a')"); | ||
|
|
||
| /// Perfect affix ILIKE should NOT be rewritten | ||
| test_f("col ILIKE 'Test%'", "col ILIKE 'Test%'"); | ||
|
|
||
| /// Perfect affix without upper bound | ||
| test_f("col LIKE '\xFF%'", "startsWith(col, '\xFF')"); | ||
| test_f("col LIKE '%\xFF'", "endsWith(col, '\xFF')"); | ||
|
|
||
| /// Perfect affix NOT LIKE | ||
| test_f("col NOT LIKE 'Test%'", "not(startsWith(col, 'Test'))"); | ||
| test_f("col NOT LIKE '%Test'", "not(endsWith(col, 'Test'))"); | ||
|
|
||
| /// Imperfect affix (I)LIKE should not be rewritten | ||
| test_f("col LIKE 'hello_world%'", "col LIKE 'hello_world%'"); | ||
| test_f("col LIKE '%hello_world'", "col LIKE '%hello_world'"); | ||
| test_f("col LIKE '%test%'", "col LIKE '%test%'"); | ||
| test_f("col LIKE '%test_'", "col LIKE '%test_'"); | ||
| test_f("col LIKE '_test%'", "col LIKE '_test%'"); | ||
| test_f("col LIKE '%'", "col LIKE '%'"); | ||
| test_f("col LIKE 'exactvalue'", "col LIKE 'exactvalue'"); | ||
|
|
||
| test_f("col ILIKE 'hello_world%'", "col ILIKE 'hello_world%'"); | ||
|
|
||
| /// Imperfect affix NOT (I)LIKE should not be rewritten | ||
| test_f("col NOT LIKE 'hello_world%'", "col NOT LIKE 'hello_world%'"); | ||
| test_f("col NOT LIKE '%hello_world'", "col NOT LIKE '%hello_world'"); | ||
| test_f("col NOT ILIKE 'hello_world%'", "col NOT ILIKE 'hello_world%'"); | ||
| test_f("col NOT ILIKE '%hello_world'", "col NOT ILIKE '%hello_world'"); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have another problem here. According to docs, LIKE is able to process UTF8 strings (needle and haystack).
starts/endsWith processes ASCII only (again according to the docs), UTF-8 variants are available as
starts/endsWithUTF8.We either need to map to the UTF8-variants or challenge the docs of LIKE, i.e. check whether LIKE really handles UTF8 (there is a chance it doesn't).
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went down this path as well, wondering who's equals to who, amongst
LIKE,startsWith, andstartsWithUTF8. Evidence suggestsLIKEis more likestartsWith: https://fiddle.clickhouse.com/30fa7302-3023-48aa-ad8a-ab5265d71f69.I think this statement from the documentation on
Likegives it away:Aside: I think proper UTF-8 support should rather have defined behavior when the input is invalid UTF-8.
For this rewrite though, my understanding is that
likeandstartsWithboth just compares raw bytes, so they should work for UTF-8 also on the "happy" path.Aside: one problematic case is for
ILIKE, where non-Ascii Latin characters may require UTF-8 way of lowering cases. Luckily(?) we're not dealing withILIKEfor now...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ClickHouse "philosophy" is to make as few assumptions and assertions as possible for performance reasons. It simply expects that the data is valid UTF-8 and if it isn't, all bets are off. ClickHouse is basically the C++ of the database world.
I checked quickly:
startsWithdoes this:(src/Functions/FunctionStartsEndsWith.h) ... indeed a byte-by-byte comparison.
Okay, then we are good