Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
fa9e55a
Add analyzer pass for like expression rewrite into ranges
zheguang Aug 20, 2025
214c83f
Add test queries; clean up
zheguang Aug 20, 2025
e93b703
Merge remote-tracking branch 'fork/master'
zheguang Aug 20, 2025
1cf764f
Add ilike, not (i)like as todo
zheguang Aug 20, 2025
ad20c04
Merge remote-tracking branch 'origin/master' into fork-master
zheguang Aug 21, 2025
8bfdbd4
Merge remote-tracking branch 'origin/master' into like-to-range
zheguang Aug 25, 2025
bebd57d
Pass LIKE functional and unit tests
zheguang Aug 26, 2025
cf6be48
Add NOT LIKE
zheguang Aug 26, 2025
1890534
Add ILIKE rewrite
zheguang Aug 27, 2025
a2cfb03
Refactor string prefix functions; tighten node type; minor refactor
zheguang Aug 27, 2025
60f03a4
Refactor analyzer utils
zheguang Aug 27, 2025
be7f1f3
Pass style check and some fast tests
zheguang Aug 28, 2025
908f743
Merge remote-tracking branch 'origin/master' into like-to-range
zheguang Aug 28, 2025
3390adc
Move new setting
zheguang Aug 28, 2025
15d86b1
Merge remote-tracking branch 'origin/master' into like-to-range
zheguang Aug 29, 2025
e238c26
Merge remote-tracking branch 'origin/master' into like-to-range
zheguang Sep 1, 2025
405d017
Do not rewrite low cardinality attribute
zheguang Sep 2, 2025
5412384
test: enable analyzer
zheguang Sep 2, 2025
494ccc5
Merge remote-tracking branch 'origin/master' into like-to-range
zheguang Sep 3, 2025
b84d95e
Tighten rewrite prerequisite and result type check
zheguang Sep 4, 2025
6e9f19a
Merge branch 'master' into master
zheguang Sep 5, 2025
8390191
Address comments
zheguang Sep 9, 2025
df08adb
Rewrite to startsWith and endsWith
zheguang Sep 12, 2025
64a4d9f
Do not rewrite ILIKE
zheguang Sep 17, 2025
6749e92
Rename objects for new pass; add performance test
zheguang Sep 17, 2025
693509f
Merge branch 'master' into like-perfect-affix-rewrite
zheguang Sep 17, 2025
26f5689
Rewrite nested string types
zheguang Sep 17, 2025
3b221a4
perf-test: up table size
zheguang Sep 17, 2025
5f60c2c
Minor: typos, redundant inlines
zheguang Sep 17, 2025
10fbcc3
Merge branch 'master' into like-perfect-affix-rewrite
zheguang Sep 18, 2025
841bb07
Merge branch 'master' into like-perfect-affix-rewrite
zheguang Sep 23, 2025
b4f5954
Move setting change
zheguang Sep 23, 2025
e93a347
Merge remote-tracking branch 'upstream/master' into like-perfect-affi…
zheguang Sep 24, 2025
c8db2ab
Merge remote-tracking branch 'upstream/master' into like-perfect-affi…
zheguang Sep 26, 2025
3fe8f71
Cleanup: remove unecessary assert; remove non-affix performance query
zheguang Sep 29, 2025
12148d1
Merge remote-tracking branch 'upstream/master' into like-perfect-affi…
zheguang Sep 29, 2025
acefd04
Merge remote-tracking branch 'upstream/master' into like-perfect-affi…
zheguang Oct 6, 2025
392c612
Merge remote-tracking branch 'upstream/master' into like-perfect-affi…
zheguang Oct 8, 2025
c26b027
Merge remote-tracking branch 'upstream/master' into like-perfect-affi…
zheguang Oct 14, 2025
8506e50
Fix test output
zheguang Oct 15, 2025
b28218b
Fix unit test output
zheguang Oct 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions src/Analyzer/Passes/LikePerfectAffixRewritePass.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
#include <Analyzer/Passes/LikePerfectAffixRewritePass.h>

#include <Analyzer/ConstantNode.h>
#include <Analyzer/FunctionNode.h>
#include <Analyzer/IQueryTreeNode.h>
#include <Analyzer/IdentifierNode.h>
#include <Analyzer/InDepthQueryTreeVisitor.h>
#include <Core/Settings.h>
#include <Common/StringUtils.h>
#include <DataTypes/DataTypeLowCardinality.h>
#include <DataTypes/DataTypeNullable.h>
#include <DataTypes/IDataType.h>
#include <Functions/FunctionFactory.h>

#include <boost/algorithm/string.hpp>

namespace DB
{
namespace Setting
{
extern const SettingsBool optimize_rewrite_like_perfect_affix;
}

namespace
{

/** Rewrite LIKE expressions with perfect affix (prefix or suffix) to startsWith or endsWith functions.
* The scope of the rewrite:
* - `col LIKE 'Prefix%'`: as `startsWith(col, 'Prefix')`
Copy link
Copy Markdown
Member

@rschu1ze rschu1ze Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have another problem here. According to docs, LIKE is able to process UTF8 strings (needle and haystack).

starts/endsWith processes ASCII only (again according to the docs), UTF-8 variants are available as starts/endsWithUTF8.

We either need to map to the UTF8-variants or challenge the docs of LIKE, i.e. check whether LIKE really handles UTF8 (there is a chance it doesn't).

Copy link
Copy Markdown
Contributor Author

@zheguang zheguang Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went down this path as well, wondering who's equals to who, amongst LIKE, startsWith, and startsWithUTF8. Evidence suggests LIKE is more like startsWith: https://fiddle.clickhouse.com/30fa7302-3023-48aa-ad8a-ab5265d71f69.
I think this statement from the documentation on Like gives it away:

If the haystack or the LIKE expression are not valid UTF-8, the behavior is undefined.

Aside: I think proper UTF-8 support should rather have defined behavior when the input is invalid UTF-8.
For this rewrite though, my understanding is that like and startsWith both just compares raw bytes, so they should work for UTF-8 also on the "happy" path.
Aside: one problematic case is for ILIKE, where non-Ascii Latin characters may require UTF-8 way of lowering cases. Luckily(?) we're not dealing with ILIKE for now...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: I think proper UTF-8 support should rather have defined behavior when the input is invalid UTF-8.

The ClickHouse "philosophy" is to make as few assumptions and assertions as possible for performance reasons. It simply expects that the data is valid UTF-8 and if it isn't, all bets are off. ClickHouse is basically the C++ of the database world.

For this rewrite though, my understanding is that like and startsWith both just compares raw bytes, so they should work for UTF-8 also on the "happy" path.

I checked quickly:

startsWith does this:

res_data[row_num] = StringRef(haystack.data, needle.size) == StringRef(needle.data, needle.size);

(src/Functions/FunctionStartsEndsWith.h) ... indeed a byte-by-byte comparison.

Okay, then we are good

* - `col LIKE '%Suffix'`: as `endsWith(col, 'Suffix')`
* - `NOT LIKE `: negating above functions with `NOT`
* Out of scope:
* - `ILIKE` is not rewritten until starts/endsWithCaseInsensitive is implemented, because lower() or left() based rewrites are suboptimal
* Type requirement on the left operand, `col`:
* - `startsWith`: Only resolved column of type `String`, `FixedString`, or nested types of these:
* - `LowCardinality(S)`, `Nullable(S)`, `LowCardinality(Nullable(S))` where `S` is either `String` or `FixedString`.
* - `endsWith`: same as `startsWith` except `FixedString` (nested or not), because `c LIKE '%suffix\0'` is not equal to `endsWith(c, 'suffix\0')`.
*/
class LikePerfectAffixRewriteVisitor : public InDepthQueryTreeVisitorWithContext<LikePerfectAffixRewriteVisitor>
{
public:
using Base = InDepthQueryTreeVisitorWithContext<LikePerfectAffixRewriteVisitor>;
using Base::Base;

void enterImpl(QueryTreeNodePtr & node)
{
if (!getSettings()[Setting::optimize_rewrite_like_perfect_affix])
return;

auto * function_node = node->as<FunctionNode>();
if (!function_node)
return;

const String & function_name = function_node->getFunctionName();
const bool is_like = (function_name == "like");
const bool is_not_like = (function_name == "notLike");
if (!(is_like || is_not_like))
return;

auto & args = function_node->getArguments().getNodes();
if (args.size() != 2)
return;

/// Do not rewrite for non-column left hand side
if (args[0]->getNodeType() != QueryTreeNodeType::COLUMN)
return;

/// Extract affix (prefix or suffix) and check if suitable for rewrite
auto * pattern_constant = args[1]->as<ConstantNode>();
if (!pattern_constant || !isString(pattern_constant->getResultType()))
return;

auto pattern = pattern_constant->getValue().safeGet<String>();
const bool is_suffix = pattern.starts_with("%");

/// Do not rewrite for
/// (1) non-string related type (String, FixedString, or nested within LowCardinality or Nullable or both)
/// (2) Suffix matching on FixedString, which requires trimming trailing null chars
if (!isResultTypeSupported(args[0]->getResultType(), is_suffix))
return;

/// Only rewrite for perfect prefix or suffix
/// Suffix is prefix in reverse
if (is_suffix)
std::reverse(pattern.begin(), pattern.end());

auto [affix, is_perfect] = extractFixedPrefixFromLikePattern(pattern, true);
if (!is_perfect || affix.empty())
return;

if (is_suffix)
std::reverse(affix.begin(), affix.end());

auto affix_constant = std::make_shared<ConstantNode>(std::move(affix));

/// Create startsWith/endsWith function
FunctionNodePtr new_node = operation(is_suffix ? "endsWith" : "startsWith", args[0], affix_constant);
if (is_not_like)
new_node = operation("not", new_node);

/// Replpace the original LIKE node
chassert(new_node != nullptr, "should have been created");
chassert(new_node->getResultType()->getTypeId() == node->getResultType()->getTypeId(), "Rewrite should preserve type");
node = std::move(new_node);
}
private:
template<typename... Args>
requires (std::is_convertible_v<Args, QueryTreeNodePtr> && ...) && (sizeof...(Args) >= 1)
FunctionNodePtr operation(const String & op, Args&&... operands)
{
auto op_function = std::make_shared<FunctionNode>(op);
auto op_resolver = FunctionFactory::instance().get(op, getContext());
op_function->getArguments().getNodes() = {std::forward<Args>(operands)...};
op_function->resolveAsFunction(op_resolver);
return op_function;
}

bool isStr(const DataTypePtr & type, const bool for_suffix)
{
/// Do not rewrite for fixedstring with suffix, because `col LIKE '%suffix\0'` is not `endsWith(col, 'suffix\0')`
if (for_suffix)
return isString(type);
else
return isStringOrFixedString(type);
}

bool isResultTypeSupported(const DataTypePtr & type, const bool for_suffix)
{
return isStr(type, for_suffix) || isNullableStr(type, for_suffix) || isLowCardinalityMaybeNullableStr(type, for_suffix);
}

bool isLowCardinalityMaybeNullableStr(const DataTypePtr & type, const bool for_suffix)
{
if (const DataTypeLowCardinality * type_low_cardinality = typeid_cast<const DataTypeLowCardinality *>(type.get()))
{
DataTypePtr type_dictionary = type_low_cardinality->getDictionaryType();
return isStr(type_dictionary, for_suffix) || isNullableStr(type_dictionary, for_suffix);
}
return false;
}

bool isNullableStr(const DataTypePtr & type, const bool for_suffix)
{
if (const DataTypeNullable * type_nullable = typeid_cast<const DataTypeNullable *>(type.get()))
{
DataTypePtr type_nested = type_nullable->getNestedType();
return isStr(type_nested, for_suffix);
}
return false;
}
};

}

void LikePerfectAffixRewritePass::run(QueryTreeNodePtr & query_tree_node, ContextPtr context)
{
LikePerfectAffixRewriteVisitor visitor(context);
visitor.visit(query_tree_node);
}

}

17 changes: 17 additions & 0 deletions src/Analyzer/Passes/LikePerfectAffixRewritePass.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#pragma once

#include <Analyzer/IQueryTreePass.h>

namespace DB
{
class LikePerfectAffixRewritePass final : public IQueryTreePass
{
public:
String getName() override { return "LikePerfectAffixRewritePass"; }

String getDescription() override { return "Rewrite LIKE expressions with perfect affix (prefix or suffix) into startsWith or endsWith functions."; }

void run(QueryTreeNodePtr & query_tree_node, ContextPtr context) override;
};

}
60 changes: 60 additions & 0 deletions src/Analyzer/Passes/tests/gtest_analyzer_utils.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
#include <Analyzer/Passes/tests/gtest_analyzer_utils.h>
#include <gtest/gtest.h>

#include <Analyzer/ColumnNode.h>
#include <Analyzer/FunctionNode.h>
#include <Analyzer/IdentifierNode.h>
#include <Analyzer/IQueryTreePass.h>
#include <Analyzer/QueryTreeBuilder.h>
#include <Analyzer/Utils.h>
#include <Common/tests/gtest_global_context.h>
#include <Common/tests/gtest_global_register.h>
#include <Parsers/parseQuery.h>
#include <Parsers/ExpressionListParsers.h>

using namespace DB;

/// Resolve columns of a type within a query tree
static QueryTreeNodePtr resolveColumn(DataTypePtr type, QueryTreeNodePtr node, std::map<String, QueryTreeNodePtr> & resolved_map, ContextPtr context)
{
auto * function_node = node->as<FunctionNode>();
if (!function_node)
{
auto * identifier_node = node->as<IdentifierNode>();
/// it is a column
if (identifier_node)
{
String col_name = identifier_node->getIdentifier().getFullName();
auto it = resolved_map.find(col_name);
if (it != resolved_map.end())
return it->second;
auto column = std::make_shared<ColumnNode>(NameAndTypePair(col_name, type), node);
resolved_map[col_name] = column;
return column;
}
/// it is constant
return node;
}
QueryTreeNodes new_args;
for (const auto & argument : function_node->getArguments())
{
auto arg = resolveColumn(type, argument, resolved_map, context);
new_args.push_back(arg);
}
function_node->getArguments().getNodes() = std::move(new_args);
resolveOrdinaryFunctionNodeByName(*function_node, function_node->getFunctionName(), context);
return node;
};

/// Run an analyzer pass on a condition involving a column of a type, then compare against the expected expression.
void testPassOnCondition(QueryTreePassPtr pass, DataTypePtr columnType, const String & cond, const String & expected)
{
ContextPtr context = getContext().context; ParserExpressionWithOptionalAlias exp_elem(false);
ASTPtr query = parseQuery(exp_elem, cond, 10000, 10000, 10000);
QueryTreeNodePtr node = buildQueryTree(query, context);

std::map<String, QueryTreeNodePtr> resolved_map;
node = resolveColumn(columnType, node, resolved_map, context);
pass->run(node, context);
EXPECT_EQ(node->formatConvertedASTForErrorMessage(), expected);
}
7 changes: 7 additions & 0 deletions src/Analyzer/Passes/tests/gtest_analyzer_utils.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#pragma once

#include <Analyzer/IQueryTreePass.h>
#include <Columns/IColumn.h>
#include <base/types.h>

void testPassOnCondition(DB::QueryTreePassPtr pass, DB::DataTypePtr columnType, const String & cond, const String & expected);
62 changes: 62 additions & 0 deletions src/Analyzer/Passes/tests/gtest_like_perfect_affix_rewrite.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#include <gtest/gtest.h>

#include <Analyzer/ColumnNode.h>
#include <Analyzer/FunctionNode.h>
#include <Analyzer/IdentifierNode.h>
#include <Analyzer/QueryTreeBuilder.h>
#include <Analyzer/Passes/LikePerfectAffixRewritePass.h>
#include <Analyzer/Passes/tests/gtest_analyzer_utils.h>
#include <Analyzer/Utils.h>
#include <Common/tests/gtest_global_context.h>
#include <Common/tests/gtest_global_register.h>
#include <DataTypes/DataTypeString.h>
#include <DataTypes/DataTypesNumber.h>
#include <Parsers/parseQuery.h>
#include <Parsers/ExpressionListParsers.h>

using namespace DB;

TEST(LikePerfectAffixRewrite, rewrite)
{
tryRegisterFunctions();
auto test_f = [&](const String & cond, const String & expected)
{
testPassOnCondition(
QueryTreePassPtr(new LikePerfectAffixRewritePass()), DataTypePtr(new DataTypeString()),
cond, expected);
};

/// Perfect affix LIKE
test_f("col LIKE 'Test%'", "startsWith(col, 'Test')");
test_f("col LIKE 'a%'", "startsWith(col, 'a')");
test_f("col LIKE '%Test'", "endsWith(col, 'Test')");
test_f("col LIKE '%a'", "endsWith(col, 'a')");

/// Perfect affix ILIKE should NOT be rewritten
test_f("col ILIKE 'Test%'", "col ILIKE 'Test%'");

/// Perfect affix without upper bound
test_f("col LIKE '\xFF%'", "startsWith(col, '\xFF')");
test_f("col LIKE '%\xFF'", "endsWith(col, '\xFF')");

/// Perfect affix NOT LIKE
test_f("col NOT LIKE 'Test%'", "not(startsWith(col, 'Test'))");
test_f("col NOT LIKE '%Test'", "not(endsWith(col, 'Test'))");

/// Imperfect affix (I)LIKE should not be rewritten
test_f("col LIKE 'hello_world%'", "col LIKE 'hello_world%'");
test_f("col LIKE '%hello_world'", "col LIKE '%hello_world'");
test_f("col LIKE '%test%'", "col LIKE '%test%'");
test_f("col LIKE '%test_'", "col LIKE '%test_'");
test_f("col LIKE '_test%'", "col LIKE '_test%'");
test_f("col LIKE '%'", "col LIKE '%'");
test_f("col LIKE 'exactvalue'", "col LIKE 'exactvalue'");

test_f("col ILIKE 'hello_world%'", "col ILIKE 'hello_world%'");

/// Imperfect affix NOT (I)LIKE should not be rewritten
test_f("col NOT LIKE 'hello_world%'", "col NOT LIKE 'hello_world%'");
test_f("col NOT LIKE '%hello_world'", "col NOT LIKE '%hello_world'");
test_f("col NOT ILIKE 'hello_world%'", "col NOT ILIKE 'hello_world%'");
test_f("col NOT ILIKE '%hello_world'", "col NOT ILIKE '%hello_world'");
}
46 changes: 4 additions & 42 deletions src/Analyzer/Passes/tests/gtest_logical_expression_optimizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
#include <Analyzer/FunctionNode.h>
#include <Analyzer/IdentifierNode.h>
#include <Analyzer/Passes/LogicalExpressionOptimizerPass.h>
#include <Analyzer/Passes/tests/gtest_analyzer_utils.h>
#include <Analyzer/QueryTreeBuilder.h>
#include <Analyzer/Utils.h>
#include <Common/tests/gtest_global_context.h>
Expand All @@ -14,53 +15,14 @@

using namespace DB;

QueryTreeNodePtr resolve_everything(QueryTreeNodePtr node, std::map<String, QueryTreeNodePtr> & resolved_map, ContextPtr context)
{
auto * function_node = node->as<FunctionNode>();
if (!function_node)
{
auto * identifier_node = node->as<IdentifierNode>();
/// it is a column
if (identifier_node)
{
String col_name = identifier_node->getIdentifier().getFullName();
auto it = resolved_map.find(col_name);
if (it != resolved_map.end())
return it->second;
DataTypePtr type = DataTypePtr(new DataTypeInt32());
auto column = std::make_shared<ColumnNode>(NameAndTypePair(col_name, type), node);
resolved_map[col_name] = column;
return column;
}
/// it is constant
return node;
}
QueryTreeNodes new_args;
for (const auto & argument : function_node->getArguments())
{
auto arg = resolve_everything(argument, resolved_map, context);
new_args.push_back(arg);
}
function_node->getArguments().getNodes() = std::move(new_args);
resolveOrdinaryFunctionNodeByName(*function_node, function_node->getFunctionName(), context);
return node;
};

TEST(OptimizeAndCompareChain, compare)
{
tryRegisterFunctions();
std::map<String, QueryTreeNodePtr> resolved_map;
auto test_f = [&](const String & cond, const String & expected)
{
ContextPtr context = getContext().context;
ParserExpressionWithOptionalAlias exp_elem(false);
ASTPtr query = parseQuery(exp_elem, cond, 10000, 10000, 10000);
QueryTreeNodePtr node = buildQueryTree(query, context);

node = resolve_everything(node, resolved_map, context);
LogicalExpressionOptimizerPass pass;
pass.run(node, context);
EXPECT_EQ(node->formatConvertedASTForErrorMessage(), expected);
testPassOnCondition(
QueryTreePassPtr(new LogicalExpressionOptimizerPass()), DataTypePtr(new DataTypeInt32()),
cond, expected);
};

// constant is large
Expand Down
2 changes: 2 additions & 0 deletions src/Analyzer/QueryTreePassManager.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
#include <Analyzer/Passes/IfTransformStringsToEnumPass.h>
#include <Analyzer/Passes/InjectRandomOrderIfNoOrderByPass.h>
#include <Analyzer/Passes/L2DistanceTransposedPartialReadsPass.h>
#include <Analyzer/Passes/LikePerfectAffixRewritePass.h>
#include <Analyzer/Passes/LogicalExpressionOptimizerPass.h>
#include <Analyzer/Passes/MultiIfToIfPass.h>
#include <Analyzer/Passes/NormalizeCountVariantsPass.h>
Expand Down Expand Up @@ -309,6 +310,7 @@ void addQueryTreePasses(QueryTreePassManager & manager, bool only_analyze)

manager.addPass(std::make_unique<ConvertOrLikeChainPass>());

manager.addPass(std::make_unique<LikePerfectAffixRewritePass>());
manager.addPass(std::make_unique<LogicalExpressionOptimizerPass>());

manager.addPass(std::make_unique<CrossToInnerJoinPass>());
Expand Down
Loading
Loading