-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Improving writeAnyEscapedString for scalar subqueries #7077
Description
As mentioned in PR 6773, the overhead introduced by string escaping for serialization/deserialization is not negligible.
What's more, according to the experiments mentioned in issue 6880 and 6956, the overhead of string escaping could even be dominant if the bitmap intermediate results are huge, say tens of millions of ids contained.
Here are the performance comparison for using a tiny patch(patch1) :
@@ -130,7 +130,9 @@ String FieldVisitorToString::operator() (const DecimalField<Decimal128> & x) con
String FieldVisitorToString::operator() (const UInt128 & x) const { return formatQuoted(UUID(x)); }
String FieldVisitorToString::operator() (const AggregateFunctionStateData & x) const
{
- return formatQuoted(x.data);
+ return x.data;
}
The query is:
WITH
(
SELECT mid_seqs
FROM cdp_tags
WHERE tag_id = 'mobile_province_NA'
LIMIT 1
) AS bm
SELECT count(cab_type_id)
FROM trips
WHERE bitmapContains(bm, toUInt32(id));
CREATE TABLE cdp_tags ( \
tag_id String, \
mid_seqs AggregateFunction(groupBitmap, UInt32), \
cardinality UInt32 \
) engine=ReplacingMergeTree() \
ORDER BY (tag_id);
Table trips is just the same got from New York Taxi Data, while SELECT mid_seqs FROM cdp_tags WHERE tag_id = 'mobile_province_NA' LIMIT 1 will return a bitmap containing over 40M ids.
The original time takes around 19s, while after the modification, the query time is reduced to around 7s, and we could see String FieldVisitorToString::operator() (const AggregateFunctionStateData & x) const has been called over 22 times during the query.
However, patch1 could not be applied because it will lead to deserialization error under distributed queries, just as PR 6773 has fixed.
Therefore we made an extra experiments with as fewer change as possible(patch2):
--- a/dbms/src/IO/WriteHelpers.h
+++ b/dbms/src/IO/WriteHelpers.h
@@ -267,7 +267,7 @@ void writeAnyEscapedString(const char * begin, const char * end, WriteBuffer & b
while (true)
{
/// On purpose we will escape more characters than minimally necessary.
- const char * next_pos = find_first_symbols<'\b', '\f', '\n', '\r', '\t', '\0', '\\', quote_character>(pos, end);
+ const char * next_pos = find_first_symbols<'\\', quote_character>(pos, end);
if (next_pos == end)
{
Within patch2, String FieldVisitorToString::operator() (const AggregateFunctionStateData & x) const is left unchanged, while only writeAnyEscapedString is adjusted such that as few string find operation happened as possible during string escaping. Given this patch, the above query time is around 7.8s, which is a bit worse than no string escaping. patch2 seems work well, which means the scalar subquery performance is improved from 19s to 7.8s, which is a remarkable achievement.
As a result, we have the following questions:
- Why
String FieldVisitorToString::operator() (const AggregateFunctionStateData & x) constis called so many times for the scalar sub query? Does this have close relationship with the number of partitions of the table data?
select partition from system.parts where table='trips';
115 rows in set
- Perhaps some faster string find algorithms could be used to replace for
find_first_symbols, although it does not have the same remarkable achievements for performance improvements as the first point.