-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Supporting parallel execution for With clause #7329
Description
cdp_tags is a table where bitmap intermediate state is stored as a column:
CREATE TABLE cdp_tags ( \
tag_id String, \
mid_seqs AggregateFunction(groupBitmap, UInt32), \
cardinality UInt32 \
) engine=ReplacingMergeTree() \
ORDER BY (tag_id) SETTINGS index_granularity=1;
A simple bitmap aggregation could return fast, for example, only 0.254s is required for both IO and bitmap AND aggregation:
SELECT groupBitmapAnd(mid_seqs)
FROM cdp_tags
WHERE has(['tag1', 'is_member'], tag_id)
┌─groupBitmapAnd(mid_seqs)─┐
│ 16614356 │
└──────────────────────────┘
1 rows in set. Elapsed: 0.254 sec. Processed 28.21 thousand rows, 1.74 MB (110.86 thousand rows/s., 6.83 MB/s.)
However, if I move that query into WITH clause:
WITH
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag1', 'is_member'], tag_id)
) AS bm1
SELECT order_complete_time
FROM orders
WHERE (order_complete_time >= '2019-01-01 00:00:00') AND (order_complete_time <= '2019-01-01 00:01:00')
703 rows in set. Elapsed: 3.173 sec. Processed 6.20 thousand rows, 24.78 KB (5.28 thousand rows/s., 21.13 KB/s.)
Here, I cut the time range as much as possible such that query on table orders itself does not affect the overall performance, I just want to evaluate the overhead of WITH clause, then the above time is increased more than tens of times:
Furthermore, if I introduce more operation into WITH clause:
WITH
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag1', 'is_member'], tag_id)
) AS bm1,
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag2', 'is_member'], tag_id)
) AS bm2
SELECT order_complete_time
FROM orders
WHERE (order_complete_time >= '2019-01-01 00:00:00') AND (order_complete_time <= '2019-01-01 00:01:00')
703 rows in set. Elapsed: 4.297 sec. Processed 6.20 thousand rows, 24.78 KB (1.44 thousand rows/s., 5.77 KB/s.)
The time elapsed is increased near linearly.
What's more, if I add more:
WITH
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag1', 'is_member'], tag_id)
) AS bm1,
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag2', 'is_member'], tag_id)
) AS bm2,
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag3', 'is_member'], tag_id)
) AS bm3,
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag4', 'is_member'], tag_id)
) AS bm4,
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag5', 'is_member'], tag_id)
) AS bm5,
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag6', 'is_member'], tag_id)
) AS bm6,
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag7', 'is_member'], tag_id)
) AS bm7,
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag8', 'is_member'], tag_id)
) AS bm8,
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag9', 'is_member'], tag_id)
) AS bm9,
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag10', 'is_member'], tag_id)
) AS bm10,
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag11', 'is_member'], tag_id)
) AS bm11,
(
SELECT groupBitmapAndState(mid_seqs)
FROM cdp_tags
WHERE has(['tag12', 'is_member'], tag_id)
) AS bm12
SELECT order_complete_time
FROM orders
WHERE (order_complete_time >= '2019-01-01 00:00:00') AND (order_complete_time <= '2019-01-01 00:01:00')
703 rows in set. Elapsed: 24.021 sec. Processed 6.20 thousand rows, 24.78 KB (257.89 rows/s., 1.03 KB/s.)
The time is near linearly relevant to the number of aggregation state---there are 12 of them in the above query.
Therefore, there are two kinds of approaches to improve the performance:
-
Find out the reason why the overhead is increased from around 0.2s to above 2s if the aggregation state query is added into WITH clause.
-
Having the WITH clause executed in parallel. Currently, they are executed sequentially and therefore the query preparation stage has occupied such a large percentage.