Skip to content

Supporting parallel execution for With clause #7329

@yingfeng

Description

@yingfeng

cdp_tags is a table where bitmap intermediate state is stored as a column:

CREATE TABLE cdp_tags ( \
    tag_id String, \
    mid_seqs AggregateFunction(groupBitmap, UInt32), \
    cardinality UInt32 \
) engine=ReplacingMergeTree() \
ORDER BY (tag_id) SETTINGS index_granularity=1;

A simple bitmap aggregation could return fast, for example, only 0.254s is required for both IO and bitmap AND aggregation:

SELECT groupBitmapAnd(mid_seqs)
FROM cdp_tags
WHERE has(['tag1', 'is_member'], tag_id)

┌─groupBitmapAnd(mid_seqs)─┐
│                 16614356 │
└──────────────────────────┘

1 rows in set. Elapsed: 0.254 sec. Processed 28.21 thousand rows, 1.74 MB (110.86 thousand rows/s., 6.83 MB/s.)

However, if I move that query into WITH clause:

WITH
    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag1', 'is_member'], tag_id)
    ) AS bm1
SELECT order_complete_time
FROM orders
WHERE (order_complete_time >= '2019-01-01 00:00:00') AND (order_complete_time <= '2019-01-01 00:01:00')

703 rows in set. Elapsed: 3.173 sec. Processed 6.20 thousand rows, 24.78 KB (5.28 thousand rows/s., 21.13 KB/s.)

Here, I cut the time range as much as possible such that query on table orders itself does not affect the overall performance, I just want to evaluate the overhead of WITH clause, then the above time is increased more than tens of times:

Furthermore, if I introduce more operation into WITH clause:

WITH

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag1', 'is_member'], tag_id)
    ) AS bm1,

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag2', 'is_member'], tag_id)
    ) AS bm2
SELECT order_complete_time
FROM orders
WHERE (order_complete_time >= '2019-01-01 00:00:00') AND (order_complete_time <= '2019-01-01 00:01:00')

703 rows in set. Elapsed: 4.297 sec. Processed 6.20 thousand rows, 24.78 KB (1.44 thousand rows/s., 5.77 KB/s.)

The time elapsed is increased near linearly.

What's more, if I add more:

WITH

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag1', 'is_member'], tag_id)
    ) AS bm1,

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag2', 'is_member'], tag_id)
    ) AS bm2,

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag3', 'is_member'], tag_id)
    ) AS bm3,

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag4', 'is_member'], tag_id)
    ) AS bm4,

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag5', 'is_member'], tag_id)
    ) AS bm5,

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag6', 'is_member'], tag_id)
    ) AS bm6,

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag7', 'is_member'], tag_id)
    ) AS bm7,

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag8', 'is_member'], tag_id)
    ) AS bm8,

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag9', 'is_member'], tag_id)
    ) AS bm9,

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag10', 'is_member'], tag_id)
    ) AS bm10,

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag11', 'is_member'], tag_id)
    ) AS bm11,

    (
        SELECT groupBitmapAndState(mid_seqs)
        FROM cdp_tags
        WHERE has(['tag12', 'is_member'], tag_id)
    ) AS bm12
SELECT order_complete_time
FROM orders
WHERE (order_complete_time >= '2019-01-01 00:00:00') AND (order_complete_time <= '2019-01-01 00:01:00')

703 rows in set. Elapsed: 24.021 sec. Processed 6.20 thousand rows, 24.78 KB (257.89 rows/s., 1.03 KB/s.)

The time is near linearly relevant to the number of aggregation state---there are 12 of them in the above query.

Therefore, there are two kinds of approaches to improve the performance:

  1. Find out the reason why the overhead is increased from around 0.2s to above 2s if the aggregation state query is added into WITH clause.

  2. Having the WITH clause executed in parallel. Currently, they are executed sequentially and therefore the query preparation stage has occupied such a large percentage.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions