Skip to content

Improving the performance of scalar subqueries #6956

@yingfeng

Description

@yingfeng

Describe the situation
Given two tables, one is order, the other is customer, we need to provide aggregation results for table order given a number of customer_ids got from table customer. For example:

SELECT
 toDate(order_time) AS rowData
FROM order
WHERE order.customer_id IN
(
SELECT id FROM customer WHERE gender='MALE'
)

The above query could be accelerated through bitmapContains if id in table customer has already been written to aggregation state in another table.

CREATE TABLE user_segments ( \
    seg_id String, \
    ids AggregateFunction(groupBitmap, UInt32) \
) engine=ReplacingMergeTree() \
ORDER BY (seg_id) SETTINGS index_granularity=1;

SELECT
 toDate(order_time) AS rowData
FROM order
WHERE bitmapContains((
SELECT ids FROM user_segments WHERE seg_id = 'MALE')
,customer_id
)

According to experiments, the bitmapContains is about twice faster than traditional sub query listed above. However, both queries can only run within single thread since we saw a 100% CPU occupation during queries, this is pretty time consuming, given our experiments: suppose there are over tens of millions of ids in sub queries(or bitmaps), the number of records of table order is around 2billion, and the overall number of ClickHouse nodes is 4, the query will take around 3-5 minutes. Is it possible to have such queries run in parallel using multiple threads for further acceleration? A simple idea is to have data from different partitions be scanned in parallel.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions