-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Improving the performance of scalar subqueries #6956
Description
Describe the situation
Given two tables, one is order, the other is customer, we need to provide aggregation results for table order given a number of customer_ids got from table customer. For example:
SELECT
toDate(order_time) AS rowData
FROM order
WHERE order.customer_id IN
(
SELECT id FROM customer WHERE gender='MALE'
)
The above query could be accelerated through bitmapContains if id in table customer has already been written to aggregation state in another table.
CREATE TABLE user_segments ( \
seg_id String, \
ids AggregateFunction(groupBitmap, UInt32) \
) engine=ReplacingMergeTree() \
ORDER BY (seg_id) SETTINGS index_granularity=1;
SELECT
toDate(order_time) AS rowData
FROM order
WHERE bitmapContains((
SELECT ids FROM user_segments WHERE seg_id = 'MALE')
,customer_id
)
According to experiments, the bitmapContains is about twice faster than traditional sub query listed above. However, both queries can only run within single thread since we saw a 100% CPU occupation during queries, this is pretty time consuming, given our experiments: suppose there are over tens of millions of ids in sub queries(or bitmaps), the number of records of table order is around 2billion, and the overall number of ClickHouse nodes is 4, the query will take around 3-5 minutes. Is it possible to have such queries run in parallel using multiple threads for further acceleration? A simple idea is to have data from different partitions be scanned in parallel.