Improving the performance of scalar subqueries

**Describe the situation**
Given two tables, one is `order`, the other is `customer`, we need to provide aggregation results for table `order` given a number of `customer_ids` got from table `customer`. For example:
```
SELECT
 toDate(order_time) AS rowData
FROM order
WHERE order.customer_id IN
(
SELECT id FROM customer WHERE gender='MALE'
)

```
The above query could be accelerated through `bitmapContains` if `id` in table `customer` has already been written to aggregation state in another table.
```
CREATE TABLE user_segments ( \
    seg_id String, \
    ids AggregateFunction(groupBitmap, UInt32) \
) engine=ReplacingMergeTree() \
ORDER BY (seg_id) SETTINGS index_granularity=1;

SELECT
 toDate(order_time) AS rowData
FROM order
WHERE bitmapContains((
SELECT ids FROM user_segments WHERE seg_id = 'MALE')
,customer_id
)
```

According to experiments, the `bitmapContains` is about twice faster than traditional sub query listed above.  However, both queries can only run within single thread since we saw a 100% CPU occupation during queries, this is pretty time consuming, given our experiments: suppose there are over tens of millions of ids in sub queries(or bitmaps), the number of records of table `order` is around 2billion, and the overall number of ClickHouse nodes is 4, the query will take around 3-5 minutes.  Is it possible to have such queries run in parallel using multiple threads for further acceleration? A simple idea is to have data from different partitions be scanned in parallel.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the performance of scalar subqueries #6956

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improving the performance of scalar subqueries #6956

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions