WHERE clauses optimization leveraging Projection feature#63207
WHERE clauses optimization leveraging Projection feature#63207suzyw-w wants to merge 14 commits intoClickHouse:masterfrom
Conversation
|
This is an automated comment for commit cf38f59 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
…ckHouse into proj_optimization_analyzer
|
Thank you! This looks interesting... I didn't check the code yet. According to my expectations, PREWHERE will scan the column for filtering and then read the remaining data from the positions that passed the filter. |
Hi Alexey, thanks for the response. First, this is a add-on to PREWHERE not a replacement to PREWHERE. Secondly, on top of PREWHERE, this optimization utilize the projection feature by including a sub select query With |
Actually, i guess it make sense to include part name as well, because it will allow to filter out parts which doesn't have interesting rows for us. select * from table where indexHint((_part, primary_key) in (select _part, primary_key from test_a where secondary_key='-42')) and secondary_key='-42'; |
Hi @UnamedRus , I think we discussed this in the original thread. Do we still need to re-calculate projection table if we try to include _part and _part_offset? |
_part_offset was different topic, it was more like: what if we will materialize and store _part_offset column from part in projection. But here i'm talking about "virtual" column _part, which just have meaning of current part being queried, it should be always up to date, because it's virtual |
Can you elaborate on this? I dont see any differences by including _part into the query. |
Difference could be visible Imagine, if your order by is something like:
Because index is sparse, it does meant for select * from table where indexHint((tenant_id, event_id) in (select tenant_id, event_id from test_a where secondary_key='-42')) and secondary_key='-42';It will read at least 1 granule from each part. (it's still be win from perf improvement, but add unnecessary slowness) |
|
To begin with code review it requires to add a performance test. |
How to add performance test? |
Please see the documentation: https://clickhouse.com/docs/en/development/tests#performance-tests |
|
@SuzyWangIBMer, please open another PR, now with the performance test. |
Refer to original PR #57216, porting code to analyzer passes to align with other ClickHouse optimization features.
This is a proposal to optimize select query by leveraging projection feature.
Since PROJECTION can effectively create a new primary key for the table, it can increase the searching speed if using properly.
This implementation inserts a new subquery to the original where condition, in order to use the projection feature in more general cases.
For example,
select * from table where secondary_key='-42';will now become
select * from table where primary_key in (select primary_key from test_a where secondary_key='-42') and secondary_key='-42';This implementation is tested and proved increasing query execution speed vastly.
Thanks to @UnamedRus 's suggestion, indexHint() is also added to further optimize the query. Now the query is going to be re-write to
select * from table where indexHint(primary_key in (select primary_key from test_a where secondary_key='-42')) and secondary_key='-42';It is proved that this will also optimize query execution speed. Tests results are attached below.
Pseudo Code
Restrictions
=,!=,in,notInfunctionFor example,
can be re-written to
However, query like the following will not be recognized, and will not be re-written.
Test result
CREATE TABLE test_a ( `src` String, `dst` String, `other_cols` String, PROJECTION p1 ( SELECT src, dst ORDER BY dst ) ) ENGINE = MergeTree ORDER BY src/* Query using primary key (fastest) */
/* Query does not using projection */
/* Optimized non-projection query now can leveraging projection feature */
palmtops1.fyre.ibm.com :) select * from test_a where src in (select src from test_a where dst='-42') and dst='-42'; SELECT * FROM test_a WHERE (src IN ( SELECT src FROM test_a WHERE dst = '-42' )) AND (dst = '-42') Query id: 68ca1b70-940f-41d2-983b-4df7c06df5ae ┌─src─┬─dst─┬─other_cols───┐ │ 42 │ -42 │ other_col 42 │ └─────┴─────┴──────────────┘ 1 row in set. Elapsed: 0.101 sec. Processed 32.77 thousand rows, 1.18 MB (325.38 thousand rows/s., 11.75 MB/s.) Peak memory usage: 275.28 KiB/* Optimized query with indexHint */
palmtops1.fyre.ibm.com :) select * from test_a where indexHint(src in (select src from test_a where dst='-42')) and dst='-42'; SELECT * FROM test_a WHERE indexHint(src IN ( SELECT src FROM test_a WHERE dst = '-42' )) AND (dst = '-42') Query id: 365d4fae-93dc-48e5-b0c0-42dc2ba9904e ┌─src─┬─dst─┬─other_cols───┐ │ 42 │ -42 │ other_col 42 │ └─────┴─────┴──────────────┘ 1 row in set. Elapsed: 0.094 sec. Processed 32.77 thousand rows, 1.18 MB (350.31 thousand rows/s., 12.65 MB/s.) Peak memory usage: 223.92 KiB.Query using primary key (fastest) : 0.042 sec, Peak memory usage: 44.66 KiB
Query does not using projection : 2.177 sec, Peak memory usage: 1021.77 KiB.
Optimized non-projection query : 0.101 sec, Peak memory usage: 275.28 KiB
Optimized query with indexHint : 0.094 sec, Peak memory usage: 223.92 KiB.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Rewrite where clause in order to leverage PROJECTION feature.
Documentation entry for user-facing changes
Modify your CI run:
NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step
Include tests (required builds will be added automatically):
Exclude tests:
Extra options:
Only specified batches in multi-batch jobs: