Optimize queries with LIMIT/LIMIT BY/ORDER BY for distributed with GROUP BY sharding_key#10373
Conversation
253a3f8 to
aaab84b
Compare
aaab84b to
f56560d
Compare
f56560d to
c9fa74b
Compare
c9fa74b to
296b2fb
Compare
95f7af3 to
d417da3
Compare
|
So to summarize this should solve some issues that had been introduced by the #10341 and adds support for LIMIT/LIMIT BY/ORDER BY |
6c4748a to
a2467f2
Compare
|
@akuzm can you please take a look is it false-positive or not? (or maybe I should just restart the build? by i.e. rebasing against upstream/master) |
Never mind, the problem is in the protocol ABI breakage (queries like |
a2467f2 to
348ef12
Compare
|
@4ertus2 this is ready, can you take a look? |
The 00956_sensitive_data_masking is flacky (#13867 ) |
348ef12 to
f39f2c5
Compare
|
@4ertus2 @alexey-milovidov friendly ping, can some one take a look? |
f39f2c5 to
cc6fb4f
Compare
So here we got the result already, but timeout triggers ( |
|
@alexey-milovidov any thoughts on this? |
Process query until the stage where the aggregate functions were
calculated and finalized.
It will be used for optimize_distributed_group_by_sharding_key.
v2: fix aliases
v3: Fix protocol ABI breakage due to WithMergeableStateAfterAggregation
Conditions >= for QueryProcessingStage::Enum has been verified, and they
are ok (in InterpreterSelectQuery).
…OUP BY sharding_key Previous set of QueryProcessingStage does not allow to do this. But after WithMergeableStateAfterAggregation had been introduced the following queries can be optimized too under optimize_distributed_group_by_sharding_key: - GROUP BY sharding_key LIMIT - GROUP BY sharding_key LIMIT BY - GROUP BY sharding_key ORDER BY And right now it is still not supports: - WITH TOTALS (looks like it can be supported) - WITH ROLLUP (looks like it can be supported) - WITH CUBE - SETTINGS extremes=1 (looks like it can be supported) But will be implemented separatelly. vX: fixes v2: fix WITH * v3: fix extremes v4: fix LIMIT OFFSET (and make a little bit cleaner) v5: fix HAVING v6: fix ORDER BY v7: rebase against 20.7 v8: move out WithMergeableStateAfterAggregation v9: add optimize_distributed_group_by_sharding_key into test names
…merge (convert to UInt64) Possible values: - 1 - Do not merge aggregation states from different servers for distributed query processing - in case it is for certain that there are different keys on different shards. - 2 - same as 1 but also apply ORDER BY and LIMIT stages
cc6fb4f to
776688a
Compare
rebased in attempt to fix it |
| drop table if exists dist_01247; | ||
| drop table if exists dist_layer_01247; | ||
|
|
||
| select 'Distributed(rand)-over-Distributed(rand)'; | ||
| create table dist_layer_01247 as data_01247 engine=Distributed(test_cluster_two_shards, currentDatabase(), data_01247, rand()); | ||
| create table dist_01247 as data_01247 engine=Distributed(test_cluster_two_shards, currentDatabase(), dist_layer_01247, number); | ||
| select count(), * from dist_01247 group by number; | ||
| drop table dist_01247; |
There was a problem hiding this comment.
Don't understand this change, will return DROP queries...
There was a problem hiding this comment.
When test DROP TABLE at the end you cannot attach with the client and reproduce the failure, that's why I'm trying to keep them, so that said that this is just a cosmetic thing.
@alexey-milovidov Is there some unspoken rule that said that tables should be removed after the test?
There was a problem hiding this comment.
It's just for convenience when you run .sql files manually and don't want leftovers.
alexey-milovidov
left a comment
There was a problem hiding this comment.
LGTM.
I understand the logic. It looks a bit fragile because StorageDistributed should maintain its awareness of all possible pipeline steps after GROUP BY. It is Ok but we need to write randomized combinatoral tests for correctness of Distributed tables on various types of clusters.
| @@ -1,2 +1,42 @@ | |||
| SELECT count(), uniq(dummy) FROM remote('127.0.0.{2,3}', system.one) SETTINGS distributed_group_by_no_merge = 1; | |||
There was a problem hiding this comment.
I'd better add new test instead of modifying an old simple test case.
FWIW here is previous test report (before rebase against upstream/master) - https://clickhouse-test-reports.s3.yandex.net/10373/9e4aa5954e45865d7756e1a6a174d7c1cf0ccfa0/performance_comparison/report.html#fail1 |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Optimize queries with LIMIT/LIMIT BY/ORDER BY for distributed with GROUP BY sharding_key (under optimize_skip_unused_shards and optimize_distributed_group_by_sharding_key)
Detailed description / Documentation draft:
Previous set of QueryProcessingStage does not allow to do this.
So this patch set introduces new WithMergeableStateAfterAggregation and use
it to optimize queries with GROUP BY sharding_key and:
And right now it is still not supports:
Details
HEAD:
Continuation of: