Optimization of ORDER BY with respect to the ORDER key in MergeTree tables.#5042
Conversation
| SELECT d FROM test.pk_order ORDER BY (a, b); | ||
| SELECT d FROM test.pk_order ORDER BY a; | ||
|
|
||
| SELECT b FROM test.pk_order ORDER BY (a, b) DESC; |
There was a problem hiding this comment.
What about select -a as a, -b as b from test.pk_order order by (a, b) before and after this patch?
There was a problem hiding this comment.
It should work correctly (as before), because aliases are substituted before interpreting:
select -a as a, -b as b from test.pk_order order by -a, -b
|
|
||
| const Settings & settings = context.getSettingsRef(); | ||
|
|
||
| const auto& order_direction = order_descr.at(0).direction; |
There was a problem hiding this comment.
Why only first direction is checked? Don’t understand it.
There was a problem hiding this comment.
The code is totally wrong.
|
On top of that, what happens when mergetree is created with order by desc? IRC it is allowed. |
No, it's not allowed. You can write |
|
@anrodigina It's not merged with master, that's why all builds have failed. |
|
|
||
| const auto& order_direction = order_descr.at(0).direction; | ||
|
|
||
| if (auto storage_merge_tree = dynamic_cast<StorageReplicatedMergeTree *>(storage.get())) |
There was a problem hiding this comment.
Why ReplicatedMergeTree?
|
|
||
| const auto& order_direction = order_descr.at(0).direction; | ||
|
|
||
| if (auto storage_merge_tree = dynamic_cast<StorageReplicatedMergeTree *>(storage.get())) |
| namespace DB | ||
| { | ||
|
|
||
| class ReverseBlockInputStream : public IBlockInputStream |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| Block readImpl() override; | ||
| }; | ||
|
|
||
| } // namespace DB |
| class ReverseBlockInputStream : public IBlockInputStream | ||
| { | ||
| public: | ||
| ReverseBlockInputStream(const BlockInputStreamPtr& input); |
| return Block(); | ||
| } | ||
|
|
||
| PaddedPODArray<size_t> permutation; |
|
|
||
| PaddedPODArray<size_t> permutation; | ||
|
|
||
| for (size_t i = 0; i < result_block.rows(); ++i) |
There was a problem hiding this comment.
Method Block::rows is called in a loop.
| permutation.emplace_back(result_block.rows() - 1 - i); | ||
| } | ||
|
|
||
| for (auto iter = result_block.begin(); iter != result_block.end(); ++iter) |
There was a problem hiding this comment.
Range based loop will be Ok.
| { | ||
| bool need_sorting = false; | ||
| const auto& sorting_key_order = storage_merge_tree->getSortingKeyColumns(); | ||
| if (!(sorting_key_order.size() < order_descr.size()) && !query.limitByValue() && !query.groupBy()) |
There was a problem hiding this comment.
What's wrong with LIMIT BY?
| { | ||
| bool need_sorting = false; | ||
| const auto& sorting_key_order = storage_merge_tree->getSortingKeyColumns(); | ||
| if (!(sorting_key_order.size() < order_descr.size()) && !query.limitByValue() && !query.groupBy()) |
There was a problem hiding this comment.
What about FinishSortingBlockInputStream?
| { | ||
| query_info.do_not_steal_task = true; | ||
|
|
||
| pipeline.transform([&](auto & stream) |
|
|
||
| if (!need_sorting) | ||
| { | ||
| query_info.do_not_steal_task = true; |
| { | ||
| for (size_t i = 0; i < order_descr.size(); ++i) | ||
| { | ||
| if (order_descr[i].column_name != sorting_key_order[i] |
There was a problem hiding this comment.
SELECT column AS x ... ORDER BY x
| const auto& sorting_key_order = storage_merge_tree->getSortingKeyColumns(); | ||
| if (!(sorting_key_order.size() < order_descr.size()) && !query.limitByValue() && !query.groupBy()) | ||
| { | ||
| for (size_t i = 0; i < order_descr.size(); ++i) |
There was a problem hiding this comment.
Method std::vector<...>::size is called in a loop.
| } | ||
| } | ||
|
|
||
| if (!need_sorting) |
| stream = std::make_shared<AsynchronousBlockInputStream>(stream); | ||
| }); | ||
|
|
||
| if (order_direction == -1) |
|
|
||
| PrewhereInfoPtr prewhere_info; | ||
|
|
||
| bool do_not_steal_task = false; |
| return res; | ||
| } | ||
|
|
||
| BlockInputStreams MergeTreeDataSelectExecutor::spreadMarkRangesAmongStreamsPKOrder( |
There was a problem hiding this comment.
I tried to hardcode query_info.read_in_pk_order=true at fetching columns to force execute this part of code and it seems that order of reading using this pipeline is still nondeterministic.
|
@CurtizJ will do this task after another task (rewriting DNS Cache). |
Optimization of ORDER BY with respect to the ORDER key in MergeTree tables (continuation of #5042).
|
Hi Guys, |
|
It's already available in testing release: |
|
Thanks, I have 2 questions please Sorry for these basic questions, as I'm very new with Linux & git :) |
Packages in testing repository have passed all CI tests but are not considered stable before they appear in the stable repository.
We have stable release every two weeks, most of the time.
If you install another version of ClickHouse, it will replace previous version in your system. |
|
@alexey-milovidov Super clear! |
|
Hi @alexey-milovidov , |
|
@MahmoudGoda0 what is order by on your table? ( imsi, start_time ) or just start_time? Can you show extended statistics with [send_logs_level = debug] |
|
Yes, its start_time only, but why its expected ? |
Because full scan and order by of millions rows is faster than millions sequential index jumps (in all databases, it will be the same in mysql and PG). |
|
Actually I cant get your point, but the create of my table as below : |
My point you don't have index which is appropriate for your search condition and order by and TOP 1000. In case of |
|
@den-crane I think this is not the case, becasue the same query but the where is |
|
@MahmoudGoda0 Do you mind converting |
|
I cant as I have values ends with "b" |
|
@MahmoudGoda0 Ok.
In ClickHouse, indexed read is never slower than full scan. (A query can be slower due to index analysis stage, in rare cases). ClickHouse does not perform point reads, it will always select ranges in data (though ranges can be as small as a single granule), then reading these ranges, and skipping unneeded data. Skipping unneeded data is done with file seeks, but if the seek is inside read buffer (that is typically 1 MB) or inside compressed block, it will just continue reading. |
|
@alexey-milovidov |
|
Reading "in order" that is implemented in this task, can be slower than usual read, due to:
|
|
OK, you mean if i tried after some time from the import time it should be faster ? |
|
We continue optimizing "in order" read: #6299 But some rare queries will remain slightly slower with |
Queries may become slightly faster due to less number of data parts, but not very significantly, because expensive merging step is still required. |
|
Can you please share how i can disable optimize_read_in_order manually ? |
|
|
|
@alexey-milovidov |
|
Sorry, now I'm out of context of your queries, and cannot be sure. |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
For changelog. Remove if this is non-significant change.
Category (leave one):
Short description (up to few sentences):
https://st.yandex-team.ru/CLICKHOUSE-4013
Added ReverseBlockInputStream,
Added Order by optimization in PK order,
Added tests