Optimization of ORDER BY with respect to the ORDER key in MergeTree tables. by anrodigina · Pull Request #5042 · ClickHouse/ClickHouse

anrodigina · 2019-04-17T21:26:28Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

Improvement

Short description (up to few sentences):
https://st.yandex-team.ru/CLICKHOUSE-4013

Added ReverseBlockInputStream,
Added Order by optimization in PK order,
Added tests

merge with master

nvartolomei · 2019-04-17T21:39:17Z

dbms/tests/queries/0_stateless/00911_group_by_pk_order.sql

+SELECT d FROM test.pk_order ORDER BY (a, b);
+SELECT d FROM test.pk_order ORDER BY a;
+
+SELECT b FROM test.pk_order ORDER BY (a, b) DESC;


What about select -a as a, -b as b from test.pk_order order by (a, b) before and after this patch?

It should work correctly (as before), because aliases are substituted before interpreting:

select -a as a, -b as b from test.pk_order order by -a, -b

nvartolomei · 2019-04-17T21:42:29Z

dbms/src/Interpreters/InterpreterSelectQuery.cpp


    const Settings & settings = context.getSettingsRef();

+    const auto& order_direction = order_descr.at(0).direction;


Why only first direction is checked? Don’t understand it.

The code is totally wrong.

nvartolomei · 2019-04-17T21:54:15Z

On top of that, what happens when mergetree is created with order by desc? IRC it is allowed.

alexey-milovidov · 2019-04-17T22:02:34Z

when mergetable is created with order by desc? IRC his is allowed.

No, it's not allowed. You can write ORDER BY -x instead (but it doesn't make sense).

alexey-milovidov · 2019-04-17T22:03:50Z

@anrodigina It's not merged with master, that's why all builds have failed.

alexey-milovidov · 2019-04-18T13:18:54Z

dbms/src/Interpreters/InterpreterSelectQuery.cpp


+    const auto& order_direction = order_descr.at(0).direction;
+
+    if (auto storage_merge_tree = dynamic_cast<StorageReplicatedMergeTree *>(storage.get()))


Why ReplicatedMergeTree?

alexey-milovidov · 2019-04-18T13:19:16Z

dbms/src/Interpreters/InterpreterSelectQuery.cpp


+    const auto& order_direction = order_descr.at(0).direction;
+
+    if (auto storage_merge_tree = dynamic_cast<StorageReplicatedMergeTree *>(storage.get()))


dbms/src/DataStreams/ReverseBlockInputStream.h

+namespace DB
+{
+
+class ReverseBlockInputStream : public IBlockInputStream


alexey-milovidov · 2019-04-18T13:19:42Z

dbms/src/DataStreams/ReverseBlockInputStream.h

+    Block readImpl() override;
+};
+
+} // namespace DB


Useless comment.

alexey-milovidov · 2019-04-18T13:20:07Z

dbms/src/DataStreams/ReverseBlockInputStream.h

+class ReverseBlockInputStream : public IBlockInputStream
+{
+public:
+    ReverseBlockInputStream(const BlockInputStreamPtr& input);


alexey-milovidov · 2019-04-18T13:20:46Z

dbms/src/DataStreams/ReverseBlockInputStream.cpp

+            return Block();
+        }
+
+        PaddedPODArray<size_t> permutation;


IColumn::Permutation

alexey-milovidov · 2019-04-18T13:21:06Z

dbms/src/DataStreams/ReverseBlockInputStream.cpp

+
+        PaddedPODArray<size_t> permutation;
+
+        for (size_t i = 0; i < result_block.rows(); ++i)


Method Block::rows is called in a loop.

alexey-milovidov · 2019-04-18T13:21:35Z

dbms/src/DataStreams/ReverseBlockInputStream.cpp

+            permutation.emplace_back(result_block.rows() - 1 - i);
+        }
+
+        for (auto iter = result_block.begin(); iter != result_block.end(); ++iter)


Range based loop will be Ok.

alexey-milovidov · 2019-04-18T13:24:26Z

dbms/src/Interpreters/InterpreterSelectQuery.cpp

+    {
+        bool need_sorting = false;
+        const auto& sorting_key_order = storage_merge_tree->getSortingKeyColumns();
+        if (!(sorting_key_order.size() < order_descr.size()) && !query.limitByValue() && !query.groupBy())


What's wrong with LIMIT BY?

alexey-milovidov · 2019-04-18T13:25:16Z

dbms/src/Interpreters/InterpreterSelectQuery.cpp

+    {
+        bool need_sorting = false;
+        const auto& sorting_key_order = storage_merge_tree->getSortingKeyColumns();
+        if (!(sorting_key_order.size() < order_descr.size()) && !query.limitByValue() && !query.groupBy())


What about FinishSortingBlockInputStream?

alexey-milovidov · 2019-04-18T13:25:50Z

dbms/src/Interpreters/InterpreterSelectQuery.cpp

+            {
+                query_info.do_not_steal_task = true;
+
+                 pipeline.transform([&](auto & stream)


Extra whitespace.

alexey-milovidov · 2019-04-18T13:25:57Z

dbms/src/Interpreters/InterpreterSelectQuery.cpp

+
+             if (!need_sorting)
+            {
+                query_info.do_not_steal_task = true;


Missing comment.

alexey-milovidov · 2019-04-18T13:26:27Z

dbms/src/Interpreters/InterpreterSelectQuery.cpp

+        {
+            for (size_t i = 0; i < order_descr.size(); ++i)
+            {
+                if (order_descr[i].column_name != sorting_key_order[i]


SELECT column AS x ... ORDER BY x

alexey-milovidov · 2019-04-18T13:26:53Z

dbms/src/Interpreters/InterpreterSelectQuery.cpp

+        const auto& sorting_key_order = storage_merge_tree->getSortingKeyColumns();
+        if (!(sorting_key_order.size() < order_descr.size()) && !query.limitByValue() && !query.groupBy())
+        {
+            for (size_t i = 0; i < order_descr.size(); ++i)


Method std::vector<...>::size is called in a loop.

alexey-milovidov · 2019-04-18T13:27:16Z

dbms/src/Interpreters/InterpreterSelectQuery.cpp

+                }
+            }
+
+             if (!need_sorting)


Extra whitespace.

alexey-milovidov · 2019-04-18T13:27:29Z

dbms/src/Interpreters/InterpreterSelectQuery.cpp

+                    stream = std::make_shared<AsynchronousBlockInputStream>(stream);
+                });
+
+                 if (order_direction == -1)


Extra whitespace.

alexey-milovidov · 2019-04-18T13:29:27Z

dbms/src/Storages/SelectQueryInfo.h


    PrewhereInfoPtr prewhere_info;

+    bool do_not_steal_task = false;


Missing comment.

CurtizJ · 2019-06-22T17:53:14Z

dbms/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp

    return res;
 }

+BlockInputStreams MergeTreeDataSelectExecutor::spreadMarkRangesAmongStreamsPKOrder(


I tried to hardcode query_info.read_in_pk_order=true at fetching columns to force execute this part of code and it seems that order of reading using this pipeline is still nondeterministic.

alexey-milovidov · 2019-07-02T23:17:14Z

@CurtizJ will do this task after another task (rewriting DNS Cache).

Optimization of ORDER BY with respect to the ORDER key in MergeTree tables (continuation of #5042).

MahmoudGoda0 · 2019-07-31T11:22:21Z

Hi Guys,
Can I know when this fix will be available ?
This fix really urgent for my company to use CH as main DB engine.

alexey-milovidov · 2019-07-31T14:54:39Z

It's already available in testing release:
https://repo.yandex.ru/clickhouse/deb/testing/main/

MahmoudGoda0 · 2019-07-31T17:42:07Z

Thanks, I have 2 questions please
1 - Is this is a stable version ? if no, When it will be stable version ?
2- I installed this version already "http://repo.yandex.ru/clickhouse/deb/stable/ main/", If i install the testing release, so i will have 2 releases on the same time? How i can run the second one ?

Sorry for these basic questions, as I'm very new with Linux & git :)

alexey-milovidov · 2019-07-31T17:46:25Z

1 - Is this is a stable version ?

Packages in testing repository have passed all CI tests but are not considered stable before they appear in the stable repository.

if no, When it will be stable version ?

We have stable release every two weeks, most of the time.

2- I installed this version already "http://repo.yandex.ru/clickhouse/deb/stable/ main/", If i install the testing release, so i will have 2 releases on the same time? How i can run the second one ?

If you install another version of ClickHouse, it will replace previous version in your system.

MahmoudGoda0 · 2019-08-01T06:52:00Z

@alexey-milovidov Super clear!
Many thanks, Really appreciated.

MahmoudGoda0 · 2019-08-01T11:51:09Z

Hi @alexey-milovidov ,
I tried it with our data, our issues was for 10 queries, 8 of them enhanced but 2 of them become slower than the old version (stable), Can you please help me ?
One query :
select TOP 1000 ext_rec_num,xdr_id,xdr_grp,xdr_type,xdr_subtype,xdr_direction,xdr_location, start_time,stop_time, transaction_duration,response_time,protocol,chunk_count,dpc, opc,first_link_id,last_dpc,last_opc,last_link_id,first_back_opc, first_back_link_id,calling_ssn,called_ssn,called_sccp_address, calling_party_address,response_calling_address,root_end_code, root_cause_code,root_cause_pl,root_failure,root_equip, map_end_code,map_cause_code,map_cause_pl,sip_end_code, sip_cause_code,sip_cause_pl,isup_end_code,isup_cause_code, isup_cause_pl,inap_end_code,inap_cause_code,inap_cause_pl, subs_id,mccmnc,imsi,msisdn,tac,imei,msc_number,vlr_number, hlr_number,service_key_1,camel_phases,camel_cpbt_handling from ss7_table where start_time > 971128806382 and start_time <971222387788 and (imsi = '938036746436052' or imsi = '938036687724700' or imsi = '938036746030189') ORDER BY start_time DESC;
It was 135 second, but now it takes 180 second :(

den-crane · 2019-08-01T13:28:10Z

@MahmoudGoda0 what is order by on your table? ( imsi, start_time ) or just start_time?
(I think in case of start_time it is expected behavior).

Can you show extended statistics with [send_logs_level = debug]

MahmoudGoda0 · 2019-08-01T13:31:37Z

Yes, its start_time only, but why its expected ?
Will do statistics and share the results with you.

den-crane · 2019-08-01T13:48:29Z

Yes, its start_time only, but why its expected ?

Because full scan and order by of millions rows is faster than millions sequential index jumps (in all databases, it will be the same in mysql and PG).

MahmoudGoda0 · 2019-08-01T13:59:23Z

Actually I cant get your point, but the create of my table as below :
create table tsnew(
ext_rec_num Nullable(UInt64),
xdr_id Nullable(UInt64),
xdr_grp Nullable(UInt64),
xdr_type Nullable(UInt64),
xdr_subtype Nullable(Int16),
xdr_direction Nullable(Int16),
xdr_location Nullable(Int16),
time UInt64,
stop_time UInt64,
transaction_duration Nullable(UInt64),
response_time Nullable(UInt64),
protocol Nullable(Int16),
chunk_count Nullable(Int16),
dpc Nullable(Int32),
opc Nullable(Int32),
first_link_id String,
last_dpc Nullable(Int32),
last_opc Nullable(Int32),
last_link_id String,
first_back_opc Nullable(Int32),
first_back_link_id String,
calling_ssn Nullable(Int16),
called_ssn Nullable(Int16),
called_sccp_address String,
calling_party_address String,
response_calling_address String,
root_end_code Nullable(Int32),
root_cause_code Nullable(Int32),
root_cause_pl Nullable(Int16),
root_failure Nullable(Int16),
root_equip Nullable(Int16)
)
ENGINE = MergeTree()
PARTITION BY toInt64(time/3600000)*3600000
order by time
SETTINGS index_granularity = 8192
So in this case what i have to do in order to enhance the response time ?

den-crane · 2019-08-01T14:08:22Z

@MahmoudGoda0

Actually I cant get your point, but the create of my table as below :

My point you don't have index which is appropriate for your search condition and order by and TOP 1000.
Just try to create another table with order by(imsi, time) and test your search.

In case of order by(time) database does millions (index->column) reads on search rows with imsi = '938036746436052' and skipping most of them. These reads are slower than full scan of imsi column.

MahmoudGoda0 · 2019-08-04T11:31:54Z

@den-crane I think this is not the case, becasue the same query but the where is
where start_time > 971128806382 and start_time <971222387788 and imsi = '938036746436052'
the time = 8 second, and this is very good.
But by adding more condition in the where part to be
where start_time > 971128806382 and start_time <971222387788 and (imsi = '938036746436052' or imsi = '938036687724700' or imsi = '938036746030189')
the time = 180 second.

alexey-milovidov · 2019-08-04T11:47:30Z

@MahmoudGoda0 Do you mind converting imsi to UInt64?

MahmoudGoda0 · 2019-08-04T11:49:24Z

I cant as I have values ends with "b"
Like this one "543036746630589b"

alexey-milovidov · 2019-08-04T12:41:13Z

@MahmoudGoda0 Ok.

@den-crane

In case of order by(time) database does millions (index->column) reads on search rows with imsi = '938036746436052' and skipping most of them. These reads are slower than full scan of imsi column.

In ClickHouse, indexed read is never slower than full scan. (A query can be slower due to index analysis stage, in rare cases). ClickHouse does not perform point reads, it will always select ranges in data (though ranges can be as small as a single granule), then reading these ranges, and skipping unneeded data. Skipping unneeded data is done with file seeks, but if the seek is inside read buffer (that is typically 1 MB) or inside compressed block, it will just continue reading.

MahmoudGoda0 · 2019-08-04T12:46:01Z

@alexey-milovidov
Thanks for your explanation, so in my case, what is your recommendation ?
I'm really counting on your support, let me know if you need any other info from my side.

alexey-milovidov · 2019-08-04T12:47:06Z

Reading "in order" that is implemented in this task, can be slower than usual read, due to:

merging of sorted data;
additional thread synchronization (to read sorted data in parallel with O(1) memory and maintain the order, the threads need to wait while another thread has finished reading some previous chunk of data).

MahmoudGoda0 · 2019-08-04T12:50:26Z

OK, you mean if i tried after some time from the import time it should be faster ?

alexey-milovidov · 2019-08-04T12:52:32Z

We continue optimizing "in order" read: #6299
This is in active development.

But some rare queries will remain slightly slower with optimize_read_in_order enabled. They are the queries that has ORDER BY prefix of primary key, large LIMIT and WHERE condition that will require to read huge amount of records before queried data will be found. For these queries you can disable optimize_read_in_order manually.

alexey-milovidov · 2019-08-04T12:54:13Z

OK, you mean if i tried after some time from the import time it should be faster ?

Queries may become slightly faster due to less number of data parts, but not very significantly, because expensive merging step is still required.

MahmoudGoda0 · 2019-08-04T12:57:25Z

Can you please share how i can disable optimize_read_in_order manually ?

alexey-milovidov · 2019-08-04T13:05:37Z

SET optimize_read_in_order = 0 in clickhouse-client interactively
or add optimize_read_in_order=0 URL parameter in HTTP interface
or add this setting to the user profile.

MahmoudGoda0 · 2019-08-05T09:49:10Z

@alexey-milovidov
By using optimize_read_in_order=0, the time now = 22 second.
I believe it should be smaller than this time, Agree ?

alexey-milovidov · 2019-08-05T18:44:37Z

Sorry, now I'm out of context of your queries, and cannot be sure.

anrodigina added 2 commits April 17, 2019 23:57

Merge pull request #1 from yandex/master

7a4dc5a

merge with master

ReverseBlockInputStream, optimization of group by, tests

0c1735f

alexey-milovidov added the can be tested label Apr 17, 2019

alexey-milovidov changed the title ~~Clickhouse 4013~~ Optimization of ORDER BY with respect to the ORDER key in MergeTree tables. Apr 17, 2019

alexey-milovidov added pr-improvement Pull request with some product improvements pr-performance Pull request with some performance improvements labels Apr 17, 2019

nvartolomei reviewed Apr 17, 2019

View reviewed changes

anrodigina added 2 commits April 18, 2019 01:58

Fix build and style issues

ef0be2a

fix style (removed whitespace)

a11fcff

alexey-milovidov reviewed Apr 18, 2019

View reviewed changes

dbms/src/DataStreams/ReverseBlockInputStream.h

namespace DB

{

class ReverseBlockInputStream : public IBlockInputStream

This comment was marked as resolved.

Sign in to view

alexey-milovidov reviewed Apr 18, 2019

View reviewed changes

dbms/src/Interpreters/InterpreterSelectQuery.cpp Outdated

}

}

if (!need_sorting)

Copy link
Copy Markdown

Member

alexey-milovidov Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra whitespace.

alexey-milovidov reviewed Apr 18, 2019

View reviewed changes

CurtizJ reviewed Jun 22, 2019

View reviewed changes

alexey-milovidov mentioned this pull request Jun 29, 2019

Slow order by primary key with small limit on big data #1344

Closed

CurtizJ mentioned this pull request Jul 18, 2019

Optimization of ORDER BY with respect to the ORDER key in MergeTree tables (continuation of #5042). #6054

Merged

3 tasks

KochetovNicolai merged commit 70159fb into ClickHouse:master Jul 27, 2019

KochetovNicolai added a commit that referenced this pull request Jul 27, 2019

Merge pull request #6054 from CurtizJ/order-by-efficient

3f2d857

Optimization of ORDER BY with respect to the ORDER key in MergeTree tables (continuation of #5042).


		const Settings & settings = context.getSettingsRef();

		const auto& order_direction = order_descr.at(0).direction;


		const auto& order_direction = order_descr.at(0).direction;

		if (auto storage_merge_tree = dynamic_cast<StorageReplicatedMergeTree *>(storage.get()))


		PaddedPODArray<size_t> permutation;

		for (size_t i = 0; i < result_block.rows(); ++i)


		PrewhereInfoPtr prewhere_info;

		bool do_not_steal_task = false;

Conversation

anrodigina commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvartolomei commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Apr 17, 2019

Uh oh!

alexey-milovidov commented Apr 17, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CurtizJ Jun 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Jul 2, 2019

Uh oh!

MahmoudGoda0 commented Jul 31, 2019

Uh oh!

alexey-milovidov commented Jul 31, 2019

Uh oh!

MahmoudGoda0 commented Jul 31, 2019

Uh oh!

alexey-milovidov commented Jul 31, 2019

Uh oh!

MahmoudGoda0 commented Aug 1, 2019

Uh oh!

MahmoudGoda0 commented Aug 1, 2019

Uh oh!

den-crane commented Aug 1, 2019

Uh oh!

MahmoudGoda0 commented Aug 1, 2019

Uh oh!

den-crane commented Aug 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MahmoudGoda0 commented Aug 1, 2019

Uh oh!

anrodigina commented Apr 17, 2019 •

edited

Loading

nvartolomei commented Apr 17, 2019 •

edited

Loading

CurtizJ Jun 22, 2019 •

edited

Loading

den-crane commented Aug 1, 2019 •

edited

Loading

den-crane commented Aug 1, 2019 •

edited

Loading

MahmoudGoda0 commented Aug 4, 2019 •

edited

Loading

alexey-milovidov commented Aug 4, 2019 •

edited

Loading

alexey-milovidov commented Aug 4, 2019 •

edited

Loading