Skip to content

Optimization of ORDER BY with respect to the ORDER key in MergeTree tables.#5042

Merged
KochetovNicolai merged 42 commits intoClickHouse:masterfrom
anrodigina:clickhouse-4013
Jul 27, 2019
Merged

Optimization of ORDER BY with respect to the ORDER key in MergeTree tables.#5042
KochetovNicolai merged 42 commits intoClickHouse:masterfrom
anrodigina:clickhouse-4013

Conversation

@anrodigina
Copy link
Copy Markdown
Contributor

@anrodigina anrodigina commented Apr 17, 2019

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

  • Improvement

Short description (up to few sentences):
https://st.yandex-team.ru/CLICKHOUSE-4013

Added ReverseBlockInputStream,
Added Order by optimization in PK order,
Added tests

@alexey-milovidov alexey-milovidov changed the title Clickhouse 4013 Optimization of ORDER BY with respect to the ORDER key in MergeTree tables. Apr 17, 2019
@alexey-milovidov alexey-milovidov added pr-improvement Pull request with some product improvements pr-performance Pull request with some performance improvements labels Apr 17, 2019
SELECT d FROM test.pk_order ORDER BY (a, b);
SELECT d FROM test.pk_order ORDER BY a;

SELECT b FROM test.pk_order ORDER BY (a, b) DESC;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about select -a as a, -b as b from test.pk_order order by (a, b) before and after this patch?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work correctly (as before), because aliases are substituted before interpreting:

select -a as a, -b as b from test.pk_order order by -a, -b


const Settings & settings = context.getSettingsRef();

const auto& order_direction = order_descr.at(0).direction;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only first direction is checked? Don’t understand it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is totally wrong.

@nvartolomei
Copy link
Copy Markdown
Contributor

nvartolomei commented Apr 17, 2019

On top of that, what happens when mergetree is created with order by desc? IRC it is allowed.

@alexey-milovidov
Copy link
Copy Markdown
Member

when mergetable is created with order by desc? IRC his is allowed.

No, it's not allowed. You can write ORDER BY -x instead (but it doesn't make sense).

@alexey-milovidov
Copy link
Copy Markdown
Member

@anrodigina It's not merged with master, that's why all builds have failed.


const auto& order_direction = order_descr.at(0).direction;

if (auto storage_merge_tree = dynamic_cast<StorageReplicatedMergeTree *>(storage.get()))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ReplicatedMergeTree?


const auto& order_direction = order_descr.at(0).direction;

if (auto storage_merge_tree = dynamic_cast<StorageReplicatedMergeTree *>(storage.get()))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

namespace DB
{

class ReverseBlockInputStream : public IBlockInputStream

This comment was marked as resolved.

Block readImpl() override;
};

} // namespace DB
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless comment.

class ReverseBlockInputStream : public IBlockInputStream
{
public:
ReverseBlockInputStream(const BlockInputStreamPtr& input);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style.

return Block();
}

PaddedPODArray<size_t> permutation;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IColumn::Permutation


PaddedPODArray<size_t> permutation;

for (size_t i = 0; i < result_block.rows(); ++i)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method Block::rows is called in a loop.

permutation.emplace_back(result_block.rows() - 1 - i);
}

for (auto iter = result_block.begin(); iter != result_block.end(); ++iter)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Range based loop will be Ok.

{
bool need_sorting = false;
const auto& sorting_key_order = storage_merge_tree->getSortingKeyColumns();
if (!(sorting_key_order.size() < order_descr.size()) && !query.limitByValue() && !query.groupBy())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with LIMIT BY?

{
bool need_sorting = false;
const auto& sorting_key_order = storage_merge_tree->getSortingKeyColumns();
if (!(sorting_key_order.size() < order_descr.size()) && !query.limitByValue() && !query.groupBy())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about FinishSortingBlockInputStream?

{
query_info.do_not_steal_task = true;

pipeline.transform([&](auto & stream)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra whitespace.


if (!need_sorting)
{
query_info.do_not_steal_task = true;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comment.

{
for (size_t i = 0; i < order_descr.size(); ++i)
{
if (order_descr[i].column_name != sorting_key_order[i]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT column AS x ... ORDER BY x

const auto& sorting_key_order = storage_merge_tree->getSortingKeyColumns();
if (!(sorting_key_order.size() < order_descr.size()) && !query.limitByValue() && !query.groupBy())
{
for (size_t i = 0; i < order_descr.size(); ++i)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method std::vector<...>::size is called in a loop.

}
}

if (!need_sorting)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra whitespace.

stream = std::make_shared<AsynchronousBlockInputStream>(stream);
});

if (order_direction == -1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra whitespace.


PrewhereInfoPtr prewhere_info;

bool do_not_steal_task = false;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comment.

return res;
}

BlockInputStreams MergeTreeDataSelectExecutor::spreadMarkRangesAmongStreamsPKOrder(
Copy link
Copy Markdown
Member

@CurtizJ CurtizJ Jun 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to hardcode query_info.read_in_pk_order=true at fetching columns to force execute this part of code and it seems that order of reading using this pipeline is still nondeterministic.

@alexey-milovidov
Copy link
Copy Markdown
Member

@CurtizJ will do this task after another task (rewriting DNS Cache).

@KochetovNicolai KochetovNicolai merged commit 70159fb into ClickHouse:master Jul 27, 2019
KochetovNicolai added a commit that referenced this pull request Jul 27, 2019
Optimization of ORDER BY with respect to the ORDER key in MergeTree tables (continuation of #5042).
@MahmoudGoda0
Copy link
Copy Markdown

Hi Guys,
Can I know when this fix will be available ?
This fix really urgent for my company to use CH as main DB engine.

@alexey-milovidov
Copy link
Copy Markdown
Member

@MahmoudGoda0
Copy link
Copy Markdown

Thanks, I have 2 questions please
1 - Is this is a stable version ? if no, When it will be stable version ?
2- I installed this version already "http://repo.yandex.ru/clickhouse/deb/stable/ main/", If i install the testing release, so i will have 2 releases on the same time? How i can run the second one ?

Sorry for these basic questions, as I'm very new with Linux & git :)

@alexey-milovidov
Copy link
Copy Markdown
Member

1 - Is this is a stable version ?

Packages in testing repository have passed all CI tests but are not considered stable before they appear in the stable repository.

if no, When it will be stable version ?

We have stable release every two weeks, most of the time.

2- I installed this version already "http://repo.yandex.ru/clickhouse/deb/stable/ main/", If i install the testing release, so i will have 2 releases on the same time? How i can run the second one ?

If you install another version of ClickHouse, it will replace previous version in your system.

@MahmoudGoda0
Copy link
Copy Markdown

@alexey-milovidov Super clear!
Many thanks, Really appreciated.

@MahmoudGoda0
Copy link
Copy Markdown

Hi @alexey-milovidov ,
I tried it with our data, our issues was for 10 queries, 8 of them enhanced but 2 of them become slower than the old version (stable), Can you please help me ?
One query :
select TOP 1000 ext_rec_num,xdr_id,xdr_grp,xdr_type,xdr_subtype,xdr_direction,xdr_location, start_time,stop_time, transaction_duration,response_time,protocol,chunk_count,dpc, opc,first_link_id,last_dpc,last_opc,last_link_id,first_back_opc, first_back_link_id,calling_ssn,called_ssn,called_sccp_address, calling_party_address,response_calling_address,root_end_code, root_cause_code,root_cause_pl,root_failure,root_equip, map_end_code,map_cause_code,map_cause_pl,sip_end_code, sip_cause_code,sip_cause_pl,isup_end_code,isup_cause_code, isup_cause_pl,inap_end_code,inap_cause_code,inap_cause_pl, subs_id,mccmnc,imsi,msisdn,tac,imei,msc_number,vlr_number, hlr_number,service_key_1,camel_phases,camel_cpbt_handling from ss7_table where start_time > 971128806382 and start_time <971222387788 and (imsi = '938036746436052' or imsi = '938036687724700' or imsi = '938036746030189') ORDER BY start_time DESC;
It was 135 second, but now it takes 180 second :(

@den-crane
Copy link
Copy Markdown
Contributor

@MahmoudGoda0 what is order by on your table? ( imsi, start_time ) or just start_time?
(I think in case of start_time it is expected behavior).

Can you show extended statistics with [send_logs_level = debug]

@MahmoudGoda0
Copy link
Copy Markdown

Yes, its start_time only, but why its expected ?
Will do statistics and share the results with you.

@den-crane
Copy link
Copy Markdown
Contributor

den-crane commented Aug 1, 2019

Yes, its start_time only, but why its expected ?

Because full scan and order by of millions rows is faster than millions sequential index jumps (in all databases, it will be the same in mysql and PG).

@MahmoudGoda0
Copy link
Copy Markdown

Actually I cant get your point, but the create of my table as below :
create table tsnew(
ext_rec_num Nullable(UInt64),
xdr_id Nullable(UInt64),
xdr_grp Nullable(UInt64),
xdr_type Nullable(UInt64),
xdr_subtype Nullable(Int16),
xdr_direction Nullable(Int16),
xdr_location Nullable(Int16),
time UInt64,
stop_time UInt64,
transaction_duration Nullable(UInt64),
response_time Nullable(UInt64),
protocol Nullable(Int16),
chunk_count Nullable(Int16),
dpc Nullable(Int32),
opc Nullable(Int32),
first_link_id String,
last_dpc Nullable(Int32),
last_opc Nullable(Int32),
last_link_id String,
first_back_opc Nullable(Int32),
first_back_link_id String,
calling_ssn Nullable(Int16),
called_ssn Nullable(Int16),
called_sccp_address String,
calling_party_address String,
response_calling_address String,
root_end_code Nullable(Int32),
root_cause_code Nullable(Int32),
root_cause_pl Nullable(Int16),
root_failure Nullable(Int16),
root_equip Nullable(Int16)
)
ENGINE = MergeTree()
PARTITION BY toInt64(time/3600000)*3600000
order by time
SETTINGS index_granularity = 8192
So in this case what i have to do in order to enhance the response time ?

@den-crane
Copy link
Copy Markdown
Contributor

den-crane commented Aug 1, 2019

@MahmoudGoda0

Actually I cant get your point, but the create of my table as below :

My point you don't have index which is appropriate for your search condition and order by and TOP 1000.
Just try to create another table with order by(imsi, time) and test your search.

In case of order by(time) database does millions (index->column) reads on search rows with imsi = '938036746436052' and skipping most of them. These reads are slower than full scan of imsi column.

@MahmoudGoda0
Copy link
Copy Markdown

@den-crane I think this is not the case, becasue the same query but the where is
where start_time > 971128806382 and start_time <971222387788 and imsi = '938036746436052'
the time = 8 second, and this is very good.
But by adding more condition in the where part to be
where start_time > 971128806382 and start_time <971222387788 and (imsi = '938036746436052' or imsi = '938036687724700' or imsi = '938036746030189')
the time = 180 second.

@alexey-milovidov
Copy link
Copy Markdown
Member

@MahmoudGoda0 Do you mind converting imsi to UInt64?

@MahmoudGoda0
Copy link
Copy Markdown

MahmoudGoda0 commented Aug 4, 2019

I cant as I have values ends with "b"
Like this one "543036746630589b"

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Aug 4, 2019

@MahmoudGoda0 Ok.

@den-crane

In case of order by(time) database does millions (index->column) reads on search rows with imsi = '938036746436052' and skipping most of them. These reads are slower than full scan of imsi column.

In ClickHouse, indexed read is never slower than full scan. (A query can be slower due to index analysis stage, in rare cases). ClickHouse does not perform point reads, it will always select ranges in data (though ranges can be as small as a single granule), then reading these ranges, and skipping unneeded data. Skipping unneeded data is done with file seeks, but if the seek is inside read buffer (that is typically 1 MB) or inside compressed block, it will just continue reading.

@MahmoudGoda0
Copy link
Copy Markdown

@alexey-milovidov
Thanks for your explanation, so in my case, what is your recommendation ?
I'm really counting on your support, let me know if you need any other info from my side.

@alexey-milovidov
Copy link
Copy Markdown
Member

Reading "in order" that is implemented in this task, can be slower than usual read, due to:

  • merging of sorted data;
  • additional thread synchronization (to read sorted data in parallel with O(1) memory and maintain the order, the threads need to wait while another thread has finished reading some previous chunk of data).

@MahmoudGoda0
Copy link
Copy Markdown

OK, you mean if i tried after some time from the import time it should be faster ?

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Aug 4, 2019

We continue optimizing "in order" read: #6299
This is in active development.

But some rare queries will remain slightly slower with optimize_read_in_order enabled. They are the queries that has ORDER BY prefix of primary key, large LIMIT and WHERE condition that will require to read huge amount of records before queried data will be found. For these queries you can disable optimize_read_in_order manually.

@alexey-milovidov
Copy link
Copy Markdown
Member

OK, you mean if i tried after some time from the import time it should be faster ?

Queries may become slightly faster due to less number of data parts, but not very significantly, because expensive merging step is still required.

@MahmoudGoda0
Copy link
Copy Markdown

Can you please share how i can disable optimize_read_in_order manually ?

@alexey-milovidov
Copy link
Copy Markdown
Member

SET optimize_read_in_order = 0 in clickhouse-client interactively
or add optimize_read_in_order=0 URL parameter in HTTP interface
or add this setting to the user profile.

@MahmoudGoda0
Copy link
Copy Markdown

@alexey-milovidov
By using optimize_read_in_order=0, the time now = 22 second.
I believe it should be smaller than this time, Agree ?

@alexey-milovidov
Copy link
Copy Markdown
Member

Sorry, now I'm out of context of your queries, and cannot be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants