Skip to content

Fix max_distributed_connections#9673

Merged
alexey-milovidov merged 5 commits intoClickHouse:masterfrom
azat:processors-max_distributed_connections
Mar 24, 2020
Merged

Fix max_distributed_connections#9673
alexey-milovidov merged 5 commits intoClickHouse:masterfrom
azat:processors-max_distributed_connections

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Mar 15, 2020

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix max_distributed_connections (w/ and w/o Processors)

@qoega
Copy link
Copy Markdown
Member

qoega commented Mar 16, 2020

Probably we should find better way to test it:

2020-03-16 02:27:51 01085_max_distributed_connections:                                      [ FAIL ] 1.87 sec. - return code 124

@qoega qoega added no-docs-needed pr-improvement Pull request with some product improvements labels Mar 16, 2020
@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 16, 2020

Probably we should find better way to test it:

Actually the test fails correctly without processors, digging

@KochetovNicolai
Copy link
Copy Markdown
Member

Let me clarify what's going on. When we build query pipeline, there are two parameters we are interested in: the number of streams we build (max_streams) and the number of threads we use (max_threads) to execute query.

Usually, max_streams = max_threads = settings.max_threads. Let's look at the cases when it is not:

  1. Storage returned less streams then settings.max_threads and query is quite simple. For example, select * from system.numbers. Then max_threads = 1 < max_streams = settings.max_threads.
  2. Setting settings.max_streams_to_max_threads_ratio > 1. Then max_streams = settings.max_threads * settings.max_streams_to_max_threads_ratio, max_threads = settings.max_threads.
  3. Query is remote, setting settings.max_distributed_connections is used instead of settings.max_threads.

In order to implement 1., QueryPipeline calculates the maximum number of streams used at each step (see QueryPipeline::Streams::maxParallelStreams()). The number of threads is limited by QueryPipeline::getNumThreads() at TCPHandler.

In order to support 2., the number of threads was additionally limited by settings.max_threads in TCPHandler. But it is not good, because 3 is not correct after this (we use less threads then expected).

This pr fixes 3., but broke 2, because max_threads be settings.max_threads * settings.max_streams_to_max_threads_ratio.

It is easy to fix - set query_pipeline.setMaxThreads() not in InterpreterSelectQuery::executeWithProcessors, but in InterpreterSelectQuery::executeFetchColumns, when settings max_streams_to_max_threads_ratio and max_distributed_connections are checked. Then it will be possible to get the number of threads directly from query pipeline, which is quite logical.

@KochetovNicolai
Copy link
Copy Markdown
Member

Please also look here. I have forgotten to limit the number of threads for http case. It must also use QueryPipeline::getNumThreads().

@alexey-milovidov alexey-milovidov added the pr-bugfix Pull request with bugfix, not backported by default label Mar 16, 2020
@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Mar 16, 2020

Added "bugfix" label because this was significant performance regression that will be visible on large clusters like those we have in Yandex Metrica (and sadly enough is not visible on any testing environments).

@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 16, 2020

Please also look here

@KochetovNicolai it had been fixed already

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose using minimum is not the best choice. Imagine query like select number from system.number union all select number from system.numbers. It could have been executed in in 2 threads, but after this change it it will use only one.

I think we should think more about it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed (I also though like this), and this will go away, already testing better approach

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I though that I found better alternative but I didn't (too much interpreters:) )

As for union all:

  • if max_threads=1 will be specified than it should use one thread (like before)
  • if max_threads=2 will be specified than it will be runned in parallel (like before, since each interpreter will have max_streams=2)

@KochetovNicolai what did I miss?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW:

$ time clickhouse-client --max_threads 1 -q 'select sleep(1) union all select sleep(1)'
0
0
real    0m2.322s
$ time clickhouse-client --max_threads 2 -q 'select sleep(1) union all select sleep(1)'
0
0
real    0m1.327s
$ time clickhouse-client --max_threads 2 -q 'select sleep(1) union all select sleep(1)' --experimental_use_processors 0 
0
0
real    0m1.419s
$ time clickhouse-client --max_threads 1 -q 'select sleep(1) union all select sleep(1)' --experimental_use_processors 0 
0
0
real    0m2.421s

@KochetovNicolai
Copy link
Copy Markdown
Member

@KochetovNicolai it had been fixed already

Thank you!

@azat azat force-pushed the processors-max_distributed_connections branch from af92dc6 to 7659d44 Compare March 16, 2020 19:44
@azat azat changed the title Fix max_distributed_connections under Processors Fix max_distributed_connections Mar 16, 2020
@azat azat force-pushed the processors-max_distributed_connections branch from 7659d44 to afbe87c Compare March 16, 2020 20:42
@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 17, 2020

Tests failures looks unrelated:

  • 00834_kill_mutation_replicated_zookeeper
  • SELECT arraySlice(arrayFill(x -> (((x.1) % 2) = 0), arrayMap(x -> (x, toString(x)), range(100000000))), 1, 10) (perf)

@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 17, 2020

Please also look here. I have forgotten to limit the number of threads for http case. It must also use QueryPipeline::getNumThreads().

@KochetovNicolai added a test to avoid regression - 7c24cee9b79878d20278be89c3994ba59d5aa1ca

@azat azat requested a review from KochetovNicolai March 17, 2020 18:57
@azat azat force-pushed the processors-max_distributed_connections branch from 7c24cee to 3d126e4 Compare March 18, 2020 17:32
@KochetovNicolai
Copy link
Copy Markdown
Member

01085_max_distributed_connections_http fails :(

@KochetovNicolai
Copy link
Copy Markdown
Member

Actually, it seems that reference file is wrong.

@KochetovNicolai
Copy link
Copy Markdown
Member

        if (max_streams > 1 && !is_remote)
            max_streams *= settings.max_streams_to_max_threads_ratio;

I think we need to limit the number of threads here too.

@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 19, 2020

Actually, it seems that reference file is wrong.

Yep, you are right!

@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 19, 2020

I think we need to limit the number of threads here too.

So revert the last change?

@KochetovNicolai
Copy link
Copy Markdown
Member

So revert the last change?

Well, let's not revert anything.
Let's just fix you test and merge this pr. Because it solves the issue and all tests are fine.
Then I check how max_streams_to_max_threads_ratio works and fix it if something wrong separetly.

@azat azat force-pushed the processors-max_distributed_connections branch from 3d126e4 to b90d094 Compare March 19, 2020 17:22
@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 19, 2020

Functional stateless tests (thread) — fail: 20, passed: 1522, skipped: 8

00111_shard_external_sort_distributed breaks everything after, but I cannot find a lot from the logs (maybe clickhouse-server had been killed by OOM, especially TSAN version)

P.S. recent changes passes all tests except for 01085_max_distributed_connections_http, where wrong .reference file had been provided)

@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 20, 2020

Functional stateless tests (thread) — fail: 20, passed: 1522, skipped: 8

00965_live_view_watch_heartbeat (does not looks relevant)

P.S. I can rebase to force CI

azat added 5 commits March 20, 2020 20:51
With Processors max_distributed_connections does not used anymore, since
it always uses not more then max_threads (see
TCPHandler::processOrdinaryQueryWithProcessors()/
QueryPipeline::execute()/PipelineExecutor::execute())

So just use QueryPipeline::getNumThreads(), that takes
max_distributed_connections into account.
@azat azat force-pushed the processors-max_distributed_connections branch from b90d094 to 550cfef Compare March 20, 2020 17:56
@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 21, 2020

Functional stateless tests (address)

Looks flacky:

  • 01006_simpod_empty_part_single_column_write
  • 01019_alter_materialized_view_consistent

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Mar 24, 2020

Unfortunately, these tests clearly indicate a bug (most likely not in your PR).

@alexey-milovidov alexey-milovidov merged commit 9a2760d into ClickHouse:master Mar 24, 2020
@azat azat deleted the processors-max_distributed_connections branch March 24, 2020 06:43
alesapin pushed a commit that referenced this pull request Mar 26, 2020
…ions

Fix max_distributed_connections

(cherry picked from commit 9a2760d)
azat added a commit to azat/ClickHouse that referenced this pull request Dec 6, 2020
With prefer_localhost_replica=1 max_distributed_connections uses
max_threads before this patch.

Fixes: ClickHouse#14936
Previous fix: ClickHouse#9673
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-docs-needed pr-bugfix Pull request with bugfix, not backported by default pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants