Skip to content

Parallel wide parts writer#14150

Closed
excitoon wants to merge 9 commits intoClickHouse:masterfrom
excitoon-favorites:parallelwidewrite
Closed

Parallel wide parts writer#14150
excitoon wants to merge 9 commits intoClickHouse:masterfrom
excitoon-favorites:parallelwidewrite

Conversation

@excitoon
Copy link
Copy Markdown
Contributor

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one)

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Attempt to make wide parts faster in S3.

@robot-clickhouse robot-clickhouse added the pr-performance Pull request with some performance improvements label Aug 27, 2020
@excitoon excitoon force-pushed the parallelwidewrite branch 2 times, most recently from 62fc420 to 4ede75b Compare September 5, 2020 17:34
@excitoon excitoon changed the title Attempt to make wide parts faster in S3 Parallel wide parts writer Sep 11, 2020
@excitoon excitoon marked this pull request as ready for review September 11, 2020 14:36
@excitoon excitoon force-pushed the parallelwidewrite branch 6 times, most recently from 3c0b522 to 764f14a Compare September 11, 2020 22:12
Copy link
Copy Markdown
Member

@azat azat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understand this correctly, that this PR makes writes to MergeTree in multiple threads, so if I will INSERT two separate blocks (i.e. two INSERT queries) then each will be processed in 16 threads.

I guess (although don't know) it is fine for s3, but not for local filesystems.

So looks to me that:

  • default should be 1 (or some auto detection)
  • thread pool should be static (maybe greater in size but static), since right now you can pretty quickly use all the 10K threads

@Akazz Akazz self-assigned this Sep 15, 2020
@excitoon excitoon force-pushed the parallelwidewrite branch 3 times, most recently from 6f0707b to c5ef8cb Compare September 30, 2020 14:03
@excitoon excitoon force-pushed the parallelwidewrite branch 2 times, most recently from 7bc20f3 to 9bd5965 Compare October 6, 2020 07:24
@excitoon excitoon marked this pull request as draft October 29, 2020 05:38
@excitoon
Copy link
Copy Markdown
Contributor Author

excitoon commented Nov 30, 2020

@azat I switched default number of threads back to 1 but it is not very clear for me why in some cases we can use separate threads, for example here:

pool.scheduleOrThrowOnError([&, part_index, thread_group = CurrentThread::getGroup()] {
. Can you also clarify what shall be the scope of thread pool? Global one or table or query-specific?

@excitoon excitoon marked this pull request as ready for review November 30, 2020 04:23
@azat
Copy link
Copy Markdown
Member

azat commented Dec 2, 2020

but it is not very clear for me why in some cases we can use separate threads, for example here:

This is for SELECT query, and:

  • it respect max_threads setting (so if I will set max_threads=1, it will not do parallel loading, and also max_threads will be lowered to 1 in case of simple SELECT with LIMIT)
  • it is not for each block, while you version does this for each INSERT block and for multiple INSERT into one table you will got number_of_inserts*merge_tree_writer_max_threads, plus there is also max_insert_threads, so 16 as default value does not looks sane to me

Copy link
Copy Markdown
Member

@azat azat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW it looks like it can be covered (at least for simple MergeTree, w/o S3), since those threads should be accounted in the query_log (after suggested change)

Copy link
Copy Markdown
Member

@azat azat Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using thread pool to write columns in parallel to regular filesystem looks overkill if the block is small enough (I guess it is OK if MergeTree uses S3 as the underlying storage)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this threads will not be accounted as threads for query, to make them accounted you need to wrap write_column_job with:

auto thread_group = CurrentThread::getGroup();
...

writing_thread_pool->scheduleOrThrowOnError([&write_column_job, thread_group]() {
    setThreadName("QueryPipelineEx");

    if (thread_group)
        CurrentThread::attachTo(thread_group);
    SCOPE_EXIT(
            if (thread_group)
                CurrentThread::detachQueryIfNotDetached();
    );

    write_column_job();
}

P.S. the code is completely untested

Copy link
Copy Markdown
Contributor

@Akazz Akazz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your PR currently urges for at least some tests for your setting merge_tree_writer_max_threads_per_block. Potentially such tests could also explain this PR's purpose/highlight the problem that is solves.

In my understanding with your changes it will be difficult to reason about how many threads query processing would take. I also do not believe that trying to parallelize data writes at this level can be a good idea in general.

if (settings.max_threads_per_block != 1)
{
offset_columns_per_column.reserve(columns_list.size());
writing_thread_pool = std::make_unique<ThreadPool>(settings.max_threads_per_block);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out we create a separate thread pool on writing each block to the DataPart. Meanwhile the pool's size is defined by settings.max_threads_per_block, which probably persists in between such writes. Thus we could have such object created somewhere in the enclosing code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, note that this setting potentially ignores any limiting on the number of execution threads prescribed by max_threads.

writing_thread_pool->wait();

// data_written = std::any_of(column_data_written.begin(), column_data_written.end(), std::identity());
data_written = std::find(column_data_written.begin(), column_data_written.end(), true) != column_data_written.end();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is now the meaning of this data_written field? What happens if any of the data write tasks fail and some of the columns fail to be written?

}
else

column_data_written[i] = written;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this column_data_written[i] simply be ref-captured by the alias written? (in the similar manner as offset_columns)

}

it = columns_list.begin();
for (size_t i = 0; i < columns_list.size(); ++i, ++it)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallelizing data write by columns might undermine the idea altogether in case of single fat column - parallelization would become completely pointless in such case.

@Akazz
Copy link
Copy Markdown
Contributor

Akazz commented Jan 26, 2021

After internal discussion we decided to decline this implementation due to a number of reasons (most of them are mentioned in the discussions above). The task of parallelizing IO-writes does not seem to belong to this entity (MergeTreeDataPartWriterWide)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants