Skip to content

Some fixes for ReplicatedMergeTree#42878

Merged
tavplubix merged 11 commits intomasterfrom
fix_intersecting_parts2
Nov 9, 2022
Merged

Some fixes for ReplicatedMergeTree#42878
tavplubix merged 11 commits intomasterfrom
fix_intersecting_parts2

Conversation

@tavplubix
Copy link
Copy Markdown
Member

@tavplubix tavplubix commented Nov 1, 2022

Changelog category (leave one):

@robot-clickhouse robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Nov 1, 2022
@tavplubix tavplubix changed the title Some experiments with ReplicatedMergeTree Some fixes for ReplicatedMergeTree Nov 2, 2022
@tavplubix tavplubix marked this pull request as ready for review November 2, 2022 18:46
@tavplubix
Copy link
Copy Markdown
Member Author

Builds - #43084
Integration tests (tsan) [1/4] - #42734
Stateless tests flaky check (asan) - 01158_zookeeper_log_long flakyness fixed in #42607
Stress test (msan) - clearOldPartsFromFilesystem on shutdown
Stress test (ubsan) - #41218

@azat
Copy link
Copy Markdown
Member

azat commented Jan 24, 2023

@tavplubix can you please comment on what was the motivation to remove that 2-RTT lock from EphemeralLockInZooKeeper?

The reason I'm asking is that even after #43675 traffic and number of List requests to ZooKeeper is 4x higher then without this patch, and the setup is pretty simple, only 50 partitions, 10-20 tables, and not that match of INSERTs.

@tavplubix
Copy link
Copy Markdown
Member Author

@tavplubix can you please comment on what was the motivation to remove that 2-RTT lock from EphemeralLockInZooKeeper?

The motivation was that it was 2RTT :)
Well, actually, there was another reason to remove it: I wanted to write some useful metadata (instead of an abandonable lock path) into block number nodes (see clearLockedBlockNumbersInPartition for details).

The reason I'm asking is that even after #43675 traffic and number of List requests to ZooKeeper is 4x higher then without this patch, and the setup is pretty simple, only 50 partitions, 10-20 tables, and not that match of INSERTs.

We need more smart scheduling of background tasks in ReplicatedMergeTree tables. It does not make sense to run a merge selecting task each merge_selecting_sleep_ms (5000 by default) when the table is rarely updated and no data is inserted.

@NickStepanov
Copy link
Copy Markdown

@tavplubix is there any chance to roll this back?

This change brought our Zookeeper cluster to its knees and the amount of requests has gone up so dramatically that the replication queue went out of control and halted the entire cluster for a good part of two days before we figured out what caused it
Screenshot 2023-02-26 at 17 10 58

This all happened after upgrading to 22.12 and then we tried rolling back to 22.8, which didn't help to revert the situation, by the way. We had do use the solution described here: #43647 (comment)

Changes like these should be tested more carefully not to affect users in such a dramatic way.

@tavplubix
Copy link
Copy Markdown
Member Author

How many replicated tables and how many partitions in total do you have? Is it correct that you have a lot of replicated tables with very rare inserts? Then you probably faced another issue: #31919. Yes, this PR made it worse, but the root cause of the issue is different. We have to fix the background tasks scheduling and do not run these tasks too frequently for unchanged tables. As a temporary workaround, you can tune merge_selecting_sleep_ms and cleanup_delay_period settings.

@tavplubix is there any chance to roll this back?

@NickStepanov, feel free to send a pull request that rolls this back. However, you have to provide an alternative fix for this issue first if you want to roll this back.

Changes like these should be tested more carefully not to affect users in such a dramatic way.

I agree that there's still a big room for improvements in our CI system, but it's hard to test all marginal usecases. Also, we highly recommend users to have a staging cluster with a workload similar to what they have on production. Checking everything on staging before rolling it out to production guarantees that your production will not be suddenly brought to its knees in a dramatic way.

@NickStepanov
Copy link
Copy Markdown

NickStepanov commented Feb 28, 2023

Thanks for coming back @tavplubix

I am pretty confident that the issue we're facing is the one described here: #43647 (comment)

And the workaround:

Manually remove /table_path_in_zk/temp/abandonable_lock-insert and /table_path_in_zk/temp/abandonable_lock-other nodes from ZooKeeper for each table.

Did help us.

The number of tables we have is around 500, with 500k+ parts in total (if I calculate correctly). The inserts are definitely not rare, I would say the insert rate is actually pretty high.

Unfortunately, I am not much of a developer, so a PR to roll it back or to provide an alertnative fix is not something I can offer.

And yes, a good point on more testing before upgrades, that's what we're planning to do more. But, as with everywhere, testing on staging doesn't guarantee the problemless upgrade in prod. This problem manifests itself mostly on highly-loaded systems, it seems.

@den-crane
Copy link
Copy Markdown
Contributor

@NickStepanov

500k+ parts

share

select count(), uniqExact(partition_id) p 
from system.parts where active 
group by database, table order by p desc limit 10;

@NickStepanov
Copy link
Copy Markdown

NickStepanov commented Mar 1, 2023

@den-crane alright, this is what we have:

        count()	p
0	18912	18912
1	3658	2604
2	4113	2587
3	3290	2570
4	3936	2570
5	3840	2568
6	2565	2565
7	2906	1914
8	1897	1897
9	1330	1195

@tavplubix
Copy link
Copy Markdown
Member Author

I am pretty confident that the issue we're facing is the one described here: #43647 (comment)

That comment was about a high number of empty partitions that were dropped or cleaned up by TTL. It was fixed in #43675.

However, it's not the only case that can lead to zk overload. Before this PR, merge selecting tasks worked like this:

  1. Check if the table has any INSERT queries in progress (1 zk request)
  2. If there are some INSERTs - load block numbers for all partitions (N zk requests, where N is the number of partitions)

And now it always loads block numbers for all partitions. Actually, it does not change the number of zk requests if you always have some INSERT queries running. But if you have a table with many partitions and insert into that table not so often, then the merge selecting task will constantly try to load all block numbers. But the problem is that we don't need to run the merge selecting task at all if no new parts were inserted, and that's what #31919 is about.

0 18912 18912

It means that you have a table with almost 19k partitions and 19k parts, so it makes 19k zk requests each merge_selecting_sleep_ms (5 seconds, so it's almost 4k rps). The number of parts equals the number of partitions, so all parts are merged, so I can assume that inserts are quite rare in that table (otherwise we would see unmerged parts in some partitions). It makes sense to increase merge_selecting_sleep_ms for that table.

In addition to implementing smarter merge selecting task scheduling, we can avoid loading block numbers for "small" partitions having only a few parts and prefer partitions with a lot of parts when selecting a merge.

@tavplubix
Copy link
Copy Markdown
Member Author

Or we can simply return the check for running inserts back. Seems like it's possible to do without 2RTT on insert and without breaking backward compatibility, I will check.

@NickStepanov
Copy link
Copy Markdown

@tavplubix ok, I am starting to get a bit more confused :)

  1. What has definitely helped in our case, was removing the files like /table_path_in_zk/temp/abandonable_lock-insert and /table_path_in_zk/temp/abandonable_lock-other for all of our tables. At the exact second we removed them, the entire Zookeeper and Clickhouse cluster came back to life. Therefore, a process related to these files is the one that caused trouble for us.
  2. The tables we have, they have a good amount of inserts, definitely not something I would call rare.

@tavplubix
Copy link
Copy Markdown
Member Author

What has definitely helped in our case, was removing the files like /table_path_in_zk/temp/abandonable_lock-insert and /table_path_in_zk/temp/abandonable_lock-other

It's because these "files" make the merge selecting task on old versions ignore the first step and always load all block numbers

@NickStepanov
Copy link
Copy Markdown

Ok, I see, so if the return of the check for running inserts (step 1) can be reintroduced, that's going to solve this?

@tavplubix
Copy link
Copy Markdown
Member Author

so if the return of the check for running inserts (step 1) can be reintroduced, that's going to solve this?

Yep. But other improvements (like smarter scheduling) still make sense.

@den-crane
Copy link
Copy Markdown
Contributor

@NickStepanov Also try

   <merge_tree>
        <cleanup_delay_period>300</cleanup_delay_period>
        <merge_selecting_sleep_ms>600000</merge_selecting_sleep_ms>
   </merge_tree>

We (Altinity) tested it with clients who have issues with ZK load, no harm from it.
It will reduce CPU usage of CH and Zookeeper in your system with any version of Clickhouse.
Because now merge_selector analyzes all your parts/partitions every 5 seconds.

@NickStepanov
Copy link
Copy Markdown

so if the return of the check for running inserts (step 1) can be reintroduced, that's going to solve this?

Yep. But other improvements (like smarter scheduling) still make sense.

@tavplubix is there a plan to bring it back in newer versions?

@hjun881
Copy link
Copy Markdown

hjun881 commented May 29, 2023

@den-crane Thank you very much. Based on your settings, the problem has been resolved. Can you explain the specific principle in detail and why changing this parameter can solve the problem

   <merge_tree>
        <cleanup_delay_period>300</cleanup_delay_period>
        <merge_selecting_sleep_ms>600000</merge_selecting_sleep_ms>
   </merge_tree>

@tavplubix
Copy link
Copy Markdown
Member Author

so if the return of the check for running inserts (step 1) can be reintroduced, that's going to solve this?

Yep. But other improvements (like smarter scheduling) still make sense.

@tavplubix is there a plan to bring it back in newer versions?

@NickStepanov, there's no plan to bring it back because it appeared to be much more complex than I thought. But there are two PRs that should significantly reduce the number of ZooKeeper requests in use cases like yours: #49637 and #50107.

@ozcanyarimdunya
Copy link
Copy Markdown

@NickStepanov Also try

   <merge_tree>
        <cleanup_delay_period>300</cleanup_delay_period>
        <merge_selecting_sleep_ms>600000</merge_selecting_sleep_ms>
   </merge_tree>

We (Altinity) tested it with clients who have issues with ZK load, no harm from it. It will reduce CPU usage of CH and Zookeeper in your system with any version of Clickhouse. Because now merge_selector analyzes all your parts/partitions every 5 seconds.

We got the same issue, based on these settings, the problem has been resolved.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logical error: Part is already written by concurrent request Uncaught exception in ActiveDataPartSet::add

8 participants