Some fixes for ReplicatedMergeTree by tavplubix · Pull Request #42878 · ClickHouse/ClickHouse

tavplubix · 2022-11-01T20:38:52Z

Changelog category (leave one):

Not for changelog (changelog entry is not required)
Removed "abandonable_lock", now block number allocation takes 1 RTT
Fixed race condition between INSERT and ALTER PARTITION that might lead to intersecting parts, fixes Uncaught exception in ActiveDataPartSet::add #36610
Fixed race condition between INSERT and DROP, fixes Logical error: Part is already written by concurrent request #42120
Fixed incorrect check for is_active node in quorum inserts (see Improve the time to recover keeper connections #42541 (comment))

src/Storages/StorageReplicatedMergeTree.cpp

src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp

tavplubix · 2022-11-09T11:38:10Z

Builds - #43084
Integration tests (tsan) [1/4] - #42734
Stateless tests flaky check (asan) - 01158_zookeeper_log_long flakyness fixed in #42607
Stress test (msan) - clearOldPartsFromFilesystem on shutdown
Stress test (ubsan) - #41218

azat · 2023-01-24T16:43:22Z

@tavplubix can you please comment on what was the motivation to remove that 2-RTT lock from EphemeralLockInZooKeeper?

The reason I'm asking is that even after #43675 traffic and number of List requests to ZooKeeper is 4x higher then without this patch, and the setup is pretty simple, only 50 partitions, 10-20 tables, and not that match of INSERTs.

tavplubix · 2023-01-25T14:00:30Z

@tavplubix can you please comment on what was the motivation to remove that 2-RTT lock from EphemeralLockInZooKeeper?

The motivation was that it was 2RTT :)
Well, actually, there was another reason to remove it: I wanted to write some useful metadata (instead of an abandonable lock path) into block number nodes (see clearLockedBlockNumbersInPartition for details).

The reason I'm asking is that even after #43675 traffic and number of List requests to ZooKeeper is 4x higher then without this patch, and the setup is pretty simple, only 50 partitions, 10-20 tables, and not that match of INSERTs.

We need more smart scheduling of background tasks in ReplicatedMergeTree tables. It does not make sense to run a merge selecting task each merge_selecting_sleep_ms (5000 by default) when the table is rarely updated and no data is inserted.

NickStepanov · 2023-02-26T16:16:21Z

@tavplubix is there any chance to roll this back?

This change brought our Zookeeper cluster to its knees and the amount of requests has gone up so dramatically that the replication queue went out of control and halted the entire cluster for a good part of two days before we figured out what caused it

This all happened after upgrading to 22.12 and then we tried rolling back to 22.8, which didn't help to revert the situation, by the way. We had do use the solution described here: #43647 (comment)

Changes like these should be tested more carefully not to affect users in such a dramatic way.

tavplubix · 2023-02-28T13:52:09Z

How many replicated tables and how many partitions in total do you have? Is it correct that you have a lot of replicated tables with very rare inserts? Then you probably faced another issue: #31919. Yes, this PR made it worse, but the root cause of the issue is different. We have to fix the background tasks scheduling and do not run these tasks too frequently for unchanged tables. As a temporary workaround, you can tune merge_selecting_sleep_ms and cleanup_delay_period settings.

@tavplubix is there any chance to roll this back?

@NickStepanov, feel free to send a pull request that rolls this back. However, you have to provide an alternative fix for this issue first if you want to roll this back.

Changes like these should be tested more carefully not to affect users in such a dramatic way.

I agree that there's still a big room for improvements in our CI system, but it's hard to test all marginal usecases. Also, we highly recommend users to have a staging cluster with a workload similar to what they have on production. Checking everything on staging before rolling it out to production guarantees that your production will not be suddenly brought to its knees in a dramatic way.

NickStepanov · 2023-02-28T14:31:38Z

Thanks for coming back @tavplubix

I am pretty confident that the issue we're facing is the one described here: #43647 (comment)

And the workaround:

Manually remove /table_path_in_zk/temp/abandonable_lock-insert and /table_path_in_zk/temp/abandonable_lock-other nodes from ZooKeeper for each table.

Did help us.

The number of tables we have is around 500, with 500k+ parts in total (if I calculate correctly). The inserts are definitely not rare, I would say the insert rate is actually pretty high.

Unfortunately, I am not much of a developer, so a PR to roll it back or to provide an alertnative fix is not something I can offer.

And yes, a good point on more testing before upgrades, that's what we're planning to do more. But, as with everywhere, testing on staging doesn't guarantee the problemless upgrade in prod. This problem manifests itself mostly on highly-loaded systems, it seems.

den-crane · 2023-02-28T14:56:36Z

@NickStepanov

500k+ parts

share

select count(), uniqExact(partition_id) p 
from system.parts where active 
group by database, table order by p desc limit 10;

NickStepanov · 2023-03-01T08:59:18Z

@den-crane alright, this is what we have:

        count()	p
0	18912	18912
1	3658	2604
2	4113	2587
3	3290	2570
4	3936	2570
5	3840	2568
6	2565	2565
7	2906	1914
8	1897	1897
9	1330	1195

tavplubix · 2023-03-01T15:23:09Z

I am pretty confident that the issue we're facing is the one described here: #43647 (comment)

That comment was about a high number of empty partitions that were dropped or cleaned up by TTL. It was fixed in #43675.

However, it's not the only case that can lead to zk overload. Before this PR, merge selecting tasks worked like this:

Check if the table has any INSERT queries in progress (1 zk request)
If there are some INSERTs - load block numbers for all partitions (N zk requests, where N is the number of partitions)

And now it always loads block numbers for all partitions. Actually, it does not change the number of zk requests if you always have some INSERT queries running. But if you have a table with many partitions and insert into that table not so often, then the merge selecting task will constantly try to load all block numbers. But the problem is that we don't need to run the merge selecting task at all if no new parts were inserted, and that's what #31919 is about.

0 18912 18912

It means that you have a table with almost 19k partitions and 19k parts, so it makes 19k zk requests each merge_selecting_sleep_ms (5 seconds, so it's almost 4k rps). The number of parts equals the number of partitions, so all parts are merged, so I can assume that inserts are quite rare in that table (otherwise we would see unmerged parts in some partitions). It makes sense to increase merge_selecting_sleep_ms for that table.

In addition to implementing smarter merge selecting task scheduling, we can avoid loading block numbers for "small" partitions having only a few parts and prefer partitions with a lot of parts when selecting a merge.

tavplubix · 2023-03-01T15:35:20Z

Or we can simply return the check for running inserts back. Seems like it's possible to do without 2RTT on insert and without breaking backward compatibility, I will check.

NickStepanov · 2023-03-01T15:38:22Z

@tavplubix ok, I am starting to get a bit more confused :)

What has definitely helped in our case, was removing the files like /table_path_in_zk/temp/abandonable_lock-insert and /table_path_in_zk/temp/abandonable_lock-other for all of our tables. At the exact second we removed them, the entire Zookeeper and Clickhouse cluster came back to life. Therefore, a process related to these files is the one that caused trouble for us.
The tables we have, they have a good amount of inserts, definitely not something I would call rare.

tavplubix · 2023-03-01T15:44:24Z

What has definitely helped in our case, was removing the files like /table_path_in_zk/temp/abandonable_lock-insert and /table_path_in_zk/temp/abandonable_lock-other

It's because these "files" make the merge selecting task on old versions ignore the first step and always load all block numbers

NickStepanov · 2023-03-01T15:54:32Z

Ok, I see, so if the return of the check for running inserts (step 1) can be reintroduced, that's going to solve this?

tavplubix · 2023-03-01T16:00:56Z

so if the return of the check for running inserts (step 1) can be reintroduced, that's going to solve this?

Yep. But other improvements (like smarter scheduling) still make sense.

den-crane · 2023-03-01T17:14:13Z

@NickStepanov Also try

   <merge_tree>
        <cleanup_delay_period>300</cleanup_delay_period>
        <merge_selecting_sleep_ms>600000</merge_selecting_sleep_ms>
   </merge_tree>

We (Altinity) tested it with clients who have issues with ZK load, no harm from it.
It will reduce CPU usage of CH and Zookeeper in your system with any version of Clickhouse.
Because now merge_selector analyzes all your parts/partitions every 5 seconds.

NickStepanov · 2023-04-12T14:49:29Z

so if the return of the check for running inserts (step 1) can be reintroduced, that's going to solve this?

Yep. But other improvements (like smarter scheduling) still make sense.

@tavplubix is there a plan to bring it back in newer versions?

hjun881 · 2023-05-29T02:51:56Z

@den-crane Thank you very much. Based on your settings, the problem has been resolved. Can you explain the specific principle in detail and why changing this parameter can solve the problem

   <merge_tree>
        <cleanup_delay_period>300</cleanup_delay_period>
        <merge_selecting_sleep_ms>600000</merge_selecting_sleep_ms>
   </merge_tree>

tavplubix · 2023-06-12T10:15:19Z

so if the return of the check for running inserts (step 1) can be reintroduced, that's going to solve this?

Yep. But other improvements (like smarter scheduling) still make sense.

@tavplubix is there a plan to bring it back in newer versions?

@NickStepanov, there's no plan to bring it back because it appeared to be much more complex than I thought. But there are two PRs that should significantly reduce the number of ZooKeeper requests in use cases like yours: #49637 and #50107.

ozcanyarimdunya · 2025-02-25T12:12:48Z

@NickStepanov Also try
   <merge_tree>
        <cleanup_delay_period>300</cleanup_delay_period>
        <merge_selecting_sleep_ms>600000</merge_selecting_sleep_ms>
   </merge_tree>
We (Altinity) tested it with clients who have issues with ZK load, no harm from it. It will reduce CPU usage of CH and Zookeeper in your system with any version of Clickhouse. Because now merge_selector analyzes all your parts/partitions every 5 seconds.

We got the same issue, based on these settings, the problem has been resolved.
Thanks!

remove abandonable_lock part 1

4f50a99

robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Nov 1, 2022

tavplubix added 5 commits November 2, 2022 13:43

fix

a1c028f

remove abandonable_lock part 2

acbad99

fix race between INSERT and ALTER PARTITION

00c9e50

fix race between INSERT and DROP

d8b3a2a

Merge branch 'master' into fix_intersecting_parts2

e67d056

tavplubix changed the title ~~Some experiments with ReplicatedMergeTree~~ Some fixes for ReplicatedMergeTree Nov 2, 2022

tavplubix marked this pull request as ready for review November 2, 2022 18:46

Merge branch 'master' into fix_intersecting_parts2

b3ae38f

hanfei1991 self-assigned this Nov 7, 2022

tavplubix mentioned this pull request Nov 7, 2022

Improve the time to recover keeper connections #42541

Merged

tavplubix added 2 commits November 7, 2022 20:26

Merge branch 'master' into fix_intersecting_parts2

b8174a6

fix

9210e58

hanfei1991 reviewed Nov 8, 2022

View reviewed changes

src/Storages/StorageReplicatedMergeTree.cpp Outdated Show resolved Hide resolved

src/Storages/StorageReplicatedMergeTree.cpp Show resolved Hide resolved

src/Storages/StorageReplicatedMergeTree.cpp Show resolved Hide resolved

hanfei1991 reviewed Nov 8, 2022

View reviewed changes

src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp Show resolved Hide resolved

fix

eb19df0

hanfei1991 approved these changes Nov 8, 2022

View reviewed changes

Merge branch 'master' into fix_intersecting_parts2

1c6a617

tavplubix merged commit f05b29d into master Nov 9, 2022

tavplubix deleted the fix_intersecting_parts2 branch November 9, 2022 11:41

tavplubix mentioned this pull request Nov 25, 2022

High amount of ZooKeeper requests/CPU Load after upgrade to 22.11.1 from 22.3 #43647

Closed

den-crane mentioned this pull request Feb 26, 2023

Provide hint for loading uncommitted blocks in merge predicate #43675

Merged

den-crane mentioned this pull request Jun 1, 2023

After upgrade from version 22.8 lts to version 23.3.2.37 lts, the path of replication in zk disappeared completely and increased ZooKeeperWait #50431

Closed

den-crane mentioned this pull request Jul 6, 2023

After upgrading ClickHouse, there has been a significant increase in ZooKeeper requests #51874

Closed

tavplubix mentioned this pull request Aug 2, 2023

drop partition may lead to part intersection and segment fault #52904

Closed

Conversation

tavplubix commented Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tavplubix commented Nov 9, 2022

Uh oh!

azat commented Jan 24, 2023

Uh oh!

tavplubix commented Jan 25, 2023

Uh oh!

NickStepanov commented Feb 26, 2023

Uh oh!

tavplubix commented Feb 28, 2023

Uh oh!

NickStepanov commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

den-crane commented Feb 28, 2023

Uh oh!

NickStepanov commented Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tavplubix commented Mar 1, 2023

Uh oh!

tavplubix commented Mar 1, 2023

Uh oh!

NickStepanov commented Mar 1, 2023

Uh oh!

tavplubix commented Mar 1, 2023

Uh oh!

NickStepanov commented Mar 1, 2023

Uh oh!

tavplubix commented Mar 1, 2023

Uh oh!

den-crane commented Mar 1, 2023

Uh oh!

NickStepanov commented Apr 12, 2023

Uh oh!

hjun881 commented May 29, 2023

Uh oh!

tavplubix commented Jun 12, 2023

Uh oh!

ozcanyarimdunya commented Feb 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

tavplubix commented Nov 1, 2022 •

edited

Loading

NickStepanov commented Feb 28, 2023 •

edited

Loading

NickStepanov commented Mar 1, 2023 •

edited

Loading