Some fixes for ReplicatedMergeTree#42878
Conversation
|
@tavplubix can you please comment on what was the motivation to remove that 2-RTT lock from The reason I'm asking is that even after #43675 traffic and number of |
The motivation was that it was 2RTT :)
We need more smart scheduling of background tasks in |
|
@tavplubix is there any chance to roll this back? This change brought our Zookeeper cluster to its knees and the amount of requests has gone up so dramatically that the replication queue went out of control and halted the entire cluster for a good part of two days before we figured out what caused it This all happened after upgrading to 22.12 and then we tried rolling back to 22.8, which didn't help to revert the situation, by the way. We had do use the solution described here: #43647 (comment) Changes like these should be tested more carefully not to affect users in such a dramatic way. |
|
How many replicated tables and how many partitions in total do you have? Is it correct that you have a lot of replicated tables with very rare inserts? Then you probably faced another issue: #31919. Yes, this PR made it worse, but the root cause of the issue is different. We have to fix the background tasks scheduling and do not run these tasks too frequently for unchanged tables. As a temporary workaround, you can tune
@NickStepanov, feel free to send a pull request that rolls this back. However, you have to provide an alternative fix for this issue first if you want to roll this back.
I agree that there's still a big room for improvements in our CI system, but it's hard to test all marginal usecases. Also, we highly recommend users to have a staging cluster with a workload similar to what they have on production. Checking everything on staging before rolling it out to production guarantees that your production will not be suddenly brought to its knees in a dramatic way. |
|
Thanks for coming back @tavplubix I am pretty confident that the issue we're facing is the one described here: #43647 (comment) And the workaround:
Did help us. The number of tables we have is around 500, with 500k+ parts in total (if I calculate correctly). The inserts are definitely not rare, I would say the insert rate is actually pretty high. Unfortunately, I am not much of a developer, so a PR to roll it back or to provide an alertnative fix is not something I can offer. And yes, a good point on more testing before upgrades, that's what we're planning to do more. But, as with everywhere, testing on staging doesn't guarantee the problemless upgrade in prod. This problem manifests itself mostly on highly-loaded systems, it seems. |
share |
|
@den-crane alright, this is what we have: |
That comment was about a high number of empty partitions that were dropped or cleaned up by TTL. It was fixed in #43675. However, it's not the only case that can lead to zk overload. Before this PR, merge selecting tasks worked like this:
And now it always loads block numbers for all partitions. Actually, it does not change the number of zk requests if you always have some INSERT queries running. But if you have a table with many partitions and insert into that table not so often, then the merge selecting task will constantly try to load all block numbers. But the problem is that we don't need to run the merge selecting task at all if no new parts were inserted, and that's what #31919 is about.
It means that you have a table with almost 19k partitions and 19k parts, so it makes 19k zk requests each In addition to implementing smarter merge selecting task scheduling, we can avoid loading block numbers for "small" partitions having only a few parts and prefer partitions with a lot of parts when selecting a merge. |
|
Or we can simply return the check for running inserts back. Seems like it's possible to do without 2RTT on insert and without breaking backward compatibility, I will check. |
|
@tavplubix ok, I am starting to get a bit more confused :)
|
It's because these "files" make the merge selecting task on old versions ignore the first step and always load all block numbers |
|
Ok, I see, so if the return of the check for running inserts (step 1) can be reintroduced, that's going to solve this? |
Yep. But other improvements (like smarter scheduling) still make sense. |
|
@NickStepanov Also try We (Altinity) tested it with clients who have issues with ZK load, no harm from it. |
@tavplubix is there a plan to bring it back in newer versions? |
|
@den-crane Thank you very much. Based on your settings, the problem has been resolved. Can you explain the specific principle in detail and why changing this parameter can solve the problem <merge_tree>
<cleanup_delay_period>300</cleanup_delay_period>
<merge_selecting_sleep_ms>600000</merge_selecting_sleep_ms>
</merge_tree> |
@NickStepanov, there's no plan to bring it back because it appeared to be much more complex than I thought. But there are two PRs that should significantly reduce the number of ZooKeeper requests in use cases like yours: #49637 and #50107. |
We got the same issue, based on these settings, the problem has been resolved. |

Changelog category (leave one):
Not for changelog (changelog entry is not required)
Removed "abandonable_lock", now block number allocation takes 1 RTT
Fixed race condition between INSERT and ALTER PARTITION that might lead to intersecting parts, fixes Uncaught exception in ActiveDataPartSet::add #36610
Fixed race condition between INSERT and DROP, fixes Logical error: Part is already written by concurrent request #42120
Fixed incorrect check for
is_activenode in quorum inserts (see Improve the time to recover keeper connections #42541 (comment))