Lock-free parts rename for ReplicatedMergeTree to avoid INSERT affect SELECT (resubmit)#64955
Conversation
|
This is an automated comment for commit 617033f with description of existing statuses. It's updated for the latest CI running ✅ Click here to open a full report in a separate page Successful checks
|
036dd31 to
8c07fe2
Compare
|
I've added benchmark into the description, which basically shows that without any other load (the ideal scenario) QPS of SELECTs that executed in parallel with INSERT, from table with And this, is without any load, numbers will got higher under load. Also note, that under load, you can have problems even without |
8c07fe2 to
f8f66b0
Compare
|
Test failures does related:
|
f8f66b0 to
70691ed
Compare
There was a problem hiding this comment.
This was the problem, for now, I've reverted this lock-free renames for plain MergeTree
|
Can someone take a look please? |
|
The error in |
|
I don't understand why Unit tests failed... we have to check. |
a) it is a bug in gdb (images have gdb 12.1) - https://sourceware.org/bugzilla/show_bug.cgi?id=29185 |
|
@azat, then make a pull request with update to a new gdb version. |
I don't think that this will help, since by some reason it got SIGTERM, and my guess is due to ACPI shutdown, like I wrote, the CI retry logic should be improved I guess |
I will take a look, but it has nothing to do with this changes, it is better to track it here - #50911 |
|
Let's resolve conflicts and merge (if it will be mergeable). |
70691ed to
e61e13c
Compare
|
Test failures does not looks related:
Details
Details |
|
OK, let's try to rebase to resolve some tests issues |
e61e13c to
e353978
Compare
|
Tidy was broken (fixed in #66070), let's do one more rebase |
Signed-off-by: Azat Khuzhin <[email protected]> (cherry picked from commit 66a2962)
Signed-off-by: Azat Khuzhin <[email protected]>
Signed-off-by: Azat Khuzhin <[email protected]>
… SELECT
Under heavy load, or not so heavy but with fsync_part_directory=1,
rename could take longer then usual, and this will increase the time
that DataPartsLock will be held, while DataPartsLock is the bottleneck
for MergeTree (any SELECT requires it to obtain list of parts).
On one of production clusters I saw ~60 seconds with
fsync_part_directory=1.
So this patch moves rename out of critical section, by explicitly rename
the parts, to make it more clear when it is done without holding a lock.
Note, that this patch fixes the bottleneck only for ReplicatedMergeTree,
MergeTree is not that simple, and should be done separately.
v2: instead of using DataPartsLock.lock.lock()/unlock() move the renameTo() into MergeTreeData::Transaction::commit()
v3: Remove unused locked_parts from MergeTreeData::Transaction
v4: Fix replacing parts with empty
v5: Use renameParts() explicitly to avoid leaving parts in detached
Since there is an assertion that does not allows to remove detached
parts during cleanup, which sounds good in general, but breaks this new
code.
v6: Avoid race between cleanup thread and renameMergedTemporaryPart()
The problem was that with this patch set renameMergedTemporaryPart() is
called without temporary_directory_lock holded (in MergeTask), since it
is reseted just before calling renameMergedTemporaryPart(), and this can
be seen in logs:
2024.03.29 19:56:42.126919 [ 1341 ] {ea7a3fd2-cf47-4ec7-91a5-51c69fba1b95::-8_0_138_2_2} <Trace> test_btnct5cr.alter_table_0 (ea7a3fd2-cf47-4ec7-91a5-51c69fba1b95) (MergerMutator): Merged 50 parts: [-8_0_0_0_2, -8_138_138_0] -> -8_0_138_2_2
2024.03.29 19:56:42.127034 [ 1341 ] {ea7a3fd2-cf47-4ec7-91a5-51c69fba1b95::-8_0_138_2_2} <Debug> test_btnct5cr.alter_table_0 (ea7a3fd2-cf47-4ec7-91a5-51c69fba1b95): Committing part -8_0_138_2_2 to zookeeper
2024.03.29 19:56:42.128462 [ 884 ] {} <Warning> test_btnct5cr.alter_table_0 (ea7a3fd2-cf47-4ec7-91a5-51c69fba1b95): Removing temporary directory /var/lib/clickhouse/store/ea7/ea7a3fd2-cf47-4ec7-91a5-51c69fba1b95/tmp_merge_-8_0_138_2_2/
2024.03.29 19:56:42.128647 [ 1341 ] {ea7a3fd2-cf47-4ec7-91a5-51c69fba1b95::-8_0_138_2_2} <Debug> test_btnct5cr.alter_table_0 (ea7a3fd2-cf47-4ec7-91a5-51c69fba1b95): Part -8_0_138_2_2 committed to zookeeper
...
2024.03.29 19:56:54.586084 [ 57841 ] {bf240267-0620-4294-afc1-479c58e6be89} <Error> executeQuery: std::exception. Code: 1001, type: std::__1::__fs::filesystem::filesystem_error, e.what() = filesystem error: in file_size: No such file or directory ["/var/lib/clickhouse/store/ea7/ea7a3fd2-cf47-4ec7-91a5-51c69fba1b95/-8_0_138_2_2/data.cmrk3"]
This should fix failures of 00993_system_parts_race_condition_drop_zookeeper in [1].
[1]: https://s3.amazonaws.com/clickhouse-test-reports/61973/f6f826c85dd5b7bb8db16286fd10dcf441a440f7/stateless_tests__coverage__[4_6].html
Though now it looks hackish...
v7: Avoid race between cleanup thread and renameMergedTemporaryPart()
v8: Require explicit rename of parts in transaction
v9: Do not remove detached parts in Transaction::rollback
v10: Fix possible assertion when size of precommitted_parts <= precommitted_parts_need_rename
CI founds [1]:
Logical error: 'precommitted_parts.size() >= precommitted_parts_need_rename.size()'
[1]: https://s3.amazonaws.com/clickhouse-test-reports/61973/5c1e6a3e956917bdbb7eaa467934e5b75f17a923/stateless_tests__tsan__s3_storage__[5_5].html
The problem is that after precommitted_parts cleaned from detached parts
it may be less then precommitted_parts_need_rename, so to avoid this,
let's just copy it to a new container.
v11: Use rename_in_transaction=false instead of explicit renameParts()
Later the lock could be splitted, but let's not mix too much changes in
once.
v12: Rename in transaction for INSERT into MergeTree
v13: Rename in transaction for INSERT into ReplicatedMergeTree
v14: revert rename_in_transaction for MergeTree, this is not correct for now (see comment in the code)
Signed-off-by: Azat Khuzhin <[email protected]>
e353978 to
6c49586
Compare
|
Just in case, I don't recommend rebasing, because it involves |
Got it (didn't knew about how conflict resolution works with private fork, now it is clear, thanks) |
|
@alexey-milovidov BTW likely I can fix this, if you will tell me for which commit you fixed the confclits |
|
Kind ping, can someone take a look please? Test failures are not related:
|
Signed-off-by: Azat Khuzhin <[email protected]>
|
|
Can someone please take a look? |
Merge with master after: - integration tests had been fixed in private fork - conflicts had been resolved (by Alexey)
Merge with master after conflict had been resolved one more time (by Alexey)
Merge with upstream to fix temporary build issue
|
I don't know why no one reviewed this PR yet. |
|
I already reviewed it a while ago :) |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Lock-free parts rename to avoid INSERT affect SELECT (due to parts lock) (under normal circumstances with
fsync_part_directory, QPS of SELECT with INSERT in parallel, increased 2x, under heavy load the effect is even bigger). Note, this only includesReplicatedMergeTreefor nowUnder heavy load, or not so heavy but with fsync_part_directory=1,
rename could take longer then usual, and this will increase the time
that DataPartsLock will be held, while DataPartsLock is the bottleneck
for MergeTree (any SELECT requires it to obtain list of parts).
On one of production clusters I saw ~60 seconds with
fsync_part_directory=1.
So this patch moves rename out of critical section, by explicitly rename
the parts, to make it more clear when it is done without holding a lock.
Note, that this patch fixes the bottleneck only for ReplicatedMergeTree,
MergeTree is not that simple, and should be done separately.
Benchmark
Patched
Upstream
Original - https://gist.github.com/azat/7df0643cebb4ca9835d97734b18647b3
Original PR: #61973 (cc @alexey-milovidov )
Revert: #64899 (cc @alesapin )