Fix removing of parts in a Temporary state#28221
Conversation
Parts in a temporary state is not exists in data_parts_by_info, so do
not try to search there and throw LOGICAL_ERROR in case of failure:
<details>
```
02:45:49.037546 [ 5890 ] {} <Error> test_iy9rta.concurrent_kill_4 (526aa7c8-db2a-4f0e-926a-a7c8db2a9f0e): Code: 40. DB::Exception: Part all_0_0_0_1 from r11 has different columns hash. (CHECKSUM_DOESNT_MATCH) (version 21.10.1.7910 (official build)). Data after mutation is not byte-identical to data on another replicas. We will download merged part from replica to force byte-identical result.
02:45:49.049422 [ 5890 ] {} <Trace> test_iy9rta.concurrent_kill_4 (526aa7c8-db2a-4f0e-926a-a7c8db2a9f0e): Trying to immediately remove part all_0_0_0_1 (state Temporary)
02:45:49.060210 [ 5890 ] {} <Fatal> : Logical error: 'Part all_0_0_0_1 doesn't exist'.
02:47:01.572508 [ 29208 ] {} <Fatal> BaseDaemon: (version 21.10.1.7910 (official build), build id: 9309CECED9A0D32CBB00BB8EC77B91456679868E) (from thread 5890) (no query) Received signal Aborted (6)
...
02:47:06.815000 [ 29208 ] {} <Fatal> BaseDaemon: 8. ./obj-x86_64-linux-gnu/../src/Storages/MergeTree/MergeTreeData.cpp:2593: DB::MergeTreeData::tryRemovePartImmediately(std::__1::shared_ptr<DB::IMergeTreeDataPart const>&&) @ 0x1f959e0d in /usr/bin/clickhouse
02:47:06.816309 [ 29209 ] {} <Fatal> BaseDaemon: 8. ./obj-x86_64-linux-gnu/../src/Storages/MergeTree/MergeTreeData.cpp:2593: DB::MergeTreeData::tryRemovePartImmediately(std::__1::shared_ptr<DB::IMergeTreeDataPart const>&&) @ 0x1f959e0d in /usr/bin/clickhouse
02:47:09.455665 [ 29208 ] {} <Fatal> BaseDaemon: 9. ./obj-x86_64-linux-gnu/../src/Storages/StorageReplicatedMergeTree.cpp:1939: DB::StorageReplicatedMergeTree::tryExecutePartMutation(DB::ReplicatedMergeTreeLogEntry const&) @ 0x1f5f2bf6 in /usr/bin/clickhouse
02:47:09.468738 [ 29209 ] {} <Fatal> BaseDaemon: 9. ./obj-x86_64-linux-gnu/../src/Storages/StorageReplicatedMergeTree.cpp:1939: DB::StorageReplicatedMergeTree::tryExecutePartMutation(DB::ReplicatedMergeTreeLogEntry const&) @ 0x1f5f2bf6 in /usr/bin/clickhouse
02:47:11.776857 [ 29208 ] {} <Fatal> BaseDaemon: 10. ./obj-x86_64-linux-gnu/../src/Storages/StorageReplicatedMergeTree.cpp:1581: DB::StorageReplicatedMergeTree::executeLogEntry(DB::ReplicatedMergeTreeLogEntry&) @ 0x1f5e484c in /usr/bin/clickhouse
02:47:11.904232 [ 29209 ] {} <Fatal> BaseDaemon: 10. ./obj-x86_64-linux-gnu/../src/Storages/StorageReplicatedMergeTree.cpp:1581: DB::StorageReplicatedMergeTree::executeLogEntry(DB::ReplicatedMergeTreeLogEntry&) @ 0x1f5e484c in /usr/bin/clickhouse
02:47:13.941811 [ 29208 ] {} <Fatal> BaseDaemon: 11. ./obj-x86_64-linux-gnu/../src/Storages/StorageReplicatedMergeTree.cpp:3176: DB::StorageReplicatedMergeTree::processQueueEntry(std::__1::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>)::$_18::operator() shared_ptr<DB::ReplicatedMergeTreeLogEntry>&) const @ 0x1f65faa5 in /usr/bin/clickhouse
02:47:14.477158 [ 29209 ] {} <Fatal> BaseDaemon: 11. ./obj-x86_64-linux-gnu/../src/Storages/StorageReplicatedMergeTree.cpp:3176: DB::StorageReplicatedMergeTree::processQueueEntry(std::__1::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>)::$_18::operator() shared_ptr<DB::ReplicatedMergeTreeLogEntry>&) const @ 0x1f65faa5 in /usr/bin/clickhouse
02:47:16.475373 [ 29208 ] {} <Fatal> BaseDaemon: 12. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/type_traits:3676: decltype(std::__1::forward<DB::StorageReplicatedMergeTree::processQueueEntry(std::__1::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>))(std::__1::forward<std::__1::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&>(fp0))) std::__1::__invoke<DB::StorageReplicatedMergeTree::processQueueEntry(std::__1::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>)::$_18&, std::__1::shared_ptr<DB::ReplicatedMEntry>&>(DB::StorageReplicatedMergeTree::processQueueEntry(std::__1::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>)::$_18&, std::__1::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&) @ 0x1f65fa32 in /usr/bin/clickhouse
02:47:16.970325 [ 29209 ] {} <Fatal> BaseDaemon: 12. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/type_traits:3676: decltype(std::__1::forward<DB::StorageReplicatedMergeTree::processQueueEntry(std::__1::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>))(std::__1::forward<std::__1::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&>(fp0))) std::__1::__invoke<DB::StorageReplicatedMergeTree::processQueueEntry(std::__1::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>)::$_18&, std::__1::shared_ptr<DB::ReplicatedMEntry>&>(DB::StorageReplicatedMergeTree::processQueueEntry(std::__1::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>)::$_18&, std::__1::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&) @ 0x1f65fa32 in /usr/bin/clickhouse
02:47:18.979481 [ 29208 ] {} <Fatal> BaseDaemon: 13. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/__functional_base:317: bool std::__1::__invoke_void_return_wrapper<bool>::__call<DB::StorageReplicatedMergeTree::processQueueEntry(std::__1::shared_ptr<DB::ReplireeQueue::SelectedEntry>)::$_18&, std::__1::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&>(DB::StorageReplicatedMergeTree::processQueueEntry(std::__1::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>)::$_18&, std::__1::shared_ptr<DB::ReplicatedMergeTreeLogE x1f65f9f2 in /usr/bin/clickhouse
02:47:19.450807 [ 29209 ] {} <Fatal> BaseDaemon: 13. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/__functional_base:317: bool std::__1::__invoke_void_return_wrapper<bool>::__call<DB::StorageReplicatedMergeTree::processQueueEntry(std::__1::shared_ptr<DB::ReplireeQueue::SelectedEntry>)::$_18&, std::__1::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&>(DB::StorageReplicatedMergeTree::processQueueEntry(std::__1::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>)::$_18&, std::__1::shared_ptr<DB::ReplicatedMergeTreeLogE x1f65f9f2 in /usr/bin/clickhouse
02:47:21.055007 [ 29208 ] {} <Fatal> BaseDaemon: 14. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/functional:1608: std::__1::__function::__default_alloc_func<DB::StorageReplicatedMergeTree::processQueueEntry(std::__1::shared_ptr<DB::ReplicatedMergeTreeQueue::ry>)::$_18, bool (std::__1::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>::operator()(std::__1::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&) @ 0x1f65f9b0 in /usr/bin/clickhouse
02:47:23.546946 [ 413 ] {} <Fatal> Application: Child process was terminated by signal 6.
```
</details>
CI: https://clickhouse-test-reports.s3.yandex.net/0/4a8b82232c11512232df3ecdf4ffaec287116ad5/stress_test_(debug).html#fail1
alesapin
left a comment
There was a problem hiding this comment.
Looks like these changes are correct. But there are several unclear moments:
- Temporary parts removed in their destructors https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/MergeTree/IMergeTreeDataPart.cpp#L418-L420. So it seems like we have a race condition here or at least unclear responsibility....
- Why do we have part state
Temporaryand part flagis_temp? What is the difference?
|
Other failures unrelated to changes. I'm not sure about fetches test, but let's check in master. |
|
Seems like it didn't help: https://clickhouse-test-reports.s3.yandex.net/0/855a53ff8160c4638fe345b0d26e062804ba790a/stress_test_(debug).html#fail1 |
@tavplubix thanks! The fix wasn't complete, will get back with a followup patch. |
CI reports [1]:
<Fatal> : Logical error: 'Deleting data part all_0_0_0_3 doesn't exist'.
...
<Fatal> BaseDaemon: 8. ./obj-x86_64-linux-gnu/../src/Storages/MergeTree/MergeTreeData.cpp:1254: DB::MergeTreeData::removePartsFinally() @ 0x1f94dffa in /usr/bin/clickhouse
[1]: https://clickhouse-test-reports.s3.yandex.net/0/855a53ff8160c4638fe345b0d26e062804ba790a/stress_test_(debug).html#fail1
Follow-up for: ClickHouse#28221
|
Classification of this issue was incorrect. I changed to improvement. |
But under certain circumstances it may be triggered by user too, so I'm voting for a |
|
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix removing of parts in a Temporary state which can lead to an unexpected exception (
Part %name% doesn't exist). Fixes #23661.Detailed description / Documentation draft:
Parts in a temporary state does not exists in data_parts_by_info, so do
not try to search there and throw LOGICAL_ERROR in case of failure:
Details
CI: https://clickhouse-test-reports.s3.yandex.net/0/4a8b82232c11512232df3ecdf4ffaec287116ad5/stress_test_(debug).html#fail1
CI: https://clickhouse-test-reports.s3.yandex.net/0/df1fe27791f82c2a917390faa30716effbd32b8f/stress_test_(debug)/test_run.txt.out.log
Cc: @alesapin