Skip to content

Fix race between mergeSelectingTask and queue reinitialization#34096

Merged
tavplubix merged 1 commit intomasterfrom
fix_race_merge_selecting_task
Jan 31, 2022
Merged

Fix race between mergeSelectingTask and queue reinitialization#34096
tavplubix merged 1 commit intomasterfrom
fix_race_merge_selecting_task

Conversation

@tavplubix
Copy link
Copy Markdown
Member

@tavplubix tavplubix commented Jan 28, 2022

Changelog category (leave one):

  • Bug Fix (user-visible misbehaviour in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fixed minor race condition that might cause "intersecting parts" error in extremely rare cases after ZooKeeper connection loss.

Detailed description:
https://s3.amazonaws.com/clickhouse-test-reports/33665/16adcc166608b922f7351a8eedcfb483d63d7513/stress_test__debug__actions_.html
See also #34013

Connection was lost on attempt to update log_pointer. Log pointer was successfully updated in ZooKeeper, but queue-0000000020 was not added into in-memory queue.

2022.01.15 23:07:15.308960 [ 777 ] {} <Error> test_98ynp9.r4: void DB::StorageReplicatedMergeTree::queueUpdatingTask(): Code: 999. Coordination::Exception: Connection loss. (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):

0. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/exception:133: std::exception::capture() @ 0x1610c9c2 in /usr/bin/clickhouse
1. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/exception:111: std::exception::exception() @ 0x1610c993 in /usr/bin/clickhouse
2. ./obj-x86_64-linux-gnu/../contrib/poco/Foundation/src/Exception.cpp:27: Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x288e1f00 in /usr/bin/clickhouse
3. ./obj-x86_64-linux-gnu/../src/Common/Exception.cpp:55: DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x160ee46e in /usr/bin/clickhouse
4. ./obj-x86_64-linux-gnu/../src/Common/ZooKeeper/IKeeper.cpp:25: Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Coordination::Error, int) @ 0x24c77f4f in /usr/bin/clickhouse
5. ./obj-x86_64-linux-gnu/../src/Common/ZooKeeper/IKeeper.cpp:41: Coordination::Exception::Exception(Coordination::Error) @ 0x24c78461 in /usr/bin/clickhouse
6. ./obj-x86_64-linux-gnu/../src/Common/ZooKeeper/ZooKeeper.cpp:1098: zkutil::KeeperMultiException::check(Coordination::Error, std::__1::vector<std::__1::shared_ptr<Coordination::Request>, std::__1::allocator<std::__1::shared_ptr<Coordination::Request> > > const&, std::__1::vector<std::__1::shared_ptr<Coordination::Response>, std::__1::allocator<std::__1::shared_ptr<Coordination::Response> > > const&) @ 0x24c81794 in /usr/bin/clickhouse
7. ./obj-x86_64-linux-gnu/../src/Common/ZooKeeper/ZooKeeper.cpp:611: zkutil::ZooKeeper::multi(std::__1::vector<std::__1::shared_ptr<Coordination::Request>, std::__1::allocator<std::__1::shared_ptr<Coordination::Request> > > const&) @ 0x24c8167f in /usr/bin/clickhouse
8. ./obj-x86_64-linux-gnu/../src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp:670: DB::ReplicatedMergeTreeQueue::pullLogsToQueue(std::__1::shared_ptr<zkutil::ZooKeeper>, std::__1::function<void (Coordination::WatchResponse const&)>, DB::ReplicatedMergeTreeQueue::PullLogsReason) @ 0x242ae6bc in /usr/bin/clickhouse
9. ./obj-x86_64-linux-gnu/../src/Storages/StorageReplicatedMergeTree.cpp:2773: DB::StorageReplicatedMergeTree::queueUpdatingTask() @ 0x23c20165 in /usr/bin/clickhouse

It's not a problem, because queue-0000000020 was reloaded from queue later when connection was restored:

2022.01.15 23:07:24.541640 [ 763 ] {} <Test> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Insert entry queue-0000000020 to queue with type MERGE_PARTS with virtual parts [all_1_28_2]

But mergeSelectingTask was executed before queue reinitialization has finished:

2022.01.15 23:07:24.387149 [ 750 ] {} <Debug> test_98ynp9.r4 (MergerMutator): Selected 9 parts from all_1_15_1 to all_31_31_0

It constructed ReplicatedMergeTreeMergePredicate that called pullLogsToQueue, but it was a noop, because queue-0000000020 entry already present in the queue in ZooKeeper (but it does not present in in-memory queue and that's why queue.virtual_parts and merges_version were not consistent).

This bug was introduced in #32140 and was partially fixed in #32389, but I forgot about another similar case.
Related to #26876.

@robot-clickhouse robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Jan 28, 2022
@tavplubix tavplubix added the pr-must-backport Pull request should be backported intentionally. Use this label with great care! label Jan 28, 2022
Copy link
Copy Markdown
Member

@azat azat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, now everything is clear!

@tavplubix
Copy link
Copy Markdown
Member Author

Stateless tests (thread, actions) [1/3] - race in StorageFile, already fixed in master

@tavplubix tavplubix merged commit d19e24f into master Jan 31, 2022
@tavplubix tavplubix deleted the fix_race_merge_selecting_task branch January 31, 2022 09:16
robot-clickhouse pushed a commit that referenced this pull request Jan 31, 2022
robot-clickhouse pushed a commit that referenced this pull request Jan 31, 2022
tavplubix added a commit that referenced this pull request Feb 1, 2022
Backport #34096 to 22.1: Fix race between mergeSelectingTask and queue reinitialization
tavplubix added a commit that referenced this pull request Feb 1, 2022
Backport #34096 to 21.12: Fix race between mergeSelectingTask and queue reinitialization
@Felixoid Felixoid added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-bugfix Pull request with bugfix, not backported by default pr-must-backport Pull request should be backported intentionally. Use this label with great care!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants