Skip to content

RFC: Eliminate race window between selecting parts to merge and merge entry creation#34013

Closed
azat wants to merge 2 commits intoClickHouse:masterfrom
azat:intersecting-parts-fix
Closed

RFC: Eliminate race window between selecting parts to merge and merge entry creation#34013
azat wants to merge 2 commits intoClickHouse:masterfrom
azat:intersecting-parts-fix

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Jan 26, 2022

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Eliminate race window between selecting parts to merge and merge entry creation (fix one more case of Logical error: 'Part X intersects previous part Y. It is a bug or a result of manual intervention in the ZooKeeper data.'. error)

CI founds 1 intersecting parts while running 01459_manual_write_to_replicas test:

Details
2022.01.15 23:07:24.387149 [ 750 ] {} <Debug> test_98ynp9.r4 (MergerMutator): Selected 9 parts from all_1_15_1 to all_31_31_0
2022.01.15 23:07:24.541640 [ 763 ] {} <Test> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Insert entry queue-0000000020 to queue with type MERGE_PARTS with virtual parts [all_1_28_2]
2022.01.15 23:07:25.327639 [ 750 ] {} <Trace> test_98ynp9.r4: Created log entry /clickhouse/tables/01459_manual_write_to_replicas_test_98ynp9/r/log/log-0000000021 for merge all_1_31_2

# What goes after, not that important, just to show the error and the flow
...
2022.01.15 23:08:02.078376 [ 790 ] {} <Warning> test_98ynp9.r4 (ReplicatedMergeTreeRestartingThread): ZooKeeper session has expired. Switching to a new session.
...
2022.01.15 23:08:04.792227 [ 790 ] {} <Trace> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Initializing parts in queue
...
2022.01.15 23:08:04.868563 [ 790 ] {} <Debug> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Loading queue from /clickhouse/tables/01459_manual_write_to_replicas_test_98ynp9/r/replicas/r4/queue
2022.01.15 23:08:04.949219 [ 790 ] {} <Debug> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Having 0 queue entries to load, 1 entries already loaded.
2022.01.15 23:08:05.398827 [ 790 ] {} <Debug> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Pulling 1 entries to queue: log-0000000021 - log-0000000021
2022.01.15 23:08:05.417492 [ 790 ] {} <Test> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Insert entry queue-0000000021 to queue with type MERGE_PARTS with virtual parts [all_1_31_2]
2022.01.15 23:08:05.424624 [ 790 ] {} <Fatal> : Logical error: 'Part all_1_31_2 intersects previous part all_1_28_2. It is a bug or a result of manual intervention in the ZooKeeper data.'.

The reason AFAICS that there is a race window between selecting parts to
merge and merge entry in zookeeper had been created, that you can see in
first 3 lines of the log.

Refs: #26876
Cc: @tavplubix
Cc: @alesapin

azat added 2 commits January 26, 2022 15:20
…y creation

CI founds [1] intersecting parts while running 01459_manual_write_to_replicas test:

<details>

    2022.01.15 23:07:24.387149 [ 750 ] {} <Debug> test_98ynp9.r4 (MergerMutator): Selected 9 parts from all_1_15_1 to all_31_31_0
    2022.01.15 23:07:24.541640 [ 763 ] {} <Test> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Insert entry queue-0000000020 to queue with type MERGE_PARTS with virtual parts [all_1_28_2]
    2022.01.15 23:07:25.327639 [ 750 ] {} <Trace> test_98ynp9.r4: Created log entry /clickhouse/tables/01459_manual_write_to_replicas_test_98ynp9/r/log/log-0000000021 for merge all_1_31_2

    # What goes after, not that important, just to show the error and the flow
    ...
    2022.01.15 23:08:02.078376 [ 790 ] {} <Warning> test_98ynp9.r4 (ReplicatedMergeTreeRestartingThread): ZooKeeper session has expired. Switching to a new session.
    ...
    2022.01.15 23:08:04.792227 [ 790 ] {} <Trace> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Initializing parts in queue
    ...
    2022.01.15 23:08:04.868563 [ 790 ] {} <Debug> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Loading queue from /clickhouse/tables/01459_manual_write_to_replicas_test_98ynp9/r/replicas/r4/queue
    2022.01.15 23:08:04.949219 [ 790 ] {} <Debug> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Having 0 queue entries to load, 1 entries already loaded.
    2022.01.15 23:08:05.398827 [ 790 ] {} <Debug> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Pulling 1 entries to queue: log-0000000021 - log-0000000021
    2022.01.15 23:08:05.417492 [ 790 ] {} <Test> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Insert entry queue-0000000021 to queue with type MERGE_PARTS with virtual parts [all_1_31_2]
    2022.01.15 23:08:05.424624 [ 790 ] {} <Fatal> : Logical error: 'Part all_1_31_2 intersects previous part all_1_28_2. It is a bug or a result of manual intervention in the ZooKeeper data.'.

</details>

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/33665/16adcc166608b922f7351a8eedcfb483d63d7513/stress_test__debug__actions_.html

The reason AFAICS that there is a race window between selecting parts to
merge and merge entry in zookeeper had been created, that you can see in
first 3 lines of the log.

Signed-off-by: Azat Khuzhin <[email protected]>
@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Jan 26, 2022
Comment on lines +3127 to +3132
if (queue.isVirtualPart(merged_info, merged_name))
{
throw Exception(ErrorCodes::DUPLICATE_DATA_PART,
"There is already queue entry that intersects {} part", merged_name);
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK if there is a covering part in the queue, then either ReplicatedMergeTreeMergePredicate::canMergeTwoParts returned false or zookeeper request will fail due to log version mismatch, so this change looks redundant. Could you please explain in more details where is the race condition and why this check is supposed to fix it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, seems you are right, and this change is redundant.

@azat azat marked this pull request as draft January 26, 2022 14:27
@tavplubix tavplubix self-assigned this Jan 26, 2022
@tavplubix
Copy link
Copy Markdown
Member

There's nothing suspicious in logs (https://gist.github.com/tavplubix/29a621d09264dd9f142d862ac34762b6), seems like log version check did not work for some reason. Will try to check zookeeper_log.

@tavplubix
Copy link
Copy Markdown
Member

But there's no zookeeper_log_dump in stress tests...

@tavplubix
Copy link
Copy Markdown
Member

I've extracted some logs from coordination.tar: https://gist.github.com/tavplubix/c01332e692da9ca2347fc362a2249630

Seems like when r4 constructed ReplicatedMergeTreeMergePredicate for all_1_31_2 it got merges_version = 3 somehow, and that's why log entry was created successfully:

q	8433	Create	1	N	N	/clickhouse/tables/01459_manual_write_to_replicas_test_98ynp9/r/log/log-
s	8433	Create	1	0	N	/clickhouse/tables/01459_manual_write_to_replicas_test_98ynp9/r/log/log-0000000021
q	8433	Set	2	N	3	/clickhouse/tables/01459_manual_write_to_replicas_test_98ynp9/r/log
s	8433	Set	2	0	3	/clickhouse/tables/01459_manual_write_to_replicas_test_98ynp9/r/log

So probably we should look for a bug in ReplicatedMergeTreeMergePredicate...

@azat
Copy link
Copy Markdown
Member Author

azat commented Jan 27, 2022

But there's no zookeeper_log_dump in stress tests...

I thought about some issues with replicas, but there is only one keeper instance, so this should not be a problem (in this case it will be interesting to see at zookeeper_log and host_address at which it was connected)

Seems like when r4 constructed ReplicatedMergeTreeMergePredicate for all_1_31_2 it got merges_version = 3 somehow, and that's why log entry was created successfully:

Indeed, It should got merges_version=3, but it seems that it did not load all_1_28_2 into the queue in-memory.
Maybe incorrect log_pointer, but how?

Here are some my observations:

# r9 created log entry for all_1_28_2
2022.01.15 23:06:57.615350 [ 785 ] {} <Trace> test_98ynp9.r9: Created log entry /clickhouse/tables/01459_manual_write_to_replicas_test_98ynp9/r/log/log-0000000020 for merge all_1_28_2
# zookeeper session has some issues
2022.01.15 23:07:20.615010 [ 772 ] {} <Error> test_98ynp9.r4: void DB::StorageReplicatedMergeTree::queueUpdatingTask(): Code: 999. Coordination::Exception: Session expired (Session expired). (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):
# replica restart
2022.01.15 23:07:20.631439 [ 763 ] {} <Trace> test_98ynp9.r4 (ReplicatedMergeTreeRestartingThread): Threads finished
2022.01.15 23:07:23.645422 [ 763 ] {} <Trace> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Initializing parts in queue
2022.01.15 23:07:23.977287 [ 763 ] {} <Trace> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Queue initialized
# now ReplicatedMergeTreeQueue::load() loads the queue
2022.01.15 23:07:24.380458 [ 763 ] {} <Debug> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Having 1 queue entries to load, 0 entries already loaded.
# mergeSelectingTask selected 1..31 parts
2022.01.15 23:07:24.387149 [ 750 ] {} <Debug> test_98ynp9.r4 (MergerMutator): Selected 9 parts from all_1_15_1 to all_31_31_0
# ReplicatedMergeTreeQueue::load updated in-memory queue
# so, ReplicatedMergeTreeMergePredicate, obviously did not see it
# but ReplicatedMergeTreeMergePredicate also should load all_1_28_2 (it calls pullLogsToQueue), why this does not happens?
2022.01.15 23:07:24.541640 [ 763 ] {} <Test> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Insert entry queue-0000000020 to queue with type MERGE_PARTS with virtual parts [all_1_28_2]
# Merge entry successfully created, even though it should not
2022.01.15 23:07:25.327639 [ 750 ] {} <Trace> test_98ynp9.r4: Created log entry /clickhouse/tables/01459_manual_write_to_replicas_test_98ynp9/r/log/log-0000000021 for merge all_1_31_2
# Interesting, it takes decent amount of time between ReplicatedMergeTreeQueue::insertUnlocked() started and called ReplicatedMergeTreeQueue::insertUnlocked()
2022.01.15 23:07:26.333021 [ 763 ] {} <Test> test_98ynp9.r4 (ReplicatedMergeTreeQueue): Adding part all_1_28_2 to mutations
# it takes 4.5sec, so mergeSelectingTask started at ~23:07:21.949, but
# log-0000000020 had been created at 2022-01-15 23:06:57, so ReplicatedMergeTreeMergePredicate should see all_1_28_2
2022.01.15 23:07:26.465698 [ 750 ] {} <Trace> test_98ynp9.r4 (StorageReplicatedMergeTree::mergeSelectingTask): Execution took 4517 ms.

@tavplubix
Copy link
Copy Markdown
Member

Closed in favor of #34096

@tavplubix tavplubix closed this Jan 28, 2022
@azat azat deleted the intersecting-parts-fix branch January 28, 2022 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants