-
Notifications
You must be signed in to change notification settings - Fork 8.3k
How can I restore a blocked replication faster? #15955
Copy link
Copy link
Closed
Labels
comp-replicationReplicatedMergeTree* + replication log/state transitions, eventual consistency mechanics.ReplicatedMergeTree* + replication log/state transitions, eventual consistency mechanics.questionQuestion?Question?
Description
Hello.
ClickHouse is an excellent database and thank you for using it as an open-source.
While I am using ReplicatedMergeTree, I encountered a blocking of replication during 5~10 minutes.
After replication-leader crashed, replications of the shard stopped during 5~10 minutes.
I want to reduce the delay from restoration of lost-block-replication.
If there is any way to reduce it, please let me know.
To reproduce it, refer to the following steps.
1. Server settings
Shard 1 - Node 1
clickhouse (A failure point is injected)
test_table/ReplicatedMergeTree
Shard 1 - Node 2
test_table/ReplicatedMergeTree
Shard 1 - Node 3
test_table/ReplicatedMergeTree
version: 20.8.2
2. Inject a failure point
// Add exit(1) at line 381 (may be different by source version)
// It will crash after make some replication infomation into ZooKeeper.
void ReplicatedMergeTreeBlockOutputStream::commitPart(...)
{
...
366 else if (Coordination::isHardwareError(multi_code))
367 {
368 transaction.rollback();
369 throw Exception("Unrecoverable network error while adding block " + toString(block_number) + " with ID '" + block_id + "': "
370 + Coordination::errorMessage(multi_code), ErrorCodes::UNEXPECTED_ZOOKEEPER_ERROR);
371 }
372 else
373 {
374 transaction.rollback();
375 throw Exception("Unexpected ZooKeeper error while adding block " + toString(block_number) + " with ID '" + block_id + "': "
376 + Coordination::errorMessage(multi_code), ErrorCodes::UNEXPECTED_ZOOKEEPER_ERROR);
377 }
378
379 if (quorum)
380 {
381 exit(1); // <------ Right here, I injected exit(1) as a failure point.
382 /// We are waiting for quorum to be satisfied.
383 LOG_TRACE(log, "Waiting for quorum");
384
385 String quorum_status_path = storage.zookeeper_path + "/quorum/status";
386
387 try
388 {
389 while (true)
...
}3. Insert data with insert_quorum 2
clickhouse client --host <Node 1> --insert_quorum 2 --query " INSERT ..."
4. After Node 1 crashed, insert more data with insert_quorum 2
clickhouse client --host <Node 2> --insert_quorum 2 --query " INSERT ..."
Received exception from server (version 20.8.2):
Code: 286. DB::Exception: Received from localhost:29000. DB::Exception: Quorum for previous write has not been satisfied yet. Status: version: 1
part_name: 20201009_29_29_0
required_number_of_replicas: 2
actual_number_of_replicas: 1
replicas:
1
After waiting 5~10 minutes, it will unblock.
How can I reduce the restoration delay?
Thank you.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
comp-replicationReplicatedMergeTree* + replication log/state transitions, eventual consistency mechanics.ReplicatedMergeTree* + replication log/state transitions, eventual consistency mechanics.questionQuestion?Question?