Improve the time to recover keeper connections by Algunenano · Pull Request #42541 · ClickHouse/ClickHouse

Algunenano · 2022-10-20T17:55:43Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Improve the time to recover lost keeper connections

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

Based on #42323
Closes #42323
Closes #42251

2 main changes:

Ephemeral nodes are now created using the server UUID as the content. When we find the node unexpectedly (because of a hard crash or a lost ZK connection) we will delete them manually if they belonged to this server; otherwise we will wait as before (3x session time max).
The recovery task (ReplicatedMergeTreeRestartingThread) is now relaunched on failure. Before it re-scheduled sometimes immediately and sometimes after 1 second (10 seconds in previous releases), but it was buggy and in some situations it would fail to re-schedule it (thus falling back to the automated checks). Now failures are re-scheduled after 100-10000ms.

After this is discussed and reviewed, I'm thinking it would make sense for some queries to be able to wait for ZK recovery instead of immediately fail; it would be something similar to lock_acquire_timeout but just for this. For example, with the new improvements an INSERT could avoid throwing an exception if it waited for 100-300 ms for a new ZK connection to be ready.

Algunenano · 2022-10-20T17:57:55Z

src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp

 static String generateActiveNodeIdentifier()
 {
-    return "pid: " + toString(getpid()) + ", random: " + toString(randomSeed());
+    return Field(ServerUUID::get()).dump();


I decided to change this to use the server UUID only so it's faster to recover also after unclean shutdowns. There might be better approaches

devcrafter · 2022-10-20T20:32:31Z

After this is discussed and reviewed, I'm thinking it would make sense for some queries to be able to wait for ZK recovery instead of immediately fail; it would be something similar to lock_acquire_timeout but just for this. For example, with the new improvements an INSERT could avoid throwing an exception if it waited for 100-300 ms for a new ZK connection to be ready.

INSERT retries should cover it, see #39764

tavplubix · 2022-10-21T17:43:21Z

src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp

+            task->scheduleAfter(immediately_ms);
+        }
    }
    catch (...)
    {
+        task->scheduleAfter(immediately_ms);


It will retry each 100-1000ms even if current replica is completely partitioned from a cluster. Should we increase this value a bit on each failure until it reaches 10s (the original value of retry_period_ms)?

Sure. I went for 1s because it was reduced to it before, but it should be better to have some exponential backoff (or some other simpler backoff) for up to 10 seconds.

tavplubix · 2022-10-21T17:45:40Z

src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp

    storage.partial_shutdown_event.set();
    storage.replica_is_active_node = nullptr;

-    LOG_TRACE(log, "Waiting for threads to finish");


These messages were useful sometimes

It was a mistake on my part. I had added a bunch of extra logs to help me understand what was going on and deleted some that I didn't add.

…nvestigation

nikitamikhaylov · 2022-10-25T11:23:03Z

Stress tests (thread) - Azure thread leek. #42640

tavplubix · 2022-10-31T15:35:18Z

Hm, there's an issue: quorum inserts rely on content of is_active node:

ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp

Lines 678 to 683 in c53f7cd

    
           /// And what if it is possible that the current replica at this time has ceased to be active 
        
           /// and the quorum is marked as failed and deleted? 
        
           String value; 
        
           if (!zookeeper->tryGet(storage.replica_path + "/is_active", value, nullptr) 
        
               || value != is_active_node_value) 
        
               throw Exception("Replica become inactive while waiting for quorum", ErrorCodes::NO_ACTIVE_REPLICAS);

Maybe we should compare node version instead

Algunenano · 2022-10-31T17:18:27Z

Hm, there's an issue: quorum inserts rely on content of is_active node:

I'm not familiar with that part of the code. Do you mind explaining what the issue is? Before the change the value changed between restarts and stable inside the process (AFAICS), and now the value is also stable between restarts too. Is this a problem?

tavplubix · 2022-10-31T17:41:25Z

According to the code I posted above, the value is supposed to change between sessions (so probably this code did not work correctly even before this PR)

tavplubix · 2022-11-07T18:40:40Z

According to the code I posted above, the value is supposed to change between sessions (so probably this code did not work correctly even before this PR)

Fixed in #42878

nikitamikhaylov and others added 11 commits October 20, 2022 19:38

Better

ef377c1

Automatic style fix

968ad45

Update ReplicatedMergeTreeRestartingThread.cpp

72e038e

Update ReplicatedMergeTreeRestartingThread.cpp

0be5026

Better

abe9ac5

Style

f2d9ae7

Also style

7968839

= nullptr

12053b2

Automatic style fix

2190205

Improve ZK connection recovery

0378d3b

Reduce timing margin

af0000d

Algunenano mentioned this pull request Oct 20, 2022

(faster session reinitialization) Remove timeout for reestablishing new connection in case of read-only table error #42323

Closed

Algunenano commented Oct 20, 2022

View reviewed changes

robot-clickhouse added the pr-improvement Pull request with some product improvements label Oct 20, 2022

Update ReplicatedMergeTreeRestartingThread.cpp

fb58eb4

tavplubix self-assigned this Oct 21, 2022

tavplubix approved these changes Oct 21, 2022

View reviewed changes

Algunenano added 3 commits October 24, 2022 10:17

Improvements based on PR comments

512265b

Fix

166179c

Merge remote-tracking branch 'blessed/master' into restaring-thread-i…

0a89a38

…nvestigation

nikitamikhaylov approved these changes Oct 25, 2022

View reviewed changes

nikitamikhaylov merged commit 0016bc2 into ClickHouse:master Oct 25, 2022

alexey-milovidov mentioned this pull request Oct 26, 2022

Add changelog for 22.10 #42604

Merged

tavplubix mentioned this pull request Nov 7, 2022

Some fixes for ReplicatedMergeTree #42878

Merged

den-crane mentioned this pull request Mar 27, 2024

get Coordination::Exception: XID overflow (Session expired), but zookeeper do not zxid overflow. #61978

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the time to recover keeper connections#42541

Improve the time to recover keeper connections#42541
nikitamikhaylov merged 15 commits intoClickHouse:masterfrom
Algunenano:restaring-thread-investigation

Algunenano commented Oct 20, 2022 •

edited

Loading

Uh oh!

Algunenano Oct 20, 2022

Uh oh!

devcrafter commented Oct 20, 2022

Uh oh!

tavplubix Oct 21, 2022

Uh oh!

Algunenano Oct 21, 2022

Uh oh!

tavplubix Oct 21, 2022

Uh oh!

Algunenano Oct 21, 2022

Uh oh!

nikitamikhaylov commented Oct 25, 2022

Uh oh!

tavplubix commented Oct 31, 2022

Uh oh!

Algunenano commented Oct 31, 2022

Uh oh!

tavplubix commented Oct 31, 2022

Uh oh!

tavplubix commented Nov 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

Algunenano commented Oct 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Uh oh!

Algunenano Oct 20, 2022

Choose a reason for hiding this comment

Uh oh!

devcrafter commented Oct 20, 2022

Uh oh!

tavplubix Oct 21, 2022

Choose a reason for hiding this comment

Uh oh!

Algunenano Oct 21, 2022

Choose a reason for hiding this comment

Uh oh!

tavplubix Oct 21, 2022

Choose a reason for hiding this comment

Uh oh!

Algunenano Oct 21, 2022

Choose a reason for hiding this comment

Uh oh!

nikitamikhaylov commented Oct 25, 2022

Uh oh!

tavplubix commented Oct 31, 2022

Uh oh!

Algunenano commented Oct 31, 2022

Uh oh!

tavplubix commented Oct 31, 2022

Uh oh!

tavplubix commented Nov 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Algunenano commented Oct 20, 2022 •

edited

Loading