(faster session reinitialization) Remove timeout for reestablishing new connection in case of read-only table error by nikitamikhaylov · Pull Request #42323 · ClickHouse/ClickHouse

nikitamikhaylov · 2022-10-14T17:06:49Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

When ZooKeeper session expires the MergeTreeRestartingThread will be notified automatically and almost instantly but it need to try several times to establish new connection. This PR enables retries without any timeout after first error occur. This closes #42251

nikitamikhaylov · 2022-10-14T17:18:09Z

This test is based on sleep, so I will just see how flaky it is.

devcrafter · 2022-10-15T21:36:27Z

src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp

+    {
+        storage.replica_is_active_node = nullptr;
+    }
+    catch (...)


This looks unnecessary since ~EphemeralNodeHolder already catch all exceptions.

Indeed. Seems like issue is not here as was described in the issue

src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp

nikitamikhaylov · 2022-10-17T15:30:10Z

@Algunenano As was mentioned above EphemeralNodeHolder doesn't throw any exception outside but just catches it and prints to log. I checked and seems like no functions that we execute during partial shutdown/initialization throws any exceptions, but nevertheless I removed even 1s timeout to recheck the health and replica state and I rewrote the code a bit to make it more reliable in my opinion

Algunenano · 2022-10-17T17:17:34Z

I've tested this PR by having 1 insert per second on a replicated table and restarting ZK and it sill takes ~30 seconds to recover. The same operation takes ~10 seconds on 21.8:

$ while true; do date; ch_client_prod-01 --query "Insert into t1 values (23)"; sleep 1; done 
lun 17 oct 2022 19:09:53 CEST
lun 17 oct 2022 19:09:54 CEST
lun 17 oct 2022 19:09:55 CEST
lun 17 oct 2022 19:09:56 CEST
lun 17 oct 2022 19:09:57 CEST
lun 17 oct 2022 19:09:58 CEST
lun 17 oct 2022 19:09:59 CEST
lun 17 oct 2022 19:10:00 CEST
Received exception from server (version 21.8.14):
Code: 999. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Cannot allocate block number in ZooKeeper: Coordination::Exception: Connection loss. (KEEPER_EXCEPTION)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:01 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:02 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:03 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:04 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:05 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:06 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:07 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:08 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:09 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:10 CEST

With the PR is goes from 19:04:38 to 19:05:10 (32 seconds still).

My feeling is that this time should be much lower, not just lower than the 30+ seconds in master but lower than the 10 seconds of 21.8. I'll try (tomorrow) with a ZK cluster instead of a standalone one where the recovery should be immediate, but I think that failed operations should influence how often the check happens. Not sure if the query itself should try to recover the ZK connection (which would be nice if possible) or at least advance the check to a reasonable time (so it doesn't happen a lot, but maybe once a second). I need to think more about it.

nikitamikhaylov · 2022-10-18T12:40:15Z

The only idea I have is that reconnection slowness happens because of some timeouts inside ZooKeeper (like timeout for deleting ephemeral nodes, etc)

Algunenano · 2022-10-18T16:03:55Z

It seems that there is another problem as initializing the replica is taking ~20 seconds:

oct 18 17:59:49 Mordor clickhouse[357188]: 2022.10.18 15:59:49.853405 [ 357376 ] {} <Trace> ZooKeeperClient: Failed to get API version
oct 18 17:59:49 Mordor clickhouse[357188]: 2022.10.18 15:59:49.853442 [ 357376 ] {} <Trace> ZooKeeper: Initialized, hosts: 127.0.0.1:2181
oct 18 17:59:49 Mordor clickhouse[357188]: 2022.10.18 15:59:49.853452 [ 357376 ] {} <Debug> Context: Establishing a new connection with ZooKeeper took 10121 ms
oct 18 17:59:49 Mordor clickhouse[357188]: 2022.10.18 15:59:49.853461 [ 357376 ] {} <Debug> default.t1 (ReplicatedMergeTreeRestartingThread): Trying to start replica up
oct 18 18:00:11 Mordor clickhouse[357188]: 2022.10.18 16:00:11.539064 [ 357376 ] {} <Trace> default.t1 (ReplicatedMergeTreeQueue): Initializing parts in queue
oct 18 18:00:11 Mordor clickhouse[357188]: 2022.10.18 16:00:11.539927 [ 357376 ] {} <Trace> default.t1 (ReplicatedMergeTreeQueue): Queue initialized
oct 18 18:00:11 Mordor clickhouse[357188]: 2022.10.18 16:00:11.539938 [ 357376 ] {} <Debug> default.t1 (ReplicatedMergeTreeQueue): Loading queue from /clickhouse/tables/01-01/default.t1/replicas/clickhouse-01/queue
oct 18 18:00:11 Mordor clickhouse[357188]: 2022.10.18 16:00:11.540736 [ 357376 ] {} <Debug> default.t1 (ReplicatedMergeTreeQueue): Having 0 queue entries to load, 0 entries already loaded.
oct 18 18:00:11 Mordor clickhouse[357188]: 2022.10.18 16:00:11.541543 [ 357376 ] {} <Trace> default.t1 (ReplicatedMergeTreeQueue): Loaded queue

I'm investigating more to see what's going on

Algunenano · 2022-10-18T16:14:41Z

Well, narrowing it down was easy. It's being spent in waitForEphemeralToDisappearIfAny.

I guess that since we couldn't remove the lock (as there isn't a connection to ZK), it finds it and waits for it to be removed automatically by ZK (it's an EPHEMERAL_SEQUENTIAL node which is removed upon the client's disconnect). I'll dig deeper to understand if this is something that should be handled better in the keeper configuration, or there needs to be a better way to deal with it on reconnections.

Algunenano · 2022-10-18T16:45:36Z

Following up:

ZK removes the session after the timeout, which CH sets (by default) at 30 seconds.
Before CH used to remove the node manually if it existed, but in Remove useless code in ReplicatedMergeTreeRestartingThread #36113 (22.6) it was changed to wait until it disappeared manually.

I'm surprised about the change in that PR because it's set up to throw a LOGICAL_ERROR if the node takes too long to disappear, which doesn't seem the right exception to me considering I was expecting this to handle configuration issues (where a user creates several servers with the same config by mistake). While this behaviour might be desirable in some cases vs the removal of the node, I think that in cases of a lost session it's doing a disservice. It'd make sense to keep a flag of "unclean shutdown" and remove the old node if necessary, but I hope there is a simpler/cleaner way.

nikitamikhaylov · 2022-10-19T12:37:30Z

Well, we already discussed internally to remove this code (and return back explicit deletion of this node on start). AFAIK we found a couple of bugs, so LOGICAL_ERROR is a bit overkill but forces us to look at any errors..

Algunenano · 2022-10-19T12:57:36Z

I'm currently working on going back to deletion of the node if the storage was partial_shutdown_called. After that I want to also have a look at why it takes 10 seconds to get a new ZK session. If you are working on it then I'll work on something else, otherwise I'll prepare a PR with all the changes once I have them.

nikitamikhaylov · 2022-10-19T13:17:13Z

Probably we can just return the old code, but make it more smart - e.g. write UUID of server to Zookeeper and then check before deleting the node.

tavplubix · 2022-10-19T14:12:48Z

I'm surprised about the change in that PR because it's set up to throw a LOGICAL_ERROR if the node takes too long to disappear, which doesn't seem the right exception to me

Yes, it's not a logical error actually and we can change the error code. But usually 3x session timeout should be enough for ephemeral node to get removed. And we want to check it in CI to catch issues like #28519 and #41215.

I'm currently working on going back to deletion of the node if the storage was partial_shutdown_called.

OK, maybe we can resurrect code that removes the node, but we should do it carefully, because ephemeral nodes are kind of distributed locks and ideologically it's not quite correct to forcefully remove them (we cannot unlock a lock if it's held by someone else). But yes, adding ServerUUID::get() to generateActiveNodeIdentifier() may help to ensure that it's our node.

Algunenano · 2022-10-19T14:22:37Z

adding ServerUUID::get() to generateActiveNodeIdentifier() may help to ensure that it's our node.

I think it's not necessary to change the node identifier as it's random and it won't change because a table restart. We can have a logic like:

If the node doesn't exist, ok and continue.
If the node exists and the id matches with us, delete it to recreate it again in the new session.
If the node exists and the id doesn't match (likely an unclean shutdown), wait?

Algunenano · 2022-10-20T12:35:19Z

I've setup a local 3-server keeper cluster to do tests closer to a production environment, as before I was testing with a single node which isn't realistic or advisable. With this PR and the ephemeral node fix (delete if the same instead of waiting for it to go away) the time to recovery is less than a second.

I'm testing the time to recovery by ingesting in the same table over an over every 1 second and restarting keeper servers randomly. If I restart a follower nothing happens, if I restart the leader a re-election happens and an insert fails but the next one, just one second after, works fine. I'll polish things up and create a PR.

Algunenano · 2022-10-20T17:56:16Z

PR here: #42541

Better

10b0e33

robot-ch-test-poll added the pr-improvement Pull request with some product improvements label Oct 14, 2022

Automatic style fix

f83688f

devcrafter reviewed Oct 15, 2022

View reviewed changes

alexey-milovidov changed the title ~~Ignore the exception about failed node deletion from Zookeeper~~ (faster session reinitialization) Ignore the exception about failed node deletion from Zookeeper Oct 17, 2022

alexey-milovidov added 2 commits October 17, 2022 05:34

Update ReplicatedMergeTreeRestartingThread.cpp

4af2af1

Update ReplicatedMergeTreeRestartingThread.cpp

06a9d7d

alexey-milovidov reviewed Oct 17, 2022

View reviewed changes

src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp Outdated Show resolved Hide resolved

alexey-milovidov reviewed Oct 17, 2022

View reviewed changes

src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp Outdated Show resolved Hide resolved

alexey-milovidov self-assigned this Oct 17, 2022

nikitamikhaylov added 3 commits October 17, 2022 17:15

Better

a737bcb

Style

1e96f22

Also style

b74f575

nikitamikhaylov changed the title ~~(faster session reinitialization) Ignore the exception about failed node deletion from Zookeeper~~ (faster session reinitialization) Remove timeout for reestablishing new connection in case of read-only table error Oct 17, 2022

devcrafter reviewed Oct 17, 2022

View reviewed changes

src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp Outdated Show resolved Hide resolved

nikitamikhaylov and others added 2 commits October 17, 2022 17:33

= nullptr

65f9b58

Automatic style fix

c1d8200

Algunenano mentioned this pull request Oct 20, 2022

Improve the time to recover keeper connections #42541

Merged

nikitamikhaylov closed this Oct 20, 2022

Conversation

nikitamikhaylov commented Oct 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Uh oh!

nikitamikhaylov commented Oct 14, 2022

Uh oh!

devcrafter Oct 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikitamikhaylov Oct 17, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nikitamikhaylov commented Oct 17, 2022

Uh oh!

Algunenano commented Oct 17, 2022

Uh oh!

nikitamikhaylov commented Oct 18, 2022

Uh oh!

Algunenano commented Oct 18, 2022

Uh oh!

Algunenano commented Oct 18, 2022

Uh oh!

Algunenano commented Oct 18, 2022

Uh oh!

nikitamikhaylov commented Oct 19, 2022

Uh oh!

Algunenano commented Oct 19, 2022

Uh oh!

nikitamikhaylov commented Oct 19, 2022

Uh oh!

tavplubix commented Oct 19, 2022

Uh oh!

Algunenano commented Oct 19, 2022

Uh oh!

Algunenano commented Oct 20, 2022

Uh oh!

Algunenano commented Oct 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nikitamikhaylov commented Oct 14, 2022 •

edited

Loading

devcrafter Oct 15, 2022 •

edited

Loading